EBI patent sequence database

EBI patent sequence database provides non-redundant, annotated nucleotide and protein sequences extracted from patent documents to support sequence retrieval, clustering, and patent-specific analyses.


Key Features:

  • Non-Redundant Databases: Aggregates non-redundant sequence sets covering the EMBL-Bank nucleotides patent class and patent protein databases by eliminating duplicate sequences.
  • Value-Added Annotations: Incorporates annotations derived from patent documents including publication number corrections, earliest publication dates, and feature collations.
  • Hierarchical Clustering: Implements two-level clustering based on MD5 checksums with Level-1 clusters grouping sequences that are 100% identical over their entire length and Level-2 sub-clustering using patent family information.
  • Comprehensive Coverage: Includes both nucleotide sequences from the EMBL-Bank patent class and patent-associated protein sequences.

Scientific Applications:

  • Enhanced Data Retrieval: Enables precise identification and retrieval of specific sequences and their patent metadata by reducing redundancy and improving annotation quality.
  • Cross-Disciplinary Research: Provides access to biological sequences and annotations embedded in patent documents for bioinformatics and interdisciplinary studies.
  • Intellectual Property Analysis: Supports patent validity and scope assessments through MD5-based clustering and patent-derived annotations.

Methodology:

Clustering sequences using MD5 checksums into Level-1 (100% identity) and Level-2 (patent-family based) clusters and incorporating annotations extracted from patent documents such as publication number corrections, earliest publication dates, and feature collations.

Topics

Details

Tool Type:
web application
Operating Systems:
Linux, Windows, Mac
Added:
3/30/2017
Last Updated:
11/25/2024

Operations

Publications

Li W, McWilliam H, de la Torre AR, Grodowski A, Benediktovich I, Goujon M, Nauche S, Lopez R. Non-redundant patent sequence databases with value-added annotations at two levels. Nucleic Acids Research. 2009;38(suppl_1):D52-D56. doi:10.1093/nar/gkp960. PMID:19884134. PMCID:PMC2808894.

Documentation