Nucleotide Sequence Databases


Understanding RefSeq


image

Image: NHGRI

Prerequisites: Basic knowledge of Biology.
Level: Beginner.
Learning objectives:
- Gaining basic understading of RefSeq Database.

Navigating the Complexity of DNA Sequence Databases: A Guide to NCBI's RefSeq - A Database of Reference Sequences


In the rapidly evolving field of genomics, the production of DNA sequence data is occurring at an unprecedented pace, resulting in considerable redundancy within major sequence databases. This redundancy challenges researchers, making it difficult to discern the most accurate and reliable sequence information. To address this issue, the National Center for Biotechnology Information (NCBI) has developed RefSeq, a comprehensive project that provides a single reference sequence for each component of the central dogma - DNA to RNA to protein.

The Reference Sequence (RefSeq) database, an open-access initiative established in 2000 by the National Center for Biotechnology Information (NCBI), serves as a meticulously annotated and curated collection of nucleotide sequences (DNA, RNA) and their associated protein products obtained from the INSDC databases (GenBank, the European Nucleotide Archive, ENA, and the DNA Data Bank of Japan, DDBJ). Unlike GenBank, RefSeq provides a singular record for each natural biological molecule (DNA, RNA, or protein) within major organisms, from viruses and bacteria to eukaryotes, focusing on significant organisms with substantial data, currently encompassing almost 144,000 distinct "named" organisms as of November 2023.

image

Understanding the Challenge. The wealth of sequence data from systematic sequencing projects and individual laboratories brings specific challenges. A single biological entity may be represented by multiple entries across various databases, leading to confusion for end users. Additionally, distinguishing between experimentally determined sequences and computational predictions can be complicated.

RefSeq: A Non-Redundant Solution. NCBI's RefSeq project stands out as a pioneering initiative to overcome the challenges posed by redundant sequence data. The primary goal of RefSeq is to offer a non-redundant reference sequence for each molecule in the central dogma, encompassing DNA, mRNA, and protein. The uniqueness of RefSeq extends beyond its non-redundant nature; each entry includes comprehensive biological attributes of the gene, gene transcript, or protein.


Manually Curated Sequences


RefSeq is a precious and indispensable resource in genomics due to its meticulous manual curation of nucleotide sequences (DNA, RNA) and their associated protein products. RefSeq employs a rigorous curation process that combines experimental evidence, computational predictions, and manual curation. This multi-faceted approach ensures the accuracy and reliability of the sequences contained in the database. By providing a curated collection, RefSeq minimizes the risk of errors and inconsistencies often associated with uncurated or automated datasets.

The curated sequences in RefSeq help reduce ambiguity and confusion associated with poorly annotated or conflicting data. The transparent and standardized representation of genomic information minimizes misinterpretations, supporting more robust scientific conclusions.

The curated annotations include information about exons, introns, coding sequences, untranslated regions (UTRs), and other critical genomic features. This depth of knowledge is invaluable for understanding the functional elements of genes.

RefSeq also integrates genomic, transcriptomic, and proteomic information into a cohesive dataset. This integration allows researchers to systematically explore the relationships between genes, transcripts, and proteins. The hierarchical structure of RefSeq's notation format enables users to navigate through different levels of genomic information seamlessly.

The clinical domain. The manually curated sequences in RefSeq hold particular significance in the clinical domain. Researchers and clinicians rely on accurate and curated genomic information for applications such as disease diagnosis, identification of genetic variations, and understanding the molecular basis of disorders. The reliability of RefSeq is crucial in clinical genomics, where precision and accuracy are paramount.

Overall, the curated nature of RefSeq promotes standardization in genomic annotations. Researchers and laboratories worldwide can rely on RefSeq as a consistent reference, facilitating standardized analyses across different studies and experiments. This standardization is crucial for comparing and integrating genomic information from diverse sources.


Key Features of RefSeq


Linkage of Sequences: Nucleotide and protein sequences in RefSeq are explicitly linked, providing a holistic view of the molecular information.

Ongoing Curation: RefSeq entries undergo continuous curation, guaranteeing that the information remains up-to-date with the latest advancements in genomics.

Taxonomic Range: RefSeq entries cover a broad taxonomic range, reflecting the diversity of biological entities and ensuring relevance across different species.


RefSeq Accession Numbers


RefSeq entries are distinguishable from other databases, such as GenBank, through a unique accession number series. The format follows a "2 + 6" structure, where a two-letter code indicates the type of reference sequence, followed by an underscore and a six-digit number. Experimentally determined sequence data are denoted as NT (Genomic contigs), NM (mRNAs), and NP (Proteins). At the same time, those derived from genome annotation efforts are marked as XM (Model mRNAs) and XP (Model proteins).

Differentiating "N" and "X" Numbers. Understanding the distinction between "N" numbers and "X" is crucial. "N" numbers represent experimentally determined sequences, providing higher confidence in their accuracy. On the other hand, "X" numbers indicate computational predictions derived from raw DNA sequences, requiring a more cautious interpretation.

Notable categories include:

AC Complete genomic molecule (alternate assembly)
AP Annotated on AC_alternate assembly
YP Annotated on genomic molecules without an instantiated transcript record
NC Complete genomic molecules
NG Incomplete genomic regions
NZ Complete genomes and unfinished WGS data
NT Contig or scaffold (clone based or WGS)
NM mRNA
NR ncRNA
NP Protein
NW Contig or scaffold (primarily WGS)
XM Predicted mRNA model
XR Predicted ncRNA model
XP Predicted Protein model (eukaryotic sequences)
WP Predicted Protein model (prokaryotic sequences)

The curation status of RefSeq Records


The RefSeq records have varied curation status levels. You can find the curation status of RefSeq entries in the COMMENT area of the record.

In the RefSeq database, the reliability of information is indicated by different status categories assigned to each record. The status categories can be broadly classified into two sets: one represents records that have undergone some level of review or validation, and the other includes records that are predictions or have yet to be individually reviewed.

More reliable status categories (have undergone some level of review or validation):

1. REVIEWED: This status indicates that NCBI staff or collaborators have reviewed the RefSeq record. The review process involves assessing available sequence data and relevant literature. These records are more reliable as they have undergone a quality check.

2. VALIDATED: This status indicates that the RefSeq record has been reviewed to establish the preferred sequence standard. Although validated, these records may be subject to final review for additional functional information.

Less reliable status categories (predictions or not yet individually reviewed):

1. MODEL: These records are generated by the NCBI Genome Annotation pipeline and are not subject to individual review or revision between annotation runs. They are based on computational predictions.

2. INFERRED: This status indicates that genome sequence analysis has predicted the RefSeq record but lacks experimental evidence. Homology data may partially support it.

3. PREDICTED: Records in this category have yet to be subject to individual review, and some aspects of the record are predicted. They are based on computational methods without thorough manual validation.

4. PROVISIONAL: RefSeq records in this category have yet to be subject to individual review. The initial sequence-to-gene association has been established by outside collaborators or NCBI staff, indicating a lower reliability level than reviewed records.

WGS (Whole Genome Shotgun) records represent a sequence not individually reviewed or revised between updates. They may need to be more reliable in terms of detailed annotation.

Observing these differences is important because it helps you gauge the confidence level in the accuracy and completeness of the data. Reviewed and validated records are generally more trustworthy, while predicted or provisional records should be interpreted cautiously and may require further experimental validation. You may find the RefSeq status codes in the COMMENT section of a record (Figure 1.).

Status Code Description
MODEL The RefSeq record is provided by the NCBI Genome Annotation pipeline and is not subject to individual review or revision between annotation runs.
INFERRED The RefSeq record has been predicted by genome sequence analysis, but it is not yet supported by experimental evidence. The record may be partially supported by homology data.
PREDICTED The RefSeq record has not yet been subject to individual review, and some aspect of the RefSeq record is predicted.
PROVISIONAL The RefSeq record has not yet been subject to individual review. The initial sequence-to-gene association has been established by outside collaborators or NCBI staff.
REVIEWED The RefSeq record has been reviewed by NCBI staff or by a collaborator. The NCBI review process includes assessing available sequence data and the literature. Some RefSeq records may incorporate expanded sequence and annotation information.
VALIDATED The RefSeq record has undergone an initial review to provide the preferred sequence standard. The record has not yet been subject to final review at which time additional functional information may be provided.
WGS The RefSeq record is provided to represent a collection of whole genome shotgun sequences. These records are not subject to individual review or revisions between genome updates.

See more in The NCBI online Handbook, Chapter 18 The Reference Sequence (RefSeq) Database.

      LOCUS       NM_001007083            1099 bp    mRNA    linear   VRT 16-DEC-2021
      DEFINITION  Gallus gallus interleukin 3 (IL3), mRNA.
      ACCESSION   NM_001007083
      VERSION     NM_001007083.2
      KEYWORDS    RefSeq.
      SOURCE      Gallus gallus (chicken)
        ORGANISM  Gallus gallus
                  Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
                  Archelosauria; Archosauria; Dinosauria; Saurischia; Theropoda;
                  Coelurosauria; Aves; Neognathae; Galloanserae; Galliformes;
                  Phasianidae; Phasianinae; Gallus.
      REFERENCE   1  (bases 1 to 1099)
        AUTHORS   Avery S, Rothwell L, Degen WD, Schijns VE, Young J, Kaufman J and
                  Kaiser P.
        TITLE     Characterization of the first nonmammalian T2 cytokine gene
                  cluster: the cluster contains functional single-copy genes for
                  IL-3, IL-4, IL-13, and GM-CSF, a gene for IL-5 that appears to be a
                  pseudogene, and a gene encoding another cytokinelike transcript,
                  KK34
        JOURNAL   J Interferon Cytokine Res 24 (10), 600-610 (2004)
        PUBMED   15626157
        REMARK    GeneRIF: the chicken genome encodes genes for the homologs of
                  mammalian interleukin-3 (IL-3), IL-4, IL-5, IL-13, and
                  granulocyte-macrophage colony-stimulating factor (GM-CSF)
      COMMENT     VALIDATED REFSEQ: This record has undergone validation or
                  preliminary review. The reference sequence was derived from
                  JAENSK010000256.1.
      ...
    

Figure 1. RefSeq flatfile header where the COMMENT line has the status code 'VALIDATED.'


RefSeq Flatfiles and FASTA Files


Example of RefSeq flatfile - same format as Genbank flatfiles. Note that the sequence and annotations is not shown on this file due to its larges size. You can see the entire file online at NCBI and you can use the "Customize view" panel to change the display on the left side panel to display the complete file.

      LOCUS       NC_000067          195154279 bp    DNA     linear   CON 10-APR-2023
DEFINITION  Mus musculus strain C57BL/6J chromosome 1, GRCm39.
ACCESSION   NC_000067
VERSION     NC_000067.7
DBLINK      BioProject: PRJNA169
            Assembly: GCF_000001635.27
KEYWORDS    RefSeq.
SOURCE      Mus musculus (house mouse)
  ORGANISM  Mus musculus
            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
            Mammalia; Eutheria; Euarchontoglires; Glires; Rodentia; Myomorpha;
            Muroidea; Muridae; Murinae; Mus; Mus.
REFERENCE   1  (bases 1 to 195154279)
  AUTHORS   Church,D.M., Schneider,V.A., Graves,T., Auger,K., Cunningham,F.,
            Bouk,N., Chen,H.C., Agarwala,R., McLaren,W.M., Ritchie,G.R.,
            Albracht,D., Kremitzki,M., Rock,S., Kotkiewicz,H., Kremitzki,C.,
            Wollam,A., Trani,L., Fulton,L., Fulton,R., Matthews,L.,
            Whitehead,S., Chow,W., Torrance,J., Dunn,M., Harden,G.,
            Threadgold,G., Wood,J., Collins,J., Heath,P., Griffiths,G.,
            Pelan,S., Grafham,D., Eichler,E.E., Weinstock,G., Mardis,E.R.,
            Wilson,R.K., Howe,K., Flicek,P. and Hubbard,T.
  TITLE     Modernizing reference genome assemblies
  JOURNAL   PLoS Biol. 9 (7), e1001091 (2011)
   PUBMED   21750661
REFERENCE   2  (bases 1 to 195154279)
  AUTHORS   Church,D.M., Goodstadt,L., Hillier,L.W., Zody,M.C., Goldstein,S.,
            She,X., Bult,C.J., Agarwala,R., Cherry,J.L., DiCuccio,M.,
            Hlavina,W., Kapustin,Y., Meric,P., Maglott,D., Birtle,Z.,
            Marques,A.C., Graves,T., Zhou,S., Teague,B., Potamousis,K.,
            Churas,C., Place,M., Herschleb,J., Runnheim,R., Forrest,D.,
            Amos-Landgraf,J., Schwartz,D.C., Cheng,Z., Lindblad-Toh,K.,
            Eichler,E.E. and Ponting,C.P.
  CONSRTM   Mouse Genome Sequencing Consortium
  TITLE     Lineage-specific biology revealed by a finished genome assembly of
            the mouse
  JOURNAL   PLoS Biol. 7 (5), e1000112 (2009)
   PUBMED   19468303
REFERENCE   3  (bases 1 to 195154279)
  CONSRTM   Genome Reference Consortium
  TITLE     Direct Submission
  JOURNAL   Submitted (24-JUN-2020) NCBI, NIH, Bethesda, MD 20892, USA
COMMENT     REFSEQ INFORMATION: The reference sequence is identical to
            CM000994.3.
            
            On Sep 22, 2020 this sequence version replaced NC_000067.6.
            Assembly Name: GRCm39
            The DNA sequence is composed of genomic sequence, primarily
            finished clones that were sequenced as part of the Mouse Genome
            Project. PCR products and WGS shotgun sequence have been added
            where necessary to fill gaps or correct errors. All such additions
            are manually curated by GRC staff. For more information see:
            https://genomereference.org.
            
            ##Genome-Annotation-Data-START##
            Annotation Provider         :: NCBI RefSeq
            Annotation Status           :: Updated annotation
            Annotation Name             :: GCF_000001635.27-RS_2023_04
            Annotation Pipeline         :: NCBI eukaryotic genome annotation
                                           pipeline
            Annotation Software Version :: 10.1
            Annotation Method           :: Best-placed RefSeq; Gnomon;
                                           RefSeqFE; cmsearch; tRNAscan-SE
            Features Annotated          :: Gene; mRNA; CDS; ncRNA
            Annotation Date             :: 04/05/2023
            ##Genome-Annotation-Data-END##
FEATURES             Location/Qualifiers
     source          1..195154279
                     /organism="Mus musculus"
                     /mol_type="genomic DNA"
                     /strain="C57BL/6J"
                     /db_xref="taxon:10090"
                     /chromosome="1"
CONTIG      join(gap(100000),gap(10000),gap(2890000),gap(50000),
            NT_039170.9:1..82274824,gap(50000),NT_078297.8:1..109679455,
            gap(100000))
//
    

FASTA-format

You can also obtain RefSeq sequences in FASTA-format. Here's an example, trucated due to the large size:

      >NC_000067.7 Mus musculus strain C57BL/6J chromosome 1, GRCm39
      NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
      NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
      ...
      TATCTTAATTAGTTTTTGAGTTCTCCAAAGCTATTTGCTCTCTGTGTTGTTAACCTGTACAAGACTGAAG
      GTTCTTATTCCTATATCTTATTAATATTCACATTGACATTTTGATGTCTGCTTTCTATATTTTCCTAAAA
      ATATTTTAAAGTACACACTATACAGACTTTTAATTTAATTCAGTTTTCTATTCAGGTAATATATTTTGAT
      CACATTTACCCCTGCTTCAAAGTTGCTAGTATGAGATTATCCTAAATTTTTTATGAAGACACTTATTACT
      ATGAACTCTCCTCCTAGTATTGATTTCATTGGGTCTCATAAGTTTGGGTATGTTGTGAATTTGTTTTCAT
      ...
    

RefSeqGene


RefSeqGene genes constitute a vital subset within the RefSeqs, designed as an integral component of the reference genome. This specialized subset comprises well-defined, complete genes meticulously curated as stable reference genes. Their primary purpose is establishing precise coordinates for all gene regions, encompassing promoters, introns, exons, and flanking regions. Additionally, RefSeqGene plays a pivotal role in delineating gene mutations and biologically significant variants, such as CNVs and SNPs, i.e., Copy Number Variations and Single Nucleotide Polymorphisms.

RefSeqGene, in contrast to the broader RefSeq database, distinguishes itself by providing gene-specific sequences for each recognized gene, presenting a comprehensive and complete set of regions associated with the gene structure. These sequences undergo meticulous alignment to reference chromosomes, ensuring their status as standard, normal alleles. RefSeqGene serves as a baseline reference, offering a robust foundation for gene sequences and facilitating precise analysis of genetic elements and variations in the context of the entire genome. This specialized resource is invaluable in research and clinical settings, contributing to a deeper understanding of genomic structures and their implications in health and disease.


Take a self test




References

  • O'Leary NA, Wright MW, Brister JR, Ciufo S, Haddad D, McVeigh R, Rajput B, Robbertse B, Smith-White B, Ako-Adjei D, Astashyn A, Badretdin A, Bao Y, Blinkova O, Brover V, Chetvernin V, Choi J, Cox E, Ermolaeva O, Farrell CM, Goldfarb T, Gupta T, Haft D, Hatcher E, Hlavina W, Joardar VS, Kodali VK, Li W, Maglott D, Masterson P, McGarvey KM, Murphy MR, O'Neill K, Pujar S, Rangwala SH, Rausch D, Riddick LD, Schoch C, Shkeda A, Storz SS, Sun H, Thibaud-Nissen F, Tolstoy I, Tully RE, Vatsan AR, Wallin C, Webb D, Wu W, Landrum MJ, Kimchi A, Tatusova T, DiCuccio M, Kitts P, Murphy TD, Pruitt KD. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 2016 Jan 4;44(D1):D733-45. doi: 10.1093/nar/gkv1189. PMID: 26553804; PMCID: PMC4702849.

  • Tatusova T, Ciufo S, Fedorov B, O'Neill K, Tolstoy I. RefSeq microbial genomes database: new representation and annotation strategy. Nucleic Acids Res. 2014 Jan;42(Database issue):D553-9. doi: 10.1093/nar/gkt1274. PMID: 24316578; PMCID: PMC3965038.

  • Sayers EW, Bolton EE, Brister JR, Canese K, Chan J, Comeau DC, Farrell CM, Feldgarden M, Fine AM, Funk K, Hatcher E, Kannan S, Kelly C, Kim S, Klimke W, Landrum MJ, Lathrop S, Lu Z, Madden TL, Malheiro A, Marchler-Bauer A, Murphy TD, Phan L, Pujar S, Rangwala SH, Schneider VA, Tse T, Wang J, Ye J, Trawick BW, Pruitt KD, Sherry ST. Database resources of the National Center for Biotechnology Information in 2023. Nucleic Acids Res. 2023 Jan 6;51(D1):D29-D38. doi: 10.1093/nar/gkac1032. PMID: 36370100; PMCID: PMC9825438.