Pfam
Pfam classifies protein sequences into families and domains to provide curated and automatically generated alignments and Hidden Markov Models for annotation and analysis in structural biology, genomics, and proteomics.
Key Features:
- Pfam-A: Contains well-characterized protein domain families with manually checked seed alignments and Hidden Markov Models (HMMs) that carry permanent accession numbers and form a library for sequence searching and automatic annotation of new proteins.
- Pfam-B: Provides an automatically generated supplement of novel sequence clusters not matched by Pfam families, with the latest version using MMseqs2 clustering and containing 136,730 sequence families.
- Release and coverage: The current release (Pfam 29.0) includes over 16,295 entries and maintains nearly 80% coverage of the UniProt Knowledgebase (UniProtKB).
- Reference proteomes basis: Reorganized to use UniProtKB reference proteomes as the primary sequence basis, reporting matches on a smaller, more stable set of sequences while retaining access to model organisms.
- Representative proteome alignments: Family alignments are provided based on four different representative proteome sequence datasets.
Scientific Applications:
- Protein annotation: Automatic and HMM-based annotation of new protein sequences across genomes and proteomes.
- Discovery of novel families: Identification and classification of previously unannotated proteins and novel family memberships, as exemplified in the Caenorhabditis elegans genome project.
- Pathogen proteome analysis: Support for analysis of viral proteomes, including studies of the SARS-CoV-2 proteome.
Methodology:
Uses manually checked seed alignments to build Hidden Markov Models and MMseqs2 for clustering Pfam-B sequence families, with matches reported against UniProtKB reference proteomes and alignments provided for representative proteome datasets.
Topics
Details
- License:
- CC0-1.0
- Tool Type:
- api, web application
- Operating Systems:
- Linux, Windows, Mac
- Added:
- 2/4/2015
- Last Updated:
- 11/24/2024
Operations
Data Inputs & Outputs
Query and retrieval
Inputs
Outputs
Publications
Finn RD, Coggill P, Eberhardt RY, Eddy SR, Mistry J, Mitchell AL, Potter SC, Punta M, Qureshi M, Sangrador-Vegas A, Salazar GA, Tate J, Bateman A. The Pfam protein families database: towards a more sustainable future. Nucleic Acids Research. 2015;44(D1):D279-D285. doi:10.1093/nar/gkv1344. PMID:26673716. PMCID:PMC4702930.
Sonnhammer EL, Eddy SR, Durbin R. Pfam: A comprehensive database of protein domain families based on seed alignments. Proteins: Structure, Function, and Genetics. 1997;28(3):405-420. doi:10.1002/(sici)1097-0134(199707)28:3<405::aid-prot10>3.0.co;2-l. PMID:9223186.
Mistry J, Chuguransky S, Williams L, Qureshi M, Salazar GA, Sonnhammer ELL, Tosatto SCE, Paladin L, Raj S, Richardson LJ, Finn RD, Bateman A. Pfam: The protein families database in 2021. Nucleic Acids Research. 2020;49(D1):D412-D419. doi:10.1093/nar/gkaa913. PMID:33125078. PMCID:PMC7779014.
Finn RD, Bateman A, Clements J, Coggill P, Eberhardt RY, Eddy SR, Heger A, Hetherington K, Holm L, Mistry J, Sonnhammer ELL, Tate J, Punta M. Pfam: the protein families database. Nucleic Acids Research. 2013;42(D1):D222-D230. doi:10.1093/nar/gkt1223. PMID:24288371. PMCID:PMC3965110.