ProteinBERT
ProteinBERT models protein sequences with a pretrained deep language architecture to predict sequence-level properties, including Gene Ontology annotations, structural features, post-translational modifications, and biophysical attributes, and was pretrained on approximately 106 million proteins from UniRef90.
Key Features:
- Pretraining scheme: Combines masked language modeling with Gene Ontology (GO) annotation prediction to learn biologically informed sequence representations.
- Pretraining dataset: Pretrained on approximately 106 million protein sequences from the UniRef90 database.
- Architectural elements: Integrates local and global representations to enable end-to-end processing and handling of very long input sequences.
- Efficiency and performance: Achieves state-of-the-art performance across multiple benchmarks while using a relatively smaller model size compared with other deep-learning models.
- Fine-tuning capabilities: Can be fine-tuned for diverse protein-related prediction tasks with limited labeled data and minimal training time.
- Implementation: Built using TensorFlow/Keras.
Scientific Applications:
- Protein structure prediction: Predicts protein structural properties relevant to function and molecular interactions.
- Post-translational modifications (PTMs): Predicts PTMs to inform on regulatory and functional modifications of proteins.
- Biophysical attribute prediction: Predicts biophysical properties such as stability and behavior under varying conditions.
Methodology:
Pretraining combines masked language modeling with Gene Ontology annotation prediction on ~106 million UniRef90 sequences; the architecture uses integrated local and global representations for end-to-end processing and long-sequence handling; implemented in TensorFlow/Keras.
Topics
Details
- Tool Type:
- command-line tool
- Programming Languages:
- Python
- Added:
- 11/29/2021
- Last Updated:
- 11/29/2021
Operations
Publications
Brandes N, Ofer D, Peleg Y, Rappoport N, Linial M. ProteinBERT: A universal deep-learning model of protein sequence and function. Unknown Journal. 2021. doi:10.1101/2021.05.24.445464.