ProteinBERT

ProteinBERT models protein sequences with a pretrained deep language architecture to predict sequence-level properties, including Gene Ontology annotations, structural features, post-translational modifications, and biophysical attributes, and was pretrained on approximately 106 million proteins from UniRef90.


Key Features:

  • Pretraining scheme: Combines masked language modeling with Gene Ontology (GO) annotation prediction to learn biologically informed sequence representations.
  • Pretraining dataset: Pretrained on approximately 106 million protein sequences from the UniRef90 database.
  • Architectural elements: Integrates local and global representations to enable end-to-end processing and handling of very long input sequences.
  • Efficiency and performance: Achieves state-of-the-art performance across multiple benchmarks while using a relatively smaller model size compared with other deep-learning models.
  • Fine-tuning capabilities: Can be fine-tuned for diverse protein-related prediction tasks with limited labeled data and minimal training time.
  • Implementation: Built using TensorFlow/Keras.

Scientific Applications:

  • Protein structure prediction: Predicts protein structural properties relevant to function and molecular interactions.
  • Post-translational modifications (PTMs): Predicts PTMs to inform on regulatory and functional modifications of proteins.
  • Biophysical attribute prediction: Predicts biophysical properties such as stability and behavior under varying conditions.

Methodology:

Pretraining combines masked language modeling with Gene Ontology annotation prediction on ~106 million UniRef90 sequences; the architecture uses integrated local and global representations for end-to-end processing and long-sequence handling; implemented in TensorFlow/Keras.

Topics

Details

Tool Type:
command-line tool
Programming Languages:
Python
Added:
11/29/2021
Last Updated:
11/29/2021

Operations

Publications

Brandes N, Ofer D, Peleg Y, Rappoport N, Linial M. ProteinBERT: A universal deep-learning model of protein sequence and function. Unknown Journal. 2021. doi:10.1101/2021.05.24.445464.

Links