ProtTrans
ProtTrans leverages pre-trained Transformer-based protein language models to generate embeddings from protein sequences for downstream predictions of secondary structure, sub-cellular localization, and solubility.
Key Features:
- Diverse Model Architectures: ProtTrans employs Transformer-XL, XLNet, BERT, Albert, Electra, and T5 models for protein sequence representation learning.
- Large-scale Unlabeled Pretraining: Models are trained on large-scale unlabeled protein sequence datasets to learn statistical and biophysical sequence patterns.
- Embedding Generation: Produces dense embeddings that encode biophysical features of proteins for use in downstream prediction tasks.
- Alignment-free Predictions: Enables predictions without relying on multiple sequence alignments or explicit evolutionary information.
Scientific Applications:
- Secondary Structure Prediction: Predicts protein secondary structure with high accuracy without using multiple sequence alignments or evolutionary information.
- Sub-cellular Localization: Predicts sub-cellular locations with reported ten-state accuracy of 81%.
- Solubility and Membrane Classification: Differentiates membrane versus water-soluble proteins with a reported two-state accuracy of 91%.
Methodology:
Training of Transformer-XL, XLNet, BERT, Albert, Electra, and T5 protein language models on large-scale unlabeled protein sequence data to produce embeddings that capture biophysical features for downstream predictive tasks.
Topics
Details
- License:
- AFL-3.0
- Cost:
- Free of charge
- Tool Type:
- command-line tool
- Operating Systems:
- Mac, Linux, Windows
- Programming Languages:
- Python
- Added:
- 1/10/2022
- Last Updated:
- 1/10/2022
Operations
Publications
Elnaggar A, Heinzinger M, Dallago C, Rehawi G, Wang Y, Jones L, Gibbs T, Feher T, Angerer C, Steinegger M, Bhowmik D, Rost B. ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2022;44(10):7112-7127. doi:10.1109/tpami.2021.3095381. PMID:34232869.