ProtTrans

ProtTrans leverages pre-trained Transformer-based protein language models to generate embeddings from protein sequences for downstream predictions of secondary structure, sub-cellular localization, and solubility.


Key Features:

  • Diverse Model Architectures: ProtTrans employs Transformer-XL, XLNet, BERT, Albert, Electra, and T5 models for protein sequence representation learning.
  • Large-scale Unlabeled Pretraining: Models are trained on large-scale unlabeled protein sequence datasets to learn statistical and biophysical sequence patterns.
  • Embedding Generation: Produces dense embeddings that encode biophysical features of proteins for use in downstream prediction tasks.
  • Alignment-free Predictions: Enables predictions without relying on multiple sequence alignments or explicit evolutionary information.

Scientific Applications:

  • Secondary Structure Prediction: Predicts protein secondary structure with high accuracy without using multiple sequence alignments or evolutionary information.
  • Sub-cellular Localization: Predicts sub-cellular locations with reported ten-state accuracy of 81%.
  • Solubility and Membrane Classification: Differentiates membrane versus water-soluble proteins with a reported two-state accuracy of 91%.

Methodology:

Training of Transformer-XL, XLNet, BERT, Albert, Electra, and T5 protein language models on large-scale unlabeled protein sequence data to produce embeddings that capture biophysical features for downstream predictive tasks.

Topics

Details

License:
AFL-3.0
Cost:
Free of charge
Tool Type:
command-line tool
Operating Systems:
Mac, Linux, Windows
Programming Languages:
Python
Added:
1/10/2022
Last Updated:
1/10/2022

Operations

Publications

Elnaggar A, Heinzinger M, Dallago C, Rehawi G, Wang Y, Jones L, Gibbs T, Feher T, Angerer C, Steinegger M, Bhowmik D, Rost B. ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2022;44(10):7112-7127. doi:10.1109/tpami.2021.3095381. PMID:34232869.

PMID: 34232869
Funding: - National Research Foundation of Korea: 2019R1A6A1A10073437, NRF-2020M3A9G7103933 - U.S. Department of Energy: DE-AC05-00OR22725

Downloads

Links