ProtTrans

ProtTrans leverages pre-trained Transformer-based protein language models to generate embeddings from protein sequences for downstream predictions of secondary structure, sub-cellular localization, and solubility.

Key Features:

Diverse Model Architectures: ProtTrans employs Transformer-XL, XLNet, BERT, Albert, Electra, and T5 models for protein sequence representation learning.
Large-scale Unlabeled Pretraining: Models are trained on large-scale unlabeled protein sequence datasets to learn statistical and biophysical sequence patterns.
Embedding Generation: Produces dense embeddings that encode biophysical features of proteins for use in downstream prediction tasks.
Alignment-free Predictions: Enables predictions without relying on multiple sequence alignments or explicit evolutionary information.

Scientific Applications:

Secondary Structure Prediction: Predicts protein secondary structure with high accuracy without using multiple sequence alignments or evolutionary information.
Sub-cellular Localization: Predicts sub-cellular locations with reported ten-state accuracy of 81%.
Solubility and Membrane Classification: Differentiates membrane versus water-soluble proteins with a reported two-state accuracy of 91%.

Methodology:

Training of Transformer-XL, XLNet, BERT, Albert, Electra, and T5 protein language models on large-scale unlabeled protein sequence data to produce embeddings that capture biophysical features for downstream predictive tasks.

Visit Official Homepage →

Topics

Natural language processing Biophysics Protein secondary structure Cell biology

Details

License:: AFL-3.0
Cost:: Free of charge
Tool Type:: command-line tool
Operating Systems:: Mac, Linux, Windows
Programming Languages:: Python
Added:: 1/10/2022
Last Updated:: 1/10/2022

Operations

Publications

Elnaggar A, Heinzinger M, Dallago C, Rehawi G, Wang Y, Jones L, Gibbs T, Feher T, Angerer C, Steinegger M, Bhowmik D, Rost B. ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2022;44(10):7112-7127. doi:10.1109/tpami.2021.3095381. PMID:34232869.

DOI: 10.1109/TPAMI.2021.3095381

PMID: 34232869

Funding: - National Research Foundation of Korea: 2019R1A6A1A10073437, NRF-2020M3A9G7103933 - U.S. Department of Energy: DE-AC05-00OR22725

Downloads

Source code
https://github.com/agemagician/ProtTrans/releases/tag/1.0

Links

Issue tracker

https://github.com/agemagician/ProtTrans/issues

← Back to search