TLDR: A new deep learning model leverages a pre-trained protein large language model (ESM-2) combined with bidirectional LSTM and GRU networks to accurately predict amyloidogenic regions in peptides and proteins. The method was trained and evaluated on multiple datasets, achieving an accuracy of 83% for hexapeptide classification and competitive performance against existing tools for peptide and whole protein amyloid prediction. This study demonstrates the significant potential of LLMs in advancing amyloid prediction, a critical area in bioinformatics for understanding and treating amyloid-related diseases.
The prediction of amyloidogenicity in peptides and proteins is a crucial area of study in bioinformatics. Amyloid proteins are linked to various diseases, including Alzheimer’s and Parkinson’s, making accurate prediction of their aggregation-prone regions vital for understanding disease mechanisms and developing new therapies.
Traditional methods for predicting amyloidogenicity often rely on features derived from evolutionary motifs or individual amino acid properties. However, recent research highlights the strong predictive power of features based on sequence information. This study, titled Deep Learning Model for Amyloidogenicity Prediction using a Pre-trained Protein LLM, explores a novel approach by leveraging contextual features from protein sequences obtained through a pre-trained protein large language model (LLM).
The Approach: Combining LLMs with Deep Learning
The researchers, Zohra YAGOUB and Hafida BOUZIANE, utilized ESM-2, a pre-trained protein LLM built on the Transformer architecture, similar to BERT. ESM-2 is trained on vast protein sequence databases, allowing it to learn intricate patterns and relationships between amino acids. This enables it to capture rich contextual information within protein sequences.
In their method, the team generated numerical representations (embeddings) for protein sequences using ESM-2. These embeddings were then fed into a deep learning model that combines bidirectional Long Short-Term Memory (Bi-LSTM) and bidirectional Gated Recurrent Unit (Bi-GRU) networks. These networks are particularly effective at processing sequential data, allowing the model to understand the protein sequence context in both forward and backward directions. Dropout layers were included to prevent the model from overfitting, ensuring robust feature extraction.
Datasets and Evaluation
The model was trained and evaluated using several well-known datasets. For hexapeptides (six-residue fragments often considered ‘hot spots’ for amyloid formation), the WaltzDB 2.0 dataset was used. To assess performance on longer peptides, the Pep-251 dataset was employed. Finally, for identifying amyloidogenic regions within whole protein sequences, the AmyPro27 dataset was utilized.
Promising Results
The proposed method demonstrated competitive performance across all evaluation stages. When classifying hexapeptide sequences, the model achieved an accuracy of 83% on the test dataset and 84.5% on 10-fold cross-validation, showing a good balance between sensitivity and specificity.
Compared to other state-of-the-art methods for amyloid peptide identification, their model achieved an accuracy of 80.8% on the Pep-251 dataset. Notably, it exhibited a balanced sensitivity of 73.4% and a specificity of 84.3%, outperforming several existing tools in specificity while maintaining strong sensitivity.
For identifying aggregation-prone regions in whole proteins, the method showed balanced performance with Segment Overlap (SOV) scores of 53.1 for amyloid-prone regions and 54.1 for non-amyloid-prone regions, resulting in an average SOV of 53.6. This indicates that the model can accurately identify both types of regions without significant bias.
Also Read:
- Retro-Expert: A Framework for Explainable Chemical Synthesis
- Uncovering AI’s Fingerprint: A New Method for Detecting Machine-Generated Text
Conclusion and Future Directions
This research highlights the significant potential of using contextual features from pre-trained protein large language models like ESM-2 for accurately identifying amyloidogenic regions. The combination with advanced deep learning architectures such as Bi-LSTM and Bi-GRU provides a robust and accurate tool for distinguishing between amyloidogenic and non-amyloidogenic sequences.
The authors suggest that future work could involve evaluating the model’s performance with other pre-trained protein LLMs and integrating additional sequence- and structure-based features to further enhance predictive accuracy.


