ESpritz - BioComputingUP

ESpritz is a protein disorder predictor based on bidirectional recursive neural networks and trained on three different flavors of disorder. It predicts disorder flavors at two distinct false positive rates, either with a fast or slower and slightly more accurate approach. Given its state-of-the-art performance, it can be especially useful for high-throughput applications.

ESpritz is available as a web server under CAID Prediction Portal / Disorder Predictors or as a stand alone, executable version available for download.

Citing ESpritz

Walsh I, Martin AJ, Di Domenico T, Tosatto SC. ESpritz: accurate and fast prediction of protein disorder. Bioinformatics. 2012 Feb 15;28(4):503-9. doi: 10.1093/bioinformatics/btr682. PMID: 22190692.

Methods

This page provides a description of the methods implemented in ESpritz and a quick reference guide for the accompanying datasets.

Overview

ESpritz is based on an efficient prediction system to find regions of protein disorder (or unstructured regions). Protein disordered regions are key for the function of numerous processes within an organism. Experimental annotations remain low with the two most common sources of information being the DisProt and the Protein Data Bank databases. Determination of disorder from amino acid sequence is a difficult problem but nonetheless published methods have shown promising results.

Prediction system

In short ESpritz is constructed as follows:

ESpritz is based on a machine learning method which does not require sliding windows or any complex sources of information (Bi-directional Recursive Neural Networks (BRNN)) [1]. The method can process proteins on a genomic scale with little effort and state-of-the-art accuracy. We have proved that BRNNs are capable of extracting more information from the protein sequence compared to static neural networks.
Sequence information: The only source of information for the BRNN is the primary amino acid sequence. Multiple sequence alignments are generated using PSI-BLAST. Although the PSI-BLAST based input improves the performance slightly the ESpritz without PSI-BLAST is much faster and only 0-3% points lower in performance. We envisage the main usuage for ESpritz being its fast genome scale processing capabilities.
Learning: Learning proceeds by extracting the relevant information from the local context of the residue under consideration using the BRNN. The algorithm used for training was gradient descent and the backpropagation through structure algorithm [2].
Datasets: Three categories of disorder data are available. To support the reproducibility of experiments, links to the corresponding publicly available datasets are provided here.

Datasets

X-ray dataset

X-ray training set (download link): For proteins in the PDB, we defined disordered residues as those whose backbone Cα atoms lack coordinate information. To generate the training set, we downloaded the list of protein chains deposited in the PDB up to 1 May 2008, restricting the selection to X-ray structures with chain lengths between 25 and 2,000 amino acids, a resolution of at most 2.5 Å, and an R-factor not exceeding 25%. Sequence redundancy was reduced using UniqueProt [3] with an HSSP threshold of 0 and the quality-first option, which prioritizes proteins with higher-quality structures. The resulting protein lists were merged and subjected to a further round of redundancy reduction using the same criteria, giving priority to proteins containing disordered regions.

X-ray test set (download link): The test set was generated using the same procedure as the training set, considering proteins deposited in the PDB between 1 May 2008 and 13 September 2010. To ensure independence from the training data, the test and training sets were combined and sequence redundancy was reduced using UniqueProt with the same parameters and options, thereby removing proteins sharing significant sequence homology with the training set.

DisProt dataset

DisProt training set (download link): DisProt [4] is a manually curated database of partially or completely disordered proteins. In this dataset, a residue is considered disordered if it has been annotated as disordered by the DisProt curators in at least one experimental context. All other residues are considered ordered.

DisProt test set (download link): This dataset is based on DisProt release 5.7. Each DisProt entry was mapped to one or more PDB entries using the UniProt accession code provided in the DisProt record and the corresponding mappings available through the SIFTS database [5]. The resulting dataset therefore combines information from both DisProt and the PDB. When conflicting annotations are present, DisProt annotations take precedence over all other sources of disorder information.

NMR dataset

NMR training set (download link): NMR mobility and flexibility annotations were calculated using the Mobi server. Mobi employs an algorithm that identifies regions adopting different conformations across the models of an NMR ensemble and was optimized to reproduce the ordered/disordered residue definition used in CASP8. Dataset extraction and sequence redundancy reduction were performed using the same procedure as for the X-ray training set (see above), except that only NMR structures from the PDB were considered and no structural quality filters were applied.

NMR test set (download link): Dataset extraction and sequence redundancy reduction were performed using the same procedure as for the X-ray test set (see above), except that only NMR structures from the PDB were considered and no structural quality filters were applied.

References

[1] Pollastri G, Przybylski D, Rost B, Baldi P. Improving the prediction of protein secondary structure in three and eight classes using recurrent neural networks and profiles. Proteins. 2002 May 1;47(2):228-35. doi: 10.1002/prot.10082. PMID: 11933069.

[2] Sperduti A, Starita A. Supervised neural networks for the classification of structures. IEEE Trans Neural Netw. 1997;8(3):714-35. doi: 10.1109/72.572108. PMID: 18255672.

[3] Mika S, Rost B. UniqueProt: Creating representative protein sequence sets. Nucleic Acids Res. 2003 Jul 1;31(13):3789-91. doi: 10.1093/nar/gkg620. PMID: 12824419; PMCID: PMC169026.

[4] Nugnes MV, Bouhraoua KEA, Zoubiri M, Pancsa R, Fichó E; DisProt Consortium; Tompa P, Piovesan D, Tosatto SCE, Aspromonte MC. DisProt in 2026: enhancing intrinsically disordered proteins accessibility, deposition, and annotation. Nucleic Acids Res. 2026 Jan 6;54(D1):D383-D392. doi: 10.1093/nar/gkaf1175. PMID: 41249866; PMCID: PMC12807702.

[5] Dana JM, Gutmanas A, Tyagi N, Qi G, O’Donovan C, Martin M, Velankar S. SIFTS: updated Structure Integration with Function, Taxonomy and Sequences resource allows 40-fold increase in coverage of structure-based annotations for proteins. Nucleic Acids Res. 2019 Jan 8;47(D1):D482-D489. doi: 10.1093/nar/gky1114. PMID: 30445541; PMCID: PMC6324003.