Boosting phosphorylation site prediction with sequence feature-based Machine learning.

08:00 EDT 14th August 2019 | BioPortfolio

Summary of "Boosting phosphorylation site prediction with sequence feature-based Machine learning."

Protein phosphorylation is one of the essential post-translation modifications playing a vital role in the regulation of many fundamental cellular processes. We propose a LightGBM-based computational approach that uses evolutionary, geometric, sequence environment, and amino acid specific features to decipher phosphate binding sites from a protein sequence. Our method, while compared with other existing methods on 2429 protein sequences taken from standard Phospho.ELM (P.ELM) benchmark dataset featuring 11 organisms reports a higher F score = 0.504 (harmonic mean of the precision and recall) and ROC AUC = 0.836 (area under the curve of the receiver operating characteristics). The computation time of our proposed approach is much less than that of the recently developed deep learning based framework. Structural analysis on selected protein sequences informs that our prediction is the superset of the phosphorylation sites, as mentioned in P.ELM dataset. The foundation of our scheme is manual feature engineering and a decision tree based classification. Hence, it is intuitive, and one can interpret the final tree as a set of rules resulting in a deeper understanding of the relationships between biophysical features and phosphorylation sites. Our innovative problem transformation method permits more control over precision and recall as is demonstrated by the fact that if we incorporate output probability of the existing deep learning framework as an additional feature, then our prediction improves (F score = 0.546; ROC AUC = 0.849). The implementation of our method can be accessed at and is mirrored at This article is protected by copyright. All rights reserved.


Journal Details

This article was published in the following journal.

Name: Proteins
ISSN: 1097-0134


DeepDyve research library

PubMed Articles [24833 Associated PubMed Articles listed on BioPortfolio]

Single-site phosphorylation within the His-tag sequence attached to a recombinant protein.

We report the observation of single-site phosphorylation in a His-tag sequence N-terminally attached to a recombinant protein (UVI31+) in vitro. This modification was detected at position 23 at a se...

DeepPASTA: deep neural network based polyadenylation site analysis.

Alternative polyadenylation (polyA) sites near the 3' end of a pre-mRNA create multiple mRNA transcripts with different 3' untranslated regions (3' UTRs). The sequence elements of a 3' UTR are essenti...

Feature-specific prediction errors and surprise across macaque fronto-striatal circuits.

To adjust expectations efficiently, prediction errors need to be associated with the precise features that gave rise to the unexpected outcome, but this credit assignment may be problematic if stimuli...

DeepGOPlus: Improved protein function prediction from sequence.

Protein function prediction is one of the major tasks of bioinformatics that can help in wide range of biological problems such as understanding disease mechanisms or finding drug targets. Many method...

Evaluating the performance of sequence encoding schemes and machine learning methods for splice sites recognition.

Identification of splice sites is imperative for prediction of gene structure. Machine learning-based approaches (MLAs) have been reported to be more successful than the rule-based methods for identif...

Clinical Trials [7043 Associated Clinical Trials listed on BioPortfolio]

Free Text Prediction Algorithm for Appendicitis

Computer-aided diagnostic software has been used to assist physicians in various ways. Text-based prediction algorithms have been trained on past medical records through data mining and fe...

CPAP Device In-lab Assessment NZ

The purpose of this trial is to assess device performance against participants in an overnight study to ensure the product meets user and clinical requirements

Evaluation of Metabolism-Boosting Beverages

The purpose of this study is to assess the effect of Metabolism-boosting Beverages (MBB) containing green tea extract with a standardized amount of epigallocatechin gallate (EGCG) and caff...

EndogenousTestosterone Response to a Testosterone Boosting Supplement

The purpose of this study is to determine whether a proprietary 'testosterone-boosting' supplement, when used as recommended by the manufacturer, results in an increase in testosterone lev...

Effect of Cobicistat Versus Ritonavir Boosting on the Brain Permeation of Darunavir in HIV-infected Individuals

The purpose of this study is to assess whether a boosting by cobicistat results in similar concentrations of darunavir in the brain compared to a boosting by ritonavir.

Medical and Biotech [MESH] Definitions

A prediction of the probable outcome of a disease based on a individual's condition and the usual course of the disease as seen in similar situations.

The sequence at the 5' end of the messenger RNA that does not code for product. This sequence contains the ribosome binding site and other transcription and translation regulating sequences.

A theoretical representative nucleotide or amino acid sequence in which each nucleotide or amino acid is the one which occurs most frequently at that site in the different sequences which occur in nature. The phrase also refers to an actual sequence which approximates the theoretical consensus. A known CONSERVED SEQUENCE set is represented by a consensus sequence. Commonly observed supersecondary protein structures (AMINO ACID MOTIFS) are often formed by conserved sequences.

Short tracts of DNA sequence that are used as landmarks in GENOME mapping. In most instances, 200 to 500 base pairs of sequence define a Sequence Tagged Site (STS) that is operationally unique in the human genome (i.e., can be specifically detected by the polymerase chain reaction in the presence of all other genomic sequences). The overwhelming advantage of STSs over mapping landmarks defined in other ways is that the means of testing for the presence of a particular STS can be completely described as information in a database.

The prediction or projection of the nature of future problems or existing conditions based upon the extrapolation or interpretation of existing scientific data or by the application of scientific methodology.

Quick Search

DeepDyve research library

Searches Linking to this Article