Track topics on Twitter Track topics that are important to you
Protein phosphorylation is one of the essential post-translation modifications playing a vital role in the regulation of many fundamental cellular processes. We propose a LightGBM-based computational approach that uses evolutionary, geometric, sequence environment, and amino acid specific features to decipher phosphate binding sites from a protein sequence. Our method, while compared with other existing methods on 2429 protein sequences taken from standard Phospho.ELM (P.ELM) benchmark dataset featuring 11 organisms reports a higher F score = 0.504 (harmonic mean of the precision and recall) and ROC AUC = 0.836 (area under the curve of the receiver operating characteristics). The computation time of our proposed approach is much less than that of the recently developed deep learning based framework. Structural analysis on selected protein sequences informs that our prediction is the superset of the phosphorylation sites, as mentioned in P.ELM dataset. The foundation of our scheme is manual feature engineering and a decision tree based classification. Hence, it is intuitive, and one can interpret the final tree as a set of rules resulting in a deeper understanding of the relationships between biophysical features and phosphorylation sites. Our innovative problem transformation method permits more control over precision and recall as is demonstrated by the fact that if we incorporate output probability of the existing deep learning framework as an additional feature, then our prediction improves (F score = 0.546; ROC AUC = 0.849). The implementation of our method can be accessed at http://cse.iitkgp.ac.in/~pralay/resources/PPSBoost/ and is mirrored at https://cosmos.iitkgp.ac.in/PPSBoost. This article is protected by copyright. All rights reserved.
This article was published in the following journal.
We report the observation of single-site phosphorylation in a His-tag sequence N-terminally attached to a recombinant protein (UVI31+) in vitro. This modification was detected at position 23 at a se...
Alternative polyadenylation (polyA) sites near the 3' end of a pre-mRNA create multiple mRNA transcripts with different 3' untranslated regions (3' UTRs). The sequence elements of a 3' UTR are essenti...
To adjust expectations efficiently, prediction errors need to be associated with the precise features that gave rise to the unexpected outcome, but this credit assignment may be problematic if stimuli...
Protein function prediction is one of the major tasks of bioinformatics that can help in wide range of biological problems such as understanding disease mechanisms or finding drug targets. Many method...
Identification of splice sites is imperative for prediction of gene structure. Machine learning-based approaches (MLAs) have been reported to be more successful than the rule-based methods for identif...
Computer-aided diagnostic software has been used to assist physicians in various ways. Text-based prediction algorithms have been trained on past medical records through data mining and fe...
The purpose of this trial is to assess device performance against participants in an overnight study to ensure the product meets user and clinical requirements
The purpose of this study is to assess the effect of Metabolism-boosting Beverages (MBB) containing green tea extract with a standardized amount of epigallocatechin gallate (EGCG) and caff...
The purpose of this study is to determine whether a proprietary 'testosterone-boosting' supplement, when used as recommended by the manufacturer, results in an increase in testosterone lev...
The purpose of this study is to assess whether a boosting by cobicistat results in similar concentrations of darunavir in the brain compared to a boosting by ritonavir.
A prediction of the probable outcome of a disease based on a individual's condition and the usual course of the disease as seen in similar situations.
The sequence at the 5' end of the messenger RNA that does not code for product. This sequence contains the ribosome binding site and other transcription and translation regulating sequences.
A theoretical representative nucleotide or amino acid sequence in which each nucleotide or amino acid is the one which occurs most frequently at that site in the different sequences which occur in nature. The phrase also refers to an actual sequence which approximates the theoretical consensus. A known CONSERVED SEQUENCE set is represented by a consensus sequence. Commonly observed supersecondary protein structures (AMINO ACID MOTIFS) are often formed by conserved sequences.
Short tracts of DNA sequence that are used as landmarks in GENOME mapping. In most instances, 200 to 500 base pairs of sequence define a Sequence Tagged Site (STS) that is operationally unique in the human genome (i.e., can be specifically detected by the polymerase chain reaction in the presence of all other genomic sequences). The overwhelming advantage of STSs over mapping landmarks defined in other ways is that the means of testing for the presence of a particular STS can be completely described as information in a database.
The prediction or projection of the nature of future problems or existing conditions based upon the extrapolation or interpretation of existing scientific data or by the application of scientific methodology.