Track topics on Twitter Track topics that are important to you
Given the current cost-effectiveness of next-generation sequencing, the amount of DNA-seq and RNA-seq data generated is ever increasing. One of the primary objectives of NGS experiments is calling genetic variants. While highly accurate, most variant calling pipelines are not optimized to run efficiently on large data sets. However, as variant calling in genomic data has become common practice, several methods have been proposed to reduce runtime for DNA-seq analysis through the use of parallel computing. Determining the effectively expressed variants from transcriptomics (RNA-seq) data has only recently become possible, and as such does not yet benefit from efficiently parallelized workflows. We introduce Halvade-RNA, a parallel, multi-node RNA-seq variant calling pipeline based on the GATK Best Practices recommendations. Halvade-RNA makes use of the MapReduce programming model to create and manage parallel data streams on which multiple instances of existing tools such as STAR and GATK operate concurrently. Whereas the single-threaded processing of a typical RNA-seq sample requires ∼28h, Halvade-RNA reduces this runtime to ∼2h using a small cluster with two 20-core machines. Even on a single, multi-core workstation, Halvade-RNA can significantly reduce runtime compared to using multi-threading, thus providing for a more cost-effective processing of RNA-seq data. Halvade-RNA is written in Java and uses the Hadoop MapReduce 2.0 API. It supports a wide range of distributions of Hadoop, including Cloudera and Amazon EMR.
This article was published in the following journal.
Name: PloS one
Any single nucleotide variant detection study could benefit from a fast and cheap method of measuring the quality of variant call list. It is advantageous to be able to see how the call list quality i...
Insertions and deletions (INDELs) comprise a significant proportion of human genetic variation, and recent papers have revealed that many human diseases may be attributable to INDELs. With the develop...
The application of next-generation sequencing in research and particularly in clinical routine requires valid variant calling results. However, evaluation of several commonly used tools has pointed ou...
De novo mutations (i.e., newly occurring mutations) are a predominant cause of sporadic dominant monogenic diseases and play a significant role in the genetics of complex disorders. De novo mutation s...
PCR-based DNA enrichment followed by massively parallel sequencing is a straightforward and cost effective method to sequence genes up to high depth. The full potential of amplicon based sequencing as...
Transcriptomics is the study of how RNA is expressed under specific conditions. Transcriptomic analyses of lesional skin biopsies can be a useful way to track how a patient responds to a d...
Discovery of differences in the host response in patients with systemic inflammation and sepsis, and identification of novel, specific markers by using a longitudinal clinico-transcriptomi...
Lupus erythematosus systemic is an auto-immune disease the evaluation of the activity of which remains very difficult because of an heterogeneousness of the clinical and biological symptom...
Despite the fact that migraine is a common disorder, the pathogenesis is still not fully elucidated. Studying transcriptomic and biochemical changes during induced and spontaneous migraine...
A reduced content of FODMAPs (fermentable oligosaccharides, disaccharides, monosaccharides, and polyols) in the diet may be beneficial for patients with IBS diarrheal variant, but so far f...
Information application based on a variety of coding methods to minimize the amount of data to be stored, retrieved, or transmitted. Data compression can be applied to various forms of data, such as images and signals. It is used to reduce costs and increase efficiency in the maintenance of large volumes of data.
Various units or machines that operate in combination or in conjunction with a computer but are not physically part of it. Peripheral devices typically display computer data, store data from the computer and return the data to the computer on demand, prepare data for human use, or acquire data from a source and convert it to a form usable by a computer. (Computer Dictionary, 4th ed.)
The science and art of collecting, summarizing, and analyzing data that are subject to random variation. The term is also applied to the data themselves and to the summarization of the data.
Systematic gathering of data for a particular purpose from various sources, including questionnaires, interviews, observation, existing records, and electronic devices. The process is usually preliminary to statistical analysis of the data.
Devices capable of receiving data, retaining data for an indefinite or finite period of time, and supplying data upon demand.
DNA sequencing is the process of determining the precise order of nucleotides within a DNA molecule. During DNA sequencing, the bases of a small fragment of DNA are sequentially identified from signals emitted as each fragment is re-synthesized from a ...
Bioinformatics is the application of computer software and hardware to the management of biological data to create useful information. Computers are used to gather, store, analyze and integrate biological and genetic information which can then be applied...