Revistas
Revista:
BIOINFORMATICS
ISSN:
1367-4803
Año:
2022
Vol.:
39
N°:
9
Págs.:
2488 - 2495
Motivation An important step in the transcriptomic analysis of individual cells involves manually determining the cellular identities. To ease this labor-intensive annotation of cell-types, there has been a growing interest in automated cell annotation, which can be achieved by training classification algorithms on previously annotated datasets. Existing pipelines employ dataset integration methods to remove potential batch effects between source (annotated) and target (unannotated) datasets. However, the integration and classification steps are usually independent of each other and performed by different tools. We propose JIND (joint integration and discrimination for automated single-cell annotation), a neural-network-based framework for automated cell-type identification that performs integration in a space suitably chosen to facilitate cell classification. To account for batch effects, JIND performs a novel asymmetric alignment in which unseen cells are mapped onto the previously learned latent space, avoiding the need of retraining the classification model for new datasets. JIND also learns cell-type-specific confidence thresholds to identify cells that cannot be reliably classified. Results We show on several batched datasets that the joint approach to integration and classification of JIND outperforms in accuracy existing pipelines, and a smaller fraction of cells is rejected as unlabeled as a result of the cell-specific confidence thresholds. Moreover, we investigate cells misclassified by JIND and provide evidence suggesting that they could be due to outliers in the annotated datasets or errors in the original approach used for annotation of the target batch.
Revista:
COMMUNICATIONS BIOLOGY
ISSN:
2399-3642
Año:
2022
Vol.:
5
N°:
1
Págs.:
351
Single-cell RNA-Sequencing has the potential to provide deep biological insights by revealing complex regulatory interactions across diverse cell phenotypes at single-cell resolution. However, current single-cell gene regulatory network inference methods produce a single regulatory network per input dataset, limiting their capability to uncover complex regulatory relationships across related cell phenotypes. We present SimiC, a single-cell gene regulatory inference framework that overcomes this limitation by jointly inferring distinct, but related, gene regulatory dynamics per phenotype. We show that SimiC uncovers key regulatory dynamics missed by previously proposed methods across a range of systems, both model and non-model alike. In particular, SimiC was able to uncover CAR T cell dynamics after tumor recognition and key regulatory patterns on a regenerating liver, and was able to implicate glial cells in the generation of distinct behavioral states in honeybees. SimiC hence establishes a new approach to quantitating regulatory architectures between distinct cellular phenotypes, with far-reaching implications for systems biology.
Autores:
Pages-Zamora, A. (Autor de correspondencia); Ochoa, Idoia; Ruiz-Cavero, G.; et al.
Revista:
PATTERN RECOGNITION
ISSN:
0031-3203
Año:
2022
Vol.:
129
Págs.:
108721
Unsupervised ensemble learning refers to methods devised for a particular task that combine data pro-vided by decision learners taking into account their reliability, which is usually inferred from the data. Here, the variant calling step of the next generation sequencing technologies is formulated as an unsuper-vised ensemble classification problem. A variant calling algorithm based on the expectation-maximization algorithm is further proposed that estimates the maximum-a-posteriori decision among a number of classes larger than the number of different labels provided by the learners. Experimental results with real human DNA sequencing data show that the proposed algorithm is competitive compared to state-of -the-art variant callers as GATK, HTSLIB, and Platypus.(c) 2022 The Author(s). Published by Elsevier Ltd.This is an open access article under the CC BY-NC-ND license ( http://creativecommons.org/licenses/by-nc-nd/4.0/ )
Revista:
NATURE COMMUNICATIONS
ISSN:
2041-1723
Año:
2021
Vol.:
12
N°:
1
Págs.:
2204
Intra-tumor heterogeneity renders the identification of somatic single-nucleotide variants (SNVs) a challenging problem. In particular, low-frequency SNVs are hard to distinguish from sequencing artifacts. While the increasing availability of multi-sample tumor DNA sequencing data holds the potential for more accurate variant calling, there is a lack of high-sensitivity multi-sample SNV callers that utilize these data. Here we report Moss, a method to identify low-frequency SNVs that recur in multiple sequencing samples from the same tumor. Moss provides any existing single-sample SNV caller the ability to support multiple samples with little additional time overhead. We demonstrate that Moss improves recall while maintaining high precision in a simulated dataset. On multi-sample hepatocellular carcinoma, acute myeloid leukemia and colorectal cancer datasets, Moss identifies new low-frequency variants that meet manual review criteria and are consistent with the tumor's mutational signature profile. In addition, Moss detects the presence of variants in more samples of the same tumor than reported by the single-sample caller. Moss' improved sensitivity in SNV calling will enable more detailed downstream analyses in cancer genomics. The study of tumour heterogeneity can be improved by sequencing multiple samples, but currently available variant callers have not been tailored to integrate them. Here the authors present Moss, a tool that can leverage multiple samples to improve somatic variant calling in different cancers.
Autores:
Dufort y Álvarez, G.; Seroussi, G.; Smircich, P.; et al.
Revista:
BIOINFORMATICS
ISSN:
1367-4803
Año:
2021
Vol.:
37
N°:
24
Págs.:
4862 - 4864
Motivation: Nanopore sequencing technologies are rapidly gaining popularity, in part, due to the massive amounts of genomic data they produce in short periods of time (up to 8.5 TB of data in <72 h). To reduce the costs of transmission and storage, efficient compression methods for this type of data are needed. Results: We introduce RENANO, a reference-based lossless data compressor specifically tailored to FASTQ files generated with nanopore sequencing technologies. RENANO improves on its predecessor ENANO, currently the state of the art, by providing a more efficient base call sequence compression component. Two compression algorithms are introduced, corresponding to the following scenarios: (1) a reference genome is available without cost to both the compressor and the decompressor and (2) the reference genome is available only on the compressor side, and a compacted version of the reference is included in the compressed file. We compare the compression performance of RENANO against ENANO on several publicly available nanopore datasets. RENANO improves the base call sequences compression of ENANO by 39.8% in scenario (1), and by 33.5% in scenario (2), on average, over all the datasets. As for total file compression, the average improvements are 12.7% and 10.6%, respectively. We also show that RENANO consistently outperforms the recent general-purpose genomic compressor Genozip.
Revista:
BIOINFORMATICS
ISSN:
1367-4803
Año:
2021
Vol.:
37
N°:
21
Págs.:
3923 - 3925
Motivation: Mass spectrometry (MS) data, used for proteomics and metabolomics analyses, have seen considerable growth in the last years. Aiming at reducing the associated storage costs, dedicated compression algorithms for MS data have been proposed, such as MassComp and MSNumpress. However, these algorithms focus on either lossless or lossy compression, respectively, and do not exploit the additional redundancy existing across scans contained in a single file. We introduce mspack, a compression algorithm for MS data that exploits this additional redundancy and that supports both lossless and lossy compression, as well as the mzML and the legacy mzXML formats. mspack applies several preprocessing lossless transforms and optional lossy transforms with a configurable error, followed by the general purpose compressors gzip or bsc to achieve a higher compression ratio. Results: We tested mspack on several datasets generated by commonly used MS instruments. When used with the bsc compression backend, mspack achieves on average 76% smaller file sizes for lossless compression and 94% smaller file sizes for lossy compression, as compared with the original files. Lossless mspack achieves 10-60% lower file sizes than MassComp, and lossy mspack compresses 36-60% better than the lossy MSNumpress, for the same error, while exhibiting comparable accuracy and running time. Supplementary information: Supplementary data are available at Bioinformatics online.
Autores:
Alvarez, G. D. Y. ; Seroussi, G.; Smircich, P.; et al.
Revista:
BIOINFORMATICS
ISSN:
1367-4803
Año:
2020
Vol.:
36
N°:
16
Págs.:
4506 - 4507
Motivation: The amount of genomic data generated globally is seeing explosive growth, leading to increasing needs for processing, storage and transmission resources, which motivates the development of efficient compression tools for these data. Work so far has focused mainly on the compression of data generated by short-read technologies. However, nanopore sequencing technologies are rapidly gaining popularity due to the advantages offered by the large increase in the average size of the produced reads, the reduction in their cost and the portability of the sequencing technology. We present ENANO (Encoder for NANOpore), a novel lossless compression algorithm especially designed for nanopore sequencing FASTQ files. Results: The main focus of ENANO is on the compression of the quality scores, as they dominate the size of the compressed file. ENANO offers two modes, Maximum Compression and Fast (default), which trade-off compression efficiency and speed. We tested ENANO, the current state-of-the-art compressor SPRING and the general compressor pigz on several publicly available nanopore datasets. The results show that the proposed algorithm consistently achieves the best compression performance (in both modes) on every considered nanopore dataset, with an average improvement over pigz and SPRING of >24.7% and 6.3%, respectively. In addition, in terms of encoding and decoding speeds, ENANO is 2.9x and 1.7x times faster than SPRING, respectively, with memory consumption up to 0.2 GB.
Autores:
Voges, J.; Paridaens, T.; Müntefering, F.; et al.
Revista:
BIOINFORMATICS
ISSN:
1367-4803
MOTIVATION: In an effort to provide a response to the ever-expanding generation of genomic data, the International Organization for Standardization (ISO) is designing a new solution for the representation, compression and management of genomic sequencing data: the MPEG-G standard. This paper discusses the first implementation of an MPEG-G compliant entropy codec: GABAC. GABAC combines proven coding technologies, such as context-adaptive binary arithmetic coding, binarization schemes and transformations, into a straightforward solution for the compression of sequencing data.
RESULTS: We demonstrate that GABAC outperforms well-established (entropy) codecs in a significant set of cases and thus can serve as an extension for existing genomic compression solutions, such as CRAM.
AVAILABILITY: The GABAC library is written in C++. We also provide a command line application which exercises all features provided by the library. GABAC can be downloaded from https://github.com/mitogen/gabac.
SUPPLEMENTARY INFORMATION: Supplementary data, including a complete list of the funding mechanisms and acknowledgements, are available at Bioinformatics online.
Revista:
BIOINFORMATICS
ISSN:
1367-4803
Año:
2020
Vol.:
36
N°:
18
Págs.:
4810 - 4812
Motivation: Sequencing data are often summarized at different annotation levels for further analysis, generally using the general feature format (GFF) or its descendants, gene transfer format (GTF) and GFF3. Existing utilities for accessing these files, like gffutils and gffread, do not focus on reducing the storage space, significantly increasing it in some cases. We propose GPress, a framework for querying GFF files in a compressed form. GPress can also incorporate and compress expression files from both bulk and single-cell RNA-Seq experiments, supporting simultaneous queries on both the GFF and expression files. In brief, GPress applies transformations to the data which are then compressed with the general lossless compressor BSC. To support queries, GPress compresses the data in blocks and creates several index tables for fast retrieval. Results: We tested GPress on several GFF files of different organisms, and showed that it achieves on average a 61% reduction in size with respect to gzip (the current de facto compressor for GFF files) while being able to retrieve all annotations for a given identifier or a range of coordinates in a few seconds (when run in a common laptop). In contrast, gffutils provides faster retrieval but doubles the size of the GFF files. When additionally linking an expression file, we show that GPress can reduce its size by more than 68% when compared to gzip (for both bulk and single-cell RNA-Seq experiments), while still retrieving the information within seconds. Finally, applying BSC to the data streams generated by GPress instead of to the original file shows a size reduction of more than 44% on average.
Revista:
BIOINFORMATICS
ISSN:
1367-4803
Año:
2020
Vol.:
36
N°:
8
Págs.:
2328 - 2336
MOTIVATION: Variants identified by current genomic analysis pipelines contain many incorrectly called variants. These can be potentially eliminated by applying state-of-the-art filtering tools, such as Variant Quality Score Recalibration (VQSR) or Hard Filtering (HF). However, these methods are very user-dependent and fail to run in some cases. We propose VEF, a variant filtering tool based on decision tree ensemble methods that overcomes the main drawbacks of VQSR and HF. Contrary to these methods, we treat filtering as a supervised learning problem, using variant call data with known "true" variants, i.e., gold standard, for training. Once trained, VEF can be directly applied to filter the variants contained in a given VCF file (we consider training and testing VCF files generated with the same tools, as we assume they will share feature characteristics).
RESULTS: For the analysis, we used Whole Genome Sequencing (WGS) Human datasets for which the gold standards are available. We show on these data that the proposed filtering tool VEF consistently outperforms VQSR and HF. In addition, we show that VEF generalizes well even when some features have missing values, when the training and testing datasets differ in coverage, and when sequencing pipelines other than GATK are used. Finally, since the training needs to be performed only once, there is a significant saving in running time when compared to VQSR (4 versus 50minutes approximately for filtering the SNPs of a WGS Human
Revista:
SCIENTIFIC REPORTS
ISSN:
2045-2322
Noise in genomic sequencing data is known to have effects on various stages of genomic data analysis pipelines. Variant identification is an important step of many of these pipelines, and is increasingly being used in clinical settings to aid medical practices. We propose a denoising method, dubbed SAMDUDE, which operates on aligned genomic data in order to improve variant calling performance. Denoising human data with SAMDUDE resulted in improved variant identification in both individual chromosome as well as whole genome sequencing (WGS) data sets. In the WGS data set, denoising led to identification of almost 2,000 additional true variants, and elimination of over 1,500 erroneously identified variants. In contrast, we found that denoising with other state-of-the-art denoisers significantly worsens variant calling performance.
Revista:
ANNUAL REVIEW OF BIOMEDICAL DATA SCIENCE
ISSN:
2574-3414
Año:
2019
Vol.:
2
Págs.:
19 - 37
Recently, there has been growing interest in genome sequencing, driven by advances in sequencing technology, in terms of both efficiency and affordability. These developments have allowed many to envision whole-genome sequencing as an invaluable tool for both personalized medical care and public health. As a result, increasingly large and ubiquitous genomic data sets are being generated. This poses a significant challenge for the storage and transmission of these data. Already, it is more expensive to store genomic data for a decade than it is to obtain the data in the first place. This situation calls for efficient representations of genomic information. In this review, we emphasize the need for designing specialized compressors tailored to genomic data and describe the main solutions already proposed. We also give general guidelines for storing these data and conclude with our thoughts on the future of genomic formats and compressors.
Revista:
BMC BIOINFORMATICS
ISSN:
1471-2105
Año:
2019
Vol.:
20
N°:
368
BackgroundMass Spectrometry (MS) is a widely used technique in biology research, and has become key in proteomics and metabolomics analyses. As a result, the amount of MS data has significantly increased in recent years. For example, the MS repository MassIVE contains more than 123TB of data. Somehow surprisingly, these data are stored uncompressed, hence incurring a significant storage cost. Efficient representation of these data is therefore paramount to lessen the burden of storage and facilitate its dissemination.ResultsWe present MassComp, a lossless compressor optimized for the numerical (m/z)-intensity pairs that account for most of the MS data. We tested MassComp on several MS data and show that it delivers on average a 46% reduction on the size of the numerical data, and up to 89%. These results correspond to an average improvement of more than 27% when compared to the general compressor gzip and of 40% when compared to the state-of-the-art numerical compressor FPC. When tested on entire files retrieved from the MassIVE repository, MassComp achieves on average a 59% size reduction. MassComp is written in C++ and freely available at https://github.com/iochoa/MassComp.ConclusionsThe compression performance of MassComp demonstrates its potential to significantly reduce the footprint of MS data, and shows the benefits of designing specialized compression algorithms tailored to MS data. MassComp is an addition to the family of omics compression algorithms designed to less
Revista:
BIOINFORMATICS
ISSN:
1367-4803
Año:
2019
Vol.:
35
N°:
15
Págs.:
2674 - 2676
Motivation
High-Throughput Sequencing technologies produce huge amounts of data in the form of short genomic reads, associated quality values and read identifiers. Because of the significant structure present in these FASTQ datasets, general-purpose compressors are unable to completely exploit much of the inherent redundancy. Although there has been a lot of work on designing FASTQ compressors, most of them lack in support of one or more crucial properties, such as support for variable length reads, scalability to high coverage datasets, pairing-preserving compression and lossless compression.
Results
In this work, we propose SPRING, a reference-free compressor for FASTQ files. SPRING supports a wide variety of compression modes and features, including lossless compression, pairing-preserving compression, lossy compression of quality values, long read compression and random access. SPRING achieves substantially better compression than existing tools, for example, SPRING compresses 195 GB of 25× whole genome human FASTQ from Illumina¿s NovaSeq sequencer to less than 7 GB, around 1.6× smaller than previous state-of-the-art FASTQ compressors. SPRING achieves this improvement while using comparable computational resources.
Revista:
BIOINFORMATICS
ISSN:
1367-4811
The affordability of DNA sequencing has led to the generation of unprecedented volumes of raw sequencing data. These data must be stored, processed, and transmitted, which poses significant challenges. To facilitate this effort, we introduce FaStore, a specialized compressor for FASTQ files. The proposed algorithm does not use any reference sequences for compression, and permits the user to choose from several lossy modes to improve the overall compression ratio, depending on the specific needs. We demonstrate through extensive simulations that FaStore achieves a significant improvement in compression ratio with respect to previously proposed algorithms for this task. In addition, we perform an analysis on the effect that the different lossy modes have on variant calling, the most widely used application for clinical decision making, especially important in the era of precision medicine. We show that lossy compression can offer significant compression gains, while preserving the essential genomic information and without affecting the variant calling performance.
Revista:
BIOINFORMATICS
ISSN:
1367-4803
Año:
2018
Vol.:
34
N°:
15
Págs.:
2654 2656
Motivation
DNA methylation is one of the most important epigenetic mechanisms in cells that exhibits a significant role in controlling gene expressions. Abnormal methylation patterns have been associated with cancer, imprinting disorders and repeat-instability diseases. As inexpensive bisulfite sequencing approaches have led to significant efforts in acquiring methylation data, problems of data storage and management have become increasingly important. The de facto compression method for methylation data is gzip, which is a general purpose compression algorithm that does not cater to the special format of methylation files. We propose METHCOMP, a new compression scheme tailor-made for bedMethyl files, which supports random access.
Results
We tested the METHCOMP algorithm on 24 bedMethyl files retrieved from four randomly selected ENCODE assays. Our findings reveal that METHCOMP offers an average compression ratio improvement over gzip of up to 7.5x. As an example, METHCOMP compresses a 48 GB file to only 0.9 GB, which corresponds to a 98% reduction in size.
Revista:
BRIEFINGS IN BIOINFORMATICS
ISSN:
1467-5463
Año:
2017
Vol.:
18
N°:
2
Págs.:
183 - 194
Recent advancements in sequencing technology have led to a drastic reduction in genome sequencing costs. This development has generated an unprecedented amount of data that must be stored, processed, and communicated. To facilitate this effort, compression of genomic files has been proposed. Specifically, lossy compression of quality scores is emerging as a natural candidate for reducing the growing costs of storage. A main goal of performing DNA sequencing in population studies and clinical settings is to identify genetic variation. Though the field agrees that smaller files are advantageous, the cost of lossy compression, in terms of variant discovery, is unclear.
Bioinformatic algorithms to identify SNPs and INDELs use base quality score information; here, we evaluate the effect of lossy compression of quality scores on SNP and INDEL detection. Specifically, we investigate how the output of the variant caller when using the original data differs from that obtained when quality scores are replaced by those generated by a lossy compressor. Using gold standard genomic datasets and simulated data, we are able to analyze how accurate the output of the variant calling is, both for the original data and that previously lossily compressed. We show that lossy compression can significantly alleviate the storage while maintaining variant calling performance comparable to that with the original data. Further, in some cases lossy compression can lead to variant calling performance tha
Revista:
BIOINFORMATICS
ISSN:
1367-4803
Año:
2016
Vol.:
32
N°:
7
Págs.:
1115 - 1117
Motivation: Data compression is crucial in effective handling of genomic data. Among several recently published algorithms, ERGC seems to be surprisingly good, easily beating all of the competitors.
Results: We evaluated ERGC and the previously proposed algorithms GDC and iDoComp, which are the ones used in the original paper for comparison, on a wide data set including 12 assemblies of human genome (instead of only four of them in the original paper). ERGC wins only when one of the genomes (referential or target) contains mixed-cased letters (which is the case for only the two Korean genomes). In all other cases ERGC is on average an order of magnitude worse than GDC and iDoComp.
Revista:
BIOINFORMATICS
ISSN:
1367-4803
Año:
2016
Vol.:
32
N°:
17
Págs.:
i479 - i486
Motivation
The dramatic decrease in the cost of sequencing has resulted in the generation of huge amounts of genomic data, as evidenced by projects such as the UK10K and the Million Veteran Project, with the number of sequenced genomes ranging in the order of 10 K to 1 M. Due to the large redundancies among genomic sequences of individuals from the same species, most of the medical research deals with the variants in the sequences as compared with a reference sequence, rather than with the complete genomic sequences. Consequently, millions of genomes represented as variants are stored in databases. These databases are constantly updated and queried to extract information such as the common variants among individuals or groups of individuals. Previous algorithms for compression of this type of databases lack efficient random access capabilities, rendering querying the database for particular variants and/or individuals extremely inefficient, to the point where compression is often relinquished altogether.
Results
We present a new algorithm for this task, called GTRAC, that achieves significant compression ratios while allowing fast random access over the compressed database. For example, GTRAC is able to compress a Homo sapiens dataset containing 1092 samples in 1.1¿GB (compression ratio of 160), while allowing for decompression of specific samples in less than a second and decompression of specific variants in 17¿ms. GTRAC uses and adapts techniques from information theory,
Revista:
BIOINFORMATICS
ISSN:
1367-4803
Año:
2015
Vol.:
31
N°:
19
Págs.:
3122 - 3129
Motivation
Recent advancements in sequencing technology have led to a drastic reduction in the cost of sequencing a genome. This has generated an unprecedented amount of genomic data that must be stored, processed and transmitted. To facilitate this effort, we propose a new lossy compressor for the quality values presented in genomic data files (e.g. FASTQ and SAM files), which comprise roughly half of the storage space (in the uncompressed domain). Lossy compression allows for compression of data beyond its lossless limit.
Results
The proposed algorithm QVZ exhibits better rate-distortion performance than the previously proposed algorithms, for several distortion metrics and for the lossless case. Moreover, it allows the user to define any quasi-convex distortion function to be minimized, a feature not supported by the previous algorithms. Finally, we show that QVZ-compressed data exhibit better performance in the genotyping than data compressed with previously proposed algorithms, in the sense that for a similar rate, a genotyping closer to that achieved with the original quality values is obtained.
Revista:
BIOINFORMATICS
ISSN:
1367-4803
Año:
2015
Vol.:
31
N°:
5
Págs.:
626 - 633
Motivation:With the release of the latest next-generation sequencing (NGS) machine, the HiSeq Xby Illumina, the cost of sequencing a Human has dropped to a mere $4000. Thus we are approach-ing a milestone in the sequencing history, known as the $1000 genome era, where the sequencingof individuals is affordable, opening the doors to effective personalized medicine. Massive gener-ation of genomic data, including assembled genomes, is expected in the following years. There iscrucial need for compression of genomes guaranteed of performing well simultaneously on differ-ent species, from simple bacteria to humans, which will ease their transmission, dissemination andanalysis. Further, most of the new genomes to be compressed will correspond to individuals ofa species from which a reference already exists on the database. Thus, it is natural to proposecompression schemes that assume and exploit the availability of such references.Results:We propose iDoComp, a compressor of assembled genomes presented in FASTA formatthat compresses an individual genome using a reference genome for both the compression and thedecompression. In terms of compression efficiency, iDoComp outperforms previously proposed al-gorithms in most of the studied cases, with comparable or better running time. For example, we ob-serve compression gains of up to 60% in several cases, includingH.sapiensdata, when comparingwith the best compression performance among the previously proposed algorithms.
Revista:
JOURNAL OF BIOINFORMATICS AND COMPUTATIONAL BIOLOGY
ISSN:
0219-7200
With the release of the latest Next-Generation Sequencing (NGS) machine, the HiSeq X by Illumina, the cost of sequencing the whole genome of a human is expected to drop to a mere $1000. This milestone in sequencing history marks the era of affordable sequencing of individuals and opens the doors to personalized medicine. In accord, unprecedented volumes of genomic data will require storage for processing. There will be dire need not only of compressing aligned data, but also of generating compressed files that can be fed directly to downstream applications to facilitate the analysis of and inference on the data. Several approaches to this challenge have been proposed in the literature; however, focus thus far has been on the low coverage regime and most of the suggested compressors are not based on effective modeling of the data.
We demonstrate the bene fit of data modeling for compressing aligned reads. Specifically, we show that, by working with data models designed for the aligned data, we can improve considerably over the best compression ratio achieved by previously proposed algorithms. Our results indicate that the pareto-optimal barrier for compression rate and speed claimed by Bon field and Mahoney (2013) [Bon field JK and Mahoneys MV, Compression of FASTQ and SAM format sequencing data, PLOS ONE, 8(3): e59190, 2013.] does not apply for high coverage aligned data. Furthermore, our improved compression ratio is achieved by splitting the data in a manner conducive to o
Autores:
ManolaKos, A. (Autor de correspondencia); Ochoa, Idoia; Venkat, K.; et al.
Revista:
BMC GENOMICS
ISSN:
1471-2164
Año:
2014
Vol.:
15
N°:
10:S8
BACKGROUND:
Identification of genomic patterns in tumors is an important problem, which would enable the community to understand and extend effective therapies across the current tissue-based tumor boundaries. With this in mind, in this work we develop a robust and fast algorithm to discover cancer driver genes using an unsupervised clustering of similarly expressed genes across cancer patients. Specifically, we introduce CaMoDi, a new method for module discovery which demonstrates superior performance across a number of computational and statistical metrics.
RESULTS:
The proposed algorithm CaMoDi demonstrates effective statistical performance compared to the state of the art, and is algorithmically simple and scalable - which makes it suitable for tissue-independent genomic characterization of individual tumors as well as groups of tumors. We perform an extensive comparative study between CaMoDi and two previously developed methods (CONEXIC and AMARETTO), across 11 individual tumors and 8 combinations of tumors from The Cancer Genome Atlas. We demonstrate that CaMoDi is able to discover modules with better average consistency and homogeneity, with similar or better adjusted R2 performance compared to CONEXIC and AMARETTO.
CONCLUSIONS:
We present a novel method for Cancer Module Discovery, CaMoDi, and demonstrate through extensive simulations on the TCGA Pan-Cancer dataset that it achieves comparable or better performance than that of CONEXIC and AMARETTO, while achieving
Revista:
BMC BIOINFORMATICS
ISSN:
1471-2105
Background
Next Generation Sequencing technologies have revolutionized many fields in biology by reducing the time and cost required for sequencing. As a result, large amounts of sequencing data are being generated. A typical sequencing data file may occupy tens or even hundreds of gigabytes of disk space, prohibitively large for many users. This data consists of both the nucleotide sequences and per-base quality scores that indicate the level of confidence in the readout of these sequences. Quality scores account for about half of the required disk space in the commonly used FASTQ format (before compression), and therefore the compression of the quality scores can significantly reduce storage requirements and speed up analysis and transmission of sequencing data.
Results
In this paper, we present a new scheme for the lossy compression of the quality scores, to address the problem of storage. Our framework allows the user to specify the rate (bits per quality score) prior to compression, independent of the data to be compressed. Our algorithm can work at any rate, unlike other lossy compression algorithms. We envisage our algorithm as being part of a more general compression scheme that works with the entire FASTQ file. Numerical experiments show that we can achieve a better mean squared error (MSE) for small rates (bits per quality score) than other lossy compression schemes. For the organism PhiX, whose assembled genome is known and assumed to be correct, we show that it
Revista:
IEEE COMMUNICATIONS LETTERS
ISSN:
1089-7798
Año:
2010
Vol.:
14
N°:
9
Págs.:
794 - 796
In this paper, we design a new energy allocation strategy for non-uniform binary memoryless sources encoded by Low-Density Parity-Check (LDPC) codes and sent over Additive White Gaussian Noise (AWGN) channels. The new approach estimates the a priori probabilities of the encoded symbols, and uses this information to allocate more energy to the transmitted symbols that occur less likely. It can be applied to systematic and non-systematic LDPC codes, improving in both cases the performance of previous LDPC based schemes using binary signaling. The decoder introduces the source non-uniformity and estimates the source symbols by applying the SPA (Sum Product Algorithm) over the factor graph describing the code.
Revista:
IEEE COMMUNICATIONS LETTERS
ISSN:
1089-7798
Año:
2010
Vol.:
14
N°:
4
Págs.:
336 - 338
This letter proposes a novel one-layer coding/shaping scheme with single-level codes and sigma-mapping for the bandwidth-limited regime. Specifically, we consider non-uniform memoryless sources sent over AWGN channels. At the transmitter, binary data are encoded by a Turbo code composed of two identical RSC (Recursive Systematic Convolutional) encoders. The encoded bits are randomly interleaved and modulated before entering the sigma-mapper. The modulation employed in this system follows the unequal energy allocation scheme first introduced in [1]. The receiver consists of an iterative demapping/decoding algorithm, which incorporates the a priori probabilities of the source symbols. To the authors' knowledge, work in this area has only been done for the power-limited regime. In particular, the authors in [2] proposed a scheme based on a Turbo code with RSC encoders and unequal energy allocation. Therefore, it is reasonable to compare the performance - with respect to the Shannon limit - of our proposed bandwidth-limited regime scheme with this former power-limited regime scheme. Simulation results show that our performance is as good or slightly better than that of the system in [2].