Our researchers

Idoia Ochoa Álvarez

Departamento de Ingeniería Eléctrica y Electrónica
Escuela de Ingeniería (TECNUN). Universidad de Navarra
Research lines
Computation and Biology, Compression
8, (Scopus, 13/11/2020)

Most recent scientific publications (since 2010)

Authors: Voges, J.; Paridaens, T.; Müntefering, F.; et al.
ISSN 1367-4803  2020 
MOTIVATION: In an effort to provide a response to the ever-expanding generation of genomic data, the International Organization for Standardization (ISO) is designing a new solution for the representation, compression and management of genomic sequencing data: the MPEG-G standard. This paper discusses the first implementation of an MPEG-G compliant entropy codec: GABAC. GABAC combines proven coding technologies, such as context-adaptive binary arithmetic coding, binarization schemes and transformations, into a straightforward solution for the compression of sequencing data. RESULTS: We demonstrate that GABAC outperforms well-established (entropy) codecs in a significant set of cases and thus can serve as an extension for existing genomic compression solutions, such as CRAM. AVAILABILITY: The GABAC library is written in C++. We also provide a command line application which exercises all features provided by the library. GABAC can be downloaded from https://github.com/mitogen/gabac. SUPPLEMENTARY INFORMATION: Supplementary data, including a complete list of the funding mechanisms and acknowledgements, are available at Bioinformatics online.
Authors: Zhang, C.; Ochoa Álvarez, Idoia
ISSN 1367-4803  Vol. 36  Nº 8  2020 
MOTIVATION: Variants identified by current genomic analysis pipelines contain many incorrectly called variants. These can be potentially eliminated by applying state-of-the-art filtering tools, such as Variant Quality Score Recalibration (VQSR) or Hard Filtering (HF). However, these methods are very user-dependent and fail to run in some cases. We propose VEF, a variant filtering tool based on decision tree ensemble methods that overcomes the main drawbacks of VQSR and HF. Contrary to these methods, we treat filtering as a supervised learning problem, using variant call data with known "true" variants, i.e., gold standard, for training. Once trained, VEF can be directly applied to filter the variants contained in a given VCF file (we consider training and testing VCF files generated with the same tools, as we assume they will share feature characteristics). RESULTS: For the analysis, we used Whole Genome Sequencing (WGS) Human datasets for which the gold standards are available. We show on these data that the proposed filtering tool VEF consistently outperforms VQSR and HF. In addition, we show that VEF generalizes well even when some features have missing values, when the training and testing datasets differ in coverage, and when sequencing pipelines other than GATK are used. Finally, since the training needs to be performed only once, there is a significant saving in running time when compared to VQSR (4 versus 50minutes approximately for filtering the SNPs of a WGS Human
Authors: Hernáez Arrazola, Mikel (Autor de correspondencia); Pavlichin, D.; Weissman, T.; et al.
ISSN 2574-3414  Vol. 2  2019  pp. 19 - 37
Recently, there has been growing interest in genome sequencing, driven by advances in sequencing technology, in terms of both efficiency and affordability. These developments have allowed many to envision whole-genome sequencing as an invaluable tool for both personalized medical care and public health. As a result, increasingly large and ubiquitous genomic data sets are being generated. This poses a significant challenge for the storage and transmission of these data. Already, it is more expensive to store genomic data for a decade than it is to obtain the data in the first place. This situation calls for efficient representations of genomic information. In this review, we emphasize the need for designing specialized compressors tailored to genomic data and describe the main solutions already proposed. We also give general guidelines for storing these data and conclude with our thoughts on the future of genomic formats and compressors.
Authors: Fisher-Hwang, I.; Ochoa Álvarez, Idoia
ISSN 2045-2322  Vol. 9  2019 
Noise in genomic sequencing data is known to have effects on various stages of genomic data analysis pipelines. Variant identification is an important step of many of these pipelines, and is increasingly being used in clinical settings to aid medical practices. We propose a denoising method, dubbed SAMDUDE, which operates on aligned genomic data in order to improve variant calling performance. Denoising human data with SAMDUDE resulted in improved variant identification in both individual chromosome as well as whole genome sequencing (WGS) data sets. In the WGS data set, denoising led to identification of almost 2,000 additional true variants, and elimination of over 1,500 erroneously identified variants. In contrast, we found that denoising with other state-of-the-art denoisers significantly worsens variant calling performance.
Authors: Chandak, S.; Tatwawadi, K.; Ochoa Álvarez, Idoia; et al.
ISSN 1367-4803  Vol. 35  Nº 15  2019  pp. 2674 - 2676
Motivation High-Throughput Sequencing technologies produce huge amounts of data in the form of short genomic reads, associated quality values and read identifiers. Because of the significant structure present in these FASTQ datasets, general-purpose compressors are unable to completely exploit much of the inherent redundancy. Although there has been a lot of work on designing FASTQ compressors, most of them lack in support of one or more crucial properties, such as support for variable length reads, scalability to high coverage datasets, pairing-preserving compression and lossless compression. Results In this work, we propose SPRING, a reference-free compressor for FASTQ files. SPRING supports a wide variety of compression modes and features, including lossless compression, pairing-preserving compression, lossy compression of quality values, long read compression and random access. SPRING achieves substantially better compression than existing tools, for example, SPRING compresses 195 GB of 25× whole genome human FASTQ from Illumina¿s NovaSeq sequencer to less than 7 GB, around 1.6× smaller than previous state-of-the-art FASTQ compressors. SPRING achieves this improvement while using comparable computational resources.
Authors: Yang, R.; Chen, X.; Ochoa Álvarez, Idoia (Autor de correspondencia)
ISSN 1471-2105  Vol. 20  Nº 368  2019 
BackgroundMass Spectrometry (MS) is a widely used technique in biology research, and has become key in proteomics and metabolomics analyses. As a result, the amount of MS data has significantly increased in recent years. For example, the MS repository MassIVE contains more than 123TB of data. Somehow surprisingly, these data are stored uncompressed, hence incurring a significant storage cost. Efficient representation of these data is therefore paramount to lessen the burden of storage and facilitate its dissemination.ResultsWe present MassComp, a lossless compressor optimized for the numerical (m/z)-intensity pairs that account for most of the MS data. We tested MassComp on several MS data and show that it delivers on average a 46% reduction on the size of the numerical data, and up to 89%. These results correspond to an average improvement of more than 27% when compared to the general compressor gzip and of 40% when compared to the state-of-the-art numerical compressor FPC. When tested on entire files retrieved from the MassIVE repository, MassComp achieves on average a 59% size reduction. MassComp is written in C++ and freely available at https://github.com/iochoa/MassComp.ConclusionsThe compression performance of MassComp demonstrates its potential to significantly reduce the footprint of MS data, and shows the benefits of designing specialized compression algorithms tailored to MS data. MassComp is an addition to the family of omics compression algorithms designed to less
Authors: Peng, J.; Milenkovic, O.; Ochoa Álvarez, Idoia (Autor de correspondencia)
ISSN 1367-4803  Vol. 34  Nº 15  2018  pp. 2654 2656
Motivation DNA methylation is one of the most important epigenetic mechanisms in cells that exhibits a significant role in controlling gene expressions. Abnormal methylation patterns have been associated with cancer, imprinting disorders and repeat-instability diseases. As inexpensive bisulfite sequencing approaches have led to significant efforts in acquiring methylation data, problems of data storage and management have become increasingly important. The de facto compression method for methylation data is gzip, which is a general purpose compression algorithm that does not cater to the special format of methylation files. We propose METHCOMP, a new compression scheme tailor-made for bedMethyl files, which supports random access. Results We tested the METHCOMP algorithm on 24 bedMethyl files retrieved from four randomly selected ENCODE assays. Our findings reveal that METHCOMP offers an average compression ratio improvement over gzip of up to 7.5x. As an example, METHCOMP compresses a 48 GB file to only 0.9 GB, which corresponds to a 98% reduction in size.
Authors: Roguski, L.; Ochoa Álvarez, Idoia; Hernáez Arrazola, Mikel; et al.
ISSN 1367-4811  2018 
The affordability of DNA sequencing has led to the generation of unprecedented volumes of raw sequencing data. These data must be stored, processed, and transmitted, which poses significant challenges. To facilitate this effort, we introduce FaStore, a specialized compressor for FASTQ files. The proposed algorithm does not use any reference sequences for compression, and permits the user to choose from several lossy modes to improve the overall compression ratio, depending on the specific needs. We demonstrate through extensive simulations that FaStore achieves a significant improvement in compression ratio with respect to previously proposed algorithms for this task. In addition, we perform an analysis on the effect that the different lossy modes have on variant calling, the most widely used application for clinical decision making, especially important in the era of precision medicine. We show that lossy compression can offer significant compression gains, while preserving the essential genomic information and without affecting the variant calling performance.
Authors: Ochoa Álvarez, Idoia (Autor de correspondencia); Hernáez Arrazola, Mikel; Goldfeder, R.;
ISSN 1467-5463  Vol. 18  Nº 2  2017  pp. 183 - 194
Recent advancements in sequencing technology have led to a drastic reduction in genome sequencing costs. This development has generated an unprecedented amount of data that must be stored, processed, and communicated. To facilitate this effort, compression of genomic files has been proposed. Specifically, lossy compression of quality scores is emerging as a natural candidate for reducing the growing costs of storage. A main goal of performing DNA sequencing in population studies and clinical settings is to identify genetic variation. Though the field agrees that smaller files are advantageous, the cost of lossy compression, in terms of variant discovery, is unclear. Bioinformatic algorithms to identify SNPs and INDELs use base quality score information; here, we evaluate the effect of lossy compression of quality scores on SNP and INDEL detection. Specifically, we investigate how the output of the variant caller when using the original data differs from that obtained when quality scores are replaced by those generated by a lossy compressor. Using gold standard genomic datasets and simulated data, we are able to analyze how accurate the output of the variant calling is, both for the original data and that previously lossily compressed. We show that lossy compression can significantly alleviate the storage while maintaining variant calling performance comparable to that with the original data. Further, in some cases lossy compression can lead to variant calling performance tha
Authors: Tatwawadi, K., (Autor de correspondencia); Hernáez Arrazola, Mikel; Ochoa Álvarez, Idoia; et al.
ISSN 1367-4803  Vol. 32  Nº 17  2016  pp. i479 - i486
Motivation The dramatic decrease in the cost of sequencing has resulted in the generation of huge amounts of genomic data, as evidenced by projects such as the UK10K and the Million Veteran Project, with the number of sequenced genomes ranging in the order of 10 K to 1 M. Due to the large redundancies among genomic sequences of individuals from the same species, most of the medical research deals with the variants in the sequences as compared with a reference sequence, rather than with the complete genomic sequences. Consequently, millions of genomes represented as variants are stored in databases. These databases are constantly updated and queried to extract information such as the common variants among individuals or groups of individuals. Previous algorithms for compression of this type of databases lack efficient random access capabilities, rendering querying the database for particular variants and/or individuals extremely inefficient, to the point where compression is often relinquished altogether. Results We present a new algorithm for this task, called GTRAC, that achieves significant compression ratios while allowing fast random access over the compressed database. For example, GTRAC is able to compress a Homo sapiens dataset containing 1092 samples in 1.1¿GB (compression ratio of 160), while allowing for decompression of specific samples in less than a second and decompression of specific variants in 17¿ms. GTRAC uses and adapts techniques from information theory,
Authors: Deorowicz, S., (Autor de correspondencia); Grabowski, S.; Ochoa Álvarez, Idoia (Autor de correspondencia); et al.
ISSN 1367-4803  Vol. 32  Nº 7  2016  pp. 1115 - 1117
Motivation: Data compression is crucial in effective handling of genomic data. Among several recently published algorithms, ERGC seems to be surprisingly good, easily beating all of the competitors. Results: We evaluated ERGC and the previously proposed algorithms GDC and iDoComp, which are the ones used in the original paper for comparison, on a wide data set including 12 assemblies of human genome (instead of only four of them in the original paper). ERGC wins only when one of the genomes (referential or target) contains mixed-cased letters (which is the case for only the two Korean genomes). In all other cases ERGC is on average an order of magnitude worse than GDC and iDoComp.
Authors: Ochoa Álvarez, Idoia (Autor de correspondencia); Hernáez Arrazola, Mikel; Weissman, T.;
ISSN 1367-4803  Vol. 31  Nº 5  2015  pp. 626 - 633
Motivation:With the release of the latest next-generation sequencing (NGS) machine, the HiSeq Xby Illumina, the cost of sequencing a Human has dropped to a mere $4000. Thus we are approach-ing a milestone in the sequencing history, known as the $1000 genome era, where the sequencingof individuals is affordable, opening the doors to effective personalized medicine. Massive gener-ation of genomic data, including assembled genomes, is expected in the following years. There iscrucial need for compression of genomes guaranteed of performing well simultaneously on differ-ent species, from simple bacteria to humans, which will ease their transmission, dissemination andanalysis. Further, most of the new genomes to be compressed will correspond to individuals ofa species from which a reference already exists on the database. Thus, it is natural to proposecompression schemes that assume and exploit the availability of such references.Results:We propose iDoComp, a compressor of assembled genomes presented in FASTA formatthat compresses an individual genome using a reference genome for both the compression and thedecompression. In terms of compression efficiency, iDoComp outperforms previously proposed al-gorithms in most of the studied cases, with comparable or better running time. For example, we ob-serve compression gains of up to 60% in several cases, includingH.sapiensdata, when comparingwith the best compression performance among the previously proposed algorithms.
Authors: Malysa, G., (Autor de correspondencia); Hernáez Arrazola, Mikel (Autor de correspondencia); Ochoa Álvarez, Idoia (Autor de correspondencia); et al.
ISSN 1367-4803  Vol. 31  Nº 19  2015  pp. 3122 - 3129
Motivation Recent advancements in sequencing technology have led to a drastic reduction in the cost of sequencing a genome. This has generated an unprecedented amount of genomic data that must be stored, processed and transmitted. To facilitate this effort, we propose a new lossy compressor for the quality values presented in genomic data files (e.g. FASTQ and SAM files), which comprise roughly half of the storage space (in the uncompressed domain). Lossy compression allows for compression of data beyond its lossless limit. Results The proposed algorithm QVZ exhibits better rate-distortion performance than the previously proposed algorithms, for several distortion metrics and for the lossless case. Moreover, it allows the user to define any quasi-convex distortion function to be minimized, a feature not supported by the previous algorithms. Finally, we show that QVZ-compressed data exhibit better performance in the genotyping than data compressed with previously proposed algorithms, in the sense that for a similar rate, a genotyping closer to that achieved with the original quality values is obtained.
Authors: Ochoa Álvarez, Idoia (Autor de correspondencia); Hernáez Arrazola, Mikel; Weissman, T.;
ISSN 0219-7200  Vol. 12  Nº 6  2014 
With the release of the latest Next-Generation Sequencing (NGS) machine, the HiSeq X by Illumina, the cost of sequencing the whole genome of a human is expected to drop to a mere $1000. This milestone in sequencing history marks the era of affordable sequencing of individuals and opens the doors to personalized medicine. In accord, unprecedented volumes of genomic data will require storage for processing. There will be dire need not only of compressing aligned data, but also of generating compressed files that can be fed directly to downstream applications to facilitate the analysis of and inference on the data. Several approaches to this challenge have been proposed in the literature; however, focus thus far has been on the low coverage regime and most of the suggested compressors are not based on effective modeling of the data. We demonstrate the bene fit of data modeling for compressing aligned reads. Specifically, we show that, by working with data models designed for the aligned data, we can improve considerably over the best compression ratio achieved by previously proposed algorithms. Our results indicate that the pareto-optimal barrier for compression rate and speed claimed by Bon field and Mahoney (2013) [Bon field JK and Mahoneys MV, Compression of FASTQ and SAM format sequencing data, PLOS ONE, 8(3): e59190, 2013.] does not apply for high coverage aligned data. Furthermore, our improved compression ratio is achieved by splitting the data in a manner conducive to o
Authors: ManolaKos, A., (Autor de correspondencia); Ochoa Álvarez, Idoia; Venkat, K.; et al.
ISSN 1471-2164  Vol. 15  Nº 10:S8  2014 
BACKGROUND: Identification of genomic patterns in tumors is an important problem, which would enable the community to understand and extend effective therapies across the current tissue-based tumor boundaries. With this in mind, in this work we develop a robust and fast algorithm to discover cancer driver genes using an unsupervised clustering of similarly expressed genes across cancer patients. Specifically, we introduce CaMoDi, a new method for module discovery which demonstrates superior performance across a number of computational and statistical metrics. RESULTS: The proposed algorithm CaMoDi demonstrates effective statistical performance compared to the state of the art, and is algorithmically simple and scalable - which makes it suitable for tissue-independent genomic characterization of individual tumors as well as groups of tumors. We perform an extensive comparative study between CaMoDi and two previously developed methods (CONEXIC and AMARETTO), across 11 individual tumors and 8 combinations of tumors from The Cancer Genome Atlas. We demonstrate that CaMoDi is able to discover modules with better average consistency and homogeneity, with similar or better adjusted R2 performance compared to CONEXIC and AMARETTO. CONCLUSIONS: We present a novel method for Cancer Module Discovery, CaMoDi, and demonstrate through extensive simulations on the TCGA Pan-Cancer dataset that it achieves comparable or better performance than that of CONEXIC and AMARETTO, while achieving
Authors: Ochoa Álvarez, Idoia (Autor de correspondencia); Asnani, H.; Bharadia, D.; et al.
ISSN 1471-2105  Vol. 14  Nº 1  2013 
Background Next Generation Sequencing technologies have revolutionized many fields in biology by reducing the time and cost required for sequencing. As a result, large amounts of sequencing data are being generated. A typical sequencing data file may occupy tens or even hundreds of gigabytes of disk space, prohibitively large for many users. This data consists of both the nucleotide sequences and per-base quality scores that indicate the level of confidence in the readout of these sequences. Quality scores account for about half of the required disk space in the commonly used FASTQ format (before compression), and therefore the compression of the quality scores can significantly reduce storage requirements and speed up analysis and transmission of sequencing data. Results In this paper, we present a new scheme for the lossy compression of the quality scores, to address the problem of storage. Our framework allows the user to specify the rate (bits per quality score) prior to compression, independent of the data to be compressed. Our algorithm can work at any rate, unlike other lossy compression algorithms. We envisage our algorithm as being part of a more general compression scheme that works with the entire FASTQ file. Numerical experiments show that we can achieve a better mean squared error (MSE) for small rates (bits per quality score) than other lossy compression schemes. For the organism PhiX, whose assembled genome is known and assumed to be correct, we show that it
Authors: Ochoa Álvarez, Idoia (Autor de correspondencia); Crespo Bofill, Pedro; del Ser Lorente, Javier; et al.
ISSN 1089-7798  Vol. 14  Nº 4  2010  pp. 336 - 338
This letter proposes a novel one-layer coding/shaping scheme with single-level codes and sigma-mapping for the bandwidth-limited regime. Specifically, we consider non-uniform memoryless sources sent over AWGN channels. At the transmitter, binary data are encoded by a Turbo code composed of two identical RSC (Recursive Systematic Convolutional) encoders. The encoded bits are randomly interleaved and modulated before entering the sigma-mapper. The modulation employed in this system follows the unequal energy allocation scheme first introduced in [1]. The receiver consists of an iterative demapping/decoding algorithm, which incorporates the a priori probabilities of the source symbols. To the authors' knowledge, work in this area has only been done for the power-limited regime. In particular, the authors in [2] proposed a scheme based on a Turbo code with RSC encoders and unequal energy allocation. Therefore, it is reasonable to compare the performance - with respect to the Shannon limit - of our proposed bandwidth-limited regime scheme with this former power-limited regime scheme. Simulation results show that our performance is as good or slightly better than that of the system in [2].
Authors: Ochoa Álvarez, Idoia (Autor de correspondencia); Crespo Bofill, Pedro; Hernáez Arrazola, Mikel
ISSN 1089-7798  Vol. 14  Nº 9  2010  pp. 794 - 796
In this paper, we design a new energy allocation strategy for non-uniform binary memoryless sources encoded by Low-Density Parity-Check (LDPC) codes and sent over Additive White Gaussian Noise (AWGN) channels. The new approach estimates the a priori probabilities of the encoded symbols, and uses this information to allocate more energy to the transmitted symbols that occur less likely. It can be applied to systematic and non-systematic LDPC codes, improving in both cases the performance of previous LDPC based schemes using binary signaling. The decoder introduces the source non-uniformity and estimates the source symbols by applying the SPA (Sum Product Algorithm) over the factor graph describing the code.

Teaching experience


Trabajo fin de Máster (MC2). 
Universidad de Navarra - Facultad de Ciencias.

Teoría de la Señal (Ing.Gr.). 
Universidad de Navarra - Escuela de Ingeniería.

Machine Learning II. 
Universidad de Navarra - Escuela de Ingeniería.