Grupos Investigadores

Miembros del Grupo

Carazo Melo
Juan Ángel
Ferrer-Bonsoms Hernández
Luis Vitores
Valcárcel García
Barrera Acuña
Ochoa Álvarez
Sada del Real

Líneas de Investigación

  • Análisis de datos de ADN. Métodos para mejorar la identificación de variantes biológicas en germline y cancer
  • Desarrollo de métodos computacionales para el análisis en integración de datos ómicos
  • Esquemas de comprensión específicos para diferentes datos ómicos. Miembros activos en el desarrollo del standard MPEG-G para la representación de datos genómicos
  • Estudio del splicing alternativo en distintos tipos de cáncer: sus alteraciones, causas y efectos
  • Inferencia de redes de regulación de genes para datos de RNA provenientes de secuenciación en bulk y single-cell
  • Influencia de la microbiota intestinal en el ámbito de la salud y nutrición
  • Integración de experimentos masivos de silenciamiento génico y fármacos en el marco de la oncología de precisión
  • Modelos predictivos de toxicidad de fármacos basados en características estructurales
  • Reprogramación metabólica en cáncer con el fin de identificar nuevas dianas terapéuticas y marcadores de respuesta

Palabras Clave

  • Algorithms
  • Alternative Splicing
  • Biomarkers
  • Cancer
  • Compression
  • Computational Models
  • Drug Targets
  • Drugs
  • Germline
  • Health
  • In-silico
  • Metabolic Networks
  • Metabolism
  • Microbiome
  • Microbiota
  • Nutrition
  • Omics
  • Personalized Nutrición
  • Precision Oncology
  • Regulatory Networks
  • RNAseq
  • Signaling Networks
  • Single-Cell RNAseq
  • Therapeutic Targets
  • Toxicity

Publicaciones Científicas desde 2018

  • Autores: Ferrer-Bonsoms, J. A.; Jareño, L.; Rubio Díaz-Cordoves, Ángel (Autor de correspondencia)
    ISSN: 1367-4803 Vol.38 N° 3 2022 págs. 844 - 845
    Motivation: Discover is an algorithm developed to identify mutually exclusive genomic events. Its main contribution is a statistical analysis based on the Poisson-Binomial (PB) distribution to take into account the mutation rate of genes and samples. Discover is very effective for identifying mutually exclusive mutations at the expense of speed in large datasets: the PB is computationally costly to estimate, and checking all the potential mutually exclusive alterations requires millions of tests. Results: We have implemented a new version of the package called Rediscover that implements exact and approximate computations of the PB. Rediscover exact implementation is slightly faster than Discover for large and medium-sized datasets. The approximation is 100-1000 times faster for them making it possible to get results in less than a minute with a standard desktop. The memory footprint is also smaller in Rediscover. The new package is available at CRAN and provides some functions to integrate its usage with other R packages such as maftools and TCGAbiolinks. Availability and implementation: Rediscover is available at CRAN ( Rediscover/index.html).
  • Autores: Valcárcel García, Luis Vitores; San José Enériz, Edurne; Cendoya Garmendia, Xabier; et al.
    ISSN: 1553-7358 Vol.18 N° 5 2022 págs. e1010180
    With the frenetic growth of high-dimensional datasets in different biomedical domains, there is an urgent need to develop predictive methods able to deal with this complexity. Feature selection is a relevant strategy in machine learning to address this challenge. We introduce a novel feature selection algorithm for linear regression called BOSO (Bilevel Optimization Selector Operator). We conducted a benchmark of BOSO with key algorithms in the literature, finding a superior accuracy for feature selection in high-dimensional datasets. Proof-of-concept of BOSO for predicting drug sensitivity in cancer is presented. A detailed analysis is carried out for methotrexate, a well-studied drug targeting cancer metabolism.
  • Autores: Carrasco-García, E. (Autor de correspondencia); López, L.; Moncho-Amor, V.; et al.
    Revista: CANCERS
    ISSN: 2072-6694 Vol.14 N° 4 2022 págs. 916
    Simple Summary Pancreatic cancers are lethal types of cancer. A majority of patients progress to an advanced and metastatic disease, which remains a major clinical problem. Therefore, it is crucial to identify critical regulators to help predict the disease progression and to develop more efficacious therapeutic approaches. In this work we found that an increased expression of the developmental factor SOX9 is associated with metastasis, a poor prognosis and resistance to therapy in pancreatic ductal adenocarcinoma patients and in cell cultures. We also found that this effect is at least in part due to the ability of SOX9 to regulate the activity of stem cell factors, such as BMI1, in addition to those involved in EMT and metastasis. Background: Pancreatic ductal adenocarcinoma (PDAC) is one of the most lethal cancers mainly due to spatial obstacles to complete resection, early metastasis and therapy resistance. The molecular events accompanying PDAC progression remain poorly understood. SOX9 is required for maintaining the pancreatic ductal identity and it is involved in the initiation of pancreatic cancer. In addition, SOX9 is a transcription factor linked to stem cell activity and is commonly overexpressed in solid cancers. It cooperates with Snail/Slug to induce epithelial-mesenchymal transition (EMT) during neural development and in diseases such as organ fibrosis or different types of cancer. Methods: We investigated the roles of SOX9 in pancreatic tumor cell plasticity, metastatic dissemination and chemoresistance using pancreatic cancer cell lines as well as mouse embryo fibroblasts. In addition, we characterized the clinical relevance of SOX9 in pancreatic cancer using human biopsies. Results: Gain- and loss-of-function of SOX9 in PDAC cells revealed that high levels of SOX9 increased migration and invasion, and promoted EMT and metastatic dissemination, whilst SOX9 silencing resulted in metastasis inhibition, along with a phenotypic reversion to epithelial features and loss of stemness potential. In both contexts, EMT factors were not altered. Moreover, high levels of SOX9 promoted resistance to gemcitabine. In contrast, overexpression of SOX9 was sufficient to promote metastatic potential in K-Ras transformed MEFs, triggering EMT associated with Snail/Slug activity. In clinical samples, SOX9 expression was analyzed in 198 PDAC cases by immunohistochemistry and in 53 patient derived xenografts (PDXs). SOX9 was overexpressed in primary adenocarcinomas and particularly in metastases. Notably, SOX9 expression correlated with high vimentin and low E-cadherin expression. Conclusions: Our results indicate that SOX9 facilitates PDAC progression and metastasis by triggering stemness and EMT.
  • Autores: Pages-Zamora, A. (Autor de correspondencia); Ochoa Álvarez, Idoia; Ruiz-Cavero, G.; et al.
    ISSN: 0031-3203 Vol.129 2022 págs. 108721
    Unsupervised ensemble learning refers to methods devised for a particular task that combine data pro-vided by decision learners taking into account their reliability, which is usually inferred from the data. Here, the variant calling step of the next generation sequencing technologies is formulated as an unsuper-vised ensemble classification problem. A variant calling algorithm based on the expectation-maximization algorithm is further proposed that estimates the maximum-a-posteriori decision among a number of classes larger than the number of different labels provided by the learners. Experimental results with real human DNA sequencing data show that the proposed algorithm is competitive compared to state-of -the-art variant callers as GATK, HTSLIB, and Platypus.(c) 2022 The Author(s). Published by Elsevier Ltd.This is an open access article under the CC BY-NC-ND license ( )
  • Autores: Apaolaza Emparanza, Iñigo; San José Enériz, Edurne; Valcárcel García, Luis Vitores; et al.
    ISSN: 1553-7358 Vol.18 N° 3 2022 págs. e1009395
    Synthetic Lethality (SL) is currently defined as a type of genetic interaction in which the loss of function of either of two genes individually has limited effect in cell viability but inactivation of both genes simultaneously leads to cell death. Given the profound genomic aberrations acquired by tumor cells, which can be systematically identified with -omics data, SL is a promising concept in cancer research. In particular, SL has received much attention in the area of cancer metabolism, due to the fact that relevant functional alterations concentrate on key metabolic pathways that promote cellular proliferation. With the extensive prior knowledge about human metabolic networks, a number of computational methods have been developed to predict SL in cancer metabolism, including the genetic Minimal Cut Sets (gMCSs) approach. A major challenge in the application of SL approaches to cancer metabolism is to systematically integrate tumor microenvironment, given that genetic interactions and nutritional availability are interconnected to support proliferation. Here, we propose a more general definition of SL for cancer metabolism that combines genetic and environmental interactions, namely loss of gene functions and absence of nutrients in the environment. We extend our gMCSs approach to determine this new family of metabolic synthetic lethal interactions. A computational and experimental proof-of-concept is presented for predicting the lethality of dihydrofolate reductase (DHFR) inhibition in different environments. Finally, our approach is applied to identify extracellular nutrient dependences of tumor cells, elucidating cholesterol and myo-inositol depletion as potential vulnerabilities in different malignancies.
  • Autores: Ferrer-Bonsoms, J. A.; Morales Urteaga, Xabier; Afshar, P. T.; et al.
    ISSN: 1367-4803 Vol.38 N° 6 2022 págs. 1491 - 1496
    Motivation: Isoform deconvolution is an NP-hard problem. The accuracy of the proposed solutions is far from perfect. At present, it is not known if gene structure and isoform concentration can be uniquely inferred given paired-end reads, and there is no objective method to select the fragment length to improve the number of identifiable genes. Different pieces of evidence suggest that the optimal fragment length is gene-dependent, stressing the need for a method that selects the fragment length according to a reasonable trade-off across all the genes in the whole genome. Results: A gene is considered to be identifiable if it is possible to get both the structure and concentration of its transcripts univocally. Here, we present a method to state the identifiability of this deconvolution problem. Assuming a given transcriptome and that the coverage is sufficient to interrogate all junction reads of the transcripts, this method states whether or not a gene is identifiable given the read length and fragment length distribution. Applying this method using different read and fragment length combinations, the optimal average fragment length for the human transcriptome is around 400-600 nt for coding genes and 150-200 nt for long non-coding RNAs. The optimal read length is the largest one that fits in the fragment length. It is also discussed the potential profit of combining several libraries to reconstruct the transcriptome. Combining two libraries of very different fragment lengths results in a significant improvement in gene identifiability.
  • Autores: Ferrer-Bonsoms Hernández, Juan Ángel; Gimeno, M.; Olaverri, D.; et al.
    ISSN: 2631-9268 Vol.4 N° 3 2022 págs. lqac067
    Alternative splicing (AS) plays a key role in cancer: all its hallmarks have been associated with different mechanisms of abnormal AS. The improvement of the human transcriptome annotation and the availability of fast and accurate software to estimate isoform concentrations has boosted the analysis of transcriptome profiling from RNA-seq. The statistical analysis of AS is a challenging problem not yet fully solved. We have included in EventPointer (EP), a Bioconductor package, a novel statistical method that can use the bootstrap of the pseudoaligners. We compared it with other state-of-the-art algorithms to analyze AS. Its performance is outstanding for shallow sequencing conditions. The statistical framework is very flexible since it is based on design and contrast matrices. EP now includes a convenient tool to find the primers to validate the discoveries using PCR. We also added a statistical module to study alteration in protein domain related to AS. Applying it to 9514 patients from TCGA and TARGET in 19 different tumor types resulted in two conclusions: i) aberrant alternative splicing alters the relative presence of Protein domains and, ii) the number of enriched domains is strongly correlated with the age of the patients.
  • Autores: Goyal, M.; SERRANO SANZ, Guillermo; Argemí Ballbé, José María; et al.
    ISSN: 1367-4803 Vol.39 N° 9 2022 págs. 2488 - 2495
    Motivation An important step in the transcriptomic analysis of individual cells involves manually determining the cellular identities. To ease this labor-intensive annotation of cell-types, there has been a growing interest in automated cell annotation, which can be achieved by training classification algorithms on previously annotated datasets. Existing pipelines employ dataset integration methods to remove potential batch effects between source (annotated) and target (unannotated) datasets. However, the integration and classification steps are usually independent of each other and performed by different tools. We propose JIND (joint integration and discrimination for automated single-cell annotation), a neural-network-based framework for automated cell-type identification that performs integration in a space suitably chosen to facilitate cell classification. To account for batch effects, JIND performs a novel asymmetric alignment in which unseen cells are mapped onto the previously learned latent space, avoiding the need of retraining the classification model for new datasets. JIND also learns cell-type-specific confidence thresholds to identify cells that cannot be reliably classified. Results We show on several batched datasets that the joint approach to integration and classification of JIND outperforms in accuracy existing pipelines, and a smaller fraction of cells is rejected as unlabeled as a result of the cell-specific confidence thresholds. Moreover, we investigate cells misclassified by JIND and provide evidence suggesting that they could be due to outliers in the annotated datasets or errors in the original approach used for annotation of the target batch.
  • Autores: Gimeno, M.; San José Enériz, Edurne; Villar Fernández, Sara; et al.
    ISSN: 1664-3224 Vol.13 2022 págs. 977358
    Artificial intelligence (AI) can unveil novel personalized treatments based on drug screening and whole-exome sequencing experiments (WES). However, the concept of "black box" in AI limits the potential of this approach to be translated into the clinical practice. In contrast, explainable AI (XAI) focuses on making AI results understandable to humans. Here, we present a novel XAI method -called multi-dimensional module optimization (MOM)- that associates drug screening with genetic events, while guaranteeing that predictions are interpretable and robust. We applied MOM to an acute myeloid leukemia (AML) cohort of 319 ex-vivo tumor samples with 122 screened drugs and WES. MOM returned a therapeutic strategy based on the FLT3, CBF beta-MYH11, and NRAS status, which predicted AML patient response to Quizartinib, Trametinib, Selumetinib, and Crizotinib. We successfully validated the results in three different large-scale screening experiments. We believe that XAI will help healthcare providers and drug regulators better understand AI medical decisions.
  • Autores: Blasco, T.; Pérez-Burillo, S.; Balzerani, F.; et al.
    ISSN: 2041-1723 Vol.12 N° 1 2021 págs. 4728
    Understanding how diet and gut microbiota interact in the context of human health is a key question in personalized nutrition. Genome-scale metabolic networks and constraint-based modeling approaches are promising to systematically address this complex problem. However, when applied to nutritional questions, a major issue in existing reconstructions is the limited information about compounds in the diet that are metabolized by the gut microbiota. Here, we present AGREDA, an extended reconstruction of diet metabolism in the human gut microbiota. AGREDA adds the degradation pathways of 209 compounds present in the human diet, mainly phenolic compounds, a family of metabolites highly relevant for human health and nutrition. We show that AGREDA outperforms existing reconstructions in predicting diet-specific output metabolites from the gut microbiota. Using 16S rRNA gene sequencing data of faecal samples from Spanish children representing different clinical conditions, we illustrate the potential of AGREDA to establish relevant metabolic interactions between diet and gut microbiota. The interplay between human diet and the gut microbiome is complex. Here, the authors present a model of human-microbiome interaction that can predict how phenolic compounds are metabolized by the human gut microbiome, identifying diet-specific metabolites in children of varied clinical conditions.
  • Autores: Medina Murua, Andoni (Autor de correspondencia); Bistue García, Guillermo; Rubio Díaz-Cordoves, Ángel
    Revista: VEHICLES
    ISSN: 2624-8921 Vol.3 N° 1 2021 págs. 127 - 144
    Direct Yaw Moment Control (DYC) is an effective way to alter the behaviour of electric cars with independent drives. Controlling the torque applied to each wheel can improve the handling performance of a vehicle making it safer and faster on a race track. The state-of-the-art literature covers the comparison of various controllers (PID, LPV, LQR, SMC, etc.) using ISO manoeuvres. However, a more advanced comparison of the important characteristics of the controllers' performance is lacking, such as the robustness of the controllers under changes in the vehicle model, steering behaviour, use of the friction circle, and, ultimately, lap time on a track. In this study, we have compared the controllers according to some of the aforementioned parameters on a modelled race car. Interestingly, best lap times are not provided by perfect neutral or close-to-neutral behaviour of the vehicle, but rather by allowing certain deviations from the target yaw rate. In addition, a modified Proportional Integral Derivative (PID) controller showed that its performance is comparable to other more complex control techniques such as Model Predictive Control (MPC).
  • Autores: Perez, S.; Hinojosa, D.; Navajas, B.; et al.
    Revista: NUTRIENTS
    ISSN: 2072-6643 Vol.13 N° 7 2021 págs. 2207
    The gut microbiota has a profound effect on human health and is modulated by food and bioactive compounds. To study such interaction, in vitro batch fermentations are performed with fecal material, and some experimental designs may require that such fermentations be performed with previously frozen stools. Although it is known that freezing fecal material does not alter the composition of the microbial community in 16S rRNA gene amplicon and metagenomic sequencing studies, it is not known whether the microbial community in frozen samples could still be used for in vitro fermentations. To explore this, we undertook a pilot study in which in vitro fermentations were performed with fecal material from celiac, cow¿s milk allergic, obese, or lean children that was frozen (or not) with 20% glycerol. Before fermentation, the fecal material was incubated in a nutritious medium for 6 days, with the aim of giving the microbial community time to recover from the effects of freezing. An aliquot was taken daily from the stabilization vessel and used for the in vitro batch fermentation of lentils. The microbial community structure was significantly different between fresh and frozen samples, but the variation introduced by freezing a sample was always smaller than the variation among individuals, both before and after fermentation.
  • Autores: Dufort y Álvarez, G.; Seroussi, G.; Smircich, P.; et al.
    ISSN: 1367-4803 Vol.37 N° 24 2021 págs. 4862 - 4864
    Motivation: Nanopore sequencing technologies are rapidly gaining popularity, in part, due to the massive amounts of genomic data they produce in short periods of time (up to 8.5 TB of data in <72 h). To reduce the costs of transmission and storage, efficient compression methods for this type of data are needed. Results: We introduce RENANO, a reference-based lossless data compressor specifically tailored to FASTQ files generated with nanopore sequencing technologies. RENANO improves on its predecessor ENANO, currently the state of the art, by providing a more efficient base call sequence compression component. Two compression algorithms are introduced, corresponding to the following scenarios: (1) a reference genome is available without cost to both the compressor and the decompressor and (2) the reference genome is available only on the compressor side, and a compacted version of the reference is included in the compressed file. We compare the compression performance of RENANO against ENANO on several publicly available nanopore datasets. RENANO improves the base call sequences compression of ENANO by 39.8% in scenario (1), and by 33.5% in scenario (2), on average, over all the datasets. As for total file compression, the average improvements are 12.7% and 10.6%, respectively. We also show that RENANO consistently outperforms the recent general-purpose genomic compressor Genozip.
  • Autores: Zhang, C. Y.; El-Kebir, M.; Ochoa Álvarez, Idoia (Autor de correspondencia)
    ISSN: 2041-1723 Vol.12 N° 1 2021 págs. 2204
    Intra-tumor heterogeneity renders the identification of somatic single-nucleotide variants (SNVs) a challenging problem. In particular, low-frequency SNVs are hard to distinguish from sequencing artifacts. While the increasing availability of multi-sample tumor DNA sequencing data holds the potential for more accurate variant calling, there is a lack of high-sensitivity multi-sample SNV callers that utilize these data. Here we report Moss, a method to identify low-frequency SNVs that recur in multiple sequencing samples from the same tumor. Moss provides any existing single-sample SNV caller the ability to support multiple samples with little additional time overhead. We demonstrate that Moss improves recall while maintaining high precision in a simulated dataset. On multi-sample hepatocellular carcinoma, acute myeloid leukemia and colorectal cancer datasets, Moss identifies new low-frequency variants that meet manual review criteria and are consistent with the tumor's mutational signature profile. In addition, Moss detects the presence of variants in more samples of the same tumor than reported by the single-sample caller. Moss' improved sensitivity in SNV calling will enable more detailed downstream analyses in cancer genomics. The study of tumour heterogeneity can be improved by sequencing multiple samples, but currently available variant callers have not been tailored to integrate them. Here the authors present Moss, a tool that can leverage multiple samples to improve somatic variant calling in different cancers.
  • Autores: Hanau, F.; Rost, H.; Ochoa Álvarez, Idoia (Autor de correspondencia)
    ISSN: 1367-4803 Vol.37 N° 21 2021 págs. 3923 - 3925
    Motivation: Mass spectrometry (MS) data, used for proteomics and metabolomics analyses, have seen considerable growth in the last years. Aiming at reducing the associated storage costs, dedicated compression algorithms for MS data have been proposed, such as MassComp and MSNumpress. However, these algorithms focus on either lossless or lossy compression, respectively, and do not exploit the additional redundancy existing across scans contained in a single file. We introduce mspack, a compression algorithm for MS data that exploits this additional redundancy and that supports both lossless and lossy compression, as well as the mzML and the legacy mzXML formats. mspack applies several preprocessing lossless transforms and optional lossy transforms with a configurable error, followed by the general purpose compressors gzip or bsc to achieve a higher compression ratio. Results: We tested mspack on several datasets generated by commonly used MS instruments. When used with the bsc compression backend, mspack achieves on average 76% smaller file sizes for lossless compression and 94% smaller file sizes for lossy compression, as compared with the original files. Lossless mspack achieves 10-60% lower file sizes than MassComp, and lossy mspack compresses 36-60% better than the lossy MSNumpress, for the same error, while exhibiting comparable accuracy and running time. Supplementary information: Supplementary data are available at Bioinformatics online.
  • Autores: Ponce-de-Leon, M. ; Apaolaza Emparanza, Iñigo; Valencia, A. (Autor de correspondencia); et al.
    ISSN: 1367-4803 Vol.36 N° 6 2020 págs. 1986 - 1988
  • Autores: Ferrer-Bonsoms, J. A.; Cassol, I. ; Fernandez-Acin, P. ; et al.
    ISSN: 2045-2322 Vol.10 N° 1 2020
    The advent of RNA-seq technologies has switched the paradigm of genetic analysis from a genome to a transcriptome-based perspective. Alternative splicing generates functional diversity in genes, but the precise functions of many individual isoforms are yet to be elucidated. Gene Ontology was developed to annotate gene products according to their biological processes, molecular functions and cellular components. Despite a single gene may have several gene products, most annotations are not isoform-specific and do not distinguish the functions of the different proteins originated from a single gene. Several approaches have tried to automatically annotate ontologies at the isoform level, but this has shown to be a daunting task. We have developed ISOGO (ISOform + GO function imputation), a novel algorithm to predict the function of coding isoforms based on their protein domains and their correlation of expression along 11,373 cancer patients. Combining these two sources of information outperforms previous approaches: it provides an area under precision-recall curve (AUPRC) five times larger than previous attempts and the median AUROC of assigned functions to genes is 0.82. We tested ISOGO predictions on some genes with isoform-specific functions (BRCA1, MADD,VAMP7 and ITSN1) and they were coherent with the literature. Besides, we examined whether the main isoform of each gene -as predicted by APPRIS- was the most likely to have the annotated gene functions and it occurs in 99.4% of the genes. We also evaluated the predictions for isoform-specific functions provided by the CAFA3 challenge and results were also convincing. To make these results available to the scientific community, we have deployed a web application to consult ISOGO predictions ( Initial data, website link, isoform-specific GO function predictions and R code is available at
  • Autores: Zhang, C.; Ochoa Álvarez, Idoia
    ISSN: 1367-4803 Vol.36 N° 8 2020 págs. 2328 - 2336
    MOTIVATION: Variants identified by current genomic analysis pipelines contain many incorrectly called variants. These can be potentially eliminated by applying state-of-the-art filtering tools, such as Variant Quality Score Recalibration (VQSR) or Hard Filtering (HF). However, these methods are very user-dependent and fail to run in some cases. We propose VEF, a variant filtering tool based on decision tree ensemble methods that overcomes the main drawbacks of VQSR and HF. Contrary to these methods, we treat filtering as a supervised learning problem, using variant call data with known "true" variants, i.e., gold standard, for training. Once trained, VEF can be directly applied to filter the variants contained in a given VCF file (we consider training and testing VCF files generated with the same tools, as we assume they will share feature characteristics). RESULTS: For the analysis, we used Whole Genome Sequencing (WGS) Human datasets for which the gold standards are available. We show on these data that the proposed filtering tool VEF consistently outperforms VQSR and HF. In addition, we show that VEF generalizes well even when some features have missing values, when the training and testing datasets differ in coverage, and when sequencing pipelines other than GATK are used. Finally, since the training needs to be performed only once, there is a significant saving in running time when compared to VQSR (4 versus 50minutes approximately for filtering the SNPs of a WGS Human
  • Autores: Meng, Q. X.; Ochoa Álvarez, Idoia; Hernaez Arrazola, Mikel
    ISSN: 1367-4803 Vol.36 N° 18 2020 págs. 4810 - 4812
    Motivation: Sequencing data are often summarized at different annotation levels for further analysis, generally using the general feature format (GFF) or its descendants, gene transfer format (GTF) and GFF3. Existing utilities for accessing these files, like gffutils and gffread, do not focus on reducing the storage space, significantly increasing it in some cases. We propose GPress, a framework for querying GFF files in a compressed form. GPress can also incorporate and compress expression files from both bulk and single-cell RNA-Seq experiments, supporting simultaneous queries on both the GFF and expression files. In brief, GPress applies transformations to the data which are then compressed with the general lossless compressor BSC. To support queries, GPress compresses the data in blocks and creates several index tables for fast retrieval. Results: We tested GPress on several GFF files of different organisms, and showed that it achieves on average a 61% reduction in size with respect to gzip (the current de facto compressor for GFF files) while being able to retrieve all annotations for a given identifier or a range of coordinates in a few seconds (when run in a common laptop). In contrast, gffutils provides faster retrieval but doubles the size of the GFF files. When additionally linking an expression file, we show that GPress can reduce its size by more than 68% when compared to gzip (for both bulk and single-cell RNA-Seq experiments), while still retrieving the information within seconds. Finally, applying BSC to the data streams generated by GPress instead of to the original file shows a size reduction of more than 44% on average.
  • Autores: Voges, J.; Paridaens, T.; Müntefering, F.; et al.
    ISSN: 1367-4803 2020
    MOTIVATION: In an effort to provide a response to the ever-expanding generation of genomic data, the International Organization for Standardization (ISO) is designing a new solution for the representation, compression and management of genomic sequencing data: the MPEG-G standard. This paper discusses the first implementation of an MPEG-G compliant entropy codec: GABAC. GABAC combines proven coding technologies, such as context-adaptive binary arithmetic coding, binarization schemes and transformations, into a straightforward solution for the compression of sequencing data. RESULTS: We demonstrate that GABAC outperforms well-established (entropy) codecs in a significant set of cases and thus can serve as an extension for existing genomic compression solutions, such as CRAM. AVAILABILITY: The GABAC library is written in C++. We also provide a command line application which exercises all features provided by the library. GABAC can be downloaded from SUPPLEMENTARY INFORMATION: Supplementary data, including a complete list of the funding mechanisms and acknowledgements, are available at Bioinformatics online.
  • Autores: Alvarez, G. D. Y. ; Seroussi, G.; Smircich, P.; et al.
    ISSN: 1367-4803 Vol.36 N° 16 2020 págs. 4506 - 4507
    Motivation: The amount of genomic data generated globally is seeing explosive growth, leading to increasing needs for processing, storage and transmission resources, which motivates the development of efficient compression tools for these data. Work so far has focused mainly on the compression of data generated by short-read technologies. However, nanopore sequencing technologies are rapidly gaining popularity due to the advantages offered by the large increase in the average size of the produced reads, the reduction in their cost and the portability of the sequencing technology. We present ENANO (Encoder for NANOpore), a novel lossless compression algorithm especially designed for nanopore sequencing FASTQ files. Results: The main focus of ENANO is on the compression of the quality scores, as they dominate the size of the compressed file. ENANO offers two modes, Maximum Compression and Fast (default), which trade-off compression efficiency and speed. We tested ENANO, the current state-of-the-art compressor SPRING and the general compressor pigz on several publicly available nanopore datasets. The results show that the proposed algorithm consistently achieves the best compression performance (in both modes) on every considered nanopore dataset, with an average improvement over pigz and SPRING of >24.7% and 6.3%, respectively. In addition, in terms of encoding and decoding speeds, ENANO is 2.9x and 1.7x times faster than SPRING, respectively, with memory consumption up to 0.2 GB.

Proyectos desde 2018

  • Título: nG23 - Enfoque inteligente para el diseño de dispositivos de detección micro-nano-fluídica y nuevas funcionalidades para el análisis fluídico
    Código de expediente: KK-2023/00001
    Investigador principal: IDOIA OCHOA ALVAREZ.
    Financiador: GOBIERNO VASCO
    Convocatoria: ELKARTEK 2023. Programa de Ayudas a la Investigación Colaborativa en áreas estratégicas
    Fecha de inicio: 01-03-2023
    Fecha fin: 31-12-2024
    Importe concedido: 198.730,16€
    Otros fondos: -
  • Título: MEdicina PERsonalizada para el TRatamiento de la OBEsidad: Integración de datos ómicos, dietéticos y de estilo de vida para la optimización de la nutrición personalizada del paciente con obesidad
    Código de expediente: 0011-1383-2022-000015 (PC098 MEPERTROBE)
    Financiador: GOBIERNO DE NAVARRA
    Convocatoria: 2022 GN Proyectos Colaborativos
    Fecha de inicio: 01-09-2022
    Fecha fin: 30-11-2024
    Importe concedido: 280.995,26€
    Otros fondos: -
  • Título: bG22 - Desarrollo de métodos diagnósticos y nuevas terapias en la era de la medicina de precisión en cáncer (GV Elkartek Tipo1)
    Código de expediente: KK-2022-00045
    Financiador: GOBIERNO VASCO
    Convocatoria: Programa ELKARTEK 2022 K1: Proyecto de Investigación Fundamental Colaborativa - Investigación Fundamental
    Fecha de inicio: 01-03-2022
    Fecha fin: 31-12-2023
    Importe concedido: 121.787,88€
    Otros fondos: -
  • Título: 2020 Beca Juan de la Cierva Formación
    Código de expediente: SFJC2000I046252XV0
    Convocatoria: 2020 AEI JUAN DE LA CIERVA FORMACIÓN
    Fecha de inicio: 01-01-2022
    Fecha fin: 09-02-2023
    Importe concedido: 0
    Otros fondos: -
  • Título: BECA Ochoa_I_Ramon y Cajal 2019
    Código de expediente: RYC2019-028578-I
    Convocatoria: 2019 AEI - MCIU RAMÓN Y CAJAL
    Fecha de inicio: 01-05-2021
    Fecha fin: 28-04-2026
    Importe concedido: 208.600,00€
    Otros fondos: -
  • Título: Desarrollo y validacion de nuevos algoritmos predictivos de letalidad sintetica en cancer
    Código de expediente: PIBA_2020_1_0055
    Financiador: GOBIERNO VASCO
    Convocatoria: Ayudas para la realización de Proyectos Investigación Básica y/o Aplicada 2020-2022
    Fecha de inicio: 04-11-2020
    Fecha fin: 30-09-2023
    Importe concedido: 0
    Otros fondos: -
  • Título: Nueva aproximación computacional para predecir letalidad sintética en cáncer
    Código de expediente: PID2019-110344RB-I00
    Convocatoria: 2019 AEI PROYECTOS I+D+i (incluye Generación del conocimiento y Retos investigación)
    Fecha de inicio: 01-06-2020
    Fecha fin: 01-01-2023
    Importe concedido: 90.750,00€
    Otros fondos: -
  • Título: BECA G.FELLOWS IDOIA OCHOA Atraccion al Talento - Integracion de datos multi-omicos y su aplicacion a muestras hematologicas malignas.
    Código de expediente: 2020-FELL-000012-01
    Investigador principal: IDOIA OCHOA ALVAREZ.
    Convocatoria: 2020 DFG Fellows Gipuzkoa de atracción y retención de talento
    Fecha de inicio: 15-04-2020
    Fecha fin: 14-04-2021
    Importe concedido: 50.000,00€
    Otros fondos: -
  • Título: Medicina de Precisión en cáncer: desarollo de herramientas diagnosticas y nuevas terapias.
    Código de expediente: KK-2020/000008
    Financiador: GOBIERNO VASCO
    Convocatoria: 2020 GV Elkartek -Proyectos de apoyo a la investigacion colaborativa en areas estrategicas.Tipo 1.
    Fecha de inicio: 01-03-2020
    Fecha fin: 31-12-2021
    Importe concedido: 46.053,00€
    Otros fondos: -
  • Título: Modelling tool for giving value to agri-food residual streams in bio based industries
    Código de expediente:
    Financiador: COMISIÓN EUROPEA
    Convocatoria: H2020-BBI-JTI-2019
    Fecha de inicio: 01-06-2020
    Fecha fin: 30-11-2023
    Importe concedido: 310.645,05€
    Otros fondos: -
  • Título: Early detection and detection:understanding the mechanisms of transformacion an hidden resistance on incurable hematological malignencies.
    Código de expediente:
    Financiador: CANCER RESEARCH UK
    Convocatoria: Seed Funding
    Fecha de inicio: 01-12-2018
    Fecha fin: 30-11-2023
    Importe concedido: 270.000,00€
    Otros fondos: -
  • Título: Smart Technologies for personalised nutrition and consumer engagement.
    Código de expediente: SEP-210501740
    Financiador: COMISIÓN EUROPEA
    Convocatoria: H2020-SFS-2018-1
    Fecha de inicio: 11-01-2018
    Fecha fin: 30-09-2022
    Importe concedido: 200.668,00€
    Otros fondos: -
  • Título: Identification at single-cell level of molecular and Identification at single-cell level of molecular and metabolomic mechanisms governing antitumoral response of CAR T therapies in MM patients (M4CART)
    Convocatoria: 2022 FD RAMÓN ARECES Ciencias de la Vida
    Fecha de inicio: 13-04-2023
    Fecha fin: 12-04-2026
    Importe concedido: 124.800,00€
  • Título: Crespo_P_PETAOPTIK_OSCM
    Fecha de inicio: 23-11-2020
    Fecha fin: 28-02-2021
    Importe: 0
    Otros fondos: -
  • Título: Precision medicine approach to target deregulated metabolism in multiple mieloma
    Investigador principal: FRANCISCO JAVIER PLANES PEDREÑO
    Convocatoria: 2019 FD Ramón Areces - Ampliación de Estudios en el extranjero en Ciencias de la Vida y de la Materia
    Fecha de inicio: 03-04-2019
    Fecha fin: 02-04-2022
    Importe concedido: 121.500,00€