Detalle Publicación

RENANO: a REference-based compressor for NANOpore FASTQ files

Autores: Dufort y Álvarez, G.; Seroussi, G.; Smircich, P.; Sotelo-Silveira, J.; Ochoa Álvarez, Idoia (Autor de correspondencia); Martin, A. (Autor de correspondencia)
Título de la revista: BIOINFORMATICS
ISSN: 1367-4803
Volumen: 37
Número: 24
Páginas: 4862 - 4864
Fecha de publicación: 2021
Motivation: Nanopore sequencing technologies are rapidly gaining popularity, in part, due to the massive amounts of genomic data they produce in short periods of time (up to 8.5 TB of data in <72 h). To reduce the costs of transmission and storage, efficient compression methods for this type of data are needed. Results: We introduce RENANO, a reference-based lossless data compressor specifically tailored to FASTQ files generated with nanopore sequencing technologies. RENANO improves on its predecessor ENANO, currently the state of the art, by providing a more efficient base call sequence compression component. Two compression algorithms are introduced, corresponding to the following scenarios: (1) a reference genome is available without cost to both the compressor and the decompressor and (2) the reference genome is available only on the compressor side, and a compacted version of the reference is included in the compressed file. We compare the compression performance of RENANO against ENANO on several publicly available nanopore datasets. RENANO improves the base call sequences compression of ENANO by 39.8% in scenario (1), and by 33.5% in scenario (2), on average, over all the datasets. As for total file compression, the average improvements are 12.7% and 10.6%, respectively. We also show that RENANO consistently outperforms the recent general-purpose genomic compressor Genozip.