Bam Query Index (qidx)
Warning: this is a work in progress
qidx
is tool for indexing BAM alignments by query name. While
samtools
have the ability to sort data by query name (also called
the read name), htslib does not provide built-in utilities
to retrieve alignments by query name. This can be advantageous
for examining multi-mapped alignments.
A utility bri predated qidx
and also
indexes BAM files by query name. Yet, it reads all alignments into memory
which is impractical for most human genome data.
Notes:
- Currently,
qidx
is very inefficient in terms of disk space. When indexing a 33GiB BAM file (Illumina 35x), it takes up 22GiB on disk when using STD compression. It initially maps 1.2TiB into memory. This is reduced to ~120GiB due to file holes. ZSTD block compression again reduces this to 22GiB. When I get a chance, I hope to look into this further. qidx
creates a disk-backed hashset using a sparse memory-mapped file. The underlying operating system must supportmmap
and file holes.qidx
doesn't currently support compression. it is currently recommended to use block-level compression (such aszfs
zstd
compression.)- The bamfile must be sorted by query name before the index is built
samtools sort -n
.