aboutsummaryrefslogtreecommitdiff

Bam Query Index (qidx)

Warning: this is a work in progress

qidx is tool for indexing BAM alignments by query name. While samtools have the ability to sort data by query name (also called the read name), htslib does not provide built-in utilities to retrieve alignments by query name. This can be advantageous for examining multi-mapped alignments.

A utility bri predated qidx and also indexes BAM files by query name. Yet, it reads all alignments into memory which is impractical for most human genome data.

Notes:

  • Currently, qidx is very inefficient in terms of disk space. When indexing a 33GiB BAM file (Illumina 35x), it takes up 22GiB on disk when using STD compression. It initially maps 1.2TiB into memory. This is reduced to ~120GiB due to file holes. ZSTD block compression again reduces this to 22GiB. When I get a chance, I hope to look into this further.
  • qidx creates a disk-backed hashset using a sparse memory-mapped file. The underlying operating system must support mmap and file holes.
  • qidx doesn't currently support compression. it is currently recommended to use block-level compression (such as zfs zstd compression.)
  • The bamfile must be sorted by query name before the index is built samtools sort -n.