aboutsummaryrefslogtreecommitdiff
path: root/README.md
blob: 7d9b4416f1573fb6f0085f3b974ca7c4c2e9453e (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
Bam Query Index (qidx)
=====================

> Warning: this is a work in progress

`qidx` is tool for indexing BAM alignments by query name. While
`samtools` have the ability to sort data by query name (also called
the read name), htslib does not provide built-in utilities
to retrieve alignments by query name. This can be advantageous
for examining multi-mapped alignments.

A utility [bri](https://github.com/jts/bri) predated `qidx` and also
indexes BAM files by query name. Yet, it reads all alignments into memory
which is impractical for most human genome data.

Notes:

- Currently, `qidx` is very inefficient in terms of disk space. When indexing a 33GiB BAM file
 (Illumina 35x), it takes up 22GiB on disk when using STD compression. It initially maps 1.2TiB
 into memory.  This is reduced to ~120GiB due to file holes. ZSTD block compression again reduces
 this to 22GiB.  When I get a chance, I hope to look into this further.
- `qidx` creates a disk-backed hashset using a sparse memory-mapped file. The underlying
operating system must support `mmap` and file holes.
- `qidx` doesn't currently support compression. it is currently recommended to
use block-level compression (such as `zfs` `zstd` compression.)
- The bamfile must be sorted by query name before the index is built `samtools sort -n`.