Compressed Suffix Arrays and Burrows Wheeler Transforms

Ross A. Lippert (lippert@math.mit.edu)

http://www-math.mit.edu/~lippert/










Reviewing suffix trees

structures:
  Stree { string, root }
  Node  { start, depth, slink=NIL, parent=NIL, children=[] }

Implementations

Complexity variations:

childrenbuild-timesearch-timespace
linked listO(|A||S|)O(|A||T|)O(|S|)
arraysO(|A||S|)O(|T|)O(|A||S|)
maps/treesO(log|A| |S|)O(log|A| |T|)O(|S|)

Implementation hacks

Known implementations


Too much space? Try suffix arrays

Suffix array is the permutation of sorted suffixes

Think of it as a depth-first `dump' of the suffix tree

A(i)suffix
17$
16a$
9acatacga$
0acataggagacatacga$
13acga$
7agacatacga$
4aggagacatacga$
11atacga$
2ataggagacatacga$
10catacga$
1cataggagacatacga$
14cga$
15ga$
8gacatacga$
6gagacatacga$
5ggagacatacga$
12tacga$
3taggagacatacga$

What can you do with a suffix array?


What can we do with less?

Obviously, suffix trees and suffix arrays are not space efficient

Not all permutations are suffix arrays!


The structure of a suffix array permutation


An auxilliary permutation

The runs break up along character boundaries

|A|+1 monotonic runs

iAinv[A[i]-1]A[i]string
0 117$
11216a$
2139acatacga$
3 00acataggagacatacga$
41613acga$
5147agacatacga$
6174aggagacatacga$
7 911atacga$
8102ataggagacatacga$
9 210catacga$
10 31cataggagacatacga$
11 414cga$
121115ga$
13 58gacatacga$
14156gagacatacga$
15 65ggagacatacga$
16 712tacga$
17 83taggagacatacga$

Burrows Wheeler Transform

Example: B=agg$tgtccaaacagaaa

Monotonic runs can be `colored' by letters

Requires additional O(|A|) storage for block starts of each letter

char$acgt
block0191216
iS[A[i]-1]Ainv[A[i]-1]A[i]string
0a 117$
1g1216a$
2g139acatacga$
3$ 00acataggagacatacga$
4t1613acga$
5g147agacatacga$
6t174aggagacatacga$
7c 911atacga$
8c102ataggagacatacga$
9a 210catacga$
10a 31cataggagacatacga$
11a 414cga$
12c1115ga$
13a 58gacatacga$
14g156gagacatacga$
15a 65ggagacatacga$
16a 712tacga$
17a 83taggagacatacga$

What good is just another string?

Two important queries:

  • occ(x,i) = the number of `x's before i
  • find(x,i) = the location of the ith x

Pattern lookup comes down to doing 2|P| count queries

  lookup(P):
    lo = 0, hi = length(B)
    i = length(P)
    while i>0:
      i = i-1
      lo = block(P[i]) + occ(P[i],lo)
      hi = block(P[i]) + occ(P[i],hi)
    return hi-lo
Note: this style of lookup could be done on a suffix array

$gat
lo012517
hi1816718

Fast counts

Store subtotals for every W letters

  • Time (for occ) = O(1 + W)
  • Space = (1+4|A|/W)|S| bytes
  • Bitwise parallelism allows hacks to let W~1000

Common pattern: support queries with auxilliary data structure which fits into arbitrarily small memory

i#$#a#c#g#tB[i]
000000a
1g
2g
301020$
4t
5g
611031t
7c
8c
911232a
10a
11a
1214232c
13a
14g
1515342a
16a
17a

Bit-parallelism for fast counting

Recall the example `agg$tgtccaaacagaaa'

atgcindicator vector
a100000000111010111
c000000011000100000
g011001000000001000
t000010100000000000

Larger example: 101010101110101000101011111101010101010100010100001000011111110010010111

counts01325
words101010101110101000101011111101010101010100010100001000011111110010010111

occ reduced to population counts

  sum-up-to(i):
    q = i/(3*8), r = i/8, s = 8*(r+1)-i
    x = counts[q]
    for 3*q<=i<r:
      x = x + wordsum(words[i])
    x = x + wordsum(words[r] >> s)

  wordsum(w):
    w = (w & 01010101) + ((w>>1) & 01010101)
    w = (w & 00110011) + ((w>>2) & 00110011)
    w = (w & 00001111) + ((w>>4) & 00001111)
    return w

CSA/BWT state of the art

(2000) Grossi and Vitter introduce a general framework for CSAs (see their 2005 paper for a good survey)

(2000) Sadakane produces the first low memory construction scheme based on dynamic dictionaries (red-black trees)



Trade-off between leaf size and performance



Improvements and simplifications

Using a single dictionary + encoding:

charcode
A0
C01
G011
T0111
N01111


Finding matches

Matches can be quickly found by an exhaustive enumeration of intervals of the corresponding binary strings.

  matches([loA hiA],[loB hiB],zeros):
    if hiAn-loAn == 0 or hiBn-loBn == 0:
      return

    for bit=0,1
      loAn=forward(loA,bit)
      hiAn=forward(hiA,bit)
      loBn=forward(loB,bit)
      hiBn=forward(hiB,bit)

      if bit == 0 and zeros == K:
        report(loAn,hiAn,loBn,hiBn)
      else:
        matches(loAn,hiAn,loBn,hiBn,zeros+!bit)

Software and experiments

On a Mac G5 with 1.8 Gb of available RAM


Sample commands

BBBWTgen -vFC -i genomes/NCBImm.fasta -o NCBImm.fwd.bbb
MerMatches -vC -m20 -x1 -X1 -n1 -N1 \
     -i NCBIhs.fwd.bbb -I NCBImm.fwd.bbb \
     -o NCBIhs_NCBImm.mers
Extractor -b0 -s1 -i NCBIhs-rc_NCBImm.mers \
      | Positionator -vC -b NCBIhs.rev.bbb -o NCBIhs.rev.pos
PostMatches -v -m20 -p NCBIhs.fwd.pos -P NCBImm.fwd.pos \
     -i NCBIhs_NCBImm.mers -o NCBIhs_NCBImm.mat

Conclusions