
structures:
Stree { string, root }
Node { start, depth, slink=NIL, parent=NIL, children=[] }
Complexity variations:
| children | build-time | search-time | space |
| linked list | O(|A||S|) | O(|A||T|) | O(|S|) |
| arrays | O(|A||S|) | O(|T|) | O(|A||S|) |
| maps/trees | O(log|A| |S|) | O(log|A| |T|) | O(|S|) |
Implementation hacks
Known implementations
Suffix array is the permutation of sorted suffixes Think of it as a depth-first `dump' of the suffix tree |
|
find(P):
i = 0
lo = 0, hi = length(A)
for 0<=i<length(P):
Binary search for x,y where P[i]=S[A[j]+i] for lo<=x<=j<y<=hi
lo = x, hi = y
return {A[lo],A[lo+1],...,A[hi-1]}
Obviously, suffix trees and suffix arrays are not space efficient


The runs break up along character boundaries |A|+1 monotonic runs |
|
Example: B=agg$tgtccaaacagaaa Monotonic runs can be `colored' by letters Requires additional O(|A|) storage for block starts of each letter
|
|
Two important queries:
Pattern lookup comes down to doing 2|P| count queries lookup(P):
lo = 0, hi = length(B)
i = length(P)
while i>0:
i = i-1
lo = block(P[i]) + occ(P[i],lo)
hi = block(P[i]) + occ(P[i],hi)
return hi-lo
Note: this style of lookup could be done on a suffix array |
|
Store subtotals for every W letters
Common pattern: support queries with auxilliary data structure which fits into arbitrarily small memory |
|
Recall the example `agg$tgtccaaacagaaa'
| atgc | indicator vector |
| a | 100000000111010111 |
| c | 000000011000100000 |
| g | 011001000000001000 |
| t | 000010100000000000 |
Larger example: 101010101110101000101011111101010101010100010100001000011111110010010111
| counts | 0 | 13 | 25 | ||||||
| words | 10101010 | 11101010 | 00101011 | 11110101 | 01010101 | 00010100 | 00100001 | 11111100 | 10010111 |
occ reduced to population counts
sum-up-to(i):
q = i/(3*8), r = i/8, s = 8*(r+1)-i
x = counts[q]
for 3*q<=i<r:
x = x + wordsum(words[i])
x = x + wordsum(words[r] >> s)
wordsum(w):
w = (w & 01010101) + ((w>>1) & 01010101)
w = (w & 00110011) + ((w>>2) & 00110011)
w = (w & 00001111) + ((w>>4) & 00001111)
return w
(2000) Grossi and Vitter introduce a general framework for CSAs (see their 2005 paper for a good survey)
(2000) Sadakane produces the first low memory construction scheme based on dynamic dictionaries (red-black trees)




Using a single dictionary + encoding:
| char | code |
| A | 0 |
| C | 01 |
| G | 011 |
| T | 0111 |
| N | 01111 |
Matches can be quickly found by an exhaustive enumeration of intervals of the corresponding binary strings.
matches([loA hiA],[loB hiB],zeros):
if hiAn-loAn == 0 or hiBn-loBn == 0:
return
for bit=0,1
loAn=forward(loA,bit)
hiAn=forward(hiA,bit)
loBn=forward(loB,bit)
hiBn=forward(hiB,bit)
if bit == 0 and zeros == K:
report(loAn,hiAn,loBn,hiBn)
else:
matches(loAn,hiAn,loBn,hiBn,zeros+!bit)
creates backwards binary BWTs for nucleotide data. Uses a dynamic dictionary.
finds k-mer matches between BWT pairs
recovers position information for the matches
On a Mac G5 with 1.8 Gb of available RAM
BBBWTgen -vFC -i genomes/NCBImm.fasta -o NCBImm.fwd.bbb
MerMatches -vC -m20 -x1 -X1 -n1 -N1 \
-i NCBIhs.fwd.bbb -I NCBImm.fwd.bbb \
-o NCBIhs_NCBImm.mers
Extractor -b0 -s1 -i NCBIhs-rc_NCBImm.mers \
| Positionator -vC -b NCBIhs.rev.bbb -o NCBIhs.rev.pos
PostMatches -v -m20 -p NCBIhs.fwd.pos -P NCBImm.fwd.pos \
-i NCBIhs_NCBImm.mers -o NCBIhs_NCBImm.mat