Home/Magazine Archive/June 2022 (Vol. 65, No. 6)/Resolution of the Burrows-Wheeler Transform Conjecture/Full Text

Research highlights
## Resolution of the Burrows-Wheeler Transform Conjecture

The Burrows-Wheeler Transform (BWT) is an invertible text transformation that permutes symbols of a text according to the lexicographical order of its suffixes. BWT is the main component of popular lossless compression programs (such as `bzip2`

) as well as recent powerful compressed indexes (such as the *r*-index^{7}), central in modern bioinformatics. The compressibility of BWT is quantified by the number *r* of equal-letter runs in the output. Despite the practical significance of BWT, no nontrivial upper bound on *r* is known. By contrast, the sizes of nearly all other known compression methods have been shown to be either always within a poly-log *n* factor (where n is the length of the text) from *z*, the size of Lempel–Ziv (LZ77) parsing of the text, or much larger in the worst case (by an *n*^{ε} factor for ε > 0).

In this paper, we show that *r* = (*z* log^{2} *n*) holds for every text. This result has numerous implications for text indexing and data compression; in particular: (1) it proves that many results related to BWT automatically apply to methods based on LZ77, for example, it is possible to obtain functionality of the suffix tree in (*z* polylog *n*) space; (2) it shows that many text processing tasks can be solved in the optimal time assuming the text is compressible using LZ77 by a sufficiently large polylog *n* factor; and (3) it implies the first nontrivial relation between the number of runs in the BWT of the text and of its reverse.

In addition, we provide an (*z* polylog *n*)-time algorithm converting the LZ77 parsing into the run-length compressed BWT. To achieve this, we develop several new data structures and techniques of independent interest. In particular, we define compressed string synchronizing sets (generalizing the recently introduced powerful technique of string synchronizing sets^{11}) and show how to efficiently construct them. Next, we propose a new variant of wavelet trees for sequences of long strings, establish a nontrivial bound on their size, and describe efficient construction algorithms. Finally, we develop new indexes that can be constructed directly from the LZ77 parsing and efficiently support pattern matching queries on text substrings.

Lossless data compression aims to exploit redundancy in the input data to represent it in a small space. Despite the abundance of compression methods, nearly every existing tool falls into one of the few general frameworks, among which the three most popular are the following: Lempel–Ziv compression (where the nominal and most commonly used is the LZ77 variant^{25}), statistical compression (this includes, e.g., context mixing, prediction by partial matching (PPM), and dynamic Markov coding), and Burrows-Wheeler transform (BWT).^{4}

One of the features that best differentiates these algorithms is whether they better remove the redundancy caused by skewed symbol frequencies or by repeated fragments. The idea in LZ77 (which underlies, e.g., `7-zip`

and `gzip`

compressors) is to partition the input text into long substrings, each having an earlier occurrence in the text. Every substring is then encoded as a pointer to the previous occurrence using a pair of integers. This method natively handles long repeated substrings and can achieve an exponential compression ratio given sufficiently repetitive input. Statistical compressors, on the other hand, are based on representing (predicting) symbols in the input based on their frequencies. This is formally captured by the notion of the *kth order empirical entropy H _{k}*(

The above analysis raises the question about the nature of compressibility of the Burrows-Wheeler transform. The compression of BWT-based compressors, such as `bzip2`

, is quantified by the number *r* of equal-letter runs in the BWT. The clear picture described above no longer applies to the measure *r.* On the one hand, Manzini^{16} proved that *r* can be upper-bounded in terms of the kth order empirical entropy of the input string. On the other hand, already in 2008, Sirén et al.^{24} observed that BWT achieves excellent compression (superior to statistical methods) on highly repetitive collections and provided probabilistic analysis exhibiting cases when *r* is small. Yet, after more than a decade, no upper bound on *r* in terms of *z* (the size of the LZ77 parsing) was discovered.

This lack of understanding is particularly frustrating due to numerous applications of BWT in the field of bioinformatics and compressed computation. One of the most successful applications of BWT is in *compressed indexing*, which aims to store a compressed string simultaneously supporting various queries (such as random access, pattern matching, or even suffix array queries) concerning the uncompressed version. Although classical (uncompressed) indexes, such as suffix trees and suffix arrays, have been successful in many applications, they are not suitable for storing and searching big highly repetitive collections, which are virtually impossible to search without preprocessing. A recent survey^{17} provides several real-life examples of such datasets. In particular, Github stores more than 20 terabytes of data, with an average of over 20 versions per project, whereas the 100000 Human Genome Project produced over 70 terabytes of DNA, which is highly compressible due to 99.9% similarity between individual human genomes. Motivated by such applications, compressed indexing witnessed a remarkable surge of interest in recent years. BWT-based indexes, such as the r-index,^{7} are among the most powerful, and their space usage is up to (polylog *n*) factors away from the value* r.* For a comprehensive overview of this field, we refer the reader to a survey by Navarro.^{18}

In addition to text indexing, the Burrows-Wheeler transform has many applications in computational biology. In particular, BWT is the main component of the popular genome read aligners such as `Bowtie`

, `BWA`

, and `Soap2`

, which paved its way to general bioinformatics textbooks.^{21} Specialized literature on algorithmic analysis of biological sequences^{15,19} devotes dozens of pages to BWT applications.

Given the importance and practical significance of BWT, one of the biggest open problems that emerged in the field of lossless data compression and compressed computation asks:

What is the upper bound on the output size of the Burrows-Wheeler transform?

Except for BWT, essentially every other known compression method has been proven to produce output whose size is always within an (polylog *n*) factor from *z*, the output size of the LZ77 algorithm (e.g., grammar compression, collage systems, and macro schemes)^{a} or larger by a polynomial factor (*n*^{ε} for some ε > 0) in the worst case (e.g., LZ78, compressed word graphs (CDAWGs)). We refer the reader to Navarro^{17} for a survey of repetitiveness measures. Notably, BWT is known to never compress much better than LZ77, that is, *z* = (*r* log *n*),^{6} and the opposite relation *r* = (*z* poly-log *n*) was often conjectured to be false.^{5}

**1.1. Our contribution**

We prove that *r* = (*z* log^{2} *n*) holds for all strings, resolving the BWT conjecture in the more surprising way and answering an open question of Gagie et al.^{5,6} This result alone has multiple implications for indexing and compression:

- It is possible to support suffix array and suffix tree functionality in (
*z*polylog*n*) space.^{7} - It was shown by Kempa
^{10}that many string processing tasks (such as BWT and LZ77 construction) can be solved in (*n*/log_{σ}*n*+*r*polylog*n*) time (where*σ*is the alphabet size). Thus, if the text is sufficiently compressible by BWT (formally,*n/r*= Ω (polylog*n*)), these tasks can be solved in optimal time (which is unlikely to be possible for general texts^{11}). Our result loosens this assumption to*n/z*= Ω (polylog*n*). - Until now, methods based on the Burrows-Wheeler transform were thought to be neither statistical nor dictionary (LZ-like) compression algorithms.
^{5,24}Our result challenges the notion that the BWT forms a separate compression type: Because of our bound, BWT is much closer to LZ compressors than anticipated.

Our slightly stronger bound*r*= (δ log^{2}*n*), where δ ≤*z*is a symmetric (insensitive to string reversal) repetitiveness measure recently studied by Kociumaka et al.,^{13}further shows that: - The number of BWT runs in the reverse of the text satisfies , which is the first nontrivial bound in terms of
*r.*This result is of practical importance due to many algorithms whose efficiency depends on .^{2,20,22,23}

After proving *r* = (*z* log^{2} *n*), we refine our approach to obtain and subsequently . We then show that the latter bound, combined with the trivial one *r* ≤ *n*, is asymptotically tight for the full spectrum of values of *n* and δ. As a side result, we obtain a tight upper bound (*n* log δ) on the sum of irreducible LCP values, improving upon the previously known bound (*n* log *r*).^{9}

Next, we develop an (*z* log^{8} *n*)-time algorithm converting the LZ77 parsing into the run-length compressed BWT (the polylog *n* factor has not been optimized). This offers up to exponential speedup over the previously fastest space-efficient algorithms, which need Ω(*n* log *z*) time.^{20,22} To achieve this, we develop new data structures and techniques of independent interest. In particular, we introduce a notion of compressed string synchronizing sets, generalizing the technique by Kempa and Kociumaka.^{11} We then describe a new variant of wavelet trees,^{8} designed to work for sequences of long strings. In the full version of this paper, we also propose new indexes that can be built directly from the LZ77 parsing and support fast pattern matching queries on text substrings.

A *string* is a finite sequence of characters from a given *alphabet.* The length of a string *S* is denoted by |*S*| and, for *i* ∈ [1 . . |*S*|],^{b} the *i*th character of *S* is denoted by *S*[*i*]. A string *U* is a *substring* of *S* if *U* = *S*[*i*] *S*[*i* + 1] … *S*[*j* – 1] for some 1 ≤ *i* ≤ *j* ≤ |*S*| + 1. We then say that *U occurs* in *S* at position *i.* The *occurrence* of *U* at position *i* in *S* is denoted by *S*[*i* . . *j*) or *S*[*i* . . *j* – 1]. Such an occurrence is a *fragment* of *S* and can be represented by (a pointer to) *S* and the two positions *i*, *j.* Two fragments (perhaps of different strings) *match* if they are occurrences of the same substring. A fragment *S*[*i* . . *j*] is a *prefix* if *i* = 1 and a *suffix* if *j* = |*S*|.

We use to denote the *reverse* of *S*, that is, *S*[|*S*|] … *S*[2] *S*[1]. We denote the *concatenation* of two strings *U* and *V*, that is, *U*[1]*U*[2] … *U*[|*U*|]*V*[1]*V*[2] … *V*[|*V*|], by *UV* or *U* · *V.* Furthermore, denotes the concatenation of *k* ≥ 0 copies of *S*; note that *S*^{0} = ε is the *empty string.*

An integer *p* ∈ [1 . . |*S*|] is a *period* of a string *S* if *S*[i] = *S*[*i* + *p*] holds for every *i* ∈ [1 . . |*S*| – *p*]. The shortest period of *S* is denoted as per(*S*). A string *S* is called *periodic* if . The following fact is a consequence of the classic *periodicity lemma* (which we do not use directly).

FACT 2.1 (see Amir et al.,^{1} FACT 1). *Any two distinct periodic strings of the same length differ on at least two positions.*

Throughout the paper, we consider a string (called the *text*) *T* of length *n* ≥ 1 over an ordered alphabet Σ of size σ. We assume that *T*[*n*] = $, where $ is the smallest symbol in Σ, and that $ does not occur anywhere else in *T.* We use to denote the order on Σ, extended to the *lexicographic* order on Σ^{*} (the set of strings over Σ) so that *U*, *V* ∈ Σ^{*} satisfy *U* *V* if and only if either *U* is a prefix of *V*, or *U*[1 . . *i*) = *V*[1 . . *i*) and *U*[*i*] ≺ *V* [*i*] holds for some *i* ∈ [1 . . min(|*U*|, |*V*|)].

The *suffix array* SA[1 . . *n*] of *T* is a permutation of [1 . . *n*] such that *T*[SA[1] . . *n*] ≺ *T*[SA[2] . . *n*] ≺ … ≺ *T*[SA[*n*] . . *n*]. The closely related *Burrows-Wheeler transform* BWT[1 . . *n*] of *T* is defined by BWT[*i*] = *T*[SA[*i*] – 1] if SA[*i*] > 1 and BWT[*i*] = *T*[*n*] otherwise. In the context of BWT, it is convenient to consider an *infinite string T*^{∞} defined so that *T*^{∞}[*i*] = *T*[1 + (*i* – 1) mod *n*] for *i* ∈ ℤ in particular, *T*^{∞} [1 . . *n*] = *T* [1 . . *n*]. We then have BWT[*i*] = *T*^{∞} [SA[*i*] – 1] for *i* ∈ [1 . . *n*]. Moreover, the assumption *T*[*n*] = $ implies *T*^{∞} [SA[1] . .] ≺ … ≺ *T*^{∞}[SA[*n*] . .].

For any string , where *c _{i}* ∈ Σ and

`bbabaababababaababa$`

in Table 1, we have BWT `= a`^{1}b^{6}a^{1}b^{2}a^{6}b^{1}a^{2}$^{1}

, and hence

**Table 1. The lexicographically sorted suffixes of a string T = bbabaa-babababaababa$ along with the BWT, SA, and LCP tables.**

By lcp(*U*, *V*) we denote the length of the longest common prefix of strings *U* and *V.* For *j*_{1}, *j*_{2} ∈ [1 . . *n*], we let LCE(*j*_{1}, *j*_{2}) = lcp(*T*[*j*_{1} . .], *T*[*j*_{2} . .]). The *LCP array* LCP[1 . . *n*] is defined so that LCP[*i*] = LCE(SA[*i*], SA[*i* – 1]) for *i* ∈ [2 . . *n*] and LCP[1] = 0. We say that the value LCP[*i*] is *reducible* if BWT[i] = BWT[*i* – 1] and *irreducible* otherwise (including when *i* = 1). There are exactly *r* irreducible LCP values.

We say that a nonempty fragment *T*[*i* . . *i* + *ℓ*) is a *previous factor* if it has an earlier occurrence in *T*, that is, LCE (*i*, *i'*) ≥ *ℓ* holds for some *i'* ∈ [1 . . *i*). The *LZ77 parsing* of *T* is a factorization *T* = *F*_{1} ··· *F*_{f} into nonempty *phrases* such that the *j*th phrase *F _{j}* is the longest previous factor starting at position 1 + |

`bbabaababababaababa$`

of Table 1 has LZ77 factorization `b · b · a · ba · aba · bababa · ababa · $`

with The *trie* of a finite set *S* ⊆ Σ^{*} is an edge-labeled rooted tree with a node *v _{X}* for every string

`{$aba, aaba, abaa, abab, abb$, b$ab, baab, baba, babb, bb$a}`

.

**Figure 1. The trie of the reversed length-4 substrings of T^{∞} for T = **

`bbabaababababaababa$`

of Table 1. Light edges are thin and dotted. **3.1. Basic upper bound**

To illustrate the main idea of our proof technique, we first prove the upper bound in its simplest form *r* = (*z* log^{2} *n*). The following lemma stands at the heart of our proof.

LEMMA 3.1. *For every ℓ* ∈ [1 . . *n*], *the number of irreducible LCP values in* [*ℓ* . . 2*ℓ*) is (*z* log *n*).

PROOF. Denote *S _{m}* = {

The idea of the proof is as follows: With each irreducible value LCP[*i*] ∈ [*ℓ* . . 2*ℓ*), we associate a cost of *ℓ* units, which are charged to individual characters of strings in *S*_{3ℓ}. We then show that each of the strings in *S*_{3ℓ} is charged at most 2 log *n* times. The number of irreducible LCP values in [*ℓ* . . 2*ℓ*) equals times the total cost, which is at most

To devise the announced assignment of cost to characters of strings in *S*_{3ℓ}, consider the trie of all reversed strings in *S _{ℓ}* (see Figure 1 for example).

Let LCP[*i*] ∈ [*ℓ* . . 2*ℓ*) be an irreducible LCP value; note that *i* > 1 due to LCP[*i*] ≥ *ℓ* > 0. Let *j*_{0} = SA[*i* – 1] and *j*_{1} = SA[*i*] so that LCP[*i*] = LCE(*j*_{0}, *j*_{1}). Because LCP[*i*] is irreducible, we have *T*^{∞} [*j*_{0} – 1] = BWT[*i* – 1] ≠ BWT[*i*] = *T*^{∞} [*j*_{1} – 1]. For *k* ∈[1 .. *ℓ*], the *k*th unit of the cost associated with LCP[*i*] is charged to the *k*th character (*T*^{∞}[*j _{t}* – 1]) of the string (

**Figure 2. Proof of Lemma 3.1 on T from Table 1 for ℓ = 4, i = 11, and k = 2. Strings T^{∞} [j_{t} – k . . j_{t} – k + 3ℓ) are highlighted. The subtree of rooted in v_{baa} is smaller than in v_{bab} (see Figure 1), and hence we charge the second symbol of T^{∞} [j_{1} – k . . j_{1} – k + 3ℓ), that is, t = 1.**

Note that at most log *n* characters of each *S* ∈ *S*_{3ℓ} can be charged during the above procedure: Whenever *S*[*k*], with *k* ∈ [1 .. *ℓ*], is charged, the subtree of rooted at has at least twice as many leaves as the subtree rooted at , and this can happen for at most log|*S _{ℓ}*| ≤ log

It remains to show that, for every *S* ∈ *S*_{3ℓ}, a single position *S*[*k*], with *k* ∈[1 .. *ℓ*], can be charged at most twice. For this, observe that the characters charged for a single irreducible value LCP[*i*] are at different positions (of strings in *S*_{3ℓ}). Hence, to analyze the total charge assigned to *S*[*k*], we only need to bound the number of possible candidate positions *i.* Let [*b* . . *e*] be the set of indices *i'* such that *T*^{∞} [SA[*i'*]..] starts with *S*[*k* + 1 .. 3*ℓ*]. In the above procedure, if a character *S*[*k*] is charged a unit of cost corresponding to LCP[*i*], then *S*[*k* + 1 .. 3*ℓ*] is a prefix of either *T*^{∞}[SA[*i*−1]..]=*T*^{∞}[*j*_{0} ..] or *T*^{∞}[SA[*i*]..]=*T*^{∞}[*j*_{1} ..]. Hence, {*i*−1,*i*}∩[*b*..*e*]≠∅. At the same time, LCE(SA[*i* –1], SA[*i*]) < 2*ℓ*, and all strings *T*^{∞}[SA[*i'*]..] with *i'* ∈[*b* .. *e*] share a common prefix *S*[*k* + 1 .. 3*ℓ*] of length 3*ℓ* – *k* ≥ 2*ℓ*. Thus, *i* = *b* or *i* = *e* + 1. ⅈ

THEOREM 3.2. *All length-n strings satisfy r* = (*z* log^{2} *n*).

PROOF. Recall that *r* is the total number of irreducible LCP values. Thus, the claim follows by applying Lemma 3.1 for *ℓ _{i}* = 2

**3.2. Tighter upper bound**

To obtain a tighter bound, we refine the ideas from Section 3.1, starting with a counterpart of Lemma 3.1.

LEMMA 3.3. *For every ℓ* ∈[1 .. *n*], *the number of irreducible LCP values in* [*ℓ* .. 2*ℓ*) is (*z* log *z*).

PROOF. The proof follows closely that of Lemma 3.1. However, with each irreducible value LCP[*i*] ∈ [*ℓ* .. 2*ℓ*), we associate cost instead of *ℓ.* We then show that each of the strings in *S*_{3ℓ} is charged at most 2 · (3 + log *z*) times (rather than 2 log *n* times). Then, the number of irreducible LCP values in the range [*ℓ* .. 2*ℓ*) does not exceed times the total cost, which is bounded by

Recall the trie of all reversed strings in *S _{ℓ}*. For a node

Consider a light edge from a node *v* to its parent *u* at depth at least . Let *v'* be a sibling of *v* satisfying size(*v'*) ≥ size(*v*), and let *S _{v}*,

Consider the set *S* of length-*ℓ* strings corresponding to the leaves in the subtree of rooted at (i.e., the labels of the root-to-leaf paths passing through ). Define and note that , because is the trie of *reversed* strings from *S _{ℓ}*. Let denote the ending positions of the leftmost occurrences in

The reasoning above shows that once a root-to-leaf path encounters a light edge connecting a node *u* at depth at least to its child *v*, we have size(*v*) ≤ 4*z.* The number of the remaining light edges on the path is at most log(size(*v*)) ≤ 2 + log *z* by the standard bound applied to the subtree of rooted at *v.*

THEOREM 3.4. *Every string T of length n satisfies* .

PROOF. To obtain tighter bounds on the number of irreducible LCP values in [*ℓ* . . 2*ℓ*), we consider three cases:

*ℓ* ≤ **log z.** We repeat the proof of Lemma 3.1, except that we observe that the number of light edges on each root-to-leaf path in is bounded by

. We use the bound (*z* log *z*) of Lemma 3.3. . We repeat the proof of Lemma 3.3, except that we observe that |*S*_{3ℓ}| ≤ *n.* Thus, the number of irreducible LCP values in .

The above upper bounds, applied for every *ℓ* = 2^{i} with *i* ∈ [0 . . ⌊log *n*⌋], yield

**3.3. Tight bounds in terms of δ**

Let denote the (cyclic) *substring complexity* of *T.*^{13} Note that letting is equivalent because |*S _{m}*| ≤

To adapt the proof Lemma 3.3, we need to generalize the observation that at most *z* substrings from *S _{ℓ}* may have disjoint leftmost occurrences in

LEMMA 3.5 (based on Kociumaka et al.,^{13} LEMMA 6). *For any integer ℓ* ≥ 1, *the total number of positions in T*^{∞}[1 . .) *covered by the leftmost occurrences of strings from S _{ℓ} is at most* 3

PROOF. Let *C* denote the set of positions in *T*^{∞}[1 . .) covered by the leftmost occurrences of strings from *S _{ℓ}*, and let

We have thus shown that |*C'*| = |*S*|≤|*S*_{2ℓ}. Because |*S*_{2ℓ}| ≤ 2*δℓ* holds by definition of δ, we obtain |*C*|<|*C*'|+ ℓ ≤ |*S*_{2ℓ}|+*ℓ*≤(2*δ*+1)*ℓ*≤3*δℓ*. ⅈ

LEMMA 3.6. *For every ℓ* ∈[1 .. *n*], *the number of irreducible LCP values in* [*ℓ* .. 2*ℓ*) *is* (δ log δ).

PROOF. Compared to the proof of Lemma 3.3, we use the bound |*S*_{3ℓ}| ≤ 3*ℓ*δ instead of |*S*_{3ℓ}| ≤ 3*ℓz*. The only other modification required is that, for every light edge connecting a node *u* of at depth at least to its child *v*, we need to prove size(*v*) = (δ).

Let be defined as in the proof of Lemma 3.3. Recall that there are at least strings in *S _{ℓ}* with disjoint leftmost occurrences in

By replacing the thresholds log *z* and with log δ and , respectively, in the proof of Theorem 3.4, we immediately obtain a bound in terms of δ.

THEOREM 3.7. *Every string T of length n satisfies* .

The following construction, analyzed in the full version of this paper, provides tight examples with substring complexity δ covering the whole spectrum between (1) and . For *ℓ* ≥ 1 and *x* ∈[0 .. 2^{ℓ}), we set bin_{ℓ} (*x*) ∈ {0, 1}^{ℓ} to be the binary representation of *x.*

LEMMA 3.8. *For all integers ℓ* ≥ 2 *and K* ≥ *1, the length n, the substring complexity δ, and the number of runs r in the BWT of a string T _{l,K}* ∈ {$, 0, 1, 2}

*satisfy* *n* = Θ(2^{K+ℓ}*ℓ*, *δ*=Θ(2^{ℓ}), *and r* = Ω(2^{ℓ} *ℓK*).

If , on the other hand, then a trivial upper bound *r* = (*n*) is tighter than that of Theorem 3.7. In this case, the following construction, also analyzed in the full version of this paper, provides tight examples.

LEMMA 3.9. *For all integers ℓ* ≥ 2 *and* Δ∈Ω(2^{ℓ})∩(2^{ℓ}*ℓ*), *the length n, the substring complexity δ, and the number of runs r in the BWT of a string* , *defined with*

*satisfy* *n* =Θ(2^{ℓ}*ℓ*), δ=Θ(Δ), *and* *r* = Ω(2^{ℓ}*ℓ*).

Overall, combining Lemmas 3.8 and 3.9, we obtain the following tight lower bound for δ ranging from (1) to Ω(*n*).

THEOREM 3.10. *For every N* ≥ *1 and* Δ ∈ [1 .. *N*], *there exists a string T whose length n, substring complexity δ, and the number of runs r in the BWT satisfy* *n*=Θ(*N*), δ=Θ(Δ), *and* .

**3.4. Further combinatorial bounds**

By combining Theorem 3.7 with known properties of the substring complexity δ, we obtain the first bound relating the number of BWT runs in the string and its reverse. No such bounds (even polynomial in *r* log *n*) were known before.

COROLLARY 3.11. *If r and denote the number of runs in the BWT of a length-n text and its reverse, respectively, then* .

PROOF. Because the value of δ is the same for the text and its reverse, Theorem 3.7 yields . Combining the results of Kempa and Prezza^{12} (Theorem 3.9) and Kociumaka et al.^{13} (Lemma 2) gives δ ≤ *r.* Consequently, we obtain .

In the full version of this paper, we also characterize the sum of all irreducible LCP values:

THEOREM 3.12. *For every string of length n, the sum r*_{Σ} *of all irreducible LCP values satisfies r*_{Σ} = (*n* log δ).

Due to δ ≤ *r*, the presented upper bound is always (asymptotically) at least as strong as the previous bound *r*_{σ} ≤ *n* log *r* by Kärkkäinen et al.^{9} Furthermore, it can be strictly stronger because log δ = *o*(log *r*) is possible when δ = log^{o(1)} *n.* In the full version of this paper, we show that the strings of Lemmas 3.8 and 3.9 yield a matching lower bound:

THEOREM 3.13. *For every N* ≥ 1 *and* Δ ∈ [1 .. *N*], *there exists a string T whose length n, substring complexity δ, and sum r*_{Σ} *of irreducible LCP values satisfy* *n*=Θ(*N*), δ=Θ(Δ), *and* *r*_{Σ} = Θ(*n* log δ).

In this section, we outline our algorithm that, given the LZ77 parsing of a text *T* ∈ Σ^{n}, computes its run-length compressed BWT in (*z* polylog *n*) time. We start with an overview that explains the key concepts. Next, we present two new data structures utilized in our algorithm: the compressed string synchronizing set (Section 4.1) and the compressed wavelet tree (Section 4.2). The sketch of the final algorithm is then presented in Section 4.3.

For any substring *Y* of *T*^{∞}, we define

We say that a substring *Y* of *T*^{∞} is *left-maximal* if there exist distinct symbols *a*, *b* ∈ Σ such that the strings *aY* and *bY* are also substrings of *T*^{∞}. The following definition, assuming Σ∩ℕ=∅, plays a key role in our construction.

DEFINITION 4.1 (BWT MODULO *ℓ*). *Let* *T* ∈ Σ^{n}, *ℓ* ≥ 1 *be an integer, and* *Y*_{i}= *T*^{∞}[SA[*i*]..SA[*i*]+*ℓ*) *for* *i* ∈[1..*n*]. *We define the string* BWT_{ℓ} ∈(Σ∪ℕ)^{n}, *called the* BWT modulo *ℓ* *(of T), as follows. For i* ∈ [1 .. *n*],

The algorithm runs in *k* = ⌈log *n*⌉ rounds. For *q* ∈ [0 .. *k*), the input to the *q*th round is RL(BWT_{2q}) and the output is RL(BWT_{2q+1}). At the end of the algorithm, we have RL(BWT_{2k}) = RL(BWT), because *X* ∈ *S*_{2k} is never left-maximal for 2^{k} ≥ *n.*

Informally, in round *q*, we are given a (run-length compressed) subsequence of BWT that can be determined based on sorting the suffixes only up to their prefixes of length 2^{q}. Notice that BWT_{ℓ}[*b* .. *e*] ∈ Σ^{*} implies BWT_{ℓ+1} [*b* .. *e*] ∈ Σ^{*} (because a prefix of a left-maximal substring is left-maximal). Hence, these subsequences need not be modified until the end of the algorithm (except possibly merging their runs with adjacent runs). For the remaining positions, BWT_{ℓ} identifies the (leftmost occurrences of) substrings to be inspected in the *q*th round with the aim of replacing their corresponding runs in BWT_{ℓ} with previously unknown BWT symbols (as defined in BWT_{2ℓ}).

We call a block BWT[*b* . . *e*] *uniform* if all symbols in BWT[*b* . . *e*] are equal and *nonuniform* otherwise. The following lemma ensures the feasibility of the above construction.

LEMMA 4.2. *For any ℓ* ≥ 1, *it holds* |RL(BWT_{ℓ})| < 2*r.*

PROOF. Denote RL(BWT_{ℓ}) = ((*c*_{1}, λ_{1}, …,(*c*_{h},λ_{h})), letting λ_{0} = 0. By definition of BWT_{ℓ}, if *c _{i}* ∈ ℕ, then the block BWT(λ

On the other hand, *c _{i}* ∈ Σ and

**4.1. Compressed string synchronizing sets**

Our algorithm builds on the notion of *string synchronizing sets*, recently introduced by Kempa and Kociumaka.^{11} Synchronizing sets are one of the most powerful techniques for sampling suffixes. In the uncompressed setting, they are the key in obtaining time-optimal solutions to multiple problems, such as a state-of-the-art BWT construction algorithm for texts over small alphabets.^{11} In this section, we introduce a notion of *compressed string synchronizing sets.* Our construction is the first implementation of synchronizing sets in the compressed setting and thus of independent interest.

We start with the definition of basic synchronizing sets.

DEFINITION 4.3 (τ-SYNCHRONIZING SET^{11}). *Let T* ∈ Σ^{n} *be a string and let* *be a parameter. A set* S⊆[1..*n*−2τ+1] *is called a τ-synchronizing set of T if it satisfies the following* consistency *and* density *conditions:*

*If**T*[*i*..*i*+2τ)=*T*[*j*..*j*+2τ),*then i*∈ S*holds if and only if j*∈ S*(for i, j*∈[1 ..*n*– 2τ +1]),- S∩[
*i*..*i*+τ)=∅*if and only if**(for i*∈[1 ..*n*– 3τ + 2]).

In most applications, we want to minimize |S|. However, the density condition imposes a lower bound for strings of length *n* ≥ 3τ – 1 that do not contain substrings of length 3τ – 1 that are periodic with period at most . Therefore, we cannot hope to achieve an upper bound improving in the worst case upon the following one.

THEOREM 4.4 *(Kempa and Kociumaka ^{11}). For any string T of length n and parameter* ,

Storing S for compressible strings presents the following challenge: Although *is* sometimes possible, it *is not* implied by *z* *n.* For example, as noted by Bille et al.,^{3} Thue–Morse strings satisfy *z* = (log *n*) yet they contain no periodic substring *X* with , and thus their synchronizing sets satisfy . This prevents us from keeping the plain representation of S when .

We therefore exploit a different property of compressible strings: That all their substrings *Y* satisfy lpos(*Y*) , where *e _{j}* is the last position of the

DEFINITION 4.5 (COMPRESSED τ-SYNCHRONIZING SET). *Let* S *be a τ-synchronizing set of string T*[1 . . *n*] *for some* , *and, for every j* ∈[1 .. *z*], *let e _{j} denote the last position of the jth phrase in the LZ77 parsing of T. For k* ∈ ℤ

In the full version of this paper, we prove that every text *T* has a synchronizing set S with a small compressed representation, and we show how to efficiently compute such S from the LZ77 parsing of *T.*

THEOREM 4.6. *There exists a Las-Vegas randomized algorithm that, given the LZ77 parsing of a string T* ∈ Σ^{n}, *a parameter* , *and a constant k* ∈ ℤ_{≥2}, *constructs in* (*z* log^{5} *n*) *time a compressed representation* comp_{k}(S) *of a τ-synchronizing set* S *of T satisfying* |comp_{k}(S)|≤ 72*kz*.

**4.2. Compressed wavelet trees**

Along with string synchronizing sets, wavelet trees,^{8} originally invented for text indexing, play a central role in our algorithm. Unlike virtually all prior applications of wavelet trees, ours uses a sequence of very long strings (up to Θ(*n*) symbols). This approach is feasible because all strings are substrings of the text, which is stored in the LZ77 representation. In this section, we describe this novel variant of wavelet trees, dubbed here *compressed wavelet trees.* In particular, we prove an upper bound on their size, describe an efficient construction from the LZ77-compressed text, and show how to augment them to support some fundamental queries.

Let Σ be an alphabet of size σ ≥ 1. Consider a string *W*[1 . . *m*] over the alphabet Σ^{ℓ} so that *W* is a sequence of *m* ≥ 0 strings of length *ℓ* ≥ 0 over the alphabet Σ. The wavelet tree of *W* is the trie of Σ with each node *v _{X}* associated to a string

With each node *v _{X}* ∈

Based on *I _{X}*, we define

if |*X*| < *ℓ*, and *B _{X}* = ε if |

As typically done in the applications of wavelet trees, we only explicitly store the strings *B _{X}*. The values of primary indices

LEMMA 4.7 (Grossi et al.^{8}). *Let X* ∈ Σ^{d}, *where d* ∈ [0 . . *ℓ*). *For every c* ∈ Σ *and j* ∈ [1 . . |*I*_{Xc}|], *we have I _{Xc}* [

We define the *compressed wavelet tree* _{c} of *W* as the wavelet tree of *W* in which all strings *B _{X}* have been run-length compressed and, except , all nodes

We store the edge labels of _{c} as pointers to substrings in *W.* We assume that each edge of _{c} and each element of RL(*B _{X}*) can be encoded in (1) space. Because |RL(

THEOREM 4.8. *If* _{c} *is the compressed wavelet tree of a sequence W of length-l strings, then* .

PROOF. Let *m* = |*W*|, *k* = |RL(*W*)| ≤ *m*, and *k'* = |{*W*[*i*] : *i* ∈ [1 . . *m*]}|. Due to |*V*(*T*_{c})|≤ 1+2*k*'=(1 + k), we can focus on nodes *v*_{x}∈*V*(*T*_{c}) such that |RL(B_{X})| ≥ 2.

The proof resembles that of Lemma 3.1. With each *X* ∈ Σ^{*} such that |RL(*B _{X}*)| ≥ 2, we associate |RL(

Consider *X* ∈ Σ^{d} with |RL(*B _{X}*)| ≥ 2; note that

Now, consider a run *W*[*b* . . *b'*] = *Y*^{δ} in RL(*W*). For a single depth *d*, the run could be charged at most twice, with at most one unit assigned to *W*[*b*] due to *p*_{1} = *b* and at most one unit assigned to *W*[*b'*] due to *p*_{0} = *b'*, both for *X* = *Y*[1 . . *d*]. Moreover, note that the subtree size on the path from *v _{Y}* to the root

The key operation that we want to support on _{c} is to compute the value *I _{X}* [

THEOREM 4.9. *Given the LZ77 representation of T[1 . . n] and a sequence W[1 .. m] of m* ≤ *n same-length substrings of T*^{∞}, *represented as* RL((lpos(*W*[*i*]))_{i ∈[1 .. m]}), *the compressed wavelet tree of W, supporting primary index queries in time* (log^{4} *n*), *can be constructed in time* ((*z*+|RL(*W*)|) log^{2} *n*).

**4.3. The algorithm**

We are now ready to sketch the procedure constructing the sequences RL(BWT_{2q}), where *q* ∈ [0 . . ⌈log *n*⌉].

Let *q* ∈ [4 . . ⌈log *n*⌉) (values *q* ∈ [0 . . 3] are handled separately) and let *ℓ* = 2^{q}. We show how to compute RL(BWT_{2ℓ}) given RL(BWT_{ℓ}) and the LZ77 parsing of *T.* The main idea of the algorithm is as follows.

Let S be a τ-synchronizing set of *T*, where . As noted earlier, BWT_{ℓ} [*j*] ∈ Σ implies BWT_{2ℓ} [*j*] ∈ Σ. Let BWT_{ℓ} [*y* . . *y'*] ∈ ℕ^{+} be a run in BWT_{ℓ}. By definition of BWT_{ℓ}, the suffixes of *T*^{∞} starting at positions *i* ∈ SA[*y* . . *y'*] share a common prefix of length *ℓ* ≥ 3τ. Thus, assuming that S∩[*i*..*i*+τ) = holds for all *i* ∈ SA[*y* . . *y'*] (the periodic case is handled separately), by the consistency of S, all text positions *i* ∈ SA[*y* . . *y'*] share a common offset Δ with *i* + Δ = min (S∩[*i*..*i*+τ)). This lets us deduce the order of length-2*1* prefixes *T*[*i* . . *i*+2*ℓ*) based on the order of strings *T*[*i* + Δ . . *i* + 2*ℓ*) starting at synchronizing positions. For this, from the sorted list of fragments *T*[*s* . . *s* + 2*ℓ*) across *s* ∈ S, we extract, using a wavelet tree, those preceded by *T*[*i* . . *i* + Δ) (a prefix common to *T*^{∞}[*i* . .) for *i* ∈ SA[*y* . . *y'*]). Importantly, the synchronizing positions *s* sharing *T*[*s* – τ . . *s* + 2*ℓ*) can be processed together; hence, by Theorem 4.6, applied for *k* = 7 so that 2*ℓ* ≤ *k*τ, it suffices to use (*z*) distinct substrings.

In the full version of the paper, we provide the details of the above construction and obtain the following result.

THEOREM 4.10. *There exists a Las-Vegas randomized algorithm that, given the LZ77 parsing of a text T of length n, computes its run-length compressed Burrows-Wheeler transform in time* ((*r* + *z*) log^{6} *n*) = (*z* log^{8} *n*).

This work was performed when the first author was a post-doctoral scholar at the University of California, Berkeley (supported by NSF grants no. 1652303 and 1934846, and an Alfred P. Sloan Fellowship grant), whereas the second author was a postdoctoral scholar at Bar-Ilan University, Israel (supported by ISF grants no. 1278/16 and 1926/19, a BSF grant no. 2018364, and an ERC grant MPM under the EU's Horizon 2020 Research and Innovation Programme, agreement no. 683064).

1. Amir, A., Iliopoulos, C.S., Radoszewski, J. Two strings at Hamming distance 1 cannot be both quasiperiodic. *Inf. Process. Lett. 128*, (2017), 54–57.

2. Belazzougui, D., Cunial, F., Gagie, T., Prezza, N., Raffinot, M. Composite repetition-aware data structures. In *CPM* (2015), Springer, Cham, Switzerland, 26–39.

3. Bille, P., Gagie, T., Gørtz, I.L., Prezza, N. A separation between RLSLPs and LZ77. *J. Discrete Algorithms 50*, (2018), 36–39.

4. Burrows, M., Wheeler, D.J. A block-sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation. Palo Alto, CA, 1994.

5. Gagie, T., Navarro, G., Prezza, N. Fully-functional suffix trees and optimal text searching in BWT-runs bounded space. arXiv 1809.02792 (2018).

6. Gagie, T., Navarro, G., Prezza, N. On the approximation ratio of Lempel–Ziv parsing. In *LATIN* (2018), Springer, Cham, Switzerland, 490–503.

7. Gagie, T., Navarro, G., Prezza, N. Fully functional suffix trees and optimal text searching in BWT-runs bounded space. *J. ACM 67*, 1 (2020), 1–54.

8. Grossi, R., Gupta, A., Vitter, J.S. High-order entropy-compressed text indexes. In *SODA* (2003), ACM/SIAM, Philadelphia, PA, USA, 841–850.

9. Kärkkäinen, J., Kempa, D., Pitkowski, M. Tighter bounds for the sum of irreducible LCP values. *Theor. Comput. Sci. 656*, (2016), 265–278.

10. Kempa, D. Optimal construction of compressed indexes for highly repetitive texts. In *SODA* (2019), SIAM, Philadelphia, PA, USA, 1344–1357.

11. Kempa, D., Kociumaka, T. String synchronizing sets: Sublinear-time BWT construction and optimal LCE data structure. In *STOC* (2019), ACM, New York, NY, USA, 756–767.

12. Kempa, D., Prezza, N. At the roots of dictionary compression: String attractors. In *STOC* (2018), ACM, New York, NY, USA, 827–840.

13. Kociumaka, T., Navarro, G., Prezza, N. Towards a definitive measure of repetitiveness. In *LATIN* (2020), Springer, Cham, Switzerland, 207–219.

14. Kreft, S., Navarro, G. On compressing and indexing repetitive sequences. *Theor. Comput. Sci. 483* (2013), 115–133.

15. Mäkinen, V., Belazzougui, D., Cunial, F., Tomescu, A.I. *Genome-Scale Algorithm Design: Biological Sequence Analysis in the Era of High-Throughput Sequencing.* Cambridge University Press, Cambridge, UK, 2015.

16. Manzini, G. An analysis of the Burrows-Wheeler transform. *J. ACM 48*, 3 (2001), 407–430.

17. Navarro, G. Indexing highly repetitive string collections, part I: Repetitiveness measures. *ACM Comput. Surv. 54*, 2 (2021), 1–31.

18. Navarro, G. Indexing highly repetitive string collections, part II: Compressed indexes. *ACM Comput. Surv. 54*, 2 (2021), 1–32.

19. Ohlebusch, E. *Bioinformatics Algorithms: Sequence Analysis, Genome Rearrangements, and Phylogenetic Reconstruction.* Oldenbusch Verlag, Bremen, Germany, 2013.

20. Ohno, T., Sakai, K., Takabatake, Y., Tomohiro, I, Sakamoto, H. A faster implementation of online RLBWT and its application to LZ77 parsing. *J. Discrete Algorithms*, (2018), 52-53:18–28.

21. Pevsner, J. *Bioinformatics and Functional Genomics*, 3^{rd} edn. Wiley-Blackwell, Chichester, UK, 2015.

22. Policriti, A., Prezza, N. From LZ77 to the run-length encoded Burrows-Wheeler transform, and back. In *CPM* (2017), Schloss Dagstuhl–Leibniz-Zentrum für Informatik, Dagstuhl, Germany, 17:1–17:10.

23. Policriti, A., Prezza, N. LZ77 computation based on the run-length encoded BWT. *Algorithmica 80*, 7 (2018), 1986–2011.

24. Sirén, J., Välimäki, N., Mäkinen, V., Navarro, G. Run-length compressed indexes are superior for highly repetitive sequence collections. In *SPIRE* (2008), Springer, Berlin, Heidelberg, Germany, 164–175.

25. Ziv, J., Lempel, A. A universal algorithm for sequential data compression. *IEEE Trans. Inf. Theory 23*, 3 (1977), 337–343.

a. The choice for LZ77 as a representative in this class follows from the fact that most other methods are NP-hard to optimize, whereas LZ77 admits a simple linear-time compression.

b. For integers *i*, *j* ∈ ℤ, we denote [*i* . . *j*] = {*k* ∈ ℤ : *i* ≤ *k* ≤ *j*}, [*i* . . *j*) = {*k* ∈ ℤ : *i* ≤ *k* < *j*}, and (*i* . . *j*] = {*k* ∈ ℤ : *i* < *k* < *j*}.

The original version of this paper was presented at FOCS 2020 and is available at https://arxiv.org/abs/1910.10631

To view the accompanying Technical Perspective, visit doi.acm.org/10.1145/3531443

The Digital Library is published by the Association for Computing Machinery. Copyright © 2022 ACM, Inc.

No entries found