### Abstract

The Burrows-Wheeler Transform (BWT) is an invertible text transformation that permutes symbols of a text according to the lexicographical order of its suffixes. BWT is the main component of popular lossless compression programs (such as `bzip2`

) as well as recent powerful compressed indexes (such as the *r*-index^{7}), central in modern bioinformatics. The compressibility of BWT is quantified by the number *r* of equal-letter runs in the output. Despite the practical significance of BWT, no nontrivial upper bound on *r* is known. By contrast, the sizes of nearly all other known compression methods have been shown to be either always within a poly-log *n* factor (where n is the length of the text) from *z*, the size of Lempel–Ziv (LZ77) parsing of the text, or much larger in the worst case (by an *n*^{ε} factor for ε > 0).

In this paper, we show that *r* =
(*z* log^{2} *n*) holds for every text. This result has numerous implications for text indexing and data compression; in particular: (1) it proves that many results related to BWT automatically apply to methods based on LZ77, for example, it is possible to obtain functionality of the suffix tree in
(*z* polylog *n*) space; (2) it shows that many text processing tasks can be solved in the optimal time assuming the text is compressible using LZ77 by a sufficiently large polylog *n* factor; and (3) it implies the first nontrivial relation between the number of runs in the BWT of the text and of its reverse.

In addition, we provide an
(*z* polylog *n*)-time algorithm converting the LZ77 parsing into the run-length compressed BWT. To achieve this, we develop several new data structures and techniques of independent interest. In particular, we define compressed string synchronizing sets (generalizing the recently introduced powerful technique of string synchronizing sets^{11}) and show how to efficiently construct them. Next, we propose a new variant of wavelet trees for sequences of long strings, establish a nontrivial bound on their size, and describe efficient construction algorithms. Finally, we develop new indexes that can be constructed directly from the LZ77 parsing and efficiently support pattern matching queries on text substrings.

### 1. Introduction

Lossless data compression aims to exploit redundancy in the input data to represent it in a small space. Despite the abundance of compression methods, nearly every existing tool falls into one of the few general frameworks, among which the three most popular are the following: Lempel–Ziv compression (where the nominal and most commonly used is the LZ77 variant^{25}), statistical compression (this includes, e.g., context mixing, prediction by partial matching (PPM), and dynamic Markov coding), and Burrows-Wheeler transform (BWT).^{4}

One of the features that best differentiates these algorithms is whether they better remove the redundancy caused by skewed symbol frequencies or by repeated fragments. The idea in LZ77 (which underlies, e.g., `7-zip`

and `gzip`

compressors) is to partition the input text into long substrings, each having an earlier occurrence in the text. Every substring is then encoded as a pointer to the previous occurrence using a pair of integers. This method natively handles long repeated substrings and can achieve an exponential compression ratio given sufficiently repetitive input. Statistical compressors, on the other hand, are based on representing (predicting) symbols in the input based on their frequencies. This is formally captured by the notion of the *kth order empirical entropy H _{k}*(

*T*). For any sufficiently long text

*T*, symbol frequencies (taking length-

*k*contexts into account) in any power of

*T*(the concatenation of several copies of

*T*) do not change significantly (see Kreft and Navarro,

^{14}Lemma 2.6). Therefore, |

*T*|

^{t}*H*(

_{k}*T*) ≈

^{t}*t*· |

*T*|

*H*(

_{k}*T*) holds for any

*t*≥ 1, which means that entropy is not sensitive to long repetitions, and hence it is not able to capture the same type of redundancy as the LZ77 compression.

^{7,12,14,24}

The above analysis raises the question about the nature of compressibility of the Burrows-Wheeler transform. The compression of BWT-based compressors, such as `bzip2`

, is quantified by the number *r* of equal-letter runs in the BWT. The clear picture described above no longer applies to the measure *r.* On the one hand, Manzini^{16} proved that *r* can be upper-bounded in terms of the kth order empirical entropy of the input string. On the other hand, already in 2008, Sirén et al.^{24} observed that BWT achieves excellent compression (superior to statistical methods) on highly repetitive collections and provided probabilistic analysis exhibiting cases when *r* is small. Yet, after more than a decade, no upper bound on *r* in terms of *z* (the size of the LZ77 parsing) was discovered.

This lack of understanding is particularly frustrating due to numerous applications of BWT in the field of bioinformatics and compressed computation. One of the most successful applications of BWT is in *compressed indexing*, which aims to store a compressed string simultaneously supporting various queries (such as random access, pattern matching, or even suffix array queries) concerning the uncompressed version. Although classical (uncompressed) indexes, such as suffix trees and suffix arrays, have been successful in many applications, they are not suitable for storing and searching big highly repetitive collections, which are virtually impossible to search without preprocessing. A recent survey^{17} provides several real-life examples of such datasets. In particular, Github stores more than 20 terabytes of data, with an average of over 20 versions per project, whereas the 100000 Human Genome Project produced over 70 terabytes of DNA, which is highly compressible due to 99.9% similarity between individual human genomes. Motivated by such applications, compressed indexing witnessed a remarkable surge of interest in recent years. BWT-based indexes, such as the r-index,^{7} are among the most powerful, and their space usage is up to
(polylog *n*) factors away from the value* r.* For a comprehensive overview of this field, we refer the reader to a survey by Navarro.^{18}

In addition to text indexing, the Burrows-Wheeler transform has many applications in computational biology. In particular, BWT is the main component of the popular genome read aligners such as `Bowtie`

, `BWA`

, and `Soap2`

, which paved its way to general bioinformatics textbooks.^{21} Specialized literature on algorithmic analysis of biological sequences^{15,19} devotes dozens of pages to BWT applications.

Given the importance and practical significance of BWT, one of the biggest open problems that emerged in the field of lossless data compression and compressed computation asks:

What is the upper bound on the output size of the Burrows-Wheeler transform?

Except for BWT, essentially every other known compression method has been proven to produce output whose size is always within an
(polylog *n*) factor from *z*, the output size of the LZ77 algorithm (e.g., grammar compression, collage systems, and macro schemes)^{a} or larger by a polynomial factor (*n*^{ε} for some ε > 0) in the worst case (e.g., LZ78, compressed word graphs (CDAWGs)). We refer the reader to Navarro^{17} for a survey of repetitiveness measures. Notably, BWT is known to never compress much better than LZ77, that is, *z* =
(*r* log *n*),^{6} and the opposite relation *r* =
(*z* poly-log *n*) was often conjectured to be false.^{5}

We prove that *r* =
(*z* log^{2} *n*) holds for all strings, resolving the BWT conjecture in the more surprising way and answering an open question of Gagie et al.^{5,6} This result alone has multiple implications for indexing and compression:

- It is possible to support suffix array and suffix tree functionality in
(
*z*polylog*n*) space.^{7} - It was shown by Kempa
^{10}that many string processing tasks (such as BWT and LZ77 construction) can be solved in (*n*/log_{σ}*n*+*r*polylog*n*) time (where*σ*is the alphabet size). Thus, if the text is sufficiently compressible by BWT (formally,*n/r*= Ω (polylog*n*)), these tasks can be solved in optimal time (which is unlikely to be possible for general texts^{11}). Our result loosens this assumption to*n/z*= Ω (polylog*n*). - Until now, methods based on the Burrows-Wheeler transform were thought to be neither statistical nor dictionary (LZ-like) compression algorithms.
^{5,24}Our result challenges the notion that the BWT forms a separate compression type: Because of our bound, BWT is much closer to LZ compressors than anticipated.

Our slightly stronger bound*r*= (δ log^{2}*n*), where δ ≤*z*is a symmetric (insensitive to string reversal) repetitiveness measure recently studied by Kociumaka et al.,^{13}further shows that: - The number
of BWT runs in the reverse of the text satisfies
, which is the first nontrivial bound in terms of
*r.*This result is of practical importance due to many algorithms whose efficiency depends on .^{2,20,22,23}

After proving *r* =
(*z* log^{2} *n*), we refine our approach to obtain
and subsequently
. We then show that the latter bound, combined with the trivial one *r* ≤ *n*, is asymptotically tight for the full spectrum of values of *n* and δ. As a side result, we obtain a tight upper bound
(*n* log δ) on the sum of irreducible LCP values, improving upon the previously known bound
(*n* log *r*).^{9}

Next, we develop an
(*z* log^{8} *n*)-time algorithm converting the LZ77 parsing into the run-length compressed BWT (the polylog *n* factor has not been optimized). This offers up to exponential speedup over the previously fastest space-efficient algorithms, which need Ω(*n* log *z*) time.^{20,22} To achieve this, we develop new data structures and techniques of independent interest. In particular, we introduce a notion of compressed string synchronizing sets, generalizing the technique by Kempa and Kociumaka.^{11} We then describe a new variant of wavelet trees,^{8} designed to work for sequences of long strings. In the full version of this paper, we also propose new indexes that can be built directly from the LZ77 parsing and support fast pattern matching queries on text substrings.

### 2. Preliminaries

A *string* is a finite sequence of characters from a given *alphabet.* The length of a string *S* is denoted by |*S*| and, for *i* ∈ [1 . . |*S*|],^{b} the *i*th character of *S* is denoted by *S*[*i*]. A string *U* is a *substring* of *S* if *U* = *S*[*i*] *S*[*i* + 1] … *S*[*j* – 1] for some 1 ≤ *i* ≤ *j* ≤ |*S*| + 1. We then say that *U occurs* in *S* at position *i.* The *occurrence* of *U* at position *i* in *S* is denoted by *S*[*i* . . *j*) or *S*[*i* . . *j* – 1]. Such an occurrence is a *fragment* of *S* and can be represented by (a pointer to) *S* and the two positions *i*, *j.* Two fragments (perhaps of different strings) *match* if they are occurrences of the same substring. A fragment *S*[*i* . . *j*] is a *prefix* if *i* = 1 and a *suffix* if *j* = |*S*|.

We use
to denote the *reverse* of *S*, that is, *S*[|*S*|] … *S*[2] *S*[1]. We denote the *concatenation* of two strings *U* and *V*, that is, *U*[1]*U*[2] … *U*[|*U*|]*V*[1]*V*[2] … *V*[|*V*|], by *UV* or *U* · *V.* Furthermore,
denotes the concatenation of *k* ≥ 0 copies of *S*; note that *S*^{0} = ε is the *empty string.*

An integer *p* ∈ [1 . . |*S*|] is a *period* of a string *S* if *S*[i] = *S*[*i* + *p*] holds for every *i* ∈ [1 . . |*S*| – *p*]. The shortest period of *S* is denoted as per(*S*). A string *S* is called *periodic* if
. The following fact is a consequence of the classic *periodicity lemma* (which we do not use directly).

FACT 2.1 (see Amir et al.,^{1} FACT 1). *Any two distinct periodic strings of the same length differ on at least two positions.*

Throughout the paper, we consider a string (called the *text*) *T* of length *n* ≥ 1 over an ordered alphabet Σ of size σ. We assume that *T*[*n*] = $, where $ is the smallest symbol in Σ, and that $ does not occur anywhere else in *T.* We use ⪯ to denote the order on Σ, extended to the *lexicographic* order on Σ^{*} (the set of strings over Σ) so that *U*, *V* ∈ Σ^{*} satisfy *U* ⪯ *V* if and only if either *U* is a prefix of *V*, or *U*[1 . . *i*) = *V*[1 . . *i*) and *U*[*i*] ≺ *V* [*i*] holds for some *i* ∈ [1 . . min(|*U*|, |*V*|)].

The *suffix array* SA[1 . . *n*] of *T* is a permutation of [1 . . *n*] such that *T*[SA[1] . . *n*] ≺ *T*[SA[2] . . *n*] ≺ … ≺ *T*[SA[*n*] . . *n*]. The closely related *Burrows-Wheeler transform* BWT[1 . . *n*] of *T* is defined by BWT[*i*] = *T*[SA[*i*] – 1] if SA[*i*] > 1 and BWT[*i*] = *T*[*n*] otherwise. In the context of BWT, it is convenient to consider an *infinite string T*^{∞} defined so that *T*^{∞}[*i*] = *T*[1 + (*i* – 1) mod *n*] for *i* ∈ ℤ in particular, *T*^{∞} [1 . . *n*] = *T* [1 . . *n*]. We then have BWT[*i*] = *T*^{∞} [SA[*i*] – 1] for *i* ∈ [1 . . *n*]. Moreover, the assumption *T*[*n*] = $ implies *T*^{∞} [SA[1] . .] ≺ … ≺ *T*^{∞}[SA[*n*] . .].

For any string
, where *c _{i}* ∈ Σ and

*ℓ*

_{i}∈ ℤ

_{>0}for

*i*∈ [1 . .

*h*], and

*c*≠

_{i}*c*

_{i+1}for

*i*∈ [1 . .

*h*), we define the

*run-length encoding*of

*S*as a sequence RL(

*S*) = ((

*c*

_{1}, λ

_{1}), …, (

*c*, λ

_{h}_{h})), where λ

_{i}=

*ℓ*

_{1}+ … +

*ℓ*for

_{i}*i*∈ [1 . .

*h*]. Throughout, we let

*r*= |RL(BWT)| denote the number of runs in the BWT of

*T.*For example, for the text

*T*=

`bbabaababababaababa$`

in Table 1, we have BWT `= a`^{1}b^{6}a^{1}b^{2}a^{6}b^{1}a^{2}$^{1}

, and hence *r*= 8.

**Table 1. The lexicographically sorted suffixes of a string T = bbabaa-babababaababa$ along with the BWT, SA, and LCP tables.**

By lcp(*U*, *V*) we denote the length of the longest common prefix of strings *U* and *V.* For *j*_{1}, *j*_{2} ∈ [1 . . *n*], we let LCE(*j*_{1}, *j*_{2}) = lcp(*T*[*j*_{1} . .], *T*[*j*_{2} . .]). The *LCP array* LCP[1 . . *n*] is defined so that LCP[*i*] = LCE(SA[*i*], SA[*i* – 1]) for *i* ∈ [2 . . *n*] and LCP[1] = 0. We say that the value LCP[*i*] is *reducible* if BWT[i] = BWT[*i* – 1] and *irreducible* otherwise (including when *i* = 1). There are exactly *r* irreducible LCP values.

We say that a nonempty fragment *T*[*i* . . *i* + *ℓ*) is a *previous factor* if it has an earlier occurrence in *T*, that is, LCE (*i*, *i’*) ≥ *ℓ* holds for some *i’* ∈ [1 . . *i*). The *LZ77 parsing* of *T* is a factorization *T* = *F*_{1} ··· *F*_{f} into nonempty *phrases* such that the *j*th phrase *F _{j}* is the longest previous factor starting at position 1 + |

*F*

_{1}···

*F*

_{j-1}|; if no previous factor starts there, then

*F*consists of a single character. In the underlying

_{j}*LZ77 representation*, every phrase

*F*=

_{j}*T*[

*i*. .

*i*+

*ℓ*) that is a previous factor is encoded as (

*i’*,

*ℓ*), where

*i’*∈ [1 . .

*i*) satisfies LCE(

*i*,

*i’*) ≥

*ℓ*(and is chosen arbitrarily in case of multiple possibilities); if

*F*=

_{j}*T*[

*i*] is not a previous factor, then it is encoded as (

*T*[

*i*], 0). We denote the number of phrases in the LZ77 parsing by

*z.*For example, the text

`bbabaababababaababa$`

of Table 1 has LZ77 factorization `b · b · a · ba · aba · bababa · ababa · $`

with *z*= 8 phrases, and its LZ77 representation is (b, 0), (1, 1), (a, 0), (2, 2), (3, 3), (7, 6), (10, 5), ($, 0).

The *trie* of a finite set *S* ⊆ Σ^{*} is an edge-labeled rooted tree with a node *v _{X}* for every string

*X*that is a prefix of a string

*S*∈

*S.*The trie is rooted at

*v*

_{ε}and the parent of each node

*v*with

_{X}*X*≠ ε is

*v*

_{X[1..|X|)}. In this case, the edge from

*v*

_{X[1..|X|)}to

*v*is

_{X}*labeled*with

*X*[|

*X*|]. See Figure 1 for the trie of

`{$aba, aaba, abaa, abab, abb$, b$ab, baab, baba, babb, bb$a}`

.
**Figure 1. The trie
of the reversed length-4 substrings of T^{∞} for T = **

`bbabaababababaababa$`

of Table 1. Light edges are thin and dotted.### 3. Combinatorial Bounds

To illustrate the main idea of our proof technique, we first prove the upper bound in its simplest form *r* =
(*z* log^{2} *n*). The following lemma stands at the heart of our proof.

LEMMA 3.1. *For every ℓ* ∈ [1 . . *n*], *the number of irreducible LCP values in* [*ℓ* . . 2*ℓ*) is
(*z* log *n*).

PROOF. Denote *S _{m}* = {

*S*∈ Σ

^{m}:

*S*is a substring of

*T*

^{∞}} for

*m*≥ 1. Observe that |

*S*| ≤

_{m}*mz*because every length-

*m*substring of

*T*

^{∞}has an occurrence crossing or beginning at a phrase boundary of the LZ77 parsing of

*T.*This includes substrings overlapping two copies of

*T*, which cross the boundary between the last and the first phrase.

The idea of the proof is as follows: With each irreducible value LCP[*i*] ∈ [*ℓ* . . 2*ℓ*), we associate a cost of *ℓ* units, which are charged to individual characters of strings in *S*_{3ℓ}. We then show that each of the strings in *S*_{3ℓ} is charged at most 2 log *n* times. The number of irreducible LCP values in [*ℓ* . . 2*ℓ*) equals
times the total cost, which is at most

To devise the announced assignment of cost to characters of strings in *S*_{3ℓ}, consider the trie
of all reversed strings in *S _{ℓ}* (see Figure 1 for example).

Let LCP[*i*] ∈ [*ℓ* . . 2*ℓ*) be an irreducible LCP value; note that *i* > 1 due to LCP[*i*] ≥ *ℓ* > 0. Let *j*_{0} = SA[*i* – 1] and *j*_{1} = SA[*i*] so that LCP[*i*] = LCE(*j*_{0}, *j*_{1}). Because LCP[*i*] is irreducible, we have *T*^{∞} [*j*_{0} – 1] = BWT[*i* – 1] ≠ BWT[*i*] = *T*^{∞} [*j*_{1} – 1]. For *k* ∈[1 .. *ℓ*], the *k*th unit of the cost associated with LCP[*i*] is charged to the *k*th character (*T*^{∞}[*j _{t}* – 1]) of the string (

*T*

^{∞}[

*j*–

_{t}*k*..

*j*–

_{t}*k*+ 3

*ℓ*)∈

*S*

_{3ℓ}, where

*t*∈ {0, 1} is such that the subtree of rooted at contains fewer leaves than the subtree rooted at (we choose

*t*= 0 in case of ties); see Figure 2 for an illustration.

**Figure 2. Proof of Lemma 3.1 on T from Table 1 for ℓ = 4, i = 11, and k = 2. Strings T^{∞} [j_{t} – k . . j_{t} – k + 3ℓ) are highlighted. The subtree of
rooted in v_{baa} is smaller than in v_{bab} (see Figure 1), and hence we charge the second symbol of T^{∞} [j_{1} – k . . j_{1} – k + 3ℓ), that is, t = 1.**

Note that at most log *n* characters of each *S* ∈ *S*_{3ℓ} can be charged during the above procedure: Whenever *S*[*k*], with *k* ∈ [1 .. *ℓ*], is charged, the subtree of
rooted at
has at least twice as many leaves as the subtree rooted at
, and this can happen for at most log|*S _{ℓ}*| ≤ log

*n*nodes on the path from to the root of .

It remains to show that, for every *S* ∈ *S*_{3ℓ}, a single position *S*[*k*], with *k* ∈[1 .. *ℓ*], can be charged at most twice. For this, observe that the characters charged for a single irreducible value LCP[*i*] are at different positions (of strings in *S*_{3ℓ}). Hence, to analyze the total charge assigned to *S*[*k*], we only need to bound the number of possible candidate positions *i.* Let [*b* . . *e*] be the set of indices *i’* such that *T*^{∞} [SA[*i’*]..] starts with *S*[*k* + 1 .. 3*ℓ*]. In the above procedure, if a character *S*[*k*] is charged a unit of cost corresponding to LCP[*i*], then *S*[*k* + 1 .. 3*ℓ*] is a prefix of either *T*^{∞}[SA[*i*−1]..]=*T*^{∞}[*j*_{0} ..] or *T*^{∞}[SA[*i*]..]=*T*^{∞}[*j*_{1} ..]. Hence, {*i*−1,*i*}∩[*b*..*e*]≠∅. At the same time, LCE(SA[*i* –1], SA[*i*]) < 2*ℓ*, and all strings *T*^{∞}[SA[*i’*]..] with *i’* ∈[*b* .. *e*] share a common prefix *S*[*k* + 1 .. 3*ℓ*] of length 3*ℓ* – *k* ≥ 2*ℓ*. Thus, *i* = *b* or *i* = *e* + 1. ⅈ

THEOREM 3.2. *All length-n strings satisfy r* =
(*z* log^{2} *n*).

PROOF. Recall that *r* is the total number of irreducible LCP values. Thus, the claim follows by applying Lemma 3.1 for *ℓ _{i}* = 2

^{i}, with

*i*∈[0 .. ⌊log

*n*⌋] , and observing that the number of LCP values 0 is exactly σ≤

*z*.

To obtain a tighter bound, we refine the ideas from Section 3.1, starting with a counterpart of Lemma 3.1.

LEMMA 3.3. *For every ℓ* ∈[1 .. *n*], *the number of irreducible LCP values in* [*ℓ* .. 2*ℓ*) is
(*z* log *z*).

PROOF. The proof follows closely that of Lemma 3.1. However, with each irreducible value LCP[*i*] ∈ [*ℓ* .. 2*ℓ*), we associate cost
instead of *ℓ.* We then show that each of the strings in *S*_{3ℓ} is charged at most 2 · (3 + log *z*) times (rather than 2 log *n* times). Then, the number of irreducible LCP values in the range [*ℓ* .. 2*ℓ*) does not exceed
times the total cost, which is bounded by

Recall the trie
of all reversed strings in *S _{ℓ}*. For a node

*v*of , by size(

*v*), we denote the number of leaves in the subtree of rooted in

*v.*An edge connecting

*v*≠root( ) to its parent in is called

*light*if

*v*has a sibling

*v’*satisfying size(

*v’*) ≥ size(

*v*) (see Figure 1). In the proof of Lemma 3.1, we observed that the characters

*S*[

*k*] of

*S*∈

*S*

_{3ℓ}that can be charged correspond to light edges on the path from the root of to the leaf : Whenever

*S*[

*k*], with

*k*∈[1 ..

*ℓ*], is charged, the edge connecting to its parent is light. We then noted that there are at most log |

*S*| ≤ log

_{ℓ}*n*light edges on each root-to-leaf path in . Here, we charge the characters of strings in

*S*

_{3ℓ}in the same way as in Lemma 3.1, but only for units . This implies that only characters

*S*[

*k*] of

*S*∈

*S*

_{3ℓ}with are charged. It remains to show that any root-to-leaf path in contains at most 3 + log

*z*light edges between a node at depth at least and its child.

Consider a light edge from a node *v* to its parent *u* at depth at least
. Let *v’* be a sibling of *v* satisfying size(*v’*) ≥ size(*v*), and let *S _{v}*,

*S*be the labels of the paths from the root to

_{v’}*v*and

*v’*, respectively. These labels differ on the last position only so, by Fact 2.1, they cannot be both periodic. Let

*ṽ*∈ {

*v*,

*v’*} be such that

*S*

_{ṽ}is not periodic, and let .

Consider the set *S* of length-*ℓ* strings corresponding to the leaves in the subtree of
rooted at *ṽ* (i.e., the labels of the root-to-leaf paths passing through *ṽ*). Define
and note that
, because
is the trie of *reversed* strings from *S _{ℓ}*. Let
denote the ending positions of the leftmost occurrences in

*T*

^{∞}[1 . .) of strings in . By definition, we have an occurrence of ending in

*T*

^{∞}at every position

*e*

_{i}with . Now, implies that for every (otherwise, the two close occurrences of would yield . Consequently, at least length-

*ℓ*substrings of

*T*

^{∞}[1 . .) have disjoint leftmost occurrences. Because each leftmost occurrence crosses or begins at a phrase boundary of the LZ77 parsing of

*T*, we conclude that , and therefore .

The reasoning above shows that once a root-to-leaf path encounters a light edge connecting a node *u* at depth at least
to its child *v*, we have size(*v*) ≤ 4*z.* The number of the remaining light edges on the path is at most log(size(*v*)) ≤ 2 + log *z* by the standard bound applied to the subtree of
rooted at *v.*

THEOREM 3.4. *Every string T of length n satisfies*
.

PROOF. To obtain tighter bounds on the number of irreducible LCP values in [*ℓ* . . 2*ℓ*), we consider three cases:

*ℓ* ≤ **log z.** We repeat the proof of Lemma 3.1, except that we observe that the number of light edges on each root-to-leaf path in
is bounded by

*ℓ.*Thus, the number of irreducible LCP values in [

*ℓ*. . 2

*ℓ*) is (

*z ℓ*).

. We use the bound
(*z* log *z*) of Lemma 3.3.
. We repeat the proof of Lemma 3.3, except that we observe that |*S*_{3ℓ}| ≤ *n.* Thus, the number of irreducible LCP values in
.

The above upper bounds, applied for every *ℓ* = 2^{i} with *i* ∈ [0 . . ⌊log *n*⌋], yield

**3.3. Tight bounds in terms of δ**

Let
denote the (cyclic) *substring complexity* of *T.*^{13} Note that letting
is equivalent because |*S _{m}*| ≤

*n*holds for

*m*≥ 1, which implies for

*m*≥

*n.*We start by noting that δ ≤

*z*because |

*S*| ≤

_{m}*mz*holds for every

*m*≥ 1, as observed in the proof of Lemma 3.1. Furthermore, |

*S*| ≤

_{m}*m*δ holds by definition of δ, so δ can replace

*z*in the proof of Lemma 3.1.

To adapt the proof Lemma 3.3, we need to generalize the observation that at most *z* substrings from *S _{ℓ}* may have disjoint leftmost occurrences in

*T*

^{∞}[1 . .). This observation is easy because the LZ77 parsing naturally yields a set of

*z*positions (phrase boundaries) in

*T.*The substring complexity δ does not provide such structure, but as the lemma here implies, we can replace

*z*by 3δ in the aforementioned observation. The proof of Lemma 3.5 is a straightforward modification of the argument used in Kociumaka et al.

^{13}(Lemma 6). For completeness, we write down the full reasoning, with technical details tailored to our notation (e.g.,

*S*defined in terms of

_{ℓ}*T*

^{∞}rather than

*T*).

LEMMA 3.5 (based on Kociumaka et al.,^{13} LEMMA 6). *For any integer ℓ* ≥ 1, *the total number of positions in T*^{∞}[1 . .) *covered by the leftmost occurrences of strings from S _{ℓ} is at most* 3

*δℓ*.

PROOF. Let *C* denote the set of positions in *T*^{∞}[1 . .) covered by the leftmost occurrences of strings from *S _{ℓ}*, and let

*C’*=

*C*\[1 ..

*ℓ*). For any

*i*∈

*C’*denote

*S*=

*T*

^{∞}[

*i*−

*ℓ*+1..

*i*+

*ℓ*], and let

*S*={

*S*

_{i}:

*i*∈

*C*‘}⊆

*S*

_{2ℓ}. We will show that |

*S*| = |

*C’*|. Let

*i*∈

*C’*. First, observe that, due to

*i*≥

*ℓ*, the fragment

*S*is entirely contained in

_{i}*T*

^{∞}[1 . .). Furthermore, by definition,

*S*. contains the leftmost occurrence of some

_{i}*S*∈

*S*. Thus, this occurrence of

_{ℓ}*S*in

_{i}*T*

^{∞}[1 . .) must also be the leftmost one in

*T*

^{∞}[1 . .). Consequently, the substrings

*S*for

_{i}*i*∈

*C’*are distinct.

We have thus shown that |*C’*| = |*S*|≤|*S*_{2ℓ}. Because |*S*_{2ℓ}| ≤ 2*δℓ* holds by definition of δ, we obtain |*C*|<|*C*‘|+ ℓ ≤ |*S*_{2ℓ}|+*ℓ*≤(2*δ*+1)*ℓ*≤3*δℓ*. ⅈ

LEMMA 3.6. *For every ℓ* ∈[1 .. *n*], *the number of irreducible LCP values in* [*ℓ* .. 2*ℓ*) *is*
(δ log δ).

PROOF. Compared to the proof of Lemma 3.3, we use the bound |*S*_{3ℓ}| ≤ 3*ℓ*δ instead of |*S*_{3ℓ}| ≤ 3*ℓz*. The only other modification required is that, for every light edge connecting a node *u* of
at depth at least
to its child *v*, we need to prove size(*v*) =
(δ).

Let
be defined as in the proof of Lemma 3.3. Recall that there are at least
strings in *S _{ℓ}* with disjoint leftmost occurrences in

*T*

^{∞}[1 . .). By Lemma 3.5, there are at most 3δ such substrings. Thus, . ⅈ

By replacing the thresholds log *z* and
with log δ and
, respectively, in the proof of Theorem 3.4, we immediately obtain a bound in terms of δ.

THEOREM 3.7. *Every string T of length n satisfies*
.

The following construction, analyzed in the full version of this paper, provides tight examples with substring complexity δ covering the whole spectrum between
(1) and
. For *ℓ* ≥ 1 and *x* ∈[0 .. 2^{ℓ}), we set bin_{ℓ} (*x*) ∈ {0, 1}^{ℓ} to be the binary representation of *x.*

LEMMA 3.8. *For all integers ℓ* ≥ 2 *and K* ≥ *1, the length n, the substring complexity δ, and the number of runs r in the BWT of a string T _{l,K}* ∈ {$, 0, 1, 2}

^{+},

*defined with*

*satisfy* *n* = Θ(2^{K+ℓ}*ℓ*, *δ*=Θ(2^{ℓ}), *and r* = Ω(2^{ℓ} *ℓK*).

If
, on the other hand, then a trivial upper bound *r* =
(*n*) is tighter than that of Theorem 3.7. In this case, the following construction, also analyzed in the full version of this paper, provides tight examples.

LEMMA 3.9. *For all integers ℓ* ≥ 2 *and* Δ∈Ω(2^{ℓ})∩
(2^{ℓ}*ℓ*), *the length n, the substring complexity δ, and the number of runs r in the BWT of a string*
, *defined with*

*satisfy* *n* =Θ(2^{ℓ}*ℓ*), δ=Θ(Δ), *and* *r* = Ω(2^{ℓ}*ℓ*).

Overall, combining Lemmas 3.8 and 3.9, we obtain the following tight lower bound for δ ranging from
(1) to Ω(*n*).

THEOREM 3.10. *For every N* ≥ *1 and* Δ ∈ [1 .. *N*], *there exists a string T whose length n, substring complexity δ, and the number of runs r in the BWT satisfy* *n*=Θ(*N*), δ=Θ(Δ), *and*
.

**3.4. Further combinatorial bounds**

By combining Theorem 3.7 with known properties of the substring complexity δ, we obtain the first bound relating the number of BWT runs in the string and its reverse. No such bounds (even polynomial in *r* log *n*) were known before.

COROLLARY 3.11. *If r and
denote the number of runs in the BWT of a length-n text and its reverse, respectively, then*
.

PROOF. Because the value of δ is the same for the text and its reverse, Theorem 3.7 yields
. Combining the results of Kempa and Prezza^{12} (Theorem 3.9) and Kociumaka et al.^{13} (Lemma 2) gives δ ≤ *r.* Consequently, we obtain
.

In the full version of this paper, we also characterize the sum of all irreducible LCP values:

THEOREM 3.12. *For every string of length n, the sum r*_{Σ} *of all irreducible LCP values satisfies r*_{Σ} =
(*n* log δ).

Due to δ ≤ *r*, the presented upper bound is always (asymptotically) at least as strong as the previous bound *r*_{σ} ≤ *n* log *r* by Kärkkäinen et al.^{9} Furthermore, it can be strictly stronger because log δ = *o*(log *r*) is possible when δ = log^{o(1)} *n.* In the full version of this paper, we show that the strings of Lemmas 3.8 and 3.9 yield a matching lower bound:

THEOREM 3.13. *For every N* ≥ 1 *and* Δ ∈ [1 .. *N*], *there exists a string T whose length n, substring complexity δ, and sum r*_{Σ} *of irreducible LCP values satisfy* *n*=Θ(*N*), δ=Θ(Δ), *and* *r*_{Σ} = Θ(*n* log δ).

### 4. From LZ77 to Run-Length BWT

In this section, we outline our algorithm that, given the LZ77 parsing of a text *T* ∈ Σ^{n}, computes its run-length compressed BWT in
(*z* polylog *n*) time. We start with an overview that explains the key concepts. Next, we present two new data structures utilized in our algorithm: the compressed string synchronizing set (Section 4.1) and the compressed wavelet tree (Section 4.2). The sketch of the final algorithm is then presented in Section 4.3.

For any substring *Y* of *T*^{∞}, we define

We say that a substring *Y* of *T*^{∞} is *left-maximal* if there exist distinct symbols *a*, *b* ∈ Σ such that the strings *aY* and *bY* are also substrings of *T*^{∞}. The following definition, assuming Σ∩ℕ=∅, plays a key role in our construction.

DEFINITION 4.1 (BWT MODULO *ℓ*). *Let* *T* ∈ Σ^{n}, *ℓ* ≥ 1 *be an integer, and* *Y*_{i}= *T*^{∞}[SA[*i*]..SA[*i*]+*ℓ*) *for* *i* ∈[1..*n*]. *We define the string* BWT_{ℓ} ∈(Σ∪ℕ)^{n}, *called the* BWT modulo *ℓ* *(of T), as follows. For i* ∈ [1 .. *n*],

The algorithm runs in *k* = ⌈log *n*⌉ rounds. For *q* ∈ [0 .. *k*), the input to the *q*th round is RL(BWT_{2q}) and the output is RL(BWT_{2q+1}). At the end of the algorithm, we have RL(BWT_{2k}) = RL(BWT), because *X* ∈ *S*_{2k} is never left-maximal for 2^{k} ≥ *n.*

Informally, in round *q*, we are given a (run-length compressed) subsequence of BWT that can be determined based on sorting the suffixes only up to their prefixes of length 2^{q}. Notice that BWT_{ℓ}[*b* .. *e*] ∈ Σ^{*} implies BWT_{ℓ+1} [*b* .. *e*] ∈ Σ^{*} (because a prefix of a left-maximal substring is left-maximal). Hence, these subsequences need not be modified until the end of the algorithm (except possibly merging their runs with adjacent runs). For the remaining positions, BWT_{ℓ} identifies the (leftmost occurrences of) substrings to be inspected in the *q*th round with the aim of replacing their corresponding runs in BWT_{ℓ} with previously unknown BWT symbols (as defined in BWT_{2ℓ}).

We call a block BWT[*b* . . *e*] *uniform* if all symbols in BWT[*b* . . *e*] are equal and *nonuniform* otherwise. The following lemma ensures the feasibility of the above construction.

LEMMA 4.2. *For any ℓ* ≥ 1, *it holds* |RL(BWT_{ℓ})| < 2*r.*

PROOF. Denote RL(BWT_{ℓ}) = ((*c*_{1}, λ_{1}, …,(*c*_{h},λ_{h})), letting λ_{0} = 0. By definition of BWT_{ℓ}, if *c _{i}* ∈ ℕ, then the block BWT(λ

_{i-1}.. λ

_{i}] is nonuniform. Thus, there are at most

*r*– 1 runs of symbols from ℕ in BWT

_{ℓ}.

On the other hand, *c _{i}* ∈ Σ and

*c*∈ Σ, with

_{j}*i*<

*j*, cannot both belong to the same run in BWT. If this was true, then either

*c*

_{i+1}∈ Σ (which implies

*c*

_{i+1}=

*c*, contradicting the definition of RL(BWT

_{i}_{ℓ}) ), or

*c*

_{i+1}∈ ℕ, which is impossible because BWT(λ

_{i}.. λ

_{i+1}] would then be nonuniform. Thus, there are at most

*r*runs of symbols from Σ in BWT

_{ℓ}. ⅈ

**4.1. Compressed string synchronizing sets**

Our algorithm builds on the notion of *string synchronizing sets*, recently introduced by Kempa and Kociumaka.^{11} Synchronizing sets are one of the most powerful techniques for sampling suffixes. In the uncompressed setting, they are the key in obtaining time-optimal solutions to multiple problems, such as a state-of-the-art BWT construction algorithm for texts over small alphabets.^{11} In this section, we introduce a notion of *compressed string synchronizing sets.* Our construction is the first implementation of synchronizing sets in the compressed setting and thus of independent interest.

We start with the definition of basic synchronizing sets.

DEFINITION 4.3 (τ-SYNCHRONIZING SET^{11}). *Let T* ∈ Σ^{n} *be a string and let*
*be a parameter. A set* S⊆[1..*n*−2τ+1] *is called a τ-synchronizing set of T if it satisfies the following* consistency *and* density *conditions:*

*If**T*[*i*..*i*+2τ)=*T*[*j*..*j*+2τ),*then i*∈ S*holds if and only if j*∈ S*(for i, j*∈[1 ..*n*– 2τ +1]),- S∩[
*i*..*i*+τ)=∅*if and only if**(for i*∈[1 ..*n*– 3τ + 2]).

In most applications, we want to minimize |S|. However, the density condition imposes a lower bound
for strings of length *n* ≥ 3τ – 1 that do not contain substrings of length 3τ – 1 that are periodic with period at most
. Therefore, we cannot hope to achieve an upper bound improving in the worst case upon the following one.

THEOREM 4.4 *(Kempa and Kociumaka ^{11}). For any string T of length n and parameter*
,

*there exists a τ-synchronizing set*S

*of size*.

*Moreover, such*S

*can be (deterministically) constructed in*(

*n*)

*time.*

Storing S for compressible strings presents the following challenge: Although
*is* sometimes possible, it *is not* implied by *z* ≪ *n.* For example, as noted by Bille et al.,^{3} Thue–Morse strings satisfy *z* =
(log *n*) yet they contain no periodic substring *X* with
, and thus their synchronizing sets satisfy
. This prevents us from keeping the plain representation of S when
.

We therefore exploit a different property of compressible strings: That all their substrings *Y* satisfy lpos(*Y*)
, where *e _{j}* is the last position of the

*j*th phrase in the LZ77 parsing of

*T.*By consistency of S, it suffices to store . To decide whether

*i*∈ S, we then locate

*i’*= lpos(

*T*[

*i*..

*i*+ 2τ) and check if . This motivates the following (more general) definition.

DEFINITION 4.5 (COMPRESSED τ-SYNCHRONIZING SET). *Let* S *be a τ-synchronizing set of string T*[1 . . *n*] *for some*
, *and, for every j* ∈[1 .. *z*], *let e _{j} denote the last position of the jth phrase in the LZ77 parsing of T. For k* ∈ ℤ

_{≥2},

*we define the*compressed representation

*of*S

*as*

In the full version of this paper, we prove that every text *T* has a synchronizing set S with a small compressed representation, and we show how to efficiently compute such S from the LZ77 parsing of *T.*

THEOREM 4.6. *There exists a Las-Vegas randomized algorithm that, given the LZ77 parsing of a string T* ∈ Σ^{n}, *a parameter*
, *and a constant k* ∈ ℤ_{≥2}, *constructs in*
(*z* log^{5} *n*) *time a compressed representation* comp_{k}(S) *of a τ-synchronizing set* S *of T satisfying* |comp_{k}(S)|≤ 72*kz*.

Along with string synchronizing sets, wavelet trees,^{8} originally invented for text indexing, play a central role in our algorithm. Unlike virtually all prior applications of wavelet trees, ours uses a sequence of very long strings (up to Θ(*n*) symbols). This approach is feasible because all strings are substrings of the text, which is stored in the LZ77 representation. In this section, we describe this novel variant of wavelet trees, dubbed here *compressed wavelet trees.* In particular, we prove an upper bound on their size, describe an efficient construction from the LZ77-compressed text, and show how to augment them to support some fundamental queries.

Let Σ be an alphabet of size σ ≥ 1. Consider a string *W*[1 . . *m*] over the alphabet Σ^{ℓ} so that *W* is a sequence of *m* ≥ 0 strings of length *ℓ* ≥ 0 over the alphabet Σ. The wavelet tree of *W* is the trie
of Σ with each node *v _{X}* associated to a string

*B*∈ Σ

_{X}^{*}defined here based on

*W.*We let denote the node-set of .

With each node *v _{X}* ∈

*V*( ), we associate an increasing sequence

*I*[1 . .

_{X}*h*] of

*primary indices*such that

Based on *I _{X}*, we define

*B*∈ Σ

_{X}^{*}so that, for

*i*∈ [1 . .

*h*],

if |*X*| < *ℓ*, and *B _{X}* = ε if |

*X*| =

*ℓ.*In other words,

*B*is a string containing the symbol at position |

_{X}*X*| + 1 for each string of

*W*that is prefixed by

*X.*Importantly, the symbols in

*B*occur in the same order as these strings occur in

_{X}*W.*

As typically done in the applications of wavelet trees, we only explicitly store the strings *B _{X}*. The values of primary indices

*I*are retrieved using additional data structures, based on the following observation.

_{X}LEMMA 4.7 (Grossi et al.^{8}). *Let X* ∈ Σ^{d}, *where d* ∈ [0 . . *ℓ*). *For every c* ∈ Σ *and j* ∈ [1 . . |*I*_{Xc}|], *we have I _{Xc}* [

*j*] =

*I*[

_{X}*i*],

*where B*[

_{X}*i*]

*is the jth occurrence of c in B*

_{X}.We define the *compressed wavelet tree*
_{c} of *W* as the wavelet tree of *W* in which all strings *B _{X}* have been run-length compressed and, except
, all nodes

*v*satisfying |RL(

_{X}*B*)| ≤ 1 have been removed (the unary paths are collapsed into single edges with their labels concatenated). The shape and edge labels of the resulting tree are identical to the

_{X}*compacted trie*of {

*W*[1], …,

*W*[

*m*]}.

We store the edge labels of
_{c} as pointers to substrings in *W.* We assume that each edge of
_{c} and each element of RL(*B _{X}*) can be encoded in
(1) space. Because |RL(

*B*)| ≥ 1 holds for every internal node

_{Y}*v*∈

_{Y}*V*(

_{c}) and, unless |

*V*(

_{c})| = 1, each leaf

*v*in

_{Z}_{c}can be injectively mapped to an element of RL(

*B*) for the parent

_{Z’}*v*of

_{Z’}*v*, the tree

_{Z}_{c}uses space.

THEOREM 4.8. *If*
_{c} *is the compressed wavelet tree of a sequence W of length-l strings, then*
.

PROOF. Let *m* = |*W*|, *k* = |RL(*W*)| ≤ *m*, and *k’* = |{*W*[*i*] : *i* ∈ [1 . . *m*]}|. Due to |*V*(*T*_{c})|≤ 1+2*k*‘=
(1 + k), we can focus on nodes *v*_{x}∈*V*(*T*_{c}) such that |RL(B_{X})| ≥ 2.

The proof resembles that of Lemma 3.1. With each *X* ∈ Σ^{*} such that |RL(*B _{X}*)| ≥ 2, we associate |RL(

*B*)| – 1 units of cost and charge them to individual elements of

_{X}*W.*We then show that each run in RL(

*W*) is in total charged at most 2 log

*k’*units of cost. Consequently,

Consider *X* ∈ Σ^{d} with |RL(*B _{X}*)| ≥ 2; note that

*d*<

*ℓ.*Let RL(

*B*) = ((

_{X}*c*

_{1}, …, (

*c*, λ

_{h}_{h})). Observe that if we let

*p*

_{0}=

*I*[λ

_{X}_{i}] and

*p*

_{1}=

*I*[λ

_{X}_{i}+ 1] for some

*i*∈ [1 . .

*h*), then

*W*[

*p*

_{0}][

*d*+ 1] =

*c*≠

_{i}*c*+

_{i}_{1}=

*W*[

*p*

_{1}][

*d*+ 1]. Moreover,

*B*[λ

_{X}_{i}] ≠

*B*[λ

_{X}_{i}+ 1] implies

*W*[

*p*

_{0}+ 1] ≠

*W*[

*p*

_{0}] and

*W*[

*p*

_{1}– 1] ≠

*W*[

*p*

_{1}]. The

*i*th unit of cost is charged to

*W*[

*p*], where

_{t}*t*∈ {0, 1} is chosen depending on the sizes of subtrees of

_{c}rooted at the children of

*v*, so that the subtree containing

_{X}*v*

_{w[pt]}has at most as many leaves as the subtree containing

*v*

_{w[p1−t]}.

Now, consider a run *W*[*b* . . *b’*] = *Y*^{δ} in RL(*W*). For a single depth *d*, the run could be charged at most twice, with at most one unit assigned to *W*[*b*] due to *p*_{1} = *b* and at most one unit assigned to *W*[*b’*] due to *p*_{0} = *b’*, both for *X* = *Y*[1 . . *d*]. Moreover, note that the subtree size on the path from *v _{Y}* to the root

*v*

_{ε}of

_{c}doubles for every depth

*d*for which the run was charged. Thus, the total charge of the run is at most 2 log

*k’*units. ⅈ

The key operation that we want to support on
_{c} is to compute the value *I _{X}* [

*q*] given

*q*∈ [1 . . |

*I*|] and a pointer to

_{X}*v*

_{x}∈

*V*(

*T*

_{c}). Let

*W*[1 . .

*m*] be a sequence of substrings of

*T*

^{∞}of common length

*ℓ*. Notice that if we have access to

*T*, then the sequence

*W*can be encoded in (1 + |RL(

*W*)|) space. Namely, it suffices to store the length

*ℓ*and the sequence RL((lpos(

*W*[

*i*]))

_{i ∈ [1 .. m]}). In the full version of this paper, we show that, given such encoding of

*W*and the LZ77 parsing of

*T*, the compressed wavelet tree of

*W*supporting fast primary index queries can be constructed efficiently.

THEOREM 4.9. *Given the LZ77 representation of T[1 . . n] and a sequence W[1 .. m] of m* ≤ *n same-length substrings of T*^{∞}, *represented as* RL((lpos(*W*[*i*]))_{i ∈[1 .. m]}), *the compressed wavelet tree of W, supporting primary index queries in time*
(log^{4} *n*), *can be constructed in time*
((*z*+|RL(*W*)|) log^{2} *n*).

We are now ready to sketch the procedure constructing the sequences RL(BWT_{2q}), where *q* ∈ [0 . . ⌈log *n*⌉].

Let *q* ∈ [4 . . ⌈log *n*⌉) (values *q* ∈ [0 . . 3] are handled separately) and let *ℓ* = 2^{q}. We show how to compute RL(BWT_{2ℓ}) given RL(BWT_{ℓ}) and the LZ77 parsing of *T.* The main idea of the algorithm is as follows.

Let S be a τ-synchronizing set of *T*, where
. As noted earlier, BWT_{ℓ} [*j*] ∈ Σ implies BWT_{2ℓ} [*j*] ∈ Σ. Let BWT_{ℓ} [*y* . . *y’*] ∈ ℕ^{+} be a run in BWT_{ℓ}. By definition of BWT_{ℓ}, the suffixes of *T*^{∞} starting at positions *i* ∈ SA[*y* . . *y’*] share a common prefix of length *ℓ* ≥ 3τ. Thus, assuming that S∩[*i*..*i*+τ) = ⌽ holds for all *i* ∈ SA[*y* . . *y’*] (the periodic case is handled separately), by the consistency of S, all text positions *i* ∈ SA[*y* . . *y’*] share a common offset Δ with *i* + Δ = min (S∩[*i*..*i*+τ)). This lets us deduce the order of length-2*1* prefixes *T*[*i* . . *i*+2*ℓ*) based on the order of strings *T*[*i* + Δ . . *i* + 2*ℓ*) starting at synchronizing positions. For this, from the sorted list of fragments *T*[*s* . . *s* + 2*ℓ*) across *s* ∈ S, we extract, using a wavelet tree, those preceded by *T*[*i* . . *i* + Δ) (a prefix common to *T*^{∞}[*i* . .) for *i* ∈ SA[*y* . . *y’*]). Importantly, the synchronizing positions *s* sharing *T*[*s* – τ . . *s* + 2*ℓ*) can be processed together; hence, by Theorem 4.6, applied for *k* = 7 so that 2*ℓ* ≤ *k*τ, it suffices to use
(*z*) distinct substrings.

In the full version of the paper, we provide the details of the above construction and obtain the following result.

THEOREM 4.10. *There exists a Las-Vegas randomized algorithm that, given the LZ77 parsing of a text T of length n, computes its run-length compressed Burrows-Wheeler transform in time*
((*r* + *z*) log^{6} *n*) =
(*z* log^{8} *n*).

### Acknowledgments

This work was performed when the first author was a post-doctoral scholar at the University of California, Berkeley (supported by NSF grants no. 1652303 and 1934846, and an Alfred P. Sloan Fellowship grant), whereas the second author was a postdoctoral scholar at Bar-Ilan University, Israel (supported by ISF grants no. 1278/16 and 1926/19, a BSF grant no. 2018364, and an ERC grant MPM under the EU’s Horizon 2020 Research and Innovation Programme, agreement no. 683064).

## Join the Discussion (0)

## Become a Member or Sign In to Post a Comment