论文标题
关于Burrows-wheeler-Transform的运行次数的新颖结果
Novel Results on the Number of Runs of the Burrows-Wheeler-Transform
论文作者
论文摘要
可逆的字符串转换是burrows-wheeler-transform(BWT),是字符串处理中许多当前数据结构的基本组件之一。它在数据压缩以及序列数据的有效查询算法中至关重要,例如网页,基因组和其他生物学序列或实际上任何文本数据。 BWT非常适合压缩,因为它的同等字母运行数量(通常称为$ r $)通常比原始字符串低得多。特别是,它非常适合具有许多重复因素的字符串。实际上,已经对$ r $参数的重复性量度非常关注,尤其是在压缩索引数据结构的时空和时间上评估性能。 在本文中,我们调查了$ρ(v)$,$ r $的比率和$ v $的反面的BWT跑步数。 Kempa和Kociumaka [焦点2020]将第一个非平凡的上限为$ρ(v)= o(\ log^2(n))$,对于任何长度$ n $的字符串$ v $。但是,对这种上限的紧密程度一无所知。我们提出了无限的二进制字符串家族,其中$ρ(v)=θ(\ log n)$保持,从而给出了第一个非平凡的下限$ρ(n)$,这是所有长度$ n $的最大值。 我们的结果表明,$ r $不是弦重复性的理想度量,因为重复因素的数量在字符串及其反向之间不变。我们认为,BWT的运行次数与字符串的组合属性之间存在更复杂的关系。
The Burrows-Wheeler-Transform (BWT), a reversible string transformation, is one of the fundamental components of many current data structures in string processing. It is central in data compression, as well as in efficient query algorithms for sequence data, such as webpages, genomic and other biological sequences, or indeed any textual data. The BWT lends itself well to compression because its number of equal-letter-runs (usually referred to as $r$) is often considerably lower than that of the original string; in particular, it is well suited for strings with many repeated factors. In fact, much attention has been paid to the $r$ parameter as measure of repetitiveness, especially to evaluate the performance in terms of both space and time of compressed indexing data structures. In this paper, we investigate $ρ(v)$, the ratio of $r$ and of the number of runs of the BWT of the reverse of $v$. Kempa and Kociumaka [FOCS 2020] gave the first non-trivial upper bound as $ρ(v) = O(\log^2(n))$, for any string $v$ of length $n$. However, nothing is known about the tightness of this upper bound. We present infinite families of binary strings for which $ρ(v) = Θ(\log n)$ holds, thus giving the first non-trivial lower bound on $ρ(n)$, the maximum over all strings of length $n$. Our results suggest that $r$ is not an ideal measure of the repetitiveness of the string, since the number of repeated factors is invariant between the string and its reverse. We believe that there is a more intricate relationship between the number of runs of the BWT and the string's combinatorial properties.