论文标题

DNA合成的最佳参考

Optimal Reference for DNA Synthesis

论文作者

Elishco, Ohad, Huleihel, Wasim

论文摘要

近年来,DNA已成为一种潜在可行的存储技术。 DNA合成是指将数据写入DNA的任务,也许是现有存储系统中最昂贵的一部分。因此,这种高成本和低吞吐量限制了可用DNA合成技术的实际用途。已经发现,均聚物运行(即同一核苷酸的重复)是影响合成和测序误差的主要因素。最近,[26]研究了批处理优化在降低大规模DNA合成成本中的作用,对于给定的池$ \ Mathcal $ \ Mathcal {S} $的固定长度的随机季节。除其他事项外,还表明当$ \ Mathcal {s} $中的字符串包含相同字符的重复(长度为一的均聚合物运行)时,批处理优化的渐近成本节省明显更大,而符号不受限制。 在本文中,我们迈向了[26]的领导,迈向了对DNA合成的理论理解,并研究长度$ k \ geq1 $的均聚物运行。具体而言,我们为我们提供了一组DNA链$ \ MATHCAL {s} $,该$从天然的马尔可夫分布中随机绘制,我们希望合成一般的均聚物运行长度约束。对于这个问题,我们证明,对于任何$ k \ geq 1 $,最佳参考链,最小化DNA合成成本的最低成本可能是令人惊讶的是,周期性序列$ \ overline {\ mathsf {acgt {acgt}} $。事实证明,解决长度$ k \ geq2 $的均聚物约束是一个具有挑战性的问题。我们的主要技术贡献是将DNA合成过程作为某个受约束的系统的表示,可以应用弦技术。

In the recent years, DNA has emerged as a potentially viable storage technology. DNA synthesis, which refers to the task of writing the data into DNA, is perhaps the most costly part of existing storage systems. Accordingly, this high cost and low throughput limits the practical use in available DNA synthesis technologies. It has been found that the homopolymer run (i.e., the repetition of the same nucleotide) is a major factor affecting the synthesis and sequencing errors. Quite recently, [26] studied the role of batch optimization in reducing the cost of large scale DNA synthesis, for a given pool $\mathcal{S}$ of random quaternary strings of fixed length. Among other things, it was shown that the asymptotic cost savings of batch optimization are significantly greater when the strings in $\mathcal{S}$ contain repeats of the same character (homopolymer run of length one), as compared to the case where strings are unconstrained. Following the lead of [26], in this paper, we take a step forward towards the theoretical understanding of DNA synthesis, and study the homopolymer run of length $k\geq1$. Specifically, we are given a set of DNA strands $\mathcal{S}$, randomly drawn from a natural Markovian distribution modeling a general homopolymer run length constraint, that we wish to synthesize. For this problem, we prove that for any $k\geq 1$, the optimal reference strand, minimizing the cost of DNA synthesis is, perhaps surprisingly, the periodic sequence $\overline{\mathsf{ACGT}}$. It turns out that tackling the homopolymer constraint of length $k\geq2$ is a challenging problem; our main technical contribution is the representation of the DNA synthesis process as a certain constrained system, for which string techniques can be applied.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源