'Reading Articles/DNA Computing' 카테고리의 글 목록

Reading Articles/DNA Computing

[논문 리딩] DNA storage: research landscape and future prospects 2024.08.02

[논문 리딩] DNA storage: research landscape and future prospects

AIst 2024. 8. 2. 23:34

2024. 8. 2. 23:34

Reference

Dong, Y., Sun, F., Ping, Z., Ouyang, Q., & Qian, L. (2020). DNA storage: research landscape and future prospects. National Science Review, 7(6), 1092–1107. https://doi.org/10.1093/nsr/nwaa007

Introduction: Information and Storage

문제점:
- 기존의 저장 방법들은 한계에 직면해 있으며, 유지비용, 데이터 손실 등 여러 문제를 해결할 새로운 솔루션이 요구된다.
DNA의 가능성:
- 자연에서 DNA는 고밀도, 저비용의 안정적인 정보 저장 매체로 사용되고 있으며, 이러한 특성 때문에 차세대 정보 저장 매체로 주목받고 있다.

Overview of DNA Storage

Research history

In 1953, Watson and Crick revealed the structure of DNA molecules as the carrier of genetic information
In 1988, the artist Joe Davis made the first attempt to construct real DNA storage
- Converted the pixel information of the image 'Microvenus' into a 0-1 sequence arranged in a 5 X7 matrix, where 1 indicated a dark pixel and 0 indicated a bright one.
- This information was then encoded into a 28-base-pair long DNA molecule and inserted into Escherichia coli
- After retrieval by DNA sequencing, the original image was successfully restored
In 1999, Clelland proposed using a method based on 'DNA micro-dots' like steganography to store information in DNA molecules
- Steganography: 영상이나 오디오 파일에 비밀 메시지를 감추어 그 정보의 존재 자체를 숨기는 보안 기술
In 2001, Bancroft proposed using DNA bases to directly encode English letters, in a way similar to encoding amino acid sequences in DNA
하지만 여태까지 뛰어난 성과를 보여주는 연구는 없었다. DNA molecule에 최대 1KB를 저장하는게 한계였다.
Church successfully stored up to 659KB of data in DNA molecules and Goldman stored even more data, reaching 739 KB
- Not only texts but also images, sounds, PDFs

Self-information of DNA molecules:

DNA 염기서열의 정보 용량은 Shannon information theory에 의해 측정된다.
- Shannon information: the maximal amount of self-information ($H$) that a single base can hold is:
  - $H=-\sum_{i}^{A,T,C,G}P\left ( i \right )logP\left ( i \right )\leq log\sum_{i}^{A,T,G,C}P\left ( i \right )\frac{1}{P\left ( i \right )}=log4=2bit$
    - $ P\left ( i \right ) $: the probability of base $i$ to occur at any position
    - $ logP\left ( i \right ) $: the base 2 logarithm as the bit (binary unit) is usually used as a measurement of digital information.
    - If the four bases are equally likely to occur, that is, $P_{i}=1/4$, each base pair int he DNA molecule can provide the largest information capacity (2 bit)
By converting the 2 bit/base to physical density:
- $\rho =\frac{2 bit}{1 base \times 325\frac{Dalton}{base} \times 1.67\times 10^{-24}\frac{g}{Dalton}} = 3.69 \times 10^{21}\frac{bit}{g} = 4.61 \times 10^{20}\frac{Byte}{g}\approx 460\frac{EB}{g}$
  - $ \rho $: density
  - $ 1 \frac{EB}{g} $ = $10^{18} B$
하지만 restrictions on the sequence of DNA molecule에 따라서 Shannon information capacity가 줄어들 수 있다는 점을 유의해야한다.

Mutual information and channel capacity

Mutual information between channel inputs and outputs is also an important factor in determining information capacity
- DNA 저장 시스템의 효율성과 신뢰성을 평가하고 최적화하기 위해 Mutual Information과 Channel Capacity를 설명함.
- Mutual Information: 입력 염기 $X$가 출력 염기 $Y$에 얼마나 많은 정보를 제공하는지
  - 높은 상호 정보량은 시퀀싱 과정에서 적은 오류가 발생하며, 입력 정보가 잘 전달된다는 것을 의미함.
- Channel Capacity: 주어진 채널을 통해 신뢰성 있게 전송할 수 있는 최대 정보량
  - 채널 용량을 바탕으로 DNA 저장 시스템이 얼마나 잘 정보를 보존하고 복원할 수 있는지?
Mutual information:
- Measures the fidelity with which the channel output $Y = \left\{ y_{i}|A,T,C,G\right\}$ (i.e. the readout of a DNA by sequencing) represents the channel input $X = \left\{ x_{i}|A,T,C,G\right\}$ (i.e. the preset DNA sequence)
- 주어진 두 변수 $X$와 $Y$ 의 상호 정보량은 한 변수를 알았을 때 다른 변수에 대해 얻을 수 있는 정보의 양을 나타냅니다. 즉, $Y$ (DNA 시퀀싱 결과)가 $X$ (원래 DNA 시퀀스)를 얼마나 잘 나타내는지를 측정합니다.
  - 상호 정보는 두 확률 변수 $X$와 $Y$ 간의 종속성을 측정하는 값입니다. 이는 한 변수 $X$를 통해 다른 변수 $Y$에 대한 불확실성을 얼마나 줄일 수 있는지를 나타냅니다.
- $I\left ( X;Y \right )=\sum_{i}^{A,T,C,G}\sum_{j}^{A,T,C,G}P\left ( x_{i}y_{j} \right )I\left ( x_{i};y_{j} \right )=\sum_{i}^{A,T,C,G}\sum_{j}^{A,T,C,G}P\left ( x_{i}y_{j} \right )log\frac{P\left ( x_{i}|y_{i} \right )}{P\left ( x_{i} \right )}$
  - $P\left ( x_{i} ,y_{j}\right )$: $X$가 $x_{i}$이고 $Y$가 $y_{j}$일 확률을 나타내는 공동 확률 분
- $I\left ( X;Y \right )= H\left ( X \right )-H\left ( X|Y \right )\leq H\left ( X \right )$
  - $H(X)$는 $X$의 엔트로피로 $X$의 불확실성을 나타낸다.
  - $H(X|Y)$는 $Y$를 알고 있을 때 $X$의 조건부 엔트로피로, $Y$가 주어졌을 때 $X$의 남아있는 불확실성을 나타낸다.
  - Sequencing 결과가 원본 DNA 서열을 완전히 정확하게 나타내는 경우:
    - $ H\left ( X|Y \right )=0 $이므로 $ I\left ( X;Y \right )= H\left ( X \right ) $
  - Sequencing 과정에서 오류가 발생할 경우:
    - $ H\left ( X|Y \right )>0 $가 되고, 따라서 $ I\left ( X;Y \right ) < H\left ( X \right ) $
      - 상호 정보가 감소함을 나타냄
Channel capacity:
- Channel의 역할:
  - 정보가 송신자(입력)로부터 수신자(출력)로 전달되는 매개체를 의미합니다.
  - DNA 저장 시스템에서 채널은 원본 DNA 서열이 시퀀싱 기술을 통해 읽혀질 때 발생하는 변환 과정을 나타냅니다.
- Can be defined by a 4 X 4 transfer matrix T
  - $XT = Y$
    - $X$: the input set
    - $Y$: the output set
    - $T$:
      - $T=\begin{bmatrix} P_{AA} & P_{AT} & P_{AC} & P_{AG} \\ P_{TA} & P_{TT} & P_{TC} & P_{TG} \\ P_{CA} & P_{CT} & P_{CC} & P_{CG} \\ P_{GA} & P_{GT} & P_{GC} & P_{GG} \\ \end{bmatrix}$
        
        $P_{ij}$: the probability that the input base $i$ is received as base $j$ after channel transmission
        
        $\left\{\begin{matrix} P_{ij} = 1, i=j\\ P_{ij} = 0, i\neq j\\ \end{matrix}\right.$
        
        Therefore, we can obtain:
        
        $H\left ( Y|x_{i} \right )=\sum_{j}^{A,T,C,G}H\left ( P_{ij} \right )$
- $P_{i}^{'}=\sum_{j}^{A,T,C,G}P_{j}\cdot P_{ji}$
  - $P_{i}$: 채널 입력에서 각 염기(A, T, C, G)의 빈도 (확률)
  - $ P_{i}^{'} $: 채널 출력에서 각 염기(A, T, C, G)의 빈도 (확률)
  - $P_{ji}$: 전이 행렬(transfer matrix) $T$의 요소로, 입력 염기 $j$가 출력 염기 $i$로 변환될 확률
  - Therefore, we can obtain:
    - $H\left ( Y \right )=\sum_{i}^{A,T,C,G}H\left ( P_{i}^{'} \right )$
    - $H\left ( Y|X \right )=\sum_{i}^{A,T,C,G}P_{i}\sum_{j=1}^{A,T,C,G}H\left ( P_{ij} \right )$

Implementation of DNA Storage

Source coding: 디지털 정보를 DNA 염기서열로 변환

Any digital information can be encoded into the DNA molecule by a simple conversion
- Bancroft et al.: English letters were directly encoded by base triplets in a manner like the amino acid codon table
  - Codon 'AAA' = Letter 'A'
  - Base 'G': reserved for sequencing primers
  - Three bases can produce a coding space of only $3^{3}=27$ elements
- Church et al.: used a more scalable approach
  - First converted different files into binary sequences in the HTML format and then converted these into DNA sequences
  - Goldman et al.: applied the Huffman coding scheme in the first step, which employs ternary instead of binary conversion
    - Huffman coding: simultaneously compresses the data and this is the first DNA storage study in which data compression algorithms were used
      - 이 방법은 자주 사용되는 데이터 항목에 짧은 코드, 덜 사용되는 항목에 긴 코드를 할당함으로써 전체 데이터를 효율적으로 압축

Channel Coding

오류가 발생할 수 있는 정보 전달 과정에서 데이터를 보호하기 위해 추가적인 기호를 추가하는 방법
For DNA molecules, error may occur during synthesis, replication, and sequencing
Two ways to recover new data despite information distortion
- Physical redundancy:
  - 동일한 정보를 여러 복사본으로 저장하는 것
  - Entails increasing the copy number of DNA molecules that encode the same information
  - However, physical redundancy is not sufficient for achieving lossless data transmission
  - For large volumes, physical redundancy imposes a dramatic increase in costs
- Logical redundancy:
  - 오류 검출 및 수정 기법을 통해 추가적인 체크 비트를 사용하는 것
  - Add extra symbols, called 'check symbols' or 'supervised symbols', in addition to the symbols encoding information
    - When the information symbols are incorrect, the check symbols can be used to detect or correct errors so that the information can be accurately recovered
  - Linear block code (Figure 4b)
    - If a group of information symbols has a length of $k$, a check symbol of length $r$ can be added sing a specific generator matrix
      - A linear block code with a code length of $n = k + r$
  - Hamming code (Figure 4a)
    - Only one error can be detected in each group of code words
      - 즉, 한 번에 하나의 오류만 검출하고 수정할 수 있다
  - Cyclic code: Bose-Chaudhuri-Hocquenghem (BCH) code
    - A code class that can correct multiple random errors based on the Galois binary field and its extension
- Quantitative assessments can be performed to compare the usefulness of physical redudancy and logical redundancy

Encoding information in DNA sequences

After being converted to a binary (or other radix) sequence, the information needs to be transformed into base sequences in DNA
- The most intuitive conversion: 2 bits with one base
  - Provides the maximal information storage capacity
  - However, it may result in sequences that are difficult to manipulate
    - Long tracts of homopolynucleotides that are error-prone in high-throughput seuencing
    - Church et al., Goldman et al., Erlich et al., tried to solve this problem

Information density of DNA storage

이론적 정보 밀도:
- DNA의 정보 저장 밀도는 이론적으로 매우 높습니다. DNA 분자는 4개의 염기(A, T, C, G)로 구성되어 있으며, 각 염기는 2 비트의 정보를 저장할 수 있다.
실제 정보 밀도:
- 실제 응용에서는 여러 제약 조건(저장 환경, Physical and logical redundancy, indexing)으로 인해 이론적 밀도를 달성하기 어렵다.

Technical Aspects and Practical Considerations

DNA 합성과 조립 기술:
- 고체상 인산아미드 화학, 배열 기반 DNA 합성, 효소적 합성 등
DNA 시퀀싱 기술:
- Sanger 시퀀싱, 차세대 시퀀싱(NGS), 단분자 시퀀싱 기술의 특성과 성능
비용 분석:
- 전통적 저장 방법과 비교하여 유지 비용이 낮다는 장점이 있지만, 여전히 실용화에는 높은 비용이 문제이다.
수명:
- DNA는 안정성이 높아 장기 저장에 유리하다. 낮은 온도에서 보관하면 수천 년 동안 정보를 보존할 수 있다.
생체 내 DNA 저장 (In vivo DNA storage):
- Compared to in vitro DNA storage, in vivo storage takes advantage of the efficient cellular machineries of DNA replication, proofreading and long-chain DNA maintenance, offers the chance for assembly-free random access of data, and support live recording of biochemical events in situ in living organism

PREV 이전 1 NEXT 다음

AIst's Blog

Reading Articles/DNA Computing

[논문 리딩] DNA storage: research landscape and future prospects

+ Recent posts

티스토리툴바