A.3.1 Principles of DNA Synthesis Costs.
Here, we provide an overview of
de novo and assembly synthesis along with quantitative insight into why assembly can be a much cheaper synthesis alternative for data. Figure
12 shows a comparative overview of standard
de novo synthesis on the left and the general idea of assembly on the right.
De novo synthesis grows each strand that encodes data one base at a time, and each unique encoding strand gets a specific location to grow on a silicon synthesis platform. The main mechanism that has been used to scale down
de novo synthesis costs has been to scale the platform in which the oligos are built. For example, Twist Bioscience utilizes a silicon substrate technology to allow for up to 700K unique oligos to be built on a single chip with a cost of
\(4.4\cdot 10^{-4}\) USD/base [
2]. This category of
de novo synthesis is known as
array-based synthesis. Another well-known
de novo synthesis method, termed as
column synthesis, focuses on higher-quality strands at a lower unique-strand scale [
20]. Because column synthesis generates unique strands on 96/384 well plates, the cost per unique base can be three orders of magnitude higher than array-based synthesis at 0.29 USD/base [
1].
On the right side of Figure
12, we have the general flow of an assembly-based synthesis process. Here, a strand of data to be synthesized is broken down into five codeword regions, where four are unique. We can now just synthesize this small set of codewords and assemble them using ligation chemistries that use a ligase enzyme to fully connect the backbone of the overhang with the backbone of the neighboring strand [
22,
30,
37]. From this, we recognize that we really do not need many unique features for such a small set of codewords. Furthermore, column-based synthesis is in fact more cost-effective than array-based synthesis if many copies of a single feature are desired. For example, array-based synthesis provides
\(10^8\) copies, while column synthesis can provide copy scales of
\(10^{16}\) at the same previously mentioned cost per base for both [
1,
2]. This leads to column-based synthesis being five orders of magnitude cheaper at some copy scale. Thus, if column synthesis is paired with assembly-based synthesis to form longer strands with meaningful amounts of data, data synthesis costs can potentially be improved greatly.
A.3.2 Calculating Costs.
This section supplements the figures presented for cost analysis in Section
5.5 with equations used to create those figures. We end this section with set of parameters used to implement the figures from the equations. The cost of assembly synthesis, as mentioned before, takes into account the cost of obtaining the primitive block set with
de novo synthesis, maintenance of that set, and the materials expended on the reactions within the tree. These costs are summarized by the following equation:
The first term in the equation represents the cost associated with reactions needed to assemble strands, where
\(C_{LR}\) is used to represent the cost of leaf reactions and as we will show later includes the costs of intermediate nodes in the tree. The second term calculates the cost associated with acquiring the first set of primitive blocks, along with the cost of replenishing that set as data is synthesized. In this equation,
\(T_{bits}\) is the total number of bits synthesized,
\(\alpha\) is a multiplier used to model the number of DNA strands needed per primitive block,
\(C_{dn}\) represents the per-base cost associated with the
de novo synthesis process used for instantiating the initial set of primitive blocks, and
\(N_{uses}\) represents the number of times a primitive block can be reused after being obtained from
de novo synthesis. The cost of leaf reactions is modeled by using scaling factors applied to a cost-per-overhang rate
\(R_{LR}\). The first scaling factor,
\({\vert {\mathcal {O}}\vert }-1\), is used to model the increase in materials needed to support more ligation in a single reaction. The second term is used to model the increase in materials required to support deeper trees. The model assumes that each reaction in a tree requires the same amount of ligation materials as a single leaf reaction. However, we can work this cost into the cost of each leaf reaction by spreading out the cost of an internal node amongst its children, and from there recursively until the leaf is reached. Thus, the additional cost for an internal node will be split to a leaf reaction needed by some power of
\({\vert {\mathcal {O}}\vert }-1\), depending on the internal node’s height. This is shown below in Equation (
7).
With the cost of leaf reactions defined, Equation (
6) is normalized by the simple factor
\(C_{dn,File} \cdot \frac{T_{bits}}{\sigma _{dn}}\) to produce the curves of Figure
9(a).
\(C_{dn,File}\) represents the cost per base of the
de novo synthesis process used to synthesize all strands for a file. To create the points in Figure
9(b) that show the cost per bit for various primitive block set sizes, it is assumed that the amount of data is asymptotically large, e.g.,
\(T_{bits} \rightarrow \infty\). This allows the ceiling functions to be dropped, since the cost impact of moving the divisions to the next greatest integers for large
\(T_{bits}\) is negligible. This allows
\(T_{bits}\) to be factored out from both terms that represent reaction costs and primitive set synthesis costs, leading to the remaining factor being the cost per bit plotted in Figure
9(b).
To generate Figures
9(a) and
9(b), we assume that the costs of the assembly approach are purely the cost of the consumable materials, e.g., the cost of ligase enzyme used to implement the connection of primitive blocks and the oligos ordered to fill the primitive block set. So, we do not consider the costs of personnel or liquid handling platforms that would be needed to physically implement the algorithm at a sufficient throughput. As our cost comparison reference that represents the cost of
de novo synthesis for an entire file’s strands, we choose Twist Bioscience’s prices, since their synthesis process is tailored towards optimizing the cost per base for each unique ordered base [
2]. To take into account economy of scale, we assume Twist Bioscience’s lowest cost per base of
\(4.4\cdot 10^{-4}\) USD/base when ordering 250 base pair strands on the largest plate of 696,000 unique oligos at a price of 76,560 USD. Although using this cost is a consumer-facing value that hides the actual costs of the raw consumables, since a company must factor in profit, personnel, and so on, the values we use to calculate the cost of assembly are also consumer-facing, providing a fair comparison. Comparing costs of synthesis is done in a similar manner for other emerging synthesis techniques for DNA data storage as described by Reference [
7].
For
de novo synthesis, costs attributed to building the primitive block set for assembly are chosen with scale of synthesis yield being a priority. Thus, we choose Eurofins custom oligos, which can be ordered at a rate of 0.29 USD per ordered base pair at a synthesis yield of 50 nmol, which, through our experience, yields approximately 25 nmol of strands for strand lengths of 17 base pairs, which is similar to all strand lengths considered in Figures
9(a) and
9(b) [
1]. Such a yield at this cost per base pair corresponds to a scale adjusted base pair cost of 0.0116 (USD/bp)/nmol, which is much more favorable for assembly synthesis than Twist Bioscience’s process, which only guarantees 0.2 fmol yield corresponding to 2200 (USD/bp)/nmol [
2].
For other consumables outside of oligos, we assume that the only remaining chemical costs are that for the DNA ligase enzyme, as this is the dominant cost factor of creating the mixture that facilitates overhang ligation. For this cost, we assume New England BioLabs’ T4 DNA ligase [
4]. The cost of this ligase is used to calculate the cost per leaf reaction per overhang
\(R_{LR}\). To do this, we calculate a cost per unit of T4 DNA ligase, e.g.,
\(260/10^{5}\) USD/unit. A
unit is defined as the amount required to ligate 50% of 0.12
\(\mu\)M of
\(5^{\prime }\) DNA termini in a reaction volume of 20
\(\mu\)l over 30 minutes at
\(16^{\circ }\text{C}\) [
4]. A termini can be defined as a physical overhang in the reaction that must be ligated. Using this information, we can calculate the fraction of a unit required for a single termini in the reaction with the following calculation:
Next, we need to know the number of overhangs that will be present in any given reaction. We assume that we need
\(10^{4}\) copies of each primitive block in a leaf reaction. Thus, if the reactions were 100% efficient and converted each codeword copy into a complete strand, then there would be
\(10^{4}\) copies per strand. Although, realistically, these reactions will not be 100% efficient, the number of copies required downstream in the storage system is several orders of magnitude smaller. As shown by Chen et al., as few as 10 copies per oligo can be accessed with typical random access methods such as
polymerase chain reaction (PCR) [
12]. Thus, our assumption holds as reasonable if a given reaction tree can be built with at least
\(10/10^{4} = 0.1 \%\) efficiency. With
\(10^{4}\) copies per codeword, the number of total overhangs to be ligated will be
\(({\vert {\mathcal {O}}\vert }-1)\cdot 10^{4}\). Thus, we can calculate the amount of unit needed for each reaction per logical overhang as:
Finally, the cost per logical overhang in the leaf reaction (
\(R_{LR}\)) can be calculated as
As for the remaining parameters,
\(L_o\) was chosen to be 4 bases, as this is a typical overhang length for Golden Gate assembly [
30], a base-3 encoding was assumed as in Section
5.6, which translates to a density of 1.33 bits per base,
\(\alpha\) was set to 2 because each primitive block needs 2 individual strands to be created from the initial
de novo synthesis to construct blocks similar to those in Figure
2, and last,
\(N_{uses}\) can be easily calculated from the 25 nmol synthesis yield and the number of code-word copies used each reaction to get
\(N_{uses} = 1.5 \cdot 10^{12}\).