# Why does Nature use a 4-level system to encode information in DNA?

First, I am not a biologist, so this question might be naive:

Computer information processing and storage is based on 2-digit system of bits with values 0 and 1. Now, DNA stores the information in a 4-digit system: A, C, G, T. Three base pairs form a codon and can encode 43 amino acids.

Is there a good reason why a 4-level system (which can store 2 bits per encoding entity) evolved rather than a 2-level or a system with a larger number of symbols in the alphabet?

Put differently: Why was a binary system not preferred for storage and processing of data? In computing, binary is much easier, and the very few tests of exotic higher-level data processing have not really been successful.

The current hypothesis is that RNA came first, DNA and proteins came later. So the reason that four bases are used might be related to the initial RNA world, and then DNA just reused the already existing RNA bases in a slightly modified form. In the RNA world, all functions had to be performed by RNA. Having more bases available than two would likely be important to be able to adopt various structures and create binding pockets or active sites for ribozymes.

You can't really think of the genetic code as an abstract data storage device. There are physical and chemical consequences to the choice of encoding. For example, proteins have to be able to bind to DNA and recognize particular patterns. With your binary code, the recognition sequence would have to be longer, because each basepair contains less information. The tRNA anticodons would have to be larger for protein biosynthesis to work with the binary code. Another issue that plays a role in some processes is that GC base pairs are more stable than AU/AT base pairs.

These are all just hypotheses. Evolution doesn't necessarily choose the best option, sometimes it is just the most convenient one that still works well.

I also found a review titled "Why are there four letters in the genetic alphabet?" that makes a similar point as my first one.

All present models to explain the fact that we have four base types in our genetic alphabet hinge, in covert or overt form, on the assumption that the genetic alphabet evolved in an RNA world

Another factor I didn't think of that is mentioned there is that while more bases make better ribozymes, more bases also decrease the accuracy of replication.

In summary, two-dimensional RNA-like structures (and, presumably also the three-dimensional structures) become better defined as alphabet size increases, whereas the accuracy of replication decreases.

Why does nature use a 4-level system (DNA) to encode information?

Short answer: Ease of manufacture, simplicity of matching, sufficiency for requirements. Fewer simple bases take less effort to create, provide fewer possible matches, yet is complex enough to code what is required while retaining sufficient degeneracy for success. Also it was the coincidence of replicase-alphabet co-evolution, both occurring in the same place at the same time.

First, I am not a biologist, so this question might be naive:

Beginners and experts are welcome at SE.

All of our information processing and storing is based on 2-level logic, bits with 0 and 1.

Euler's number ($$e$$) is defined as the sum of an infinite series $$sum_{n=0}^infty frac{1}{n!}$$ and has the lowest radix economy, but it's not convenient to implement in logic circuits. With the radix economy of $$e$$ set at 1.000, ternary is 1.0046 and binary is 1.0615.

Ternary computers have been constructed using ternary logic and while they are uncommon ternary logic is used in SQL; even in binary based computers.

Most, but not all of our information processing and storing is based on 2-level logic.

Now, DNA stores the information in a 4-level system: A, C, G, T. Three basepairs form a codon and can encode 4^3 amino acids.

Most, but not all.

The five canonical, or primary, nucleobases are: adenine (A), cytosine (C), guanine (G), thymine (T), and uracil (U). DNA uses A, G, C, and T while RNA uses A, G, C, and U.

In the laboratory DNA has been created with 6 and 8 bases, it is functional.

See the (paywall) report: "Hachimoji DNA and RNA: A genetic system with eight building blocks", Feb 22 2019, by Shuichi Hoshika, Nicole A. Leal, Myong-Jung Kim, Myong-Sang Kim, Nilesh B. Karalkar, Hyo-Joong Kim, Alison M. Bates, Norman E. Watkins Jr., Holly A. SantaLucia, Adam J. Meyer, Saurja DasGupta, Joseph A. Piccirilli, Andrew D. Ellington, John SantaLucia Jr., Millie M. Georgiadis, and Steven A. Benner. (Google Cache version).

"Fig. 4 Structure and fluorescent properties of hachimoji RNA molecules.
(A) Schematic showing the full hachimoji spinach variant aptamer; additional nucleotide components of the hachimoji system are shown as black letters at positions 8, 10, 76, and 78 (B, Z, P, and S, respectively). The fluor binds in loop L12 (25). (B to E) Fluorescence of various species in equal amounts as determined by UV. Fluorescence was visualized under a blue light (470 nm) with an amber (580 nm) filter.
(B) Control with fluor only, lacking RNA.
(C) Hachimoji spinach with the sequence shown in (A).
(D) Native spinach aptamer with fluor.
(E) Fluor and spinach aptamer containing Z at position 50, replacing the A:U pair at positions 53:29 with G:C to restore the triple observed in the crystal structure. This places the quenching Z chromophore near the fluor; CD spectra suggest that this variant had the same fold as native spinach (fig. S8).".

Centrifuge tube C contains the spinach with the DNA containing eight bases.

Is there a good reason for why during early evolution, a 4-level system (which can store 2 bits per encoding entity) is favoured over a 2-level system or over larger systems?

Yes.

• Copying fidelity decreases roughly exponentially with increasing size (N pairs) of the alphabet (keeping the length of the genome fixed). The reason for this is that as one adds more letters to the alphabet, they will resemble each other more and more, and hence the chance of mispairing and mutagenesis increases.

• Overall metabolic efficiency and fitness are determined by the size, we have 20 amino acids to code for (smaller makes 16 or less) and 3 stop codons. So we have a space for 64, and rely on degeneracy to provide a degree of 'error correction' (synonymization) where errors are converted, usually to produce non-fatal errors. While seldom fatal translation errors can still cause rare diseases.

We are already running inefficiently, going to a larger number of pairs introduces unnecessary complexity and going smaller isn't available for the number of amino acids that must be coded for. Increasing the codon length makes DNA larger, as it is it must already be coiled to stuff it into the cells; one third larger DNA would better fit cells that are also one third larger.

In the opinion piece "Why are there four letters in the genetic alphabet?", Nature Reviews Genetics volume 4, pages 995-1001 (2003), by Eörs Szathmáry there are the following observations:

Page 995:

"There are four main constraints on the successful incorporation of a new base pair$$^{[6-8]}$$:

• chemical stability (the base should not readily decompose);

• thermodynamic stability (new base pairs should not destabilize nucleic-acid structures);

• enzymatic processability (polymerases should accept the base pairs as substrates, catalyse addition to the primer and be able to carry on the process); and

• kinetic selectivity (orthogonality to other base pairs).

All four criteria are important but the combination of the last two, which we might call replicability, has received particular attention because it is the main obstacle to adding to the genetic alphabet.".

Page 997:

"Theoretical arguments
The feasibility of alternative base pairs raises the question: why are there four bases in the natural genetic alphabet? As Orgel pointed out, there are two types of answer: either evolution has never experimented with alternative base pairs or four bases 'were enough'$$^{[20]}$$. The first option might hold for the hydrophobic base pairs discussed above (an adequate early synthesis might be lacking), but it is unlikely to be true for all of the hydrogen-bonding bases in a prebiotic 'chemical mayhem'. At any rate, it does not explain why we do not have only two bases$$^{[21-24]}$$. It therefore seems worthwhile to pursue the second option: why might four bases be enough? If 'enough' is understood in terms of evolutionary stability, it means optimality within the frame of the structural constraints that are afforded by natural selection. Here, I describe attempts to show that four bases are optimal under STABILIZING SELECTION, especially when we consider MUTATION-SELECTION EQUILIBRIUM. I then discuss evidence for the optimal size of the genetic code obtained from in silico DIRECTIONAL SELECTION and finally analyse a more abstract contribution from so-called ERROR-CODING THEORY.".

Page 1000:

"Theoretical investigations based on structural, energetic and information-theoretic studies confirm the view that increased alphabet size decreases copying fidelity while increasing information density. This indicates that there must be an optimum alphabet size in terms of fitness, whether we assume that the genetic.alphabet was fixed in an RNA world or not.

According to the RNA-world-based view, the genetic alphabet became fixed more than 3 billion years ago$$^{[31]}$$, and the origin of the genetic code and translation happened subsequently$$^{[42]}$$. This line of reasoning indicates that the informational/operational division of labour between nucleic acids and proteins$$^{[43}]$$ has uncoupled the genetic alphabet from enzymatic functionality constraints. As the genetic code evolved in the context of a certain genetic alphabet, any further change of the alphabet would have been unnecessary and/or extremely unlikely.

If, however, the genetic code originated by the simultaneous co-evolution of nucleic acids and proteins (a much more complicated model), then the fixation of the genetic alphabet must be considered in this complex context. Here, the general insight of Mac Dónaill$$^{[38]}$$ helps: the information density of the alphabet is a useful concept, whether the exercised function is ribozymic or a messenger function in protein synthesis. In this case, the problem of the size of the 'catalytic alphabet' (the number of encoded amino acids) readily arises: why do we have 20 rather than, for example, 16 or 25 different amino acids? It has been pointed out that some of the considerations discussed in this article (effects on catalytic efficiency and translation fidelity) apply to this related problem$$^{[32}]$$. However, another crucial factor is likely to be involved: the metabolic cost of producing amino acids. An amino acid that belongs to the same biosynthetic family$$^{[43]}$$ is expected to increase catalytic efficiency only modestly and its metabolic cost is likely to be small. By contrast, an amino acid from a new biosynthetic family is likely to confer a high enzymatic advantage, but is expected to incur high metabolic costs (for instance, many new ATP-requiring steps).".

References:

$$[6.]$$ Mathis, G. & Hunziker, J. Towards a DNA-like duplex without hydrogen-bonded base pairs. Angew. Chem. Int. Ed. 41, 3203-3205 (2002).

$$[7.]$$ Ogawa, A. K., Wu, Y., Berger, M., Schultz, P. G. & Romesberg, F. E. Rational design of an unnatural base pair with increased kinetic selectivity. J. Am. Chem. Soc. 122, 8803-8804 (2000).

$$[8.]$$ Kool, E. T. Synthetically modified DNAs as substrates for polymerases. Curr. Opin. Chem. Biol. 4, 602-608 (2000).

$$[20.]$$ Orgel, L. E. Nucleic acids - adding to the genetic alphabet. Nature 343, 18-20 (1990).

$$[21.]$$ Orgel, L. E. Evolution of the genetic apparatus. J. Mol. Bio . 38, 381-393 (1968).

$$[22.]$$ Crick, F. H. C. The origin of the genetic code. J. Mol. Biol. 38, 367-379 (1968).

$$[23.]$$ Wächtershäuser, G. An all-purine precursor of nucleic acids. Proc. Natl Acad. Sci. USA 85, 1134-1135 (1988).

$$[24.]$$ Zubay, G. An all-purine precursor of nucleic acids. Chemtracts 2, 439-442 (1991).

$$[31.]$$ Szathmáry, E. Four letters in the genetic alphabet: a frozen evolutionary optimum? Proc. R. Soc. Lond. B 245, 91-99 (1991).

$$[32.]$$ Szathmáry, E. What is the optimum size for the genetic alphabet? Proc. Natl Acad. Sci. USA 89, 2614-2618 (1992).

$$[38.]$$ Mac Dónaill, D. A. Why nature chose A, C, G and U/T: an error-coding perspective of nucleotide alphabet composition. Orig. Life Evol. Biosphere 33, 433-455 (2003).

$$[42.]$$ Szathmáry, E. The origin of the genetic code: amino acids as cofactors in an RNA world. Trends Genet. 15, 223-229 (1999).

$$[43.]$$ Wong, J. T. A coevolution theory of the genetic code. Proc. Natl Acad. Sci. USA 72, 1909-1912 (1975).

Further Information:

Eörs Szathmáry's Wikipedia web page
`http://www.colbud.hu/fellows/szathmary.shtml`- The Collegium Budapest is closed.

Scripps Research Institute

Steven Benner's web page
`http://www.chem.ufl.edu/benner.html`- Dr. Benner left UoF in 2005.

Asked differently: Why did evolution not prefer to have a binary system to store and process data? For us, binary is much easier, and the very few tests of exotic higher-level data processing were not really successful.

Binary has nothing to do with evolution. Few of us can count to 255 in binary, we prefer decimal. Both ternary computers and SQL are "really successful", people prefer the alternatives.

This is intended to be an answer suitable for a layperson. Eörs Szathmáry's article and it's associated references can be consulted for more details.

The use of binary in computers arose primarily from practical considerations of how to represent digits using electric current or voltage (i.e. either 'on' or 'off' is the least equivocal). Such representation was not only - or even primarily - for storing information of different numerical types, but for programming logic using Boolean algebra. Physical storage of data can be in various different formats (magnetic, optical, electrical) but these are functionally equivalent and the retrieval and conversion of binary data into integers real numbers, text or images is predominantly a mathematical rather than a physical concern.

DNA has various functions, but these do not include a concern with programming logic. In storage of data of different types, there is no problem representing different bases or radixes - sufficient different nucleic acid bases are available to represent base-4 digits. The most pertinent question in storage is one that does not arise in computer memory, that is the physical transformation of the information into other molecules. This can take the form of inverse copying of a genomic nucleic acid (DNA or, perhaps originally, RNA) in replication, the copying of one strand of information in a DNA duplex into a single strand of a related but not identical nucleic acid in the transcription to mRNA (messenger RNA), and 'reading' (translation) of the information in the chemical bases of the mRNA to produce a protein composed of amino acids - quite different chemical molecules.

Thus, the electronic or mathematical considerations that lead to the statement “In computing, binary is much easier” have no relevance for DNA and the genetic code, where chemical and structural molecular considerations are paramount. The supposition that there is a need to explain why information in DNA is not specifically binary is therefore false.

Speculation on the structural chemistry of the evolution of genetic information

The question of why 4-digit system (rather than 2- or 6- etc.) is still valid, but cannot be answered definitively. However, it is worth discussing to illustrate to numerical scientists the ways in which structural considerations might have determined the choice of information system. I shall consider two early stages in biochemical evolution at which the 4-digit system may have been selected for, after which - one should recognize - there might have been severe barriers to further change. Giving up Java and moving to Python (or even just changing from Java I to Java II) was probably not an option.

THE CHEMISTRY OF THE SELF-REPLICATING GENOME

I shall assume one of the main tenets of the RNA World Hypothesis - that RNA preceded DNA as the cellular genome. Even if the original genome were DNA, the question is the same - why four nucleic acid bases rather than two or six etc. - and the requirements of its chemical constitution are similar: to allow self-replication (hence the need to consider even numbers).

One might consider that an RNA with two bases arose first - let us assume adenine (A) and uracil (U) for the sake of argument. Later the cell acquired the catalytic ability to synthesize guanine (G) and cytosine (C), so that development from a self-replicating AU genome to an AUGC genome became possible. Assuming that this occurred before the amino acid-coding potential of the RNA had arisen, what might have favoured the more complex genome? It might have been something to do with the fact that there are three, rather than two, hydrogen bonds in a GC base-pair, perhaps producing a different RNA:RNA structure which was either more stable or easier to replicate for some reason. Alternatively it may have had nothing to do with the structure of the RNA:RNA helix, but was a side-effect of the acquisition of additional bases the greater chemical versatility of which enhanced the enzymic functions (ribozyme activity) of primaeval RNA.

If more meant better, why not six bases, rather than four? There might be specific chemical reasons like the slower development of the enzyme activities to produce other nucleic acid bases, or that with more bases the possibility of mispairing of bases was higher. (The 'Goldilocks principle' takes one a long way.) Or it may have been that the system worked well enough, was followed by the development of a triplet code, at which stage the system was frozen.

THE STRUCTURAL CHEMISTRY OF TRANSLATION

The die may have already been cast in the genome, above, but it would be a shame not to look at the chemistry of the decoding of the genetic information, as it is hardly a consideration for computer systems. So let us consider a competition between closely related organisms, one with a two-base genome and another with a four-base genome (and even a six-base genome). The requirement is to encode the information for a number of amino acids that can give proteins functional versatility - somewhere about the 20 (plus termination signals) we have today. The size of the codon (the word size), is 3 in a 4-bit system, accommodating 64 (43) possible codons in a (the standard) genetic code. If a 2-bit system of data storage were used then a word-size of 5 would be required to generate 32 (25) possible codons, whereas a 6-bit system could reduce the word-size to 2 with 36 (62) possible codons.

The physical consequences of such different systems would be seen in the decoding process where an adaptor molecule - transfer RNA (tRNA) - delivers amino acids to the peptidyl transferase centre of the ribosome (at one end) while interacting with the messenger RNA (mRNA) at the other end through codon-anticodon base-pairing. One might argue that the tRNA anticodon of three bases fits into the loop at the end of the helical anticodon stem in a manner that allows it adopt a relatively precise position (yes, I know about wobble) where it can make appropriate contact with the mRNA codon bases (see diagram below).

A quintuplet anticodon and five-base interaction (although not impossible) would appear less naturally adapted to the structural chemistry of RNA. Similar objections would not apply to a two-base interaction, although one might argue that total energy of interaction between two base-pairs was insufficient to prevent errors. The error consideration also applies to a binary system where the difference in energy between a five-base interaction and a four-base interaction (i.e. a simple mismatch) would be low. Indeed, if the hypothetical competition between 2-bit and 4-bit organisms had occurred, the 2-bit system would also have been more prone to error frequency through slippage during replication.

The Last Word…

… goes to Steven Benner, whose group has constructed 8-base DNA in the laboratory:

“The ability to store information is not very interesting for evolution. You have to be able to transfer that information into a molecule that does something.”

## V(D)J recombination

V(D)J recombination is the mechanism of somatic recombination that occurs only in developing lymphocytes during the early stages of T and B cell maturation. It results in the highly diverse repertoire of antibodies/immunoglobulins and T cell receptors (TCRs) found in B cells and T cells, respectively. The process is a defining feature of the adaptive immune system.

V(D)J recombination in mammals occurs in the primary lymphoid organs (bone marrow for B cells and thymus for T cells) and in a nearly random fashion rearranges variable (V), joining (J), and in some cases, diversity (D) gene segments. The process ultimately results in novel amino acid sequences in the antigen-binding regions of immunoglobulins and TCRs that allow for the recognition of antigens from nearly all pathogens including bacteria, viruses, parasites, and worms as well as "altered self cells" as seen in cancer. The recognition can also be allergic in nature (e.g. to pollen or other allergens) or may match host tissues and lead to autoimmunity.

In 1987, Susumu Tonegawa was awarded the Nobel Prize in Physiology or Medicine "for his discovery of the genetic principle for generation of antibody diversity". [1]

## What is Uracil?

Uracil is one of the four nitrogenous bases in RNA molecule that is represented by the letter U. It’s a pyrimidine and has a large single ring structure.

The chemical formula of Uracil is C4H4N2O2 and its IUPAC name is Pyrimidine-2,4(1H,3H)-dione.

A uracil is a demethylated form of a thymine, meaning that a methyl group (CH3) is removed from a molecule of thymine at the 5′ Carbon.

In RNA, Uracil binds to Adenine using two hydrogen bonds. Thus, uracil acts as both a hydrogen bond acceptor and a hydrogen bond donor when bonded with Adenine.

This Uracil binds with a ribose pentose sugar to form the ribonucleoside uridine. And, as soon as the phosphate group attaches to uridine, then uridine 5′-monophosphate is produced.

Uracil is a weak acid and it is very resistant to oxidation and so allows the RNA to exist outside of the nucleus freely without any hassle. This DNA can’t do.

Uracil with its bond with riboses and phosphates performs various biological functions like acting as allosteric regulators, coenzymes for reactions, involved in the biosynthesis of polysaccharides, and the transportation of sugars containing aldehydes, etc.

The presence of Uracil in RNA like mRNA helps in the production of amino acid chains to produce proteins. Almost 37 codons out of the 64 total codons of mRNA have Uracil that helps to encode proteins.

The function of Uracil for termination of protein synthesis can’t also be ignored. Uracil is present in all of the 3 stop codons: UAA, UAG, and UGA.

## The benefits of synthetic biology

Biology is chemistry! Biological systems are often turned to for inspiration in the search for new reactions and catalysts. Just think about the cell of a microorganism. Inside, staggeringly efficient metabolic enzymes are used to transform simple abundant starting materials into beautifully complex natural products. Chemists simply can’t compete with nature in this regard.

A bioreactor used to ferment ethanol from corncob waste being loaded with yeast

Source: United States Department of Agriculture

### What is synthetic biology?

Synthetic biology is about building artificial systems that can then do something useful. For example, we can use our knowledge of enzymatic chemistry to genetically encode a synthetic pathway within a sequence of DNA. Once inside a microbe, this DNA will then coerce the cell to produce whatever molecule we desire. To draw an analogy, this is the equivalent of a reaction flask that produces all the catalysts and reagents for a desired total synthesis at the exact point in the synthetic route that they are required, performed in water using renewable materials and completed in less than a day. If we could do this in a lab or on an industrial scale, it would make any chemical process incredibly efficient.

Photomicrograph of cross section of a seed

### What are the advantages of synthetic biology?

There are plenty. For starters, the products we want can be derived directly from starting materials such as glucose, CO2 or methane. Once the best enzymes have been identified, they can be moved between different species we can even connect whole pathways together using genome engineering. Reactive and toxic intermediates can be contained within sub-cellular compartments, reducing hazardous waste. And the end product can be exported from the cell using natural secretion pathways – which means we don’t have to perform difficult purifications.

### Why isn’t synthetic biology used more widely?

We don’t have enzymes that can match every known reaction in organic chemistry, so synthetic biology cannot begin to rival the diversity of compounds that are accessible via traditional methods. Moreover, many commercially important molecules contain structural motifs that can’t be made using known enzymatic chemistry. For this reason, many current bioprocesses produce synthetic intermediates, which are then extracted and then further developed in a synthetic chemistry lab to get the finished product.

### Can we develop the enzymes we need?

We hope so. Modern biology is currently looking to address this gap in reactivity by using the very tool that empowered it in the first place – evolution.

Directed evolution is a method that tries to mimic natural selection to get the genes/functions we want. Researchers can identify similarities between enzymatic reactions and chemical reactions in the lab and then repeatedly try to select for genes that show this function. This results in huge libraries of mutated genes, from which highly active variants can be identified.

Enzymes now exist that can perform functions not seen in nature but that are very useful to chemists. It is only a matter of time before these new families of designer enzymes are integrated within engineered metabolic pathways.

Taking sample from biotechnological bio-reactor in microbiological laboratory.

### Does this mean that chemists won’t be needed?

Not at all – in fact, it means the opposite. Chemists offer a unique molecular understanding of biology. They are indispensable to synthetic biology, and the best approach to creating the molecules we need will probably involve the use of tools from both fields. There is a lot of work chemists can do, such as utilising sustainable carbon sources into existing processes developing biocompatible catalysts or designing new catalytic cofactors for use in artificial metalloenzymes. These are all exciting developments for synthetic biologists – and means the future of the two sciences are likely to be even more entwined.

## A role for RNA

A foundational tenet in molecular genetics — its central dogma — was that cellular machinery faithfully transcribes genetic information from a double-stranded DNA template into a single-stranded RNA messenger, which is then translated into a protein. But in the 1980s, a handful of labs noticed that some mRNA transcripts contained altered or extra letters that were not encoded in the DNA. The findings were controversial until scientists uncovered a family of enzymes called adenosine deaminases acting on RNA (ADARs). These proteins bind to RNAs and alter their sequence by changing a familiar base known as adenosine into a molecule called inosine. Although not one of the canonical RNA bases, inosine is read by the cell’s protein-translation machinery as the familiar guanosine. A handful of other RNA-editing enzymes surfaced around the same time.

Scientists have struggled over the past three decades to understand what exactly RNA editing accomplishes. The editors work only on double-stranded RNAs, which sometimes show up in the cell as regulatory elements — or as viruses. Some have speculated that the ADAR proteins evolved as a defence against viruses, but many viruses with double-stranded RNA are unaffected by the enzymes. The editing might serve a regulatory function, but most adult tissues don’t produce the high levels of the proteins required for the editing to occur.

The kill-switch for CRISPR that could make gene-editing safer

Brenda Bass, a biochemist at the University of Utah in Salt Lake City, was among the first to identify ADARs in frog embryos 2 . She says that no one has found a specific role for the changes made to non-protein-coding RNAs, which account for the majority of edited molecules. The editing could serve to protect double-stranded RNAs from immune attack. Bass suspects that ADARs edit the double-stranded transcripts, adding inosines as a way of telling the body to leave them alone. The enzymes also seem to have a role in embryonic development: mice that lack ADAR genes die before birth or don’t live long after. The editors also seem to have some function in select tissues of adult organisms — such as the nervous system of cephalopods.

It was this activity that drew marine biologist Joshua Rosenthal to RNA editing in the early 2000s. It seems that highly intelligent cephalopods, such as squid, cuttlefish and octopuses, use RNA editing extensively to adjust genes involved in nerve-cell development and signal transmission. No other animals are known to use RNA editing in this way. Inspired by these observations, Rosenthal wondered whether it was possible to use the system to correct the messages produced by dysfunctional genes in a therapeutic setting. In 2013, his group at the University of Puerto Rico in San Juan re-engineered ADAR enzymes and attached them to guide RNAs that would bind to a specific point in an mRNA — creating a double strand. With these, they were able to edit transcripts in frog embryos, and even in human cells in culture 3 .

Similar to Stafforst, Rosenthal, now at the Marine Biological Laboratory in Woods Hole, Massachusetts, saw his publication mostly ignored. A similar fate, he learnt, had befallen the work of researchers at a company called Ribozyme, who in 1995 proposed ‘therapeutic editing’ of mutated RNA sequences by inserting complementary sequences into frog embryos and allowing ADARs to edit the resulting double-stranded molecule and correct the mutation 4 .

But in the past several years, multiple factors have converged to bring Rosenthal’s and Stafforst’s findings to the fore. Peter Beal, a chemist at the University of California, Davis, says that the 2016 publication 5 of the molecular structure of ADAR bound to double-stranded RNA made the system more understandable and enabled scientists to better engineer the enzyme to enhance its delivery or make it more efficient. And in 2018, the US Food and Drug Administration (FDA) approved the first therapy using RNA interference (RNAi): a technique in which a small piece of RNA is inserted into a cell in which it binds to native mRNAs and hastens their degradation. The approval has opened the door for other therapies that involve mRNA interactions, says Gerard Platenburg, chief innovation officer of ProQR Therapeutics in Leiden, the Netherlands, which is pursuing various RNA-based therapies. “Learning from the past, and with the number of approvals picking up, the field has matured a lot,” says Platenburg.

Many see RNA editing as an important alternative to DNA editing using techniques such as CRISPR. CRISPR technology is improving, but DNA editing can cause unwanted mutations in other parts of the genome — ‘off-target effects’ — which might create new problems.

Super-precise new CRISPR tool could tackle a plethora of genetic diseases

Rosenthal expects, moreover, that RNA editing will prove useful for diseases without a genetic origin. He is currently using ADARs to edit the mRNA for a gene encoding the sodium channel Nav1.7, which controls how pain signals are transmitted to the brain. Permanently changing the Nav1.7 gene through DNA editing could eliminate the ability to feel pain and disrupt other necessary functions of the protein in the nervous system, but tuning it down through RNA editing in select tissues for a limited amount of time could help to alleviate pain without the risk of dependency or addiction associated with conventional painkillers.

Similarly, RNA editing could allow researchers to mimic genetic variants that provide a health advantage. For example, people with certain mutations in the gene PCSK9, which regulates cholesterol in the bloodstream, tend to have lower cholesterol levels, and modifying PCSK9 mRNA could confer a similar advantage without permanently disrupting the protein’s other functions. Immunologist Nina Papavasiliou of the German Cancer Research Center in Heidelberg says that RNA editing could be used to fight tumours. Some cancers hijack important cell-signalling pathways, such as those involved in cell death or proliferation. If RNA editors could be conscripted to turn off key signalling molecules temporarily, she says, “we could see the tumour die”. Then, the patient could stop the therapy, allowing the pathway to resume its normal functions.

As a treatment, RNA editing might be less likely to cause a potentially dangerous immune reaction than are CRISPR-based approaches. Unlike the DNA-editing enzyme Cas9, which comes from bacteria, ADARs are human proteins that don’t trigger an attack from the immune system. “You really don’t need heavy machinery to target RNA,” says Prashant Mali, a bioengineer at the University of California, San Diego.

In a paper published last year 6 , Mali and his colleagues injected guide RNAs into mice born with a genetic mutation that causes muscular dystrophy. The guide RNAs were designed to trigger production of a missing protein called dystrophin. Although the system edited only a small amount of the RNA encoding dystrophin, it restored the protein to about 5% of its normal level in the animals’ muscle tissue, an amount that has shown therapeutic potential.

Illustration by Joanna Gębal

In other diseases that result from a missing or dysfunctional protein, such as some types of haemophilia, “it makes a huge difference to go from nothing to something”, Stafforst says, and it might not be necessary to edit RNA in every cell in the body. RNA editing might perform better than forms of gene therapy that would involve injecting a new gene. Mali and others say that directing native ADARs to operate on the cell’s own mRNA might provide a more natural response than introducing an external, engineered gene.

RNA-editing technology is far from perfect, however, even when it comes to laboratory applications. “It is early days,” Bass says. “There’s lots of questions.” Because ADARs are much less efficient than CRISPR, they could be less useful for making genetically modified plants and animals. “As a research tool, it’s very limiting,” says Jin Billy Li, a geneticist at Stanford University in California.

Another major disadvantage is that ADARs can make only a few kinds of change to RNA. CRISPR systems act as scissors by cutting DNA at a designated spot and removing or inserting a new sequence ADARs are more like an overwrite function that changes letters chemically, without breaking the RNA molecule’s ‘backbone’.

Although this process is less likely to cause unintended mutations, it limits the enzymes to making specific changes — adenosine to inosine in the case of ADARs, and cytosine to uridine by a set of enzymes called APOBECs (see ‘The RNA corrections’). There are a few other possibilities. Grape plants, for instance, can change cytidines to uridines, and some tumours can change guanosines to adenosines. “Biodiversity is giving us tons of answers to these things,” Rosenthal says. “I think down the line, things like the squid are going to teach us a lot.” But he says the field is understudied — researchers don’t understand the process that drives this editing. And it remains to be seen whether a plant enzyme, for instance, could function in human cells.

Scientists are already looking for ways to engineer new enzymes that could expand RNA-editing capabilities. “It’s quite a process where you don’t know what you’ll find,” says Omar Abudayyeh, a biological engineer at the Massachusetts Institute of Technology (MIT) in Cambridge. Working with Feng Zhang, a CRISPR pioneer at MIT, Abudayyeh and his colleagues linked an ADAR enzyme to Cas13 7 . A bacterial enzyme similar to the CRISPR-associated protein Cas9, Cas13 cuts RNA instead of DNA. The researchers altered the sequence of the ADAR until it could convert cytidines to uridines. They then used the new system in human cells to change bases in mRNAs encoded by several genes, including APOE. One naturally occurring genetic variant of this gene is associated with Alzheimer’s disease, and editing it could switch the variant to the harmless form.

Abudayyeh and his MIT collaborator, biological engineer Jonathan Gootenberg, admit it is possible that changing the ADAR protein could cause the immune system to stop recognizing it as a natural human protein and attack cells that contain it. But they say that because these edits are small, this risk pales next to known concerns about the immune system attacking Cas13 or the virus used to deliver the editing tools into cells.

Researchers see promise in a natural process called pseudouridylation, in which a set of protein and RNA enzymes chemically modify the structure of uridines in mRNA. Unlike ADAR modifications, pseudouridylation doesn’t change the sequence of the mRNA or protein. Instead, for reasons that are not entirely clear, the process stabilizes the RNA molecule and causes the translation machinery to ignore signals instructing it to stop making protein.

The ability to turn these molecular red lights into green lights could be powerful. Yi-Tao Yu, a biochemist at the University of Rochester in New York, says that hundreds of genetic diseases are caused by DNA mutations that create incorrect stop signals in mRNAs, resulting in a shortened protein that doesn’t function normally in the body. “The list is very long,” Yu says, and includes cystic fibrosis, the eye disease Hurler’s syndrome and numerous cancers.

Despite its early stage, researchers — and biotech investors — are excited about the wide potential of RNA editing. “I got into it way before it became cool,” says Papavasiliou, who is trying to map where natural ADARs work in the body. “For many years this was a backwater, and all of a sudden there’s a company popping up every two weeks.”

Numerous start-ups and established DNA-editing firms have announced their intention to move into RNA. They include Beam Therapeutics in Boston, Massachusetts, which was co-founded by Zhang and Liu and has been developing CRISPR DNA editing as a therapy for several blood diseases. Locana, based in San Diego, is also pursuing CRISPR-based RNA editing that it hopes could treat conditions including motor-neuron disease and Huntington’s disease.

CRISPR babies: when will the world be ready?

The challenge for industry is to work out the best way to get the guide RNAs into the cell without triggering an immune reaction or causing the cell to degrade them. Beal says that this could include making strategic chemical modifications to the engineered RNAs that stabilize them, or embedding them in a nanoparticle or virus that can sneak into cells.

And although ADARs are already in human cells, the human body makes only small amounts of them in most tissues, meaning that any therapy might need to add ADARs or other enzymes to boost cells’ editing capabilities. Packing viruses with the genes that encode all the machinery needed for RNA editing might not be efficient. Many hope that it won’t be necessary.

Platenburg hopes to add RNAs and rely on the naturally occurring ADARs to help to correct the lettering of mRNAs that contribute to retinal disorders. “We use the system given to us by nature and harness it,” he says.

Researchers including Stafforst are engineering guide RNAs with chemical modifications that attract ADARs in the cell to the editing site. But some researchers worry that conscripting the natural ADARs into editing specific mRNAs could pull them away from their normal tasks and cause other health problems. Altering gene expression in one part of the body could affect other parts in unforeseen ways. In Mali’s muscular-dystrophy study, for instance, mice developed liver problems for unknown reasons. “It’s a tool in development still,” he says.

“ADAR evolved to allow the body to modify bases in a very targeted fashion,” says Nessan Bermingham, chief executive and a co-founder with Rosenthal and others of biotechnology company Korro Bio in Cambridge, Massachusetts. Bermingham is optimistic about the prospects of RNA editing, but cautious not to get ahead of the biology. “We have a lot of work to do as we start to mature these techniques,” he says. “We’re not leaving anything off the table, but we have to recognize certain limitations.”

Nature 578, 24-27 (2020)

## Junk or Undiscovered?

Making a protein is not as simple as following a recipe from a cookbook. Proteins are formed when DNA undergoes a process called transcription. This is required, since the enzymes that make proteins can&rsquot read DNA. The information coded in DNA is copied onto a new molecule called messenger RNA (mRNA). Like the DNA, mRNA also has 4 nucleotide bases, but the thymine (T) is replaced by uracil (U). Another difference is that mRNA is a single-stranded molecule.

DNA and RNA (Photo Credit : ShadeDesign/Shutterstock)

During transcription, mRNA gets chopped up and rejoined. This is known as RNA splicing. This is done because sections of the gene don&rsquot make &ldquoprotein sense&rdquo these are called introns. During RNA splicing, these bits are cut out and discarded. You could say that these pieces are lost in transcription!

Mechanism of mRNA splicing (Photo Credit : Udayadaithya M V)/Shutterstock)

These discarded non-coding segments baffled scientists for decades. Introns were garbled nonsense between genes. Having no apparent purpose, many scientists thought it was worthless. In 1972, Susumu Ohno, a geneticist, coined the term &lsquojunk DNA&rsquo to catchily explain away this DNA waste. At times, it was also called &ldquoselfish DNA&rdquo, as it seemed to exist solely for itself, without contributing anything to the organism&rsquos survival.

The Central Dogma of Gene Expression (Photo Credit : udaix/Shutterstock)

However, several scientists believed that these large chunks of DNA should not be so hastily labelled as &ldquouseless&rdquo. If you are reading this article and knew only ten words of the English language, would you think that everything in this article except those ten words was nonsense? In the same way, scientists believed that the function of this so-called &lsquojunk&rsquo DNA had simply yet to be discovered.

## Why do we still have mitochondrial DNA?

The mitochondrion isn't the bacterium it was in its prime, say two billion years ago. Since getting consumed by our common single-celled ancestor the "energy powerhouse" organelle has lost most of its 2,000+ genes, likely to the nucleus. There are still a handful left--depending on the organism--but the question is why. One explanation, say a mathematician and biologist who analyzed gene loss in mitochondria over evolutionary time, is that mitochondrial DNA is too important to encode inside the nucleus and has thus evolved to resist the damaging environment inside of the mitochondrion. Their study appears February 18 in Cell Systems.

"It's not that the 'lost' genes no longer exist in many cases, it's that the nucleus produces the proteins and the proteins go into the mitochondria, but why bother having anything in the mitochondria when you could have it all in the nucleus?" says co-author Ben Williams, a postdoctoral fellow at the Whitehead Institute for Biomedical Research. "It's like saying you have a central library with all your books in it, but we're going to keep 10 of them off site in a leaky shed."

Despite our long-term relationship with mitochondria, a lot of how our cells and these commensal organelles work together is still mysterious and controversial. We know that acquiring mitochondria may have sparked one of the most important evolutionary events in history by giving the common ancestor of eukaryotes (our kingdom of life) the energy to go multicellular. And we know that each of our cells can possess dozens or hundreds of mitochondria, which are essential for powering everything from our muscles to our brain. But what's strange is that in nearly all multicellular organisms, mitochondria have stayed independent by holding on to a few vital genes--despite the fact it may be safer for the cell to store these genes in the nucleus.

To figure out what makes the few genes in mitochondria so essential, Williams and lead author Iain Johnston, a research fellow at the University of Birmingham, took all of the data generated about mitochondrial genes and threw them into a computer. After a few weeks, with the algorithm Johnston developed, the computer threw back a timeline for mitochondrial gene loss over evolutionary history.

"The hypotheses underlying potential reasons for mitochondria to keep their own genes have been debated for decades, and this is the first data-driven approach to address this question," says Johnston. "It's facilitated by the fact that there are thousands of mitochondrial genomes from across a very wide diverse set of taxa available so now we can harness the data and let it speak for itself."

The analysis revealed that the genes that are retained in the mitochondria are related to building the organelle's internal structure, are otherwise at risk of being misplaced by the cell, and the DNA in these genes use a very ancient pattern that allows the mitochondrial DNA to strongly bond together and resist breaking apart. Williams and Johnston believe this design, not typically found in our own DNA, is likely what keeps the mitochondrial genes from breaking apart during mitochondrial energy production.

As energy is produced within the mitochondria, in the form of ATP, free radicals are emitted--the same free radicals that are a common byproduct of radiation. In essence, the power produced by the mitochondria comes with a certain amount of destruction, and it could be that the mitochondria are capable of withstanding this damage. "You need specialists who can work in this ridiculously extreme environment because the nucleus is not necessary the best fit," says Williams.

The investigators also observed that the mitochondrial gene loss that's taken place across the eukaryote kingdom has followed the same pattern. This is a lesson that evolution may follow the same path many times over, and it's not always this entirely random process. In the cellular environment, the evolution of mitochondrial gene loss became nearly predictable between different organisms. "If we can harness data on what evolution has done in the past and make predictive statements about where it's going to go next, the possibility for exploring synthetic biology and disease are massive," says Johnston.

Using their algorithm, the duo next plans to explore the reasons for chloroplasts as well as where mitochondrial diseases, which are often quite devastating, fit into this bigger picture. While this study doesn't close the door on why we still have mitochondrial DNA, the authors say it does find a middle ground for many different arguments in the debate.

Cell Systems, Johnston and Williams: "Evolutionary inference across eukaryotes identifies multiple pressures favoring mtDNA gene retention" http://dx. doi. org/ 10. 1016/ j. cels. 2016. 01. 013

Cell Systems (@CellSystemsCP), published by Cell Press, is a monthly journal featuring papers that provide, support, or apply systems-level understanding in the life sciences and related disciplines. Research describes novel discoveries, milestone achievements, applied research, translational findings, broadly useful tools or resources, or insights into the use of technology. For more information, please visit http://www. cell. com/ cell-systems. To receive Cell Press media alerts, contact [email protected]

Disclaimer: AAAS and EurekAlert! are not responsible for the accuracy of news releases posted to EurekAlert! by contributing institutions or for the use of any information through the EurekAlert system.

## Who Are the Leaders in this Field? What Are They Doing?

The scientists at the forefront of the DNA computer revolution are a brilliant breed indeed. It was all started by a professor of Computer Science at USC by the name of Leonard M. Adleman, who utilized recombinant DNA to solve a simple Hamiltonian path problem which is classified as NP-complete. His article in the Nov. '94 issue of Science magazine, Molecular Computations of Solutions to Combinatorial Problems touched off the current wave of interest in molecular computation. The Hamiltonian path problem, on a large scale, is effectively unsolvable by conventional computer systems (it's theoretically possible, but would take an extremely long time).

His work was picked up by Dr. Donald Beaver, among others, who analyzed the approach and organized it into a highly accessible web page which includes a concise annotated bibliography. One major contributor to this page is the research group of Dr. Richard Lipton, Dan Boneh, and Christopher Dunworth- a professor of Computer Science and two graduate students at Princeton University. They are currently using a DNA Computer to break the government's data encryption standard (DES), as described in the article Breaking DES Using a Molecular Computer.

## 23.2: Gene regulation: Eukaryotic

As was previously noted, regulation is all about decision making. Gene regulation, as a general topic, is related to making decisions about the functional expression of genetic material. Whether the final product is an RNA species or a protein, the production of the final expressed product requires processes that take multiple steps. We have spent some time discussing some of these steps (i.e. transcription and translation) and some of the mechanisms that nature uses for sensing cellular and environmental information to regulate the initiation of transcription.

When we discussed the concept of strong and weak promoters we introduced the idea that regulating the amount (number of molecules) of transcript that was produced from a promoter in some unit of time might also be important for function. This should not be entirely surprising. For a protein coding gene, the more transcript that is produced, the greater potential there is to make more protein. This might be important in cases where making a lot of a particular enzyme is key for survival. By contrast, in other cases only a little protein is required and making too much would be a waste of cellular resources. In this case low levels of transcription might be preferred. Promoters of differing strengths can accommodate these varying needs. With regards to transcript number, we also briefly mentioned that synthesis is not the only way to regulate abundance. Degradation processes are also important to consider.

In this section, we add to these themes by focusing on eukaryotic regulatory processes. Specifically, we examine - and sometimes re-examine - some of the multiple steps that are required to express genetic material in eukaryotic organisms in the context of regulation. We want you not only to think about the processes but also to recognize that each step in the process of expression is also an opportunity to fine tune not only the abundance of a transcript or protein but also its functional state, form (or variant), and/or stability. Each of these additional factors may also be vitally important to consider for influencing the abundance of conditionally-specific functional variants.

### Structural differences between bacterial and eukaryotic cells influencing gene regulation

The defining hallmark of the eukaryotic cell is the nucleus, a double membrane that encloses the cell's hereditary material. In order to efficiently fits the organism's DNA into the confined space of the nucleus, the DNA is first packaged and organized by protein into a structure called chromatin. This packaging of the nuclear material reduces access to specific parts of the chromatin. Indeed, some elements of the DNA are so tightly packed that the transcriptional machinery cannot access regulatory sites like promoters. This means that one of the first sites of transcriptional regulation in eukaryotes must be the control access to the DNA itself. Chromatin proteins can be subject to enzymatic modification that can influence whether they bind tightly (limited transcriptional access) or more loosely (greater transcriptional access) to a segment of DNA . This process of modification - whichever direction is considered first - is reversible. Therefore DNA can be dynamically sequestered and made available when the "time is right".

The regulation of gene expression in eukaryotes also involves some of the same additional fundamental mechanisms discussed in the module on bacterial regulation (i.e. the use of strong or weak promoters, transcription factors, terminators etc.) but the actual number of proteins involved is typically much greater in eukaryotes than bacteria or archaea.

The post-transcriptional enzymatic processing of RNA that occurs in the nucleus and the export of the mature mRNA to the cytosol are two additional difference between bacterial and eukaryotic gene regulation. We will consider this level of regulation in more detail below.

Depiction of some key differences between the processes of bacterial and eukaryotic gene expression. Note in this case the presence of histone and histone modifiers, the splicing of pre-mRNA, and the export of the mature RNA from the nucleus as key differentiators between the bacterial and eukaryotic systems.
Attribution: Marc T. Facciotti (own work)

### DNA Packing and Epigenetic Markers

The DNA in eukaryotic cells is precisely wound, folded, and compacted into chromosomes so that it will fit into the nucleus. It is also organized so that specific segments of the chromosomes can be easily accessed as needed by the cell. Areas of the chromosomes that are more tightly compacted will be harder for proteins to bind and therefore lead to reduced gene expression of genes encoded in those areas. Regions of the genome that are loosely compacted will be easier for proteins to access, thus increasing the likelihood that the gene will be transcribed. Discussed here are the ways in which cells regulate the density of DNA compaction.

#### DNA packing

The first level of organization, or packing, is the winding of DNA strands around proteins. Histones package and order DNA into structural units called , which can control the access of proteins to specific DNA regions. Under the electron microscope, this winding of DNA around histone proteins to form nucleosomes looks like small beads on a string. These beads (nucleosome complexes) can move along the string (DNA) to alter which areas of the DNA are accessible to transcriptional machinery. While nucleosomes can move to open the chromosome structure to expose a segment of DNA, they do so in a very controlled manner.

DNA is folded around histone proteins to create (a) nucleosome complexes. These nucleosomes control the access of proteins to the underlying DNA. When viewed through an electron microscope (b), the nucleosomes look like beads on a string. (credit &ldquomicrograph&rdquo: modification of work by Chris Woodcock)

#### Histone Modification

How the histone proteins move is dependent on chemical signals found on both the histone proteins and on the DNA. These chemical signals are chemical tags added to histone proteins and the DNA that tell the histones if a chromosomal region should be "open" or "closed". The figure below depicts modifications to histone proteins and DNA. These tags are not permanent, but may be added or removed as needed. They are chemical modifications (phosphate, methyl, or acetyl groups) that are attached to specific amino acids in the histone proteins or to the nucleotides of the DNA. The tags do not alter the DNA base sequence, but they do alter how tightly wound the DNA is around the histone proteins. DNA is a negatively charged molecule therefore, changes in the charge of the histone will change how tightly wound the DNA molecule will be. When unmodified, the histone proteins have a large positive charge by adding chemical modifications like acetyl groups, the charge becomes less positive.

Nucleosomes can slide along DNA. When nucleosomes are spaced closely together (top), transcription factors cannot bind and gene expression is turned off. When the nucleosomes are spaced far apart (bottom), the DNA is exposed. Transcription factors can bind, allowing gene expression to occur. Modifications to the histones and DNA affect nucleosome spacing.

Why do histone proteins normally have a large amount of positive charges (histones contain a high number of lysine amino acids). Would removal of the positive charges cause a tightening of loosening of the histone-DNA interaction?

Predict the state of the histones in areas of the genome that are transcribed regularly. How do these differ from areas that do not experience high levels of transcription?

#### DNA Modification

The DNA molecule itself can also be modified. This occurs within very specific regions called CpG islands. These are stretches with a high frequency of cytosine and guanine dinucleotide DNA pairs (CG) often found in the promoter regions of genes. When this configuration exists, the cytosine member of the pair can be methylated (a methyl group is added). This modification changes how the DNA interacts with proteins, including the histone proteins that control access to the region. Highly methylated (hypermethylated) DNA regions with deacetylated histones are tightly coiled and transcriptionally inactive.

Epigenetic changes do not result in permanent changes in the DNA sequence. Epigenetic changes alter the chromatin structure (protein-DNA complex) to allow or deny access to transcribe genes. DNA modification such as methylation on cytosine nucleotides can either recruit repressor proteins that block RNA polymerase's access to transcribe a gene or they can aid in compacting the DNA to block all protein access to that area of the genome. These changes are reversible whereas mutations are not, however, epigenetic changes to the chromosome can also be inherited.
Source: modified from https://researcherblogski.wordpress. r/dudiwarsito/

Regulation of gene expression through chromatin remodeling is called epigenetic regulation. Epigenetic means &ldquoaround genetics.&rdquo The changes that occur to the histone proteins and DNA do not alter the nucleotide sequence and are not permanent. Instead, these changes are temporary (although they often persist through multiple rounds of cell division and can be inherited) and alter the chromosomal structure (open or closed) as needed.

View this video that describes how epigenetic regulation controls gene expression.

### Eukaryotic gene structure and RNA processing

#### Eukaryotic gene structure

Many eukaryotic genes, particularly those encoding protein products, are encoded on the genome . That is, the coding region is broken into pieces by intervening non-coding gene elements. The coding regions are termed while the intervening non-coding elements are termed . The figure below depicts a generic eukaryotic gene.

The parts of a typical discontinuous eukaryotic gene. Attribution: Marc T. Facciotti (own work)

Parts of a generic eukaryotic gene include familiar elements like a promoter and terminator. Between those two elements, the region encoding all of the elements of the gene that have the potential to be translated (they have no stop codons), like in bacterial systems, is called the open reading frame (ORF). Enhancer and/or silencer elements are regions of the DNA that serve to recruit regulatory proteins. These can be relatively close to the promoter, like in bacterial systems, or thousands of nucleotides away. Also present in many bacterial transcripts, 5' and 3' untranslated regions (UTRs) also exist. These regions of the gene encode segments of the transcript, which, as their names imply, are not translated and sit 5' and 3', respectively, to the ORF. The UTRs typically encode some regulatory elements critical for regulating transcription or steps of gene expression that occur post-transcriptionally.

The RNA species resulting from the transcription of these genes are also discontinuous and must therefore be processed before exiting the nucleus to be translated or used in the cytosol as mature RNAs. In eukaryotic systems this includes RNA splicing, 5' capping, 3' end cleavage and polyadenylation. This series of steps is a complex molecular process that must occur within the closed confines of the nucleus. Each one of these steps provides an opportunity for regulating the abundance of exported transcripts and the functional forms that these transcripts will take. While these would be topics for more advanced courses, think about how to frame some of the following topics as subproblems of the Design Challenge of genetic regulation. If nothing else, begin to appreciate the highly orchestrated molecular dance that must occur to express a gene and how this is a stunning bit of evolutionary engineering.

#### 5' capping

Like in bacterial systems, eukaryotic systems must assemble a pre-initiation complex at and around the promoter sequence to initiate transcription. The complexes that assemble in eukaryotes serve many of the same function as those in bacterial systems but they are significantly more complex, involving many more regulatory proteins. This added complexity allows for a greater degree of regulation and for the assembly of proteins with functions that occur predominantly in eukaryotic systems. One of these additional functions is the "capping" of nascent transcripts.

In eukaryotic protein coding genes, the RNA that is first produced is called the pre-mRNA. The "pre" prefix signifies that this is not the full mature mRNA that will be translated and that it first requires some processing. The modification known as 5'-capping occurs after the pre-mRNA is about 20-30 nucleotides in length. At this point the pre-RNA typically receives its first post-transcriptional modification, a 5'-cap. The "cap" is a chemical modification - a 7-methylguanosine - whose addition to the 5' end of the transcript is enzymatically catalyzed by multiple enzymes called the capping enzyme complex (CEC) a group of multiple enzymes that carry out sequential steps involved in adding the 5'-cap. The CEC binds to the RNA polymerase very early in transcription and carries out a modification of the 5' triphosphate, the subsequent transfer of at GTP to this end (connecting the two nucleotides using a unique 5'-to-5' linkage), the methylation of the newly transferred guanine, and in some transcripts the additional modifications to the first few nucleotides. This 5'-cap appears to function by protecting the emerging transcript from degradation and is quickly bound by RNA binding proteins known as the cap-binding complex (CBC). There is some evidence that this modification and the proteins bound to it play a role in targeting the transcript for export from the nucleus. Protecting the nascent RNA from degradation is not only important for conserving the energy invested in creating the transcript but is clearly involved in regulating the abundance of fully-functional transcript that is produced. Moreover, the role of the 5'-cap in guiding the transcript for export will directly help to regulate not only the amount of transcript that is made but, perhaps more importantly, the amount of transcript that is exported to the cytoplasm that has the potential to be translated.

The structure of a typical 7-methylguanylate cap. Attribution: Marc T. Facciotti (own work)

#### Transcript splicing

Nascent transcripts must be processed into mature RNAs by joining exons and removing the intervening introns. This is accomplished by a multicomponent complex of RNA and proteins called the spliceosome. The spliceosome complex assembles on the nascent transcript and in many cases the decisions about which introns to combine into a mature transcript are made at this point. How these decisions are made is still not completely understood but involves the recognition of specific DNA sequences at the splice sites by RNA and protein species and several catalytic events. It is interesting to note that the catalytic portion of the spliceosome is made of RNA rather than protein. Recall that the ribosome is another example of a RNA-protein complex where the RNA serves as the primary catalytic component. The selection of which splice variant to make is a form of regulating gene expression. In this case rather than simply influencing abundance of a transcript, alternative splicing allows the cell to make decisions about which form of transcript is made.

The alternative splice forms of genes that result in protein products of related structure but of varying function are known as . The creation of isoforms is common in eukaryotic systems and is known to be important in different stages of development in multicellular organisms and in defining the functions of different cell types. By encoding multiple possible gene products from a single gene whose transcription initiation is encoded from a single transcriptional regulatory site (by making the decision of which end-product to produce post-transcriptionally) obviates the need to create and maintain independent copies of each gene in different parts of the genome and evolving independent regulatory sites. Therefore, the ability to form multiple isoforms from a single coding region is though to be evolutionarily advantageous because it enables some efficiency in DNA coding, minimizes transcriptional regulatory complexity, and may lower the energy burden of maintaining more DNA and protecting it from mutation. Some examples of possible outcomes of alternative splicing can include: the generation of enzyme variants with differential substrate affinity or catalytic rates signal sequences that target proteins to various sub-cellular compartments can be changed entirely new functions, via the swapping of protein domains can be created. These are just a few examples.

One additional interesting possible outcome of alternative splicing is the introduction of stop codons that can, through a mechanism that seems to require translation, lead to the targeted decay of the transcript. This means that, in addition to the control of transcription initiation and 5'-capping, alternative splicing can also be considered one of the regulatory mechanisms that may influence transcript abundance. The effects of alternative splicing are therefore potentially broad - from complete loss of function to novel and diversified function to regulatory effects.

A figure depicting some of the different modes of alternative splicing illustrating how different splice variants can lead to different protein forms.
Attribution: Marc T. Facciotti (own work)

#### 3' end cleavage and polyadenylation

One final modification is made to nascent pre-mRNAs before they leave the nucleus - the cleavage of the 3' end and its polyadenylation. This two step process is catalyzed by two different enzymes (as depicted below) and may decorate the 3' end of transcripts with up to nearly 200 nucleotides. This modification enhances the stability of the transcript. Generally, the more As in the polyA tag the longer lifetime that transcript has. The polyA tag also seems to play a role in the export of the transcript from the nucleus. Therefore, the 3' polyA tag plays a role in gene expression by regulating functional transcript abundance and how much is exported from the nucleus for translation.

A two step process is involved in modifying the 3' ends of transcripts prior to nuclear exports. These include cutting transcripts just downstream of a conserved sequence (AAUAAA) and transferring adenylate groups. Both processes are enzymatically catalyzed.
Attribution: Marc T. Facciotti (own work)

### MicroRNAs

#### RNA Stability and microRNAs

In addition to the modifications of the pre-RNA described above and the associated proteins that bind to the nascent and transcripts, there are other factors that can influence the stability of the RNA in the cell. One example are elements called microRNAs. The microRNAs, or miRNAs, are short RNA molecules that are only 21&ndash24 nucleotides in length. The miRNAs are transcribed in the nucleus as longer pre-miRNAs. These pre-miRNAs are subsequently chopped into mature miRNAs by a protein called dicer. These mature miRNAs recognize a specific sequence of a target RNA through complementary base pairing. miRNAs, however, also associate with a ribonucleoprotein complex called the RNA-induced silencing complex (RISC). RISC binds a target mRNA, along with the miRNA, to degrade the target mRNA. Together, miRNAs and the RISC complex rapidly destroy the RNA molecule. As one might expect, the transcription of pre-miRNAs and their subsequent processing is also tightly regulated.

### Nuclear export

#### Nuclear export

Fully processed, mature transcripts, must be exported through the nucleus. Not surprisingly this process involves the coordination of a mature RNA species to which are bound many accessory proteins - some of which have been intimately involved in the modifications discussed above - and a protein complex called the . Transport through the NPC allows flow of proteins and RNA species to move in both directions and is mediated by a number of proteins. This process can be used to selectively regulate the transport of various transcripts depending on which proteins associate with the transcript in question. This means that not all transcripts are treated equally by the NPC - depending on modification state and the proteins that have associated with a specific species of RNA it can be moved either more or less efficiently across the nuclear membrane. Since the rate of movement across the pore will influence the abundance of mature transcript that is exported into the cytosol for translation export control is another example of a step in the process of gene regulation that can be modulated. In addition, recent research has implicated interactions between the NPC and transcription factors in the regulation of transcription initiation, likely through some mechanism whereby the transcription factors tether themselves to the nuclear pores. This last example demonstrates how interconnected the regulation of gene expression is across the multiple steps of this complex process.

Many additional details of the processes described above are known to some level of detail, but many more questions remain to be answered. For the sake of Bis2a it is sufficient to begin forming a model of the steps that occur in the production of a mature transcript in eukaryotic organisms. We have painted a picture with very broad strokes, trying to present a scene that reflect what happens generally in all eukaryotes. In addition to learning the key differentiating features of eukaryotic gene regulation, we would also like for Bis2a students to begin thinking of each of these steps as an opportunity for Nature to regulate gene expression in some way and to be able to rationalize how deficiencies or changes in these pathways - potentially introduced through mutation - might influence gene expression.

While we did not explicitly bring up the Design Challenge or Energy Story here these formalisms are equally adept at helping you to make some sense of what is being described. We encourage you to try making an Energy Story for various processes. We also encourage you to use the Design Challenge rubric to reexamine the stories above: identify problems that need solving hypothesize potential solutions and criteria for success. Use there formalisms to dig deeper and ask new questions/identify new problems or things that you don't know about the processes is what experts do. Chances are that doing this suggested exercise will lead you to identify a direction of research that someone has already pursued (you'll feel pretty smart about that!). Alternatively, you may raise some brand new question that no one has thought of yet.

### Control of Protein Abundance

After an mRNA has been transported to the cytoplasm, it is translated into protein. Control of this process is largely dependent on the RNA molecule. As previously discussed, the stability of the RNA will have a large impact on its translation into a protein. As the stability changes, the amount of time that it is available for translation also changes.

#### The initiation complex and translation rate

Like transcription, translation is controlled by proteins complexes of proteins and nucleic acids that must associate to initiate the process. In translation, one of the first complexes that must assembles to start the process is referred to as the initiation complex. The first protein to bind to the mRNA that helps initiate translation is called eukaryotic initiation factor-2 (eIF-2). Activity of the eIF-2 protein is controlled by multiple factors. The first is whether or not it is bound to a molecule of GTP. When the eIF-2 is bound to GTP it is considered to be in an active form. The eIF-2 protein bound to GTP can bind to the small 40S ribosomal subunit. When bound, the eIF-2/40S ribosome complex, bringing with it the mRNA to be translated, also recruits the methionine initiator tRNA associates. At this point, when the initiator complex is assembled, the GTP is hydrolyzed into GDP creating an "inactive form of eIF-2 that is released, along with the inorganic phosphate, from the complex. This step, in turn, allows the large 60S ribosomal subunit to bind and to begin translating the RNA. The binding of eIF-2 to the RNA further controlled by protein phosphorylation. When eIF-2 is phosphorylated, it undergoes a conformational change and cannot bind to GTP thus inhibiting the initiation complex from forming - translation is therefore inhibited (see the figure below). In the dephosphorylated state eIF-2 can bind GTP and allow the assembly of the translation initiation complex as described above. The ability of the cell therefore to tune the assembly of the translation invitation complex via a reversible chemical modification (phosphorylation) to a regulatory protein is another example of how Nature has taken advantage of even this seemingly simple step to tuned gene expression.

An increase in phosphorylation levels of eIF-2 has been observed in patients with neurodegenerative diseases such as Alzheimer&rsquos, Parkinson&rsquos, and Huntington&rsquos. What impact do you think this might have on protein synthesis?

#### Chemical Modifications, Protein Activity, and Longevity

Not to be outdone by nucleic acids, proteins can also be chemically modified with the addition of groups including methyl, phosphate, acetyl, and ubiquitin groups. The addition or removal of these groups from proteins can regulate their activity or the length of time they exist in the cell. Sometimes these modifications can regulate where a protein is found in the cell&mdashfor example, in the nucleus, the cytoplasm, or attached to the plasma membrane.

Chemical modifications can occur in response to external stimuli such as stress, the lack of nutrients, heat, or ultraviolet light exposure. In addition to regulating the function of the proteins themselves, if these changes occur on specific proteins they can alter epigenetic accessibility (in the case of histone modification), transcription (transcription factors), mRNA stability (RNA binding proteins), or translation (eIF-2) thus feeding back and regulating various parts of the process of gene expression. In the case of modification to regulatory proteins, this can be an efficient way for the cell to rapidly change the levels of specific proteins in response to the environment by regulating various steps in the process.

The addition of an ubiquitin group has another function - it marks that protein for degradation. Ubiquitin is a small molecule that acts like a flag indicating that the tagged proteins should be targeted to an organelle called the proteasome. This organelle is a large multi-protein complex that functions to cleave proteins into smaller pieces that can then be recycled. Ubiquitination (the addition of a ubiquitin tag), therefore helps to control gene expression by altering the functional lifetime of the protein product.

Proteins with ubiquitin tags are marked for degradation within the proteasome.

## Key Concepts and Summary

• The entire genetic content of a cell is its genome.
• Genes code for proteins, or stable RNA molecules, each of which carries out a specific function in the cell.
• Although the genotype that a cell possesses remains constant, expression of genes is dependent on environmental conditions.
• A phenotype is the observable characteristics of a cell (or organism) at a given point in time and results from the complement of genes currently being used.
• The majority of genetic material is organized into chromosomes that contain the DNA that controls cellular activities.
• Prokaryotes are typically haploid, usually having a single circular chromosome found in the nucleoid. Eukaryotes are diploid DNA is organized into multiple linear chromosomes found in the nucleus.
• Supercoiling and DNA packaging using DNA binding proteins allows lengthy molecules to fit inside a cell. Eukaryotes and archaea use histone proteins, and bacteria use different proteins with similar function.
• Prokaryotic and eukaryotic genomes both contain noncoding DNA, the function of which is not well understood. Some noncoding DNA appears to participate in the formation of small noncoding RNA molecules that influence gene expression some appears to play a role in maintaining chromosomal structure and in DNA packaging.
• Extrachromosomal DNA in eukaryotes includes the chromosomes found within organelles of prokaryotic origin (mitochondria and chloroplasts) that evolved by endosymbiosis. Some viruses may also maintain themselves extrachromosomally.
• Extrachromosomal DNA in prokaryotes is commonly maintained as plasmids that encode a few nonessential genes that may be helpful under specific conditions. Plasmids can be spread through a bacterial community by horizontal gene transfer.
• Viral genomes show extensive variation and may be composed of either RNA or DNA, and may be either double or single stranded.

## Why is almost everything in nature symmetrical?

The evolutionary explanation is that it rarely has an advantage to be asymmetrical.If your left leg would be longer than your right leg, for example, you would run out of Mank. And it’s a “natural” aesthetic that we find symmetrical faces more beautiful than asymmetric faces. After all, asymmetry may be caused by an injury or infection in a part of the face. People with symmetrical faces therefore have more partner choice and therefore get more and/or healthier children. The same applies to many other animals.

In addition, if symmetry has no major disadvantages, it is the default because it is genetically and embryonally cheaper to build a symmetrical body.Because to encode the recipe for building a symmetrical body in the DNA, you only have to describe half of the body. Thus, an asymmetrical body would require a larger genome, which requires more DNA production, and a greater chance of something going wrong.

Where asymmetry has a great advantage, for example in the structure of the heart and the digestive system, we are indeed asymmetrical.But there must be a good reason. Machine Builders, furniture builders etc also choose symmetry if there is no good reason not to do it. For about the same reasons.

Yet the above argument is not convincing, and we need to think deeper about it.There is a very interesting book written on this subject: Right hand, Left hand .The book explains why it is very difficult to describe an asymmetrical body in the DNA. He starts with this thought experiment:

Imagine coming into radio contact with an alien civilization.You talk to them about things you have in common, like math and the raw materials. There is, however, a game rule: you may not refer to things that the extraterrors can also see, such as specific star images. Only to universal knowledge.

You can explain to the extraterrors what you know in “future” and “past”.”Above” and “down” is also successful. But “left” and “right”? You may be able to explain what the left/right dialektics mean, but you cannot distinguish “left” from “right”. You could choose one of them to name “Glubs” and the other “zluchts”, but you can not explain which of the two links it is. If you send a building plan of an asymmetrical object to the extraterreders, you have no guarantee that they will build it as you consider. It may well be that they build the mirror image of it.

Developing an embryo has the same problem.A symmetrical uterus can build asymmetric structures, but for each component there is 50% chance of getting a mirror image. An asymmetrical body therefore has a very high risk of birth defects.

The book then questions the question: how can it be that we still have asymmetric organs, and indeed, that there is not 50/50 chance to be left-handed, or to have a right heart?Very early in the evolution there is “chosen” for D glucose instead of L glucose. This will have an impact on the direction with which DNA is running, and many other biochemical structures. On another planet it could be the other way around. Or maybe not. There are indications in the particles physically that there is a subtle difference between left and right. Whether this difference is too subtle to influence the evolutionary “choice” between link and right-rotating molecules is not entirely clear. At least, when the book was written.