The Poetry Fan Who Taught an LLM to Read and Write DNA
15th February 2025
DNA is often compared to a written language. The metaphor leaps out: Like letters of the alphabet, molecules (the nucleotide bases A, T, C and G, for adenine, thymine, cytosine and guanine) are arranged into sequences — words, paragraphs, chapters, perhaps — in every organism, from bacteria to humans. Like a language, they encode information. But humans can’t easily read or interpret these instructions for life. We cannot, at a glance, tell the difference between a DNA sequence that functions in an organism and a random string of A’s, T’s, C’s and G’s.
“It’s really hard for humans to understand biological sequence,” said the computer scientist Brian Hie (opens a new tab), who heads the Laboratory of Evolutionary Design at Stanford University, based at the nonprofit Arc Institute (opens a new tab). This was the impetus behind his new invention, named Evo: a genomic large language model (LLM), which he describes as ChatGPT for DNA.