More Than Just A, T, C, and G: Screening for Hidden Dangers in DNA Sequences
post by sgd · 2025-04-21T20:12:21.522Z · LW · GW · 0 commentsContents
More Than Just A, T, C, and G: Screening for Hidden Dangers in DNA Sequences Introduction What is this post about? Me? How Current Sequence Analysis Algorithms Work Secure DNA The Common Mechanism Battelle UltraSeq Aclid FAST-NA Scanner Issues with Current Sequence Screening Tools Testing and Improving Algorithms Vulnerability Studies How can unseen sequences be screened? Beyond Algorithms: Other Mechanisms to Reduce the Risk of Harmful Sequences Background Checks and Registration Systems Benchtop Synthesisers Watermarking Synthetic Sequences To Summarise… Conclusions: Making synthetic biology safer will require a multi-faceted approach None No comments
More Than Just A, T, C, and G: Screening for Hidden Dangers in DNA Sequences
Introduction
Synthetic biology has revolutionised the way modern biosciences are now done, impacting every field within the subject. By engineering organisms in predictable ways, it is possible to create novel biological solutions using genetic components from other species, and those developed in other ways, to give specific characteristics to the species of interest. Recent developments in nucleic acid synthesis allow sequences (such as DNA and RNA) to be produced ‘to order’, it is now possible to engineer a wide variety of species in ways previously not thought to be possible. Synthetic biology has become a growing field, and is frequently described as being at the inflection point computing was in the 1950s.
The advent of machine learning and artificial intelligence will accelerate this work, allowing DNA sequences to be computationally engineered ‘in silico’ and then synthesised. By automating highly repetitive and time consuming lab tasks, organisms can be engineered at a faster pace and more significantly.
With all good technologies come drawbacks. One issue frequently raised with synthetic biology is dual-use, where a technology developed for good (for example, making wheat more drought resistant) can also be applied to a malicious use (such as developing a more virulent form of flu). Bad actors could misuse the technology of today to develop more virulent versions of pathogens like smallpox, malaria, or measles, evading existing immunity, spreading more quickly, and inflicting significant economic and societal harm. Or with the technology of tomorrow, completely novel, never before seen pathogens could be developed, wreaking havoc on society.
Combating the risks of synthetic biology will be an important process in the coming years, helping to make the field safe as more people get trained in the techniques, and general understanding becomes more widespread.
One way of reducing the risk of synthetic biology is by screening nucleic acid orders, and this is what I’ll focus on here. Scientists can order synthetic nucleic acids, for example that have been engineered with computational techniques, and implement these in their engineered organisms. There are contractors that will send you DNA from a sequence you provide, or by using a piece of equipment called a ‘benchtop synthesiser’ found in some research labs. Screening these orders against known libraries of harmful DNA sequences is something that is now widespread, with around 80% of DNA synthesis orders being screened. Of course, this isn’t enough, and every order should be screened to prevent malicious actors from accessing dangerous DNA.
Over the past few years, machine learning and artificial intelligence have become more advanced, allowing proteins to be engineered for improved biological functions. More recently, it’s possible to develop completely new proteins that have a desired function. These brand new proteins are completely unseen, and will therefore not be recognised by sequence screening algorithms used by synthesis companies today. Here, I look at how sequence analysis works today, and the possible mechanisms that could be introduced in the future to mitigate the risk of harmful DNA from being ordered.
What is this post about? This post aims to explain sequence analysis, its shortcomings, and ways of improving its resilience in a way that is understandable to people that don’t have a specialist background in synthetic biology. Originally, this doc was intended for policy makers, but it should be accessible and useful for anyone interested in preventing engineered pandemics and making synthetic biology more secure. Me? I'm a biochemist with a few years of experience in things like synthetic biology and bioinformatics. |
How Current Sequence Analysis Algorithms Work
Sequence analysis encompasses algorithms analysing the risk of DNA and RNA sequences, allowing orders for concerning sequences to be flagged. Once flagged, these orders can be manually analysed, the customer contacted, and the risk determined. If the risk is found to be low, then the provider can then send the order out to the customer. However, high risk sequences may need further analysis, and this can involve law enforcement agencies (such as the FBI or Metropolitan Police) to make a decision. Here, I’ll focus on the initial step:analysing the sequence and deciding which orders to flag.
There are a number of different sequence analysis algorithms available today:
- SecureDNA (NAO)
- Common Mechanism (NTI / IBBIS)
- Battelle UltraSeq
- Aclid
- FAST-NA scanner
These all work in a similar way, cutting the sequence into short sections of around 30-50 nucleotides, or letters, and searching these against a proprietary database containing sequences of concern. Cutting the sequence into short sections makes searching databases much faster, but if the sections are too short, then the chance of getting false positives increases. 30-50 nucleotides is generally regarded as being a good trade-off between speed and accuracy.
If an ordered sequence has a match in the database of concerning sequences, it’s flagged to the synthesis company, which then assesses the risk of the order. If they decide that the flag is a false positive, the order can be released. But this can be a challenging decision to make, especially where the sequence has no match against known organisms (eg, engineered sequences, which may have used machine learning to develop a completely novel sequence).
Secure DNA
Secure DNA has developed a way of screening DNA orders by ‘hashing’ (essentially encrypting) sequences. Some customers may be concerned about their intellectual property, so this method ensures Secure DNA cannot see the orders being screened by their system. Once an order has been submitted, it is searched against their database of hazardous sequences, and a flag is raised if the sequence is similar to one in the database.
The Common Mechanism
The Common Mechanism, developed by IBBIS, combines two types of sequence analysis algorithms to determine the risk of a given sequence: (1) Searching sections of sequences against databases of known concerning sequences, (2), A form of AI / machine learning, called hidden Markov models, are used to identify ‘functional homologues’. A functional homologue is a sequence that probably has the same function as a different sequence, but wouldn’t cause a match in the first searching strategy.
This mechanism also describes screening the customer — perhaps the most important part of this. When a customer sets up an account with a synthesis company, checks are carried out to ensure that the customer is a proper entity, has any relevant licenses, and the details match official records.
No synthesis companies have adopted the Common Mechanism as of Winter 2024, although a small number of suppliers is testing it. This is due to an absence of government regulation requiring the screening described by the Common Mechanism, and the added costs it imposes on a fairly low-margin business.
Battelle UltraSeq
This product, from the non-profit Battelle, was originally developed to analyse metagenomic sequencing data for pathogenicity factors (essentially bits of DNA that correspond to a part of a pathogen that causes the infection). This would only be effective against known pathogens/pathogenicity factors. Risk ratings can be assigned to sequences, allowing a threshold to be applied before the sequence is flagged.
Aclid
Aclid has developed a system similar to the Common Mechanism, combining screening of nucleic acid orders with customer checks. By searching nucleic acid orders against databases of sequences of concern, Aclid assigns a risk rating, determining if an order should be flagged. Customer screening evaluates the licenses and certifications an organisation holds, as well as biosecurity policies and legitimate use. Synthesis screening consists of searching for matches against known sequences of concern.
Aclid’s system is in use with synthesis companies (both startups and enterprises), as well as government organisations.
FAST-NA Scanner
FAST-NA is a screening algorithm developed by RTX BBN with IDT, searching sequences against databases of sequences of concern, false positives and false negatives, to assign a risk rating. Protein sequences are also able to be screened.
Issues with Current Sequence Screening Tools
The current tools, from IBBIS, Secure DNA, and others, all use databases of known hazardous sequences. Before the advent of cheap high-performance computing, it was only possible to use existing sequences - developing new sequences in silico (on a computer instead of in the lab) was not possible. This made creating unseen sequences much more challenging, and created a technological hurdle that only universities and research institutes were able to surmount.
However, computing power is now abundant and cheap, making it easy for almost anyone to use biodesign tools like Alpha Fold and Ligand MPNN to develop potentially hazardous sequences. These sequences will not appear in databases of hazardous sequences, making it impossible for the existing screening approaches to be effective here.
Some tools are unable to identify functional homologues, including Secure DNA and other software that evaluates sequence similarity.
Testing and Improving Algorithms
Algorithms used for sequence analysis can be tested through a process called penetration testing or ‘red teaming’. By identifying where the algorithms fall short and how they can be circumvented, safeguards can be implemented to prevent malicious actors from misusing DNA synthesis.
One study, conducted by researchers at Microsoft, IBBIS, IDT, Twist Bioscience, RTX, Aclid, and Batelle (essentially all the big organisations involved in DNA synthesis), looked at ways to circumvent sequence analysis algorithms. Published in 2024, these researchers used modern methods (including machine learning/AI) to find weaknesses in the screening algorithms used by the participating organisations. ‘Patches’, ways of updating the screening system (similar to software updates), were applied to the algorithms where weaknesses were identified, reducing the likelihood of breaches occurring from the exploits that were found during this study.
Vulnerability Studies
An earlier study, carried out by researchers at Secure DNA, involved placing orders for sections of a pandemic influenza virus genome, and a close relative to smallpox’s genome. Successful in ordering these sequences with pandemic-potential, vulnerabilities were found in the screening algorithms used by the synthesis companies (these were sequences that should have been flagged as ‘of concern’ when the order was placed).
This isn’t an issue that has only come about since cheap computing power made computational design of proteins and nucleic acids possible. Journalists at the Guardian ordered sections of the smallpox genome in 2006, successfully passing the synthesis company’s screening system.
Much like all computer systems, it’s important to assess the vulnerabilities of sequence analysis software. Once vulnerabilities have been identified, the risk of that vulnerability can be determined, and a solution be implemented to reduce the risk.
How can unseen sequences be screened?
Artificial intelligence (AI) has become much more prevalent in bioscience research since the development of AlphaFold. By allowing researchers to develop brand new (or de novo) proteins, new drugs can be created, for example. In the coming years, this will likely lead to more personalised medicine, improved cancer treatments, and better clinical outcomes for patients who previously had a poor prognosis.
However, these sequences have never been seen before by anyone but the researchers. This gives the synthesis company several challenges:
- How can the risk of that sequence be determined, if it gets no matches against databases of known harmless / harmful sequences?
- How is the risk of shorter sections of sequence evaluated? For example, an order for a complete genome may be split across several synthesis companies. A short section of that genome may be harmless, but the complete sequence, once assembled in the lab, may become much more dangerous.
Codon[1] switching can be used to change the protein coded by the DNA / RNA ordered by the researchers. This would prevent sequence analysis algorithms from being able to search against existing databases, as all (known) life shares the same codons.
One method to determine functional homologues (sequences that ultimately produce a protein that has the same biological function) are Hidden Markov Models (HMM). This is a type of machine learning, allowing proteins with the same underlying biological function to be identified. By translating (converting the DNA or RNA sequence to a protein sequence) across the six possible reading frames (three in the forward direction, and three in reverse), the HMM can be used to determine if an engineered sequence may produce a similar protein that could be harmful.
What’s a reading frame? DNA is made up of individual bases, which can be represented by the letters A, T, G, and C. When DNA is converted to protein, the intermediary molecule (mRNA) is read in groups of three bases. It’s possible for any possible arrangement of bases to be read, and this defines the ‘reading frame’. In the image above, the three possible reading frames in the forward direction of the DNA strand are shown. As DNA is a double-stranded molecule, it’s also possible to read from the complementary strand (which is a different sequence to the forward direction). This gives another three reading frames, for a total of six. |
Beyond Algorithms: Other Mechanisms to Reduce the Risk of Harmful Sequences
While having a technological solution to screen sequence synthesis orders is an important mechanism to prevent the misuse of nucleic acid synthesis, the potential for misuse is a great risk that must also be mitigated by other means. By implementing layered interventions (like the Swiss cheese model), the risk of any vulnerabilities in sequence analysis software can be reduced. It’s unlikely a database of known dangerous sequences will ever contain every possible harmful sequence, or that machine learning tools will be capable of identifying all possible harmful proteins. So it’s important to have additional methods to reduce risk. Finding these alternative methods should involve looking in other sectors and fields.
Background Checks and Registration Systems
Ensuring people doing synthetic biology research are properly trained and competent could go part of the way to reduce the risk of bad actors accessing nucleic acids. Similar programmes are used in other sectors:
- Childcare – In the United Kingdom, people that work with children and vulnerable people must undergo a background check (called the DBS) to ensure they do not have previous convictions that could make them a danger to children and vulnerable people. The UK also has background checks for masters and PhD programmes in fields that have potential for dual use (ATAS).
- Financial Services – Banking frequently has customer verification to reduce money laundering. See: know your customer checks.
These programmes could be applied to nucleic acid synthesis by requiring researchers to be registered on a database, and for research organisations (such as universities and research institutes) to undergo checks of their biosecurity policies.
Benchtop Synthesisers
So far, only synthesis companies have been looked at here. However, benchtop nucleic acid synthesis equipment is becoming cheaper, better, and therefore more widespread. This equipment allows researchers to ‘print’ their own DNA and RNA sequences in their lab without having to send the sequence away to a third party. These can have privacy benefits, especially where intellectual policy is concerned, as the sequence isn’t sent away for processing by a third party.
Having this equipment in many labs increases the chances of the equipment being misused by malicious actors. To reduce these risks, some manufacturers require owners of their equipment to be registered with them. Manufacturers of the equipment are also the only supplier of reagents, preventing unregistered users from placing orders.
DNAscript, a developer of benchtop DNA synthesisers, also includes screening software in their equipment, checking synthesis requests against their system before approving the request. As with all electrical equipment, this safeguard may be vulnerable to hacking attempts.
Ensuring manufacturers require users of their equipment to register centrally, and to ensure the sequence screening component of the software is present and working on turning the equipment on, could be potential mechanisms for protecting benchtop synthesiser equipment. Using a trusted platform module to ensure sequence analysis software is correctly installed (verifying platform integrity), similar to modern operating systems, could prevent the synthesis equipment from activating without the security software being present.
Watermarking Synthetic Sequences
It’s possible to develop completely new (de novo) sequences using machine learning tools like AlphaFold and Ligand MPNN, creating new protein sequences (which can easily be converted into DNA sequences). These sequences are usually unseen, and will almost never match any sequences in existing databases. Being able to add watermarks to these sequences could enable synthetic sequences to be identified and subsequently attributed in a potential outbreak.
Recent advances in machine learning have led to the proliferation of language models and chatbots like OpenAI’s 4o, Google Gemini, and Anthropic’s Claude. Methods of watermarking output from these models have been suggested (mainly to reduce plagiarism in school and university coursework). One mechanism of watermarking the text output is to change the probability of words, compared to human-written text. When analysing text, the proportion of words is determined, and if it is in line with what the model produces, then it may be likely that the text has been generated with a language model.
It may be possible to apply this logic to sequence development. By changing the probability of certain amino acids being selected, it may be possible to mark synthetic sequences to allow them to be identified as synthetic during subsequent sequence analysis. It may not be possible to alter the probabilities of bases, as these can be changed through a process called codon optimisation (some codons are more efficiently converted to amino acid than others in certain organisms, and can therefore be changed before placing a synthesis order.
Although keeping society safe from the risks of synthetic DNA, it is also important to ensure molecular and synthetic biology are still accessible. Advanced biological drugs research will be accelerated by synthetic nucleic acid, with modern lab techniques depending on computational techniques and synthetic nucleic acids in conjunction.
To Summarise…
- Background Checks and Registration
Evaluating customers and end users as part of account set-up with synthesis screening companies will prevent overt bad actors from obtaining synthetic nucleic acids. Similar policies in other sectors, like childcare and finance, are successful at reducing the risk to individuals and society. These methods could easily be transferred to synthetic biology. - Benchtop Synthesisers
Allowing users to synthesise nucleic acids on-site will make science faster, but reduces the supervision that synthesis companies are able to exert over orders. Increasing regulation of benchtop synthesis equipment will mitigate these risks. - Watermarking Synthetic Sequences
Attributing an outbreak to synthetic DNA would make retrospective forensic analysis possible, allowing engineered sequences to be identified.
Layered defense is the most effective way of reducing the risk of synthetic nucleic acids, protecting society from outbreaks of engineered pathogens.
Conclusions: Making synthetic biology safer will require a multi-faceted approach
As demonstrated by the Covid-19 pandemic, pathogens are able to spread throughout the world in a matter of weeks. The democratisation of synthetic biology increases the number of individuals with the capability to develop highly pathogenic organisms. This risk must be mitigated to prevent damaging bioterrorism attacks from occurring.
Sequence analysis is an important technique in mitigating the risk presented by synthetic biology and nucleic acids, but is unlikely to be the sole mechanism to eliminate these risks. Improvements in sequence analysis, likely with machine learning techniques, will play an important role in reducing the likelihood of misuse, ensuring hazardous sequences are harder to obtain.
By combining sequence analysis software with other mechanisms, such as background checks, registration of synthesis equipment, and through investigating the potential watermarking sequences may have, the risk can be further reduced. These actions will inhibit malicious actor’s access to nucleic acid synthesis, making the development of pathogens with pandemic potential much more challenging.
- ^
A codon is a sequence of 3 bases in DNA and RNA, coding for amino acids. All life uses the same codons for each amino acid, but there is potential for synthetic biology to develop life that uses different codons for the same amino acid through a process called codon swapping. Codon swapping would make screening nucleic acid sequences somewhat more complicated.
- ^
Figure by Hornung Ákos, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=17667908
0 comments
Comments sorted by top scores.