How long till Inverse AlphaFold?

post by Daniel Kokotajlo (daniel-kokotajlo) · 2020-12-17T19:56:14.474Z · LW · GW · 18 comments

This is a question post.

AlphaFold 2 takes an amino acid sequence as input, and outputs a 3D structure which represents the protein that that sequence forms. It would be cool if we could do it in reverse, i.e. the user inputs a 3D model (e.g. a gear, an axle, a wall with a hole in it of a certain shape...) and then the system outputs an amino acid sequence that would form a protein with that structure.

I don't have a good sense of whether this is a very difficult problem that we are nowhere near solving, or an obvious next step after AlphaFold2.

My current median is that it's 4 years away, but I'm very uncertain about that.


answer by CellBioGuy · 2020-12-18T03:24:22.336Z · LW(p) · GW(p)

From what I understand the pipeline depends strongly on homology to existing proteins that have determined structures to use substitution correlations to create an interaction graph which it then allows to evolve via learned rules.

I strongly suspect that as such it will not be very good at orphans without significant homology, be it sequence to structure or the reverse.

answer by Razied · 2020-12-18T21:13:19.897Z · LW(p) · GW(p)

The naive way can be done immediately, AlphaFold2 is differentiable end-to-end, so you could immediately use gradient descent to optimise over the space of amino acid sequences (this is a discrete space, but you can treat it as continuous and add some penalties to your loss function to bias it towards the discrete points):

Correct sequence = argmin ( || AlphaFold2(x) - Target structure||)

For some notion of distance between structures.  

comment by maximkazhenkov · 2020-12-19T04:10:39.426Z · LW(p) · GW(p)

But is AlphaFold2(x) a (mostly) convex function? More importantly, is the real structure(x) convex?

I can see a potential bias here, in that AlphaFold and inverse AlphaFold might work well for biomolecules because evolution is also a kind of local search, so if it can find a certain solution, AlphaFold will find it, too. But both processes will be blind to a vast design space that might contain extremely useful designs.

Then again, we are biology, so maybe we only care about biomolecules and adjacent synthetic molecules anyway.

Replies from: Razied
comment by Razied · 2020-12-19T13:04:14.742Z · LW(p) · GW(p)

Yeah, I agree that there's probably a bias towards biomolecules right now, but I think that's relatively easily fixable by using the naive way to predict the amino acid sequence of a structure we want, then actually making that protein, then checking its structure with crystallography and retraining AlphaFold to predict the right structure. If we do this procedure with sequences that differ more and more from biomolecules, we'll slowly remove that bias from AlphaFold. 

Replies from: maximkazhenkov
comment by maximkazhenkov · 2020-12-22T15:21:55.851Z · LW(p) · GW(p)

By "bias" I didn't mean biases in the learned model, I meant "the class of proteins whose structures can be predicted by ML algorithms at all is biased towards biomolecules". What you're suggesting is still within the local search paradigm, which might not be sufficient for the protein folding problem in general, any more than it is sufficient for 3-SAT in general. No sampling is dense enough if large swaths of the problem space is discontinuous.

answer by koanchuk · 2020-12-18T01:06:38.006Z · LW(p) · GW(p)

I think that one problem is that an AA sequence generally results in a single, predictable 3D structure (at stable pH, and barring any misfolding events), whereas there are a lot of AA sequences that would result in something resembling e.g. an axle of a certain size, and even more that do not. It seems to me that this problem is in a different class of computational complexity.

comment by Daniel Kokotajlo (daniel-kokotajlo) · 2020-12-18T01:42:48.486Z · LW(p) · GW(p)

Shouldn't that make it easier? The AI has many options to choose from when seeking to generate the gear, or axle, or whatever that it is tasked with generating.

Replies from: koanchuk
comment by koanchuk · 2020-12-18T03:55:41.057Z · LW(p) · GW(p)

Predicting sequence from structure just belongs to a different class of problems. Pierce & Winfree (2002) seem to have proven that it is NP-hard.

Replies from: daniel-kokotajlo
comment by Daniel Kokotajlo (daniel-kokotajlo) · 2020-12-18T10:22:37.875Z · LW(p) · GW(p)

I'm not disputing that it's in a different complexity class, I'm just saying that it seems easier in practice. For example, suppose you gave me a giant bag of legos and then a sequence of pieces to add with instructions for which previous piece to add them to, and where. I could predict what the end result would look like, but it would involve simulating the addition of each piece to the whole, and might be too computationally intensive for my mortal brain to handle. But if you said to me "Give me some lego instructions for building a wall with a cross-shaped hole in it" I could pretty easily do so. The fact that there are zillions of different correct answers to that question only makes it easier. As the paper you link says,

"It is important to keep in mind that the classification of protein design as an NP-hard optimization problem is a reflection of worst-case behavior. In practice, it is possible for an exponential-time algorithm to perform well or for an approximate stochastic method to prove capable of finding excellent solutions to NP-complete and NP-hard problems."
Replies from: Measure, jmh
comment by Measure · 2020-12-18T15:50:14.042Z · LW(p) · GW(p)

I think the Lego example says more about the human brain's limited working memory to keep track of the current state without errors. It seems like it would be easier to write a computer program to do the first task than the second, and I think the first program would execute faster as well.

Replies from: daniel-kokotajlo
comment by Daniel Kokotajlo (daniel-kokotajlo) · 2020-12-18T16:06:27.434Z · LW(p) · GW(p)

Yeah, maybe. Or maybe not. Do you have arguments that artificial neural nets are more like computer programs in this regard than like human brains?

Replies from: Measure
comment by Measure · 2020-12-18T20:52:38.125Z · LW(p) · GW(p)

I'm not familiar enough with neural nets to have reliable intuitions about them. I was thinking in terms of more traditional computer programs. I wouldn't be surprised if a neural net behaved more like a human brain in this regard.

Replies from: daniel-kokotajlo
comment by Daniel Kokotajlo (daniel-kokotajlo) · 2020-12-18T21:11:10.053Z · LW(p) · GW(p)

OK, thanks. Well, we'll find out in a few years!

comment by jmh · 2020-12-20T19:11:04.108Z · LW(p) · GW(p)

I'm wondering a bit about the idea that there are  X correct answers. That might be true of getting the shape but is share all the really matters here?

Replies from: daniel-kokotajlo
comment by Daniel Kokotajlo (daniel-kokotajlo) · 2020-12-20T22:03:34.390Z · LW(p) · GW(p)

I'm not sure. I had in mind nanotech stuff--making little robots and factories using gears, walls, axles, etc. So I guess shape alone isn't enough, you need to be able to hold that shape under stress. A wall that crumbles at the slightest touch shouldn't count.


Comments sorted by top scores.

comment by kjz · 2020-12-18T01:15:59.929Z · LW(p) · GW(p)

For clarification, would you consider an amino acid sequence designed to have a certain function to pass this test? For example, a sequence that generates a protein capable of binding selectively to specific RNA sequences?

Replies from: daniel-kokotajlo
comment by Daniel Kokotajlo (daniel-kokotajlo) · 2020-12-18T01:45:08.758Z · LW(p) · GW(p)

Good question. I had been thinking of it differently, where the user inputs a 3D shape and the AI outputs an amino acid sequence that codes a protein of that shape. But perhaps it would be even more useful to have the inputs be functions, as you say. E.g. "Give me a protein that has two ends, one of which binds selectively to the SARS-COV-2 virus, and the other of which signals the immune system to attack." It wasn't what I had in mind though.

Replies from: kjz
comment by kjz · 2020-12-18T02:39:11.078Z · LW(p) · GW(p)

Yup, that would be another good example. I would guess that sequences designed for functions like these will be developed faster than sequences designed for shape, because the incentives to do so already exist. If you generate a gear or axle, what could you do with it? Are there known applications for such things? Ultimately we could imagine molecular machines made of such a toolkit, but that seems like another level of complexity. (Although perhaps it could tie in with work along the lines of Fraser Stoddart's group.)