How Many (Smallish) Organic Compounds Are There?

chemslug

How Many (Smallish) Organic Compounds Are There?

post by chemslug · 2022-02-20T04:43:08.623Z · LW · GW · 10 comments

  Introduction
  Example Chemical Sub-spaces: Hydrocarbons and Proteins
  "Smallish" Organic Molecules
  Conclusion
None
10 comments

Introduction

Here's a question I've been pondering recently: how many organic compounds are there? There are multiple ways to interpret the question, each of which leads to consideration of a different potential set of compounds, and each of which is informative in its own way.

First off, what is an organic compound? This sounds straightforward, but because of the history of the term, it isn't. Originally, "organic compounds" were carbon-containing compounds associated with living things. Over time, the term drifted to include most carbon-containing compounds, except for things like carbon dioxide, minerals like limestone (CaCO), and allotropes of pure carbon (like diamonds, graphite, or carbon nanotubes). Wikipedia defines an organic compound as "generally any chemical compounds that contain carbon-hydrogen bonds". That definition seems good enough for our purposes^[1], so we'll go with it.

Next, what do we mean by "are there"? Is it substances that have actually been found or made and characterized? A quick check of PubChem says that there are around 100 million ( $10^{8}$ ) compounds with information submitted. However, that's not quite what I was envisioning when I asked the question. The compounds we've actually made or isolated (and more importantly, characterized to a greater or lesser extent) are only a tiny fraction of chemical space, which I will loosely define as the set of all possible^[2] molecules.

Example Chemical Sub-spaces: Hydrocarbons and Proteins

Organic chemical space contains, in theory, an infinite number of molecules. Consider the set of fully-saturated, unbranched hydrocarbon chains: methane with a single carbon, ethane with two, etc. In principle, one could construct an arbitrarily long chain with no reason to expect chemical instability. In practice, we make polyethylene chains up to about half a million carbon atoms in length that are reasonable approximations of that thought exercise.

Of course, ultra-long hydrocarbons aren't that interesting chemically, with only one monomer and no functional groups. What about molecules that actually do things? Let's take proteins, a strong contender for "most interesting class of molecules". There are 20 naturally-occurring amino acids. A 30-kDa protein^[3] has about 300 amino acids, so there are 20^300 (about 10^390) possible combinations of amino acids leading to a protein of moderate size. As a comparison, a quick Fermi calculation gives 10^80 atoms in the known universe.^[4]

"Smallish" Organic Molecules

Okay, but that still isn't quite what I meant to ask when I wondered "how many organic molecules are there?" I'm a synthetic organic chemist, so what I really wondered was "how big is the space of the kinds of molecules that synthetic organic chemists are typically concerned with?" These molecules tend to be:

Relatively small compared to the examples we've considered so far (mostly less than 500 Da molecular weight)
Stable on a timescale of at least hours under conditions achievable in the lab
Composed of relatively dense arrangements of rings, chains, and functional groups

These constraints are fairly similar to the space of "potentially pharmacologically active molecules". Wikipedia led me to a paper by Bohacek, McMartin, and Guida^[5] that gives an estimate of 10^63 such molecules. This sounds like an answer to the question I actually wanted to ask, so it's worth unpacking their calculation further.

Here's the calculation as Bohacek et al give it^[6]:

Although the number of possible molecules is difficult to estimate accurately, simple considerations show that it must be very large! Consider growing a linear molecule an atom at a time and choosing a carbon, nitrogen, oxygen, or sulfur atom at each position. Some of these atoms can be doubly or triply bonded, but not all combinations of atoms are chemically stable, and some multiple bonds will only be possible in nonlinear structures, i.e., a C=O group. Assuming a very approximate average choice multiplicity of 6, then 6^30 or 2*10^23 molecules could be grown containing 30 atoms. Now consider the ways of introducing branching or cyclization into the resulting structure. Closure of rings with three or more atoms involves selecting two atoms to form a bond and could be achieved in 30*28/2 ways. Making a branched molecule could be achieved by choosing a point to cut the chain and a point in the first part of the chain to attach to the cut end of the second part of the chain (i.e., 30^2 ways). Not all atoms can be joined in this way. However, this will be offset by the fact that when stereochemical considerations are introduced, the number of possibilities will be expanded. Based upon these considerations, approximately 10^40 molecules with up to four rings and 10 branch points could be produced from each linear chain, resulting in a very approximate estimate of 10^63 molecules in total. Although this is a rough estimate, it seems likely that when all the different possible combinations of ring closure and branching are taken into account, the true number will be well in excess of 10^60 and will rise steeply with increasing molecular weight.

We can quibble with some of the choices made in this calculation, but if anything, the final figure of 10^60 is a lower bound. There are other elements besides C, N, O, and S that show up in natural products or other potentially bioactive compounds (B, F, P, Cl, Br, I just for a start) and while the halogens would only substitute for hydrogen, rather than forming rings and chains, each added element contributes multiple orders of magnitude to the total.

It's also fun to compare this number with some estimates about how much carbon is available in various places. Carbon amounts in this section come from this paper. First, there are about 600 Pg carbon in the amosphere, mostly as carbon dioxide. That works out to about 3 x 10^40 carbon atoms in Earth's atmosphere.^[7] Between the atmosphere, the oceans, and the terrestrial biosphere, there are around 43,000 Pg "circulating carbon", or ~2 x 10^42 carbon atoms.^[8] If we add in the estimates of "deep carbon" in the earth's interior, there are around 1.85 x 10^9 Pg C on the planet, or about 10^47 carbon atoms.^[9]

Conclusion

There are a few points I'd like to emphasize at the end of this chain of Fermi calculations. First, all the estimates here are incredibly rough, and could be off by multiple orders of magnitude without changing the primary conclusions. Second, the set of organic compounds considered as a search space is really big, and if you have a goal that involves picking a compound with a defined set of characteristics, you'll want to do something other than brute-force search. Finally, in the estimate for the size of small-molecule chemical space, the bulk of the work (about two-thirds of our log-units) is done by the ability of carbon to form rings and branched chains. In this respect, no other element comes close to the variety we see with carbon-based compounds.^[10] To the extent that molecular shape is relevant to processes we care about (like living systems we know about, and likely those we haven't encountered yet) we should expect carbon-based compounds to play a significant role.

^{^}
it does exclude some fun edge cases like carbon tetrachloride and hexafluorobenzene, but any definition we choose will have to draw the line somewhere.
^{^}
"Possible" here is intended to mean only chemical entities that are isolable and stable for at least long enough to be characterized.
^{^}
That is, a protein with a molecular weight of 30,000 grams per mole. Mid-sized, as proteins go.
^{^}
Wikipedia gives a figure of 1.5x10^53 kg total mass of ordinary matter. Most of this is hydrogen atoms, which come 6x10^26 per kilogram. That gives 9*10^79 or ~10^80 total atoms. The existence of non-hydrogen atoms doesn't change this significantly.
^{^}
Apologies for the paywalled article. I'll quote the portion that interests us in full but wanted to include a link to the paper.
^{^}
This is far from the main concern of the paper. In fact, the calculation is a footnote to a figure late in the paper.
^{^}
600 Pg carbon = 6 x 10^14 kg C
1 kg C = 83.333 mol C = 5 x 10^25 C atoms
600 Pg C = 3 x 10^40 C atoms
^{^}
43,000 Pg carbon = 4.3 x 10^16 kg C
1 kg C = 83.333 mol C = 5 x 10^25 C atoms
43,000 Pg C = 2 x 10^42 C atoms
^{^}
1.85 x 10^21 kg C on earth
1 kg C = 83.333 mol C = 5 x 10^25 atoms
9.25 x 10^46 ~ 10^47 C atoms on earth
^{^}
Silicon can form some rings and chains analogous to carbon's, but they tend to be less stable, in large part due to the reduced strength of Si-Si bonds relative to C-C bonds. Sulfur can form chains and rings of various sizes but doesn't branch well and also suffers from relatively low S-S bond strength.

10 comments

Comments sorted by top scores.

comment by Charlie Steiner · 2022-02-22T03:36:23.024Z · LW(p) · GW(p)

I thought Derek Lowe had a blog post on this but now I can't find it, sadly.

comment by ChristianKl · 2022-02-20T22:35:56.763Z · LW(p) · GW(p)

There are 20 naturally-occurring amino acids. A 30-kDa protein^[3] [LW(p) · GW(p)] has about 300 amino acids, so there are 20^300 (about 10^390) possible combinations of amino acids leading to a protein of moderate size.

That's wrong. Humans alone use 21 amino acids to create proteins if you include selenocysteine which gets used even if it gets used less often than the 20 others.

Wikipedia speaks of 500 naturally occurring amino acids. While most of them don't get used by any life form for protein creation, there's no fundamental reason why they can't be.

Replies from: jmh, chemslug

↑ comment by jmh · 2022-02-21T00:02:18.627Z · LW(p) · GW(p)

Wikipedia speaks of 500 naturally occurring amino acids. While most of them don't get used by any life form for protein creation, there's no fundamental reason why they can't be.

This strikes me as a rather strong claim. If we take it as true wouldn't we expect that some evidence exists for such use and if viable that the use of these amino acids would persist?

I take that you're really saying we don't know of any reasons they could not be used in life processes but the observations that they are not suggests that the inquiry should be along the lines of why not rather than assuming they could be but aren't (at least from our current observations).

Replies from: ChristianKl

↑ comment by ChristianKl · 2022-02-21T19:18:11.144Z · LW(p) · GW(p)

It's not easy to evolve the usage of additional amino acids for use in proteins.

You need to synthesize the protein, you need to create a way for the amino acid to be incorporated into existing proteins (and if you just replace an existing three-letter code, you are going to mess up a lot of proteins) and you finally need to use the amino acid in a productive fashion.

If we look at why pyrrolysine exists in a few organisms with the last common ancestor 3 billion years ago, it's mainly used for methyltransferase's and by organisms which allows them to digest methylamine.

That leads to the thesis that it's conserved in those organisms but not others that share the same common ancestors because it has a special use in them.

Most of the commonly used amino acids are simpler than pyrrolysine. From what I remember from my molecular biology lectures the professor did believe that more amino acids were used 3 billion years ago and that evolutionary pressure removed some complex amino acids from use.

↑ comment by chemslug · 2022-02-21T02:32:32.268Z · LW(p) · GW(p)

I should have written 'common proteinogenic' in place of 'naturally-occurring'. Thank you for the correction.

comment by Gunnar_Zarncke · 2022-02-20T17:46:33.586Z · LW(p) · GW(p)

I somehow expected this number - 10^63 to be put into relation to comparable search spaces - and not just to "really big". The complete search space of chess positions is much bigger than this - but doesn't rule out useful game play for example.

A related questions is: Why is there not a corresponding multitasking of silicon based molecules? Silicon also has four possible bonds but the only place where this seems to play a role is in semiconductors. Is that because there are no biological ways to make such molecules?

Replies from: chemslug, Charlie Steiner

↑ comment by chemslug · 2022-02-21T03:00:28.692Z · LW(p) · GW(p)

The lack of context for comparable search spaces is a fair criticism. The implicit assumption (which I now realize was inappropriate not to spell out for this audience) was that your search would, at some point, involve actually making the molecules in question in order to subject them to some form of experimental characterization. The comparison of the number of possible small molecules to the amount of available terrestrial carbon was intended to make the point that achieving sizable coverage of the search space experimentally is close to a non-starter. In practice, of course, there are all kinds of ways to bias your search in productive directions.

Some search-space context:

Number of possible chess games: Shannon conservatively estimated 10^120 possible games, 10^43 possible board positions.

Number of possible Go games: Wikipedia gives 10^172

Number of ways to order a standard 52-card deck: 8 x 10^67

As for why we don't see complex silicon-containing compounds in biology, here's an attempt at an answer: We do see silicates in structural roles, for example in phytoliths. However, low Si-Si bond strength relative to C-C, combined with very strong Si-O bonds mean that you tend to get Si-O-Si linkages (like in silicone polymers) rather than Si-Si bonds, and in the absence of Si-C bonds to prevent further oxidation, you form silicates pretty quickly.

Replies from: Gunnar_Zarncke

↑ comment by Gunnar_Zarncke · 2022-02-21T09:23:57.440Z · LW(p) · GW(p)

Thanks for the explanation about silicon compounds. As oxygen is always more abundant than silicon, this makes it indeed unlikely to become the basis of silicon-based life.

↑ comment by Charlie Steiner · 2022-02-22T04:25:46.040Z · LW(p) · GW(p)

Phosphorous and nitrogen are also interesting elements capable of forming lots of cool structures... the problem is they'd often rather be doing other things, and can insist quite energetically.

comment by Capybasilisk · 2022-02-20T13:03:15.217Z · LW(p) · GW(p)

This Chemical Does Not Exist.

(Refresh the page to load new ones)

How Many (Smallish) Organic Compounds Are There?

Contents

Introduction

Example Chemical Sub-spaces: Hydrocarbons and Proteins

"Smallish" Organic Molecules

Conclusion

10 comments