Bitter Lessons in Chemistry

December 1, 2023

“And I took the little scroll from the hand of the angel and ate it. It was sweet as honey in my mouth, but when I had eaten it my stomach was made bitter.”

–Revelation 10:10

As machine learning becomes more and more important to chemistry, it’s worth reflecting on Richard Sutton’s 2019 blog post about the “bitter lesson.” In this now-famous post, Sutton argues that “the biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin.” This might sound obvious, but it’s not:

[The bitter lesson] is a big lesson. As a field, we still have not thoroughly learned it, as we are continuing to make the same kind of mistakes. To see this, and to effectively resist it, we have to understand the appeal of these mistakes. We have to learn the bitter lesson that building in how we think we think does not work in the long run. The bitter lesson is based on the historical observations that 1) AI researchers have often tried to build knowledge into their agents, 2) this always helps in the short term, and is personally satisfying to the researcher, but 3) in the long run it plateaus and even inhibits further progress, and 4) breakthrough progress eventually arrives by an opposing approach based on scaling computation by search and learning. The eventual success is tinged with bitterness, and often incompletely digested, because it is success over a favored, human-centric approach. (emphasis added)

How might the bitter lesson be relevant in chemistry? One example is computer-assisted retrosynthesis in organic synthesis, i.e. figuring out how to make a given target from commercial starting materials. This task was first attempted by Corey’s LHASA program (1, 2), and more recently has been addressed by Bartosz Gryzbowski’s Synthia. Despite considerable successes, both efforts have operated through manual encoding of human chemical intuition. If Sutton is to be believed, we should be pessimistic about the scalability and viability of such approaches relative to pure search-based alternatives in the coming years.

Another example is machine-learned force fields (like ANI or NequIP). While one could argue that equivariant neural networks like e3nn aren’t so much incorporating domain-specific knowledge as exploiting relevant symmetries (much like convolutional neural networks exploit translational symmetry in images), there’s been a movement in recent years to combine chemistry-specific forces (e.g. long-range Coulombic forces) with neural networks: Parkhill’s TensorMol did this back in 2018, and more recently Dral, Piquemal, and Levitt & Fain (among others) have published on this as well. While I’m no expert in this area, the bitter lesson suggests that we should be skeptical about the long-term viability of efforts, and instead just throw more data at chemistry-agnostic models.

A key assumption of the bitter lesson is that “over a slightly longer time than a typical research project, massively more computation inevitably becomes available.” This idea has also been discussed by Andrej Karpathy, who reproduced Yann LeCun’s landmark 1989 backpropagation paper last year using state-of-the-art techniques and reflected on how the field has progressed since then. In particular, Karpathy discussed how the last three decades of progress in ML can help us envision what the next three decades might look like:

Suppose that the lessons of this exercise remain invariant in time. What does that imply about deep learning of 2022? What would a time traveler from 2055 think about the performance of current networks?

2055 neural nets are basically the same as 2022 neural nets on the macro level, except bigger.

Our datasets and models today look like a joke. Both are somewhere around 10,000,000X larger.

One can train 2022 state of the art models in ~1 minute by training naively on their personal computing device as a weekend fun project.

Today’s models are not optimally formulated, and just changing some of the details of the model, loss function, augmentation or the optimizer we can about halve the error.

Our datasets are too small, and modest gains would come from scaling up the dataset alone.

Further gains are actually not possible without expanding the computing infrastructure and investing into some R&D on effectively training models on that scale.

If we take this seriously, we might expect that chemical ML will not be able to advance much farther without bigger datasets and bigger models. Today, experimental datasets rarely exceed 10⁴–10⁵ data points, and even computational datasets typically comprise 10⁷ data points or fewer—compare this to the ~10¹³ tokens reportedly used to train GPT-4! It’s not obvious how to get experimental datasets that are this large. HTE and robotics will help, but five orders of magnitude is a big ask. Even all of Reaxys doesn’t get you to 10⁸, poor data quality notwithstanding. (It’s probably not a coincidence that DNA-encoded libraries, which can actually have hundreds of millions of data points, also pair nicely with ML: I’ve written about this before.)

In contrast, computation permits the predictable generation of high-quality datasets. If Sutton is right about the inevitable availability of “massively more computation,” then we can expect it to become easier and easier to run hitherto expensive calculations like DFT in parallel to generate huge datasets, and to enable more and more downstream applications like chemical machine learning. With the right infrastructure (like Rowan, hopefully), it should be possible to turn computer time into high-quality chemical data with almost no non-financial scaling limit: we’re certainly not going to run out of molecules.

The advent of big data, though, heralds the decline of academic relevance. Julian Togelius and Georgios Yannakakis wrote about this earlier this year in a piece on “survival strategies for depressed AI academics,” which discusses the fact that “the gap between the amount of compute available to ordinary researchers and the amount available to stay competitive is growing every year.” For instance, GPT-4 reportedly cost c. $60 million to train, far surpassing any academic group’s budget. Togelius and Yannakakis provide a lot of potential solutions, some sarcastic (“give up”) and others quite constructive—for instance, lots of interpretability work (like this) is done on toy models that don’t require GPT-4 levels of training. Even the most hopeful scenarios they present, however, still leave academics with a rather circumscribed role in the ML ecosystem.

The present academic era of chemical machine learning can thus be taken as a sign of the field’s immaturity. When will chemical ML reach a Suttonian era where scaling is paramount and smaller efforts become increasingly futile? I’m not sure, but my guess is that it will happen when (and only when) there are clear commercial incentives for developing sophisticated models, enabling companies to sink massive amounts of capital into training and development. This clearly hasn’t happened yet, but it also might not be as far away as it seems (cf.). It’s an interesting time to be a computational chemist…

If you want email updates when I write new posts, you can subscribe on Substack.