Blog


Books from 2024

January 1, 2025

(Previously: 2022, 2023.)

#1. Baldassar Castiglione, The Book of the Courier

This book gets cited from time to time as a sort of historical guide to "being cool," since the characters spend some time discussing the idea of sprezzatura, basically grace or effortlessness. More interesting to me was the differences between Renaissance conceptions of virtue, character, & masculinity / femininity and how our culture's used to thinking about these concepts—"the past is a foreign country."

#2. Grant Cardone, Sell Or Be Sold
#3. Andrew Chen, The Cold Start Problem
#4–7. Stephanie Meyer, The Twilight Saga

Having never read or watched any Twilight before this year, I found them much weirder than I was expecting.

#8. Fuschia Dunlop, Invitation to a Banquet

As featured on CWT!

#9. Iris Murdoch, The Black Prince
#10. David Kushner, Masters of Doom

A history of id Software, the company behind Wolfenstein 3D, Doom, Quake, and the fast inverse square root algorithm. John Carmack is a legendary figure in the software world, and after reading a fictionalized history inspired by id last year (Tomorrow and Tomorrow and Tomorrow) it was good to read the real thing.

#11. Michael Gerber, The E-Myth Revisited
#12. William Gibson, Neuromancer

A lot of old science fiction is hard to appreciate properly—the best ideas have been sucked out and copied a hundredfold, leaving only the author's weirder musings behind to be appreciated. Neuromancer's been copied as much as any novel, but I was impressed by the pace and general bleakness of this novel; it holds up well.

#13–26. Lois McMaster Bujold, The Vorkosigan Saga

I adored this series, which I read pretty steadily over the course of the year. Bujold writes satisfying, well-constructed plots that keep the focus on characters, not setting. The books fit together nicely, too: each story stands alone, but together paint a decades-long picture of her characters aging, gaining wisdom through their mistakes, and learning to handle the responsibilities placed on them. I think Captain Vorpatril's Alliance is my favorite one.

#27. R. F. Kuang, Babel
#28. Clay Christiansen, The Innovator’s Dilemma

As recommended by Jensen Huang; unlike most business books, this one is worth reading all the way through.

#29. Rob Fitzpatrick, The Mom Test

A canonical book for startup founders, which I probably should have read 1–2 years ago.

#30. Elena Ferrante, My Brilliant Friend

At its core, this is a very similar story to Wicked: a coming-of-age story focusing on the envious and unstable friendship between two women. I liked this book, but haven't yet picked up the rest of the Neopolitan Novels; somehow keeping track of the names must intimidate me on a subconscious level.

#31. Andy Grove, Only The Paranoid Survive
#32. Vernor Vinge, A Fire Upon The Deep

I liked this book a lot. I would have adored it if I'd read it as a kid, I think; there's something viscerally compelling about Vinge's "Zones of Thought."

#33. C. S. Lewis, The Discarded Image

This book examines what medieval Europeans thought of the world: how did they see their universe and their place in it? This is a surprisingly subtle question: obviously they were Christian, but their cosmology was considerably different than what even the most "traditional" modern people believe. Last year, I wrote this about The Canterbury Tales:

Reading Chaucer fills me with questions about the medieval mind. The stories are steeped in Christianity, as one might expect. Any argument goes back to the Bible, even those among animals, and Chaucer assumes a level of familiarity with e.g. the Psalms far exceeding that of most modern Christians. Yet at the same time the Greco-Roman world looms large: Roman gods appear as plot characters in three tales (the Knight’s Tale, the Merchant’s Tale, and the Manciple’s Tale), and Seneca is viewed as a moral authority on par with Scripture. I’m curious how all these beliefs and ideas fit together and welcome any recommendations on this subject.

The Discarded Image exactly answers these questions. If you're at all interested in medieval thought, I highly recommend it.

#34. Jim Collins, Good To Great
#35. R. T. France, The Gospel of Mark
#36. Nathan Azrin, Toilet Training In Less Than One Day

We didn't quite live up to the book's promise, but it took less than a week, so I'm happy.

#37. Tim Keller, Every Good Endeavor
#38. Brad Feld, Venture Deals

Another canonical book for startup founders, which I also probably should have read before now.

#39. Abigail Shrier, Bad Therapy

Shrier invites controversy here as with her other writing. Sweeping conclusions about American youth aside, I found this surprisingly compelling when viewed as a self-help book about how to be less fearful.

#40. Sheldon Vaunaken, A Severe Mercy

Caused me to weep uncontrollably while stuck in a middle seat on a five-hour flight: you've been warned.

#41. Thich Nhat Hanh, You Are Here
#42. Gunther Hagen, This Is Germany: An Art Book
#43. Thomas Malory, Le Mort D’ Arthur
#44. Georgette Heyer, A Civil Contract
#45. Alex Hormozi, $100M Offers
#46. R. F. Kuang, Yellowface
#47. Barry Werth, The Billion-Dollar Molecule

This book is crazy, and I can't believe I hadn't read it before, particularly since I'm not too distant from a lot of the action, professionally or physically. It's framed as a science story, but I think it works even better at conveying the sheer desperation of early-stage startup life.

#48. Diarmid McCullough, The Reformation

The Reformation is much weirder than most people, Protestant or Catholic, realize: I was surprised by the diversity of pre-Reformation religious practice in Europe, which was mostly stamped out in the doctrinal standardization of the 1500s. For both Protestants and Catholics, it became very important to separate "us" from "them," which led to the rise of catechisms, inquisitions, and so on.

This book also soured me on the "Albion's Seed" idea, as popularized by the SSC book review. Viewed in isolation, the Puritans seem like a bunch of religious fanatics, but really McCullough argues that the same impulse predominated all over Europe in a "Reformation of Manners," from Charles Borromeo's Milan to Plymouth Colony. Perhaps it's less about the Puritans and more about the 1620s.

#49. Amy Chua, Battle Hymn Of The Tiger Mother

This book made it back into the discourse, so I decided I'd actually read it—it's much better than I was expecting, and I don't think most of Chua's critics really understand the book. Conclusions for my own parenting have yet to be determined.



I also read good chunks of a number of textbooks this year, including:

Overall, this was a good year for books. As the stress of Rowan has ramped up more, I've found it more difficult to write creatively in my free time, and easier to just read other people's words—this manifests in a much-diminished rate of blogging, and a lot more energy diverted to reading fiction.

Next year, I hope to read:

Happy new year, and feel free to leave book recommendations in the Substack comments!

Are Forcefields Able To Describe Protein Dynamics?

October 11, 2024

This post assumes some knowledge of molecular dynamics and forcefields/molecular mechanics. For readers unfamiliar with these topics, Abhishaike Mahajan has a great guide to these topics on his blog.

Although forcefields are commonplace in all sorts of biomolecular simulation today, there’s a growing body of evidence showing that they often give unreliable results. For instance, here’s Geoff Hutchison criticizing the use of forcefields for small-molecule geometry optimizations:

The use of classical MM methods for optimizing molecular structures having multiple torsional degrees of freedom is only advised if the precision and accuracy of the final structures and rankings obtained from the conformer searches is of little or no concern... current small molecule force fields should not be trusted to produce accurate potential energy surfaces for large molecules, even in the range of “typical organic compounds.” (emphasis added)

Here’s a few other scattered case studies where forcefields have failed:

This list could be a lot longer, but I think the point is clear—even for normal, bio-organic molecules, forcefields often give bad or unreliable answers.

Despite all these results, though, it’s tough to know how bad the problem really is because there have been lots of scientific questions that can only be studied with forcefields. Studying protein conformational motion, for instance, is one of the tasks that forcefields have traditionally been developed for, and the scale and size of the systems in question makes it really challenging to study any other way. So although researchers can show that different forcefields give different answers, it’s tough to quantify how close any of these answers is to the truth, and it’s always been possible to hope that a good forcefield really is describing the underlying motion of the system quite well.

It’s for this reason that I’ve been so fascinated by this April 2024 work from Oliver Unke and co-workers, which studies the dynamics of peptides and proteins using neural network potentials (NNPs). NNPs allow scientists to approach the accuracy of quantum chemical calculations in a tiny fraction of the time by training ML models to reproduce the output of high-level QM-based simulations: although NNPs are still significantly slower than forcefields, they’re typically about 3–6 orders of magnitude faster than the corresponding high-level calculations would be, with only slightly lower accuracy.

A nice overview of the paper.

In this case, Unke and co-workers train a SpookyNet-based NNP to reproduce PBE0/def2-TZVPPD+MBD reference data comprising fragments from the precise systems under study. (MBD refers to Tkatchenko’s many-body dispersion correction, which can be thought of as a fancier alternative to pairwise dispersion corrections like D3 or D4.) In total, about 60 million atom-labeled data points were used to train the NNPs used in this study—which reportedly took 110,000 hours of CPU time to compute, equivalent to 12 CPU-years!

(This might be a nitpick, but I don’t love the use of PBE0 here. Range-separated hybrids are crucial for producing consistent and accurate results for large zwitterionic biomolecules (see e.g. this recent work from Valeev), so it’s possible that the underlying training data isn’t as accurate as it seems.)

The authors find that the resulting NNPs (“GEMS”) perform much better than existing forcefields in terms of overall error metrics: for instance, GEMS has an MAE of 0.45 meV/atom on snapshots of AceAla15Nme structures taken from MD simulations, while Amber has an MAE of 2.27 meV/atom. What’s much more interesting, however, is that GEMS gives significantly different dynamics than forcefields! While Amber simulations of AceAla15Nme predict that a stable α-helix will form at 300 K, GEMS predicts that a mixture of α- and 310 helices exist, which is exactly what’s seen in Ala-rich peptides experimentally. The CHARMM and GROMOS forcefields also get this system wrong, suggesting that GEMS really is significantly more accurate than forcefields at modeling the structure of peptides.

Amber-based simulations stay in one configuration, while GEMS-based simulations are significantly more flexible.

The authors next study crambin, a small 46-residue protein which is frequently chosen as a model system in papers like this. Similar to what was seen with the Ala15 helices, crambin is significantly more flexible when modeled by GEMS than when modeled with Amber (see below figure). The authors conduct a variety of other analyses, and argue that there are “qualitative differences between simulations with conventional FFs and GEMS on all timescales.” This is an incredibly significant result, and one that casts doubt on literal decades of forcefield-based MD simulations. Think about what this means for Relay’s MD-based platform, for instance!

A UMAP plot of protein motion through conformational space. (Yes, we all know UMAP is bad, but this is still a nice plot!)

Why do Amber and GEMS differ so much here? Here’s what Unke and coworkers think is going on:

AmberFF is a conventional FF, and as such, models bonded interactions with harmonic terms. Consequently, structural fluctuations on small timescales are mostly related to these terms. Intermediate-scale conformational changes as involved in, for example, the “flipping” of the dihedral angle in the disulfide bridges of crambin, on the other hand, can only be mediated by (nonbonded) electrostatic and dispersion terms, because the vast majority of (local) bonded terms stay unchanged for all conformations. On the other hand, GEMS makes no distinction between bonded and non-bonded terms, and individual contributions are not restricted to harmonic potentials or any other fixed functional form. Consequently, it can be expected that large structural fluctuations for AmberFF always correspond to “rare events” associated with large energy barriers, whereas GEMS dynamics arise from a richer interplay between chemical bonds and nonlocal interactions.

The overall idea that (1) forcefields impose an unphysical distinction between bonded and non-bonded interactions, and (2) this distinction leads to strange dynamical effects makes sense to me. There’s parts of this discussion that I don’t fully understand—what’s to stop a large structural fluctuation in Amber from having a small barrier? Aren’t all high-barrier processes “rare events” irrespective of where the barrier comes from?

There are some obvious caveats here that mean this sort of strategy isn’t ready for widespread adoption yet. These aren’t foundation models; the authors create a new model for each peptide and protein under study by adding system-specific fragments to the training data and retraining the NNP. This takes “between 1 and 2 weeks, depending on the system,” not counting the cost of running all the DFT calculations, so this is far too expensive and slow for routine use. While this might seem like a failure, I think it’s worth reflecting on how tough this problem is. Crambin alone has thousands of degrees of freedom, not counting the surrounding water molecules, and accurately reproducing the results of the Schrodinger equation for this system is an incredible feat. The fact that we can’t automatically also solve this problem in a zero-shot manner for every other protein is hardly a failure, particularly because it seems very likely that scaling these models will dramatically improve their generalizability!

The other big limitation is inference speed: the SpookyNet-based NNPs are about 250x slower than a conventional forcefield, so it’s much tougher to access the long timescales that are needed to simulate processes like protein folding. There are a lot of techniques that can help address these problems: NNPs can become faster and not require system-specific retraining, coarse graining can reduce the number of particles in the system, and Boltzmann generators can reduce the number of evaluations needed. So the future is bright, but there’s clearly a lot of ML engineering and applied research that will be needed to help NNP-based simulations scale.

But overall, I think this is a very significant piece of work, and one that should make anyone adjacent to forcefield-based MD pause and take note. One day it will be possible to run simulations like this just as quickly as people run regular MD simulations today, and I can’t wait to see what comes of that.

Thanks to Abhishaike Mahajan for helpful feedback on this post.

Robots Won't Solve Organic Synthesis

September 17, 2024

Abhishaike Mahajan recently wrote an excellent piece on how generative ML in chemistry is bottlenecked by synthesis (disclaimer: I gave some comments on the piece, so I may be biased). One of the common reactions to this piece has been that self-driving labs and robotics will soon solve this problem—this is a pretty common sentiment, and one that I’ve heard a lot.

Unfortunately, I think that the strongest version of this take is wrong: organic synthesis won’t be “solved” by just replacing laboratory scientists with robots, because (1) figuring out what reactions to run is hard and (2) running reactions is even harder and (3) we need scientific advances to fix this, not just engineering.

Predicting What Reactions To Run Is Hard

Organic molecules are typically made through a sequence of reactions, and figuring out how to make a molecule involves both the strategic question of which reactions to run in what order and the tactical question of how to run each reaction.

There’s been a ton of work on both of these problems, and it’s certainly true that computer-assisted retrosynthesis tools have come a long way in the last decade! But retrosynthesis is one of those problems that’s (relatively) easy to be good at and almost impossible to be great at. In part, this is because data in this field tends to be very bad: publications and patents are full of irreproducible or misreported reactions, and negative results are virtually never reported. (This post by Derek Lowe is a good overview of some of the problems that the field faces.)

But also, the problems are just hard! I got the chance to try out one of the leading retrosynthesis software packages back in my career as an organic chemist, and when we fed it some of the tough synthetic problems we were facing, it gave us all the logical suggestions that we had already tried (unsuccessfully) and then began suggesting insane reactions to us. I can’t really blame the model for not being able to invent new chemistry—but this illustrates the limits of what pure retrosynthesis can accomplish, absent new scientific discoveries.

The tactical problem of optimizing reaction conditions is also difficult. In cases where there are a lot of continuous variables (like temperatures or concentrations), conventional optimization methods like design-of-experiments can work well—but where reagents or catalysts are involved, optimization becomes significantly more challenging. Lots of cheminformatics/small-data ML work has been done in this area, but it’s still not straightforward to reliably take a reaction drawn on paper and get it to work in the lab.

Running Reactions Is Even Harder

All of the above problems are, in principle, solvable. Where I think robotics is likely to struggle even more is in the actual execution of these routes. Synthetic organic chemistry is an arcane and specialized craft that typically requires at least five years of training to be decent at—most published reaction procedures assume that the reader is themselves a trained organic chemist, and omit most of the “obvious” details that are needed to unambiguously specify a sequence of steps. (The incredibly detailed procedures in Organic Syntheses illustrate just how much is missing from the average publication.)

My favorite illustration of how irreproducible organic chemistry can be is BlogSyn, a brief project that aimed to anonymously assess how easily published reactions could be reproduced. The second BlogSyn post found that a reported olefination of pyridine could not be reproduced—the original author of the paper, Jin-Quan Yu (Scripps) responded, and the shape of reaction tube was ultimately found to be critical to reaction success.

The third BlogSyn post found that an IBX-mediated benzylic oxidation reported by Phil Baran (also of Scripps) could not be reproduced at all as written. Phil and his co-authors responded pretty aggressively, and after several weeks of back-and-forth it was ultimately found that the reaction could be reproduced after modifying virtually every parameter. A comment from Phil’s co-author Tamsyn illustrates some of the complexities at play:

There is in [BlogSyn’s] discussion a throw away comment about the 2-methylnaphthalene not being volatile. Have you never showered and then left your hair to dry at room temperature? – water evaporates at RT, just as 2-methylnaphthalene does at 95 ºC. I suggest to you that at the working temperatures of this reaction, the biggest problem may be substrate evaporation (or “hanging out” on the colder parts of the flask as Phil said)... We need fluorobenzene to reflux in these reactions and in so-doing wash substrate back into the reaction from the walls of the vessel, but it clearly slows/inhibits the reaction also – so, we need to tune this balance carefully and with patience. Scale will have a big influence on how well this process works.

Tamsyn is, of course, right—volatile substrates can evaporate, and part of setting up a reaction is thinking about the vapor pressure of your substrates and how you can address this. But this sort of thinking requires a trained chemist, and isn’t easily automated. There are a million judgment calls to make in organic synthesis—what concentration to use, how quickly to add the reagent, how to work up the reaction, what extraction solvent to use, and so on—and it’s hard enough to teach first-year graduate students how to do all this, let alone robots. Perhaps at the limit as robots achieve AGI this will be possible, but for now these remain difficult problems.

We Need Scientific Advances To Fix This

What can be done, then?

From a synthetic point of view, we need more robust reactions. Lots of academics work on reaction development, but the list of truly reliable reactions remains miniscule: amide couplings, Suzuki couplings, addition to Ellman auxiliaries, SuFFEx chemistry, and so on. From a practical point of view, every reaction like this is worth a thousand random papers with a terrible substrate scope (Sharpless said it better in 2001 than I ever could; see also this 2015 study about how basically no new reactions are used in industry). Approaches like skeletal editing are incredibly exciting, but there’s a limit to how impactful any non-general methodology can be.

Perhaps even more important is finding better methods for reaction purification. Purification is one of those topics which doesn’t get a lot of academic attention, but being able to efficiently automate purification unlocks a whole new set of possibilities. Solid-phase synthesis (which makes purification as simple as rinsing off some beads) has always seen some amount of use in organic chemistry, but a lot of commonly-used reactions aren’t compatible with solid support: either new supports or new reactions could address this problem. There are also cool approaches like Marty Burke’s “catch-and-release” boronate platform which haven’t yet seen broad adoption.

Ultimately, I share the dream of the robotics enthusiasts: if we’re able to make organic synthesis routine, we can stop worrying about how to make molecules and start thinking about what to make! I’m very optimistic about the opportunity of new technologies to address synthetic bottlenecks and enable higher-throughput data generation in chemistry. But getting to this point will take not only laboratory automation but also a ton of scientific progress in organic chemistry, and the first step in solving these problems is actually taking them seriously and recognizing that they’re unsolved.

Thanks to Abhishaike Mahajan and Ari Wagen for helpful comments about this post.

Choosing Maximally Dissimilar Molecules

July 10, 2024

I've been playing around with generating non-equilibrium conformations by molecular dynamics recently, and I've been thinking about how to best parse the outputs of a dynamics simulation. A technique I've seen quite often in the literature is "choose a dissimilar subset of conformers by RMSD"—for instance, here's what the SPICE paper says:

For each of the 677 molecules, the dataset includes 50 conformations of which half are low energy and half are high energy. To generate them, RDKit 2020.09.3 was first used to generate 10 diverse conformations. Each was used as a starting point for 100 ps of molecular dynamics at a temperature of 500 K using OpenMM 7.6 and the Amber14 force field. A conformation was saved every 10 ps to produce 100 candidate high energy conformations. From these, a subset of 25 was selected that were maximally different from each other as measured by all atom RMSD.

This makes a good amount of sense: you want to choose conformers which cover as much chemical space as possible so that you get information about the PES as efficiently as possible, and RMSD is a cheap and reasonable way to do this. But how do you actually do this in practice? Nothing super helpful came up after a quick Google search, so I wrote a little script myself:

import cctk
import numpy as np
from sklearn.cluster import AgglomerativeClustering
import sys
import tqdm
import copy

e = cctk.XYZFile.read_file(sys.argv[1]).ensemble
molecules = e.molecule_list()

rmsd_matrix = np.zeros((len(molecules), len(molecules)))
comparison_atoms = molecules[0].get_heavy_atoms()

def compute_rmsd(mol1: cctk.Molecule, mol2: cctk.Molecule) -> float:
    geom1 = copy.deepcopy(mol1.geometry[comparison_atoms])
    geom1 -= geom1.mean(axis=0)

    geom2 = copy.deepcopy(mol2.geometry[comparison_atoms])
    geom2 -= geom2.mean(axis=0)

    return cctk.helper_functions.compute_RMSD(geom1, geom2)


for i in tqdm.tqdm(range(len(molecules))):
    for j in range(i + 1, len(molecules)):
        rmsd_matrix[i, j] = compute_rmsd(molecules[i], molecules[j])
        rmsd_matrix[j, i] = rmsd_matrix[i, j]

clustering = AgglomerativeClustering(
    n_clusters=50,
    metric="precomputed",
    linkage="average"
)
clustering.fit(rmsd_matrix)

selected_molecules: list[int] = []
for cluster_id in range(50):
    cluster_indices = np.where(clustering.labels_ == cluster_id)[0]
    selected_molecule = cluster_indices[
        np.argmin(rmsd_matrix[cluster_indices].sum(axis=1))
    ]
    selected_molecules.append(selected_molecule)

e2 = cctk.ConformationalEnsemble()
for idx in selected_molecules:
    e2.add_molecule(molecules[idx])

cctk.XYZFile.write_ensemble_to_file(sys.argv[2], e2)
print("done!")

This script uses agglomerative clustering to sort conformations into clusters, but could easily be adapted to work with other algorithms. To run this script, simply paste into into a file (choose_dissimilar.py) and run:

python choose_dissimilar.py input.xyz output.xyz

This will dump 50 output conformers into output.xyz. Hopefully this saves someone some time... happy computing!

Running Simple MD Simulations

June 13, 2024

Scientific software can be confusing, particularly when you're doing something that the software isn't primarily intended for. I often find myself wanting to run quick-and-dirty molecular dynamics simulations on small organic molecules, but I've struggled to find an easy way to do this using open-source tools like OpenMM.

This is particularly frustrating since I feel like I should be equipped to succeed at this task:

Yet despite all this, I've probably tried and failed to use OpenMM for my research a half-dozen times over the past five years (evidence). I always get bogged down somewhere: I don't have PDB files for my compounds, or I can't figure out how to get the forcefield parameters right for a small molecule, or I just get lost in a sea of similar-sounding classes and objects. Part of the problem here is that all the "intro to MD" tutorials seem to assume that you're hoping to run an MD simulation on a protein from the PDB—so if you have an .xyz file, it's not obvious how to proceed.

Nevertheless, I've finally succeeded. Here's the code, with minor annotations interspersed:

from openff.toolkit import Molecule, Topology

from openmm import *
from openmm.app import *

import nglview
import mdtraj
import matplotlib.pyplot as plt
import numpy as np
import openmoltools
import tempfile
import cctk
import openmmtools
import math
from random import random, randint

from sys import stdout
import pandas as pd

from rdkit import Chem
from rdkit.Chem import AllChem

from openmmforcefields.generators import SMIRNOFFTemplateGenerator

%config InlineBackend.figure_format='retina'

Already, we're off to a complex start: we need OpenFF, OpenMM, openmoltools, openmmtools, and openmmforcefields (not to mention nglview and mdtraj). There's a broader point to be made here about the state of scientific software and how this relates to academic incentive structure, but I digress...

smiles = "c1cc(F)ccc1O"

def generate_forcefield(smiles: str) -> ForceField:
    """ Creates an OpenMM ForceField object that knows how to handle a given SMILES string """
    molecule = Molecule.from_smiles(smiles)
    smirnoff = SMIRNOFFTemplateGenerator(molecules=molecule)
    forcefield = ForceField(
      'amber/protein.ff14SB.xml',
      'amber/tip3p_standard.xml',
      'amber/tip3p_HFE_multivalent.xml'
     )
    forcefield.registerTemplateGenerator(smirnoff.generator)
    return forcefield

def generate_initial_pdb(
    smiles: str,
    min_side_length: int = 25, # Å
    solvent_smiles = "O",
) -> PDBFile:
    """ Creates a PDB file for a solvated molecule, starting from two SMILES strings. """

    # do some math to figure how big the box needs to be
    solute = cctk.Molecule.new_from_smiles(smiles)
    solute_volume = solute.volume(qhull=True)
    solvent = cctk.Molecule.new_from_smiles(solvent_smiles)
    solvent_volume = solvent.volume(qhull=False)

    total_volume = 50 * solute_volume # seems safe?
    min_allowed_volume = min_side_length ** 3
    total_volume = max(min_allowed_volume, total_volume)

    total_solvent_volume = total_volume - solute_volume
    n_solvent = int(total_solvent_volume // solvent_volume)
    box_size = total_volume ** (1/3)

    # build pdb
    with tempfile.TemporaryDirectory() as tempdir:
        solute_fname = f"{tempdir}/solute.pdb"
        solvent_fname = f"{tempdir}/solvent.pdb"
        system_fname = f"system.pdb"

        smiles_to_pdb(smiles, solute_fname)
        smiles_to_pdb(solvent_smiles, solvent_fname)
        traj_packmol = openmoltools.packmol.pack_box(
          [solute_fname, solvent_fname],
          [1, n_solvent],
          box_size=box_size
         )
        traj_packmol.save_pdb(system_fname)

        return PDBFile(system_fname)

def smiles_to_pdb(smiles: str, filename: str) -> None:
    """ Turns a SMILES string into a PDB file (written to current working directory). """
    m = Chem.MolFromSmiles(smiles)
    mh = Chem.AddHs(m)
    AllChem.EmbedMolecule(mh)
    Chem.MolToPDBFile(mh, filename)

forcefield = generate_forcefield(smiles)
pdb = generate_initial_pdb(smiles, solvent_smiles="O")

system = forcefield.createSystem(
    pdb.topology,
    nonbondedMethod=PME,
    nonbondedCutoff=1*unit.nanometer,
)

This code turns a SMILES string representing our molecule into an OpenMM System, which is a core object in the OpenMM ecosystem. To do this, we have to do a lot of shenanigans involving figuring out how many solvent molecules to add, calling PACKMOL, etc. One of the key steps here is the SMIRNOFFTemplateGenerator (documented here), which uses one of the recent OpenFF forcefields to describe our chosen molecule.

# initialize Langevin integrator and minimize
integrator = LangevinIntegrator(300 * unit.kelvin, 1 / unit.picosecond, 1 * unit.femtoseconds)
simulation = Simulation(pdb.topology, system, integrator)
simulation.context.setPositions(pdb.positions)
simulation.minimizeEnergy()

# we'll make this an NPT simulation now
system.addForce(MonteCarloBarostat(1*unit.bar, 300*unit.kelvin))
simulation.context.reinitialize(preserveState=True)

checkpoint_interval = 100
printout_interval = 10000

# set the reporters collecting the MD output.
simulation.reporters = []
simulation.reporters.append(DCDReporter("traj_01.dcd", checkpoint_interval))
simulation.reporters.append(
    StateDataReporter(
        stdout,
        printout_interval,
        step=True,
        temperature=True,
        elapsedTime=True,
        volume=True,
        density=True
    )
)

simulation.reporters.append(
    StateDataReporter(
        "scalars_01.csv",
        checkpoint_interval,
        time=True,
        potentialEnergy=True,
        totalEnergy=True,
        temperature=True,
        volume=True,
        density=True,
    )
)

# actually run the MD
simulation.step(500000) # this is the number of steps, you may want fewer to test quickly

Here, we configure the settings that will be familiar to people who read MD textbooks. I've chosen to use a Langevin integrator/thermostat to control temperature, and added a Monte Carlo barostat to make this an NPT simulation (constant pressure & temperature). This isn't appropriate for all uses, but it means we don't have to worry about getting the box size exactly right. We also configure how we're going to save data from the simulation, and then the last line actually runs it (this might take a while if you don't have a GPU).

Once we've run this simulation, we can visualize the output and watch things wiggle around! The MDTraj trajectory is a pretty versatile object so you can do much more analysis here if you want to.

# build MDTraj trajectory
t = mdtraj.load("traj_01.dcd", top="init_01.pdb")

# visualize the trajectory - this works in jupyter at least
view = nglview.show_mdtraj(t)
view.clear_representations()
view.add_licorice()
view.add_unitcell()
view
NGLView is actually pretty great, except if you're forming or breaking bonds.

It's also good to check that the thermostat actually worked and nothing blew up:

# check that the thermostat is working
df = pd.read_csv("scalars_01.csv")
df.plot(kind="line", x='#"Time (ps)"', y="Temperature (K)")

This is probably laughably simple to an OpenMM expert, but it works well enough for me. Here's my current list of things I still don't know how to do:

Still, I hope this script is a useful asset to the community, and helps other people not make the same mistakes I did.

Thanks to Dominic Rufa for answering some of my stupid questions.