Blog


Are Forcefields Able To Describe Protein Dynamics?

October 11, 2024

This post assumes some knowledge of molecular dynamics and forcefields/molecular mechanics. For readers unfamiliar with these topics, Abhishaike Mahajan has a great guide to these topics on his blog.

Although forcefields are commonplace in all sorts of biomolecular simulation today, there’s a growing body of evidence showing that they often give unreliable results. For instance, here’s Geoff Hutchison criticizing the use of forcefields for small-molecule geometry optimizations:

The use of classical MM methods for optimizing molecular structures having multiple torsional degrees of freedom is only advised if the precision and accuracy of the final structures and rankings obtained from the conformer searches is of little or no concern... current small molecule force fields should not be trusted to produce accurate potential energy surfaces for large molecules, even in the range of “typical organic compounds.” (emphasis added)

Here’s a few other scattered case studies where forcefields have failed:

This list could be a lot longer, but I think the point is clear—even for normal, bio-organic molecules, forcefields often give bad or unreliable answers.

Despite all these results, though, it’s tough to know how bad the problem really is because there have been lots of scientific questions that can only be studied with forcefields. Studying protein conformational motion, for instance, is one of the tasks that forcefields have traditionally been developed for, and the scale and size of the systems in question makes it really challenging to study any other way. So although researchers can show that different forcefields give different answers, it’s tough to quantify how close any of these answers is to the truth, and it’s always been possible to hope that a good forcefield really is describing the underlying motion of the system quite well.

It’s for this reason that I’ve been so fascinated by this April 2024 work from Oliver Unke and co-workers, which studies the dynamics of peptides and proteins using neural network potentials (NNPs). NNPs allow scientists to approach the accuracy of quantum chemical calculations in a tiny fraction of the time by training ML models to reproduce the output of high-level QM-based simulations: although NNPs are still significantly slower than forcefields, they’re typically about 3–6 orders of magnitude faster than the corresponding high-level calculations would be, with only slightly lower accuracy.

A nice overview of the paper.

In this case, Unke and co-workers train a SpookyNet-based NNP to reproduce PBE0/def2-TZVPPD+MBD reference data comprising fragments from the precise systems under study. (MBD refers to Tkatchenko’s many-body dispersion correction, which can be thought of as a fancier alternative to pairwise dispersion corrections like D3 or D4.) In total, about 60 million atom-labeled data points were used to train the NNPs used in this study—which reportedly took 110,000 hours of CPU time to compute, equivalent to 12 CPU-years!

(This might be a nitpick, but I don’t love the use of PBE0 here. Range-separated hybrids are crucial for producing consistent and accurate results for large zwitterionic biomolecules (see e.g. this recent work from Valeev), so it’s possible that the underlying training data isn’t as accurate as it seems.)

The authors find that the resulting NNPs (“GEMS”) perform much better than existing forcefields in terms of overall error metrics: for instance, GEMS has an MAE of 0.45 meV/atom on snapshots of AceAla15Nme structures taken from MD simulations, while Amber has an MAE of 2.27 meV/atom. What’s much more interesting, however, is that GEMS gives significantly different dynamics than forcefields! While Amber simulations of AceAla15Nme predict that a stable α-helix will form at 300 K, GEMS predicts that a mixture of α- and 310 helices exist, which is exactly what’s seen in Ala-rich peptides experimentally. The CHARMM and GROMOS forcefields also get this system wrong, suggesting that GEMS really is significantly more accurate than forcefields at modeling the structure of peptides.

Amber-based simulations stay in one configuration, while GEMS-based simulations are significantly more flexible.

The authors next study crambin, a small 46-residue protein which is frequently chosen as a model system in papers like this. Similar to what was seen with the Ala15 helices, crambin is significantly more flexible when modeled by GEMS than when modeled with Amber (see below figure). The authors conduct a variety of other analyses, and argue that there are “qualitative differences between simulations with conventional FFs and GEMS on all timescales.” This is an incredibly significant result, and one that casts doubt on literal decades of forcefield-based MD simulations. Think about what this means for Relay’s MD-based platform, for instance!

A UMAP plot of protein motion through conformational space. (Yes, we all know UMAP is bad, but this is still a nice plot!)

Why do Amber and GEMS differ so much here? Here’s what Unke and coworkers think is going on:

AmberFF is a conventional FF, and as such, models bonded interactions with harmonic terms. Consequently, structural fluctuations on small timescales are mostly related to these terms. Intermediate-scale conformational changes as involved in, for example, the “flipping” of the dihedral angle in the disulfide bridges of crambin, on the other hand, can only be mediated by (nonbonded) electrostatic and dispersion terms, because the vast majority of (local) bonded terms stay unchanged for all conformations. On the other hand, GEMS makes no distinction between bonded and non-bonded terms, and individual contributions are not restricted to harmonic potentials or any other fixed functional form. Consequently, it can be expected that large structural fluctuations for AmberFF always correspond to “rare events” associated with large energy barriers, whereas GEMS dynamics arise from a richer interplay between chemical bonds and nonlocal interactions.

The overall idea that (1) forcefields impose an unphysical distinction between bonded and non-bonded interactions, and (2) this distinction leads to strange dynamical effects makes sense to me. There’s parts of this discussion that I don’t fully understand—what’s to stop a large structural fluctuation in Amber from having a small barrier? Aren’t all high-barrier processes “rare events” irrespective of where the barrier comes from?

There are some obvious caveats here that mean this sort of strategy isn’t ready for widespread adoption yet. These aren’t foundation models; the authors create a new model for each peptide and protein under study by adding system-specific fragments to the training data and retraining the NNP. This takes “between 1 and 2 weeks, depending on the system,” not counting the cost of running all the DFT calculations, so this is far too expensive and slow for routine use. While this might seem like a failure, I think it’s worth reflecting on how tough this problem is. Crambin alone has thousands of degrees of freedom, not counting the surrounding water molecules, and accurately reproducing the results of the Schrodinger equation for this system is an incredible feat. The fact that we can’t automatically also solve this problem in a zero-shot manner for every other protein is hardly a failure, particularly because it seems very likely that scaling these models will dramatically improve their generalizability!

The other big limitation is inference speed: the SpookyNet-based NNPs are about 250x slower than a conventional forcefield, so it’s much tougher to access the long timescales that are needed to simulate processes like protein folding. There are a lot of techniques that can help address these problems: NNPs can become faster and not require system-specific retraining, coarse graining can reduce the number of particles in the system, and Boltzmann generators can reduce the number of evaluations needed. So the future is bright, but there’s clearly a lot of ML engineering and applied research that will be needed to help NNP-based simulations scale.

But overall, I think this is a very significant piece of work, and one that should make anyone adjacent to forcefield-based MD pause and take note. One day it will be possible to run simulations like this just as quickly as people run regular MD simulations today, and I can’t wait to see what comes of that.

Thanks to Abhishaike Mahajan for helpful feedback on this post.

Robots Won't Solve Organic Synthesis

September 17, 2024

Abhishaike Mahajan recently wrote an excellent piece on how generative ML in chemistry is bottlenecked by synthesis (disclaimer: I gave some comments on the piece, so I may be biased). One of the common reactions to this piece has been that self-driving labs and robotics will soon solve this problem—this is a pretty common sentiment, and one that I’ve heard a lot.

Unfortunately, I think that the strongest version of this take is wrong: organic synthesis won’t be “solved” by just replacing laboratory scientists with robots, because (1) figuring out what reactions to run is hard and (2) running reactions is even harder and (3) we need scientific advances to fix this, not just engineering.

Predicting What Reactions To Run Is Hard

Organic molecules are typically made through a sequence of reactions, and figuring out how to make a molecule involves both the strategic question of which reactions to run in what order and the tactical question of how to run each reaction.

There’s been a ton of work on both of these problems, and it’s certainly true that computer-assisted retrosynthesis tools have come a long way in the last decade! But retrosynthesis is one of those problems that’s (relatively) easy to be good at and almost impossible to be great at. In part, this is because data in this field tends to be very bad: publications and patents are full of irreproducible or misreported reactions, and negative results are virtually never reported. (This post by Derek Lowe is a good overview of some of the problems that the field faces.)

But also, the problems are just hard! I got the chance to try out one of the leading retrosynthesis software packages back in my career as an organic chemist, and when we fed it some of the tough synthetic problems we were facing, it gave us all the logical suggestions that we had already tried (unsuccessfully) and then began suggesting insane reactions to us. I can’t really blame the model for not being able to invent new chemistry—but this illustrates the limits of what pure retrosynthesis can accomplish, absent new scientific discoveries.

The tactical problem of optimizing reaction conditions is also difficult. In cases where there are a lot of continuous variables (like temperatures or concentrations), conventional optimization methods like design-of-experiments can work well—but where reagents or catalysts are involved, optimization becomes significantly more challenging. Lots of cheminformatics/small-data ML work has been done in this area, but it’s still not straightforward to reliably take a reaction drawn on paper and get it to work in the lab.

Running Reactions Is Even Harder

All of the above problems are, in principle, solvable. Where I think robotics is likely to struggle even more is in the actual execution of these routes. Synthetic organic chemistry is an arcane and specialized craft that typically requires at least five years of training to be decent at—most published reaction procedures assume that the reader is themselves a trained organic chemist, and omit most of the “obvious” details that are needed to unambiguously specify a sequence of steps. (The incredibly detailed procedures in Organic Syntheses illustrate just how much is missing from the average publication.)

My favorite illustration of how irreproducible organic chemistry can be is BlogSyn, a brief project that aimed to anonymously assess how easily published reactions could be reproduced. The second BlogSyn post found that a reported olefination of pyridine could not be reproduced—the original author of the paper, Jin-Quan Yu (Scripps) responded, and the shape of reaction tube was ultimately found to be critical to reaction success.

The third BlogSyn post found that an IBX-mediated benzylic oxidation reported by Phil Baran (also of Scripps) could not be reproduced at all as written. Phil and his co-authors responded pretty aggressively, and after several weeks of back-and-forth it was ultimately found that the reaction could be reproduced after modifying virtually every parameter. A comment from Phil’s co-author Tamsyn illustrates some of the complexities at play:

There is in [BlogSyn’s] discussion a throw away comment about the 2-methylnaphthalene not being volatile. Have you never showered and then left your hair to dry at room temperature? – water evaporates at RT, just as 2-methylnaphthalene does at 95 ºC. I suggest to you that at the working temperatures of this reaction, the biggest problem may be substrate evaporation (or “hanging out” on the colder parts of the flask as Phil said)... We need fluorobenzene to reflux in these reactions and in so-doing wash substrate back into the reaction from the walls of the vessel, but it clearly slows/inhibits the reaction also – so, we need to tune this balance carefully and with patience. Scale will have a big influence on how well this process works.

Tamsyn is, of course, right—volatile substrates can evaporate, and part of setting up a reaction is thinking about the vapor pressure of your substrates and how you can address this. But this sort of thinking requires a trained chemist, and isn’t easily automated. There are a million judgment calls to make in organic synthesis—what concentration to use, how quickly to add the reagent, how to work up the reaction, what extraction solvent to use, and so on—and it’s hard enough to teach first-year graduate students how to do all this, let alone robots. Perhaps at the limit as robots achieve AGI this will be possible, but for now these remain difficult problems.

We Need Scientific Advances To Fix This

What can be done, then?

From a synthetic point of view, we need more robust reactions. Lots of academics work on reaction development, but the list of truly reliable reactions remains miniscule: amide couplings, Suzuki couplings, addition to Ellman auxiliaries, SuFFEx chemistry, and so on. From a practical point of view, every reaction like this is worth a thousand random papers with a terrible substrate scope (Sharpless said it better in 2001 than I ever could; see also this 2015 study about how basically no new reactions are used in industry). Approaches like skeletal editing are incredibly exciting, but there’s a limit to how impactful any non-general methodology can be.

Perhaps even more important is finding better methods for reaction purification. Purification is one of those topics which doesn’t get a lot of academic attention, but being able to efficiently automate purification unlocks a whole new set of possibilities. Solid-phase synthesis (which makes purification as simple as rinsing off some beads) has always seen some amount of use in organic chemistry, but a lot of commonly-used reactions aren’t compatible with solid support: either new supports or new reactions could address this problem. There are also cool approaches like Marty Burke’s “catch-and-release” boronate platform which haven’t yet seen broad adoption.

Ultimately, I share the dream of the robotics enthusiasts: if we’re able to make organic synthesis routine, we can stop worrying about how to make molecules and start thinking about what to make! I’m very optimistic about the opportunity of new technologies to address synthetic bottlenecks and enable higher-throughput data generation in chemistry. But getting to this point will take not only laboratory automation but also a ton of scientific progress in organic chemistry, and the first step in solving these problems is actually taking them seriously and recognizing that they’re unsolved.

Thanks to Abhishaike Mahajan and Ari Wagen for helpful comments about this post.

Choosing Maximally Dissimilar Molecules

July 10, 2024

I've been playing around with generating non-equilibrium conformations by molecular dynamics recently, and I've been thinking about how to best parse the outputs of a dynamics simulation. A technique I've seen quite often in the literature is "choose a dissimilar subset of conformers by RMSD"—for instance, here's what the SPICE paper says:

For each of the 677 molecules, the dataset includes 50 conformations of which half are low energy and half are high energy. To generate them, RDKit 2020.09.3 was first used to generate 10 diverse conformations. Each was used as a starting point for 100 ps of molecular dynamics at a temperature of 500 K using OpenMM 7.6 and the Amber14 force field. A conformation was saved every 10 ps to produce 100 candidate high energy conformations. From these, a subset of 25 was selected that were maximally different from each other as measured by all atom RMSD.

This makes a good amount of sense: you want to choose conformers which cover as much chemical space as possible so that you get information about the PES as efficiently as possible, and RMSD is a cheap and reasonable way to do this. But how do you actually do this in practice? Nothing super helpful came up after a quick Google search, so I wrote a little script myself:

import cctk
import numpy as np
from sklearn.cluster import AgglomerativeClustering
import sys
import tqdm
import copy

e = cctk.XYZFile.read_file(sys.argv[1]).ensemble
molecules = e.molecule_list()

rmsd_matrix = np.zeros((len(molecules), len(molecules)))
comparison_atoms = molecules[0].get_heavy_atoms()

def compute_rmsd(mol1: cctk.Molecule, mol2: cctk.Molecule) -> float:
    geom1 = copy.deepcopy(mol1.geometry[comparison_atoms])
    geom1 -= geom1.mean(axis=0)

    geom2 = copy.deepcopy(mol2.geometry[comparison_atoms])
    geom2 -= geom2.mean(axis=0)

    return cctk.helper_functions.compute_RMSD(geom1, geom2)


for i in tqdm.tqdm(range(len(molecules))):
    for j in range(i + 1, len(molecules)):
        rmsd_matrix[i, j] = compute_rmsd(molecules[i], molecules[j])
        rmsd_matrix[j, i] = rmsd_matrix[i, j]

clustering = AgglomerativeClustering(
    n_clusters=50,
    metric="precomputed",
    linkage="average"
)
clustering.fit(rmsd_matrix)

selected_molecules: list[int] = []
for cluster_id in range(50):
    cluster_indices = np.where(clustering.labels_ == cluster_id)[0]
    selected_molecule = cluster_indices[
        np.argmin(rmsd_matrix[cluster_indices].sum(axis=1))
    ]
    selected_molecules.append(selected_molecule)

e2 = cctk.ConformationalEnsemble()
for idx in selected_molecules:
    e2.add_molecule(molecules[idx])

cctk.XYZFile.write_ensemble_to_file(sys.argv[2], e2)
print("done!")

This script uses agglomerative clustering to sort conformations into clusters, but could easily be adapted to work with other algorithms. To run this script, simply paste into into a file (choose_dissimilar.py) and run:

python choose_dissimilar.py input.xyz output.xyz

This will dump 50 output conformers into output.xyz. Hopefully this saves someone some time... happy computing!

Running Simple MD Simulations

June 13, 2024

Scientific software can be confusing, particularly when you're doing something that the software isn't primarily intended for. I often find myself wanting to run quick-and-dirty molecular dynamics simulations on small organic molecules, but I've struggled to find an easy way to do this using open-source tools like OpenMM.

This is particularly frustrating since I feel like I should be equipped to succeed at this task:

Yet despite all this, I've probably tried and failed to use OpenMM for my research a half-dozen times over the past five years (evidence). I always get bogged down somewhere: I don't have PDB files for my compounds, or I can't figure out how to get the forcefield parameters right for a small molecule, or I just get lost in a sea of similar-sounding classes and objects. Part of the problem here is that all the "intro to MD" tutorials seem to assume that you're hoping to run an MD simulation on a protein from the PDB—so if you have an .xyz file, it's not obvious how to proceed.

Nevertheless, I've finally succeeded. Here's the code, with minor annotations interspersed:

from openff.toolkit import Molecule, Topology

from openmm import *
from openmm.app import *

import nglview
import mdtraj
import matplotlib.pyplot as plt
import numpy as np
import openmoltools
import tempfile
import cctk
import openmmtools
import math
from random import random, randint

from sys import stdout
import pandas as pd

from rdkit import Chem
from rdkit.Chem import AllChem

from openmmforcefields.generators import SMIRNOFFTemplateGenerator

%config InlineBackend.figure_format='retina'

Already, we're off to a complex start: we need OpenFF, OpenMM, openmoltools, openmmtools, and openmmforcefields (not to mention nglview and mdtraj). There's a broader point to be made here about the state of scientific software and how this relates to academic incentive structure, but I digress...

smiles = "c1cc(F)ccc1O"

def generate_forcefield(smiles: str) -> ForceField:
    """ Creates an OpenMM ForceField object that knows how to handle a given SMILES string """
    molecule = Molecule.from_smiles(smiles)
    smirnoff = SMIRNOFFTemplateGenerator(molecules=molecule)
    forcefield = ForceField(
      'amber/protein.ff14SB.xml',
      'amber/tip3p_standard.xml',
      'amber/tip3p_HFE_multivalent.xml'
     )
    forcefield.registerTemplateGenerator(smirnoff.generator)
    return forcefield

def generate_initial_pdb(
    smiles: str,
    min_side_length: int = 25, # Å
    solvent_smiles = "O",
) -> PDBFile:
    """ Creates a PDB file for a solvated molecule, starting from two SMILES strings. """

    # do some math to figure how big the box needs to be
    solute = cctk.Molecule.new_from_smiles(smiles)
    solute_volume = solute.volume(qhull=True)
    solvent = cctk.Molecule.new_from_smiles(solvent_smiles)
    solvent_volume = solvent.volume(qhull=False)

    total_volume = 50 * solute_volume # seems safe?
    min_allowed_volume = min_side_length ** 3
    total_volume = max(min_allowed_volume, total_volume)

    total_solvent_volume = total_volume - solute_volume
    n_solvent = int(total_solvent_volume // solvent_volume)
    box_size = total_volume ** (1/3)

    # build pdb
    with tempfile.TemporaryDirectory() as tempdir:
        solute_fname = f"{tempdir}/solute.pdb"
        solvent_fname = f"{tempdir}/solvent.pdb"
        system_fname = f"system.pdb"

        smiles_to_pdb(smiles, solute_fname)
        smiles_to_pdb(solvent_smiles, solvent_fname)
        traj_packmol = openmoltools.packmol.pack_box(
          [solute_fname, solvent_fname],
          [1, n_solvent],
          box_size=box_size
         )
        traj_packmol.save_pdb(system_fname)

        return PDBFile(system_fname)

def smiles_to_pdb(smiles: str, filename: str) -> None:
    """ Turns a SMILES string into a PDB file (written to current working directory). """
    m = Chem.MolFromSmiles(smiles)
    mh = Chem.AddHs(m)
    AllChem.EmbedMolecule(mh)
    Chem.MolToPDBFile(mh, filename)

forcefield = generate_forcefield(smiles)
pdb = generate_initial_pdb(smiles, solvent_smiles="O")

system = forcefield.createSystem(
    pdb.topology,
    nonbondedMethod=PME,
    nonbondedCutoff=1*unit.nanometer,
)

This code turns a SMILES string representing our molecule into an OpenMM System, which is a core object in the OpenMM ecosystem. To do this, we have to do a lot of shenanigans involving figuring out how many solvent molecules to add, calling PACKMOL, etc. One of the key steps here is the SMIRNOFFTemplateGenerator (documented here), which uses one of the recent OpenFF forcefields to describe our chosen molecule.

# initialize Langevin integrator and minimize
integrator = LangevinIntegrator(300 * unit.kelvin, 1 / unit.picosecond, 1 * unit.femtoseconds)
simulation = Simulation(pdb.topology, system, integrator)
simulation.context.setPositions(pdb.positions)
simulation.minimizeEnergy()

# we'll make this an NPT simulation now
system.addForce(MonteCarloBarostat(1*unit.bar, 300*unit.kelvin))
simulation.context.reinitialize(preserveState=True)

checkpoint_interval = 100
printout_interval = 10000

# set the reporters collecting the MD output.
simulation.reporters = []
simulation.reporters.append(DCDReporter("traj_01.dcd", checkpoint_interval))
simulation.reporters.append(
    StateDataReporter(
        stdout,
        printout_interval,
        step=True,
        temperature=True,
        elapsedTime=True,
        volume=True,
        density=True
    )
)

simulation.reporters.append(
    StateDataReporter(
        "scalars_01.csv",
        checkpoint_interval,
        time=True,
        potentialEnergy=True,
        totalEnergy=True,
        temperature=True,
        volume=True,
        density=True,
    )
)

# actually run the MD
simulation.step(500000) # this is the number of steps, you may want fewer to test quickly

Here, we configure the settings that will be familiar to people who read MD textbooks. I've chosen to use a Langevin integrator/thermostat to control temperature, and added a Monte Carlo barostat to make this an NPT simulation (constant pressure & temperature). This isn't appropriate for all uses, but it means we don't have to worry about getting the box size exactly right. We also configure how we're going to save data from the simulation, and then the last line actually runs it (this might take a while if you don't have a GPU).

Once we've run this simulation, we can visualize the output and watch things wiggle around! The MDTraj trajectory is a pretty versatile object so you can do much more analysis here if you want to.

# build MDTraj trajectory
t = mdtraj.load("traj_01.dcd", top="init_01.pdb")

# visualize the trajectory - this works in jupyter at least
view = nglview.show_mdtraj(t)
view.clear_representations()
view.add_licorice()
view.add_unitcell()
view
NGLView is actually pretty great, except if you're forming or breaking bonds.

It's also good to check that the thermostat actually worked and nothing blew up:

# check that the thermostat is working
df = pd.read_csv("scalars_01.csv")
df.plot(kind="line", x='#"Time (ps)"', y="Temperature (K)")

This is probably laughably simple to an OpenMM expert, but it works well enough for me. Here's my current list of things I still don't know how to do:

Still, I hope this script is a useful asset to the community, and helps other people not make the same mistakes I did.

Thanks to Dominic Rufa for answering some of my stupid questions.

Generative Linker Design

May 10, 2024

Much molecular design today can be boiled down to “put the right functional groups in exactly the right places.” In catalysis, proper positioning of functional groups to complement developing charge or engage in other stabilizing non-covalent interactions with the transition state can lead to vast rate accelerations. A classic demonstration of this is Uyeda and Jacobsen’s enantioselective Claisen rearrangement, where a simple catalyst presents a guanidinium ion to stabilize an anionic region and an electron-rich arene to stabilize a cationic region. Together, these interactions lead to high enantioselectivity and a 250-fold rate increase over the background reaction.

I freely admit this choice of example is biased, but this is a great paper.

While putting the right functional groups in the right positions might sound easy, the underlying interactions are often exquisitely sensitive to distance, which makes finding the right molecular scaffold very challenging. Jeremy Knowles put this nicely in his 1991 perspective on enzyme catalysis:

Although it is too early to generalize, it is evident that in this case [triose phosphate isomerase] at least, the positioning of functionality at the active site of the enzyme needs to be quite precise if full catalytic potency is to be realized… The good news for catalyst engineers is that proper placement of appropriate groups in the right environment seems to be enough. The not-so-good news is that this placement must be very precise.

Proper positioning of various groups isn’t just a problem in catalysis—it’s also very important in drug design. Lots of topics in medicinal chemistry essentially boil down to a variant of the positioning problem:

  1. Recent years have seen an explosion in the number of arene bioisosteres, i.e. “ways to connect two groups the same distance apart as an arene would without using any aromatic rings.” This is nice because you can use these bioisosteres to e.g. tune biophysical properties, like lipophilicity, solubility, and membrane permeability, while ideally not affecting the actual conformation of the molecule too much. Here’s a graphic from Chris Swain showing some options for replacing a 1,4-disubstituted arene:
    Taken from this page.
    (There's lots of work on this; see also the Baran Lab's great overview of bioisosteres.)
  2. “Scaffold hopping” generally means switching the core of a molecule while preserving key interactions and substituents (although the term is a little vague). Scaffold hops often look for new structures that don’t suffer from the ADME/toxicity liabilities of the original structure, which necessitates finding a new core that positions substituents in the same way. Here's a nice example of a scaffold hop in a tyrosine kinase inhibitor, from a review by Hongmao Sun and co-workers:
    (This is Figure 8 in the reference.)
  3. Linker design has become incredibly important for PROTACs and other heterobifunctional molecules, as choosing a good linker is often key to drug efficacy (e.g.). Unfortunately there are a vast variety of potential linkers, and it’s far from obvious to figure out which one will best balance, affinity, bioavailability, stability, and other properties:
    Taken from this review.
  4. One of the toughest parts of fragment-based drug design is finding the right way to connect fragments which bind to different parts of the desired pocket. Here’s what Philine Kirsch and co-workers have to say about this:
    Finding the right linker motif, which orients the individual fragment units in the favourable geometry in relation to each other without introducing too much flexibility whilst maintaining the binding poses of both fragments, can be very challenging. If successful, the combination of two fragments with rather low affinity could result in significantly higher affinity and has the potential to result in “superadditive” contributions of both binding motifs. The challenge in fragment linking is the exploration of the binding mode of both fragments and the identification of an optimal linker. Only in this case, the overall reduced so-called rigid body entropy translates into synergistically-improved affinity.

What’s hard about all these positioning problems is that finding a molecule that orients substituents in a given way is incredibly non-obvious: molecules are inherently discrete and atomic, making it hard to change a distance or angle by a precise percent. You can have two carbon atoms between your substituents, or you can have three carbon atoms, but you can’t have 2.5 carbon atoms. This makes prospective design very challenging: I can model my protein’s active site and figure out that I want a an ortho-pyridyl substituent and a tetrazole 8 Å apart at a 30º angle, but working backwards to an actual scaffold almost always requires a lot of trial and error.

A recent paper from Ilia Igashov and co-workers sets out to solve exactly this “inverse design” problem: given two substituents, can we use ML to find a linker that connects them in the desired orientation? Their solution is DiffLinker, a diffusion-based method that takes separate atomic fragments and generates a linker that connects them.

There’s been other work in this area, but the DiffLinker authors argue that their model stands out in a few ways. DiffLinker generally produces more synthetically accessible and drug-like molecules than competitor methods, although the relative ranking of models does change significantly from benchmark to benchmark. Also, they’re not limited to joining pairs of molecule structures: DiffLinker can perform “one-shot generation of the linker between any arbitrary number of fragments,” which lets them vastly outperform other models when linking three or more fragments.

For cases where fragments must be joined in a protein pocket, the authors train a pocket-conditioned model, and show that this model results in many fewer clashes than an unconstrained model. They can use this model to recapitulate known drug structures, which they demonstrate with a known HSP90 inhibitor derived from molecular fragments. (It’s worth noting that the authors got the desired inhibitor structure only 3 times out of 1000 DiffLinker predictions.) They also show that their protein-conditioned model produces molecules that have good binding affinity as assessed by docking (GNINA/Vina), with the huge caveat that docking scores are notoriously inaccurate.

a is the experimental fragments, b is the input to DiffLinker, c is the experimental inhibitor, and d is the DiffLinker-generated inhibitor. Not bad!

There’s still plenty of work that needs to be done here: for instance, the authors readily acknowledge that PROTACs are still too challenging:

While DiffLinker effectively suggests diverse and valid chemical structures in tasks like fragment linking and scaffold hopping, we have observed that generating relevant linkers for PROTAC-like molecules poses a greater challenge. The main difference between these problems lies on the linker length and the distance between the input fragments. While the average linker size in our training sets is around 8 atoms (5 for ZINC, 10 for GEOM, 10 for Pockets), a typical linker in a PROTAC varies between 12 and 20 atoms. It means that the distribution of linkers in PROTACs has different characteristics compared to the distributions of linkers provided in our training sets. Therefore, to improve the performance of DiffLinker in PROTAC design, one may consider retraining the model using more suitable PROTAC data.

DiffLinker is open-source and comes with pre-trained models, so I played around with it a bit myself to see how well it worked. I sketched out a classic meta-terphenyl scaffold, deleted the central phenyl ring, and then asked DiffLinker to connect the now-separated phenyl rings. I was hoping that DiffLinker would come up with one of Enamine’s cool suggestions for meta-arene bioisosteres, but in all five cases I just got back some variant on a benzene ring… which isn’t surprising in hindsight.

DiffLinker also doesn't output hydrogens, which is a little annoying.

Although I don’t think this version of DiffLinker is going to replace humans at any of the tasks I talked about above, this still seems like a pretty cool direction for generative chemical ML. I’m excited to see future versions of methods like DiffLinker that are able to generate predictions conditioned on other molecular properties to allow for guided exploration of molecular space. (For instance, it would have been nice to request fragments that were three-dimensional above, so as to avoid getting boring benzenes back.)

I also suspect that DiffLinker, like other generative chemical models, will increase the demand for accurate physics-based methods for refining and validating the output predictions. DiffLinker’s grasp of potential energy surfaces is presumably worse than DFT or other dedicated ML potentials, and a hybrid workflow where DiffLinker generates structures and a higher-quality method optimizes and scores them will probably be much more accurate than just DiffLinker alone. Generative AI is having a moment right now, but for better or worse I think “classic” molecular simulation is here to stay too.