Source: https://www.youtube.com/watch?v=_YEk0YAd-cw

  • We have little understanding of how cells and complex biological systems work
  • We have a “piece-wise” understanding of it
    • what is DNA, RNA, proteins, …
    • but exactly how all the pieces interact and why is still not entirely clear
    • interactions are the hard part to model

Protein interactions

Cancer example

  • pd-1 and pd-l1 proteins
    • pd-1 is in cancer cells
    • pd-l1 is in your immune system
    • pd-1 binds to pd-l1 to “turn off” the immune system w.r.t cancer
    • if we can potentially bind to pd-l1 before pd-1 does, then we can fight cancer
    • need to define higher affinity binders

Before AI

  • simultating molecular dynamics
    • the ground-truth is the Boltzmann-distribution (not static)
      • trying to get low-energy conformation and transition states in between them
      • one protein has multiple low-energy states (especially in a water solution, where water molecules are bumping into the protein)
    • if you don’t simulate long enough, you might not get an accurate distribution and miss some of the states.
  • closest approximation to true protein structure was through protein crystals but they don’t really occur in nature

MSA

  • evolution information
  • indication of which part of the protein are highly stable or not
  • if you mutate a highly stable residue, its properties are likely to change

After AI

  • MSA hacking for alphafold2

    • sub-sampling the MSA to change the alphafold2 predicted structure
  • distributional graph-former

    • first step at enabling conformal equilibrium distribution estimation
  • alphafold-multimer

    • takes into account interactions between proteins
    • tells you the quality of the interfaces
  • esm fold 2

    • alternative to alphafold 2
    • the base is BERT objective on amino acid sequences
    • just put an evo former module on the embeddings, and it will predict something pretty good
  • alpha-flow

    • generalization of alphafold-2 to flow matching
    • tested on fold switching proteins
      • example: kybe
        • 10% of the time is in one conformation (fold-switch state)
        • 90% of the time in an another
        • dictated by circadian rythm
  • physics-informed ML

  • not done yet

    • alpha-flow but on alphafold multimer
  • training/validation split are extremely crucial

    • many models don’t generalize well because of this
    • need to split based on sequence similarity or structural similarity
    • Diff-Dock-L split their data in a good way
  • multi-modal / all-atom models

    • rosettafold all atoms
  • Usual workflow

    • RF diffusion (for performing surgeries on structures)
      • uses rosetta fold backbone
      • diffusion model on protein 3D structures
      • motif-scaffolding
        • sort of inpainting but for keeping protein motifs
      • fold conditioning
        • conditioning on tertiary structures
      • it can design binders for you
        • can condition the generation to bind to a target protein
    • Ligand-MPNN
      • for designing the sequences
      • can bias towards or away certain residues
      • (maybe) given a 3D structure, it will give you the sequence
  • high specificity allows to design proteins that will bind to a target and pretty much only this target.

  • main bottleneck for drug discovery is target identification