Thoughts about AI in biology in 2024

Source: https://www.youtube.com/watch?v=_YEk0YAd-cw

We have little understanding of how cells and complex biological systems work
We have a “piece-wise” understanding of it
- what is DNA, RNA, proteins, …
- but exactly how all the pieces interact and why is still not entirely clear
- interactions are the hard part to model

Protein interactions

Cancer example

pd-1 and pd-l1 proteins
- pd-1 is in cancer cells
- pd-l1 is in your immune system
- pd-1 binds to pd-l1 to “turn off” the immune system w.r.t cancer
- if we can potentially bind to pd-l1 before pd-1 does, then we can fight cancer
- need to define higher affinity binders

Before AI

simultating molecular dynamics
- the ground-truth is the Boltzmann-distribution (not static)
  - trying to get low-energy conformation and transition states in between them
  - one protein has multiple low-energy states (especially in a water solution, where water molecules are bumping into the protein)
- if you don’t simulate long enough, you might not get an accurate distribution and miss some of the states.
closest approximation to true protein structure was through protein crystals but they don’t really occur in nature

MSA

evolution information
indication of which part of the protein are highly stable or not
if you mutate a highly stable residue, its properties are likely to change

After AI

MSA hacking for alphafold2
- sub-sampling the MSA to change the alphafold2 predicted structure
distributional graph-former
- first step at enabling conformal equilibrium distribution estimation
alphafold-multimer
- takes into account interactions between proteins
- tells you the quality of the interfaces
esm fold 2
- alternative to alphafold 2
- the base is BERT objective on amino acid sequences
- just put an evo former module on the embeddings, and it will predict something pretty good
alpha-flow
- generalization of alphafold-2 to flow matching
- tested on fold switching proteins
  - example: kybe
    - 10% of the time is in one conformation (fold-switch state)
    - 90% of the time in an another
    - dictated by circadian rythm
physics-informed ML
not done yet
- alpha-flow but on alphafold multimer
training/validation split are extremely crucial
- many models don’t generalize well because of this
- need to split based on sequence similarity or structural similarity
- Diff-Dock-L split their data in a good way
multi-modal / all-atom models
- rosettafold all atoms
Usual workflow
- RF diffusion (for performing surgeries on structures)
  - uses rosetta fold backbone
  - diffusion model on protein 3D structures
  - motif-scaffolding
    - sort of inpainting but for keeping protein motifs
  - fold conditioning
    - conditioning on tertiary structures
  - it can design binders for you
    - can condition the generation to bind to a target protein
- Ligand-MPNN
  - for designing the sequences
  - can bias towards or away certain residues
  - (maybe) given a 3D structure, it will give you the sequence
high specificity allows to design proteins that will bind to a target and pretty much only this target.
main bottleneck for drug discovery is target identification

🤖 Harold's Notes

Explorer

Thoughts about AI in biology in 2024

Protein interactions

Cancer example

Before AI

MSA

After AI

Graph View

Table of Contents

Backlinks