Source: https://www.youtube.com/watch?v=_YEk0YAd-cw
- We have little understanding of how cells and complex biological systems work
- We have a “piece-wise” understanding of it
- what is DNA, RNA, proteins, …
- but exactly how all the pieces interact and why is still not entirely clear
- interactions are the hard part to model
Protein interactions
Cancer example
- pd-1 and pd-l1 proteins
- pd-1 is in cancer cells
- pd-l1 is in your immune system
- pd-1 binds to pd-l1 to “turn off” the immune system w.r.t cancer
- if we can potentially bind to pd-l1 before pd-1 does, then we can fight cancer
- need to define higher affinity binders
Before AI
- simultating molecular dynamics
- the ground-truth is the Boltzmann-distribution (not static)
- trying to get low-energy conformation and transition states in between them
- one protein has multiple low-energy states (especially in a water solution, where water molecules are bumping into the protein)
- if you don’t simulate long enough, you might not get an accurate distribution and miss some of the states.
- the ground-truth is the Boltzmann-distribution (not static)
- closest approximation to true protein structure was through protein crystals but they don’t really occur in nature
MSA
- evolution information
- indication of which part of the protein are highly stable or not
- if you mutate a highly stable residue, its properties are likely to change
After AI
-
MSA hacking for alphafold2
- sub-sampling the MSA to change the alphafold2 predicted structure
-
distributional graph-former
- first step at enabling conformal equilibrium distribution estimation
-
alphafold-multimer
- takes into account interactions between proteins
- tells you the quality of the interfaces
-
esm fold 2
- alternative to alphafold 2
- the base is BERT objective on amino acid sequences
- just put an evo former module on the embeddings, and it will predict something pretty good
-
alpha-flow
- generalization of alphafold-2 to flow matching
- tested on fold switching proteins
- example: kybe
- 10% of the time is in one conformation (fold-switch state)
- 90% of the time in an another
- dictated by circadian rythm
- example: kybe
-
physics-informed ML
-
not done yet
- alpha-flow but on alphafold multimer
-
training/validation split are extremely crucial
- many models don’t generalize well because of this
- need to split based on sequence similarity or structural similarity
- Diff-Dock-L split their data in a good way
-
multi-modal / all-atom models
- rosettafold all atoms
-
Usual workflow
- RF diffusion (for performing surgeries on structures)
- uses rosetta fold backbone
- diffusion model on protein 3D structures
- motif-scaffolding
- sort of inpainting but for keeping protein motifs
- fold conditioning
- conditioning on tertiary structures
- it can design binders for you
- can condition the generation to bind to a target protein
- Ligand-MPNN
- for designing the sequences
- can bias towards or away certain residues
- (maybe) given a 3D structure, it will give you the sequence
- RF diffusion (for performing surgeries on structures)
-
high specificity allows to design proteins that will bind to a target and pretty much only this target.
-
main bottleneck for drug discovery is target identification