Leveraging machine learning for improved functional annotation and remote homology detection
While working on genome mining projects for enzyme discovery, one of our big questions is how can we improve our ability to predict protein function? We have been investigating how to use structure searches and neural-network-based sequence classifiers to improve remote homology searches and have recently found a way to use protein language models to dramatically increase the sensitivity of protein domain annotations while still maintaining acceptable search speed. We also look forward to working with other research groups in applying these new computational approaches to advance enzyme discovery at NEB.
This figure shows an example comparison of logos of HMM profiles (Wheeler et al, 2014) derived from the 4HBT Pfam profile and the YBGC_HELPY sequence from the seed alignment. Positional amino acid frequencies predicted by ESM-2 3B protein language model resembled those found in MSAs built from sequence searches.