Speaker: Lucy Colwell (Cambridge)A central challenge is to be able to predict functional properties of a
protein from its sequence, and thus (i) discover new proteins with
specific functionality and (ii) better understand the functional effect
of genomic mutations. Experimental breakthroughs in our ability to read
and write DNA allows data on the relationship between sequence and
function to be rapidly acquired. This data can be used to train and
validate machine learning models that predict protein function from
sequence. Because in many cases phenotypic changes are controlled by
more than one amino acid, the mutations that separate different
phenotypes may be epistatic, requiring us to build models that take the
correlation structure into account. Such models rival the accuracy of
existing hidden Markov models at sequence annotation, even when given
relatively little training data. The representation of sequence space
learned by the model can be used to build families that the model did
not see during training. Finally, prospective experiments show that
machine learning models identify variants of the AAV capsid protein that
assemble integral capsids and package their genome with >55%
accuracy, for gene therapy applications.