Combining Structure and Sequence Data to Predict Peptide-HLA Binding Affinity

Title:Combining Structure and Sequence Data to Predict Peptide-HLA Binding Affinity

Authors:Anja Conev, Mauricio Rigo, Dinler Antunes, Romanos Fasoulis, Sarah Hall-Swan and Lydia Kavraki

Conference:2020DSConf

Tags:binding affinity prediction, MHC class I, peptide HLA binding and structural and sequence features

Abstract:

Class I HLA alleles play a key role in immune response through presentation of peptides to T-cell lymphocytes. The HLA pathway is very important for vaccinology efforts. We are working on the implementation of a computational infrastructure for identifying highly conserved peptides within SARS-CoV-2 proteins that could be used as targets for a broad-spectrum peptide vaccine in the context of specific HLAs. Here, we show one step of this pipeline: the implementation of a scoring function that considers the pHLA structure, not only the sequence of the peptide. Our group developed a method using random forest classifier trained on features extracted from modeled pHLA structures, which has shown competitive results compared to sequence-based methods. We combined the available structural and sequence data to build new models and use the data on 82,000 pHLA structures across 30 HLA alleles. We mapped the structures to pHLA binding affinity values and used these values as labels for our regression models. The features are comprised of a structural distance matrix along with chemical properties of the peptide sequence. Dataset was split into training/test set (20% of the data for each HLA was left out of the training phase) and trained a random forest regressor on the features and labels for each of the HLA alleles. Data distribution is such that each model is trained on at least 1,000 structures. Finally, we tuned the parameters of random forest regressors in a 5-fold validation setting. The coefficient of determination (R2) scored on the test set ranged from 0.74 to 0.99. Models trained on the combination of structure and sequence features of pHLA complexes highlights the power of leveraging sequence and structural data. This work was funded by the National Science Foundation (NSF) (award number 2033262) and Rice University funds.