Scaling of binding scores and evaluation of prediction performance

Immunological data reported in different studies are measured under different experimental conditions or with different reference peptides. It is inappropriate to simply combine heterogeneous data without any transformation of raw binding affinities. To enable inspection and comparison of predictions for different HLA alleles we scaled all the data to a common scale, e.g., from 0 to 100 using logarithm and linear transformation. The below three data transformation equations (1), (2), and (3) are for mapping of measurements of binding affinities expressed as concentrations (in nM), measurements expressed as relative binding affinity (in %) to the labeled reference peptide, and measurements used by Multipred and Hotspot Hunter data sets, onto 0-100 scale,.

where is the final scaled binding score ranging from 0 to 100, is the original reported value of experimental binding affinity, and is an intermediate value to map xi to a slightly different scale.


Based on these datasets, a number of computational models could be developed to address different problems. Some models may focus on distinguishing MHC binding peptides from non-binding peptides, while some others were used to predict the binding affinity between MHC molecules and peptides. Thus different strategies should be employed to evaluate their performance.

For the assessment of classification accuracy, the area under the ROC curve (AROC) could be used. This curve is a plot of the true positive rate TP/(TP+FN) on the vertical axis vs false positive rate FP/(TN+FP) on the horizontal axis for the complete range of the decision thresholds. The values AROC>=0.9 indicate excellent, 0.9>AROC>=0.8 indicate good, 0.8>AROC>=0.7 indicate marginal and 0.7>AROC indicate poor predictions.

To assess the accuracy of binding affinity predictions, the Pearson correlation coefficient could be used:

where and are experimental individual and average affinities; and are average peptide predictions. The range of correlation coefficient is within -1 to 1, with 1 representing a perfect positive linear relationship, -1 representing a perfect negative linear relationship, and 0 representing total lack of correlation.


Link to other online resources

  • IEDB (Immune Epitope Database (IEDB) is a database containing curate immunological data for antibody, B and T cell epitopes derived from humans, non-human primates, rodents, and other animal species. IEDB also contains MHC binding data from a variety of different antigenic sources and immune epitope data from 4 major contributors including, the FIMM (Brusic), HLA Ligand (Hildebrand), TopBank (Sette), and MHC binding (Buus) databases. Moreover, antigenic analysis tools for B cell and T cell epitope predictions are also provided in this dataset )
  • SYFPEITHI (SYFPEITHI is a database comprising more than 4500 peptide sequences known to bind class I and class II MHC molecules of bovine, equine, rodents, chicken, sheep, human, non-human primate and others. The entries are compiled from published reports only)
  • MHCBN (A database contains information for more than 25,000 MHC binding, non-binding peptides and T-cell epitopes).
Version 1.0, Sep 2009. Developed by Bioinformatics Core at Cancer Vaccine Center, Dana-Farber Cancer Institute.