Predicting protein pKa by environment similarity

Abstract
A statistical method to predict protein pKa has been developed by using the 3D structure of a protein and a database of 434 experimental protein pKa values. Each pKa in the database is associated with a fingerprint that describes the chemical environment around an ionizable residue. A computational tool, MoKaBio, has been developed to identify automatically ionizable residues in a protein, generate fingerprints that describe the chemical environment around such residues, and predict pKa from the experimental pKa values in the database by using a similarity metric. The method, which retrieved the pKa of 429 of the 434 ionizable sites in the database correctly, was crossvalidated by leave-one-out and yielded root mean square error (RMSE) = 0.95, a result that is superior to that obtained by using the Null Model (RMSE 1.07) and other well-established protein pKa prediction tools. This novel approach is suitable to rationalize protein pKa by comparing the region around the ionizable site with similar regions whose ionizable site pKa is known. The pKa of residues that have a unique environment not represented in the training set cannot be predicted accurately, however, the method offers the advantage of being trainable to increase its predictive power. Proteins 2009.