Reduced amino acid alphabet is sufficient to accurately recognize intrinsically disordered protein

Abstract
Intrinsically disordered proteins are an important class of proteins with unique functions and properties. Here, we have applied a support vector machine (SVM) trained on naturally occurring disordered and ordered proteins to examine the contribution of various parameters (vectors) to recognizing proteins that contain disordered regions. We find that a SVM that incorporates only amino acid composition has a recognition accuracy of 87 ± 2%. This result suggests that composition alone is sufficient to accurately recognize disorder. Interestingly, SVMs using reduced sets of amino acids based on chemical similarity preserve high recognition accuracy. A set as small as four retains an accuracy of 84 ± 2%; this suggests that general physicochemical properties rather than specific amino acids are important factors contributing to protein disorder. Keywords Unstructured protein Support vector machine Amino acid composition Protein classification Sequence complexity 1 Introduction It is becoming increasingly clear that proteins or segments of proteins that lack a stable and well-defined three-dimensional structure, often referred to as intrinsically disordered proteins, have a range of important properties and functions that depend on or derive from being disordered [1–4] . For example, Wright and colleagues have proposed that intrinsically disordered protein segments confer conformational flexibility to some proteins allowing a functionally important promiscuity in binding [5] . Other functions such as regulators of nuclear port transport and entropic clocks for ion channel gating have also been proposed [5,6] . Our interest in this problem derives from a proposal that certain cytoskeletal proteins have intrinsically disordered protein segments [7] . In particular, the side-arms of neurofilament proteins NF-M and NF-H and the projection domain of MAP2 are highly unstructured and as a consequence exert long range repulsive forces that are largely entropic in origin; these forces are critical to organizing the neuronal cytoskeleton [8–10] . The emerging importance of disordered proteins has led to the development of tools and approaches for recognizing and predicting the propensity for any given protein sequence to be disordered. Developing and testing these tools requires defining libraries of disordered protein sequences; however, there are no rigorous experimental criteria for defining disorder. Criteria used for identifying collections of disordered regions include considering data from X-ray crystallography, NMR, circular dichroism, and protease sensitivity [11–14] . One notable feature of disordered proteins identified in such collections is a strong bias towards charged and polar amino acids and against hydrophobic amino acids [12] . While there is no unambiguous test of these groupings, it is reasonable to assume that they are at least strongly biased in their relative composition of disordered versus ordered proteins. With that caveat in mind, Dunker and co-workers developed PONDR, a neural net-based predictor [15] . There are now a variety of implementations of PONDR with prediction accuracies as high as 87% [16] . Linding and coworkers also developed a neural net predictor for disorder, DisEMBL, which uses three data sets based on different definitions of disorder [13] . These sets are based on an analysis of proteins with known three-dimensional structure. Consistent with the previous work, the propensities of these sets show a bias for charged and polar amino acids and against hydrophobic amino acids, although there are significant differences in the relative compositions. A significant limitation of these neural net-based approaches is that it is difficult to interrogate the relative contribution of individual parameters to recognizing or predicting disorder. Here, we have trained a support vector machine (SVM) to recognize intrinsically disordered proteins. SVMs are learning machines based on the development of statistical learning theory by Vapnik and colleagues [17] . An important feature of SVMs is that the results of the learning process can be quantified; thus, the relative influence of different parameters on the ability of the SVM to recognize disordered proteins can be measured. SVMs operate in two stages: data sets from two different classes are first mapped into a higher dimensional space based on vectors that represent some particular parameter, then the hyperplane that optimally separates the two classes is calculated. SVMs are designed to provide a globally optimized solution that ensures the highest level of recognition accuracy. SVMs have been successfully applied to many pattern classification and recognition problems; applications to biology include predictions of secondary structure, subcellular location, and solvent accessibility [18–20] . Jones and colleagues [21] have recently shown that SVMs are effective tools for predicting disordered proteins. Here, we use an SVM based approach to gain further insight into the physicochemical principles important for recognition of disordered proteins. 2 Materials and methods 2.1 Protein data The training set was that compiled by Dunker and colleagues [15] . This set contains 718 segments classified as disordered and 1190 sequence classified as structured. 2.2 Support vector machine We used the mySVM implementation of support vector machine theory by Rüping ( http://www-ai.cs.uni-dortmund.de/SOFTWARE/MYSVM/ ). The initial stage of mapping data sets into higher dimensional spaces was accomplished using a kernel function, K ( s i , x ), where s i is a support vector and x is the input sequence. For our analysis, we chose a dot kernel function where K ( s i , x )= s i · x . This kernel function provides high accuracy while avoiding the long training and testing times associated with higher order kernel functions. The results of the mapping process are represented as a set...