Pattern recognition and classification are important tools of NN. Examples of early NN methods that are still widely used today are PHD [66, 67] PSIPRED [68] and JPred [69] though advancement has occurred to a great deal as Deep neural network (DNN) models have been shown have an advantage of performance in image and language based problems [70] and has been seen to extend to some specific CASP areas such as residue-residue contact prediction and direct use for accurate tertiary structure generation [71–75].
Support Vector Machines (SVMs) Support Vector Machine (SVM) is a supervised Machine Learning technique that has been used to rank protein models [76]. SVM has been put to use in pattern classification problems related to biology. Support Vector Machine method is performed based on the database derived from SCOP, in which protein domains are classified which is based on
1 Known structures of protein in the data bank
2 Evolutionary relationships of the predicted protein
3 The various principles of bond formation governing the 3-D structure of protein.
The advantages of SVM include avoidance of over-fitting very effectively which is a disadvantage with several other methods and is able to manage large feature spaces, and condensation of large amount of information data.
Bayesian Methods The most successful methods for determining secondary structure from primary structure use machine learning approaches that are quite accurate, but they do not directly incorporate structural information. There is a need to determine higher order protein structure which can provide a better and deeper understanding of protein’s function in the cell as structure and function are strongly related. Various computational prediction methods have been developed for the prediction of secondary structure if the primary amino acid sequence is available and one such computational methods is the Bayesian method
The knob-socket model of protein packing in secondary structure forms the basis of Bayesian model. As it is known that when packaging of protein may result in residues that are packed close in space but distant in sequence if the primary structure is seen [77, 78] which is not taken into account by several other methods. The Bayesian model method considers the packing influence of residues on the secondary structure determination. Thus this method has an advantage over other methods of having constructs for the direct inclusion and prediction of the secondary states of coil and turn. Where other secondary structure prediction methods are indirect and do not make direct prediction of coil structure of alpha helix and beta sheet. The secondary folding is very much dependent upon the surrounding environment (aqueous/non aqueous) as a lot of hydrogen bonding and hydrophobic is involved. Thus this method helps in developing the understanding of the environment responsible for secondary structure formation.
Clustering Methods A protein rarely performs its function in isolation, various kinds of interaction is needed to perform its function [79] as discussed earlier in this chapter in context to quaternary structure. Protein–protein interactions are thus fundamental to almost all biological processes [80] and it’s really important to understand this phenomenon. Increasing availability of large-scale protein-protein interaction data has made it possible to understand the basic components and organization of cell machinery from the network level in terms of interactions taking place. Protein–protein interactions can be studied by advance high-throughput technologies such as yeast-two-hybrid, mass spectrometry, and protein chip technologies and making available huge data sets of such interactions [81] which can be put to great use in structure prediction. In computation analysis such protein– protein interaction data can be naturally represented in the form of networks. This network representation can provide the initial global picture of protein interactions on a genomic scale and can also help to build an understanding of the basic components and organization of cell machinery. In Clustering method protein interaction network is represented as an interaction graph. In this graphical representation the proteins are as vertices (or nodes) and interactions as edges. This method has been put to use in the study of surface or topological properties of protein interaction including the network diameter, the distribution of vertex degree, the clustering coefficient and shows that there is scale-free network [82–85] and effects in a very small area [86, 87]. It has been observed and shown that clustering protein interaction networks is an effective approach for system biology to understand the relationship between the organization of a network and its function [88] making it a very effective tool.
The proteins are grouped into sets (clusters) helping to demonstrate greater similarity among proteins in the same cluster than in different clusters. The clusters have two which are protein complexes and functional modules. Protein complexes are groups of proteins that interact with each other at the same time and place which form a single multimolecular structure as evident in RNA splicing and polyadenylation machinery, protein export and transport complexes to name a few [89]. The difference between protein complex and functional modules is that the functional module consists of proteins binding each other at a different time and place and participating in a cellular process. Example of functional module includes the yeast pheromone response pathway, MAP signalling cascades, etc. [90] which initiates with an extracellular signaling leading to a signal cascade pathway resulting in gene activation and other processes.
2.7 Role of Artificial Intelligence in Computer-Aided Drug Design
High throughput screening (HTS) is a set of techniques that are capable of identifying biologically active molecules with desired properties from any compound database of billions of compounds. The prediction and identification of active compounds with high accuracy and activity are crucial to decrease the time taken to discover potent drugs. Different medicinal chemistry-related companies use screening techniques to identify active compounds from drug databases in a significantly less amount of time. The decrease in search space or targeted search will reduce the overall cost of the drug discovery process. The critical problem is how to establish a relationship between the 3D structure of the lead molecule and its biological activity. QSAR is a technique that can able to predict the activity of a set of compounds using the derived equations from a set of known compounds [91]. While in QSPR (quantitative structure–property relationships), one predicts biological activity, using the physicochemical properties of known compounds as a response variable. Accurate prediction of the activity of chemical molecules is still a persistence issue in drug discovery. It is a general phenomenon in structural bioinformatics that if the two protein structures share structural similarities, then their functions may also be the same. Nevertheless, this is not always true in the case of chemical structures, where minute structural differences in pairs of compounds will lead to change in their activity against the same target receptor. This is an activity cliff problem which is being a hot topic of debate among computational and medicinal scientists [92, 93].
The lock-and-key hypothesis and induced fit model hypothesis deal with the biochemistry of binding of a ligand at the receptor. In general, a ligand–receptor complex comprises of a smaller ligand which attaches to the functional cavity of the receptor. The 3D structure information of both ligand, as well as receptor, is essential in order to understand their functional role. There is a change in 3D conformation of receptor protein upon binding of ligands at the active site and thus leads to change in their functional state. X-Ray Crystallography, Nuclear Magnetic Resonance (NMR), Electron Microscopy are the currently available experimental techniques to predict the 3D structure of proteins. Since there is a considerable gap between available protein sequences and their 3D structures, one can harness bioinformatics techniques, namely molecular modeling, to predict their 3D structures in a less amount of time with comparable accuracy. Molecular docking is a technique that can be used to predict the binding mode of ligand at the receptor if their 3D information is available. It is the most commonly used for pose prediction of ligand at the active site of the receptor. The approach of identifying lead compounds using 3D structure information of receptor–protein is known as Structure-Based Drug Design (SBDD). Nowadays, the process of identifying, predicting and optimising the activity of small molecules against a biological target comes under SBDD domain [94–96].
Читать дальше