3.3.2 The Coronaviridae Family
Coronaviruses are transmitted by respiratory secretion sprays, with different incubation periods estimated between 2 days and a week, after these 2 weeks, the virus is eliminated from the body. The virus replicates in the host cell, passing to the Golgi apparatus until they are finally released by exocytosis. In general, coronaviruses induce 15% of colds and flu, and their effect intensifies in winter and early spring.
Coronaviruses infect a large number of birds and mammals, particularly the SARS CoV-2-index-SARS-CoV-2; they are responsible for severe respiratory diseases, which can be spread even to babies; and in few cases, they are responsible for neurologic syndromes.
The virus remains in the upper respiratory tract and exhibits minimal immune response. Due to its mutation capacity, it persists in the invaded species.
3.3.3 The SARS-CoV-2 Structural Proteins
Four structural proteins form the SARS-CoV-2, the Spike, Membrane, Envelope, and Nucleocapsid proteins; there are other groups related to SARS-CoV-2 like the non-structural proteins, and SARS-CoV-2 putative accessory factors; although, there are also protein groups with important structural differences related to it, such as those of the MERS, and SARS-CoV proteins.
3.3.4 Protein Representations
Proteins are formed by amino acids; they are the fundamental units of every living organism; they are transformed into tissues, muscles, skin, or nails, but they can also be converted into accelerators or retardants of chemical or physiological processes. Life could not be understood without them.
The first representation of a protein is a succession of amino acids; it is like placing one amino acid after another, this representation is called linear representation or sequence. Then, three representations take place in the three-dimensional space that are related to the form the amino acids settle.
When protein amino acids cluster, they take forms like alpha helices and beta sheets, this is called a secondary representation. When these small structures bind together, the protein has a tertiary representation. Finally, if the protein is so large that it is made up of two or more tertiary structures, then it is said to be a protein with a quaternary structure.
A protein may be made up of few amino acids or thousands of them; however, their number has nothing to do with the size it adopts but with its regulation. Suffice it to say that as an example the SARS-CoV-2 structural protein (Spike) has 1283aa and yet it only has few microns.
Note 3.1 Here, it is important to mention that viruses are formed by proteins, and although at the time being, it is still discussed whether they are living organisms or not, they are also formed by proteins.
3.4 Computational Predictors
There is a practical interest, besides the scientific interest, in aiming efforts to determine the predominant function of a protein. Let us start by saying that proteins do not have a single action on a pathogen; on the contrary, all proteins have action on several pathogens. For this reason, when examining a database specialized in proteins, it is common to find that the same protein is reported several times, with a different predisposition.
On the other hand, determining the action or function of a protein involves costly experiments and/or clinical trials, without mentioning that there are proteins with pathogenic action that are increasingly difficult to find in nature; as in the case of SCAAPs (selective cationic amphipathic antibacterial peptides) that are highly toxic to bacterial membrane and harmless to erythrocytes.
SCAAPs are also very short 6aa proteins – 14aa; for all these characteristics, they are very valuable in the production of pharmaceutical drugs; however, it is increasingly difficult to find them in nature.
In this scenario, it is very useful to have computational mathematical predictions that can identify the preponderant function of a protein only by taking its sequence. This enables the inspection of databases to search peptides with a specific function. Of course, this will not prevent experimental testing, but it will substantively reduce the proteins tested.
There are several types of classification for prediction algorithms of the protein predominant function, such as the representation of proteins in three-dimensional space rather than in linear representation. Others use stochastic algorithms instead of deterministic algorithms that may or may not evaluate physico-chemical properties.
In this work, we used two main divisions: the supervised algorithms and the non-supervised algorithms, both classifications are discussed below.
3.4.1 Supervised Algorithms
A supervised algorithm is a particular computational code that requires calibration or training to know what to look for. This makes them programmer-dependent codes as, at a first stage, it is calibrated and, at a second stage, when they are already calibrated, they can search a particular profile.
In the proteomics and genomics fields, there are many different algorithms designed under this assumption.
3.4.2 Non-Supervised Algorithms
A non-supervised algorithm is a computational code that does not require calibration or training to know what to look for and, if it requires it, it is only a part of the code and it modifies itself to adjust the search criteria. The running of these codes does not depend on the programmer as they are independent.
In the proteomics and genomics fields, there are also these types of algorithms and although they are less, they are very useful.
In this chapter, we will use an algorithm of this type named Polarity Index Method ®, to explore SARS-CoV-2 structural proteins.
3.5 Polarity Index Method®
The non-supervised algorithm named Polarity Index Method ®(PIM ®) is a system programmed in FORTRAN 77 and Linux. It calculates and compares the PIM ®protein profile of the target group with other groups, modifying the PIM ®profile of the target group to make it representative and discriminant of the other protein groups it is compared with.
The metrics of the PIM ®profile consist to evaluate the 16 charge/polarity interactions identified by reading the sequence of a protein by pairs of residues, from left to right. The PIM ®system has three stages:
1 1. The amino acid sequence is converted to the numeric charge/polarity-related annotations P+, P−, N, and NP, where P+ are H, His; K, Lys; and R, Arg; P− are D, Asp; and E, Glu; N are C, Cys; G, Gly; N, Asp; Q, Gln; S, Ser; T, Thr; and Y, Tyr; and NP are A, Ala; F, Phe; I, Ile; L, Leu; M, Met; P, Pro; V, Val; and W, Trp.
2 2. The sequence is expressed in FASTA format; all the incidences of these pairs of amino acids are registered in a 4 × 4 algebraic matrix where its rows and columns are the four PIM® profile groups. Once all amino acid pairs are recorded, the incidence matrix is normalized.
3 3. Create a 16-element vector putting, from left to right, the 16 possible positions from the incidence matrix in increasing or decreasing order. Two proteins are equal if their 16-element vectors are the same.
Two proteins are equal if their 16-element vectors shared the same preponderant function.
The main advantage of this method is that the metric acts on the linear representation of the protein and not in the three-dimensional representation of it, making possible a simple analysis. On the other hand, only one physico-chemical property is evaluated, the polarity/charge of the protein.
Читать дальше