There are additional assumptions that the reader should be aware of regarding the construction of these PAM matrices. All sites have been assumed to be equally mutable, replacement has been assumed to be independent of surrounding residues, and there is no consideration of conserved blocks or motifs. The sequences being compared here are of average composition based on the small number of protein sequences available in 1978, so there is a bias toward small, globular proteins, even though efforts have been made to bring in additional sequence data over time (Gonnet et al. 1992; Jones et al. 1992). Finally, there is an implicit assumption that the forces responsible for sequence evolution over shorter time spans are the same as those for longer evolutionary time spans. Although there are significant drawbacks to the PAM matrices, it is important to remember that, given the information available in 1978, the development of these matrices marked an important advance in our ability to quantify the relationships between sequences. As these matrices are still available for use with numerous bioinformatic tools, the reader should keep these potential drawbacks in mind and use them judiciously.
In 1992, Steve and Jorja Henikoff took a slightly different approach to the one described above, one that addressed many of the drawbacks of the PAM matrices. The groundwork for the development of new matrices was a study aimed at identifying conserved motifs within families of proteins (Henikoff and Henikoff 1991, 1992). This study led to the creation of the BLOCKS database, which used the concept of a block to identify a family of proteins. The idea of a block is derived from the more familiar notion of a motif, which usually refers to a conserved stretch of amino acids that confers a specific function or structure to a protein. When these individual motifs from proteins in the same family can be aligned without introducing a gap, the result is a block, with the term block referring to the alignment, not the individual sequences themselves. Obviously, any given protein can contain one or more blocks, corresponding to each of its structural or functional motifs. With these protein blocks in hand, it was then possible to look for substitution patterns only in the most conserved regions of a protein, the regions that (presumably) were least prone to change. Two thousand blocks representing more than 500 groups of related proteins were examined and, based on the substitution patterns in those conserved blocks, bl ocks su bstitution m atrices (or BLOSUMs, for short) were generated.
Given the pace of scientific discovery, many more protein sequences were available in 1992 than in 1978, providing for a more robust base set of data from which to derive these new matrices. However, the most important distinction between the BLOSUM and PAM matrices is that the BLOSUM matrices are directly calculated across varying evolutionary distances and are not extrapolated, providing a more accurate view of substitution patterns (and, in turn, evolutionary forces) at those various distances. The fact that the BLOSUM matrices are calculated directly based only on conserved regions makes these matrices more sensitive to detecting structural or functional substitutions; therefore, the BLOSUM matrices perform demonstrably better than the PAM matrices for local similarity searches (Henikoff and Henikoff 1993).
Returning to the point of directly deriving the various matrices, each BLOSUM matrix is assigned a number (BLOSUM n ), and that number represents the conservation level of the sequences that were used to derive that particular matrix. For example, the BLOSUM62 matrix is calculated from sequences sharing no more than 62% identity; sequences with more than 62% identity are clustered and their contribution is weighted to 1. The clustering reduces the contribution of closely related sequences, meaning that there is less bias toward substitutions that occur (and may be over-represented) in the most closely related members of a family. Reducing the value of n yields more distantly related sequences.
Which Matrices Should be Used When?
Although most bioinformatic software will provide users with a default choice of a scoring matrix, the default may not necessarily be the most appropriate choice for the biological question being asked. Table 3.1is intended to provide some guidance as to the proper selection of scoring matrix, based on studies that have examined the effectiveness of these matrices to detect known biological relationships (Altschul 1991; Henikoff and Henikoff 1993; Wheeler 2003). Note that the numbering schemes for the two matrix families move in opposite directions: more divergent sequences are found using higher numbered PAM matrices and lower numbered BLOSUM matrices. The following equivalencies are useful in relating PAM matrices to BLOSUM matrices (Wheeler 2003):
PAM250 is equivalent to BLOSUM45
PAM160 is equivalent to BLOSUM62
PAM120 is equivalent to BLOSUM80.
In addition to the protein matrices discussed here, there are numerous specialized matrices that are either specific to a particular species, concentrate on particular classes of proteins (e.g. transmembrane proteins), focus on structural substitutions, or use hydrophobicity measures in attempting to assess similarity (see Wheeler 2003). Given this landscape, the most important take-home message for the reader is that no single matrix is the complete answer for all sequence comparisons. A thorough understanding of what each matrix represents is critical to performing proper sequence-based analyses.
Table 3.1 Selecting an appropriate scoring matrix.
Matrix |
Best use |
Similarity |
PAM40 |
Short alignments that are highly similar |
70–90% |
PAM160 |
Detecting members of a protein family |
50–60% |
PAM250 |
Longer alignments of more divergent sequences |
∼30% |
BLOSUM90 |
Short alignments that are highly similar |
70–90% |
BLOSUM80 |
Detecting members of a protein family |
50–60% |
BLOSUM62 |
Most effective in finding all potential similarities |
30–40% |
BLOSUM30 |
Longer alignments of more divergent sequences |
<30% |
The Similarity column gives the range of similarities that the matrix is able to best detect (Wheeler 2003).
Nucleotide Scoring Matrices
At the nucleotide level, the scoring landscape is much simpler. More often than not, the matrices used here simply count matches and mismatches. These matrices also assume that each of the possible four nucleotide bases occurs with equal frequency (25% of the time). In some cases, ambiguities or chemical similarities between the bases are also considered; this type of matrix is shown in Figure 3.2. The basic differences in the construction of nucleotide and protein scoring matrices should make obvious the fact that protein-based searches are always more powerful than nucleotide-based searches of coding DNA sequences in determining similarity and inferring homology, given the inherently higher information content of the 20-letter amino acid alphabet versus the four-letter nucleotide alphabet.
Often times, gaps are introduced to improve the alignment between two nucleotide or protein sequences. These gaps compensate for insertions and deletions between the sequences being studied so, in essence, these gaps represent biological events. As such, the number of gaps introduced into a pairwise sequence alignment needs to be kept to a reasonable number so as to not yield a biologically implausible scenario.
Читать дальше