LibCat » Книги » Приключения » unrecognised » Bioinformatics

Bioinformatics

Здесь есть возможность читать онлайн «Bioinformatics» — ознакомительный отрывок электронной книги совершенно бесплатно, а после прочтения отрывка купить полную версию. В некоторых случаях можно слушать аудио, скачать через торрент в формате fb2 и присутствует краткое содержание. Жанр: unrecognised, на английском языке. Описание произведения, (предисловие) а так же отзывы посетителей доступны на портале библиотеки ЛибКат.

Читать книгу

Название:
Bioinformatics
Автор:
Неизвестный Автор
Жанр:
unrecognised / на английском языке
Год:
неизвестен
ISBN:
нет данных
Рейтинг книги:
4 / 5. Голосов: 1
Избранное:

Добавить в избранное
Отзывы:
Написать комментарий
Ваша оценка:
- 80
- 1
- 2
- 3
- 4
- 5

Bioinformatics: краткое содержание, описание и аннотация

Предлагаем к чтению аннотацию, описание, краткое содержание или предисловие (зависит от того, что написал сам автор книги «Bioinformatics»). Если вы не нашли необходимую информацию о книге — напишите в комментариях, мы постараемся отыскать её.

Praise for the third edition of
“This book is a gem to read and use in practice.”
— "This volume has a distinctive, special value as it offers an unrivalled level of details and unique expert insights from the leading computational biologists, including the very creators of popular bioinformatics tools."
— “A valuable survey of this fascinating field. . . I found it to be the most useful book on bioinformatics that I have seen and recommend it very highly.”
— “This should be on the bookshelf of every molecular biologist.”
— The field of bioinformatics is advancing at a remarkable rate. With the development of new analytical techniques that make use of the latest advances in machine learning and data science, today’s biologists are gaining fantastic new insights into the natural world’s most complex systems. These rapidly progressing innovations can, however, be difficult to keep pace with.
The expanded fourth edition of the best-selling
aims to remedy this by providing students and professionals alike with a comprehensive survey of the current field. Revised to reflect recent advances in computational biology, it offers practical instruction on the gathering, analysis, and interpretation of data, as well as explanations of the most powerful algorithms presently used for biological discovery.
offers the most readable, up-to-date, and thorough introduction to the field for biologists at all levels, covering both key concepts that have stood the test of time and the new and important developments driving this fast-moving discipline forwards.
This new edition features:
New chapters on metabolomics, population genetics, metagenomics and microbial community analysis, and translational bioinformatics A thorough treatment of statistical methods as applied to biological data Special topic boxes and appendices highlighting experimental strategies and advanced concepts Annotated reference lists, comprehensive lists of relevant web resources, and an extensive glossary of commonly used terms in bioinformatics, genomics, and proteomics
is an indispensable companion for researchers, instructors, and students of all levels in molecular biology and computational biology, as well as investigators involved in genomics, clinical research, proteomics, and related fields.

Bioinformatics — читать онлайн ознакомительный отрывок

Ниже представлен текст книги, разбитый по страницам. Система сохранения места последней прочитанной страницы, позволяет с удобством читать онлайн бесплатно книгу «Bioinformatics», без необходимости каждый раз заново искать на чём Вы остановились. Поставьте закладку, и сможете в любой момент перейти на страницу, на которой закончили чтение.

Тёмная тема

Шрифт:

↓

↑

Сбросить

Интервал:

↓

↑

Закладка:

Сделать

13 International Human Genome Sequencing Consortium (2001). Initial sequencing and analysis of the human genome. Nature. 409: 860–921.

14 Madej, T., Lanczycki, C.J., Zhang, D. et al. (2014). MMDB and VAST+: tracking structural similarities between macromolecular complexes. Nucleic Acids Res. 42: D297–D303.

15 McKusick, V.A. (1966). Mendelian Inheritance in Man: Catalogs of Autosomal Dominant, Autosomal Recessive, and X-Linked Phenotypes. Baltimore, MD: The Johns Hopkins University Press.

16 McKusick, V.A. (1998). Online Mendelian Inheritance in Man: Catalogs of Human Genes and Genetic Disorders, 12e. Baltimore, MD: The Johns Hopkins University Press.

17 Schmutz, J., Wheeler, J., Grimwood, J. et al. (2004). Quality assessment of the human genome sequence. Nature. 429: 365–368.

18 Srour, M., Rivière, J.B., Pham, J.M.T. et al. (2010). Mutations in DCC cause congenital mirror movements. Science. 328: 592.

19 Wilbur, W.J. and Coffee, L. (1994). The effectiveness of document neighboring in search enhancement. Inf. Process. Manag. 30: 253–266.

20 Wilbur, W.J. and Yang, Y. (1996). An analysis of statistical term strength and its use in the indexing and retrieval of molecular biology texts. Comput. Biol. Med. 26: 209–222.

This chapter was written by Dr. Andreas D. Baxevanis in his private capacity. No official support or endorsement by the National Institutes of Health or the United States Department of Health and Human Services is intended or should be inferred .

3 Assessing Pairwise Sequence Similarity: BLAST and FASTA

Andreas D. Baxevanis

Introduction

One of the cornerstones of bioinformatics is the process of comparing nucleotide or protein sequences in order to deduce how the sequences are related to one another. Through this type of comparative analysis, one can draw inferences regarding whether two proteins have similar function, contain similar structural motifs, or have a discernible evolutionary relationship. This chapter focuses on pairwise alignments, where two sequences are directly compared, position by position, to deduce these relationships. Another approach, multiple sequence alignment , is used to identify important features common to three or more sequences; this approach, which is often used to predict secondary structure and functional motifs and to identify conserved positions and residues important to both structure and function, is discussed in Chapter 8.

Before entering into any discussion of how relatedness between nucleotide or protein sequences is assessed, two important terms need to be defined: similarity and homology . These terms tend to be used interchangeably when, in fact, they mean quite different things and imply quite different biological relationships.

Similarity is a quantitative measure of how related two sequences are to one another. Similarity is always based on an observable – usually pairwise alignment of two sequences. When two sequences are aligned, one can simply count how many residues line up with one another, and this raw count can then be converted to the most commonly used measure of similarity: percent identity. Measures of similarity are used to quantify changes that occur as two sequences diverge over evolutionary time, considering the effect of substitutions, insertions, or deletions. They can also be used to identify residues that are crucial for maintaining a protein's structure or function. In short, a high percentage of sequence similarity may imply a common evolutionary history or a possible commonality in biological function.

In contrast, homology implies an evolutionary relationship and is the putative conclusion reached based on examining the optimal alignment between two sequences and assessing their similarity. Genes (and their protein products) either are or are not homologous – homology is not measured in degrees or percentages. The concept of homology and the term homolog may apply to two different types of relationships, as follows.

If genes are separated by the event of speciation, they are termed orthologous. Orthologs are direct descendants of a sequence in a common ancestor, and they may have similar domain structure, three-dimensional structure, and biological function. Put simply, orthologs can be thought of as the same gene (or protein) in different species.

If genes within the same species are separated by a genetic duplication event, they are termed paralogous. The examination of paralogs provides insight into how pre-existing genes may have been adapted or co-opted toward providing a new or modified function within a given species.

The concepts of homology, orthology, and paralogy and methods for determining the evolutionary relationships between sequences are covered in much greater detail in Chapter 9.

Global Versus Local Sequence Alignments

The methods used to assess similarity (and, in turn, infer homology) can be grouped into two types: global sequence alignment and local sequence alignment. Global sequence alignment methods take two sequences and try to come up with the best alignment of the two sequences across their entire length. In general, global sequence alignment methods are most applicable to highly similar sequences of approximately the same length. Although these methods can be applied to any two sequences, as the degree of sequence similarity declines, they will tend to miss important biological relationships between sequences that may not be apparent when considering the sequences in their entirety.

Most biologists instead depend on the second class of alignment algorithm – local sequence alignments. In these methods, the sequence comparison is intended to find the most similar regions within the two sequences being aligned, rather than finding (or forcing) an alignment over the entire length of the two sequences being compared. As such, and by focusing on subsequences of high similarity that are more easily alignable, determining putative biological relationships between the two sequences being compared becomes a much easier proposition. This makes local alignment methods one of the approaches of choice for biological discovery. Often times, these methods will return more than one result for the two sequences being compared, as there may be more than one domain or subsequence common to the sequences being analyzed. Local sequence alignment methods are best for sequences that share some degree of similarity or for sequences of different lengths, and the ensuing discussion will focus mostly on these methods.

Scoring Matrices

Whether one uses a global or local alignment method, once the two sequences under consideration are aligned, how does one actually measure how good the alignment is between “sequence A” and “sequence B”? The first step toward answering that question involves numerical methods that consider not just the position-by-position overlap between two sequences but also the nature and characteristics of the residues or nucleotides being aligned.

Much effort has been devoted to the development of constructs called scoring matrices . These matrices are empirical weighting schemes that appear in all analyses involving the comparison of two or more sequences, so it is important to understand how these matrices are constructed and how to choose between matrices. The choice of matrix can (and does) strongly influence the results obtained with most sequence comparison methods.

The most commonly used protein scoring matrices consider the following three major biological factors.

1 Conservation. The matrices need to consider absolute conservation between protein sequences and also need to provide a way to assess conservative amino acid substitutions. The numbers within the scoring matrix provide a way of representing what amino acid residues are capable of substituting for other residues while not adversely affecting the function of the native protein. From a physicochemical standpoint, characteristics such as residue charge, size, and hydrophobicity (among others) need to be similar. Figure 3.1 The BLOSUM62 scoring matrix (Henikoff and Henikoff 1992). BLOSUM62 is the most widely used scoring matrix for protein analysis and provides best coverage for general-use cases. Standard single-letter codes to the left of each row and at the top of each column specify each of the 20 amino acids. The ambiguity codes B (for asparagine or aspartic acid; Asx) and Z (for glutamine or glutamic acid; Glx) also appear, as well as an X (denoting any amino acid). Note that the matrix is a mirror image of itself with respect to the diagonal. See text for details.