LibCat » Книги » Приключения » unrecognised » Stephen Winters-Hilt - Informatics and Machine Learning

Stephen Winters-Hilt - Informatics and Machine Learning

Здесь есть возможность читать онлайн «Stephen Winters-Hilt - Informatics and Machine Learning» — ознакомительный отрывок электронной книги совершенно бесплатно, а после прочтения отрывка купить полную версию. В некоторых случаях можно слушать аудио, скачать через торрент в формате fb2 и присутствует краткое содержание. Жанр: unrecognised, на английском языке. Описание произведения, (предисловие) а так же отзывы посетителей доступны на портале библиотеки ЛибКат.

Читать книгу

Название:
Informatics and Machine Learning
Автор:
Stephen Winters-Hilt
Жанр:
unrecognised / на английском языке
Год:
неизвестен
ISBN:
нет данных
Рейтинг книги:
3 / 5. Голосов: 1
Избранное:

Добавить в избранное
Отзывы:
Написать комментарий
Ваша оценка:
- 60
- 1
- 2
- 3
- 4
- 5

Informatics and Machine Learning: краткое содержание, описание и аннотация

Предлагаем к чтению аннотацию, описание, краткое содержание или предисловие (зависит от того, что написал сам автор книги «Informatics and Machine Learning»). Если вы не нашли необходимую информацию о книге — напишите в комментариях, мы постараемся отыскать её.

Informatics and Machine Learning
Discover a thorough exploration of how to use computational, algorithmic, statistical, and informatics methods to analyze digital data Informatics and Machine Learning: From Martingales to Metaheuristics
ad hoc, ab initio
Informatics and Machine Learning: From Martingales to Metaheuristics

Informatics and Machine Learning — читать онлайн ознакомительный отрывок

Ниже представлен текст книги, разбитый по страницам. Система сохранения места последней прочитанной страницы, позволяет с удобством читать онлайн бесплатно книгу «Informatics and Machine Learning», без необходимости каждый раз заново искать на чём Вы остановились. Поставьте закладку, и сможете в любой момент перейти на страницу, на которой закончили чтение.

Тёмная тема

Шрифт:

↓

↑

Сбросить

Интервал:

↓

↑

Закладка:

Сделать

2.7 Exercises

1 2.1 Evaluate the Shannon Entropy, by hand, for the fair die probability distribution: (1/6,1/6,1/6,1/6,1/6,1/6), for the probability of rolling a 1 thru a 6 (all are the same, 1/6, for uniform prob. Dist). Also evaluate for loaded die: (1/10,1/10,1/10,1/10,1/10,1/2).

2 2.2 Evaluate the Shannon Entropy for the fair and loaded probability distribution in Exercise 2.1 computationally, by running the program described in Section 2.1.

3 2.3 Now consider you have two dice, where each separately rolls “fair,” but together they do not roll “fair,” i.e. each specific pair of die rolls does not have probability 1/36, but instead has probability:Die 1 rollDie 2 rollProbability11(1/6)*(0.001)12(1/6)*(0.125)13(1/6)*(0.125)14(1/6)*(0.125)15(1/6)*(0.124)16(1/6)*(0.5)2Any(1/6)*(1/6)3Any(1/6)*(1/6)4Any(1/6)*(1/6)5Any(1/6)*(1/6)61(1/6)*(0.5)62(1/6)*(0.125)63(1/6)*(0.125)64(1/6)*(0.125)65(1/6)*(0.124)66(1/6)*(0.001)What is Shannon Entropy for the Die 1 outcomes? (call H(1)) What is the Shannon entropy of the Die 2 outcomes (refer to as H(2))? What is the Shannon entropy on the two‐dice outcomes with probabilities shown in the table above (denote (H(1,2))?Compute the function MI(Die 1,Die 2) = H(1) + H(2) − H(1,2). Is it positive?

4 2.4 Go to genbank ( https://www.ncbi.nlm.nih.gov/genbank/) and select the genome of a small virus (~10 kb). Using the Python code shown in Section 2.1, determine the base frequencies for {a,c,g,t}. What is the shannon entropy (if those frequencies are taken to be the probabilities on the associated outcomes)?

5 2.5 Go to genbank ( https://www.ncbi.nlm.nih.gov/genbank/) and select the genome of three medium‐sized viruses (~100 kb). Using the Python code shown in Section 2.1, determine the trinucleotide frequencies. What is the Shannon entropy of the trinucleotide frequencies for each of the three virus genomes? Using this as a distance measure phylogenetically speaking, which two viruses are most closely related?

6 2.6 Repeat (Exercise 2.5) but now use symmetrized relative entropy between the trinucleotide probability distributions as a distance measure instead (reevaluate pairwise between the three viruses). Using this as a distance measure phylogenetically speaking, which two viruses are most closely related?

7 2.7 Prove that relative entropy is always positive (hint: use Jensen's Inequality from Section 2.4).

8 2.8 What is the Expectation for the two‐dice roll with pair outcome probabilities listed in (Exercise 2.3)?

9 2.9 What is the Expectation for the two‐dice roll with fair dice? Is this expectation an actual outcome possibility? What does it mean if it is not?

10 2.10 Survey the literature and write a report on common occurrences of distributions of the type: uniform, geometric, exponential, Gaussian, log‐normal, heavy‐tail.

11 2.11 Survey the literature and write a report on common occurrences of series of the type: Martingale.

12 2.12 Consider the loaded die example, where the probability of rolling a 1,2,3,4, or 5, is 0.1, and the probability of rolling a 6 is 0.5.What is the expectation for the loaded die?What is its variance?What is its mode?What is its median?The LLN for the loaded die above indicates that a sequence of rolls could be done and if its average tends toward 4.5, you know it is loaded, and if it goes toward 3.5, you know it is fair. So it comes down to how soon you can resolve that its converging on these two possible expectations differing by 1.0. Suppose someone is rolling a die that is either fair or loaded as described above, how many rolls do you think you will need to see before it will be obvious how the average is trending? Is the better way to spot the anomaly? Like frequency of seeing three sixes in a row is notably skewed?

13 2.13 You have a genomic sequence of length L. (For DNA genomes you have approximately 10**4 for viruses, 10**6 for bacteria, and 10**9 for mammals.) A typical analysis is to get counts on subsequences of length N within the full sequence L, where there are L − N + 1 subsequences of length N (by sliding a window of width N across the length L sequence and taking the window samples accordingly). The number of possible subsequences of length N grows exponentially with increase in that length. For DNA subsequences of length six bases, the 6mers, with four base possibilities, {a,c,g,t}, there are thus 4**6 = 4096 possible 6mers. If the 6mers are equally probable, then in the approximate 10 000 length of a virus each 6mer might be seen a couple times (10 000/4096 to be precise), while a particular 6mer can be seen millions of times in mammalian genomes. Sounds fine so far, but now consider an analysis of 25mers…. The possible 25mers number 4**25 = 2**50 = (2**10)**5 = 1024**5 = 10**15. So, a million billion possibilities…. It turns out that DNA information does not have subsequences with approximately equal statistical counts (equal probabilities), but, instead, is highly structured with a variety of overlapping encoding schemes, so has subsequences with very unequal statistics. The vast majority of the 25mer subsequences, in fact, will have zero counts such that enumeration of the possibilities ahead of time in an array data‐structure is not useful or even possible in some cases, which then leads to associative arrays in this context as shown in the sample code. Do a 25mer analysis on bacterial genome (get from genbank, like E. coli). What is the highest count 25mer subsequence?

14 2.14 A rare genetic disease has probability P(disease) = 0.0000001. Suppose you have a test for this condition with sensitivity that is perfect (SN = 1.0) and with specificity that is 99.99% correct (SP = 0.9999) (i.e. false positives occur with probability 0.0001). Is this test useful in practice?

15 2.15 Prove P(X,Y|Z) = P(X|Z) P(Y|X,Z).

16 2.16 Prove the Bonferroni inequality: P(X,Y) ≥ P(X) + P(Y) − 1.

17 2.17 Suppose you are on a game show (with Monty Hall), and you are given the choice of three doors: Behind one door is a car; behind the others, goats. You pick a door, say No. 1, and the host, who knows what is behind the doors, opens another door, say No. 3, which has a goat. He then says to you, “Do you want to change your pick door to No. 2?” Is it to your advantage to switch your choice? Prove by tabulation of possibilities, and then prove using Bayes' Rule with appropriate choice of variables.

18 2.18 Prove E(Y/X) ≥ E(Y)/E(X).

3 Information Entropy and Statistical Measures

In this chapter, we start with a description of information entropy and statistical measures ( Section 3.1). Using these measures we then examine “raw” genomic data. No biology or biochemistry knowledge is needed in doing this analysis and yet we almost trivially rediscover a three‐element encoding scheme that is famous in biology, known as the codon. Analysis of information encoding in the four element {a, c, g, t} genomic sequence alphabet is about as simple as you can get (without working with binary data), so it provides some of the introductory examples that are implemented. A few (simple) statistical queries to get the details of the codon encoding scheme are then straightforward ( Section 3.2). Once the encoding scheme is known to exist, further structure is revealed via the anomalous placement of “stop” codons, e.g. anomalously large open reading frames (ORFs) are discovered. A few more (simple) statistical queries from there, and the relation of ORFs to gene structure is revealed ( Section 3.3). Once you have a clear structure in the sequential data that can be referenced positionally, it is then possible to gather statistical information for a Markov model. One example of this is to look at the positional base statistics at various positions “upstream” from the start codon. We thereby identify binding sites for critical molecular interaction in both transcription and translation. Since the Markov model is needed in analysis of sequential processes in general for what is discussed in later chapters ( Chapters 6and 7in particular), a review of Markov models, and some of their specializations, are given in Section 3.4( Chapters 6and 7covers Hidden Markov models, or HMMs).