LibCat » Книги » Приключения » unrecognised » Stephen Winters-Hilt - Informatics and Machine Learning

Stephen Winters-Hilt - Informatics and Machine Learning

Здесь есть возможность читать онлайн «Stephen Winters-Hilt - Informatics and Machine Learning» — ознакомительный отрывок электронной книги совершенно бесплатно, а после прочтения отрывка купить полную версию. В некоторых случаях можно слушать аудио, скачать через торрент в формате fb2 и присутствует краткое содержание. Жанр: unrecognised, на английском языке. Описание произведения, (предисловие) а так же отзывы посетителей доступны на портале библиотеки ЛибКат.

Читать книгу

Название:
Informatics and Machine Learning
Автор:
Stephen Winters-Hilt
Жанр:
unrecognised / на английском языке
Год:
неизвестен
ISBN:
нет данных
Рейтинг книги:
3 / 5. Голосов: 1
Избранное:

Добавить в избранное
Отзывы:
Написать комментарий
Ваша оценка:
- 60
- 1
- 2
- 3
- 4
- 5

Informatics and Machine Learning: краткое содержание, описание и аннотация

Предлагаем к чтению аннотацию, описание, краткое содержание или предисловие (зависит от того, что написал сам автор книги «Informatics and Machine Learning»). Если вы не нашли необходимую информацию о книге — напишите в комментариях, мы постараемся отыскать её.

Informatics and Machine Learning
Discover a thorough exploration of how to use computational, algorithmic, statistical, and informatics methods to analyze digital data Informatics and Machine Learning: From Martingales to Metaheuristics
ad hoc, ab initio
Informatics and Machine Learning: From Martingales to Metaheuristics

Informatics and Machine Learning — читать онлайн ознакомительный отрывок

Ниже представлен текст книги, разбитый по страницам. Система сохранения места последней прочитанной страницы, позволяет с удобством читать онлайн бесплатно книгу «Informatics and Machine Learning», без необходимости каждый раз заново искать на чём Вы остановились. Поставьте закладку, и сможете в любой момент перейти на страницу, на которой закончили чтение.

Тёмная тема

Шрифт:

↓

↑

Сбросить

Интервал:

↓

↑

Закладка:

Сделать

Figure 3.1 Codon structure is revealed in the V. cholera genome by mutual information between nucleotides in the genomic sequence when evaluated for different gap sizes.

In Figure 3.1, at first the mutual information falls off as we look at statistical linkages at greater and greater distance, which makes sense for any information construct that is “local,” e.g. it should have less linkage with structure further away. After a certain point, however, the mutual information no longer falls off, instead cycling back to a certain level of mutual information with a cycle period of three bases. This suggests that a long‐range three‐element encoding scheme might exist (among other things), which can easily be tested. In doing so we ask Nature “the right question” and the answer is the rediscovery of the codon encoding scheme, as will be shown in what follows.

So, to clarify before proceeding, suppose we want to get information on a three‐element encoding scheme for the Escherichia coli genome (Chromosome 1), say, in file EC_Chr1.fasta.txt. We, therefore, want an order = 3 oligo counting, but on 3‐element windows seen “stepping” across the genome, e.g. a “stepping” window for sampling, not a sliding window, resulting in three choices on stepping, or framing, according to how you take your first step:

case 0: agttagcgcgt ‐‐> (agt)(tag)(cgc)gt

case 1: agttagcgcgt ‐‐> a(gtt)(agc)(gcg)t

case 2: agttagcgcgt ‐‐> ag(tta)(gcg)(cgt)

In the code that follows we get codon counts for a particular frame‐pass (prog2.py addendum 4):

------------------- prog2.py addendum 4 --------------------- # so suspect existence of three-element coding scheme, the codon, # so need stats (anomolous) on codons.... # 'frame' specifies the frame pass as case 0, 1, or 2 in the text. # getting codon counts for a specified framing and specified sequence # will now be shown in two ways, one built from re-use of code blocks, # one from re-use of an entire subroutine: def codon_counter ( seq, frame ): codon_counts = {} pattern = '[acgtACGT]' result = re.findall(pattern, seq) seqlen = len(seq) # probs = np.empty((0)) for index in range(frame,seqlen-2): if (index+3-frame)%3!=0: continue codon = result[index]+result[index+1]+result[index+2] if codon in codon_counts: codon_counts[codon]+=1 else: codon_counts[codon]=1 counts = np.empty((0)) for i in sorted(codon_counts): counts = np.append(counts,codon_counts[i]+0.0) print "codon", i, "count =", codon_counts[i] probs = count_to_freq(counts) return probs codon_counter(EC_sequence,0) # could also get codon counts by shannon_order with modification to step # (and have order=3 for codon: def shannon_codon( seq, frame ): order=3 stats = {} pattern = '[acgtACGT]' result = re.findall(pattern, seq) seqlen = len(seq) for index in range(order-1+frame,seqlen): if index%3!=2: continue xmer = "" for xmeri in range(0,order): xmer+=result[index-(order-1)+frame+xmeri] if xmer in stats: stats[xmer]+=1 else: stats[xmer]=1 for i in sorted(stats): print("%d %s" % (stats[i],i)) counts = np.empty((0)) for i in sorted(stats): counts = np.append(counts,stats[i]+0.0) probs = count_to_freq(counts) return probs shannon_codon(EC_sequence,0) ---------------- prog2.py addendum 4 end -------------------

In running prog2.py addendum 4 we find that the codon “tag” has much lower counts, and similarly for the codon “cta”:

frame 0 have tag with 8970 and cta with 8916

frame 1 have tag with 9407 and cta with 8821

frame 2 have tag with 8877 and cta with 9033

The tag and cta trinucleotides happen to be related – they are reverse compliments of each other (the first hint of information encoding via duplex deoxyribonucleic acid (DNA) with Watson–Crick base‐pairing). There are two other notably rare codons: taa and tga (and their reverse compliment in this all‐frame genome‐wide study as well).

Now that we have identified an interesting feature, such as “tag,” it is reasonable to ask about this feature’s placement across the genome. Having done that, the follow‐up is to identify any anomalously recurring feature proximate to the feature of interest. Such an analysis would need a generic subroutine for getting counts on sub‐strings of indicated order on an indicated reference, to genome sequence data, and that is provided next as an addendum #5 to prog2.py.

-------------------- prog2.py addendum 5 -------------------- # see that 'tag' is anomolous, want to get sense of the distribution on # gaps between 'tag' (still satepping 'in-frame'). def codon_gap_counter ( seq, frame, delimiter ): counts = {} pattern = '[acgtACGT]' result = re.findall(pattern, seq) seqlen = len(seq) # probs = np.empty((0)) oldindex=0 for index in range(frame,seqlen-2): if (index+3-frame)%3!=0: continue codon = result[index]+result[index+1]+result[index+2] if codon!=delimiter: continue else: gap = index - oldindex quant = 100 bin = gap/quant if oldindex!=0: if bin in counts: counts[bin]+=1 else: counts[bin]=1 oldindex=index npcounts = np.empty((0)) for i in sorted(counts): npcounts = np.append(npcounts,counts[i]+0.0) print "gapbin", i, "count =", counts[i] probs = count_to_freq(npcounts) return probs # usage: delimiters = ("AAA","AAC","AAG","AAT","ACA","ACC","ACG","ACT", "AGA","AGC","AGG","AGT","ATA","ATC","ATG","ATT", "CAA","CAC","CAG","CAT","CCA","CCC","CCG","CCT", "CGA","CGC","CGG","CGT","CTA","CTC","CTG","CTT", "GAA","GAC","GAG","GAT","GCA","GCC","GCG","GCT", "GGA","GGC","GGG","GGT","GTA","GTC","GTG","GTT", "TAA","TAC","TAG","TAT","TCA","TCC","TCG","TCT", "TGA","TGC","TGG","TGT","TTA","TTC","TTG","TTT") for delimiter in delimiters: print "\n\ndelimiter is", delimiter codon_gap_counter(EC_sequence,0,delimiter) ----------------- prog2.py addendum 5 end ------------------

Upon running the above code with codon delimiter set to “tag,” we arrive at Table 3.1, which shows the distribution on (tag) gap sizes. Bin size is 100. So gap bin 0 has the count on all gaps seen sized anywhere from 1 to 99. Bin 1 has counts on occurrences of gaps in the domain 100–199, etc.

Table 3.1 (tag) Gap sizes, with bin size 100.

Gap bin	Count
0	2115
1	1428
2	1066
3	829
4	696
5	484
6	399
7	293
8	241
9	222

Table 3.2 (aaa) Gap sizes, with bin size 100.

Gap bin	Count
0	21 256
1	7843
2	3375
3	1665
4	827
5	480
6	287
7	163
8	86
9	70

In order to see how strongly the (tag) distribution is skewed, we consider some other codon to evaluate, such as for the “aaa” gap, where aaa is most common. The aaa gaps, shown in Table 3.2, tend to be much smaller, with a standard exponential distribution fall‐off indicative of no long‐range encoding linkages:

Thus the codon tag is clearly very different from aaa, it is as if tag roughly marks the boundaries of regions, and aaa is just scattered throughout. Are any other codons similar to tag? The frequency analysis blurs counts so more subtle differences not as obvious have to run gap counter for each to directly see, and how to easily “see”? Notice how the tag distribution has a long tail. The gap bins only go to 9 in the figures, but for the full dataset the last nonzero gap bin for “tag” is at a remarkable bin 70. For “aaa” the last nonzero bin is much earlier, at bin number 23 (even though there are 10 times as many (aaa) codons as (tag) codons). For “taa” the last nonzero bin is at 60, while for tga it is at 53. The codons taa, tga, and tag are known as the stop codons and the gaps between them are known as ORFs, or ORFs. A subtlety in the statistical analysis is that the stop codons do not have to match to define such anomalously large regions (according to observation). Thus, a biochemical encoding scheme must exist that works with any of the three stop codons seen as equivalent, thus the naming for this group as “stop” codons (and their grouping as such in Figure 3.2). For more nuances of the naming convention “stop” codon when delving into the encoding biochemistry see [1, 3].