Stephen Winters-Hilt - Informatics and Machine Learning
Здесь есть возможность читать онлайн «Stephen Winters-Hilt - Informatics and Machine Learning» — ознакомительный отрывок электронной книги совершенно бесплатно, а после прочтения отрывка купить полную версию. В некоторых случаях можно слушать аудио, скачать через торрент в формате fb2 и присутствует краткое содержание. Жанр: unrecognised, на английском языке. Описание произведения, (предисловие) а так же отзывы посетителей доступны на портале библиотеки ЛибКат.
- Название:Informatics and Machine Learning
- Автор:
- Жанр:
- Год:неизвестен
- ISBN:нет данных
- Рейтинг книги:3 / 5. Голосов: 1
-
Избранное:Добавить в избранное
- Отзывы:
-
Ваша оценка:
- 60
- 1
- 2
- 3
- 4
- 5
Informatics and Machine Learning: краткое содержание, описание и аннотация
Предлагаем к чтению аннотацию, описание, краткое содержание или предисловие (зависит от того, что написал сам автор книги «Informatics and Machine Learning»). Если вы не нашли необходимую информацию о книге — напишите в комментариях, мы постараемся отыскать её.
Discover a thorough exploration of how to use computational, algorithmic, statistical, and informatics methods to analyze digital data Informatics and Machine Learning: From Martingales to Metaheuristics
ad hoc, ab initio
Informatics and Machine Learning: From Martingales to Metaheuristics
Informatics and Machine Learning — читать онлайн ознакомительный отрывок
Ниже представлен текст книги, разбитый по страницам. Система сохранения места последней прочитанной страницы, позволяет с удобством читать онлайн бесплатно книгу «Informatics and Machine Learning», без необходимости каждый раз заново искать на чём Вы остановились. Поставьте закладку, и сможете в любой момент перейти на страницу, на которой закончили чтение.
Интервал:
Закладка:
1.2 Informatics and Data Analytics
It is common to need to acquire a signal where the signal properties are not known, or the signal is only suspected and not discovered yet, or the signal properties are known but they may be too much trouble to fully enumerate. There is no common solution, however, to the acquisition task. For this reason the initial phases of acquisition methods unavoidably tend to be ad hoc . As with data dependency in non‐evolutionary search metaheuristics (where there is no optimal search method that is guaranteed to always work well), here there is no optimal signal acquisition method known in advance. In what follows methods are described for bootstrap optimization in signal acquisition to enable the most general‐use, almost “common,” solution possible. The bootstrap algorithmic method involves repeated passes over the data sequence, with improved priors, and trained filters, among other things, to have improved signal acquisition on subsequent passes. The signal acquisition is guided by statistical measures to recognize anomalies. Informatics methods and information theory measures are central to the design of a good finite state automata (FSAs) acquisition method, and will be reviewed in signal acquisition context in Chapters 2– 4. Code examples are given in Python and C (with introductory Python described in Chapter 2and Appendix A). Bootstrap acquisition methods may not automatically provide a common solution, but appear to offer a process whereby a solution can be improved to some desirable level of general‐data applicability.
The signal analysis and pattern recognition methods described in this book are mainly applied to problems involving stochastic sequential data: power signals and genomic sequences in particular. The information modeling, feature selection/extraction, and feature‐vector discrimination, however, were each developed separately in a general‐use context. Details on the theoretical underpinnings are given in Chapter 3, including a collection of ab initio information theory tools to help “find your way around in the dark.” One of the main ab initio approaches is to search for statistical anomalies using information measures, so various information measures will be described in detail [103–115].
The background on information theory and variational/statistical modeling has significant roots in variational calculus. Chapter 3describes information theory ideas and the information “calculus” description (and related anomaly detection methods). The involvement of variational calculus methods and the possible parallels with the nascent development of a new (modern) “calculus of information” motivates the detailed overview of the highly successful physics development/applications of the calculus of variations ( Appendix B). Using variational calculus, for example, it is possible to establish a link between a choice of information measure and statistical formalism (maximum entropy, Section 3.1). Taking the maximum entropy on a distribution with moment constraints leads to the classic distributions seen in mathematics and nature (the Gaussian for fixed mean and variance, etc.). Not surprisingly, variational methods also help to establish and refine some of the main ML methods, including Neural Nets (NNs) ( Chapters 9, 13) and Support Vector Machines (SVM) ( Chapter 10). SVMs are the main tool presented for both classification (supervised learning) and clustering (unsupervised learning), and everything in between (such as bag learning).
1.3 FSA‐Based Signal Acquisition and Bioinformatics
Many signal features of interest are time limited and not band limited in the observational context of interest, such as noise “clicks,” “spikes,” or impulses. To acquire these signal features a time‐domain finite state automaton (tFSA) is often most appropriate [116–124]. Human hearing, for example, is a nonlinear system that thereby circumvents the restrictions of the Gabor limit (to allow for musical geniuses, for example, who have “perfect pitch”), where time‐frequency acuity surpasses what would be possible by linear signal processing alone [116] , such as with Nyquist sampled linear response recording devices that are bound by the limits imposed by the Fourier uncertainty principle (or Benedick’s theorem) [117] . Thus, even when the powerful Fourier Transform or Hidden Markov Model (HMM) feature extraction methods are utilized to full advantage, there is often a sector of the signal analysis that is only conveniently accessible to analysis by way of FSAs (without significant oversampling), such that a parallel processing with both HMM and FSA methods is often needed (results demonstrating this in the context of channel current analysis [1–3] will be described in Chapter 14). Not all of the methods employed at the FSA processing stage derive from standard signal processing approaches, either, some are purely statistical such as with oversampling [118] (used in radar range oversampling [119, 120]) and dithering [121] (used in device stabilization and to reduce quantization error [122, 123]).
All of the tFSA signal acquisition methods described in Chapters 2– 4are O(L), i.e. they scan the data with a computational complexity no greater than that of simply seeing the data (via a “read” or “touch” command, O(L) is known as “order of,” or “big‐oh,” notation). Because the signal acquisition is only O(L) it is not significantly costly, computationally, to simply repeat the acquisition analysis multiple times with a more informed process with each iteration, to have arrived at a “bootstrap” signal acquisition process. In such a setting, signal acquisition is often done with bias to very high specificity initially (and sensitivity very poor), to get a “gold standard” set of highly likely true signals that can be data mined for their attributes. With a filter stage thereby trained, later scan passes can pass suspected signals with very weak specificity (very high sensitivity now) with high specificity then recovered by use of the filter. This then allows a bootstrap process to a very high specificity (SP) and sensitivity (SN) at the tFSA acquisition stage on the signals of interest.
An example of a bootstrap FSA from genomic analysis is to first scan through a genome base‐by‐base and obtain counts on nucleotide pairs with different gap sizes between the nucleotides observed [1, 3]. This then allows a mutual information analysis on the nucleotide pairs taken at the different gap sizes (shown in Chatpers 3 and 4). What is found for prokaryotic genomes, with their highly dense gene placement, that is mostly protein coding (i.e. where there is little “junk” deoxyribonucleic acid (DNA) and no introns), is a clear signal indicating anomalous statistical linkages on bases three apart [1, 3, 60]. What is discovered thereby is codon structure, where the coding information comes in groups of three bases. Knowing this, a repeated pass (bootstrap) with frequency analysis of the 64 possible 3‐base groupings can then be done, at which point the anomalously low counts on “stop” codons is then observed. Upon identification of the stop codons their placement (topology) in the genome can then be examined and it is found that their counts are anomalously low because there are large stretches of regions with no stop codon (e.g. there are stop codon “voids,” known as open reading frames, or “ORF”s). The codon void topologies are examined in a comparative genomic analysis in [60] (and shown in Chapter 3). The stop codons, which should occur every 21 codons on average if DNA sequence data was random, are sometimes not seen for stretches of several hundred codons. For the genomic data we are finding the longer genes, whose anomalous non‐random DNA sequence is more distinctive the longer the gene‐coding region. This basic analysis can provide a gene‐finder on prokaryotic genomes that comprises a one‐page Python script that can perform with 90–99% accuracy depending on the prokaryotic genome (shown in Chapter 3). A second page of Python coding to introduce a “filter,” along the lines of the bootstrap learning process mentioned above, leads to an ab initio prokaryotic gene‐predictor with 98.0–99.9% accuracy. Python code to accomplish this is shown in Chapter 4. In this bootstrap acquisition process all that is used is the raw genomic data (with its highly structured intrinsic statistics) and methods for identifying statistical anomalies and informatics structural anomalies: (i) anomalously high mutual information is identified (revealing codon structure); (ii) anomalously high (or low) statistics on an attribute or event is then identified (low stop codon counts, lengthy stop codon voids); then anomalously high sub‐sequences (binding site motifs) are found in the neighborhood of the identified ORFs (used in the filter).
Читать дальшеИнтервал:
Закладка:
Похожие книги на «Informatics and Machine Learning»
Представляем Вашему вниманию похожие книги на «Informatics and Machine Learning» списком для выбора. Мы отобрали схожую по названию и смыслу литературу в надежде предоставить читателям больше вариантов отыскать новые, интересные, ещё непрочитанные произведения.
Обсуждение, отзывы о книге «Informatics and Machine Learning» и просто собственные мнения читателей. Оставьте ваши комментарии, напишите, что Вы думаете о произведении, его смысле или главных героях. Укажите что конкретно понравилось, а что нет, и почему Вы так считаете.