LibCat » Книги » Приключения » unrecognised » Computational Statistics in Data Science

Computational Statistics in Data Science

Здесь есть возможность читать онлайн «Computational Statistics in Data Science» — ознакомительный отрывок электронной книги совершенно бесплатно, а после прочтения отрывка купить полную версию. В некоторых случаях можно слушать аудио, скачать через торрент в формате fb2 и присутствует краткое содержание. Жанр: unrecognised, на английском языке. Описание произведения, (предисловие) а так же отзывы посетителей доступны на портале библиотеки ЛибКат.

Читать книгу

Название:
Computational Statistics in Data Science
Автор:
Неизвестный Автор
Жанр:
unrecognised / на английском языке
Год:
неизвестен
ISBN:
нет данных
Рейтинг книги:
4 / 5. Голосов: 1
Избранное:

Добавить в избранное
Отзывы:
Написать комментарий
Ваша оценка:
- 80
- 1
- 2
- 3
- 4
- 5

Computational Statistics in Data Science: краткое содержание, описание и аннотация

Предлагаем к чтению аннотацию, описание, краткое содержание или предисловие (зависит от того, что написал сам автор книги «Computational Statistics in Data Science»). Если вы не нашли необходимую информацию о книге — напишите в комментариях, мы постараемся отыскать её.

An essential roadmap to the application of computational statistics in contemporary data science
Computational Statistics in Data Science
Computational Statistics in Data Science
Wiley StatsRef: Statistics Reference Online
Computational Statistics in Data Science

Computational Statistics in Data Science — читать онлайн ознакомительный отрывок

Ниже представлен текст книги, разбитый по страницам. Система сохранения места последней прочитанной страницы, позволяет с удобством читать онлайн бесплатно книгу «Computational Statistics in Data Science», без необходимости каждый раз заново искать на чём Вы остановились. Поставьте закладку, и сможете в любой момент перейти на страницу, на которой закончили чтение.

Тёмная тема

Шрифт:

↓

↑

Сбросить

Интервал:

↓

↑

Закладка:

Сделать

4.1 Edward, Pyro, NumPyro, and PyMC3

Recently, there have been several important probabilistic programming libraries released for Python, namely, Edward , Pyro , NumPyro , and PyMC3 . These packages are characterized by the capacity to fit broad classes of models, with massive number of parameters, using advanced particle simulators (such as Hamiltonian Monte Carlo (HMC)).

These packages differ in implementation, but all provide world‐class computational solutions to probabilistic inference and Monte Carlo techniques. These packages provide the latest and optimized algorithms for many classes of models: directed graphs, neural networks, implicit generative models, Bayesian nonparametrics, Markov Chains, variational inference, Bayesian multilevel regression, Gaussian processes, mixture modeling, and survival analysis. Edward is built on a TensorFlow backend, while Pyro is built using PyTorch (and NumPyro is based on NumPy ). Pyro uses the universal probabilistic programming language (PPL) to specify models. NumPy complies code to either central processing unit (CPU) or Graphical Processing Unit (GPU), greatly increasing computation speed in many statistical/linear algebra computations. PyMC3 is built on a Theno backend and uses an intuitive syntax to specify models.

4.2 Julia

Julia is a new language designed by Bezanson et al . and was released in 2012 [27]. Julia's first stable version (1.0) was released in August 2018. The developers describe themselves as “greedy” – they want a software application that does it all. Users no longer would create prototypes in scripting languages than port to C or Java for speed. Below, we quote from Julia's public announcement ( https://julialang.org/blog/2012/02/why‐we‐created‐julia):

We want a language that's open source, with a liberal license. We want the speed of C with the dynamism of Ruby. We want a language that's homoiconic, with true macros like Lisp, but with obvious, familiar mathematical notation like MATLAB. We want something as usable for general programming as Python, as easy for statistics as R, as natural for string processing as Perl, as powerful for linear algebra as MATLAB, as good at gluing programs together as the shell. Something that is dirt simple to learn, yet keeps the most serious hackers happy. We want it interactive and we want it compiled.

Despite the stated goals, we classify Julia as an analysis software at this early stage. Indeed, Julia's syntax exhibits elegance and friendliness to mathematics. The language natively implements an extensive mathematical library. Julia's core distribution includes multidimensional arrays, sparse vectors/matrices, linear algebra, random number generation, statistical computation, and signal processing.

Julia's design affords speeds comparable to C due to it being an interpreted, embeddable language with a JIT compiler. The software also implements concurrent threading, enabling parallel computing natively. Julia integrates nicely with other languages including calling C directly, Python via PyCall , and R via RCall .

Julia exhibits great promise but remains nascent. We are intrigued by a language that does it all and is easy to use. Yet, Julia's underdevelopment limits its statistical analysis capability. On the other hand, Julia is growing fast with active support and positive community outlook. Coupling Julia's advantages and MATLAB's diminishing appeal, we anticipate Julia to contribute in the area for years to come.

4.3 NIMBLE

NIMBLE ( https://r‐nimble.org/) provides a framework for building and sharing computationally intensive statistical models. The software has gained instant recognition due to the adoption of the familiar BUGS modeling language. This feature appeals to a broad base of Bayesian statisticians who have limited time to invest in learning new computing skills. NIMBLE is implemented as an R package, but all the under‐the‐hood work is completed in compiled C++ code, providing near‐optimal speed. Even if a user does not desire the BUGS language, NIMBLE accelerates R for general‐purpose numerical work via nimbleFunctions without the burden of writing native C++ source code.

4.4 Scala

An emerging data science tool, Scala ( https://www.scala‐lang.org/), combines object‐oriented and functional paradigms in a high‐level programming language. Scala is built for complex applications and workflows. To meet such applications, static object typing keeps the code bug‐free, even during numerous parallelized computations or asynchronous programming (dependent jobs). Scala is designed for interoperability with Java/JavaScript as it runs on Java Virtual Machine. This provides access to the entire Java ecosystem. Scala interfaces with Apache Spark (as does Python and R) for scalable, accurate, and numeric operations. In short, Scala scales Java for high‐performance computing.

4.5 Stan

Stan [28] is a PPL for specifying models, most often Bayesian. Stan samples posterior distributions using HMC – a variant of Markov Chain Monte Carlo (MCMC). HMC boasts a more robust and efficient approach over Gibbs or Metropolis‐Hastings sampling for complex models, while providing insightful diagnostics to assess convergence and mixing. This may explain why Stan is gaining popularity over other Bayesian samplers (such as BUGS [10] and JAGS [11]).

Stan provides a flexible and principled model specification framework. In addition to fully Bayesian inference, Stan computes log densities and Hessians, variational Bayes, expectation propagation, and approximate integration. Stan is available as a command line tool or R/Python interface (RStan and PyStan, respectively).

Stan has the ability to become the de facto Bayesian modeling software. Designed by thought leader Andrew Gelman and a growing, enthusiastic community, Stan possesses much promise. The language architecture promotes cross‐compatibility and extensibility, and the general‐purpose posterior sampler with innovative diagnostics appeals to novice and advanced modelers alike. Further, to our knowledge, Stan is the only general‐purpose Bayesian modeler that scales to thousands of parameters – a boon for big data analytics.

5 The Future of Statistical Computing

Two key drivers will dictate statistical software moving forward: (i) Increased model complexity and (ii) increased data collection speed and sheer size (big data). These two factors will require software to be highly flexible – the languages must be easy to work with for small‐to‐medium data sets/models, while easily scaling to massive data sets/models. The software must give easy access to the latest computer hardware (including GPUs) and provide hassle‐free parallel distribution of tasks. To this end, successful statistical software must feature compiled/optimized code of the latest algorithms, parallelization, and cloud/cluster computing support. Likely, one tool will not meet all the demands, and therefore cross‐compatibility standards must be developed. Moreover, data visualization will become increasingly important (including virtual reality) for large, complex data sets where conventional inferential tools are suspect or without use.

The advantages of open‐source, community‐based development have been emphasized throughout – especially in the scholarly arena and with smaller businesses. The open‐source paradigm enables rapid software development with limited resources. However, commercial software with dedicated support services will appeal to certain markets, including medium‐to‐large businesses.