LibCat » Книги » Приключения » unrecognised » Lillian Pierson - Data Science For Dummies

Lillian Pierson - Data Science For Dummies

Здесь есть возможность читать онлайн «Lillian Pierson - Data Science For Dummies» — ознакомительный отрывок электронной книги совершенно бесплатно, а после прочтения отрывка купить полную версию. В некоторых случаях можно слушать аудио, скачать через торрент в формате fb2 и присутствует краткое содержание. Жанр: unrecognised, на английском языке. Описание произведения, (предисловие) а так же отзывы посетителей доступны на портале библиотеки ЛибКат.

Читать книгу

Название:
Data Science For Dummies
Автор:
Lillian Pierson
Жанр:
unrecognised / на английском языке
Год:
неизвестен
ISBN:
нет данных
Рейтинг книги:
4 / 5. Голосов: 1
Избранное:

Добавить в избранное
Отзывы:
Написать комментарий
Ваша оценка:
- 80
- 1
- 2
- 3
- 4
- 5

Data Science For Dummies: краткое содержание, описание и аннотация

Предлагаем к чтению аннотацию, описание, краткое содержание или предисловие (зависит от того, что написал сам автор книги «Data Science For Dummies»). Если вы не нашли необходимую информацию о книге — напишите в комментариях, мы постараемся отыскать её.

Make smart business decisions with your data by design! Take a deep dive to understand how developing your data science dogma can drive your business—ya dig? Every phone, tablet, computer, watch, and camera generates data—we’re overwhelmed with the stuff. That’s why it’s become increasingly important that you know how to derive useful insights from the data you have to understand which piece of data in the sea of data is important and which isn’t (trust us: not as scary as it sounds!), and to rely on said data to make critical business decisions. Enter the world of data science: the practice of using scientific methods, processes, and algorithms to gain knowledge and insights from any type of data.
Data Science For Dummies Data Science For Dummies How natural language processing works Strategies around data science How to make decisions using probabilities Ways to display your data using a visualization model How to incorporate various programming languages into your strategy Whether you’re a professional or a student,
will get you caught up on all the latest data trends. Find out how to ask the pressing questions you need your data to answer by picking up your copy today.

Data Science For Dummies — читать онлайн ознакомительный отрывок

Ниже представлен текст книги, разбитый по страницам. Система сохранения места последней прочитанной страницы, позволяет с удобством читать онлайн бесплатно книгу «Data Science For Dummies», без необходимости каждый раз заново искать на чём Вы остановились. Поставьте закладку, и сможете в любой момент перейти на страницу, на которой закончили чтение.

Тёмная тема

Шрифт:

↓

↑

Сбросить

Интервал:

↓

↑

Закладка:

Сделать

Reducing dimensionality with factor analysis

Factor analysis is along the same lines as SVD in that it’s a method you can use for filtering out redundant information and noise from your data. An offspring of the psychometrics field, this method was developed to help you derive a root cause in cases where a shared root cause results in shared variance — when a variable’s variance correlates with the variance of other variables in the dataset.

Data Science For Dummies - изображение 82 A variable's variability measures how much variance it has around its mean. The greater a variable’s variance, the more information that variable contains.

When you find shared variance in your dataset, that means information redundancy is at play. You can use factor analysis or principal component analysis to clear your data of this information redundancy. You see more on principal component analysis in the following section, but for now, focus on factor analysis and the fact that you can use it to compress your dataset’s information into a reduced set of meaningful, non-information-redundant latent variables — meaningful inferred variables that underlie a dataset but are not directly observable.

Factor analysis makes the following assumptions:

Your features are metric — numeric variables on which meaningful calculations can be made.

Your features should be continuous or ordinal (if you’re not sure what ordinal is, refer back to the first class, business class, and economy class analogy in the probability distributions section of this chapter).

You have more than 100 observations in your dataset and at least 5 observations per feature.

Your sample is homogenous.

There is r > 0.3 correlation between the features in your dataset.

In factor analysis, you do a regression — a topic covered later in this chapter — on features to uncover underlying latent variables, or factors. You can then use those factors as variables in future analyses, to represent the original dataset from which they’re derived. At its core, factor analysis is the process of fitting a model to prepare a dataset for analysis by reducing its dimensionality and information redundancy.

Decreasing dimensionality and removing outliers with PCA

Principal component analysis (PCA) is another dimensionality reduction technique that’s closely related to SVD: This unsupervised statistical method finds relationships between features in your dataset and then transforms and reduces them to a set of non-information-redundant principal components — uncorrelated features that embody and explain the information that’s contained within the dataset (that is, its variance). These components act as a synthetic, refined representation of the dataset, with the information redundancy, noise, and outliers stripped out. You can then use those reduced components as input for your machine learning algorithms to make predictions based on a compressed representation of your data. (For more on outliers, see the “ Detecting Outliers” section, later in this chapter.)

The PCA model makes these two assumptions:

Multivariate normality (MVN) — or a set of real-valued, correlated, random variables that are each clustered around a mean — is desirable, but not required.

Variables in the dataset should be continuous.

Although PCA is like factor analysis, they have two major differences: One difference is that PCA does not regress to find some underlying cause of shared variance, but instead decomposes a dataset to succinctly represent its most important information in a reduced number of features. The other key difference is that, with PCA, the first time you run the model, you don’t specify the number of components to be discovered in the dataset. You let the initial model results tell you how many components to keep, and then you rerun the analysis to extract those features.

Data Science For Dummies - изображение 83 Similar to the CVE discussion in the SVD part of this chapter, the amount of variance you retain depends on how you’re applying PCA, as well as the data you’re inputting into the model. Breaking it down based on how you’re applying PCA, the following rules of thumb become relevant:

Used for descriptive analytics: If PCA is being used for descriptive purposes only (for example, when working to build a descriptive avatar of your company’s ideal customer) the CVE can be lower than 95 percent. In this case you can get away with a CVE as low as 75-80 percent.

Used for diagnostic, predictive or prescriptive analytics: If principal components are meant for downstream models that generate diagnostic, predictive or prescriptive analytics, then CVE should be 95 percent or higher. Just realize that the lower the CVE, the less reliable your model results will be downstream. Each percentage of CVE that’s lost represents a small amount of information from your original dataset that won’t be captured by the principal components.

Data Science For Dummies - изображение 84 When using PCA for outlier detection, simply plot the principal components on an x-y scatter plot and visually inspect for areas that might have outliers. Those data points correspond to potential outliers that are worth investigating.

Modeling Decisions with Multiple Criteria Decision-Making

Life is complicated. We’re often forced to make decisions where several different criteria come into play, and it often seems unclear which criterion should have priority. Mathematicians, being mathematicians, have come up with quantitative approaches that you can use for decision support whenever you have several criteria or alternatives on which to base your decision. You see those approaches in Chapter 3, where I talk about neural networks and deep learning — another method that fulfills this same decision-support purpose is multiple criteria decision-making (or MCDM, for short).

Turning to traditional MCDM

You can use MCDM methods in anything from stock portfolio management to fashion-trend evaluation, from disease outbreak control to land development decision-making. Anywhere you have two or more criteria on which you need to base your decision, you can use MCDM methods to help you evaluate alternatives.

To use multiple criteria decision-making, the following two assumptions must be satisfied:

Multiple criteria evaluation: You must have more than one criterion to optimize.

Zero-sum system: Optimizing with respect to one criterion must come at the sacrifice of at least one other criterion. This means that there must be trade-offs between criteria — to gain with respect to one means losing with respect to at least one other.

Another important thing to note about MCDM is that it’s characterized by binary membership. In mathematics, a set is a group of numbers that share a similar characteristic. In traditional set theory, membership is binary — in other words, an individual is either a member of a set or it’s not. If the individual is a member, it’s represented by the number 1, representing a “yes.” If it is not a member, it’s represented by the number 0, for “no.”

The best way to gain a solid grasp on MCDM is to see how it’s used to solve a real-world problem. MCDM is commonly used in investment portfolio theory. Pricing of individual financial instruments typically reflects the level of risk you incur, but an entire portfolio can be a mixture of virtually riskless investments (US government bonds, for example) and minimum-, moderate-, and high-risk investments. Your level of risk aversion dictates the general character of your investment portfolio. Highly risk-averse investors seek safer and less lucrative investments, and less risk-averse investors choose riskier, more lucrative investments. In the process of evaluating the risk of a potential investment, you’d likely consider the following criteria: