Alan R. Simon - Data Lakes For Dummies

Здесь есть возможность читать онлайн «Alan R. Simon - Data Lakes For Dummies» — ознакомительный отрывок электронной книги совершенно бесплатно, а после прочтения отрывка купить полную версию. В некоторых случаях можно слушать аудио, скачать через торрент в формате fb2 и присутствует краткое содержание. Жанр: unrecognised, на английском языке. Описание произведения, (предисловие) а так же отзывы посетителей доступны на портале библиотеки ЛибКат.

Data Lakes For Dummies: краткое содержание, описание и аннотация

Предлагаем к чтению аннотацию, описание, краткое содержание или предисловие (зависит от того, что написал сам автор книги «Data Lakes For Dummies»). Если вы не нашли необходимую информацию о книге — напишите в комментариях, мы постараемся отыскать её.

Take a dive into data lakes  “Data lakes” is the latest buzz word in the world of data storage, management, and analysis. 
decodes and demystifies the concept and helps you get a straightforward answer the question: “What exactly is a data lake and do I need one for my business?” Written for an audience of technology decision makers tasked with keeping up with the latest and greatest data options, this book provides the perfect introductory survey of these novel and growing features of the information landscape. It explains how they can help your business, what they can (and can’t) achieve, and what you need to do to create the lake that best suits your particular needs. 
With a minimum of jargon, prolific tech author and business intelligence consultant Alan Simon explains how data lakes differ from other data storage paradigms. Once you’ve got the background picture, he maps out ways you can add a data lake to your business systems; migrate existing information and switch on the fresh data supply; clean up the product; and open channels to the best intelligence software for to interpreting what you’ve stored. 
Understand and build data lake architecture Store, clean, and synchronize new and existing data Compare the best data lake vendors Structure raw data and produce usable analytics Whatever your business, data lakes are going to form ever more prominent parts of the information universe every business should have access to. Dive into this book to start exploring the deep competitive advantage they make possible—and make sure your business isn’t left standing on the shore.

Data Lakes For Dummies — читать онлайн ознакомительный отрывок

Ниже представлен текст книги, разбитый по страницам. Система сохранения места последней прочитанной страницы, позволяет с удобством читать онлайн бесплатно книгу «Data Lakes For Dummies», без необходимости каждый раз заново искать на чём Вы остановились. Поставьте закладку, и сможете в любой момент перейти на страницу, на которой закончили чтение.

Тёмная тема
Сбросить

Интервал:

Закладка:

Сделать

The bronze zone

You load your data into the bronze zone when the data first enters the data lake. First, you extract the data from a source application (the E part of ELT), and then the data is transmitted into the bronze zone in raw form (thus, one of the alternative names for this zone). You don’t correct any errors or otherwise transform or modify the data at all. The original operational data should look identical to the copy of that data now in the bronze zone.

Data Lakes For Dummies - изображение 28Your catchphrase for loading data into the bronze zone is “the need for speed.” You may be trickling one piece of data at a time or bulk-loading hundreds of gigabytes or even terabytes of data. Your objective is to transmit the data into the data lake environment as quickly as possible. You’ll worry about checking out and refining that data later.

The silver zone

The silver zone consists of data that has been error-checked and cleansed but still remains in its original format. Data may be copied from a source application in JavaScript Object Notation (JSON) format and land in the bronze zone in raw form, looking exactly as the data was in the source system itself — errors and all.

You’ll patch up any known errors, handle missing data, and otherwise cleanse the data. Then you’ll store the cleansed data in the silver zone, still in JSON format.

Data Lakes For Dummies - изображение 29Not all data from your bronze zone will be cleansed and copied into your silver zone. The data lake model calls for loading massive amounts of data into the bronze zone without having to do upfront analysis to determine which data is definitely or likely needed for analysis. When you decide what data you need, you do the necessary data cleansing and move only the cleansed data into the silver zone.

The gold zone

The gold zone is the final home for your most valuable analytical data. You’ll curate data coming from the silver zone, meaning that you’ll group and restructure data into “packages” dedicated to your organization’s high-value analytical needs.

LINKING THE DATA LAKE ZONES TOGETHER

The following figure shows the progressive pipelines of data among the various zones, including the sandbox. Notice how not every piece or group of data is cleansed and then sent from the bronze zone to the silver zone. You’ll spend time refurbishing, refining, and transmitting data to the silver zone that you definitely or likely need for analytics.

Likewise select data sets are sent from the silver zone to the gold zone - фото 30

Likewise, select data sets are sent from the silver zone to the gold zone. Remember that another name for the gold zone is the curated zone, meaning that you’ve especially selected certain data to be consolidated and then placed in “packages” within the gold zone.

You might transmit raw, uncleansed data from the bronze zone into the sandbox along with data from the silver zone, depending on the specifics of your experimental or short-term analytical needs.

Data Lakes For Dummies - изображение 31You will almost certainly replicate data across the various gold zone packages, but that’s not a problem at all. As long as you carefully control the data flows and the replicated data, you’re unlikely to run into problems with uncontrolled data proliferation.

The sandbox

Your bronze, silver, and gold zones combine to form a data pipeline. In your gold zone, you create data packages that are closely aligned with high-value, pervasive analytical needs and that will provide data-driven insights to your organization for a long time.

But what about shorter-term analytical needs or experiments that you want to run with your data? You may be building new machine learning models to predict customer behavior, optimize your supply chain, or determine new treatment plans for a hospital system’s patients. You need to experiment with different machine learning techniques, and you need actual data for your work.

Head over to the sandbox and start playing. You’ll load whatever data you need for your short-term or experimental work and do your thing. The data lake isolates the sandbox from the data pipeline, so you can do whatever you need without interfering with your organization’s primary analytical work.

Data Lakes and Big Data

Turn the clock back to the early 2010s when big data burst onto the scene. Almost every organization was exploring how this new generation of data management technology can overcome many of the barriers and constraints of relational databases, particularly for analytical storage.

Big data promised — and delivered — significantly greater capacity than was possible with relational databases. With big data, you can store unstructured and semi-structured data alongside your structured data. You can also bring new data into a big data environment with lower latency than with relational databases.

Wait a minute! That sounds just like the description of a data lake! So, is a data lake just another name for big data?

Well, sort of … possibly … or maybe not… .

The best way to think of the two disciplines in relation to one another is as follows:

Big data is the underlying core technology used to build a data lake.

A data lake is an environment that includes big data but also potentially other data management technologies along with services for data transmission and data governance.

THE THREE (OR FOUR OR FIVE OR MORE) VS OF BIG DATA AND DATA LAKES

Quick quiz: Name all the Vs of big data and data lakes. You can start with the original three: volume, variety, and velocity. But you’ll also find blog posts and online articles that mention value, veracity (a formal term for accuracy), visualization, and many others. In fact, don’t be surprised if one day you read an article or blog post that also includes Valentine’s Day!

The original three Vs of big data came from a Gartner Group analyst named Doug Laney, way back in 2001. Volume, variety, and velocity were primarily aspirational characteristics of data environments, describing next-generational characteristics beyond what the relational databases of the time were capable of supporting.

Over the years, other industry analysts, bloggers, consultants, and product vendors added to the list with their own Vs. The difference between the original three Vs and those that followed, though, is that value, veracity, visualization, and others all apply to tried-and-true relational technology just as much as to big data.

Don’t get confused trying to decide how many Vs apply to big data and to data lakes. Just focus on the original three — volume, variety, and velocity — as the must-have characteristics of your data lake.

Data Lakes For Dummies - изображение 32You’ll find varying perspectives on the relationship between big data and data lakes, which certainly confuses the issue. Some technologists reverse the relationship between big data and data lakes; they consider a data lake to be the core technology and big data to be the overall environment. So, if you run across a blog post or another description that differs from the one I use, don’t worry. As with almost everything about data lakes and much of the technology world, you’ll find all sorts of opinions and perspectives, especially when you don’t have any official standards to govern a discipline.

Читать дальше
Тёмная тема
Сбросить

Интервал:

Закладка:

Сделать

Похожие книги на «Data Lakes For Dummies»

Представляем Вашему вниманию похожие книги на «Data Lakes For Dummies» списком для выбора. Мы отобрали схожую по названию и смыслу литературу в надежде предоставить читателям больше вариантов отыскать новые, интересные, ещё непрочитанные произведения.


Отзывы о книге «Data Lakes For Dummies»

Обсуждение, отзывы о книге «Data Lakes For Dummies» и просто собственные мнения читателей. Оставьте ваши комментарии, напишите, что Вы думаете о произведении, его смысле или главных героях. Укажите что конкретно понравилось, а что нет, и почему Вы так считаете.

x