1 ...6 7 8 10 11 12 ...23
Defining Big Data and the Three Vs
I am reluctant to even mention big data in this, the third, edition of Data Science For Dummies. Back about a decade ago, the industry hype was huge over what people called big data — a term that characterizes data that exceeds the processing capacity of conventional database systems because it’s too big, it moves too fast, or it lacks the structural requirements of traditional database architectures.
My reluctance stems from a tragedy I watched unfold across the second decade of the 21st century. Back then, the term big data was so overhyped across industry that countless business leaders made misguided impulse purchases. The narrative in those days went something like this: “If you’re not using big data to develop a competitive advantage for your business, the future of your company is in great peril. And, in order to use big data, you need to have big data storage and processing capabilities that are available only if you invest in a Hadoop cluster.”
Hadoop is a data processing platform that is designed to boil down big data into smaller datasets that are more manageable for data scientists to analyze. For reasons you’re about to see, Hadoop’s popularity has been in steady decline since 2015.
Despite its significant drawbacks, Hadoop is, and was, powerful at satisfying one requirement: batch-processing and storing large volumes of data. That's great if your situation requires precisely this type of capability, but the fact is that technology is never a one-size-fits-all sort of thing. If I learned anything from the years I spent building technical and strategic engineering plans for government institutions, it’s this: Before investing in any sort of technology solution, you must always assess the current state of your organization, select an optimal use case, and thoroughly evaluate competing alternatives, all before even considering whether a purchase should be made. This process is so vital to the success of data science initiatives that I cover it extensively in Part 4.
Unfortunately, in almost all cases back then, business leaders bought into Hadoop before having evaluated whether it was an appropriate choice. Vendors sold Hadoop and made lots of money. Most of those projects failed. Most Hadoop vendors went out of business. Corporations got burned on investing in data projects, and the data industry got a bad rap. For any data professional who was working in the field between 2012 and 2015, the term big data represents a blight on the industry.
Despite the setbacks the data industry has faced due to overhype, this fact remains: If companies want to stay competitive, they must be proficient and adept at infusing data insights into their processes, products, as well as their growth and management strategies. This is especially true in light of the digital adoption explosion that occurred as a direct result of the COVID-19 pandemic. Whether your data volumes rank on the terabyte or petabyte scales, data-engineered solutions must be designed to meet requirements for the data’s intended destination and use.
When you’re talking about regular data, you’re likely to hear the words kilobyte and gigabyte used as measurements. Kilobyte refers to 1024 bytes, or 2 10B.) A byte is an 8-bit unit of data.
Three characteristics — also called “the three Vs” — define big data: volume, velocity, and variety. Because the three Vs of big data are continually expanding, newer, more innovative data technologies must continuously be developed to manage big data problems.
In a situation where you’re required to adopt a big data solution to overcome a problem that’s caused by your data’s velocity, volume, or variety, you have moved past the realm of regular data — you have a big data problem on your hands.
Grappling with data volume
The lower limit of big data volume starts as low as 1 terabyte, and it has no upper limit. If your organization owns at least 1 terabyte of data, that data technically qualifies as big data.
In its raw form, most big data is low value — in other words, the value-to-data-quantity ratio is low in raw big data. Big data is composed of huge numbers of very small transactions that come in a variety of formats. These incremental components of big data produce true value only after they’re aggregated and analyzed. Roughly speaking, data engineers have the job of aggregating it, and data scientists have the job of analyzing it.
A lot of big data is created by using automated processes and instrumentation nowadays, and because data storage costs are relatively inexpensive, system velocity is, many times, the limiting factor. Keep in mind that big data is low-value. Consequently, you need systems that are able to ingest a lot of it, on short order, to generate timely and valuable insights.
In engineering terms, data velocity is data volume per unit time. Big data enters an average system at velocities ranging between 30 kilobytes (K) per second to as much as 30 gigabytes (GB) per second. Latency is a characteristic of all data systems, and it quantifies the system’s delay in moving data after it has been instructed to do so. Many data-engineered systems are required to have latency less than 100 milliseconds, measured from the time the data is created to the time the system responds.
Throughput is a characteristic that describes a systems capacity for work per unit time. Throughput requirements can easily be as high as 1,000 messages per second in big data systems! High-velocity, real-time moving data presents an obstacle to timely decision-making. The capabilities of data-handling and data-processing technologies often limit data velocities.
Tools that intake data into a system — otherwise known as data ingestion tools — come in a variety of flavors. Some of the more popular ones are described in the following list:
Apache Sqoop: You can use this data transference tool to quickly transfer data back-and-forth between a relational data system and the Hadoop distributed file system (HDFS) — it uses clusters of commodity servers to store big data. HDFS makes big data handling and storage financially feasible by distributing storage tasks across clusters of inexpensive commodity servers.
Apache Kafka: This distributed messaging system acts as a message broker whereby messages can quickly be pushed onto, and pulled from, HDFS. You can use Kafka to consolidate and facilitate the data calls and pushes that consumers make to and from the HDFS.
Apache Flume: This distributed system primarily handles log and event data. You can use it to transfer massive quantities of unstructured data to and from the HDFS.
Dealing with data variety
Big data gets even more complicated when you add unstructured and semistructured data to structured data sources. This high-variety data comes from a multitude of sources. The most salient point about it is that it’s composed of a combination of datasets with differing underlying structures (structured, unstructured, or semistructured). Heterogeneous, high-variety data is often composed of any combination of graph data, JSON files, XML files, social media data, structured tabular data, weblog data, and data that’s generated from user clicks on a web page — otherwise known as click-streams.
Читать дальше