Neal Fishman
Thanks to Jim Minatel, Tom Dinse, and the rest of the team at Wiley for recognizing the need for this book and for enhancing its value with their editorial guidance. I'd also like to thank Elizabeth Schaefer for introducing me to Neal and giving me the opportunity to work with him. Thanks also to Jason Oberholtzer and the folks at Gather for enabling my work at IBM. Lastly, I'm grateful to Neal Fishman for sharing his vision and inviting me to contribute to this important book.
Cole Stryker
Foreword for Smarter Data Science
There have been remarkable advances in artificial intelligence the past decade, owing to a perfect storm at the confluence of three important forces: the rise of big data, the exponential growth of computational power, and the discovery of key algorithms for deep learning. IBM's Deep Blue beat the world's best chess player, Watson bested every human on Jeopardy , and DeepMind's AlphaGo and AlphaZero have dominated the field of Go and videogames. On the one hand, these advances have proven useful in commerce and in science: AI has found an important role in manufacturing, banking, and medicine, to name a few domains. On the other hand, these advances raise some difficult questions, especially with regard to privacy and the conduct of war.
While discoveries in the science of artificial intelligence continue, the fruits of that science are now being put to work in the enterprise in very tangible ways, ways that are not only economically interesting but that also contribute to the human condition. As such, enterprises that want to leverage AI must turn their focus to engineering pragmatic systems of value that contain cognitive components.
That's where Smarter Data Science comes in.
As the authors explain, data is not an afterthought in building such systems; it is a forethought. To leverage AI for predicting, automating, and optimizing enterprise outcomes, the science of data must be made an intentional, measurable, repeatable, and agile part of the development pipeline. Here, you'll learn about best practices for collecting, organizing, analyzing, and infusing data in ways that make AI real for the enterprise. What I celebrate most about this book is that not only are the authors able to explain these best practices from a foundation of deep experience, they do so in a manner that is actionable. Their emphasis on results-driven methodology that is agile yet enables a strong architectural framework is refreshing.
I'm not a data scientist; I'm a systems engineer, and increasingly I find myself working with data scientists. Believe me, this is a book that has taught me many things. I think you'll find it quite informative as well.
Grady Booch
ACM, IEEE, and IBM Fellow
“There is no AI without IA.”
Seth Earley
IT Professional, vol. 18, no. 03, 2016.
( info.earley.com/hubfs/EIS_Assets/ITPro-Reprint-No-AI-without-IA.pdf)
In 2016, IT consultant and CEO Seth Earley wrote an article titled “There is no AI without IA” in an IEEE magazine called IT Professional . Earley put forth an argument that enterprises seeking to fully capitalize on the capabilities of artificial intelligence must first build out a supporting information architecture. Smarter Data Science provides a comprehensive response: an IA for AI.
“What I'm trying to do is deliver results.”
Lou Gerstner
Business Week
“No one would have believed in the last years of the nineteenth century that this world was being watched keenly and closely…”
So begins H. G. Wells' The War of the Worlds , 1898, Harper&Brothers. In the last years of the 20th century, such disbelief also prevailed. But unlike the fictional watchers from the 19th century, the late-20th century watchers were real, pioneering digitally enabled corporations. In The War of the Worlds , simple bacteria proved to be a defining weapon for both offense and defense. Today, the ultimate weapon is data. When misusing data, a corporate entity can implode. When data is used appropriately, a corporate entity can thrive.
Ever since the establishment of hieroglyphs and alphabets, data has been useful. The term business intelligence (BI) can be traced as far back as 1865 ( ia601409.us.archive.org/25/items/cyclopaediacomm00devegoog). However, it wasn't until Herman Hollerith, whose company would eventually become known as International Business Machines, developed the punched card that data could be harvested at scale. Hollerith initially developed his punched card–processing technology for the 1890 U.S. government census. Later in 1937, the U.S. government contracted IBM to use its punched card–reading machines for a new, massive bookkeeping project that involved 26 million Social Security numbers.
In 1965, the U.S. government built its first data center to store 742 million tax returns and 175 million sets of fingerprints on magnetic computer tape. With the advent of the Internet, and later mobile devices and IoT, it became possible for private companies to truly use data at scale, building massive stores of consumer data based on the growing number of touchpoints they now shared with their customers. Taken as an average, data is created at a rate of more than 1.7MB every second for every person ( www.domo.com/solution/data-never-sleeps-6). That equates to approximately 154,000,000,000,000 punched cards. By coupling the volume of data with the capacity to meaningfully process that data, data can be used at scale for much more than simple record keeping.
Clearly, our world is firmly in the age of big data. Enterprises are scrambling to integrate capabilities that can address advanced analytics such as artificial intelligence and machine learning in order to best leverage their data. The need to draw out insights to improve business performance in the marketplace is nothing less than mandatory. Recent data management concepts such as the data lake have emerged to help guide enterprises in storing and managing data. In many ways, the data lake was a stark contrast to its forerunner, the enterprise data warehouse (EDW). Typically, the EDW accepted data that had already been deemed useful, and its content was organized in a highly systematic way.
When misused, a data lake serves as nothing more than a hoarding ground for terabytes and petabytes of unstructured and unprocessed data, much of it never to be used. However, a data lake can be meaningfully leveraged for the benefit of advanced analytics and machine learning models.
But, are data warehouses and data lakes serving their intended purpose? More succinctly, are enterprises realizing the business-side benefit of having a place to hoard data?
The global research and advisory firm Gartner has provided sobering analysis. It has estimated that more than half of the enterprise data warehouses that were attempted have been failures and that the new data lake has fared even worse. At one time, Gartner analysts projected that the failure rate of data lakes might reach as high as 60 percent ( blogs.gartner.com/nick-heudecker/big-data-challenges-move-from-tech-to-the-organization). However, Gartner has now dismissed that number as being too conservative. Actual failure rates are thought to be much closer to 85 percent ( www.infoworld.com/article/3393467/4-reasons-big-data-projects-failand-4-ways-to-succeed.html).
Why have initiatives such as the EDW and the data lake failed so spectacularly? The short answer is that developing a proper information architecture isn't simple.
Читать дальше