(Feyerabend, 1975, p. 7)
This book does not swear by the entire philosophy of Feyerabend, but it does align with his idea that it is good for science if we violate some of its rules every now and then. It might be a way to move forward. This is therefore neither a book about true data science nor about dogmatic sociology (whatever those might be). It demands that the reader keep an open mind in relation to the transcending character of the presented analytical approach.
As argued above, theory needs data. But this book is not about data science being told correctly by sociology. It is just as much the other way around. And maybe not so much telling as mutual learning. Throughout the central parts of this book, we shall look at how knowledge about some particular data can be advanced through some particular social theory. I will also discuss how theory can advance the formulation of the methodology by which we approach the data. The overarching goal is the productive meeting of the two.
There are new types of data that demand new types of methods, while there are also new types of research questions arising that call for developing new theoretical approaches. This demands the advancing of our perspective on data theory and methods in parallel. In other words, developing a data theory approach. The term ‘data theory’ as such has been used to some extent already in statistics. William G. Jacoby, a researcher on public opinion and voting behaviour, has used it to refer to the process by which the researcher, being theoretically driven, chooses some aspects of the observable reality as the data to be analysed:
Data theory examines how real world observations are transformed into something to be analyzed – that is, data. Any empirical observation provides the observer with information. Typically, however, only certain aspects of this information will be useful for analytic purposes. The researcher takes a vitally important step in his or her analysis simply by culling out those pieces of information that are used from those that could be considered, but are not. The information that is used comprises the data, and it is clearly only a subset of observable reality. Hence, it is important to distinguish between observations (the information that we can see in the real world around us) and data (the information that we choose to analyze). The central concern of data theory is to specify how the latter are derived from the former.
(Jacoby, 1991, p. 4)
Furthermore, there was even a Department of Data Theory in the 1990s at the University of Leiden in the Netherlands, working to adapt classical statistical methods to suit ‘the particular characteristics of data obtained in the social and behavioral sciences’ as they ‘are often data that are non-numerical, with measurements recorded on scales that have an uncertain unit of measurement’ (Meulman, Hubert, and Heiser, 1998, p. 489). I, however, use the concept of data theory as a very broad label for the work that this book does in order to bring social theory and data science closer to one another.
While most data scientists are hired by industry, they also exist within a number of disciplines in academia where the focus is on computational methods applied to unconventional or messy data. Rachel Schutt and Cathy O’Neil (2013, p. 15) suggest that:
an academic data scientist is a scientist, trained in anything from social science to biology, who works with large amounts of data, and must grapple with computational problems posed by the structure, size, messiness, and the complexity and nature of the data, while simultaneously solving a real-world problem.
Social scientists should ideally play an important role for data science as many problems that data science works with – friending, connections, linking, sharing, talking – are ‘social science-y problems’ (Schutt and O’Neil, 2013, p. 9). As put by new media theorist Lev Manovich (2012, p. 461):
The emergence of social media in the middle of the 2000s created opportunities to study social and cultural processes and dynamics in new ways. For the first time, we can follow imaginations, opinions, ideas, and feelings of hundreds of millions of people. We can see the images and the videos they create and comment on, monitor the conversations they are engaged in, read their blog posts and tweets, navigate their maps, listen to their track lists, and follow their trajectories in physical space. And we don’t need to ask their permission to do this, since they themselves encourage us to do so by making all of this data public.
But even if we sometimes may have actual, real-life, well-motivated questions to pose to the data, data science notoriously runs the risk of becoming too data-driven. Indeed, data science is sometimes referred to as ‘data-driven science’ as its main aim actually is to extract knowledge from data. It is mostly not about testing hypotheses or theories in the traditional scholarly way. Instead, the work that is done with the data is driven by the data itself – in terms of the possibilities for gathering it, and the available tools for probing it.
A related concept is data mining. As the word ‘mining’ hints, this approach is about working to discover interesting patterns in large amounts of data, for example from the internet and social media. This approach marks a break with the established view of the research process – at least within the more objectivist types of science – where a problem or research question is formulated beforehand. This problem, formulated following a particular need for a certain type of knowledge about a specific issue, then guides the researcher in sampling data, devising the research methods, and choosing the theoretical perspectives – or even in formulating strict hypotheses to verify or falsify. Such a process is by no means axiomatic when it comes to data science, which makes no secret about often being highly explorative, and going fishing with a very wide net. In many cases a so-called data piñata approach is employed. As defined by the online resource Urban Dictionary:
data piñata: Big Data method that consists of whacking data with a stick and hopefully some insights will come out. [Example:] The Big Data Scientist made a Twitter data piñata and found that Saturdays are the weekdays with the most tweets linking to kitty pictures.
(Urban Dictionary, 2018)
Such strategies may be seen by some as unscientific, as they do not rely on actual questions about real problems, but on patterns that one stumbles across more or less randomly. Indeed, in the type of research that deals with solicited data, intently collected for certain research purposes, a data piñata approach would be odd. Why should we collect some random data, just to beat it with a stick to see what pops out? And, what type of data should that be? What methods or informants should be engaged, and how? In the case of register-based or database research, a piñata strategy might be closer at hand. And this is most definitely true in the case of the types of data that are enabled by people’s use of the internet and social media.
Census and survey researcher Kingsley Purdam and his data scientist colleague Mark Elliot aptly point out that today, to a lesser and lesser degree, data is ‘something we have’, rather: ‘the reality and scale of the data transformation is that data is now something we are becoming immersed and embedded in’ (Purdam and Elliot, 2015, p. 26). Their notion of a data environment underlines that people today are at the same time generators of, but also generated by, this new environment. ‘Instead of people being researched’, Purdam and Elliot (2015, p. 26) write, ‘they are the research’. Their point is that new data types have emerged – and are constantly emerging – that demand new flexible approaches. Doing digital social research, therefore, often entails discovering and experimenting with challenges and possibilities of ever-new types and combinations of information. Among these are not only social media data, but also data traces that are left, often unknowingly, through digital encounters. Manovich gives an explanation that is so to the point that it is worth citing at length:
Читать дальше