Artificial Intelligence and Machine Learning Contributed Articles

Data Science–A Systematic Treatment

While the success of early data-science applications is evident, the full impact of data science has yet to be realized.

By M. Tamer Özsu

Posted Jul 1 2023

Introduction
Key Insights
What Is Data Science?
Data-Science Ecosystem
Data-Science Life Cycle
Data Science Is Interdisciplinary
Conclusion
Acknowledgments
References
Author
Footnotes

There is a data-driven revolution under way in science and society, disrupting every form of enterprise. We are collecting and storing data more rapidly than ever before. The value of data as a central asset in an organization is now well established and generally accepted. The Economist called data “the world’s most valuable resource.”⁴⁰ The World Economic Forum’s briefing paper, A New Paradigm for Business of Data, states “At the heart of digital economy and society is the explosion of insight, intelligence and information—data.”⁵

Key Insights

Establishing data science as a field of study or an academic discipline requires identification of its foundations and scope.
Data science is not a proper subset of AI or of statistics. It is broader than data analytics using statistical or machine-learning techniques.
Data science is highly interdisciplinary and involves STEM and non-STEM disciplines.
Applications are crucial for data science, without which there is only big-data management and processing.

The field of data science is expected to enable data to be leveraged for making better decisions and achieving more meaningful outcomes. Although the term data science has some history, in its current incarnation as a modern field of study, it has already had significant economic impact. A 2015 Organisation for Economic Co-operation and Development (OECD) report identified “data-driven innovation” (DDI) as having a central driving role in 21^st century economies, defining DDI as “the use of data and analytics to improve and foster new products, processes, organisational methods and markets.” Data science deployments are still what might be called first generation, but their impact is already being felt in many areas: global sustainability,¹¹ power and energy systems,²⁵ biological and biomedical systems,³⁸ health sciences and health informatics,¹² finance and insurance,⁸ smart cities,³³ digital humanities,²⁸ and more.

The last decade has established the terms “big data,” “data analytics,” and “data science” into our lexicon, both as buzzwords and as important fields of study. Interest in the topic, as evidenced by Google Trends (see Figure 1), has exploded over the same period. An increasing number of countries have released policy statements related to data science. In academia, data-science programs and research institutes have been established with significant speed, while many industrial organizations have created data-science units. A quick survey of these programs and initiatives suggests a common core but also a lack of unified and clear framing of data science.

Figure 1. Trending of data science-relevant terms.

There are several reasons why clarification is helpful. One is to be able to understand whether data science is an academic discipline. This is hard to know without a definition of data science and an identification of its core and scope. A related reason is to provide an intellectually consistent framing to the numerous data-science institutes and academic units being formed. A third reason is to bring some clarity to the question of who a data scientist is. The point is not to constrain what is meant by a data scientist or to limit the scope of current academic initiatives but to acknowledge the diversity around some commonalities. The final reason is that a systematic investigation of the field is likely to identify important techniques and tools that should be in a professional data scientist’s toolbox.

Part of the difficulty is the carelessly interchangeable use of the terms “big data,” “data analytics,” and “data science” in much of the popular literature, which frequently spills over to technical literature. It is important to get them right. Data analytics, as defined in the next two sections, is a component of data science and not synonymous with it. Data science is not the same as big data. Perhaps the best analogy between them is that big data is like raw material; it has considerable promise and potential if one knows what to do with it. Data science gives it purpose, specifying how to process it to extract its full potential and to what end. It does this typically in an application-driven manner, allowing applications to frame the study objective and question. Applications are central to data science; if there are no applications to drive the inquiry, it is hard to argue that there is a data-science deployment. Jagadish also emphasizes this point, stating “‘Big Data’ begins with the data characteristics (and works up from there), whereas ‘Data Science’ begins with data use (and works down from there).”²²

A second difficulty is the vagueness in many definitions about the relationship between data science, machine learning (ML), and data mining (DM)—this arises from the colloquial use of “data science” to mean data analytics using ML/DM. Data science is not a subfield of ML/DM nor is it synonymous with these disciplines. More broadly, data science is not a subtopic of AI—a common claim originating from confusion on boundaries. AI and data science are conceptually different fields that overlap when ML/DM techniques are used in data analytics but otherwise have their own broader concerns. The broader scope of data science is discussed in this article, highlighting its constituents that are not part of AI. Conversely, there are topics in AI, such as agents, robotics, automated programming, and others, that are not within the scope of data science. Thus, AI and data science are related, but one does not encompass the other.

A final difficulty is that data science is broad in scope, involving numerous disciplines, and finding the right synthesis is not straightforward. At the risk of oversimplification, the following are the different constituencies that have an interest in data science: STEM people who focus on foundational techniques and underlying principles (computer scientists, mathematicians, statisticians); STEM people who focus on science and engineering data-science applications (for example, biologists, ecologists, earth and environmental scientists, health scientists, engineers); and non-STEM people who focus on social, political, and societal aspects. It is important to include all these constituencies in discussions surrounding data science while establishing a recognizable core of the field. This is a difficult balance to maintain.

The objective of this article is to put forth an internally consistent and coherent view of data science. The article discusses core components, contributing disciplines, life cycle considerations, and shareholder communities. The main takeaways can be summarized as follows:

It is important to clearly establish a consistent and inclusive view of the entire field, and one is proposed.
To avoid becoming a catch-all or whatever particular circumstances allow, it is essential to define the core of the field while being inclusive, and four core areas are identified.
It is critical to take a holistic view of activities that comprise data science.
A framework must be established to facilitate cooperation and collaboration among a number of disciplines.

Data science is still in its early days as an emerging field. This article contributes to discussions around its nature and scope. There will, hopefully, be joinders to the discussion to better define the field.

What Is Data Science?

The origins of the term data science are fuzzy. Data is central to both statistics and computing, so both communities have tried to define the field. Statisticians suggest that its origins lead to John Tukey,⁴¹ who passionately argued in the 1960s for the separation of “data analysis” from “classical statistics.” His main point was that data analysis is an empirical science while classical statistics is pure mathematics. Tukey defines data analysis as “procedures for analyzing data, techniques for interpreting the results of such procedures, ways to plan the gathering of data to make its analysis easier, more precise or more accurate, and all the machinery and results of (mathematical) statistics which apply to analyzing data.” Capturing a precise definition of data has been important from the start of computing as a discipline. The International Federation of Information Processing’s (IFIP) definition of data is “a representation of facts or ideas in a formalized manner capable of being communicated or manipulated by some process.”²⁰ Naur builds on this definition: “Data science is the science of dealing with data, once they have been established, while the relation of data to what they represent is delegated to other fields and sciences.”³⁰

Clearly, both statisticians and computer scientists have been thinking about data science for a long time, and the understanding of what it is has evolved over time. A subtle shift happened in the 2000s with the recognition that data science is broader than data analytics and that it involves a process from data ingestion to the production of insights and actionable recommendations. During this period, data-intensive approaches to problem solving started to produce results. This became known as the ‘big-data revolution’ and resulted in data-centric approaches in many fields. What initially began as a computational paradigm (also called the third paradigm), where computational methods could replace or enhance laboratory experimental methods (a 2001 New York Times article declared “all science is computer science,”²³) rather quickly changed to data-intensive methods. This is frequently referred as the fourth paradigm,¹⁹ and data science systematizes this understanding.

There are significant differences between what was called data analysis (or analytics) and what the current understanding of data science entails. More modern definitions of data science encompass this broader interpretation—for example, “Data science encompasses a set of principles, problem definitions, algorithms, and processes for extracting non-obvious and useful patterns from large data sets.”²⁶ The National Consortium for Data Science (NCDS), a U.S. consortium of leaders from academia, industry, and government, defines data science as the “systematic study of organization and use of digital data for research discoveries, decision-making, and data-driven economy.”² These are not sufficiently precise to define a field, but they capture some important common themes: interdisciplinarity (as defined in Alvargonzález⁴ and Choi and Pak⁹), a data-based approach to problem solving, the use of large and multi-modal data, the focus on deriving insights and value by discovering patterns and relationships in data, and the underlying process-oriented life cycle.

A working comprehensive definition that captures the essence of the field and explicitly recognizes that it involves a process would be: Data science is a data-based approach to problem solving by analyzing and exploring large volumes of possibly multi-modal data, extracting knowledge and insight from it, and using information for better decision-making. It involves the process of collecting, preparing, managing, analyzing, explaining, and disseminating the data and analysis results. This is consistent with the current understanding of the broad scope of the field—see, for example, the CRA’s use of the term.¹⁷

Note that the definition intentionally uses the term “data-based” rather than the more common “data-driven.” The latter has frequently been interpreted as “data should be the main (only?) basis for decisions” since “data speaks for itself.” This is wrong—data certainly holds facts and can reveal a story, but it only speaks through those who interpret it, and that can potentially introduce biases. Therefore, data should be one of the inputs to decision making, not the only one. Furthermore, “data-driven” has come to mean that it is possible to take data and analyze it by automated tools to generate automated actions. This is also problematic despite the recent popularity of, and over-reliance on, predictive and prescriptive analytics. Though data science has significant potential, and successful data-science applications are plentiful, there are sufficient misuses of data to give us pause and concern: Google’s algorithmic detection of influenza spread using social media data is one prominent example, while the Risk Needs Assessment Test used in the U.S. justice system is another. Therefore, “data-based” is the preferable phrase that signals that data-science deployments are aids to the decision maker, not decision makers themselves.^a

Data-Science Ecosystem

Data science is inherently interdisciplinary: It builds on a core set of capabilities in data engineering, data analytics, data protection, and ethics—the four pillars of data science (see Figure 2). Some of this core is technical, some is not. Although the term “data science” is frequently used to refer only to data analysis, the scope is wider, and the contributing elements of the field should be properly recognized. The core is in close interaction with application domains that have the dual function of informing the appropriate technologies, tools, algorithms, and methodologies that should be useful to develop and leverage these capabilities to solve their problems. Data-science application deployments are highly sensitive to existing social and policy context, and these influence both the core technologies and the application deployments.

Figure 2. Data-science building blocks.

Data engineering. Data is at the core of data science; the type of data that is used is commonly referred to as big data. There is no universal definition of big data; it is usually characterized as data that is large (volume); multi-modal (variety) with many types of data: structured, text, images, video, and others; sometimes streaming at high speed (velocity); and has quality issues (veracity). These are known as the “four Vs” and addressing them appropriately is the domain of big-data management.³¹ Data engineering in data science addresses two main concerns: the management of big data (including the computing platforms for its processing) and the preparation of data for analysis.

Managing big data is challenging but critical in data-science application development and deployment. These data characteristics are quite different than those that traditional data-management systems are designed for, requiring new systems, methodologies, and approaches. What is needed is a data-management platform that provides appropriate functionality and interfaces for conducting data analysis, executing declarative queries, and enabling sophisticated search. This exceeds the current state of the art, where individual systems are specialized toward a specific data model and require meta-models that allow for different levels of abstraction, and for more fluent integration and seamless interoperability. Data preparation is typically understood to be the process of data-set selection, data acquisition, data integration, and data-quality enforcement. Applying appropriate analysis to the integrated data will provide new insights that can improve organizational effectiveness and efficiency and result in evidence-informed policies. However, for this analysis to yield meaningful results, the input data must be appropriately prepared and trustworthy. The quality of the analysis model makes little difference; if the input data is not clean and trustworthy, then the results will not very valuable. The adage of “garbage in, garbage out” is real in data science. Data quality is an essential element in making the data trustworthy for analysis. It addresses the veracity characteristic of big data. Data quality is considered mission critical in data-science success and constitutes a major portion of the data-preparation effort for most organizations.

There are significant differences between what was called data analysis (or analytics) and what the current understanding of data science entails.

An important vehicle of data quality and data trustworthiness is metadata and metadata management. One particularly important metadata that deserves mention is provenance, which tracks the source of the original data. Another challenge is developing and instituting the appropriate system and tool support for managing provenance, and tracking data as it goes through the processing pipeline.

A very important aspect of data quality is data cleaning.²¹ When data from multiple sources is used, there are bound to be inconsistencies, errors in data, and missing information that must be corrected (cleaned). Techniques and methodologies for data cleaning are an important part of data engineering.

Data analytics. Data analytics is the application of statistical and ML techniques to draw insights from data under study and to make predictions about the behavior of the system under study. A first-level distinction in data analysis is made between inference and prediction. Inference is based on building a model that describes a system behavior by representing the input variables and relationships among them. Prediction goes further and identifies the courses of action that might yield the “best” outcomes. This classification can be made more finely grained by identifying four different classes: descriptive, which retrospectively looks at the historical data to answer the questions “What happened?” or “What does the data tell us?; diagnostic, which is also retrospective but goes beyond descriptive to answer the question “Why has that happened?”; predictive, a forward-looking analysis of historical data that provides calculated predictions of what is likely to happen; and prescriptive, which goes further by recommending courses of action. Predictive and prescriptive analytics together are usually called advanced analytics. The relationship among these is usually evaluated along two dimensions: complexity and value.²⁷ Going from descriptive to prescriptive, analysis becomes far more complex, but the value derived from it also substantially increases.

There are six data-analysis tasks (methods) commonly used in data science:^15,24 clustering, which finds meaningful groups or collections of data based on the “similarity” of data points (data points in the same cluster are more similar to each other than they are to data points in other clusters); outlier detection, which refers to identification of rare data items in a dataset that differ significantly from the majority of the data; association rule learning discovers interesting relationships between variables in a large dataset; classification finds a function (model) that places a given data item in one of a set of predefined classes; regression finds the function that relates one or more independent variables to a dependent variable; and summarization creates a more compact representation of the dataset.

As discussed earlier, an important data source in data science is streaming data. In this case, real-time analytics must be considered as data flows continuously. Real-time analytics is particularly difficult given that most analysis algorithms are computationally heavy and usually require multiple passes over the dataset, which is challenging in streaming data.

In a data-science project, an important consideration is the selection of appropriate techniques for the task and how they can be leveraged. Given the societal impact of data-science applications and deployments, the explainability of the analysis results is equally important.

Data protection. Data science’s reliance on large volumes of varied data from many sources raises important data-protection concerns. The scale, diversity, and interconnectedness of data (for example, in online social networks) requires revisiting data-protection techniques that have been mostly developed for corporate data.^7,29

It is customary to discuss the relevant issues under the complementary topics data security and data privacy. The former protects information from unauthorized access or malicious attacks, while the latter focuses on the rights of users and groups over data about themselves. Data security typically deals with data confidentiality, access control, infrastructure security, and system monitoring, and uses technologies such as encryption, trusted execution environments, and monitoring tools. Data privacy, on the other hand, deals with privacy policies and regulations, data retention and deletion policies, data-subject access requirement (DSAR) policies, management of data use by third parties, and user consent. Data privacy normally involves privacy-enhancing technologies (PETs). Although research on these topics is usually isolated, it is helpful to take a holistic and broader view, hence the term data protection is more appropriate and informative.

The characteristics of data used in data science pose unique challenges. Data volumes make the enforcement of access-control mechanisms more difficult and the detection of malicious data and use more challenging. The numerosity and variety of data sources make it possible to inject mis/disinformation, skewing the analysis results.⁷ Data-science platforms are, by necessity, scale-out systems that increase the possibility of infrastructure attacks. These environments also increase the potential for surveillance. The variability and potentially high numbers of end users, and in many data-science deployments, the need for openness for sharing analysis results and for bolstering the analysis, opens the possibility of data breaches and misuse. These factors seriously increase the threats and the attack surface. Therefore, protection is required for the entirety of the data-science life cycle, from data acquisition to the dissemination of results, as well as for secure archiving or deletion. An implicit goal of data science is to gain access to as much data as possible, which directly conflicts with the fundamental least-privilege security principle of providing access to as few resources as necessary. Closing this gap includes careful redesign and advancement of security technologies to preserve the integrity of scientific results, data privacy, and to comply with regulations and agreements governing data access. Techniques that have been developed for privacy-preserving data mining are examples of this consideration.

Data-science ethics. The fourth building block of data science is ethics. In many discussions, ethics is bundled with a discussion of data privacy. The two topics certainly have a strong relationship, but they should be considered separate pillars of the data-science core. Literature typically refers to data ethics as “. . . the branch of ethics that studies and evaluates moral problems related to data, . . . algorithms, . . . and corresponding practices, in order to formulate and support morally good solutions.”¹⁶ The definition recognizes the three dimensions of the issue—data, algorithms, and practice.

The ethics of data refers to the ethical problems posed by the collection and analysis of large datasets and on issues arising from the use of big data in a diverse set of applications.
The ethics of algorithms addresses concerns arising from the increasing complexity and autonomy of algorithms, their fairness, bias, equity, validity, and reliability.¹⁸
The ethics of practices addresses questions concerning the responsibilities and liabilities of people and organizations in charge of data processes, strategies, and policies. The growing research in AI ethics tackles many of these issues.

Perhaps one of the most important concepts in data-science ethics is informed consent. Participants of data-science projects should have full information about the project, its objectives, and scope, and they should freely agree to participate. If data about participants is collected, they should have full knowledge of what is being collected and how it will be used (including by third parties) so they can agree to its collection and use.

One important issue in data-science ethics that has received significant attention is bias. Oxford English Dictionary defines bias as the “inclination or prejudice for or against one person or group, especially in a way considered to be unfair.” Bias is inherent in human activities and decision making, and human biases are reflected in data science as biases in data and biases in algorithms. Bias in data can be introduced through what is included in the historical data used by the algorithms—for example, arrest records in the U.S. include more marginalized communities, primarily because they are over-policed. Bias in data may also be introduced due to under-representation—for example, data used in face-recognition systems comprises 80% whites, of which three-quarters are males. Algorithmic bias can occur due to the inclusion or omission of features in the algorithm/model. In ML deployments, this can occur during feature engineering. These features include individual attributes such as race, religion, and gender. The use of proxy metrics (for instance, using standardized examination scores to predict student success) can also lead to bias.

Although considerable attention has focused on the problem of bias, data-science ethics should be considered more generally consistent with the previously offered, broader definition of ethics. Some of the broader ethical considerations have overlapping concerns with data protection. Related to ethics of practice, societal norms that usually get coded in legislation are important (more on this in the next section). Consequently, while some of the ethical concerns are universal, others can be specific to a jurisdiction.

Broader ethical concerns also include: ownership of data; transparency, which refers to subjects knowing what data is collected about them and how it will be stored and processed (including informed consent by the subject); privacy of personal data, in particular the revealing of personal identifiable information; and intention regarding how data will be used, especially for secondary use.

Social and policy context. As noted earlier, data-science deployments are highly sensitive to the societal and policy contexts in which they are deployed. For example, what can be done with data differs in different jurisdictions. The context can be legal, establishing legal norms for data-science deployments, or it can be societal, identifying what is socially acceptable. Furthermore, there are significant intersections between social science and humanities and the core issues in data science. There are four central concerns: ownership, representation, regulation, and public policy. Obviously, there is overlap between these and the data-ethics concerns previously discussed.

Ownership. Data ownership, access, and use—particularly in terms of how individual data is generated, who owns and can access to it, and who profits from it—is a critical concern. At the societal and organizational levels, researchers analyze how economic systems are increasingly data-dependent in terms of both operations and revenue streams, and how pressures to collect and share ever-more intimate data may conflict with users’ own calls for privacy and autonomy. Data privacy from a technical perspective was previously discussed, but it obviously has a significant legal and social dimension that requires careful study (for example, Solove³⁵).

Representation. A primary concern in the development of data-science technologies is ensuring diverse and equitable representation at all stages of the life cycle. This includes the evaluation of the training, tools, and techniques used in data science, including who designs them, who has access to them, and who is represented by them. Data representativeness is tightly tied to questions of marginalization and bias that appear throughout the design, data collection, analysis, and implementation of these technologies—for example, Richardson et al.³² Another concern is how data is increasingly used to “speak” for users, often without their knowledge—changing individuals’ relationships with their local communities, corporations, and state.

Who “owns” data science is a topic of some discussion, primarily between statisticians and computer scientists.

Regulation and accountability. Components of ethical data science also include a commitment to transparency and explainability in terms of analyzing the inputs of data-driven decision making, the algorithms applied, and determining how they lead to specific outputs and recommendations. Ensuring that the benefits and opportunities afforded by advances in data science equally benefit broader society involves accountability and regulatory practices at each stage of the data-science life cycle and are not only centered on laws and policy interventions, such as the General Data Protection Regulation (GDPR) and the Canadian Personal Information Protection and Electronic Documents Act (PIPEDA). They must also include efforts to include values-in-design, interventions for more accessible and inclusive design, and tools for ethical thought at the levels of training, education, and ongoing daily practice.

Public policy. There is a critical and urgent need to integrate data science into the analysis of public policy.³⁶ In an age where every Facebook, Twitter, and Instagram post is a data observation that can be archived and can become part of a historical dataset that can inform public policy, governments have been left behind in their ability to collect, aggregate, and analyze this data. With the necessary tools, this data could be managed and analyzed in a way that is explainable and that can be disseminated in a meaningful way to provide key insights. In a similar vein, there is a paradox in the large amount of “open” data that remains unused, along with concerns of a data deficit in terms of more information that could and should be collected to inform public policy.

Data-Science Life Cycle

The definition of data science previously given clearly identifies the process view of data science, namely that it consists of several processing stages, starting from data ingestion and eventually leads to better decisions, insights, and action. That process is called the data-science life cycle. Literature refers to the data life cycle, focusing only on data processing. A good definition of the data life cycle is given by the U.S. National Science Foundation Working Group on the Emergence of Data Science,⁶ which identifies five linear stages: acquiring the data, cleaning it and preparing it for analysis, using the data through analysis, publishing the data and the methods used to analyze the data, and preserving/destroying the data according to policy. Variations of this data-life cycle model emerge in various proposals, some predating the above formulation—for example, Agrawal et al.,² Jagadish,²² and Stodden.³⁷

This life cycle model and its variants give the impression that the entire process is linear and unidirectional. Real project development hardly works in a linear fashion. An alternative model that is more iterative with built-in feedback loops has been proposed in the CRoss-Industry Standard Process for Data Mining (CRISP-DM) model for data-mining projects.³⁴ CRISP places data at the center and specifies a cyclical life cycle that is iterative and may be repeated over the lifetime of the project. PPDAC²⁶ is similar to CRISP-DM for statistical analysis tasks. Microsoft Team Data Science Access Life cycle³⁹ also emphasizes the iterative nature of the process.

The data-science life cycle proposed in this article (see Figure 3) derives from and is built on those iterative models. It starts with the specification of the research question that may come from a particular application or may be an exploratory question. A good understanding of the research question is important, since it normally drives the entire process. The next step is data preparation, which includes determining which datasets are needed and available; selecting the appropriate datasets from within this larger set; ingesting the data; and addressing data-quality issues, including data cleaning and data provenance. The third step is the proper storage and management of the data, including big-data management. Specifically, data needs to be integrated, decisions need to be made about the storage structures for data for efficient access, appropriate storage structures need to be chosen and designed, suitable access interfaces must be specified, and provisions need to be made for metadata management, in particular for provenance data. The prepared and suitably stored data is then open for analysis. In particular, the appropriate statistical and ML model(s) is/are selected/developed, feature engineering is performed to identify the most appropriate model parameters, and model validation studies are conducted to determine the model’s suitability. If model validation is successful, the next step is deployment and dissemination, which involves different activities depending on the particular project and application. In some cases, the analysis and processing of data needs to be performed on a continuing basis, so deployment involves maintaining and monitoring the system over time. In other cases, deployment may involve compilation and dissemination of analysis results and their explanation. Dissemination of the analysis results, and sometimes even the curated data, is an important aspect of this phase. Open data, which is data that “anyone can freely access, use, modify, and share for any purpose (subject, at most, to requirements that preserve provenance and openness”^b) is an important part of dissemination. Many governments and private institutions are adopting Open Data Principles stating that data should not only be open but should be complete, accurate, primary, and released in a timely manner. These properties make this data very valuable to data scientists, journalists, and the public. When open data is used effectively, data scientists can explore and analyze public resources, which allows them to question public policy, create new knowledge and services, and discover new value for social, scientific, or business initiatives.

Figure 3. Data-science life cycle.

Problems during the analysis phase may result in the process returning to either reformulating the research question (it may be underspecified or overspecified, making model building infeasible) or cycling back to data preparation if the model requires other or different data that has not been prepared. As noted earlier, data-science deployments are not “one-and-done.” Following deployment, there must be constant monitoring—perhaps the environment changes, the data changes, or there is a deeper understanding of the research question that results in its revision and improvement. Thus, the process cycles as a dialectic process—every time the process comes back to the research question, we are at an elevated understanding of what needs to be studied. It is important to recognize that the stages in the life cycle are not isolated; the boundaries between stages are fuzzy, and there are important and interesting issues that arise at their intersections.

Data science should be viewed as a unifying force that connects several fields, some of which are STEM and some are not.

There is a continuous bi-directional interaction between this life cycle and the data-protection issues. Similarly, the ethical concerns, social norms, and policy framework impact each of the phases, sometimes even preventing the initiation of specific data-science studies.

Data Science Is Interdisciplinary

Who “owns” data science is a topic of some discussion, primarily between statisticians and computer scientists. This discussion bleeds into the question of who data scientists are, which then leads to different educational models of data science. Given the centrality of data to both disciplines, this discussion is perhaps not surprising. The concern among statisticians regarding the field of data science is long-standing. Given the early promotion of data analytics as an important topic by Tukey, there is a strong feeling among statisticians that they own (or should own) the topic. In a 2013 opinion piece, Davidian¹³ laments the absence of statisticians in a multi-institution data-science initiative and asks if data science is not what statisticians do. She indicates that data science is “described as a blend of computer science, mathematics, data visualization, machine learning, distributed data management—and statistics,” almost disappointed that these disciplines are involved along with statistics. Similarly, Donoho laments the current popular interest in data science, indicating that most statisticians view new data science programs as “cultural appropriation.”¹⁴

There is a well-known argument put forth by Conway¹⁰ on the nature of data science. He proposes three main areas organized as a Venn diagram: hacking skills, mathematics and statistics, and substantive experience. The hacking skills he contends to be important are the ability “to manipulate text files at the command-line, understanding vectorized operations, thinking algorithmically.” Mathematics and statistics knowledge, at the level of “knowing what an ordinary least squares regression is and how to interpret it” is required to analyze data. The substantive experience is about the research problem that may come from an application domain or a specific research project. The Conway diagram, as it has come to be known, has become popular in these ownership debates by those who do not see a central role for computer science in data science, because Conway argues that hacking skills have nothing to do with computer science: “This, however, does not require a background in computer science—in fact, many of the most impressive hackers I have met never took a single CS course.”

The computer science view espouses instead the centrality of computing. One such view has been championed by Ullman,⁴² who also uses a Venn diagram to express his viewpoint and counters Conway, since he considers “algorithms and techniques for processing large-scale data efficiently as the center of data science.” Ullman claims that the two big knowledge bases of data science are computer science and domain science (that is, the application domain), and their intersection is where data science resides. He sees, rightly, ML as part of computer science. He argues, again rightly, that some of ML is used for data science, but there are ML applications that are outside of data science. His diagram shows that there are aspects of data science that require computer-science techniques which have nothing to do with machine learning—data engineering, as previously discussed, would fall in that category. I suspect that some of these points may not be controversial. Where the argument is likely to be challenged is that, in his view, mathematics and statistics “do not really impact domain sciences directly” albeit their importance in computer science. Within computer science, there is the discussion regarding the relationship of AI/ML with data science, with some indicating that data science is part of AI, which was addressed in this article’s first section.

A more balanced view has been put forth by Marina Vogt,^c who indicates data science as sitting at the intersection of computer science, mathematics and statistics, and domain knowledge. This is more in line with the view put forward by ACM in its curriculum proposal: “Data science is an interdisciplinary endeavor between computer science, mathematics, statistics, and applied areas such as natural sciences.”¹ However, even this viewpoint is very STEM-centric and leaves out many topics that are of interest to data science as a field.

These discussions and the resulting controversies are not helpful or necessary; they do not move the data-science agenda forward. No single community “owns” data science—it is too big, the challenges are too great, and it requires involvement from many disciplines. Creation and use of knowledge are fundamental human activities that span millennia. This activity is at the core of what we can define as being our collective human culture. Attempts to splinter the core of human achievement through an attribute of ownership is, at its best, a narrow parochial view, and, at its worst, ascendancy of self-interest and greed.

Data science should be viewed as a unifying force that connects several fields (see Figure 4), some of which are STEM and some are not. I go back to my discussion about stakeholders, who are diverse. Ownership arguments take place within one stakeholder group—STEM people who focus on foundational techniques and the underlying principles. Within this group, it is important to recognize and accept that there are communities with complementary and sometimes overlapping interests: computer scientists who bring expertise in computational techniques/tools that can effectively deal with scale and heterogeneity, statisticians who focus on statistical modeling for analysis, and mathematicians who have much to contribute with discrete and continuous optimization techniques and precise modeling of processes. However, this is only one stakeholder group; I have identified two others. One danger in such a unifying view is not finding the right balance between inclusiveness in accepting the contributions of all these fields and identifying the core of data science. I believe arguments made earlier in this article have established the core, so this danger is averted.

Figure 4. Unifying view of data science.

Conclusion

Despite its recent popularity, the field of data science is still in its infancy and much work needs to be done to scope and position it properly. The success of early data-science applications is evident—from health sciences, where social-network analytics have enabled the tracking of epidemics; to financial systems, where guidance of investment decisions is based on the analysis of large volumes of data; to the customer care industry, where advances in speech recognition have led to the development of chatbots for customer service. However, these advances only hint at what is possible; the full impact of data science has yet to be realized. Significant improvements are required in fundamental aspects of data science and in the development of integrated processes that turn data into insight. Current developments tend to be isolated to subfields of data science and do not consider the entire scoping as discussed in this article. This siloing is significantly impeding large-scale advances. As a result, the capacity for data-science applications to incorporate new foundational technologies is lagging.

The objective of this article is to lay out a systematic view of the data-science field and to reinforce the key takeaways: It is important to clearly establish a consistent and inclusive view the entire field; it is essential to define the core of the field while being inclusive to avoid becoming a catch-all or whatever the particular circumstances allow; it is critical to take a holistic view of activities that comprise data science; and a framework needs to be established to facilitate cooperation and collaboration among a number of disciplines.

Acknowledgments

A preliminary version of the ideas in this article appeared in an opinion piece in a 2020 bulletin of the IEEE Technical Committee on Data Engineering 43, (3) 3–11. My views on data science were sharpened in discussions with many colleagues as we worked on several data-science proposals. I thank colleagues (too many to list individually) who participated in these initiatives and taught me different aspects of the issues; they will see their fingerprints in the text of this article. I would especially like to acknowledge the many early discussions on framing the field and the relevant joint work with Raymond Ng and Nancy Reid. I thank Samer Al-Kiswani, Angela Bonifati, Khuzaima Daudjee, Maura Grossman, John Hirdes, Florian Kerschbaum, Jatin Matwani, Renée Miller, and Patrick Valduriez for feedback on all or part of this article. I very much appreciate the feedback from the anonymous reviewers, who pointed to weaknesses in some of my arguments and challenged me to clarify others. These helped improve the article.

Figure. Watch the author discuss this work in the exclusive Communications video. https://cacm.acm.org/videos/data-science-systematic-treatment

Submit an Article to CACM

CACM welcomes unsolicited submissions on topics of relevance and value to the computing community.

You Just Read

Data Science–A Systematic Treatment

View in the ACM Digital Library

Copyright is held by the owner/author(s). Publication rights licensed to ACM.
Request permission to publish from permissions@acm.org

DOI

10.1145/3582491

July 2023 Issue

Published: July 1, 2023

Vol. 66 No. 7

Pages: 106-116

Table of Contents

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Explore More

BLOG@CACM Aug 30 2024

Everything You Always Wanted to Know About PCs, But Were Afraid to Ask

Saurabh Bagchi

Computing Profession

individuals at a conference table, illustration

News Aug 30 2024

How CrowdStrike Stopped Everything

David Geer

Security and Privacy

BLOG@CACM Aug 29 2024

Leveraging Computational Thinking in the Era of Generative AI

Yael Erez, Koby Mike, and Orit Hazzan

Education

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More

Key Insights

What Is Data Science?

Data-Science Ecosystem

Data-Science Life Cycle

Data Science Is Interdisciplinary

Conclusion

Acknowledgments

Data Science–A Systematic Treatment

DOI

July 2023 Issue

Related Reading

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.