The Digital Physics of Data Mining

Being asked to project the future state of data mining is like being asked to predict what the descendants of a newborn baby will look like. We are at the dawn of the age of stored digital information and only in the first decade of the data mining field. This opportunity reminded me of an interview Joel L. Swerdlow of National Geographic magazine conducted with me as part of an article he was writing for the August 1999 issue on what he called the “history of writing.” The interview got off to a difficult start as it took us awhile to even begin communicating on the basics. Swerdlow viewed the history of writing in terms of alphabets, paper, and writing instruments. The word “write” to me means something completely different—spinning disks and streams of 0s and 1s being written on them by computers and devices. He was surprised when I pointed out that more information is written to digital media today than the total cumulative handwriting and printing during all of recorded human history. Could anyone have predicted the digitization of writing 1,000 years ago? I certainly would not have been able to.

Data mining is one of the central activities associated with understanding, navigating, and exploiting the new world of digital data. It is the mechanized process of identifying and discovering useful structure in data. I use the term “structure” to refer to patterns, models, and relations over the data. A pattern is classically described as a parsimonious description of a subset of the data. A model is a (statistical) description of the entire data set. A relation is simply a property specifying some dependency between fields (attributes) over a subset of the data (such as that a disproportionately large set of the people owning automobiles also happen to have undergone major surgery).

While fairly uncomplicated, this definition of data mining turns out to have important implications, including those needed to address some of the most challenging questions we face today. For example, as we attempt to understand what it means to model data and derive the latent information in it, what does it mean for a pattern to be true? What does it mean for a summary to be interesting or useful? Which of the multitude of models and patterns that can possibly describe any given finite data set do we choose to adopt as the truth and its implications? What is our basis for preferring one pattern over another? What is the difference between base and derived facts?

Imagine what digital physics will give us as the equivalent of nuclear weaponry once we learn to crack the atoms of information.

These are all aspects of the mysterious atomic essence of the emerging information economy in which data is the most basic and valuable asset. The person mastering techniques for exploiting and mining it and nimbly navigating the vast new high-dimensional frontiers will inevitably have power in the digital information environment and its related economies. Hence, data mining is as fundamental to future human history in that environment as geographic navigation and mineral mining have been in the physical environment.

While people have recorded and stored data for at least the past 5,000 years, data analysis is a fairly recent phenomenon in a number of scientific fields, initially astronomy and medicine.

The early data sets, such as those involving business and military records, were small in size and, more important, were “low dimensional,” reflecting only a small number of variables, typically fewer than 10. The set of techniques and technologies to analyze and model this data revolved around visualization of the data using graphs, charts, and tables. Digital computers and data storage dramatically changed the picture. The vast scale of the new data volumes quickly ruled out traditional manual approaches to analysis. Humans are by nature and history dwellers in low-dimensional environments. Our senses and instincts help us deal with three to five dimensions, perhaps as many as 10 if we count all our natural senses and their derivatives. How are we to deal with 100 dimensions? 1,000 dimensions? How about tens of thousands, as in e-commerce, the Web, manufacturing, finance, and scientific observation?

With the reality of massive data sets came the realization of the difficulty and complexity of constructing computational tools extending our human analysis abilities to higher dimensions. Today’s data mining algorithms are used to determine statistical models that fit the data or reliable patterns that appear within it; they are fairly complex and draw on mathematical techniques from probability theory, information theory, estimation, uncertainty, graph theory, and database techniques. Since in most cases a closed solution cannot be determined easily, data mining techniques wind up being search-intensive, often involving iterative convergence on acceptable solutions. Even with all these techniques, we have taken only the first unsteady steps toward addressing such difficult problems as understanding and exploiting the meaning of the information hidden from our perception in the higher dimensions. What about determining causality from data? If one event always seems to follow another in data, when is it appropriate to determine that one of them is a cause, rather than a correlated symptom?

With such a short history to constrain me, I can go on unrestrained to imagine the trends of the future. The most basic is that valuable data, much like money today, will need to be kept safe, easily accessible, and productive. The evolution of data banks is inevitable. Much like utilities, such as those providing phone and electricity service, people will access, exchange, and manipulate data by interacting with the banks and institutions that host it. The equivalent of interest on the data deposits is the added value of turning data into information and knowledge—precisely data mining’s goal.

Data mining techniques will be used to filter, select, customize, and deliver the right doses of information in the right format and the right context. Data will be accessible and useful at all times, since all organizations and all individuals will connect at will to the vast store of human knowledge. When the data tone is cut off or interrupted, we will feel as uncomfortable and find it as disturbing as when we experience a disruption in electricity, transportation, or communications today. Raw human abilities, without the proper data mining tools, will stand helpless in the high-dimensional massive data universe. Our ability to harvest benefits from the vast data stores will rely entirely on the algorithms and mathematics of data mining.

Accurate modeling of data is the ultimate form of data compression. If we derive the generative distribution describing a data set, we can reduce terabytes and petabytes of data to a single formula. These compact summaries of data become the ultimate portable data stores and can serve to answer queries and quickly navigate the data. Data mining will enable the ultimate transformation of the physical, via the digital, to the mathematical instantiation. The reduction of data to its generative principles is the ultimate form of detective work.

Some worry about privacy and the ability of governments and other organizations to fully know and hence control the behavior and even the thoughts of individuals. I remain optimistic. With every advance in technology comes an advance in our individual ability to move faster or disappear more effectively. Information countermeasures will evolve to allow individuals to control which of our information properties make it into the vast data banks, how and by whom they are accessed, and how they are moved around. With more information (not with data overload) also comes the ability to make better individual decisions and choices. Moreover, data mining tools can work both ways—also helping individuals figure out when their space is being mined inappropriately.

Since competition is part of our human nature, wars will be waged over this information. Data banks will be held hostage and robbed. Magnificent digital libraries will be destroyed. The destruction of the ancient library at Alexandria, Egypt, will be repeated at ever larger scales in the cyberworld. Some nations will overtake others due to some edge in information infrastructure or data navigation and mining techniques. Information borders will evolve, and some descendants of data mining techniques will be classified as weapons. As in all endeavors in human history, nothing will drive progress in the field as fast as the data wars and the data mining arms race. We see it beginning in the business world today, while some encryption algorithms are being classified as restricted munitions by the U.S. government.

I am most excited by the prospects of the new mathematics and theory that will evolve to help us deal with and understand the digital information landscape. Our current abilities and understanding of the digital information environment are primitive compared with our abilities to model and understand the physical (analog) environment. Imagine what digital physics will give us as the equivalent of nuclear weaponry once we learn to crack the atoms of information. The consequences of unleashing data’s tremendous latent information will be bewildering, exhilarating, even frightening.

In the same vein, my imagination fails to begin to visualize the wonders we will discover as data mining evolves into the effective cybernavigation science of tomorrow. I envy the early discovery expeditions into the new world of data. A wondrous journey awaits us all.