Sign In

Communications of the ACM


What Does 'Big Data' Mean (Part 4)?

MIT Adjunct Professor Michael Stonebraker

In three previous blog postings, I talked about four possible meanings of "big data", namely: 

Big volume – small analytics

Big volume – big analytics

Big velocity

Big variety

and discussed the first three use cases. In this posting, I will finish this series with a discussion of the fourth use case. Big variety means attempting to cope with data arriving from too many data sources, which results in a daunting data integration challenge. I will focus on the integration of structured data, leaving the issue of text integration to others.

In the 1990's most large enterprises set up "enterprise data warehouses," led by the retail giants. These warehouses contained item-level historical sales data and were queried by product managers. In effect, the goal was to determine, for example, that Barbie dolls are in and pet rocks are out. The barbie dolls were then moved to front of the store for promotion and the pet rocks were discounted aggressively. Retailers found that sales data warehouses paid for themselves in less than a year due to better stock rotation and buying decisions. Most large enterprises followed suit with sales and customer data. Such warehouses typically integrated a dozen or less data sources, and became the focus of so-called "Extract, Transform and Load (ETL)" vendors. The ETL methodology that has emerged is:

For each data source to be integrated {

Assign a programmer to understand the data source – by talking to people, reading documentation, etc.

The programmer writes a program (in a scripting language) to extract data from the source

The programmer figures out how to map the local data source to a pre-defined global schema

The programmer writes any needed data cleaning and/or transformation routines

This (very widely used) methodology has a number of disadvantages:

  1. The global schema may not exist a-priori. In one use case I am familiar with, a Big Pharma company wants to integrate the electronic lab notebook of several thousand bench chemists and biologists. The global schema must be pieced together as a composite of the local schemas. Hence, it is a byproduct of the integration, not something that is known upfront.
  2. It may be difficult (or impossible) to write the transformations.For example, a "salary" attribute in a local source in Paris might well be a person's wages in Euros including a lunch allowance, but after taxes. These semantics may not be in any data dictionary. Hence, the programmer may have a very difficult time understanding the meaning of attributes in local data sources
  3. Data cleaning routines may be very difficult to write. For example, it is very difficult to decide if two different restaurants at the same address are valid (a food court, for example) or one went out of business and was replaced by the other.
  4. Data integration is not a one-shot.All schemas, transformations and mappings may be time varying and may vary with the local updates performed on the data sources. >It is not unusual to see monthly changes to these elements.
  5. The methodology does not scale. Since a lot of the steps are human-centric, this process will scale to a few tens of local sources, before the cost becomes exorbitant.

Since the whole idea behind "big data" is scalability, the rest of this blog posting will focus on item 5) above.

I know of a number of enterprises that view data integration as their most serious issues. Generally speaking, most invested in data warehouses a while ago and constructed integrated composite databases of their sales/customer/product information. However, the vast majority are facing the issues in the following vignette.

I visited a major beer company in the late 1990's that had a traditional data warehouse of beer sales by distributor by brand by time period. At the time, the weather forecasters were predicting an "El Nino" winter, i.e. they predicted it would be wetter than usual on the West Coast and warmer than normal in the Northeast. I asked the business intelligence personnel at this company "Do you think beer sales are correlated with either temperature or precipitation?" They replied that it would be a great question to find out the answer to, but unfortunately weather data was not in their data warehouse.

In effect, business analysts have an insatiable appetite for new data sources to correlate to enterprise data for business insight. A typical enterprise has more than 5000 operational data sources, some of which are relevant to business insight. Furthermore, much enterprise financial data is in spreadsheets, perhaps on the CFO's laptop. Lastly, there is a treasure chest of public data sources from the web (for example weather data). The net result is pressure from business analysts to integrate more and more data sources, thereby creating an increasingly difficult data integration problem.

This same pressure will often be generated by new business requirements, for example the cross selling of products between divisions of an enterprise or better ad placement by knowing more about the person at the other end of the web session. A case in point is car insurance, which is in the process of applying "mass personalization" to auto insurance rates. This was spearheaded a decade ago by Progressive, which put a sensor in your car and rewarded drivers for safe behavior. This concept is being expanded to include where and when you drive as well as other information (credit rating, recent life changes, etc.). Again new business requirements dictate a much more difficult data integration problem.

In the rest of this post I offer a few possible suggestions for data integration at scale.

Human involvement: First, the traditional methodology will scale to 10's of sites; it has no possibility of scaling to 100's or 1,000's. To scale, the traditional methodology has to be replaced by machine learning and statistics to "pick the low hanging fruit" automatically. A human should be involved, only when automatic algorithms fail. See my paper in CIDR '13 ( for one possible approach along these lines.

Transformations: Second, having a programmer write transformations has no chance of scaling. Instead, the simpler ones should be guessed automatically and more complex ones should use a WHYSWIG interface oriented toward non-programmers. A good start in this direction is the Data Wrangler software from Stanford ( Only the most complex ones should require a programmer. Although most proprietary ETL frameworks come with a library of transformations, a public catalog of them, not tied to software licenses, would be a very valuable service. I suspect that transformations are often written many times from scratch, because a programmer cannot find previously written versions.

Entity consolidation: In its simplest instantiation, this means de-duping of data. However, in most cases data sources will have different information about entities and a consolidation phase is required.Consolidation can be domain specific (for example English names or company names) or general purpose (just look for clusters of records in attribute space). All low hanging fruit should be picked by automatic algorithms, with human involvement only in the tricky cases.

Cleaning: Presumably a lot of mileage can be obtained by using WYSIWYG cleaning tools such as DbWipes (  I expect there are many more good ideas with a substantial visualization component.

Moreover, when the same entity (say a Red Lobster restaurant) comes from multiple sources, then some cleaning can be done automatically during entity consolidation. If automatic algorithms determine that the same entity is listed at different addresses or with different phone numbers, then data correction can be performed automatically.

Since any automatic system will inevitably have errors, one should create business processes to deal with incorrect data. Hence, 100% accuracy is an illusion, and enterprises need mechanisms to deal with errors. For example, one large web retailer admits there will always be processing errors, and has a system to deal with the resulting human complaints that are likely to arise.

Wrappers: One of the thorniest problems is to extract data from a legacy system (say SAP or PeopleSoft) and cast it into a format readable by a data integration system. The legacy ETL vendors have spent large amounts of time creating these wrappers for popular systems. But how do we create them for the rest of the data sources? Equally daunting is the web. Although there is some information in HTML tables that is easily read, most of the interesting data is buried in text, pull down menus, or behind clickable buttons (the so called deep web). How to build cost-effective extractors in these situations is problematic and in need of a good idea.


Marcos Sena

I think It's a good initiative from experienced researchers to explain some topics in IT.
I'm a undergraduate student and have a lot of questions about everything.
An it's so helpful see that here!
Thank you!

William Mitlyng

A snippet of a war story
I am certain I do not understand. Twenty years ago I worked on integrating State level data from DEC VAXes, IBM variable records so ASCII, EBCDIC with bin and comp fields from Cobol programming.We had a Sun Sparc1 and added a sparc 10 a year or two later. Minimal machinery, antiquated record types with five states data related agencies of Social Science research. The data sets would be translated from tape to Sybase in 24 to 48 hours in spite of file corruption, remapped data fields. Lastly recoding characteristic variables let us complete a three way join produce 100's of milliions of records in six hours and then SAS.

We had one person who used dd, awk, and sed faster than putting it on the database,

That was unix with a dedicated team of researchers at the University of Chicago.
I had someone who had me in an advisory position. There was a big todo about creating stored procedures from entity relationship diagrams. Thanks to some uncommon sense she just wrote it out by hand and did a much better job.

There is a great potential for overly complicated solutions to simple problems.

i would think that with many solid state disks, an os that can handle multiple disks and cpus effectively, fiber channels. Lets say a redundant array of inexpensive computers. It is my old fashioned approach. How can you say there is any quality without data dictionaries.

Are you going to find data based on an hypothesis about what has value in the data using decision making NP-Hard as opposed to tossing everything in and creating more sophisticated approach to translating everything?

Well that is my folksy industry perspective.

Displaying all 2 comments

Sign In for Full Access
» Forgot Password? » Create an ACM Web Account
ACM Resources
Read CACM in a free mobile app!
Access the latest issue, plus archived issues and more
ACM Logo
  • ACM CACM apps available for iPad, iPhone and iPod Touch, and Android platforms
  • ACM Digital Library apps available for iOS, Android, and Windows devices
  • Download an app and sign in to it with your ACM Web Account
Find the app for your mobile device