Why the ‘Data Lake’ Is Really A ‘Data Swamp’

MIT Adjunct Professor Michael Stonebraker

A popular refrain I hear these days is "I am planning to put all of my data into a data lake so my employees can do analytics over this potential treasure trove of information." This point of view is also touted by several vendors selling products in the Hadoop ecosystem. Unfortunately, it has a serious flaw, which I illustrate in this posting using (mostly fake) data on one of my M.I.T. colleagues.

Consider two very simplistic data sources containing data on employees. The first data source has record of the form:

Employee (name, salary, hobbies, age, city, state)

while the second contains data with a layout of:

Person (p-id, wages, address, birthday, year-born, likes)

An example record from the first data set might be:

(Sam Madden, $4000, {bike, dogs}, 36, Cambridge, Mass)

An example from the second data set might be:

(Samuel E. Madden, $5000, Newton Ma., October 4, 1985, bicycling)

A first reasonable step is to assemble these records into a single place for subsequent processing. This Ingest step is the first phase of a data curation process with the following components:

Ingest: Data sources must be assembled as noted above.

Data transformation: The field "address" must be decomposed into its constituent components "city" and "state". Similarly "age" must be computed from "year born" and "birthday".

Schema integration. The fact that p-id and name mean the same thing must be ascertained. Similar statements hold for other attributes in the records, as well as in the transformed fields.

Data cleaning. The two salaries of Sam Madden are different numbers. One or both may be incorrect. Alternately, both may be correct; for example one could be total wages including consulting while the other could represent the W-2 salary of an individual. Similarly, Sam can only have a single age. Hence, one (or both) data source have incorrect information.

Entity consolidation. It must be ascertained that the two Sam Madden records correspond to the same person, and not two different persons. Then, the two records must be merged into a composite record. In the process, decisions (often called merge rules) need to be made about Sam’s hobbies. For example are "bike" and "bicycling" the same thing?

Several comments are immediately evident. Data curation is an iterative process. After applying some of these steps, it may make sense to go back and repeat some of the other steps. For example, entity consolidation may reveal the problems with Sam’s salary and age. Hence, further data cleaning is warranted. In effect, data curation is a workflow of processing steps with some backtracking.

Second, some data curation steps will require human involvement. It is not reasonable to expect automatic systems to do the whole job. Moreover, in many environments, it will take experts to provide the necessary human judgement. For example, integration of genomics data requires a skilled biologist, and cannot be performed by normal crowd sourcing techniques.

Third, real world data is usually very dirty. Anecdotal evidence suggests that up to 10% of corporate data inside the firewall is incorrect. Hence, any person who thinks that his troubles are over, once he has ingested his data sources is sadly mistaken. The remaining four steps will be very costly.

In effect, the ingest phase is trivial compared to the other four steps. Hence, data ingest into an uncurated "data swamp" is just the tip of a data consolidation iceberg. A huge amount of effort will have to be subsequently invested to turn the swamp into a data lake.

The moral of this story is "don’t underestimate the difficulty of data curation." If you do, you will revisit a well-worn path, namely the experience of enterprises in the 1990’s concerning data warehouses. A popular strategy at the time was to assemble customer- facing data into a data warehouse. Business analysts could then use the result to determine customer behavior and make better sales decisions. The average data warehouse project at the time was a factor of 2 over budget and a factor of two late because the data curation problems were underestimated. Don’t repeat that particular mistake.