Extract, Shoehorn, and Load

A lot of data is moved from system to system in an important and increasing part of the computing landscape. This is traditionally known as ETL (extract, transform, and load). While many systems are extremely good at this process, the source for the extraction and the destination for the load frequently have different representations for their data. It is common for this transformation to squeeze, truncate, or pad the data to make it fit into the target. This is really like using a shoehorn to fit into a shoe that is too small. Sometimes it’s a needed step. Frequently it’s a real pain!

Two major parts of ETL are the extraction and the load. These processes are where the rubber meets the participating data stores.

Extraction pulls data out of a source system. This may be relational data kept in a database. If so, it may be converted to an object relational format where each object transforms the join of multiple relational rows into a cohesive thing. Data is frequently organized as messages when it is sucked out. It’s also common for data to be extracted from key-value stores where it is kept in a semi-structured representation.

Load happens when the data is placed into the target system. The target will have its own metadata describing the shape and form of the data in its belly. If the target is an analytics system, then its data will likely be loaded into a relational form.

While it may be counterintuitive, it is frequently useful to take relational data out of a system as objects; convert, massage, and shoehorn the data from one object representation to another; and load it into the target system in relational form.

The new, shoehorned data is then used for analytic queries.

To make this work, you need metadata¹—for both the source being extracted and the target being loaded.

The metadata for the extracted source is descriptive. The data exists. The metadata describes its shape, form, and meaning. It is always the case that extracted data is copied out and the metadata describes what its shape was and what its shape is as it’s extracted.

The metadata for the loaded data is prescriptive. The ETL system makes it fit the output metadata’s shape and form exactly as it is prescribed to do. Only when the system knows what the results should look like can it do the work.

ETL systems always need to know the current shape of the input, as well as the shape the data should become.

Frequently the output data to be loaded into the target system is relational in its shape. Many analytics systems expect relational data to analyze and report. Relational is great as the format for both planned and ad hoc queries.

What If the Shoe Doesn’t Fit?

When incoming data and its descriptive metadata don’t match the outgoing data shape specified by the prescriptive data, you must do something about it.

When a shoe is too large, people will shove padding into it. Similarly, when the incoming data doesn’t have all the information required for the outgoing shape and form, you add stuff. This may be a default value or a null value.

If a shoe is too small and the foot is too large, sometimes you use a shoehorn to force the foot into the shoe, comfort be damned. This is a real pain! Similarly, when the incoming data has too much information, the system needs to discard data that doesn’t fit the outgoing metadata.

The process of discarding or padding data is very common.

All too often, the descriptive metadata for the input is not a perfect match to the prescriptive metadata for the desired output!

Sometimes data is extracted from many sources with either the same or different input metadata describing the stuff being loaded. It’s essential that the data from the various sources be modified to fit into the target metadata.

Note that normalizing the data to relational form may be difficult with different input data from different systems. The needed information may be missing from some input source.

Conclusion

ETL takes disparate sources and destinations and moves data from one to the other. Frequently there is only a partially useful mapping of the metadata. Sometimes data must be discarded to traverse the path from source to destination. Other times the source data may need to be augmented with null values or default values. It’s also possible that the mapping is complex and loses much of the meaning kept in the original translation as the data is reshaped and re-formed.

Metadata for the loaded source data is descriptive—it describes the data. Metadata for the data loaded into the target is prescriptive—it prescribes the required target shape and form. The challenge is that the described output may be ill fitting to the prescribed input.

It turns out the business value of ill-fitting data is extremely high. The process of taking the input data, discarding what doesn’t fit, adding default or null values for missing stuff, and generally shoehorning it to the prescribed shape is important. The prescribed shape is usually one that is amenable to analysis for deeper meaning.

It is the shoehorning that gives the data the shape it needs to be understood consistently.