Data provenance

Having worked for Steerads for some time now, and from previous work/academic projects in machine learning, I’ve been building up an interest in best practices surrounding Big Data/machine learning-related architectures and work processes. I feel some nuggets of experience are worth sharing.

Data provenance

One theme I’ve grown interested in is Data provenance (or lineage). It’s this idea of being able to track how some chunk of data was produced, what’s in it and, if you’re careful enough, how to reproduce it. For example, manipulating datasets for experimental purpose, I’ll often apply some transformation (filtering, etc.) to it, obtaining a transformed set of files. To the naked eye innocently scanning ‘ls’ outputs and distractedly examining the files’ content, the result looks pretty much like the first dataset; however results obtained on the second do not generalize to the first and vice-versa.

It’s insidious, just like any error in data analysis activities: if you mess up data at one stage, nothing crashes, but data is now tainted all the way forward, and all your results are then invalid. I’ve seen a few cases of working for days with flawed data assumptions…

So I grew an interest in the topic, but can’t say I’ve read as much as I’d like about it. There is a wide litterature on the topic, and many solutions exist for various workflow systems (e.g. RAMP is one for Hadoop). I won’t strive for completeness. I’ll only share some impressions, experience and links.

I feel solving this is very much about adopting habits (as in “best practices”), as well as adding it as a design consideration in data workflows.


The tools used to make data transformations, when experimenting and analysing, are so very diverse. We can’t expect them all to follow some convention of tagging data with proper metadata. Heck, more often than not the transformation will be done with on-the-fly code in a Python/R/etc. shell prompt… So I feel a good part of keeping track of provenance for one-off data is about adopting habits.

Just like with code comments and clarity, I think it’s about thinking of the “future you” or colleague who’ll have to deal with the code, or here data. The most obvious solution when doing manual transformations is to add a README file near the files and keep it up-to-date. I’ll mention what the original data was and what manipulations were applied (copy-paste of my commands etc.). I’ll also explain the purpose of the manipulations, i.e. what was I trying to extract/get at by doing that, which I found helps a lot.

It’s certainly not foolproof, as you need to think about it, but following this simple trick for the last few months has already proved very useful.

As a design consideration

Keeping track of provenance also becomes a design consideration as, of course, the workflow/pipeline used to obtain production data is worth modifying to leave some traces along with its outputs. Pipeline code changes with time, and numbers which look the same in two files from different days may have very different meanings.

Notice that tracking data provenance information integrates very well with an architecture arranged in a workflow fashion: at every stage, you add a piece of metadata related to that stage, and point to the previous step in the graph. It’s not a coincidence if so many workflow systems mention data provenance tracking as a feature (e.g. Pegasus, VisTrails). If you feel like spending a few hours reading (or at least vaguely scanning) academic papers on the topic, google for “scientific workflows”.

Here are some basic things to record about a piece of data:

  • The function or command used to produce it
  • The code version  (including important libraries’ versions)
  • The data inputs (dependencies), which we’ll assume are fixed chunks of immutable data, with their own provenance metadata
  • The arguments to the code (as in command line arguments, or config file options)

For some situations, other factors are important. Environment variables, OS, machine configuration, etc. obviously count, but at some point you need to draw a line for what not to record.

This of course does not cover accumulating data from a pipe or live stream, and doing continuous transformations. As I said, I assume the input data chunks are immutable, fixed chunks. Of course those chunks can come accumulating data for some time, e.g. “Twitter tweets for October 1st, 2011”.

In some cases randomness will also play a part in the result, for example if you’re transcoding a live video (ie. machine lag can count).

Conclusion and further reading

For full disclosure, I’m still myself searching for a good system that does not come with an integrated kitchen sink, and of yet it’s not been high enough on the priority list to really select something. Maybe I’ll code something up that does most of what I want (which is probably why there are so many such systems: everyone goes off creating his own in a bunch-of-scripts fashion 🙂 ).

My point here was to raise awareness mostly (I had never heard of this during my Masters, and I’d have liked to), but if you want to go further all you really need is just a few keywords to start your search, so I suggest “data provenance survey”, or “scientific workflow”, or “workflow management system”.

As said above, there is no lack of existing systems and research on the topic. From getting lost at sea in there, I recommend heading straight to the survey papers and focusing on the analysis scheme they come up with. For example the “Categorization scheme for conceptual properties of data provenance” section in “Data Provenance: A Categorization of Existing Approaches“. That helped me view the problem more broadly.

Leave a comment