Archive for August 2012

Data provenance

Having worked for Steerads for some time now, and from previous work/academic projects in machine learning, I’ve been building up an interest in best practices surrounding Big Data/machine learning-related architectures and work processes. I feel some nuggets of experience are worth sharing.

Data provenance

One theme I’ve grown interested in is Data provenance (or lineage). It’s this idea of being able to track how some chunk of data was produced, what’s in it and, if you’re careful enough, how to reproduce it. For example, manipulating datasets for experimental purpose, I’ll often apply some transformation (filtering, etc.) to it, obtaining a transformed set of files. To the naked eye innocently scanning ‘ls’ outputs and distractedly examining the files’ content, the result looks pretty much like the first dataset; however results obtained on the second do not generalize to the first and vice-versa.

Continue reading ‘Data provenance’ »

Logging memory usage in Python

Lately I’ve had to optimize machine learning code in Python, and there are many modules that will give you a time breakdown per function or per line, such as cProfile or kernprof. However I also have memory usage issues (as in MemoryError, RAM maxed), and I suspect I sometimes have spikes of memory usage that I can’t see from ‘top’.

Here’s a little utility class I wrote to log usage after certain calls, to track what memory was allocated at chosen points in the program. I only create one such MemoryUsageLogger, a global one, and then it’s very easy to track usage at a new spot. E.g. once “global_logger” is created, I can simply do “global_logger.log_current_usage()” and I’ll get lines such as:

5292032 (test_MemoryUsageLogger)    2012-08-31 16:36:50.517296
5324800 (test_MemoryUsageLogger)    2012-08-31 16:36:50.517529
5332992 (test_MemoryUsageLogger)    2012-08-31 16:36:50.517768

Continue reading ‘Logging memory usage in Python’ »