Code for a simple PostgreSQL plugin for StarCluster on Amazon EC2

I’m exploring using Amazon EC2 as an alternative to dedicated compute clusters for running machine learning experiments. I typically use Jobman for hyperparameter exploration (ie. trying to find the best parameters for statistical models), and in a distributed setting it requires a PostgreSQL database.

To automate cluster launching on EC2, StarCluster is very handy. However they only have a MySQL plugin, no PostgreSQL one, and mighty Google didn’t give anything useful for “starcluster postgresql”. So at first I thought of writing some bash script, but StarCluster’s plugin system was very easy to get started with, so I wrote such a plugin. The code is down there; you just need to put it in a file called “postgresql_plugin.py” under your $HOME/.starcluster/plugins directory. Here’s a raw file with the code to avoid copy/pasting.

The plugin automates the procedure described here. It simply sets up each node with the postgresql-client package, and the master node with a new DB and user.

Once you’ve copied the file, you simply need to add a [plugin] section to your .starcluster/config file, like so:

[plugin postgresql_plugin]
SETUP_CLASS = postgresql_plugin.PostgresqlPlugin
db_name = mydb
db_user = myuser
db_password = mypassword

Please note that I’m certainly no PostgreSQL/StarCluster administration guru, but I think the setup should be enough to get you going for basic needs such as mine, and the firewall rules should be enough to limit access to nodes from your StarCluster-launched cluster (hence I’m not too worried about the plaintext password issue, but if you know better please leave a comment below).

from starcluster.clustersetup import ClusterSetup
from starcluster.logger import log

class PostgresqlPlugin(ClusterSetup):
    def __init__(self, db_name, db_user, db_password):
        self.db_name = db_name
        self.db_user = db_user
        self.db_password = db_password

    def run(self, nodes, master, user, user_shell, volumes):
        master_packages = "postgresql postgresql-client"

        # master package install
        log.info("Installing "+master_packages+" on master, and setting up psql user and DB.")

        # configure role and DB
        master.ssh.execute('apt-get -y install %s' % master_packages)
        master.ssh.execute('sudo -u postgres psql -c "CREATE ROLE %s WITH LOGIN ENCRYPTED PASSWORD \'%s\';"' % \
                                    (self.db_user, self.db_password))
        master.ssh.execute('sudo -u postgres createdb -O %s %s' % (self.db_user, self.db_name))

        # now from master you should be able to connect with
        # psql -h 127.0.0.1 --password mydb myuser

        # configure firewall rules in pg_hba.conf and pos
        line_pattern = "host\tall\tall\t%s/32\tmd5"
        lines = [line_pattern % (master.private_ip_address,)]
        for node in nodes:
            if node.private_ip_address == master.private_ip_address:
                continue
            lines.append(line_pattern % (node.private_ip_address,))

        for l in lines:
            log.info("Adding this line to pg_hba.conf: %s" % l)
            master.ssh.execute('sudo -u postgres bash -c "echo \\"%s\\" >> /etc/postgresql/*/main/pg_hba.conf"' % l)

        # accept connections on all network interfaces
        log.info("Adding listen_addresses = '*' to postgres.conf")
        master.ssh.execute('sudo -u postgres bash -c "echo \\"listen_addresses = \'*\'\\" >> /etc/postgresql/*/main/postgresql.conf"')

        # nodes only need to have the psql client installed, no special config
        # needed (except for firewall rules on master)
        node_packages = "postgresql-client"

        # TODO: do these installs in parallel
        for node in nodes:
            if node.private_ip_address == master.private_ip_address:
                continue
            log.info("Installing %s on %s" % (node_packages, node.alias))
            node.ssh.execute('sudo apt-get -y install %s' % node_packages)

        master.ssh.execute('sudo /etc/init.d/postgresql restart')

Data provenance

Having worked for Steerads for some time now, and from previous work/academic projects in machine learning, I’ve been building up an interest in best practices surrounding Big Data/machine learning-related architectures and work processes. I feel some nuggets of experience are worth sharing.

Data provenance

One theme I’ve grown interested in is Data provenance (or lineage). It’s this idea of being able to track how some chunk of data was produced, what’s in it and, if you’re careful enough, how to reproduce it. For example, manipulating datasets for experimental purpose, I’ll often apply some transformation (filtering, etc.) to it, obtaining a transformed set of files. To the naked eye innocently scanning ‘ls’ outputs and distractedly examining the files’ content, the result looks pretty much like the first dataset; however results obtained on the second do not generalize to the first and vice-versa.

Continue reading ‘Data provenance’ »

Logging memory usage in Python

Lately I’ve had to optimize machine learning code in Python, and there are many modules that will give you a time breakdown per function or per line, such as cProfile or kernprof. However I also have memory usage issues (as in MemoryError, RAM maxed), and I suspect I sometimes have spikes of memory usage that I can’t see from ‘top’.

Here’s a little utility class I wrote to log usage after certain calls, to track what memory was allocated at chosen points in the program. I only create one such MemoryUsageLogger, a global one, and then it’s very easy to track usage at a new spot. E.g. once “global_logger” is created, I can simply do “global_logger.log_current_usage()” and I’ll get lines such as:

5292032    memusage.py:52 (test_MemoryUsageLogger)    2012-08-31 16:36:50.517296
5324800    memusage.py:52 (test_MemoryUsageLogger)    2012-08-31 16:36:50.517529
5332992    memusage.py:52 (test_MemoryUsageLogger)    2012-08-31 16:36:50.517768

Continue reading ‘Logging memory usage in Python’ »

Image stitching

As I said in the previous post, the first assignment in that computer vision course I took was to write an image stitching program. The basic idea is to take a series of pictures by rotating around the point where the camera is located. Then you find how each “maps” to the others and “stitch” them into a single coherent panorama.

Continue reading ‘Image stitching’ »

Computer vision project: overlaying 3D reconstruction from webcam on the original scene

I’ve been taking a few graduate courses at Université de Montréal in the past two years, but the last one, the computer vision course, was by far the one with the most “showable” (mighty Google says that’s a word) projects. I’ll be posting later on the first project, a basic image stitching app. For now, I’ll just leave this here:

It’s an example result from my term project.

Continue reading ‘Computer vision project: overlaying 3D reconstruction from webcam on the original scene’ »