Archive for September 2012

Code for a simple PostgreSQL plugin for StarCluster on Amazon EC2

I’m exploring using Amazon EC2 as an alternative to dedicated compute clusters for running machine learning experiments. I typically use Jobman for hyperparameter exploration (ie. trying to find the best parameters for statistical models), and in a distributed setting it requires a PostgreSQL database.

To automate cluster launching on EC2, StarCluster is very handy. However they only have a MySQL plugin, no PostgreSQL one, and mighty Google didn’t give anything useful for “starcluster postgresql”. So at first I thought of writing some bash script, but StarCluster’s plugin system was very easy to get started with, so I wrote such a plugin. The code is down there; you just need to put it in a file called “postgresql_plugin.py” under your $HOME/.starcluster/plugins directory. Here’s a raw file with the code to avoid copy/pasting.

The plugin automates the procedure described here. It simply sets up each node with the postgresql-client package, and the master node with a new DB and user.

Once you’ve copied the file, you simply need to add a [plugin] section to your .starcluster/config file, like so:

[plugin postgresql_plugin]
SETUP_CLASS = postgresql_plugin.PostgresqlPlugin
db_name = mydb
db_user = myuser
db_password = mypassword

Please note that I’m certainly no PostgreSQL/StarCluster administration guru, but I think the setup should be enough to get you going for basic needs such as mine, and the firewall rules should be enough to limit access to nodes from your StarCluster-launched cluster (hence I’m not too worried about the plaintext password issue, but if you know better please leave a comment below).

from starcluster.clustersetup import ClusterSetup
from starcluster.logger import log

class PostgresqlPlugin(ClusterSetup):
    def __init__(self, db_name, db_user, db_password):
        self.db_name = db_name
        self.db_user = db_user
        self.db_password = db_password

    def run(self, nodes, master, user, user_shell, volumes):
        master_packages = "postgresql postgresql-client"

        # master package install
        log.info("Installing "+master_packages+" on master, and setting up psql user and DB.")

        # configure role and DB
        master.ssh.execute('apt-get -y install %s' % master_packages)
        master.ssh.execute('sudo -u postgres psql -c "CREATE ROLE %s WITH LOGIN ENCRYPTED PASSWORD \'%s\';"' % \
                                    (self.db_user, self.db_password))
        master.ssh.execute('sudo -u postgres createdb -O %s %s' % (self.db_user, self.db_name))

        # now from master you should be able to connect with
        # psql -h 127.0.0.1 --password mydb myuser

        # configure firewall rules in pg_hba.conf and pos
        line_pattern = "host\tall\tall\t%s/32\tmd5"
        lines = [line_pattern % (master.private_ip_address,)]
        for node in nodes:
            if node.private_ip_address == master.private_ip_address:
                continue
            lines.append(line_pattern % (node.private_ip_address,))

        for l in lines:
            log.info("Adding this line to pg_hba.conf: %s" % l)
            master.ssh.execute('sudo -u postgres bash -c "echo \\"%s\\" >> /etc/postgresql/*/main/pg_hba.conf"' % l)

        # accept connections on all network interfaces
        log.info("Adding listen_addresses = '*' to postgres.conf")
        master.ssh.execute('sudo -u postgres bash -c "echo \\"listen_addresses = \'*\'\\" >> /etc/postgresql/*/main/postgresql.conf"')

        # nodes only need to have the psql client installed, no special config
        # needed (except for firewall rules on master)
        node_packages = "postgresql-client"

        # TODO: do these installs in parallel
        for node in nodes:
            if node.private_ip_address == master.private_ip_address:
                continue
            log.info("Installing %s on %s" % (node_packages, node.alias))
            node.ssh.execute('sudo apt-get -y install %s' % node_packages)

        master.ssh.execute('sudo /etc/init.d/postgresql restart')