I’m exploring using Amazon EC2 as an alternative to dedicated compute clusters for running machine learning experiments. I typically use Jobman for hyperparameter exploration (ie. trying to find the best parameters for statistical models), and in a distributed setting it requires a PostgreSQL database.
To automate cluster launching on EC2, StarCluster is very handy. However they only have a MySQL plugin, no PostgreSQL one, and mighty Google didn’t give anything useful for “starcluster postgresql”. So at first I thought of writing some bash script, but StarCluster’s plugin system was very easy to get started with, so I wrote such a plugin. The code is down there; you just need to put it in a file called “postgresql_plugin.py” under your $HOME/.starcluster/plugins directory. Here’s a raw file with the code to avoid copy/pasting.
The plugin automates the procedure described here. It simply sets up each node with the postgresql-client package, and the master node with a new DB and user.
Once you’ve copied the file, you simply need to add a [plugin] section to your .starcluster/config file, like so:
[plugin postgresql_plugin] SETUP_CLASS = postgresql_plugin.PostgresqlPlugin db_name = mydb db_user = myuser db_password = mypassword
Please note that I’m certainly no PostgreSQL/StarCluster administration guru, but I think the setup should be enough to get you going for basic needs such as mine, and the firewall rules should be enough to limit access to nodes from your StarCluster-launched cluster (hence I’m not too worried about the plaintext password issue, but if you know better please leave a comment below).
from starcluster.clustersetup import ClusterSetup from starcluster.logger import log class PostgresqlPlugin(ClusterSetup): def __init__(self, db_name, db_user, db_password): self.db_name = db_name self.db_user = db_user self.db_password = db_password def run(self, nodes, master, user, user_shell, volumes): master_packages = "postgresql postgresql-client" # master package install log.info("Installing "+master_packages+" on master, and setting up psql user and DB.") # configure role and DB master.ssh.execute('apt-get -y install %s' % master_packages) master.ssh.execute('sudo -u postgres psql -c "CREATE ROLE %s WITH LOGIN ENCRYPTED PASSWORD \'%s\';"' % \ (self.db_user, self.db_password)) master.ssh.execute('sudo -u postgres createdb -O %s %s' % (self.db_user, self.db_name)) # now from master you should be able to connect with # psql -h 127.0.0.1 --password mydb myuser # configure firewall rules in pg_hba.conf and pos line_pattern = "host\tall\tall\t%s/32\tmd5" lines = [line_pattern % (master.private_ip_address,)] for node in nodes: if node.private_ip_address == master.private_ip_address: continue lines.append(line_pattern % (node.private_ip_address,)) for l in lines: log.info("Adding this line to pg_hba.conf: %s" % l) master.ssh.execute('sudo -u postgres bash -c "echo \\"%s\\" >> /etc/postgresql/*/main/pg_hba.conf"' % l) # accept connections on all network interfaces log.info("Adding listen_addresses = '*' to postgres.conf") master.ssh.execute('sudo -u postgres bash -c "echo \\"listen_addresses = \'*\'\\" >> /etc/postgresql/*/main/postgresql.conf"') # nodes only need to have the psql client installed, no special config # needed (except for firewall rules on master) node_packages = "postgresql-client" # TODO: do these installs in parallel for node in nodes: if node.private_ip_address == master.private_ip_address: continue log.info("Installing %s on %s" % (node_packages, node.alias)) node.ssh.execute('sudo apt-get -y install %s' % node_packages) master.ssh.execute('sudo /etc/init.d/postgresql restart')