« Developer Happiness at Etsy | Main | Andriod to iPod/iPhone Dock Adapter Cable »

July 9, 2011

Ruby on Rails 3, Octopus, and Sharding with Groups

The Ruby on Rails app I'm building from the ground up needs to be able to handle *lots* of data. From the get-go I've known I'd need to put some kind of data partitioning to make this thing scale, but just finally tackled it this week.

Rails doesn't come with a built-in ability to connect to multiple databases, but there are lots of folks who have tackled it. Ryan Tomayko's post Rails and Scaling with Multiple Databases (written way back in 2007) is a good indicator that there are folks doing serious business with rails and multiple databases, and has some good links to other sources.

I was looking at writing something myself, but in digging around came across Octopus a few times and realized it does exactly what we need, data sharding with the ability to programmatically select databases in controllers and models, as well as good migration handling.

For the application we'll have a master database which centralizes account, user and session information with a pointer to the appropriate data shard. Once a user request has been authenticated, the actual application data gets pulled off of one of the shards they've been assigned to. The master database has a simple schema for user, account, and session information, the schema on the shards is significantly more complex, but the same on each shard.

Octopus documentation is there, but not extensive. Through some trial and error I found that the following shard.yml configuration file was what I needed (the Octopus documentation doesn't spell out how to use shard groups in a rails environment, but it makes sense once you get it working). The master shard for the environment is defined in database.yml.

octopus:
  environments:
    - development
    - test
    - production
  development:
        data_shards:
            data_1:
                adapter: postgresql
                database: dev_data_1
                username: something
                password: cryptic
                template: template0
            data_2:
                adapter: postgresql
                database: dev_data_2
                username: something
                password: cryptic
                template: template0
...

It's obviously paramount that the models know where to connect for their data. When a user authenticates I invoke ActiveRecord::Base.establish_connection() to connect to the shard so it's the default data source everywhere in the application for that request. Requests that need to interact with the master database are handled in the controller:

  around_filter :select_shard
  def select_shard(&block)
    Octopus.using(:master, &block)
  end

The other important piece of this is migrations. Octopus works well for our needs here. When a migration is written it either needs to update a single master database or be applied to all data shards. This is where shard groups come in. To apply a migration to all shards we specify the group with using_group() method, and the migration then runs for each shard in the specified group.

class CreateInventoryEvents < ActiveRecord::Migration
  using_group(:data_shards)
  ...
end

Pretty slick, happy with Octopus from a functionality perspective. Not sure about scalability and the overhead to move requests between database connections, will dig more into that next.

Posted by mike at July 9, 2011 7:27 AM