Archives For December 2012

What if money were no object?

December 18, 2012

I saw this really thoughtful YouTube clip a while back, and wanted to record it here, so I can refer people to it in the future. If you haven’t seen it yet, take 3 minutes and watch it. It makes me feel very fortunate, I love writing software.

As you may know, we run our OrderPipe ecommerce dashboard on Google’s App Engine – I’m a big fan of the platform, but it has some traps for new (and old) players. A recent issue required us to bulk delete entities on App Engine, and for that it seemed the best tool for the job was the ‘Mapper’ part of Map Reduce. It was a good experience learning a) how it works and b) applying it, I thought I’d document a full Map Reduce example, because the worked examples I found were all based on an older version of the library.

Background

You might be interested to read a bit about why I had to use Map Reduce – it was my own fault. A subtle bug in a recent release caused an additional Account to be created under certain situations – it made it past our staging server, because that certain situation happened infrequently enough to not cause an issue, but once it got to production, where we have many thousands of orders arriving daily, we started seeing a lot of zombie accounts being created. My first thought was we were being attacked, but alas, it wasn’t an attack, just a bug! The nett result, we were left with 10’s of thousands of unwanted entities.

The Problem

In App Engine, the datastore is highly scalable, but it’s not relational, you can’t just run a simple query to delete all rows in a table with a particular created timestamp. There are simple tools to blindly delete all entities of a type, but not entities that meet a particular condition, and more importantly, not related entities that also meet a particular condition. That’s where Map Reduce comes in, basically we shard the entities and create multiple parallel workers to break the collection into smaller parts and process through them quickly. This seems like an inefficient approach until you consider scaling beyond a single database server, or even a single database cluster.
Continue Reading…