As you may know, we run our OrderPipe ecommerce dashboard on Google’s App Engine – I’m a big fan of the platform, but it has some traps for new (and old) players. A recent issue required us to bulk delete entities on App Engine, and for that it seemed the best tool for the job was the ‘Mapper’ part of Map Reduce. It was a good experience learning a) how it works and b) applying it, I thought I’d document a full Map Reduce example, because the worked examples I found were all based on an older version of the library.
You might be interested to read a bit about why I had to use Map Reduce – it was my own fault. A subtle bug in a recent release caused an additional Account to be created under certain situations – it made it past our staging server, because that certain situation happened infrequently enough to not cause an issue, but once it got to production, where we have many thousands of orders arriving daily, we started seeing a lot of zombie accounts being created. My first thought was we were being attacked, but alas, it wasn’t an attack, just a bug! The nett result, we were left with 10′s of thousands of unwanted entities.
In App Engine, the datastore is highly scalable, but it’s not relational, you can’t just run a simple query to delete all rows in a table with a particular
created timestamp. There are simple tools to blindly delete all entities of a type, but not entities that meet a particular condition, and more importantly, not related entities that also meet a particular condition. That’s where Map Reduce comes in, basically we shard the entities and create multiple parallel workers to break the collection into smaller parts and process through them quickly. This seems like an inefficient approach until you consider scaling beyond a single database server, or even a single database cluster.