Bryan Whitehead

How Obama Helped Digg Fix Bugs

A historic moment in history has occurred with the election of Barack Obama as the next President of the United States. For many, this moment was shared with the digg community that included a furious level of digging and commenting. In fact, election night generated the most traffic and activity in the history of Digg.

At about 8pm PST, most TV networks called the election with Obama as winner. Some interesting stats for the 8pm hour (PST):
Submitting was 108% of normal.
Digging was 202% of normal.
Burying was 137% of normal.
Commenting was 278% of normal.
Comment Digging was 619% of normal.
Comment Burying was 689% of normal.

Note the resulting traffic on one of our DB chains:
Load
CPU
Wow. Talk about making the DBAs sweat. If you were on digg at this time you might have noticed a couple of annoying issues. Logging in wasn’t pleasant. Page load times were longer than usual. Digging was… a bit… unresponsive, submitting a new story may have been prone to failure.

If you recall the previous blog from our lead DBA timeless, you’ll remember we have a concept of ’selectors’ where developers will select a pool and purpose for their db related code. An example might be “main_write” as a selector. This would give a handle to the main database chain for the purpose of updating or inserting a row. A handle like “main_read” would give a handle to one of the many slaves of our main master db. Transactions going to “*_write” are supposed to be very quick and cheap. Our write masters do not run the dbmon.pl software to kill off connections – that is reserved for the massive pool of read slaves.

Now, here’s the bug. A small number of our queries on “*_write” are not writes. They are reads for a very, very small subset of queries that need the absolute latest information. Any slave lag might cause weird problems with how we keep certain elements (like new user handles) unique. Unfortunately, a bug in our code that does this ‘quick check’ was generating 6 or 7 thousand quick checks… Multiply this by the huge amount of traffic Obama generated and our ‘main_write’ selector ran out of connections:
Connections

While the graphs show problems – they are only clues an investigation needs to start. For the above, I parsed all the queries from the hour before as a baseline measurement of ‘normal traffic’, and then took a look at the 8pm hour. Most of the work is in figuring out ways to classify and group similar queries together to pinpoint anomalies. Often after grouping everything together, it is quite obvious that a certain class of queries is causing problems. For example, we were seeing a very disproportionately high number of queries of one class in relation to other classes of queries. Once the issue had been isolated, I sent the results via email to one of our software engineers. In this case, Kurt, looked at the anomaly and tracked down this bug introduced some time ago.

After pushing the code we have seen much more consistent normal traffic:
Questions
Queries
Thanks to Obama, the many people who took interest in the elections, and our digg community, we have been able to fix the bug. We’re always grateful for your participation and feedback on Digg. It’s incredibly helpful. So, keep it coming and stay tuned as more improvements are on the way.

Digg on.

Driver