plathrop

Master of Puppets #1: ralsh

Allow me to introduce myself. I’m Paul, Sr. Systems Engineer and resident Puppetmaster for Digg. “Puppetmaster?” I hear you say, “What do puppets have to do with Digg?” Well, that’s what I thought I’d write about today.

To answer the question, it isn’t “puppets”, it’s Puppet, the open-source configuration management tool, and a relatively new addition to the Operations Secret Sauce here at Digg. Dr. Timeless and Joe have already given you a good overview of the complex architecture that makes Digg possible; in my posts I’m going to go into detail on how we manage some of the components of that architecture using Puppet (and touching on other tools along the way).

First off, let me answer a couple of questions that I’ve had to answer in the past. Why configuration management? Why Puppet, specifically? Configuration management buys you a number of benefits, and other people have written about those benefits more extensively, and more eloquently, than I can. Suffice it to say the Ops team at Digg believes configuration management is an incredibly powerful tool for managing complexity in a large-scale architecture (which, by complete coincidence, is exactly what we have!)

We chose Puppet after evaluating a number of other options (cfengine, and bcfg2 among others). Like many people, we thought that Puppet, cfengine, and bcfg2 were the top contenders in this space. Cfengine is, in a way, the grand-daddy of open-source configuration management tools. Unfortunately, certain design decisions and philosophical stances it assumes leave it somewhat behind the curve. Bcfg2 showed a lot of promise, but in the end it lost out due to a few key factors. First, there was a lack of “Bundles” supporting the things we needed to manage; Puppet had “types” which more closely matched our needs. Second, the philosophy of Bcfg2 is a “total management” philosophy; you can’t deploy a Bcfg2 configuration that only manages a small portion of a machine’s configuration. For our needs, the ability to implement configuration management incrementally was very important, and Puppet gave us that flexibility. Third, the overall design philosophy of Puppet makes a lot of sense to us. Last, and least importantly, I had more experience with Puppet that I could draw on as we rolled it out.

For those of you that are unfamiliar, Puppet is a multi-tentacled beast project with several components: a declarative language for describing system configuration, a standalone parser for that language, a resource abstraction layer for providing platform-agnostic manipulation of the resources described by the language, and a client-server daemon for distributing and applying configurations described by the language. (Phew, what a mouthful!) The foundation of Puppet’s power is the resource abstraction layer, which I will talk about below. The rest of the Puppet components (and how we use them) will have to wait for another post.

Puppet’s resource abstraction layer (the RAL, to save me some typing) allows us to think of various aspects of system configuration as “resources”. Puppet comes with a tool called ralsh which allows you to interact directly with the underlying RAL; you can use it to list resources or manipulate them in the same way Puppet does. “But what exactly is a resource!?” I hear you cry. A resource is a discrete component of system configuration. Some examples are: cron jobs, users, host file entries, mount points… Puppet’s RAL knows how to manage a wide variety of resources natively; it is also fairly easy to extend it to manage resources it doesn’t already understand.

ralsh is an incredibly powerful tool, and decidedly underused in the Puppet community. Let’s pretend you’ve inherited a decidedly non-homogeneous architecture, filled with various flavors of Linux and *BSD. Want to add a user? Just remember that on linux you can use adduser, or useradd, but on FreeBSD you probably want pw, and who knows how many different command-line switches you need to remember, or maybe just consult the man pages every time… Or, use ralsh. It provides an interface that works on every platform Puppet works on. The interface is the same on every platform. Want to know what users are defined? ralsh user lists them. Just want to know about this guy named baduser? ralsh user baduser will give you the dirt. Want to add a user? Watch this:

~$ ralsh user newuser uid=9999 gid=9999 home=/home/newuser shell=/bin/bash ensure=present
notice: /User[newuser]/ensure: created
user { ‘newuser’:
uid => ‘9999′,
gid => ‘9999′,
home => ‘/home/newuser’,
shell => ‘/bin/bash’,
password => ‘!’,
ensure => ‘present’
}

This doesn’t get really cool until you realize you can use the exact same tool and syntax to manage any type of resource that Puppet understands. Check it out:

~$ ralsh package xmlstarlet ensure=present
package { 'xmlstarlet':
ensure => '1.0.1-2'
}


~$ ralsh cron check_my_mail command="/usr/bin/fetchmail mail.my.mail.server" user=plathrop hour='*' minute='*/30' ensure=present
notice: /Cron[check_my_mail]/ensure: created
cron { ‘check_my_mail’:
command => ‘/usr/bin/fetchmail mail.my.mail.server’,
monthday => ‘absent’,
hour => ['*'],
environment => ”,
target => ‘plathrop’,
special => ”,
minute => ['*/30'],
user => ‘plathrop’,
ensure => ‘present’,
weekday => ‘absent’,
month => ‘absent’
}

This is totally cool! Suddenly I don’t have to care what platform I’m on, and I can think of these things in an abstract, encapsulated manner. I can choose platforms based off of what they are good at instead of requiring homogeneity in order to minimize the costs of management. Like OpenBSD’s firewall, but you have a Debian network? By all means, throw an OpenBSD box in there; you can use the same commands to manage both. Want the performance of a FreeBSD network stack for a certain application? Go for it, you already have the tools you need to administer it.

Not only that, but if you are building a Puppet infrastructure, you can use ralsh to explore your existing configuration; the output you see above is valid Puppet code, and can be used in a Puppet configuration as-is!

Next time I’ll talk more about the higher-level components of Puppet; the RAL, though awesome, is just the foundation. The Puppet language allows us to treat configurations as code; we can use the same techniques software engineers use to manage their codebases to manage our systems configurations. The Puppet client/server allows us to apply consistent configurations across a cluster of machines, and manage the entire life-cycle of each system, from deployment to retirement. Combined with a flexible and robust automated deployment system like Debian FAI, Puppet can help you drastically reduce the intervention required to bring a machine up from bare-metal to production; saving both time and money as well as giving you a chance to focus on more important issues.

See you next time.

–Paul

Joe Stump

The codes are a-changin’!

Hi all -

Later today Digg will be undergoing some significant updates, but upon first glance, it may seem that nothing has changed. That’s because the changes included in the coming release are 100% under the hood. For the last few months, Digg’s developers have been heads down scrubbing, cleaning, porting and improving the core code of Digg. Before I get into the details, I want to give a big thanks to everyone on Digg’s development team. They’ve done a great job with what, technically, is one of the largest pushes Digg’s released.

Digg’s development team has grown tremendously over the last few years to over twenty developers (we’re hiring!). At the same time, Digg’s code has been maturing from a monolithic ad-hoc approach to a set of consistent frameworks. Our goal was to standardize the way applications at Digg are written, managed, and deployed. Sounds fun, but how did we do it?

- During the profile redesign, we also rewrote our login and registration system using an internal event driven framework we call App. It’s a small, simple framework for quickly creating applications. App implements the ‘V’ and ‘C’ in ‘MVC’ along with input sanitization and authentication. Basically, it allows Digg developers to rapidly develop applications without worrying about the basics. As of this latest push we’ve ported the majority of Digg’s applications into App. Every developer at Digg has had a hand in porting thousands and thousands of lines of code to App.

- We also have another framework, called AJAX (I’m not very creative at naming my frameworks), that manages all of our AJAX endpoints (the PHP code that processes and responds to AJAX calls from jQuery). AJAX allows our developers to create a simple PHP class to process requests while the AJAX framework handles JSON and AHAH encoding, token checking, authentication, XSS/CSRF checks, input sanitization, HTTP error codes, exceptions, etc. Before this push about 70% of our endpoints were in the AJAX framework. As of this forthcoming push we’ve managed to port 100% of our endpoints to AJAX.

- One of the major problems at Digg, with regards to development processes, was that our code was monolithic. With three developers this isn’t a problem, but with 20+ it can get hairy. It can make merging difficult, interdependencies impossible, and make it impossible to promote/enforce ownership within SVN. In response to this, all of our core frameworks, applications, etc. are now PEAR packages. For those who don’t know, PEAR is a package management system for PHP, along with being a world class repository of PHP libraries. This allows us to break our code up into dozens of smaller projects which are owned by individuals and teams. PEAR also allows teams to define specific PEAR, PHP and PECL code that it depends on. What this means is that App_Login can enforce dependencies on App, Message, Mail, etc. and deployed separately from App_PermaLink or App_StoryList.

- Since our VP of Engineering John Quinn has joined the team, he’s been spoon feeding us the Agile pixie dust. Part of the Agile philosophy is unit testing, and the result is that all code at Digg must now have unit tests which are fed into our continuous integration environment. Our CI environment automagically checks out code every ten minutes, runs unit tests and verifies the code conforms to our coding standards, and then emails developers of any problems.

So, despite the appearance of nothing changing, big changes have been afoot in Digg’s development department. Tens of thousands of lines of code have been rewritten, massaged, ported, unit tested and modularized. But, why? So we can continue to develop quality new features at a quick pace even as we continue to grow our development team.

Thanks!

- Joe

timeless

Digg Database Architecture

Welcome back to the Digg technology blog, where you get to read about what the tech people at Digg are thinking about. Let’s get right into it, shall we?

How Many Databases does Digg Have?

As Joe hinted in his earlier blog entry, the particulars of how many machines Digg has is one of the most often asked questions, and yet one of the least relevant to Digg employees. None of us directly responsible for putting machines into our production cluster bothers to know the answer to this question, including myself.

The simple answer, which seems flippant, is that we have enough to do what Digg does. But I suppose you deserve a more detailed answer. So let’s go into it a little, shall we?

We have about 1.8x to 2.5x the theoretical minimum number of machines required to run Digg. Operations mandates that we want 2x the server capacity required to run Digg, so this makes sense. But why the spread? Let’s go into some of the reasons as they pertain to the databases.

Load deltas can be caused by errors or brilliance by Digg employees, such as bugs or fixes from the developers or improved indexing or bad I/O subsystem layout by the DB team, for example. Although it’s fair to say improvements far outnumber the mistakes, they both exist and both must be dealt with (feel free to ponder about what various ways performance improvements must be “dealt with”).

Sometimes Digg performance is changed when we enable or disable a particular costly feature of Digg. The DB team must be prepared for the human element of performance. Sometimes a feature is considered valuable enough to keep even though it causes what may appear to be an undue strain on the databases. At some time in the future, the feature may become so system-intensive that the hard decision is made to cut it. Until that point, the systems can be “overloaded,” and more importantly, once the feature is disabled, the systems become underutilised.

Pool utilisation can also be changed when the Digg members, as a social group, stop using a feature, or start using another that Digg itself made no change to.

All such factors contribute to the number of machines deployed in our production cluster at a given moment being either more or fewer than we desire, but if we’re careful, there are always more than the theoretical minimum required to run Digg, even during spikes in load.

Database Pools

At the highest level, you can think of the Digg databases as a four-master set of clusters. We shall call them A, B, C, and D. Two of the masters (masters A and B) are masters only, and two (masters C and D) are slaves of one of the other masters (A).

Each cluster has several slaves in it. I shall call a grouping of slaves a “pool.” At the highest level, you can think of all the slaves in a particular pool of masters B through D as equivalent, but the slaves in the pool underneath master A are special. This is for historical reasons. So let’s dig into history, shall we?

The original Digg database was designed as a single monolithic DB, and additional capacity was created by adding slaves. As Digg grew, we added more database clusters (B, C, D) into the mix. This is your classical scaling via distribution of writes, but the original database cluster (A) remained with most of the read/write throughput. Scaling it has proven to require a bit more ingenuity as its slaves have always historically had the most disk contention.

At first, to get more cache hits on a slave in A’s DB pool while still keeping all tables on each slave, they were split into subpools, call them A_alpha and A_beta. All queries sent to the slaves of A were given descriptors, and the database access layer was given a mapping of descriptors to subpools. Thus queries that mostly hit only tables M and N could be sent to A_alpha, and those that hit mostly I and J could be sent to A_beta. Hence the index and data pages for M, N, I, and J would most likely be in RAM on their respective database slaves.

This worked rather well for some time until the write load on A_[alpha|beta] became too intense, and further optimisations were required. These include dropping indexes and tables that aren’t needed in their respective subpools.

If you’re a MySQL DBA running MySQL 5.0 or lower, you know that there isn’t a simple report that MySQL will generate that shows you a list of indexes or tables that are used in a database. The assumption is that if your company created a table or index, that it will always be in use.

The radical changes to Digg’s front-end architecture over the past several years means that isn’t true. We know there are tables and indexes we don’t use in our A_alpha and A_beta subpools. So to determine what could be dropped from these two subpools, we analysed the dbmon.pl output (dbmon.pl is covered in some detail in the section on Database Overload, below) using getServerIndexes.pl. Basically it does an EXPLAIN on every query on the slave to generate a list of tables and indexes used (special note, I didn’t use the word “exhaustive” in the preceding sentence). If you use this tool, be sure you read the caveats in the script comments! As with everything high volume, nothing is ever simple.

Database Access Layer

The Digg database access layer is written in PHP and lives at the level of the application server (Apache). Basically, when the application decides it needs to do a read query, it passes off the query with a descriptor to a method that grabs a list of servers for the database pool that can satisfy the query, then picks one at random, submits the query, and returns the results to the calling method.

If the server picked won’t respond in a very small amount of time, the code moves on to the next server in the list. So if MySQL is down on a database machine in one of the pools, the end-user of Digg doesn’t notice. This code is extremely robust and well-tested. We worry neither that shutting down MySQL on a read slave in the Digg cluster, nor a failure in alerting on a DB slave that dies will cause site degradation.

Every few months we consider using a SQL proxy to do this database pooling, failover, and balancing, but it’s a tough sell since the code is simple and works extremely well. Furthermore, the load balancing function itself is spread across all our Apaches. Hence there is no “single point of failure” as there would be if we had a single SQL proxy.

Monitoring and Alerting

Though it is an integral part of our database architecture, I will keep this section a bit short, since it isn’t my specialty. We use Nagios to alert us on predicted failure modes of databases. The most common alerts are slave lag, disk space low, and complete machine death. Slave lag is caused by a number of things, including spikes in system usage, long-running update queries, or intermittent disk failure.

For monitoring, we use a Cacti-like Digg-written tool called MotiRTG. Suffice it to say it resembles Cacti in several ways, but is specialised to Digg’s cluster layout and is more suited to a cluster that has machines entering and leaving every day. It is written and maintained by our Networks and Metrics manager, Mike Newton. It is a strong candidate for open sourcing in the future.

Our alerting and monitoring subsystems are used in the traditional fashion. Alerts are for predicted failure modes, and monitoring is for post-failure analysis and future trending.

Database Overload

One of the most common problems on Digg systems is a spike in load, often caused by large news events like Apple announcements or hurricanes or… well, anything newsworthy. Assuming the spike isn’t taken care of by one of the myriad other spike-limiting features of Digg’s infrastructure, and the spike makes it to Digg’s databases, there are two simple mechanisms to limit the effect.

The first is the aforementioned over deployment of machines in our DB cluster. I estimate that this takes care of  more than 99% of database spikes in load. It may be no exaggeration to say 99.99%; we get several such spikes in database load per minute.

It is possible that a combination of adverse conditions will contribute to a spike that causes a particular segment of our DB resources to become 100% utilised. Under such conditions, it is not acceptable that Digg go down entirely. Hence the second mechanism, a tool called “dbmon.pl” (which you can download the source code for), is used. It’s a daemon that watches the MySQL instance for queries that have been running longer than a time limit set on the commandline, and kills queries that take longer than that.

Combined with separate DB subpools for different sorts of functionality on Digg, an overload on the DBs will only affect the portion of Digg serviced by that subpool, and then only a subset of the total requests coming in.

Note that the front-end application must deal with killed queries gracefully, since some of the killed requests will originate from legitimate users. If you use this tool, be sure you test your code in some environment where the kill time is set to a much lower value than you’ll actually use in production, and under stress, to be sure your application doesn’t barf out some nasty stack trace on the user when a connection gets killed.

Be very careful of adding logic where the front-end resubmits the request at no explicit request from the user. I recommend adding no such logic. It can easily negate the advantage of using dbmon.pl. Don’t worry. The user will hit reload. There is no need to DOS your own databases.

See You Next Time

Thanks for reading! I hope you found something useful or interesting here. This is Digg’s lead DBA Timeless signing out. Digg on.

Joe Stump

How Digg works

We often get asked how Digg works from a technology perspective, so wanted to shed some insight on this with our first post from the newly-launched technology blog. We’ll be posting regularly to give you a peek at what’s under the hood from the Digg development teams.

Ask Ron - our Systems Engineering Lead - the exact number of servers we have in production and he’ll probably respond with, “I don’t honestly know.” I can say we’ve got dozens of web servers and dozens more DB servers. I can say with certainty it takes six specialized graph database servers to run the Recommendation Engine and we have another six to ten machines that serve files from MogileFS. But really, the numbers are the least interesting part of the equation. What makes Digg an interesting place to work are what the pieces are and how they fit together.

Overview of Digg's Architecture

An average request to Digg starts at the outer most point in our architecture - the load balancers. We run a few specialized appliances that do a number of things: monitor the application servers, constantly adjust the cluster according to health, balance incoming requests and caching JavaScript, CSS and images. They’re also constantly keeping track of one another so they can take over all traffic if one of them fails. This is often referred to as a “heartbeat”. If you’re looking for a FOSS solution that offers a similar solution we recommend checking out the Linux Virtual Server and Squid.

Once your request has made it through the gauntlet that are the load balancers your request is handed off to an application server. Application servers consist of a number of daemons, including Apache+PHP, Memcached, and Gearman. These servers then work as conductors orchestrating DB connections, MogileFS connections and other service requests. It then munges, parses and processes the result and returns it to your browser.

The databases are broken up across four masters, each of which have lots of slaves dangling off of them. All writes go to the masters and all reads go to the various slaves. All run MySQL. Our databases are pretty vanilla as far as design goes though we probably have a lot more denormalized data than your average database design. Also on these machines are more Memcached and a specialized daemon that monitors connections and kills ones that have been open too long.

The file servers run Danga’s MogileFS, which is an anagram for “OMG Files!”. MogileFS is, essentially, a distributed WebDAV cluster. This serves up all of our story icons and our user icons. We also use it to store copies of each story’s source so Research & Development can study them.

That leaves us with the Recommendation Engine, which is a really fancy term for “distributed graph database”. Traditionally RDBMS’s are not well suited for doing the math that goes into generating a recommendation so Research & Development created a custom one.

So, that’s about it. Digg uses Debian GNU/Linux across the board with a mixture of MySQL, Memcached, MogileFS, Python, PHP, Apache, Gearman and various appliances to serve up billions of requests a month (and more every day!)

Thanks,

Joe