We often get asked how Digg works from a technology perspective, so wanted to shed some insight on this with our first post from the newly-launched technology blog. We’ll be posting regularly to give you a peek at what’s under the hood from the Digg development teams.
Ask Ron – our Systems Engineering Lead – the exact number of servers we have in production and he’ll probably respond with, “I don’t honestly know.” I can say we’ve got dozens of web servers and dozens more DB servers. I can say with certainty it takes six specialized graph database servers to run the Recommendation Engine and we have another six to ten machines that serve files from MogileFS. But really, the numbers are the least interesting part of the equation. What makes Digg an interesting place to work are what the pieces are and how they fit together.
An average request to Digg starts at the outer most point in our architecture – the load balancers. We run a few specialized appliances that do a number of things: monitor the application servers, constantly adjust the cluster according to health, balance incoming requests and caching JavaScript, CSS and images. They’re also constantly keeping track of one another so they can take over all traffic if one of them fails. This is often referred to as a “heartbeat”. If you’re looking for a FOSS solution that offers a similar solution we recommend checking out the Linux Virtual Server and Squid.
Once your request has made it through the gauntlet that are the load balancers your request is handed off to an application server. Application servers consist of a number of daemons, including Apache+PHP, Memcached, and Gearman. These servers then work as conductors orchestrating DB connections, MogileFS connections and other service requests. It then munges, parses and processes the result and returns it to your browser.
The databases are broken up across four masters, each of which have lots of slaves dangling off of them. All writes go to the masters and all reads go to the various slaves. All run MySQL. Our databases are pretty vanilla as far as design goes though we probably have a lot more denormalized data than your average database design. Also on these machines are more Memcached and a specialized daemon that monitors connections and kills ones that have been open too long.
The file servers run Danga’s MogileFS, which is an anagram for “OMG Files!”. MogileFS is, essentially, a distributed WebDAV cluster. This serves up all of our story icons and our user icons. We also use it to store copies of each story’s source so Research & Development can study them.
That leaves us with the Recommendation Engine, which is a really fancy term for “distributed graph database”. Traditionally RDBMS’s are not well suited for doing the math that goes into generating a recommendation so Research & Development created a custom one.
So, that’s about it. Digg uses Debian GNU/Linux across the board with a mixture of MySQL, Memcached, MogileFS, Python, PHP, Apache, Gearman and various appliances to serve up billions of requests a month (and more every day!)
Thanks,
Joe
Digg Technology Blog