Micah Snyder

DUI.Stream and MXHR

Hola,

Although we usually wait until a feature is fully baked before showing off the technology behind it, we’re just too excited about this new project to keep it to ourselves. Digg’s Front End Architecture team has been experimenting a lot with performance improvements, and we’re just about done with one that we think will change the way you get data from high-traffic websites. First though, a bit of background:

One of the ways that high-performance websites like Yahoo suggest speeding up load times is by reducing the number of HTTP requests per page. We started thinking about what we could do to reduce HTTP overhead, and where we could get the biggest benefits from it. Well, one thing led to another and the next thing we knew we were talking about writing a generalized framework for bundling files, sending them through a single request, then separating them for use once they head down the pipe.

We call this technique MXHR (short for Multipart XMLHttpRequests), and we wrote an addition to our Digg User Interface library called DUI.Stream to implement it. Specifically, DUI.Stream opens and reads multipart HTTP responses piece-by-piece through an XHR, passing each chunk to a JavaScript handler as it loads.

Why do this? Well, DUI.Stream will allow developers to drastically improve the speed of uncached page loads by bundling most of their resources into a single HTTP request, with a single time-to-first-byte and no request throttling by the user agent. Additionally, the size of the response has no effect on the rendering time of each chunk, as the client handles each piece of the response on the fly and can inject it into the DOM for rendering immediately, in the exact order you specify. On a high traffic, high-activity site like Digg, we have to display incredible amounts of data on each permalink — typically hundreds of user images within the first 50 comment threads on a page alone, not to mention the UI chrome and actual comment data. (You can see this for yourself: notice the number of HTTP requests that queue up when you expand a page of comments). So our primary use case for DUI.Stream is turning that first long, arduous page load on an empty cache into something nearly indistinguishable from a page of data with fully cached resources.

Let’s talk a bit about the architectural benefits of implementing MXHRs with DUI.Stream. Back when the web was based largely on a page metaphor (i.e.: one central document with external references), whenever you loaded the page, the page requested its images, stylesheets, etc, then you were done. These days you’re just as often loading an application; the page progressively enhances into a stateful UI by loading extra stylesheets, scripts and a whole mess of UI chrome after the initial request. Yet, we’re still using the old model flow of get markup –> render markup –> request external resources –> load and display externals.

Take our modal login dialog box for example. In order to reduce requests we bundle its JavaScript in with the rest of the page, we put its CSS up in the header with the rest of the styles, then we request only the markup for the dialog box, render it, and let it fire its own HTTP requests for the images that make up its chrome. In this broken model, HTTP connections and rendering behaviors split our UI architecture up into different parts of the page that all render at different times at the browser’s discretion. Even if we put everything into one cohesive structure and loaded the CSS link, script tag and markup together, they’d still all fire their own HTTP requests and the images would still come in afterwards on the first page load. This just won’t do.

Now, let’s rethink how our login dialog could work using DUI.Stream. We can request a Stream that contains everything needed to render and use the dialog box. As each part comes in, it gets passed through to be built, and renders immediately with no image backfill or delayed JS behavior. The DUI.Stream framework can then pass those resources back into cacheable elements for our next page load, which can happily 302 its way quickly through the rendering process. Pretty sweet right? Right.

Before we get into actual code, I have to stress that the DUI.Stream client, with all its aforementioned badassitude, is an alpha release — it’s better than a proof of concept, but it’s definitely not ready for prime time. Think of it like Miles and Hurley’s time travel discussion on Lost:  There are still a few plot holes, but what we’ve seen so far is pretty cool. With that in mind, let’s save the talk about future features for a bit later and get into what DUI.Stream does right now.

Here’s a TLDR example:

var s = new DUI.Stream();
 
s.listen('text/html', function(payload) {
    $('#stream').append(payload);
});
 
s.listen('complete', function() {
    alert("D. Plainview: I'm finished!");
});
 
s.load('testStreamData.php');

DUI.Stream works on a pretty simple mechanism: create a DUI.Stream, register listeners for each mimetype you plan to send (as well as an optional onComplete), then send a request to a URL that implements an MXHR responder. I’ll leave the details of DUI.Stream’s inner-workings to those of you who want to look through the source. As always, our code is released under a open-source license so if you find any fugly parts, feel free to let us know or send us a patch.

You can see the demo code in action, and I’ve included some rough timing numbers to demonstrate just how much faster DUI.Stream is than a regular uncached page load. YMMV with the demo, so don’t be surprised if it falls over in your browser or occasionally reports wacky numbers. Check out the roadmap below for cross-browser info. Also, we have a demo that handles images that performs atrociously well in Firefox and Safari, but at present IE doesn’t play nice with it (shocking, I know). You can test drive it at your own risk.

If you’re curious about responders, we’ve thrown together a few demos in Python, Ruby, Perl and Java in the DUI.Stream GitHub repository. We’ll be releasing a full set of documentation once we hammer out the rest of the DUI.Stream client. Rest assured you won’t have to implement anything based on a blog post and some demos.

Here are the things still on our short-term roadmap:

  • Cache detection. If you wanted to implement DUI.Stream right now, you’d have to switch between regular requests and MXHRs on your own. We don’t recommend that.
  • Background caching. A big part of implementing this is going to be loading the MXHR through DUI.Stream, then turning any cacheable chunks into their normal, cache-friendly tags.
  • Support for multiple headers per chunk. Specifically, we’ll be adding a set of custom headers to allow for UI-specific information to be sent, like CSS selectors for greater control in handlers.
  • XMLHttpRequest.multipart support. Right now we’re not using this flag, since we aren’t keeping connections open for a server-push implementation yet. Like I said, It’s an alpha ;)
  • IE7 and IE8 object tag workarounds. In the current build, bouncing binary streams into object tags in IE has some interesting results.

Anyways, I hope you like what you’ve seen so far. Several of my teammates have been busting their asses to get this ready to show off — specifically Jordan Alperin, Arsenio Santos, Chris Goffinet, and Arin Sarkissian — and we’re all pretty excited about it. As always, if you have any questions, comments or thoughts, let us know!

Cheers,
- Micah

John Quinn

DiggBar Changes Live Today

Hi all,

As promised last week, we pushed a few key changes to the DiggBar today.  If you are logged-out, you will no longer see the DiggBar by default. If you miss the DiggBar, we encourage you to log-in, as we believe it creates a much more seamless Digging experience.

You can continue to use Digg to create short URLs from anywhere on the web by simply typing http://digg.com before any URL or using our bookmarklet.

If you are logged-in and have the DiggBar enabled, you may notice a few additional changes. We shortened the height of the DiggBar to make it lighter and more compact. We also temporarily removed the ‘view count’ number for now, as the data no longer represents the global views for that content.

Stay tuned for more enhancements. As always, let us know what you think, as we appreciate all your feedback!

John

Kurt Wilms

PWDTAD: Item-Based Collaborative Filtering

You may have noticed the “People who Dugg this also Dugg” feature on the Digg story list pages as an interesting new way to sift through all the available items on Digg and discover new content.

We recently rolled out a few subtle changes to the placement of this feature, along with Related by Keyword. Many users wanted to see the comments directly under the story summary, so we are exploring different locations of these features based on what’s more relevant to you.

So how are the items that appear in the “People who Dugg this also Dugg” section chosen? The short answer is that it is based on an new item-item collaborative filtering algorithm recently developed at Digg.

The Digg Recommendation Engine predicts what items on Digg you might like by finding the users most similar to you, and then using their digging history in order to make predictions for you. There are various tweaks to the algorithm, but that is the basic model. The “People who Dugg this also Dugg” feature explores the relationships between items on Digg rather than the relationship between users. Instead of computing correlations between people, here we compute correlations between pairs of items on Digg. Computer scientists call this type of collaborative filtering ‘item-item’ because the algorithm is focusing on items rather than users. When we want to show “People who Dugg this also Dugg” items for a target item, we look at all the other items on Digg and select those that are most similar to the target item. Again, there are various tweaks to the algorithm, but that is the basic model.

The critical step in calculating the “People who Dugg this also Dugg” items is to compute the similarity between items. The basic idea in similarity computation between two items is to first isolate the users who have dugg the two items you are comparing and then to apply a similarity computation technique to determine their similarity. There are various strategies for determining this value. For those interested I would suggest this paper for a good overview. We’ve chosen to use an implementation of Jaccard coefficients in our calculation.

Launching a recommender on Digg comes with a unique set of challenges, many of which have to do with the tens of thousands of items submitted daily and the incredible volume and rate of diggs occurring on those items. As you can imagine, computing similarity for all theses items on digg fast enough to produce relevant recommendations can be a tricky problem to solve! Our recommender, written in python, computes similarity in real time as digging activity occurs. When an item on Digg receives at least five diggs, we compute similarity scores, select the most similar items, and store the results in a special data structure. As digging activity continues, the similarities are recalculated and the most similar items are reselected. At any given point in time, the recommender is calculating similarities for tens of thousands of items on Digg.

Of course, even with some of the technical challenges solved, our work is far from finished. We’ll keep on studying how our users interact with the related stories and we’ll continue thinking of innovative ways to use the data we have to improve the Digg experience. Since this is the beta version of the feature we want to know what you think. Please contact us with your thoughts and suggestions.

-Kurt

Joe Stump

Introducing Digg’s IDDB Infrastructure

I’ve been hinting at a mysterious piece of infrastructure, called IDDB, for some time in various talks, blog posts and mailing list posts. Today, with the introduction of Digg’s DiggBar and URL minifying service, IDDB is finally powering a significant feature. I say that because it’s actually been quietly powering user IM’s and links for about a month with no issues (this is why that feature temporarily disappeared a few weeks ago only to reappear after a major data migration).

So what, exactly, is IDDB? IDDB is a way to partition both indexes (e.g. integer sequences and unique character indexes) and actual tables across multiple storage servers (MySQL and MemcacheDB are currently supported with more to follow). It started out as a harebrained idea that a few guys from operations and I would chat about during lunch. After a few lunches Ron and I decided to put the marker to the whiteboard and hammer out the details. A few afternoons later we had the basics whiteboarded and I’d prototyped a working package.

From there Ian and the Core Infrastructure team took over implementing and refining IDDB. There were tools to be written for operations, features to be fleshed out and tests to be written (In fact, 41% of IDDB’s code is its unit test suite). In the end some of the features that were cooked up for IDDB include:

  • Ability to partition data across multiple storage nodes using arbitrary node assignment. The hashing framework allows node assignment to be pluggable (e.g. We could create hashing algorithms that hash ID’s by colocation facility, geological location of the user, racks, etc.). In addition to this, we allow for multiple types of storage nodes.
  • Ability to break IDDB indexes, which store ID and location meta data, across up to 16 machines.
  • Each ID maps to N storage nodes and has an individual status on each node. All data, by default, is written to three storage nodes. MySQL and MemcacheDB both support replication so their slaves are added to each ID as a read-only node.
  • Each ID type can be served from its own cluster of servers. This means that the user ID sequence can be on a totally different cluster than the user email unique character index.
  • IDDB utilizes Gearman to move data around. For instance, we can take a user who’s consuming a lot of resources and migrate them to a less loaded set of storage nodes. We also use this to find ID’s that are in a state of error to fix them (e.g. A user ID requires 2 copies, but there’s only 1 so we lock it and make another copy).
  • Storage nodes can be arbitrarily added to the pool at any time and new ID’s will instantly start mapping to them. Additionally, we can set a storage node’s status to “full”, which means it can no longer accept new ID’s, but will continue to accept new data for existing ID’s. When combined with our migration tools we can elastically grow the storage clusters out and rebalance ID’s across them as needed. This also means we can keep heterogeneous hardware operating in unison (e.g. Older machines can hold 10,000 users, but new ones can hold 25,000).
  • All meta data about ID’s, which is pretty much entirely static, is kept in Memcached to reduce reads from the index cluster.
  • Ability to track number of queries being ran against a storage node or against a single ID.
  • An entire suite of CLI utilities to manage the clusters.

What’s nice is that IDDB abstracts all of this, making it quite easy for developers to partition data without having to worry about the basics. Here’s how you’d fetch a user record and query for the user’s IM links:

<?php

require_once 'IDDB/ID.php';

// Fetch a user's IDDB record, which contains basic meta data and location data.
$id = IDDB_ID::factory('User', 1234);

// Get all of the user's data from a random node.
$links = $id->db()->getAll('SELECT * FROM UserLinks WHERE userid = ?', array($id->id));
print_r($links);

?>

Pretty simple. But what about creating a new ID? Let’s say we’re creating a new user and want to add some IM’s to their table. What would that look like?

<?php

require_once 'IDDB/ID.php';

// Creates a new unique, auto-increment, ID in IDDB.
$id = IDDB_ID::create('User'); 

// The execute() method will write this query to all of the ID's writable nodes within individual
// transactions. If it fails to write on all nodes then the exception is surfaced. If it fails on 1
// of, say, 3 writes it will mark the ID as being in an error state on that 1 node while being
// live on the other two nodes.
$id->db()->execute('INSERT INTO UserIMs SET userid = ?, service = ?, handle = ?', array(
    $id->id, $_POST['service'], $_POST['handle']
);

?>

The DiggBar and URL minifying service is powered by a 16 machine IDDB cluster, which includes 8 write masters in the index and 8 MySQL storage nodes. It’s, to date, the largest IDDB cluster Digg has pushed into production, but we have plans for much bigger IDDB clusters.

On a final note I really want to call out IDDB’s test suite and how putting extra effort into tests has helped us rapidly iterate over IDDB. The test suite contains 2,300 lines of code, 990 tests and a whopping 8,081 assertions! In fact, IDDB’s test suite has surfaced (and fixed!) a number of bugs in PHPUnit itself. It’s imperative that such a crucial piece of code be completely unit tested and I’m insanely impressed with Ian’s work in this area.

Thanks!

Joe