StreamBuilder: our open-source framework for powering your dashboard.

Today, we’re abnormally jazzed to announce that we’re open-sourcing the custom framework we built to power your dashboard on Tumblr. We call it StreamBuilder, and we’ve been using it for many years.

First things first. What is open-sourcing? Open sourcing is a decentralized software development model that encourages open collaboration. In more accessible language, it is any program whose source code is made available for use or modification as users or other developers see fit.

What, then, is StreamBuilder? Well, every time you hit your Following feed, or For You, or search results, a blog’s posts, a list of tagged posts, or even check out blog recommendations, you’re using this framework under the hood. If you want to dive into the code, check it out here on GitHub!

StreamBuilder has a lot going on. The primary architecture centers around “streams” of content: whether posts from a blog, a list of blogs you’re following, posts using a specific tag, or posts relating to a search. These are separate kinds of streams, which can be mixed together, filtered based on certain criteria, ranked for relevancy or engagement likelihood, and more.

On your Tumblr dashboard today you can see how there are posts from blogs you follow, mixed with posts from tags you follow, mixed with blog recommendations. Each of those is a separate stream, with its own logic, but sharing this same framework. We inject those recommendations at certain intervals, filter posts based on who you’re blocking, and rank the posts for relevancy if you have “Best stuff first” enabled. Those are all examples of the functionality StreamBuilder affords for us.

So, what’s included in the box?

The full framework library of code that we use today, on Tumblr, to power almost every feed of content you see on the platform.
A YAML syntax for composing streams of content, and how to filter, inject, and rank them.
Abstractions for programmatically composing, filtering, ranking, injecting, and debugging streams.
Abstractions for composing streams together—such as with carousels, for streams-within-streams.
An abstraction for cursor-based pagination for complex stream templates.
Unit tests covering the public interface for the library and most of the underlying code.

What’s still to come

Documentation. We have a lot to migrate from our own internal tools and put in here!
More example stream templates and example implementations of different common streams.

If you have questions, please check out the code and file an issue there.

opensource engineering tumblr

Resizing Gifs On-Demand

Why dynamically resize images in the first place?

Before the dawn of on-demand resizing at Tumblr, every posted image was resized into seven or eight different sizes, and each was saved into our backing media store (a massive S3 bucket). This made serving our images very fast—just grab the size you want right from the bucket! While this was great, it also meant that any changes to our image processing would not affect any images we had already saved (billions of images). If we were to upgrade image quality, add a new size crop, or change how we handle taking down media, the effect would only be marginal…what a bummer! The cost of storing all the resizes as separate files (petabytes of data!), along with a lack of agility moving forward, motivated us to adopt a dynamic resizing and serving strategy.

We began with resizing jpg and png images on-demand instead of persisting each different resize crop in our S3 bucket. This has been a great success; our “Dynamic Image Resizer” churns through over 6,000 images a second, at a roundtrip request latency of only 250ms per image. Not having to store the resizes saves us tens of thousands of dollars a month! So, the natural question was, can we also do this for gifs and make a “Dynamic Gif Resizer?”

The problem with resizing gifs on-demand

Gifs, as a medium, are a wonderful thing. They capture a special or hilarious moment and repeat it back to you, forever. However, the actual Graphics Interchange Format leaves much to be desired. Last touched in 1989, the format is woefully outdated, and this begets massive, low quality animated images. When compared to video format counterparts (H.264 and the like), the gif file size can be tens of times larger at similar visual quality. Many companies have punted on the gif file format entirely; imgur released their gifv format, which wraps an mp4 video. Instagram will loop your video clips, but will flatten gifs to a still image. However, as the true “home of the gif,” Tumblr isn’t ever giving up on your gif files!

Resize it faster

A while ago, one of my colleagues @dngrm posted about updates we made in our gif resizing technology. Essentially, we switched our gif resizing library from ImageMagick to gifsicle with great success—we got lower latency and higher-quality results. In order to resize a gif in a realistic timeframe for on-demand resizing and serving, we proposed some changes to gifsicle that parallelizes the resizing step. Since a gif is just a stack of image frames, we figured that resizing them using a thread pool could lead to a performance improvement. Luckily for us (and the world!), gifsicle author Eddie Kohler accepted and merged our changes into gifsicle. With this new threaded resize option in gifsicle, we gained about a 1.5-2x speed-up in resizing an average gif against using the vanilla gifsicle. This brought down the average wall time of a gif resize to about 100ms. The entire gif resize request (downloading the upstream image, resizing the gif, and serving the response) is now only 400ms on average.

Cache is king

To make all this possible, tumblr heavily relies on CDNs to cache massive amounts of static content and avoid repeated work. Thanks to this, the Dynamic Gif Resizer only gets a little over 1,000 resize requests per second, thanks to an incredibly-high cache hit ratio on our CDNs.

On top of that, we rely on the usage of conditional GET requests and 304 Not Modified responses to cut down on the amount of real work we must do in the resizing level. The number of 304s we serve fluctuates between 30-50% for all gif responses, which saves us a tremendous amount of compute time!

Putting it all together

The resizer itself is an nginx server with a custom module that does the upstreaming and resizing, and is written in C. The jpg/png resizer utilizes OpenCV for image manipulation, while the gif resizer uses the aforementioned gifsicle library.
Our fleet of resizers and their surrounding architecture are housed in AWS. The main motivation for this was colocation to our image store (S3) and the ability to automatically scale our instance count up and down, depending on time of day (our traffic pattern is heavily cyclic over a 24h window). The rest of tumblr’s architecture is housed in our own DC. Below is a minimalistic diagram of our resizer setup.

Thanks

Both the Image Resizer and Gif Resizer were massive undertakings, and a lot of people deserve credit for fantastic work:
Massive thanks to co-developer @naklin, and to @neerajrajgure who helped with improvements.
To @dngrm, @michaelbenedict, @heinstrom, @yl3w, and @jeffreyweston for architecture help and sage advice.
To our AWS enterprise support team Frank Cincotta, Shaun Qualheim, Darrell DeCosta, and Dheeraj Achra.
And to Eddie Kohler who helped clean up my ugly gifsicle changes and let them be a part of his library.

Questions? Comments?

Talk to me on tumblr using our new messaging system! My tumblr is @hashtag-content

gif gif resizer resizer engineering tumblr staff

awnerd

Culture Matters: Casper Suite for People Who Fear Going Corporate

A few weeks ago, I presented at JNUC 2015 about an emerging approach in IT – a people-centric approach. IT teams have a significant impact on the culture of the organizations we support. Embracing this influence is challenging. But doing so can be transformative.

We covered a lot of ground:

The urgent business case for nurturing individual expression
Ways to build trust between IT and other departments
How choosing the right words can change how people think (and can be key when needing buy-in from managers and employees)

Each year, JAMF produces videos of every session from the 3-day JAMF Nation User Conference, and makes them freely available. I was thrilled to be asked to present, and I’m honored to have been chosen as a highlight of this year’s conference.

Take a look and let me know what you think.

Source: jamfsoftware.com

culture matters culturematters jamf jnuc jnuc2015 junc 2015 mac MacIT IT apple tumblr

infoq.com

Tumblr - Bits to Gifs

John Bunting talks about different services Tumblr has built and how their architecture helps them be fault tolerant as they continue to grow.

John Bunting talks about different services Tumblr has built and how their architecture helps them be fault tolerant as they continue to grow.

You are looking at a Tumblr post, so you love GIFs. You are reading an engineering post, so you love bits. Have you ever wanted to know how Tumblr turns these bits into GIFs? Thanks to QCon, you can watch cyborg hacker codingjester talk about how its done.

sre tumblr how it works bits gifs tumblr engineering engineering scale

kevinthecity

speakerdeck.com

Tumblr on Android

engineering

Tumblr on Android is always a hot topic among savvy Tumblr users. It’s easy to get caught up in the hustle and bustle of every day life and forget that this isn’t just a silly app for looking at cool and weird stuff - It’s a complex and robust spectacle of engineering that is under a constant barrage of updates, bug fixes, and new features - all in the sake of helping you look at the cool and weird stuff you love.

If you’ve ever daydreamed about the nuts and bolts of Tumblr’s Android app, then this presentation is right up your alley.

tumblr tumblr android app android development

bryan

speakerdeck.com

Tumblr iOS architecture

engineering

Some consider the Tumblr iOS app to be a marvel of modern medicine and science. Others find our custom native HTML rendering to be a telltale sign of our insanity. Other others view our staunch avoidance of Core Data migrations as both prudent and educational. Whichever camp you fall into, you may find something to enjoy in this high-level overview of our iOS application architecture.

tumblr ios architecture

Who doesn’t love animated GIFs?
Believe it or not, support for GIFs at Tumblr was a happy accident! When Tumblr put together the code for handling JPEGs, support and GIFs (and PNGs) happened to also work using the same code. Perhaps even more... — **Who doesn’t love animated GIFs?**
Believe it or not, support for GIFs at Tumblr was a happy accident! When Tumblr put together the code for handling JPEGs, support and GIFs (and PNGs) happened to also work using the same code. Perhaps even more surprising is that the tools used to handle GIFs at Tumblr hadn’t changed much from those early days.
The image above is an original from sukme that could not be posted to Tumblr last June. It also would have failed if he’d tried last Sunday. If you click-through to the original post, you will see a muddy, reduced-saturation mess. All this because our resizer couldn’t handle the original.
**I’ve got ninety-nine problems and the GIF is one**
There is a lot of misinformation about GIF limits on Tumblr, so let me set the record straight: We don’t count colors or frames or pixels. We only count bytes and seconds. Every image that comes in is scaled to a number of smaller sizes and the smaller your image is, the fewer resizes need to happen, which means less time.
We had two core failure modes in our prior resizer: Some images would take as much as several minutes to convert. This was not directly attributable to color, dimensions, or frame count, but a mysterious mix of all of them. Some images would balloon in size (600KB at 400x400, 27MB at 250x250).
The unpredictability of these failures made our GIF limits feel arbitrary and terrible to the end users. Some have gone so far as to threaten monkey kicks. I don’t want to get kicked by a monkey, so we started working hard late last year to fix it.
**A proposed solution**
Some of you may have seen this post where the performance of our current converter was compared with a new “mystery” converter. The mystery converter was roughly 1000x faster on the “slapping” GIF and happened to look great, but had quality problems on other images. Those were more fully explored in here a couple of days later.
If you haven’t figured it out yet, the mystery converter is gifsicle.
**Getting a better handle on it**
To get an unbiased test set, I took a random sample of roughly 90K GIFs that Tumblr users *tried* to upload, not limiting the corpus only to those that succeeded. These were tested against the current converter, resizing down to the next size we produce. Each resize is given up to 20 seconds to complete in our application, but all resizes must complete in 30 seconds. All resizes must be under 1MB or we will convert the first frame to JPEG and call it a day.
2.6% of my 90K GIFs took longer than 20 seconds to resize. This is an underestimation of how many GIFs would be rejected for time because this is only one of several resizes required. A whopping 17.1% of all GIFs were over 1MB. Even if we bump up to 2MB, the rejection rate is 2.75%. The converter was making over 25% of all resizes larger than the higher-resolution originals! The total rejection rate for my sample set was 4.46% of all original GIFs uploaded.
Using gifsicle is so much faster that our CPU rejection rate drops to 0.00 on my test set. Also, just under 99% of all images were smaller when resized than they were at their original resolution. The size rejection rate was a much lower 0.59%.
**Gifsicle problems**
As compelling as the performance of gifsicle is, the quality problems are too much to ignore. We played around with the code a bit, but eventually we just got in touch with the author, Dr. Eddie Kohler. The specifics are in this post, but the short version is that Eddie was able to improve quality by adding some more advanced resampling methods as well as palette expansion for small-palette images. This increased our size rejection rate to 0.68% while still keeping us well under our CPU budget.
**Proving it**
Image processing is all about choices. How do you resample? Do you sharpen? Where in the workflow is gamma correction applied, if at all? The list goes on and on.
As you can imagine from the performance differences, our previous converter and gifsicle take very different approaches to GIF resizing. The output images look different. Sometimes it is slight, sometimes it is significant, but there is no way we could put out a converter that messes up your images, even if it messes them up quickly.
We set up a qualitative study. The goal was simply to prove that we weren’t doing worse than our old converter, not necessarily that we were doing better. This study was opened up to all Tumblr employees, as well as some “randomly selected” outsiders (my friends and family). Participants were presented with one of two questions:
1.) Given an original and 1 resize, decide whether it is ok, unacceptable, or completely broken.
2.) Given an original and 2 resizes (randomly choses which was left and which was right, sometimes they were identical), choose the better image or say there is no difference.
The results were everything I could have hoped for. The “acceptable” test showed that users found gifsicle better at producing acceptable results (87% vs. 84%), but not by a statistically relevant amount (p=0.086) and that gifsicle produced fewer broken GIFs (0.71% vs. 1.38%), but again not enough to say it is definitively better (p=0.106). The “better” test found users preferring gifsicle 37% of the time, the prior converter only 16% of the time, but users also preferred one identical image over the other 27% of the time. Again, it is hard to say that gifsicle is better, but it is clear that it is no worse.
**Putting it all together**
The development and testing described above took from late October until the beginning of March. Packaging, deployment, and integration took only a couple of weeks!
We aren’t done. There is work underway exploring how we handle JPEGs and PNGs. There are a slew of features that we can go after. This was a big step, a necessary step, but not the end for sure.
**We are a community, it takes a village, there’s no “i” in GIF**
This project couldn’t have happened without the excellent work of Eddie Kohler in creating, maintaining, and enhancing gifsicle. Tumblr’s Site Reliability Engineering group packaged and helped deploy gifsicle onto hundreds and hundreds of machines in our datacenter. Tumblr’s Security Team vetted the code, both by inspection and by attacking it to make sure we stay safe. This was all for the awesome Tumblr creators, but I have to mention qilme/sukme (same dude, two blogs), reallivingartist, and especially gnumblr for their help in understanding and ultimately attacking this monstrous problem.

gif gif limit gifsicle tumblr

Tumblr Summer Intern: Walter Menendez

This summer I got to join Tumblr as a search team intern. This was a dream come true: I spend a lot of my day on Tumblr sharing content with others and generally indulging in the specific Tumblr lingo and lifestyle, so to do it all day was just too unreal. (Fun fact: I got hired through Tumblr, thanks to a post on this very same blog!) I didn’t expect to like New York as much as I did and I’m so heartbroken that I have to leave Tumblr and go back to school.

The search team is a really cool team to work on because we do a lot of really critical, quantitative thinking about the highly subjective, qualitative content on Tumblr. My projects definitely required a thorough understanding of Tumblr’s users and their idiosyncrasies in order for our data to make any sense. It was cool to just go into our databases, scrape some data and do all kinds of things with it. It definitely opened up my view of Tumblr as a community and, well, I also found a ton of stuff to follow/reblog while doing so.

I go to MIT, where a lot of my research background is in data visualization and large data sets, so one of my first projects was to work on a data visualization for our trending post stream. The search team had been working on trending content for quite some time now, focusing on the three core forms of content on the site: blogs, tags, and posts. Blogs and tags had a decent amount of front end work but trending posts not so much. We basically had a list of post URLs that were deemed trending based on our metrics, but we had no idea what they looked like, nor did we really have any intuition on to what extent they were trending. To fix that, I built a D3.js based visualization, while grabbing posts from our PHP framework. There was a fair amount of tech involved! From processing JSONs to animating images, the visualization took a fair bit of work. At the moment, it’s more of an internal thing but hopefully, users all over Tumblr will be able to see it as well.

My second project was much more data intensive and focused on search traffic analysis. Tumblr had rolled out trending tags to mobile during my time here, which was a great way to discover more content on Tumblr. However, those metrics and algorithms were only based on post creation. Since more people consume content than they do create it, there was a lot of interesting data lying in tags and their search counts. I started scraping data from our page logs and effectively did a very high level count over hundreds of thousands of tags. Afterwards, I would save the data into Redis and would then compute some statistics and ultimately rank the tags based on how trending they were.

This analysis was a really cool project because every morning as I ran my scripts to collect and process the tag data, I got to see the progression of current events all of a sudden going from single digit counts to thousands of hits. It was also a great way to stay up to date with news as I would often Google a tag that I had no idea what it was referring to. Another cool part was comparing my ranking to our current rankings in production, and seeing just how aligned and different we were.

My time at Tumblr definitely couldn’t have been possible had it not been for the staff here. They’re all so knowledgable and approachable and hilarious, so while my roommates would groan about going to “work”, I would skip on my merry way to a blogger’s haven. Plus, we have dogs. Who wouldn’t rush to the office to see those?

Forever reblog!

- Walter

Engineering interns tumblr

Tumblr Summer Intern: Eric Leong

I spent this summer as an intern on Tumblr’s Android team. I joined as the team had just transitioned to Android Studio, so we were all learning how to work with the new IDE together while familiarizing myself with the code. On the first day I even made a pull request to ignore a folder created when installing Android Studio. My tasks throughout the summer were to implement features the team hadn’t gotten around to yet, allowing me to brush the dust off my Java and touch the various parts of the codebase at my own pace.

I mainly focused on user-facing feature development, which involved grappling with the intricacies of Android animations and the Tumblr app. My first task was to develop the notifications widget and revamp the existing post creation widgets, which turned out to be a great way to play with the Tumblr codebase and the newest features of the Android framework. I had full responsibility for the features I developed, with the design and QA team assigning bugs to me, which was both exciting and humbling. I had the most fun working on features I knew no one else had implemented, like “swipe back”, and realizing that I was pushing the boundaries of the Android platform.

It was an incredible experience to watch my code end up in the hands of our Android users and watch my friends play with the features I built. I’m also proud that I had a chance to fix issues and add functionality that I wanted to see in the app. Even when I did not directly work on a feature, my feedback helped shape the design and implementation.

I had an amazing summer working with the Tumblr team, who provided guidance amidst a flurry of questions and showed me how design and code can come together to produce a beautiful app. Me and my fellow interns were brought closer together by our love of photography and the Tumblr community.

It was an unforgettable summer at Tumblr where I played my small role in bringing creators the audience they deserve.

- Eric

engineering interns tumblr

Tumblr Summer Intern: Jared Stern
Working at Tumblr as a summer engineering intern, I was flung headfirst into a mysterious world replete with terms I did not understand, people I did not know, and a codebase that defied all attempts at... — Tumblr Summer Intern: Jared Stern
Working at Tumblr as a summer engineering intern, I was flung headfirst into a mysterious world replete with terms I did not understand, people I did not know, and a codebase that defied all attempts at understanding. I asked many questions, and slowly realized that with a little thinking I could figure some things out myself. Slowly, I got to know the tools and languages (now I know some PHP!) and gained a bit of confidence in dealing with our code.

Much of my work dealt with a couple of Tumblr’s back-end services, which handle a lot of the heavy lifting for the web application—mostly moving data around so the app can quickly access everything it needs. I was given quite a bit of responsibility, which was exciting and instructive and frightening, particularly when I broke an important thing on my third Tuesday. Luckily, my other work was a bit less eventful.
I eventually got more comfortable (a bit more comfortable, anyway!) with deploying changes to our code. In addition to services, I worked on a few small projects closer to the site’s front end. I fixed a bug in the bookmarklet that caused posts to be created incorrectly. I fixed an issue in our internal administration site so the site would be more responsive at peak times. On one occasion, I worked with a member of our support team to fix a single tumblelog that curiously would not load past its thirteenth page. It was a pleasure to be able to fix an actual user’s actual problem.

Through all of this I had the honor of working with the fabulous Tumblr team, people who were friendly and helpful and knowledgeable, and who went out of their way to help me along. All told, it was a marvelously challenging, interesting, and educational summer.
- Jared

engineering tumblr interns

Tumblr Summer Intern: Jacob Haip
This summer I interned as a product engineer in the Creativity team under Pau Santesmasses. I was tasked with removing handlebars.js and prototype.js, two javascript libraries we were using for HTML tempting and... — Tumblr Summer Intern: Jacob Haip
This summer I interned as a product engineer in the Creativity team under Pau Santesmasses. I was tasked with removing handlebars.js and prototype.js, two javascript libraries we were using for HTML tempting and simplifying javascript code. Nothing was wrong with these libraries but we were also including underscore.js and jQuery, libraries with similar features, so it was a bit of a waste. To help free Tumblr of these libraries, over the course of the summer I rewrote ~15,000 lines of code to port prototype.js to jQuery and handlebars.js to underscore.js.
The transition was a challenge not because the libraries were that different but because of the how scattered the old code was throughout the website. I’m happy to end the summer having removed handlebars.js and all but a couple isolated pieces of prototype.js. It was also cool to have touched so many pieces of Tumblr in the process.
Users should enjoy faster load times throughout the website. Here in the office this is about a 1/3 of a second faster load time and this is even more noticeable on slower internet connections. It’s also nice to know that the rest of the people in the engineering team will have two less libraries they have to worry about.
It was a pleasure to be a part of the Tumblr team during such an exciting time: the Yahoo! acquisition and Tumblr gaining more of the attention it deserves as the place for brands. The other interns were amazing and I loved how I got to know all of them and not just the engineering interns. I look forward to joining my friends once again at MIT but I will miss all the people I have met here. MIT maybe be good at science but it sure isn’t good at art and I will miss the creativity of all the people in the office and being a part of the truly creative community on Tumblr.
- Jacob

engineering tumblr interns

This summer the engineering team at Tumblr got to work with some amazing Interns. We asked each of them to share their experiences and tell you a little bit about the projects they worked on. Here is the first in a series of upcoming intern profiles.

–

Tumblr Summer Intern: Iain Nash

This summer I interned at Tumblr as a front end web engineer working with the discovery team. I had amazing opportunities here to build awesome things that many people see, but more important, to work with a creative and dynamic team, and be able to contribute to Tumblr. Additionally, I loved spending the summer in New York - really is an exciting place to be.

I started off getting setup with my own development box and similar access to full time employees at Tumblr. While I started off with smaller projects to get to know the codebase and team, I quickly started getting bigger and bigger projects. It was really overwhelming at first to deploy the tumblr.com codebase within the first week I was here, going from only writing tiny things to a site as big as Tumblr.

The first big project I worked on was making a new logged out tag page. Throughout the summer, I was mostly mentored by Johnny Benson, who really helped me out with how things are done and constant creative and practical decisions involving my work. I also worked with Tag Savage for many of the design changes I was working on. It was always fun for me to hear “if this isn’t too hard to do…,” and be able to make a proof-of-concept by the day’s end. In fact, the current tag page design started off as refinements to a current tag page, then grew into a bigger project when I took these suggestions on. The layout of this page is rather unique - smart sliding rows, and it took a good deal of code to make it work properly.

I also had the opportunity to clean up and improve on some of the already stunning login and register views. These pages were mostly complete, just needing some design and code tweaks. It was a bit nerve wracking deploying these pages as there was a chance I could have missed something and would break login. Most of the changes I made were on the backend PHP, so that going out without a hitch was great for me.

I really enjoyed the other interns at tumblr this summer - previously, I would be the only intern at a company, now I had other interns working with me. Going to lunch, exploring the city was always fun with the other interns.

Now, it is time for me to head back to school at the University of Southern California, and face the homework, the time crunches, and the assignments once again. Seeing Tumblr grow and change over the past few months was a cool experience, and I’m glad I was able to be a part of it.

– Iain

tumblr engineering interns profiles iain nash

robewaschuk

My Philosophy On Alerting

robewaschuk

I wrote some stuff while I was at Google about writing clean alerts and keeping an oncall rotation sane; after some cleanup they’ve allowed me to make it public. Of course, this represents my opinions and not Google’s. They do reflect what we think are best practices at Tumblr, though. We're hiring Site Reliability Engineers.

Check out My Philosophy On Alerting.

Summary

When you are auditing or writing alerting rules, consider these things to keep your oncall rotation happier:

Pages should be urgent, important, actionable, and real.
They should represent either ongoing or imminent problems with your service.
Err on the side of removing noisy alerts.
You should almost always be able to classify the problem into one of
- availability & basic functionality
- latency
- correctness (completeness, freshness and durability of data)
- and feature-specific problems.
Symptoms are a better way to capture more problems more comprehensively and robustly with less effort.
The further up your serving stack you go, the more problems you catch in a single rule. Balance this with being able to distinguish what’s going on.

sre oncall tumblr monitoring notdevops engineering

adamlaiacano

Using entropy to route web traffic

adamlaiacano

Earlier this week, Blake asked me for some help with a problem he’s working on. He has a couple of hash functions that are being used to route web traffic to a number of different servers. A hash function takes an input, such as a blog’s url, and outputs a number between 0 and 2^32. Say we have 1000 servers, that means that each one will handle about 430 million points in the hash-space.

The data looked something like this (with fake blog names, of course):

##       blog.name        H1        H2        H3        H4
## 1 23.tumblr.com 3.137e+09 1.866e+09 6.972e+08 5.792e+08
## 2 19.tumblr.com 1.875e+09 2.545e+08 2.606e+09 1.312e+09
## 3 34.tumblr.com 1.366e+09 2.236e+09 1.106e+09 3.640e+09
## 4 43.tumblr.com 2.639e+09 1.098e+09 8.755e+08 1.507e+09
## 5 90.tumblr.com 6.564e+08 5.397e+07 3.084e+09 2.961e+09
## 6 29.tumblr.com 2.476e+09 4.532e+08 2.787e+08 4.894e+08

One important thing to point out before we get started is that this has to be a representative sample of the request data. Despite the wild popularity of my personal blog, it doesn’t get a sliver of the traffic that Beyonce gets. That fact needs to be represented in the sample data, meaning that her blog should appear in more rows of the sample data than mine.

Plot the data

The first thing I ever do is plot data to get a sense of what I’m working with and what I’m trying to accomplish. The density plots below show the distribution of values in the hash space for each algorithm. If you’re not familiar with kernel density plots, you can imagine this to be a smoothed (and prettier) version of a histogram. For the electrical engineers out there, it’s the sum of the convolution of a kernel function (usually a gaussian), with an impulse function at each of the points on the x-axis (represented here by dots).

By comparison, here are the density plots of a “near-ideal” example (1000 pulls from a uniform distribution) and a bad example (all assigned to the value 2e+09). The worst case example here shows the shape of the kernel function.

Calculate the entropy

In information theory, entropy is the minimum number of bits required (on average) to identify an encoded symbol (stay with me here…). If we’re trying to transmit a bunch of text digitally, we would encode the alphabet where each “symbol” is a letter that will be represented in 1’s and 0’s. In order to transmit our message quickly, we want to use as few bits as possible. Since the letter “e” appears more frequently than the letter “q”, we want the symbol for “e” to have fewer bits than “q”. Make sense?

Huffman Coding is one encoding algorithm. There’s an example implementation here, which assigns the code 100 to “e” and 1111110010 to “q”. The more “uneven” your symbol distribution is, the fewer bits it will take, on average, to transmit your message (meaning you’ll have a lower entropy). The entropy value is a lower bound for the actual weighted average of the symbol lengths. There are special cases where some encoding algorithms get closer to the entropy value than others, but none will ever surpass it.

The actual entropy formula is:

\[ H(x)=-\sum_{i=0}^{N-1} p_i log_2(p_i) \]

Where \( H(x) \) is the entropy, and \( p_i \) is the probability of that symbol \( i \) will appear. In the example I linked to above, \( p_e = 0.124 \) and \( p_q=0.0009 \), so it makes sense that e’s symbol is so much shorter. In the example, the average number of bits per symbol is \( \sum S_i p_i = 4.173 \frac{bits}{symbol} \), where \( S_i \) is the number of bits in the symbol. The entropy, from the above equation, is \( 4.142 \frac{bits}{symbol} \).

The example problem of web traffic distribution is a little different. We’re not actually encoding anything, but rather trying to make the theoretical lower bound for average number of bits/signal as high as possible.

We can consider each server to be a symbol, and the amount of traffic that it recieves is decided by the hash function that we’re trying to choose. If one of our servers is the equivalent of the letter “e”, it’s going to be totally overloaded while the “q” isn’t going to be handling much traffic at all. We want each symbol (server) to appear (receive traffic) equally often.

So to calculate the entropy, we’ll take a histogram of the hash values with 20 buckets (representing the 20 servers). That will give us the number of requests that go to each server. Dividng that by the total number of requests gives us each server’s probability of handing the next incoming request. These are the \( p_i \) values that we need in order to calculate the entropy. In code, it looks like this:

calc.entropy <- function(hash.value) {
    h = hist(hash.value, plot = FALSE, breaks = seq(0, 2^32, length.out = 21))
    probs = h$counts/sum(h$counts)
    print(probs)
    entropy = -sum(probs * log2(probs))
}

The entropy values for our four hash functions are:

##   hash.function entropy
## 1            H1   4.203
## 2            H2   4.226
## 3            H3   4.254
## 4            H4   4.180

And while we’re at it, here’s the entropy of our best/worst case example that we plotted earlier.

##    hash.function entropy
## 1     near.ideal   4.309
## 2 worst.possible   0.000

Why is the worst case value 0? Because if all traffic is going to one server, we wouldn’t need any bits at all to tell us which server the request is going to. The theoretical limit for a histogram with 20 buckets is: \( -20\frac{1}{20}log_2{\frac{1}{20}} = 4.32 \), which we’re close to but can never exceed.

All of our hash functions appear to be working pretty well, especially for such a small sample size that I’m using for this blog post. It looks like our winner is H3.

To summarize here’s what we did to find the optimal hashing function:

Get a representative sample of your web traffic
Run each request through the hashing function
Take a histogram of the resulting values with N bins, where N is the number of servers you have available
Divide the bin counts by the total number of requests in your sample to get the probability of handing a request for each server
Calculate the entropy, \( H(x)=-\sum_{i=0}^{N-1} p_i log_2(p_i) \), for each hash function

There are more considerations to take, like setting an upper bound on the value of \( p_i \) to ensure that no single server ever gets so busy that it can’t handle its load.

If you want to read up more on information theory, Elements of Information Theory by Thomas Cover and Joy Thomas is an excellent book that is reasonably priced (used) on Amazon.

There is also, of course, Claude Shannon’s landmark paper from 1948 “A Mathematical Theory of Communication”“, in which he essentially defines the entire field.

engineering tumblr tech computer science

0xa

0xa:
“ We have Engineering Summer Internships @ Tumblr! We’re looking for aspiring software engineers with a passion for open source software to join us for a summer of programming and fun. You will be integrated into a small engineering team working...

engineering tumblr internship

Tumblr Engineering (Posts tagged tumblr)

See, that’s what the app is perfect for.