Thanks to visit codestin.com
Credit goes to word.bitly.com

The Great Migration

Recently we gave a talk at Velocity EU about the lessons learned from our big data center move in 2016. Check it out to learn more about what worked well and what you should consider when you are facing your own data center move.

Bitly Summer Intern Wrap 2015

We’ve invited this year’s summer interns to share their experiences from working on our engineering team. Our interns worked on the same problems as our full-time engineers and made a difference to our product and business on a massive scale.

They also appeared on the Bitly Tech Podcast which you should totally check out!

If you have an interest in making a difference on a massive scale then check out our job postings as we are hiring in both Denver and New York City!

Nathaniel

I’m Nathaniel and I’m writing this a few days before my awesome summer with Bitly comes to an end. I will be returning to Carnegie Mellon to start my 3rd year studying Computer Science but I have to say, I’m going to miss this place. I spent the summer interning with the extremely talented Frontend team here at Bitly. While here I’ve gotten to create a javascript AB testing utility and contribute to the soon-to-be-released BBT2 product. One of the best parts of working at Bitly was that even as an intern, I felt like I had voice at the company. This was extremely evident with my work on BBT2 where I got to contribute not just code, but also ideas and opinions. Oftentimes you hear stories about how interns get stuck in corners to work on projects that never get to see the light of day. Bitly does the exact opposite. We were given seats amongst everyone else, we were given projects that will actually affect the direction and success of the company, we got to (and were encouraged to) attend all meetings, and most importantly, we were made to feel like members of the team.

I also learned a ton this summer. The new BBT2 product is being built in React JS with the help of Immutable JS. I’ve gotten to fully immerse myself in these tools and I’ve fallen in love with them. Being a TA for a functional programming class, Immutable JS brings me as close as a javascript framework could to the Eden that is functional programming. Then React combined with JSX helped us write very clean and modular code, something any programmer can appreciate. Having previously worked in Ember JS, I have to say that React was a pleasant change from the rigidity of Ember. Although I worked primarily in javascript and CSS, Python is unavoidable at Bitly. Luckily, despite my love for functional programming, Python is my favorite language. Although I was only able to make one Meetup, I really appreciated that Bitly gave their space during the night to host events for the Python community in NYC. Even for an NYC native like me, the city can feel too big. Having a space to meet up with like-minded people can be really nice. Bitly does everything it can to make every employee enjoy what they do and this summer I truly did enjoy every moment of working at this great company.

Meina

My name is Meina. I am currently a graduate student at New York University, majoring in data science. In addition to data science, I have also focused my study on marketing at Stern. I worked with Bitly as a data science intern this summer. I decided to join Bitly because I was impressed by their innovative and collaborative culture. I wanted to utilize my knowledge of both data science and marketing at Bitly. I believed Bitly was a perfect fit for me because of Bitly’s ability to help its customers better understand and target their audience based on the results from data analysis.

I worked on two projects during my internship at Bitly. They were improving the MQL (Marketing Qualified Leads) Scoring System and understanding New York Times users’ browsing behaviors. For the MQL Scoring System project, I had the opportunity to work with the marketing team. I trained several statistical learning models based on Bitly’s historical sales data. I improved the accuracy of the MQL scoring system by 53.1%. Meeting with marketing team frequently helped me practice communication across departments. I also gained more knowledge about the marketing side of Bitly.

While working on understanding New York Times users’ browsing behaviors, I started to learn a new clustering framework, Apache Spark, in order to process large-scale datasets. Learning Spark was a lot of fun and provided me with many challenges. I really appreciated that my colleagues were very helpful and patient whenever I asked for their help. Instead of solving the problem directly, they inspired me by providing directions for me and let me explore the solutions by myself. I learned to solve the problems more independently. The collaborative and welcoming atmosphere here at Bitly encouraged me to work through those challenges patiently and helped me become more productive.

In addition to working on these two projects, I had numerous other unforgettable experiences at Bitly. I remember the time when I went rafting with all my colleagues for Q2 celebration. I remember the wonderful moments that I spent with my colleagues every Thursday during our cocktails and dreams events. We talked about our projects and enjoyed various special drinks. I remember my excitement when we moved into the new office. During my internship at Bitly, I not only gained valuables skills, including technical skills and communication skills, but also became friends with many of my exceptional colleagues. If I had the chance to choose my internship again, I would still join Bitly without any doubt.

Ben

I’m Ben, a rising junior at Columbia studying Computer Science. I was lucky enough to be selected to join Bitly’s application engineering (backend) team a couple of months ago, and it’s been a blast.

A big part of Bitly’s appeal is the scale it offers to its customers, and I was able to see firsthand what goes into creating and maintaining that. After a couple days of setting up, I was able to begin scouring the code for ways of tackling my first assignment, which involved addressing user search complaints. We use Elasticsearch to allow users to search through links they’ve created, so I spent about a week carefully combing through the relevant documentation, looking for and testing the right combination of queries that would address issues that users brought up, while keeping search lag low.

I realized that in order to fully improve search functionality, I would need to completely reindex our Elasticsearch database without interrupting user experience. It took a while to come up with a detailed plan of how I was to go about doing this; I first had to learn all about Bitly’s open source distributed messaging system which is largely written in Go. Learning about distributed systems and Go was hugely rewarding, as until then I was used to dealing with single-server applications. I now have insight into how companies like Bitly manage to deal with massive influxes of data every day – in other words, what it takes to scale.

At school, I’ve used agile development practices to build student-run applications for my peers, but seeing them as a part of a professional workflow was a whole new ballgame. Every morning we would state our goals for the day. Bi-weekly sprint planning meetings allow everyone to note progress and map out next steps. Department-wide meetings outlined goals for the company. Because the company isn’t too large, I knew of the goings-on of almost every division, even sales and revenue. I was always very impressed to hear about our latest partners and the ever-expanding scope of our product. At times, it made my task of migrating a massive amount of user data terrifying as I became acutely aware of the extent to which people relied on our services. I would simulate my procedure over and over, paranoid that any change I made as I worked on it might ruin the whole system. Yet there was an underlying thrill in the possibility of creating something that would be able to transform our data and make it more useful for everyone.

I enjoyed Bitly’s culture. Despite its growing size (as indicated by our beautiful brand new offices in NYC) it retains an easygoing, youthful energy which is evident in everything from our weekly “Lunch & Learn” meetings to the mid-year company outing at Lehigh River during which everyone went rafting People are very friendly, eager to explain the intricacies of something they worked on and how it is used throughout the codebase. I am incredibly grateful to have had such kind, engaging mentors throughout my time here. The experience I’ve gained here has been invaluable.

I still have a few weeks left here, so I’m excited to keep building!

Bitly In Denver

Denver

via flickr

Want to meet and hang out with some of the Bitly engineering team?

A bunch of us will be in Denver, Colorado the week of May 18-22. While we’re out there we’ll be catching up with our growing Denver-based team (we’re hiring, by the way) and participating in a bunch of local tech events.

Check out the schedule below and come join us:

Keep an eye on #bitlyindenver for live updates.

Please come and join us at any of these events to join in the conversation, we can’t wait to see you!

Introducing The Bitly Tech Podcast

Bitly Tech Podcast

At Bitly, we love to share as we learn.

Coming up through our careers, we have all learned and gained a ton of experience from others in the community that have shared their experiences and lessons with us. Accordingly, we’re always happy and excited to give back and share as much as we can.

So far we’ve shared through posts right here on our blog and by giving talks at meetups and conferences. These posts and talks have been great and will continue to be a big part of what we do, but recently we’ve been wondering if we could be doing more. What else could we do to share our experiences and start more conversations that let us learn from others?

As a result of this, we’re proud to announce that today we’re kicking off the Bitly Tech Podcast (iTunes).

The Bitly Tech Podcast will be a weekly conversation about life as tech folk building product. Each week, members of the Bitly tech team as well as friends from the larger community will be chatting about the experiences we go through and the things we learn as we build products. Topics will range from the soft and fuzzy like how we got to where we are today and how we work, to our experiences working with the latest new tech like Go and Elasticsearch.

We’re excited to see where this new project will go and what conversations it will start. Subscribe today and let us know what you think!

Frontend Dependency Management with Browserify

With frontend development moving as fast as it does at Bitly, things can get pretty messy. We found ourselves with piles of unmanaged script tags and little indication of what was still being used in the app’s current iteration. There had to be a better way!

Enter Browserify! A few months ago, we embarked upon a project to totally overhaul the “Your Bitlinks” interface, and with that came the opportunity to rethink our practices and introduce frontend dependency management. Now the page looks as good behind the scenes as it does in your browser!

So, what is Browserify?

Browserify is a great tool that lets you write your client-side scripts like it’s node, allowing you to use node’s package management and module systems. Each file should be a module and explicitly require() all its dependencies. Then you simply give Browserify your main entry point, such as an overarching app file, and the name of a destination file, e.g.

browserify app.js > bundle.js

This example takes your network of dependencies starting at app.js and streams all the files from your network of dependencies into bundle.js.

(Side note: Being familiar with the node.js environment and NPM, node’s package manager, is essential for picking up Browserify. In the weeks before we started working with Browserify, I wrote and published a node module as a side project. Going through the process, making sure all the pieces were in place— specifically the package.json— so that I could add it into the NPM registry made figuring out Browserify much easier.)

Why use Browserify?

There was no question in our minds that using a frontend dependency management tool was an important step in our app’s development. We considered the field of options, ultimately settling on Browserify. We didn’t need the async loading functionality of Require.js and appreciated the power of using NPM, leaving Browserify as the natural choice.

Leveraging node to bundle our JavaScript provides many advantages over the traditional multiplicity of strictly ordered script tags in HTML files. We found that the Browserify mindset facilitated coding conventions and frontend practices we strive for at Bitly. Bundling your app’s files together is a good practice because it reduces the amount of HTTP requests your app makes in order to load. Furthermore, the resulting code is all contained in an immediately-invoked function expression, helping to keep the global namespace unpolluted, just how we like it.

We especially noticed that using Browserify helps us reduce the amount of code we’re sending to our users, thanks to the module convention and NPM. Because each module must explicitly require() its dependencies, dead code is less likely to stick around. Granted, unused require() statements can still linger in files, but because these statements are conventionally listed at the beginning of the file, it’s easier to catch unused code. Relying on NPM also helps make sure we don’t forget to include code we need. If you don’t require() a necessary package, you’ll get an error when bundling your components. NPM also takes care of resolving versions; if module foo needs one version of a package and module bar needs another, it’s all figured out for you. Being able to npm install modules right from NPM’s registry is a cinch, much easier than, say, hunting down the jQuery source code. NPM also updates packages as the author pushes updates to the registry, provided you’ve declared the module in your package.json file in a way that green-lights new versions using node’s semver syntax. Furthermore, node comes with built-in modules that too can be require()-ed and therefore bundled. We use the node url module, which make sense because URLs are a little important to our day-to-day. Even more, you don’t have to worry about include order because if a module needs another module, it’ll require() it itself, ensuring the necessary code is included before it is ever used.

One of the most powerful and unique things about Browserify is the ability to use source transforms, packages that alter your source code as it’s streamed to your bundle. For example, we use coffeeify for translating CoffeeScript to JavaScript. Once you’ve installed coffeeify, you can just do

browserify -t coffeeify app.coffee > bundle.js

and your CoffeeScript becomes JavaScript without intermediate build steps.

Transforms depend heavily on node’s streaming interface, but you don’t need to know much of anything about streams to plug in other people’s transforms (they’re fascinating though; read up here if you’d like). I personally liken transforms to Legos— the ones I need generally already exist and it’s just a matter of rummaging through NPM to find them. Some other transforms we use are:

  • hbsify to precompile handlebars templates so we could require them right from the views that use them
  • browserify-swap to substitute one package or file for another based on an environment variable
  • browserify-shim to make “CommonJS-Incompatible Files Browserifyable”
  • watchify to automatically recompile files when a watched file gets edited
  • remapify to “map whole directories as different directories to browserify”

It all gets even better when combined with Browserify’s --debug flag, which adds source maps in a cinch. Source maps are useful because all of the modules get bundled into a single file, making tracking down errors from the browser’s developer tools hard. They become even more important if you decide to use a minification transform such as uglifyify. Instead of getting a reference to the JavaScript bundle, we can see where the bug is in the individual, unbundled, human-readable source file.

Notable Challenges

All in all, we were able to get Browserify up and and running pretty smoothly. Granted, our job was much easier because we weren’t trying to migrate much existing code, and thus did not need to translate it to adhere to the CommonJS standard. Still, transitioning to Browserify was no small feat. There were some interesting roadblocks we ran into because of some of our established development practices.

Something we needed to reconcile early on was the fact that Backbone requires Underscore, but we use LoDash for all its extra gadgets. Terin Stock has an excellent solution involving browserify-swap in this blog post. Long story short, we needed to request that underscore.js itself resolve to LoDash. (Tip: browserify-swap can be a little finicky; make sure you bundle from the directory where node_modules is located.)

We also heavily use Handlebars, and therefore historically depended on Handlebars helpers. However, the fact that they get concatenated together into one huge file doesn’t stick to the modular mentality we’re striving for. Thus, we decided to forgo the Handlebars helpers lifestyle in favor of utility modules called in our views that preprocess the data before it gets fed to Handlebars. Now, if one view needs a function to format large numbers to have commas, all of the other views don’t necesarily know about that function. This will also ensure that once a helper becomes obsolete, it won’t be included, which is a peril with the usual monolithic Handlebars helper registering system.

Another challenge we faced was that our organizational preferences for our code lead to an extremely nested directory structure. Because of this, remapify has been a great asset. Before using the package, relative paths for require() statements for files not in node_modules often needed four or five levels of ../../. This was both hard to parse and hard to figure out right the first time. Remapify allows us to reference files top-down from where we knew they lived, i.e. in our “models” or “views” directories, instead of relative to where they were being required from. We needed to do some forking to fit our exact needs, however.

We’re not that Special!

Well, maybe a little special… but you are too, and you too can use Browserify! If you’re not starting a new project, the most cumbersome part will be likely be making all your modules adhere to Common JS— “node-able” as I like to say. This means using the require() and module.exports syntaxes and making sure everything is require()’d, npm install-ed, and is in a package.json. From there, you’re just a few keystrokes away from getting Browserify running from the command line!

So, if you’re deciding whether or not to use Browserify, even in production, I would highly recommend it. It’s a bundle of fun, will transform your code, and make your app load(s) better!

Joining Bitly Engineering

First post! (aka Introduction)

Hello everyone, my name is Peter Herndon. I recently started working at Bitly as an application engineer on Bitly’s backend systems (which are legion). My recent experience is with a series of smaller start-ups, preceded by a long stint in a much larger and more conservative enterprise setting. I bring to the table expertise in Python, systems administration (both cloudy and bare metal), databases and systems architecture.

I’ve been interested in Bitly for quite a while, and wanted to work here for much of that time. Since its beginning, Bitly has had a reputation for technical excellence. The engineers here have demonstrated that excellence both by solving engineering challenges and by the ingenuity of how they approach those solutions. Bitly’s former chief scientist, Hilary Mason, single-handedly popularized the concepts of Big Data and Data Science, and legitimized them as engineering disciplines. Her talks and blog posts created my own awareness of and interest in the field. So when I had the opportunity to work here, I gladly leapt into it head first.

What I Found

A Company in the Process of Renewing Itself

Bitly is a unique place to work, even among tech businesses. The company employs about 60 people, about 25 of them technical, and has been in existence for 4-5 years now. That said, Bitly is in many ways a very new company. Recently the company underwent a shift in management, resulting in a new focus on business. The new CEO, Mark Josephson, brings a laser-sharp clarity to helping Bitly’s customers become successful by providing insight into how their brands are performing. This clarity of purpose is in addition to continuing the company’s technical leadership. We began the new year here with a renewed sense of purpose that is reflected in the number of new hires and the number of open positions.

I’ve experienced the process of watching an ailing small business shed employees and management, in a downward spiral of despair, including my own exit from that company. This is the first time I’ve experienced the rebirth of a company, the upward swell of pride and energy that comes from active leadership and direction. I’m very happy to see that Bitly has retained a great deal of its technical team, thus providing good institutional memory and continuity. That retention speaks well of the new leadership and the amount of pride in what the folks here have previously built. And what they’ve built is tremendous.

A Remarkable Technical Architecture

Bitly’s business is insight: providing customers with information that helps them make better decisions regarding their business by analyzing shortlink creation (referred to as encodes internally) and link click data (internally, decodes). To that end, our infrastructure must handle accumulating and manipulating around 6 billion decodes per month. That’s a lot of incoming HTTP requests. Not Google scale, but not pocket change by a long shot. To handle that volume, we use a stream-based architecture, rather than batch processing. That is, instead of accumulating incoming data in a data store and periodically processing it to reveal insights, we have a very deep, very long chain of processing steps. Each step, each link in the chain (and chain is an oversimplification since the structure is more of a directed graph, mostly acyclic) is an asynchronous processor that accepts incoming event data and performs a single logical transformation on the data. That transformation may be as simple as writing the datum to a file, or it may involve comparing it to other aggregated data for building recommendations, or for detecting spam and abuse. Frequently, the processed datum is then emitted back into the queue system for consumption further down the chain. The processed data are then made available via a service-oriented API, which is used to power the dashboards and reports we present to our customers. If any given step in the chain requires more processing power to handle the load of incoming events, we can spin up additional servers to run that particular step.

The advantage of stream-based processing over a traditional batch processing system is that the stream processing system is a great deal more resilient to spikes in incoming data. Since each processing step is asynchronous and has a built-in capacity limit, messages remain in the queue for that step until the processor is ready to handle them. The result is that every step in the chain has its own, independent capacity for handling data, and while backlogs occur (and we do monitor for them), a backlog in a given step is by no means a breaking problem as a whole. It may signify a failure in a particular subsystem, but the rest of the Bitly world will usually remain unaffected. Of course, when the problem is corrected, the result will usually be a backlog in the next steps of the chain, but that is usually fine and expected. Each step of the chain will chew through its allotted tasks and move on.

This stream processing system is powered by NSQ (documentation), about which much has been written and said, both on this very blog (here, here, and here) and elsewhere. I won’t add more, as I’m far from an expert (yet!), but I will say that I am impressed with how useful NSQ is for building large distributed systems that are remarkably resilient.

A Fanatical Attention to Code Quality

Another aspect of Bitly that has made a great impression on me is the devotion to code quality embodied in the code review process. Bitly experienced enormous growth at a time before modern configuration management tools became popular, and as a result wound up building their own system for managing server configuration. There is a certain amount of cruft in the system (how could there not be?), but Bitly’s engineers have paid a great deal of attention over time to making the deployment system as streamlined as possible. After all, maintaining the fleets of servers necessary to keep Bitly running is no small task. And that attention to operational maintainability spills over to the code that runs on those servers. Bitly has a code review process where equal emphasis is placed on functional correctness and test coverage, and on operational ease and maintainability. I’ve never had my code pored over with such a fine-toothed comb as I’ve had here, and going through the review process made me a better programmer overnight. In previous positions, I’ve quickly produced code that works; here at Bitly, I produce code that works, is aesthetically and semantically appropriate (i.e., consistent naming, following a reasonable style guide), and fits conceptually within the greater whole that is our code base. The review process can be frustrating at times, as I attempt to figure out the most efficient way to get my changes merged, but overall is a huge benefit, contributing greatly to the quality of the Bitly product.

A colleague asked me to comment on whether rigorous code review is better or worse than pair programming at improving code quality, since pair programming is something he has not done. My experience with pair programming is limited, but in that experience, pair programming does not provide a huge benefit to code quality. Instead, it is much more useful for design quality, hashing out architectural issues, and for transferring knowledge. The kind of issues I’ve caught in pair programming, or been caught in creating, are typically typos or minor logic bugs (brainos). These are the kind of bugs that pop immediately on trying to run your code for the first time, or running tests. (Tests are a given, right? Everybody writes tests nowadays.) So while there might be a tiny bit of added productivity from pair programming on the code quality front, that benefit is offset by consuming double the amount of programmer hours. The trade-off is that rigorous code review improves code quality a great deal, but does tend to lose sight of architecture and design issues. It encourages deep focus on the code itself, without considering the design. I think code review is necessary (or at least more beneficial) for code quality, while pair programming is not. Pair programming can be swapped for design meetings, thus reducing the total time spent by multiple developers on a single task.

A New (to Bitly) Approach to Teams

A major change we’ve instituted recently is to create what are being called “feature teams”. These feature teams are composed of a cross-functional slice of Bitly, including back-end developers, front-end developers, product and project management, and most importantly, business stakeholders from our Customer Success team. Each feature team is tasked with making improvements to our products, starting with different sections of the Bitly Brand Tools. I think this is the number one change towards better directing Bitly’s amazing technical talent to creating something useful for our customers, rather than just yet another neat technical tool. With our Customer Success team getting feedback on our proposed improvements directly from our customers, we are now in a perfect position to make Bitly the best source of insight it can be. And that is our ultimate goal, to provide our customers with better insight into the world around them.

In my previous experience, I’ve never seen “improvements” ever actually improve anything without feedback from customers. Near-misses, yes, but not actual hits. The inspiration should often come from within, as we are in the best position to improve existing features for all our customers, rather than just taking the opinion of one. But without business-side involvement, and without customer feedback, I’ve never seen a tech-driven improvement result in success for the actual end-user, unless the intended end-user is in fact technical. That is why a large percentage of start-ups focus on tools for other engineers, it’s easier to get started.

10 Things We Forgot to Monitor

There is always a set of standard metrics that are universally monitored (Disk Usage, Memory Usage, Load, Pings, etc). Beyond that, there are a lot of lessons that we’ve learned from operating our production systems that have helped shape the breadth of monitoring that we perform at bitly.

One of my favorite all-time tweets is from @DevOps_Borat

“Law of Murphy for devops: if thing can able go wrong, is mean is already wrong but you not have Nagios alert of it yet.”

What follows is a small list of things we monitor at bitly that have grown out of those (sometimes painful!) experiences, and where possible little snippets of the stories behind those instances.

1 - Fork Rate

We once had a problem where IPv6 was intentionally disabled on a box via options ipv6 disable=1 and alias ipv6 off in /etc/modprobe.conf. This caused a large issue for us: each time a new curl object was created, modprobe would spawn, checking net-pf-10 to evaluate IPv6 status. This fork bombed the box, and we eventually tracked it down by noticing that the process counter in /proc/stat was increasing by several hundred a second. Normally you would only expect a fork rate of 1-10/sec on a production box with steady traffic.

check_fork_rate.sh

2 - flow control packets

TL;DR; If your network configuration honors flow control packets and isn’t configured to disable them, they can temporarily cause dropped traffic. (If this doesn’t sound like an outage, you need your head checked.)

$ /usr/sbin/ethtool -S eth0 | grep flow_control
rx_flow_control_xon: 0
rx_flow_control_xoff: 0
tx_flow_control_xon: 0
tx_flow_control_xoff: 0

Note: Read this to understand how these flow control frames can cascade to switch-wide loss of connectivity if you use certain Broadcom NIC’s. You should also trend these metrics on your switch gear. While at it, watch your dropped frames.

3 - Swap In/Out Rate

It’s common to check for swap usage above a threshold, but even if you have a small quantity of memory swapped, it’s actually the rate it’s swapped in/out that can impact performance, not the quantity. This is a much more direct check for that state.

check_swap_paging_rate.sh

4 - Server Boot Notification

Unexpected reboots are part of life. Do you know when they happen on your hosts? Most people don’t. We use a simple init script that triggers an ops email on system boot. This is valuable to communicate provisioning of new servers, and helps capture state change even if services handle the failure gracefully without alerting.

notify.sh

5 - NTP Clock Offset

If not monitored, yes, one of your servers is probably off. If you’ve never thought about clock skew you might not even be running ntpd on your servers. Generally there are 3 things to check for. 1) That ntpd is running, 2) Clock skew inside your datacenter, 3) Clock skew from your master time servers to an external source.

We use check_ntp_time for this check

6 - DNS Resolutions

Internal DNS - It’s a hidden part of your infrastructure that you rely on more than you realize. The things to check for are 1) Local resolutions from each server, 2) If you have local DNS servers in your datacenter, you want to check resolution, and quantity of queries, 3) Check availability of each upstream DNS resolver you use.

External DNS - It’s good to verify your external domains resolve correctly against each of your published external nameservers. At bitly we also rely on several CC TLD’s and we monitor those authoritative servers directly as well (yes, it’s happened that all authoritative nameservers for a TLD have been offline).

7 - SSL Expiration

It’s the thing everyone forgets about because it happens so infrequently. The fix is easy, just check it and get alerted with enough timeframe to renew your SSL certificates.

define command{
    command_name    check_ssl_expire
    command_line    $USER1$/check_http --ssl -C 14 -H $ARG1$
}
define service{
    host_name               virtual
    service_description     bitly_com_ssl_expiration
    use                     generic-service
    check_command           check_ssl_expire!bitly.com
    contact_groups          email_only
    normal_check_interval   720
    retry_check_interval    10
    notification_interval   720
}

8 - DELL OpenManage Server Administrator (OMSA)

We run bitly split across two data centers, one is a managed environment with DELL hardware, and the second is Amazon EC2. For our DELL hardware it’s important for us to monitor the outputs from OMSA. This alerts us to RAID status, failed disks (predictive or hard failures), RAM Issues, Power Supply states and more.

9 - Connection Limits

You probably run things like memcached and mysql with connection limits, but do you monitor how close you are to those limits as you scale out application tiers?

Related to this is addressing the issue of processes running into file descriptor limits. We make a regular practice of running services with ulimit -n 65535 in our run scripts to minimize this. We also set Nginx worker_rlimit_nofile.

10 - Load Balancer Status.

We configure our Load Balancers with a health check which we can easily force to fail in order to have any given server removed from rotation.We’ve found it important to have visibility into the health check state, so we monitor and alert based on the same health check. (If you use EC2 Load Balancers you can monitor the ELB state from Amazon API’s).

Various Other things to watch

New entries written to Nginx Error Logs, service restarts (assuming you have something in place to auto-restart them on failure), numa stats, new process core dumps (great if you run any C code).

EOL

This scratches the surface of how we keep bitly stable, but if that’s an itch you like scratching, we’re hiring.

Adventures in Optimizing Text Processing

Lessons learned while post-processing 1.75 billion lines of Hadoop output.

The Problem

Recently, I encountered a problem. I had a nightly Hadoop job running on EMR that churned over the past 30 days’ worth of Bitly redirect data in order to run reach analysis pertaining to about 1000 of our paid accounts. This job resulted in 175 gzipped “part” files, each containing at least 10 million lines of data. I needed to collate that data after the Hadoop job ran.

> ls part-*.gz
part-0000.gz
...
part-0174.gz

> zcat part-0000.gz | wc -l
10000000

The Hadoop output data inside the part files consisted of things like this:

"3,g,05-02,12SIMV6" 329
"175,geo,05,US,GA,Atlanta"  9987
"10,phrase,05,egg foo young"    1093
"11,n_clicks,05"    393999

Those were comma-delimited keys with the following structure and a count:

"[ACCOUNT_ID],[METRIC_TYPE],[DATE],[VALUE]"    COUNT

The challenge was this: How do I efficiently separate out this data by ACCOUNT_ID and METRIC_TYPE? That is, I wanted one file per ACCOUNT_ID-METRIC_TYPE combination.

First, Look on the Shelf

Like many people churning through volumes of data, we make use of the the mr_job python package for our Hadoop processing. At first I thought this was a no-brainer: “I’ll use the oddjob plugin. Yay, a solution already exists!” The plugin’s description was tailor-made for me:

“oddjob.MultipleJSONOutputFormat - Writes to the directories specified by the first element in the key” – https://github.com/jblomo/oddjob

Wrong.

  • oddjob plugin wouldn’t run at all on our Hadoop cluster.
  • oddjob plugin wouldn’t run consistently on EMR
  • This approach resulted in 890 x 175 x 5 = ~800K part files. To scp 800K files from EMR is a nightmare of a long time.

Secondary Hadoop Jobs

After days of struggling with oddjob, I cut bait on it and looked at running a set of secondary Hadoop jobs, using the output from the first Hadoop job as input to the second ones. Something like this:

for account_id in $account_ids
do
    run_emr_job_to_extract $account_id
done

Even if each job only took one minute (which it wouldn’t), 890 mins == 14 hours. That was no good.

On to Text Processing

Here’s where I started putting the bash time command and python timeit to work to try to whittle down to an approach that was viable.

zgrep

First I tried at the most straightforward approach I could think of – zgrep.

for account_id in $account_ids
do
    zgrep '^"$account_id,' part-*gz > $account_id.txt
done

That took 11.5 mins per account. 11.5 mins * 890 accounts = 170 hours

Blah. No.

zcat | awk

Next, I played around with zcat piped to awk. My final version of that was something akin to this:

zcat part-*gz | awk -F'[,\"]' '
{
    print >> $2"-"$3".txt"
}'

This approach seemed reasonably concise but it still took 15 hours to run. So yeah, no.

The problem with this approach is that it results in way more syscalls than is necessary. If you think about it, that append operator (>>) opens, writes, and closes the file being appended to every time you call it. Yikes, that’s 5.2 billion syscalls – three for each line of data!

Onward.

Python-Based Solutions

At this point, it was seeming like a bash solution wasn’t going to do it, so I ventured back into python. This was the concept:

import gzip
for part_file in part_files:
    with gzip.open(part_file, 'rb') as f:
        for line in f:
           # 1. parse line for account_id and metric_type
           # 2. write to appropriate file for account_id and metric_type

Before I even get into that second for loop, let’s consider gzip.open(). Let’s consider it, and then let’s ditch it.

Python gzip – Oh the Pain!

gzip.open() on those 10-million-line files was as slow as molasses in January. My first pass at an all-python solution took 15 hours. Much of that was spent in gzip.open(). Look at this:

> cat test-gzip.py
import gzip
f = gzip.open('10-million-line-file.gz')
for line in f:
    pass
f.close()

> time python test-gzip.py;  # on an m1.large
real    3m7.687s
user    3m6.844s
sys     0m0.068s

That’s 3.1 minutes per file. 3.1 mins * 175 files = 9 hours. JUST TO READ THE FILES!

zcat is much faster, so I quickly switched to a zcat-python combo solution. This approach also happens to leverage 2 cores, one for the zcat, the other for the python script.

> cat test-zcat.py
import sys
for line in sys.stdin:
    pass

> time $(zcat 10-million-lines.gz | python test-zcat.py)
real    0m3.642s
user    0m5.056s
sys     0m0.508s

3.6 SECONDS per file is much nicer, no? 3.6 seconds * 175 files = 10 minutes spent reading files. Definitely acceptable.

Dive into the python for loop

So now we’re here:

zcat part-*.gz | parse_accounts.py

What shall that parse_accounts.py look like? Something like this now:

import sys
for line in sys.stdin:
   # 1. parse line for account_id and metric_type
   # 2. write to appropriate file for account_id and metric_type
Parsing the lines

Recall that we want to extract ACCOUNT_ID and METRIC_TYPE from each line, and each line takes this form:

"[ACCOUNT_ID],[METRIC_TYPE],[DATE],[VALUE]"    COUNT

I’ll save us some pain and tell you to forget doing regular expression group matches. Using re.match() was 2X slower than line.split(). The fastest way I found was this:

# 1. parse line for account_id and metric_type
key = line.split(',')
account_id = key[ACCOUNT_ID_INDEX][1:] # strip the leading quote (")
metric_type = key[METRIC_TYPE_INDEX]

Note: I’m putting account_id and metric_type in variables here for clarity’s sake, but let me say this: If you’re going to be running a piece of code 1.75 billion times, it’s time to abandon clarity and embrace efficiency. If you’re only going to be accessing a variable one time, don’t bother setting it in a variable. If more than once, do bother setting it. You’ll see what I mean when it all comes together below.

Writing the files

At first, I tried to be clever and only open the files as needed, like this:

import sys
from collections import defaultdict
OUT_FILES = defaultdict(dict)
for line in sys.stdin:
    # 1. parse line for account_id and metric_type
    key = line.split(',')
    account_id = key[ACCOUNT_ID_INDEX][1:] # strip leading quote
    metric_type = key[METRIC_TYPE_INDEX]

    # 2. write to appropriate file for account_id and metric_type
    if metric_type not in OUT_FILES[account_id]:
         OUT_FILES[account_id][metric_type] = \
                 open(os.path.join(account_id, key[METRIC_TYPE_INDEX]), "wb")

    OUT_FILES[account_id][metric_type].write(line)

close_outfiles()  # close all the files we opened

Here I was at about 5 hours of processing time. Not bad considering where I started, but more paring could be done.

Again, I invoke the admonition to optimize the hell out of things when you’re performing an operation so many times.

So I should eliminate the extraneous if statement, right? Why did I need to open the files conditionally? So I didn’t open files unnecessarily? Who cares if a few extra files are opened and not used?

I ended up at essentially something like this:

import sys
from collections import defaultdict
OUT_FILES = defaultdict(dict)

open_outfiles()  # open all files I could possibly need

for line in sys.stdin:
    # 1. parse line for account_id and metric_type
    key = line.split(',')
    account_id = key[ACCOUNT_ID_INDEX][1:] # strip leading quote

    # 2. write to appropriate file for account_id and metric_type
    OUT_FILES[account_id][key[METRIC_TYPE_INDEX]].write(line)

close_outfiles()  # close all the files we opened

Final execution time: 1 hour 50 minutes.

try/finally

One last takeaway here is that sometimes you need to invest a good bit of development time in order to save yourself what might be an unacceptable amount of processing time. It requires patience and can be painful, sometimes with days’ worth of work ultimately abandoned.

But if you find yourself thinking, “There’s got to be a way to make this faster,” then it’s at least worth trying. I stopped short of rewriting it in C. Now that would have been faster!

If you’re interested in working on the next set of problems at Bitly, we’re actively hiring.

Getting more clues from Python’s logging module

During development of Python web applications, there are a lot of tools that can help provide clues about what caused a bug or exception. For example, Django’s default debug error page prints out local variables at every stack frame when there’s an unhandled exception. The popular django_debugtoolbar adds more information. In a similar vein there are things like Pyramid’s pyramid_debugtoolbar and its predecessor WebError.

But when troubleshooting production issues, those tools aren’t available, for good reason: We need the useful information to be exposed to developers, not overwhelming our end users nor exposing sensitive information to malicious eyes.

So, we turn to logging instead. But that provides a lot less information out of the box. So at bitly, we often found ourselves in an irritating cycle that went something like this:

  • Sometime soon after a deploy, we notice an unusual number of exceptions being logged. (Let’s say that old Python chestnut, 'NoneType' object is unsubscriptable.) So we know that we have some code that is expecting a dict or list and getting None instead.

  • We may decide the error is not critical enough to roll back without first trying to diagnose the underlying cause, but of course we want to fix it quickly. The pressure is on.

  • We find the traceback in our logs. It looks like:

Python traceback example

  • But where’s the None? Is it response['data'] or is it one of the entry values? We have no way to see. (Tip: We can actually tell that it’s not response['data']['subtotal']. Do you know why? Would you remember that while under pressure to fix a production bug?)

  • We try, and fail, to provoke the error on our local development or staging environments.

  • Then we add some logging code.

  • We get a quick code review and deploy the logging code.

  • We wait for the error to happen again.

  • We find the log message and realize we forgot to log everything we wanted to see, or we need more contextual information. Go back 3 steps and repeat as needed.

  • Finally we can see what the bad value was and we finally have enough clues that we can understand the cause of the error and fix it.

This is a frustrating amount of overhead when you’re trying to diagnose a production issue as quickly as possible, even with multiple people helping out.

Fortunately, if you make liberal use of logging.exception() – for example, if your web framework calls it on any unhandled exception, and if you take care to call it on unknown exceptions that you deliberately catch – then it’s actually easy to change our logging configuration such that we get a lot more useful information logged. And we can do this globally without sprinkling logging changes throughout our code.

Here’s one possible way to hack your logging config to do this:

Python log handler example

Then, in whatever code sets up your logging configuration, you can add this to the relevant handler like so:

handler.setFormatter(VerboseExceptionFormatter(log_locals_on_exception=True))

The MAX_LINE_LENGTH and MAX_VARS_LINES values are used to limit the amount of data we dump out. And so far we’re only logging the local variables from the innermost frame, on the assumption that those are the most likely to be useful.

Having deployed this logging configuration, our hypothetical traceback would look more like this:

Python traceback example, with locals

Aha. Now we can clearly see that some upstream service failed, and our code is failing to handle that. And we can take action immediately.

This technique is not a magic crystal ball, certainly; sometimes the important clues are elsewhere in the stack, or they’re in the current global scope. If we wanted to, it would be trivial to log locals and globals at every frame, just by grabbing the values of tb.tb_frame.f_locals and tb.tb_frame.f_globals during the while tb.tb_next loop. But that could get very verbose, and some of our services get hit very hard and log a lot, so we haven’t gone that far yet.

We have already used these enhanced log entries to quickly diagnose a couple of bugs in production, so we’re happy with this little tweak. Hopefully you will find it useful too.

If you’d like to get your hands directly on our code, we’re hiring.

Networking: Using Linux Traffic Control for Fun and Profit Loss Prevention

Here at bitly, we are big fans of data, tubes and especially tubes that carry data.

This is a story about asking tubes to carry too much data.

A physical hadoop cluster has been a significant part of bitly’s infrastructure and a core tool of the bitly Data Science and Ops/Infra teams for some time. Long enough that we needed to cycle in a new cluster, copy data, and fold the old into the new.

Branches were opened, work was done, servers provisioned and the new cluster was stood up. Time to take a step back from the story and flesh out some technical details:

  • bitly operates at a consequential scale of data: At the time of this migration the hadoop cluster was just over 150TB consumed disk space of compressed stream data, that is data that is the valuable output of our various applications after having been manipulated and expanded on by other applications.

  • bitly’s physical presence is collocated with our data center partner. There are three physical chassis classes (application, storage and database) racked together in contiguous cabinets in rows. At the time of this story each chassis had three physical 1Gb Ethernet connections (each logically isolated by VLANs), frontlink, backlink and lights-out (for out of band management of the server chassis). Each connection, after a series of cabinet specific patch panels and switches, connects to our core switches over 10Gb glass in a hub and spoke topology.

  • While bitly also operates at a consequential physical scale (hundreds of physical server chassis), we depend on our data center partner for network infrastructure and topology. This means that within most levels of the physical networking stack, we have severely limited control and visibility.

Back to the story:

The distcp tool bundled with hadoop allowed us to quickly copy data from one cluster to the other. Put simply, distcp tool creates a mapreduce job to shuffle data from one hdfs cluster to another, in a many to many node copy. Distcp was fast, which was good.

bitly broke, and the Ops/Infra team was very sad.

Errors and latent responses were being returned to website users and api clients. We discovered that services were getting errors from other services, database calls were timing out and even DNS queries internal to our network were failing. We determined that the copy had caused unforeseen duress on the network tubes, particularly the tubes of the network that carried traffic cross physical cabinets. However the information given to us by our data center partner only added to the confusion: no connection, not even cabinet to core, showed signs of saturation, congestion or errors.

We now had two conflicting problems: a need to continue making headway with the hadoop migration as well as troubleshooting and understanding our network issues.

We limited the number of mapreduce mappers that the distcp tool used to copy data between clusters, which artificially throttled throughput for the copies, allowing us to resume moving forward with the migration. Eventually the copies completed, and we were able to swap the new cluster in for the old.

The new cluster had more nodes, which meant that hadoop was faster.

The Data Science team was happy.

Unfortunately, with hadoop being larger and faster, more data was getting shipped around to more nodes during mapreduce workloads, which had the unintended effect of:

bitly broke, a lot. The Ops/Infra team was very sad.

The first response action was to turn hadoop off.

The Data Science team was sad.

A turned off hadoop cluster is bad (just not as bad as breaking bitly), so we time warped the cluster back to 1995 by forcing all NICs to re-negotiate at 100Mbps (as opposed to 1Gbps) using ethtool -s eth1 speed 100 duplex full autoneg on. Now we could safely turn hadoop on, but it was painfully slow.

The Data Science team was still sad.

In fact it was so slow and congested, that data ingestion and scheduled ETL/reporting jobs began to fail frequently, triggering alarms that woke up Ops/Infra team members up in the middle of the night.

The Ops/Infra team was still sad.

Because of our lack of sufficient visibility into the state of the network, working through triage and troubleshooting with our data center partner was going to be an involved and lengthy task. Something had to be done to get hadoop into a usable state, while protecting bitly from breaking.

Time to take another step back:

Some tools we have at bitly:

  • roles.json : A list of servers (app01, app02, userdb01, hadoop01 etc), roles (userdb, app, web, monitoring, hadoop_node etc), and the mapping of servers into roles (app01,02 -> app, hadoop01,02 -> hadoop_node etc).

  • $datacenter/jsons/* : A directory containing a json file per logical server, with attributes describing the server such as ip address, names, provisioning information and most importantly for this story; cabinet location.

  • Linux : Linux.

Since we could easily identify what servers do what things, where that server is racked and can leverage all the benefits of Linux, this was a solvable problem, and we got to work.

And the Ops/Infra team was sad.

Because Linux’s networking Traffic Control (tc) syntax was clunky and awkward and its documentation intimidating. After much swearing and keyboard smashing, perseverance paid off and working examples of tc magic surfaced. Branches were opened, scripts were written, deploys done, benchmarks run and finally some test nodes were left with the following:

$ tc class show dev eth1
class htb 1:100 root prio 0 rate 204800Kbit ceil 204800Kbit burst 1561b
    cburst 1561b
class htb 1:10 root prio 0 rate 819200Kbit ceil 819200Kbit burst 1433b 
    cburst 1433b
class htb 1:20 root prio 0 rate 204800Kbit ceil 204800Kbit burst 1561b 
    cburst 1561b

$ tc filter show dev eth1
filter parent 1: protocol ip pref 49128 u32 
filter parent 1: protocol ip pref 49128 u32 fh 818: ht divisor 1 
filter parent 1: protocol ip pref 49128 u32 fh 818::800 order 2048 key 
    ht 818 bkt 0 flowid 1:20 
    match 7f000001/ffffffff at 16
filter parent 1: protocol ip pref 49129 u32 
filter parent 1: protocol ip pref 49129 u32 fh 817: ht divisor 1 
filter parent 1: protocol ip pref 49129 u32 fh 817::800 order 2048 key 
    ht 817 bkt 0 flowid 1:10 
    match 7f000002/ffffffff at 16
filter parent 1: protocol ip pref 49130 u32 
filter parent 1: protocol ip pref 49130 u32 fh 816: ht divisor 1 
filter parent 1: protocol ip pref 49130 u32 fh 816::800 order 2048 key 
    ht 816 bkt 0 flowid 1:20 
    match 7f000003/ffffffff at 16
<snipped>

$ tc qdisc show
qdisc mq 0: dev eth2 root 
qdisc mq 0: dev eth0 root 
qdisc htb 1: dev eth1 root refcnt 9 r2q 10 default 100 
    direct_packets_stat 24

In plain English, there are three traffic control classes. Each class represents a logical group, to which a filter can be subscribed, such as:

class htb 1:100 root prio 0 rate 204800Kbit ceil 204800Kbit burst 1561b cburst 1561b

Each class represents a ceiling or throughput limit of outgoing traffic aggregated across all filters subscribed to that class.

Each filter is a specific rule for a specific ip (unfortunately each IP is printed in hex) , so the filter:

filter parent 1: protocol ip pref 49128 u32 
filter parent 1: protocol ip pref 49128 u32 fh 818: ht divisor 1 
filter parent 1: protocol ip pref 49128 u32 fh 818::800 order 2048 key 
    ht 818 bkt 0 flowid 1:20 
    match 7f000001/ffffffff at 16

can be read as “subscribe hadoop14 to the class 1:20” where “7f000001” can be read as the IP for hadoop14 and “flowid 1:20” is the class being subscribed to.

Finally there is a qdisc, which is more or less the active queue for the eth1 device. That queue defaults to placing any host that is not otherwise defined in a filter for a class, into the 1:100 class.

qdisc htb 1: dev eth1 root refcnt 9 r2q 10 default 100 direct_packets_stat 24

With this configuration, any host, hadoop or not, that is in the same cabinet as the host being configured gets a filter that is assigned to the “1:10” class, which allows up to ~800Mbps for the class as a whole. Similarly, there is a predefined list of roles that are deemed “roles of priority hosts”, which get a filter created on the same “1:100” rule. These are hosts that do uniquely important things, like running the hadoop namenode or jobtracker services, and also our monitoring hosts.

Any other hadoop host that is not in the same cabinet is attached to the “1:20” class, which is limited to a more conservative ~200Mbps class.

As mentioned before, any host not specified by a filter gets caught by the default class for the eth1 qdisc, which is “1:100”.

What does this actually look like? Here is a host that is caught by the catch all “1:100” rule:

[root@hadoop27 ~]# iperf -t 30 -c NONHADOOPHOST
------------------------------------------------------------
Client connecting to NONHADOOPHOST, TCP port 5001
TCP window size: 23.2 KByte (default)
------------------------------------------------------------
[  3] local hadoop27 port 35897 connected with NONHADOOPHOST port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-30.1 sec   735 MBytes   205 Mbits/sec

Now when connecting to another host in the same cabinet, or the “1:10” rule :

[root@hadoop27 ~]# iperf -t 30 -c CABINETPEER
------------------------------------------------------------
Client connecting to CABINETPEER, TCP port 5001
TCP window size: 23.2 KByte (default)
------------------------------------------------------------
[  3] local hadoop27 port 39016 connected with CABINETPEER port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-30.0 sec  2.86 GBytes   820 Mbits/sec

Now what happens when connecting to two servers that match the “1:10” rule?

[root@hadoop27 ~]# iperf -t 30 -c CABINETPEER1
------------------------------------------------------------
Client connecting to CABINETPEER1, TCP port 5001
TCP window size: 23.2 KByte (default)
------------------------------------------------------------
[  3] local hadoop27 port 39648 connected with CABINETPEER1 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-30.0 sec  1.47 GBytes   421 Mbits/sec

[root@hadoop27 ~]# iperf -t 30 -c CABINETPEER2
------------------------------------------------------------
Client connecting to 10.241.28.160, TCP port 5001
TCP window size: 23.2 KByte (default)
------------------------------------------------------------
[  3] local hadoop27 port 38218 connected with CABINETPEER2 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-30.0 sec  1.43 GBytes   408 Mbits/sec

So the traffic got halved? Sounds about right.

Even better, trending the data was relatively easy by mangling the stats output to our trending services:

$ /sbin/tc -s class show dev eth1 classid 1:100
class htb 1:100 root prio 0 rate 204800Kbit ceil 204800Kbit 
    burst 1561b cburst 1561b 
Sent 5876292240 bytes 41184081 pkt (dropped 0, overlimits 0 requeues 0) 
rate 3456bit 2pps backlog 0b 0p requeues 0 
lended: 40130273 borrowed: 0 giants: 0
tokens: 906 ctokens: 906

After testing, we cycled through hadoop hosts, re-enabling their links to 1Gb after applying the traffic control roles. With deploys done, hadoop was use-ably performant.

The Data Science team was happy.

The Ops/Infra team could begin tackling longer term troubleshooting and solutions while being able to sleep at night, knowing that bitly was not being broken.

The Ops/Infra team was happy.

Take aways:

  • In dire moments: your toolset for managing your environment is as important as the environment itself. Because we already had the toolset available to holistically control the environment, we were able to dig ourselves out of the hole almost as quickly as we had fallen into it.

  • Don’t get into dire moments: Understand the environment that you live in. In this case, we should have had a better understanding and appreciation for the scope of the hadoop migration and its possible impacts.

  • Linux TC is a high cost, high reward tool. It was almost certainly written by people with the very longest of beards, and requires time and patience to implement. However we found it to be an incredibly powerful tool that helped save us from ourselves.

  • Linux: Linux

EOL

This story is a good reminder of the “Law of Murphy for devops”. Temporary solutions like those in this story afforded us the time to complete troubleshooting of our network and implement permanent fixes. We have since unthrottled hadoop and moved it to its own dedicated network, worked around shoddy network hardware to harden our primary network and much more. Stay tuned.