Measure, Monitor, Observe and supervise13 Jan 2014
Around 3 months ago (probably more, but I can’t recall the exact date), Gogobot set out to do a complete revamp to the underlying infrastructure that runs the site.
We decided to go for it for various reasons, like performance, maintainability, but the main reason was to do it right, with Chef, documentation etc…
Right off the bat, I knew I’d need help doing it, so I called my buddy @kesor6 who runs a Devops shop in Israel to help out.
Evgeny was (and still is) great, he took care of all the heavy lifting while I was still focused on delivering features, so he wrote the majority of what we needed, up to the point where domain knowledge was needed, where I came in.
Even after all that heavy lifting, we did a ton of pair programming, which was one of the best experiences I had pairing with anyone, ever.
One of the main things Evgeny recommended was to start monitoring everything, now I immediately said, “dude, we have monitoring, what the hell are you talking about?”, We had NewRelic, NewRelic server monitor, Pingdom, Pagerduty, Amazon monitoring on all servers including ELB, DB and everything.
I didn’t actually realize what I was missing out on.
We installed a few new services
- Logstash with Kibana
- Graphite (w/ statsd, collectd, collectd plugins)
Logstash will index it (you can control the index properties) and convert everything to a searchable format.
Now, you don’t actually know how much you need something like that, but since then I had a couple of bugs that I would take 2-3 more hours to solve without it.
- Show me all the API response codes of people uploading postcards.
- Show me how many requests came from GoogleBot comparing to organic over the last 24 hours (same for bing)
- Show me all CSS requests that are broken (same for JS)
Kibana is the client side application that takes everything and makes sense of it, so you can visualize everything is a beautiful way.
With Kibana you can save searches, make them your homepage and more.
All the knowledge you didn’t have on your app is at your fingertips now, since I had it I used it for many things and insights we did not have before.
One of the nastiest bugs I encountered lately was that our Mobile app could not post postcards, it happened to specific users, and when it happened, you could not fix it, you had to reinstall the app.
Luckily, this issue wasn’t widespread, but even then, it was a hell to debug.
Here’s what it looked like:
What we saw was, the instead of doing
POST, the phones that had bugs in them did a
GET request, which was then retried over and over again.
What we also saw, is that the phones that were buggy, did not send the right headers, so it could not authenticate.
Just seeing everything as it was happening was mind blowing, since we have multiple API servers, I would never have seen this on my own, it was too difficult.
What we soon figured with Kibana is that that phones that had a bug in them did a
POST got a 301 request (permanent redirect), and since then did a
GET without even trying to do another post.
This directed us to the bug in Nginx configuration which was doing redirects to API requests (DON’T ever do that, trust me).
Again, Logstash has pretty amazing defaults, so the index and the data being sent from Nginx is enough to debug most problems you can encounter.
We use Kibana as a first research tool when we get a bug report. Looking at similar requests, checking how wide spread the bug is, and more.
With Kibana, you can look at a specific client as well (based on IP, user_id and more)
I am guessing you can imagine the level of research and insights you can draw from it.
Perhaps the most important piece of the monitoring puzzle for us, I can’t start to explain how much we use Graphite these days, for things we never knew.
Before I go any further into this… let me show you what we had before
As you can see, there’s a HUGE spike in request queueing, and this is something I was always frustrated about NewRelic, WHY?
Why do I have such a request queue bottleneck, what happened? Did the DB spike? Did Redis Spike, are 50% of the LB servers dead?
What the hell is going on?
With NewRelic, we were blind, really, it was really frustrating at times, especially when you suddenly see a DB spike, but you have no idea what caused it, because you lack the reference.
One of the most annoying bugs we had for a while, that we’ve been having DB spikes, like once a week, the DB would spike to around 80-100ms per request, and after a minute settle back down to 10-15ms.
We were trying to figure it out in NewRelic, but the slow query log only showed fast queries that were queued up, nothing really helpful, this is where graphite really shines.
We sat down and started looking at the stats, cross referencing things to one another, soon we realized that one of the admin controllers was doing too many long queries, which slowed down the DB time.
But, we had no proof, so we graphed it.
What you see in this graph, is that every time this controller was requested, it would spike the database, sometimes a light spike and sometimes a bigger spike.
Also, as you can clearly see from the graph, the issue was fixed and then tested over an over again, without the DB spiking again.
Using Graphite for everything
Now, we use Graphite for every measurement we need, we send Disk data, CPU Data, Memory data and more (using Collectd), We send Nginx connection data (using the Nginx collectd plugin), we send everything that rails supports through ActiveSupport, and also custom data about the scoring system and more.
The level of insights you begin to develop is sometimes mind blowing, it’s really hard to comprehend how blind I was to all of these things in the past.
http://codeascraft.com/ (Etsy engineering blog) has a ton of insights, just seeing how they use monitoring and insights is amazing, I have learned a great deal from reading about it, I recommend you do the same.
(From the github README)
A network daemon that runs on the Node.js platform and listens for statistics, like counters and timers, sent over UDP and sends aggregates to one or more pluggable backend services (e.g., Graphite).
Since we don’t want a performance hit when sending stats to Graphite, we want to send UDP packets. which are fast and fire-forget.
Collectd is a deamon that collects system performance periodically and sends them over to Graphite.
For example: See the Mongo graphs together with the Sidekiq metrics, so you can see if you have errors in Sidekiq workers, what Mongo looks like during those times. (You can see some pretty amazing things)
You can collect a ridiculous amount of data, and then you can look at it with Graphite, again, with cross reference to other metrics.
For example, we had problems with Mongo reporting about replica lag. One of the theories was that the disk was queuing reads/writes because of insufficient iops.
2 hours into having graphite, we realized this theory was wrong, and we needed to look elsewhere.
CollectD has a very big list of plugins, it can watch Nginx, Mysql, Mongo, and others, you can read more about it here: https://collectd.org/wiki/index.php/Table_of_Plugins
One of the plugins we use is Nginx, so you can see some really useful stats about Nginx.
I really touched just the tip of the iceberg of what we monitor now, we actually have a huge number of metrics.
We collect custom business logic events too, like scoring system events, which are critical to the business. Search events and more.
Once you start implementing that into your workflow, you start to see the added value of it, day in day out.
When I start implementing a new feature, I immediately bake stats into it, this way I know how it functions in production.
Eventually, you should remember that Graphite is an amazing platform you can build on, like that amazing Dashboard you always wanted can be achieved with Graphene, or with Dashing. There’s a more comprehensive list here: http://dashboarddude.com/blog/2013/01/23/dashboards-for-graphite/.
You can expose those insights to other members of the team, from product to the CEO who can care about totally different things then you personally do.
The approach I take with measuring and collecting is: First collect, store, then realize what you want to do with this data. Once you know what you want, you already have some stored data, and you can begin work.
There are libraries that support alerts based on Graphite graphs, so you don’t have to actually “look” at the dashboard.
Get to it!
I believe that absolutely every company, at any stage can benefit from this, from the 2 people bootstrapped startup to the funded multiple engineer startup.
Every time you see a post from a leading tech company, it’s backed by graphs, data and insights, you can have it too.
Get to it, it’s not difficult
Feel free to comment/feedback.