Introduction to StatsD

"If it moves, we track it." - Etsy

To complement Orbitz' Graphite software, which offers a monitoring system that accepts incoming metrics without any prior notion or schema, Etsy created StatsD as a means for applications to very easily and simply send these metrics to such a monitoring system.

The resulting combination has been offering developers the possibility to collect metrics with minimum effort and use them downstream with as little friction as possible, be it dashboards, alerts, or other types of analysis.

Contents

Benefits of Using StatsD
Types of Metrics in StatsD
- Gauges
- Counters
- Timers
- Sets
Noteworthy Concepts
- Flush Intervals and Time Resolution
- $BinSize Template Variable

Benefits of Using StatsD

StatsD sits between your code and your metrics platform. Over the years, it has become ubiquitous. As a result, you get to take advantage of a vast open-source ecosystem and enjoy StatsD's simplicity and flexibility.

The StatsD concepts and protocol are quite simple and there are libraries available for every programming language, which provide a thin wrapper and make use of each respective language's paradigms. The data protocol itself is text-based and easy to read and write, and test from the command line or your programming language's interactive prompt.

Sending data to StatsD is done using UDP, so it is fire-and-forget, without error checking or waiting for acks. For most use cases, this methodology is acceptable because perfection is not necessary, and in addition the app is not succeptible to errors and crashes from the instrumentation code layer.

The StatsD daemon is decoupled from your app, so your app can be written is a different programming language and can be located on another server too. If the StatsD daemon crashes, your app will continue running just fine. Lastly, instrumenting your app with StatsD adds negligible overhead to your app.

Types of Metrics in StatsD

Gauges

Gauges in StatsD allow you to directly specify the values you want to send to the metrics database.

Counters

Counters keep track of a running total for you, and you only specify by how much to increment them. When the amount of time defined by the StatsD flushInterval config option elapses, StatsD sends the final cumulative value to the metrics backend, and resets the counter to zero. StatsD creates two metrics for each metric you define by using the .count and .rate suffixes. 'Count' is the total value whereas 'rate' is the per-second average.

Timers

Originally started as an implementation for timers, this type of metric allows you to send many samples per time interval for a metric and StatsD will calculate some statistics for you. These statistics have the following suffixes:

.count
.count_ps
.lower
.upper
.upper_90
.std
.sum
.sum_90
.mean
.mean_90
.median

The _90 represents the value at the 90th percentile.

Sets

Instead of looking at the values themselves, by using the sets metrics type, you're going to get the count of unique values instead. For example, if during one time interval, you send StatsD the values 7, 13, 7, and 5 for a given metric name, then that metric will record the value 3 because this set of values contains three unique values.

Noteworthy Concepts

Relationship between Flush Intervals in StatsD and the Time Resolution of Metrics in the Database

The default flush interval in StatsD is 10 seconds, which means that every ten seconds, StatsD sends its latest data to the defined backends, such as InfluxDB or Graphite. For a given metric name, the last value received during a given time interval will overwrite any previous values received during that time interval. This might become problematic when you consider that StatsD resets its counters after every flush, so you face the possibility of losing data if you don't define the flush interval carefully.

An example will help clarify the concept. Say:

you have a metric named website.analytics.page_views that measures page views,
the metric's resolution defined in the backend is 1 minute,
the flushInterval in StatsD is set to 10 seconds, and
over the course of two minutes, StatsD happens to output the following counts: 17, 23, 11, 87, 59, 43, 47, 19, 29, 83, 73, and 61.

First, keep in mind that StatsD resets the counter every 10 seconds, so all these counts are only for their respective 10-second intervals.

Your backend (InfluxDB, Graphite, etc.) will receive the value 17, then overwrite it with 23 when that value is received, overwrite that with 11, and so on. It will store the value 43 for the first one-minute interval, and it will store the value 61 for the second one-minute interval. As you can see, the metric ends up severely misrepresenting the number of page views!

To avoid this problem, all you have to do is ensure that your StatsD flushInterval is at least as big as the metric's resolution in the backend. If they are the same, then each bin will receive a single value from StatsD, which is ideal. If the flushInterval is greater than the resolution for the metric, then some bins will be empty and some bins will contain the value spanning from the last non-empty bin. While not ideal, that's fine and can be worked with.

Defining a `$BinSize` Template Variable in Grafana Can Make Things Easier

Defining the bin size in your dashboards can help keep you in control and speed up the queries that are powering your dashboards.

Say you want to zoom out and look at the last 30 days instead of at the last 2 days, as you would usually do. Your dashboard would end up having many more data points than usual, they would be hard to distinguish, and the query would be slow. Once you have a variable that defines the desired bin size, your query can ask the database to aggregate the metrics to that bin size, and only then deliver them to you, in Grafana. The resulting graphs will load much faster and won't contain an overwhelming amount of data points.

One way to do so is to define a template variable (which can be named $BinSize, for example) in each of your dashboards. A reasonable choice of values from which to select can be 5min,10min,15min,20min,30min,1h,2h,3h,4h,6h,12h,1d,2d,7d.

Once you have defined the template variable, you can use it to aggregate the data in your queries. In Graphite, you can aggregate your metric values using the summarize function. For example:

summarize(*.*.*.task_run_duration.upper, "$BinSize", "max")
summarize(some.task_run_duration.mean, "$BinSize", "avg")
summarize(*.*.task_runs.count, "$BinSize", "sum")

In InfluxDB, you can aggregate your metric values using the GROUP BY time() construct. For example:

SELECT max(duration_upper) FROM task_performance GROUP BY time($BinSize)
SELECT mean(duration_mean) FROM task_performance GROUP BY time($BinSize)
SELECT sum(runs_count) FROM task_performance GROUP BY time($BinSize)

The step-by-step guide to defining your template variable is that once you are on the dashboard page, click the settings gear icon in the center of the page header, then click on 'Templating' to get to the configuration panel for template variables. Name your variable, for example $BinSize, give it a more human-friendly label if you desire, and change the 'Type' to 'Custom'. In the 'Custom Options' section, type the values you want your variable to let you choose from, for example, 5min,10min,15min,20min,30min,1h,2h,3h,4h,6h,12h,1d,2d,7d. Lastly, click the 'Add' button. You will now see a row at the top of our dashboard containing your variable and a dropdown with the available values from which you can choose.