Graphite

The Graphite system architecture stores data points into files, called metrics, and generates graphs from those files.

The main quality supported by the Graphite system is performance. Let’s consider the following sentence from the document:

Imagine you have a dashboard with 10 graphs in it and the page refreshes once a minute. Each time a user opens the dashboard in their browser, Graphite has to handle 10 more requests per minute. This quickly becomes expensive.

This sentence describes a scenario for performance, where the stimulus are requests, which are sent periodically by dashboards, every minute, to render graphs. The consequence of this scenario is that Graphite needs to render the same graphs every minute which is expansive in terms of CPU and may compromise the overall performance. In order to describe what should be the desirable behaviour of Graphite under this scenario we can define that in these circumstances the system response should guarantee that graphs are sent back to the dashboards in less than one minute, this value is not explicitly in the description but dashboards have to display the graphs before sending a new request.

The tactic taken to support this performance scenario is the use of a cache of graphs such that it may not be necessary to regenerate the graph every time it is requested. Note that the rational for this decision is based on the fact that the requests are identical. This is an example on how the characterization of the workload is essential to decide on tactics for demanding architectural qualities. Also based on the workload, and considering that the same graph may seldom be shared by different dashboards, a local cache could have been used. However, if the client applications run in a browser a application specific cache may not be feasible and the solution needs to use a global or distributed cache.

Another performance scenario is:

Another common case that creates lots of rendering requests is when a user is tweaking the display options and applying functions in the Composer UI. Each time the user changes something, Graphite must redraw the graph.

In this case, the previous tactic is not effective because a new graph is being displayed. Actually, this corresponds to a different workload, one where the same data points are used to generate different presentations and the performance cost is on disk accesses to read the data points. The tactic used is a cache of data points, which reduces the overhead of accessing the information from disk.

Note that, since the end user is experimenting with different possible representations of the data, he expects to have rapid feedback whenever he changes an option, which may raise the question whether this scenario can be interpreted as a usability scenario. To answer this question it is necessary to identify the type of stimulus. In a usability scenario the stimulus is the use of the system, for instance, the easiness or efficiency, when a user uses the system, while in the above scenario the stimulus is an input, the request to render graphs.

Is it enough to analyze the read workload to define the expected quality behaviour of Graphite? In terms of the write workload we can read in the Graphite description:

Imagine that you have 60,000 metrics that you send to your Graphite server, and each of these metrics has one data point per minute. Remember that each metric has its own whisper file on the filesystem. This means carbon must do one write operation to 60,000 different files each minute. As long as carbon can write to one file each millisecond, it should be able to keep up. This isn’t too far fetched, but let’s say you have 600,000 metrics updating each minute, or your metrics are updating every second, or perhaps you simply cannot afford fast enough storage. Whatever the case, assume the rate of incoming data points exceeds the rate of write operations that your storage can keep up with. How should this situation be handled?

This corresponds to another performance scenario where the resource being stressed is the disk and delays are due to the disk seek time. To choose the tactic we have to analyze the workload. The write operation appends data points to the metric files, there is no delete or updated of data points, which means that if several data points for the same metric are written together the seek time will be constant. Therefore by buffering several data points for the same metric the performance of write operations is improved because the number of write operations is decreased. However, the application of the tactic has a drawback as is explained:

Buffering data points was a nice way to optimize carbon’s I/O but it didn’t take long for my users to notice a rather troubling side effect. Revisiting our example again, we’ve got 600,000 metrics that update every minute and we’re assuming our storage can only keep up with 60,000 write operations per minute. This means we will have approximately 10 minutes worth of data sitting in carbon’s queues at any given time. To a user this means that the graphs they request from the Graphite webapp will be missing the most recent 10 minutes of data: Not good!

This kind of side-effect is very common whenever asynchronism is introduced in the architecture. The use of buffers to optimize disk writes results in a situation where a write operation has already finished but its value is not available to read operations, which means that graphs will not contain the most recent data points. Therefore, the performance of Graphite is improved but reliability is compromised. To continue to provide reliable graphs the tactic followed was to change the read operation to read both from the disk and the buffer in order to support a higher level the consistency, though it may not be a strict consistency.