Best practices

Cache considerations

When doing read operations, it is vital that your working set is large enough that the storage backend cannot fulfil requests from cache - unless of course, cache performance is what you are trying to benchmark!

The object size and count parameters determine your working set. For example, if you have 10,000 objects of 1M size, then your working set will be 10 GB.

Exactly how big your working set needs to be is dependent on the storage system under test, and may be difficult to determine. For instance, when benchmarking Rados, we would need to consider not only Ceph’s own cache sizes, but also the combined amount of cache built into all the drives in the system.

When in doubt, use a bigger object count. The only downsides to using a larger count are the possibility of running out of memory on the sibench nodes themselves, and the increased amount of time it will take to clean up after the benchmark.

We regularly use working sets measured in terabytes when benchamarking medium sized clusters.

Throughput isn’t everything!

Storage systems are not usually run at peak throughput because it can lead to extremely long response times. In consequence, running without bandwidth limiting is only giving half the story: it’ll tell you what the maximum throughput in the system might be, but it is likely to be very misleading about the response times that the storage system is likely to give in real-world use.

More useful figures can often be obtained by first determining the peak throughput of the system, and then re-running the benchmarks with the bandwidth limited to 80 or 90 percent of the peak number.

Boosting throughput

sibench is inefficient with respect to the amount of load it puts on its own nodes. This is by design: we do not want to have to wait long for a thread to be scheduled in order to read data that has become available. Nor do we want to be interrupted during a write. Both of these scenarios can have a huge effect on the accuracy of our response time measurements, and may make them look much worse than they really are. In essence, we are trying to avoid benchmarking the benchmarking system itself!

As a consequence, a sibench node only starts up as many workers as it has cores. This is adjustable using the --workers option. (A factor of 2.0 will have twice as many workers as cores). This may be useful if we want to determine absolute maximum throughput, provided we don’t care about the accuracy of the response times.

Note that sibench considers hyperthreaded cores as real cores for the purposes of determining core counts.

Alternatively, you may also be able to boost read throughput from the sibench nodes by using the --skip-read-verification option, which does exactly what it suggests.

In general though, neither of these two options are recommended except for one particular use case: if disabling read verification or increasing the worker count boosts your throughput numbers, then that is an indication that more sibench nodes should be added in order to benchmark at those rates whilst still giving accurate timings.

Response times

Whilst sibench will output the maximum, minimum and average response times, in practice it is the 95%-response time - the time in which 95% of requests complete - that is likely to be the most informative. Maximum response times can be thrown out by one outlier result, which in turn poisons the average. The 95% figure (or the 99% figure if you wish to perform your own analysis) is a better indicator of a system’s behaviour.

Memory considerations

sibench is written to use as little memory as possible. The generators algorithmically create each object to be written or read-and-verified on the fly, and so objects do not need to be held in memory for longer than a single read or write operation as they can be recreated at will.

The one part of sibench that can take a lot of memory is the stats gathering, as stats are held in memory by each driver node until the completion of each phase of a run. At the end of each phase, the manager process collects the stats from all the nodes and merges them. This can be a lot of data if, say, you are running 30 driver nodes against an NVMe cluster for a long run time.

A consequence of this approach is that the manager node may need a lot more memory than the driver nodes, because it has to hold the stats of all of the driver nodes in memory in order to do the merge.

Unfortunately, some of the Ceph native libraries used by sibench appear to hold on to data for longer than would seem necessary. This can result in large amounts of memory being used, which can result in two undesirable outcomes:

  • Swapping: if the benchmarking process needs to swap, then performance figures are likely to be wildly wrong.

  • Process death: on Linux, the OOM Killer in the kernel will terminate processes that take too much memory with a SIGKILL. Since this is not a signal that can be caught, there is no warning or error when it occurs. (The systemd script should start a new copy of the server immediately though, so the sibench node will be usable for a new benchmark run with no further action.

At the start of each run, sibench determines how much physical memory each node has, and does some back-of-the-envelope maths to determine how much memory a benchmark may consume in the worst case. If the latter is within about 80% of the former, it outputs a warning message to alert the user of possible consequences. However, the benchmark will try to run (and because this assumes worst-case ceph library behaviour, it may well succeed).

Homogeneous cores

sibench divides its workload between nodes, with each taking responsibility for reading and writing some number of objects. The division of labour is done purely according to how many cores each node has. It does not attempt to measure the performance of each server node, nor does it use some artificial measure of performance such as BogoMIPS. Because of this, it is important that the nodes used as sibench servers be of roughly equivalent speed, at least on a per-core basis.

The reason for this is that if one sibench server is far quicker than its peers, then when it finishes reading its share of the objects and loops round to start at the beginning again, the data may still be in the storage system’s caches.