Best practices¶
Cache considerations¶
When doing read operations, it is vital that your working set is large enough that the storage backend cannot fulfil requests from cache - unless of course, cache performance is what you are trying to benchmark!
The object size and count parameters determine your working set. For example, if you have 10,000 objects of 1M size, then your working set will be 10 GB.
Exactly how big your working set needs to be is dependent on the storage system under test, and may be difficult to determine. For instance, when benchmarking Rados, we would need to consider not only Ceph’s own cache sizes, but also the combined amount of cache built into all the drives in the system.
When in doubt, use a bigger object count. The only downsides to using a larger
count are the possibility of running out of memory on the sibench
nodes
themselves, and the increased amount of time it will take to clean up after the
benchmark.
We regularly use working sets measured in terabytes when benchamarking medium sized clusters.
Throughput isn’t everything!¶
Storage systems are not usually run at peak throughput because it can lead to extremely long response times. In consequence, running without bandwidth limiting is only giving half the story: it’ll tell you what the maximum throughput in the system might be, but it is likely to be very misleading about the response times that the storage system is likely to give in real-world use.
More useful figures can often be obtained by first determining the peak throughput of the system, and then re-running the benchmarks with the bandwidth limited to 80 or 90 percent of the peak number.
Boosting throughput¶
sibench
is inefficient with respect to the amount of load it puts on its own
nodes. This is by design: we do not want to have to wait long for a thread to
be scheduled in order to read data that has become available. Nor do we want to
be interrupted during a write. Both of these scenarios can have a huge effect on
the accuracy of our response time measurements, and may make them look much
worse than they really are. In essence, we are trying to avoid benchmarking
the benchmarking system itself!
As a consequence, a sibench
node only starts up as many workers as it has cores.
This is adjustable using the --workers
option. (A factor of 2.0 will have
twice as many workers as cores). This may be useful if we want to determine
absolute maximum throughput, provided we don’t care about the accuracy of the
response times.
Note that sibench considers hyperthreaded cores as real cores for the purposes of determining core counts.
Alternatively, you may also be able to boost read throughput from the sibench
nodes by using the --skip-read-verification
option, which does exactly what
it suggests.
In general though, neither of these two options are recommended except for one
particular use case: if disabling read verification or increasing the worker
count boosts your throughput numbers, then that is an indication that more
sibench
nodes should be added in order to benchmark at those rates whilst still
giving accurate timings.
Response times¶
Whilst sibench
will output the maximum, minimum and average response times, in
practice it is the 95%-response time - the time in which 95% of requests
complete - that is likely to be the most informative. Maximum response times
can be thrown out by one outlier result, which in turn poisons the average. The
95% figure (or the 99% figure if you wish to perform your own analysis) is a
better indicator of a system’s behaviour.
Memory considerations¶
sibench
is written to use as little memory as possible. The generators
algorithmically create each object to be written or read-and-verified on the
fly, and so objects do not need to be held in memory for longer than a single
read or write operation as they can be recreated at will.
The one part of sibench
that can take a lot of memory is the stats gathering,
as stats are held in memory by each driver node until the completion of each
phase of a run. At the end of each phase, the manager process collects the
stats from all the nodes and merges them. This can be a lot of data if, say,
you are running 30 driver nodes against an NVMe cluster for a long run time.
A consequence of this approach is that the manager node may need a lot more memory than the driver nodes, because it has to hold the stats of all of the driver nodes in memory in order to do the merge.
Unfortunately, some of the Ceph native libraries used by sibench
appear to
hold on to data for longer than would seem necessary. This can result in large
amounts of memory being used, which can result in two undesirable outcomes:
Swapping: if the benchmarking process needs to swap, then performance figures are likely to be wildly wrong.
Process death: on Linux, the OOM Killer in the kernel will terminate processes that take too much memory with a SIGKILL. Since this is not a signal that can be caught, there is no warning or error when it occurs. (The systemd script should start a new copy of the server immediately though, so the
sibench
node will be usable for a new benchmark run with no further action.
At the start of each run, sibench
determines how much physical memory each node
has, and does some back-of-the-envelope maths to determine how much memory a
benchmark may consume in the worst case. If the latter is within about 80% of
the former, it outputs a warning message to alert the user of possible
consequences. However, the benchmark will try to run (and because this assumes
worst-case ceph library behaviour, it may well succeed).
Homogeneous cores¶
sibench
divides its workload between nodes, with each taking responsibility for
reading and writing some number of objects. The division of labour is done
purely according to how many cores each node has. It does not attempt to
measure the performance of each server node, nor does it use some artificial
measure of performance such as BogoMIPS. Because of this, it is important that
the nodes used as sibench
servers be of roughly equivalent speed, at least on a
per-core basis.
The reason for this is that if one sibench
server is far quicker than its peers,
then when it finishes reading its share of the objects and loops round to start
at the beginning again, the data may still be in the storage system’s caches.