An answer that is just black or white

Right now we have two devices available for immediate deployment. Both are based on the same design that has so far best met our requirements as described in detail on the previous pages. The difference between them is that one is optimised for both maximum bitrate and robustness against external interference – while the other is a pared down version aimed at reducing its production cost (but without seriously compromising its quality), so that we can also offer a genuinely good source of trusted entropy to anyone who is in need of one at a 'no-excuses' price point.

The BitBabbler White TRNG

BitBabbler White RNG

The BitBabbler White is the premium device which we have designed and built for our own personal use. This version comes equipped with four independent circuits that will each gather entropy separately, and which can be sampled individually for testing that all of them are operating correctly. For normal use they are mixed together into a single output stream that is already white, or very nearly so (there is some expected variation between individual devices), even before any post-processing is done on it.

Due to the normally expected manufacturing and component tolerances, some individual devices produce a more robustly white output over a wider range of operating conditions and bitrates than others do, but so far we've had none of the initial trial run of a hundred production run of 600 devices fail validation when using the conservative default of a 2.5Mbps bitrate, folded once. Many of them also passed at higher bitrates and without folding, but since the bias away from white can be very small and can take a very large number of samples to show up as being statistically significant, it probably isn't very practical for us to bin them into groups with different performance ratings at this stage. [If there is real demand for that, we might consider doing it as a high-end product, but given the time that would take compared to just giving each device a good long run with just a single configuration for a simple pass/fail result, it would significantly raise the production costs]. We've been experimenting to try and find tests which might be a good indicator of this over shorter runs, but at the moment having one configuration that we have a very high confidence of being good with every device seemed like the more useful initial goal.

For now, if people want to try to push each device to its individual limits beyond that, we'll need to leave that profiling to them. We do however provide tools which will help with doing that, they just need you to invest the time to run them and study the results, and to exercise your own judgement about how close to the edge of reliability under a wide range of operating conditions you are willing to push it.

The conservative default gives a high enough bitrate, with a wide enough margin for safety, that users who just want it to work out of the box can simply plug it in and go – so only people who genuinely have an interest in experimenting with this should ever have the need to change it. The simplest, safest, and most effective way to greatly increase the available bitrate, if you really need to do that, is to just add as many more BitBabbler devices as you need to achieve the target rate that you are aiming for. We have a test machine with 60 of them in it, which we can get about 30 GiB/hr of raw entropy from.

The BitBabbler Black TRNG

BitBabbler Black RNG

The BitBabbler Black is the more budget constrained version, which uses the same basic design as the White device above, but includes only a single entropy gatherer. When its output is folded twice it is almost functionally equivalent to the White model, and under testing it does appear to be essentially indistinguishable from it in quality too. The main difference being that its effective output bitrate is only a quarter of what the White model can supply, and all entropy is accumulated in just a single gathering circuit.

Having only a single integrator could potentially make it a bit more vulnerable to external interference (though its design makes it still quite naturally resistant to that), and will make it a bit more sensitive to the effects that component tolerances can have on individual units. Having the three other circuits on the same board will tend to average out the overall effect of any 'edge case' components, and it means that each of them are more likely to respond slightly but notably different to any environmental change or ambient noise signal. How much differently they will respond is difficult to model and predict, since it depends on a large number of variables where small changes cascade into significant effects, but in this context, that extra uncertainty seems useful to have when the extra cost is not a limiting factor to consider.

For people who don't need the highest bitrates they can get, or who don't need the stronger robustness guarantees of the White device, this offers an option to have the same high quality entropy source, backed by the same intensive research and testing, and supported by the same software tools, at a price which won't break even the smallest budget. We were originally planning to only produce the White device, but since the best use of good entropy is to make sure that everyone who has even a casual need for it also has a trusted and safe source of it, we decided to also offer an option that is as easily affordable as we could make it without having to compromise on any of the initial design goals.

The BitBabbler White device is the single package equivalent to mixing the output of four separate BitBabbler Black devices.

Supporting software

We'd originally envisaged that the supporting software for this would be extremely simple. Since our design requirements called for there to be no firmware hidden on the device, we just needed to connect the raw output bitstream to whatever people wanted to use to consume it (in our case, the Linux /dev/random device) and we figured there were already existing tools (like rngd) that we could make good use of for that.

It quickly became obvious though that the very important and no simpler clause of as simple as possible was going to guide us down a slightly different path to that. It seemed wholly inadequate to us to just throw the output of some initial sample of devices through Dieharder and friends, and if it didn't obviously fail those to just call it done. While that still would have been more than what some surprising number of the other TRNG devices we'd originally looked at had done before declaring them fit for sale, it wasn't going to be enough to give us the sort of ongoing peace of mind we wanted, which is the need that started all of this work in the first place. If we weren't going to finish the job properly, it would have been a waste of time to even start it in the first place.

We needed two equally important things: Tools that we could use to analyse, in whatever detail seemed to be necessary, the various designs that we were experimenting with. Good instrumentation which would be able to continue measuring the ongoing performance of every device in the actual conditions under which it was deployed.

There are a number of subtle but very important distinctions there. Practical test conditions aren't exactly the same as what would be expected for normal use. Ignoring for the moment the obvious factor of the surrounding environment changing when being used in different locations – the actual operating conditions are likely to be quite different too. If we're going to conduct statistical tests, which require a very large number of samples, then any practical initial test is going to try to pull those samples out of the device as quickly as it can. Which means the device is going to be running continuously, and under relatively steady-state conditions, during which it will be extracting every consecutive random bit, for hours, if not days, at a time. Which in turn is almost exactly the opposite of what it is likely to be doing when actually deployed, where it is more likely to sit idle for a comparatively long period of time when no entropy is required from it, with periodic bursts of activity extracting only a relatively small number of bits at a time.

While any TRNG device which could be considered to be worth its salt should perform identically in both those cases, unless that question has actually been put to the test and proven to be the case, there is no axiomatic reason to assume that it will. In fact there are a long list of good reasons to suspect that without careful attention to, and testing of, the design, this will not actually prove to be true at all.

The best of both worlds

Having established that the real world is not as easy to simulate as it is to exist in, the good news is that there is still quite a bit of overlap between the tools we needed for initial analysis and what we could make use of for ongoing assessment. Which is doubly good because it means we don't then have the problem of trying to compare apples to oranges in the results from each case. If the normal runtime results look as good as what they do when the same tests were run under initial testing conditions, we can have some confidence that the quality of the entropy we are extracting for real use is still similar too.

The software we provide for use with these devices allows anyone to run the same sort of tests that we have (without restricting in any way what other tests they could also run), with the most rigorous suite of real-time performance and health monitoring enabled during normal use that we know of for any device of this type.

And obviously, the source is open and freely licenced, so you can audit every part of it yourself and contribute to improving it further too if you have your own itches that we haven't yet completely scratched.

The bit-babbler software package is available in Debian since Stretch (and is also available in Jessie-backports). Users of it can simply:

# apt-get install bit-babbler

The source is also directly available from this site. If you're using a Debian based Linux distribution, you can build packages of it for yourself by just running dpkg-buildpackage in the top level directory. For users on other systems, it can be built with the normal ./configure && make idiom.

The only required dependency is libusb-1.0 which we use to communicate with the USB interface chip.

If you have libudev available, then it will be used to provide hotplug support, enabling additional devices to be added without needing to restart the software (the list of allowed devices can be constrained by specifying them explicitly in the configuration). If you do not, the available devices will be scanned for when the software is first started. As of the 0.4 release, hotplug is also available on some other platforms where it is supported by libusb itself.

If you use munin, and wish to use the plugin we provide for it to continuously graph performance and send monitoring alerts, then you will also need the perl JSON::XS module (available in the libjson-xs-perl package on Debian systems).

The code has been tested on (at least) Linux, OpenBSD, FreeBSD, MacOS, and both 32 and 64-bit Windows. It should be easily portable (if it doesn't already just work right out of the box) to any other platform that libusb-1.0 supports or can be made to support. And we are always happy to help however we can if we get reports of a problem on some new platform.

The current release is: bit-babbler 0.9 · signature

Testing

There's fairly limited value to just pasting up some random test results here. On the one hand, it probably is the thing that people ultimately really care about, or at the very least it's what most of the sites for other devices we originally looked at seemed to think was about all it took to add some weight to their claims. But on the other hand, a couple of hand-picked results from a couple of tests on a couple of devices, doesn't really tell you much of the story at all, unless the whole story is just really terrible (which some of them certainly were).

Part of the problem there is in the nature of statistical testing – significance only comes from a large number of tests, and in the case of a device like this, also conducted under a wide variety of operating conditions. The other part is you're reading this on the internet, and either you think we can be trusted, in which case simply saying the test results all look good carries just as much weight – or you don't think we can, in which case why would you believe any set of tabulated numbers that we put up? They could have been created from anything, or shopped to fit what we wanted you to see in them.

Ultimately the only testing which matters is the testing you do yourself, on the individual devices that you have. So we're going to show some test results here, but not because we expect you to be impressed by the numbers seen on them. We're going to use them to try to help explain what it is you should be looking for in them, what factors can influence their results, and what pitfalls to try to avoid both in running them and interpreting them.

ENT

One of our favourite sets of tests right now, at least for getting a very quick and mostly reasonable idea of the quality of a block of random numbers of just about any size, come from John "Random" Walker's ENT suite. If you've looked into assessing the quality of random numbers even casually before now, you'll probably recognise the following, which is its assessment of a moderately large block of bits taken directly from the raw output of one of our devices (with no QA filtering applied to it).

Entropy = 8.000000 bits per byte. Optimum compression would reduce the size of this 524288000 byte file by 0 percent. Chi square distribution for 524288000 samples is 270.13, and randomly would exceed this value 24.61 percent of the times. Arithmetic mean value of data bytes is 127.4989 (127.5 = random). Monte Carlo value for Pi is 3.141623498 (error 0.00 percent). Serial correlation coefficient is -0.000038 (totally uncorrelated = 0.0).

Here is another one, using an even larger block. This time it is testing the very same set of bits that we had analysed with both Dieharder and TestU01 for the descriptions of using those suites (which are detailed further below). In comparison to those, this test took about an hour and a half to complete.

Entropy = 8.000000 bits per byte. Optimum compression would reduce the size of this 246488244224 byte file by 0 percent. Chi square distribution for 246488244224 samples is 262.90, and randomly would exceed this value 35.35 percent of the times. Arithmetic mean value of data bytes is 127.5000 (127.5 = random). Monte Carlo value for Pi is 3.141596020 (error 0.00 percent). Serial correlation coefficient is 0.000003 (totally uncorrelated = 0.0).

Skin deep

Looks pretty good, right? Except unless you're quite familiar with the details of exactly what it is testing there, and what things aside from the quality of the randomness can influence their precision, it's almost deceptively good. Or at the very least, that one sample of each set of bits is certainly a long way from really giving you the complete picture of everything that is important here.

Despite the fact that we haven't run this over a rigged set of numbers (I promise!), and that we didn't run this over multiple sets of them and cherry pick the nicest ones to show here (but I can say for certain that we could run it over multiple sets and find examples that look both better and worse than this, at least on some of the measures if not all of them) – there is definitely one decision that we made here which has essentially 'biased' every one of those measures, except the Chi-square result, to look more shiny than they otherwise might: Selecting the block size to analyse.

By analysing a single large block of a size that we choose, we can let the tests which converge over very large numbers of samples gain an almost arbitrary amount of extra precision, while minimising the effect of even significant anomalies, provided they are relatively infrequent. Only the Chi-square test becomes more sensitive to smaller errors as the size of the sample block grows.

It's not quite 'dishonest', this is the actual result of the analysis for that many samples, but it does highlight some of these numbers (and particularly the ones that a naive user might most easily recognise and latch on to) with a 'softer focus' than what a diligent analysis might find to be ideal.

Let's have a look at what we see if we analyse exactly the same data in smaller chunks. The following three sample sets are each taken consecutively from non-overlapping blocks starting at the beginning of the first block that was analysed above.

Entropy = 7.996437 bits per byte. Optimum compression would reduce the size of this 51200 byte file by 0 percent. Chi square distribution for 51200 samples is 253.84, and randomly would exceed this value 50.87 percent of the times. Arithmetic mean value of data bytes is 127.1955 (127.5 = random). Monte Carlo value for Pi is 3.147310442 (error 0.18 percent). Serial correlation coefficient is 0.010748 (totally uncorrelated = 0.0).
Entropy = 7.996191 bits per byte. Optimum compression would reduce the size of this 51200 byte file by 0 percent. Chi square distribution for 51200 samples is 270.00, and randomly would exceed this value 24.79 percent of the times. Arithmetic mean value of data bytes is 127.4528 (127.5 = random). Monte Carlo value for Pi is 3.126215868 (error 0.49 percent). Serial correlation coefficient is -0.008923 (totally uncorrelated = 0.0).
Entropy = 7.996636 bits per byte. Optimum compression would reduce the size of this 51200 byte file by 0 percent. Chi square distribution for 51200 samples is 238.25, and randomly would exceed this value 76.70 percent of the times. Arithmetic mean value of data bytes is 127.6872 (127.5 = random). Monte Carlo value for Pi is 3.140278917 (error 0.04 percent). Serial correlation coefficient is -0.002539 (totally uncorrelated = 0.0).

No prizes for guessing which of those five the marketing department would choose to put on their glossy brochure.

But the reality is, the data in all of those still has the same amount of real entropy on a per-byte basis, and the same expected variance from the mean. The larger sample gives us a more sensitive measure of the Chi-square statistic (but since these samples really are, by all measures that we've made on them, statistically random, that makes no actual difference to the result seen here), but the smaller samples are more sensitive to the variations that we actually expect to see in the other measures, if the blocks of samples really are genuinely random.

A very large block with a mean extremely close to 127.5, a value of π that was correct to more than 6 digits, and a serial correlation of less than ±1e-6 would not be terribly surprising, even from a block that may have very obviously failed other stronger tests for statistical randomness.

But a small block that reported results like that, for all of those measures in the same block, would be a statistical freak. And a string of consecutive small blocks that all produced results like that would be a very strong sign that the data is not actually random at all. A completely deterministic, but still uniform, sequence of bits could give a result like that, but a genuinely random source would extremely unlikely to do so repeatedly in small sample blocks. Which makes our three small blocks above, when taken together, become a potentially better proof than just the single large one alone, or even three very large ones which all looked about the same as the first one shown here.

A single bad run of these metrics can very quickly show us that our RNG is terrible. But a small number of runs (of any size) that look okay really doesn't tell us very much at all, except that it's worth doing the work to run some better tests that will more thoroughly assess the randomness of the bitstream.

The good news though, is that these test metrics are very cheap to run, which means that while the ent command line tool is of only limited usefulness as a rough gauge, the tests that it uses can easily be run, with various block sizes, over an enormous number of blocks. Doing that gives us a much stronger and more useful test, since we can then see that not only does each block fall within the expected bounds for each metric, but that they also vary from block to block across the full range that we would expect to see from a genuinely random source.

Trying to analyse that in the form above would be completely impractical. but these are an excellent set of metrics that we can use for both automated QA testing on devices that have passed more rigorous testing, and to plot pretty graphs that give a good indication of any periods of suspiciously abnormal behaviour.

Chi Chi Chi Chi Changes

The Chi-square test2) is probably the most valuable single analysis in the set performed by ENT, though it is also the most commonly and easily misunderstood of them too.

Most of the tests that ent does all have some clear asymptote which they should converge on as the number of samples tested grows – so the only question for those is whether the observed deviation from it matches what would be expected for the given number of samples, and it's not too hard to see if they are at least near the right ballpark.

The χ2 test however does not. With it, the expected result for a block of random samples is itself a uniform distribution across its entire range – and so it is difficult to infer much from it with a high degree of confidence when given only a single result.

We can be suspicious if that result is near the edges of probability, but that suspicion is still only about as good as rolling some number of dice once, having them all come up with 6, then having to decide based on just that whether you think they are fair or not. You really can't be very sure they aren't unless you roll them again and they keep doing that more often than they should.

The most naive assumption which some people start with is that a result of and randomly would exceed this value 50.00 percent of the times is the ideal for this test, and any variation from that is an anomaly – maybe because it seems intuitive that we want an outcome where the probability of a 0 or 1 being output is 0.5 – but that's not how this test works or what it is measuring for us.

A slightly less naive, but unfortunately common and almost as wrong, (mis)interpretation, is that values outside of a central range indicate the samples are not random – and even more wrongly, that values inside of it confirm they are. This mistake probably comes from a casual misreading of ENT's own documentation. While it is careful to say this would only flag them as suspected of being not random, it doesn't actually elaborate on the details of how to test that suspicion further and attempt to confirm or deny it.

Results near the extreme edges of probability are suspect in the same way which a horse that wins a race at 100 to 1 odds is suspect. Maybe the race was fixed, maybe the horse was doped – or maybe that really just was its lucky day. It's up to the race stewards to investigate all of those things, because all of them are possible. If it keeps winning at those odds, our suspicion should grow. But even at shorter odds, the race may still have been rigged and we need to be able to know that too if so.

But in order to do that, first we need to understand how the test works and what its output really means …

Clinical trials and tribulations

The χ2 test as run by ENT is a measure of how the observed frequency of each non-overlapping 8-bit value compares to the distribution of them which would be expected if each possible value was equally probable and each of the observed values had been selected independently at random from that entire set.

What sets it apart from the other tests is the ability to tell us not only that the samples are not too biased, but also that they are not too uniform either. That the distribution of them really is (probably) Just Right.

For example, if you were to roll a fair dice 6 times, then even though there is a 1 in 6 chance of each value occurring for each roll, if you obtained a result where each value had occurred exactly once, that would actually be a rather improbably uniform outcome.

Not quite as improbable as the alternative non-uniform extreme, where every roll comes up with exactly the same value, but still far less probable than having some of the values not come up at all and others come up more than once – simply because there are more permutations in which that could happen, and individually each of those outcomes are all equally probable if the dice is indeed fair.

Ignoring for the moment that this example is far too small a set of trials for this type of test to have any significant amount of resolving power, such an outcome would in fact be reported, in the syntax used by ENT, something like this:

Chi square distribution for 6 samples is 0.00, and randomly would exceed this value 99.99 percent of the times.

If we were to take either of the naive interpretations here, we would then (wrongly) conclude that this dice is obviously not fair.

But that's not what this result is showing us at all. It is not saying there is a 99% chance that the dice is not random. And it should be self-evident that this isn't a completely implausible set of values for fair dice to have rolled.

It's instead telling us that if we ran this same test repeatedly, and if the dice was fair, we would expect that the χ2 statistic we obtained would have a value greater than the 0.00 which it returned this time in 99% of those trials.

The distinction is a subtle but important one. If we never saw an outcome like this, it would also be safe to assume the dice was not fair. What we really want to know is whether we only see it happen about the number of times that we should expect it to, because if it occurs either more or less than that, by a significant amount, then that would indeed be a suspicious result. What the χ2 test helps us quantify is what counts as a significant amount.

For trials of a genuinely random source, the expected values of the χ2 statistic should themselves be uniformly distributed when grouped according to their probability. If we bin those into blocks of 10% (0-9%, 10-19%, 20-29% etc.), or into any other equally spaced intervals, then over a large number of trials, each bin would be expected to have about the same number of results fall into it too – in much the same way that with a large number of dice rolls we would expect the variation in the total number of times that each value occurred to become increasingly small by proportion if it was both fair and fairly rolled.

So a result of 50% is not some ideal outcome that we are looking for here, but rather a value which we should expect to see (in the 50-59% range) in only about 10% of trials if the samples are genuinely random. If we always and only ever see a result in the 50% range, that would itself be a clear sign the RNG is almost certainly not random. It would be like rolling a dice which always came up with a 3 on it.

Significant others

A single result near the extremes of either end (let's use less than 1% or greater than 99% as an example) might raise initial suspicion because there is only a 1 in 100 chance of seeing that outcome in any individual trial.

If you see a result like this only occur with about that frequency when you repeat the test on more blocks of independent samples, then this is what we would normally expect. However if you see it for every block that you test, then you know something is almost certainly wrong. Likewise if you never see it in a large number of trials, then that also should be a red flag to investigate further too.

So how do we know if the χ2 statistic values we are seeing do in fact occur with the expected frequency or are significantly anomalous? The answer to that is beautifully elegant. Since we know that they too should follow the same sort of distribution as the random samples themselves did, if the random samples were actually independent and identically distributed – We can run a χ2 test on them! If it shows that their distribution is improbable, then we can be more confidently suspicious that something is wrong than we can be from the result of a single test on a single block of samples.

But we'll come back to this again in more detail below in the discussion of Dieharder, since this is essentially what it does to assess the tests it performs, just with a slightly different test which is better suited to values that aren't naturally in discrete bins (like the p-values calculated for probability statistics).

One thing you can possibly infer with a relatively high degree of confidence though, is that if somebody shows you the output of ent to prove that their device is random – and all the results of it that they show you always fall into that 50% bin – then you now know what the probability is of it being a randomly selected representative result …

The χ2 test doesn't care where the bias came from, but with enough samples it will show you if there evidently is one. And the more samples you have to analyse, the better it gets at detecting even smaller anomalies as being significant.

The munin graphs of the short-block χ2 test results give a good visualisation of how this statistic is expected to be randomly distributed throughout its probable range when calculated for random samples.

FIPS 140-2

Perhaps the most interesting question we might ask about the FIPS 140-2 tests is Why are we even still bothering to run them?. They are computationally more expensive than the ENT tests, and also less able to detect small, but statistically relevant, flaws in the quality of a random stream.

One answer to that is we have another better test, which also requires us to count runs of consecutive bits, and which means that we get the most expensive part of these tests essentially 'for free'. But another is that they do operate on very small blocks. So while they are kind of useless for assessing the real quality of a random stream, they are still somewhat useful for very quickly detecting a catastrophic failure of the entropy source. They'll let significantly bad entropy through if the corruption is subtle, but assuming that the RNG hardware has been well tested before being deployed, and with stronger tests like those from the ENT suite to watch our back for small flaws which may creep in, they will almost immediately trap the case of a completely failed entropy source. Which is of enough value to us that we'll likely continue to run them until we find some other test which better covers that case as well.

We have somewhat enhanced the usefulness of them by tracking not only the expected failure rate (since their significance levels are calibrated to a point where they are expected to fail on average about once every 1250 blocks), but also that the run length without a failure does not exceed acceptably probable levels. A bitstream which these tests never failed on would most definitely not be genuinely random.

We similarly also produce pretty graphs that give a good overview of any trends that may be evident in these test results too.

NIST SP 800-22

The NIST special publication 800-22 draft was last revised in 2012, and at the time of writing this, no newer or final version has yet been released. It provides a lot of detail, both theoretical and practical, about the tests implemented in the NIST Statistical Test Suite. We ran all of our initial prototypes through that suite as part of verifying them, though in general we've found that it didn't seem to catch much for us that some other test didn't also flag (which isn't necessarily surprising, other suites such as Dieharder also used that document in developing their own test sets and TestU01 includes equivalent or similar tests too). It can be quite quick to run, and produces detailed reports on the test results, but the front end that it comes with isn't exactly a thesis on either user or machine friendliness, so it would need some work to make it more suited for regular automated testing of large numbers of devices (which largely has already been done by other suites incorporating these tests).

There are some tests in it which are either variants of tests that we had already added to our own suite, or which were very easy to add to some existing part of it just from the detail given in the draft document. Currently we include the min-entropy calculation from it in our implementation of the ENT tests; the Adaptive Proportion Test, which fitted easily into our implementation of the FIPS 140-2 tests; and the Repetition Count test, which is a slightly better version of the 'Continuous test' from FIPS 140-2. There are at least a few other tests described in it which could usefully be added to the continuous QA checks that we have, but right now that's on the list of things that are still earmarked for Future Study.

Dieharder

It is probably not possible for us to say enough nice things about this suite of tests. It's almost single-handedly responsible for the push to extract higher and higher bitrates out of the devices we experimented with so that we could run them through it faster and more often. It mercilessly shattered any remaining illusions we might have had about how this project should be an easy job, and exposed just how far wrong well that looks pretty random can be when only naive tests are applied to something which is in fact very definitely not so. And it's at least partly responsible for us making the decision to take on this whole project in the first place …

When you come across somebody who is claiming to have a robust and reliable and well tested device, intended for cryptographic use, and then find out that their claim that it passed Dieharder was entirely based on running it with the wrong command line options and analysing its own default built-in PRNG rather than the hardware device that was supposedly under test – and that when run correctly by somebody else, that device unambiguously fails it – and that when that is pointed out, the response is no, no, that's ok, hardware generators are supposed to do that – skepticism about pretty much everything gets elevated to a whole new level. You start wondering who else didn't give what other things the sort of care and attention that they needed for this.

And when you later see something like this:
 test_name   |ntup| tsamples |psamples|  p-value |Assessment
  rgb_bitdist|   1|    100000|     100|0.00001100|   WEAK
  rgb_bitdist|   2|    100000|     100|0.00000000|  FAILED
being described as a pretty good result …
you know not a drop of that skepticism was wasted.

Probably improbable

Dieharder applies science to the problem that we touched on above with trying to read some real significance into the ENT results. Aside from just running many more tests, which are sensitive to many different (ir)regularities, it also runs each of those tests many times and looks at the distribution of the results for each test too.

In the short-block ENT examples above, we noted that the sort of variation seen there was not only expected, but it was actually a fundamental characteristic of a genuinely random source, and any stream of bits which does not exhibit it is very evidently not random. What we didn't try to quantify there was the Goldilocks zone, of exactly how much variation was Just Right, and how much would actually be too much or too little if those blocks were really random. This is something that we can calculate and measure though.

For any statistic where we know what the expected distribution of independent, identically distributed (IID) samples is, we can determine what the probability is of obtaining some particular outcome. This probability is known as the p-value, which is used to establish the statistical significance of a series of observed results. The Chi-square (χ2) test in the ENT suite is a measure of the goodness of fit that the observed distribution of sample values has to the expected distribution of a series of IID samples, and the percentage interpretation that is shown for it is its p-value.

In the horror result from Dieharder just above, the WEAK p-value of 0.000011 signifies that a test of the distribution of 0 and 1 bits from a good (IID) random source would be expected to produce a distribution of results which were more severely biased than those which were seen in that test roughly once in every 100,000 trials. Which is fairly improbable, but not impossible. If you repeated that test more than 100 thousand times, that is one of the results it is probable you would see – and if you didn't see it after running the test roughly that many times, that too would be an indication that the RNG is not truly random. In this case we can guess based on the adjacent result that further testing would show this isn't just the chance outcome of a good RNG, and that might be a fairly safe bet just on the back of 1 in 100k odds alone, but without more testing that would still just be a guess. Dieharder flags this as WEAK because the outcome is ambiguous, and only more testing can actually resolve that for us one way or the other.

The FAILED p-value seen there signifies that a test of the observed distribution of pairs of bits (the tuples 00, 01, 10, 11), produced results which were so heavily biased away from what would be expected that a good random source would have this outcome less than once in every 200,000,000 trials (and that's if we generously assume that the next digit, which was lost to rounding, might have been a 5). While still not a completely impossible event, we've now fairly unambiguously moved from improbable to implausible that this particular stream is really random. In statistical terms, the null hypothesis (that the stream of bits is random) can pretty safely be rejected, with only a very small chance of being wrong about that.

The interesting question for us then, is how do we interpret a weak result?

Turtles all the way down

On any given run of Dieharder (or any of the other statistical tests that we do), there is a non-zero probability that some result will be on the cusp of uncertainty. As we noted above, if this never happens, then that in itself is a strong sign that the RNG we are testing is almost certainly not random. Which means we're back to the same problem that we started with, how do we find out if that is happening the right number of times and not too often or too rarely? We could run the same test again, and see if it doesn't fail a second time. But what do we do if it does? And if it doesn't, what confidence do we have that it wouldn't happen again if run a third time? We could just run it repeatedly until we get the answer that we want, and not tell anybody else about the outcome of the others, but this isn't the pharmaceutical industry, so that simply isn't going to fly. Not for our peace of mind, and not in the face of independent scrutiny either.

The elegantly recursive answer to this is that we can use exactly the same solution, of comparing the observed distribution to an expected one, again. And again. And again.

If we obtain our initial statistic value from applying some test to some number of purportedly IID samples, and our p-value from the significance of that statistic, then we can repeat that same process multiple times to obtain some number of p-values. If each series of tests is also independent, then we know the expected distribution of the p-values that we obtain in this way too. They will also be IID samples in their own right. Which means we can apply the same analysis to them, to arrive at yet another p-value, which indicates the probability of obtaining the original set of p-values that we observed. If the original samples really are IID, then at each layer of repeating this, the p-values which we obtain should also be IID as well, and if they aren't then either numerical errors have crept into the calculation along the way, or the original samples were not IID either.

This still cannot prove without any doubt that our RNG is certainly random (nothing can do that), but it can tell us if the pattern of improbable results that we saw is itself highly implausible or whether they seem to happen at about the rate that we'd expect them to for a random source. Applied enough times, to enough test results, the power of the test will increase, to the limits of numerical accuracy, and the significance of the result will become less uncertain. It will either start to look like an increasingly probable outcome, or we'll get a fairly unambiguous FAILED result.

Fortunately for us, none of this is news to the people who wrote Dieharder. It allows us to both manually tweak its test parameters to increase the strength of tests where we are uncertain about the outcome or suspicious of some weakness, and to have it automatically repeat tests where the result appeared to be weakly significant, on an increasingly large number of trials, until the outcome is no longer heavily in doubt. The automation for that isn't (yet) perfect, but it removes at least one layer of potential confirmation bias from a human user who is more interested in seeing a good outcome than proving they can break things.

We've put up an annotated comparison of three consecutive test runs which we hope will help to demonstrate this a bit more concretely.

TestU01

It's tempting to think of the TestU01 suite as a kind of Swiss Army Dieharder, or maybe more aptly as its Swedish flatpack counterpart. Like Dieharder it provides a very comprehensive library of both generators and tests, but it does not include a standard application front end. Instead you get to assemble your own from a huge range of possible options. It does however make it quite easy to do this with a relatively small amount of code, abstracting out most of the basic things that are needed for a minimal tool, and providing test sets that are grouped into predefined 'batteries' with preset parameters to suit various common use cases.

It would be difficult to describe even part of the things which this library provides in a way that would come close to doing it justice. The short version of just the main library documentation runs to over 200 pages, and that's not counting the (separate) docs for the utility or probability distribution functions that it also includes …

Crush, Kill, Destroy

So far we have been using the Alphabit battery which was designed primarily for testing of hardware generators, the Rabbit battery which is oriented toward testing binary data (as opposed to random numbers in the interval [0,1) which much of TestU01 operates on), the SmallCrush battery which analyses about 51 million 32-bit random numbers and is quite fast, the Crush battery which tests about 235 random numbers (actually ~126GiB of data), and the BigCrush battery which tests about 238 random numbers (actually ~1.3TiB of data).

Mostly we've found that Alphabit, Rabbit, and SmallCrush will pass things that Dieharder will raise at least half an eyebrow about, though at least part of the reason for that is TestU01 does not make the same distinction about weak results in the grey area of significance. It simply alerts about results with p-values which fall outside of the range [0.001, 0.999], where Dieharder will flag a result that exceeds the 0.005 threshold as WEAK, and one which exceeds 0.000001 as an almost certain FAILED result.

In tests we did with streaming bits from multiple devices, where one device in the set was deliberately detuned by just enough to produce poor quality randomness that our long-term tests graphing the individual devices could detect (but only after several hours of running) – we found that Dieharder did not always fail the combined output, while Crush did more reliably detect the anomaly in at least one or two of its tests (but not always in the same tests each time we repeated this). The precise reasons for that are still at least partly an open question, but the primary goal of this experiment was to confirm that we did have tests which could detect that situation occurring, and it was reassuring to find that we did.

The TestU01 batteries also make it somewhat more difficult to just increase the statistical power to resolve an ambiguous test result as we described above for Dieharder. While the individual tests are all very configurable (which is a lesson both these suites learned from the limits of the original Diehard), the batteries of them are not. Some of the tests in the Alphabit and Rabbit batteries will tune their parameters based on the number of bits available, but many just always operate on the same sample size with the same test parameters each time. The same principles do apply here too, but the mechanics of applying them are far more DIY with TestU01 than with Dieharder.

AIS-31

The AIS-31 standard documents contain a wealth of good information on many aspects of RNG design and testing, and perhaps more importantly, also provide detailed rationales for the recommendations which are made in them. If you're interested in learning more about the gritty details of evaluating RNGs, they're certainly worth putting high on your reading list.

Unfortunately (for us at least), the test suite they provide to go along with that is written in both Java and German, which makes it a bit more … difficult … for us to recommend it as effusively as the documents themselves, especially since the tests it implements are the same as, or should be covered by, tests we are already running in other suites. There may be some variation in their overall rejection criteria and methodology for establishing that, but at this stage, like the NIST STS, we don't expect that to catch anything that we won't already trap in some other test.

We do have an independent implementation of what it calls the General Runs Test, which we'd already written and were using in our initial assessment work before we became aware of AIS-31, but it adds a nice validation to the usefulness of that test. That test compares the observed number of runs of consecutive 0 or 1 bits, of any length, to the expected number of them based on the total number of bits being analysed. The significance of any deviation from the expected distribution is established by a Chi-square goodness of fit test.

We continually calculate the results of that test during normal running (which is almost free to do if we're already counting runs of consecutive bits for the FIPS 140-2 tests), and make them available to query in real-time, but we aren't currently graphing them or using them for automated assessment of the stream's (lack of) quality. There are two main problems with doing that which we don't have a perfectly clear answer to yet:

Open questions for the Bitruns test

Selecting exactly what level of significance ought to unambiguously trigger a QA failure. Since as we described above, we expect this measure to swing fully between the limits of improbability, and if we were to censor the stream at a point where this statistic became fairly improbable, but not utterly implausible, then we would actually make the output stream less random than it would be expected to be. Since we also take only a long term measure of this (over every bit that has been seen since the monitoring process began), once a very large number of samples are being used it could hover at those limits for a considerable period of time before swinging back the other way again. We have occasionally seen this measure appear to be a leading indicator for a bad stream that will later fail one or more other statistics, but we've also seen it swing to (improbable but not implausible) extremes when every other statistic continues to assert (for long periods) that it is still statistically random. For now, we do expect that any genuinely bad stream will begin to fail other measures before this one actually gets to unambiguously implausible levels, so we're not losing much by not using it for that, but if we find some strong theoretical proof, or collect enough experimental evidence, that we can put a firm number on this, then it would be fairly trivial to also use it directly as an automatic QA trigger.

The other thing we don't quite have a good answer for with this test is how we might meaningfully graph it in a way that can sensibly be interpreted from the graph data alone. Both the raw data and the χ2 statistic are expected to increase in value over time if the stream is good, as the total number of bits that have been observed increases. The p-value has an element of discontinuity to it, since the number of degrees of freedom used to calculate it from the χ2 statistic will also grow over time, as longer runs become more probable when the total number of bits in the stream being measured increases. The correct number of degrees of freedom itself is also somewhat subjective, for a test like this it is common to only include the bins where the expected number of results is greater than 5, but that is merely a convention, which while experimentally reasonable for many data sets and widely accepted, has no strong theoretical basis. All of that being before we consider the simple fact that there are actually quite a lot of numbers in that measurement, any or all of which might individually show some real anomaly that is more interesting than the overall goodness of fit to the expected distribution.

So we're open to good suggestions about how people might think we can report on that one better, in ways that would be useful to them, but for now we just make the full report able to be requested at any time (in both machine and human readable forms). The user-friendly output of that looks something like this.

Other tests

In the early stages of developing this, we searched for and dabbled with just about every kind of test, or published tool for testing, that we could find which looked even half plausibly able to detect some kind of flaw. As our designs moved from well that didn't take long to break to We're going to need a bigger hammer, the tests that are described above were basically the only ones that so far have proved themselves enduringly useful at finding things that something else might not, or at the very least at possibly finding them more quickly.

Plotting samples visually, in various dimensions and after various manipulations, had an early appeal. It's widely claimed that our eyes are sensitive to patterns that other tests might miss, and who doesn't like pretty pictures as proof of something or another – but the reality was that this was a relatively poor way to detect anything but the most glaring departures from a random distribution, and only useful over relatively small sets of samples at that.

Which isn't to say that's not a useful test to subject PRNG algorithms to, where ultimately there is some known mathematical structure to the samples whether it is easy to see in this way or not. We also can't totally rule out that there may be some clever mapping of samples that could be done which we so far missed, but we tried a number of methods, and once our prototypes got even remotely good, the other tests were still finding things that were completely invisible (or at least far from being visually obvious) in every plot, no matter how hard we squinted at them or what dimensions or domains we tried warping the samples into.

We found a bunch of other test suites which people had created, some supposedly general purpose, and some for analysing very specific uses of random numbers, but all of those fell by the wayside pretty quickly too. Maybe we missed something actually good there too, and if we did, we'd love to hear about it and put it to the test, but right now, any such things are at least as secret to us still as we hope the entropy from the current BitBabbler designs really is.

Which also isn't to say these are the only tests that you should run, or the only tests that we will ever run. It cannot be emphasised enough that there is simply no way to prove that any source really is random, the best thing that we can do is succeed at proving that some sources most definitely are not, and not let failure to do so get in the way of continuing to try to prove that the rest of them are not either. We're making the BitBabbler devices available more widely now because we are consistently failing at that – but we know we won't ever really be able to trust them unless all of the rest of you fail at that too! The more people we have diligently trying to do that, the better off everyone who needs a trusted source of entropy ultimately will be.