On Linux’s Random Number Generation
I have been asked about the usefulness of security monitoring of entropy levels in the Linux kernel. This calls for some explanation of how random generation works in Linux systems.
So, randomness and the Linux kernel. This is an area where there is longstanding confusion, notably among some Linux kernel developers, including Linus Torvalds himself.
A long time ago (20+ years), a random generator was designed along the theory of “entropy depletion”. It goes thus: at any point, the generator has an internal state (the “entropy pool”) which has contents unknown to attackers. The amount of unknown-ness is called “entropy” (initially by analogy with thermodynamics, but that’s just an analogy; it’s not real thermodynamic entropy) expressed with a logarithmic scale, in bits; roughly speaking, if the pool contains “n bits of entropy” then this means that its unknown contents can be any of 2n potential values. (That’s a simplified definition that assumes that all potential values are equiprobable, but it will do for this explanation.)
Whenever some bits are obtained from the pool, they conceptually leak information about the pool: if you extract k random bits, then the remaining unknown-ness may be reduced down to 2n-k potential values, i.e. the entropy pool has been depleted by k bits.
Now, here, the important word is “conceptually”. The RNG will not give away its raw pool bits just like that; instead, it will use the pool as a seed in a cryptographically strong RNG, whose output is what is sent back to applications. The whole idea of a RNG being cryptographically strong is that it is computationally infeasible to actually obtain information about the seed by just observing the output. The RNG, effectively, plugs the leak, and no depletion occurs in any practical sense.
However, the person who designed that RNG was not very clear about that notion of “cryptographically strong”. In fact, out of some paranoid zeal, that person felt that some secret values required extra strong entropy, so that they may defeat adversaries that are powerful enough to break through the cryptographically strong RNG (it was in the 1990s, with the US rules on export of cryptography and the PGP craze, so these powerful adversaries were the usual fantasy of god-like NSA). If the CSRNG is assumed to be transparent, then depletion is back. Hence, the kernel was equipped with code that maintains a count of the current pool entropy contents, and /dev/random blocks when there is not enough to return the requested amount of pure NSA-proof randomness.
This is flawed reasoning, in several ways:
- The whole premise of entropy depletion is that cryptography does not work (the CSRNG does not prevent the leak), and yet the one and only example of values that require absolute randomness is “cryptographic keys”, i.e. the things that make sense only if cryptography, in fact, works. This is self-contradictory.
- The current amount of entropy in the pool is not known. It is estimated. Entropy is extracted from physical events (in particular exact timing of IRQs, as measured with the cycle counter), and this relies on that information being unpredictable by attackers. In other words, the god-like entities that can munch through cryptographic algorithms at breakfast are supposed not to be able to measure and accurately simulate physical systems. So much for divine abilities. In a sense, whether a given mechanism provides entropy is a matter of “this or that expert said that it does”; impossibility of accurate simulation comes from physics, specifically quantum mechanics, so the entropy pool estimator is based on a fair amount of trust in physicists such as Feynman or Bohr (but not Einstein, for that matter). But the entropy depletion is an assertion that cryptographers such as Shamir cannot be equally trusted.
- The measures from physical events are not uniform sequences of bits; they have biases, and successive measures can be correlated. Thus, the pool relies on some mixing which uses… a cryptographic hash function. So cryptography still has to be trusted to do something properly.
Notwithstanding its flaws, the entropy depletion theory got its followers early on, and was adamantly maintained by some big names in Linux kernel development, mostly because it is quite hard to admit to other people, and to oneself, that one might have got something wrong.
That theory is harmful in that /dev/random may block, possibly at inopportune moments. Typically, an OS or application installation may stall for long periods (possibly hours, I have seen it) because it tries to generate a cryptographic key (for SSH, PGP…) and the kernel does not believe it still has sufficient entropy. Note that, for all practical purposes, it really has more than enough entropy; the pool contents are not guessable by outsiders, since all previously extracted random elements were obtained through the CSRNG, and attackers do not have, in fact, the help of demoniac entities with unlimited computing abilities (or when they do, you usually have bigger problems).
To work around the blocking issues of /dev/random, an alternate API was added, called /dev/urandom. It’s the same as /dev/random, except that it does not block. Never. This is better, but not actually good: there are times when the entropy pool is really empty, namely during the early stages of the boot. At that point, the kernel did not obtain many physical events to work on, and it is conceivable that /dev/urandom output could be predicted. Note that this is not really about /dev/urandom, the special file: the early boot moments we are talking about are before there is any notion of a file; this is really about a single case, which is booting a diskless machine over the network, and mounting the root filesystem from a remote server. The relevant network protocol can need some randomness (e.g. TCP sequence numbers).
Since at least the early 2000s, Linux distributions have applied workarounds to ensure proper entropy at boot time, namely that a boot script injects the contents of a saved file upon boot, and immediately proceeds to regenerate the said file with /dev/urandom. In effect, this transports the entropy across reboots, so that even if the boot sequence was not enough, by itself, to generate enough entropy, the file contents would ensure that everything is all right.
Later on, a system call was added, to get randomness without having to open a file and use a file descriptor; it is named getrandom(). That system call finally implements the proper behavior, i.e. blocking until a sufficient amount of initial entropy has been gathered since last boot, but never blocking afterwards. Incidentally, this is what /dev/urandom does on sane systems (e.g. FreeBSD or macOS). Applications should simply use getrandom(), and be happy.
(Or not. Linux 5.3 will turn back getrandom() into /dev/urandom with its never-blocking behavior, because, quite frankly, Linus’s opinions on his own mastery of RNG theory exceed his actual abilities. The reasoning seems to be that if there is “not enough entropy”, then the application should make an interpretative dance of some kind to promote the appearance of new hardware events from which the entropy can be gathered. How the application does that, or why the kernel should not do it despite being much closer to the hardware, is not said.)
All of the above is the classical description, up to the early/mid-2010s. There are now a few extra relevant points to make:
- Virtual machines are a challenge to entropy gathering, in at least three ways:
- They provided access to virtual, emulated hardware only. The nice physical events from which entropy is supposed to come (thermal noise, mostly) are then just a simulation, and that which is simulated can, indeed, be simulated.
- The hypervisor can prevent access to the cycle counter (rdtsc opcode), which will further hinder attempts by the kernel to get entropy from the (not so) physical events.
- VM snapshots can be taken and replayed at will; each restart from the same snapshot will use the recorded pool contents.
- A contrario, sufficiently recent CPU have an embedded hardware generator which is totally available from VM (it’s the rdrand opcode on x86 CPU). The Linux kernel uses rdrand. It does not trust rdrand, because NSA (I’m not exaggerating! The kernel source code explicitly calls out the NSA), so it will not count the rdrand output as worth any entropy. But it will still use it. In all edge cases described above (network boot, VM snapshots…), rdrand will by itself ensure that there is enough entropy for all practical purposes.
Given all the above mess, there are people who try to use extra hardware RNG, usually by injecting the random bytes obtained from these devices into the kernel (by writing into
/dev/random). Surprise! It does not actually work; or, at least, not as well as is usually believed. It so happens that in 2017, somebody improved the RNG by replacing the old pseudorandom stream generator with something decent and fast (the ChaCha20 stream cipher). As a byproduct of these changes, the CSRNG reseeds from the entropy pool only if the last reseeding occurred more than five minutes earlier (this is a part of the performance improvements). It still uses rdrand for each call, so it is very fine. But that means that any extra entropy, obtained from a dedicated hardware RNG (possibly a very costly one) and injected into the kernel, may be wholly ignored for up to five minutes. This incarnates a new high on the scale of uselessness.
The entropy depletion cult is not very happy about that. Over time, it has spun off some extra sects, including haveged. This particular piece of software is not only operating on the flawed notion of entropy depletion, but also on the belief that where the kernel, which has access to the hardware, fails, a userland software, without access to the hardware, can succeed. In practice, it’s another moving part in a complicated but ultimately meaningless ritual.
Some words on statistical tests. The NIST has come up with a bunch of statistical tests meant to measure the quality of a random source. This is a semi-flawed idea. Any good CSRNG will pass these tests successfully. Many atrocious CSRNG will pass them, too. A bias that can be detected through statistics is the sign of an extremely bad design. The whole point of cryptography is to defeat intelligent attackers who have computers and know exactly what kind of software we run; this gives them a lot more power than statistics. Consider for instance the following RNG: from a seed s, it generates output in 32-byte chunks by doing the following:
- Replace s with SHA-256(s)
- Output s
- Goto 1 (until all requested output as been produced)
This will pass the FIPS 140-2 statistical tests with flying colours; none of the statistical tests will see anything wrong with that. But, of course, the first 32 bytes that this RNG outputs are its current internal seed; any attacker observing a 32-byte output chunk can compute all subsequent output with 100% reliability.
To be brief, a CSRNG should be unbreakable by smart attackers with lots of resources. If a CSRNG does not pass statistical tests, then this means that it can be broken by a chimpanzee. But if it passes the tests, this does in no way prove that it is cryptographically strong; it just means that the chimpanzee will be stumped.
Now, statistical tests are randomized in nature. Each such test is really a measure of implausibility: the test gets an output, then tries to compute how improbable such an outcome can be, assuming that the source is perfectly random. For instance, if there was less than 1 chance in 1000 to have such a bias, then the test reports a “significant bias” (this is the way all experimental science works, the probability being called “p-value” and the significance threshold being traditionally 0.05, i.e. that which could happen with probability 1/20 or less is considered significant).
1 in 1000 is a low probability. But if you run 10000 tests, each with a 1 in 1000 threshold, then there will be some “failures”: things that happen with probability 1 in 1000, do happen fairly reliably if you try 10000 times. Thus, the fact that some FIPS tests occasionally report a “failure” is here meaningless.
Summary. Monitoring entropy levels on Linux systems is not very useful. From a security point of view, the entropy estimates by the kernel are quite off. It may happen that some applications insist on reading from /dev/random instead of /dev/urandom (or from getrandom()), and block unduly for the sake of that specific brand of fetishism. This should not happen often. OpenSSL uses /dev/urandom, for instance. So does Java by default. The occasional blocking is so irksome that there has been great pressure to make all apps use the non-blocking /dev/urandom; only the very few apps that nobody really uses (e.g. PGP) still insist on /dev/random.
To fully ensure that no blocking ever happens, a simple solution might be to simply make /dev/random a link to /dev/urandom. (In recent Linux distributions, /dev is a virtual filesystem, so such renaming must be done at each boot, which may or may not be easy.)
[Editor’s note: This post was slightly modified on December 20 2019 to remove specific attribution of some of the ideas under discussion.]