BAT: a Fast and Small Key Encapsulation Mechanism

In this post we present a newly published key encapsulation mechanism (KEM) called BAT. It is a post-quantum algorithm, using NTRU lattices, and its main advantages are that it is both small and fast. The paper was accepted by TCHES (it should appear in volume 2022, issue 2) and is also available on ePrint:

An implementation (in C, both with and without AVX2 optimizations) is on GitHub:

What is a Post-Quantum KEM?

Asymmetric cryptography, as used in, for instance, an SSL/TLS connection, classically falls into two categories: digital signatures, and key exchange protocols. The latter designates a mechanism through which two parties send each other messages, and, at the end of the protocol, end up with a shared secret value that they can use to perform further tasks such as symmetric encryption of bulk data. In TLS, the key exchange happens during the initial handshake, along with signatures to convince the client that it is talking to the expected server and none other. A key encapsulation mechanism is a kind of key exchange protocol that can work with only two messages:

  • Party A (the server, in TLS) sends a generic, reusable message that is basically a public key (this message can conceptually be sent in advance and used by many clients, although in TLS servers usually make a new one for each connection, to promote forward secrecy).
  • Party B (the client, in TLS) sends a single message that uses the information from A’s message.

Key encapsulation differs from asymmetric encryption in the following sense: the two parties obtain in fine a shared secret, but neither gets to choose its value; it is an output of the protocol, not an input. The two concepts of KEM and asymmetric encryption are still very close to each other; an asymmetric encryption can be used as a KEM by simply generating a sequence of random bytes and encrypting it with the recipient’s public key, and, conversely, a KEM can be extended into an asymmetric encryption system by adjoining a symmetric encryption algorithm to it to encrypt a message using, as key, the shared secret that is produced by the KEM (when the KEM is Diffie-Hellman, there is even a rarely used standard for that, called IES).

In today’s world, we routinely use KEMs which are fast and small, based on elliptic curve cryptography (specifically, the elliptic curve Diffie-Hellman key exchange). However, tomorrow’s world might feature quantum computers, and a known characteristic of quantum computers is that they can easily break ECDH (as well as older systems such as RSA or classic Diffie-Hellman). There is currently no quantum computer that can do so, and it is unknown whether there will be one in the near future (or ever), but it makes sense to make some provisions for that potential ground-breaking event, that is, to develop some post-quantum algorithms, i.e. cryptographic algorithms that will (presumably) successfully defeat attackers who wield quantum computers.

The NIST has been running for the last few years a standardization process (officially not a competition, though it features candidates and rounds and finalists) that aims at defining a few post-quantum KEMs and signature algorithms. Among the finalists in both categories are algorithms based on lattice cryptography; for KEMs, the two lattice-based algorithms with the “best” performance trade-offs are CRYSTALS-Kyber and Saber.


BAT is not a candidate to the NIST post-quantum project; it has been just published and that is way too late to enter that process. However, just like standardization of elliptic curve cryptography in the late 1990s (with curves like NIST’s P-256) did not prevent the appearance and wide usage of other curve types (e.g. Curve25519), there is no reason not to keep researching and proposing new schemes with different and sometimes better performance trade-offs. Let us compare some performance measurements for Kyber, Saber and BAT (using the variants that target “NIST level 1” security, i.e. roughly the same as AES-128, and in practice the only one you really need):

AlgorithmKeygen cost
cost (kilocycles)
cost (kilocycles)
public key
size (bytes)
size (bytes)
Kyber, Saber and BAT performance on Intel x86 Coffee Lake

These values were all measured on the same system (Intel x86 Coffee Lake, 64-bit, Clang-10.0.0). Let us see what these numbers mean.

First, it is immediately obvious that BAT’s key pair generation is expensive, at close to 30 millions of clock cycles. It is a lot more than for the two other algorithms. However, it is not intolerably expensive; it it still about 10 times faster than RSA key pair generation (for 2048-bit keys), and we have been using RSA for decades. This can still run on small embedded microcontrollers. Key pair generation is, normally, a relatively rare operation. It is quite convenient if key pair generation is very fast, because, for instance, it would allow a busy TLS server to create a new key pair for every incoming connection, which seems best for forward secrecy, but if key pair generation is not so fast, then simply creating a new key pair once every 10 seconds can still provide a fair amount of forward secrecy, at negligible overall cost.

Then we move to encapsulation and decapsulation costs, and we see that BAT encapsulation (the client-side operation, in a TLS model) is very fast, with a cost lower than a third of the cost of Kyber encapsulation, while decapsulation is still on par with that of Saber. We could claim a partial victory here. Does it matter? Not a lot! With costs way lower than a million cycles, everything here is way too fast to have any noticeable impact on a machine such as a laptop, smartphone or server, even if envisioning a very busy server with hundreds of incoming connections per second. Cost may matter for very small systems, such as small microcontrollers working on minimal power, but the figures above are for a big 64-bit x86 CPU with AVX2 optimizations everywhere, which yields very little information on how things would run on a microcontroller (an implementation of BAT optimized for such a small CPU is still a future work at this time).

What really matters here for practical Web-like deployments is the sizes. Public keys and ciphertexts (the two messages of a KEM) travel on the wire, and while networks are fast, exchanges that require extra packets tend to increase latency, an especially sensitive parameter in the case of TLS key exchanges since that happens at the start of the connection, when the human user has clicked on the link is and is waiting for the target site to appear. Human users have very low patience, and it is critical to have as low a latency as possible. Cloudflare has recently run some experiments with post-quantum signature schemes in that area, and it appeared that the size of public keys and signatures of current candidates is a problem. Similar issues impact KEMs as well (though with a lower magnitude because in a TLS handshake, there is a single KEM but typically several signatures and public keys, conveyed in X.509 certificates).

We may also expect size of public keys and ciphertexts to be an important parameter for low-power embedded applications with radio transmission: the energy cost of message transmission is proportional to its size, and is typically much greater than the cost of the computations that produced that message.

We see that BAT offers public key and ciphertext sizes which are significantly lower than those of Kyber and Saber, with savings of about 25 to 40%. This is where BAT shines, and what makes the scheme interesting and worth investigating a bit. Like all new schemes, it shall certainly not be deployed in production! It should first undergo some months (preferably years) of analysis by other researchers. If BAT succeeds at defeating attackers for a few years, then it could become a good candidate for new protocols that need a post-quantum KEM.

The Lattice

Without entering into the fine details of the lattice used by BAT, I am going to try to give an idea about how BAT can achieve lower public key and ciphertext sizes.

Practical lattice-based algorithms tend to work with lattices expressed over a polynomial ring: values are polynomials with coefficients being integers modulo a given integer q (usually small or small-ish, not necessarily a prime), and polynomial computations being made modulo the polynomial Xn+1 with n being a power of 2, often 256, 512 or 1024 (these are convenient cyclotomic polynomials, that allows very fast computations thanks to the number-theoretic transform). Depending on the scheme, there may be one or several such values in a public key and/or a ciphertext. While the internal mechanics differ in their details and even in the exact hard problem they rely on, they tend to have the same kind of trade-off between security, reliability and modulus size.

Indeed, encapsulation can be thought of as injecting a random value as “error bits” in an operation, and decapsulation leverages the private key in order to find the most plausible initial input, before the errors were inserted. Larger errors hide the secret better, but also increase the probability that the decapsulation gets it wrong, i.e. obtains the wrong output in the end. In order to maintain the security level while getting decapsulation error probability so low that it will never happen anywhere in the world, the usual trick is to increase the value of the modulus q. However, a larger q mechanically implies a larger public key and a larger ciphertext, since these are all collections of values modulo q. There are various tricks that can save some bits here and there, but the core principle is that a larger q is a problem and an algorithm designer wants to have q as small as possible.

Saber uses q = 8192. Kyber uses q = 3329. In BAT, q = 257. This is why BAT keys and ciphertexts are smaller.

How does BAT cope with the smaller q? In a nutshell, it has an enhanced decapsulation mechanism that “gets it right” more often. The BAT lattice is a case of a NTRU lattice with a complete base: the private key consists of (in particular) four polynomials f, g, F and G, with small integer coefficients (not integer modulo q, but plain integers), which are such that gF – fG = q. This is known as the NTRU equation. Polynomials f and g are generated randomly with a given distribution, and finding matching F and G is the reason why the key pair generation of BAT is expensive. This is in fact the same key pair generation as in the signature scheme Falcon, though with slightly smaller internal values, and, I am pleased to report, no floating-point operations anywhere. BAT is completely FPU-free, including in the key pair generation; that should make it quite easier to implement on microcontrollers.

Any (F,G) pair that fulfills the NTRU equation allows some decapsulation to happen, but the error rate is lowest when F and G are smallest. The F and G polynomials are found as an approximate reduction using Babai’s nearest plane algorithm, which would return an “optimal” solution with non-integral coefficients, so the coefficients are rounded and F and G are only somewhat good (they are the smallest possible while still having integers as coefficients). The main idea of BAT is to make “better” F and G by also working part of the decapsulation modulo another prime (64513, in the case of BAT) with an extra polynomial (called w in the paper) that in a way incarnates the “missing decimals” of the coefficients of F and G. These computations are only on the decapsulation side, they don’t impact the public key or ciphertext sizes, but they considerably lower the risk of a decapsulation error, and thereby allow using a much smaller modulus q, which leads to the smaller public keys and ciphertexts.

Next Steps

BAT is still in its infancy and I hope other researchers will be motivated into trying to break it (and, hopefully, fail) and extending it. This is ongoing research. I will also try to make an optimized implementation for a small microcontroller (in the context of the NIST post-quantum project, the usual target is the ARM Cortex M4 CPU; the current C code should compile as is and run successfully, but this should be done and performance measured, and some well-placed assembly routines can most likely reduce costs).