Elliptic curves are commonly used to implement asymmetric cryptographic operations such as key exchange and signatures. These operations are used in many places, in particular to initiate secure network connections within protocols such as TLS and Noise. However, they are relatively expensive in terms of computing resources, especially for low-end embedded systems, which run on small microcontrollers and are limited in Flash storage, RAM size, raw CPU abilities, and available electrical power when running on batteries. In this post, we are talking about optimizing some of the internal operations used in elliptic curve implementations, in particular (but not only) on small microcontrollers.

For the impatient, here are the relevant links:

- The core article, which describes the optimized binary GCD algorithm: https://eprint.iacr.org/2020/972
- The implementation on x86 CPU (with inline assembly, for Intel Skylake and more recent cores): https://github.com/pornin/bingcd
- Another implementation, this time for ARM Cortex-M0 and M0+ microcontrollers; it also includes the Legendre symbol, and an X25519 implementation which runs in about 3.23 million cycles (new record!): https://github.com/pornin/x25519-cm0

And a summary of the performance results on ARM Cortex-M0 and M0+ CPUs:

Operation | Cost (clock cycles) | Previous records |

Inversion modulo 2^{255}-19 | 54793 | ~270000 |

Legendre symbol modulo 2^{255}-19 | 43726 | ~270000 |

X25519 (point multiplication on Curve25519) | 3229950 | 3474201 |

**Modular Inversion**

Elliptic curves run on finite fields; here, we focus on the most basic type of finite field, which is integers modulo a given prime *p*. An elliptic curve that provides the traditional cryptographic security level of “128 bits” must use a finite field of size at least 250 bits or so; thus, we consider that *p* is a prime integer of 250 bits (or slightly more). We will use Curve25519 as our main example; this is a curve that uses the specific modulus *p* = 2^{255}-19 for its finite field. Use of Curve25519 for key exchange is specified in RFC 7748 under the name “X25519”. This specific use case is fast becoming the dominant way to perform key exchange for HTTPS connections.

Operations modulo *p* are not immediate; such integers do not fit in individual registers, and must be represented as several internal words (often called *limbs*). Appropriate algorithms are then used to compute additions, subtractions, multiplications, squarings and inversions modulo *p*. In general, additions and subtractions are fast, multiplications much less so. Squarings are a special case of multiplication and, depending on target architecture and implementation technique, may be somewhat faster (cost of one squaring is between 50 and 100% of a multiplication, and commonly between 65 and 80%). The modulus *p* = 2^{255}-19 was specifically chosen to make multiplications easier and faster. Inversions, however, are much more expensive. The usually recommended method for inversion modulo 2^{255}-19 is Fermat’s Little Theorem, which states (indirectly) that 1/*x* = *x*^{p-2} mod *p*: this is a modular exponentiation, with a 255-bit exponent; with the best known modular exponentiation algorithms, this is going to require at least 254 squarings, and a few extra multiplications on top of that (11 extra multiplications for this specific modulus).

Unfortunately, each basic operation in an elliptic curve, in its classic description, requires an inversion in the field, and interesting cryptographic protocols will need to perform several *hundreds* of such basic operations. To avoid paying for the cost of a modular inversion several hundred times, the usual trick is to work with fractions: each field element *x* is represented as a pair (*X*, *Z*) such that the true value *x* is equal to *X*/*Z*. Using fractions increases the number of multiplications, but avoids inversions, until the very end of the algorithm. One inversion is still needed to convert the final result from fractional to integral. In a way, this merges all inversions into a final one; however, that final inversion must still be computed.

To put some numbers on that: an instance of X25519 requires 1276 multiplications, 1020 squarings, one inversion, and some other cheaper operations. The cost of the inversion will then represent between 7 and 10% of the total cost. On an ARM Cortex M0+, the previous record was held by Haase and Labrique at 3474201 cycles – if the microcontroller operates at 8 MHz, this represents close to half a second, and it must be used twice to make a TLS key exchange. Thus, the computational cost brings the computation time into the range of values that are perceptible to the human user; and human users are quite sensitive to latency and perceived slowness. This highlights the importance of optimizing such operations.

Fermat’s Little Theorem is not the only known modular inversion algorithm. In fact, more than 2300 years ago, Euclid described a method to compute the greatest common divisor between two integers that can be readily extended into a modular inversion method (this is called the extended Euclidean algorithm). While Euclid’s algorithm uses divisions (on plain integers), which are complicated and costly to implement, a binary version that involves only subtractions and halvings was described by आर्यभट 1500 years ago. In the case of Curve25519, the binary GCD algorithm was initially deemed not to be competitive with Fermat’s Little Theorem, especially on “large” architectures such as modern servers, laptops, and smartphones, which now have intrinsic support for fast 64-bit multiplications.

However, it is possible to considerably speed up the binary GCD by noticing that all operations on big integers within the algorithm only need to use a few bits of the said integers, namely the few least significant bits and the few most significant bits. It is thus possible to execute most of the algorithm on *approximations* of the values, that fit in a single register (the “middle bits” having been removed), and propagate updates to the real values only a few times within the course of the algorithm. This does not change the overall mathematical description of the algorithm, but can significantly lower the cost. Moreover, if done with some care, these operations can be done *safely* on secret values, i.e. with an execution time and memory access patterns that do not depend on any secret, and therefore immune to leakage through timing-based side channels (such code is said the be *constant-time* and it is highly desirable in cryptographic implementations).

I had initially devised and used this binary GCD optimization technique in 2018 for the implementation of a step of RSA key pair generation within the BearSSL library; I also used it as part of the key pair generation of Falcon, a signature algorithm (currently part of NIST’s post-quantum cryptography standardization project). In both cases it was used on relatively large integers (1024 bits for RSA, up to 6500 bits for Falcon), and did its job (i.e. the cost of modular inversion became a very minor part of the overall cost). But it was time to properly document the algorithm, and to try it out on smaller fields, especially those used for elliptic curves. This yielded the following article and demonstration code:

- Article: https://eprint.iacr.org/2020/972
- Implementation: https://github.com/pornin/bingcd

The article formally describes the algorithm and mathematically proves that it always works within the expected number of iterations. The implementation is for big x86 CPUs; specifically, it targets x86 systems running in 64-bit mode and implementing the ADX and BMI2 extensions (i.e. the adcx, adox and mulx opcodes). On such systems, Fermat’s Little Theorem leads to an inversion cost of about 9175 cycles; the optimized binary GCD uses only 6253 cycles (test CPU is an Intel i5-8295U “Coffee Lake” core), which happens to be a new record for this type of CPU. The gain of about 3000 cycles is not very significant; this is about one microsecond worth of CPU time. It won’t be perceptible anywhere except in extremely specialized applications that do *many* elliptic curve operations; nothing of the sort will be relevant to an even busy Web server. However, for small microcontrollers, the speed gain is comparatively greater, and the gains can become significant. When implemented on an ARM Cortex M0+, we achieve inversion modulo 2^{255}-19 in only **54793 cycles**, i.e. about 5 times faster than Fermat’s Little Theorem method (at around 270000 cycles). This implementation is part of the new X25519 implementation (see next section).

Getting substantial speed improvements on the final inversion are only one of the ways in which the optimized algorithm can help with elliptic curve implementations. For instance, when working with generic Weierstraß curves such as NIST’s P-256 (which is also extremely common in TLS, as well as being much used for signatures in X.509 certificates), curve operations split into *point doublings* (adding a point to itself) and generic *point additions*, the latter being more expensive (about twice more expensive than doublings). Optimized point multiplication algorithms on such curves use “window optimizations” that reduce the number of generic point additions (down to about one point addition for every four or five doublings), but the cumulative cost of point additions remains significant. These algorithms compute a *window* of a few multiples of the input point and then use these points repeatedly. A *fast* inversion routine makes it worthwhile to normalize these points to affine coordinates (i.e. reducing the fractions to make denominators equal to 1), which yields substantial speed-ups for point addition. Such savings are cumulative with those obtained from making the final inversion faster.

**A New X25519 Speed Record**

I wrote a new X25519 implementation for ARM Cortex-M0 and M0+ CPUs. These CPUs are popular in constrained applications because they offer “32-bit performance” while using only very low power and silicon area. The implementation is available here: https://github.com/pornin/x25519-cm0

This implementation sets a new speed record at **3229950 clock cycles**, i.e. about 7.6% faster than the previous record. The difference is slight, but such savings are cumulative over an application and can matter if they allow getting over a threshold effect (e.g. allowing the hardware to run at a lower frequency to save on battery power).

If we go into the details, the use of the optimized binary GCD saves about 215000 cycles. An additional 75000 cycles are also gained in the implementation thanks to the use of pure assembly, as opposed to Haase and Labrique’s code which is mostly written in C. They have assembly implementations of multiplications and squarings (in 1478 and 998 cycles, respectively) and mine offer similar performance (1464 and 997 cycles, respectively), which is expected since I used the same underlying techniques (three layers of signed Karatsuba multiplication). However, keeping to assembly allowed me to use a different internal ABI in which functions do not save the registers they modify; such operations are necessary to interoperate with external code (e.g. written in C) but most of it is wasted in practice, since many registers do not contain values that need preservation across calls. Thus, Haase and Labrique’s additions modulo *p* require 120 cycles, while mine use only 71 cycles. These are small savings, but they add up to about 75000 cycles over the course of an X25519 execution.

On the other hand, my implementation spends an extra 46080 cycles to stick to pure constant-time discipline. Haase and Labrique assume that accesses to RAM in small microcontrollers does not leak address-related information, because such CPUs do not have caches. This is a correct assumption in most cases, but it is hard to verify in practice. The microcontroller vendor assembles the CPU core with some RAM and Flash blocks, and other elements such as I/O connectivity and timers, and may put a data cache in there; also, memory accesses go through an interconnection matrix that arbitrates between concurrent accesses, and it is possible that some address-dependent effect may happen, especially if some I/O activity with DMA from a peripheral runs concurrently. Microcontroller vendors hardly ever document these aspects in all their fine details. Therefore, the safest way is to assume the worst and, defensively, insist on pure constant-time discipline in the implementation.

Apart from the optimized binary GCD in ARMv6-M assembly, the implementation also includes an implementation of the Legendre symbol, which is not actually used in X25519, but may be useful to other operations adjacent to elliptic curves. This is described in the next section.

**Legendre Symbol**

The *Legendre symbol*, denoted (*x*|*n*) for two nonnegative integers *x* and *n* (with *n* an odd prime integer) is defined as follows:

- If
*x*= 0 mod*n*, then (*x*|*n*) = 0. - Otherwise, if
*x*is a*quadratic residue*modulo*n*(i.e. there is an integer*y*such that*x*=*y*^{2}mod*n*), then (*x*|*n*) = 1. - Otherwise, (
*x*|*n*) = -1.

Computing the Legendre symbol is used in particular when hashing data into a curve point in a constant-time manner (it is often used as an “is_square” test function). The usual method to compute Legendre’s symbol, especially when a constant-time implementation is needed (e.g. because the data which is hashed into a curve point is a secret value, such as a password), is again a modular exponentiation, using the fact that (*x*|*n*) = *x*^{(n-1)/2} mod *n*. As in the case of inversion, the exponent has about the same size as the modulus, and the cost is very close to that of inversion with Fermat’s Little Theorem.

There again, other algorithms are known. The Legendre symbol can be extended to the case of non-prime *n* (in which case it becomes the Jacobi symbol), and then again to a generic case that includes negative integers as well (this is the Kronecker symbol). These extended symbols allow the use of algorithms that are very close to the extended Euclidean algorithm and the binary GCD. And, indeed, we can use a variant of the optimized binary GCD to compute Legendre symbols.

Namely, we perform a binary GCD. In the binary GCD, we have two variables *a* and *b* whose initial values are *x* and *p*, respectively; throughout the course of the algorithm, both values slowly shrink, down to a final state where *a* = 0, and *b* contains the GCD of *x* and *n* (i.e. 1, if we use a prime *n*, unless we started with *x* = 0 mod *n*). At each iteration, the three following steps happen:

- If
*a*is odd and*a*<*b*, then*a*and*b*are exchanged. - If
*a*is odd, then*b*is subtracted from*a*. *a*is divided by 2.

Throughout the algorithm, *b* is always odd, and all divisions by 2 (in step 3) are exact since step 2 makes sure that if *a* is odd, then it is replaced with *a*–*b*, which is then necessarily even. The optimized binary GCD uses the same outline, but it uses approximations of *a* and *b* most of the time, which can lead it to “miscompute” step 1, i.e. swap values when it should not, or not swap them when it should. The consequence is that *a* or *b* may become negative. However, this is fixed later on in the algorithm. The critical remark here is that while *a* or *b* may become negative, they cannot be both negative at the same time. If a subtraction makes *a* negative, then subsequent steps will have one value negative and one positive, until a fixing step (not shown above) occurs and makes them both positive again.

The Legendre/Jacobi/Kronecker symbol is obtained by doing the same computations, but also keeping track of the expected symbol value by applying the following rules:

- If
*x*=*y*mod*n*, then (*x*|*n*) = (*y*|*n*), as long as either*n*> 0, or*x*and*y*have the same sign. This means that step 2 above does not change the Kronecker symbol (*a*|*b*), since either*b*is positive, and/or*a*and*a*–*b*are both positive. - If
*x*and*y*are not both negative, then (*x*|*y*) = (*y*|*x*), unless*x*= 3 mod 4 and*y*= 3 mod 4, in which case (*x*|*y*) = -(*y*|*x*). Thus, when exchanging*a*with*b*, we just have to look at the two low bits of each in order to apply the corresponding change to the Kronecker symbol. - (2|
*n*) = 1 if*n*= 1 or 7 mod 8, or -1 if*n*= 3 or 5 mod 8. Since the Kronecker symbol is multiplicative, this means that when*a*is divided by 2 (in step 3), the Kronecker symbol must be negated if and only if*b*= 3 or 5 mod 8 at that point (which, again, uses only the low bits of*b*).

This adaptation of the binary GCD to the Kronecker symbol is not new, but it shows that it is compatible with the optimized binary GCD algorithm, with a few extra operations in the inner loop. *A contrario*, we do not need to keep permanent track of the Bézout coefficients (values *u* and *v* such that *a* = *xu* mod *p* and *b* = *xv* mod p, which are necessary to compute a modular inversion), which saves some of the cost. In total, the implementation of the Legendre symbol modulo *p* = 2^{255}-19 uses only **43726 cycles** on an ARM Cortex-M0+.

One specific case where this optimized Legendre symbol is useful is Haase and Labrique’s AuCPace protocol. This is a protocol for key exchange with password-based authentication, immune to offline dictionary attack. On the client side, this protocol requires two evaluations of X25519, as well as one Elligator2 map. The latter needs a modular inversion *and* a Legendre symbol computation. When using modular exponentiation methods, it is possible to combine both into a single exponentiation, which will then (with *p* = 2^{255}-19) require 254 squarings and 23 extra multiplications; Haase and Labrique find a total Elligator2 map cost of 289276 cycles. With the optimized binary GCD and the variant for Legendre symbol described above, the cost of the Elligator2 map can be expected to be lowered to about 102000 cycles, i.e. savings of about 187000 cycles. Combined with the savings on X25519 itself, we may hope for lowering the total client cost of AuCPace from 7.35 million cycles down to about 6.67 million cycles, i.e. about 10% faster. That kind of gain is not a complete game changer, but it is large enough to be significant and make it worth the implementation effort.