*This post is about elliptic curves as they are used in cryptography, in particular for signatures. There are many ways to define specific elliptic curves that strive to offer a good balance between security and performance; here, I am talking about specific contributions of mine: a new curve definition, and some algorithmic improvements that target signature verification with any curve.*

A few months ago, I defined and published a new elliptic curve called Curve9767. It was designed to explore the use of finite field extensions to allow efficient computation of divisions; the primary implementation target was the ARM Cortex M0+ CPU, a favourite among very small embedded microcontrollers. A few days ago, I published two new things:

**New implementations**, for the bigger (but still embedded) ARM Cortex M4 CPU, and, at the other end of the spectrum, for x86 CPU in 64-bit mode and leveraging AVX2 opcodes.**A novel optimization for the signature verification**, making it about 30% faster. This is an application of a method which was described in 2005, but that was not considered worthwhile for fast curves because of a large overhead in a preparatory computation. I found a way to make that computation faster, making the 2005 method worthwhile. That optimization is not specific to Curve9767, it can be applied to all cases of Schnorr signatures on any curve (or on any other group).

**Links:**

*Curve9767 implementations:*https://github.com/pornin/curve9767*Article describing the curve:*https://eprint.iacr.org/2020/009*Optimization of Schnorr signature verification:*https://eprint.iacr.org/2020/454

## Curve9767 Outline

Tremendous research efforts have been dedicated in the last three decades to finding specific fields and elliptic curves allowing efficient and secure implementations. In particular, special-format prime moduli (such as *p* = 2^{255}-19) make additions, subtractions and multiplications quite fast. However, divisions in the finite field of integers modulo *p* are still quite expensive (the traditional estimate is “at least 80 times the cost of a multiplication”), prompting the use of internal representation of values as fractions, so that very few divisions are needed (typically, a single one at the end of the complete computation on the curve). Using fractions increases the number of multiplications and additions. This, in turn, called for exploring alternate curve equation formats that would minimize the total number of multiplications. A full bestiary of curves has been defined, with many corresponding formulas.

One of the “best in class” is Edwards25519, an Edwards curve defined over the finite field of integers modulo 2^{255}-19 (Edwards25519 is birationally equivalent to Curve25519, which uses a Montgomery equation; for the purpose of this post, Edwards25519 and Curve25519 can be considered to be the same thing). It allows for simple, fast and secure implementations (in particular with no timing-based side channels, i.e. with “constant-time code”). However, it has an irksome feature, which is that the curve order is not prime. The curve contains a subgroup of prime order (and that prime is a 252-bit integer, appropriate for cryptographic computations), but the complete curve order is eight times larger. The ratio between the curve order and the subgroup order is called the *cofactor*. All the fastest known curve types (Edwards, Montgomery…) have a cofactor at least 4, so this is not a problem of Edwards25519 having been improperly chosen; this is inherent to the kind of curve equation used. A non-trivial cofactor is not a problem in many cases, but it has drawbacks in some protocols; roughly speaking, it means that a given curve point may have a few other “equivalent” (but different) points, leading to data malleability or discrepancies in distributed consensus protocols. This has sometimes resulted in serious vulnerabilities. In most cases, these can be fixed by modifying the protocols that use the curve to include some extra multiplications by the cofactor, but this increases the complexity of implementations and may add runtime overhead. In a way, a part of the speed and simplicity of Edwards25519 has been obtained not by reducing complexity, but by moving it up to another layer.

Curve9767 takes a different path. Instead of optimizing exclusively for fast multiplications, it uses another kind of finite field so that *divisions* are efficient. This allows us to use the traditional kind of curve equation (“short Weierstraß”, i.e. *y*^{2} = *x*^{3} + *ax* + *b*, for two constants *a* and *b*) and, in particular, to obtain a curve with a prime order, which removes all issues related to the cofactor. Since divisions are efficient, we can use affine coordinates (Jacobian coordinates are also used transiently, especially for computing several point doublings in a row). In Curve9767, the underlying base field is *GF*(9767^{19}): finite field elements are polynomials of degree at most 18, computed modulo the polynomial *X*^{19}-2, and their coefficients are themselves integers modulo the prime 9767. These values have been chosen so as to promote both security (e.g. the extension degree 19 is prime and greater than 11, so that attacks based on Weil descent are too expensive) and performance, especially on small embedded systems. The main implementation target was the ARM Cortex M0+ CPU; on that system, Curve9767 offers competitive performance (within 30% of the best reported Curve25519 speed on the same kind of hardware), while still having a convenient prime curve order. The ARM Cortex M0+ is one of the 32-bit CPU with the least power usage, making it appropriate for small devices that must remain able to function continuously for weeks with only battery power.

## New Implementations

All Curve9767 code is published on GitHub: https://github.com/pornin/curve9767

The ARM Cortex M4 implements the ARMv7-M architecture, with a much richer instruction set than the ARMv6-M architecture used by the ARM Cortex M0+. This allows for writing faster code; M4 cores can typically run at higher frequencies than M0+ cores, but even at the same frequency they will make computations faster. In particular, the following features are very helpful:

- All opcodes can use the whole register set (16 registers, but the instruction pointer PC and the stack pointer SP are best left alone) instead of the low half only (8 registers r0 to r7), as in the ARMv6-M architecture.
- Most computation opcodes accept three parameters: two sources and one destination. On the M0+, multiplications have only two arguments, the result being written over one of the operands, thereby often requiring an extra copy.
- Extra shifts and rotations can be attached to one operand of an addition or subtraction or Boolean bitwise operation, at no cost.
- The Cortex M4 includes a DSP, which is a set of extra opcodes that can perform computations with some degree of parallelism. For instance, the
`smlad`opcode can compute two 16×16 multiplications and then add the two 32-bit results to another 32-bit value, all in a single cycle. Since polynomial coefficients in the field used in Curve9767 fit on signed 16-bit words, we can leverage the DSP instructions to considerably speed up operations.

On large x86 systems, the new code leverages the AVX2 subsystem to do many operations in parallel: AVX2 opcodes can compute up to 16 operations on 16-bit values at the same time. Unintuitively, the best performance was not achieved that way, but instead by using 32-bit values. The reason is that multiplications and additions are so fast and pipelined that most of the cost is in moving the data around, and use of 32-bit words reduces that cost. Compared with the implementations for the ARM Cortex M0+ and M4, the x86/AVX2 code has some extra specificities:

- Since big x86 systems have a lot more RAM than small embedded systems, windows for the double-and-add curve point multiplication routines have been expanded to 5 bits instead of 4.
- On ARM, squarings have specialized implementations which are substantially faster than multiplications (squaring cost is only 63% of multiplication cost); thus, when making repeated doublings and Jacobian coordinates are transiently used, the implementation uses formulas that favour squarings (1M+8S formulas). On x86, no significant speed-up could be obtained for squarings, and thus other formulas were used (4M+4S).
- To gain an extra bit of speed (at the expanse of using an extra kilobyte of L1 cache, which is no worry on big systems), for constant-time point multiplication, Jacobian coordinates have been used internally for most of the computation. This can be done in a safe way thanks to the curve prime order: while formulas for point addition on Jacobian coordinates are not
*complete*(i.e. generic point addition mishandles the point-at-infinity, and also adding a point to itself), it can be shown that no problematic case may be reached before the last iteration of the point multiplication routine, and we handle that last iteration by switching back to affine coordinates and our generic complete routine.

There are mainly three relevant categories for performance:

- Multiplication of the conventional generator point by a secret scalar (for ECDH and signature key pair generation, and for each signature generation).
- Multiplication of a dynamically obtained point by a secret scalar (for completing ECDH).
- Joint multiplication of the generator and a dynamically obtained point by (non-secret) scalars (as used for Schnorr or ECDSA signature verification).

The following performance is achieved (expressed in clock cycles):

Operation | ARM Cortex M0+ | ARM Cortex M4 | x86+AVX2 (Coffee Lake) |

Generator multiplication | 1937792 | 887520 | 172660 |

Point multiplication | 4598756 | 2054792 | 392714 |

Signature verification | ~3818000 | ~1779000 | ~380500 |

We note that the performance on x86 is decent but not record-breaking; on the same kind of hardware, Edwards25519 implementations tend to be up to three times faster. On the other hand, these figures mean that the 2.3 GHz CPU on which these measurements have been taken can still generate 13000 signatures per second, and verify 6000 signatures per second, using a single core. This is more than enough to ensure that, for most purposes, the proportion of resources used for the cryptographic operations will be negligible. Generally speaking, raw performance is most important on systems which are especially starved of CPU resources, i.e. embedded systems.

Interestingly, the signature verification is *faster* than generic point multiplication, while it theoretically involves more work. Part of it is due to the use of non-constant-time algorithms (wNAF representation of scalars), which can be used because signature verification uses only public data elements. But most of the speed-up comes from an algorithmic improvement, described in the next section.

## Faster Schnorr Signature Verification

Schnorr signatures are a type of signature scheme that can be applied over any group where discrete logarithm is hard, in particular elliptic curves. Here is a simplified description (we do not take the cofactor into account, refer to RFC 8032 for details in the case of EdDSA, and the Curve9767 paper, section 4.8, for Schnorr signatures on Curve9767; notations tend to vary):

- The group has (prime) order
*n*. The conventional generator is*B*. The private key is*x*(a non-zero secret integer modulo*n*) and the public key is*A*=*xB*. - To sign a message
*m*, first generate a random uniform non-zero integer*r*modulo*n*(there are derandomization techniques that can be used to make*r*without using a random source, see e.g. RFC 6979 for the ECDSA case, but these do not matter in this description). From the chosen*r*, compute the element*R*=*rB*. Then, a “challenge” value (an integer modulo*n*) is computed as: c =*H*(*A*,*R*,*m*) (hash of the concatenation of the public key*A*, the point*R*and the message*m*, using a hash function that returns a value reduced modulo*n*). The signature is (*R*,*s*), where*s*=*r*+*cx*mod*n*. - To verify a signature, recompute
*c*from*A*(the public key),*R*(the first half of the signature) and*m*(the message), and then check that*sB*–*cA*=*R*.

Most of the cost of verification is in computing *sB* – *cA*. The two scalars *s* and *c* have the size of *n*. If using classic double-and-add algorithms, then the doublings between *sB* and *cA* can be mutualized (this is colloquially known as “Shamir’s trick” but it was invented by Straus). If *n* has size *k* bits, then this requires *k* doublings, and some extra additions; with window-based optimizations, the number of extra additions can be reduced, to the point that the doublings will constitute 60% or more of the total cost.

In 2005, Antipa *et al* described a generic method to halve the number of doublings, thereby speeding up verification. The core remark is that verifying that *sB* – *cA* = *R* is equivalent to verifying that *d*(*sB* – *cA* – *R*) = 0 for any non-zero scalar *d*, and we can rewrite that equation as:

*e*_{0}*B* + *e*_{1}(2^{k/2})*B* – (*dc* mod *n*)*A* – *dR* = 0

where *e*_{0} + 2^{k/2}*e*_{1} = *ds* mod *n*. Notice that *e*_{0} and *e*_{1} have size *k*/2 bits (half the size of *n*), and elements *B* and 2^{k/2}*B* are fixed and thus can be precomputed. If we can find a convenient non-zero value *d* such that both *d* and *dc* mod *n* are small (about half the size of *n*), then we end up with computing a linear combination of four points (instead of two), but with half-size multipliers. Straus’s algorithm again applies. Compared with the previous case, we have the same number of extra additions, but the number of doublings is halved. We can thus hope to save 30% or more of the computation cost (but finding a proper value *d* may have some overhead).

Finding the right *d* can be interpreted as a lattice basis reduction in dimension 2. This can be heuristically approximated with a variant of the extended Euclidean greatest common divisor algorithm (stopping at mid-course, as it were). A general solution can be obtained with the venerable algorithm due to Joseph-Louis Lagrange, back in 1773. However, implementation involves dabbling with big integer multiplication and division (not modular integers, but plain integers) and is complex and, in practice, somewhat expensive. This leads to that method being explicitly rejected as not being worth the effort (e.g. in the original Ed25519 article, on page 14).

I found an improvement which allows implementing a binary variant of Lagrange’s algorithm with only additions, subtractions and left-shifts. The variant is proven to always find a suitable solution in quadratic time in the size of the group order (i.e. *O*(*k*^{2}), with *n* being a *k*-bit integer), and, more importantly, it can be sufficiently fast to make Antipa *et al*‘s method worthwhile even for fast curves. Finding *d* is done in about 106300 cycles on an ARM Cortex M0+, down to an average of 15250 cycles on a recent x86 (Coffee Lake). In total, taking into account the cost of finding *d*, about 30% of the signature verification cost is removed. The new lattice basis reduction algorithm is fast enough that it should yield a non-negligible speed improvement for signature verification even for the very fast curve Edwards25519 (this is, as yet, not implemented, but theoretically one may expect saving about 25000 cycles from a verification routine that uses 110000 cycles).

## Conclusion

After 35 years of cryptographic research on elliptic curves, we still have not found an undisputed optimal solution. In the Internet/Web ecosystem, there is an ongoing alignment on Edwards25519, which has very good performance, especially on “large” systems (desktop computers, smartphones…), but possibly less so on low-end IoT; there is still room for improvements for some expensive operations such as signature verification. Moreover, the non-prime order of Edwards curves implies some extra protocol complexity in some cases. Curve9767 represents an alternate path which is worth exploring.