Tool Release – Reliably-checked String Library Binding

by Robert C. Seacord

Memory Safety

Reliably-checked Strings is a library binding I created that uses static array extents to improve diagnostics that can help identify memory safety flaws. This is part of broader initiative in the C Standards Committee to improve bounds checking for array types. See my blog post Improving Software Security through C Language Standards for an overview of the work being done by the committee in this area.

Buffer overflows continue to be a well-known security problem even in 2021. The term “buffer overflow” generally refers to memory accesses (reads and writes) outside the bounds of an object (typically an array). Buffer overflows can be exploited to execute arbitrary code with the permissions of a vulnerable process [Seacord 2013]. The 2020 common weakness enumeration (CWE) has “Weaknesses Improper Restriction of Operations within the Bounds of a Memory Buffer” ranked fifth on its list of top 25 most dangerous software weaknesses. Memory safety issues account for:

Traditionally, the C Library has contained many functions that trust the programmer to provide output character arrays big enough to hold the result being produced. Not only do these functions not check that the arrays are big enough, they frequently lack the information needed to perform such checks.

The memcpy function is a well-known and oft-used function that copies n characters from the object pointed to by s2 (the source array) into the object pointed to by s1 (the destination array).

The following is a simple test of memcpy. The source string str2 has 11 bytes and destination array str1 has 6 bytes, including the null termination characters:

bool memcpy_test(void) {
  char str1[] = "01234";  
  char str2[] = "abcdefghij";  
  
  memcpy(str1, str2, sizeof(str2));
  puts("\nstr1 after memcpy ");
  puts(str1);
  
  return true;
}

The gcc compiler produces no diagnostics when compiled with gcc -Wall -Wextra although the actual call to memcpy results in a five byte buffer overflow.

Many GCC warnings (and all the flow-based ones) depend on optimization to avoid both false negatives and false positives.  Because optimization changes the shape of the code, it can also be the cause of false negatives and false positives.  There are known bugs and limitations that have yet to be addressed.  Overall, the generation of these diagnostics is still a work in progress and far from perfect.

The preceding call to the memcpy function does not trigger a warning at -O0 because, even though memcpy is always special, the GCC optimizer recognizes it as special only with optimization enabled, and the warning that is issued at -O1 and above uses the same logic.  This is neither intentional nor unavoidable: fixing it is just a matter of either decoupling the warning logic from the optimizer when it comes to looking at memcpy, or annotating the Glibc declaration to tell GCC about its properties [Sebor 2021].  But doing that means that uses of memcpy will be prone to more false positives at -O0 because the optimizations that the warning relies on to avoid them (e.g., dead code elimination) don’t run.

Static Array Extents

Static array extents were added in C99 [ISO/IEC 9899:1999]. They are supported by gcc and clang, but not yet in the 2022 preview release of Microsoft Visual Studio (go here to upvote the feature request). Along with variably modified array types, static array extents provide a mechanism of specifying the minimum size of array arguments in a way that can be statically checked at compile time. C17 [ISO/IEC 9899:2018] Section 6.7.6.2, “Array declarators” states that:

The optional type qualifiers and the keyword static shall appear only in a declaration of a function parameter with an array type, and then only in the outermost array type derivation.

Section 6.7.6.3, “Function declarators”, paragraph 6 states:

A declaration of a parameter as “array of type” shall be adjusted to “qualified pointer to type”, where the type qualifiers (if any) are those specified within the [ and ] of the array type derivation. If the keyword static also appears within the [ and ] of the array type derivation, then for each call to the function, the value of the corresponding actual argument shall provide access to the first element of an array with at least as many elements as specified by the size expression.

Variably modified types and variable-length array types were a required feature of C99 but are now an optional features of C17. Consequently, these features cannot be used in required sections of the C Standard without first making variably modified types a required featured again.

Most standard string library functions accept arguments of type char * or void * to reference strings or memory, respectively. In WG14 N2660 Improved Bounds Checking for Array Types by Martin Uecker argues that amending standard library function so that pointer arguments are declared as arrays with static array extents can be used instead of pointers for safe programming because compilers can use length information encoded in the type to detect errors. In line with the C23 charter [Keaton 2020], this would make the API self-documenting and allow tools to diagnose bounds violations at compile-time or at runtime.

We can produce a binding for the memcpy function called memcpy_rcs that alters the signature as follows:

extern inline void* memcpy_rcs(
  size_t n, char s1[restrict static n], const char s2[restrict static n]
) {
  return memcpy(s1, s2, n);
}

The size_t parameter becomes the first parameter so we can use this size as part of a variably modified type in the second and third parameters. If you want to pass the array first and the size afterward, you can use a forward declaration in the parameter list—a GNU extension only implemented by GCC:

extern inline void* memcpy_rcs(
  size_t n; char s1[restrict static n], const char s2[restrict static n], size_t n
) {
  return memcpy(s1, s2, n);
}

The size_t n before the semicolon is a parameter forward declaration, and it serves the purpose of making the name n known when the declaration of s1 and s2 are parsed.

The second parameter is the destination array and is declared as char s1[restrict static n]. The keyword static appears within the [ and ] of the array type declaration indicating that the corresponding actual argument n must provide access to the first element of an array with at least n elements.

The third parameter is the source array and is declare as const char s2[restrict static n] because it is it is never written to—only read from. The size parameter n applies to both the source and destination arrays because the semantics of memcpy guarantees that n bytes will be read from the source array and written to the destination array. This requires both arrays to be at least of size n.

Compiling with gcc -Wall produces improved diagnostics for the reliably checked functions.

The use of inline changes the emitted code. If a function can be inlined, the overhead of a call is avoided, register spill is reduced, and register usage can generally be streamlined [Gustedt 2010]. An inline function definition can be included in a header file so that it is visible to the entire project. The extern keyword should then be included in exactly one compilation unit (.c file) to ensure that only one external symbol is generated in case the function cannot be inlined at some location.

Compiling the following test code with gcc 11.1.0 on Ubuntu 20.04 using the command gcc -Wall:

bool memcpy_rcs_test002(void) {
  char str1[] = "hello";  
  char str2[] = "world"; 
  
  memcpy_rcs(sizeof(str2)+1, str1, str2); // gcc -Warray-bounds
  puts(str1);    
  return true;
}

produces improved diagnostics for the reliably checked functions:

rcs.c: In function ‘memcpy_rcs_test002’:
rcs.c:81:3: warning: ‘memcpy_rcs’ accessing 7 bytes in a region of size 6 [-Wstringop-overflow=]
   81 |   memcpy_rcs(sizeof(str2)+1, str1, str2); // warns -Warray-bounds
      |   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
rcs.c:81:3: note: referencing argument 2 of type ‘char *’
rcs.c:81:3: warning: ‘memcpy_rcs’ reading 7 bytes from a region of size 6 [-Wstringop-overread]
rcs.c:81:3: note: referencing argument 3 of type ‘const char *’
In file included from rcs.c:5:
string_rcs.h:7:21: note: in a call to function ‘memcpy_rcs’
    7 | extern inline void* memcpy_rcs(size_t n, char s1[static n], const char s2[static n]) {
      |                     ^~~~~~~~~~

If instead you pass the -O2 flag, for example, gcc -Wall -O2, you’ll get significantly different diagnostics:

In file included from /usr/include/string.h:495,
                 from string_rcs.h:2,
                 from rcs.c:5:
In function ‘memcpy’,
    inlined from ‘memcpy_rcs’ at string_rcs.h:7:9,
    inlined from ‘memcpy_rcs_test002’ at rcs.c:44:3:
/usr/include/x86_64-linux-gnu/bits/string_fortified.h:34:10: warning: ‘__builtin___memcpy_chk’ forming offset 6 is out of the bounds [0, 6] of object ‘str1’ with type ‘char[6]’ [-Warray-bounds]
   34 |   return __builtin___memcpy_chk (__dest, __src, __len, __bos0 (__dest));
      |          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
rcs.c: In function ‘memcpy_rcs_test002’:
rcs.c:41:8: note: ‘str1’ declared here
   41 |   char str1[] = "hello";
      |        ^~~~

Optimization can lead to different warnings.  In the preceding example, it makes a difference because at -O0 the call to memcpy_rcs is not inlined (and so the [static n] argument in the function declaration is used to validate the call). When this same code is optimized, the function is inlined and the declaration is lost and so the optimizer sees the call to memcpy and not the call to memcpy_rcs. This is an unfortunate consequence of the warnings depending on the optimizer, and not something that is likely to be fixed. GCC’s static analyzer doesn’t depend on optimizations so it will eventually do better but does not yet provide these diagnostics. Consequently, the -O2 diagnostics are essentially the same for both the standard string functions and the reliably-checked string library bindings.

Not all implementations take the same approach as gcc. Clang, for example, does not emit any diagnostics from their optimization passes to prevent the user from seeing different warnings in an optimized (production) build from those in an unoptimized development build. This sort of behavior could be disruptive in a continuous integration (CI) pipeline, for example.

Function definitions are included in the string_rcs.h header file so the definitions are available in all translation units that include this file. The call to the memcpy_rcs function from memcpy_rcs_test compiles to the following instructions at -O2:

memcpy_rcs_test:
        sub     rsp, 24
        mov     eax, 100
        lea     rdi, [rsp+4]
        mov     DWORD PTR [rsp+4], 1819438967
        mov     WORD PTR [rsp+8], ax
        call    puts
        mov     eax, 1
        add     rsp, 24
        ret

Function Signatures

Reliably-checked function signatures depend upon the existing function signature and the semantics of the function. We’ve already examined memcpy, where the size applies to both the source and destination arrays.

The strncpy function

The reliably checked strncpy function has the following definition:

extern inline char *strncpy_rcs(
  size_t n, char s1[restrict static n], const char s2[restrict static 1]
) {
  return strncpy(s1, s2, n);
}

This function takes a size argument that specifies a maximum number of characters to copy, but not the actual number that will be copied (characters that follow a null character are not copied). Furthermore, the C Standard states that:

If the array pointed to by s2 is a string that is shorter than n characters, null characters are appended to the copy in the array pointed to by s1, until n characters in all have been written.

Because of these requirements, the size is applied to the destination array s1 but not to the source array, which can be shorter but must be a string (e.g., a null terminated array of at least one byte). The source array is declared as const char s2[static 1], again because it is read from but not written to. The static array extent of 1 requires that a pointer to an array of at least one byte is passed. This produces a diagnostic, but is not a significant improvement for either gcc or clang where a diagnostic is already generated for these functions. For example, clang 12.0.0 produces similar diagnostics when passing a null pointer to both the strncpy and strncpy_rcs functions:

<source>:250:35: warning: null passed to a callee that requires a non-null argument [-Wnonnull]
  strncpy(str1, NULL, sizeof(str2)); 
                ~~~~              ^
<source>:283:3: warning: null passed to a callee that requires a non-null argument [-Wnonnull]
  strncpy_rcs(sizeof(str1), str1, NULL);
  ^                               ~~~~
<source>:27:73: note: callee declares array parameter as static here
extern inline char *strncpy_rcs(size_t n, char s1[static n], const char s2[static 1]) {

A similar effect can be achieved in gcc and clang using the nonnull attribute extension.

The strcpy function

The strcpy function is perhaps the most infamous of the Section 7.2, “String handling” functions for buffer overflows, as no size is passed and the function basically screams YOLO as it writes all over memory as in the following example:

bool strcpy_test003(void) {
  char str1[] = "01234";  
  char str2[] = "abcdefghi";  
  
  strcpy(str1, str2);
  puts(str1);

  return true;
}

So would inventing a size parameter for the strcpy function (below) improve security?

extern inline char *strcpy_rcs(
  size_t n, char s1[restrict static n], const char s2[restrict static 1]
) {
  return strcpy(s1, s2);
}

The answer is “no”. Compiling the test code using gcc -Wall -Wextra produces no diagnostics for either function when the destination array is too small for the copy operation. Adding the -O2 flag, however, catches both buffer overflows:

<source>:218:3: warning: 'strcpy' forming offset [6, 9] is out of the bounds [0, 6] of object 'str1' with type 'char[6]' [-Warray-bounds]
  218 |   strcpy(str1, str2);
      |   ^~~~~~~~~~~~~~~~~~
<source>:215:8: note: 'str1' declared here
  215 |   char str1[] = "01234";
      |        ^~~~

In function 'strcpy_rcs',
    inlined from 'strcpy_rcs_test003' at <source>:248:3:
<source>:23:16: warning: 'strcpy' writing 10 bytes into a region of size 6 [-Wstringop-overflow=]
   23 |         return strcpy(s1, s2);
      |                ^~~~~~~~~~~~~~
<source>: In function 'strcpy_rcs_test003':
<source>:246:8: note: destination object 'str1' of size 6
  246 |   char str1[] = "01234";
      |        ^~~~

Both diagnostics are on the call to the strcpy function and not the strcpy_rcs function, so passing the size provides no advantage. Consequently, there is no advantage to adding sizes for functions that don’t already take one, so we declare the binding to the strcpy_rcs function without the additional size parameter to more closely match the signature of the strcpy function:

extern inline char *strcpy_rcs(char s1[restrict static n], const char s2[restrict static 1]) {
  return strcpy(s1, s2);
}

This suggests that the ideal functions for this treatment are ones that already take a size indicating the length of an array.

The strndup function

The strndup function is another interesting example. The reliably-checked binding for this function has the following definition:

inline char *strndup_rcs(const char s[static 1], size_t size) {
	return strndup(s, size);
}

In this case, we chose not to make the s array a variably modified type.

The strndup function creates a string initialized with no more than size initial characters of the array pointed to by s and up to the first null character, whichever comes first, in a space allocated as if by a call to malloc. This means that even though a size is specified, the resulting array could be smaller. This functionality would not be possible if the array was variably modified by size.

Annex K: Bounds Checking Interfaces

Annex K provides alternative library functions that promote safer, more secure programming. The alternative functions verify that output buffers are large enough for the intended result and return a failure indicator if they are not. One such example is the strcpy_s function shown below:

#define __STDC_WANT_LIB_EXT1__ 1
#include <string.h>
errno_t strcpy_s(char * restrict s1, rsize_t s1max, const char * restrict s2);

The strcpy_s function ensures the string is either successfully copied or an error is indicated. However, the function definition lacks the mechanisms to detect possible errors at runtime and can be improved as follows:

extern errno_t strcpy_s(
  rsize_t s1max, char s1[restrict static s1max], const char s2[restrict static 1]
);

Unfortunately, making this modification requires changing the order of arguments to ensure the size argument comes before the first variably modified type that references this size.

Annex K functions are an ideal candidate to amend so that pointer arguments are declared as arrays with static array extents:

  • All of the Annex K string handling functions add sizes
  • Annex K is optional, so it is acceptable that they depend on the optional variably modified type mechanism.
  • There are few conforming implementations that will need to be changed [O’Donell 2015, Seacord 2019]

Summary

Static array extents are an underutilized C programming language feature that can be used to produce better compile time diagnostics and consequently reduce software defects and vulnerabilities. This is already true when compiling with gcc -Wall today.

Changing Annex K functions to use these new signatures is a viable and recommended change for C23. Using the supplied binding for Section 7.24 string handling functions will improve the analyzability and security of these legacy functions without breaking existing code. Requiring variably modified types (again) and standardizing the parameter forward declaration GNU extension would also allow preserving ABI compatibility of the existing Section 7.24 string handling functions while still taking advantage of static array extents.

Reliably-checked string bindings licensed under a permissive MIT License is available at https://github.com/rcseacord/rcs.

Acknowledgements

Thanks to Aaron Ballman (Intel), Martin Sebor (RedHat), Martin Uecker (University of Göttingen), Graham Bucholz (NCC Group), and Ray Lai (NCC Group) for reviewing this paper.

Thanks also to NCC Group SVP, Global Head of Research Jennifer Fernick for sponsoring this project and Amanda Crowell for giving me the time to work on it.

References

[Gustedt 2010] Gustedt, Jens. Myth and reality about inline in C99. November 29, 2010.

[ISO/IEC 9899:1990] ISO/IEC. 1990. “Programming Languages—C,” 1st ed. ISO/IEC 9899:1990.

[ISO/IEC 9899:1999] ISO/IEC. 1999. “Programming Languages—C,” 2nd ed. ISO/IEC 9899:1999.

[ISO/IEC 9899:2011] ISO/IEC. 2011. “Programming Languages—C,” 3rd ed. ISO/IEC 9899:2011.

[ISO/IEC 9899:2018] ISO/IEC. 2018. “Programming Languages—C,” 4th ed. ISO/IEC 9899:2018.

[O’Donell 2015] Carlos O’Donell, Martin Sebor. N1967 Updated Field Experience With Annex K — Bounds Checking Interfaces. September, 2015.

[Keaton 2020] David Keaton. WG14 N2611 Programming Language C – C23 Charter. November 2020.

[Memarian 2019] Kayvan Memarian, Victor B. F. Gomes, Brooks Davis, Stephen Kell, Alexander Richardson, Robert N. M. Watson, and Peter Sewell. 2019. Exploring C semantics and pointer provenance. Proc. ACM Program. Lang. 3, POPL, Article 67 (January 2019), 32 pages. DOI:https://doi.org/10.1145/3290380

[Ritchie 1993] Dennis M. Ritchie. 1993. The development of the C language. In The second ACM SIGPLAN conference on History of programming languages (HOPL-II). Association for Computing Machinery, New York, NY, USA, 201–208. DOI:https://doi.org/10.1145/154766.155580&nbsp;

[Seacord 2013] Seacord, Robert C. 2013. Secure Coding in C and C++, 2nd ed. Boston: Addison-Wesley Professional.

[Seacord 2014] Seacord, Robert C. 2014. The CERT C Coding Standard: 98 Rules for Developing Safe, Reliable, and Secure Systems, 2nd ed. Boston: Addison-Wesley Professional.

[Seacord 2019] Seacord, Robert C. WG14 N2336 Bounds-checking Interfaces: Field Experience and Future Directions. February 3, 2019.

[Seacord 2020] Robert C. Seacord. Effective C: An Introduction to Professional C Programming. August 2020. ISBN-13: 9781718501041.

[Sebor 2021] Sebor, Martin. Use source-level annotations to help GCC detect buffer overflows. Red Hat Developer Blog. June 2021.