Perfaddict.NET

— A blog for .NET performance addicts.
Performance
C#.NET

UTF-8 String Literals, how useful are they?

👋 Introduction

For those who might not know: the string type in C# represents a string encoded in UTF-16. UTF-8 string literals is a new feature introduced in C# 11. The motivation of the creation of this new feature was the fact that most of the time UTF-8 encoding is used, like in protocols such HTTP, or maybe even when writing to a file. Let's check its performance and try to find some real world scenarios in which it can be used.

⏱️ Performance

Download the Sample Code

Method String Length Mean Allocated
GetUtf8Value_32_u16const 32 bytes 27.8060 ns 56 B
GetUtf8Value_32_u8const 32 bytes 0.3061 ns -
GetUtf8Value_64_u16const 64 bytes 42.9306 ns 88 B
GetUtf8Value_64_u8const 64 bytes 0.3020 ns -
GetUtf8Value_128_u16const 128 bytes 53.7413 ns 152 B
GetUtf8Value_128_u8const 128 bytes 0.2976 ns -
WriteAuthHeaderToStream_u16const 32 bytes 68.54 ns 96 B
WriteAuthHeaderToStream_u8const 32 bytes 47.06 ns 56 B
WriteAuthHeaderToStream_u16const 64 bytes 90.22 ns 128 B
WriteAuthHeaderToStream_u8const 64 bytes 65.07 ns 88 B
WriteAuthHeaderToStream_u16const 128 bytes 103.83 ns 192 B
WriteAuthHeaderToStream_u8const 128 bytes 77.68 ns 152 B

So what are we seeing in the table above? We have two base methods:

  1. GetUtf8Value: simply returns an UTF-8 value as ReadOnlySpan<byte> based on a constant. This constant can either be a traditional UTF-16 encoded string which is then encoded using the Encoding.UTF8 class or be a new UTF-8 string literal which essentially don't need any conversion or encoding, it's just returned so basically almost an empty method.
  2. WriteAuthHeaderToStream: this scenario intends to be much closer to a real life problem. Here, we're writing the auth header to an HTTP request stream (to a MemoryStream actually) which means first we write a constant Authorization: and then an arbitrary value.

What we can instantly see from the performance point of view is that this really is a micro-optimization. Particularly, because the UTF-16 -> UTF-8 conversion is most of the time related to some I/O operation (which operations are on the milisecond level), so both in "relative" and "absolute" terms the performance gain is quite minimal (in terms of both CPU and memory). Don't get me wrong, it's a great feature, just don't expect any change in performance after upgrading some constants. Only on hot paths, or if you used lots of constants that can be rewritten as UTF-8 string literals, then you would be seeing some improvement.

⚠️ API Support

What is really missing here to make this an even greater feature is API support. In most cases you cannot write bytes to a raw stream, but you have to provide a string (like when adding headers to a HTTP request using the HttpRequestMessage class), which is obviously pointless if you want to take advantage of the UTF-8 string literal. On the other hand, the string type is a really great abstraction for series of characters, UTF-16 is easy to handle. Only thing I could imagine is another string type like string8 for storing UTF-8 strings (instead of the current low-level type ReadOnlySpan<byte> which itself cannot guarantee that it indeed contains a UTF-8 string), and APIs that convert strings to UTF-8 should require this UTF-8 string type in the first place. However, I'm not a language designer, and I'm pretty sure that there are areas I didn't take into account regarding my "string8" idea.

📋 Summary

✔️ Great optimization for hot paths
❌ API Support is missing which prevents broader usage