UTF-8 String Literals, how useful are they?
👋 Introduction
For those who might not know: the string
type in C# represents a string encoded in UTF-16. UTF-8 string literals is a new feature introduced in C# 11. The motivation of the creation of this new feature was the fact that most of the time UTF-8 encoding is used, like in protocols such HTTP, or maybe even when writing to a file. Let's check its performance and try to find some real world scenarios in which it can be used.
⏱️ Performance
Method | String Length | Mean | Allocated |
---|---|---|---|
GetUtf8Value_32_u16const | 32 bytes | 27.8060 ns | 56 B |
GetUtf8Value_32_u8const | 32 bytes | 0.3061 ns | - |
GetUtf8Value_64_u16const | 64 bytes | 42.9306 ns | 88 B |
GetUtf8Value_64_u8const | 64 bytes | 0.3020 ns | - |
GetUtf8Value_128_u16const | 128 bytes | 53.7413 ns | 152 B |
GetUtf8Value_128_u8const | 128 bytes | 0.2976 ns | - |
WriteAuthHeaderToStream_u16const | 32 bytes | 68.54 ns | 96 B |
WriteAuthHeaderToStream_u8const | 32 bytes | 47.06 ns | 56 B |
WriteAuthHeaderToStream_u16const | 64 bytes | 90.22 ns | 128 B |
WriteAuthHeaderToStream_u8const | 64 bytes | 65.07 ns | 88 B |
WriteAuthHeaderToStream_u16const | 128 bytes | 103.83 ns | 192 B |
WriteAuthHeaderToStream_u8const | 128 bytes | 77.68 ns | 152 B |
So what are we seeing in the table above? We have two base methods:
- GetUtf8Value: simply returns an UTF-8 value as
ReadOnlySpan<byte>
based on a constant. This constant can either be a traditional UTF-16 encoded string which is then encoded using theEncoding.UTF8
class or be a new UTF-8 string literal which essentially don't need any conversion or encoding, it's just returned so basically almost an empty method. - WriteAuthHeaderToStream: this scenario intends to be much closer to a real life problem. Here, we're writing the auth header to an HTTP request stream (to a
MemoryStream
actually) which means first we write a constantAuthorization:
and then an arbitrary value.
What we can instantly see from the performance point of view is that this really is a micro-optimization. Particularly, because the UTF-16 -> UTF-8 conversion is most of the time related to some I/O operation (which operations are on the milisecond level), so both in "relative" and "absolute" terms the performance gain is quite minimal (in terms of both CPU and memory). Don't get me wrong, it's a great feature, just don't expect any change in performance after upgrading some constants. Only on hot paths, or if you used lots of constants that can be rewritten as UTF-8 string literals, then you would be seeing some improvement.
⚠️ API Support
What is really missing here to make this an even greater feature is API support. In most cases you cannot write bytes to a raw stream, but you have to provide a string (like when adding headers to a HTTP request using the HttpRequestMessage
class), which is obviously pointless if you want to take advantage of the UTF-8 string literal. On the other hand, the string type is a really great abstraction for series of characters, UTF-16 is easy to handle. Only thing I could imagine is another string type like string8 for storing UTF-8 strings (instead of the current low-level type ReadOnlySpan<byte>
which itself cannot guarantee that it indeed contains a UTF-8 string), and APIs that convert strings to UTF-8 should require this UTF-8 string type in the first place. However, I'm not a language designer, and I'm pretty sure that there are areas I didn't take into account regarding my "string8" idea.
📋 Summary
✔️ Great optimization for hot paths
❌ API Support is missing which prevents broader usage