UTF-8 String Literals, how useful are they?

#net7

#speed

#memory

2023-10-07

👋 Introduction

For those who might not know: the string type in C# represents a string encoded in UTF-16. UTF-8 string literals is a new feature introduced in C# 11. The motivation of the creation of this new feature was the fact that most of the time UTF-8 encoding is used, like in protocols such HTTP, or maybe even when writing to a file. Let's check its performance and try to find some real world scenarios in which it can be used.

⏱️ Performance

Download the Sample Code

Method	String Length	Mean	Allocated
GetUtf8Value_32_u16const	32 bytes	27.8060 ns	56 B
GetUtf8Value_32_u8const	32 bytes	0.3061 ns	-
GetUtf8Value_64_u16const	64 bytes	42.9306 ns	88 B
GetUtf8Value_64_u8const	64 bytes	0.3020 ns	-
GetUtf8Value_128_u16const	128 bytes	53.7413 ns	152 B
GetUtf8Value_128_u8const	128 bytes	0.2976 ns	-
WriteAuthHeaderToStream_u16const	32 bytes	68.54 ns	96 B
WriteAuthHeaderToStream_u8const	32 bytes	47.06 ns	56 B
WriteAuthHeaderToStream_u16const	64 bytes	90.22 ns	128 B
WriteAuthHeaderToStream_u8const	64 bytes	65.07 ns	88 B
WriteAuthHeaderToStream_u16const	128 bytes	103.83 ns	192 B
WriteAuthHeaderToStream_u8const	128 bytes	77.68 ns	152 B

So what are we seeing in the table above? We have two base methods:

GetUtf8Value: simply returns an UTF-8 value as ReadOnlySpan<byte> based on a constant. This constant can either be a traditional UTF-16 encoded string which is then encoded using the Encoding.UTF8 class or be a new UTF-8 string literal which essentially don't need any conversion or encoding, it's just returned so basically almost an empty method.
WriteAuthHeaderToStream: this scenario intends to be much closer to a real life problem. Here, we're writing the auth header to an HTTP request stream (to a MemoryStream actually) which means first we write a constant Authorization: and then an arbitrary value.

What we can instantly see from the performance point of view is that this really is a micro-optimization. Particularly, because the UTF-16 -> UTF-8 conversion is most of the time related to some I/O operation (which operations are on the milisecond level), so both in "relative" and "absolute" terms the performance gain is quite minimal (in terms of both CPU and memory). Don't get me wrong, it's a great feature, just don't expect any change in performance after upgrading some constants. Only on hot paths, or if you used lots of constants that can be rewritten as UTF-8 string literals, then you would be seeing some improvement.

⚠️ API Support

What is really missing here to make this an even greater feature is API support. In most cases you cannot write bytes to a raw stream, but you have to provide a string (like when adding headers to a HTTP request using the HttpRequestMessage class), which is obviously pointless if you want to take advantage of the UTF-8 string literal. On the other hand, the string type is a really great abstraction for series of characters, UTF-16 is easy to handle. Only thing I could imagine is another string type like string8 for storing UTF-8 strings (instead of the current low-level type ReadOnlySpan<byte> which itself cannot guarantee that it indeed contains a UTF-8 string), and APIs that convert strings to UTF-8 should require this UTF-8 string type in the first place. However, I'm not a language designer, and I'm pretty sure that there are areas I didn't take into account regarding my "string8" idea.

📋 Summary

✔️ Great optimization for hot paths
❌ API Support is missing which prevents broader usage