Replacing invalid UTF-8 octets

Why?

Ever since unicode has become common between systems encoding related problems have largely gone away. Every now and then you receive some UTF-8 encoded strings that have some unexpected code points (e.g. control characters) in them, but that’s fairly easy to solve - You don’t even have to do it yourself, you can use ready made libararies such as patchwork/utf8 for it.

Recently however I have stumbled upon something new in an API response that I had never seen before: The contents contained octets (bytes) that are not valid in UTF-8 codepoints and break some languages (such as PHP) UTF-8 handling.

Say what?

A codepoint in UTF-8 describes a single character. UTF-8 uses a pagination approach in order to let often used characters use less space while still being able to accomadate thousands of code points. This way a single codepoint can consist of between 1-4 octets. To do this the most significant bits of an octet are used to signal information about the pagination.

The wikipedia article about UTF-8 does a great job of explaining the concept. Here is a short summary:

  • If an octet begins with 0xxxxxx then this octet is a standalone code point. The lowest standalone code point is \x00, the highest \x7F
  • If an octet begins with 110xxxx then it is expected that another octet starting with 10xxxxxx follows. Both octets together are the full codepoint. The lowest octet containing a 2-page indicator is \xC0 while the highest is \xDF
  • If an octet begins with 1110xxx then it is expected 2 other octets starting with 10xxxxxx follow. All octets together are the full codepoint. The lowest octet containing a 3-page indicator is \xE0 while the highest is \xEF
  • If an octet begins with 11110xx then it is expected 2 other octets starting with 10xxxxxx follow. All octets together are the full codepoint. The lowest octet containing a 4-page indicator is \xF0 while the highest is \xF7
  • The lowest octet containing a following page indicator (10xxxxxx) is \x80 while the highest is \xBF.

This fact also means however: - That any octet starting with 10xxxxxx that is not preceeded by a pagination indicator is invalid. - That any octet starting with 110xxxx, 1110xxx or 11110xx not followed by the appropriate number of pagination indicators (10xxxxx) is invalid

To avoid confusion: These invalid octets are not invalid / unwanted codepoints. They are invalid bytes that do not add up to a full code point making the whole string an invalid UTF-8 string.

The solution

As with many things regex are a solution - in my case the only performant solution I could come up with. The example below shows the regular expressions used to replace the invalid octets with a space in PHP - although this solution should work in any language that has full regex support.

// 2-page indicator without 1 page behind it
$string = preg_replace('/[\xC0-\xDF](?![\x80-\xBF])/', ' ', $string);

// 3-page indicator without 2 pages behind it
$string = preg_replace('/[\xE0-\xEF](?![\x80-\xBF][\x80-\xBF])/', ' ', $string);

// 4-page indicator without 3 pages behind it
$string = preg_replace('/[\xF0-\xF7](?![\x80-\xBF][\x80-\xBF][\x80-\xBF])/', ' ', $string);

// Paginated character without either another paginated character or page indicator in front of it.
$string = preg_replace('/(?<!([\xC0-\xF7]|[\x80-\xBF]))[\x80-\xBF]/', ' ', $string);

After this the string is a valid UTF-8 string again only containing octet sequences that are valid codepoints in UTF-8. This means that other common UTF-8 sanitization measures can be taken such as using the /u flag for regular expressions:

// Remove control characters and unused code points (requires valid UTF-8)
$string = preg_replace('/\p{C}/u', ' ', $string);

// Replace various kinds of whitespace with a single space
$string = preg_replace('/\s+/u', ' ', $string);

Why not one big regex?

Looking at this you can see 6 regular expressions that all replace things with a space - so you may wonder “wouldn’t this be more efficient in a single regex”? In fact, all of this can be built into a single regular expression using the pipe | character pretty easily. I wondered about this and set out to test it.

According to my (very limited results) there were no performance differences when using 6 small regular expressions vs one big one. I tested this with 1000 iterations on a 15MB text file and monitored runtime as well as peak memory usage: Both did not really change.

Because they are roughly the same I opted for 6 small regular expressions as this makes it easier to logically separate them as well as document them accordingly.

The disclaimer

A wise man once said

“if you ever find yourself thinking ‘A regex would be the perfect solution to this’ you will soon find you have two problems”.

Some problems are only feasibly solvable by using regular expressions. These times are dire and you should not rush over these kinds of implementations. Regular expressions are notoriosly hard to read and debug and I am very sure that there are still errors lurking in the expressions above.

In times like this the only solution to preserver your sanity and to keep your project moving without ignoring edge cases is to write tests: Don’t take my word for the regular expressions above - If you end up using them be sure to include tests for all kinds of incredibly dumb invalid strings you can think of - and then some. If you cannot guarantee that something has no bugs then at least test for the edgecases you know of.