Why?
Ever since unicode has become common between systems encoding related problems have largely gone away. Every now and then you receive some UTF-8 encoded strings that have some unexpected code points (e.g. control characters) in them, but that’s fairly easy to solve - You don’t even have to do it yourself, you can use ready made libararies such as <code>patchwork/utf8</code> for it.
Recently however I have stumbled upon something new in an API response that I had never seen before: The contents contained octets (bytes) that are not valid in UTF-8 codepoints and break some languages (such as PHP) UTF-8 handling.
Say what?
A codepoint in UTF-8 describes a single character. UTF-8 uses a pagination approach in order to let often used characters use less space while still being able to accomadate thousands of code points. This way a single codepoint can consist of between 1-4 octets. To do this the most significant bits of an octet are used to signal information about the pagination.
The wikipedia article about UTF-8 does a great job of explaining the concept. Here is a short summary:
- If an octet begins with
0xxxxxx
then this octet is a standalone code point. The lowest standalone code point is\x00
, the highest\x7F
- If an octet begins with
110xxxx
then it is expected that another octet starting with10xxxxxx
follows. Both octets together are the full codepoint. The lowest octet containing a 2-page indicator is\xC0
while the highest is\xDF
- If an octet begins with
1110xxx
then it is expected 2 other octets starting with10xxxxxx
follow. All octets together are the full codepoint. The lowest octet containing a 3-page indicator is\xE0
while the highest is\xEF
- If an octet begins with
11110xx
then it is expected 2 other octets starting with10xxxxxx
follow. All octets together are the full codepoint. The lowest octet containing a 4-page indicator is\xF0
while the highest is\xF7
- The lowest octet containing a following page indicator (
10xxxxxx
) is\x80
while the highest is\xBF
.
This fact also means however:
- That any octet starting with
10xxxxxx
that is not preceeded by a pagination indicator is invalid. - That any octet starting with
110xxxx
,1110xxx
or11110xx
not followed by the appropriate number of pagination indicators (10xxxxx
) is invalid
To avoid confusion: These invalid octets are not invalid / unwanted codepoints. They are invalid bytes that do not add up to a full code point making the whole string an invalid UTF-8 string.
The solution
As with many things regex are a solution - in my case the only performant solution I could come up with. The example below shows the regular expressions used to replace the invalid octets with a space in PHP - although this solution should work in any language that has full regex support.
<?php
// 2-page indicator without 1 page behind it
$string = preg_replace('/[\xC0-\xDF](?![\x80-\xBF])/', ' ', $string);
// 3-page indicator without 2 pages behind it
$string = preg_replace('/[\xE0-\xEF](?![\x80-\xBF][\x80-\xBF])/', ' ', $string);
// 4-page indicator without 3 pages behind it
$string = preg_replace('/[\xF0-\xF7](?![\x80-\xBF][\x80-\xBF][\x80-\xBF])/', ' ', $string);
// Paginated character without either another paginated character or page indicator in front of it.
$string = preg_replace('/(?<!([\xC0-\xF7]|[\x80-\xBF]))[\x80-\xBF]/', ' ', $string);
After this the string is a valid UTF-8 string again only containing octet sequences that are valid codepoints in UTF-8.
This means that other common UTF-8 sanitization measures can be taken such as using the /u
flag for regular expressions:
<?php
// Remove control characters and unused code points (requires valid UTF-8)
$string = preg_replace('/\p{C}/u', ' ', $string);
// Replace various kinds of whitespace with a single space
$string = preg_replace('/\s+/u', ' ', $string);
Why not one big regex?
Looking at this you can see 6 regular expressions that all replace things with a space - so you may wonder “wouldn’t this be more efficient in a single regex”?
In fact, all of this can be built into a single regular expression using the pipe |
character pretty easily.
I wondered about this and set out to test it.
According to my (very limited results) there were no performance differences when using 6 small regular expressions vs one big one. I tested this with 1000 iterations on a 15MB text file and monitored runtime as well as peak memory usage: Both did not really change.
Because they are roughly the same I opted for 6 small regular expressions as this makes it easier to logically separate them as well as document them accordingly.
The disclaimer
A wise man once said
“if you ever find yourself thinking ‘A regex would be the perfect solution to this’ you will soon find you have two problems”.
Some problems are only feasibly solvable by using regular expressions. These times are dire and you should not rush over these kinds of implementations. Regular expressions are notoriosly hard to read and debug and I am very sure that there are still errors lurking in the expressions above.
In times like this the only solution to preserve your sanity and to keep your project moving without ignoring edge cases is to write tests: Don’t take my word for the regular expressions above - If you end up using them be sure to include tests for all kinds of incredibly dumb invalid strings you can think of - and then some. If you cannot guarantee that something has no bugs then at least test for the edgecases you know of.