HN Gopher Feed (2017-10-07) - page 1 of 10 ___________________________________________________________________
A Branchless UTF-8 Decoder
166 points by zdw
http://nullprogram.com/blog/2017/10/06/___________________________________________________________________
Aardwolf - 4 hours ago
You need to have padding bytes at the end of the stream. What if
you don't have them and got a const pointer to a buffer provided by
a user? Then you need to copy the entire buffer again with the
padding bytes behind it. Are there any tricks to avoid this being
costly?
twoodfin - 5 hours ago
A constant reminder: Be wary when microbenchmarking functions that
make relatively heavy use of lookup tables vs. those that don?t.
You may not see the effects of the pressure this adds to the data
cache, but a real program using many functions and other data
will.The same is true for long vs. short functions and the
instruction cache.
Animats - 4 hours ago
If that bothers you, there's this approach: #define VALIDTAB
(0x7f00ffff) #define LENTAB (0x000000003a550000) inline
bool isvalidutf8start(char c) { return((VALIDTAB >>
(c>>3)) & 1); } inline unsigned utf8length(char c)
{ return(1+((LENTAB >> ((c>>3)*2)) & 2)); } The 32-entry
table of lengths can be coded up as a 64-bit value. The "is
valid" table is stored as a separate constant.
Animats - 1 hours ago
Should be: inline unsigned utf8length(char c) {
return(1+((LENTAB >> ((c>>3)*2)) & 3)); }
ChuckMcM - 4 hours ago
Presumably !len is a test and set instruction which has the same
impact on the pipeline. But that said branchless code that fits in
a couple of cache lines seems like it would be the best you could
do.
asveikau - 1 hours ago
I don't know the answer to this. Say on x86, if you do "test
eax, eax", then "setz eax", then use eax in further arithmetic -
obviously there is no jump involved, but does that stall the
pipeline?At any rate the caller needs to check for errors coming
out of this function (the int *e parameter) and zero termination,
so there will be conditional jumps somewhere in the use of this
thing. (And reading the code it seems to assume s[3] is safe to
dereference...)
userbinator - 4 hours ago
The core of the algorithm shows that UTF-8 is actually big-endian
in its ordering of the bits, making it somewhat more difficult to
implement efficiently on the usual little-endian machine --- the
first byte, at the lowest address, actually contains the most
sigificant bits. Had it been the other way around, the final shift
would not be length-dependent.Also, none of the compilers I tried
at https://godbolt.org/ were able to combine the 4 s[0...3] 8-bit
reads into one 32-bit read, which is somewhat disappointing. They
generated 4 separate byte-read instructions, one for each s[0..3].
Yes, I know it could be unaligned, but this is x86 where it would
still be faster than 4 separate reads. You would need to do this
instead (and sacrifice endianness/alignment-independence):
uint32_t v = *(uint32_t*)s; *c = (uint32_t)(v & masks[len]) <<
18; *c |= (uint32_t)((v>>8) & 0x3f) << 12; *c |=
(uint32_t)((v>>16) & 0x3f) << 6; *c |= (uint32_t)((v>>24) &
0x3f) << 0; *c >>= shiftc[len]; We're still left with a bunch
of shifts, ands, and ors, the purpose of which is to essentially
"compact" the 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx bitstring into
one 21-bit codepoint by removing the bits in the middle. All the
compilers I tried weren't so clever as to do this (assuming eax = v
from above after masking, and cl = slightly modified shiftc[len])
at any optimisation level: bswap eax mov ebx, eax and
ebx, 00FF00FFh lea ebx, [ebx + ebx*2] add eax, ebx shr
eax, 2 shl ax,4 shr eax, cl I played around with
optimising UTF-8 decoding a while ago, and the above resembles one
of the fastest and smallest ways to "compact" the 4 bytes into a
codepoint.Of course, this bit string manipulation would be much
faster and trivial in hardware (essentially rearranging wires),
which makes me want a LODSUTF8 instruction that operates like LODSB
but reads a whole codepoint instead, and sets CF on error...
revelation - 3 hours ago
Processors will never read anything under a cache line, so it
doesn't really matter (as memory access goes)
[deleted]
bdonlan - 2 hours ago
One thing you could do to optimize further (on CPUs with the BMI2
instruction set) is to use the PEXT instruction to perform the
bit extract operation (after a BSWAP of course). The whole
operation could be something like (untested): ; Parameters:
RDI - pointer to input character ; Return value: RAX -
Unicode codepoint, or -1 for error lea r8, [encoding_table] ;
Load pointer to table base using rip-relative addressing mov
edx, [rdi] ; unaligned (over)read of 32-bit utf8 codepoint
mov ebx, edx ; Copy to EBX to construct our table index and
ebx, 0xf8 ; Mask off the (potential) data bits ; Now we want
to convert the top 5 bits into an index into a table of three *
32-bit entries ; Currently EBX = index * 8, we need index *
12, so we'll divide by two and then multiply by 3 shr ebx, 1
lea ebx, [ebx + ebx * 2] bswap edx ; Get the
codepoint into big-endian representation pext ecx, edx, [ebx
+ 4] ; extract padding bits using mask pext eax, edx,
[ebx] ; extract data bits cmp ecx, [ebx + 8] ; check padding
bits cmovnz eax, -1 retq encoding_table: ;
00000xxx - 01111xxx times 16 dd 0x7F, 0x80, 0x00 ;
10000xxx - 10111xxx (invalid) times 8 dd 0, 0xFFFFFFFF, 0 ;
The padding will never match here, forcing an error return ;
11000xxx - 11011xxx (two bytes) - padding value 11010 times 4
dd 0x1F3F, 0xE0C0, 0x1A ; 11100xxx - 11101xxx (three bytes) -
padding value 1110 1010 times 2 dd 0x0F3F3F, 0xF0C0C0, 0xEA
; 11110xxx (four bytes) - padding value 111 1010 1010 dd
0x073F3F3F, 0xF8C0C0C0, 0x7AAAA
wolf550e - 2 hours ago
You have to use memcpy with length 4 to read that uint32_t from
something that was a char pointer, those casts are not allowed
anymore. On x86 it would become a single mov instruction.
Cyph0n - 2 hours ago
Are you sure about this?I did this recently in C++ and it
worked fine: cast a uint8_t* to a uint32_t* and read data as 32
bit ints. I had to use reinterpret_cast in C++ though and of
course I had to swap words to get the correct ordering.
hsivonen - 2 hours ago
Pretty sure it's UB if the uint32_t* doesn't point to a
correctly-aligned address. (IIRC, just computing the
unaligned pointer is already UB before you dereference it.)
On x86, the code might appear to work, though, if you don't
happen to trigger autovectorization.The correct way to do an
unaligned read is memcpy.
wolf550e - 2 hours ago
I think even if you happen to know that the address is
always aligned, but the compiler can't tell that it's
guaranteed to be always aligned, the compiler can
miscompile the code. Even without autovectorization.
glandium - 1 hours ago
To give an example, iirc on sparc, unaligned access traps.
And iirc, on older arm (before armv7), it would read at the
clamped aligned address, and rotate the bits to match the
unaligned address (eg if your aligned data is abcdef and
you read 4 bytes at offset 2, you actually read cdab). I
think you can still get this behavior on armv7 if you want.
wolf550e - 2 hours ago
Yes, I'm sure. Here is a blog post by one of the leading
professors of undefined behavior about
this:https://blog.regehr.org/archives/959(edited to better
blog post by same author)
gsnedders - 1 hours ago
That works fine if and only if uint8_t is an alias of char,
as otherwise it violates the strict aliasing rule.
Dylan16807 - 3 hours ago
> Yes, I know it could be unaligned, but this is x86 where it
would still be faster than 4 separate reads.You have to worry
about memory access violations too.
zAy0LfpBZLC8mAC - 3 hours ago
Where in the C standard are memory access violations specified
as a language feature?!
Dylan16807 - 2 hours ago
If you're worried about the actual C standard then you hit
undefined behavior the moment you dereference a uint32_t*
pointing to data that's actually char.
zAy0LfpBZLC8mAC - 2 hours ago
Which doesn't prevent the compiler from doing a 32 bit
load, does it? Dereferencing a pointer to a location
outside any object is also undefined, therefore, if there
are no conditionals between subsequent byte reads, the
compiler should be perfectly fine with combining them into
a 32 bit load!?
fourthark - 1 hours ago
The issue is that once you hit undefined behavior, the
optimizer is allowed to do whatever it wants - the
proverbial launch nukes, crash, raid your fridge,
etc.More likely, it could just not execute that
instruction or any instruction that depends on it,
because that's the fastest way to preserve undefined
behavior.
cwzwarich - 1 hours ago
Actually, this is false. The language specifically allows
aliasing char (both signed and unsigned) with other types
as an exception to type-based aliasing.
wolf550e - 2 hours ago
"memory access violations" are reading beyond allocated
storage for a variable. You're not allowed to dereference or
even reference locations outside storage allocated for
variables, except one past the last item of an array which
can be referenced in a comparison (but cannot be
dereferenced). Stack variables, memory allocated by malloc
and memory allocated by mmap all conceptually behave this
way.
userbinator - 2 hours ago
From the article:"One important consequence of always reading
four bytes is that the caller must zero-pad the buffer to at
least four bytes. In practice, this means padding the entire
buffer with three bytes in case the last character is a single
byte."In other words, both the byte-by-byte and dord-at-a-time
code assume none of the bytes being read will be at an invalid
address.
Dylan16807 - 2 hours ago
Oh, sorry, I was confused about what exactly you were
replacing.
kuwze - 5 hours ago
I'm an idiot in this area.Why would you decode utf8?
kutkloon7 - 4 hours ago
To get the unicode codepoint. For example, you need this when you
want to render a character, or convert it to another
representation.
fra - 5 hours ago
In order to look up glyph in your font, you'll typically need a
32bit unicode code-point. A utf-8 string encodes those code-
points using a variable-number-of-bytes scheme, so you need to
decode each code-point before you can render it.
amelius - 5 hours ago
Yes, but this code expands it to a 32-bit array. That seems a
bit silly because you'd be wasting a lot of memory bandwidth
there. I suspect the loss could even be greater than what
you've gained from removing those branches.
amaranth - 4 hours ago
Not quite, it's expanded to a 32-bit integer, one codepoint
at a time.
grzm - 5 hours ago
Perhaps you're misinterpreting what a decoder does. From the
article:> This week I took a crack at writing a branchless UTF-8
decoder: a function that decodes a single UTF-8 code point from a
byte stream?
chrisseaton - 5 hours ago
To print it to the screen for example.
smegel - 5 hours ago
Some languages represent unicode strings as objects quite
separate to their binary encoding, e.g. Python.
Sidnicious - 4 hours ago
I ran into the same UTF-8 decoder that you did
(http://bjoern.hoehrmann.de/utf-8/decoder/dfa/) and tried to better
understand why it was fast. In the process, I wrote a naive UTF-8
parser that turned out to be significantly faster. I used a version
of the original benchmark (the current Hindi Wikipedia dump as one
big XML file).My original benchmark read the file as a stream ? no
guarantee that you'll get whole characters in each read ? so I
modified it a bit to read the whole file into memory, parse it, and
print out the number of characters decoded and an XOR "hash".
Here's what happened (looking at the "user" line of `time`
output):- Bjoern Hoehrmann?s decoder: 2.159s- This post?s decoder:
2.754s (27.6% slower)- Naive decoder: 1.453s (32.7% faster)I'm
curious whether this test makes sense and, if so, why there's such
a big difference. Currently my best guess is the data cache
pressure mentioned in twoodfin's comment.I guess I should write a
blog post, too.Naive parser: https://gist.github.com/s4y/7c95f1ebeb
2c069cfb09db3c3251eca3Benchmarks:
https://gist.github.com/s4y/344a355f8c1f99c6a4cb2347ec4323cc
ori_b - 3 hours ago
Almost all the text is ascii, so the branch predictor gets things
right. Correctly predicted branches are nearly free, costing
about as much as their decode overhead.It'd be interesting to see
the results where you have randomly shuffled characters of all
lengths, although it's likely to be something you see in
reality.
ComputerGuru - 3 hours ago
Dealing with situations like this where you can?t guarantee you
read a complete decode sequence is exactly where memory mapped
files are gold.Replacing read, test sequence end, buffer, drop
buffer if it exceeds certain length without sequence end but
record sequence start, seek until end sequence is found, and
process that mess in a safe, passably-efficient manner with
memory mapped files made my cross-platform tac a breeze to write:
https://neosmart.net/blog/2017/a-high-performance-cross-plat...
ori_b - 3 hours ago
The simplest thing to do there is to issue a read if you have
less than 4 bytes remaining to decode. This guarantees that you
always have at least one well formed character, or an EOF
coming up.Mmap doesn't work with pipes, which hugely reduces
the utility of this kind of program.
ComputerGuru - 2 hours ago
You aren?t reading a byte at a time, it?s a block. You have
no guarantee that you just ended on yet another partial
sequence. It usually makes more sense to just rewind to the
last complete sequence (though with streams that is not
always an option).
ori_b - 2 hours ago
> You have no guarantee that you just ended on yet another
partial sequenceWhich is why you don't read to the end
(until you have an EOF). You can just sidestep the partial
sequence problem by getting more data before there's
potential for a partial sequence to show up. The longest
possible sequence is 4 bytes, so as long as you ensure you
have 4 bytes available, partial sequences are impossible on
well formed input.
jwilk - 54 minutes ago
Your decoder accepts lone surrogates and overlong sequences.This
probably helps a bit for speed.
[deleted]
ot - 5 hours ago
> My primary ? and totally arbitrary ? goal was to beat the
performance of Bjoern Hoehrmann?s DFA-based decoder.The thing is,
that decoder could be already branchless. The core of the algorithm
is: *codep = (*state != UTF8_ACCEPT) ? (byte & 0x3fu) |
(*codep << 6) : (0xff >> type) & (byte); Which can be
compiled with a conditional move (CMOV) instead of a branch. In
fact:> With GCC 6.3.0 on an i7-6700, my decoder is about 20% faster
than the DFA decoder in the benchmark. With Clang 3.8.1 it?s just
1% faster.There's a chance that when inlined into the benchmark
Clang uses a CMOV while GCC uses a branch. As he correctly points
out, his benchmark is the worst case for branch prediction:> The
even distribution of lengths greatly favors a branchless decoder.
The random distribution inhibits branch prediction.
tedunangst - 5 hours ago
CMOV isn't necessarily faster than no branches. There's still
data dependencies which can prevent reordering. But considerably
faster than a misprediction.
CalChris - 4 hours ago
I dunno what you mean by CMOV vs ?no? branches.However, CMOV is
single cycle latency in Kaby Lake. Also, I don?t know if those
data dependencies wouldn?t also cause similar problems for
resolving a conditional branch.It may be possible to construct
a scenario where CMOV is marginally slower but I think we can
put Linus Torvalds? CMOV rant in the bit bucket. Intel
recommends CMOV with a proviso. Assembly/Compiler Coding Rule
2: Use the SETCC and CMOV instructions to eliminate
unpredictable conditional branches where possible. Do not do
this for predictable branches. The Intel C Compiler uses it.
TwoBit - 3 hours ago
Did Intel disprove Linus?
CalChris - 3 hours ago
Obviate would be a better choice of word.
tedunangst - 3 hours ago
Something like X = flag ? X : 0; X = test ? X : 0; Vs
X |= flag; X |= test; The latter is more obviously
branch less. How exactly the two perform probably requires
some testing.And somewhat more general point, since we're
writing C and not asm, depends how much you want to depend on
the compiler to use the right instructions. To backtrack on
my point a bit, whether CMOV gets used is an important
consideration.> There's a chance that when inlined into the
benchmark Clang uses a CMOVAnd a chance it won't. :)
ot - 5 hours ago
Sure, but if you look at the implementation in the article,
that's a much longer dependency chain: *c = (uint32_t)(s[0]
& masks[len]) << 18; *c |= (uint32_t)(s[1] & 0x3f) << 12;
*c |= (uint32_t)(s[2] & 0x3f) << 6; *c |= (uint32_t)(s[3]
& 0x3f) << 0; *c >>= shiftc[len]; /* Accumulate the
various error conditions. */ *e = (*c < mins[len]) << 6;
*e |= ((*c >> 11) == 0x1b) << 7; // surrogate half? *e |=
(s[1] & 0xc0) >> 2; *e |= (s[2] & 0xc0) >> 4; *e |=
(s[3] ) >> 6; *e ^= 0x2a; // top two bits of each
tail byte correct? *e >>= shifte[len]; Plus, all those
loads are not making it easy for the processor to pipeline.And
*e = (*c < mins[len]) << 6; is really a CMOV in disguise :)
userbinator - 4 hours ago
is really a CMOV in disguise :)I don't see how using CMOV
would be appropriate there --- I'd compile it as something
more like this: ; eax = *c, ebx = mins[len] cmp eax,
ebx sbb eax, eax and eax, 64 ; eax = (*c <
mins[len]) << 6
d33 - 5 hours ago
That looks way oversimplified. I once wrote a minimal HTTP server
in Lua for Nmap project and part of the problem was UTF8
security. Consider get_next_char_len function here to see some of
the edge cases:https://github.com/nmap/nmap/blob/b7a5a6/ncat/scri
pts/httpd....
loeg - 4 hours ago
The Lua code is just excessively verbose.
d33 - 3 hours ago
What do you mean? How else would you implement a check for
whether the prefix length is valid?
Tuna-Fish - 3 hours ago
UTF-8-decode and encode are basically the only things that could be
done by a single instruction, are common enough for implementing
this instruction to make sense, yet still don't exist in x86. I
wonder why.
spullara - 2 hours ago
Every time I have worked for a company that had a tight
relationship with Intel I've gotten the opportunity to meet with
them and ask for chips features. I always ask for encoding /
decoding Unicode. So far I've been ignored for 15 years or so.
ant6n - 5 hours ago
Seems like a fun project! One thing I stumbled over:>
unsigned char *next = s + len + !len;> Originally this expression
was the return value, computed at the very end of the function.
However, after inspecting the compiler?s assembly output, I decided
to move it up, and the result was a solid performance boost. That?s
because it spreads out dependent instructions. With the address of
the next code point known so early, the instructions that decode
the next code point can get started early.If there's no change in
the actual logic, I feel like the compiler should be able to move
up expressions in order to make them available earlier. I wonder
whether this was tested/inspected with some optimizations turned
off.
0xbear - 4 hours ago
Except in case of profile-guided optimization (which few people
use), the compiler has one massive weakness: it doesn't know the
"shape" and distribution of your data. If it did, it would be
able to do quite a bit more, but it mostly does not.I also find
that people have way too much faith in omnipotence of compilers.
If you care about performance, you _have_ to go down to the
assembly level and ensure that the compiler did the right thing,
at least in the hotspots. Oftentimes it does not, and you have to
change your higher level code to make it cooperate.
jacobush - 5 hours ago
So optimising code feels like playing chess now?
logicallee - 4 hours ago
did you mean because computers can do it better? I can't see
anything else in their comment that you could have referred to,
but the quoted part shows that the author did it better than
the compiler output...
qznc - 4 hours ago
It is not "optimising" because nobody even hopes for something
"optimal".Now? For a long time x86/amd64 performance is quite
unpredictable because the CPUs are so complex.It is not like
chess. More like Poker because more uncertainty.
thethirdone - 1 hours ago
> It is not "optimising" because nobody even hopes for
something "optimal".Optimizing doesn't need to reach the
optimal, just improve.> It is not like chess. More like Poker
because more uncertainty.I would say the uncertainty is
similar to chess. It isn't feasible to know the best option
in any given situation, but it is deterministic.
smegel - 5 hours ago
Weird to talk so much about branch prediction for branchless code.
tedunangst - 5 hours ago
1. Understanding how branches work is important to understanding
the motivation for branchless code.2. It's not entirely branch
less. Consistent branching is equally important.
smegel - 13 minutes ago
You don't need to know anything about prediction to understand
branches or why they are bad news for pipelining.