c++ – Strlen function optimization

This seems like the obvious choice searching withing a string. However, while pcmpistri is very general/powerful, it is also not very fast. On typical Intel processors it consists of 3 µops that all go to execution port p0 (therefore limiting this loop to at best running one iteration every 3 cycles), on AMD Zen(1/2) it’s slightly less bad coming in at 2 µops and executing once every 2 cycles.

There is an in way more primitive way (just using SSE2) based on pcmpeqb and pmovmskb. That leaves you with a mask instead of an index, but for most of the loop that doesn’t matter (all that matters is whether the mask is zero or not), and in the final iteration you can use tzcnt (or similar) to find the actual index of the zero byte within the vector.

That technique also scales to AVX2, which pcmpistri does not. Additionally, you could use some unrolling: pminub some successive blocks of 16 bytes together to go through the string quicker at first, at the cost of a more tricky final iteration and a more complex pre-loop setup (see the next point).

While an aligned load that contain at least one byte of the string is safe even if the load pulls in some data that is outside the bounds of the string (an aligned load cannot cross a page boundary), that trick is unsafe for unaligned loads. A string that ends near a page boundary could cause this function to fetch into the next page, and possibly trigger an access violation.

There are different ways to fix it. The obvious one is using the usual byte-by-byte loop until a sufficiently aligned address is reached. A more advanced trick is rounding the address down to a multiple of 16 (32 for AVX2) and doing an aligned load. There are bytes in it that aren’t from the string, maybe including a zero. Therefore those bytes must be explicitly ignored, for example by shifting the mask that pmovmskb returned to the right by data & 15. If you decide to add unrolling, then the address for the main loop should be even more aligned, to guarantee that all the loads in the main loop body are safe.