How to Quickly Index a Character in Short String

DM Television

OpenAI launches Operator for ChatGPT Pro in select countries

February

S	M	T	W	T	F	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28

How to Quickly Index a Character in Short String

Small Business

Tags: small web

Author: DATE POSTED:January 23, 2025

Feed: Hacker Noon - Medium

View: Original article

This is probably the most useless optimization I've ever made. I can imagine this solution could be used as part of a more complex search algorithm or, in rare cases, to search for a combination in a short string. Anyway, have fun reading this.

:::info Full source code can be found here: https://github.com/Lezh1k/Codewars_c/blob/master/src/trench_assault.c

:::

Problem

In short, there are two groups of letters. Each letter has a 'weight.'

:::info Full problem statement here.

:::

\ ==The left side letters and their power:==

\ w - 4

p - 3

b - 2

s - 1

\ ==The right side letters and their power:==

\ m - 4

q - 3

d - 2

z - 1

\ So, we need a function to acquire each symbol's weight and "side" in the input string. The side can be defined as a sign of the weight (negative - left side, positive - right side).

Solution

Let's define the letters and sides:

static const char left_side_letters[] = {'s', 'b', 'p', 'w', 0}; static const char right_side_letters[] = {'z', 'd', 'q', 'm', 0}; static const char relief_letters[] = {' ', '-', '|', 0};

:::warning These arrays will be cast to uint32. Do not do it on prod. Use union instead.

:::

Something like this:

typedef union short_str { uint32_t val; char arr[4]; } short_str_t;

\ The most obvious way is to use a simple switch, but it’s easy to lose something there. So, here is the initial simplest implementation of the 'weight' function:

int weight_slow(char s) { for (const char *pl = relief_letters; *pl; ++pl) { if (*pl != s) continue; return 0; } for (const char *pl = left_side_letters; *pl; ++pl) { if (*pl != s) continue; return -((int)(pl - left_side_letters) + 1); } for (const char *pl = right_side_letters; *pl; ++pl) { if (*pl != s) continue; return (int)(pl - right_side_letters) + 1; } // invalid input, raise error exit(1); }

\n But I didn't feel like this was the fastest way to find a character in a short string. These 4 bytes fit one 32-bit integer. Therefore, we can combine making a mask from the searched byte and XORing it with one of the letter sets.

Optimizations

The main idea is pretty simple. For example, left-side letters can be expressed as 0x73627077. If we are looking for symbol 'p' (0x70), we can xor each byte with 0x70, and only the match will give zero as the XOR result. In our case, the result is 0x03120007. The only thing left is to find the index of the 0x00 byte in an integer. This is possible. See the weight function below.

MMX Optimization

The first attempt included an MMX because there are special CPU instructions on comparing bytes and getting the mask.

.intel_syntax noprefix .text .global barr8_char_idx # Function prototype: # int barr8_char_idx(const char* array, char input_char); barr8_char_idx: # array (address of 8-byte array) -> rdi # input_char -> esi movdqu xmm0, [rdi] # Broadcast the input_char across an SSE register movd xmm1, esi pshufd xmm1, xmm1, 0 # Compare each byte of xmm0 with xmm1 pcmpeqb xmm0, xmm1 # Extract the result into a mask pmovmskb eax, xmm0 # Check if the mask is non-zero test eax, eax jz not_found bsf eax, eax ret not_found: mov eax, -1 ret

\ This one works slowly (even slower than the first solution). Data is not aligned + using MMX is too much for such a small issue. But MMX has ready functions to compare registers and convert results into a bitmask. This was just the proof of concept.

Math + bit twiddling hack optimization

So the main challenge is to find the 0x00 byte in uint32. This is possible, and we can convert all the non-zero bytes in uint32 to 0xff and all the zero bytes into 0x7f. Inverting the result gives all zeros except in the position of the 0x00 byte. After transforming and inverting, it equals 0x80. Then, the only thing left is to count trailing zero bits and divide the result by 8 to get the byte index. There are several ways to count trailing zero bits (Please see https://graphics.stanford.edu/\~seander/bithacks.html#ZerosOnRightLinear). Sometimes, it's possible to find the necessary function among the compiler’s built-in functions or in the CPU instructions set (BSF for x86_64).

\ The second attempt is the fastest implementation at this time:

int zbyte_32(uint32_t x) { // for 0 byte set 0x7f, for other bytes - 0xff uint32_t y = (x & 0x7f7f7f7f) + 0x7f7f7f7f; // inverting gives 0x80 where 0 byte was and 0 for other bytes y = ~(y | x | 0x7f7f7f7f); //This check is necessary because 0 as an argument of __builtin_ctz is undefined behavior // without this check gcc/clang compilers change the weight function just to return 0; // statement. if (y == 0) { return -1; } // find index of first non zero bit in int32_t int n = __builtin_ctz(y); // divide this index by 8 to get byte index (instead of bit index) return n >> 3; } int weight(char s) { uint32_t s_msk = (uint32_t)s * 0x01010101; uint32_t relief_val = *(const uint32_t *)relief_letters; uint32_t left_side_val = *(const uint32_t *)left_side_letters; uint32_t right_side_val = *(const uint32_t *)right_side_letters; int w = zbyte_32(relief_val ^ s_msk); if (w != -1) return 0; w = zbyte_32(right_side_val ^ s_msk); if (w != -1) return w + 1; w = zbyte_32(left_side_val ^ s_msk); if (w != -1) return -w - 1; __builtin_unreachable(); }

\ I also tried to optimize the zbyte_32 function. if (y == 0) return -1; - this check seemed excessive. The BFS function scans the source operand for the first bit set. Sets ZF if a bit is found set and loads the destination with an index to the first set bit. Clears ZF if no bits are found set. So why should I check Y before calling this function? I can use BSF and then check the ZF flag and return -1 if it is set.

\ So I tried this implementation:

.intel_syntax noprefix .text .global zbyte_32_asm # Function prototype: # extern int zbyte_32_asm(uint32_t x); zbyte_32_asm: # Input: rdi -> input uint32_t mov eax, edi and eax, 0x7f7f7f7f add eax, 0x7f7f7f7f or eax, edi or eax, 0x7f7f7f7f not eax bsf eax, eax jz not_found shr eax, 3 ret not_found: mov eax, -1 ret

\ It works 10 times slower than what compilers generate. See the profiling results.

Profiling results

To profile the function, I tried to get the weight of each symbol of the test string sbpwzdqm -|sbpwzdqm -|sbpwzdqm -|sbpwzdqm -| 10,000,000 times. Here are the results:

slow() took 1.2938900000 seconds to execute fast() took 0.1967720000 seconds to execute asm() took 1.6532540000 seconds to execute

\ In most cases, micro-optimizations are a waste of time, as it's much more useful (and sometimes easier) to reduce complexity, add a cache, change a memory allocator etc. However, in rare cases, micro-optimizations are the only means of achieving the necessary service performance.

Feed: Hacker Noon - Medium

View: Original article

Tags: small web

Small Business