The Parallel Lanes Nobody Uses
The Parallel Lanes Nobody Uses SIMD and the Eight-Lane Highway You've Been Driving Solo Reading time: ~13 minutes You ran ripgrep across a 2GB log file and it finished in half a second. grep would ...

Source: DEV Community
The Parallel Lanes Nobody Uses SIMD and the Eight-Lane Highway You've Been Driving Solo Reading time: ~13 minutes You ran ripgrep across a 2GB log file and it finished in half a second. grep would have taken ten. You called np.array * 2 and it finished before the function call overhead had time to register. Here's what actually happened: your CPU has 256-bit registers that can process 8 floats simultaneously. Those tools used all eight lanes of an eight-lane highway. Your Python for-loop uses one. This is what your CPU can actually do. The Fundamental Idea SIMD stands for Single Instruction, Multiple Data. It's not a clever trick. It's a first-class feature of every CPU you've used in the last twenty years. The idea is direct. A normal CPU instruction operates on one value: ADD rax, rbx # add one 64-bit integer to one other 64-bit integer A SIMD instruction operates on a packed vector of values in a single clock: VADDPS ymm0, ymm1, ymm2 # add eight 32-bit floats at once Eight additions