Stage 1(Contd): Targeting code to optimize

Sukhbeer Dhillon
2 min readNov 17, 2019

In my previous blog, I began exploring the xz utility. From my analysis of the perf report, I found that the bt_find_function is being called maximum times, almost 40%. Inside the function itself, the hotspot seems to be a cmp instruction which is checking if a register is 0.

Perf annotation for bt_find_func on Aarchie

My first idea is to try and use the bitwise xor instead of the subtraction.

I had contacted the main dev for this project Lasse Collin regarding ideas on what to optimize. He gave me the suggestion to look into the code in the memcmplen file. It is optimized for x86 to work for unaligned 8-byte access. Other architectures use 4-byte access. He advised looking into this function which will have a potential impact on compression.

The file mentioned above provides a function that compares two given buffers and returns the number of bytes that match using the uint32_t datatype. The number returned is always between the number of bytes already compared and matching to the limit up to which to compare the buffers.

A first glance into the code suggests that intrinsics are being used to carry out all sorts of instructions like load, store, add etc. My goal will be to create an elif directive that will be true if it is an ARM platform. I will be following the below mentioned guides to understand the given code in x86 and the equivalent or improved ones for AArch64 SIMD.

So there you have it, my plan. I have two ideas basically. One seems a bit silly but is based on my beginner’s knowledge about processors and assembly code. This is the result of my benchmarking and profiling results. The other one is to do what the person who knows this code inside out suggested.

--

--