Optimizing xz

Sukhbeer Dhillon
3 min readDec 10, 2019

Stage II: Faster compression on AArch64

For Stage I of my Software and Portability course project, I benchmarked thexzutility. In this blog I will discuss my attempts at carrying out those plans.

Firstly, I had planned to use the bitwise xor operator instead of the heavy subtraction operator in one of hotspot functions.

I started looking for ways to use xor instead of doing depth-- = 0 . It is efficient to usexor with zero register. If returned value is true, the value is zero. However, turned out that its easier to do that directly in assembly only. I faced the dilemma whether I should add assembly code into a purely C code base. Before going into that direction, I wanted to do some preliminary testing for this path. So I added the following in place of comparison against integer 0 .

(((depth | (~depth + 1)) >> 31) & 1)

This weird looking statement was found here. Turned out that instead of improving performance, this change took 10% more time.

//Using default xz
real 23m18.178s
user 22m59.110s
sys 0m10.939s
//Using bitwise operations build
real 32m40.353s
user 32m19.091s
sys 0m12.232s

I kind of knew this could turn into a rabbit hole, so I did not investigate it more. Also, I missed to update the value of depth by post decrementing it while comparing against 0.

Lets move on to the actual optimization. I had found a function that calculated number of equal bytes for two buffers.

My first task in understanding what to optimize was understanding what I had in front of me. I began deciphering the various pre-processor directives, builtin and otherwise.

Understanding what unaligned access means.

TUKLIB_FAST_UNALIGNED_ACCESS — This is defined if the system supports fast unaligned access for 16-bit and 32-bit integers.

TUKLIB_GNUC_REQ(3, 4) — Check if gcc version is greater than 3.4.

_MSC_VER — Microsoft Visual Studio version is defined

__INTEL_COMPILER —This defines the Intel C++ Compiler

__x86_64__ — Processor architecture is x86_64

_M_X64 — Visual Studio defined AMD64 compiler directive

Porting the Code

For aarch64, the preprocessor directive is defined as__aarch64__ . I defined the extra comparison length to be 8 bytes, as was the case for x86. There is a call to unaligned_read16ne() which is defined in tuklib_integer.h. This function uses a packed struct to read unaligned memory if gcc version is defined and less than 6 or if inter compiler is also defined. Otherwise it uses the standard memcpy().For the time, I will not edit this.

Next function being used is __BitScanForward64() which is a Visual Studio function that searches for a set bit(1) in the mask passed as the second argument. If found, it inserts the location into the first argument. This function is supported for ARM64. So we’re good.

If this is not windows, then the following builtin is used if the unaligned read returned a difference.

Built-in Function: int __builtin_ctz (unsigned int x)

Returns the number of trailing 0-bits in x, starting at the least significant bit position. If x is 0, the result is undefined.

I tried to look for a similar gcc builtin for aarch64. There is nothing matching in the gcc manual pages. I found out that what I need is equivalencies for Bit Scan Forward, or Count Trailing Zeroes instructions. According to this manual page for ARM, __builtin_ctz is supported by ARM.

So what have I changed? I just added an additional directive to the 64-bit unaligned access code, so that it is compiled also for aarch64.

Results? Not very satisfying. This is for the same 1G file that was used at the start of this blog.

real    32m21.747s
user 31m59.800s
sys 0m12.280s

To ensure I didn’t break anything, I matched sha256 check sums for the original file before it was compressed with the file compressed and decompressed using the changed source code binary. Since I didn’t actually change the code, I was fine. They were exactly the same.

This is the end of my attempts at trying to optimize xz for aarch64 by using unaligned access while comparing two buffers in the lzma api. During this process, I read a lot about unaligned access and why it is bad for compilers, and why it could be used. We will talk about that in detail in the last stage of this project.

--

--