Saturday 18 April 2020

Project Stage - 3

This might be the final blog I am posting so I will try to make it as descriptive as possible. This blog will be about the stage 3 of the project. If you haven't gone through the other stages of the project, here are the links to them: Stage 1 ( https://asharma206-spo600.blogspot.com/2020/04/project-stage-1.html ), update to stage 1 ( https://asharma206-spo600.blogspot.com/2020/04/update-to-project-stage-1.html ), stage 2 ( https://asharma206-spo600.blogspot.com/2020/04/project-stage-2.html ).
After, building, benchmarking and profiling, last stage was all about creating a strategy to find possible optimization approach for the software.
To find an optimization opportunity we needed to find the hotspots within the hot (function taking maximum time) function in the software first. In other words we needed to figure out what instructions inside of the software took maximum time to execute, so that with small optimization we could see differences in the overall run. So as a recall from the previous stage, I made use of perf report, which displays the percentage running times within the software of different functions. We would go one step further and press enter on CompressFragment to analyse it one step further. From there we would use its annotate functionality to get a breakdown of different instructions within the function. Then, a screen is displayed with source code and different instructions that take place corresponding to the source code, along with the percentage of run times. With pressing shift + 'h', redirected me to the hottest instruction within the function with most percentage. The instruction, I landed on was taking 8.90 % of the total time within the function and was ldur x2,  [x28, #4]. My first thought on this was since this is only about 9%, I was doubtful if I could do any significant optimization that makes a difference. Now the next steps were to find out if any optimizations could be done on this.





                                                     SIMD and Vectorization
SIMD stands for single instruction multiple data. With all modern machines having SIMD registers which are 128 bits wide, we were introduced to use these as a possible approach. Though these registers could be used as 128, 64, 32 bits and so on, but it could be also divided into lanes with options of being 64 bits (2 lanes), 32 bits (4 lanes), 16 bits (8 lanes). And those 8 lanes in turn can be used to process upto 8 values at a time.
Vectorization is the code that processes larger chunks of data at a time rather than a single unit. So basically we use vectorization with SIMD to process multiple pieces of data at once and that speeds up the program.
For vectorization to kick in automatically we can set the optimization level to -03 or use option -ftree-vectorize while compiling and if the instructions are in the order for the compiler, the loops will be vectorized automatically. This is called autovectorization.
Video on SIMD and vectorization: https://youtu.be/EIPbufXhiQs
 But since this might not always be the case we can use inline assembly instructions or instrinsics in turn.
For inline assembly instruction we use the syntax:  __asm__("assembly template" : outputs : inputs : clobbers). This can be thought of a way assembly instructions can be injected into a C or C++ program and thus vectorization commands can be used in this way.
More on inline assembly:  https://wiki.cdot.senecacollege.ca/wiki/Inline_Assembly_Language

With instrincs the idea is the same, the only difference is that instrinsics can be thought of as functions, but they both can be used interchangably.
More on instrinsics: https://wiki.cdot.senecacollege.ca/wiki/Compiler_Intrinsics

Now I was ready to apply this knowledge to hotspots in my functions, in order to speed them up.
But soon I realised, my hottest instruction ldur was load register unscaled. So it was not working on an algorithm or processing data, but it had everything to do with the accessing of memory. Since load register is probably just waiting to access data from the memory, there was no possible opportunity of optimization on that instruction.

From there I decided to go one step further and decided to look for second hottest instruction, and if there was any opportunity there. The instruction was strh w2, [x21, x1] which stands for store register halfword, and since this was also about storing values into the register and had everything to do with the memory access, so that instruction was also not something I could do anything about.




Going further I went another step forward and looked for third hottest instruction in the pool and this instruction was surprisingly add x28, x20, x3. Now what was surprising about this was that add is usually a fast instruction and does not take much time. Digging in a bit deep, I remembered the professor had told us the instruction with time might actually not be the instruction taking time but waiting for some other instruction to finish so that it can execute. Then I looked at the instructions that were being preceded by add. So the instructions ( as expected) were strh w5, [x21, x3, lsl #1] (store register halfword) and ldrh w3, [x21, x1] (load register halfword). Which made a lot of sense and made it clear that the add instruction was waiting for those instructions to execute.  Hence, once again there was nothing much that could've been done about add as it was linked to accessing memory as well.




At this point it was clear that there were no SIMD and vectorization techniques that could be applied to any part of the hottest function of the software. So I decided to look upon some other techniques our professor told us about.

                                               Platform specific optimizations
The only other aspects to look at of this software were to limit the instructions to a specific architecture. Which means limiting the software to only perform instructions that were specific to the architectures that it was being executed on.
So using ifdefs and defines we can specify what options to set when the platform is arm, x86 etc, that in turn results in running just what is required and ignore everything else, resulting in improvements.

I began searching on that aspect and if I can do certain changes so that code to make it architecture specific. After going through the files, and studying each of them as well as I could, I found that the software already had those conditions specified.






So after a few discussions with the professor I was able to conclude that the software was as optimized as it could be. And as all the possible approaches were tried on the software, I was able to conclude that the benchmarks I took were the fastest times for that optimization options.
No wonder why snappy is known for its fast compression and focuses less on the level of compression. Overall this experience was a fun and exciting ride for me with the things, techniques and knowledge it has given me, though I was worried at the start since it was something very new for me. But I guess, now I have learnt how to approach a software and how to approach the optimization part of it, and I hope to work on some in my future.

This course on a whole was a huge learning curve for me at the beginning. I went through a lot of struggles, spent hours on thinking on the labs, and had an exciting time working on the project. With the help from the professor and my own struggles with the stuff were made to work on, I feel I am taking a good chunk of useful knowledge with me from this course.

No comments:

Post a Comment