Saturday 18 April 2020

Project Stage - 3

This might be the final blog I am posting so I will try to make it as descriptive as possible. This blog will be about the stage 3 of the project. If you haven't gone through the other stages of the project, here are the links to them: Stage 1 ( https://asharma206-spo600.blogspot.com/2020/04/project-stage-1.html ), update to stage 1 ( https://asharma206-spo600.blogspot.com/2020/04/update-to-project-stage-1.html ), stage 2 ( https://asharma206-spo600.blogspot.com/2020/04/project-stage-2.html ).
After, building, benchmarking and profiling, last stage was all about creating a strategy to find possible optimization approach for the software.
To find an optimization opportunity we needed to find the hotspots within the hot (function taking maximum time) function in the software first. In other words we needed to figure out what instructions inside of the software took maximum time to execute, so that with small optimization we could see differences in the overall run. So as a recall from the previous stage, I made use of perf report, which displays the percentage running times within the software of different functions. We would go one step further and press enter on CompressFragment to analyse it one step further. From there we would use its annotate functionality to get a breakdown of different instructions within the function. Then, a screen is displayed with source code and different instructions that take place corresponding to the source code, along with the percentage of run times. With pressing shift + 'h', redirected me to the hottest instruction within the function with most percentage. The instruction, I landed on was taking 8.90 % of the total time within the function and was ldur x2,  [x28, #4]. My first thought on this was since this is only about 9%, I was doubtful if I could do any significant optimization that makes a difference. Now the next steps were to find out if any optimizations could be done on this.





                                                     SIMD and Vectorization
SIMD stands for single instruction multiple data. With all modern machines having SIMD registers which are 128 bits wide, we were introduced to use these as a possible approach. Though these registers could be used as 128, 64, 32 bits and so on, but it could be also divided into lanes with options of being 64 bits (2 lanes), 32 bits (4 lanes), 16 bits (8 lanes). And those 8 lanes in turn can be used to process upto 8 values at a time.
Vectorization is the code that processes larger chunks of data at a time rather than a single unit. So basically we use vectorization with SIMD to process multiple pieces of data at once and that speeds up the program.
For vectorization to kick in automatically we can set the optimization level to -03 or use option -ftree-vectorize while compiling and if the instructions are in the order for the compiler, the loops will be vectorized automatically. This is called autovectorization.
Video on SIMD and vectorization: https://youtu.be/EIPbufXhiQs
 But since this might not always be the case we can use inline assembly instructions or instrinsics in turn.
For inline assembly instruction we use the syntax:  __asm__("assembly template" : outputs : inputs : clobbers). This can be thought of a way assembly instructions can be injected into a C or C++ program and thus vectorization commands can be used in this way.
More on inline assembly:  https://wiki.cdot.senecacollege.ca/wiki/Inline_Assembly_Language

With instrincs the idea is the same, the only difference is that instrinsics can be thought of as functions, but they both can be used interchangably.
More on instrinsics: https://wiki.cdot.senecacollege.ca/wiki/Compiler_Intrinsics

Now I was ready to apply this knowledge to hotspots in my functions, in order to speed them up.
But soon I realised, my hottest instruction ldur was load register unscaled. So it was not working on an algorithm or processing data, but it had everything to do with the accessing of memory. Since load register is probably just waiting to access data from the memory, there was no possible opportunity of optimization on that instruction.

From there I decided to go one step further and decided to look for second hottest instruction, and if there was any opportunity there. The instruction was strh w2, [x21, x1] which stands for store register halfword, and since this was also about storing values into the register and had everything to do with the memory access, so that instruction was also not something I could do anything about.




Going further I went another step forward and looked for third hottest instruction in the pool and this instruction was surprisingly add x28, x20, x3. Now what was surprising about this was that add is usually a fast instruction and does not take much time. Digging in a bit deep, I remembered the professor had told us the instruction with time might actually not be the instruction taking time but waiting for some other instruction to finish so that it can execute. Then I looked at the instructions that were being preceded by add. So the instructions ( as expected) were strh w5, [x21, x3, lsl #1] (store register halfword) and ldrh w3, [x21, x1] (load register halfword). Which made a lot of sense and made it clear that the add instruction was waiting for those instructions to execute.  Hence, once again there was nothing much that could've been done about add as it was linked to accessing memory as well.




At this point it was clear that there were no SIMD and vectorization techniques that could be applied to any part of the hottest function of the software. So I decided to look upon some other techniques our professor told us about.

                                               Platform specific optimizations
The only other aspects to look at of this software were to limit the instructions to a specific architecture. Which means limiting the software to only perform instructions that were specific to the architectures that it was being executed on.
So using ifdefs and defines we can specify what options to set when the platform is arm, x86 etc, that in turn results in running just what is required and ignore everything else, resulting in improvements.

I began searching on that aspect and if I can do certain changes so that code to make it architecture specific. After going through the files, and studying each of them as well as I could, I found that the software already had those conditions specified.






So after a few discussions with the professor I was able to conclude that the software was as optimized as it could be. And as all the possible approaches were tried on the software, I was able to conclude that the benchmarks I took were the fastest times for that optimization options.
No wonder why snappy is known for its fast compression and focuses less on the level of compression. Overall this experience was a fun and exciting ride for me with the things, techniques and knowledge it has given me, though I was worried at the start since it was something very new for me. But I guess, now I have learnt how to approach a software and how to approach the optimization part of it, and I hope to work on some in my future.

This course on a whole was a huge learning curve for me at the beginning. I went through a lot of struggles, spent hours on thinking on the labs, and had an exciting time working on the project. With the help from the professor and my own struggles with the stuff were made to work on, I feel I am taking a good chunk of useful knowledge with me from this course.

Saturday 11 April 2020

Project Stage - 2

This blog is about the stage 2 of our project ( Project Stage 1 - https://asharma206-spo600.blogspot.com/2020/04/project-stage-1.html). So after we were done setting the benchmarks, next step was to do the profiling of the software. In other words we had to figure out what part of the software was taking the maximum execution time.

We were introduced to two ways on how to figure out what function in our software was taking the maximum execution time. The two concepts were of sampling and instrumentation.

1. Sampling: In this way of profiling, the program is interrupted thousands of times per second, thus collecting thousands of samples per second, and we can get information about what the program counter was doing during the interruption and thus we can figure out what called what and hence what was getting executed for the majority of the time.

2.  Instrumentation: Another way of profiling is instrumentation. In this, some code is added during building of the software, so that all the function calls can be recorded and thus we can figure out what part takes the maximum execution.

Two tools that can be used to do the profiling. gprof and perf. I will demonstrate how to work with each of these and then the results I got from each of them.

gprof:
In order to use this tool I had to setup the build to include a -pg (profile generation) option. Next step was to build the software. Next we would run the software as usual. We would see a new executable called gmon.out generated. It would be a binary file. Next step was the usage of gprof command:
filename: (name of the executable originally used to run the software (not gmon.out))
gprof filename | gprof2dot | dot -T x11

Now gprof filename would originally would give the status of different functions and the execution times and other related info. But in order to get a more readable and helpful info we would feed the result to gprof2dot which would further be fed to a program called dot which would give a textual form of graph for the execution tree but for a more graphical version we give -T x11 options which would launch a new window with the visual version of the tree.

result:


Since the original tree was much more complex, this is just a screenshot of the original tree containing the function that takes the maximum execution time.
As it is clear from this graph the function CompressFragment was taking more than 74% of the execution time and though the control was getting transferred to multiple functions, the time taken by them was not that significant. Just a note, this tool uses both instrumentation and sampling



perf:
Though I already knew by now what function was taking the maximum time, but since it was suggested by our professor to use both the tools, I decided to work with perf as well. Now, to use this tool we do not need to use any special options and so I just used the -O2 optimization level and rebuild the software.
Note: This tool only uses the sampling and not the instrumentation.
In order to use this tool, we would just issue the program:
perf record ( program execution)
This would record the samples and would create a file called perf.data.
Then we would issue a command: perf report
Since this tool does not use instrumentation we can't produce an execution tree, we would just see a status of different functions and their execution times.

result:



As expected it is also clear that the most execution time in this program is spent on function CompressFragment (72%). Though the execution time is a bit different from the one in gprof but it is understandable since both of them use different approaches.


So by this time I was clear it was CompressFragment function in this software that I had to work on and try to optimize and since it is taking more than 70%, even a small change in its execution time would mean a good amount of difference in its overall execution time. I would be spending the next few days figuring out what aspect in the function I can work on and I hope to make some good progress on this project overall.

Monday 6 April 2020

Update to Project Stage - 1

As I told in the last post (https://asharma206-spo600.blogspot.com/2020/04/project-stage-1.html), I did not have much luck figuring out how to set the optimization level for the build of my software. But with some help from the professor and reading some documentation about cmake, I finally managed to set the optimization level and decided to update my benchmarks.
I set the CXX_FLAGS to "-O1 -DNDEBUG" for all builds. and surprisingly the execution of the software which was taking about 6 mins and 44 secs, got reduced significantly to just 24%. So I increased the size of one of the files that the execution of the software was taking place on so it would take more than 4 mins.
All I did was appended the file multiple times to a new file using a loop:
{for (X=0; X<no. of times to append; X++) ; do cat file1.txt >> newfile.txt ; done}

Now before I did this, I set the level of optimization to -02 (recommended) and then I rebuild and decided to take the new benchmarks.

                                                                       Results

AArchie (AArch64)



Once again as in the last post, the values are in seconds and ranged from about 284.42 to 283.35, with a variation of about 0.37%.



Xerxes (x86_64)





The benchmarks for Xerxes ranged from about 244.36 to 243.84, with a variation of just about 0.21%.


Another small thing I would like to point out that I think is important here is, when I was working on setting optimizations and removing them and experimenting with them, a very key step is making sure the build files are deleted and then build again before you can run the updated version of the code. So that is make clean and then make. This is important when we want program to recompile with our updated options.
With that being said and my new benchmarks in, I'll be looking forward to do the profiling on this software and hopefully I can optimize it by some bit by the end of this course. I will keep you guys updated.