SPO600

Saturday, 18 April 2020

Project Stage - 3

This might be the final blog I am posting so I will try to make it as descriptive as possible. This blog will be about the stage 3 of the project. If you haven't gone through the other stages of the project, here are the links to them: Stage 1 ( https://asharma206-spo600.blogspot.com/2020/04/project-stage-1.html ), update to stage 1 ( https://asharma206-spo600.blogspot.com/2020/04/update-to-project-stage-1.html ), stage 2 ( https://asharma206-spo600.blogspot.com/2020/04/project-stage-2.html ).
After, building, benchmarking and profiling, last stage was all about creating a strategy to find possible optimization approach for the software.
To find an optimization opportunity we needed to find the hotspots within the hot (function taking maximum time) function in the software first. In other words we needed to figure out what instructions inside of the software took maximum time to execute, so that with small optimization we could see differences in the overall run. So as a recall from the previous stage, I made use of perf report, which displays the percentage running times within the software of different functions. We would go one step further and press enter on CompressFragment to analyse it one step further. From there we would use its annotate functionality to get a breakdown of different instructions within the function. Then, a screen is displayed with source code and different instructions that take place corresponding to the source code, along with the percentage of run times. With pressing shift + 'h', redirected me to the hottest instruction within the function with most percentage. The instruction, I landed on was taking 8.90 % of the total time within the function and was ldur x2, [x28, #4]. My first thought on this was since this is only about 9%, I was doubtful if I could do any significant optimization that makes a difference. Now the next steps were to find out if any optimizations could be done on this.

SIMD and Vectorization
SIMD stands for single instruction multiple data. With all modern machines having SIMD registers which are 128 bits wide, we were introduced to use these as a possible approach. Though these registers could be used as 128, 64, 32 bits and so on, but it could be also divided into lanes with options of being 64 bits (2 lanes), 32 bits (4 lanes), 16 bits (8 lanes). And those 8 lanes in turn can be used to process upto 8 values at a time.
Vectorization is the code that processes larger chunks of data at a time rather than a single unit. So basically we use vectorization with SIMD to process multiple pieces of data at once and that speeds up the program.
For vectorization to kick in automatically we can set the optimization level to -03 or use option -ftree-vectorize while compiling and if the instructions are in the order for the compiler, the loops will be vectorized automatically. This is called autovectorization.
Video on SIMD and vectorization: https://youtu.be/EIPbufXhiQs
But since this might not always be the case we can use inline assembly instructions or instrinsics in turn.
For inline assembly instruction we use the syntax: __asm__("assembly template" : outputs : inputs : clobbers). This can be thought of a way assembly instructions can be injected into a C or C++ program and thus vectorization commands can be used in this way.
More on inline assembly: https://wiki.cdot.senecacollege.ca/wiki/Inline_Assembly_Language

With instrincs the idea is the same, the only difference is that instrinsics can be thought of as functions, but they both can be used interchangably.
More on instrinsics: https://wiki.cdot.senecacollege.ca/wiki/Compiler_Intrinsics

Now I was ready to apply this knowledge to hotspots in my functions, in order to speed them up.
But soon I realised, my hottest instruction ldur was load register unscaled. So it was not working on an algorithm or processing data, but it had everything to do with the accessing of memory. Since load register is probably just waiting to access data from the memory, there was no possible opportunity of optimization on that instruction.

From there I decided to go one step further and decided to look for second hottest instruction, and if there was any opportunity there. The instruction was strh w2, [x21, x1] which stands for store register halfword, and since this was also about storing values into the register and had everything to do with the memory access, so that instruction was also not something I could do anything about.

Going further I went another step forward and looked for third hottest instruction in the pool and this instruction was surprisingly add x28, x20, x3. Now what was surprising about this was that add is usually a fast instruction and does not take much time. Digging in a bit deep, I remembered the professor had told us the instruction with time might actually not be the instruction taking time but waiting for some other instruction to finish so that it can execute. Then I looked at the instructions that were being preceded by add. So the instructions ( as expected) were strh w5, [x21, x3, lsl #1] (store register halfword) and ldrh w3, [x21, x1] (load register halfword). Which made a lot of sense and made it clear that the add instruction was waiting for those instructions to execute. Hence, once again there was nothing much that could've been done about add as it was linked to accessing memory as well.

At this point it was clear that there were no SIMD and vectorization techniques that could be applied to any part of the hottest function of the software. So I decided to look upon some other techniques our professor told us about.

Platform specific optimizations
The only other aspects to look at of this software were to limit the instructions to a specific architecture. Which means limiting the software to only perform instructions that were specific to the architectures that it was being executed on.
So using ifdefs and defines we can specify what options to set when the platform is arm, x86 etc, that in turn results in running just what is required and ignore everything else, resulting in improvements.

I began searching on that aspect and if I can do certain changes so that code to make it architecture specific. After going through the files, and studying each of them as well as I could, I found that the software already had those conditions specified.

So after a few discussions with the professor I was able to conclude that the software was as optimized as it could be. And as all the possible approaches were tried on the software, I was able to conclude that the benchmarks I took were the fastest times for that optimization options.
No wonder why snappy is known for its fast compression and focuses less on the level of compression. Overall this experience was a fun and exciting ride for me with the things, techniques and knowledge it has given me, though I was worried at the start since it was something very new for me. But I guess, now I have learnt how to approach a software and how to approach the optimization part of it, and I hope to work on some in my future.

This course on a whole was a huge learning curve for me at the beginning. I went through a lot of struggles, spent hours on thinking on the labs, and had an exciting time working on the project. With the help from the professor and my own struggles with the stuff were made to work on, I feel I am taking a good chunk of useful knowledge with me from this course.

Saturday, 11 April 2020

Project Stage - 2

This blog is about the stage 2 of our project ( Project Stage 1 - https://asharma206-spo600.blogspot.com/2020/04/project-stage-1.html). So after we were done setting the benchmarks, next step was to do the profiling of the software. In other words we had to figure out what part of the software was taking the maximum execution time.

We were introduced to two ways on how to figure out what function in our software was taking the maximum execution time. The two concepts were of sampling and instrumentation.

1. Sampling: In this way of profiling, the program is interrupted thousands of times per second, thus collecting thousands of samples per second, and we can get information about what the program counter was doing during the interruption and thus we can figure out what called what and hence what was getting executed for the majority of the time.

2. Instrumentation: Another way of profiling is instrumentation. In this, some code is added during building of the software, so that all the function calls can be recorded and thus we can figure out what part takes the maximum execution.

Two tools that can be used to do the profiling. gprof and perf. I will demonstrate how to work with each of these and then the results I got from each of them.

gprof:
In order to use this tool I had to setup the build to include a -pg (profile generation) option. Next step was to build the software. Next we would run the software as usual. We would see a new executable called gmon.out generated. It would be a binary file. Next step was the usage of gprof command:
filename: (name of the executable originally used to run the software (not gmon.out))
gprof filename | gprof2dot | dot -T x11

Now gprof filename would originally would give the status of different functions and the execution times and other related info. But in order to get a more readable and helpful info we would feed the result to gprof2dot which would further be fed to a program called dot which would give a textual form of graph for the execution tree but for a more graphical version we give -T x11 options which would launch a new window with the visual version of the tree.

result:

Since the original tree was much more complex, this is just a screenshot of the original tree containing the function that takes the maximum execution time.
As it is clear from this graph the function CompressFragment was taking more than 74% of the execution time and though the control was getting transferred to multiple functions, the time taken by them was not that significant. Just a note, this tool uses both instrumentation and sampling

perf:
Though I already knew by now what function was taking the maximum time, but since it was suggested by our professor to use both the tools, I decided to work with perf as well. Now, to use this tool we do not need to use any special options and so I just used the -O2 optimization level and rebuild the software.
Note: This tool only uses the sampling and not the instrumentation.
In order to use this tool, we would just issue the program:
perf record ( program execution)
This would record the samples and would create a file called perf.data.
Then we would issue a command: perf report
Since this tool does not use instrumentation we can't produce an execution tree, we would just see a status of different functions and their execution times.

result:

As expected it is also clear that the most execution time in this program is spent on function CompressFragment (72%). Though the execution time is a bit different from the one in gprof but it is understandable since both of them use different approaches.

So by this time I was clear it was CompressFragment function in this software that I had to work on and try to optimize and since it is taking more than 70%, even a small change in its execution time would mean a good amount of difference in its overall execution time. I would be spending the next few days figuring out what aspect in the function I can work on and I hope to make some good progress on this project overall.

Monday, 6 April 2020

Update to Project Stage - 1

As I told in the last post (https://asharma206-spo600.blogspot.com/2020/04/project-stage-1.html), I did not have much luck figuring out how to set the optimization level for the build of my software. But with some help from the professor and reading some documentation about cmake, I finally managed to set the optimization level and decided to update my benchmarks.
I set the CXX_FLAGS to "-O1 -DNDEBUG" for all builds. and surprisingly the execution of the software which was taking about 6 mins and 44 secs, got reduced significantly to just 24%. So I increased the size of one of the files that the execution of the software was taking place on so it would take more than 4 mins.
All I did was appended the file multiple times to a new file using a loop:
{for (X=0; X<no. of times to append; X++) ; do cat file1.txt >> newfile.txt ; done}

Now before I did this, I set the level of optimization to -02 (recommended) and then I rebuild and decided to take the new benchmarks.

Results

AArchie (AArch64)

Once again as in the last post, the values are in seconds and ranged from about 284.42 to 283.35, with a variation of about 0.37%.

Xerxes (x86_64)

The benchmarks for Xerxes ranged from about 244.36 to 243.84, with a variation of just about 0.21%.

Another small thing I would like to point out that I think is important here is, when I was working on setting optimizations and removing them and experimenting with them, a very key step is making sure the build files are deleted and then build again before you can run the updated version of the code. So that is make clean and then make. This is important when we want program to recompile with our updated options.
With that being said and my new benchmarks in, I'll be looking forward to do the profiling on this software and hopefully I can optimize it by some bit by the end of this course. I will keep you guys updated.

Tuesday, 31 March 2020

Project Stage - 1

This blog will be all about the Stage 1 for the final project for this course. I was both excited and nervous to work on the project since I had not worked on anything like this before which brought excitement and also I was worried about what I should choose and what kind of challenges I would face doing so.
The instructions I had in mind that was given by the professor were that we need to pick something that is not too basic, in other words it had to be resource intensive. There were few options for the programming languages that had to chosen from, since we could not choose Java or web scripts, C and C++ were the only languages I was left with as I did not have experience working with any other languages.

1. Find an open source software package that includes a CPU-intensive function or method that is implemented in a language that compiles to machine code (such as C, C++, or Assembler). This could be hash function, compression, a codec, or any other compute-intensive task, and may be in an application or a library.

I started by working on a project called Soundtouch (https://github.com/rspeyer/soundtouch) which was used for changing pitch, tempo and playback rate. I started building the software by issuing commands and installing binaries bootstrap and configure, but failed in doing so as I soon figured out after getting help from the professor that the software was broken in the last few builds. I was also informed that no tests were included within the module, so I felt it would be really problematic to work on something that would need so much work on just the benchmarks.
There were a lot of of different fields of software I could choose from but I was a bit more interested in working on compression software. After studying and researching some of the different software packages, I finally chose a compression software from Google called Snappy (https://github.com/google/snappy).
It is a compression, decompression software which does not aim for maximum compression but for higher speeds and a reasonable amount of compression.

2. Build the software.

I took the following steps (commands) to retrieve and build the software:

1. git clone https://github.com/google/snappy.git
pulls the repository from the github into your own machine.

2. cd snappy && mkdir build
changes the directory to snappy pulled from the github and makes new directory called build within that directory.

3. cd build
change directory to build

4. cmake ../ && make
builds the library. Successful make means the library has been successfully build.

3. Next we had to find a test case that would help us benchmark the software with minimal variations.

Fortunately for me, I had got a unit case in the repository that helped me run tests on the package. The unit test named snappy_unittest was used to test the software and benchmark it.

4. Benchmark results

Benchmarks were supposed to be done with real run time being of a significant value so that the small changes done after optimization would reflect as significant as well and the variations in different iterations are averaged out. I used the same unit test to be run on both AArch64(AArchie) and x86_64(xerxes) for 5 times each and the results are reflected below:

The values are in seconds as I thought it was more precise to do so.

x86_64: The average value ranged between 64.21 and 64.66 seconds and had a variation of about 0.60% which is a fairly small value.

AArch64: The average value ranged between 401.51 and 403.67 seconds with a variation of about 0.53% which again was fairly small.

The difference in times for x86_64 and AArch 64 were almost around 84% and it was understood as x86_64 is much more powerful compared to AArch64

Compiler optimizations:
Since I am using cmake to build the software and I did not have much experience doing so and also the demos given by the professor were on configure and not cmake, I had a hard time figuring out how to do compiler optimizations on this. But with the help from professor and reading documentations, I am sure I will be able to do so and I will be giving an update about it on the next post.

Overall, I would say this is one of the another challenges in this course. Though it has brought a lot of unpredictability based on how new this thing is to me and also the optimizations we would be doing on this software, but I hope I'll be able to overcome this last challenge this course has to offer and I would much more comfortable working with different assemblers.

Thursday, 26 March 2020

Week 9 - Lecture

This week's lecture started with all of us being a bit sad because the whole world was facing and fighting against a pandemic known as CoVid 19, and it was announced that there won't be any in person lectures to stop the disease's spread so it was probably the last lecture we were taking in person.
The lecture was based on the compiler optimizations i.e. the small tricks compiler does to make our program efficient. In other words compiler does some changes to our code that makes it faster and easier to run.

Optimizations:
Strength Reduction:
In this optimization some expensive operations are replaced by cheaper/inexpensive operations. An example of this would be as: multiplication is considered to be a comparatively expensive operation, so the code is rewritten in a way so that multiplication is replaced by addition or something similar.

Hoisting:
In some situation loops might contain repetitive calculations or in other words there might be some calculations that are not changed within the loop and such calculations are done outside of the loop and nothing is repeated within the loop thus making it more efficient. Example: basic calculations, non changing functions, loop counters with complex calculations. These all can be done outside of the loop so that nothing is repeated.

Loop unswitching:
There are situations when checks irrelevant of the loops are placed inside the loops that way checks are done in every cycle of the loop. But such checks are done outside of the loops and the loops are placed inside of the checks thus getting rid of checks in every loop. This is called loop unswitching.

Loop splitting:
Sometimes two or more parts of the loop are supposed to be handled differently on the basis of the checks done, but instead of doing similar checks in every loop, the loop is split into different parts based on the checks thus getting rid of that extra work in doing checks for the whole loop.

Loop unrolling:
In this method a loop is unrolled to contain multiple copies of the same code, thus decreasing the number of times the calculations are done. Example: A case when we know there are going to be even number of iterations of the loop, instead of checking every time, the increment is done twice, the check is done every second time and thus calculation is also halved throughout the loop.

Inlining:
In a case of inlining, the place where the function is being used is replaced directly by the code inside the function and eliminates the calling of the function.

Common subexpression elimination:
In cases where similar calculations are done more than 1 time the calculated result can be stored inside of a variable and that variable can then be used in place of those calculation and thus calculation is done only once.

Jump Threading:
In cases where we know all the possibilities, the repetitive if conditions can be changed to have an else statement in the last possibility thus eliminating evaluation for the last condition.

Short circuit evaluation:
In conditions containing OR or AND logical statements, conditions can be rewritten to evaluate only 1 condition first in case of OR and then continue to second only if 1st is false and the opposite for AND.

Dead Store elimination:
There are situation when a value is written to a variable more than once but the old values assigned are never used in the program and thus are eliminated.

So these were all the optimizations that compiler does on our behalf and makes the program efficient. More on these optimization can be found here:
https://wiki.cdot.senecacollege.ca/wiki/Compiler_Optimizations
I feel going further in the course might be a bit difficult considering that there would be no in person classes and also with the final project approaching. But I am confident that I would be able to achieve the most from this course and fight out of these difficult situations.

Wednesday, 18 March 2020

Week 9 - Lab 6

This week started with another lab that we had to complete and it was based on algorithm selection.
Digital sound is represented as 16 bit integer signal samples with two streams for left and right stereos. To change the volume of any sound each signal sample can be scaled (multiplied) by a volume factor, in the range of 0.00 (silence) to 1.00 (full volume).

In order to do the scaling we have three options to choose from:

1. Naive algorithm: In this approach each sample is directly multiplied by the volume factor.

2. Lookup based algorithm: This approach pre calculates all the possible 65536 (16 bit values) results and looks up in that pre calculated table instead of multiplying.

3. Fixed point algorithm: This approach uses fixed - point math and bit shifting to perform the multiplication instead of directly multiplying without using floating point math.

So basically we had these three options coded in three different files and a header file which had the number of samples to scale. Each file did some pre processing for the required algorithm, scaled the samples and then summed up the results. We had to make sure each algorithm took more than 20 seconds in order to find some significant results, so we had to adjust the number of samples accordingly. Then we had to find a way to time only the time taken to scale the samples and exclude everything else.
The way I chose to accomplish this was by timing the whole program and then commenting out the scaling portion of it and then timing it again and the difference between the two gave me the time taken for scaling.
Once we had that time we had to compare each of the three algorithms using the benchmarks, and also do the same on different machines/architectures.

So I chose to test and compare them on two of my class servers Aarchie (AArch64) and Xerxes (x86_64) and also on matrix server.

Results:
Fortunately, since testing on difference machines for so many times and timing them would have been a trouble but we were made aware of a script that would handle all that:
for X in vol1 vol2 vol3 ; do echo ==== $X ; for Y in 1 2 3 4 ; do echo ---- Trial $Y ; /usr/bin/time ./$X ; done ; done |& tee results.log

The script not only times the three files for 4 trials each but also stores the results in a file so that we can always reach back to it later.

AArchie (AArch64)

I decided to take an average of all the four trials so it is easier to analyse and reach a decision.
Algorithm 3 proved to be the fastest and took the lowest amount of time in scaling, while algorithm 1 took 46% more time and was slower and algorithm 2 took 68% more time than algorithm 3. So overall on this Aarch_64 machine, Fixed point algorithm proved to be the fastest. No significant change was seen in maximum resident memory usage of all the three approaches though.

Xerxes (x86_64)

Once again I averaged out all the 4 trials and came up with a single value to do the analysis of the benchmarks. On this x86_64 machine also, algorithm 3 proved to be the fastest taking just 1.39 seconds to scale the samples, while algorithm 1 took 26% more time and was slowest while algorithm 2 took 18% more time in comparison to algorithm 3. So once again Fixed point algorithm was seen as winner.

Matrix Server

Matrix came up with a bit varying results for Naive algorithm, which might be due to the load on the server and other activities being performed on it, but averaging them all out again helped me and made it easier to analyse and compare the results. Surprisingly this time, algorithm 2 took the lowest time scaling the samples averaging out to just 0.45, while algorithm 1 took 88% more time and was the slowest of the three, while algorithm 3 took 73% more time compared to algorithm 2. Surprisingly, Lookup based algorithm was the fastest on matrix server.

This was surprisingly one of the easiest labs we had done so far. It is clear that the effectiveness of an algorithm might vary based on the machine we work on.

Saturday, 7 March 2020

Week 8 - Lab 5 continued and Lecture

This week's lab we were supposed to continue with the lab 5 we had started the week before the study break and our group had made some improvement from the last code I showed you guys in the last blog, and we were now being able to print the iteration numbers besides the word loop but we were only able to print 0-9 and then we were getting other characters after 9 which proceeded 9 on the ASCII table. Now before I explain what we did to accomplish all of this and the final result I want to show you guys the code and explain how each of the things work. So here is the code:

.text
.globl _start

min = 0 /* starting value for the loop index; note that this is a symbol (constant), not a variable */
max = 30 /* loop exits when the index hits this number (loop condition is i<max) */

_start:

mov x19, min

loop:

mov x0, 1 /* file descriptor: 1 is stdout */
adr x1, msg /* message location (memory address) */
mov x2, len /* message length (bytes) */
mov x22, #10 /* store 10 in x22 */
udiv x20, x19, x22 /* div x19 (raw increment) by x22 (10) store it in x20 */
msub x5, x20, x22, x19 /* store at x5 the remainder */
add x4, x5, 0x30 /* store (temp) the index (x19) plus the offset for the ascii numbers */
add x5, x20, 0x30 /* store (temp) the 10s (x20) plus the offset for the ascii numbers */

cmp x20, 0
b.eq back
strb w5, [x1, 5] /* write the 10s value*/

back:
strb w4, [x1, 6] /* store address x4 at x1 (beginning of msg) + 5 */

mov x8, 64 /* write is syscall #64 */
svc 0 /* invoke syscall */

add x19, x19, 1 /* add 1 to x19 */
cmp x19, max /* compare x19 to the max */
b.ne loop /* if not equal to max, branch back to loop */

mov x0, 0 /* status -> 0 */
mov x8, 93 /* exit is syscall #93 */
svc 0 /* invoke syscall */

.data
msg: .ascii "Loop \n"
len= . - msg

Now I will first start with how we got 0-9 iterations printed which we did by placing the iterator's value + x30 (i.e. number + hex 30 which would display the character as in ASCII table) inside a register and moving that register's value into address: msg + the position in msg where we want to display that number. i.e. if msg + 6 because LOOP (4 digits) + space (1 digit) + 1 (place to be stored at). But now the problem was when the iterator reached 10 it printed next character after 9 in ASCII table which was not 10 and since we didn't have double digits to display in ASCII we had to split the numbers to display double digit values. So we took the double digit number got its 10's place (divide by 10) added x30 and stored its value in a register and got 1's place by getting its remainder and stored it in the same way. We used udiv and msub to get the quotient and remainder of the division respectively, which can be found here:
https://wiki.cdot.senecacollege.ca/wiki/AArch64_Register_and_Instruction_Quick_Start

Now as we had both the digits we needed we just needed to place them in adjacent places after the word loop which we did in a similar way we had placed digits after LOOP earlier. The last challenge we had was that we had to print singles digit till 9 and double digits afterwards but we had all iterations in double digits so we made use of the subroutine to skip printing 10's digit if our ten's digit was equal to 0. And in this way we were able to accomplish what was required for this lab.

Overall I found this lab very interesting as it involved alot of Maths and also had some significant logic to think about. I feel I now had even stronger grip over the assembly language.

Lecture:
This week's lecture was strictly based on instructions about our upcoming project and its different stages. We were given a lot of instructions on how and what sort of project we should be looking to choose based on it being resource intensive and what we wanted to optimize out of CPU, RAM and energy. Then we had to build and benchmark our chosen package by the following steps:
1. Obtain code
2. Install dependencies
3. Build
4. Get or generate some data
5. Build command
6. Run + time it
and in this way we can compare our code's performance and improvements after we work on it.

With the end of semester approaching I am both excited and nervous at the same time to work on the project and am looking forward to explore some good optimization and other helpful techniques.