This blog will be all about the Stage 1 for the final project for this course. I was both excited and nervous to work on the project since I had not worked on anything like this before which brought excitement and also I was worried about what I should choose and what kind of challenges I would face doing so.
The instructions I had in mind that was given by the professor were that we need to pick something that is not too basic, in other words it had to be resource intensive. There were few options for the programming languages that had to chosen from, since we could not choose Java or web scripts, C and C++ were the only languages I was left with as I did not have experience working with any other languages.
1. Find an open source software package that includes a CPU-intensive function or method that is implemented in a language that compiles to machine code (such as C, C++, or Assembler). This could be hash function, compression, a codec, or any other compute-intensive task, and may be in an application or a library.
I started by working on a project called Soundtouch (https://github.com/rspeyer/soundtouch) which was used for changing pitch, tempo and playback rate. I started building the software by issuing commands and installing binaries bootstrap and configure, but failed in doing so as I soon figured out after getting help from the professor that the software was broken in the last few builds. I was also informed that no tests were included within the module, so I felt it would be really problematic to work on something that would need so much work on just the benchmarks.
There were a lot of of different fields of software I could choose from but I was a bit more interested in working on compression software. After studying and researching some of the different software packages, I finally chose a compression software from Google called Snappy (https://github.com/google/snappy).
It is a compression, decompression software which does not aim for maximum compression but for higher speeds and a reasonable amount of compression.
2. Build the software.
I took the following steps (commands) to retrieve and build the software:
1. git clone https://github.com/google/snappy.git
pulls the repository from the github into your own machine.
2. cd snappy && mkdir build
changes the directory to snappy pulled from the github and makes new directory called build within that directory.
3. cd build
change directory to build
4. cmake ../ && make
builds the library. Successful make means the library has been successfully build.
3. Next we had to find a test case that would help us benchmark the software with minimal variations.
Fortunately for me, I had got a unit case in the repository that helped me run tests on the package. The unit test named snappy_unittest was used to test the software and benchmark it.
4. Benchmark results
Benchmarks were supposed to be done with real run time being of a significant value so that the small changes done after optimization would reflect as significant as well and the variations in different iterations are averaged out. I used the same unit test to be run on both AArch64(AArchie) and x86_64(xerxes) for 5 times each and the results are reflected below:
The values are in seconds as I thought it was more precise to do so.
x86_64: The average value ranged between 64.21 and 64.66 seconds and had a variation of about 0.60% which is a fairly small value.
AArch64: The average value ranged between 401.51 and 403.67 seconds with a variation of about 0.53% which again was fairly small.
The difference in times for x86_64 and AArch 64 were almost around 84% and it was understood as x86_64 is much more powerful compared to AArch64
Compiler optimizations:
Since I am using cmake to build the software and I did not have much experience doing so and also the demos given by the professor were on configure and not cmake, I had a hard time figuring out how to do compiler optimizations on this. But with the help from professor and reading documentations, I am sure I will be able to do so and I will be giving an update about it on the next post.
Overall, I would say this is one of the another challenges in this course. Though it has brought a lot of unpredictability based on how new this thing is to me and also the optimizations we would be doing on this software, but I hope I'll be able to overcome this last challenge this course has to offer and I would much more comfortable working with different assemblers.
Tuesday, 31 March 2020
Thursday, 26 March 2020
Week 9 - Lecture
This week's lecture started with all of us being a bit sad because the whole world was facing and fighting against a pandemic known as CoVid 19, and it was announced that there won't be any in person lectures to stop the disease's spread so it was probably the last lecture we were taking in person.
The lecture was based on the compiler optimizations i.e. the small tricks compiler does to make our program efficient. In other words compiler does some changes to our code that makes it faster and easier to run.
Optimizations:
Strength Reduction:
In this optimization some expensive operations are replaced by cheaper/inexpensive operations. An example of this would be as: multiplication is considered to be a comparatively expensive operation, so the code is rewritten in a way so that multiplication is replaced by addition or something similar.
Hoisting:
In some situation loops might contain repetitive calculations or in other words there might be some calculations that are not changed within the loop and such calculations are done outside of the loop and nothing is repeated within the loop thus making it more efficient. Example: basic calculations, non changing functions, loop counters with complex calculations. These all can be done outside of the loop so that nothing is repeated.
Loop unswitching:
There are situations when checks irrelevant of the loops are placed inside the loops that way checks are done in every cycle of the loop. But such checks are done outside of the loops and the loops are placed inside of the checks thus getting rid of checks in every loop. This is called loop unswitching.
Loop splitting:
Sometimes two or more parts of the loop are supposed to be handled differently on the basis of the checks done, but instead of doing similar checks in every loop, the loop is split into different parts based on the checks thus getting rid of that extra work in doing checks for the whole loop.
Loop unrolling:
In this method a loop is unrolled to contain multiple copies of the same code, thus decreasing the number of times the calculations are done. Example: A case when we know there are going to be even number of iterations of the loop, instead of checking every time, the increment is done twice, the check is done every second time and thus calculation is also halved throughout the loop.
Inlining:
In a case of inlining, the place where the function is being used is replaced directly by the code inside the function and eliminates the calling of the function.
Common subexpression elimination:
In cases where similar calculations are done more than 1 time the calculated result can be stored inside of a variable and that variable can then be used in place of those calculation and thus calculation is done only once.
Jump Threading:
In cases where we know all the possibilities, the repetitive if conditions can be changed to have an else statement in the last possibility thus eliminating evaluation for the last condition.
Short circuit evaluation:
In conditions containing OR or AND logical statements, conditions can be rewritten to evaluate only 1 condition first in case of OR and then continue to second only if 1st is false and the opposite for AND.
Dead Store elimination:
There are situation when a value is written to a variable more than once but the old values assigned are never used in the program and thus are eliminated.
So these were all the optimizations that compiler does on our behalf and makes the program efficient. More on these optimization can be found here:
https://wiki.cdot.senecacollege.ca/wiki/Compiler_Optimizations
I feel going further in the course might be a bit difficult considering that there would be no in person classes and also with the final project approaching. But I am confident that I would be able to achieve the most from this course and fight out of these difficult situations.
The lecture was based on the compiler optimizations i.e. the small tricks compiler does to make our program efficient. In other words compiler does some changes to our code that makes it faster and easier to run.
Optimizations:
Strength Reduction:
In this optimization some expensive operations are replaced by cheaper/inexpensive operations. An example of this would be as: multiplication is considered to be a comparatively expensive operation, so the code is rewritten in a way so that multiplication is replaced by addition or something similar.
Hoisting:
In some situation loops might contain repetitive calculations or in other words there might be some calculations that are not changed within the loop and such calculations are done outside of the loop and nothing is repeated within the loop thus making it more efficient. Example: basic calculations, non changing functions, loop counters with complex calculations. These all can be done outside of the loop so that nothing is repeated.
Loop unswitching:
There are situations when checks irrelevant of the loops are placed inside the loops that way checks are done in every cycle of the loop. But such checks are done outside of the loops and the loops are placed inside of the checks thus getting rid of checks in every loop. This is called loop unswitching.
Loop splitting:
Sometimes two or more parts of the loop are supposed to be handled differently on the basis of the checks done, but instead of doing similar checks in every loop, the loop is split into different parts based on the checks thus getting rid of that extra work in doing checks for the whole loop.
Loop unrolling:
In this method a loop is unrolled to contain multiple copies of the same code, thus decreasing the number of times the calculations are done. Example: A case when we know there are going to be even number of iterations of the loop, instead of checking every time, the increment is done twice, the check is done every second time and thus calculation is also halved throughout the loop.
Inlining:
In a case of inlining, the place where the function is being used is replaced directly by the code inside the function and eliminates the calling of the function.
Common subexpression elimination:
In cases where similar calculations are done more than 1 time the calculated result can be stored inside of a variable and that variable can then be used in place of those calculation and thus calculation is done only once.
Jump Threading:
In cases where we know all the possibilities, the repetitive if conditions can be changed to have an else statement in the last possibility thus eliminating evaluation for the last condition.
Short circuit evaluation:
In conditions containing OR or AND logical statements, conditions can be rewritten to evaluate only 1 condition first in case of OR and then continue to second only if 1st is false and the opposite for AND.
Dead Store elimination:
There are situation when a value is written to a variable more than once but the old values assigned are never used in the program and thus are eliminated.
So these were all the optimizations that compiler does on our behalf and makes the program efficient. More on these optimization can be found here:
https://wiki.cdot.senecacollege.ca/wiki/Compiler_Optimizations
I feel going further in the course might be a bit difficult considering that there would be no in person classes and also with the final project approaching. But I am confident that I would be able to achieve the most from this course and fight out of these difficult situations.
Wednesday, 18 March 2020
Week 9 - Lab 6
This week started with another lab that we had to complete and it was based on algorithm selection.
Digital sound is represented as 16 bit integer signal samples with two streams for left and right stereos. To change the volume of any sound each signal sample can be scaled (multiplied) by a volume factor, in the range of 0.00 (silence) to 1.00 (full volume).
In order to do the scaling we have three options to choose from:
1. Naive algorithm: In this approach each sample is directly multiplied by the volume factor.
2. Lookup based algorithm: This approach pre calculates all the possible 65536 (16 bit values) results and looks up in that pre calculated table instead of multiplying.
3. Fixed point algorithm: This approach uses fixed - point math and bit shifting to perform the multiplication instead of directly multiplying without using floating point math.
So basically we had these three options coded in three different files and a header file which had the number of samples to scale. Each file did some pre processing for the required algorithm, scaled the samples and then summed up the results. We had to make sure each algorithm took more than 20 seconds in order to find some significant results, so we had to adjust the number of samples accordingly. Then we had to find a way to time only the time taken to scale the samples and exclude everything else.
The way I chose to accomplish this was by timing the whole program and then commenting out the scaling portion of it and then timing it again and the difference between the two gave me the time taken for scaling.
Once we had that time we had to compare each of the three algorithms using the benchmarks, and also do the same on different machines/architectures.
So I chose to test and compare them on two of my class servers Aarchie (AArch64) and Xerxes (x86_64) and also on matrix server.
Results:
Fortunately, since testing on difference machines for so many times and timing them would have been a trouble but we were made aware of a script that would handle all that:
for X in vol1 vol2 vol3 ; do echo ==== $X ; for Y in 1 2 3 4 ; do echo ---- Trial $Y ; /usr/bin/time ./$X ; done ; done |& tee results.log
The script not only times the three files for 4 trials each but also stores the results in a file so that we can always reach back to it later.
AArchie (AArch64)
I decided to take an average of all the four trials so it is easier to analyse and reach a decision.
Algorithm 3 proved to be the fastest and took the lowest amount of time in scaling, while algorithm 1 took 46% more time and was slower and algorithm 2 took 68% more time than algorithm 3. So overall on this Aarch_64 machine, Fixed point algorithm proved to be the fastest. No significant change was seen in maximum resident memory usage of all the three approaches though.
Xerxes (x86_64)
Once again I averaged out all the 4 trials and came up with a single value to do the analysis of the benchmarks. On this x86_64 machine also, algorithm 3 proved to be the fastest taking just 1.39 seconds to scale the samples, while algorithm 1 took 26% more time and was slowest while algorithm 2 took 18% more time in comparison to algorithm 3. So once again Fixed point algorithm was seen as winner.
Matrix Server
Matrix came up with a bit varying results for Naive algorithm, which might be due to the load on the server and other activities being performed on it, but averaging them all out again helped me and made it easier to analyse and compare the results. Surprisingly this time, algorithm 2 took the lowest time scaling the samples averaging out to just 0.45, while algorithm 1 took 88% more time and was the slowest of the three, while algorithm 3 took 73% more time compared to algorithm 2. Surprisingly, Lookup based algorithm was the fastest on matrix server.
This was surprisingly one of the easiest labs we had done so far. It is clear that the effectiveness of an algorithm might vary based on the machine we work on.
Digital sound is represented as 16 bit integer signal samples with two streams for left and right stereos. To change the volume of any sound each signal sample can be scaled (multiplied) by a volume factor, in the range of 0.00 (silence) to 1.00 (full volume).
In order to do the scaling we have three options to choose from:
1. Naive algorithm: In this approach each sample is directly multiplied by the volume factor.
2. Lookup based algorithm: This approach pre calculates all the possible 65536 (16 bit values) results and looks up in that pre calculated table instead of multiplying.
3. Fixed point algorithm: This approach uses fixed - point math and bit shifting to perform the multiplication instead of directly multiplying without using floating point math.
So basically we had these three options coded in three different files and a header file which had the number of samples to scale. Each file did some pre processing for the required algorithm, scaled the samples and then summed up the results. We had to make sure each algorithm took more than 20 seconds in order to find some significant results, so we had to adjust the number of samples accordingly. Then we had to find a way to time only the time taken to scale the samples and exclude everything else.
The way I chose to accomplish this was by timing the whole program and then commenting out the scaling portion of it and then timing it again and the difference between the two gave me the time taken for scaling.
Once we had that time we had to compare each of the three algorithms using the benchmarks, and also do the same on different machines/architectures.
So I chose to test and compare them on two of my class servers Aarchie (AArch64) and Xerxes (x86_64) and also on matrix server.
Results:
Fortunately, since testing on difference machines for so many times and timing them would have been a trouble but we were made aware of a script that would handle all that:
for X in vol1 vol2 vol3 ; do echo ==== $X ; for Y in 1 2 3 4 ; do echo ---- Trial $Y ; /usr/bin/time ./$X ; done ; done |& tee results.log
The script not only times the three files for 4 trials each but also stores the results in a file so that we can always reach back to it later.
AArchie (AArch64)
I decided to take an average of all the four trials so it is easier to analyse and reach a decision.
Algorithm 3 proved to be the fastest and took the lowest amount of time in scaling, while algorithm 1 took 46% more time and was slower and algorithm 2 took 68% more time than algorithm 3. So overall on this Aarch_64 machine, Fixed point algorithm proved to be the fastest. No significant change was seen in maximum resident memory usage of all the three approaches though.
Xerxes (x86_64)
Once again I averaged out all the 4 trials and came up with a single value to do the analysis of the benchmarks. On this x86_64 machine also, algorithm 3 proved to be the fastest taking just 1.39 seconds to scale the samples, while algorithm 1 took 26% more time and was slowest while algorithm 2 took 18% more time in comparison to algorithm 3. So once again Fixed point algorithm was seen as winner.
Matrix Server
Matrix came up with a bit varying results for Naive algorithm, which might be due to the load on the server and other activities being performed on it, but averaging them all out again helped me and made it easier to analyse and compare the results. Surprisingly this time, algorithm 2 took the lowest time scaling the samples averaging out to just 0.45, while algorithm 1 took 88% more time and was the slowest of the three, while algorithm 3 took 73% more time compared to algorithm 2. Surprisingly, Lookup based algorithm was the fastest on matrix server.
This was surprisingly one of the easiest labs we had done so far. It is clear that the effectiveness of an algorithm might vary based on the machine we work on.
Saturday, 7 March 2020
Week 8 - Lab 5 continued and Lecture
This week's lab we were supposed to continue with the lab 5 we had started the week before the study break and our group had made some improvement from the last code I showed you guys in the last blog, and we were now being able to print the iteration numbers besides the word loop but we were only able to print 0-9 and then we were getting other characters after 9 which proceeded 9 on the ASCII table. Now before I explain what we did to accomplish all of this and the final result I want to show you guys the code and explain how each of the things work. So here is the code:
.text
.globl _start
min = 0 /* starting value for the loop index; note that this is a symbol (constant), not a variable */
max = 30 /* loop exits when the index hits this number (loop condition is i<max) */
_start:
mov x19, min
loop:
mov x0, 1 /* file descriptor: 1 is stdout */
adr x1, msg /* message location (memory address) */
mov x2, len /* message length (bytes) */
mov x22, #10 /* store 10 in x22 */
udiv x20, x19, x22 /* div x19 (raw increment) by x22 (10) store it in x20 */
msub x5, x20, x22, x19 /* store at x5 the remainder */
add x4, x5, 0x30 /* store (temp) the index (x19) plus the offset for the ascii numbers */
add x5, x20, 0x30 /* store (temp) the 10s (x20) plus the offset for the ascii numbers */
cmp x20, 0
b.eq back
strb w5, [x1, 5] /* write the 10s value*/
back:
strb w4, [x1, 6] /* store address x4 at x1 (beginning of msg) + 5 */
mov x8, 64 /* write is syscall #64 */
svc 0 /* invoke syscall */
add x19, x19, 1 /* add 1 to x19 */
cmp x19, max /* compare x19 to the max */
b.ne loop /* if not equal to max, branch back to loop */
mov x0, 0 /* status -> 0 */
mov x8, 93 /* exit is syscall #93 */
svc 0 /* invoke syscall */
.data
msg: .ascii "Loop \n"
len= . - msg
Now I will first start with how we got 0-9 iterations printed which we did by placing the iterator's value + x30 (i.e. number + hex 30 which would display the character as in ASCII table) inside a register and moving that register's value into address: msg + the position in msg where we want to display that number. i.e. if msg + 6 because LOOP (4 digits) + space (1 digit) + 1 (place to be stored at). But now the problem was when the iterator reached 10 it printed next character after 9 in ASCII table which was not 10 and since we didn't have double digits to display in ASCII we had to split the numbers to display double digit values. So we took the double digit number got its 10's place (divide by 10) added x30 and stored its value in a register and got 1's place by getting its remainder and stored it in the same way. We used udiv and msub to get the quotient and remainder of the division respectively, which can be found here:
https://wiki.cdot.senecacollege.ca/wiki/AArch64_Register_and_Instruction_Quick_Start
Now as we had both the digits we needed we just needed to place them in adjacent places after the word loop which we did in a similar way we had placed digits after LOOP earlier. The last challenge we had was that we had to print singles digit till 9 and double digits afterwards but we had all iterations in double digits so we made use of the subroutine to skip printing 10's digit if our ten's digit was equal to 0. And in this way we were able to accomplish what was required for this lab.
Overall I found this lab very interesting as it involved alot of Maths and also had some significant logic to think about. I feel I now had even stronger grip over the assembly language.
Lecture:
This week's lecture was strictly based on instructions about our upcoming project and its different stages. We were given a lot of instructions on how and what sort of project we should be looking to choose based on it being resource intensive and what we wanted to optimize out of CPU, RAM and energy. Then we had to build and benchmark our chosen package by the following steps:
1. Obtain code
2. Install dependencies
3. Build
4. Get or generate some data
5. Build command
6. Run + time it
and in this way we can compare our code's performance and improvements after we work on it.
With the end of semester approaching I am both excited and nervous at the same time to work on the project and am looking forward to explore some good optimization and other helpful techniques.
.text
.globl _start
min = 0 /* starting value for the loop index; note that this is a symbol (constant), not a variable */
max = 30 /* loop exits when the index hits this number (loop condition is i<max) */
_start:
mov x19, min
loop:
mov x0, 1 /* file descriptor: 1 is stdout */
adr x1, msg /* message location (memory address) */
mov x2, len /* message length (bytes) */
mov x22, #10 /* store 10 in x22 */
udiv x20, x19, x22 /* div x19 (raw increment) by x22 (10) store it in x20 */
msub x5, x20, x22, x19 /* store at x5 the remainder */
add x4, x5, 0x30 /* store (temp) the index (x19) plus the offset for the ascii numbers */
add x5, x20, 0x30 /* store (temp) the 10s (x20) plus the offset for the ascii numbers */
cmp x20, 0
b.eq back
strb w5, [x1, 5] /* write the 10s value*/
back:
strb w4, [x1, 6] /* store address x4 at x1 (beginning of msg) + 5 */
mov x8, 64 /* write is syscall #64 */
svc 0 /* invoke syscall */
add x19, x19, 1 /* add 1 to x19 */
cmp x19, max /* compare x19 to the max */
b.ne loop /* if not equal to max, branch back to loop */
mov x0, 0 /* status -> 0 */
mov x8, 93 /* exit is syscall #93 */
svc 0 /* invoke syscall */
.data
msg: .ascii "Loop \n"
len= . - msg
Now I will first start with how we got 0-9 iterations printed which we did by placing the iterator's value + x30 (i.e. number + hex 30 which would display the character as in ASCII table) inside a register and moving that register's value into address: msg + the position in msg where we want to display that number. i.e. if msg + 6 because LOOP (4 digits) + space (1 digit) + 1 (place to be stored at). But now the problem was when the iterator reached 10 it printed next character after 9 in ASCII table which was not 10 and since we didn't have double digits to display in ASCII we had to split the numbers to display double digit values. So we took the double digit number got its 10's place (divide by 10) added x30 and stored its value in a register and got 1's place by getting its remainder and stored it in the same way. We used udiv and msub to get the quotient and remainder of the division respectively, which can be found here:
https://wiki.cdot.senecacollege.ca/wiki/AArch64_Register_and_Instruction_Quick_Start
Now as we had both the digits we needed we just needed to place them in adjacent places after the word loop which we did in a similar way we had placed digits after LOOP earlier. The last challenge we had was that we had to print singles digit till 9 and double digits afterwards but we had all iterations in double digits so we made use of the subroutine to skip printing 10's digit if our ten's digit was equal to 0. And in this way we were able to accomplish what was required for this lab.
Overall I found this lab very interesting as it involved alot of Maths and also had some significant logic to think about. I feel I now had even stronger grip over the assembly language.
Lecture:
This week's lecture was strictly based on instructions about our upcoming project and its different stages. We were given a lot of instructions on how and what sort of project we should be looking to choose based on it being resource intensive and what we wanted to optimize out of CPU, RAM and energy. Then we had to build and benchmark our chosen package by the following steps:
1. Obtain code
2. Install dependencies
3. Build
4. Get or generate some data
5. Build command
6. Run + time it
and in this way we can compare our code's performance and improvements after we work on it.
With the end of semester approaching I am both excited and nervous at the same time to work on the project and am looking forward to explore some good optimization and other helpful techniques.
Subscribe to:
Posts (Atom)