Saturday 18 April 2020

Project Stage - 3

This might be the final blog I am posting so I will try to make it as descriptive as possible. This blog will be about the stage 3 of the project. If you haven't gone through the other stages of the project, here are the links to them: Stage 1 ( https://asharma206-spo600.blogspot.com/2020/04/project-stage-1.html ), update to stage 1 ( https://asharma206-spo600.blogspot.com/2020/04/update-to-project-stage-1.html ), stage 2 ( https://asharma206-spo600.blogspot.com/2020/04/project-stage-2.html ).
After, building, benchmarking and profiling, last stage was all about creating a strategy to find possible optimization approach for the software.
To find an optimization opportunity we needed to find the hotspots within the hot (function taking maximum time) function in the software first. In other words we needed to figure out what instructions inside of the software took maximum time to execute, so that with small optimization we could see differences in the overall run. So as a recall from the previous stage, I made use of perf report, which displays the percentage running times within the software of different functions. We would go one step further and press enter on CompressFragment to analyse it one step further. From there we would use its annotate functionality to get a breakdown of different instructions within the function. Then, a screen is displayed with source code and different instructions that take place corresponding to the source code, along with the percentage of run times. With pressing shift + 'h', redirected me to the hottest instruction within the function with most percentage. The instruction, I landed on was taking 8.90 % of the total time within the function and was ldur x2,  [x28, #4]. My first thought on this was since this is only about 9%, I was doubtful if I could do any significant optimization that makes a difference. Now the next steps were to find out if any optimizations could be done on this.





                                                     SIMD and Vectorization
SIMD stands for single instruction multiple data. With all modern machines having SIMD registers which are 128 bits wide, we were introduced to use these as a possible approach. Though these registers could be used as 128, 64, 32 bits and so on, but it could be also divided into lanes with options of being 64 bits (2 lanes), 32 bits (4 lanes), 16 bits (8 lanes). And those 8 lanes in turn can be used to process upto 8 values at a time.
Vectorization is the code that processes larger chunks of data at a time rather than a single unit. So basically we use vectorization with SIMD to process multiple pieces of data at once and that speeds up the program.
For vectorization to kick in automatically we can set the optimization level to -03 or use option -ftree-vectorize while compiling and if the instructions are in the order for the compiler, the loops will be vectorized automatically. This is called autovectorization.
Video on SIMD and vectorization: https://youtu.be/EIPbufXhiQs
 But since this might not always be the case we can use inline assembly instructions or instrinsics in turn.
For inline assembly instruction we use the syntax:  __asm__("assembly template" : outputs : inputs : clobbers). This can be thought of a way assembly instructions can be injected into a C or C++ program and thus vectorization commands can be used in this way.
More on inline assembly:  https://wiki.cdot.senecacollege.ca/wiki/Inline_Assembly_Language

With instrincs the idea is the same, the only difference is that instrinsics can be thought of as functions, but they both can be used interchangably.
More on instrinsics: https://wiki.cdot.senecacollege.ca/wiki/Compiler_Intrinsics

Now I was ready to apply this knowledge to hotspots in my functions, in order to speed them up.
But soon I realised, my hottest instruction ldur was load register unscaled. So it was not working on an algorithm or processing data, but it had everything to do with the accessing of memory. Since load register is probably just waiting to access data from the memory, there was no possible opportunity of optimization on that instruction.

From there I decided to go one step further and decided to look for second hottest instruction, and if there was any opportunity there. The instruction was strh w2, [x21, x1] which stands for store register halfword, and since this was also about storing values into the register and had everything to do with the memory access, so that instruction was also not something I could do anything about.




Going further I went another step forward and looked for third hottest instruction in the pool and this instruction was surprisingly add x28, x20, x3. Now what was surprising about this was that add is usually a fast instruction and does not take much time. Digging in a bit deep, I remembered the professor had told us the instruction with time might actually not be the instruction taking time but waiting for some other instruction to finish so that it can execute. Then I looked at the instructions that were being preceded by add. So the instructions ( as expected) were strh w5, [x21, x3, lsl #1] (store register halfword) and ldrh w3, [x21, x1] (load register halfword). Which made a lot of sense and made it clear that the add instruction was waiting for those instructions to execute.  Hence, once again there was nothing much that could've been done about add as it was linked to accessing memory as well.




At this point it was clear that there were no SIMD and vectorization techniques that could be applied to any part of the hottest function of the software. So I decided to look upon some other techniques our professor told us about.

                                               Platform specific optimizations
The only other aspects to look at of this software were to limit the instructions to a specific architecture. Which means limiting the software to only perform instructions that were specific to the architectures that it was being executed on.
So using ifdefs and defines we can specify what options to set when the platform is arm, x86 etc, that in turn results in running just what is required and ignore everything else, resulting in improvements.

I began searching on that aspect and if I can do certain changes so that code to make it architecture specific. After going through the files, and studying each of them as well as I could, I found that the software already had those conditions specified.






So after a few discussions with the professor I was able to conclude that the software was as optimized as it could be. And as all the possible approaches were tried on the software, I was able to conclude that the benchmarks I took were the fastest times for that optimization options.
No wonder why snappy is known for its fast compression and focuses less on the level of compression. Overall this experience was a fun and exciting ride for me with the things, techniques and knowledge it has given me, though I was worried at the start since it was something very new for me. But I guess, now I have learnt how to approach a software and how to approach the optimization part of it, and I hope to work on some in my future.

This course on a whole was a huge learning curve for me at the beginning. I went through a lot of struggles, spent hours on thinking on the labs, and had an exciting time working on the project. With the help from the professor and my own struggles with the stuff were made to work on, I feel I am taking a good chunk of useful knowledge with me from this course.

Saturday 11 April 2020

Project Stage - 2

This blog is about the stage 2 of our project ( Project Stage 1 - https://asharma206-spo600.blogspot.com/2020/04/project-stage-1.html). So after we were done setting the benchmarks, next step was to do the profiling of the software. In other words we had to figure out what part of the software was taking the maximum execution time.

We were introduced to two ways on how to figure out what function in our software was taking the maximum execution time. The two concepts were of sampling and instrumentation.

1. Sampling: In this way of profiling, the program is interrupted thousands of times per second, thus collecting thousands of samples per second, and we can get information about what the program counter was doing during the interruption and thus we can figure out what called what and hence what was getting executed for the majority of the time.

2.  Instrumentation: Another way of profiling is instrumentation. In this, some code is added during building of the software, so that all the function calls can be recorded and thus we can figure out what part takes the maximum execution.

Two tools that can be used to do the profiling. gprof and perf. I will demonstrate how to work with each of these and then the results I got from each of them.

gprof:
In order to use this tool I had to setup the build to include a -pg (profile generation) option. Next step was to build the software. Next we would run the software as usual. We would see a new executable called gmon.out generated. It would be a binary file. Next step was the usage of gprof command:
filename: (name of the executable originally used to run the software (not gmon.out))
gprof filename | gprof2dot | dot -T x11

Now gprof filename would originally would give the status of different functions and the execution times and other related info. But in order to get a more readable and helpful info we would feed the result to gprof2dot which would further be fed to a program called dot which would give a textual form of graph for the execution tree but for a more graphical version we give -T x11 options which would launch a new window with the visual version of the tree.

result:


Since the original tree was much more complex, this is just a screenshot of the original tree containing the function that takes the maximum execution time.
As it is clear from this graph the function CompressFragment was taking more than 74% of the execution time and though the control was getting transferred to multiple functions, the time taken by them was not that significant. Just a note, this tool uses both instrumentation and sampling



perf:
Though I already knew by now what function was taking the maximum time, but since it was suggested by our professor to use both the tools, I decided to work with perf as well. Now, to use this tool we do not need to use any special options and so I just used the -O2 optimization level and rebuild the software.
Note: This tool only uses the sampling and not the instrumentation.
In order to use this tool, we would just issue the program:
perf record ( program execution)
This would record the samples and would create a file called perf.data.
Then we would issue a command: perf report
Since this tool does not use instrumentation we can't produce an execution tree, we would just see a status of different functions and their execution times.

result:



As expected it is also clear that the most execution time in this program is spent on function CompressFragment (72%). Though the execution time is a bit different from the one in gprof but it is understandable since both of them use different approaches.


So by this time I was clear it was CompressFragment function in this software that I had to work on and try to optimize and since it is taking more than 70%, even a small change in its execution time would mean a good amount of difference in its overall execution time. I would be spending the next few days figuring out what aspect in the function I can work on and I hope to make some good progress on this project overall.

Monday 6 April 2020

Update to Project Stage - 1

As I told in the last post (https://asharma206-spo600.blogspot.com/2020/04/project-stage-1.html), I did not have much luck figuring out how to set the optimization level for the build of my software. But with some help from the professor and reading some documentation about cmake, I finally managed to set the optimization level and decided to update my benchmarks.
I set the CXX_FLAGS to "-O1 -DNDEBUG" for all builds. and surprisingly the execution of the software which was taking about 6 mins and 44 secs, got reduced significantly to just 24%. So I increased the size of one of the files that the execution of the software was taking place on so it would take more than 4 mins.
All I did was appended the file multiple times to a new file using a loop:
{for (X=0; X<no. of times to append; X++) ; do cat file1.txt >> newfile.txt ; done}

Now before I did this, I set the level of optimization to -02 (recommended) and then I rebuild and decided to take the new benchmarks.

                                                                       Results

AArchie (AArch64)



Once again as in the last post, the values are in seconds and ranged from about 284.42 to 283.35, with a variation of about 0.37%.



Xerxes (x86_64)





The benchmarks for Xerxes ranged from about 244.36 to 243.84, with a variation of just about 0.21%.


Another small thing I would like to point out that I think is important here is, when I was working on setting optimizations and removing them and experimenting with them, a very key step is making sure the build files are deleted and then build again before you can run the updated version of the code. So that is make clean and then make. This is important when we want program to recompile with our updated options.
With that being said and my new benchmarks in, I'll be looking forward to do the profiling on this software and hopefully I can optimize it by some bit by the end of this course. I will keep you guys updated.

Tuesday 31 March 2020

Project Stage - 1

This blog will be all about the Stage 1 for the final project for this course. I was both excited and nervous to work on the project since I had not worked on anything like this before which brought excitement and also I was worried about what I should choose and what kind of challenges I would face doing so.
The instructions I had in mind that was given by the professor were that we need to pick something that is not too basic, in other words it had to be resource intensive. There were few options for the programming languages that had to chosen from, since we could not choose Java or web scripts, C and C++ were the only languages I was left with as I did not have experience working with any other languages.

1. Find an open source software package that includes a CPU-intensive function or method that is implemented in a language that compiles to machine code (such as C, C++, or Assembler). This could be hash function, compression, a codec, or any other compute-intensive task, and may be in an application or a library.

I started by working on a project called Soundtouch (https://github.com/rspeyer/soundtouch) which was used for changing pitch, tempo and playback rate. I started building the software by issuing commands and installing binaries bootstrap and configure, but failed in doing so as I soon figured out after getting help from the professor that the software was broken in the last few builds. I was also informed that no tests were included within the module, so I felt it would be really problematic to work on something that would need so much work on just the benchmarks.
There were a lot of of different fields of software I could choose from but I was a bit more interested in working on compression software. After studying and researching some of the different software packages, I finally chose a compression software from Google called Snappy (https://github.com/google/snappy).
It is a compression, decompression software which does not aim for maximum compression but for higher speeds and a reasonable amount of compression.


2. Build the software.

I took the following steps (commands) to retrieve and build the software:

1. git clone https://github.com/google/snappy.git
pulls the repository from the github into your own machine.

2. cd snappy && mkdir build
changes the directory to snappy pulled from the github and makes new directory called build within that directory.

3. cd build
change directory to build

4. cmake ../ && make
builds the library. Successful make means the library has been successfully build.

3. Next we had to find a test case that would help us benchmark the software with minimal variations.

Fortunately for me, I had got a unit case in the repository that helped me run tests on the package. The unit test named snappy_unittest was used to test the software and benchmark it.

4. Benchmark results

Benchmarks were supposed to be done with real run time being of a significant value so that the small changes done after optimization would reflect as significant as well and the variations in different iterations are averaged out. I used the same unit test to be run on both AArch64(AArchie) and x86_64(xerxes) for 5 times each and the results are reflected below:






The values are in seconds as I thought it was more precise to do so.

x86_64: The average value ranged between 64.21 and 64.66 seconds and had a variation of about 0.60% which is a fairly small value.

AArch64: The average value ranged between 401.51 and 403.67 seconds with a variation of about 0.53% which again was fairly small.

The difference in times for x86_64 and AArch 64 were almost around 84% and it was understood as x86_64 is much more powerful compared to AArch64

Compiler optimizations:
Since I am using cmake to build the software and I did not have much experience doing so and also the demos given by the professor were on configure and not cmake, I had a hard time figuring out how to do compiler optimizations on this. But with the help from professor and reading documentations, I am sure I will be able to do so and I will be giving an update about it on the next post.

Overall, I would say this is one of the another challenges in this course. Though it has brought a lot of unpredictability based on how new this thing is to me and also the optimizations we would be doing on this software, but I hope I'll be able to overcome this last challenge this course has to offer and I would much more comfortable working with different assemblers.

Thursday 26 March 2020

Week 9 - Lecture

This week's lecture started with all of us being a bit sad because the whole world was facing and fighting against a pandemic known as CoVid 19, and it was announced that there won't be any in person lectures to stop the disease's spread so it was probably the last lecture we were taking in person.
The lecture was based on the compiler optimizations i.e. the small tricks compiler does to make our program efficient. In other words compiler does some changes to our code that makes it faster and easier to run.

                                                        Optimizations:
Strength Reduction:
In this optimization some expensive operations are replaced by cheaper/inexpensive operations. An example of this would be as: multiplication is considered to be a comparatively expensive operation, so the code is rewritten in a way so that multiplication is replaced by addition or something similar.


Hoisting:
In some situation loops might contain repetitive calculations or in other words there might be some calculations that are not changed within the loop and such calculations are done outside of the loop and nothing is repeated within the loop thus making it more efficient. Example: basic calculations, non changing functions, loop counters with complex calculations. These all can be done outside of the loop so that nothing is repeated.


Loop unswitching:
There are situations when checks irrelevant of the loops are placed inside the loops that way checks are done in every cycle of the loop. But such checks are done outside of the loops and the loops are placed inside of the checks thus getting rid of checks in every loop. This is called loop unswitching.


Loop splitting:
Sometimes two or more parts of the loop are supposed to be handled differently on the basis of the checks done, but instead of doing similar checks in every loop, the loop is split into different parts based on the checks thus getting rid of that extra work in doing checks for the whole loop.


Loop unrolling:
In this method a loop is unrolled to contain multiple copies of the same code, thus decreasing the number of times the calculations are done. Example: A case when we know there are going to be even number of iterations of the loop, instead of checking every time, the increment is done twice, the check is done every second time and thus calculation is also halved throughout the loop.


Inlining:
In a case of inlining, the place where the function is being used is replaced directly by the code inside the function and eliminates the calling of the function.


Common subexpression elimination:
In cases where similar calculations are done more than 1 time the calculated result can be stored inside of a variable and that variable can then be used in place of those calculation and thus calculation is done only once.


Jump Threading:
In cases where we know all the possibilities, the repetitive if conditions can be changed to have an else statement in the last possibility thus eliminating evaluation for the last condition.


Short circuit evaluation:
In conditions containing OR or AND logical statements, conditions can be rewritten to evaluate only 1 condition first in case of OR and then continue to second only if 1st is false and the opposite for AND.


Dead Store elimination:
There are situation when a value is written to a variable more than once but the old values assigned are never used in the program and thus are eliminated.


So these were all the optimizations that compiler does on our behalf and makes the program efficient. More on these optimization can be found here:
https://wiki.cdot.senecacollege.ca/wiki/Compiler_Optimizations
I feel going further in the course might be a bit difficult considering that there would be no in person classes and also with the final project approaching. But I am confident that I would be able to achieve the most from this course and fight out of these difficult situations.

Wednesday 18 March 2020

Week 9 - Lab 6

This week started with another lab that we had to complete and it was based on algorithm selection.
Digital sound is represented as 16 bit integer signal samples with two streams for left and right stereos. To change the volume of any sound each signal sample can be scaled (multiplied) by a volume factor, in the range of 0.00 (silence) to 1.00 (full volume).

In order to do the scaling we have three options to choose from:

1. Naive algorithm: In this approach each sample is directly multiplied by the volume factor.

2. Lookup based algorithm: This approach pre calculates all the possible 65536 (16 bit values) results and looks up in that pre calculated table instead of multiplying.

3. Fixed point algorithm: This approach uses fixed - point math and bit shifting to perform the multiplication instead of directly multiplying without using floating point math.

So basically we had these three options coded in three different files and a header file which had the number of samples to scale. Each file did some pre processing for the required algorithm, scaled the samples and then summed up the results. We had to make sure each algorithm took more than 20 seconds in order to find some significant results, so we had to adjust the number of samples accordingly. Then we had to find a way to time only the time taken to scale the samples and exclude everything else.
The way I chose to accomplish this was by timing the whole program and then commenting out the scaling portion of it and then timing it again and the difference between the two gave me the time taken for scaling.
Once we had that time we had to compare each of the three algorithms using the benchmarks, and also do the same on different machines/architectures.

So I chose to test and compare them on two of my class servers Aarchie (AArch64) and Xerxes (x86_64) and also on matrix server.

                                                         Results:
Fortunately, since testing on difference machines for so many times and timing them would have been a trouble but we were made aware of a script that would handle all that:
for X in vol1 vol2 vol3 ; do echo ==== $X ; for Y in 1 2 3 4 ; do echo ---- Trial $Y ; /usr/bin/time ./$X ; done ; done |& tee results.log

The script not only times the three files for 4 trials each but also stores the results in a file so that we can always reach back to it later.


AArchie (AArch64)


I decided to take an average of all the four trials so it is easier to analyse and reach a decision.
Algorithm 3 proved to be the fastest and took the lowest amount of time in scaling, while algorithm 1 took 46% more time and was slower and algorithm 2 took 68% more time than algorithm 3. So overall on this Aarch_64 machine, Fixed point algorithm proved to be the fastest. No significant change was seen in maximum resident memory usage of all the three approaches though.


Xerxes (x86_64)




Once again I averaged out all the 4 trials and came up with a single value to do the analysis of the benchmarks. On this x86_64 machine also, algorithm 3 proved to be the fastest taking just 1.39 seconds to scale the samples, while algorithm 1 took 26% more time and was slowest while algorithm 2 took 18% more time in comparison to algorithm 3. So once again Fixed point algorithm was seen as winner.


Matrix Server



Matrix came up with a bit varying results for Naive algorithm, which might be due to the load on the server and other activities being performed on it, but averaging them all out again helped me and made it easier to analyse and compare the results. Surprisingly this time, algorithm 2 took the lowest time scaling the samples averaging out to just 0.45, while algorithm 1 took 88% more time and was the slowest of the three, while algorithm 3 took 73% more time compared to algorithm 2. Surprisingly, Lookup based algorithm was the fastest on matrix server.

This was surprisingly one of the easiest labs we had done so far. It is clear that the effectiveness of an algorithm might vary based on the machine we work on.



Saturday 7 March 2020

Week 8 - Lab 5 continued and Lecture

This week's lab we were supposed to continue with the lab 5 we had started the week before the study break and our group had made some improvement from the last code I showed you guys in the last blog, and we were now being able to print the iteration numbers besides the word loop but we were only able to print 0-9 and then we were getting other characters after 9 which proceeded 9 on the ASCII table. Now before I explain what we did to accomplish all of this and the final result I want to show you guys the code and explain how each of the things work. So here is the code:


.text
.globl _start

min = 0                          /* starting value for the loop index; note that this is a symbol (constant), not a variable */
max = 30                         /* loop exits when the index hits this number (loop condition is i<max) */

_start:

    mov     x19, min

loop:

mov     x0, 1        /* file descriptor: 1 is stdout */
adr     x1, msg    /* message location (memory address) */
mov     x2, len    /* message length (bytes) */
mov x22, #10 /* store 10 in x22 */
udiv x20, x19, x22 /* div x19 (raw increment) by x22 (10) store it in x20 */
msub x5, x20, x22, x19   /* store at x5 the remainder */
add x4, x5, 0x30  /* store (temp) the index (x19) plus the offset for the ascii numbers */
add x5, x20, 0x30 /* store (temp) the 10s (x20) plus the offset for the ascii numbers */

cmp x20, 0
b.eq back
strb w5, [x1, 5] /* write the 10s value*/

back:
strb w4, [x1, 6] /* store address x4 at x1 (beginning of msg) + 5 */

mov     x8, 64      /* write is syscall #64 */
svc     0           /* invoke syscall */



add     x19, x19, 1 /* add 1 to x19 */
    cmp     x19, max /* compare x19 to the max */
b.ne    loop /* if not equal to max, branch back to loop */


mov     x0, 0            /* status -> 0 */
    mov     x8, 93           /* exit is syscall #93 */
    svc     0                /* invoke syscall */

.data
msg:  .ascii      "Loop   \n"
len=  . - msg


Now I will first start with how we got 0-9 iterations printed which we did by placing the iterator's value + x30 (i.e. number + hex 30 which would display the character as in ASCII table) inside a register and moving that register's value into address: msg + the position in msg where we want to display that number. i.e. if msg + 6 because LOOP (4 digits) + space (1 digit) + 1 (place to be stored at). But now the problem was when the iterator reached 10 it printed next character after 9 in ASCII table which was not 10 and since we didn't have double digits to display in ASCII we had to split the numbers to display double digit values. So we took the double digit number got its 10's place (divide by 10) added x30 and stored its value in a register and got 1's place by getting its remainder and stored it in the same way. We used udiv and msub to get the quotient and remainder of the division respectively, which can be found here:
https://wiki.cdot.senecacollege.ca/wiki/AArch64_Register_and_Instruction_Quick_Start

Now as we had both the digits we needed we just needed to place them in adjacent places after the word loop which we did in a similar way we had placed digits after LOOP earlier. The last challenge we had was that we had to print singles digit till 9 and double digits afterwards but we had all iterations in double digits so we made use of the subroutine to skip printing 10's digit if our ten's digit was equal to 0. And in this way we were able to accomplish what was required for this lab.

Overall I found this lab very interesting as it involved alot of Maths and also had some significant logic to think about. I feel I now had even stronger grip over the assembly language.

Lecture:
This week's lecture was strictly based on instructions about our upcoming project and its different stages. We were given a lot of instructions on how and what sort of project we should be looking to choose based on it being resource intensive and what we wanted to optimize out of CPU, RAM and energy. Then we had to build and benchmark our chosen package by the following steps:
1. Obtain code
2. Install dependencies
3. Build
4. Get or generate some data
5. Build command
6. Run + time it
and in this way we can compare our code's performance and improvements after we work on it.

With the end of semester approaching I am both excited and nervous at the same time to work on the project and am looking forward to explore some good optimization and other helpful techniques.

Wednesday 26 February 2020

Week - 7 Lecture (Lab 5)

This week there was no lab due to Family day's holiday so the Lab 5 was supposed to be worked on in the lecture period. This were the required steps for Lab 5:

Here is a basic loop in AArch64 assembler - this loops from 0 to 9, using r19 as the index (loop control) counter:
.text
.globl _start

min = 0                          /* starting value for the loop index; note that this is a symbol (constant), not a variable */
max = 30                         /* loop exits when the index hits this number (loop condition is i<max) */

_start:

    mov     x19, min

loop:

    /* ... body of the loop ... do something useful here ... */

    add     x19, x19, 1
    cmp     x19, max
    b.ne    loop

    mov     x0, 0           /* status -> 0 */
    mov     x8, 93          /* exit is syscall #93 */
    svc     0               /* invoke syscall */
This code doesn't actually do anything while looping, because the body of the loop is empty. On an AArch64 machine, combine this code with code from the "Hello World" assembley-language example, so that it prints a word each time it loops:
Loop
Loop
Loop
Loop
Loop
Loop
Loop
Loop
Loop
Loop
Then modify the message so that it includes the loop index values, showing each digit from 0 to 9 like this:
Loop: 0
Loop: 1
Loop: 2
Loop: 3
Loop: 4
Loop: 5
Loop: 6
Loop: 7
Loop: 8
Loop: 9

Now we had a list of servers we could choose from so we decided to go with aarchie.
Then after taking reference from both the loop example given and the hello world program we were able to print the iterations of loop without numbers. Here is the code:

.text
.globl _start

min = 0                          /* starting value for the loop index; note that this is a symbol (constant), not a variable */
max = 30                         /* loop exits when the index hits this number (loop condition is i<max) */

_start:

    mov     x19, min

loop:

        mov     x0, 1           /* file descriptor: 1 is stdout */
adr     x1, msg    /* message location (memory address) */
mov     x2, len    /* message length (bytes) */

mov     x8, 64      /* write is syscall #64 */
svc     0          /* invoke syscall */


    add     x19, x19, 1
    cmp     x19, max
    b.ne    loop

    mov     x0, 0           /* status -> 0 */
    mov     x8, 93          /* exit is syscall #93 */
    svc     0               /* invoke syscall */

.data
msg: .ascii      "loop \n"
len= . - msg

the text is stored in the msg label and .ascii acts as a dcb. then _start. declares the start of the program and the loop label uses the message and its length to do a syscall to print it to the screen and one gets added to the register used for iteration and this keeps on going until the loop is finished i.e. comparison between max and register value. Well we were trying different ways on how to get the number printed besides the loop but could only managed to go this far until the lecture was over and we were told this lab would be continued until the lab period after study week so we had a lot of time to think over this.
Though the syntax looks a bit scary for now for this architecture it is just a matter of time before we get used to these instructions and then everything should be fine.

Saturday 22 February 2020

Week 6 - Lab 4 continued and Lecture

This week's lab also we were supposed to work on Lab 4 and I will talk about the second task that were supposed to work on, in this blog. Now in case you guys did not check out last post this was our task:

Option 2: Data Input Form[edit]

  1. Create a subroutine which enables the user to enter a string of up to 40 characters. Decide how the data will be returned from the subroutine. The user should be able to use the left/right cursor keys and backspace to edit the text, and the cursor location should be visible to the user.
  2. Write a program which uses this subroutine to collect the user's name, address, city, province, and postal code, and then display this information formatted as a shipping address (for example, as shown below).


Your order will be delivered to: J. Doe 1750 Finch Ave E Toronto ON M2J 2X5

Now the approach used was to store each of the inputs was that each iteration would give a message (stored as constant byte) followed by the user input (stored as empty dcb's of required lengths). Then each of the empty constant bytes will be replaced by user input accordingly and displayed in the last iteration with the help of labels. Now there was a very important concept about logical operations we were taught in week 5 which was XOR or EOR which stands for exclusive OR. Anything EOR with 0 would keep the original value and EOR with 1 would give a flipped value. So basically we used it to flip the bits.
Here is code for lab 4 task 2:

define        SCINIT        $ff81 ; initialize/clear screen
define        CHRIN        $ffcf ; input character from keyboard
define        CHROUT        $ffd2 ; output character to screen
define        SCREEN        $ffed ; get screen size
define        PLOT        $fff0 ; get/set cursor coordinates
define    POINTER        $10    ; ptr: start of row
define    POINTER_H    $11
define    SPOINTER    $12    ; ptr: start of screen
define    SPOINTER_H    $13
define MAX_INPUT    $14


    lda #$f0
    sta SPOINTER_H
    lda #40
    sta MAX_INPUT
   
          jsr SCINIT

; Name
          ldx #<label1    ; loading X with a pointer to the string to be printed
          ldy #>label1    ; loading Y with a pointer to the string to be printed
      jsr prtstr    ;  print string

          ldx #<name    ; loading X with a pointer to the string to be read and store
          ldy #>name
          lda #len1 ; length of string

          jsr getstr    ; go get user input

          jsr SCINIT

;Address

          ldx #<label2
          ldy #>label2
          jsr prtstr

          ldx #<address    ; loading X with a pointer to the string to be read and stored
          ldy #>address    ; loading X with a pointer to the string to be read and stored
          lda #len2 ; length of string

          jsr getstr    ;  get user input

          jsr SCINIT

;City

          ldx #<label3
          ldy #>label3
          jsr prtstr

          ldx #<city    ; loading X with a pointer to the string to be read and stored
          ldy #>city    ; loading Y with a pointer to the string to be read and stored
          lda #len3 ; length of string

          jsr getstr    ; go get user input

          jsr SCINIT

; Printing final message and captured details

          ldx #<label4    ; load XY with a pointer to the string to be printed
          ldy #>label4
      jsr prtstr    ; go print string

          ldx #<name    ; load XY with a pointer to the string to be printed
          ldy #>name
      jsr prtstr    ; go print string

      jsr prtcrlf

          ldx #<address    ; load XY with a pointer to the string to be printed
          ldy #>address
      jsr prtstr    ; go print string

      jsr prtcrlf

          ldx #<city    ; load XY with a pointer to the string to be printed
          ldy #>city
      jsr prtstr    ; go print string

      jsr prtcrlf
         
          brk

prtcrlf:               ;this subroutine moves the code to a newline
      ldx #<crlf    ; there is only one parameter 
      ldy #>crlf    ;prints a newline carriage return character
      jsr prtstr


prtstr:                                  ; this subroutine prints a NULL terminated string
          stx POINTER           ; there is only one parameter
          sty POINTER_H      ;    registers XY point to the string to be printed

          ldy #$00         

prtchar:    
          lda (POINTER),y
          beq prtstrdone

          jsr CHROUT

          iny
          bne prtchar

prtstrdone:
          rts


getstr:                       ;this subroutine takes care of the typing and checking of keys
 stx POINTER           ; there is 2 parameters the pointer and length of string
         sty POINTER_H
    sta SPOINTER
    ldy    #$00
   
check2:
idle:     inx
          cpx #$10
          bne check
          lda (SPOINTER),y
          eor #$80
          sta (SPOINTER),y
check:    lda $ff
          beq idle

          ldx #$00
          stx $ff

          cmp #$08
          bne print

          lda (SPOINTER),y
          and #$7f
          sta (SPOINTER),y

          dey
          jmp next

print: 
    cpy MAX_INPUT
    beq maxed

     cmp #$0D    ; check for Enter key
     beq done

     cmp #$83    ; check for Left arrow key  
     beq go_left

     cmp #$81    ; check for Right arrow key
     beq go_right

    sta (SPOINTER),y
    sta (POINTER),y
          iny

maxed: 
          jmp check2

go_left:
;brk
    dey
    cpy #$ff
    bne not_at_left

    iny

not_at_left:
    jmp idle

go_right:
;brk
    lda (POINTER),y
    cmp #$00
;brk
    beq not_at_right

    iny

not_at_right:
    jmp idle

done: 
 rts

label1:
dcb "p","l","e","a","s","e",32,"e","n","t","e","r",32,"y","o","u","r",32,"n","a","m","e",":",00
define len1 24

label2:
dcb "p","l","e","a","s","e",32,"e","n","t","e","r",32,"y","o","u","r",32,"a","d","d","r","e","s","s",":",00
define len2 27

label3:
dcb "p","l","e","a","s","e",32,"e","n","t","e","r",32,"y","o","u","r",32,"c","i","t","y",32,"a","n","d",32,"p","o","s","t","a","l",32,"c","o","d","e",":",00
define len3 40

label4:
dcb "Y","o","u","r",32,"o","r","d","e","r",32,"w","i","l","l",32,"b","e",32,"d","e","l","i","v","e","r","e","d",32,"t","o",":",$0d,$0d,00

name:
dcb 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,$0d

address:
dcb 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,$0d

city:
dcb 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,$0d

crlf:
dcb $0d,0

So by incorporating previous task's knowledge about working with character display and some additional information we were able to finish this lab.

Lecture:
This week's lecture started with a quiz and it was all about differences between the architecture we already finished learning and the architectures we were in process of learning i.e. 6502, AArch 64 and X86 64. And after the quiz the prof. gave us a summary about the 3.
6502: 8 bit processor, 16 bit address, 3 registers.
X86 64: 64 bit processor, 64 bit address, 16 registers.
AArch 64: 64 bit processor, 64 bit address, 31 registers.

Then we got an insight on how both architectures started from different approaches i.e. RISC (have tiny instructions but small and fast), and CISC (complex instructions) and ended up curving in a way to come on similar grounds. 

Then we were shown a sample hello world program that had some weird looking syntax at first but then made sense once we got to understand it.
Overall we knew we were taking a step forward into the this assembly language and had to be ready for it.

Wednesday 12 February 2020

Week 5 - Lab 4 and Lecture

This week started with us getting another lab that we had to complete as a group. So we had about 4 tasks to choose from. out of which we had to complete 2. Here are the options we had:

Option 1: Adding Calculator[edit]

  1. Create a subroutine which enables the user to enter two numbers of up to two digits. Indicate where the cursor is, and allow the user to use the digit keys (0-9), backspace, and enter keys. Return the user's input value in the accumulator (A) register.
  2. Using this subroutine, write a program which add the two numbers (each of which is in the range 0-99) and print the result.
  • Optional challenge: extend to more than 2 digits and add a running total capability with a clear key.

Option 2: Data Input Form[edit]

  1. Create a subroutine which enables the user to enter a string of up to 40 characters. Decide how the data will be returned from the subroutine. The user should be able to use the left/right cursor keys and backspace to edit the text, and the cursor location should be visible to the user.
  2. Write a program which uses this subroutine to collect the user's name, address, city, province, and postal code, and then display this information formatted as a shipping address (for example, as shown below).
Your order will be delivered to:

  J. Doe
  1750 Finch Ave E
  Toronto ON  M2J 2X5
  • Optional challenge: validate the province/territory and the postal code format.

Option 3: Hexdump[edit]

  1. Create a subroutine which enables the user to enter a 4-digit hexadecimal value. Allow the use of the backspace key for editing, and show the user the cursor location. Decide how the data will be returned by this subroutine.
  2. Using this subroutine, ask the user for a starting address and a length (in bytes), and then output a hexdump of the specified memory locations, formatted to show address and data information in some layout. If the output is longer than the screen, pause for user input (e.g., "Press ENTER to continue" or something similar) every 20 lines.
  • Optional challenge: add a display of ASCII data on the right-hand side, similar to the output of the linux command od -A x -t xz -v filename

Option 4: Screen Colour Selector[edit]

  1. Create a subroutine which displays on the character display a list of colours available on the bitmapped display, one colour per line, with one of the colours names highlighted in reverse video (white on black). The user may use the up/down arrow keys to change the highlighted row. Return the user's selection as a number in the accumulator (A) register when they press Enter.
  2. Using this subroutine, get a colour from the user, then fill the bitmap display with this colour, and allow the user to select a different colour.
  • Optional challenge: update the colour on the bitmapped display as the user scrolls through the colour-selector list.

So out of these options our group decided to go for options 2 and 4 which were data input form and screen colour selector. We decided to work on the screen color selector first. 

Now as we were working on the lab the professor gave us 2 commands that would be useful for the lab which were pha and pla which stand for PUSH Accumulator and PULL accumulator respectively. Pha simply copies whatever is in the accumulator and stores it on the stack while Pha doing the opposite takes 1 byte from the stack and stores it in the accumulator. 

Now we started by displaying the names of the colors by declaring them as constant bytes and then printing them on character display but soon we figured out the old way of displaying things by starting displaying from the character display's start address would not prove very useful and we would have to use ROM routines in order to work on this lab. Now ROM routines are pre defined subroutines in the memory that we can access and use for basic functionalities. Here is a list of them:
; ROM routines
define  SCINIT  $ff81 ; initialize/clear screen
define  CHRIN  $ffcf ; input character from keyboard
define  CHROUT  $ffd2 ; output character to screen
define  SCREEN  $ffed ; get screen size
define  PLOT  $fff0 ; get/set cursor coordinates


Well, after getting to display the names, the next task was to get the cursor moving up and down that list and displaying the color based on the position of cursor. While we had limited time in lab we could only manage to go so far and decided to complete the lab afterwards.

Lecture:
Well this week's lecture introduced us to 2 architectures which were fairly new to us though we use them a lot, x86 64 and AArch 64. We were given a background on both those architectures and how those 2 emerged from being small in memory chips to being so huge memory and powerful now.
Then we got to know the registers on both the architectures. While AArch 64 had 0-30 registers and a register 31 that could be used as zero register or stack pointer based on its functionality. More on these registers and some related instruction set can be found on: 

While on the other hand X86 64 had 16 registers, more of which could be found on:
It was interesting to know how both the architectures had registers that could be used for 64 bit or 32 bit wide access. 

Now going back to the lab, we took from the code from the previous lab on how to scroll using input from the keyboard and we went further to store the color, address cur_pos held inside and we manipulated the cursor to store different color's code according to the input from the keyboard, as they ranged from 0-F. So after managing to change the color code based on keyboard up or down key and making it print the stored code using the filling the bitmapped display's code we were able to come up with the following code:

; ROM routines
define SCINIT $ff81 ; initialize/clear screen
define CHRIN $ffcf ; input character from keyboard
define CHROUT $ffd2 ; output character to screen
define SCREEN $ffed ; get screen size
define PLOT $fff0 ; get/set cursor coordinates

define      cur_pos     $10

          jsr SCINIT
          ldy #$00
          
          ldx #$00
          stx cur_pos

char:     lda color_list,y
          beq main
          
          jsr CHROUT
          iny
          bne char
          
main:
lda $ff          ; get a keystroke
    beq main             ; fix freezing on max speed
    
  ldx #$00          ; clear out the key buffer
  stx $ff

cmp #$82          ;  arrow down for increment (going down)
beq incr

cmp #$80          ; arrow up for decrement (going up)
beq decr

jmp main           ; if no key pressed loop forever
    
clear_cursor:
    lda #$2d             ; character -
    jsr print_cursor
    rts
    
incr:
    jsr clear_cursor
    
    inc cur_pos
    lda cur_pos
    cmp #$10
    beq incr_reset
    
    bne incr_decr
    
incr_reset:
    lda #$00
    sta cur_pos
    jmp incr_decr
        
decr:
    jsr clear_cursor
    
    dec cur_pos
    lda cur_pos
    cmp #$FF
    beq decr_reset
    
    bne incr_decr
    
decr_reset:
    lda #$0F
    sta cur_pos
    jmp incr_decr

incr_decr:
    lda #$93             ; character black box
    jsr print_cursor
jmp paint
    
print_cursor:
    clc                  ; select the SET function for PLOT
    ldx #$00
    ldy cur_pos          ; increment this one
    pha                  ; save the value of A into the stack
    jsr PLOT             ; set the cursor position
    pla                  ; restore the value of A from the stack
    jsr CHROUT           ; print the char
    rts
    
paint:
      ldx #$06           ; max value for $11
      ldy #$00           ; index
      
      lda #$00           ; set pointer at $10 to $0200
      sta $11
      lda #$02
      sta $12
      
      lda cur_pos        ; set the color
      jmp paint_loop
      
paint_loop: 
      sta ($11),y        ; store colour
      iny                ; increment index
      bne paint_loop     ; branch until page done
      
      inc $12            ; increment high byte of pointer
      cpx $12            ; compare with max value
      bne paint_loop     ; continue if not done 
      
      jmp main           ; done - return to main

color_list:
    dcb $93,"B","l","a","c","k",13
    dcb "-","W","h","i","t","e",13
    dcb "-","R","e","d",13
    dcb "-","C","y","a","n",13
    dcb "-","P","u","r","p","l","e",13
    dcb "-","G","r","e","e","n",13
    dcb "-","B","l","u","e",13
    dcb "-","Y","e","l","l","o","w",13
    dcb "-","O","r","a","n","g","e",13
    dcb "-","B","r","o","w","n",13
    dcb "-","L","i","g","h","t",32,"r","e","d",13
    dcb "-","D","a","r","k",32,"g","r","e","y",13
    dcb "-","G","r","e","y",13
    dcb "-","L","i","g","h","t",32,"g","r","e","e","n",13
    dcb "-","L","i","g","h","t",32,"b","l","u","e",13
    dcb "-","L","i","g","h","t",32,"g","r","e","y",00

As you can see we used ROM routines CHROUT and PLOT to print the characters to the display and setting the position of the cursor in the memory respectively. 
Though we managed to pull of this first task we still had one more to go.

Friday 31 January 2020

Week 4 - Lecture

This week's lecture started with how we can display text on the 6502 emulator. I found it fairly easy.
Here is a sample code:

define SCREEN $f000
           ldy #$00
 
 char:     lda text,y
           beq done
           sta SCREEN,y
           iny
           bne char
 
 done:     brk
 
 text:
 dcb "S","P","0","6","0","0",32,"i","s",32,"a","w","e","s","o","m","e",".",00

This was accomplished by having a text as dcb and looping it to load and store it in accumulator.
Spaces were put using number 32 and all characters were put in double quotes.

Next we talked about compiling the code. We compiled a simple C program inside linux and saw its executable was larger than its original file. We also learnt about dynamic and static linking and how each of them have their own pros and cons. While static being larger in size and dynamic was relatively smaller. Then we learnt about different compatibility modes it had to offer and how it could range from O0 (almost no optimizations) to O3 (most optimized) and we were also shown how many flags for optimized content was set to active and how we could manipulate it. Og was used to optimize for debugging. We almost looked upon certain commands like make which could be used to make partial builds inside of whole when small changed are made to source files.

This week's lecture ended with us touching upon something new and we were also told text was the last thing we learnt on 6502 and we wont be working with it anymore.

Thursday 30 January 2020

Week 4 - Lab 3 continued

This week's lab started with us giving a quiz. We were all a bit surprised as we did not expect it but it turned out to be just fine in the end since it was just matching and there were no short or long questions to answer. Rest of the lab was given to us to figure out more about lab 3 and work on it.
We still did not have much to end the lab with so we decided and talked to professor a bit more about it and understanding it in more detail.
After much discussions finally we were able to incorporate everything we know and combined declare constant byte's knowledge to get rid of multiplication, used delay loop that would keep counting as necessary to give us necessary delay to give an effect of moving, incremented rows and columns accordingly to direct the graphic and finally filled the display with black color where the graphic is no longer there. Here is the full code for this lab:

; zero-page variable locations
 define ROW        $20    ; current row
 define    COL        $21    ; current column
 define    POINTER        $10    ; ptr: start of row - upper left corner of object
 define    POINTER_H    $11
 define    WPOINTER    $12    ; ptr: start of row - write pointer
 define    WPOINTER_H    $13
 define BOUNCE_X    $14
 define BOUNCE_Y    $15
 define ROW_COUNT    $16
 define TIMER        $17    ; ptr: delay timer
 define TIMER_H        $18 
 define TIMER_CONTROL    $19 ; contains the base of the high bit of the timer

; constants
define WIDTH         7     ; width of graphic
define HEIGHT         7     ; height of graphic
define GREEN        $05    ; GREEN
define BLUE        $03    ; BLUE
define BLACK        $00    ; BLACK
define RIGHT        $19    ;  defining where the graphic must change direction accounting for its size
define    LEFT        $00    ;
define TOP        $00    ;
define BOTTOM        $19    ;

        lda #$0         ; initialize delay loop counter
        sta TIMER   
        sta TIMER_H
        lda #$40      ; setting the initial speed of the timer
        sta TIMER_CONTROL
        lda #$01        ; assume that the object is moving across the screen
        sta BOUNCE_Y ; contain the rate motion of the box along the Y axis
        lda #$01
        sta BOUNCE_X ; contain the rate motion of the box along the X axis
        lda #$03     ;setting were object starts on screen
        sta ROW
        lda #$06
        sta COL

loop:
    clc
    ldy ROW            ; load POINTER with start-of-row
    lda table_low,y
    adc COL
    sta WPOINTER
    lda table_high,y
    sta WPOINTER_H
    lda #$00            ; number of rows we've drawn
    sta ROW_COUNT         
    ldx #$00            ; index for data
    ldy #$00            ; index for screen column

draw:   ; draws graphic
    lda data,x   ; set up where graphic starts on sreen
    sta (WPOINTER),y
    inx
    iny
    cpy #WIDTH
    bne draw
    inc ROW_COUNT            ; increment row counter
    lda #HEIGHT            ; end of graphic by height
    cmp ROW_COUNT
    beq done            ; exit
    lda WPOINTER           ; load pointer
    clc
    adc #$20            ; add 32 to drop one row
    sta WPOINTER
    lda WPOINTER_H        ; carry to high byte if needed
    adc #$00
    sta WPOINTER_H
    ldy #$00
    beq draw

done:
    lda TIMER_CONTROL ; initialize timer with current speed
    sta TIMER_H

delay:  dec TIMER ; wait
        bne delay
        dec TIMER_H
        bne delay

        clc
        ldy ROW            ; load POINTER with start-of-row
        lda table_low,y
        adc COL
        sta WPOINTER
        lda table_high,y
        sta WPOINTER_H
        lda #$00            ; number of rows we've drawn
        sta ROW_COUNT         
        ldx #$00            ; index for data
        ldy #$00            ; index for screen column

wdraw:   ; remove graphic were it has been draw
    lda #BLACK
    sta (WPOINTER),y
    inx
    iny
    cpy #WIDTH
    bne wdraw

    inc ROW_COUNT            ; increment row counter
    lda #HEIGHT         
    cmp ROW_COUNT
    beq wdone            ; exit

    lda WPOINTER           ; load pointer
    clc
    adc #$20            ; add 32 to drop one row
    sta WPOINTER
    lda WPOINTER_H        ; carry to high byte if needed
    adc #$00
    sta WPOINTER_H
    ldy #$00
    beq wdraw

wdone:  ; check graphic if hitting a sides and if so change direction
    clc
    lda COL
    adc BOUNCE_X
    sta COL
    cmp #RIGHT
    bne no_bounce

    pha
    lda #$ff   ; using the fact that the register overflows adding ff is the same as subtracting 1
    sta BOUNCE_X
    pla
 
no_bounce:
    cmp #LEFT
    bne no_bounce1

    lda #$01
    sta BOUNCE_X

no_bounce1:
    clc
    lda ROW 
    adc BOUNCE_Y
    sta ROW
    cmp #BOTTOM
    bne no_bounce2

    pha
    lda #$ff
    sta BOUNCE_Y
    pla
 
no_bounce2:
    cmp #TOP
    bne no_bounce3

    lda #$01
    sta BOUNCE_Y

no_bounce3:

getkey: 
    lda $ff        ; get a keystroke
    ldx #$00    ; clear out the key buffer
    stx $ff
    cmp #$2d    ; is it a minus
    bne check

    clc   ; slow down timer
    lda TIMER_CONTROL
    adc #$01 ; changing speed of timer by adding to the timer
    bcs ignore

    sta TIMER_CONTROL

ignore:
    jmp move_it

check:
     cmp #$2b    ; is it a plus
     bne check0

    sec  ; speed up timer
    lda TIMER_CONTROL
    sbc #$01 ; changing speed of timer by subtracting from the timer
    beq ignore

    sta TIMER_CONTROL
    jmp move_it

check0:
    cmp #$80    ; check key == up
     bne check1

    sec
    lda ROW        ; ... if yes, decrement ROW
    sbc #$01
    beq ignore

    sta ROW
    jmp move_it

 check1:
    cmp #$81    ; check key == right
     bne check2

    lda COL        ; ... if yes, increment COL
    adc #$01
    cmp #RIGHT
    beq ignore

    sta COL
    jmp move_it

 check2:
    cmp #$82    ; check if key == down
    bne check3

    lda ROW        ; ... if yes, increment ROW
    adc #$01
    cmp #BOTTOM
    beq ignore

    sta ROW
    jmp move_it

 check3:
    cmp #$83    ; check if key == left
    bne move_it

    sec
    lda COL        ; ... if yes, decrement COL
    sbc #$01
    beq ignore
 
    sta COL
    bcc move_it

move_it: 
    jmp loop

data:  ; for the graphic
dcb 00,03,03,03,03,03,00
dcb 03,02,03,03,03,02,03
dcb 03,03,03,03,03,03,03
dcb 03,03,03,03,03,03,03
dcb 03,02,03,03,03,02,03
dcb 03,03,02,02,02,03,03
dcb 00,03,03,03,03,03,00

 ; these two tables contain the high and low bytes
 ; of the addresses of the start of each row

 table_high:
 dcb $02,$02,$02,$02,$02,$02,$02,$02
 dcb $03,$03,$03,$03,$03,$03,$03,$03
 dcb $04,$04,$04,$04,$04,$04,$04,$04
 dcb $05,$05,$05,$05,$05,$05,$05,$05,

 table_low:
 dcb $00,$20,$40,$60,$80,$a0,$c0,$e0
 dcb $00,$20,$40,$60,$80,$a0,$c0,$e0
 dcb $00,$20,$40,$60,$80,$a0,$c0,$e0
 dcb $00,$20,$40,$60,$80,$a0,$c0,$e0