SPO600: 2020

Saturday, 18 April 2020

Project Stage - 3

This might be the final blog I am posting so I will try to make it as descriptive as possible. This blog will be about the stage 3 of the project. If you haven't gone through the other stages of the project, here are the links to them: Stage 1 ( https://asharma206-spo600.blogspot.com/2020/04/project-stage-1.html ), update to stage 1 ( https://asharma206-spo600.blogspot.com/2020/04/update-to-project-stage-1.html ), stage 2 ( https://asharma206-spo600.blogspot.com/2020/04/project-stage-2.html ).
After, building, benchmarking and profiling, last stage was all about creating a strategy to find possible optimization approach for the software.
To find an optimization opportunity we needed to find the hotspots within the hot (function taking maximum time) function in the software first. In other words we needed to figure out what instructions inside of the software took maximum time to execute, so that with small optimization we could see differences in the overall run. So as a recall from the previous stage, I made use of perf report, which displays the percentage running times within the software of different functions. We would go one step further and press enter on CompressFragment to analyse it one step further. From there we would use its annotate functionality to get a breakdown of different instructions within the function. Then, a screen is displayed with source code and different instructions that take place corresponding to the source code, along with the percentage of run times. With pressing shift + 'h', redirected me to the hottest instruction within the function with most percentage. The instruction, I landed on was taking 8.90 % of the total time within the function and was ldur x2, [x28, #4]. My first thought on this was since this is only about 9%, I was doubtful if I could do any significant optimization that makes a difference. Now the next steps were to find out if any optimizations could be done on this.

SIMD and Vectorization
SIMD stands for single instruction multiple data. With all modern machines having SIMD registers which are 128 bits wide, we were introduced to use these as a possible approach. Though these registers could be used as 128, 64, 32 bits and so on, but it could be also divided into lanes with options of being 64 bits (2 lanes), 32 bits (4 lanes), 16 bits (8 lanes). And those 8 lanes in turn can be used to process upto 8 values at a time.
Vectorization is the code that processes larger chunks of data at a time rather than a single unit. So basically we use vectorization with SIMD to process multiple pieces of data at once and that speeds up the program.
For vectorization to kick in automatically we can set the optimization level to -03 or use option -ftree-vectorize while compiling and if the instructions are in the order for the compiler, the loops will be vectorized automatically. This is called autovectorization.
Video on SIMD and vectorization: https://youtu.be/EIPbufXhiQs
But since this might not always be the case we can use inline assembly instructions or instrinsics in turn.
For inline assembly instruction we use the syntax: __asm__("assembly template" : outputs : inputs : clobbers). This can be thought of a way assembly instructions can be injected into a C or C++ program and thus vectorization commands can be used in this way.
More on inline assembly: https://wiki.cdot.senecacollege.ca/wiki/Inline_Assembly_Language

With instrincs the idea is the same, the only difference is that instrinsics can be thought of as functions, but they both can be used interchangably.
More on instrinsics: https://wiki.cdot.senecacollege.ca/wiki/Compiler_Intrinsics

Now I was ready to apply this knowledge to hotspots in my functions, in order to speed them up.
But soon I realised, my hottest instruction ldur was load register unscaled. So it was not working on an algorithm or processing data, but it had everything to do with the accessing of memory. Since load register is probably just waiting to access data from the memory, there was no possible opportunity of optimization on that instruction.

From there I decided to go one step further and decided to look for second hottest instruction, and if there was any opportunity there. The instruction was strh w2, [x21, x1] which stands for store register halfword, and since this was also about storing values into the register and had everything to do with the memory access, so that instruction was also not something I could do anything about.

Going further I went another step forward and looked for third hottest instruction in the pool and this instruction was surprisingly add x28, x20, x3. Now what was surprising about this was that add is usually a fast instruction and does not take much time. Digging in a bit deep, I remembered the professor had told us the instruction with time might actually not be the instruction taking time but waiting for some other instruction to finish so that it can execute. Then I looked at the instructions that were being preceded by add. So the instructions ( as expected) were strh w5, [x21, x3, lsl #1] (store register halfword) and ldrh w3, [x21, x1] (load register halfword). Which made a lot of sense and made it clear that the add instruction was waiting for those instructions to execute. Hence, once again there was nothing much that could've been done about add as it was linked to accessing memory as well.

At this point it was clear that there were no SIMD and vectorization techniques that could be applied to any part of the hottest function of the software. So I decided to look upon some other techniques our professor told us about.

Platform specific optimizations
The only other aspects to look at of this software were to limit the instructions to a specific architecture. Which means limiting the software to only perform instructions that were specific to the architectures that it was being executed on.
So using ifdefs and defines we can specify what options to set when the platform is arm, x86 etc, that in turn results in running just what is required and ignore everything else, resulting in improvements.

I began searching on that aspect and if I can do certain changes so that code to make it architecture specific. After going through the files, and studying each of them as well as I could, I found that the software already had those conditions specified.

So after a few discussions with the professor I was able to conclude that the software was as optimized as it could be. And as all the possible approaches were tried on the software, I was able to conclude that the benchmarks I took were the fastest times for that optimization options.
No wonder why snappy is known for its fast compression and focuses less on the level of compression. Overall this experience was a fun and exciting ride for me with the things, techniques and knowledge it has given me, though I was worried at the start since it was something very new for me. But I guess, now I have learnt how to approach a software and how to approach the optimization part of it, and I hope to work on some in my future.

This course on a whole was a huge learning curve for me at the beginning. I went through a lot of struggles, spent hours on thinking on the labs, and had an exciting time working on the project. With the help from the professor and my own struggles with the stuff were made to work on, I feel I am taking a good chunk of useful knowledge with me from this course.

Saturday, 11 April 2020

Project Stage - 2

This blog is about the stage 2 of our project ( Project Stage 1 - https://asharma206-spo600.blogspot.com/2020/04/project-stage-1.html). So after we were done setting the benchmarks, next step was to do the profiling of the software. In other words we had to figure out what part of the software was taking the maximum execution time.

We were introduced to two ways on how to figure out what function in our software was taking the maximum execution time. The two concepts were of sampling and instrumentation.

1. Sampling: In this way of profiling, the program is interrupted thousands of times per second, thus collecting thousands of samples per second, and we can get information about what the program counter was doing during the interruption and thus we can figure out what called what and hence what was getting executed for the majority of the time.

2. Instrumentation: Another way of profiling is instrumentation. In this, some code is added during building of the software, so that all the function calls can be recorded and thus we can figure out what part takes the maximum execution.

Two tools that can be used to do the profiling. gprof and perf. I will demonstrate how to work with each of these and then the results I got from each of them.

gprof:
In order to use this tool I had to setup the build to include a -pg (profile generation) option. Next step was to build the software. Next we would run the software as usual. We would see a new executable called gmon.out generated. It would be a binary file. Next step was the usage of gprof command:
filename: (name of the executable originally used to run the software (not gmon.out))
gprof filename | gprof2dot | dot -T x11

Now gprof filename would originally would give the status of different functions and the execution times and other related info. But in order to get a more readable and helpful info we would feed the result to gprof2dot which would further be fed to a program called dot which would give a textual form of graph for the execution tree but for a more graphical version we give -T x11 options which would launch a new window with the visual version of the tree.

result:

Since the original tree was much more complex, this is just a screenshot of the original tree containing the function that takes the maximum execution time.
As it is clear from this graph the function CompressFragment was taking more than 74% of the execution time and though the control was getting transferred to multiple functions, the time taken by them was not that significant. Just a note, this tool uses both instrumentation and sampling

perf:
Though I already knew by now what function was taking the maximum time, but since it was suggested by our professor to use both the tools, I decided to work with perf as well. Now, to use this tool we do not need to use any special options and so I just used the -O2 optimization level and rebuild the software.
Note: This tool only uses the sampling and not the instrumentation.
In order to use this tool, we would just issue the program:
perf record ( program execution)
This would record the samples and would create a file called perf.data.
Then we would issue a command: perf report
Since this tool does not use instrumentation we can't produce an execution tree, we would just see a status of different functions and their execution times.

result:

As expected it is also clear that the most execution time in this program is spent on function CompressFragment (72%). Though the execution time is a bit different from the one in gprof but it is understandable since both of them use different approaches.

So by this time I was clear it was CompressFragment function in this software that I had to work on and try to optimize and since it is taking more than 70%, even a small change in its execution time would mean a good amount of difference in its overall execution time. I would be spending the next few days figuring out what aspect in the function I can work on and I hope to make some good progress on this project overall.

Monday, 6 April 2020

Update to Project Stage - 1

As I told in the last post (https://asharma206-spo600.blogspot.com/2020/04/project-stage-1.html), I did not have much luck figuring out how to set the optimization level for the build of my software. But with some help from the professor and reading some documentation about cmake, I finally managed to set the optimization level and decided to update my benchmarks.
I set the CXX_FLAGS to "-O1 -DNDEBUG" for all builds. and surprisingly the execution of the software which was taking about 6 mins and 44 secs, got reduced significantly to just 24%. So I increased the size of one of the files that the execution of the software was taking place on so it would take more than 4 mins.
All I did was appended the file multiple times to a new file using a loop:
{for (X=0; X<no. of times to append; X++) ; do cat file1.txt >> newfile.txt ; done}

Now before I did this, I set the level of optimization to -02 (recommended) and then I rebuild and decided to take the new benchmarks.

Results

AArchie (AArch64)

Once again as in the last post, the values are in seconds and ranged from about 284.42 to 283.35, with a variation of about 0.37%.

Xerxes (x86_64)

The benchmarks for Xerxes ranged from about 244.36 to 243.84, with a variation of just about 0.21%.

Another small thing I would like to point out that I think is important here is, when I was working on setting optimizations and removing them and experimenting with them, a very key step is making sure the build files are deleted and then build again before you can run the updated version of the code. So that is make clean and then make. This is important when we want program to recompile with our updated options.
With that being said and my new benchmarks in, I'll be looking forward to do the profiling on this software and hopefully I can optimize it by some bit by the end of this course. I will keep you guys updated.

Tuesday, 31 March 2020

Project Stage - 1

This blog will be all about the Stage 1 for the final project for this course. I was both excited and nervous to work on the project since I had not worked on anything like this before which brought excitement and also I was worried about what I should choose and what kind of challenges I would face doing so.
The instructions I had in mind that was given by the professor were that we need to pick something that is not too basic, in other words it had to be resource intensive. There were few options for the programming languages that had to chosen from, since we could not choose Java or web scripts, C and C++ were the only languages I was left with as I did not have experience working with any other languages.

1. Find an open source software package that includes a CPU-intensive function or method that is implemented in a language that compiles to machine code (such as C, C++, or Assembler). This could be hash function, compression, a codec, or any other compute-intensive task, and may be in an application or a library.

I started by working on a project called Soundtouch (https://github.com/rspeyer/soundtouch) which was used for changing pitch, tempo and playback rate. I started building the software by issuing commands and installing binaries bootstrap and configure, but failed in doing so as I soon figured out after getting help from the professor that the software was broken in the last few builds. I was also informed that no tests were included within the module, so I felt it would be really problematic to work on something that would need so much work on just the benchmarks.
There were a lot of of different fields of software I could choose from but I was a bit more interested in working on compression software. After studying and researching some of the different software packages, I finally chose a compression software from Google called Snappy (https://github.com/google/snappy).
It is a compression, decompression software which does not aim for maximum compression but for higher speeds and a reasonable amount of compression.

2. Build the software.

I took the following steps (commands) to retrieve and build the software:

1. git clone https://github.com/google/snappy.git
pulls the repository from the github into your own machine.

2. cd snappy && mkdir build
changes the directory to snappy pulled from the github and makes new directory called build within that directory.

3. cd build
change directory to build

4. cmake ../ && make
builds the library. Successful make means the library has been successfully build.

3. Next we had to find a test case that would help us benchmark the software with minimal variations.

Fortunately for me, I had got a unit case in the repository that helped me run tests on the package. The unit test named snappy_unittest was used to test the software and benchmark it.

4. Benchmark results

Benchmarks were supposed to be done with real run time being of a significant value so that the small changes done after optimization would reflect as significant as well and the variations in different iterations are averaged out. I used the same unit test to be run on both AArch64(AArchie) and x86_64(xerxes) for 5 times each and the results are reflected below:

The values are in seconds as I thought it was more precise to do so.

x86_64: The average value ranged between 64.21 and 64.66 seconds and had a variation of about 0.60% which is a fairly small value.

AArch64: The average value ranged between 401.51 and 403.67 seconds with a variation of about 0.53% which again was fairly small.

The difference in times for x86_64 and AArch 64 were almost around 84% and it was understood as x86_64 is much more powerful compared to AArch64

Compiler optimizations:
Since I am using cmake to build the software and I did not have much experience doing so and also the demos given by the professor were on configure and not cmake, I had a hard time figuring out how to do compiler optimizations on this. But with the help from professor and reading documentations, I am sure I will be able to do so and I will be giving an update about it on the next post.

Overall, I would say this is one of the another challenges in this course. Though it has brought a lot of unpredictability based on how new this thing is to me and also the optimizations we would be doing on this software, but I hope I'll be able to overcome this last challenge this course has to offer and I would much more comfortable working with different assemblers.

Thursday, 26 March 2020

Week 9 - Lecture

This week's lecture started with all of us being a bit sad because the whole world was facing and fighting against a pandemic known as CoVid 19, and it was announced that there won't be any in person lectures to stop the disease's spread so it was probably the last lecture we were taking in person.
The lecture was based on the compiler optimizations i.e. the small tricks compiler does to make our program efficient. In other words compiler does some changes to our code that makes it faster and easier to run.

Optimizations:
Strength Reduction:
In this optimization some expensive operations are replaced by cheaper/inexpensive operations. An example of this would be as: multiplication is considered to be a comparatively expensive operation, so the code is rewritten in a way so that multiplication is replaced by addition or something similar.

Hoisting:
In some situation loops might contain repetitive calculations or in other words there might be some calculations that are not changed within the loop and such calculations are done outside of the loop and nothing is repeated within the loop thus making it more efficient. Example: basic calculations, non changing functions, loop counters with complex calculations. These all can be done outside of the loop so that nothing is repeated.

Loop unswitching:
There are situations when checks irrelevant of the loops are placed inside the loops that way checks are done in every cycle of the loop. But such checks are done outside of the loops and the loops are placed inside of the checks thus getting rid of checks in every loop. This is called loop unswitching.

Loop splitting:
Sometimes two or more parts of the loop are supposed to be handled differently on the basis of the checks done, but instead of doing similar checks in every loop, the loop is split into different parts based on the checks thus getting rid of that extra work in doing checks for the whole loop.

Loop unrolling:
In this method a loop is unrolled to contain multiple copies of the same code, thus decreasing the number of times the calculations are done. Example: A case when we know there are going to be even number of iterations of the loop, instead of checking every time, the increment is done twice, the check is done every second time and thus calculation is also halved throughout the loop.

Inlining:
In a case of inlining, the place where the function is being used is replaced directly by the code inside the function and eliminates the calling of the function.

Common subexpression elimination:
In cases where similar calculations are done more than 1 time the calculated result can be stored inside of a variable and that variable can then be used in place of those calculation and thus calculation is done only once.

Jump Threading:
In cases where we know all the possibilities, the repetitive if conditions can be changed to have an else statement in the last possibility thus eliminating evaluation for the last condition.

Short circuit evaluation:
In conditions containing OR or AND logical statements, conditions can be rewritten to evaluate only 1 condition first in case of OR and then continue to second only if 1st is false and the opposite for AND.

Dead Store elimination:
There are situation when a value is written to a variable more than once but the old values assigned are never used in the program and thus are eliminated.

So these were all the optimizations that compiler does on our behalf and makes the program efficient. More on these optimization can be found here:
https://wiki.cdot.senecacollege.ca/wiki/Compiler_Optimizations
I feel going further in the course might be a bit difficult considering that there would be no in person classes and also with the final project approaching. But I am confident that I would be able to achieve the most from this course and fight out of these difficult situations.

Wednesday, 18 March 2020

Week 9 - Lab 6

This week started with another lab that we had to complete and it was based on algorithm selection.
Digital sound is represented as 16 bit integer signal samples with two streams for left and right stereos. To change the volume of any sound each signal sample can be scaled (multiplied) by a volume factor, in the range of 0.00 (silence) to 1.00 (full volume).

In order to do the scaling we have three options to choose from:

1. Naive algorithm: In this approach each sample is directly multiplied by the volume factor.

2. Lookup based algorithm: This approach pre calculates all the possible 65536 (16 bit values) results and looks up in that pre calculated table instead of multiplying.

3. Fixed point algorithm: This approach uses fixed - point math and bit shifting to perform the multiplication instead of directly multiplying without using floating point math.

So basically we had these three options coded in three different files and a header file which had the number of samples to scale. Each file did some pre processing for the required algorithm, scaled the samples and then summed up the results. We had to make sure each algorithm took more than 20 seconds in order to find some significant results, so we had to adjust the number of samples accordingly. Then we had to find a way to time only the time taken to scale the samples and exclude everything else.
The way I chose to accomplish this was by timing the whole program and then commenting out the scaling portion of it and then timing it again and the difference between the two gave me the time taken for scaling.
Once we had that time we had to compare each of the three algorithms using the benchmarks, and also do the same on different machines/architectures.

So I chose to test and compare them on two of my class servers Aarchie (AArch64) and Xerxes (x86_64) and also on matrix server.

Results:
Fortunately, since testing on difference machines for so many times and timing them would have been a trouble but we were made aware of a script that would handle all that:
for X in vol1 vol2 vol3 ; do echo ==== $X ; for Y in 1 2 3 4 ; do echo ---- Trial $Y ; /usr/bin/time ./$X ; done ; done |& tee results.log

The script not only times the three files for 4 trials each but also stores the results in a file so that we can always reach back to it later.

AArchie (AArch64)

I decided to take an average of all the four trials so it is easier to analyse and reach a decision.
Algorithm 3 proved to be the fastest and took the lowest amount of time in scaling, while algorithm 1 took 46% more time and was slower and algorithm 2 took 68% more time than algorithm 3. So overall on this Aarch_64 machine, Fixed point algorithm proved to be the fastest. No significant change was seen in maximum resident memory usage of all the three approaches though.

Xerxes (x86_64)

Once again I averaged out all the 4 trials and came up with a single value to do the analysis of the benchmarks. On this x86_64 machine also, algorithm 3 proved to be the fastest taking just 1.39 seconds to scale the samples, while algorithm 1 took 26% more time and was slowest while algorithm 2 took 18% more time in comparison to algorithm 3. So once again Fixed point algorithm was seen as winner.

Matrix Server

Matrix came up with a bit varying results for Naive algorithm, which might be due to the load on the server and other activities being performed on it, but averaging them all out again helped me and made it easier to analyse and compare the results. Surprisingly this time, algorithm 2 took the lowest time scaling the samples averaging out to just 0.45, while algorithm 1 took 88% more time and was the slowest of the three, while algorithm 3 took 73% more time compared to algorithm 2. Surprisingly, Lookup based algorithm was the fastest on matrix server.

This was surprisingly one of the easiest labs we had done so far. It is clear that the effectiveness of an algorithm might vary based on the machine we work on.

Saturday, 7 March 2020

Week 8 - Lab 5 continued and Lecture

This week's lab we were supposed to continue with the lab 5 we had started the week before the study break and our group had made some improvement from the last code I showed you guys in the last blog, and we were now being able to print the iteration numbers besides the word loop but we were only able to print 0-9 and then we were getting other characters after 9 which proceeded 9 on the ASCII table. Now before I explain what we did to accomplish all of this and the final result I want to show you guys the code and explain how each of the things work. So here is the code:

.text
.globl _start

min = 0 /* starting value for the loop index; note that this is a symbol (constant), not a variable */
max = 30 /* loop exits when the index hits this number (loop condition is i<max) */

_start:

mov x19, min

loop:

mov x0, 1 /* file descriptor: 1 is stdout */
adr x1, msg /* message location (memory address) */
mov x2, len /* message length (bytes) */
mov x22, #10 /* store 10 in x22 */
udiv x20, x19, x22 /* div x19 (raw increment) by x22 (10) store it in x20 */
msub x5, x20, x22, x19 /* store at x5 the remainder */
add x4, x5, 0x30 /* store (temp) the index (x19) plus the offset for the ascii numbers */
add x5, x20, 0x30 /* store (temp) the 10s (x20) plus the offset for the ascii numbers */

cmp x20, 0
b.eq back
strb w5, [x1, 5] /* write the 10s value*/

back:
strb w4, [x1, 6] /* store address x4 at x1 (beginning of msg) + 5 */

mov x8, 64 /* write is syscall #64 */
svc 0 /* invoke syscall */

add x19, x19, 1 /* add 1 to x19 */
cmp x19, max /* compare x19 to the max */
b.ne loop /* if not equal to max, branch back to loop */

mov x0, 0 /* status -> 0 */
mov x8, 93 /* exit is syscall #93 */
svc 0 /* invoke syscall */

.data
msg: .ascii "Loop \n"
len= . - msg

Now I will first start with how we got 0-9 iterations printed which we did by placing the iterator's value + x30 (i.e. number + hex 30 which would display the character as in ASCII table) inside a register and moving that register's value into address: msg + the position in msg where we want to display that number. i.e. if msg + 6 because LOOP (4 digits) + space (1 digit) + 1 (place to be stored at). But now the problem was when the iterator reached 10 it printed next character after 9 in ASCII table which was not 10 and since we didn't have double digits to display in ASCII we had to split the numbers to display double digit values. So we took the double digit number got its 10's place (divide by 10) added x30 and stored its value in a register and got 1's place by getting its remainder and stored it in the same way. We used udiv and msub to get the quotient and remainder of the division respectively, which can be found here:
https://wiki.cdot.senecacollege.ca/wiki/AArch64_Register_and_Instruction_Quick_Start

Now as we had both the digits we needed we just needed to place them in adjacent places after the word loop which we did in a similar way we had placed digits after LOOP earlier. The last challenge we had was that we had to print singles digit till 9 and double digits afterwards but we had all iterations in double digits so we made use of the subroutine to skip printing 10's digit if our ten's digit was equal to 0. And in this way we were able to accomplish what was required for this lab.

Overall I found this lab very interesting as it involved alot of Maths and also had some significant logic to think about. I feel I now had even stronger grip over the assembly language.

Lecture:
This week's lecture was strictly based on instructions about our upcoming project and its different stages. We were given a lot of instructions on how and what sort of project we should be looking to choose based on it being resource intensive and what we wanted to optimize out of CPU, RAM and energy. Then we had to build and benchmark our chosen package by the following steps:
1. Obtain code
2. Install dependencies
3. Build
4. Get or generate some data
5. Build command
6. Run + time it
and in this way we can compare our code's performance and improvements after we work on it.

With the end of semester approaching I am both excited and nervous at the same time to work on the project and am looking forward to explore some good optimization and other helpful techniques.

Wednesday, 26 February 2020

Week - 7 Lecture (Lab 5)

This week there was no lab due to Family day's holiday so the Lab 5 was supposed to be worked on in the lecture period. This were the required steps for Lab 5:

Here is a basic loop in AArch64 assembler - this loops from 0 to 9, using r19 as the index (loop control) counter:

.text
.globl _start

min = 0                          /* starting value for the loop index; note that this is a symbol (constant), not a variable */
max = 30                         /* loop exits when the index hits this number (loop condition is i<max) */

_start:

    mov     x19, min

loop:

    /* ... body of the loop ... do something useful here ... */

    add     x19, x19, 1
    cmp     x19, max
    b.ne    loop

    mov     x0, 0           /* status -> 0 */
    mov     x8, 93          /* exit is syscall #93 */
    svc     0               /* invoke syscall */

This code doesn't actually do anything while looping, because the body of the loop is empty. On an AArch64 machine, combine this code with code from the "Hello World" assembley-language example, so that it prints a word each time it loops:

Loop
Loop
Loop
Loop
Loop
Loop
Loop
Loop
Loop
Loop

Then modify the message so that it includes the loop index values, showing each digit from 0 to 9 like this:

Loop: 0
Loop: 1
Loop: 2
Loop: 3
Loop: 4
Loop: 5
Loop: 6
Loop: 7
Loop: 8
Loop: 9

Now we had a list of servers we could choose from so we decided to go with aarchie.
Then after taking reference from both the loop example given and the hello world program we were able to print the iterations of loop without numbers. Here is the code:

.text
.globl _start

min = 0 /* starting value for the loop index; note that this is a symbol (constant), not a variable */
max = 30 /* loop exits when the index hits this number (loop condition is i<max) */

_start:

mov x19, min

loop:

mov x0, 1 /* file descriptor: 1 is stdout */
adr x1, msg /* message location (memory address) */
mov x2, len /* message length (bytes) */

mov x8, 64 /* write is syscall #64 */
svc 0 /* invoke syscall */

add x19, x19, 1
cmp x19, max
b.ne loop

mov x0, 0 /* status -> 0 */
mov x8, 93 /* exit is syscall #93 */
svc 0 /* invoke syscall */

.data
msg: .ascii "loop \n"
len= . - msg

the text is stored in the msg label and .ascii acts as a dcb. then _start. declares the start of the program and the loop label uses the message and its length to do a syscall to print it to the screen and one gets added to the register used for iteration and this keeps on going until the loop is finished i.e. comparison between max and register value. Well we were trying different ways on how to get the number printed besides the loop but could only managed to go this far until the lecture was over and we were told this lab would be continued until the lab period after study week so we had a lot of time to think over this.
Though the syntax looks a bit scary for now for this architecture it is just a matter of time before we get used to these instructions and then everything should be fine.

Saturday, 22 February 2020

Week 6 - Lab 4 continued and Lecture

This week's lab also we were supposed to work on Lab 4 and I will talk about the second task that were supposed to work on, in this blog. Now in case you guys did not check out last post this was our task:

Option 2: Data Input Form[edit]

Create a subroutine which enables the user to enter a string of up to 40 characters. Decide how the data will be returned from the subroutine. The user should be able to use the left/right cursor keys and backspace to edit the text, and the cursor location should be visible to the user.
Write a program which uses this subroutine to collect the user's name, address, city, province, and postal code, and then display this information formatted as a shipping address (for example, as shown below).

Your order will be delivered to: J. Doe 1750 Finch Ave E Toronto ON M2J 2X5

Now the approach used was to store each of the inputs was that each iteration would give a message (stored as constant byte) followed by the user input (stored as empty dcb's of required lengths). Then each of the empty constant bytes will be replaced by user input accordingly and displayed in the last iteration with the help of labels. Now there was a very important concept about logical operations we were taught in week 5 which was XOR or EOR which stands for exclusive OR. Anything EOR with 0 would keep the original value and EOR with 1 would give a flipped value. So basically we used it to flip the bits.

Here is code for lab 4 task 2:

define SCINIT $ff81 ; initialize/clear screen

define CHRIN $ffcf ; input character from keyboard

define CHROUT $ffd2 ; output character to screen

define SCREEN $ffed ; get screen size

define PLOT $fff0 ; get/set cursor coordinates

define POINTER $10 ; ptr: start of row

define POINTER_H $11

define SPOINTER $12 ; ptr: start of screen

define SPOINTER_H $13

define MAX_INPUT $14

lda #$f0

sta SPOINTER_H

lda #40

sta MAX_INPUT

jsr SCINIT

; Name

ldx #<label1 ; loading X with a pointer to the string to be printed

ldy #>label1 ; loading Y with a pointer to the string to be printed

jsr prtstr ; print string

ldx #<name ; loading X with a pointer to the string to be read and store

ldy #>name

lda #len1 ; length of string

jsr getstr ; go get user input

jsr SCINIT

;Address

ldx #<label2

ldy #>label2

jsr prtstr

ldx #<address ; loading X with a pointer to the string to be read and stored

ldy #>address ; loading X with a pointer to the string to be read and stored

lda #len2 ; length of string

jsr getstr ; get user input

jsr SCINIT

;City

ldx #<label3

ldy #>label3

jsr prtstr

ldx #<city ; loading X with a pointer to the string to be read and stored

ldy #>city ; loading Y with a pointer to the string to be read and stored

lda #len3 ; length of string

jsr getstr ; go get user input

jsr SCINIT

; Printing final message and captured details

ldx #<label4 ; load XY with a pointer to the string to be printed

ldy #>label4

jsr prtstr ; go print string

ldx #<name ; load XY with a pointer to the string to be printed

ldy #>name

jsr prtstr ; go print string

jsr prtcrlf

ldx #<address ; load XY with a pointer to the string to be printed

ldy #>address

jsr prtstr ; go print string

jsr prtcrlf

ldx #<city ; load XY with a pointer to the string to be printed

ldy #>city

jsr prtstr ; go print string

jsr prtcrlf

brk

prtcrlf: ;this subroutine moves the code to a newline

ldx #<crlf ; there is only one parameter

ldy #>crlf ;prints a newline carriage return character

jsr prtstr

prtstr: ; this subroutine prints a NULL terminated string

stx POINTER ; there is only one parameter

sty POINTER_H ; registers XY point to the string to be printed

ldy #$00

prtchar:

lda (POINTER),y

beq prtstrdone

jsr CHROUT

iny

bne prtchar

prtstrdone:

rts

getstr: ;this subroutine takes care of the typing and checking of keys

stx POINTER ; there is 2 parameters the pointer and length of string

sty POINTER_H

sta SPOINTER

ldy #$00

check2:

idle: inx

cpx #$10

bne check

lda (SPOINTER),y

eor #$80

sta (SPOINTER),y

check: lda $ff

beq idle

ldx #$00

stx $ff

cmp #$08

bne print

lda (SPOINTER),y

and #$7f

sta (SPOINTER),y

dey

jmp next

print:

cpy MAX_INPUT

beq maxed

cmp #$0D ; check for Enter key

beq done

cmp #$83 ; check for Left arrow key

beq go_left

cmp #$81 ; check for Right arrow key

beq go_right

sta (SPOINTER),y

sta (POINTER),y

iny

maxed:

jmp check2

go_left:

;brk

dey

cpy #$ff

bne not_at_left

iny

not_at_left:

jmp idle

go_right:

;brk

lda (POINTER),y

cmp #$00

;brk

beq not_at_right

iny

not_at_right:

jmp idle

done:

rts

label1:

dcb "p","l","e","a","s","e",32,"e","n","t","e","r",32,"y","o","u","r",32,"n","a","m","e",":",00

define len1 24

label2:

dcb "p","l","e","a","s","e",32,"e","n","t","e","r",32,"y","o","u","r",32,"a","d","d","r","e","s","s",":",00

define len2 27

label3:

dcb "p","l","e","a","s","e",32,"e","n","t","e","r",32,"y","o","u","r",32,"c","i","t","y",32,"a","n","d",32,"p","o","s","t","a","l",32,"c","o","d","e",":",00

define len3 40

label4:

dcb "Y","o","u","r",32,"o","r","d","e","r",32,"w","i","l","l",32,"b","e",32,"d","e","l","i","v","e","r","e","d",32,"t","o",":",$0d,$0d,00

name:

dcb 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,$0d

address:

dcb 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,$0d

city:

dcb 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,$0d

crlf:

dcb $0d,0

So by incorporating previous task's knowledge about working with character display and some additional information we were able to finish this lab.

Lecture:

This week's lecture started with a quiz and it was all about differences between the architecture we already finished learning and the architectures we were in process of learning i.e. 6502, AArch 64 and X86 64. And after the quiz the prof. gave us a summary about the 3.

6502: 8 bit processor, 16 bit address, 3 registers.

X86 64: 64 bit processor, 64 bit address, 16 registers.

AArch 64: 64 bit processor, 64 bit address, 31 registers.

Then we got an insight on how both architectures started from different approaches i.e. RISC (have tiny instructions but small and fast), and CISC (complex instructions) and ended up curving in a way to come on similar grounds.

Then we were shown a sample hello world program that had some weird looking syntax at first but then made sense once we got to understand it.

Overall we knew we were taking a step forward into the this assembly language and had to be ready for it.

Wednesday, 12 February 2020

Week 5 - Lab 4 and Lecture

This week started with us getting another lab that we had to complete as a group. So we had about 4 tasks to choose from. out of which we had to complete 2. Here are the options we had:

Option 1: Adding Calculator[edit]

Create a subroutine which enables the user to enter two numbers of up to two digits. Indicate where the cursor is, and allow the user to use the digit keys (0-9), backspace, and enter keys. Return the user's input value in the accumulator (A) register.
Using this subroutine, write a program which add the two numbers (each of which is in the range 0-99) and print the result.

Optional challenge: extend to more than 2 digits and add a running total capability with a clear key.

Option 2: Data Input Form[edit]

Create a subroutine which enables the user to enter a string of up to 40 characters. Decide how the data will be returned from the subroutine. The user should be able to use the left/right cursor keys and backspace to edit the text, and the cursor location should be visible to the user.
Write a program which uses this subroutine to collect the user's name, address, city, province, and postal code, and then display this information formatted as a shipping address (for example, as shown below).

Your order will be delivered to:

  J. Doe
  1750 Finch Ave E
  Toronto ON  M2J 2X5

Optional challenge: validate the province/territory and the postal code format.

Option 3: Hexdump[edit]

Create a subroutine which enables the user to enter a 4-digit hexadecimal value. Allow the use of the backspace key for editing, and show the user the cursor location. Decide how the data will be returned by this subroutine.
Using this subroutine, ask the user for a starting address and a length (in bytes), and then output a hexdump of the specified memory locations, formatted to show address and data information in some layout. If the output is longer than the screen, pause for user input (e.g., "Press ENTER to continue" or something similar) every 20 lines.

Optional challenge: add a display of ASCII data on the right-hand side, similar to the output of the linux command od -A x -t xz -v filename

Option 4: Screen Colour Selector[edit]

Create a subroutine which displays on the character display a list of colours available on the bitmapped display, one colour per line, with one of the colours names highlighted in reverse video (white on black). The user may use the up/down arrow keys to change the highlighted row. Return the user's selection as a number in the accumulator (A) register when they press Enter.
Using this subroutine, get a colour from the user, then fill the bitmap display with this colour, and allow the user to select a different colour.

Optional challenge: update the colour on the bitmapped display as the user scrolls through the colour-selector list.

So out of these options our group decided to go for options 2 and 4 which were data input form and screen colour selector. We decided to work on the screen color selector first.

Now as we were working on the lab the professor gave us 2 commands that would be useful for the lab which were pha and pla which stand for PUSH Accumulator and PULL accumulator respectively. Pha simply copies whatever is in the accumulator and stores it on the stack while Pha doing the opposite takes 1 byte from the stack and stores it in the accumulator.

Now we started by displaying the names of the colors by declaring them as constant bytes and then printing them on character display but soon we figured out the old way of displaying things by starting displaying from the character display's start address would not prove very useful and we would have to use ROM routines in order to work on this lab. Now ROM routines are pre defined subroutines in the memory that we can access and use for basic functionalities. Here is a list of them:

; ROM routines
define  SCINIT  $ff81 ; initialize/clear screen
define  CHRIN  $ffcf ; input character from keyboard
define  CHROUT  $ffd2 ; output character to screen
define  SCREEN  $ffed ; get screen size
define  PLOT  $fff0 ; get/set cursor coordinates

Well, after getting to display the names, the next task was to get the cursor moving up and down that list and displaying the color based on the position of cursor. While we had limited time in lab we could only manage to go so far and decided to complete the lab afterwards.

Lecture:

Well this week's lecture introduced us to 2 architectures which were fairly new to us though we use them a lot, x86 64 and AArch 64. We were given a background on both those architectures and how those 2 emerged from being small in memory chips to being so huge memory and powerful now.

Then we got to know the registers on both the architectures. While AArch 64 had 0-30 registers and a register 31 that could be used as zero register or stack pointer based on its functionality. More on these registers and some related instruction set can be found on:

https://wiki.cdot.senecacollege.ca/wiki/AArch64_Register_and_Instruction_Quick_Start

While on the other hand X86 64 had 16 registers, more of which could be found on:

https://wiki.cdot.senecacollege.ca/wiki/X86_64_Register_and_Instruction_Quick_Start

It was interesting to know how both the architectures had registers that could be used for 64 bit or 32 bit wide access.

Now going back to the lab, we took from the code from the previous lab on how to scroll using input from the keyboard and we went further to store the color, address cur_pos held inside and we manipulated the cursor to store different color's code according to the input from the keyboard, as they ranged from 0-F. So after managing to change the color code based on keyboard up or down key and making it print the stored code using the filling the bitmapped display's code we were able to come up with the following code:

; ROM routines

define SCINIT $ff81 ; initialize/clear screen

define CHRIN $ffcf ; input character from keyboard

define CHROUT $ffd2 ; output character to screen

define SCREEN $ffed ; get screen size

define PLOT $fff0 ; get/set cursor coordinates

define cur_pos $10

jsr SCINIT

ldy #$00

ldx #$00

stx cur_pos

char: lda color_list,y

beq main

jsr CHROUT

iny

bne char

main:

lda $ff ; get a keystroke

beq main ; fix freezing on max speed

ldx #$00 ; clear out the key buffer

stx $ff

cmp #$82 ; arrow down for increment (going down)

beq incr

cmp #$80 ; arrow up for decrement (going up)

beq decr

jmp main ; if no key pressed loop forever

clear_cursor:

lda #$2d ; character -

jsr print_cursor

rts

incr:

jsr clear_cursor

inc cur_pos

lda cur_pos

cmp #$10

beq incr_reset

bne incr_decr

incr_reset:

lda #$00

sta cur_pos

jmp incr_decr

decr:

jsr clear_cursor

dec cur_pos

lda cur_pos

cmp #$FF

beq decr_reset

bne incr_decr

decr_reset:

lda #$0F

sta cur_pos

jmp incr_decr

incr_decr:

lda #$93 ; character black box

jsr print_cursor

jmp paint

print_cursor:

clc ; select the SET function for PLOT

ldx #$00

ldy cur_pos ; increment this one

pha ; save the value of A into the stack

jsr PLOT ; set the cursor position

pla ; restore the value of A from the stack

jsr CHROUT ; print the char

rts

paint:

ldx #$06 ; max value for $11

ldy #$00 ; index

lda #$00 ; set pointer at $10 to $0200

sta $11

lda #$02

sta $12

lda cur_pos ; set the color

jmp paint_loop

paint_loop:

sta ($11),y ; store colour

iny ; increment index

bne paint_loop ; branch until page done

inc $12 ; increment high byte of pointer

cpx $12 ; compare with max value

bne paint_loop ; continue if not done

jmp main ; done - return to main

color_list:

dcb $93,"B","l","a","c","k",13

dcb "-","W","h","i","t","e",13

dcb "-","R","e","d",13

dcb "-","C","y","a","n",13

dcb "-","P","u","r","p","l","e",13

dcb "-","G","r","e","e","n",13

dcb "-","B","l","u","e",13

dcb "-","Y","e","l","l","o","w",13

dcb "-","O","r","a","n","g","e",13

dcb "-","B","r","o","w","n",13

dcb "-","L","i","g","h","t",32,"r","e","d",13

dcb "-","D","a","r","k",32,"g","r","e","y",13

dcb "-","G","r","e","y",13

dcb "-","L","i","g","h","t",32,"g","r","e","e","n",13

dcb "-","L","i","g","h","t",32,"b","l","u","e",13

dcb "-","L","i","g","h","t",32,"g","r","e","y",00

As you can see we used ROM routines CHROUT and PLOT to print the characters to the display and setting the position of the cursor in the memory respectively.

Though we managed to pull of this first task we still had one more to go.

Friday, 31 January 2020

Week 4 - Lecture

This week's lecture started with how we can display text on the 6502 emulator. I found it fairly easy.
Here is a sample code:

define SCREEN $f000
           ldy #$00
 
 char:     lda text,y
           beq done
           sta SCREEN,y
           iny
           bne char
 
 done:     brk
 
 text:
 dcb "S","P","0","6","0","0",32,"i","s",32,"a","w","e","s","o","m","e",".",00

This was accomplished by having a text as dcb and looping it to load and store it in accumulator.
Spaces were put using number 32 and all characters were put in double quotes.

Next we talked about compiling the code. We compiled a simple C program inside linux and saw its executable was larger than its original file. We also learnt about dynamic and static linking and how each of them have their own pros and cons. While static being larger in size and dynamic was relatively smaller. Then we learnt about different compatibility modes it had to offer and how it could range from O0 (almost no optimizations) to O3 (most optimized) and we were also shown how many flags for optimized content was set to active and how we could manipulate it. Og was used to optimize for debugging. We almost looked upon certain commands like make which could be used to make partial builds inside of whole when small changed are made to source files.

This week's lecture ended with us touching upon something new and we were also told text was the last thing we learnt on 6502 and we wont be working with it anymore.

Thursday, 30 January 2020

Week 4 - Lab 3 continued

This week's lab started with us giving a quiz. We were all a bit surprised as we did not expect it but it turned out to be just fine in the end since it was just matching and there were no short or long questions to answer. Rest of the lab was given to us to figure out more about lab 3 and work on it.
We still did not have much to end the lab with so we decided and talked to professor a bit more about it and understanding it in more detail.
After much discussions finally we were able to incorporate everything we know and combined declare constant byte's knowledge to get rid of multiplication, used delay loop that would keep counting as necessary to give us necessary delay to give an effect of moving, incremented rows and columns accordingly to direct the graphic and finally filled the display with black color where the graphic is no longer there. Here is the full code for this lab:

; zero-page variable locations
define ROW $20 ; current row
define COL $21 ; current column
define POINTER $10 ; ptr: start of row - upper left corner of object
define POINTER_H $11
define WPOINTER $12 ; ptr: start of row - write pointer
define WPOINTER_H $13
define BOUNCE_X $14
define BOUNCE_Y $15
define ROW_COUNT $16
define TIMER $17 ; ptr: delay timer
define TIMER_H $18
define TIMER_CONTROL $19 ; contains the base of the high bit of the timer

; constants
define WIDTH 7 ; width of graphic
define HEIGHT 7 ; height of graphic
define GREEN $05 ; GREEN
define BLUE $03 ; BLUE
define BLACK $00 ; BLACK
define RIGHT $19 ; defining where the graphic must change direction accounting for its size
define LEFT $00 ;
define TOP $00 ;
define BOTTOM $19 ;

lda #$0 ; initialize delay loop counter
sta TIMER
sta TIMER_H
lda #$40 ; setting the initial speed of the timer
sta TIMER_CONTROL
lda #$01 ; assume that the object is moving across the screen
sta BOUNCE_Y ; contain the rate motion of the box along the Y axis
lda #$01
sta BOUNCE_X ; contain the rate motion of the box along the X axis
lda #$03 ;setting were object starts on screen
sta ROW
lda #$06
sta COL

loop:
clc
ldy ROW ; load POINTER with start-of-row
lda table_low,y
adc COL
sta WPOINTER
lda table_high,y
sta WPOINTER_H
lda #$00 ; number of rows we've drawn
sta ROW_COUNT
ldx #$00 ; index for data
ldy #$00 ; index for screen column

draw: ; draws graphic
lda data,x ; set up where graphic starts on sreen
sta (WPOINTER),y
inx
iny
cpy #WIDTH
bne draw
inc ROW_COUNT ; increment row counter
lda #HEIGHT ; end of graphic by height
cmp ROW_COUNT
beq done ; exit
lda WPOINTER ; load pointer
clc
adc #$20 ; add 32 to drop one row
sta WPOINTER
lda WPOINTER_H ; carry to high byte if needed
adc #$00
sta WPOINTER_H
ldy #$00
beq draw

done:
lda TIMER_CONTROL ; initialize timer with current speed
sta TIMER_H

delay: dec TIMER ; wait
bne delay
dec TIMER_H
bne delay

clc
ldy ROW ; load POINTER with start-of-row
lda table_low,y
adc COL
sta WPOINTER
lda table_high,y
sta WPOINTER_H
lda #$00 ; number of rows we've drawn
sta ROW_COUNT
ldx #$00 ; index for data
ldy #$00 ; index for screen column

wdraw: ; remove graphic were it has been draw
lda #BLACK
sta (WPOINTER),y
inx
iny
cpy #WIDTH
bne wdraw

inc ROW_COUNT ; increment row counter
lda #HEIGHT
cmp ROW_COUNT
beq wdone ; exit

lda WPOINTER ; load pointer
clc
adc #$20 ; add 32 to drop one row
sta WPOINTER
lda WPOINTER_H ; carry to high byte if needed
adc #$00
sta WPOINTER_H
ldy #$00
beq wdraw

wdone: ; check graphic if hitting a sides and if so change direction
clc
lda COL
adc BOUNCE_X
sta COL
cmp #RIGHT
bne no_bounce

pha
lda #$ff ; using the fact that the register overflows adding ff is the same as subtracting 1
sta BOUNCE_X
pla

no_bounce:
cmp #LEFT
bne no_bounce1

lda #$01
sta BOUNCE_X

no_bounce1:
clc
lda ROW
adc BOUNCE_Y
sta ROW
cmp #BOTTOM
bne no_bounce2

pha
lda #$ff
sta BOUNCE_Y
pla

no_bounce2:
cmp #TOP
bne no_bounce3

lda #$01
sta BOUNCE_Y

no_bounce3:

getkey:
lda $ff ; get a keystroke
ldx #$00 ; clear out the key buffer
stx $ff
cmp #$2d ; is it a minus
bne check

clc ; slow down timer
lda TIMER_CONTROL
adc #$01 ; changing speed of timer by adding to the timer
bcs ignore

sta TIMER_CONTROL

ignore:
jmp move_it

check:
cmp #$2b ; is it a plus
bne check0

sec ; speed up timer
lda TIMER_CONTROL
sbc #$01 ; changing speed of timer by subtracting from the timer
beq ignore

sta TIMER_CONTROL
jmp move_it

check0:
cmp #$80 ; check key == up
bne check1

sec
lda ROW ; ... if yes, decrement ROW
sbc #$01
beq ignore

sta ROW
jmp move_it

check1:
cmp #$81 ; check key == right
bne check2

lda COL ; ... if yes, increment COL
adc #$01
cmp #RIGHT
beq ignore

sta COL
jmp move_it

check2:
cmp #$82 ; check if key == down
bne check3

lda ROW ; ... if yes, increment ROW
adc #$01
cmp #BOTTOM
beq ignore

sta ROW
jmp move_it

check3:
cmp #$83 ; check if key == left
bne move_it

sec
lda COL ; ... if yes, decrement COL
sbc #$01
beq ignore

sta COL
bcc move_it

move_it:
jmp loop

data: ; for the graphic
dcb 00,03,03,03,03,03,00
dcb 03,02,03,03,03,02,03
dcb 03,03,03,03,03,03,03
dcb 03,03,03,03,03,03,03
dcb 03,02,03,03,03,02,03
dcb 03,03,02,02,02,03,03
dcb 00,03,03,03,03,03,00

; these two tables contain the high and low bytes
; of the addresses of the start of each row

table_high:
dcb $02,$02,$02,$02,$02,$02,$02,$02
dcb $03,$03,$03,$03,$03,$03,$03,$03
dcb $04,$04,$04,$04,$04,$04,$04,$04
dcb $05,$05,$05,$05,$05,$05,$05,$05,

table_low:
dcb $00,$20,$40,$60,$80,$a0,$c0,$e0
dcb $00,$20,$40,$60,$80,$a0,$c0,$e0
dcb $00,$20,$40,$60,$80,$a0,$c0,$e0
dcb $00,$20,$40,$60,$80,$a0,$c0,$e0