Don’t ask me why I started looking at writing basic ARM assembly routines. Perhaps it’s for the thrill of it, or taking a walk down memory lane. My first assembly language program was for an IBM System/360 using WYLBUR in college.
This post is not a tutorial on assembly language itself, or the ARM processor for that matter. If the phrases mnemonic, register, or branch on not equal are foreign to you, have a look here. I just wanted to write some easy routines and pick up some basics.
Editor’s note: All of the code below is available on GitHub.
We’ll be using a Raspberry Pi 4.  You will (obviously) need the GCC toolchain installed, which can be accomplished with sudo apt-get install build-essential.
Let’s save the following in a file named helloworld.s:
helloworld.s:
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | .global main main: push {ip, lr} ldr r0,=hellostr bl  printf mov r0,#0 pop {ip, pc} .data hellostr: .asciz "Hello, world!\n" | 
Assemble and link the application together with gcc helloworld.s -o helloworld and run it.
It doesn’t get much more straightforward than this, and you’ve learned three new ARM assembly instructions:  ldr, mov, and bl.  The remaining text are directives to the GNU assembler which we’ll cover in a minute.
The ldr instruction loads some value from memory into a register.  This is key with ARM:  load instructions load from memory.  In the example above we’re loading the address of the beginning of the string into register r0.  A technical note:  ldr is actually a psuedoinstruction, but let’s gloss over that.
bl branches to the label indicated (and updates the link register), and in our case, there is this magical printf we’re branching to.  More on that later.
Finally, mov r0,#0 is positioning our program’s return code (zero) into r0.  Check it:
| 1 2 3 4 | pi@raspberrypi:~ $ ./helloworld Hello, world! pi@raspberrypi:~ $ echo $? 0 | 
What if we change the mov r0,#0 to mov r0,#0xff?  Try it:
| 1 2 3 4 5 | pi@raspberrypi:~ $ gcc helloworld.s -o helloworld pi@raspberrypi:~ $ ./helloworld Hello, world! pi@raspberrypi:~ $ echo $? 255 | 
Okay, now for something interesting. Let’s count down from 10 to 1 and then print Hello, world!.
countdown.s:
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | .global main main: push {ip, lr} mov r5,#10 do: ldr r0,=fmtstr mov r1,r5 bl  printf sub r5,#1 cmp r5,#0 bne do ldr r0,=hellostr bl  printf mov r0,#0xff pop {ip, pc} .data fmtstr: .asciz "%d\n" hellostr: .asciz "Hello, world!\n" | 
Okay, that escalated quickly! One of the reasons assembly language is so much fun. Let’s take a look at what is going on here and add some comments to our code.
| 1 2 3 4 5 6 7 8 | mov r5,#10      /* Count down from 10 */ do: ldr r0,=fmtstr  /* Load the fmtstr pointer for printf */ mov r1,r5       /* Copy over our count to r1 for printf */ bl  printf      /* Call printf */ sub r5,#1       /* Decrement the counter by one */ cmp r5,#0       /* Compare the counter to zero */ bne do          /* Branch back to the do label if not */ | 
There’s a few things to note here.  First, let’s talk about the use of r5 and why that register was deliberately chosen.  It turns out that when calling routines in assembly you better not use registers that will get trashed by whatever subroutine your calling (r0-r3).  printf can use these registers, so we’ll use r5 in our routine.
Now, I will confess, I am not an assembly language expert much less an ARM assembly language expert. Someone may look at the above code and ask why I didn’t use the subtract-and-compare-to-zero instruction (if there is one) or some other technique. If there is a better way to write the above, please let me know!
Counting Up
In the above example we counted down, now let’s count up, and instead of counting from zero to some max, let’s count up from some minimum value to some maximum value. In other words, we’ll step through a sequence of values using an increment of one.
countup.s:
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 | .global main main: push {ip, lr} ldr r5,=min ldr r5,[r5] ldr r6,=max ldr r6,[r6] do: ldr r0,=fmtstr mov r1,r5 bl  printf add r5,#1 cmp r5,r6 bne do mov r0,r5 pop {ip, pc} .data min: .int 14 max: .int 29 fmtstr: .asciz "%d\n" | 
There’s some new syntax here, in particular the ldr rx,[rx].  This syntax is “load the value that is pointed to by the address in the register.”  It makes sense in that there is an instruction immediately before it ldr,=min which is load the address identified by the label min.  To be clear, the actual value of that label is going to be dependent on the assembler, your application size, and where it gets loaded into memory.  Let’s look at an example of that:
printmem.s:
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | .global main main: push {ip, lr} ldr r0,=fmtstr ldr r1,=x bl  printf mov r0,#0xff pop {ip, pc} .data x: .int 128 fmtstr: .asciz "Address of x is %p\n" | 
Compile and execute this code to see something like Address of x is 0x21028.  Then move x to after fmtstr and you will see the address change.  What it will change to, again, is highly dependent on a number of factors.  Suffice it to say, using ldr with memory addresses loads the address into a register, not the value at the address.  That is what we use ldr rx,[rx] for.
Running our countup code indeed counts up from 14 to 28 and if we look at the return code (echo $?) we get 29, the last value that was in r5.
Writing a Procedure
Here is a basic ARM assembly procedure that computes and returns fib(n), the nth element of the Fibonacci sequence.  We chose this specifically to demonstrate the use of the stack with push and pop.
fib.s:
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 | .global fib fib: push {r4-r5,lr} mov r4,r0      // r4 <- n cmp r4,#0 beq fib0       // n == 0? f(0) = 0 cmp r4,#1 beq fib1       // n == 1? f(1) = 1 // f(n-1) sub  r4,#1     // r4 <- n-1 mov  r0,r4 bl   fib       // r0 <- fib(n-1) mov  r5,r0     // r5 <- fib(n-1) // f(n-2) sub  r4,#1     // r4 <- n-2 mov  r0,r4 bl   fib       // r0 <- fib(n-2) add  r0,r0,r5  // r0 <- r0 fib(n-2) + r5 fib(n-1) b    return fib1: mov r0,#1 b   return fib0: mov r0,#0 b   return return: pop {r4-r5,lr} bx  lr | 
What should be noticed here is the use of push and the list of register values we’re going to save onto the stack.  In ARM assembly the Procedure Call Standard convention is to save registers r4-r8 if you’re going to work with them in your subroutine.  In the above example we use r4 and r5 to compute fib(n) so we first push r4 and r5 along with the link register.  Before returning we pop the previous values off the stack back into the registers.
To use this routine in C we can write:
fibmain.c:
| 1 2 3 4 5 6 7 8 9 10 11 | #include <stdio.h> int fib(int n); int main(void) {   for (int i = 10; i > 0; i--) {     int f = fib(i);     printf("fib(%d) = %d\n", i, f);   }   return 0; } | 
Then, compile, assemble, and link with gcc fibmain.c fib.s -o fibmain.  Recall the procedure call standard convention that the arguments to the procedure will be in r0-r3, hence why our first instruction mov r4,r0 to capture what n we’re calculating the Fibonacci number of.
Taking the Average
Okay, one last routine. We want to take the average of an array of integers. In C that would look something like this:
| 1 2 3 4 5 6 7 | float average(int* a, unsigned l) {   float avg = 0.0;   for (int i = 0; i < l; i++) {     avg += a[i];   }   return avg/l; } | 
Here’s a go at it in ARM assembly.
average.s:
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | .global average average: push {r4-r5,lr} mov  r4,#0         // r4 <- 0 vmov s1,r1         // s1 <- n do: ldr  r5,[r0],#4    // r5 <- a[n] add  r4,r5         // r4 <- r4 + r5 sub  r1,#1         // r1 <- r1 - 1 cmp  r1,#0 bne  do            // if (r1 != 0) do vmov s0,r4         // s0 <- sum vcvt.f32.s32 s0,s0 // Convert s0 to a fp value vcvt.f32.s32 s1,s1 // Convert s1 to a fp value vdiv.f32 s0,s0,s1  // s0 <- s0/s1 pop  {r4-r5,lr} bx   lr | 
There are some new instructions, an interesting form of the ldr instruction, and a new type of register.
First, the vmov instruction and register s1.  vmov moves values into registers of the Vector Floating-Point Coprocessor, assuming your processor has one (if it is a Pi it will).  s1 is one of the single-precision floating-point registers.  Note that this is a 32-bit wide register that can store a C float.
Next up is the ldr r5,[r0],#4 instruction.  Recall that ldr ra,[rb] loads the value stored at the address in rb into ra.  The #4 at the end instructs the processor to then increment the value in rb by 4.  In effect we are walking the array of integers whose starting address is in r0.
Finally, once we add all of the values in the array we have the sum in the register r4.  To divide that sum by the length of the array (which was saved off in the floating-point register s1) we load r4 into s0 and perform one last thing:  vcvt.  vcvt converts between integers and floating-point numbers (which are, after all, an encoding).  So s0 gets converted to a floating-point value, as does s1, and then we perform our division with vdiv.
As with r0 being the standard for returning an int from a procedure call, s0 will hold our float value.
We can use this function in our main routine.
averagemain.c:
| 1 2 3 4 5 6 7 8 9 10 11 | #include <stdio.h> float average(int* a, unsigned l); int main(void) {   int a[] = {82,  98, 90, 88, 87, 75};   float avg = average(a, sizeof(a)/sizeof(int));   printf("Average:  %2.2f\n", avg); } | 
Compile with gcc averagemain.c average.s -o averagemain and run.
| 1 2 | ./averagemain Average:  86.67 | 
Closing Thoughts
This post has been a lot of fun to write because assembly is actually fun to write and serves as a reminder that even the highest-level languages get compiled down to instructions that the underlying CPU can execute. One instruction set we didn’t touch on is the store instructions. These are used to save the contents (store) of registers to memory. Perhaps next time.
Once again, all of the code in this post can be found in the armassembly repository on GitHub.
 
    