{"id":3981,"date":"2020-05-17T17:01:12","date_gmt":"2020-05-17T22:01:12","guid":{"rendered":"https:\/\/dev.iachieved.it\/iachievedit\/?p=3981"},"modified":"2020-05-17T17:01:12","modified_gmt":"2020-05-17T22:01:12","slug":"working-with-arm-assembly","status":"publish","type":"post","link":"https:\/\/dev.iachieved.it\/iachievedit\/working-with-arm-assembly\/","title":{"rendered":"Working with ARM Assembly"},"content":{"rendered":"<p>Don&#8217;t ask me why I started looking at writing basic ARM assembly routines.  Perhaps it&#8217;s for the thrill of it, or taking a walk down memory lane.  My first assembly language program was for an <a href=\"https:\/\/en.wikipedia.org\/wiki\/IBM_System\/360\">IBM System\/360<\/a> using <a href=\"https:\/\/en.wikipedia.org\/wiki\/ORVYL_and_WYLBUR\">WYLBUR<\/a> in <a href=\"https:\/\/engineering.tamu.edu\/cse\/index.html\">college<\/a>.<\/p>\n<p>This post is not a tutorial on <a href=\"https:\/\/en.wikipedia.org\/wiki\/Assembly_language\">assembly language<\/a> itself, or the ARM processor for that matter.  If the phrases <i>mnemonic<\/i>, <i>register<\/i>, or <i>branch on not equal<\/i> are foreign to you, have a look <a href=\"https:\/\/azeria-labs.com\/writing-arm-assembly-part-1\/\">here<\/a>.  I just wanted to write some easy routines and pick up some basics.<\/p>\n<p><b>Editor&#8217;s note:<\/b>  All of the code below is available on <a href=\"https:\/\/github.com\/iachievedit\/armassembly\">GitHub<\/a>.<\/p>\n<p>We&#8217;ll be using a <a href=\"https:\/\/www.raspberrypi.org\/products\/raspberry-pi-4-model-b\/\">Raspberry Pi 4<\/a>.  You will (obviously) need the GCC toolchain installed, which can be accomplished with <code>sudo apt-get install build-essential<\/code>.<\/p>\n<p>Let&#8217;s save the following in a file named <code>helloworld.s<\/code>:<\/p>\n<p><code>helloworld.s<\/code>:<\/p>\n<pre class=\"lang:asm decode:true\">\n.global main\n\nmain:\npush {ip, lr}\n\nldr r0,=hellostr\nbl  printf\n\nmov r0,#0\n\npop {ip, pc}\n\n.data\nhellostr:\n.asciz \"Hello, world!\\n\"\n<\/pre>\n<p>Assemble and link the application together with <code>gcc helloworld.s -o helloworld<\/code> and run it.<\/p>\n<p>It doesn&#8217;t get much more straightforward than this, and you&#8217;ve learned three new ARM assembly instructions:  <code>ldr<\/code>, <code>mov<\/code>, and <code>bl<\/code>.  The remaining text are directives to the GNU assembler which we&#8217;ll cover in a minute.<\/p>\n<p>The <code>ldr<\/code> instruction <i>loads<\/i> some value from <i>memory<\/i> into a register.  This is key with ARM:  load instructions load from memory.  In the example above we&#8217;re loading the <i>address<\/i> of the beginning of the string into register <code>r0<\/code>.  A technical note:  <code>ldr<\/code> is actually a psuedoinstruction, but let&#8217;s gloss over that.<\/p>\n<p><code>bl<\/code> <i>branches<\/i> to the label indicated (and updates the <i>link register<\/i>), and in our case, there is this magical <code>printf<\/code> we&#8217;re branching to.  More on that later.<\/p>\n<p>Finally, <code>mov r0,#0<\/code> is positioning our program&#8217;s return code (zero) into r0.  Check it:<\/p>\n<pre class=\"lang:sh decode:true\">\npi@raspberrypi:~ $ .\/helloworld\nHello, world!\npi@raspberrypi:~ $ echo $?\n0\n<\/pre>\n<p>What if we change the <code>mov r0,#0<\/code> to <code>mov r0,#0xff<\/code>?  Try it:<\/p>\n<pre class=\"lang:sh decode:true\">\npi@raspberrypi:~ $ gcc helloworld.s -o helloworld\npi@raspberrypi:~ $ .\/helloworld\nHello, world!\npi@raspberrypi:~ $ echo $?\n255\n<\/pre>\n<p>Okay, now for something interesting.  Let&#8217;s count down from 10 to 1 and then print <i>Hello, world!<\/i>.<\/p>\n<p><code>countdown.s<\/code>:<\/p>\n<pre class=\"lang:asm decode:true\">\n.global main\n\nmain:\npush {ip, lr}\n\nmov r5,#10\ndo:\nldr r0,=fmtstr\nmov r1,r5\nbl  printf\nsub r5,#1\ncmp r5,#0\nbne do\n\nldr r0,=hellostr\nbl  printf\n\nmov r0,#0xff\n\npop {ip, pc}\n\n.data\nfmtstr:\n.asciz \"%d\\n\"\nhellostr:\n.asciz \"Hello, world!\\n\"\n<\/pre>\n<p>Okay, that escalated quickly!  One of the reasons assembly language is so much fun.  Let&#8217;s take a look at what is going on here and add some comments to our code.<\/p>\n<pre>\nmov r5,#10      \/* Count down from 10 *\/\ndo:\nldr r0,=fmtstr  \/* Load the fmtstr pointer for printf *\/\nmov r1,r5       \/* Copy over our count to r1 for printf *\/\nbl  printf      \/* Call printf *\/\nsub r5,#1       \/* Decrement the counter by one *\/\ncmp r5,#0       \/* Compare the counter to zero *\/\nbne do          \/* Branch back to the do label if not *\/\n<\/pre>\n<p>There&#8217;s a few things to note here.  First, let&#8217;s talk about the use of <code>r5<\/code> and why that register was deliberately chosen.  It turns out that when calling routines in assembly you better not use registers that will get trashed by whatever subroutine your calling (<code>r0-r3<\/code>).  <code>printf<\/code> can use these registers, so we&#8217;ll use <code>r5<\/code> in our routine.<\/p>\n<p>Now, I will confess, I am not an assembly language expert much less an ARM assembly language expert.  Someone may look at the above code and ask why I didn&#8217;t use the subtract-and-compare-to-zero instruction (if there is one) or some other technique.  If there is a better way to write the above, please let me know!<\/p>\n<h2>Counting Up<\/h2>\n<p>In the above example we counted down, now let&#8217;s count up, and instead of counting from zero to some max, let&#8217;s count up from some minimum value to some maximum value.  In other words, we&#8217;ll step through a sequence of values using an increment of one.<\/p>\n<p><code>countup.s<\/code>:<\/p>\n<pre class=\"lang:asm decode:true\">\n.global main\n\nmain:\npush {ip, lr}\n\nldr r5,=min\nldr r5,[r5]\nldr r6,=max\nldr r6,[r6]\ndo:\nldr r0,=fmtstr\nmov r1,r5\nbl  printf\nadd r5,#1\ncmp r5,r6\nbne do\n\nmov r0,r5\n\npop {ip, pc}\n\n.data\nmin:\n.int 14\nmax:\n.int 29\nfmtstr:\n.asciz \"%d\\n\"\n<\/pre>\n<p>There&#8217;s some new syntax here, in particular the <code>ldr rx,[rx]<\/code>.  This syntax is &#8220;load the value that is pointed to by the address in the register.&#8221;  It makes sense in that there is an instruction immediately before it <code>ldr,=min<\/code> which is load the address identified by the label <code>min<\/code>.  To be clear, the <i>actual value<\/i> of that label is going to be dependent on the assembler, your application size, and where it gets loaded into memory.  Let&#8217;s look at an example of that:<\/p>\n<p><code>printmem.s<\/code>:<\/p>\n<pre class=\"lang:asm decode:true\">\n.global main\n\nmain:\npush {ip, lr}\n\nldr r0,=fmtstr\nldr r1,=x\nbl  printf\n\nmov r0,#0xff\n\npop {ip, pc}\n\n.data\nx:\n.int 128\nfmtstr:\n.asciz \"Address of x is %p\\n\"\n<\/pre>\n<p>Compile and execute this code to see something like <code>Address of x is 0x21028<\/code>.  Then move <code>x<\/code> to after <code>fmtstr<\/code> and you will see the address change.  What it will change to, again, is highly dependent on a number of factors.  Suffice it to say, using <code>ldr<\/code> with memory addresses <i>loads the address<\/i> into a register, <i>not<\/i> the value at the address.  That is what we use <code>ldr rx,[rx]<\/code> for.<\/p>\n<p>Running our <code>countup<\/code> code indeed counts up from 14 to 28 and if we look at the return code (<code>echo $?<\/code>) we get 29, the last value that was in r5.<\/p>\n<h2>Writing a Procedure<\/h2>\n<p>Here is a basic ARM assembly procedure that computes and returns <code>fib(n)<\/code>, the nth element of the <a href=\"https:\/\/en.wikipedia.org\/wiki\/Fibonacci_number\">Fibonacci sequence<\/a>.  We chose this specifically to demonstrate the use of the stack with <code>push<\/code> and <code>pop<\/code>.<\/p>\n<p><code>fib.s<\/code>:<\/p>\n<pre>\n.global fib\nfib:\npush {r4-r5,lr}\nmov r4,r0      \/\/ r4 <- n\ncmp r4,#0\nbeq fib0       \/\/ n == 0? f(0) = 0\ncmp r4,#1\nbeq fib1       \/\/ n == 1? f(1) = 1\n\n\/\/ f(n-1)\nsub  r4,#1     \/\/ r4 <- n-1\nmov  r0,r4\nbl   fib       \/\/ r0 <- fib(n-1)\nmov  r5,r0     \/\/ r5 <- fib(n-1)\n\n\/\/ f(n-2)\nsub  r4,#1     \/\/ r4 <- n-2\nmov  r0,r4\nbl   fib       \/\/ r0 <- fib(n-2)\nadd  r0,r0,r5  \/\/ r0 <- r0 fib(n-2) + r5 fib(n-1)\nb    return\n\nfib1:\nmov r0,#1\nb   return\n\nfib0:\nmov r0,#0\nb   return\n\nreturn:\npop {r4-r5,lr}\nbx  lr\n<\/pre>\n<p>What should be noticed here is the use of <code>push<\/code> and the list of register values we're going to save onto the stack.  In ARM assembly the Procedure Call Standard convention is to save registers r4-r8 if you're going to work with them in your subroutine.  In the above example we use r4 and r5 to compute <code>fib(n)<\/code> so we first push r4 and r5 along with the <i>link register<\/i>.  Before returning we <code>pop<\/code> the previous values off the stack back into the registers.<\/p>\n<p>To use this routine in C we can write:<\/p>\n<p><code>fibmain.c<\/code>:<\/p>\n<pre class=\"lang:c decode:true\">\n#include <stdio.h>\n\nint fib(int n);\n\nint main(void) {\n  for (int i = 10; i > 0; i--) {\n    int f = fib(i);\n    printf(\"fib(%d) = %d\\n\", i, f);\n  }\n  return 0;\n}\n<\/pre>\n<p>Then, compile, assemble, and link with <code>gcc fibmain.c fib.s -o fibmain<\/code>.  Recall the procedure call standard convention that the arguments to the procedure will be in <code>r0-r3<\/code>, hence why our first instruction <code>mov r4,r0<\/code> to capture what <code>n<\/code> we're calculating the Fibonacci number of.<\/p>\n<h2>Taking the Average<\/h2>\n<p>Okay, one last routine.  We want to take the average of an array of integers.  In C that would look something like this:<\/p>\n<pre>\nfloat average(int* a, unsigned l) {\n  float avg = 0.0;\n  for (int i = 0; i < l; i++) {\n    avg += a[i];\n  }\n  return avg\/l;\n}\n<\/pre>\n<p>Here's a go at it in ARM assembly.<\/p>\n<p><code>average.s<\/code>:<\/p>\n<pre class=\"lang:asm decode:true\">\n.global average\naverage:\npush {r4-r5,lr}\nmov  r4,#0         \/\/ r4 &lt;- 0\nvmov s1,r1         \/\/ s1 &lt;- n\ndo:\nldr  r5,[r0],#4    \/\/ r5 &lt;- a[n]\nadd  r4,r5         \/\/ r4 &lt;- r4 + r5\nsub  r1,#1         \/\/ r1 &lt;- r1 - 1\ncmp  r1,#0\nbne  do            \/\/ if (r1 != 0) do\nvmov s0,r4         \/\/ s0 &lt;- sum\nvcvt.f32.s32 s0,s0 \/\/ Convert s0 to a fp value\nvcvt.f32.s32 s1,s1 \/\/ Convert s1 to a fp value\nvdiv.f32 s0,s0,s1  \/\/ s0 &lt;- s0\/s1\npop  {r4-r5,lr}\nbx   lr\n<\/pre>\n<p>There are some new instructions, an interesting form of the <code>ldr<\/code> instruction, and a new type of register.<\/p>\n<p>First, the <code>vmov<\/code> instruction and register <code>s1<\/code>.  <code>vmov<\/code> moves values into registers of the <b>Vector Floating-Point Coprocessor<\/b>, <i>assuming<\/i> your processor has one (if it is a Pi it will).  <code>s1<\/code> is one of the single-precision floating-point registers.  Note that this is a 32-bit wide register that can store a C <code>float<\/code>.<\/p>\n<p>Next up is the <code>ldr r5,[r0],#4<\/code> instruction.  Recall that <code>ldr ra,[rb]<\/code> loads the value stored at the address in <code>rb<\/code> into <code>ra<\/code>.  The <code>#4<\/code> at the end instructs the processor to then <i>increment<\/i> the value in <code>rb<\/code> by 4.  In effect we are walking the array of integers whose starting address is in <code>r0<\/code>.<\/p>\n<p>Finally, once we add all of the values in the array we have the sum in the register <code>r4<\/code>.  To divide that sum by the length of the array (which was saved off in the floating-point register <code>s1<\/code>) we load <code>r4<\/code> into <code>s0<\/code> and perform one last thing:  <code>vcvt<\/code>.  <code>vcvt<\/code> converts between integers and floating-point numbers (which are, after all, an encoding).  So <code>s0<\/code> gets converted to a floating-point value, as does <code>s1<\/code>, and then we perform our division with <code>vdiv<\/code>.<\/p>\n<p>As with <code>r0<\/code> being the standard for returning an <code>int<\/code> from a procedure call, <code>s0<\/code> will hold our float value.<\/p>\n<p>We can use this function in our main routine.<\/p>\n<p><code>averagemain.c<\/code>:<\/p>\n<pre class=\"lang:c decode:true\">\n#include <stdio.h>\n\nfloat average(int* a, unsigned l);\n\nint main(void) {\n  int a[] = {82,  98, 90, 88, 87, 75};\n\n  float avg = average(a, sizeof(a)\/sizeof(int));\n\n  printf(\"Average:  %2.2f\\n\", avg);\n}\n<\/pre>\n<p>Compile with <code>gcc averagemain.c average.s -o averagemain<\/code> and run.<\/p>\n<pre>\n.\/averagemain\nAverage:  86.67\n<\/pre>\n<h2>Closing Thoughts<\/h2>\n<p>This post has been a lot of fun to write because assembly is actually fun to write and serves as a reminder that even the highest-level languages get compiled down to instructions that the underlying CPU can execute.  One instruction set we didn't touch on is the <b>store<\/b> instructions.  These are used to save the contents (store) of registers to memory.  Perhaps next time.<\/p>\n<p>Once again, all of the code in this post can be found in the <a href=\"https:\/\/github.com\/iachievedit\/armassembly\">armassembly<\/a> repository on GitHub.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Don&#8217;t ask me why I started looking at writing basic ARM assembly routines. Perhaps it&#8217;s for the thrill of it, or taking a walk down memory lane. My first assembly language program was for an IBM System\/360 using WYLBUR in college. This post is not a tutorial on assembly language itself, or the ARM processor [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":2960,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[99,100,13],"tags":[],"class_list":["post-3981","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-arm","category-assembly","category-raspberry-pi"],"_links":{"self":[{"href":"https:\/\/dev.iachieved.it\/iachievedit\/wp-json\/wp\/v2\/posts\/3981"}],"collection":[{"href":"https:\/\/dev.iachieved.it\/iachievedit\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/dev.iachieved.it\/iachievedit\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/dev.iachieved.it\/iachievedit\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/dev.iachieved.it\/iachievedit\/wp-json\/wp\/v2\/comments?post=3981"}],"version-history":[{"count":18,"href":"https:\/\/dev.iachieved.it\/iachievedit\/wp-json\/wp\/v2\/posts\/3981\/revisions"}],"predecessor-version":[{"id":3999,"href":"https:\/\/dev.iachieved.it\/iachievedit\/wp-json\/wp\/v2\/posts\/3981\/revisions\/3999"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/dev.iachieved.it\/iachievedit\/wp-json\/wp\/v2\/media\/2960"}],"wp:attachment":[{"href":"https:\/\/dev.iachieved.it\/iachievedit\/wp-json\/wp\/v2\/media?parent=3981"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/dev.iachieved.it\/iachievedit\/wp-json\/wp\/v2\/categories?post=3981"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/dev.iachieved.it\/iachievedit\/wp-json\/wp\/v2\/tags?post=3981"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}