Assembly Language Programming with asm-x86

Overview

asm-x86 is an assembler written in Forth, for use from within a Forth environment. Since Forth itself is usually interactive, the asm-x86 assembler may be considered an interactive assembly language programming environment. The version provided with kForth may be used to create words using the 80x86 instruction set. Such words may be executed interactively, or used within the definitions of words written in Forth, just as with any ordinary high-level Forth definitions. Words defined in assembly language provide the advantage of speed and precise control of the processor, and in certain cases, simplicity, when compared with the higher level Forth definitions. Some notes on the development of the asm-x86 assembler are given in the comments at the beginning of asm-x86.4th.

Some basic familiarity with assembly language programming on Intel x86 and compatible processors, and the architecture of these processors, is assumed for this introduction. However, a few of these basics will be covered in the tutorial for using asm-x86.

Quick Start

asm-x86 is loaded just as any other Forth program from kforth:

include asm-x86

Once loaded, the user may type definitions of words written in the asm-x86 assembly language, or include files which contain such definitions.

Assembly language word definitions begin with CODE and end with END-CODE. Between these two statements are the assembly language statements which explicitly define the processor instructions to be performed by the word. The words CODE and END-CODE also introduce some instructions for the purpose of creating an interface between the Forth environment and the assembly code, for example in setting the value of the EBX machine register to point to the top of the stack upon entry into the word. Thus, the structure of word written for asm-x86 is of the form

CODE name
  assembler statement
    "      "
    "      "
    :      :
END-CODE

where name is the name of the word appearing in the dictionary. The new word may be executed from the Forth environment, just like an ordinary Forth word, and arguments may be passed to it on the Forth data stack. We will refer to such words as "CODE words".

asm-x86 Conventions

An assembler statement consists of zero, one, or more operands, and one machine instruction, i.e. a built-in operation of the 80x86 processor. asm-x86 is a postfix assembler, meaning that the operands are specified before the instruction. This is consistent with the stack-oriented nature of Forth. An operand may be a machine register, a memory reference, or an immediate value. Ultimately an operand is, of course, a numbers but within the context of a particular instruction, the associated number indicates which physical source supplies the value of the operand. Much of the work of the assembler is then to translate the specified operands into the appropriate numbers, for a given instruction. These numbers form the machine code which is stored in the memory associated with the word, and which is actually executed by the processor when the word is used. An example of a typical assembler statement is

4 [ebx] eax mov,

This particular statement has three operands, followed by the instruction, "MOV,". In asm-x86, the above statement generates machine code to perform the operation of copying 32 bits from a memory location into the EAX register. For the above example, the memory address is specified in a somewhat complex way, known as indirect addressing. The first two operands specify that the base address is stored in the EBX register, and the offset from the base address is 4, i.e., address = 4 + value in EBX register. The assembler translates the above statement into the sequence of bytes, represented in hexadecimal,

8B 43 04

which is the corresponding machine code for the instruction and its operands. When executed, these three bytes tell the processor to compute the address in memory by adding 4 to the value in the EBX register, then retrieve the 32-bit value at that memory location, and finally to copy the value into the EAX register.

Notice that in our example of an assembler statement above, the instruction is "MOV," with the comma being part of the instruction. In asm-x86, all instructions have the comma suffix. Also, note the order of our operands. Even though there are three operands, the first two operands specify a source, and the third a destination. In asm-x86, the order of operands is such that the source precedes the destination:

source destination instruction

(other assemblers may have the opposite convention, so one must be careful in comparing assembly code written for different assemblers, even if they target the same processor). Now, consider the meaning of the following statement

eax ebx add,

Clearly the meaning of the above statement is to add the contents of the EAX and EBX machine registers. However, unless the ordering of operands assumed by the assembler is known, one cannot determine whether the result is stored in EAX or in EBX. For the asm-x86 convention, the sum will be stored in EBX.

Passing Arguments From kForth to CODE Words

As alluded to earlier, the Forth environment must place its stack pointer in a location accessible to the assembler statements, allowing a CODE word access to arguments passed to it on the Forth data stack. Although one might use a variable in which to store the stack pointer, it is much more convenient and faster to place the stack pointer in a CPU register. The particular register is EBX, and the address of the top of the stack (TOS) may be assumed to be stored in this register upon entry into the CODE word:

EBX contains the address of TOS

The CODE word may increment the value of EBX by multiples of a cell size (4 bytes), thereby dropping arguments from the Forth stack as it uses them to perform its computation. Upon reaching the end of the CODE word, the value of the EBX register will specify the new TOS. The word END-CODE accomplishes this reset of the Forth stack pointer, using the value of EBX. END-CODE also introduces a return instruction from the machine code,"RET,", so it is not necessary for the programmer to explicitly write "RET," (although doing so is acceptable).

Operands

With the above preliminaries, we are now in a position to illustrate some examples of acutal CODE words, and discuss further the specification of operands in asm-x86. Some examples are taken from asm-x86-examples.4th, provided with kForth.

CODE adrop ( n -- | drop an item from the Forth stack using assembly code )
	4 # ebx add,
END-CODE

The above example is equivalent to the Forth word DROP. It consists of a single assembler statement, which adds the immediate value 4 ( 1 cells ) to the EBX register, thereby advancing the stack pointer. The assembler word "#" is used to inform the assembler that "4" is an operand of type immediate value, and its role is to ensure that the operand is not confused for some other type such as a register, or a memory reference.

We may also use ordinary Forth CONSTANTs and VARIABLEs to supply immediate values:

1 CELLS CONSTANT  TCELL

CODE adrop ( n -- )
       TCELL # ebx add,
END-CODE

In the above example, the Forth word TCELL returns its value onto the stack, and "#" marks it as an immediate value.

Now, define a variable "V".

VARIABLE v

Consider the meaning of the following assembler statement:

        v #  edx  mov,

In the above statement, we are moving an immediate value into the destination operand, which is the EDX register. What is this immediate value? It is not the value stored in the variable "v", but rather the address of "v". This is consistent with Forth: execution of a variable name returns the address of the variable, rather than the value stored in it. Now, what if instead of the address of "v", we instead wished to copy the value stored in the variable to the EDX register. In order to do this, we must indicate to the assembler that the source operand is a memory reference, rather than an immediate value. This is accomplished by

        v #@  edx  mov,

where the assembler word "#@" informs the assembler that the item on the stack is to be regarded as a memory reference. The above statement will then generate the machine code to retrieve the value stored at "v" and copy this into the EDX register.