x86 Disassembly/Optimization Examples

From Wikibooks, the open-content textbooks collection

Jump to: navigation, search

Contents

[edit] Example: Optimized vs Non-Optimized Code

The following example is adapted from an algorithm presented in Knuth(vol 1, chapt 1) used to find the greatest common denominator of 2 integers. Compare the listing file of this function when compiler optimizations are turned on and off.

/*line 1*/
 int EuclidsGCD(int m, int n) /*we want to find the GCD of m and n*/
 {
        int q, r; /*q is the quotient, r is the remainder*/
        while(1)
        {
                q = m / n; /*find q and r*/
                r = m % n;
                if(r == 0) /*if r is 0, return our n value*/
                {
                        return n;
                }
                m = n; /*set m to the current n value*/
                n = r; /*set n to our current remainder value*/
        } /*repeat*/
 }

Compiling with the Microsoft C compiler, we generate a listing file using no optimization:

PUBLIC        _EuclidsGCD
 _TEXT  SEGMENT
 _r$ = -8      ; size = 4
 _q$ = -4      ; size = 4
 _m$ = 8       ; size = 4
 _n$ = 12      ; size = 4
 _EuclidsGCD PROC NEAR
 ; Line 2
        push   ebp
        mov    ebp, esp
        sub    esp, 8
 $L477:
 ; Line 4
        mov    eax, 1
        test   eax, eax
        je     SHORT $L473
 ; Line 6
        mov    eax, DWORD PTR _m$[ebp]
        cdq
        idiv   DWORD PTR _n$[ebp]
        mov    DWORD PTR _q$[ebp], eax
 ; Line 7
        mov    eax, DWORD PTR _m$[ebp]
        cdq
        idiv   DWORD PTR _n$[ebp]
        mov    DWORD PTR _r$[ebp], edx
 ; Line 8
        cmp    DWORD PTR _r$[ebp], 0
        jne    SHORT $L479
 ; Line 10
        mov    eax, DWORD PTR _n$[ebp]
        jmp    SHORT $L473
 $L479:
 ; Line 12
        mov    ecx, DWORD PTR _n$[ebp]
        mov    DWORD PTR _m$[ebp], ecx
 ; Line 13
        mov    edx, DWORD PTR _r$[ebp]
        mov    DWORD PTR _n$[ebp], edx
 ; Line 14
        jmp    SHORT $L477
 $L473:
 ; Line 15
        mov    esp, ebp
        pop    ebp
        ret    0
 _EuclidsGCD ENDP
 _TEXT  ENDS
 END

Notice how there is a very clear correspondence between the lines of C code, and the lines of the ASM code. the addition of the "; line x" directives is very helpful in that respect.

Next, we compile the same function using a series of optimizations to stress speed over size:

cl.exe /Tceuclids.c /Fa /Ogt2

and we produce the following listing:

PUBLIC        _EuclidsGCD
 _TEXT  SEGMENT
 _m$ = 8       ; size = 4
 _n$ = 12       ; size = 4
 _EuclidsGCD PROC NEAR        
 ; Line 7
        mov    eax, DWORD PTR _m$[esp-4]
        push   esi
        mov    esi, DWORD PTR _n$[esp]
        cdq
        idiv   esi
        mov    ecx, edx
 ; Line 8
        test   ecx, ecx
        je     SHORT $L563
 $L547:
 ; Line 12
        mov    eax, esi
        cdq
        idiv   ecx
 ; Line 13
        mov    esi, ecx
        mov    ecx, edx
        test   ecx, ecx
        jne    SHORT $L547
 $L563:
 ; Line 10
        mov    eax, esi
        pop    esi
 ; Line 15
        ret    0
 _EuclidsGCD ENDP
 _TEXT  ENDS
 END

As you can see, the optimized version is significantly shorter then the non-optimized version. Some of the key differences include:

  • The optimized version does not prepare a standard stack frame. This is important to note, because many times new reversers assume that functions always start and end with proper stack frames, and this is clearly not the case. EBP isnt being used, ESP isnt being altered (because the local variables are kept in registers, and not put on the stack), and no subfunctions are called. 5 instructions are cut by this.
  • The "test EAX, EAX" series of instructions in the non-optimized output, under ";line 4" is all unnecessary. The while-loop is defined by "while(1)" and therefore the loop always continues. this extra code is safely cut out. Notice also that there is no unconditional jump in the loop like would be expected: the "if(r == 0) return n;" instruction has become the new loop condition.
  • The structure of the function is altered greatly: the division of m and n to produce q and r is performed in this function twice: once at the beginning of the function to initialize, and once at the end of the loop. Also, the value of r is tested twice, in the same places. The compiler is very liberal with how it assigns storage in the function, and readily discards values that are not needed.

[edit] Example: Manual Optimization

The following lines of assembly code are not optimized, but they can be optimized very easily. Can you find a way to optimize these lines?

mov   eax, 1
test   eax, eax
je     SHORT $L473

The code in this line is the code generated for the "while( 1 )" C code, to be exact, it represents the loop break condition. Because this is an infinite loop, we can assume that these lines are unnecessary.

"mov eax, 1" initializes eax.

the test immediately afterwards tests the value of eax to ensure that it is nonzero. because eax will always be nonzero (eax = 1) at this point, the conditional jump can be removed along whith the "mov" and the "test".

The assembly is actully checking whether 1 equals 1. Another fact is, that the C code for an infinite FOR loop:

for( ; ; )
 {
    ...
 }
would not create such a meaningless assembly code to begin with, and is logically the same as "while( 1 )".

[edit] Example: Trace Variables

Here are the C code and the optimized assembly listing from the EuclidGCD function, from the example above. Can you determine which registers contain the variables r and q?

/*line 1*/
 int EuclidsGCD(int m, int n) /*we want to find the GCD of m and n*/
 {
        int q, r; /*q is the quotient, r is the remainder*/
        while(1)
        {
                q = m / n; /*find q and r*/
                r = m % n;
                if(r == 0) /*if r is 0, return our n value*/
                {
                        return n;
                }
                m = n; /*set m to the current n value*/
                n = r; /*set n to our current remainder value*/
        } /*repeat*/
 }
PUBLIC        _EuclidsGCD
 _TEXT  SEGMENT
 _m$ = 8       ; size = 4
 _n$ = 12       ; size = 4
 _EuclidsGCD PROC NEAR        
 ; Line 7
        mov    eax, DWORD PTR _m$[esp-4]
        push   esi
        mov    esi, DWORD PTR _n$[esp]
        cdq
        idiv   esi
        mov    ecx, edx
 ; Line 8
        test   ecx, ecx
        je     SHORT $L563
 $L547:
 ; Line 12
        mov    eax, esi
        cdq
        idiv   ecx
 ; Line 13
        mov    esi, ecx
        mov    ecx, edx
        test   ecx, ecx
        jne    SHORT $L547
 $L563:
 ; Line 10
        mov    eax, esi
        pop    esi
 ; Line 15
        ret    0
 _EuclidsGCD ENDP
 _TEXT  ENDS
 END
At the beginning of the function, eax contains m, and esi contains n. When the instruction "idiv esi" is executed, eax contains the quotient (q), and edx contains the remainder (r). The instruction "mov ecx, edx" moves r into ecx, while q is not used for the rest of the loop, and is therefore discarded.

[edit] Example: Decompile Optimized Code

Below is the optimized listing file of the EuclidGCD function, presented in the examples above. Can you decompile this assembly code listing into equivalent "optimized" C code? How is the optimized version different in structure from the non-optimized version?

PUBLIC        _EuclidsGCD
 _TEXT  SEGMENT
 _m$ = 8       ; size = 4
 _n$ = 12       ; size = 4
 _EuclidsGCD PROC NEAR        
 ; Line 7
        mov    eax, DWORD PTR _m$[esp-4]
        push   esi
        mov    esi, DWORD PTR _n$[esp]
        cdq
        idiv   esi
        mov    ecx, edx
 ; Line 8
        test   ecx, ecx
        je     SHORT $L563
 $L547:
 ; Line 12
        mov    eax, esi
        cdq
        idiv   ecx
 ; Line 13
        mov    esi, ecx
        mov    ecx, edx
        test   ecx, ecx
        jne    SHORT $L547
 $L563:
 ; Line 10
        mov    eax, esi
        pop    esi
 ; Line 15
        ret    0
 _EuclidsGCD ENDP
 _TEXT  ENDS
 END

Altering the conditions to maintain the same structure gives us:

int EuclidsGCD(int m, int n)
 {
     int r;
     r = m / n;
     if(r != 0) 
     {
         do
         {
             m = n;
             r = m % r;
             n = r;
         }while(r != 0)
     }
     return n;
 }
It is up to the reader to compile this new "optimized" C code, and determine if there is any performance increase. Try compiling this new code without optimizations first, and then with optimizations. Compare the new assembly listings to the previous ones.

[edit] Example: Instruction Pairings

Q
Why does the dec/jne combo operate faster than the equivalent loopnz?
A
The dec/jnz pair operates faster then a loopsz for several reasons. First, dec and jnz pair up in the different modules of the netburst pipeline, so they can be executed simultaneously. Top that off with the fact that dec and jnz both require few cycles to execute, while the loopnz (and all the loop instructions, for that matter) instruction takes more cycles to complete. loop instructions are rarely seen output by good compilers.

[edit] Example: Duff's Device

What does the following C code function do? Is it useful? Why or why not?

void MyFunction(int *arrayA, int *arrayB, cnt)
{
  switch(cnt % 6) 
  {
    while(cnt ?= 0) 
    {
      case 0:
        arrayA[cnt] = arrayB[cnt--];
      case 5:
        arrayA[cnt] = arrayB[cnt--];
      case 4:
        arrayA[cnt] = arrayB[cnt--];
      case 3:
        arrayA[cnt] = arrayB[cnt--];
      case 2:
        arrayA[cnt] = arrayB[cnt--];
      case 1:
        arrayA[cnt] = arrayB[cnt--];
    }
  }
}

This piece of code is known as a Duff's device or "Duff's machine". It is used to partially unwind a loop for efficiency. Notice the strange way that the while() is nested inside the switch statement? Two arrays of integers are passed to the function, and at each iteration of the while loop, 6 consecutive elements are copied from arrayB to arrayA. The switch statement, since it is outside the while loop, only occurs at the beginning of the function. The modulo is taken of the variable cnt with respect to 6. If cnt is not evenly divisible by 6, then the modulo statement is going to start the loop off somewhere in the middle of the rotation, thus preventing the loop from causing a buffer overflow without having to test the current count after each iteration.

Duff's Device is considered one of the more efficient general-purpose methods for copying strings, arrays, or data streams.
Personal tools