Jump to content

x86 Disassembly/Optimization Examples

From Wikibooks, open books for an open world

Example: Optimized vs Non-Optimized Code

[edit | edit source]

The following example is adapted from an algorithm presented in Knuth(vol 1, chapt 1) used to find the greatest common denominator of 2 integers. Compare the listing file of this function when compiler optimizations are turned on and off.

 /*line 1*/
 int EuclidsGCD(int m, int n) /*we want to find the GCD of m and n*/
 {
 	int q, r; /*q is the quotient, r is the remainder*/
 	while(1)
 	{
 		q = m / n; /*find q and r*/
 		r = m % n;
 		if(r == 0) /*if r is 0, return our n value*/
 		{
 			return n;
 		}
 		m = n; /*set m to the current n value*/
 		n = r; /*set n to our current remainder value*/
 	} /*repeat*/
 }

Compiling with the Microsoft C compiler, we generate a listing file using no optimization:

 PUBLIC	_EuclidsGCD
 _TEXT	SEGMENT
 _r$ = -8	; size = 4
 _q$ = -4	; size = 4
 _m$ = 8	; size = 4
 _n$ = 12	; size = 4
 _EuclidsGCD PROC NEAR
 ; Line 2
 	push	ebp
 	mov	ebp, esp
 	sub	esp, 8
 $L477:
 ; Line 4
 	mov	eax, 1
 	test	eax, eax
 	je	SHORT $L473
 ; Line 6
 	mov	eax, DWORD PTR _m$[ebp]
 	cdq
 	idiv	DWORD PTR _n$[ebp]
 	mov	DWORD PTR _q$[ebp], eax
 ; Line 7
 	mov	eax, DWORD PTR _m$[ebp]
 	cdq
 	idiv	DWORD PTR _n$[ebp]
 	mov	DWORD PTR _r$[ebp], edx
 ; Line 8
 	cmp	DWORD PTR _r$[ebp], 0
 	jne	SHORT $L479
 ; Line 10
 	mov	eax, DWORD PTR _n$[ebp]
 	jmp	SHORT $L473
 $L479:
 ; Line 12
 	mov	ecx, DWORD PTR _n$[ebp]
 	mov	DWORD PTR _m$[ebp], ecx
 ; Line 13
 	mov	edx, DWORD PTR _r$[ebp]
 	mov	DWORD PTR _n$[ebp], edx
 ; Line 14
 	jmp	SHORT $L477
 $L473:
 ; Line 15
 	mov	esp, ebp
 	pop	ebp
 	ret	0
 _EuclidsGCD ENDP
 _TEXT	ENDS
 END

Notice how there is a very clear correspondence between the lines of C code, and the lines of the ASM code. the addition of the "; line x" directives is very helpful in that respect.

Next, we compile the same function using a series of optimizations to stress speed over size:

cl.exe /Tceuclids.c /Fa /Ogt2

and we produce the following listing:

 PUBLIC	_EuclidsGCD
 _TEXT	SEGMENT
 _m$ = 8	; size = 4
 _n$ = 12       ; size = 4
 _EuclidsGCD PROC NEAR	
 ; Line 7
 	mov	eax, DWORD PTR _m$[esp-4]
 	push	esi
 	mov	esi, DWORD PTR _n$[esp]
 	cdq
 	idiv	esi
 	mov	ecx, edx
 ; Line 8
 	test	ecx, ecx
 	je	SHORT $L563
 $L547:
 ; Line 12
 	mov	eax, esi
 	cdq
 	idiv	ecx
 ; Line 13
 	mov	esi, ecx
 	mov	ecx, edx
 	test	ecx, ecx
 	jne	SHORT $L547
 $L563:
 ; Line 10
 	mov	eax, esi
 	pop	esi
 ; Line 15
 	ret	0
 _EuclidsGCD ENDP
 _TEXT	ENDS
 END

As you can see, the optimized version is significantly shorter than the non-optimized version. Some of the key differences include:

  • The optimized version does not prepare a standard stack frame. This is important to note, because many times new reversers assume that functions always start and end with proper stack frames, and this is clearly not the case. EBP isnt being used, ESP isnt being altered (because the local variables are kept in registers, and not put on the stack), and no subfunctions are called. 5 instructions are cut by this.
  • The "test EAX, EAX" series of instructions in the non-optimized output, under ";line 4" is all unnecessary. The while-loop is defined by "while(1)" and therefore the loop always continues. this extra code is safely cut out. Notice also that there is no unconditional jump in the loop like would be expected: the "if(r == 0) return n;" instruction has become the new loop condition.
  • The structure of the function is altered greatly: the division of m and n to produce q and r is performed in this function twice: once at the beginning of the function to initialize, and once at the end of the loop. Also, the value of r is tested twice, in the same places. The compiler is very liberal with how it assigns storage in the function, and readily discards values that are not needed.

Example: Manual Optimization

[edit | edit source]

The following lines of assembly code are not optimized, but they can be optimized very easily. Can you find a way to optimize these lines?

mov	eax, 1
test	eax, eax
je	SHORT $L473

The code in this line is the code generated for the "while( 1 )" C code, to be exact, it represents the loop break condition. Because this is an infinite loop, we can assume that these lines are unnecessary.

"mov eax, 1" initializes eax.

the test immediately afterwards tests the value of eax to ensure that it is nonzero. because eax will always be nonzero (eax = 1) at this point, the conditional jump can be removed along whith the "mov" and the "test".

The assembly is actually checking whether 1 equals 1. Another fact is, that the C code for an infinite FOR loop:

 for( ; ; )
 {
    ...
 }

would not create such a meaningless assembly code to begin with, and is logically the same as "while( 1 )".

Example: Trace Variables

[edit | edit source]

Here are the C code and the optimized assembly listing from the EuclidGCD function, from the example above. Can you determine which registers contain the variables r and q?

 /*line 1*/
 int EuclidsGCD(int m, int n) /*we want to find the GCD of m and n*/
 {
 	int q, r; /*q is the quotient, r is the remainder*/
 	while(1)
 	{
 		q = m / n; /*find q and r*/
 		r = m % n;
 		if(r == 0) /*if r is 0, return our n value*/
 		{
 			return n;
 		}
 		m = n; /*set m to the current n value*/
 		n = r; /*set n to our current remainder value*/
 	} /*repeat*/
 }
 PUBLIC	_EuclidsGCD
 _TEXT	SEGMENT
 _m$ = 8	; size = 4
 _n$ = 12       ; size = 4
 _EuclidsGCD PROC NEAR	
 ; Line 7
 	mov	eax, DWORD PTR _m$[esp-4]
 	push	esi
 	mov	esi, DWORD PTR _n$[esp]
 	cdq
 	idiv	esi
 	mov	ecx, edx
 ; Line 8
 	test	ecx, ecx
 	je	SHORT $L563
 $L547:
 ; Line 12
 	mov	eax, esi
 	cdq
 	idiv	ecx
 ; Line 13
 	mov	esi, ecx
 	mov	ecx, edx
 	test	ecx, ecx
 	jne	SHORT $L547
 $L563:
 ; Line 10
 	mov	eax, esi
 	pop	esi
 ; Line 15
 	ret	0
 _EuclidsGCD ENDP
 _TEXT	ENDS
 END

At the beginning of the function, eax contains m, and esi contains n. When the instruction "idiv esi" is executed, eax contains the quotient (q), and edx contains the remainder (r). The instruction "mov ecx, edx" moves r into ecx, while q is not used for the rest of the loop, and is therefore discarded.

Example: Decompile Optimized Code

[edit | edit source]

Below is the optimized listing file of the EuclidGCD function, presented in the examples above. Can you decompile this assembly code listing into equivalent "optimized" C code? How is the optimized version different in structure from the non-optimized version?

 PUBLIC	_EuclidsGCD
 _TEXT	SEGMENT
 _m$ = 8	; size = 4
 _n$ = 12       ; size = 4
 _EuclidsGCD PROC NEAR	
 ; Line 7
 	mov	eax, DWORD PTR _m$[esp-4]
 	push	esi
 	mov	esi, DWORD PTR _n$[esp]
 	cdq
 	idiv	esi
 	mov	ecx, edx
 ; Line 8
 	test	ecx, ecx
 	je	SHORT $L563
 $L547:
 ; Line 12
 	mov	eax, esi
 	cdq
 	idiv	ecx
 ; Line 13
 	mov	esi, ecx
 	mov	ecx, edx
 	test	ecx, ecx
 	jne	SHORT $L547
 $L563:
 ; Line 10
 	mov	eax, esi
 	pop	esi
 ; Line 15
 	ret	0
 _EuclidsGCD ENDP
 _TEXT	ENDS
 END

Altering the conditions to maintain the same structure gives us:

 int EuclidsGCD(int m, int n)
 {
     int r;
     r = m % n;
     if(r != 0) 
     {
         do
         {
             m = n;
             r = m % r;
             n = r;
         }while(r != 0)
     }
     return n;
 }

It is up to the reader to compile this new "optimized" C code, and determine if there is any performance increase. Try compiling this new code without optimizations first, and then with optimizations. Compare the new assembly listings to the previous ones.

Example: Instruction Pairings

[edit | edit source]
Q
Why does the dec/jne combo operate faster than the equivalent loopnz?
A
The dec/jnz pair operates faster than a loopnz for several reasons. First, dec and jnz pair up in the different modules of the netburst pipeline, so they can be executed simultaneously. Top that off with the fact that dec and jnz both require few cycles to execute, while the loopnz (and all the loop instructions, for that matter) instruction takes more cycles to complete. Loop instructions are rarely seen output by good compilers.

Example: Avoiding Branches

[edit | edit source]

Below is an assembly version of the expression c ? d : 0. There is no branching in the code, so how does it work?

; ecx = c and edx = d
; eax will contain c ? d : 0 (eax = d if c is not zero, otherwise eax = 0)
neg	ecx
sbb	eax, eax
and	eax, edx
ret

This is an example of using various arithmetic instructions to avoid branching. The neg instruction sets the carry flag if c is not zero; otherwise, it clears the carry flag. The next line depends on this. If the carry flag is set, then sbb results in eax = eax - eax - 1 = 0xffffffff. Otherwise, eax = eax - eax = 0. Finally, performing an and on this result ensures that if ecx was not zero in the first place, eax will contain edx, and zero otherwise.

Example: Duff's Device

[edit | edit source]

What does the following C code function do? Is it useful? Why or why not?

void MyFunction(int *arrayA, int *arrayB, int cnt)
{
  switch(cnt % 6) 
  {
    while(cnt != 0) 
    {
      case 0:
        arrayA[--cnt] = arrayB[cnt];
      case 5:
        arrayA[--cnt] = arrayB[cnt];
      case 4:
        arrayA[--cnt] = arrayB[cnt];
      case 3:
        arrayA[--cnt] = arrayB[cnt];
      case 2:
        arrayA[--cnt] = arrayB[cnt];
      case 1:
        arrayA[--cnt] = arrayB[cnt];
    }
  }
}

This piece of code is known as a Duff's device or "Duff's machine". It is used to partially unwind a loop for efficiency. Notice the strange way that the while() is nested inside the switch statement? Two arrays of integers are passed to the function, and at each iteration of the while loop, 6 consecutive elements are copied from arrayB to arrayA. The switch statement, since it is outside the while loop, only occurs at the beginning of the function. The modulo is taken of the variable cnt with respect to 6. If cnt is not evenly divisible by 6, then the modulo statement is going to start the loop off somewhere in the middle of the rotation, thus preventing the loop from causing a buffer overflow without having to test the current count after each iteration.

Duff's Device is considered one of the more efficient general-purpose methods for copying strings, arrays, or data streams.