x86 Disassembly/Optimization Examples
From Wikibooks, the open-content textbooks collection
Contents |
[edit] Example: Optimized vs Non-Optimized Code
The following example is adapted from an algorithm presented in Knuth(vol 1, chapt 1) used to find the greatest common denominator of 2 integers. Compare the listing file of this function when compiler optimizations are turned on and off.
/*line 1*/ int EuclidsGCD(int m, int n) /*we want to find the GCD of m and n*/ { int q, r; /*q is the quotient, r is the remainder*/ while(1) { q = m / n; /*find q and r*/ r = m % n; if(r == 0) /*if r is 0, return our n value*/ { return n; } m = n; /*set m to the current n value*/ n = r; /*set n to our current remainder value*/ } /*repeat*/ }
Compiling with the Microsoft C compiler, we generate a listing file using no optimization:
PUBLIC _EuclidsGCD _TEXT SEGMENT _r$ = -8 ; size = 4 _q$ = -4 ; size = 4 _m$ = 8 ; size = 4 _n$ = 12 ; size = 4 _EuclidsGCD PROC NEAR ; Line 2 push ebp mov ebp, esp sub esp, 8 $L477: ; Line 4 mov eax, 1 test eax, eax je SHORT $L473 ; Line 6 mov eax, DWORD PTR _m$[ebp] cdq idiv DWORD PTR _n$[ebp] mov DWORD PTR _q$[ebp], eax ; Line 7 mov eax, DWORD PTR _m$[ebp] cdq idiv DWORD PTR _n$[ebp] mov DWORD PTR _r$[ebp], edx ; Line 8 cmp DWORD PTR _r$[ebp], 0 jne SHORT $L479 ; Line 10 mov eax, DWORD PTR _n$[ebp] jmp SHORT $L473 $L479: ; Line 12 mov ecx, DWORD PTR _n$[ebp] mov DWORD PTR _m$[ebp], ecx ; Line 13 mov edx, DWORD PTR _r$[ebp] mov DWORD PTR _n$[ebp], edx ; Line 14 jmp SHORT $L477 $L473: ; Line 15 mov esp, ebp pop ebp ret 0 _EuclidsGCD ENDP _TEXT ENDS END
Notice how there is a very clear correspondence between the lines of C code, and the lines of the ASM code. the addition of the "; line x" directives is very helpful in that respect.
Next, we compile the same function using a series of optimizations to stress speed over size:
cl.exe /Tceuclids.c /Fa /Ogt2
and we produce the following listing:
PUBLIC _EuclidsGCD _TEXT SEGMENT _m$ = 8 ; size = 4 _n$ = 12 ; size = 4 _EuclidsGCD PROC NEAR ; Line 7 mov eax, DWORD PTR _m$[esp-4] push esi mov esi, DWORD PTR _n$[esp] cdq idiv esi mov ecx, edx ; Line 8 test ecx, ecx je SHORT $L563 $L547: ; Line 12 mov eax, esi cdq idiv ecx ; Line 13 mov esi, ecx mov ecx, edx test ecx, ecx jne SHORT $L547 $L563: ; Line 10 mov eax, esi pop esi ; Line 15 ret 0 _EuclidsGCD ENDP _TEXT ENDS END
As you can see, the optimized version is significantly shorter then the non-optimized version. Some of the key differences include:
- The optimized version does not prepare a standard stack frame. This is important to note, because many times new reversers assume that functions always start and end with proper stack frames, and this is clearly not the case. EBP isnt being used, ESP isnt being altered (because the local variables are kept in registers, and not put on the stack), and no subfunctions are called. 5 instructions are cut by this.
- The "test EAX, EAX" series of instructions in the non-optimized output, under ";line 4" is all unnecessary. The while-loop is defined by "while(1)" and therefore the loop always continues. this extra code is safely cut out. Notice also that there is no unconditional jump in the loop like would be expected: the "if(r == 0) return n;" instruction has become the new loop condition.
- The structure of the function is altered greatly: the division of m and n to produce q and r is performed in this function twice: once at the beginning of the function to initialize, and once at the end of the loop. Also, the value of r is tested twice, in the same places. The compiler is very liberal with how it assigns storage in the function, and readily discards values that are not needed.
[edit] Example: Manual Optimization
The following lines of assembly code are not optimized, but they can be optimized very easily. Can you find a way to optimize these lines?
mov eax, 1 test eax, eax je SHORT $L473
The code in this line is the code generated for the "while( 1 )" C code, to be exact, it represents the loop break condition. Because this is an infinite loop, we can assume that these lines are unnecessary.
"mov eax, 1" initializes eax.
the test immediately afterwards tests the value of eax to ensure that it is nonzero. because eax will always be nonzero (eax = 1) at this point, the conditional jump can be removed along whith the "mov" and the "test".
The assembly is actually checking whether 1 equals 1. Another fact is, that the C code for an infinite FOR loop:
for( ; ; ) { ... }
would not create such a meaningless assembly code to begin with, and is logically the same as "while( 1 )".
[edit] Example: Trace Variables
Here are the C code and the optimized assembly listing from the EuclidGCD function, from the example above. Can you determine which registers contain the variables r and q?
/*line 1*/ int EuclidsGCD(int m, int n) /*we want to find the GCD of m and n*/ { int q, r; /*q is the quotient, r is the remainder*/ while(1) { q = m / n; /*find q and r*/ r = m % n; if(r == 0) /*if r is 0, return our n value*/ { return n; } m = n; /*set m to the current n value*/ n = r; /*set n to our current remainder value*/ } /*repeat*/ }
PUBLIC _EuclidsGCD _TEXT SEGMENT _m$ = 8 ; size = 4 _n$ = 12 ; size = 4 _EuclidsGCD PROC NEAR ; Line 7 mov eax, DWORD PTR _m$[esp-4] push esi mov esi, DWORD PTR _n$[esp] cdq idiv esi mov ecx, edx ; Line 8 test ecx, ecx je SHORT $L563 $L547: ; Line 12 mov eax, esi cdq idiv ecx ; Line 13 mov esi, ecx mov ecx, edx test ecx, ecx jne SHORT $L547 $L563: ; Line 10 mov eax, esi pop esi ; Line 15 ret 0 _EuclidsGCD ENDP _TEXT ENDS END
At the beginning of the function, eax contains m, and esi contains n. When the instruction "idiv esi" is executed, eax contains the quotient (q), and edx contains the remainder (r). The instruction "mov ecx, edx" moves r into ecx, while q is not used for the rest of the loop, and is therefore discarded.
[edit] Example: Decompile Optimized Code
Below is the optimized listing file of the EuclidGCD function, presented in the examples above. Can you decompile this assembly code listing into equivalent "optimized" C code? How is the optimized version different in structure from the non-optimized version?
PUBLIC _EuclidsGCD _TEXT SEGMENT _m$ = 8 ; size = 4 _n$ = 12 ; size = 4 _EuclidsGCD PROC NEAR ; Line 7 mov eax, DWORD PTR _m$[esp-4] push esi mov esi, DWORD PTR _n$[esp] cdq idiv esi mov ecx, edx ; Line 8 test ecx, ecx je SHORT $L563 $L547: ; Line 12 mov eax, esi cdq idiv ecx ; Line 13 mov esi, ecx mov ecx, edx test ecx, ecx jne SHORT $L547 $L563: ; Line 10 mov eax, esi pop esi ; Line 15 ret 0 _EuclidsGCD ENDP _TEXT ENDS END
Altering the conditions to maintain the same structure gives us:
int EuclidsGCD(int m, int n) { int r; r = m % n; if(r != 0) { do { m = n; r = m % r; n = r; }while(r != 0) } return n; }
It is up to the reader to compile this new "optimized" C code, and determine if there is any performance increase. Try compiling this new code without optimizations first, and then with optimizations. Compare the new assembly listings to the previous ones.
[edit] Example: Instruction Pairings
- Q
- Why does the dec/jne combo operate faster than the equivalent loopnz?
- A
- The dec/jnz pair operates faster then a loopsz for several reasons. First, dec and jnz pair up in the different modules of the netburst pipeline, so they can be executed simultaneously. Top that off with the fact that dec and jnz both require few cycles to execute, while the loopnz (and all the loop instructions, for that matter) instruction takes more cycles to complete. loop instructions are rarely seen output by good compilers.
[edit] Example: Duff's Device
What does the following C code function do? Is it useful? Why or why not?
void MyFunction(int *arrayA, int *arrayB, cnt) { switch(cnt % 6) { while(cnt ?= 0) { case 0: arrayA[cnt] = arrayB[cnt--]; case 5: arrayA[cnt] = arrayB[cnt--]; case 4: arrayA[cnt] = arrayB[cnt--]; case 3: arrayA[cnt] = arrayB[cnt--]; case 2: arrayA[cnt] = arrayB[cnt--]; case 1: arrayA[cnt] = arrayB[cnt--]; } } }
This piece of code is known as a Duff's device or "Duff's machine". It is used to partially unwind a loop for efficiency. Notice the strange way that the while() is nested inside the switch statement? Two arrays of integers are passed to the function, and at each iteration of the while loop, 6 consecutive elements are copied from arrayB to arrayA. The switch statement, since it is outside the while loop, only occurs at the beginning of the function. The modulo is taken of the variable cnt with respect to 6. If cnt is not evenly divisible by 6, then the modulo statement is going to start the loop off somewhere in the middle of the rotation, thus preventing the loop from causing a buffer overflow without having to test the current count after each iteration.
Duff's Device is considered one of the more efficient general-purpose methods for copying strings, arrays, or data streams.