0x200 Programming

Hacker is a term for both those who write code and those who exploit it. Even though these two groups of hackers have different end goals, both groups use similar problem-solving techniques. An understanding of programming helps those who exploit, and an understanding of exploitation helps those who program. Hacking is really just the act of finding a clever and counterintuitive solution to a problem.

Hacks and Programming Hacks *

The hacks found in program exploits usually use the rules of the computer to bypass security in ways never intended. Programming hacks are similar, but the final goal is efficiency or smaller source code, not necessarily a security compromise.

Programs that are small, efficient, and neat to accomplish a given task are said to have elegance, and the clever and inventive solutions that tend to lead to this efficiency are called hacks.

In the business world, more importance is placed on churning out functional code than on achieving clever hacks and elegance. [p6]

What are Hackers *

Hackers can be:

Hackers are people who get excited about programming and really appreciate the beauty of an elegant piece of code or the ingenuity of a clever hack.

An understanding of programming is a prerequisite to understanding how programs can be exploited.

What Is Programming?

A program is a series of statements written in a specific language.

Pseudo-code

Programmers have yet another form of programming language called pseudo-code. Pseudo-code is simply English arranged with a general structure similar to a high-level language.

Control Structures

More Fundamental Programming Concepts

Getting Your Hands Dirty

For this book, Linux and an x86-based processor is used exclusively.

The firstprog.c program is a simple piece of C code that will print “Hello, world!” 10 times.

firstprog.c

#include <stdio.h>

int main()
{
  int i;
  for(i=0; i < 10; i++)
  {
    printf("Hello World!\n");
  }
}

The first line is a C syntax that tells the compiler to include headers for a standard input/output (I/O) library named stdio. This header file is added to the program when it is compiled. It is located at /usr/include/stdio.h. A function prototype is needed for printf() before it can be used. This function prototype (along with many others) is included in the stdio.h header file. A lot of the power of C comes from its extensibility and libraries.

The GNU Compiler Collection (GCC) is a free C compiler that translates C into machine language that a processor can understand. The outputted translation is an executable binary file, which is called a.out by default.

$ gcc firstprog.c
$ ls -l a.out
-rwxr-xr-x 1 reader reader 6621 2007-09-06 22:16 a.out
$ ./a.out
Hello, world!
Hello, world!
Hello, world!
Hello, world!
Hello, world!
Hello, world!
Hello, world!
Hello, world!
Hello, world!
Hello, world!

The Bigger Picture

This has all been stuff you would learn in an elementary programming class. Most introductory programming classes just teach how to read and write C. Though being fluent in C is very useful and is enough to make you a decent programmer, but it’s only a piece of the bigger picture. Most programmers learn the language from the top down and never see the big picture.

Hackers get their edge from knowing how all the pieces interact within this bigger picture. To see the bigger picture in the realm of programming, simply realize that C code is meant to be compiled. The code can’t actually do anything until it’s compiled into an executable binary file. Thinking of C-source as a program is a common misconception that is exploited by hackers every day.

The binary a.out’s instructions are written in machine language. Compilers translate the language of C code into machine language for a variety of processor architectures. In this case, the processor is in a family that uses the x86 architecture. There are also Sparc processor architectures (used in Sun Workstations) and the PowerPC processor architecture (used in pre-Intel Macs). The compiler acts as a middle ground: translating C code into machine language for the target architecture.

The average programmer is only concerned with source code, but a hacker realizes that the compiled program is what actually gets executed out in the real world. With a better understanding of how the CPU operates, a hacker can manipulate the programs that run on it.

objdump *

The GNU development tools include a program called objdump, which can be used to examine compiled binaries. Let’s look at the machine code the main() function was translated into.

 $ objdump -D a.out | grep -A20 main.:
08048374 <main>:
 8048374: 55 push %ebp
 8048375: 89 e5 mov %esp,%ebp
 8048377: 83 ec 08 sub $0x8,%esp
 804837a: 83 e4 f0 and $0xfffffff0,%esp
 804837d: b8 00 00 00 00 mov $0x0,%eax
 8048382: 29 c4 sub %eax,%esp
 8048384: c7 45 fc 00 00 00 00 movl $0x0,0xfffffffc(%ebp)
 804838b: 83 7d fc 09 cmpl $0x9,0xfffffffc(%ebp)
 804838f: 7e 02 jle 8048393 <main+0x1f>
 8048391: eb 13 jmp 80483a6 <main+0x32>
 8048393: c7 04 24 84 84 04 08 movl $0x8048484,(%esp)
 804839a: e8 01 ff ff ff call 80482a0 <printf@plt>
 804839f: 8d 45 fc lea 0xfffffffc(%ebp),%eax
 80483a2: ff 00 incl (%eax)
 80483a4: eb e5 jmp 804838b <main+0x17>
 80483a6: c9 leave
 80483a7: c3 ret
 80483a8: 90 nop
 80483a9: 90 nop
 80483aa: 90 nop
Description of output from objdump *

The output is piped into grep with the command-line option to only display 20 lines after the regular expression main.:. Each byte is represented in hexadecimal notation, which is a base-16 numbering system. This is a convenient notation since a byte contains 8 bits, and each byte can be described with 2 hexadecimal digits.

The Assembly Language *

Unlike C and other compiled languages, assembly language instructions have a direct one-to-one relationship with their corresponding machine language instructions. This means that since every processor architecture has different machine language instructions, each also has a different form of assembly language. Assembly is a way to represent the machine language instructions of the processor.

There are the two main types of assembly language syntax:

The assembly shown in the output above is AT&T syntax. Nearly all of Linux’s disassembly tools use this syntax by default. It’s easy to recognize AT&T syntax by the cacophony of % and $ symbols prefixing everything. The same code can be shown in Intel syntax by providing an additional command-line option, -M intel, to objdump, as shown in the output below.

$ objdump -M intel -D a.out | grep -A20 main.:
08048374 <main>:
 8048374: 55 push ebp
 8048375: 89 e5 mov ebp,esp
 8048377: 83 ec 08 sub esp,0x8
 804837a: 83 e4 f0 and esp,0xfffffff0
 804837d: b8 00 00 00 00 mov eax,0x0
 8048382: 29 c4 sub esp,eax
 8048384: c7 45 fc 00 00 00 00 mov DWORD PTR [ebp-4],0x0
 804838b: 83 7d fc 09 cmp DWORD PTR [ebp-4],0x9
 804838f: 7e 02 jle 8048393 <main+0x1f>
 8048391: eb 13 jmp 80483a6 <main+0x32>
 8048393: c7 04 24 84 84 04 08 mov DWORD PTR [esp],0x8048484
 804839a: e8 01 ff ff ff call 80482a0 <printf@plt>
 804839f: 8d 45 fc lea eax,[ebp-4]
 80483a2: ff 00 inc DWORD PTR [eax]
 80483a4: eb e5 jmp 804838b <main+0x17>
 80483a6: c9 leave
 80483a7: c3 ret
 80483a8: 90 nop
 80483a9: 90 nop
 80483aa: 90 nop

Intel syntax is much more readable and easier to understand; this book will use this syntax. Regardless of the assembly language representation, the commands a processor understands are quite simple. These instructions consist of an operation and sometimes additional arguments that describe the destination and/or the source for the operation. These operations move memory around, perform some sort of basic math, or interrupt the processor to get it to do something else. In the end, that’s all a computer processor can really do.

Processors also have their own set of special variables called registers. Most of the instructions use these registers to read or write data, so understanding the registers of a processor is essential to understanding the instructions.

The x86 Processor

The x86 processor has several registers, which are like internal variables for the processor. The GNU development tools also include a debugger called GDB. Debuggers are used by programmers to step through compiled programs, examine program memory, and view processor registers. A debugger can view the execution from all angles, pause it, and change anything along the way. [p23]

Below, GDB is used to show the state of the processor registers right before the program starts.

$ gdb -q ./a.out
Using host libthread_db library "/lib/tls/i686/cmov/libthread_db.so.1".
(gdb) break main
Breakpoint 1 at 0x804837a
(gdb) run
Starting program: /home/reader/booksrc/a.out
Breakpoint 1, 0x0804837a in main ()
(gdb) info registers
eax 0xbffff894 -1073743724
ecx 0x48e0fe81 1222704769
edx 0x1 1
ebx 0xb7fd6ff4 -1208127500
esp 0xbffff800 0xbffff800
ebp 0xbffff808 0xbffff808
esi 0xb8000ce0 -1207956256
edi 0x0 0
eip 0x804837a 0x804837a <main+6>
eflags 0x286 [ PF SF IF ]
cs 0x73 115
ss 0x7b 123
ds 0x7b 123
es 0x7b 123
fs 0x0 0
gs 0x33 51
(gdb) quit
The program is running. Exit anyway? (y or n) y

A breakpoint is set on the main() function so execution will stop right before our code is executed. Then GDB runs the program, stops at the breakpoint, and is told to display all the processor registers and their current states.

EAX, ECX, EDX, and EBX registers *

The first four registers (EAX, ECX, EDX, and EBX) are known as general-purpose registers. These are called the Accumulator, Counter, Data, and Base registers, respectively. They are used for a variety of purposes, but they mainly act as temporary variables for the CPU when it is executing machine instructions.

ESP, EBP, ESI, and EDI registers *

The second four registers (ESP, EBP, ESI, and EDI) are also general-purpose registers, but they are sometimes known as pointers and indexes. These stand for Stack Pointer, Base Pointer, Source Index, and Destination Index, respectively.

EIP register *

The EIP register is the Instruction Pointer register, which points to the current instruction the processor is reading. This register is quite important and will be used a lot while debugging. Currently, it points to a memory address at 0x804838a.

EFLAGS register *

The remaining EFLAGS register actually consists of several bit flags that are used for comparisons and memory segmentations. The actual memory is split into several different segments and these registers keep track of that. For the most part, these registers can be ignored since they rarely need to be accessed directly.

Assembly Language

To use Intel syntax assembly language, our tools must be configured to use this syntax. Inside GDB, the disassembly syntax can be set to Intel by typing set disassembly intel or set dis intel. You can configure this setting to run every time GDB starts up by putting the command in the file .gdbinit in your home directory.

$ gdb -q
(gdb) set dis intel
(gdb) quit
$ echo "set dis intel" > ~/.gdbinit

The assembly instructions in Intel syntax generally follow this style:

operation <destination>, <source>

The destination and source values will either be a register, a memory address, or a value. The operations are usually intuitive mnemonics. For example:

The instructions below will move the value from ESP to EBP and then subtract 8 from ESP (storing the result in ESP).

8048375: 89 e5         mov ebp,esp
8048377: 83 ec 08      sub esp,0x8

There are also operations that are used to control the flow of execution.

In the example below:

804838b: 83 7d fc 09   cmp DWORD PTR [ebp-4],0x9
804838f: 7e 02         jle 8048393 <main+0x1f>
8048391: eb 13         jmp 80483a6 <main+0x32>

The -g flag can be used by the GCC compiler to include extra debugging information, which will give GDB access to the source code.

$ gcc -g firstprog.c
$ ls -l a.out
-rwxr-xr-x 1 matrix users 11977 Jul 4 17:29 a.out
$ gdb -q ./a.out
Using host libthread_db library "/lib/libthread_db.so.1".
(gdb) list
1 #include <stdio.h>
2
3 int main()
4 {
5 int i;
6 for(i=0; i < 10; i++)
7 {
8 printf("Hello, world!\n");
9 }
10 }
(gdb) disassemble main
Dump of assembler code for function main():
0x08048384 <main+0>: push ebp
0x08048385 <main+1>: mov ebp,esp
0x08048387 <main+3>: sub esp,0x8
0x0804838a <main+6>: and esp,0xfffffff0
0x0804838d <main+9>: mov eax,0x0
0x08048392 <main+14>: sub esp,eax
0x08048394 <main+16>: mov DWORD PTR [ebp-4],0x0
0x0804839b <main+23>: cmp DWORD PTR [ebp-4],0x9
0x0804839f <main+27>: jle 0x80483a3 <main+31>
0x080483a1 <main+29>: jmp 0x80483b6 <main+50>
0x080483a3 <main+31>: mov DWORD PTR [esp],0x80484d4
0x080483aa <main+38>: call 0x80482a8 <_init+56>
0x080483af <main+43>: lea eax,[ebp-4]
0x080483b2 <main+46>: inc DWORD PTR [eax]
0x080483b4 <main+48>: jmp 0x804839b <main+23>
0x080483b6 <main+50>: leave
0x080483b7 <main+51>: ret
End of assembler dump.
(gdb) break main
Breakpoint 1 at 0x8048394: file firstprog.c, line 6.
(gdb) run
Starting program: /hacking/a.out
Breakpoint 1, main() at firstprog.c:6
6 for(i=0; i < 10; i++)
(gdb) info register eip
eip 0x8048394 0x8048394
(gdb)
  1. The source code is listed and the disassembly of the main() function is displayed.
  2. A breakpoint is set at the start of main(), and the program is run.
    • This breakpoint tells the debugger to pause the execution when it gets to that point. Since the breakpoint has been set at the start of the main() function, the program hits the breakpoint and pauses before actually executing any instructions in main().
  3. The value of EIP (the Instruction Pointer) is displayed.

Notice that EIP contains a memory address (i.e. 0x8048394) that points to an instruction in the main() function’s disassembly. The instructions before this (from 0x08048384 to 0x08048392) are collectively known as the function prologue and are generated by the compiler to set up memory for the rest of the main() function’s local variables. Part of the reason why variables need to be declared in C is to aid the construction of this section of code. The debugger knows this part of the code is automatically generated and is smart enough to skip over it.

Examining memory with GDB's examine command *

The GDB provides a direct method to examine memory, using the command x, which is short for examine. Examining memory is a critical skill for any hacker. With a debugger like GDB, every aspect of a program’s execution can be deterministically examined, paused, stepped through, and repeated as often as needed. Since a running program is mostly just a processor and segments of memory, examining memory is the first way to look at what’s really going on.

The examine command expects two arguments:

The display format also uses a single-letter shorthand, which is optionally preceded by a count of how many items to examine. Some common format letters are as follows:

In the following example, the current address of the EIP register is used. Shorthand commands are often used with GDB, and even info register eip can be shortened to just i r eip.

(gdb) i r eip
eip 0x8048384 0x8048384 <main+16>
(gdb) x/o 0x8048384
0x8048384 <main+16>: 077042707
(gdb) x/x $eip
0x8048384 <main+16>: 0x00fc45c7
(gdb) x/u $eip
0x8048384 <main+16>: 16532935
(gdb) x/t $eip
0x8048384 <main+16>: 00000000111111000100010111000111
(gdb)
Referencing registers *

The memory the EIP register is pointing to can be examined by using the address stored in EIP. With the debugger, you can reference registers directly. $eip is equivalent to the value EIP contains at that moment.

A number can also be prepended to the format of the examine command to examine multiple units at the target address.

(gdb) x/2x $eip
0x8048384 <main+16>: 0x00fc45c7 0x83000000
(gdb) x/12x $eip
0x8048384 <main+16>: 0x00fc45c7 0x83000000 0x7e09fc7d 0xc713eb02
0x8048394 <main+32>: 0x84842404 0x01e80804 0x8dffffff 0x00fffc45
0x80483a4 <main+48>: 0xc3c9e5eb 0x90909090 0x90909090 0x5de58955
(gdb)
Halfwards, Words, and Giants *

The default size of a single unit is a four-byte unit called a word. The size of the display units for the examine command can be changed by adding a size letter to the end of the format letter. The valid size letters are as follows:

This is slightly confusing, because sometimes the term word also refers to 2-byte values, in which case a double word or DWORD refers to a 4-byte value. Note that in this book:

The following GDB output shows memory displayed in various sizes.

(gdb) x/8xb $eip
0x8048384 <main+16>: 0xc7 0x45 0xfc 0x00 0x00 0x00 0x00 0x83
(gdb) x/8xh $eip
0x8048384 <main+16>: 0x45c7 0x00fc 0x0000 0x8300 0xfc7d 0x7e09 0xeb02 0xc713
(gdb) x/8xw $eip
0x8048384 <main+16>: 0x00fc45c7 0x83000000 0x7e09fc7d 0xc713eb02
0x8048394 <main+32>: 0x84842404 0x01e80804 0x8dffffff 0x00fffc45
(gdb)

Note in the above output that the first examine shows the first two bytes to be 0xc7 and 0x45, but when a halfword is examined at the exact same memory address, the value 0x45c7 is shown, with the bytes reversed. This same byte-reversal effect can be seen when a full four-byte word is shown as 0x00fc45c7, but when the first four bytes are shown byte by byte, they are in the order of 0xc7, 0x45, 0xfc, and 0x00.

This is because on the x86 processor values are stored in little-endian byte order, which means the least significant byte is stored first. For example, if four bytes are to be interpreted as a single value, the bytes must be used in reverse order. The GDB debugger is smart enough to know how values are stored, so when a word or halfword is examined, the bytes must be reversed to display the correct values in hexadecimal. Revisiting these values displayed both as hexadecimal and unsigned decimals might help clear up any confusion.

(gdb) x/4xb $eip
0x8048384 <main+16>: 0xc7 0x45 0xfc 0x00
(gdb) x/4ub $eip
0x8048384 <main+16>: 199 69 252 0
(gdb) x/1xw $eip
0x8048384 <main+16>: 0x00fc45c7
(gdb) x/1uw $eip
0x8048384 <main+16>: 16532935
(gdb) quit
The program is running. Exit anyway? (y or n) y
$ bc -ql
199*(256^3) + 69*(256^2) + 252*(256^1) + 0*(256^0)
3343252480
0*(256^3) + 252*(256^2) + 69*(256^1) + 199*(256^0)
16532935
quit

The first four bytes are shown both in hexadecimal and standard unsigned decimal notation. A command-line calculator program called bc is used to show that if the bytes are interpreted in the incorrect order, 3343252480 is the result, which is incorrect. The byte order of a given architecture is an important detail. [p30]

In addition to converting byte order, GDB can do other conversions with the examine command. The examine command also accepts the format letter i, short for instruction, to display the memory as disassembled assembly language instructions.

$ gdb -q ./a.out
Using host libthread_db library "/lib/tls/i686/cmov/libthread_db.so.1".
(gdb) break main
Breakpoint 1 at 0x8048384: file firstprog.c, line 6.
(gdb) run
Starting program: /home/reader/booksrc/a.out
Breakpoint 1, main () at firstprog.c:6
6 for(i=0; i < 10; i++)
(gdb) i r $eip
eip 0x8048384 0x8048384 <main+16>
(gdb) x/i $eip
0x8048384 <main+16>: mov DWORD PTR [ebp-4],0x0
(gdb) x/3i $eip
0x8048384 <main+16>: mov DWORD PTR [ebp-4],0x0
0x804838b <main+23>: cmp DWORD PTR [ebp-4],0x9
0x804838f <main+27>: jle 0x8048393 <main+31>
(gdb) x/7xb $eip
0x8048384 <main+16>: 0xc7 0x45 0xfc 0x00 0x00 0x00 0x00
(gdb) x/i $eip
0x8048384 <main+16>: mov DWORD PTR [ebp-4],0x0
(gdb)

In the output above, the a.out program is run in GDB, with a breakpoint set at main(). The EIP register is pointing to memory that actually contains machine language instructions, and it is disassemble using the x/i (i stands for instruction) command. The previous objdump disassembly confirms that the seven bytes EIP is pointing to actually are machine language for the corresponding assembly instruction.

8048384: c7 45 fc 00 00 00 00 mov DWORD PTR [ebp-4],0x0

This assembly instruction will move the value of 0 into memory located at the address stored in the EBP register, minus 4. This is where the C variable i is stored in memory; i was declared as an integer that uses 4 bytes of memory on the x86 processor. This instruction will zero out the variable i for the for loop. If that memory is examined right now, it will contain nothing but random garbage. The memory at this location can be examined several different ways.

(gdb) i r ebp
ebp 0xbffff808 0xbffff808
(gdb) x/4xb $ebp - 4
0xbffff804: 0xc0 0x83 0x04 0x08
(gdb) x/4xb 0xbffff804
0xbffff804: 0xc0 0x83 0x04 0x08
(gdb) print $ebp - 4
$1 = (void *) 0xbffff804
(gdb) x/4xb $1
0xbffff804: 0xc0 0x83 0x04 0x08
(gdb) x/xw $1
0xbffff804: 0x080483c0
(gdb)

It is shown from the above example that:

These methods shown above will accomplish the same task: displaying the 4 garbage bytes found in memory that will be zeroed out when the current instruction executes.

To execute the current instruction, run the command nexti, which is short for next instruction. The processor will read the instruction at EIP, execute it, and advance EIP to the next instruction.

(gdb) nexti
0x0804838b 6 for(i=0; i < 10; i++)
(gdb) x/4xb $1
0xbffff804: 0x00 0x00 0x00 0x00
(gdb) x/dw $1
0xbffff804: 0
(gdb) i r eip
eip 0x804838b 0x804838b <main+23>
(gdb) x/i $eip
0x804838b <main+23>: cmp DWORD PTR [ebp-4],0x9
(gdb)

As predicted, the previous command zeroes out the 4 bytes found at EBP minus 4, which is memory set aside for the C variable i. Then EIP advances to the next instruction.