The program "mandel.c" creates a visualization of the Mandelbrot fractal.
It uses an algorithm with three nested loops, with the inner loop containing
several 32x32=64 multiplies. Our 32-bit x86 compiler doesn't know how to do
that kind of multiplication (x86 supports it, but only if the destination
registers are EDX:EAX), so it has to call off to a library function.
  On x64, the native word size is 64 bits, so it's easy to do a 32x32=64
multiply. Therefore, we can do the multiplication inline, and it participates
in register allocation, producing better code overall.
  With size optimizations (-OS -Omax), the code for "mandelbrot" takes up
374 bytes (135 instructions) on 32-bit x86, and only 227 bytes (69 instructions)
on x64.
  The difference is even more marked with speed optimizations (-Ospeed -Omax),
because the two biggest optimizations that trade size for speed are function
inlining and loop unrolling. Since the 32-bit "mandelbrot" contains a function
call inside a nested loop, both of these optimizations hit hard. The x64 version
is 792 bytes to x86's 2412 bytes --- a factor of three difference in code size!
We have not run speed benchmarks on x64 code, but this size difference seems
significant.

  Sample command lines for producing pretty pictures:
./a.out -0.7005 0.295 0.015 ; animate mandel.pgm

  The idea of using a fixed-point Mandelbrot calculation to demonstrate 64-bit
code optimization comes from this article by Mike Wall on AMD.com:
http://developer.amd.com/articlex.jsp?id=58