Mark: 3 revision(s)

2008-11-03T05:18:41Z

3 revision(s)

New page

=== John's lab report ===

I played around with the number of loop iterations until the program took about 1 second to run.

double f = 0;
clock_t t0 = clock( );

for( n = 0; n < 100000000; n++ )
f = f + 1;
printf( "%ld\n", clock( ) - t0 );

By changing the type of <tt>f</tt>, I was able to gather data for the different C data types,
and by commenting out the line <tt>f = f + 1</tt> I obtained a time for the empty loop. To
check the variance in the times, I added another loop around the code to repeat 10 times.

Here are the results from my laptop:
double float long int short noop
484 469 391 390 360 375
469 469 391 391 359 391
453 453 390 453 359 390
469 453 391 438 360 407
469 469 391 422 343 390
437 453 406 406 360 391
469 453 390 437 359 391
469 469 391 438 360 390
453 453 391 406 343 375
468 469 390 422 375 391
(CLOCKS_PER_SEC is 1,000)

Subtracting the mean time for the noop loop gave the following mean times:
double float long int short
74.9 71.9 3.1 3.12 -3.13

Long and int both have similar runtime, as to double and float.

The result for short is most curious: it appears that a loop in which you increment a short takes
less time than an empty loop.

It is worth taking a closer look at the generated assembly code, just in case the C compiler is doing something odd. I ran
<tt>gcc -S add-times.c</tt> on the code for short, and again on the code with <tt>f = f + 1</tt> commented out. The block on
the left is the loop with the short increment, and on the right is the noop loop:
L5:
cmpl $99999999, -16(%ebp) cmpl $99999999, -16(%ebp)
jg L6 jg L6
movzwl -10(%ebp), %eax
incl %eax
movw %ax, -10(%ebp)
leal -16(%ebp), %eax leal -16(%ebp), %eax
incl (%eax) incl (%eax)
jmp L5 jmp L5
L6:

Nothing strange here -- the instructions are identical apart from the three lines for the increment. Those three additional
lines of assembler actually reduce the runtime of the loop.

Repeating the experiment on the Linux server gives:
short int long float double noop
1100000 1620000 1620000 1170000 1120000 3000000
1090000 1630000 1630000 1160000 1130000 3010000
1070000 1630000 1630000 1130000 1140000 3010000
1100000 1620000 1630000 1140000 1130000 3000000
1090000 1640000 1620000 1180000 1120000 3010000
1100000 1610000 1630000 1160000 1140000 3020000
1100000 1630000 1630000 1200000 1130000 3000000
1070000 1630000 1630000 1080000 1140000 3010000
1100000 1630000 1630000 1170000 1130000 3010000
1100000 1620000 1630000 1190000 1140000 3010000
(CLOCKS_PER_SEC is 1,000,000)

The results show similar relative times, but even stranger time for an empty loop: doing nothing
takes nearly '''three times as long''' as a loop that performs a simple operation.
Who would have guessed?

It's also notable that the Linux server is considerably slower than my laptop.

Why is the noop loop so slow? Modern processors are able to execute several instructions in parallel, but
this gets tricky when there is a conditional branch. Rather than delay, waiting for the calculations for the
branch condition to be available, the processor may make a guess and start executing code after the branch "just in case"
it turns out to be the right guess. If the guess is wrong, the work needs to be undone. In the code above, the
jg (Jump Greater) instruction is only taken when the loop ends, and perhaps the processor is always guessing it is taken.
In the empty loop case, it would get further ahead and end up with more work to undo when it discovers its mistake.

A completely different set of results can be obtained by taking the difference between a loop with two add operations
and a loop with just one. i.e. replace the inner loop with this:

for( n = 0; n < 100000000; n++ ) {
f = f + 1;
g = g + 1;
}

Obtaining times with and without the second increment, <tt>g = g + 1</tt> gives the following:
short2 short1 long2 long1 float2 float1 double2 double1
860 359 438 406 515 469 454 469
843 360 422 391 500 468 468 453
844 359 421 375 500 453 469 468
859 359 438 422 500 454 469 454
844 344 422 391 500 468 469 468
860 375 453 406 485 453 453 453
843 359 422 390 484 469 453 454
844 360 422 391 500 453 453 468
844 359 437 406 500 453 484 453
859 344 422 391 500 454 453 454

Taking the mean differences gives very, very different results for short and double:
mean( short2 - short1 ) = 492.2
mean( long2 - long1 ) = 32.8
mean( float2 - float1 ) = 39
mean( double2 - double1 ) = 3.1

It now appears that shorts cost the earth, and two double adds are as cheap as one.

The times on the PowerPC64 Linux server are also damning for shorts and floats, but longs
prove to be much cheaper.

short2 short1 long2 long1 float2 float1 double2 double1
1800000 1080000 1690000 1620000 1890000 1240000 1380000 1120000
1810000 1100000 1700000 1630000 1890000 1230000 1390000 1140000
1810000 1100000 1700000 1620000 1890000 1250000 1390000 1130000
1810000 1100000 1700000 1630000 1890000 1240000 1380000 1130000
1810000 1090000 1700000 1630000 1910000 1240000 1390000 1140000
1800000 1100000 1710000 1630000 1890000 1250000 1390000 1140000
1740000 1100000 1710000 1630000 1900000 1240000 1390000 1150000
1730000 1100000 1700000 1630000 1900000 1240000 1390000 1130000
1750000 1100000 1700000 1630000 1900000 1250000 1390000 1130000
1760000 1100000 1700000 1640000 1900000 1250000 1390000 1120000

mean( short2 - short1 ) = 685000
mean( long2 - long1 ) = 72000
mean( float2 - float1 ) = 653000
mean( double2 - double1 ) = 255000

A possible cause here is memory alignment: the PowerPC64 is optimised for fetching a 64-bit quantity
(the size of a long and a double), and performs poorly when writing a smaller quantity because it has
to first read a 64-bit block, update it, then write it back. The 32-bit Intel processor on my Dell
appears to suffer the same problem when working with a 16-bit short.

As for why a double is faster than a float, most FPUs operate on full-precision floating point numbers.
Working on a 32-bit float means converting it to and from 64-bit format. I guess the Intel FPU can
execute several instructions in parallel, hence the "free" second increment.

SE250:lab-1:jham005 - Revision history

Mark: 3 revision(s)