SE250:lab-1:jham005
John's lab report
I played around with the number of loop iterations until the program took about 1 second to run.
double f = 0; clock_t t0 = clock( ); for( n = 0; n < 100000000; n++ ) f = f + 1; printf( "%ld\n", clock( ) - t0 );
By changing the type of f, I was able to gather data for the different C data types, and by commenting out the line f = f + 1 I obtained a time for the empty loop. To check the variance in the times, I added another loop around the code to repeat 10 times.
Here are the results from my laptop:
double float long int short noop 484 469 391 390 360 375 469 469 391 391 359 391 453 453 390 453 359 390 469 453 391 438 360 407 469 469 391 422 343 390 437 453 406 406 360 391 469 453 390 437 359 391 469 469 391 438 360 390 453 453 391 406 343 375 468 469 390 422 375 391 (CLOCKS_PER_SEC is 1,000)
Subtracting the mean time for the noop loop gave the following mean times:
double float long int short 74.9 71.9 3.1 3.12 -3.13
Long and int both have similar runtime, as to double and float.
The result for short is most curious: it appears that a loop in which you increment a short takes less time than an empty loop.
It is worth taking a closer look at the generated assembly code, just in case the C compiler is doing something odd. I ran gcc -S add-times.c on the code for short, and again on the code with f = f + 1 commented out. The block on the left is the loop with the short increment, and on the right is the noop loop:
L5: cmpl $99999999, -16(%ebp) cmpl $99999999, -16(%ebp) jg L6 jg L6 movzwl -10(%ebp), %eax incl %eax movw %ax, -10(%ebp) leal -16(%ebp), %eax leal -16(%ebp), %eax incl (%eax) incl (%eax) jmp L5 jmp L5 L6:
Nothing strange here -- the instructions are identical apart from the three lines for the increment. Those three additional lines of assembler actually reduce the runtime of the loop.
Repeating the experiment on the Linux server gives:
short int long float double noop 1100000 1620000 1620000 1170000 1120000 3000000 1090000 1630000 1630000 1160000 1130000 3010000 1070000 1630000 1630000 1130000 1140000 3010000 1100000 1620000 1630000 1140000 1130000 3000000 1090000 1640000 1620000 1180000 1120000 3010000 1100000 1610000 1630000 1160000 1140000 3020000 1100000 1630000 1630000 1200000 1130000 3000000 1070000 1630000 1630000 1080000 1140000 3010000 1100000 1630000 1630000 1170000 1130000 3010000 1100000 1620000 1630000 1190000 1140000 3010000 (CLOCKS_PER_SEC is 1,000,000)
The results show similar relative times, but even stranger time for an empty loop: doing nothing takes nearly three times as long as a loop that performs a simple operation. Who would have guessed?
It's also notable that the Linux server is considerably slower than my laptop.
Why is the noop loop so slow? Modern processors are able to execute several instructions in parallel, but this gets tricky when there is a conditional branch. Rather than delay, waiting for the calculations for the branch condition to be available, the processor may make a guess and start executing code after the branch "just in case" it turns out to be the right guess. If the guess is wrong, the work needs to be undone. In the code above, the jg (Jump Greater) instruction is only taken when the loop ends, and perhaps the processor is always guessing it is taken. In the empty loop case, it would get further ahead and end up with more work to undo when it discovers its mistake.
A completely different set of results can be obtained by taking the difference between a loop with two add operations and a loop with just one. i.e. replace the inner loop with this:
for( n = 0; n < 100000000; n++ ) { f = f + 1; g = g + 1; }
Obtaining times with and without the second increment, g = g + 1 gives the following:
short2 short1 long2 long1 float2 float1 double2 double1 860 359 438 406 515 469 454 469 843 360 422 391 500 468 468 453 844 359 421 375 500 453 469 468 859 359 438 422 500 454 469 454 844 344 422 391 500 468 469 468 860 375 453 406 485 453 453 453 843 359 422 390 484 469 453 454 844 360 422 391 500 453 453 468 844 359 437 406 500 453 484 453 859 344 422 391 500 454 453 454
Taking the mean differences gives very, very different results for short and double:
mean( short2 - short1 ) = 492.2 mean( long2 - long1 ) = 32.8 mean( float2 - float1 ) = 39 mean( double2 - double1 ) = 3.1
It now appears that shorts cost the earth, and two double adds are as cheap as one.
The times on the PowerPC64 Linux server are also damning for shorts and floats, but longs prove to be much cheaper.
short2 short1 long2 long1 float2 float1 double2 double1 1800000 1080000 1690000 1620000 1890000 1240000 1380000 1120000 1810000 1100000 1700000 1630000 1890000 1230000 1390000 1140000 1810000 1100000 1700000 1620000 1890000 1250000 1390000 1130000 1810000 1100000 1700000 1630000 1890000 1240000 1380000 1130000 1810000 1090000 1700000 1630000 1910000 1240000 1390000 1140000 1800000 1100000 1710000 1630000 1890000 1250000 1390000 1140000 1740000 1100000 1710000 1630000 1900000 1240000 1390000 1150000 1730000 1100000 1700000 1630000 1900000 1240000 1390000 1130000 1750000 1100000 1700000 1630000 1900000 1250000 1390000 1130000 1760000 1100000 1700000 1640000 1900000 1250000 1390000 1120000 mean( short2 - short1 ) = 685000 mean( long2 - long1 ) = 72000 mean( float2 - float1 ) = 653000 mean( double2 - double1 ) = 255000
A possible cause here is memory alignment: the PowerPC64 is optimised for fetching a 64-bit quantity (the size of a long and a double), and performs poorly when writing a smaller quantity because it has to first read a 64-bit block, update it, then write it back. The 32-bit Intel processor on my Dell appears to suffer the same problem when working with a 16-bit short.
As for why a double is faster than a float, most FPUs operate on full-precision floating point numbers. Working on a 32-bit float means converting it to and from 64-bit format. I guess the Intel FPU can execute several instructions in parallel, hence the "free" second increment.