SE250:lab-5:zyan057
Task 1
OK... So we need to find the suitable sample size before we can test and rank the has functions.
Here is the code I have wrote to find a suitable sample size
ent_test( "Buzhash low", low_entropy_src, 10, &rt_add_buzhash ); ent_test( "Buzhash low", low_entropy_src, 100, &rt_add_buzhash ); ent_test( "Buzhash low", low_entropy_src, 1000, &rt_add_buzhash ); ent_test( "Buzhash low", low_entropy_src, 10000, &rt_add_buzhash ); ent_test( "Buzhash low", low_entropy_src, 100000, &rt_add_buzhash ); ent_test( "Buzhash low", low_entropy_src, 1000000, &rt_add_buzhash ); ent_test( "Buzhash low", low_entropy_src, 10000000, &rt_add_buzhash );
Result
Testing Buzhash low on 10 samples Entropy = 3.584963 bits per byte. Optimum compression would reduce the size of this 12 byte file by 55 percent. Chi square distribution for 12 samples is 244.00, and randomly would exceed this value 50.00 percent of the times. Arithmetic mean value of data bytes is 126.0833 (127.5 = random). Monte Carlo value for Pi is 2.000000000 (error 36.34 percent). Serial correlation coefficient is -0.341362 (totally uncorrelated = 0.0).
Testing Buzhash low on 100 samples Entropy = 6.248758 bits per byte. Optimum compression would reduce the size of this 100 byte file by 21 percent. Chi square distribution for 100 samples is 273.76, and randomly would exceed this value 25.00 percent of the times. Arithmetic mean value of data bytes is 129.3100 (127.5 = random). Monte Carlo value for Pi is 3.250000000 (error 3.45 percent). Serial correlation coefficient is -0.092433 (totally uncorrelated = 0.0).
Testing Buzhash low on 1000 samples Entropy = 7.847331 bits per byte. Optimum compression would reduce the size of this 1000 byte file by 1 percent. Chi square distribution for 1000 samples is 207.81, and randomly would exceed this value 97.50 percent of the times. Arithmetic mean value of data bytes is 126.7080 (127.5 = random). Monte Carlo value for Pi is 3.277108434 (error 4.31 percent). Serial correlation coefficient is 0.007539 (totally uncorrelated = 0.0).
Testing Buzhash low on 10000 samples Entropy = 7.984998 bits per byte. Optimum compression would reduce the size of this 10000 byte file by 0 percent. Chi square distribution for 10000 samples is 206.87, and randomly would exceed this value 97.50 percent of the times. Arithmetic mean value of data bytes is 126.8134 (127.5 = random). Monte Carlo value for Pi is 3.157262905 (error 0.50 percent). Serial correlation coefficient is 0.008094 (totally uncorrelated = 0.0).
Testing Buzhash low on 100000 samples Entropy = 7.998378 bits per byte. Optimum compression would reduce the size of this 100000 byte file by 0 percent. Chi square distribution for 100000 samples is 224.41, and randomly would exceed this value 90.00 percent of the times. Arithmetic mean value of data bytes is 127.6864 (127.5 = random). Monte Carlo value for Pi is 3.112444498 (error 0.93 percent). Serial correlation coefficient is 0.000743 (totally uncorrelated = 0.0).
Testing Buzhash low on 1000000 samples Entropy = 7.999896 bits per byte. Optimum compression would reduce the size of this 1000000 byte file by 0 percent. Chi square distribution for 1000000 samples is 144.94, and randomly would exceed this value 99.99 percent of the times. Arithmetic mean value of data bytes is 127.4920 (127.5 = random). Monte Carlo value for Pi is 3.144948580 (error 0.11 percent). Serial correlation coefficient is -0.000856 (totally uncorrelated = 0.0).
Testing Buzhash low on 10000000 samples Entropy = 7.999986 bits per byte. Optimum compression would reduce the size of this 10000000 byte file by 0 percent. Chi square distribution for 10000000 samples is 191.37, and randomly would exceed this value 99.50 percent of the times. Arithmetic mean value of data bytes is 127.4860 (127.5 = random). Monte Carlo value for Pi is 3.140720456 (error 0.03 percent). Serial correlation coefficient is 0.000051 (totally uncorrelated = 0.0).
The result suggest that any sample size larger than 100000 does not make much difference.