SE250:lab-5:tlou006
LAB 5
Q1
Testing with
int sample_size = 1000; int n_keys = 1000; int table_size = 1000;
and running rt_add_buzhash Testing Buzhash low on 1000 samples
Entropy = 7.843786 bits per byte. Optimum compression would reduce the size of this 1000 byte file by 1 percent. Chi square distribution for the 1000 samples is 214.46, and randomly would exceed this value 95.00 percent of the times. Arithmetic mean value of the data bytes is 128.0860 <127.5 = random>. Monte Carlo value for Pi is 3.132530120 <error 0.29 percent>. Serial correlation coefficient is -0.017268 <totally uncorrelated = 0.0>. Buzhash low 1000/1000: llps = 6, expecting 5.51384
Not sure what these results mean yet.
After increasing the sample size to first 100000 then 10000000
I observed..
Entropy got closer to 8 bits per byte.
The percentage compression would reduce the file size by decreased to 0.
Chi square distribution decreased.
Arithmetic mean value of the data byes gets closer to 127.5
Monte Carlo value got closer to PI.
Serial correlation coefficient got closer to 0
llps value was closer to the expected value.
All the results suggest increasing the sample size increases "randomness"
Running rt_add_buzhashn with sample size 100000 and low entropy
Entropy = 7.998236 bits per byte. Optimum compression would reduce the size of this 100000 byte file by 0 percent. Chi square distribution for the 100000 samples is 244.84, and randomly would exceed this value 50.00 percent of the times. Arithmetic mean value of the data bytes is 127.4936 <127.5 = random>. Monte Carlo value for Pi is 3.137635506 <error 0.13 percent>. Serial correlation coefficient is -0.003092 <totally uncorrelated = 0.0>. Buzhash low 1000/1000: llps = 999 (!!!!!!), expecting 5.51384
llps = 999 suggests that a lot of values is bunched up in one place
Running rt_add_hash_CRC with sample size 100000 and low entropy
Entropy = 5.574705 bits per byte. Optimum compression would reduce the size of this 100000 byte file by 30 percent. Chi square distribution for the 100000 samples is 1398897.03, and randomly would exceed this value 0.01 percent of the times. Arithmetic mean value of the data bytes is 95.7235 <127.5 = random>. Monte Carlo value for Pi is 3.747989920 <error 19.30 percent>. Serial correlation coefficient is -0.075371 <totally uncorrelated = 0.0>. Buzhash low 1000/1000: llps = 13, expecting 5.51384
Running rt_add_base256 with sample size 100000 and low entropy
Entropy = 0.00000 (!!!) bits per byte. Optimum compression would reduce the size of this 100000 byte file by 100 percent. Chi square distribution for the 1000 samples is 25500000.00, and randomly would exceed this value 0.01 percent of the times. Arithmetic mean value of the data bytes is 97.0000 <127.5 = random>. Monte Carlo value for Pi is 4.0000000 <error 27.32 percent>. Serial correlation coefficient is undefined <totally uncorrelated = 0.0>. Buzhash low 1000/1000: llps = 1000 (!!!!), expecting 5.51384
base256 and other hash functions produced many unexpected results.
Maybe the sample size is too large? Suggests buzhash performs well using large sample sizes.