SE250:lab-5:llay008
HOW DO HASH FUNCTIONS PERFORM IN THEORY AND PRACTICE?
Task One: How Do the Functions Compare in Theoretical Randomness
This task took quite a long time to do, due to the large amount of repetitive compiling of the code. It took a while at the start to understand exactly what to do. The first thing that I did was read through the code and try to grasp was what it was. I noticed that the code was divided into the hash fuctions and the analysis as well as the high and low entropy. I figured out which of the fuctions I had to substitute into the main code. From then it was simply modifying this to get the required output.
I chose a sample size of 1000, a key number of 1000 and a table number of 100000. I chose these for two reasons: first, when I ran the code these values gave a good entropy value. Altering them did not change the values considerably. The second reason that I chose these values is that i had a misconception of what they are. It wasn't entirely clear to me what they were so I asked for help and got told the wrong information, namely that they sample size was the number of inputs into the hash table, not that it was the number of times the statistcal tests were run. When I realised this mistake I didn't change the values because that would invalidate the results but I ran some tests on how the results differed when I changed them.
I kept most of the output on "full", because, although it takes up more space, it was easier to interpret (I didn't have to try and remember which values corresponded to what).
Output
buzhash
low
Testing Buzhash low on 1000 samples Entropy = 7.843786 bits per byte. Optimum compression would reduce the size of this 1000 byte file by 1 percent. Chi square distribution for 1000 samples is 214.46, and randomly would exceed this value 95.00 percent of the times. Arithmetic mean value of data bytes is 128.0860 (127.5 = random). Monte Carlo value for Pi is 3.132530120 (error 0.29 percent). Serial correlation coefficient is -0.017268 (totally uncorrelated = 0.0). Buzhash low 1000/100000: llps = 2, expecting 2.00948
typical
Testing Buzhash typical on 1000 samples Entropy = 7.797775 bits per byte. Optimum compression would reduce the size of this 1000 byte file by 2 percent. Chi square distribution for 1000 samples is 250.82, and randomly would exceed this value 50.00 percent of the times. Arithmetic mean value of data bytes is 126.5740 (127.5 = random). Monte Carlo value for Pi is 3.277108434 (error 4.31 percent). Serial correlation coefficient is -0.007005 (totally uncorrelated = 0.0). Buzhash typical 1000/100000: llps = 2, expecting 2.00948
bazhashn
low
Testing Buzhashn low on 1000 samples Entropy = 7.823873 bits per byte. Optimum compression would reduce the size of this 1000 byte file by 2 percent. Chi square distribution for 1000 samples is 220.61, and randomly would exceed this value 90.00 percent of the times. Arithmetic mean value of data bytes is 127.3730 (127.5 = random). Monte Carlo value for Pi is 3.108433735 (error 1.06 percent). Serial correlation coefficient is -0.007118 (totally uncorrelated = 0.0). Buzhashn low 1000/100000: llps = 2, expecting 2.00948
typical
Java String Hash typical 7.82387 90.00% 127.373 1.06% -0.007118 Java String Hash typical 1000/10000: llps = 999, expecting 2.82556
hash_CRC
low
Testing hash_CRC low on 1000 samples Entropy = 3.965965 bits per byte. Optimum compression would reduce the size of this 1000 byte file by 50 percent. Chi square distribution for 1000 samples is 36163.52, and randomly would exceed this value 0.01 percent of the times. Arithmetic mean value of data bytes is 93.6860 (127.5 = random). Monte Carlo value for Pi is 4.000000000 (error 27.32 percent). Serial correlation coefficient is -0.380754 (totally uncorrelated = 0.0). hash_CRC low 1000/100000: llps = 1, expecting 2.00948
typical
Testing hash_CRC typical on 1000 samples Entropy = 7.202459 bits per byte. Optimum compression would reduce the size of this 1000 byte file by 9 percent. Chi square distribution for 1000 samples is 1660.86, and randomly would exceed this value 0.01 percent of the times. Arithmetic mean value of data bytes is 114.9320 (127.5 = random). Monte Carlo value for Pi is 3.204819277 (error 2.01 percent). Serial correlation coefficient is -0.032076 (totally uncorrelated = 0.0). hash_CRC typical 1000/100000: llps = 2, expecting 2.00948
base256
low
Testing base256 low on 1000 samples Entropy = 0.000000 bits per byte. Optimum compression would reduce the size of this 1000 byte file by 100 percent. Chi square distribution for 1000 samples is 255000.00, and randomly would exceed this value 0.01 percent of the times. Arithmetic mean value of data bytes is 97.0000 (127.5 = random). Monte Carlo value for Pi is 4.000000000 (error 27.32 percent). Serial correlation coefficient is undefined (all values equal!). base256 low 1000/100000: llps = 1000, expecting 2.00948
typical
Testing base256 typical on 1000 samples Entropy = 3.919224 bits per byte. Optimum compression would reduce the size of this 1000 byte file by 51 percent. Chi square distribution for 1000 samples is 19854.27, and randomly would exceed this value 0.01 percent of the times. Arithmetic mean value of data bytes is 106.4100 (127.5 = random). Monte Carlo value for Pi is 4.000000000 (error 27.32 percent). Serial correlation coefficient is 0.217294 (totally uncorrelated = 0.0). base256 typical 1000/100000: llps = 46, expecting 2.00948
Java Integer Hash
low
lab5.c:332: warning: passing arg 4 of `llps' from incompatible pointer type Testing Java Integer Hash low on 1000 samples Entropy = 2.791730 bits per byte. Optimum compression would reduce the size of this 1000 byte file by 65 percent. Chi square distribution for 1000 samples is 143448.00, and randomly would exceed this value 0.01 percent of the times. Arithmetic mean value of data bytes is 31.1250 (127.5 = random). Monte Carlo value for Pi is 4.000000000 (error 27.32 percent). Serial correlation coefficient is -0.230200 (totally uncorrelated = 0.0). Java Integer Hash low 1000/100000: llps = 1, expecting 2.00948
typical
lab5.c:332: warning: passing arg 4 of `llps' from incompatible pointer type Testing Java Integer Hash typical on 1000 samples Entropy = 2.791730 bits per byte. Optimum compression would reduce the size of this 1000 byte file by 65 percent. Chi square distribution for 1000 samples is 143448.00, and randomly would exceed this value 0.01 percent of the times. Arithmetic mean value of data bytes is 31.1250 (127.5 = random). Monte Carlo value for Pi is 4.000000000 (error 27.32 percent). Serial correlation coefficient is -0.230200 (totally uncorrelated = 0.0). Java Integer Hash typical 1000/100000: llps = 91, expecting 2.00948
Java Object Hash
low
Testing Java Object Hash low on 1000 samples Entropy = 2.000000 bits per byte. Optimum compression would reduce the size of this 1000 byte file by 75 percent. Chi square distribution for 1000 samples is 63000.00, and randomly would exceed this value 0.01 percent of the times. Arithmetic mean value of data bytes is 77.0000 (127.5 = random). Monte Carlo value for Pi is 4.000000000 (error 27.32 percent). Serial correlation coefficient is -0.521556 (totally uncorrelated = 0.0). Java Object Hash low 1000/100000: llps = 1000, expecting 2.00948
typical
Testing Java Object Hash typical on 1000 samples Entropy = 4.232015 bits per byte. Optimum compression would reduce the size of this 1000 byte file by 47 percent. Chi square distribution for 1000 samples is 33033.66, and randomly would exceed this value 0.01 percent of the times. Arithmetic mean value of data bytes is 88.4960 (127.5 = random). Monte Carlo value for Pi is 4.000000000 (error 27.32 percent). Serial correlation coefficient is -0.731743 (totally uncorrelated = 0.0). Java Object Hash typical 1000/100000: llps = 1, expecting 2.00948
Java String hash
low
Java Stirng Hash low 7.91640 99.99% 129.471 0.48% 0.009052 Java String Hash low 1000/100000: llps = 1, expecting 2.00948
typical
Java String Hash typical 7.37782 0.01% 117.390 8.92% -0.013887 Java String Hash typical 1000/100000: llps = 2, expecting 2.00948
Around this point I realised that I had some misconceptions about the variables and tried it again using a table size of 1000. The results are below.
Java String Hash typical 7.37782 0.01% 117.390 8.92% -0.013887 Java String Hash typical 1000/1000: llps = 7, expecting 5.51384
Increasing the sample size gave these results
Java String Hash typical 7.89301 0.01% 126.009 1.42% -0.021360 Java String Hash typical 1000/1000: llps = 7, expecting 5.51384
For my benefit (to see exactly what the numbers correlate to)
Testing Java String Hash typical on 10000 samples Entropy = 7.893010 bits per byte. Optimum compression would reduce the size of this 10000 byte file by 1 percent. Chi square distribution for 10000 samples is 2047.92, and randomly would exceed this value 0.01 percent of the times. Arithmetic mean value of data bytes is 126.0088 (127.5 = random). Monte Carlo value for Pi is 3.186074430 (error 1.42 percent). Serial correlation coefficient is -0.021360 (totally uncorrelated = 0.0). Java String Hash typical 1000/1000: llps = 7, expecting 5.51384
rand
low
Java rand low 7.71844 0.01% 110.541 8.92% -0.048389 Java rand low 1000/10000: llps = 2, expecting 2.82556
typical
Java rand typical 7.74840 0.05% 112.891 7.38% -0.081749 Java rand typical 1000/10000: llps = 3, expecting 2.82556
high rand
This gave me errors =
lab5.c:332: error: `high_rand' undeclared (first use in this function)
Results
TEST ENTROPY S.C.C Chi squared value Java String hash low 7.91640 0.009052 99.99 buzhashn low 7.823873 -0.007118 90.00 buzhash low 7.843786 -0.017268 95.00 rand low 7.71844 -0.048389 0.01 hash_CRC low 3.965965 -0.380754 0.01 Java Object hash low 4.232015 -0.521556 0.01 Java Integer hash low 2.791730 -0.230200 0.01 base256 low 0.000000 undefined 0.01 buzhash typical 7.797775 -0.032076 50.00 rand typical 7.74840 -0.081749 0.05 buzhashn typical 7.202459 -0.007118 90.00 Java String hash typical 7.37782 -0.013887 0.01 base256 typical 3.919224 0.217294 0.01 hash_CRC typical 3.919224 -0.032076 0.01 Java Object hash typical 4.232015 -0.731743 0.01 Java Integer hash typical 2.791730 -0.230200 0.01
Looking at these results there is a clear difference between the top four and the bottom four, especially when looking at the entropy of each. base256 is actually 0, which means every value is the same(?). The difference is also seen when looking at the serial correlation coefficient. In each of the top four the first significant figure occurs at the thousandth significant figure, while it it larger for the bottom four, in most cases by a power of ten. With the sole exception of buzhash typical, all of the hash functions performed badly on the chi squared test, scoring very high (too good to be true) or very low (not very random). In general the low entropy test gave better results than the typical.