SE250:lab-5:dsan039
Testing (#) low on (sample size) samples
Entropy = (#) bits per byte. Optimum compression would reduce the size of this (#) byte file by (#) percent. Chi square distribution for (sample size) samples is (#), and randomly would exceed this value (#) percent of the times. Arithmetic mean value of data bytes is (#) ((#) = random). Monte Carlo value for Pi is (#) (error (#) percent). Serial correlation coefficient is (#) (totally uncorrelated = (#)). (hash function) (low/high) (sample size / table size): llps = (#), expecting (#)
- Entropy: This is a measure of information content, how random it is. I'm still not too sure how this is measured or how in certain situations a very good result can be misleading. In this context a value as close to 8 as possible is desired.
- Compression: The higher the percentage, the more the hash table can be compressed (the more data that can be eliminated as there are repetitions). So a high percentage would mean low randomness. This was the definition in the maintainers report, but i take it to mean that it could be a measure of the number of empty elements in an array (in which case if your sample size is bigger than the array size, then there should be none). The fact that it can be compressed would imply that values being added to the hash table are being hashed to the same place.
- Chi Square: A statistical calculation that is used to test how well a distribution of a set of observed data is compared to its theoretical probability function. Chi square narrows down cases where the data that is obtained is too regular in distribution according to its theory values, or is it too irregular. So for a good function, the chi value must be such that it would imply that the data is close to being a match with its theory values, but not an exact match. (taken from the maintainers report, but also what we were told by jham in lectures, it seems like a decent definition.)
- Arithmetic mean: The expected mean for a set of random numbers is always the centre between the highest value and the lowest value. For this set of tests,(0-255) 127.5 is the expected mean. (Maintainers report).
- Monte Carlo: This is the name given to a range of experimental techniques. (and example taken from wikipedia) "Draw a square of on the ground, then inscribe a circle within it. Now, scatter some small objects (for example, grains of rice or sand) throughout the square. If the objects are scattered uniformly, then the proportion of objects within the circle should be approximately π/4, which is the ratio of the circle's area to the square's area. Thus, if we count the number of objects in the circle, multiply by four, and divide by the number of objects in the square, we'll get an approximation of π". a good hash function needs a uniform distribution of hashes, so that elements in the array are all used and collisions are less frequent.
- serial correlation: determines whether or not there is a relationship between hash values, 0 indicates no relationship.
Task One
I chose a sample size of 10000 because it gives a reasonably large spread of data. a good hash function should be able to give these values the appearance of randomness, with having to make so many hash values that it begins to repeat or develop a very clear pattern.
order: (entropy) | (chi squared percentage) | (arithmetic mean) | (percentage error from pi, (monte carlo)) | (serial correlation)
Buzhash low 7.98550 99.00% 125.825 1.26% -0.000047 Buzhash high 7.97970 25.00% 127.177 1.95% -0.007153 buzhashn low 4.85247 0.01% 99.261 26.10% -0.285059 buzhashn high 7.77880 0.01% 123.087 1.11% -0.002250 hash_CRC low 4.83879 0.01% 78.156 26.10% -0.612667 hash_CRC high 7.77880 0.01% 123.087 1.11% -0.002250 base256 low 0.00000 0.01% 97.000 27.32% undefined base256 high 3.90309 0.01% 107.987 27.32% 0.079032 Java Int low 4.11606 0.01% 32.385 27.32% -0.217081 Java Int high 4.11606 0.01% 32.385 27.32% -0.217081 Java Object low 2.00000 0.01% 77.000 27.32% -0.521556 Java Objec high 4.93352 0.01% 107.661 10.89% -0.493270 Java String low 7.99389 99.99% 127.178 0.19% -0.011155 Java Strin high 7.89301 0.01% 126.009 1.42% -0.021360 Rand low 7.93533 0.01% 112.054 10.82% -0.066125 Rand high 7.93951 0.01% 111.686 10.82% -0.060458 high rand low 7.98301 75.00% 127.541 0.42% 0.001185 high rand high 7.98338 75.00% 128.211 1.49% -0.014177
most of this makes sense, the poorer results and the good. However, some stand out:
- Buzhash low 7.98550 99.00%
- Buzhash high 7.97970 25.00%
- base256 low 0.00000 0.01% 97.000 27.32% undefined
- high rand low 7.98301 75.00% 127.541 0.42% 0.001185
- high rand high 7.98338 75.00%
hmm will have to explain..
In order of best to worst: