SE250:lab-5:dsan039

From Marks Wiki
Revision as of 05:19, 3 November 2008 by Mark (Sọ̀rọ̀ | contribs) (13 revision(s))
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Testing (#) low on (sample size) samples

Entropy = (#) bits per byte.

Optimum compression would reduce the size
of this (#) byte file by (#) percent.

Chi square distribution for (sample size) samples is (#), and randomly
would exceed this value (#) percent of the times.

Arithmetic mean value of data bytes is (#) ((#) = random).
Monte Carlo value for Pi is (#) (error (#) percent).
Serial correlation coefficient is (#) (totally uncorrelated = (#)).

(hash function) (low/high) (sample size / table size): llps = (#), expecting (#)


  1. Entropy: This is a measure of information content, how random it is. I'm still not too sure how this is measured or how in certain situations a very good result can be misleading. In this context a value as close to 8 as possible is desired.
  2. Compression: The higher the percentage, the more the hash table can be compressed (the more data that can be eliminated as there are repetitions). So a high percentage would mean low randomness. This was the definition in the maintainers report, but i take it to mean that it could be a measure of the number of empty elements in an array (in which case if your sample size is bigger than the array size, then there should be none). The fact that it can be compressed would imply that values being added to the hash table are being hashed to the same place.
  3. Chi Square: A statistical calculation that is used to test how well a distribution of a set of observed data is compared to its theoretical probability function. Chi square narrows down cases where the data that is obtained is too regular in distribution according to its theory values, or is it too irregular. So for a good function, the chi value must be such that it would imply that the data is close to being a match with its theory values, but not an exact match. (taken from the maintainers report, but also what we were told by jham in lectures, it seems like a decent definition.)
  4. Arithmetic mean: The expected mean for a set of random numbers is always the centre between the highest value and the lowest value. For this set of tests,(0-255) 127.5 is the expected mean. (Maintainers report).
  5. Monte Carlo: This is the name given to a range of experimental techniques. (and example taken from wikipedia) "Draw a square of on the ground, then inscribe a circle within it. Now, scatter some small objects (for example, grains of rice or sand) throughout the square. If the objects are scattered uniformly, then the proportion of objects within the circle should be approximately π/4, which is the ratio of the circle's area to the square's area. Thus, if we count the number of objects in the circle, multiply by four, and divide by the number of objects in the square, we'll get an approximation of π". a good hash function needs a uniform distribution of hashes, so that elements in the array are all used and collisions are less frequent.
  6. serial correlation: determines whether or not there is a relationship between hash values, 0 indicates no relationship.

Task One

I chose a sample size of 10000 because it gives a reasonably large spread of data. a good hash function should be able to give these values the appearance of randomness, with having to make so many hash values that it begins to repeat or develop a very clear pattern.

order: (entropy) | (chi squared percentage) | (arithmetic mean) | (percentage error from pi, (monte carlo)) | (serial correlation)


Buzhash low	7.98550	99.00%	125.825	1.26%	-0.000047
Buzhash high	7.97970	25.00%	127.177	1.95%	-0.007153
buzhashn low	4.85247	0.01%	99.261	26.10%	-0.285059
buzhashn high	7.77880	0.01%	123.087	1.11%	-0.002250
hash_CRC low	4.83879	0.01%	78.156	26.10%	-0.612667
hash_CRC high	7.77880	0.01%	123.087	1.11%	-0.002250
base256 low	0.00000	0.01%	97.000	27.32%	undefined
base256 high	3.90309	0.01%	107.987	27.32%	0.079032
Java Int low	4.11606	0.01%	32.385	27.32%	-0.217081
Java Int high	4.11606	0.01%	32.385	27.32%	-0.217081
Java Object low 2.00000 0.01%	77.000	27.32%	-0.521556
Java Objec high 4.93352 0.01%	107.661	10.89%	-0.493270
Java String low 7.99389 99.99%	127.178	0.19%	-0.011155
Java Strin high 7.89301 0.01%	126.009	1.42%	-0.021360
Rand  low	7.93533	0.01%	112.054	10.82%	-0.066125
Rand  high	7.93951	0.01%	111.686	10.82%	-0.060458
high rand low	7.98301	75.00%	127.541	0.42%	0.001185
high rand high	7.98338	75.00%	128.211	1.49%	-0.014177

most of this makes sense, the poorer results and the good. However, some stand out:

  • Buzhash low 7.98550 99.00%
  • Buzhash high 7.97970 25.00%
  • base256 low 0.00000 0.01% 97.000 27.32% undefined
  • high rand low 7.98301 75.00% 127.541 0.42% 0.001185
  • high rand high 7.98338 75.00%

hmm will have to explain..

In order of best to worst: