SE250:lab-5:Maintainers report
Lab 5 Report
Introduction
A hash function is any well-defined procedure or mathematical function for turning some kind of data into a relatively small integer, that may serve as an index into an array. The values returned by a hash function are called hash values, hash codes, hash sums, or simply hashes.
Hash functions are mostly used to speed up table lookup or data comparison tasks --- such as finding items in a database, detecting duplicated or similar records in a large file, finding similar stretches in DNA sequences, and so on.
Hash table could also be defined(loosly defined) as an array of linked list.
Hash table is a 2D array of values with a specific key attached to each value. This table is controlled or populated using a hash function. A hash function is a mapping device that generates and assigns numbers to the input data so that it can be stored in the array index according to this assigned number; each assigned number is unique to the input data.
A good hash function is essential for good hash table performance. A bad choice for a hash function will eventually lead to clustering, which is when the probability of keys being assigned same hash values is higher then expected from a random function. In addition, some hash functions are computationally expensive, so the amount of time (and, in some cases, memory) taken to compute the hash may be burdensome.
Methodology
In this weeks lab we were expected to perform experiments on hash tables, to show how different hash functions work.
All the tests were provided, and it was just up to us students to run them, look at the results and come to conclusions about which hash function would be best to use.
The hash functions that were provided were:
- rt_add_buzhash
- rt_add_bushashn
- rt_add_hash_CRC
- rt_add_base256
- rt_add_Java_Integer
- rt_add_Java_Object
- rt_add_Java_String
- rt_add_rand
- rt_add_high_rand
I wanted to try to explain what each of these functions do to get the hash values, but after looking at the code, I decided it would not be the best idea so I left it alone.
Definitions
- Entropy: Is the measure of how disordered a system is. This is a Thermodynamics idea. Basically this would measure if our hash table is random or not. The value returned by this test would prove how random our distribution is and for this particular test, it would be ideal if the entropy returned by this function is as close to 8 as possible.
- Percentage by which the result could be compressed: The higher the percentage, the more the hash table can be compressed (the more data that can be eliminated as there are repetitions). So a high percentage would mean low randomness.
- Chi Square: Chi square is a statistical calculation that is used to test how well a distribution of a set of observed data is compared to its theoretical probability function. What this means, is that, chi square narrows down cases where the data that is obtained is too regular in distribution according to its theory values, or is it too irregular. So for a good function, the chi value must be such that it would imply that the data is close to being a match with its theory values, but not an exact match.
- Arithmetic mean: The expected mean for a set of random numbers is always the centre between the
highest value and the lowest value. For this set of tests,(0-255) 127.5 is the expected mean.
- Monte Carlo: This method takes random numbers, plots them on a cartesian coordinate plane, and then calculates the ratio of the points that are on the plane, with the number of points that lie in a unit circle centered at 0. This ratio is apparantly a quarter times pi, and this method is then used to calculate the value of pi. So a more accurate value of pi given, the more random the numbers are.
- Serial Correlation: Is how much are the values in the hash table related to each other, basically, tries to determine if there is a pattern to the ordering of the values in the hash table or not.
Referenced from the HTTB 2007 and the Internet
Part 1
We used ent_test to compare the randomness of all the hash functions on both the low entropy source and the typical entropy source.
We also compared the results to the output from the Unix random number generators. Then we are arranging the values obtained from the functions in order from best to worst.
Choosing the sample size
Some students did experiments to choose an appropriate sample size. The majority of those used one function to test different sample sizes. The stopping point was when increasing the sample size made no difference on the randomness of the Hash function.
Example: (taken from hpan027's report)
To determine the sample size, I carried out a series of ent_test with different sample sizes. Buzhash low 5 3.00000 50.00% 132.875 100.00% -0.509682 Buzhash low 10 3.58496 50.00% 107.667 36.34% 0.235256 Buzhash low 100 6.15366 2.50% 130.460 3.45% -0.088601 Buzhash low 1000 7.84646 97.50% 126.550 2.01% 0.006193 Buzhash low 10000 7.97970 25.00% 127.177 1.95% -0.007153 Buzhash low 100000 7.99827 50.00% 127.587 0.14% 0.000712 Buzhash low 1000000 7.99989 99.99% 127.501 0.23% -0.000832 Basically, the conclusion was somewhere around 10000 the results stop varying much, and hence 10,000 was the chosen sample size for the rest of the tests.
Most students made up some numbers and used them without any justification. I think this is due to the fact that the relationships between table size, No. of keys and sample size are not very clear to the majority of them.
Part 1.1
CODE:
int main( ) { int sample_size = 1000; int n_keys = 1000; int table_size = 10000; ent_test( "Buzhash low", low_entropy_src, sample_size, &rt_add_buzhash ); ent_test( "Buzhash high", typical_entropy_src, sample_size, &rt_add_buzhash ); ent_test( "buzhashn low", low_entropy_src, sample_size, &rt_add_hash_CRC ); ent_test( "buzhashn high", typical_entropy_src, sample_size, &rt_add_hash_CRC ); ent_test( "hash_CRC low", low_entropy_src, sample_size, &rt_add_hash_CRC ); ent_test( "hash_CRC high", typical_entropy_src, sample_size, &rt_add_hash_CRC ); ent_test( "base256 low", low_entropy_src, sample_size, &rt_add_base256 ); ent_test( "base256 high", typical_entropy_src, sample_size, &rt_add_base256 ); ent_test( "Java Int low", low_entropy_src, sample_size, &rt_add_Java_Integer ); ent_test( "Java Int high", typical_entropy_src, sample_size, &rt_add_Java_Integer ); ent_test( "Java Object low", low_entropy_src, sample_size, &rt_add_Java_Object ); ent_test( "Java Objec high", typical_entropy_src, sample_size, &rt_add_Java_Object ); ent_test( "Java String low", low_entropy_src, sample_size, &rt_add_Java_String ); ent_test( "Java Strin high", typical_entropy_src, sample_size, &rt_add_Java_String ); ent_test( "Rand low", low_entropy_src, sample_size, &rt_add_rand ); ent_test( "Rand high", typical_entropy_src, sample_size, &rt_add_rand ); ent_test( "high rand low", low_entropy_src, sample_size, &rt_add_high_rand ); ent_test( "high rand high", typical_entropy_src, sample_size, &rt_add_high_rand ); return 0; }
Part 1.2
For: int sample_size = 1000; int n_keys = 1000; int table_size = 10000; Buzhash low 7.99830 75.00% 127.578 0.71% 0.000327 Buzhash high 7.99783 2.50% 127.374 0.77% -0.000076 buzhashn low 5.59831 0.01% 81.783 24.53% 0.028545 buzhashn high 7.84046 0.01% 122.862 1.98% 0.018518 hash_CRC low 5.71827 0.01% 112.143 12.43% -0.204906 hash_CRC high 7.84046 0.01% 122.862 1.98% 0.018518 base256 low 0.00000 0.01% 97.000 27.32% undefined base256 high 4.02297 0.01% 107.853 27.32% 0.034082 Java Int low 4.82824 0.01% 43.883 27.32% -0.092002 Java Int high 4.82824 0.01% 43.883 27.32% -0.092002 Java Object low 2.00000 0.01% 77.000 27.32% -0.521556 Java Objec high 5.72621 0.01% 118.129 4.28% -0.343281 Java String low 7.99958 99.99% 127.474 0.15% 0.001097 Java Strin high 7.94554 0.01% 126.139 0.27% 0.021181 Rand low 7.95308 0.01% 111.441 11.17% -0.051837 Rand high 7.95272 0.01% 111.395 10.65% -0.049131 high rand low 7.99828 75.00% 127.441 0.75% -0.001213 high rand high 7.99807 50.00% 127.406 0.07% -0.002226
'Rearranged in the form best to worst:'
Buzhash low 7.9983 75.00% 127.578 0.71% 0.000327 high rand low 7.99828 75.00% 127.441 0.75% -0.001213 high rand high 7.99807 50.00% 127.406 0.07% -0.002226 Java String low 7.99958 99.99% 127.474 0.15% 0.001097 Buzhash high 7.99783 2.50% 127.374 0.77% -0.000076 Java Strin high 7.94554 0.01% 126.139 0.27% 0.021181 buzhashn high 7.84046 0.01% 122.862 1.98% 0.018518 hash_CRC high 7.84046 0.01% 122.862 1.98% 0.018518 Rand low 7.95308 0.01% 111.441 11.17% -0.051837 Rand high 7.95272 0.01% 111.395 10.65% -0.049131 base256 high 4.02297 0.01% 107.853 27.32% 0.034082 Java Objec high 5.72621 0.01% 118.129 4.28% -0.343281 hash_CRC low 5.71827 0.01% 112.143 12.43% -0.204906 buzhashn low 5.59831 0.01% 81.783 24.53% 0.028545 Java Int low 4.82824 0.01% 43.883 27.32% -0.092002 Java Int high 4.82824 0.01% 43.883 27.32% -0.092002 Java Object low 2 0.01% 77 27.32% -0.521556 base256 low 0 0.01% 97 27.32% undefined
Part 2
We used llps (and llps_i) to measure the actual worst case performance of the supplied hash functions.
Running the code:
int main( ) { int sample_size = 100000; int n_keys = 10000; int table_size = 1000; printf( "Buzhash low %d/%d: llps = %d, expecting %g\n", n_keys, table_size, llps( n_keys, low_entropy_src, table_size, buzhash ), expected_llps( n_keys, table_size )); printf( "Buzhash high %d/%d: llps = %d, expecting %g\n", n_keys, table_size, llps( n_keys, typical_entropy_src, table_size, buzhash ), expected_llps( n_keys, table_size )); printf( "Hash CRC low %d/%d: llps = %d, expecting %g\n", n_keys, table_size, llps( n_keys, low_entropy_src, table_size, hash_CRC ), expected_llps( n_keys, table_size )); printf( "Hash CRC high %d/%d: llps = %d, expecting %g\n", n_keys, table_size, llps( n_keys, typical_entropy_src, table_size, hash_CRC ), expected_llps( n_keys, table_size )); printf( "Base 256 low %d/%d: llps = %d, expecting %g\n", n_keys, table_size, llps( n_keys, low_entropy_src, table_size, base256 ), expected_llps( n_keys, table_size )); printf( "Base 256 high %d/%d: llps = %d, expecting %g\n", n_keys, table_size, llps( n_keys, typical_entropy_src, table_size, base256 ), expected_llps( n_keys, table_size )); return 0; }
Gives:
Buzhash low 10000/1000: llps = 20, expecting 21.7096 Buzhash high 10000/1000: llps = 22, expecting 21.7096 Hash CRC low 10000/1000: llps = 88, expecting 21.7096 Hash CRC high 10000/1000: llps = 23, expecting 21.7096 Base 256 low 10000/1000: llps = 10000, expecting 21.7096 Base 256 high 10000/1000: llps = 497, expecting 21.7096
Rearranged from best to worst: Buzhash low 10000/1000: llps = 20, expecting 21.7096 Buzhash high 10000/1000: llps = 22, expecting 21.7096 Hash CRC high 10000/1000: llps = 23, expecting 21.7096 Hash CRC low 10000/1000: llps = 88, expecting 21.7096 Base 256 high 10000/1000: llps = 497, expecting 21.7096 Base 256 low 10000/1000: llps = 10000, expecting 21.7096
Conclusion
For part 1, it was very difficult to determine the order of "randomness" of each function because it's hard to weigh each statistical test. It is likely the ranks for randomness would change depending on sample size.Functions perform better within a certain range of numbers and hence are disadvantaged by this particular sample size.
The difference between rand and high_rand is that high rand is generally better but comes at some slight processing power and memory use. High rand tends to 'over' randomize where as rand just 'under' randomizes (ie tends to be biased to under values rather than higher values).
For some people, it was noticed that they ranked BuzHash a better fucntion than Buzhashn, as it performed well overall, but majority had Buzhashn as the best rated function.
I would rank the hash functions from best to worst like this:
Buzhashn, Buzhash, java string, hash CRC, java object, java integer, base256.
For Part 2,
* Buzhash is pretty much consistent with the expected result * Java_String seems to perform better than expected * Java_Object seems to be broken for with a low entropy source * hash_CRC tends to have a higher llps than the expected
The results in part 2 match roughly what we expect in theory (results in part 1 of the experement).
The order of the best hash functions are ordered as follows:
Buzhash Hash CRC Base 256
Base256 is the worst of all!
Difficulties or Problems of This Lab
After reading all the comments by John Hamer, it was apparent that most of the students had made choices for the values for sample size, n_keys and table size without solid justification. One of the best justifications seen in the lab reports was:
Results for the test did were nearly the same for values about 10,000.
And since this was given to be the sample size, the table size had to be a bit greater than this to accommodate the sample size, and similarly, n_keys as well.
Another problem seen was that most people just copied and pasted the output without sufficient explanation to the results. Many people failed to do most if not all the testing and tasks involved. In some reports, it was noticed that the table was either underloaded, or overloaded.