<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en-GB">
	<id>https://wiki.kram.nz/index.php?action=history&amp;feed=atom&amp;title=SE250%3Alab-5%3AMaintainers_report</id>
	<title>SE250:lab-5:Maintainers report - Revision history</title>
	<link rel="self" type="application/atom+xml" href="https://wiki.kram.nz/index.php?action=history&amp;feed=atom&amp;title=SE250%3Alab-5%3AMaintainers_report"/>
	<link rel="alternate" type="text/html" href="https://wiki.kram.nz/index.php?title=SE250:lab-5:Maintainers_report&amp;action=history"/>
	<updated>2026-04-28T12:13:46Z</updated>
	<subtitle>Revision history for this page on the wiki</subtitle>
	<generator>MediaWiki 1.45.3</generator>
	<entry>
		<id>https://wiki.kram.nz/index.php?title=SE250:lab-5:Maintainers_report&amp;diff=6368&amp;oldid=prev</id>
		<title>Mark: 39 revision(s)</title>
		<link rel="alternate" type="text/html" href="https://wiki.kram.nz/index.php?title=SE250:lab-5:Maintainers_report&amp;diff=6368&amp;oldid=prev"/>
		<updated>2008-11-03T05:19:42Z</updated>

		<summary type="html">&lt;p&gt;39 revision(s)&lt;/p&gt;
&lt;p&gt;&lt;b&gt;New page&lt;/b&gt;&lt;/p&gt;&lt;div&gt;= Lab 5 Report =&lt;br /&gt;
&lt;br /&gt;
== Introduction ==&lt;br /&gt;
&lt;br /&gt;
A hash function is any well-defined procedure or mathematical function for turning some kind of data into a relatively small integer, that may serve as an index into an array. The values returned by a hash function are called hash values, hash codes, hash sums, or simply hashes.&lt;br /&gt;
&lt;br /&gt;
Hash functions are mostly used to speed up table lookup or data comparison tasks --- such as finding items in a database, detecting duplicated or similar records in a large file, finding similar stretches in DNA sequences, and so on.&lt;br /&gt;
&lt;br /&gt;
Hash table could also be defined(loosly defined) as an array of linked list.&lt;br /&gt;
&lt;br /&gt;
Hash table is a 2D array of values with a specific key attached to each value. This table is controlled or populated using a hash function. A hash function is a mapping device that generates and assigns numbers to the input data so that it can be stored in the array index according to this assigned number; each assigned number is unique to the input data.&lt;br /&gt;
&lt;br /&gt;
A good hash function is essential for good hash table performance. A bad choice for a hash function will eventually lead to clustering, which is when the probability of keys being assigned same hash values is higher then expected from a random function. In addition, some hash functions are computationally expensive, so the amount of time (and, in some cases, memory) taken to compute the hash may be burdensome.&lt;br /&gt;
&lt;br /&gt;
== Methodology ==&lt;br /&gt;
&lt;br /&gt;
In this weeks lab we were expected to perform experiments on &amp;#039;&amp;#039;&amp;#039;hash tables&amp;#039;&amp;#039;&amp;#039;, to show how different hash functions work.&lt;br /&gt;
&lt;br /&gt;
All the tests were provided, and it was just up to us students to run them, look at the results and come to conclusions about which hash function would be best to use.&lt;br /&gt;
&lt;br /&gt;
The hash functions that were provided were:&lt;br /&gt;
&lt;br /&gt;
*rt_add_buzhash&lt;br /&gt;
*rt_add_bushashn&lt;br /&gt;
*rt_add_hash_CRC&lt;br /&gt;
*rt_add_base256&lt;br /&gt;
*rt_add_Java_Integer&lt;br /&gt;
*rt_add_Java_Object&lt;br /&gt;
*rt_add_Java_String&lt;br /&gt;
*rt_add_rand&lt;br /&gt;
*rt_add_high_rand&lt;br /&gt;
&lt;br /&gt;
I wanted to try to explain what each of these functions do to get the hash values, but after looking at the code, I decided it would not be the best idea so I left it alone.&lt;br /&gt;
&lt;br /&gt;
== Definitions ==&lt;br /&gt;
&lt;br /&gt;
*&amp;#039;&amp;#039;Entropy:&amp;#039;&amp;#039; Is the measure of how disordered a system is. This is a Thermodynamics idea. Basically this would measure if our hash table is random or not. The value returned by this test would prove how random our distribution is and for this particular test, it would be ideal if the entropy returned by this function is as close to 8 as possible.&lt;br /&gt;
&lt;br /&gt;
*&amp;#039;&amp;#039;Percentage by which the result could be compressed:&amp;#039;&amp;#039; The higher the percentage, the more the hash table can be compressed (the more data that can be eliminated as there are repetitions). So a high percentage would mean low randomness.&lt;br /&gt;
&lt;br /&gt;
*&amp;#039;&amp;#039;Chi Square:&amp;#039;&amp;#039; Chi square is a statistical calculation that is used to test how well a distribution of a set of observed data is compared to its theoretical probability function. What this means, is that, chi square narrows down cases where the data that is obtained is too regular in distribution according to its theory values, or is it too irregular. So for a good function, the chi value must be such that it would imply that the data is close to being a match with its theory values, but not an exact match.&lt;br /&gt;
&lt;br /&gt;
*&amp;#039;&amp;#039;Arithmetic mean:&amp;#039;&amp;#039; The expected mean for a set of random numbers is always the centre between the&lt;br /&gt;
highest value and the lowest value. For this set of tests,(0-255) 127.5 is the expected mean.&lt;br /&gt;
&lt;br /&gt;
*&amp;#039;&amp;#039;Monte Carlo:&amp;#039;&amp;#039; This method takes random numbers, plots them on a cartesian coordinate plane, and then calculates the ratio of the points that are on the plane, with the number of points that lie in a unit circle centered at 0. This ratio is apparantly a quarter times pi, and this method is then used to calculate the value of pi. So a more accurate value of pi given, the more random the numbers are. &lt;br /&gt;
&lt;br /&gt;
*&amp;#039;&amp;#039;Serial Correlation:&amp;#039;&amp;#039; Is how much are the values in the hash table related to each other, basically, tries to determine if there is a pattern to the ordering of the values in the hash table or not.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;Referenced from the HTTB 2007 and the Internet&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
==Part 1==&lt;br /&gt;
&lt;br /&gt;
We used ent_test to compare the randomness of all the hash functions on both the low entropy source and the typical entropy source.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
We also compared the results to the output from the Unix random number generators. Then we are arranging the values obtained from the functions in order from best to worst.  &lt;br /&gt;
&lt;br /&gt;
=== Choosing the sample size ===&lt;br /&gt;
Some students did experiments to choose an appropriate sample size. The majority of those used one function to test different sample sizes. The stopping point was when increasing the sample size made no difference on the randomness of the Hash function. &lt;br /&gt;
&lt;br /&gt;
Example: (taken from hpan027&amp;#039;s report)&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
 To determine the sample size, I carried out a series of ent_test with different sample sizes.&lt;br /&gt;
 &lt;br /&gt;
 Buzhash low 5		3.00000	50.00%	132.875	100.00%	-0.509682&lt;br /&gt;
 Buzhash low 10		3.58496	50.00%	107.667	36.34%	0.235256&lt;br /&gt;
 Buzhash low 100		6.15366	2.50%	130.460	3.45%	-0.088601&lt;br /&gt;
 Buzhash low 1000	7.84646	97.50%	126.550	2.01%	0.006193&lt;br /&gt;
 Buzhash low 10000	7.97970	25.00%	127.177	1.95%	-0.007153&lt;br /&gt;
 Buzhash low 100000	7.99827	50.00%	127.587	0.14%	0.000712&lt;br /&gt;
 Buzhash low 1000000	7.99989	99.99%	127.501	0.23%	-0.000832&lt;br /&gt;
 &lt;br /&gt;
 Basically, the conclusion was somewhere around 10000 the results stop varying much, &lt;br /&gt;
 and hence 10,000 was the chosen sample size for the rest of the tests.&lt;br /&gt;
&lt;br /&gt;
Most students made up some numbers and used them without any justification. I think this is due to the fact that the relationships between table size, No. of keys and sample size are not very clear to the majority of them.&lt;br /&gt;
&lt;br /&gt;
=== Part 1.1 ===&lt;br /&gt;
CODE:&lt;br /&gt;
&lt;br /&gt;
 int main( ) {&lt;br /&gt;
   int sample_size = 1000;&lt;br /&gt;
   int n_keys = 1000;&lt;br /&gt;
   int table_size = 10000;&lt;br /&gt;
 &lt;br /&gt;
   ent_test( &amp;quot;Buzhash low&amp;quot;, low_entropy_src, sample_size, &amp;amp;rt_add_buzhash );&lt;br /&gt;
   ent_test( &amp;quot;Buzhash high&amp;quot;, typical_entropy_src, sample_size, &amp;amp;rt_add_buzhash );&lt;br /&gt;
 &lt;br /&gt;
   ent_test( &amp;quot;buzhashn low&amp;quot;, low_entropy_src, sample_size, &amp;amp;rt_add_hash_CRC );&lt;br /&gt;
   ent_test( &amp;quot;buzhashn high&amp;quot;, typical_entropy_src, sample_size, &amp;amp;rt_add_hash_CRC );&lt;br /&gt;
 &lt;br /&gt;
   ent_test( &amp;quot;hash_CRC low&amp;quot;, low_entropy_src, sample_size, &amp;amp;rt_add_hash_CRC );&lt;br /&gt;
   ent_test( &amp;quot;hash_CRC high&amp;quot;, typical_entropy_src, sample_size, &amp;amp;rt_add_hash_CRC );&lt;br /&gt;
 &lt;br /&gt;
   ent_test( &amp;quot;base256 low&amp;quot;, low_entropy_src, sample_size, &amp;amp;rt_add_base256 );&lt;br /&gt;
   ent_test( &amp;quot;base256 high&amp;quot;, typical_entropy_src, sample_size, &amp;amp;rt_add_base256 );&lt;br /&gt;
 &lt;br /&gt;
   ent_test( &amp;quot;Java Int low&amp;quot;, low_entropy_src, sample_size, &amp;amp;rt_add_Java_Integer );&lt;br /&gt;
   ent_test( &amp;quot;Java Int high&amp;quot;, typical_entropy_src, sample_size, &amp;amp;rt_add_Java_Integer );&lt;br /&gt;
 &lt;br /&gt;
   ent_test( &amp;quot;Java Object low&amp;quot;, low_entropy_src, sample_size, &amp;amp;rt_add_Java_Object );&lt;br /&gt;
   ent_test( &amp;quot;Java Objec high&amp;quot;, typical_entropy_src, sample_size, &amp;amp;rt_add_Java_Object ); &lt;br /&gt;
 &lt;br /&gt;
   ent_test( &amp;quot;Java String low&amp;quot;, low_entropy_src, sample_size, &amp;amp;rt_add_Java_String );&lt;br /&gt;
   ent_test( &amp;quot;Java Strin high&amp;quot;, typical_entropy_src, sample_size, &amp;amp;rt_add_Java_String );&lt;br /&gt;
   &lt;br /&gt;
   ent_test( &amp;quot;Rand  low&amp;quot;, low_entropy_src, sample_size, &amp;amp;rt_add_rand );&lt;br /&gt;
   ent_test( &amp;quot;Rand  high&amp;quot;, typical_entropy_src, sample_size, &amp;amp;rt_add_rand );&lt;br /&gt;
 &lt;br /&gt;
   ent_test( &amp;quot;high rand low&amp;quot;, low_entropy_src, sample_size, &amp;amp;rt_add_high_rand );&lt;br /&gt;
   ent_test( &amp;quot;high rand high&amp;quot;, typical_entropy_src, sample_size, &amp;amp;rt_add_high_rand );&lt;br /&gt;
  &lt;br /&gt;
   return 0;&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
------------------&lt;br /&gt;
&lt;br /&gt;
===Part 1.2===&lt;br /&gt;
&lt;br /&gt;
 For:&lt;br /&gt;
   int sample_size = 1000;&lt;br /&gt;
   int n_keys = 1000;&lt;br /&gt;
   int table_size = 10000;&lt;br /&gt;
 &lt;br /&gt;
 		&lt;br /&gt;
 Buzhash low	7.99830	75.00%	127.578	0.71%	0.000327 &lt;br /&gt;
 Buzhash high	7.99783	2.50%	127.374	0.77%	-0.000076&lt;br /&gt;
 buzhashn low	5.59831	0.01%	81.783	24.53%	0.028545 &lt;br /&gt;
 buzhashn high	7.84046	0.01%	122.862	1.98%	0.018518&lt;br /&gt;
 hash_CRC low	5.71827	0.01%	112.143	12.43%	-0.204906&lt;br /&gt;
 hash_CRC high	7.84046	0.01%	122.862	1.98%	0.018518 &lt;br /&gt;
 base256 low	0.00000	0.01%	97.000	27.32%	undefined&lt;br /&gt;
 base256 high	4.02297	0.01%	107.853	27.32%	0.034082&lt;br /&gt;
 Java Int low	4.82824	0.01%	43.883	27.32%	-0.092002&lt;br /&gt;
 Java Int high	4.82824	0.01%	43.883	27.32%	-0.092002&lt;br /&gt;
 Java Object low	2.00000	0.01%	77.000	27.32%	-0.521556&lt;br /&gt;
 Java Objec high	5.72621	0.01%	118.129	4.28%	-0.343281&lt;br /&gt;
 Java String low	7.99958	99.99%	127.474	0.15%	0.001097&lt;br /&gt;
 Java Strin high	7.94554	0.01%	126.139	0.27%	0.021181 &lt;br /&gt;
 Rand  low	7.95308	0.01%	111.441	11.17%	-0.051837 &lt;br /&gt;
 Rand  high	7.95272	0.01%	111.395	10.65%	-0.049131&lt;br /&gt;
 high rand low	7.99828	75.00%	127.441	0.75%	-0.001213&lt;br /&gt;
 high rand high	7.99807	50.00%	127.406	0.07%	-0.002226&lt;br /&gt;
 &lt;br /&gt;
 &lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;&amp;#039;Rearranged in the form best to worst:&amp;#039;&amp;#039;&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
 Buzhash low	7.9983	75.00%	127.578	0.71%	0.000327&lt;br /&gt;
 high rand low	7.99828	75.00%	127.441	0.75%	-0.001213&lt;br /&gt;
 high rand high	7.99807	50.00%	127.406	0.07%	-0.002226&lt;br /&gt;
 Java String low	7.99958	99.99%	127.474	0.15%	0.001097 &lt;br /&gt;
 Buzhash high	7.99783	2.50%	127.374	0.77%	-0.000076&lt;br /&gt;
 Java Strin high	7.94554	0.01%	126.139	0.27%	0.021181 &lt;br /&gt;
 buzhashn high	7.84046	0.01%	122.862	1.98%	0.018518&lt;br /&gt;
 hash_CRC high	7.84046	0.01%	122.862	1.98%	0.018518 &lt;br /&gt;
 Rand  low	7.95308	0.01%	111.441	11.17%	-0.051837&lt;br /&gt;
 Rand  high	7.95272	0.01%	111.395	10.65%	-0.049131&lt;br /&gt;
 base256 high	4.02297	0.01%	107.853	27.32%	0.034082&lt;br /&gt;
 Java Objec high	5.72621	0.01%	118.129	4.28%	-0.343281&lt;br /&gt;
 hash_CRC low	5.71827	0.01%	112.143	12.43%	-0.204906&lt;br /&gt;
 buzhashn low	5.59831	0.01%	81.783	24.53%	0.028545&lt;br /&gt;
 Java Int low	4.82824	0.01%	43.883	27.32%	-0.092002&lt;br /&gt;
 Java Int high	4.82824	0.01%	43.883	27.32%	-0.092002&lt;br /&gt;
 Java Object low	2	0.01%	77	27.32%	-0.521556 &lt;br /&gt;
 base256 low	0	0.01%	97	27.32%	undefined&lt;br /&gt;
&lt;br /&gt;
==Part 2==&lt;br /&gt;
&lt;br /&gt;
We used llps (and llps_i) to measure the actual worst case performance of the supplied hash functions.&lt;br /&gt;
&lt;br /&gt;
Running the code:&lt;br /&gt;
&lt;br /&gt;
 int main( ) {&lt;br /&gt;
   int sample_size = 100000;&lt;br /&gt;
   int n_keys = 10000;&lt;br /&gt;
   int table_size = 1000;&lt;br /&gt;
 &lt;br /&gt;
   printf( &amp;quot;Buzhash low %d/%d: llps = %d, expecting %g\n&amp;quot;, n_keys, table_size, &lt;br /&gt;
   llps( n_keys, low_entropy_src, table_size, buzhash ), expected_llps( n_keys, table_size ));&lt;br /&gt;
   printf( &amp;quot;Buzhash high %d/%d: llps = %d, expecting %g\n&amp;quot;, n_keys, table_size, &lt;br /&gt;
   llps( n_keys, typical_entropy_src, table_size, buzhash ), expected_llps( n_keys, table_size ));&lt;br /&gt;
 &lt;br /&gt;
   printf( &amp;quot;Hash CRC low %d/%d: llps = %d, expecting %g\n&amp;quot;, n_keys, table_size, &lt;br /&gt;
   llps( n_keys, low_entropy_src, table_size, hash_CRC ), expected_llps( n_keys, table_size ));&lt;br /&gt;
   printf( &amp;quot;Hash CRC high %d/%d: llps = %d, expecting %g\n&amp;quot;, n_keys, table_size, &lt;br /&gt;
   llps( n_keys, typical_entropy_src, table_size,  hash_CRC ), expected_llps( n_keys, table_size )); &lt;br /&gt;
 &lt;br /&gt;
   printf( &amp;quot;Base 256 low %d/%d: llps = %d, expecting %g\n&amp;quot;, n_keys, table_size, &lt;br /&gt;
   llps( n_keys, low_entropy_src, table_size, base256 ), expected_llps( n_keys, table_size ));&lt;br /&gt;
   printf( &amp;quot;Base 256 high %d/%d: llps = %d, expecting %g\n&amp;quot;, n_keys, table_size, &lt;br /&gt;
   llps( n_keys, typical_entropy_src, table_size,  base256  ), expected_llps( n_keys, table_size ));&lt;br /&gt;
 &lt;br /&gt;
  return 0;&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Gives:&lt;br /&gt;
&lt;br /&gt;
 Buzhash low 10000/1000: llps = 20, expecting 21.7096 &lt;br /&gt;
 Buzhash high 10000/1000: llps = 22, expecting 21.7096&lt;br /&gt;
 Hash CRC low 10000/1000: llps = 88, expecting 21.7096&lt;br /&gt;
 Hash CRC high 10000/1000: llps = 23, expecting 21.7096&lt;br /&gt;
 Base 256 low 10000/1000: llps = 10000, expecting 21.7096&lt;br /&gt;
 Base 256 high 10000/1000: llps = 497, expecting 21.7096&lt;br /&gt;
&lt;br /&gt;
 Rearranged from best to worst: &lt;br /&gt;
 Buzhash low 10000/1000: llps = 20, expecting 21.7096&lt;br /&gt;
 Buzhash high 10000/1000: llps = 22, expecting 21.7096&lt;br /&gt;
 Hash CRC high 10000/1000: llps = 23, expecting 21.7096&lt;br /&gt;
 Hash CRC low 10000/1000: llps = 88, expecting 21.7096&lt;br /&gt;
 Base 256 high 10000/1000: llps = 497, expecting 21.7096&lt;br /&gt;
 Base 256 low 10000/1000: llps = 10000, expecting 21.7096&lt;br /&gt;
&lt;br /&gt;
==Conclusion==&lt;br /&gt;
&lt;br /&gt;
For part 1, it was very difficult to determine the order of &amp;quot;randomness&amp;quot; of each function because it&amp;#039;s hard to weigh each statistical test. It is likely the ranks for randomness would change depending on sample size.Functions perform better within a certain range of numbers and hence are disadvantaged by this particular sample size.&lt;br /&gt;
&lt;br /&gt;
The difference between rand and high_rand is that high rand is generally better but comes at some slight processing power and memory use. High rand tends to &amp;#039;over&amp;#039; randomize where as rand just &amp;#039;under&amp;#039; randomizes (ie tends to be biased to under values rather than higher values).&lt;br /&gt;
&lt;br /&gt;
For some people, it was noticed that they ranked BuzHash a better fucntion than Buzhashn, as it performed well overall, but majority had Buzhashn as the best rated function.&lt;br /&gt;
&lt;br /&gt;
I would rank the hash functions from best to worst like this:&lt;br /&gt;
&lt;br /&gt;
Buzhashn, Buzhash, java string, hash CRC, java object, java integer, base256. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
For Part 2,   &lt;br /&gt;
 &lt;br /&gt;
    * Buzhash is pretty much consistent with the expected result&lt;br /&gt;
    * Java_String seems to perform better than expected&lt;br /&gt;
    * Java_Object seems to be broken for with a low entropy source&lt;br /&gt;
    * hash_CRC tends to have a higher llps than the expected &lt;br /&gt;
&lt;br /&gt;
The results in part 2 match roughly what we expect in theory (results in part 1 of the experement).&lt;br /&gt;
&lt;br /&gt;
The order of the best hash functions are ordered as follows:&lt;br /&gt;
 Buzhash&lt;br /&gt;
 Hash CRC&lt;br /&gt;
 Base 256&lt;br /&gt;
&lt;br /&gt;
Base256 is the worst of all!&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Difficulties or Problems of This Lab ==&lt;br /&gt;
&lt;br /&gt;
After reading all the comments by John Hamer, it was apparent that most of the students had made choices for the values for sample size, n_keys and table size without solid justification. &lt;br /&gt;
One of the best justifications seen in the lab reports was: &lt;br /&gt;
&lt;br /&gt;
Results for the test did were nearly the same for values about 10,000.&lt;br /&gt;
&lt;br /&gt;
And since this was given to be the sample size, the table size had to be a bit greater than this to accommodate the sample size, and similarly, n_keys as well.&lt;br /&gt;
&lt;br /&gt;
Another problem seen was that most people just copied and pasted the output without sufficient explanation to the results. Many people failed to do most if not all the testing and tasks involved. In some reports, it was noticed that the table was either underloaded, or overloaded.&lt;/div&gt;</summary>
		<author><name>Mark</name></author>
	</entry>
</feed>