C Program To Implement Dictionary Using Hashing Functions

C Program To Implement Dictionary Using Hashing Functions
Write A C Program To Implement Functions Of Dictionary Using Hashing
C Program To Implement Dictionary Using Hashing Functions Using
C Program To Implement Dictionary Using Hashing Functions Based
Dictionary

Idea of a Hash Table

We will use an array of buckets to store the data, then use a hash function to turn a string into a number in the range 0.MAXELEMENTS. Each bucket will hold a linked list of strings, so you can retrieve information again. Typically o(1) insertion and find. Ii) Write a C program for implementing Heap sort algorithm for sorting a given list of integers in ascending order. 35 10 Write a C program to implement all the functions of a dictionary (ADT) using hashing. 41 11 Write a C program for implementing Knuth -Morris - Pratt pattern matching algorithm.

In this problem you will implement a dictionary using a hash table.The idea of a hash table is very simple, and decidedly hackish.

We are going to write a hash table to store strings, but the same idea can beadapted to any other datatype. We write a function (the hash function)that takes a string and'hashes' it, i.e. returns an integer (possibly large) that is obtainedby manipulating the bytes of the input string. For a good hash functionall bytes of the input string must affect the output, and the dependance of the output on the strings should be non-obvious. Here is an exampleof a good hash function (actual library code), which uses the factthat a character is interpreted as a small integer: For the C++ class strings the same function will look like this:

With some such hash function, we try the following idea. Declare an arrayof length N and, for any string str, store the stringin the slot in the array. Since the results of hashing a string appear random (but, of course, the resultfor any given string is always the same) the above number for the slotto store the string in will hopefully be distributed uniformly through the array. If we want to check if a string is already stored in our array, we hash the string and look at the slot determined by the aboveformula. The best distribution of indices for the above hash function hasbeen observed to be with the following values of N:

It may happen that two different strings hash to the sameslot number, collide. If N is large enough, there won't be many such collisions. This difficulty can be resolved as follows: our array will be an array of lists of stringsinstead of indivudual strings. These lists are called bucketsin some books. Thus the algorithm for adding a string to the hash tableis:
For a string str:

Apply the hash function to str, take mod by N. This is the index in the hash array.
Check every element of the linked list (bucket) at that index. If str is found, return true, else return false.

Why use Hash Tables?

Imagine a dictionary implemented as one long linked list (of strings). Finding a word in it may require going all the way to the end of the list,checking every element for equality with the search string. This is veryinefficient and slow.

Now look at the same process for the hash table. Hashing a string does not take long (see hash_string), and then we know the right bucket rightaway. Of course, the bucket is a linked list and has to be searchedby comparing every element in it to the search string, but if the indexobtained by hashing is roughly uniformly distributed across the arrayof buckets, the buckets should not be long (on average, a bucketwill contain TOTAL_WORDS_STORED / N elements).

Thus a hash table is much better than an array or a likned list, becausewe zero in on the right bucket by hashing, and then have only a fewwords that collided at that bucket's index to check. All kinds of seriousapplications use hash tables, including C++ compilers that keep the namesof defined variables, functions and classes in one.

Exercise (Problem D)

On UNIX systems there is a file that contains all frequently usedwords of the English language, usually /usr/dict/words.Here is one, zipped: words.zip. Writeyour hash table for strings, and fill it in by reading that file.Something like this will do:There are 45402 words in my file, so reasonable choices for hash tablesize are 12289, 24593, and 49157. You can make you class keep statisticsof the largest and the average number of collisions (i.e. sizes of buckets).

Then, in the same manner as above, read in a text file and print outall words that are possible spelling errors. Write a member functionbool HashTable::Contains(const string & ). There is a Unixprogram ispell that does something like that.

Pesky technical problems

Notice that from the point of view of the program above, any sequenceof non-whitespace characters is a word. Whitespace characters (space,newline and tab) are seen as word separators and skipped. Thus youneed to remove any punctuation marks (except apostrophes) from your file being spell checked,before you can run the program on your text, so that 'end.' , 'end!', 'end?' and 'end' are not different words. This can be done by another small program, or by search-and-replace in any good text editor. There isalso the issue of capital vs. lowercase letters: 'And' and 'and'should be the same word. This issue, unfortunately, cannot be resolvedwith any text editor (other than Emacs) that I know of. I can pre-processfor you any text file, as follows:

C Program To Implement Dictionary Using Hashing Functions

Both of these preparatory tasks can be carried out with just two commands of the UNIX operating system:The first command changes any character which is not a letter in theranges A-Z or a-z or a newline n into a space. The second commandchanges each capital letter to the corresponding lowercase one. See man tr on your UNIX system for more information.Nothing like that on MS Windows or MacOS.

This post describes how dictionaries are implemented in the Python language.

Dictionaries are indexed by keys and they can be seen as associative arrays. Let’s add 3 key/value pairs to a dictionary:

The values can be accessed this way:

The key ‘d’ does not exist so a KeyError exception is raised.

Hash tables

Python dictionaries are implemented using hash tables. It is an array whose indexes are obtained using a hash function on the keys. The goal of a hash function is to distribute the keys evenly in the array. A good hash function minimizes the number of collisions e.g. different keys having the same hash. Python does not have this kind of hash function. Its most important hash functions (for strings and ints) are very regular in common cases:

We are going to assume that we are using strings as keys for the rest of this post. The hash function for strings in Python is defined as:

If you run hash(‘a’) in Python, it will execute string_hash() and return 12416037344. Here we assume we are using a 64-bit machine.

If an array of size x is used to store the key/value pairs then we use a mask equal to x-1 to calculate the slot index of the pair in the array. This makes the computation of the slot index fast. The probability to find an empty slot is high due to the resizing mechanism described below. This means that having a simple computation makes sense in most of the cases. If the size of the array is 8, the index for ‘a’ will be: hash(‘a’) & 7 = 0. The index for ‘b’ is 3, the index for ‘c’ is 2, the index for ‘z’ is 3 which is the same as ‘b’, here we have a collision.

We can see that the Python hash function does a good job when the keys are consecutive which is good because it is quite common to have this type of data to work with. However, once we add the key ‘z’, there is a collision because it is not consecutive enough.

We could use a linked list to store the pairs having the same hash but it would increase the lookup time e.g. not O(1) average anymore. The next section describes the collision resolution method used in the case of Python dictionaries.

Open addressing

Open addressing is a method of collision resolution where probing is used. In case of ‘z’, the slot index 3 is already used in the array so we need to probe for a different index to find one which is not already used. Adding a key/value pair will average O(1) and the lookup operation too.

A quadratic probing sequence is used to find a free slot. The code is the following:

Recurring on 5*j+1 quickly magnifies small differences in the bits that didn’t affect the initial index. The variable “perturb” gets the other bits of the hash code into play.

Just out of curiosity, let’s look at the probing sequence when the table size is 32 and j = 3.
3 -> 11 -> 19 -> 29 -> 5 -> 6 -> 16 -> 31 -> 28 -> 13 -> 2…

You can read more about this probing sequence by looking at the source code of dictobject.c. A detailed explanation of the probing mechanism can be found at the top of the file.

Now, let’s look at the Python internal code along with an example.

Dictionary C structures

The following C structure is used to store a dictionary entry: key/value pair. The hash, key and value are stored. PyObject is the base class of the Python objects.

The following structure represents a dictionary. ma_fill is the number of used slots + dummy slots. A slot is marked dummy when a key pair is removed. ma_used is the number of used slots (active). ma_mask is equal to the array’s size minus 1 and is used to calculate the slot index. ma_table is the array and ma_smalltable is the initial array of size 8.

Dictionary initialization

When you first create a dictionary, the function PyDict_New() is called. I removed some of the lines and converted the C code to pseudocode to concentrate on the key concepts.

Adding items

Write A C Program To Implement Functions Of Dictionary Using Hashing

When a new key/value pair is added, PyDict_SetItem() is called. This function takes a pointer to the dictionary object and the key/value pair. It checks if the key is a string and calculates the hash or reuses the one cached if it exists. insertdict() is called to add the new key/value pair and the dictionary is resized if the number of used slots + dummy slots is greater than 2/3 of the array’s size.
Why 2/3? It is to make sure the probing sequence can find a free slot fast enough. We will look at the resizing function later.

inserdict() uses the lookup function lookdict_string() to find a free slot. This is the same function used to find a key. lookdict_string() calculates the slot index using the hash and the mask values. If it cannot find the key in the slot index = hash & mask, it starts probing using the loop described above, until it finds a free slot. At the first probing try, if the key is null, it returns the dummy slot if found during the first lookup. This gives priority to re-use the previously deleted slots.

We want to add the following key/value pairs: {‘a’: 1, ‘b’: 2′, ‘z’: 26, ‘y’: 25, ‘c’: 5, ‘x’: 24}. This is what happens:

A dictionary structure is allocated with internal table size of 8.

PyDict_SetItem: key = ‘a’, value = 1

hash = hash(‘a’) = 12416037344
insertdict

lookdict_string

slot index = hash & mask = 12416037344 & 7 = 0
slot 0 is not used so return it

init entry at index 0 with key, value and hash
ma_used = 1, ma_fill = 1

PyDict_SetItem: key = ‘b’, value = 2

hash = hash(‘b’) = 12544037731
insertdict

lookdict_string

slot index = hash & mask = 12544037731 & 7 = 3
slot 3 is not used so return it

init entry at index 3 with key, value and hash
ma_used = 2, ma_fill = 2

PyDict_SetItem: key = ‘z’, value = 26

hash = hash(‘z’) = 15616046971
insertdict

lookdict_string

slot index = hash & mask = 15616046971 & 7 = 3
slot 3 is used so probe for a different slot: 5 is free

init entry at index 5 with key, value and hash
ma_used = 3, ma_fill = 3

PyDict_SetItem: key = ‘y’, value = 25

hash = hash(‘y’) = 15488046584
insertdict

lookdict_string

slot index = hash & mask = 15488046584 & 7 = 0
slot 0 is used so probe for a different slot: 1 is free

init entry at index 1 with key, value and hash
ma_used = 4, ma_fill = 4

PyDict_SetItem: key = ‘c’, value = 3

hash = hash(‘c’) = 12672038114
insertdict

lookdict_string

slot index = hash & mask = 12672038114 & 7 = 2
slot 2 is free so return it

init entry at index 2 with key, value and hash
ma_used = 5, ma_fill = 5

PyDict_SetItem: key = ‘x’, value = 24

hash = hash(‘x’) = 15360046201
insertdict

lookdict_string

slot index = hash & mask = 15360046201 & 7 = 1
slot 1 is used so probe for a different slot: 7 is free

init entry at index 7 with key, value and hash
ma_used = 6, ma_fill = 6

This is what we have so far:

6 slots on 8 are used now so we are over 2/3 of the array’s capacity. dictresize() is called to allocate a larger array. This function also takes care of copying the old table entries to the new table.

dictresize() is called with minused = 24 in our case which is 4 * ma_used. 2 * ma_used is used when the number of used slots is very large (greater than 50000). Why 4 times the number of used slots? It reduces the number of resize steps and it increases sparseness.

The new table size needs to be greater than 24 and it is calculated by shifting the current size 1 bit left until it is greater than 24. It ends up being 32 e.g. 8 -> 16 -> 32.

This is what happens with our table during resizing: a new table of size 32 is allocated. Old table entries are inserted into the new table using the new mask value which is 31. We end up with the following:

Removing items

C Program To Implement Dictionary Using Hashing Functions Using

PyDict_DelItem() is called to remove an entry. The hash for this key is calculated and the lookup function is called to return the entry. The slot is now a dummy slot.

C Program To Implement Dictionary Using Hashing Functions Based

We want to remove the key ‘c’ from our dictionary. We end up with the following array:

Note that the delete item operation doesn’t trigger an array resize if the number of used slots is much less that the total number of slots. However, when a key/value pair is added, the need for resize is based on the number of used slots + dummy slots so it can shrink the array too.

Dictionary

That’s it for now. I hope you enjoyed the article. Please write a comment if you have any feedback. If you need help with a project written in Python or with building a new web service, I am available as a freelancer: LinkedIn profile. Follow me on Twitter @laurentluce.