DSA Hash Sets
Hash Sets
A Hash Set is a form of Hash Table data structure that usually holds a large number of elements.
Using a Hash Set we can search, add, and remove elements really fast.
Hash Sets are used for lookup, to check if an element is part of a set.
Hash Set
Hash Code
{{ sumOfAscii }} % 10 = {{ currHashCode }}
{{ resultText }}0
A Hash Set stores unique elements in buckets according to the element's hash code.
- Hash code: A number generated from an element's unique value (key), to determine what bucket that Hash Set element belongs to.
- Unique elements: A Hash Set cannot have more than one element with the same value.
- Bucket: A Hash Set consists of many such buckets, or containers, to store elements. If two elements have the same hash code, they belong to the same bucket. The buckets are therefore often implemented as arrays or linked lists, because a bucket needs to be able to hold more than one element.
Finding The Hash Code
A hash code is generated by a hash function.
The hash function in the animation above takes the name written in the input, and sums up the Unicode code points for every character in that name.
After that, the hash function does a modulo 10 operation (% 10
) on the sum of characters to get the hash code as a number from 0 to 9.
This means that a name is put into one of ten possible buckets in the Hash Set, according to the hash code of that name. The same hash code is generated and used when we want to search for or remove a name from the Hash Set.
The Hash Code gives us instant access as long as there is just one name in the corresponding bucket.
Unicode code point: Everything in our computers are stored as numbers, and the Unicode code point is a unique number that exist for every character. For example, the character A
has Unicode code point 65
. Just try it in the simulation above. See this page for more information about how characters are represented as numbers.
Modulo: A mathematical operation, written as %
in most programming languages (or \(mod\) in mathematics). A modulo operation divides a number with another number, and gives us the resulting reminder. So for example, 7 % 3
will give us the reminder 1
. (Dividing 7 apples between 3 people, means that each person gets 2 apples, with 1 apple to spare.)
Direct Access in Hash Sets
Searching for Peter
in the Hash Set above, means that the hash code 2
is generated (512 % 10
), and that directs us right to the bucket Peter
is in. If that is the only name in that bucket, we will find Peter
right away.
In cases like this we say that the Hash Set has constant time \(O(1)\) for searching, adding, and removing elements, which is really fast.
But, if we search for Jens
, we need to search through the other names in that bucket before we find Jens
. In a worst case scenario, all names end up in the same bucket, and the name we are searching for is the last one. In such a worst case scenario the Hash Set has time complexity \(O(n)\), which is the same time complexity as arrays and linked lists.
To keep Hash Sets fast, it is therefore important to have a hash function that will distribute the elements evenly between the buckets, and to have around as many buckets as Hash Set elements.
Having a lot more buckets than Hash Set elements is a waste of memory, and having a lot less buckets than Hash Set elements is a waste of time.
Hash Set Implementation
Hash Sets in Python are typically done by using Python's own set
data type, but to get a better understanding of how Hash Sets work we will not use that here.
To implement a Hash Set in Python we create a class SimpleHashSet
.
Inside the SimpleHashSet
class we have a method __init__
to initialize the Hash Set, a method hash_function
for the hash function, and methods for the basic Hash Set operations: add
, contains
, and remove
.
We also create a method print_set
to better see how the Hash Set looks like.
Example
class SimpleHashSet:
def __init__(self, size=100):
self.size = size
self.buckets = [[] for _ in range(size)] # A list of buckets, each is a list (to handle collisions)
def hash_function(self, value):
# Simple hash function: sum of character codes modulo the number of buckets
return sum(ord(char) for char in value) % self.size
def add(self, value):
# Add a value if it's not already present
index = self.hash_function(value)
bucket = self.buckets[index]
if value not in bucket:
bucket.append(value)
def contains(self, value):
# Check if a value exists in the set
index = self.hash_function(value)
bucket = self.buckets[index]
return value in bucket
def remove(self, value):
# Remove a value
index = self.hash_function(value)
bucket = self.buckets[index]
if value in bucket:
bucket.remove(value)
def print_set(self):
# Print all elements in the hash set
print("Hash Set Contents:")
for index, bucket in enumerate(self.buckets):
print(f"Bucket {index}: {bucket}")
Using the SimpleHashSet
class we can create the same Hash Set as in the top of this page:
Example
class SimpleHashSet:
def __init__(self, size=100):
self.size = size
self.buckets = [[] for _ in range(size)] # A list of buckets, each is a list (to handle collisions)
def hash_function(self, value):
# Simple hash function: sum of character codes modulo the number of buckets
return sum(ord(char) for char in value) % self.size
def add(self, value):
# Add a value if it's not already present
index = self.hash_function(value)
bucket = self.buckets[index]
if value not in bucket:
bucket.append(value)
def contains(self, value):
# Check if a value exists in the set
index = self.hash_function(value)
bucket = self.buckets[index]
return value in bucket
def remove(self, value):
# Remove a value
index = self.hash_function(value)
bucket = self.buckets[index]
if value in bucket:
bucket.remove(value)
def print_set(self):
# Print all elements in the hash set
print("Hash Set Contents:")
for index, bucket in enumerate(self.buckets):
print(f"Bucket {index}: {bucket}")
# Creating the Hash Set from the simulation
hash_set = SimpleHashSet(size=10)
hash_set.add("Charlotte")
hash_set.add("Thomas")
hash_set.add("Jens")
hash_set.add("Peter")
hash_set.add("Lisa")
hash_set.add("Adele")
hash_set.add("Michaela")
hash_set.add("Bob")
hash_set.print_set()
print("\n'Peter' is in the set:",hash_set.contains('Peter'))
print("Removing 'Peter'")
hash_set.remove('Peter')
print("'Peter' is in the set:",hash_set.contains('Peter'))
print("'Adele' has hash code:",hash_set.hash_function('Adele'))
Run Example ยป