Understanding Hash Tables — Implementation Deep Dive

Hash functions, collision resolution strategies, common operations, and production-ready implementation.

Author

Mr. Oz

Date

Read

8 mins

Level 2

Code visualization showing hash function computation and collision resolution strategies

Author

Mr. Oz

Date

Read

8 mins

Share

Now that we understand the magical catalog analogy, let's dive into the actual implementation. We'll explore hash functions, collision resolution, and how to build production-ready hash tables.

The Hash Function

The hash function is the heart of a hash table. It takes a key and converts it into a bucket index.

A good hash function should:

  • Be deterministic: Same key always produces the same hash
  • Distribute evenly: Spread keys across all buckets
  • Be fast: Compute the hash quickly
  • Minimize collisions: Different keys should hash to different values
# Simple hash function for strings
def simple_hash(key, table_size):
    hash_value = 0
    for char in key:
        hash_value = (hash_value * 31 + ord(char)) % table_size
    return hash_value

# Python's built-in hash() then modulo
def better_hash(key, table_size):
    return hash(key) % table_size

# Java example
int hash = key.hashCode() % tableSize;

Collision Resolution

When two keys hash to the same bucket, we have a collision. Here's how we handle it:

1. Chaining (Linked List)

Each bucket contains a linked list of all key-value pairs that hash to that location.

# Visual representation
Bucket 0: []
Bucket 1: [("Alice", "555-1234")] → ("Charlie", "555-9999")
Bucket 2: [("Bob", "555-5678")]
Bucket 3: []

2. Open Addressing (Probing)

When a collision occurs, find the next empty bucket using a probing strategy.

  • Linear Probing: Check bucket 0, 1, 2, 3...
  • Quadratic Probing: Check bucket 0, 1, 4, 9...
  • Double Hashing: Use a second hash function to determine step size
# Linear probing example
def put(key, value):
    index = hash(key) % size
    while table[index] is not None:
        if table[index].key == key:
            table[index].value = value  # Update existing
            return
        index = (index + 1) % size  # Move to next bucket
    table[index] = KeyValue(key, value)

Common Operations

1. Insert/Update (Put)

def put(key, value):
    index = hash(key) % size
    # With chaining
    for entry in buckets[index]:
        if entry.key == key:
            entry.value = value  # Update
            return
    buckets[index].append(KeyValue(key, value))  # Insert

2. Lookup (Get)

def get(key):
    index = hash(key) % size
    for entry in buckets[index]:
        if entry.key == key:
            return entry.value
    return None  # Not found

3. Delete

def delete(key):
    index = hash(key) % size
    for i, entry in enumerate(buckets[index]):
        if entry.key == key:
            buckets[index].pop(i)  # Remove from chain
            return True
    return False  # Not found

Load Factor

The load factor is the ratio of filled buckets to total buckets:

load_factor = number_of_entries / number_of_buckets

When load factor exceeds a threshold (typically 0.7 or 0.75), it's time to resize the hash table!

Why resize? As the table fills up, collisions increase, performance degrades. Resizing rehashes all entries into a larger table.

Built-in Implementations

Most languages provide optimized hash table implementations:

  • Python: dict - Dictionary with chaining
  • Java: HashMap - Chaining with red-black tree for large buckets
  • C++: unordered_map - Chaining implementation
  • JavaScript: Map and Object - Hash table semantics
  • Go: map - Built-in hash table

Time & Space Complexity Reference

Operation Average Worst
Insert O(1) O(n)
Lookup O(1) O(n)
Delete O(1) O(n)
Space O(n) O(n)

Worst case occurs when all keys hash to the same bucket (degenerate to linked list). Good hash functions prevent this!

Common Pitfalls to Avoid

  • Using mutable keys: Never use mutable objects as keys (like lists in Python)
    Wrong: hash_map[[1,2]] = "value"
  • Assuming order: Hash tables don't guarantee order (use LinkedHashMap if needed)
  • Poor hash functions: Can cause clustering and performance degradation
  • Ignoring load factor: Not resizing causes performance to degrade over time

Production Tips

  • Use built-in implementations: They're highly optimized and battle-tested
  • Choose initial capacity wisely: If you know the approximate size, pre-allocate
  • Consider immutable keys: Strings and integers make great hash keys
  • Profile before optimizing: Built-in hash tables are usually fast enough
  • Handle missing keys gracefully: Use get() with default values

Key Takeaways

  • Hash functions convert keys to bucket indices deterministically
  • Collisions are inevitable; handle them with chaining or probing
  • Chaining uses linked lists; probing finds empty buckets
  • Load factor indicates when to resize (typically at 0.75)
  • Average O(1) for insert, lookup, and delete operations
  • Use built-in implementations for production code