Hashtables

In search of the \( O(1) \) data structure...

Overview

Three of the most common operations on a set of data include:

add(E element)
remove(Object element)
contains(Object element)

We've looked at several data structures that could provide the interface of a Set for these operations. These operations have different asymptotic time complexities depending on the data structure used.

Data Structure	`add()`	`remove()`	`contains()`
`ArrayList`	\( O(n) \)	\( O(n) \)	\( O(n) \)
`LinkedList`	\( O(n) \)	\( O(n) \)	\( O(n) \)
`TreeSet`	\( O(\log n) \)	\( O(\log n) \)	\( O(\log n) \)

Can you explain why add() is \( O(n) \) for the ArrayList and LinkedList?

What would the time complexities be for a SortedArrayList?

What additional constraints do a SortedArrayList and TreeSet put on the data being stored?

Collecting Shells

A \( O(\log n) \) algorithm is way better than \( O(n) \) since for large \( n \), the graph of \( \log n \) almost looks like a horizontal line; however, we'd like to do better. In fact, we'd like to find a data structure where all of these operations can be done in \( O(1) \) time. Of course, this sounds crazy. It's basically saying that I can manage my collection of all but three shells in the world just as fast as my brother can manage his collection of three shells. C.R.A.Z.Y.

If we don't organize your shells, we'll need to exhaustively iterate through the entire collection whenever we want to find or remove a shell¹, making these \( O(n) \) operations.

How would you organize these shells?

How would you organize these fifty shells?

Depending on how you organize these, it may make finding a shell easier or harder. If you are looking for the largest shell, you could find it quickly if the shells were sorted by size. If you are looking for most orange shell, being organized by size, doesn't help. What if you want to remove the shell that your dog brought home for your nineth birthday since you are mad at your dog for destroying your favorite Star Trek mini-figure?

For objects that could be stored lots of different ways, sorting only improves your time complexity when you wish to interact with the set based on the same ordering. Sorting is great when there is one clear way to sort things, but when you have more complicated data, it may not be a reasonable or even feasible option.

Moving Beyond Sorting

Consider how mailboxes are organized.

If we could give each shell a PO Box number, add(), remove(), and contains() could be executed in whatever time it takes to locate and open the mailbox.

The grid of mailboxes approximates random access. Recall the RandomAccess interface indicates that a List implementation supports fast (generally constant time) access to any location in the list. ArrayLists and arrays provide constant time access to any element in the array since the location of any element can be calculated based on the memory location of the beginning of the array, the index, and the number of bytes needed for each element in the array.

Questions to consider

How many seashell mailboxes does my brother need for his collection of three shells?
How many mailboxes do I need for my collection of shells?
When (and how) do we decide the address for each shell?

Determining an Object's "address"

a.k.a., the hash code

We've spent enough time thinking about the real world, let's get comfortable in the digital world.

The Object class defines a method hashCode() that returns an integer. This integer can be thought of as the address for the object in a hash table. Ideally, the integer returned would be:

the same for objects that are equivalent (if equals() returns true),
immutable, and
different for any two objects that are not equivalent.

Which one(s) of these properties is impossible to ensure for all types of objects?

Hash Tables

A hash table is a class containing an array to store the data.

Each index is an "address" for an object that may be stored in the table.
The capacity of a hash table is defined as the length of the array.
The load factor of a hash table is defined as size / capacity.
A collision occurs when two objects map to the same "address"

Questions to consider

What do we do if the "address" of an object produces and IndexOutOfBoundsException?
What should we do if/when a collision occurs?
How do we determine the appropriate capacity?
Why do we care about the load factor?

Since every shell is unique, we don't need to worry about duplicates. As a result, add() can just dump the shell somewhere in our collection (constant time). This is also why our collection is actually a set.

Images generated by ChatGPT. No shells were knowingly harmed in the creation of these images.