Skip Lists
This article was published originally in the April 2001
issue of Java Pro with
the unenlightening title “The Sort-ed
Details.” It was written when JDK 1.3 was the latest
available Java release, years
before ConcurrentSkipListMap
appeared in
JDK 1.6. The example code is instructional and does not implement
all of the Java Collections interfaces required for a
production-use implementation. Even though you will want to
use ConcurrentSkipListMap
and ConcurrentSkipListSet
instead of
implementing your own skip list, the article remains useful for
those who would like an overview of how a skip list works.
Changes. The code examples have been updated to use generics, an
exercise of little value that makes the code harder to read. I
have also added timings for
ConcurrentSkipListMap
to the driver
program so you can see that the performance properties described
in the article are implementation-independent. Depending on the
hardware and JVM used—and the order in which the tests are
run—ConcurrentSkipListMap
executes
put
, get
,
and remove
slower or faster than the
article's code. In all cases, the execution times are the same
base-2 order of magnitude (i.e., less than a factor of two
difference) as the example code, which is two to three times
slower than
TreeMap
for the tests performed.
Sorting and Searching
Sorting
continues to be one of the most common operations performed by
computer programs. Java programs are no exception. The Collections
Framework recognizes the fundamental need for ordering computer
data by providing the Comparable
and Comparator
interfaces. If you
cannot determine the natural ordering between two objects, you
cannot sort a collection of
objects. The Comparable
interface
allows an object to control its ordering by implementing the
compareTo(Object)
method. The Comparator
interface
allows you to implement
a compare(Object,Object)
method that
returns the ordering of an arbitrary pair of objects that may not
have been designed with ordering in mind.
A common approach to ordering data is to store the data in a
container, be it an array or a list of some kind, and sort it as
needed. Arrays are often
difficult to work with when you don't know how many array elements
you will need to store all of your data.
The Vector
and ArrayList
classes—each of which
implements the java.util.List
interface—resolve this difficulty by providing dynamically
growing array-based storage. Array data can be sorted with
java.util.Arrays.sort()
and List
data can be sorted
with java.util.Collections.sort()
. Once
the data is sorted, you can search it relatively efficiently in
O(log2(n)) time using a binary search algorithm, implemented
by Arrays.binarySearch
and Collections.binarySearch
.
Searching for data in linear storage containers, such as arrays, is inefficient. In the worst case it requires the examination of every item of data. If the data is sorted, you can take advantage of a binary search, but you still have to pay the cost of sorting the data. For dynamically changing collections of data, that cost is not acceptable. An alternative is to use a data structure that maintains the ordering between its elements as those elements are inserted and removed.
Mappings, such as those represented by
the java.util.Map
interface,
associate a set of keys with a set of values. Sometimes all we
care about is the ability to quickly retrieve a value associated
with a given key. Should we want to access the mapped values
based on the natural ordering of their keys, or define a new
submapping over a range of key values, we have to sort the keys.
In these cases, an ordered mapping, represented by
the java.util.SortedMap
interface,
is more appropriate. The java.util.TreeMap
class implements the SortedMap
interface using a red-black tree, a type of balanced binary search
tree that allows both efficient searching and ordered traversal.
There are many alternative data structures that could have been
used for a default SortedMap
implementation in the Collections Framework. Implementing one of
them may shed some light on why the red-black tree was chosen over
the available alternatives.
Enter the Skip List
Figure 1 contains the source for an implementation of a probabilistically balanced container called a skip list. Skip lists were invented by Bill Pugh, a professor at the University of Maryland, in 1987 (see “Skip lists: a probablistic alternative to balanced trees,” Communications of the ACM, Vol. 33, No. 6, June 1990), who observed that a hierarchy of linked lists is equivalent to a binary tree. You can think of a skip list as a set of linked lists stacked one on top of the other.
List Arrays
It's easier to get a handle on skip lists by studying a special case called a list array. In a list array, the linked list at the lowest level (level 0) contains all the elements of the list, like a normal linear linked list. The list at the second level contains pointers to every other element. The third level contains pointers to every fourth element, and so on. Each level contains pointers to n/2l elements, where n is the number of elements in the list and l is the level number starting from 0.
Searching for a key in a list array can be done in O(log2(n)) time. Starting at the topmost level, traverse the list until you encounter an element greater than or equal to the search key. If the element is equal to the search key, you're done. If not, go down a level from the previous element and repeat the process. If you reach level 0 without finding the key, it's not in the list and you're done. Assuming a list of n elements, the search requires no more than 2log2(n) comparisons because there are log2(n) lists and you compare no more than two elements before moving to a lower level. Given the structure of the list array, the search is analogous to a binary search, although it incurs twice as many comparisons.
The list array search algorithm is implemented in the
get()
method in Figure 1. Notice that
a dummy list header and tail are used to avoid treating boundary
conditions as special cases. The use of the
MinimumKey
and MaximumKey
classes ensures that the
head of the list will always appear to be the lowest-valued
element and the tail of the list will appear to be the
highest-valued element.
Although searching exhibits a running time competitive with tree-based techniques, insertions and deletions do not perform as well. If you were to insert a key at the beginning of the list, you would need to adjust the level of every other element in the list because a list array restricts pointers at a given level to skip exactly one element in the list at the next lowest level.
Probabilistic Balancing
It turns out that you can achieve similar search performance and much better insertion and deletion performance if you can guarantee that the average number of elements skipped is one, instead of the exact number. A skip list is a list array that provides such a guarantee by randomly generating the skip increment for each level when you insert a key. To insert a key, use the search algorithm to locate where to perform the insertion. If the search succeeds and duplicates are disallowed, simply replace the value associated with the key. If the search fails, you should insert the key immediately after the last node visited whose value was less than the value to be inserted.
While you are searching, keep track of the last node
visited at each level so that you can update their links after
the insertion. Before you insert the key, you have to decide at
which levels to insert it (note that every key must be inserted
at level 0). You do this by generating a random number between
0 and 1. If the value is less than 0.5, stop. If the value is
equal or greater, go up an additional level and insert the key.
This generates another random number and repeats the process.
The random number generation is implemented
in __randomLevel()
(but uses a
probability of 0.25 instead of 0.5) in Figure 1, and the
insertion process is implemented
in put()
.
Skip lists can become relatively deep, but the
probablistic nature of the balancing makes it unlikely. You can
set a cap on the depth to keep small lists shallow. The maximum
level should be set so that the expected number of elements at
the maximum level is 1. If the probability for level skipping
is p, then the maximum level is
log1/p(n). The example driver in Figure 2
sets n to 220
and p
to 1/4, making
the maximum level equal to 10. Deletions work in much the same
way as insertions, as shown in
the remove()
method from
Figure 1.
Performance
Figure 2 contains a driver program that tests
the SkipList
class from Figure 1,
comparing its performance to TreeMap
. The
test is unscientific and should really explore a range of values
for both NUM_LEVELS
and NUM_ELEMENTS
. Nonetheless, it is good
enough to get a rough picture of skip list behavior.
If you run the driver, you will find that for large numbers
of elements, the SkipList
class takes
roughly twice as long as TreeMap
to perform
insertions, deletions, and searches. Insertions take longer
because of the time spent generating random numbers. Both
insertions and deletions suffer because they have to maintain
the __update[]
array and index into an array on
every call to getNext()
. Array indexing
incurs a higher cost than accessing a left or right child in a
tree because of run-time array bounds checking. All of the
operations suffer from the extra comparisons required when
traversing a skip list. Even though both skip list and tree
searches are O(log2(n)), the constant factors associated with skip
list searches in Java are greater.
Straight-up in-order list traversal—the primary reason
for using an ordered mapping—is comparable in both
situations. You would expect it to be more efficient in a skip
list because all you have to do is follow all of the links at the
lowest level, whereas a tree requires you to perform potentially
costly comparisons. Array indexing again appears to be the
culprit. For smaller numbers of
elements, SkipList
is more
competitive.
One of the reasons advocated for using skip lists is that they are easier to implement. I find that they are no easier to implement than red-black trees. They also require more storage, which is generally undesirable. Even so, their worst-case performance is independent of the data they store. If you enter sorted data in sequence to a binary tree (not a red-black tree), it degenerates to a linear linked list. With a skip list, your performance is protected by the probabilistic nature of the link construction. Some of the alternatives to skip lists and red-black trees are AVL trees, splay trees, and 2-3 trees. The red-black tree was perhaps the best choice for implementing a general-purpose ordered mapping for the core APIs, but Java makes a capable playground for exploring the alternatives.