#include "uthash.h"
This document is geared towards C programmers of UNIX systems. Since you're reading this, chances are that you know a hash is used for looking up items using a key. In scripting languages like Perl, hashes are used all the time. In C, hashes don't exist in the language itself. This software provides a hash for C structures.
This software supports four basic operations on hashes.
add
find
delete
iterate/sort
Add, find and delete are normally constant-time operations. This is influenced by your key domain and the hash function.
This hash aims to be minimalistic and efficient. It's around 500 lines of C. It inlines automatically because it's implemented as macros. It's fast as long as the hash function is properly chosen to suit your keys. This is easy; see Appendix A: Choosing the hash function.
No, it's just a single header file: uthash.h. All you need to do is copy the header file into your project, and:
#include "uthash.h"
You may then utilize the hash macros as explained below.
This software has been tested on Linux, Solaris, OpenBSD, Mac OSX and Cygwin. It probably runs on other Unix or Unix-like platforms too.
This software is made available under the BSD license. It is free and open source.
Please follow the link to download on the uthash website.
You can email the author at thanson@users.sourceforge.net if you have questions not answered by this document.
A hash is comprised of structures. Each structure represents a key-value association. One or more fields act as the key; the structure itself is the value.
#include "uthash.h" struct my_struct { int id; /* key */ char name[10]; UT_hash_handle hh; /* makes this structure hashable */ };
There are no restrictions on the data type or name of the key field(s).
Note
|
Just remember that, as with any hash, the keys have to be unique. |
The UT_hash_handle field must be present in your structure in order to hash it. It is used for the internal bookkeeping that makes the hash work. It does not require initialization. It can be named anything, but you can simplify matters by naming it hh. This allows you to use the easier "convenience" macros to add, find and delete items.
This section introduces the HASH_ADD, HASH_FIND, HASH_DELETE, and HASH_SORT macros. This section demonstrates the macros by example. For a more succinct listing, see Appendix F: Macro Reference.
Your hash must be declared as a NULL-initialized pointer to your structure. Then allocate and initialize your structure and call HASH_ADD. (Here we use the convenience macro HASH_ADD_INT, which is for keys of type int.)
Important
|
The hash must be initialized to NULL! |
struct my_struct *users = NULL; /* important! initialize to NULL */ int add_user(int user_id, char *name) { struct my_struct *s; s = malloc(sizeof(struct my_struct)); s->id = user_id; strcpy(s->name, name); HASH_ADD_INT( users, id, s ); /* id: name of key field */ }
The second parameter to HASH_ADD_INT is the name of the key field. Here, this is id.
Once an item has been added to the hash, do not change the value of its key. Instead, delete the item from the hash, change the key, and then re-add it.
Important
|
Key uniqueness
Your application must enforce key uniqueness. In other words, do not add the same key to the hash more than once. If you do, hash behavior will be undefined. |
You can test whether a key already exists in the hash using HASH_FIND.
To look up an item in a hash, you must have its key. Then call HASH_FIND. (Here we use the convenience macro HASH_FIND_INT for keys of type int).
struct my_struct *find_user(int user_id) { struct my_struct *s; HASH_FIND_INT( users, &user_id, s ); /* s: output pointer */ return s; }
Note
|
The middle argument is a pointer to the key. You can't pass a literal key value to HASH_FIND. Instead assign the literal value to a variable, and pass a pointer to the variable. |
Here, s is the output variable of HASH_FIND_INT. It's a pointer to the sought item, or NULL if the key wasn't found in the hash.
To delete an item from a hash, you must have a pointer to the item.
void delete_user(struct my_struct *user) { HASH_DEL( users, user); /* user: pointer to deletee */ }
Note
|
Deleting an item from a hash just removes it— it does not free it. |
You can loop over the items in the hash by starting from the beginning and following the hh.next pointer.
void print_users() { struct my_struct *s; for(s=users; s != NULL; s=s->hh.next) { printf("user id %d: name %s\n", s->id, s->name); } }
There is also an hh.prev pointer you could use to iterate backwards through the hash, starting from any known item.
The items in the hash are, by default, traversed in the order they were added ("insertion order") when you follow the hh.next pointer. But you can sort the items into a new order using HASH_SORT. E.g.,
HASH_SORT( users, name_sort );
The second argument is a pointer to a comparison function. It must accept two arguments which are pointers to two items to compare. Its return value should be less than zero, zero, or greater than zero, if the first item sorts before, equal to, or after the second item, respectively. (Just like strcmp).
int name_sort(struct my_struct *a, struct my_struct *b) { return strcmp(a->name,b->name); } int id_sort(struct my_struct *a, struct my_struct *b) { return (a->id - b->id); } void sort_by_name() { HASH_SORT(users, name_sort); } void sort_by_id() { HASH_SORT(users, id_sort); }
When the items in the hash are sorted, the first item may also change position; so in the example above, users may point to a different item after calling HASH_SORT.
We'll repeat all the code and embellish it with a main() function to form a working example.
If this code was placed in a file called example.c in the same directory as uthash.h, it could be compiled and run like this:
cc -o example example.c ./example
Follow the prompts to try the program, and type Ctrl-C when done.
#include <stdio.h> /* gets */ #include <stdlib.h> /* atoi, malloc */ #include <string.h> /* strcpy */ #include "uthash.h" struct my_struct { int id; /* key */ char name[10]; UT_hash_handle hh; /* makes this structure hashable */ }; struct my_struct *users = NULL; int add_user(int user_id, char *name) { struct my_struct *s; s = malloc(sizeof(struct my_struct)); s->id = user_id; strcpy(s->name, name); HASH_ADD_INT( users, id, s ); /* id: name of key field */ } struct my_struct *find_user(int user_id) { struct my_struct *s; HASH_FIND_INT( users, &user_id, s ); /* s: output pointer */ return s; } void delete_user(struct my_struct *user) { HASH_DEL( users, user); /* user: pointer to deletee */ } void print_users() { struct my_struct *s; for(s=users; s != NULL; s=s->hh.next) { printf("user id %d: name %s\n", s->id, s->name); } } int name_sort(struct my_struct *a, struct my_struct *b) { return strcmp(a->name,b->name); } int id_sort(struct my_struct *a, struct my_struct *b) { return (a->id - b->id); } void sort_by_name() { HASH_SORT(users, name_sort); } void sort_by_id() { HASH_SORT(users, id_sort); } int main(int argc, char *argv[]) { char in[10]; int id=1; struct my_struct *s; while (1) { printf("1. add user\n"); printf("2. find user\n"); printf("3. delete user\n"); printf("4. sort items by name\n"); printf("5. sort items by id\n"); printf("6. print users\n"); gets(in); switch(atoi(in)) { case 1: printf("name?\n"); add_user(id++, gets(in)); break; case 2: printf("id?\n"); s = find_user(atoi(gets(in))); printf("user: %s\n", s ? s->name : "unknown"); break; case 3: printf("id?\n"); s = find_user(atoi(gets(in))); if (s) delete_user(s); else printf("id unknown\n"); break; case 4: sort_by_name(); break; case 5: sort_by_id(); break; case 6: print_users(); break; } } }
This program is included in the distribution in tests/example.c. You can run make example in that directory to compile it easily.
This is almost the same as using integer keys except the macros are named HASH_ADD_STR and HASH_FIND_STR.
Note
|
char[ ] vs. char*
The string is within the structure in this example— name is a char[10] field. If instead our structure merely pointed to the key (i.e., name was declared char *), we'd use HASH_ADD_KEYPTR, described in Appendix F. |
#include <string.h> /* strcpy */ #include <stdlib.h> /* malloc */ #include <stdio.h> /* printf */ #include "uthash.h" struct my_struct { char name[10]; /* key */ int id; UT_hash_handle hh; /* makes this structure hashable */ }; int main(int argc, char *argv[]) { char **n, *names[] = { "joe", "bob", "betty", NULL }; struct my_struct *s, *users = NULL; int i=0; for (n = names; *n != NULL; n++) { s = malloc(sizeof(struct my_struct)); strcpy(s->name, *n); s->id = i++; HASH_ADD_STR( users, name, s ); } HASH_FIND_STR( users, "betty", s); if (s) printf("betty's id is %d\n", s->id); }
This example is included in the distribution in tests/test15.c. It prints:
betty's id is 2
Note
|
Remember, don't change an item's key after adding it to the hash. (Instead, delete the item from the hash, change the key and then re-add it to the hash). |
Your key field can have any data type, and it can even be an aggregate of multiple contiguous fields.
The generalized version of the macros (e.g., HASH_ADD) must be used. In addition to the usual arguments, the generalized macros require the name of the UT_hash_handle field, and the key length.
#include <stdlib.h> /* malloc */ #include <stddef.h> /* offsetof */ #include <stdio.h> /* printf */ #include <string.h> /* memset */ #include "uthash.h" struct my_event { struct timeval tv; /* key is aggregate of this field */ char event_code; /* and this field. */ int user_id; UT_hash_handle hh; /* makes this structure hashable */ }; int main(int argc, char *argv[]) { struct my_event *e, ev, *events = NULL; int i, keylen; keylen = offsetof(struct my_event, event_code) + sizeof(char) - offsetof(struct my_event, tv); for(i = 0; i < 10; i++) { e = malloc(sizeof(struct my_event)); memset(e,0,sizeof(struct my_event)); e->tv.tv_sec = i * (60*60*24*365); /* i years (sec)*/ e->tv.tv_usec = 0; e->event_code = 'a'+(i%2); /* meaningless */ e->user_id = i; HASH_ADD( hh, events, tv, keylen, e); } /* look for one specific event */ memset(&ev,0,sizeof(struct my_event)); ev.tv.tv_sec = 5 * (60*60*24*365); ev.tv.tv_usec = 0; ev.event_code = 'b'; HASH_FIND( hh, events, &ev.tv, keylen, e ); if (e) printf("found: user %d, unix time %ld\n", e->user_id, e->tv.tv_sec); }
This example is included in the distribution in tests/test16.c.
A structure can be in multiple hashes simultaneously. For example, you might want to hash a structure on both an ID field and on a unique username.
struct my_struct { int id; /* key 1 */ char username[10]; /* key 2 */ UT_hash_handle hh1,hh2; /* makes this structure hashable */ };
In the example above, the structure can now be added to two separate hashes; in one hash, id is its key, while in the other hash, username is the key. (There is no requirement that the two hashes have different key fields. They could both use the same key, such as id).
Note
|
You need to have a separate UT_hash_handle for each hash that your structure will participate in simultaneously. These are hh1 and hh2 in this example. |
struct my_struct *users_by_id = NULL, *users_by_name = NULL, *s; int i; char *name; s = malloc(sizeof(struct my_struct)); s->id = 1; strcpy(s->username, "thanson"); HASH_ADD(hh1, users_by_id, id, sizeof(int), s); HASH_ADD(hh2, users_by_name, username, strlen(s->username), s); /* lookup user by ID */ i=1; HASH_FIND(hh1, users_by_id, &i, sizeof(int), s); if (s) printf("found id %d: %s\n", i, s->username); /* lookup user by username */ name = "thanson"; HASH_FIND(hh2, users_by_name, name, strlen(name), s); if (s) printf("found user %s: %d\n", name, s->id);
When participating in two or more hashes simultaneously, always use the same key field with the same hash handle. E.g., in this example, hh1 is always used with the id key, and hh2 is always used with the username key.
You can use this hash in a threaded program. But, the locking is up to you. You must protect the hash against concurrent modification. For every hash that you create in a threaded program, you should create a lock to protect it.
Concurrent hash lookups (HASH_FIND) are permitted, but modifications (HASH_ADD, HASH_DEL, and HASH_SORT) cannot be concurrent. Therefore, you can use a read/write lock which provides concurrent-read/exclusive-write semantics. Or you can use a mutex-style lock to prevent any concurrent access to the hash, but this is less efficient.
Internally this software uses a particular hash function, such as Bernstein's hash, to transform a key to a bucket number.
#define HASH_BER(key,keylen,num_bkts,bkt,i,j,k) \ bkt = 0; \ while (keylen--) bkt = (bkt * 33) + *key++; \ bkt &= (num_bkts-1);
Several such hash functions are built-in. The default is Jenkin's hash, but your key domain may get better distribution with another function. You can use a specific hash function by compiling your program with -DHASH_FUNCTION=HASH_xyz where xyz is one of the symbolic names listed below. E.g.,
cc -DHASH_FUNCTION=HASH_OAT -o program program.c
Symbol | Name |
---|---|
JEN | Jenkins (default) |
BER | Bernstein |
SAX | Shift-Add-Xor |
OAT | One-at-a-time |
FNV | Fowler/Noll/Vo |
JSW | Julienne Walker |
You can easily determine the best hash function for your key domain. To do so, you'll need to run your program once in a data-collection pass, and then run the collected data through an included analysis utility.
First you must build the analysis utility. From the top-level directory,
cd tests/ make
We'll use test14.c to demonstrate the data-collection and analysis steps (using sh syntax):
cc -DHASH_EMIT_KEYS=3 -I../src -o test14 test14.c ./test14 3>test14.keys ./keystats test14.keys
The numeric value of HASH_EMIT_KEYS is a file descriptor. Any file descriptor that your program doesn't use for its own purposes can be used instead of 3.
Note
|
The data-collection mode enabled by -DHASH_EMIT_KEYS=x should not be used in production code. |
The result of running the keystats command should look like this:
fcn hash_q #items #buckets #dups flags add_usec find_usec --- ---------- ---------- ---------- ---------- ---------- ---------- ---------- BER 1.000000 1219 512 0 ok 1206 550 FNV 1.000000 1219 512 0 ok 1734 660 JEN 1.000000 1219 512 0 ok 2008 845 OAT 0.998897 1219 512 0 ok 1941 833 SAX 0.997514 1219 1024 0 ok 2217 674 JSW 0.994652 1219 512 0 ok 1369 608
Usually, you should just pick the first hash function that is listed. Here, this is BER. This is the function that provides the most even distribution for your keys. If several have the same hash_q, then choose the fastest one according to the find_usec column.
Tip
|
Sometimes you might pick the fastest one even though its hash_q isn't the best. This is ok, particularly if its substantially faster (ideally hash functions have negligible performance overhead but in reality they can vary). |
symbolic name of hash function
a number between 0 and 1 that measures how evenly the items are distributed among the buckets (1 is best)
the number of keys that were read in from the emitted key file
the number of buckets in the hash after all the keys were added
the number of duplicate keys encounted in the emitted key file. Duplicates keys are filtered out to maintain key uniqueness. (Duplicates are normal. For example, if the application adds an item to a hash, deletes it, then re-adds it, the key is written twice to the emitted file.)
this is either ok, or noexpand if the expansion inhibited flag is set, described in Appendix C. It is not recommended to use a hash function that has the noexpand flag set.
the clock time in microseconds required to add all the keys to a hash
the clock time in microseconds required to look up every key in the hash
The hash_q statistic measures how evenly the keys are distributed among the buckets. It is a value between 0 (worst) and 1 (best). You can access it in this way, if you're interested in observing how well your hash keys hold up to the ideal distribution.
/* using the example "users" hash from previous listings */ void print_hashq() { if (users) printf("hash_q: %f\n", users->hh.tbl->hash_q); }
The hash_q statistic is updated whenever bucket expansion occurs (see Appendix C).
A low value of hash_q may lead to reduced lookup (HASH_FIND) performance. You can change the hash function to one that distributes your keys more evenly; see Appendix A: Choosing the hash function.
Internally this hash manages the number of buckets, with the goal of having enough buckets so that each one contains only small number of items.
This hash attempts to keep fewer than 10 items in each bucket. When an item is added that would cause a bucket to exceed this number, the number of buckets in the hash is doubled and the items are redistributed.
Note
|
Bucket expansion occurs automatically and invisibly as needed. There is no need for the application to know when it occurs. |
There is a hook for this event, primarily for uthash development.
#include "uthash.h" #undef uthash_expand_fyi #define uthash_expand_fyi(tbl) printf("expanded to %d buckets\n", tbl->num_buckets) ...
During bucket expansion, if the number of items in a bucket exceeds the ideal number, then the expansion threshold for that bucket will be increased (instead of being 10, it will be a multiple of 10). This prevents excessive bucket expansion for hash functions that tend to over-utilize a few buckets yet have good distribution overall.
Some key domains may not yield a good bucket distribution using the default (or manually-specified) hash function.
If this occurs, statistic hash_q reflects the uneven distribution by tending closer to 0 than 1. This means that items are piling up in some buckets while other buckets are under-utilized. Bucket expansion then takes place as this hash tries to keep a bounded number of items in each bucket.
However, if items continue to be distributed unevenly after repeated bucket expansion, there is a point at which this hash in effect says, "This key domain is really incompatible with this hash function. I could spend all day expanding the buckets but it isn't helping to spread out the items in the hash, so I'm going to cut my losses and stop expanding the buckets."
The decision to inhibit expansion is made by a heuristic rule: if two consecutive bucket expansions yield hash_q values below 0.5, then the bucket expansion inhibited flag is set for this hash.
Because this condition would cause HASH_FIND lookups to exhibit worse than constant-time performance, your application can choose to be made aware if it occurs. (Usually you don't need to worry about this, if you used the keystats utility during development to select a good hash for your keys).
#include "uthash.h" #undef uthash_noexpand_fyi #define uthash_noexpand_fyi printf("warning: bucket expansion inhibited\n"); ...
Once set, the bucket expansion inhibited flag remains in effect as long as the hash has items in it.
By default this hash implementation uses malloc and free to manage memory. If your application uses its own custom allocator, this hash can use them too.
#include "uthash.h" /* undefine the defaults */ #undef uthash_bkt_malloc #undef uthash_bkt_free #undef uthash_tbl_malloc #undef uthash_tbl_free /* re-define, specifying alternate functions */ #define uthash_bkt_malloc(sz) my_malloc(sz) /* for UT_hash_bucket */ #define uthash_bkt_free(ptr) my_free(ptr) #define uthash_tbl_malloc(sz) my_malloc(sz) /* for UT_hash_table */ #define uthash_tbl_free(ptr) my_free(ptr) ...
If memory allocation fails (i.e., the malloc function returned NULL), the default behavior is to terminate the process by calling exit(-1). This can be modified by re-defining the uthash_fatal macro.
#undef uthash_fatal #define uthash_fatal(msg) my_fatal_function(msg);
The fatal function should terminate the process; uthash does not support "returning a failure" if memory cannot be allocated.
The UT_hash_handle data structure includes next, prev, hh_next and hh_prev fields. The former two fields determine the "application" ordering (that is, iteration order that corresponds to the order the items were added). The latter two fields determine the "bucket" order. These link the UT_hash_handles together in a doubly-linked list that defines a bucket.
If a program that uses this hash is compiled with -DHASH_DEBUG=1, a special internal consistency-checking mode is activated. In this mode, the integrity of the whole hash is checked following every add or delete operation.
Note
|
This is for debugging the uthash software only, not for use in production code. |
In this mode, any internal errors in the hash data structure will cause a message to be printed to stderr and the program to exit.
the hash is walked in its entirety twice: once in bucket order and a second time in application order
the total number of items encountered in both walks is checked against the stored number
during the walk in bucket order, each item's hh_prev pointer is compared for equality with the last visited item
during the walk in application order, each item's prev pointer is compared for equality with the last visited item
In the tests/ directory, running make debug will run all the tests in this mode.
These macros add, find, delete and sort the items in a hash.
macro | arguments |
---|---|
HASH_ADD | (hh_name, head, keyfield_name, key_len, item_ptr) |
HASH_ADD_KEYPTR | (hh_name, head, key_ptr, key_len, item_ptr) |
HASH_FIND | (hh_name, head, key_ptr, key_len, item_ptr) |
HASH_DELETE | (hh_name, head, item_ptr) |
HASH_SORT | (head, cmp) |
Note
|
HASH_ADD_KEYPTR is used when the structure contains a pointer to the key, rather than the key itself. |
The convenience macros do the same thing as the generalized macros, but require fewer arguments. They have key and naming restrictions (see below).
macro | arguments |
---|---|
HASH_ADD_INT | (head, keyfield_name, item_ptr) |
HASH_FIND_INT | (head, key_ptr, item_ptr) |
HASH_ADD_STR | (head, keyfield_name, item_ptr) |
HASH_FIND_STR | (head, key_ptr, item_ptr) |
HASH_DEL | (head, item_ptr) |
In order to use the convenience macros,
the structure's UT_hash_handle field must be named hh, and
the key field must be of type int or char[]
HASH_DEL does not require the second condition.
name of the UT_hash_handle field in the structure. Conventionally called hh.
the structure pointer variable which acts as the "head" of the hash. So named because it points to the first item which is added to the hash.
the name of the key field in the structure. (In the case of a multi-field key, this is the first field of the key). If you're new to macros, it might seem strange to pass the name of a field as a parameter. See note.
the length of the key field in bytes. E.g. for an integer key, this is sizeof(int), while for a string key it's strlen(key). (For a multi-field key, see the notes in this guide on calculating key length).
for HASH_FIND, this is a pointer to the key to look up in the hash (since it's a pointer, you can't directly pass a literal value here). For HASH_ADD_KEYPTR, this is the address of the key of the item being added.
pointer to the structure being added, deleted or looked up. This is an input parameter for HASH_ADD and HASH_DELETE macros, and an output parameter for HASH_FIND.
pointer to comparison function which accepts two arguments (pointers to items to compare) and returns an int specifying whether the first item should sort before, equal to, or after the second item (like strcmp).