Hash Storage Structure

Hash is the keyed storage structure that calculates a placement number or address by applying a hashing algorithm to the key data value. A hashing algorithm is a function that does mathematical computations to a piece of data to produce a number. It always produces the same number for the same piece of data.

Hash is the fastest access method for exact match queries (that is, with no pattern matching). A quick calculation is used to determine which pages to search, but there is no additional I/O necessary for index scanning, as there is in an ISAM or B-tree table. However, hash is more limited in the types of queries it can handle, because the hashing algorithm is not useful in looking for ranges of values, handling partial key restrictions, or doing pattern matching. The entire table must be scanned for these queries.

Using the Modify Table Structure dialog or the modify statement, you can change any table to the hash storage structure. When you modify a table to hash, you must specify a key. If you do not specify a key; otherwise, the first column is used.

Modifying a table to hash involves several calculations. Taking the number of rows currently in the table, and calculating how many rows can fit on a 2000-byte page, modify calculates how many main pages are necessary. (Main pages are data pages where the rows are actually stored.)

To help the hashing algorithm distribute the data evenly, as well as to allow plenty of room to add new data, this figure is doubled (referred to as 50% fill factor). This is the number of main pages assigned to the table. The hashing algorithm decides on which main page the row resides by calculating its hashing address.

Structure of a Hash Table

An example illustrates how a hash table is structured and what hashing means:

The example uses an employee table that has 31 rows, 500 bytes each.
The table is modified to hash on the age field, using the Structure of Table dialog. For a full description of modify procedures, see the chapter "Maintaining Storage Structures." You can also use this modify statement:
modify employee to hash on age;

The number of main pages needed is calculated. The number chosen is always at least seven, no matter how small the table is. The number of main pages chosen is approximately twice the number of pages required if the table were a heap. Normally hash uses a 50% fill factor, although if the row width is greater than 1000 bytes, it uses a 100% fill factor.

The calculation used is this:

Main pages = (Rows_in_Table / Rows_per_page) * 2

31 rows_in_table / 4 rows_per_page = 8 (round up)
8 * 2 = 16;

Main pages for employee table = 16

The main pages calculation is checked against the Min Pages and Max Pages values. If these were specified, the result must fall in this range.

When a table is modified to hash, a skeletal table is set up with an appropriate number of main pages. Although 16 pages can actually be used, as shown in the calculation above, for illustration purposes assume 10 main pages are chosen. The table is built by placing each row on the page where its key hashes.

The chart in the following example illustrates what a table looks like after modifying to hash on age. Remember that the actual hashing function is a complex algorithm that supports all of the data types. For simplicity, however, the following examples use the module function as a hashing algorithm.

Here is an example of a hashing function:

Main Page = Key MOD Main_Pages

Ross, Age 50 50 mod 10 = 0; hashes to page 0
McShane, Age 22 22 mod 10= 2; hashes to page 2
.
.
.

After this hashing process is completed for all 31 rows, the table looks like this:

+----------------------------+

Page 0 | 50|Ross | 55000.000|

| 20|Smith | 10000.000|

| 30|Curan | 30000.000|

| 20|Sabel | 21000.000|

|----------------------------|

Page 1 | |

| |

|----------------------------|

Page 2 | 22|McShane | 22000.000|

| 32|Gregori | 31000.000|

| 42|Brodie | 40000.000|

| |

|----------------------------| Overflow Chain for Page 3

Page 3 | 33|Blumberg | 32000.000| |----------------------------|

| 43|Clark | 40000.000|-->| 23|Ramos | 30000.000|

| 23|Ming | 22000.000| | 53|McTigue | 41000.000|

| 43|Kay | 38000.000| |----------------------------|

|----------------------------|

Page 4 | 24|Saxena | 22000.000|

| 34|Curry | 32000.000|

| 44|Stein | 40000.000|

| 64|Robinson | 80000.000|

|----------------------------|

Page 5 | 55|Verducci | 55000.000|

| 35|Huber | 32000.000|

| 25|Kreseski | 24000.000|

| |

|----------------------------|

Page 6 | 26|Zimmerman | 25000.000|

| 46|Mandic | 43000.000|

| 36|Stannich | 33000.000|

| |

|----------------------------|

Page 7 | 37|Cameron | 35000.000|

| 47|Giller | 46000.000|

| 27|Green | 26000.000|

| |

|----------------------------|

Page 8 | 38|Stover | 35000.000|

| 38|Sullivan | 35000.000|

| 28|Gordon | 27000.000|

| |

|----------------------------|

Page 9 | 49|Aitken | 50000.000|

| 29|Shigio | 28000.000|

| |

+----------------------------+

To retrieve the employee data about employees who are 49 years old:

Select * from employee
Where employee.age = 49;

The table is hashed on age, and the qualification has specified an age. The hashing algorithm is used to determine the main page on which rows with ages of 49 are located:

49 mod 10 = 9

The lookup goes directly to Page 9, instead of looking through the entire table, and returns the row requested.

To find all employees who are age 53, the calculation is:

53 mod 10 = 3

These rows are found on main Page 3. However, a search through the page, looking for 53, shows it is not there. There is an overflow page, though, and searching this page finds the row. Overflow pages are associated with a particular main page. They can slow down processing time because searches are required not only on the main page but the overflow chain connected to the main page as well.

Inserts, updates, and deletes work the same way retrievals do. If you want to append a new row, the row is placed on the page the new employee's age hashes to. Therefore, if you add an employee with age 22, 22 mod 10 is 2, so this employee is placed on main Page 2.

To find all employees who are older than 50, there is no way of directly locating these rows using the hashing algorithm; you must hash on every possible value greater than 50. Instead, the table is treated as a heap table and every page is scanned, starting from Page 0 through to Page 9 including all overflow pages, looking for qualifying rows.

To retrieve the row where the employee's name is Shigio, does the age key help? Because the table is not hashed on name, and Shigio's age is unknown, the entire table must be scanned, from Page 0 through Page 9, looking for rows where the employee name is Shigio. This retrieval treats the table like a heap table, scanning every main and overflow page. To retrieve a row you need without scanning the entire table, specify the value of the key for the row.

Retrievals Supported by Hash

The hash storage structure allows multi-column keys, but every column in the key must be specified in a query to take advantage of the hash access method. For instance, to hash the employee table on both age and name, use the Structure of Table dialog.

Alternatively, use the following modify statement:

modify employee to hash on age,name;

The following queries make use of the hash key:

select * from employee
where employee.age = 28
and employee.name = 'Gordon';

select * from employee
  where employee.age = 28
    and employee.name = 'Gordon'
  or employee.age = 29
    and employee.name = 'Quan';

The next queries do not use the hash key, because the entire key has not been specified:

select * from employee
where employee.age = 28;

select * from employee
where employee.name = 'Gordon';

select * from employee
where employee.age = 28
and employee.name like 'Gor%';

select * from employee
where employee.age = 28
or employee.name = 'Gordon';

When to Use Hash

Hash is the fastest structure to use when you specify an exact match of the whole key value. Hash does not efficiently support pattern matching, range searches, or partial key specification with multi-column keys. For these queries the entire table must be scanned.

Hash is a good storage structure to use if you always retrieve the rows based on a known key value, such as order number or employee number.

Hash is a poor storage structure to use in any of these cases:

You use pattern matching.
You retrieve ranges of values.
You specify part of a multi-column key.

Hash Troubleshooting

The following are problems encountered with hash storage structure, and their solutions:

Problem	Solution
Pattern matching and range scans used; performance slow.	Use ISAM or B-tree instead.
Partial key of multi-column key used; performance slow.	Use ISAM or B-tree instead.
Overflow pages occur in table after adding rows.	Remodify.
Overflow pages occur in newly modified table.	If key is repetitive, this is normal but undesirable. If key is unique, hashing algorithm does not distribute data well; try increasing minpages. If column is a character column that only partially varies (for example, AAAAA1 AAAAA2), consider using ISAM instead.