Berkeley DB Reference Guide: Indexing

Berkeley DB Reference Guide:
Berkeley DB XML

Indexing

Berkeley DB XML offers effective and flexible indexing functionality that gives application developers powerful control over query performance. Berkeley DB XML indexing strategies are best illustrated by example. Consider the following XML document:

<book bookID="1234">
<author>Abelson, H</author>
<title>Structure and Interpretation of Computer Programs</title>
<isbn>0-262-51036-7</isbn>
</book>

The following figure illustrates this XML document as a graph:

xml_fig_3

The graph consists of a series of nodes (book; bookID; author; title; ISBN), edges that connect the nodes (book.bookID; book.author; book.title; book.ISBN), and values (1234; Abelson, H; Structure and Interpretation of Computer Programs; 0-262-51036-7).

The indexer maintains index keys based on the nodes, edges, and values of an XML document. Only XML document element and attribute nodes are indexed. Other XML document node types, such as declarations, comments, and processing instructions, are ignored.

When creating index keys from a document, the indexer does not follow external references to document type definitions and external entities. References to internal entities within the document's character data are substituted for their replacement text. References to external entities are removed from the character data.

The indexer concatenates character data mixed with child elements into a single value. For example, the fragment "<a>1<b>2</b>3</a>" has two element type values, element 'a' has value '13', and element 'b' has value '2'.

The indexer expands CDATA sections when indexing a document. For example, the character data <![CDATA[<greeting>Hello, world!</greeting>]]> would be indexed as if the character data were <greeting>Hello, world!</greeting>.

The indexer replaces namespace prefixes with the namespace URI to which they refer. This element type normalization ensures that documents containing the same element types, but different prefixes for the same namespace, will be considered to be identical.

When defining an index within Berkeley DB XML, application developers must specify the name of the node to be indexed and the type of indexing strategy to be maintained for that node. For element nodes the name is the element type, and for attribute nodes the name is the attribute name. All nodes with that name are indexed, regardless of the path to the node.

Applications must specify a value for four index types when defining an index:

Path Type: The index must be created with a path type of either node or edge. Using the book example, an index defined with a path type of node would have five entries; one for each node of the graph, while an index defined with a path type of edge would have four entries; one for each edge of the graph.
Node Type: Node Type refers to whether the graph node represents an element or an attribute in the original XML document. Using the book example, an index with node type of element would have four entries: one each for book, author, ISBN and title. An index with node type of attribute would have one entry for the bookID node, as it is the only attribute in the original XML document.
Key Type: Key Type determines the type of index key that is created, and must be specified as equality, presence or substring. A key type of equality allows the application to test whether a specified node exists with a given value. A key type of presence allows the application to test whether a specified node exists with any value including a missing value. A key type of substring allows the application to test whether a specified node exists and contains a given substring.
Syntax Type: Syntax Type dictates how a value is to be compared, and must be specified as string, number, or none (for a presence index). Berkeley DB XML uses the XPath definition of string and number.

The following table summarizes the Berkeley DB XML index type options:

Index Axis Types	Path Type	Node Type	Key Type	Syntax Type
Possible Values	Edge Node	Element Attribute	Equality Presence Substring	String or Number None String

Index Axis Types

Path Type

Node Type

Key Type

Syntax Type

Possible Values

Edge

Node

Element

Attribute

Equality

Presence

Substring

String or Number

None

String

The following table enumerates the supported Berkeley DB XML indexing strategies:

Index Strategy(Path-Node-Key-Syntax)	Description
none-none-none-none	Disable indexing for a node
node-element-presence	Define an Element Presence Index
node-attribute-presence	Define an Attribute Presence Index
node-element-equality-string	Define an Element Equality Strings Index
node-element-equality-number	Define an Element Equality Numbers Index
node-element-substring-string	Define an Element Substring Index
node-attribute-equality-string	Define an Attribute Equality Strings Index
node-attribute-equality-number	Define an Attribute Equality Numbers Index
node-attribute-substring-string	Define an Attribute Substring Index
edge-element-presence	Define an Element Edge Presence Index
edge-attribute-presence	Define an Attribute Edge Presence Index
edge-element-equality-string	Define an Element Edge Equality Strings Index
edge-element-equality-number	Define an Element Edge Equality Numbers Index
edge-element-substring-string	Define an Element Edge Substring Index
edge-attribute-equality-string	Define an Attribute Edge Equality Strings Index
edge-attribute-equality-number	Define an Attribute Edge Equality Numbers Index
edge-attribute-substring-string	Define an Attribute Edge Substrings Index

The following code example demonstrates how to declare an index:

void example()
{
	// Create and open a container.
	XmlContainer container(0,"test.dbxml");
	container.open(0,DB_CREATE);

	// Declare an index for title elements.
	XmlIndexSpecification index(container.getIndexSpecification(0));
	index.addIndex(
		"",
		"title",
		"node-element-equality-string");
	container.setIndexSpecification(0,index);
	container.close();
}

Index Keys

Indices are maintained for efficient query processing. Each type of index is suitable for a particular form of XPath expression.

Presence Index Keys

An index of type node-element-presence-none contains keys that record the presence of an element type within a document. They are suitable for path navigation expressions such as '/book/author/name/first', as the index records all documents that contain the element types that appear in the specified path. They are also suitable for expressions containing element type value predicates such as '/book[author='john']', as the index records all documents that have elements of type 'book' and 'author'.

An index of type node-attribute-presence-none contains keys that record the presence of an attribute name within a document. They are suitable for expressions that include path navigation to attributes, such as '/book/@bookID', as the index records all documents that contain 'bookID' attributes. For the same reason, they are also suitable for expressions that include attribute value predicates such as '/book[@bookID='a1b2c3']'.

Equality Index Keys

An index of type node-element-equality-string contains keys that record the value of an element type. They are suitable for expressions that include element type value predicates such as '/book[author='john']', as the index records all documents that have elements of type 'author' with the value 'john'. An equality index is always more selective than a presence index, so Berkeley DB XML query processing favors an equality index lookup over a presence index lookup.

An index of type node-attribute-equality-string contains keys that record the value of an attribute. They are suitable for expressions that include attribute value predicates such as '/book[@bookID='a1b2c3']', as the index records all documents that have 'bookID' attributes with the value 'a1b2c3'.

An index with key type equality and syntax type number contains equality keys that are compared numerically rather than lexically. The numeric syntax supports queries such as '/book[(year=>1980 and year<1990) or year=2000]'.

The query processor uses equality indices for satisfying queries that use the XPath function 'starts-with'. For example, '/book[starts-with(title,'Structure')'].

Substring Index Keys

An index with key type substring (for example, node-element-substring-string) contains keys that support substring queries. XPath substring searches are expressed using the XPath function 'contains'. For example the XPath query '/book[contains(title,'Computer')]' searches for documents that contain the word 'Computer' in their title.

The Berkeley DB XML indexer creates one key for each three-character substring of the node value. For example, the indexer creates substring keys {abc, bcd, cde} if given a node value of 'abcde'.

Edge Index Keys

An index of type edge-element-presence-none is a more specific form of the node-element-presence-none index, and is suitable for expressions containing long path navigations. If an XML document is viewed as a graph, with the elements being nodes, then the index keys are formed from the edges between the nodes. For example, the XML fragment '<a><b><c>d</c></b></a>' has element nodes {a, b, c} and edges {a-b, b-c}. A query with a path expression such as '/a/b/c' can be efficiently satisfied with lookups into the index for the edges {a-b, b-c}. An edge index is always more selective than a node index, so the Berkeley DB XML query processor favors an edge index lookup over a node index lookup.

Equality, and substring indices of node type 'edge' are also supported. Each works in the same way as its node counterpart. For example, indexing the above XML fragment with edge-element-equality-string would give us one key, 'b-c=d'.