Jena2 Database Interface - Database Layout

This document provides some details on the Jena2 database schema and the encoding used to store URIs and literal values.

Overview - Denormalized Triple Store

A widely-used scheme for storing RDF statements in a relational database is the triple store. In this approach, each RDF statement is streod as a single row in a three column 'statement' table. Typically, a fourth column is added to indicate if the object is a literal or a URI. A common variation of this scheme which uses much less storage space is the normalized triple store approach. This scheme uses a statement table plus a literals table and a resources table.

The literals table stores the literals for all statements and the resources table stores all the resources from all the statements. The statement table stores the subject, predicate and object but instead of storing the values directly, it stores references to the values in the resources and literals tables. This normalized scheme uses less space than the standard triple store approach since a literal value or resource URI is only stored once, regardless of the number of times it occurs in statements. The space savings comes at a cost however since retrieving the subject, predicate and object values for a statement requires a 3-way join of the statement, literals and resources tables. Jena1 used a normalized triple store approach.

Jena2 stores RDF statements using a denormalized triple store approach which is a hybrid of the standard triple store and the normalized triple store. This scheme uses a statement table, a literals table and resources table as before. However, the statement table may contain either the values themselves or references to values in the literals and resources tables. 'Short' literals are stored directly in the statement table and 'long' literals are stored in the literals table. Similarly short URIs are stored directly in the statement table and long URIs are stored in the resources table. The length threshold for short vs. long is configurable but the default length is 256.

The motivation for the Jena2 schema is to combine the space savings of the normalized approach with the retrieval efficiency (i.e., no joins) of the standard triple store. In Jena2, long literal and object values are only stored once. Short values may be stored multiple times but, since they are short in length, the additional storage space required is deemed acceptable.

It is hoped that most Jena2 retrieval operations can be accomplished by accessing just the statement table. The retrieval operation returns all statements that match a value for subject, predicate and object. So long as the value to be matched is a short literal or URI, the retrieval can be performed with a simple SQL select on the statement table. However, if the match value is a long value, then the literals or resources tables must first be searched for the identifier of the matching value. Then, the statement table is searched for the identifier of the matching value rather than the value itself.

In terms of a space-time trade-off, the Jena2 approach uses extra storage space in order to save on response time. However, the user can configure the threshold for short vs. long values (see LongObjectLength in Options) and so adjust the space-time trade-off to the needs of the application. More details on the Jena2 storage subsystem are available in the Hewlett-Packard Laboratories Technical Report Efficient RDF Storage and Retrieval in Jena2, HPL-2003-266.

Statement Tables

Jena2 uses two types of statement tables, one type for asserted statements and a second type for reified statements. The benefit of the reified statement table is that it stores reified statements in an optimized form. Recall that a reified statement is expressed in RDF as four individual RDF statements. Storing this would require four rows in a standard triple store. With a reified statement table, a reified statement can be stored as a single row. For applications that use a large number of reified statements, the space savings can be substantial.

Note that, by default, each graph (model) is stored in its own pair of statement tables (one for asserted statements, one for reified statements). However, Jena2 allows models to share statement tables as well (see StoreWithModel in Options). The table layouts for the various Jena2 tables are sketched below. Varchar(n) denotes a variable-length string with a maximum length of n characters (n is configurable). The encoding used for the variable-length character string columns is described later in this document.

Asserted Statement Table (Jena_GiTj_Stmt)

Column	Type	Description
Subj	Varchar(n) not null	Subject of asserted statement (encoded)
Prop	Varchar(n) not null	Predicate of asserted statement (encoded)
Obj	Varchar(n) not null	Object of asserted statement (encoded)
GraphId	Integer	Identifier of graph (model) that contains the asserted statement

Indexes: non-unique index on <Subj,Prop>; non-unique index on <Obj>

These tables hold asserted (non-reified) statements for one or more graphs. The table name is generated and has the form Jena_GiTj_Stmt where i is a graph identifier and a j is a table counter for the graph, e.g., Jena_G1T1_Stmt.

Reified Statement Table (Jena_GiTj_Reif)

Column	Type	Description
Subj	Varchar(n)	Subject of reified statement (encoded)
Prop	Varchar(n)	Predicate of reified statement (encoded)
Obj	Varchar(n)	Object of reified statement (encoded)
GraphId	Integer	Identifier of graph (model) that contains the asserted statement
Stmt	Varchar(n) not null	Identifier (URI) of reified statement (encoded)
HasType	Char(1) not null	'T' if the graph (model) contains the statement (Stmt, rdf:Type, rdf:Statement), else ' '

Indexes: unique index on <Stmt,HasType>; non-unique index on <Subj,Prop>; non-unique index on <Obj>

These tables hold reified statements for one or more graphs. The table name is generated and has the form Jena_GiTj_Reif where i is a graph identifier and a j is a table counter for the graph, e.g., Jena_G1T2_Reif. A row of the reified statement table, say
        ex:subj, ex:prop, ex:obj, 1, ex:stmt, 'T'
would represent the following four asserted statements in the graph (model) with identifier 1:
        ex:stmt, rdf:Subject, ex:subj .
        ex:stmt, rdf:Property, ex:prop .
        ex:stmt, rdf:Object, ex:obj .
        ex:stmt, rdf:Type, rdf:Statement .

System Tables

The Jena2 system tables store metadata as well as the long values for literals and resources inthe statement tables. As before, Varchar(n) dentoes a variable-length character string with a maximum length of n characters. Blob denotes a very long character string whose maximum length depends on the database engine. Typically, database engines do not support indexes on blob columns.

System Statement Table (Jena_Sys_Stmt)

Column	Type	Description
Subj	Varchar(n) not null	Subject of metadata statement (encoded)
Prop	Varchar(n) not null	Predicate of metadata statement (encoded)
Obj	Varchar(n) not null	Object of metadata statement (encoded)
GraphId	Integer	Always zero, representing the system (meta) graph

Indexes: non-unique index on <Subj,Prop>

The system statement table is used to store system metadata for the Jena2 storage subsystem such as configuration parameters, table names for graphs, etc.. The metadata is expressed as RDF statements so this table looks very much like an asserted statement table.

Long Literals Table (Jena_Long_Lit)

Column	Type	Description
Id	Integer not null	Identifier of long literal, referenced from the statement tables
Head	Varchar(n) not null	First n characters of long literal (encoded)
ChkSum	Integer	Checksum of tail of long literal
Tail	Blob	Remainder of long literal (long literal without the head)

Indexes: unique index on <Head,ChkSum>
Primary Key: Id

The long literals table stores literals that are too long to store directly in a statement table. Each long literal is assigned a unique integer identifier and this identifier is used to reference the literal from a statement table. To support indexing of long literals, the first n characters of the long literal are stored in the Head column. The length of the head is configurable (see IndexKeyLength in Options). The remainder of the long literal is stored as a blob in the Tail column. A checksum is computed over the tail to distinguish long literals with the same Head.

Long Resources Table (Jena_Long_URI)

Column	Type	Description
Id	Integer not null	Identifier of long URI, referenced from the statement tables
Head	Varchar(n) not null	First n characters of long URI (encoded)
ChkSum	Integer	Checksum of tail of long URI
Tail	Blob	Remainder of long URI (long URI without the head)

Indexes: unique index on <Head,ChkSum>
Primary Key: Id

The long resources table stores URIs that are too long to store directly in a statement table. Each long URI is assigned a unique integer identifier and this identifier is used to reference the URI from a statement table. The columns of the long resources table are similar to those in the long literals table. However, the encoding of URIs is different from that of long literals. In particular, URIs may have prefixes. See the prefixes table and the encoding discussion, below.

Prefixes Table (Jena_Prefix)

Column	Type	Description
Id	Integer not null	Identifier of prefix, referenced from the statement tables
Head	Varchar(n) not null	First n characters of prefix (encoded)
ChkSum	Integer	Checksum of tail of long prefix
Tail	Blob	Remainder of long prefix (long prefix without the head)

Indexes: unique index on <Head,ChkSum>
Primary Key: Id

URIs often have common prefixes (e.g., http://www.example.org/prop1, http://www.example.org/prop2). There can be substantial space savings if prefixes are stored just once. Jena2 will optionally store common URI prefixes in the prefixes table (see DoCompressURI in Options). This table is structurally similar to the long literals and long resources tables.

Graph Table (Jena_Graph)

Column	Type	Description
Id	Integer not null	Unique identifier for graph
Name	Blob	Graph name

Primary Key: Id

The graph table stores the name and unique identifier for each user graph. An unnamed graph appears under the name DEFAULT.

Lock Table (Jena_Mutex)

Column	Type	Description
Dummy	Integer	Unused

The lock table is used internally by Jena2 to implement a critical section. It has no meaningful content. It the table exists, the database is locked for an internal critical section, e.g., creating a model. Normally, the lock is released at the end of the critical section by deleting the table. But, if a Jena application fails while in a critical section, i.e., while holding the lock, users may have to manually release the lock either by deleting the table or calling DriverRDB.unlockDB().

Value Encoding in Statement and Long Tables

The following describes the encoding used for literals, URIs and other values in the statement tables and the system tables. Square brackets delimit optional components of the encoding. Italics are used for variable values.

Literal Encoding in Statement Tables
Short Literal. Lv:[langLen]:[datatypeLen]:[langString][datatypeString]value[:]
Long Literal. Lr:dbid

Literal Encoding in Long Literal Table
Literal. Lv:[langLen]:[datatypeLen]:[langString][datatypeString]head[:] hash tail

Legend
L indicates a literal
v indicates a value
r indicates a reference to another table
: is used as a delimiter. Note that MySQL trims trailing white space for certain Varchar columns so an extra delimiter is appended when necessary for those columns.
dbid an (integer) identifier of an entry in the long literals table
langLen is the length of the language identifier for the literal
langString is the language identifier
datatypeLen is the length of the datatype for the literal
datatypeString is the datatype for the literal
value is the lexical form of the string
head is the initial substring of the literal that is indexed
hash is the CRC32 hash value for the tail
tail is the remainder of the literal that is not indexed

URI Encoding in Statement Tables
Short URI. Uv:[pfx_dbid]:URI[:]
Long URI. Ur:[pfx_dbid]:dbid

URI Encoding in Long URI Table
URI. Uv:head[:] hash tail

Legend
U indicates a URI
pfx_dbid is an (integer) identifier of an entry in the prefixes table. If the prefix is too short, i.e., the length of the prefix is less than URI_COMPRESS_LENGTH (see Options), the URI is not compressed and pfx_dbid is null.
URI is the complete URI
other notation same as for literal encoding

Blank Node Encoding in Statement Tables
Short URI. Bv:[pfx_dbid]:bnid[:]
Long URI. Br:[pfx_dbid]:dbid

Blank Encoding in Long URI Table
URI. Bv:head[:] hash tail

Legend
B indicates a blank node
bnid is the blank node identifier
other notation same as above
Note: currently, blank nodes are always stored uncompressed (pfix_dbid is null).

Variable Node Encoding in Statement Tables
Variable Node. Vv:name

Legend
V indicates a variable node
v indicates a value
name is the variable name
Note: the length must be less than LONG_OBJECT_LENGTH

ANY Node Encoding in Statement Tables
Variable Node. Av:

Prefix Encoding in Prefix Table
Prefix. Pv:val[:] [hash] [tail]

Legend
P indicates a prefix
other notation same as above
hash and tail are only required for long prefixes.