This document provides some details on the Jena2 database schema and the encoding used to store URIs and literal values.
A widely-used scheme for storing RDF statements in a relational database is the triple store. In this approach, each RDF statement is streod as a single row in a three column 'statement' table. Typically, a fourth column is added to indicate if the object is a literal or a URI. A common variation of this scheme which uses much less storage space is the normalized triple store approach. This scheme uses a statement table plus a literals table and a resources table.
The literals table stores the literals for all statements and the resources table stores all the resources from all the statements. The statement table stores the subject, predicate and object but instead of storing the values directly, it stores references to the values in the resources and literals tables. This normalized scheme uses less space than the standard triple store approach since a literal value or resource URI is only stored once, regardless of the number of times it occurs in statements. The space savings comes at a cost however since retrieving the subject, predicate and object values for a statement requires a 3-way join of the statement, literals and resources tables. Jena1 used a normalized triple store approach.
Jena2 stores RDF statements using a denormalized triple store approach which is a hybrid of the standard triple store and the normalized triple store. This scheme uses a statement table, a literals table and resources table as before. However, the statement table may contain either the values themselves or references to values in the literals and resources tables. 'Short' literals are stored directly in the statement table and 'long' literals are stored in the literals table. Similarly short URIs are stored directly in the statement table and long URIs are stored in the resources table. The length threshold for short vs. long is configurable but the default length is 256.
The motivation for the Jena2 schema is to combine the space savings of the normalized approach with the retrieval efficiency (i.e., no joins) of the standard triple store. In Jena2, long literal and object values are only stored once. Short values may be stored multiple times but, since they are short in length, the additional storage space required is deemed acceptable.
It is hoped that most Jena2 retrieval operations can be accomplished by accessing just the statement table. The retrieval operation returns all statements that match a value for subject, predicate and object. So long as the value to be matched is a short literal or URI, the retrieval can be performed with a simple SQL select on the statement table. However, if the match value is a long value, then the literals or resources tables must first be searched for the identifier of the matching value. Then, the statement table is searched for the identifier of the matching value rather than the value itself.
In terms of a space-time trade-off, the Jena2 approach uses extra storage space in order to save on response time. However, the user can configure the threshold for short vs. long values (see LongObjectLength in Options) and so adjust the space-time trade-off to the needs of the application. More details on the Jena2 storage subsystem are available in the Hewlett-Packard Laboratories Technical Report Efficient RDF Storage and Retrieval in Jena2, HPL-2003-266.
Jena2 uses two types of statement tables, one type for asserted statements and a second type for reified statements. The benefit of the reified statement table is that it stores reified statements in an optimized form. Recall that a reified statement is expressed in RDF as four individual RDF statements. Storing this would require four rows in a standard triple store. With a reified statement table, a reified statement can be stored as a single row. For applications that use a large number of reified statements, the space savings can be substantial.
Note that, by default, each graph (model) is stored in its own pair of statement tables (one for asserted statements, one for reified statements). However, Jena2 allows models to share statement tables as well (see StoreWithModel in Options). The table layouts for the various Jena2 tables are sketched below. Varchar(n) denotes a variable-length string with a maximum length of n characters (n is configurable). The encoding used for the variable-length character string columns is described later in this document.
Asserted Statement Table (Jena_GiTj_Stmt)
Column | Type | Description |
Subj | Varchar(n) not null | Subject of asserted statement (encoded) |
Prop | Varchar(n) not null | Predicate of asserted statement (encoded) |
Obj | Varchar(n) not null | Object of asserted statement (encoded) |
GraphId | Integer | Identifier of graph (model) that contains the asserted statement |
Indexes: non-unique index on <Subj,Prop>; non-unique index on <Obj>
These tables hold asserted (non-reified) statements for one or more graphs. The table name is generated and has the form Jena_GiTj_Stmt where i is a graph identifier and a j is a table counter for the graph, e.g., Jena_G1T1_Stmt.
Reified Statement Table (Jena_GiTj_Reif)
Column | Type | Description |
Subj | Varchar(n) | Subject of reified statement (encoded) |
Prop | Varchar(n) | Predicate of reified statement (encoded) |
Obj | Varchar(n) | Object of reified statement (encoded) |
GraphId | Integer | Identifier of graph (model) that contains the asserted statement |
Stmt | Varchar(n) not null | Identifier (URI) of reified statement (encoded) |
HasType | Char(1) not null | 'T' if the graph (model) contains the statement (Stmt, rdf:Type, rdf:Statement), else ' ' |
Indexes: unique index on <Stmt,HasType>; non-unique index on <Subj,Prop>;
non-unique index on <Obj>
These tables hold reified statements for one or more graphs. The table name is generated and has the form Jena_GiTj_Reif
where i is a graph identifier and a j is a table counter for the
graph, e.g., Jena_G1T2_Reif. A row of the reified statement table, say
ex:subj, ex:prop, ex:obj, 1, ex:stmt,
'T'
would represent the following four asserted statements in the graph (model) with
identifier 1:
ex:stmt, rdf:Subject, ex:subj .
ex:stmt, rdf:Property, ex:prop .
ex:stmt, rdf:Object, ex:obj .
ex:stmt, rdf:Type, rdf:Statement .
The Jena2 system tables store metadata as well as the long values for literals and resources inthe statement tables. As before, Varchar(n) dentoes a variable-length character string with a maximum length of n characters. Blob denotes a very long character string whose maximum length depends on the database engine. Typically, database engines do not support indexes on blob columns.
System Statement Table (Jena_Sys_Stmt)
Column | Type | Description |
Subj | Varchar(n) not null | Subject of metadata statement (encoded) |
Prop | Varchar(n) not null | Predicate of metadata statement (encoded) |
Obj | Varchar(n) not null | Object of metadata statement (encoded) |
GraphId | Integer | Always zero, representing the system (meta) graph |
Indexes: non-unique index on <Subj,Prop>
The system statement table is used to store system metadata for the Jena2 storage subsystem such as configuration parameters, table names for graphs, etc.. The metadata is expressed as RDF statements so this table looks very much like an asserted statement table.
Long Literals Table (Jena_Long_Lit)
Column | Type | Description |
Id | Integer not null | Identifier of long literal, referenced from the statement tables |
Head | Varchar(n) not null | First n characters of long literal (encoded) |
ChkSum | Integer | Checksum of tail of long literal |
Tail | Blob | Remainder of long literal (long literal without the head) |
Indexes: unique index on <Head,ChkSum>
Primary Key: Id
The long literals table stores literals that are too long to store directly
in a statement table. Each long literal is assigned a unique integer identifier
and this identifier is used to reference the literal from a statement table. To
support indexing of long literals, the first n characters of the long
literal are stored in the Head column. The length of the head is configurable
(see IndexKeyLength in Options). The remainder
of the long literal is stored as a blob in the Tail column. A checksum is
computed over the tail to distinguish long literals with the same Head.
Long Resources Table (Jena_Long_URI)
Column | Type | Description |
Id | Integer not null | Identifier of long URI, referenced from the statement tables |
Head | Varchar(n) not null | First n characters of long URI (encoded) |
ChkSum | Integer | Checksum of tail of long URI |
Tail | Blob | Remainder of long URI (long URI without the head) |
Indexes: unique index on <Head,ChkSum>
Primary Key: Id
The long resources table stores URIs that are too long to store directly in a statement table. Each long URI is assigned a unique integer identifier and this identifier is used to reference the URI from a statement table. The columns of the long resources table are similar to those in the long literals table. However, the encoding of URIs is different from that of long literals. In particular, URIs may have prefixes. See the prefixes table and the encoding discussion, below.
Prefixes Table (Jena_Prefix)
Column | Type | Description |
Id | Integer not null | Identifier of prefix, referenced from the statement tables |
Head | Varchar(n) not null | First n characters of prefix (encoded) |
ChkSum | Integer | Checksum of tail of long prefix |
Tail | Blob | Remainder of long prefix (long prefix without the head) |
Indexes: unique index on <Head,ChkSum>
Primary Key: Id
URIs often have common prefixes (e.g., http://www.example.org/prop1
,
http://www.example.org/prop2)
. There can be substantial space savings if
prefixes are stored just once. Jena2 will optionally store common URI prefixes in the prefixes
table (see DoCompressURI in Options). This
table is structurally similar to the long literals and long resources tables.
Graph Table (Jena_Graph)
Column | Type | Description |
Id | Integer not null | Unique identifier for graph |
Name | Blob | Graph name |
Primary Key: Id
The graph table stores the name and unique identifier for each user graph. An unnamed graph appears under the name DEFAULT.
Lock Table (Jena_Mutex)
Column | Type | Description |
Dummy | Integer | Unused |
The lock table is used internally by Jena2 to implement a critical section. It has no meaningful
content. It the table exists, the database is locked for
an internal critical section, e.g., creating a model. Normally, the lock is released
at the end of the critical section by deleting the table. But, if a Jena application fails while
in a critical section, i.e., while holding the lock, users may have to manually release the lock either by deleting the table or calling DriverRDB.unlockDB().
The following describes the encoding used for literals, URIs and other values in the statement tables and the system tables. Square brackets delimit optional components of the encoding. Italics are used for variable values.
Literal Encoding in Statement Tables
Short Literal. Lv:[langLen]:[datatypeLen]:[langString][datatypeString]value[:]
Long Literal. Lr:dbid
Literal Encoding in Long Literal Table
Literal. Lv:[langLen]:[datatypeLen]:[langString][datatypeString]head[:]
hash tail
Legend
L indicates a literal
v indicates a value
r indicates a reference to another table
: is used as a delimiter. Note that MySQL trims trailing white space for
certain Varchar columns so an extra delimiter is appended when necessary for
those columns.
dbid an (integer) identifier of an entry in the long literals table
langLen is the length of the language identifier for the literal
langString is the language identifier
datatypeLen is the length of the datatype for the literal
datatypeString is the datatype for the literal
value is the lexical form of the string
head is the initial substring of the literal that is indexed
hash is the CRC32 hash value for the tail
tail is the remainder of the literal that is not indexed
URI Encoding in Statement Tables
Short URI. Uv:[pfx_dbid]:URI[:]
Long URI. Ur:[pfx_dbid]:dbid
URI Encoding in Long URI Table
URI. Uv:head[:] hash tail
Legend
U indicates a URI
pfx_dbid is an (integer) identifier of an entry in the prefixes table. If the prefix is too
short, i.e., the length of the prefix is less than URI_COMPRESS_LENGTH (see Options), the URI is not compressed and
pfx_dbid is null.
URI is the complete URI
other notation same as for literal encoding
Blank Node Encoding in Statement Tables
Short URI. Bv:[pfx_dbid]:bnid[:]
Long URI. Br:[pfx_dbid]:dbid
Blank Encoding in Long URI Table
URI. Bv:head[:] hash tail
Legend
B indicates a blank node
bnid is the blank node identifier
other notation same as above
Note: currently, blank nodes are always stored uncompressed (pfix_dbid is null).
Variable Node Encoding in Statement Tables
Variable Node. Vv:name
Legend
V indicates a variable node
v indicates a value
name is the variable name
Note: the length must be less than LONG_OBJECT_LENGTH
ANY Node Encoding in Statement Tables
Variable Node. Av:
Prefix Encoding in Prefix Table
Prefix. Pv:val[:] [hash] [tail]
Legend
P indicates a prefix
other notation same as above
hash and tail are only required for long prefixes.