Chapter 8. Backing up and Restoring Berkeley DB Java Edition Applications

Table of Contents

Databases and Log Files
Log File Overview
Cleaning the Log Files
The BTree
Database Modifications
Syncs
Normal Recovery
Checkpoints
Performing Backups
Performing a Partial Backup
Performing a Complete Backup
Performing Catastrophic Recovery
Hot Standby

Fundamentally, you backup your databases by copying JE log files off to a safe storage location. To restore your database from a backup, you copy those files to an appropriate directory on disk and reopen your JE application

Beyond these simple activities, there are some differing backup strategies that you may want to consider. These topics are described in this chapter.

Databases and Log Files

Before describing JE backup and restore, it is necessary to describe some of JE's internal workings. In particular, a high-level understanding of JE log files and the in-memory cache is required. You also need to understand a little about how JE is using its internal data structures in order to understand why checkpoints and/or syncs are required.

If you are an impatient reader, then you can skip this section so long as you understand that:

  • JE databases are stored in log files contained in your environment directory.

  • Every time a JE environment is opened, normal recovery is run.

  • For transactional applications, checkpoints should be run in order to bound normal recovery time. Checkpoints are normally run by the checkpointer thread. See The Checkpointer Thread for information on managing this thread.

  • For non-transactional applications, environment syncs must be performed if you want to guarantee the persistence of your database modifications. Environment syncs are manually performed by the application developer. See Data Persistence for details.

Log File Overview

Your JE database is stored on-disk in a series of log files. JE uses no-overwrite log files, which is to say that JE only ever appends data to the end of a log file. It will never delete or modify an existing log file record.

JE log files are named NNNNNNNN.jdb where NNNNNNNN is an 8-digit hexadecimal number that increases by 1 (starting from 00000000) for each log file written to disk.

JE creates a new log file whenever the current log file has reached a pre-configured size (10000000 bytes by default). This size is controlled by the je.log.fileMax properties parameter. See The JE Properties File for information on setting JE properties.

Cleaning the Log Files

Because JE uses no-overwrite log files, the logs must be compacted or cleaned so as to conserve disk space.

JE uses the cleaner background thread to perform this task. When it runs, the cleaner thread picks a log file (generally the earliest active one) and scans each log record in it. If the record is no longer active in the database tree, the cleaner does nothing. If the record is still active in the tree, then the cleaner copies the record forward to a newer log file.

Once a log file is no longer needed (that is, it no longer contains active records), then the cleaner thread deletes the log file for you. Or, optionally, the cleaner thread can simply rename the discarded log file with a del suffix.

JE uses a minimum log utilization property to determine how much cleaning to perform. The log files contain both obsolete and utilized records. Obsolete records are records that are no longer in use, either because they have been modified or because they have been deleted. Utilized records are those records that are currently in use. The je.cleaner.minUtilization property identifies the minimum percentage of log space that must be used by utilized records. If this minimum percentage is not met, then obsolete records are deleted until the minimum percentage is met.

For information on managing the cleaner thread, see The Cleaner Thread.

The BTree

JE databases are internally organized as a BTree. In order to operate, JE requires the complete BTree be available to it.

When database records are created, modified, or deleted, the modifications are represented in the BTree's leaf nodes. Beyond leaf node changes, database record modifications can also cause changes to other BTree nodes and structures.

Database Modifications

When a write operation is performed in JE, the modified data is written to leaf nodes contained in the in-memory cache. If your JE writes are performed without transactions, then the in-memory cache is the only location guaranteed to receive a database modification without further intervention on the part of the application developer.

If your writes are transactionally protected, then every time a transaction is committed the leaf nodes (and only the leaf nodes) modified by that transaction are written to the JE log files on disk.

Syncs

As stated above, database modifications performed without a transaction are guaranteed to only ever exist in the in-memory cache. For some class of applications, this is ideal. By not writing these modifications to the on-disk logs, the application can avoid most of the overhead caused by disk I/O.

However, if the application requires its data to persist across process runs, then the developer must manually sync database modifications to the on-disk log files (again, this is only necessary for non-transactional applications). This is done using Environment.sync().

Note that syncing the cache causes JE to write all modified objects in the cache to disk. This is probably the most expensive operation that you can perform in JE. Even so, if your application requires database data to be persistent across application runs, then the cache must be synced at least before the environment is closed.

Normal Recovery

Because of the way that JE organizes and manages its BTrees, all it needs is leaf nodes in order to recreate the rest of the BTree. Essentially, this is what normal recovery is doing – recreating any missing parts of the internal BTree from leaf node information stored in the log files.

Checkpoints

Recreating the BTree (that is, running normal recovery) can become expensive if over time all that is ever written to disk is BTree leaf nodes. So in order to limit the time required for normal recovery, JE runs checkpoints. Checkpoints write to your log files all the internal BTree nodes and structures modified as a part of transactional operations. This means that your log files contain a complete BTree up to the moment in time when the checkpoint was run. This means that normal recovery only needs to recreate the portion of the BTree that has been modified since the time of the last checkpoint.

Checkpoints write more information to disk than do transaction commits, and so they are more expensive from a disk I/O perspective. Therefore, one of the performance tuning activities that you should perform is to determine how frequently to run checkpoints. You have to balance the cost of the checkpoints against the time it will take your application to restart due to the cost of running normal recovery.

Checkpoints are normally performed by the checkpointer background thread. See The Checkpointer Thread for information on managing this thread.