Chapter 5. Storage of Variable-Length Columns

Table of Contents

5.1. Overview
5.2. COMPACT and REDUNDANT Row Formats
5.3. DYNAMIC Row Format
5.4. Specifying a Table’s Row Format

5.1. Overview

All data in InnoDB is stored in database pages comprising a B-tree index (the so-called clustered index or primary key index). The essential idea is that the nodes of the B-tree contain, for each primary key value (whether user-specified or generated or chosen by the system), the values of the remaining columns of the row as well as the key. In some other database systems, a clustered index is called an index-organized table. Secondary indexes in InnoDB are also B-trees, containing pairs of values of the index key and the value of the primary key, which acts as a pointer to the row in the clustered index.

There is an exception to this rule. Variable-length columns (such as BLOB and VARCHAR) that are too long to fit on a B-tree page are stored on separately allocated disk (overflow) pages. We call these off-page columns. The values of such columns are stored on singly-linked lists of overflow pages, and each such column has its own list of one or more overflow pages. In some cases, all or a prefix of the long column values is stored in the B-tree, to avoid wasting storage and eliminating the need to read a separate page.

The new Barracuda file format provides a new option (KEY_BLOCK_SIZE) to control how much column data is stored in the clustered index, and how much is placed on overflow pages.

5.2. COMPACT and REDUNDANT Row Formats

Previous versions of InnoDB used an unnamed file format (now called Antelope) for database files. With that format, tables were defined with ROW_FORMAT=COMPACT (or ROW_FORMAT=REDUNDANT) and InnoDB stored up to the first 768 bytes of variable-length columns (such as BLOB and VARCHAR) in the index record within the B-tree node, with the remainder stored on the overflow page(s).

To preserve compatibility with those prior versions, tables created with the InnoDB Plugin use the prefix format, unless one of ROW_FORMAT=DYNAMIC or ROW_FORMAT=COMPRESSED is specified (or implied) on the CREATE TABLE command.

With the Antelope file format, if the value of a column is not longer than 768 bytes, no overflow page is needed, and some savings in I/O may result, since the value is in the B-tree node. This works well for relatively short BLOBs, but may cause B-tree nodes to fill with data rather than key values, thereby reducing their efficiency. Tables with many BLOB columns could cause B-tree nodes to become too full of data, and contain too few rows, making the entire index less efficient than if the rows were shorter or if the column values were stored off-page.

5.3. DYNAMIC Row Format

When innodb_file_format is set to Barracuda and a table is created with ROW_FORMAT=DYNAMIC or ROW_FORMAT=COMPRESSED, long column values are stored fully off-page, and the clustered index record contains only a 20-byte pointer to the overflow page.

Whether any columns are stored off-page depends on the page size and the total size of the row. When the row is too long, InnoDB chooses the longest columns for off-page storage until the clustered index record fits on the B-tree page.

The DYNAMIC row format maintains the efficiency of storing the entire row in the index node if it fits (as do the COMPACT and REDUNDANT formats), but this new format avoids the problem of filling B-tree nodes with a large number of data bytes of long columns. The DYNAMIC format is predicated on the idea that if a portion of a long data value is stored off-page, it is usually most efficient to store all of the value off-page. With DYNAMIC format, shorter columns are likely to remain in the B-tree node, minimizing the number of overflow pages needed for any given row.

5.4. Specifying a Table’s Row Format

The row format used for a table is specified with the ROW_FORMAT clause of the CREATE TABLE and ALTER TABLE commands. Note that COMPRESSED format implies DYNAMIC format. See Section 3.2, “Enabling Compression for a Table” for more details on the relationship between this clause and other clauses of these commands.