Chapter 1. OVERVIEW & CONCEPTS

Table of Contents

1.1. Introduction
1.1.1. The NDB API
1.1.2. The MGM API
1.2. Terminology
1.3. The NDB Cluster Transaction and Scanning API
1.3.1. Core NDB API Classes
1.3.2. Application Program Basics
1.3.3. Review of MySQL Cluster Concepts
1.3.4. The Adaptive Send Algorithm

Abstract

This chapter provides a general overview of essential MySQL Cluster, NDB API, and MGM API concepts, terminology, and programming constructs.

1.1. Introduction

Abstract

This section introduces the NDB Transaction and Scanning APIs as well as the NDB Management (MGM) API for use in building applications to run on MySQL Cluster. It also discusses the general theory and principles involved in developing such applications.

1.1.1. The NDB API

The NDB API is an object-oriented application programming interface for MySQL Cluster that implements indexes, scans, transactions, and event handling. NDB transactions are ACID-compliant in that they provide a means to group together operations in such a way that they succeed (commit) or fail as a unit (rollback). It is also possible to perform operations in a "no-commit" or deferred mode, to be committed at a later time.

NDB scans are conceptually rather similar to the SQL cursors implemented in MySQL 5.0 and other common enterprise-level database management systems. These allow for high-speed row processing for record retrieval purposes. (MySQL Cluster naturally supports set processing just as does MySQL in its non-Cluster distributions. This can be accomplished via the usual MySQL APIs discussed in the MySQL Manual and elsewhere.) The NDB API supports both table scans and row scans; the latter can be performed using either unique or ordered indexes. Event detection and handling is discussed in Section 3.5, “The NdbEventOperation Class”, as well as Section 6.5, “NDB API Event Handling Example”.

In addition, the NDB API provides object-oriented error-handling facilities in order to provide a means of recovering gracefully from failed operations and other problems. See Section 6.2, “Handling Errors and Retrying Transactions”, for a detailed example.

The NDB API provides a number of classes implementing the functionality described above. The most important of these include the Ndb, Ndb_cluster_connection, NdbTransaction, and NdbOperation classes. These model (respectively) database connections, cluster connections, transactions, and operations. These classes and their subclasses are listed in Chapter 3, NDB API CLASSES. Error conditions in the NDB API are handled using NdbError, a structure which is described in Section 4.1, “The NdbError Structure”.

1.1.2. The MGM API

The MySQL Cluster Management API, also known as the MGM API, is a C-language programming interface intended to provide administrative services for the cluster. These include starting and stopping Cluster nodes, handling Cluster logging, backups, and restoration from backups, as well as various other management tasks. A conceptual overview of MGM and its uses can be found in Chapter 5, THE MGM API.

The MGM API's principal structures model the states of individual modes (ndb_mgm_node_state), the state of the Cluster as a whole (ndb_mgm_cluster_state), and management server response messages (ndb_mgm_reply). See Section 5.4, “MGM Structures”, for detailed descirptions of these.

1.2. Terminology

Abstract

Provides a glossary of terms which are unique to the NDB and MGM APIs, or have a specialised meaning when applied therein.

The following terms are useful to an understanding of MySQL Cluster, the NDB API, or have a specialised meaning when used in one of these contexts. In addition, you may find the MySQL Manual's Cluster Glossary to be useful as well.

  • Backup: A complete copy of all cluster data, transactions and logs, saved to disk.

  • Restore: Returning the cluster to a previous state as stored in a backup.

  • Checkpoint: Generally speaking, when data is saved to disk, it is said that a checkpoint has been reached. When working with the NDB storage engine, there are two sorts of checkpoints which work together in order to ensure that a consistent view of the cluster's data is maintained:

    • Local Checkpoint (LCP): This is a checkpoint that is specific to a single node; however, LCPs take place for all nodes in the cluster more or less concurrently. An LCP involves saving all of a node's data to disk, and so usually occurs every few minutes, depending upon the amount of data stored by the node.

      More detailed information about LCPs and their behaviour can be found in the MySQL Manual, in the sections Defining MySQL Cluster Data Nodes and Configuring Parameters for Local Checkpoints.

    • Global Checkpoint (GCP): A GCP occurs every few seconds, when transactions for all nodes are synchronised and the REDO log is flushed to disk.

      A related term is GCI, which stands for “Global Check ID”. This marks the point in the REDO log where a GCP took place.

  • Node: A component of MySQL Cluster. 3 node types are supported:

    • Management (MGM) nodes

    • SQL nodes (also known as API nodes, particularly when speaking of application other than mysqld that access cluster data)

    • data nodes (sometimes referred to as “storage nodes”, although this usage is now discouraged)

    For more information about these node types, please refer to Section 1.3.3, “Review of MySQL Cluster Concepts”, or to Cluster Process Management in the MySQL Manual.

  • Node Failure: MySQL Cluster is not solely dependent upon the functioning of any single node making up the cluster, which can continue to run even when one node fails.

  • Node Restart: The process of restarting a cluster node which has stopped on its own or been stopped deliberately. This can be done for several different reasons, including the following:

    • Restarting a node which has shut down on its own (when this has occurred, it is known as forced shutdown or node failure; the other cases dicussed here involve manually shutting down the node and restarting it)

    • To update the node's configuration

    • As part of a software or hardware upgrade

    • In order to defragment the node's DataMemory

  • Initial Node Restart: The process of starting a cluster node with its filesystem removed. This is sometimes used in the course of software upgrades and in other special circumstances.

  • System Crash (or System Failure): This can occur when so many cluster nodes have failed that the cluster's state can no longer be guaranteed.

  • System Restart: The process of restarting the cluster and reinitialising its state from disk logs and checkpoints. This is required after either a planned or an unplanned shutdown of the cluster.

  • Fragment: Contains a portion of a database table; in other words, in the NDB storage engine, a table is broken up into and stored as a number of subsets, usually referred to as fragments. A fragment is sometimes also called a partition.

  • Replica: Under the NDB storage engine, each table fragment has number of replicas in order to provide redundancy.

  • Transporter: A protocol providing data transfer across a network. The NDB API supports 4 different types of transporter connections: TCP/IP (local), TCP/IP (remote), SCI, and SHM. TCP/IP is, of course, the familiar network protocol that underlies HTTP, FTP, and so forth, on the Internet. SCI (Scalable Coherent Interface) is a high-speed protocol used in building multiprocessor systems and parallel-processing applications. SHM stands for Unix-style shared memory segments. For an informal introduction to SCI, see this essay at dolphinics.com.

  • NDB: This originally stood for “Network Database”. It now refers to the storage engine used by MySQL AB to enable its MySQL Cluster distributed database.

  • ACC: Access Manager. Handles hash indexes of primary keys providing speedy access to the records.

  • TUP: Tuple Manager. This handles storage of tuples (records) and contains the filtering engine used to filter out records and attributes when performing reads and/or updates.

  • TC: Transaction Coordinator. Handles co-ordination of transactions and timeouts; serves as the interface to the NDB API for indexes and scan operations.

1.3. The NDB Cluster Transaction and Scanning API

Abstract

This section defines and discusses the high-level architecture of the NDB API, and introduces the NDB classes which are of greatest use and interest to the developer. It also covers most important NDB API concepts, including a review of MySQL Cluster Concepts.

1.3.1. Core NDB API Classes

The NDB API is a MySQL Cluster application interface that implements transactions. It consists of the following fundamental classes:

In addition, the NDB API defines an NdbError structure, which contains the specification for an error.

It is also possible to receive events triggered when data in the database is changed. This is accomplished through the NdbEventOperation class.

Important

The NDB event notification API is not supported prior to MySQL 5.1. (Bug#19719)

For more information about these classes as well as some additional auxiliary classes not listed here, see Chapter 3, NDB API CLASSES.

1.3.2. Application Program Basics

The main structure of an application program is as follows:

  1. Connect to a cluster using the Ndb_cluster_connection object.

  2. Initiate a database connection by constructing and initialising one or more Ndb objects.

  3. Identify the tables, columns, and indexes on which you wish to operate, using NdbDictionary and one or more of its subclasses.

  4. Define and execute transactions using the NdbTransaction class.

  5. Delete Ndb objects.

  6. Terminate the connection to the cluster (terminate an instance of Ndb_cluster_connection).

1.3.2.1. Using Transactions

The procedure for using transactions is as follows:

  1. Start a transaction (instantiate an NdbTransaction object).

  2. Add and define operations associated with the transaction using instances of one or more of the NdbOperation, NdbScanOperation, NdbIndexOperation, and NdbIndexScanOperation classes.

  3. Execute the transaction (call NdbTransaction::execute()).

  4. The operation can be of two different types — Commit or NoCommit:

    • If the operation is of type NoCommit, then the application program requests that the operation portion of a transaction be executed, but without actually committing the transaction. Following the execution of a NoCommit operation, the program can continue to define additional transaction operations for later execution.

      NoCommit operations can also be rolled back by the application.

    • If the operation is of type Commit, then the transaction is immediately committed. The transaction must be closed after it has been committed (even if the commit fails), and no further operations can be added to or defined for this transaction.

    See Section 3.9.1.3, “The NdbTransaction::ExecType Type”.

1.3.2.2. Synchronous Transactions

Synchronous transactions are defined and executed as follows:

  1. Begin (create) the transaction, which is referenced by an NdbTransaction object typically created using Ndb::startTransaction(). At this point, the transaction is merely being defined; it is not yet sent to the NDB kernel.

  2. Define operations and add them to the transaction, using one or more of the following:

    • NdbTransaction::getNdbOperation()

    • NdbTransaction::getNdbScanOperation()

    • NdbTransaction::getNdbIndexOperation()

    • NdbTransaction::getNdbIndexScanOperation()

    along with the appropriate methods of the respectiveNdbOperation class (or possibly one or more of its subclasses). Note that, at this point, the transaction has still not yet been sent to the NDB kernel.

  3. Execute the transaction, using the NdbTransaction::execute() method.

  4. Close the transaction by calling Ndb::closeTransaction().

For an example of this process, see Section 6.1, “Using Synchronous Transactions”.

To execute several synchronous transactions in parallel, you can either use multiple Ndb objects in several threads, or start multiple application programs.

1.3.2.3. Operations

An NdbTransaction consists of a list of operations, each of which is represented by an instance of NdbOperation, NdbScanOperation, NdbIndexOperation, or NdbIndexScanOperation (that is, of NdbOperation or one of its child classes).

1.3.2.3.1. Single-row operations

After the operation is created using NdbTransaction::getNdbOperation() or NdbTransaction::getNdbIndexOperation(), it is defined in the following three steps:

  1. Specify the standard operation type using NdbOperation::readTuple().

  2. Specify search conditions using NdbOperation::equal().

  3. Specify attribute actions using NdbOperation::getValue().

Here are two brief examples illustrating this process. For the sake of brevity, we omit error handling.

This first example uses an NdbOperation:

// 1. Retrieve table object
myTable= myDict->getTable("MYTABLENAME");

// 2. Create an NdbOperation on this table
myOperation= myTransaction->getNdbOperation(myTable);

// 3. Define the operation's type and lock mode
myOperation->readTuple(NdbOperation::LM_Read);

// 4. Specify search conditions
myOperation->equal("ATTR1", i);

// 5. Perform attribute retrieval
myRecAttr= myOperation->getValue("ATTR2", NULL);

For additional examples of this sort, see Section 6.1, “Using Synchronous Transactions”.

The second example uses an NdbIndexOperation:

// 1. Retrieve index object
myIndex= myDict->getIndex("MYINDEX", "MYTABLENAME");

// 2. Create
myOperation= myTransaction->getNdbIndexOperation(myIndex);

// 3. Define type of operation and lock mode
myOperation->readTuple(NdbOperation::LM_Read);

// 4. Specify Search Conditions
myOperation->equal("ATTR1", i);

// 5. Attribute Actions 
myRecAttr = myOperation->getValue("ATTR2", NULL);

Another example of this second type can be found in Section 6.4, “Using Secondary Indexes in Scans”.

We now discuss in somewhat greater detail each step involved in the creation and use of synchronous transactions.

  1. Define single row operation type.  The following operation types are supported:

    • NdbOperation::insertTuple(): Inserts a non-existing tuple.

    • NdbOperation::writeTuple(): Updates a tuple if one exists, otherwise inserts a new tuple.

    • NdbOperation::updateTuple(): Updates an existing tuple.

    • NdbOperation::deleteTuple(): Deletes an existing tuple.

    • NdbOperation::readTuple(): Reads an existing tuple using the specified lock mode.

    All of these operations operate on the unique tuple key. When NdbIndexOperation is used, then each of these operations operates on a defined unique hash index.

    Note

    If you want to define multiple operations within the same transaction, then you need to call NdbTransaction::getNdbOperation() or NdbTransaction::getNdbIndexOperation() for each operation.

  2. Specify Search Conditions.  The search condition is used to select tuples. Search conditions are set using NdbOperation::equal().

  3. Specify Attribute Actions.  Next, it is necessary to determine which attributes should be read or updated. It is important to remember that:

    • Deletes can neither read nor set values, but only delete them.

    • Reads can only read values.

    • Updates can only set values. Normally the attribute is identified by name, but it is also possible to use the attribute's identity to determine the attribute.

    NdbOperation::getValue() returns an NdbRecAttr object containing the value as read. To obtain the actual value, one of two methods can be used; the application can either

    • Use its own memory (passed through a pointer aValue) to NdbOperation::getValue(), or

    • receive the attribute value in an NdbRecAttr object allocated by the NDB API.

    The NdbRecAttr object is released when Ndb::closeTransaction() is called. For this reason, the application cannot reference this object following any subsequent call to Ndb::closeTransaction(). Attempting to read data from an NdbRecAttr object before calling NdbTransaction::execute() yields an undefined result.

1.3.2.3.2. Scan Operations

Scans are roughly the equivalent of SQL cursors, providing a means to perform high-speed row processing. A scan can be performed on either a table (using an NdbScanOperation) or an ordered index (by means of an NdbIndexScanOperation).

Scan operations have the following characteristics:

  • They can perform read operations which may be shared, exclusive, or dirty.

  • They can potentially work with multiple rows.

  • They can be used to update or delete multiple rows.

  • They can operate on several nodes in parallel.

After the operation is created using NdbTransaction::getNdbScanOperation() or NdbTransaction::getNdbIndexScanOperation(), it is carried out as follows:

  1. Define the standard operation type, using NdbScanOperation::readTuples().

  2. Specify search conditions, using NdbScanFilter, NdbIndexScanOperation::setBound(), or both.

  3. Specify attribute actions using NdbOperation::getValue().

  4. Execute the transaction using NdbTransaction::execute().

  5. Traverse the result set by means of successive calls to NdbScanOperation::nextResult().

Here are two brief examples illustrating this process. Once again, in order to keep things relatively short and simple, we forego any error handling.

This first example performs a table scan using an NdbScanOperation:

// 1. Retrieve a table object
myTable= myDict->getTable("MYTABLENAME");

// 2. Create a scan operation (NdbScanOperation) on this table
myOperation= myTransaction->getNdbScanOperation(myTable);

// 3. Define the operation's type and lock mode
myOperation->readTuples(NdbOperation::LM_Read);

// 4. Specify search conditions
NdbScanFilter sf(myOperation);
sf.begin(NdbScanFilter::OR);
sf.eq(0, i);   // Return rows with column 0 equal to i or
sf.eq(1, i+1); // column 1 equal to (i+1)
sf.end();

// 5. Retrieve attributes
myRecAttr= myOperation->getValue("ATTR2", NULL);

The second example uses an NdbIndexScanOperation to perform an index scan:

// 1. Retrieve index object
myIndex= myDict->getIndex("MYORDEREDINDEX", "MYTABLENAME");

// 2. Create an operation (NdbIndexScanOperation object)
myOperation= myTransaction->getNdbIndexScanOperation(myIndex);

// 3. Define type of operation and lock mode
myOperation->readTuples(NdbOperation::LM_Read);

// 4. Specify search conditions
// All rows with ATTR1 between i and (i+1)
myOperation->setBound("ATTR1", NdbIndexScanOperation::BoundGE, i);
myOperation->setBound("ATTR1", NdbIndexScanOperation::BoundLE, i+1);

// 5. Retrieve attributes 
myRecAttr = MyOperation->getValue("ATTR2", NULL);

Some additional discussion of each step required to perform a scan follows:

  1. Define Scan Operation Type.  It is important to remember that only a single operation is supported for each scan operation (NdbScanOperation::readTuples() or NdbIndexScanOperation::readTuples()).

    Note

    If you want to define multiple scan operations within the same transaction, then you need to call NdbTransaction::getNdbScanOperation() or NdbTransaction::getNdbIndexScanOperation() separately for each operation.

  2. Specify Search Conditions.  The search condition is used to select tuples. If no search condition is specified, the scan will return all rows in the table. The search condition can be an NdbScanFilter (which can be used on both NdbScanOperation and NdbIndexScanOperation) or bounds (which can be used only on index scans - see NdbIndexScanOperation::setBound()). An index scan can use both NdbScanFilter and bounds.

    Note

    When NdbScanFilter is used, each row is examined, whether or not it is actually returned. However, when using bounds, only rows within the bounds will be examined.

  3. Specify Attribute Actions.  Next, it is necessary to define which attributes should be read. As with transaction attributes, scan attributes are defined by name, but it is also possible to use the attributes' identities to define attributes as well. As discussed elsewhere in this document (see Section 1.3.2.2, “Synchronous Transactions”), the value read is returned by the NdbOperation::getValue() method as an NdbRecAttr object.

1.3.2.3.3. Using Scans to Update or Delete Rows

Scanning can also be used to update or delete rows. This is performed by

  1. Scanning with exclusive locks using NdbOperation::LM_Exclusive.

  2. (When iterating through the result set:) For each row, optionally calling either NdbScanOperation::updateCurrentTuple() or NdbScanOperation::deleteCurrentTuple().

  3. (If performing NdbScanOperation::updateCurrentTuple():) Setting new values for records simply by using NdbOperation::setValue(). NdbOperation::equal() should not be called in such cases, as the primary key is retrieved from the scan.

Important

The update or delete is not actually performed until the next call to NdbTransaction::execute() is made, just as with single row operations. NdbTransaction::execute() also must be called before any locks are released; for more information, see Section 1.3.2.3.4, “Lock Handling with Scans”.

Features Specific to Index Scans.  When performing an index scan, it is possible to scan only a subset of a table using NdbIndexScanOperation::setBound(). In addition, result sets can be sorted in either ascending or descending order, using NdbIndexScanOperation::readTuples(). Note that rows are returned unordered by default unless sorted is set to true. It is also important to note that, when using NdbIndexScanOperation::BoundEQ() on a partition key, only fragments containing rows will actually be scanned. Finally, when performing a sorted scan, any value passed as the NdbIndexScanOperation::readTuples() method's parallel argument will be ignored and maximum parallelism will be used instead. In other words, all fragments which it is possible to scan are scanned simultaneously and in parallel in such cases.

1.3.2.3.4. Lock Handling with Scans

Performing scans on either a table or an index has the potential to return a great many records; however, Ndb locks only a predetermined number of rows per fragment at a time. The number of rows locked per fragment is controlled by the batch parameter passed to NdbScanOperation::readTuples().

In order to allow the application to handle how locks are released, NdbScanOperation::nextResult() has a Boolean parameter fetchAllowed. If NdbScanOperation::nextResult() is called with fetchAllowed equal to false, then no locks may be released as result of the function call. Otherwise the locks for the current batch may be released.

This next example shows a scan delete that handles locks in an efficient manner. For the sake of brevity, we omit error-handling.

     
int check;

// Outer loop for each batch of rows
while((check = MyScanOperation->nextResult(true)) == 0)
{
  do
  {
    // Inner loop for each row within the batch
    MyScanOperation->deleteCurrentTuple();
  } 
  while((check = MyScanOperation->nextResult(false)) == 0);

  // When there are no more rows in the batch, execute all defined deletes       
  MyTransaction->execute(NoCommit);
}

For a more complete example of a scan, see Section 6.3, “Basic Scanning Example”.

1.3.2.3.5. Error Handling

Errors can occur either when operations making up a transaction are being defined, or when the transaction is actually being executed. Catching and handling either sort of error requires testing the value returned by NdbTransaction::execute(), and then, if an error is indicated (that is, if this value is equal to -1), using the following two methods in order to identify the error's type and location:

  • NdbTransaction::getNdbErrorOperation() returns a reference to the operation causing the most recent error.

  • NdbTransaction::getNdbErrorLine() yields the method number of the erroneous method in the operation, starting with 1.

This short example illustrates how to detect an error and to use these two methods to identify it:

theTransaction = theNdb->startTransaction();
theOperation = theTransaction->getNdbOperation("TEST_TABLE");
if(theOperation == NULL)
  goto error;

theOperation->readTuple(NdbOperation::LM_Read);
theOperation->setValue("ATTR_1", at1);
theOperation->setValue("ATTR_2", at1);  //  Error occurs here
theOperation->setValue("ATTR_3", at1);
theOperation->setValue("ATTR_4", at1);

if(theTransaction->execute(Commit) == -1) 
{
  errorLine = theTransaction->getNdbErrorLine();
  errorOperation = theTransaction->getNdbErrorOperation();
}

Here, errorLine is 3, as the error occurred in the third method called on the NdbOperation object (in this case, theOperation). If the result of NdbTransaction::getNdbErrorLine() is 0, then the error occurred when the operations were executed. In this example, errorOperation is a pointer to the object theOperation. The NdbTransaction::getNdbError() method returns an NdbError object providing information about the error.

Note

Transactions are not automatically closed when an error occurs. You must call Ndb::closeTransaction() or NdbTransaction::close() to close the transaction.

See Section 3.1.1.8, “Ndb::closeTransaction(), and Section 3.9.2.7, “NdbTransaction::close().

One recommended way to handle a transaction failure (that is, when an error is reported) is as shown here:

  1. Roll back the transaction by calling NdbTransaction::execute() with a special ExecType value for the type parameter.

    See Section 3.9.2.5, “NdbTransaction::execute() and Section 3.9.1.3, “The NdbTransaction::ExecType Type”, for more information about how this is done.

  2. Close the transaction by calling NdbTransaction::closeTransaction().

  3. If the error was temporary, attempt to restart the transaction.

Several errors can occur when a transaction contains multiple operations which are simultaneously executed. In this case the application must go through all operations and query each of their NdbError objects to find out what really happened.

Important

Errors can occur even when a commit is reported as successful. In order to handle such situations, the NDB API provides an additional NdbTransaction::commitStatus() method to check the transaction's commit status.

See Section 3.9.2.10, “NdbTransaction::commitStatus().

1.3.3. Review of MySQL Cluster Concepts

Abstract

This seciton covers the NDB Kernel, and discusses MySQL Cluster transaction handling and transaction coordinators. It also describes NDB record structures and concurrency issues.

The NDB Kernel is the collection of data nodes belonging to a MySQL Cluster. The application programmer can for most purposes view the set of all storage nodes as a single entity. Each data node is made up of three main components:

  • TC: The transaction coordinator.

  • ACC: The index storage component.

  • TUP: The data storage component.

When an application executes a transaction, it connects to one transaction coordinator on one data node. Usually, the programmer does not need to specify which TC should be used, but in some cases where performance is important, the programmer can provide “hints” to use a certain TC. (If the node with the desired transaction coordinator is down, then another TC will automatically take its place.)

Each data node has an ACC and a TUP which store the indexes and data portions of the database table fragment. Even though a single TC is responsible for the transaction, several ACCs and TUPs on other data nodes might be involved in that transaction's execution.

1.3.3.1. Selecting a Transaction Coordinator

The default method is to select the transaction coordinator (TC) determined to be the "nearest" data node, using a heuristic for proximity based on the type of transporter connection. In order of nearest to most distant, these are:

  1. SCI

  2. SHM

  3. TCP/IP (localhost)

  4. TCP/IP (remote host)

If there are several connections available with the same proximity, one is selected for each transaction in a round-robin fashion. Optionally, you may set the method for TC selection to round-robin mode, where each new set of transactions is placed on the next data node. The pool of connections from which this selection is made consists of all available connections.

As noted in Section 1.3.3, “Review of MySQL Cluster Concepts”, the application programmer can provide hints to the NDB API as to which transaction coordinator should be uses. This is done by providing a table and a partition key (usually the primary key). If the primary key as the partition key, then the transaction is placed on the node where the primary replica of that record resides. Note that this is only a hint; the system can be reconfigured at any time, in which case the NDB API chooses a transaction coordinator without using the hint. For more information, see Section 3.4.2.2.16, “Column::getPartitionKey(), and Section 3.1.1.7, “Ndb::startTransaction(). The application programmer can specify the partition key from SQL by using this construct:

CREATE TABLE ... ENGINE=NDB PARTITION BY KEY (attribute_list);

For additional information, see Partitioning and in particular KEY Partitioning in the MySQL Manual.

1.3.3.2. NDB Record Structure

The NDB Cluster storage engine used by MySQL Cluster is a relational database engine storing records in tables as with other relational database systems. Table rows represent records as tuples of relational data. When a new table is created, its attribute schema is specified for the table as a whole, and thus each table row has the same structure. Again, this is typical of relational databases, and NDB is no different in this regard.

Primary Keys.  Each record has from 1 up to 32 attributes which belong to the primary key of the table.

Transactions.  Transactions are committed first to main memory, and then to disk, after a global checkpoint (GCP) is issued. Since all data are (in most NDB Cluster configurations) synchronously replicated and stored on multiple data nodes, the system can handle processor failures without loss of data. However, in the case of a system-wide failure, all transactions (committed or not) occurring since the most recent GCP are lost.

Concurrency Control.  NDB Cluster uses pessimistic concurrency control based on locking. If a requested lock (implicit and depending on database operation) cannot be attained within a specified time, then a timeout error results.

Concurrent transactions as requested by parallel application programs and thread-based applications can sometimes deadlock when they try to access the same information simultaneously. Thus, applications need to be written in a manner such that timeout errors occurring due to such deadlocks are handled gracefully. This generally means that the transaction encountering a timeout should be rolled back and restarted.

Hints and Performance.  Placing the transaction coordinator in close proximity to the actual data used in the transaction can in many cases improve performance significantly. This is particularly true for systems using TCP/IP. For example, a Solaris system using a single 500 MHz processor has a cost model for TCP/IP communication which can be represented by the formula

[30 microseconds] + ([100 nanoseconds] * [number of bytes])

This means that if we can ensure that we use “popular” links we increase buffering and thus drastically reduce the costs of communication. The same system using SCI has a different cost model:

[5 microseconds] + ([10 nanoseconds] * [number of bytes])

This means that the efficiency of an SCI system is much less dependent on selection of transaction coordinators. Typically, TCP/IP systems spend 30 to 60% of their working time on communication, whereas for SCI systems this figure is in the range of 5 to 10%. Thus, employing SCI for data transport means that less effort from the NDB API programmer is required and greater scalability can be achieved, even for applications using data from many different parts of the database.

A simple example would be an application that uses many simple updates where a transaction needs to update one record. This record has a 32-bit primary key which also serves as the partitioning key. Then the keyData is used as the address of the integer of the primary key and keyLen is 4.

1.3.4. The Adaptive Send Algorithm

Abstract

Discusses the mechanics of transaction handling and transmission in MySQL Cluster and the NDB API, and the objects used to implement these.

When transactions are sent using NdbTransaction::execute(), they are not immediately transferred to the NDB Kernel. Instead, transactions are kept in a special send list (buffer) in the Ndb object to which they belong. The adaptive send algorithm decides when transactions should actually be transferred to the NDB kernel.

The NDB API is designed as a multi-threaded interface, and so it is often desirable to transfer database operations from more than one thread at a time. The NDB API keeps track of which Ndb objects are active in transferring information to the NDB kernel and the expected number of threads to interact with the NDB kernel. Note that a given instance of Ndb should be used in at most one thread; different threads should not share the same Ndb object.

There are four conditions leading to the transfer of database operations from Ndb object buffers to the NDB kernel:

  1. The NDB Transporter (TCP/IP, SCI, or shared memory) decides that a buffer is full and sends it off. The buffer size is implementation-dependent and may change between MySQL Cluster releases. When TCP/IP is the transporter, the buffer size is usually around 64 KB. Since each Ndb object provides a single buffer per data node, the notion of a “full” buffer is local to each data node.

  2. The accumulation of statistical data on transferred information may force sending of buffers to all storage nodes (that is, when all the buffers become full).

  3. Every 10 ms, a special transmission thread checks whether or not any send activity has occurred. If not, then the thread will force transmission to all nodes. This means that 20 ms is the maximum amount of time that database operations are kept waiting before being dispatched. A 10-millisecond limit is likely in future releases of MySQL Cluster; checks more frequent than this require additional support from the operating system.

  4. For methods that are affected by the adaptive send algorithm (such as NdbTransaction::execute()), there is a force parameter that overrides its default behaviour in this regard and forces immediate transmission to all nodes. See the individual NDB API class listings for more information.

Note

The conditions listed above are subject to change in future releases of MySQL Cluster.