Transient errors

Some distributed database clusters make use of transient errors. A transient error is a temporary error that is likely to disappear soon. By definition it is safe for a client to ignore a transient error and retry the failed operation on the same database server. The retry is free of side effects. Clients are not forced to abort their work or to fail over to another database server immediately. They may enter a retry loop before to wait for the error to disappear before giving up on the database server. Transient errors can be seen, for example, when using MySQL Cluster. But they are not bound to any specific clustering solution per se.

PECL/mysqlnd_ms can perform an automatic retry loop in case of a transient error. This increases distribution transparency and thus makes it easier to migrate an application running on a single database server to run on a cluster of database servers without having to change the source of the application.

The automatic retry loop will repeat the requested operation up to a user configurable number of times and pause between the attempts for a configurable amount of time. If the error disappears during the loop, the application will never see it. If not, the error is forwarded to the application for handling.

In the example below a duplicate key error is provoked to make the plugin retry the failing query two times before the error is passed to the application. Between the two attempts the plugin sleeps for 100 milliseconds.

例1 Provoking a transient error

mysqlnd_ms.enable=1
mysqlnd_ms.collect_statistics=1
{
    "myapp": {
        "master": {
            "master_0": {
                "host": "localhost"
            }
        },
        "slave": {
            "slave_0": {
                "host": "192.168.78.136",
                "port": "3306"
            }
       },
       "transient_error": {
          "mysql_error_codes": [
            1062
          ],
          "max_retries": 2,
          "usleep_retry": 100
       }
    }
}

例2 Transient error retry loop

<?php
$mysqli 
= new mysqli("myapp""username""password""database");
if (
mysqli_connect_errno())
  
/* Of course, your error handling is nicer... */
  
die(sprintf("[%d] %s\n"mysqli_connect_errno(), mysqli_connect_error()));

if (!
$mysqli->query("DROP TABLE IF EXISTS test") ||
    !
$mysqli->query("CREATE TABLE test(id INT PRIMARY KEY)") ||
    !
$mysqli->query("INSERT INTO test(id) VALUES (1))")) {
  
printf("[%d] %s\n"$mysqli->errno$mysqli->error);
}

/* Retry loop is completely transparent. Checking statistics is
 the only way to know about implicit retries */
$stats mysqlnd_ms_get_stats();
printf("Transient error retries before error: %d\n"$stats['transient_error_retries']);

/* Provoking duplicate key error to see statistics change */
if (!$mysqli->query("INSERT INTO test(id) VALUES (1))")) {
  
printf("[%d] %s\n"$mysqli->errno$mysqli->error);
}

$stats mysqlnd_ms_get_stats();
printf("Transient error retries after error: %d\n"$stats['transient_error_retries']);

$mysqli->close();
?>

上の例の出力は、 たとえば以下のようになります。

Transient error retries before error: 0
[1062] Duplicate entry '1' for key 'PRIMARY'
Transient error retries before error: 2

Because the execution of the retry loop is transparent from a users point of view, the example checks the statistics provided by the plugin to learn about it.

As the example shows, the plugin can be instructed to consider any error transient regardless of the database servers error semantics. The only error that a stock MySQL server considers temporary has the error code 1297. When configuring other error codes but 1297 make sure your configuration reflects the semantics of your clusters error codes.

The following mysqlnd C API calls are monitored by the plugin to check for transient errors: query(), change_user(), select_db(), set_charset(), set_server_option() prepare(), execute(), set_autocommit(), tx_begin(), tx_commit(), tx_rollback(), tx_commit_or_rollback(). The corresponding user API calls have similar names.

The maximum time the plugin may sleep during the retry loop depends on the function in question. The a retry loop for query(), prepare() or execute() will sleep for up to max_retries * usleep_retry milliseconds.

However, functions that control connection state are dispatched to all connections. The retry loop settings are applied to every connection on which the command is to be run. Thus, such a function may interrupt program execution for longer than a function that is run on one server only. For example, set_autocommit() is dispatched to connections and may sleep up to (max_retries * usleep_retry) * number_of_open_connections) milliseconds. Please, keep this in mind when setting long sleep times and large retry numbers. Using the default settings of max_retries=1, usleep_retry=100 and lazy_connections=1 it is unlikely that you will ever see a delay of more than 1 second.