Atom feed of this document
  
 

 Recover cloud after disaster

Use the following procedures to manage your cloud after a disaster, and to easily back up its persistent storage volumes. Backups are mandatory, even outside of disaster scenarios.

For a DRP definition, see http://en.wikipedia.org/wiki/Disaster_Recovery_Plan.

 Disaster recovery example

A disaster could happen to several components of your architecture (for example, a disk crash, a network loss, or a power cut). In this example, the following components are configured:

  1. A cloud controller (nova-api, nova-objectstore, nova-network)

  2. A compute node (nova-compute)

  3. A Storage Area Network (SAN) used by OpenStack Block Storage (cinder-volumes)

The worst disaster for a cloud is a power loss, which applies to all three components. Before a power loss:

  • From the SAN to the cloud controller, we have an active iSCSI session (used for the "cinder-volumes" LVM's VG).

  • From the cloud controller to the compute node, we also have active iSCSI sessions (managed by cinder-volume).

  • For every volume, an iSCSI session is made (so 14 ebs volumes equals 14 sessions).

  • From the cloud controller to the compute node, we also have iptables/ ebtables rules which allow access from the cloud controller to the running instance.

  • And at least, from the cloud controller to the compute node; saved into database, the current state of the instances (in that case "running" ), and their volumes attachment (mount point, volume ID, volume status, and so on.)

After the power loss occurs and all hardware components restart:

  • From the SAN to the cloud, the iSCSI session no longer exists.

  • From the cloud controller to the compute node, the iSCSI sessions no longer exist.

  • From the cloud controller to the compute node, the iptables and ebtables are recreated, since at boot, nova-network reapplies configurations.

  • From the cloud controller, instances are in a shutdown state (because they are no longer running).

  • In the database, data was not updated at all, since Compute could not have anticipated the crash.

Before going further, and to prevent the administrator from making fatal mistakes, instances won't be lost, because no "destroy" or "terminate" command was invoked, so the files for the instances remain on the compute node.

Perform these tasks in the following order.

[Warning]Warning

Do not add any extra steps at this stage.

  1. Get the current relation from a volume to its instance, so that you can recreate the attachment.

  2. Update the database to clean the stalled state. (After that, you cannot perform the first step).

  3. Restart the instances. In other words, go from a shutdown to running state.

  4. After the restart, reattach the volumes to their respective instances (optional).

  5. SSH into the instances to reboot them.

 Recover after a disaster
 

Procedure 4.10. To perform disaster recovery

  1. Get the instance-to-volume relationship

    You must determine the current relationship from a volume to its instance, because you will re-create the attachment.

    You can find this relationship by running nova volume-list. Note that the nova client includes the ability to get volume information from OpenStack Block Storage.

  2. Update the database

    Update the database to clean the stalled state. You must restore for every volume, using these queries to clean up the database:

    mysql> use cinder;
    mysql> update volumes set mountpoint=NULL;
    mysql> update volumes set status="available" where status <>"error_deleting";
    mysql> update volumes set attach_status="detached";
    mysql> update volumes set instance_id=0;

    You can then run nova volume-list commands to list all volumes.

  3. Restart instances

    Restart the instances using the nova reboot $instance command.

    At this stage, depending on your image, some instances completely reboot and become reachable, while others stop on the "plymouth" stage.

  4. DO NOT reboot a second time

    Do not reboot instances that are stopped at this point. Instance state depends on whether you added an /etc/fstab entry for that volume. Images built with the cloud-init package remain in a pending state, while others skip the missing volume and start. The idea of that stage is only to ask Compute to reboot every instance, so the stored state is preserved. For more information about cloud-init, see help.ubuntu.com/community/CloudInit.

  5. Reattach volumes

    After the restart, and Compute has restored the right status, you can reattach the volumes to their respective instances using the nova volume-attach command. The following snippet uses a file of listed volumes to reattach them:

    #!/bin/bash
    
    while read line; do
        volume=`echo $line | $CUT -f 1 -d " "`
        instance=`echo $line | $CUT -f 2 -d " "`
        mount_point=`echo $line | $CUT -f 3 -d " "`
            echo "ATTACHING VOLUME FOR INSTANCE - $instance"
        nova volume-attach $instance $volume $mount_point
        sleep 2
    done < $volumes_tmp_file

    At this stage, instances that were pending on the boot sequence (plymouth) automatically continue their boot, and restart normally, while the ones that booted see the volume.

  6. SSH into instances

    If some services depend on the volume, or if a volume has an entry into fstab, you should now simply restart the instance. This restart needs to be made from the instance itself, not through nova.

    SSH into the instance and perform a reboot:

    # shutdown -r now

By completing this procedure, you can successfully recover your cloud.

[Note]Note

Follow these guidelines:

  • Use the errors=remount parameter in the fstab file, which prevents data corruption.

    The system locks any write to the disk if it detects an I/O error. This configuration option should be added into the cinder-volume server (the one which performs the iSCSI connection to the SAN), but also into the instances' fstab file.

  • Do not add the entry for the SAN's disks to the cinder-volume's fstab file.

    Some systems hang on that step, which means you could lose access to your cloud-controller. To re-run the session manually, run the following command before performing the mount:

    # iscsiadm -m discovery -t st -p $SAN_IP $ iscsiadm -m node --target-name $IQN -p $SAN_IP -l
  • For your instances, if you have the whole /home/ directory on the disk, leave a user's directory with the user's bash files and the authorized_keys file (instead of emptying the /home directory and mapping the disk on it).

    This enables you to connect to the instance, even without the volume attached, if you allow only connections through public keys.

 Script the DRP

You can download from here a bash script which performs the following steps:

  1. An array is created for instances and their attached volumes.

  2. The MySQL database is updated.

  3. Using euca2ools, all instances are restarted.

  4. The volume attachment is made.

  5. An SSH connection is performed into every instance using Compute credentials.

The "test mode" allows you to perform that whole sequence for only one instance.

To reproduce the power loss, connect to the compute node which runs that same instance and close the iSCSI session. Do not detach the volume using the nova volume-detach command; instead, manually close the iSCSI session. For the following example command uses an iSCSI session with the number 15:

# iscsiadm -m session -u -r 15

Do not forget the -r flag. Otherwise, you close ALL sessions.

Questions? Discuss on ask.openstack.org
Found an error? Report a bug against this page

loading table of contents...