What Not to Do for IBM i Disaster Recovery

When you recover information from tape media or virtual tape, it is critical to have a complete backup strategy to ensure total recovery of your data. Whether you’re testing your recovery strategy or performing a real disaster recovery, avoiding common mistakes will make sure that your recovery goes as smoothly as possible. To that end, here the top 10 mistakes I see and how to avoid them.

Disaster Recovery

 

10. Using Only One Set of Tape Media

You should always have at least two sets of tape in the case of media errors or incomplete saves. The best option is to make duplicate copies of your backup media and send your most current backups—with a set of your previous backups—to your recovery site.

9. Labeling Tape Media Incorrectly

Ensure labels are correct and in the recommended format. Using improvised labels like sticky notes may damage the tape and tape device. Also, make sure to position the label correctly. Replace any damaged or missing labels and remove the existing label before adding a new one. Virtual tape libraries (VTL) will remove the manual task of labels as everything is done electronically.

8. Not Using Tape Management

Without tape management, you can’t know what data resides on your tape media or its location, making disaster recovery a real nightmare. Use a tape management system along with your backup to locate, track, and rotate your media according to a defined set of policies.

7. Shipping Tape Media Improperly

Media damaged in transit due to improper packaging causes a painful experience when the media can no longer be read to perform the recovery. Always ship tape media in original (or better) packaging. Always ship tape cartridges in a jewel case and use recommended shipping containers that securely hold the cartridge during transportation. Never ship tape media in a commercial envelope; always place it in a box or package. If you ship the cartridge in a cardboard or sturdy material box, ensure the following:

  • Place the cartridge snugly in polyethylene plastic wrap or bags to protect it from dust, moisture, and other contaminants.
  • Double-box the cartridge (place it inside a box, then put that box inside the shipping box) and add padding between the two boxes.

6. Storing Tape Media Onsite

How often you send your backup media offsite depends on your recovery point objective (RPO). If the minimum amount of data you can lose is 24 hours’ worth, you should move your tape media offsite every 24 hours. Do not wait to send a tape offsite until it is full! If you haven’t shipped a tape for a week and it’s destroyed in a fire or flood, you’ll lose a lot more than just a tape.

During many disasters, you will have enough warning to perform a full system backup before the disaster occurs. However, during hurricanes, many businesses have discovered that once their area is declared a mandatory evacuation zone, tape vendors will not come to pick up their tape media. Making frequent backups can minimize the amount you lose when catastrophe hits.

As another measure against natural disasters, always store your tape media offsite, out of reach of potential disaster risks. Also, make sure the recovery media is close to your recovery site—system recovery cannot start until the tapes arrive at that site. If you live in an area that is more subject to natural disaster, you should be considering high availability (HA) solutions. With this technology, you always have two exact copies of your data separated by distance that minimizes your risks.

5. Not Following Recovery Procedures

Having well-documented recovery procedures is vital for any IBM i shop. Depending on your operating system level, you might adopt the procedures in IBM’s “Systems Management: Recovering Your System”, IBM recovery scripts provided by IBM Business Continuity and Resiliency Services (BCRS), or your own custom recovery scripts.

Most important is that everyone involved in performing the recovery read and follow the procedures completely. This may seem like common sense, but recoveries fail when staff members don’t thoroughly read and follow the steps, or if they rush through them. If you’re unsure how to proceed, ask for the appropriate technical assistance.

4. Failing to Perform a Complete Recovery Test

Testing the recovery of your systems does not ensure you can recover your business in the event of a disaster. And being able to restore your data does not mean that you can recover your IT environment and resume business activity in the required time frame.

Recovering IBM i systems is the easy part. Harder is reestablishing your network connectivity and validating the applications and data integrity. A complete test also includes procedures for alert management, declaration, chain of command, and reporting.

One of the most difficult steps of disaster recovery is the process of deciding when to declare a disaster. Some disasters do not have obvious times to determine. You do not want an unqualified person to declare a disaster at the first sign of trouble, nor do you want any delays in starting the recovery because no one knows what to do or when to start.

3. Executing Incomplete Saves

Only data that has been saved can be recovered. Data missing from the save strategy or objects with exclusive locks during the save processing will result in incomplete recoveries. Do not be fooled into thinking the Save System (SAVSYS) command saves everything on your system. SAVSYS saves only the Licensed Internal Code (LIC), operating system, user profiles, and device configuration objects. Save menu option 21 saves your entire system by executing the following:

  • Ending all subsystems
  • Saving the Licensed Internal Code
  • Saving the operating system
  • Saving security data
  • Saving device configuration
  • Saving all libraries
  • Saving all documents and folders
  • Saving all directories

The advantage of using an option 21 save is that it completely saves everything on your system. The disadvantage is that the system is unavailable to users during the entire save process.

If you have a very short backup window that requires a more complex backup strategy, you can use one (or a combination of) the following methods:

  • Save system information in a non-restricted state
  • Save data concurrently using multiple tape devices
  • Save data in parallel using multiple tape devices
  • Use the Save While Active process

Before you use any of those methods, you must have a complete backup of your entire system. Now, let’s take a closer look at each of these techniques:

Save system information in a non-restricted state.

The Save System Information (SAVSYSINF) command performs a cumulative save of a subset of system data and objects saved by SAVSYS without requiring the system to be in a restricted state. SAVSYSINF is not a replacement for SAVSYS and is not for use in system upgrades or migrations.

After you perform a base SAVSYS, SAVSYSINF saves the following:

  • System objects: job descriptions, job queues, subsystem descriptions, and change commands
  • System reply lists, service attributes, environment variables, system values required for system recovery, and network attributes
  • Operating system PTFs that are copied into *SERVICE
    • Use the Change Service Attributes (CHGSRVA) command to modify your service attributes to automatically copy the PTF save files to *SERVICE when loading PTFs

For system recovery, recover the LIC and operating system from your SAVSYS media. Then, use your SAVSYSINF media and the Restore System Information (RSTSYSINF) command to restore the saved changes to system objects and PTFs.

Save data concurrently using multiple tape devices.

To reduce downtime, perform save operations on more than one tape device at a time. For example, you can save libraries to one tape device, folders to a second tape device, and directories to a third tape device. Or you can save different sets of libraries, objects, folders, or directories to different tape devices.

Save data in parallel using multiple tape devices.

A parallel save is intended for very large objects, libraries, or directories. With this method, the system “spreads” the data in the object, library, or directory across multiple tape devices.

Use the Save While Active process.

Save While Active (SWA) can significantly reduce the amount of time your applications are unavailable and increase user access to applications and data. With SWA, users can resume activity after the save processing reaches a synchronization checkpoint.

The simplest way to use the SWA feature is to prevent user access to applications and data until the SWA checkpoint is reached. At this point, any exclusive locks are released, and users can resume their normal activity while the system continues to perform the save. Especially with large files, it takes significantly less time to reach the SWA checkpoint than to actually save the objects depending on the number, not the size, of the objects.

Starting with IBM i 6.1, the SWA function offers a single Save While Active checkpoint for multiple saves. The Start Save Synchronization (STRSAVSYNC) command ensures a single, consistent checkpoint for your library and IFS saves or for multiple concurrent library saves. If you use SWA, make sure you understand the process and monitor for any synchronization checkpoints before making your objects available for use.

2. Using Only “Special” Backups to Test Recovery

Failing to use regular backup tapes is a huge mistake. If you perform an option 21 save only for the recovery test, you are ensuring that you have a complete backup to use to recover data. Performing a special backup to test your recovery is asking for trouble—if normal monthly, weekly, and daily backups have problems and you can’t rely on them for tests, there is no way you will ever recover with these backups in a real disaster situation.

1. Not Testing Recovery Strategy

Without testing, you can’t know if you can recover your systems, and limited recovery tests don’t tell you what happens in reality.

If you really think your backups are good, are you confident that your organization could completely recover its system right now? If not, why? No matter how comprehensive you believe your backup is, you will never know if it works unless you actually test it. To truly verify your backup/HA, you must test your recovery.

Recovery Without Disaster

Downtime comes in many forms and it doesn’t take a full-on disaster to destroy your data. This guide shows you what you need in order to build a strong recovery strategy that your business can really rely on.