RD Controls Software Release Note 109.8
Tape Change Procedure

Deb Baddorf, Cyndee Chopp and John DeVoy

February 24, 1999



All user and system disks on WARNER, DISNEY, RDIV, HUEY, DEWEY, LOUIE, WEBBY, DONALD, HYDRA, and CRYBAK, are backed up to tape nightly in the wee hours of the morning. The tapes used for the nightly backups are changed so that a fresh tape is used for the next backup. (Please Note: Saturday's backup is overwritten by Sunday's, and Sunday's is overwritten by Monday's.)

Tapes will be pulled semi-quarterly by the BD Controls Group, to be placed in the Feynman Center vault for semi-permanent storage. Also, WARNER and RDIV tapes will be pulled once a month for less-permanent long term storage.

This document is intended as a guide to changing the backup tapes in the event that Deb Baddorf (x2289) is unavailable.

Preliminaries

Is the backup job finished?

A backup job may not be finished when you're ready to change the tape. The WARNER backups usually take the longest; they typically finish by 07:00. Symptoms of an unfinished backup include:

If the backup job is making progress then wait until it is finished. The best way to determine if the job is progressing is to periodically examine the backup log, cmn$manager:auto_backup*.log or ocmn$manager:auto_backup*.log. The job will print a message as it starts and finishes backing up each disk. There may also be periodic error messages as the job is backing up an active disk.

Operator assistance (OPCOM)

If the backup job is waiting for operator assistance then the last line of the log file will be the following:

        %BACKUP-I-OPERASSIST, operator assistance has been requested
If this line is followed by a line that says ``no operator is available to handle the request'' then the job has decided not to wait for assistance, and the procedure described in the rest of this section does not apply. Otherwise, in order to ``assist'' you must ``reply'' to the request for assistance. First type the command ``reply/enable/temp'' into a DECterm window logged on to the node making the request. (If it is a WARNER node then any node in the cluster will do; the same applies for DISNEY and RDIV.) You need OPER privilege to do this. Be sure that broadcast is enabled for the window; you should immediately see two messages telling you that the window has been enabled to receive OPCOM messages.

Next you must find out the number of the request to which you are going to reply. If it is a WARNER or RDIV backup job then log on to CNS11 and run CONSOLE C3. If it is a DISNEY backup, or a front-end backup, log on to HYDRA and use the command ``vcsmon'' to start VCS. Look for a set of lines similar to the following:

        %%%%%%%%%%%  OPCOM  16-DEC-1992 02:06:11.26  %%%%%%%%%%% 
        Request 7, from user SYSMANAGER on DAFFY
        %BACKUP-I-OPERSPEC, specify option (QUIT or CONTINUE)
Note that not all of the core nodes in a cluster send messages of this type to CNS11 or HYDRA. Press the ``select'' key until you are viewing one that does. For example, the above message was output by ELMER even though it was the backup job on DAFFY that failed. On RODRNR or HYDRA, just watch the DECterm window; eventually you will receive an OPCOM message telling you the number of the request. OPCOM will repeat these messages about every 20 minutes or so.

Now that you know the number of the request, use the command ``reply/to=# quit'', in a DECterm window, where # is the number of the request. Follow this with ``reply/disable''. The backup job will then quit, retry the backup, or try to restart itself on another node. Watch the log file to see what is going on. Keep in mind that the system flushes output to the log file only about every two minutes; you may have to wait a bit before you see anything happen. Also, when it is retrying a backup, the program has certain delays built in that are intended to allow the tape drive to settle down.

If all else fails ...

If the backup job is not progressing and is not asking for assistance (i.e. it seems to be just ``hung'') then you should delete the job from the queue. First you must type the command ``show queue *****'', where ***** is the name of the queue where the backup is running. Then you type the following command to delete the job: ``delete/entry=#'', where # is the entry number given by the show queue command. You need OPER privilege to do this also.

Important: never use ``stop/id=#'' on a backup job; it puts the system into a state that only a reboot will cure. Deleting the job from the queue is permissible, but should be done only as a last resort.

Checking the log file

When the backup job is finished, you should check the backup log file. Ordinarily, if the job finishes unsuccessfully it will mail a copy of its log file to a predetermined list of users (currently Deb and John). We usually assume that a particular backup was successful if we do not get mail. This is not always the case: if a backup job is waiting for assistance, for example, it cannot mail any messages. Also, a job that is deleted from the queue never reaches the code that does the mailing.

It can be arranged that the backup job mail a copy of its log file even for a successful completion. If this is the case, and your name is on the mailing list, then checking the log file is easy - just read your mail.

If you are not on the mailing list then you can check the log files manually. The log file is written to a file named ``auto_backup*.log'' which will be located in ``cmn$manager:'' or ``ocmn$manager:''. You may need to check more than one version to find the file corresponding to the node you are interested in.

Log files are also collected every morning into ``warner::usr$disk1:[baddorf]checkbackup.log''. If the backup job tried to restart itself on another node then the name of the log file for the restarted job will be ``failover_backup*.log''. Deb's checkbackup procedure does not look at these.

Possible errors

Possible error messages to watch for are: parity error, fatal error, ECC error, job aborted, volume not software enabled, unable to mount, backup failed, etc. In general, any message that begins with the string ``%BACKUP-F'' signals an error that will cause the backup to fail. Some of these errors can be caused by an open door on the tape drive, so be sure to check this. Any message that only applies to a specific file can be ignored. Examples of these follow:

%BACKUP-E-VERIFYERR, verification error for block 3 of USR$DISK1:[USER]MAIL.MAI
%BACKUP-E-EOFMISMATCH, end of file position mismatch for USR$DISK1:[HOLD]JUNK.X
%BACKUP-E-OPENIN, error opening USR$DISK2:[BOX]STORAGE.FIL
Any disk errors that occur during the backup will cause a warning message to be mailed. The log file is not included in this message; it just serves as a warning that the disk may be going bad.

Restarting a backup job

If the backup job did not complete successfully, then you may want to restart it. Use the command ``@[.backups]redo_backup'' from the sysmanager account. You will need to have sysmanager privileges to restart backups. On a standalone node this command will start a backup job immediately; no questions will be asked.

On a cluster you need to specify which backup is to be restarted. You will be shown a list of parameter files which can be used on the node you've logged into. See the comments at the end of the file ``sys$common:[sysmgr.backups]redo_backup.com'' for more details.

The log file for a restarted job will be ``redo_backup.log''

Tape Change Procedures

Before changing tapes, read the log file to be sure the backup was successful.

Tapes are stored in a ``FIFO queue'' arrangement. Today's backup goes in one end, and the oldest tape is removed at the other end. The dates on the tape boxes will show you which direction the queue is going.

RDIV and WARNER tapes are stored below the computers, in the BD computer room. Tapes for CRYBAK are there too. DISNEY and front-end tapes are stored in a drawer-unit in the Op Center computer room. The drawers are to the left of HYDRA and the rack of DISNEY computers.

Get the oldest tape (the oldest date) from one end of the ``queue'' of physical tapes. Swap it with the tape in the tape drive. Put today's date on the case, since it now contains the tape you just removed from the tape drive. (Or, write the date that the logfile indicated is contained on that tape, if it is not today's backup.) Put this tape at the other end of the ``queue,'' the end with the newest dates.

Before leaving the computer room you should survey the tape drives to be sure that all drive doors are closed, and the green lights are lit on the 8mm drives.

Special considerations

If a backup job fails repeatedly, try using a cleaning tape and/or a new tape. If cleaning and a new tape do not work then the tape drive may be failing. At this point, one should notify Deb Baddorf (x2289). A power cycle and/or new tape drive may be in order.

Failovers

When a backup job on ELMER or DAFFY fails, it tries to restart itself, or fail over, on a different tape drive. When this occurs do not change the tape that failed. Check the log file; there will be a message near the end indicating failure and a subsequent FAILOVER_BACKUP job being submitted. Read the log file for the failover job to see if it was successful. The log file for a failover job will be in either ``cmn$manager:failover_backup.log'' or ``ocmn$manager:failover_backup.log''.

If the failover job was successful, remove the tape in the designated failover drive instead of the normal drive. Put a new tape in the failover drive.

Meanwhile, the tape that failed is still in the normal tape drive. One should restart the backup to see if it fails again. If it does fail, replace the tape and retry. If it does not fail, keep track of it over the next days. If the same tape fails again several times, then consider replacing it and/or cleaning the drive.

Special Monday procedures

The Monday backup tapes for most nodes are kept in a special rotation. People on temporary tape changing duty can ignore this special rotation since Deb can implement it retroactively (unless she is gone for more than four weeks). Monday's tapes are not put back into their normal places after having been removed from their tape drives. Instead, they are put into the cabinet in Deb's office. The oldest tapes in this cabinet are then rotated back into the normal rotation. This means that Monday's tapes are effectively on a longer rotation (as opposed to a four week rotation for the rest of the tapes.)

The tape corresponding to the first Monday of a month is permanently removed from the rotation and replaced with a new one. These tapes are write-protected and clearly dated. They are kept in the same cabinet in Deb's office.

Anyone trying to find a Monday tape (e.g. for a file restore) should check this cabinet first.

Special Quarterly procedure

Roughly every quarter of the year all tapes should be pulled to be placed in storage in Feynman Center. People on temporary tape changing duty can ignore this rotation.

When all tapes are collected for Feynman, they must be appropriately labeled for storage. Each tape must get a new ID number (see warner::usr$disk1:[baddorf.tapes]tape_archive.txt) and a form located in Deb's cabinet must be filled out for Feynman Center's records. Once relabelled with the new ID number and the date, they may be taken to Feynman Center for permanent storage.

Keywords: RDCS, EPICURE, controls, backup, save, tape, restore, procedure.
Distribution: minimal

baddorf@fnal.gov

Security, Privacy, Legal