EPICURE Software Release Note 91.0
WARNER Cluster
Power Recovery Reboot Procedure
EPICURE Software Release Note 91.0
WARNER Cluster
Power Recovery Reboot Procedure
Deb Baddorf
The desired general plan for WARNER power-recovery reboots is that all nodes will automatically reboot when the power returns, with no user intervention. However, this scheme has problems, at the present, in that all the nodes contend for the same files and the same batch queues, and bottlenecking occurs, causing up to 3 hour reboot times for the entire cluster. In fact, the bottlenecking is such that many workstation nodes will not manage to reboot by themselves, but will either give up after many tries, or will get stuck in the middle of the reboot. Until this problem is resolved, the following manual intervention procedure will suffice to reboot WARNER in a faster manner, allowing logins after less than half an hour on selected nodes. The remaining nodes will be rebooted manually at spaced intervals until the whole cluster is online.
CHIP may take half an hour longer to boot, but will probably get there eventually. CHIP has a tape in its MUA0 drive. Therefore, it defaults to searching the entire tape for a system boot file (it isn't there) before getting around to asking for a boot over the network. Newer type nodes can be programmed to look directly at the ethernet, and skip local disks and tapes, but the MicroVAX II's cannot.
The other nodes (currently there are other nodes) will power up, perform self tests, and try to reboot too. Many will time out and fail (VS2000's don't have much patience); some may get half way and fail due to contention; some may succeed. Unfortunately, since they have to wait till the core nodes are far enough along to provide the needed data, the ``timeout and stop trying'' mode is the most common.
At this point you should be able to log into DAFFY or ELMER too, using a terminal on the terminal server. This is because these are faster rebooting nodes and will already be up by the time ARDYH is up.
NOTE: If for some reason ARDYH is dead and will not boot, estimate 20 minutes from power up and then try to log in to DAFFY and ELMER from the terminal server. Otherwise, if ARDYH is okay, the reboot time for ARDYH serves as a good estimate for DAFFY and ELMER having time to finish the basic reboot as well as the startup of EPICURE processes.
Command: VIEW nodenameIf the display shows that the node is waiting at the >>> prompt, then perform the following steps to boot it. If the display shows activity on the node, it is probably in the process of booting; leave it alone. To tell the node to boot:
Command: OUTPUT nodename BOOT !for most nodesUse the name of the node you want to boot in place of the word nodename and press carriage return at the end of the line. This should be all you need to know about VCS for rebooting, but more information about using the VCS monitor program is available in EPICURE Software Release Note 14.Command: OUTPUT nodename BOOT XQ !for CHIP or DALE
If ARDYH is not available, call the system manager (see call-in list). You will need to do the following (with telephone prompting) for each node:
If you have gone around and switched them all off, then reboot them by switching them on. They should continue by themselves (unless too many nodes are booting and they time out waiting for a response).
If they were not powered off, or if they have timed out and given up trying, the main screen should have a >>> prompt symbol. If there is no >>> prompt, hit the tiny recessed HALT button to get the prompt. At the prompt, type B or BOOT and then carriage-return. Workstation nodes do not use ARDYH; their prompts go to the main workstation screen itself.
Walk around and check all nodes again when you think they are all up. Make sure you can pop up a login box (the box with the ``Create Window'' choices) on each node. You don't need to actually log in. If some nodes have gotten in a strange mode, they will look like the window system is up, but they won't let you pop up a login box. Nodes CHLDRS (Childress), WRNLS (Stutte), and WRNGRG (Gutierrez) are particularly susceptible to this failure mode.
A node which is in this state must be rebooted again. If you are in the controls group (Watts' and Larwill's people) the preferred method to reboot these nodes is: log in on another node, as yourself. SET HOST to the troubled node; log in as SHUTDOWN. Choose the REBOOT option, answer any questions, and let it proceed. Other people will have to use the brute force methods: hit the recessed HALT button and type B, or power cycle the node if you can't find the button.
Keywords: EPICURE, WARNER, computer, reboot, power fail
Distribution:
normal
RD Site Operations