System administrator's guide: Difference between revisions
No edit summary |
No edit summary |
||
Line 17: | Line 17: | ||
* Retire a user | * Retire a user | ||
* RAID disk failure | * RAID disk failure | ||
* | = System down/hung/crashed/offline = | ||
* | This section has two parts. In the first, Diagnosis, we enumerate the possible problems and what the symptoms might look like. In the second part, we rehearse scenarios of how to proceed. There are so many different kinds of failure that it is difficult to anticipate every one. Still, we have tried to write down the most common failure modes and sensible ways to proceed. | ||
* | |||
== Diagnosis == | |||
* system up but df hangs -> disk is off, hung, or unmounted. Solution -> | |||
* cannot ping head node. | |||
* no home directory | |||
* web server down or does not respond | |||
* jobs don't start in queuing system | |||
* disk full | |||
* kernel panic | |||
== Scenarios == | |||
* Install new software by request | * Install new software by request | ||
= After power failure = | |||
* check that mailman came back up properly | |||
* [[Cluster 0]] - check that XML RPC services came back up properly | |||
* check on pipeline pilot server back up correctly. | |||
= When someone leaves the lab = | |||
* back up their data or move to proust as appropriate | |||
* reduce disk footprint as much as possible | |||
* offer them portable USB disks for backups | |||
* Add new hardware to the cluster | * Add new hardware to the cluster |
Revision as of 13:22, 19 March 2014
Once the docking lab has been set up, it must be maintained. This guide covers all likely and some unlikely events that may occur after you have a computational pharmacology lab up and running. The guide for the initial setup is call So you want to set up a lab.
For security reasons, documents pertaining to authentication and access are kept in google docs
There are two kinds of maintenance, reactive and pro-active. Pro-active maintenance is classified temporally, reactive is always in the present.
Pro-active
Reactive
- Create a new user
- Retire a user
- RAID disk failure
System down/hung/crashed/offline
This section has two parts. In the first, Diagnosis, we enumerate the possible problems and what the symptoms might look like. In the second part, we rehearse scenarios of how to proceed. There are so many different kinds of failure that it is difficult to anticipate every one. Still, we have tried to write down the most common failure modes and sensible ways to proceed.
Diagnosis
- system up but df hangs -> disk is off, hung, or unmounted. Solution ->
- cannot ping head node.
- no home directory
- web server down or does not respond
- jobs don't start in queuing system
- disk full
- kernel panic
Scenarios
- Install new software by request
After power failure
- check that mailman came back up properly
- Cluster 0 - check that XML RPC services came back up properly
- check on pipeline pilot server back up correctly.
When someone leaves the lab
- back up their data or move to proust as appropriate
- reduce disk footprint as much as possible
- offer them portable USB disks for backups
- Add new hardware to the cluster