To contact the systadmins, look here.

These are pages for docking.org sysadmins, also useful for anyone who wants to install, run and manage a docking.org site. Sysadmin pages are *only* relevant if you have sudo on a docking.org type cluster. For other roles, see here. Once the docking lab has been set up , it must be maintained. This guide covers everything it takes to be a docking.org sysadmin.

For security reasons, documents pertaining to security are kept in google docs. For access, contact the sysadmins. This includes 1) Software Licenses 2) Our Computers 3) Sysadmin security secrets

Pro-active maintenance

There are two kinds of maintenance, reactive and pro-active. Pro-active maintenance is classified temporally, reactive is always in the present.

Periodic system maintenance
manage public DNS https://www.cgl.ucsf.edu/dns_dhcp/
edit host alias file to define private address machine names. Use alpha:/opt/bks/bin/add-host-alias. This script freezes and unfreezes the dynamic zones using rndc freeze <zone> and rndc thaw <zone> e.g. zone is cluster.ucsf.bkslab.org
NB CNAMES must be terminated with a dot .
NB bkslab.org is managed by aaa1 but uoft.bkslab.org is delegated to spinaltap and ucsf.bkslab.org to alpha
use joker.com to manage top level bkslab.org domain

Conventions

When we create a desktop, we create the user account l_<USER> (l as in lion or local). This allows the user to use the desktop if ldap or network are down.

Reactive maintenance

Create a new user
Retire a user
RAID disk failure
disk full
security breach

Policies

we have an elaborate scheme for private addresses that is possibly more trouble than it is worth
if a machine does not have to be on the public network, is should not be on the public network
use iptables aggressively to suppress nearly all public services outside the lab
use VMs
document all machines in the google docs
document everything that is not security related on the wiki

System down/hung/crashed/offline

This section has two parts. In the first, Diagnosis, we enumerate the possible problems and what the symptoms might look like. In the second part, we rehearse scenarios of how to proceed. There are so many different kinds of failure that it is difficult to anticipate every one. Still, we have tried to write down the most common failure modes and sensible ways to proceed.

Diagnosis

system up but df hangs -> disk is off, hung, or unmounted. Solution ->
cannot ping head node.
no home directory
web server down or does not respond
jobs don't start in queuing system
disk full
kernel panic

Scenarios

Install new software by request

After power failure

check that mailman came back up properly
Cluster 0 - check that XML RPC services came back up properly
check on pipeline pilot server back up correctly.

When someone leaves the lab

back up their data or move to proust as appropriate
reduce disk footprint as much as possible
offer them portable USB disks for backups

Add new hardware to the cluster

Procedures

How to run backups
How to restore
How to set up a new computer
Monthly tasks
Security

Updating Software

Delphi
AMSOL
DOCK
dockenv
mol2db
molinspiration
OpenEye
Cactvs
Daylight
Marvin/JChem

Troubleshooting Services

MySQL
Perl
Apache, mod_perl
Python
Mailman
condor
sendmail

Subcategories

This category has only the following subcategory.

S

Sysadmin‎ (1 C, 155 P)

Pages in category "Sysadmin"

The following 155 pages are in this category, out of 155 total.

A

B

C

D

E

F

H

I

J

L

M

N

O

P

Q

Question marks

R

T

U

V

W

Z

Zfs

Category:Sysadmin

Contents

Pro-active maintenance

Conventions

Reactive maintenance

Policies

System down/hung/crashed/offline

Diagnosis

Scenarios

After power failure

When someone leaves the lab

Procedures

Updating Software

Troubleshooting Services

Subcategories

S

Pages in category "Sysadmin"

A

B

C

D

E

F

H

I

J

L

M

N

O

P

Q

R

S

T

U

V

W

Z

Navigation menu

Category:Sysadmin

Pro-active maintenance

Conventions

Reactive maintenance

Policies

System down/hung/crashed/offline

Diagnosis

Scenarios

After power failure

When someone leaves the lab

Procedures

Updating Software

Troubleshooting Services

Subcategories

S

Pages in category "Sysadmin"

A

B

C

D

E

F

H

I

J

L

M

N

O

P

Q

R

S

T

U

V

W

Z

Navigation menu

Search