Category:Sysadmin: Difference between revisions

From DISI
Jump to navigation Jump to search
No edit summary
No edit summary
 
(14 intermediate revisions by one other user not shown)
Line 1: Line 1:
These are pages for docking.org sysadmins.  This is a starting point and central node for anyone who wants to install, run and manage a docking.org site locally.
To contact the systadmins, look [[sysadmin|here]].
 
These are pages for docking.org sysadmins, also useful for anyone who wants to install, run and manage a docking.org site.
Sysadmin pages are *only* relevant if you have sudo on a docking.org type cluster. For other roles, see [[:Category:Roles|here]].
Sysadmin pages are *only* relevant if you have sudo on a docking.org type cluster. For other roles, see [[:Category:Roles|here]].
Once the docking lab has been  [[So you want to set up a lab | set up ]], it must be [[Periodic_system_maintenance |maintained]].
This guide covers everything it takes to be a docking.org [[sysadmin]].


* [[So you want to set up a lab]]
{{TOCright}}


Once the docking lab has been set up, it must be maintained. This guide covers all likely and some unlikely events that may occur after you have a computational pharmacology lab up and running.  The guide for the initial setup is call [[So you want to set up a lab]].  
For security reasons, documents pertaining to security are kept in google docs.   For access, contact the [[sysadmin]]s. This includes
1) Software Licenses
2) Our Computers
3) Sysadmin security secrets


'''For security reasons, documents pertaining to authentication and access are kept in google docs'''
= Pro-active maintenance =  
= Google Docs docs =
There are two kinds of maintenance, reactive and pro-active.   
* Lab Software Status
Pro-active maintenance is classified temporally, reactive is always in the present.
* Lab IP addresses
* [[Periodic system maintenance]]
* Sysadmin Secrets
* manage public DNS https://www.cgl.ucsf.edu/dns_dhcp/
Access requests to [[sysadmin]]s.
* edit host alias file to define private address machine names. Use alpha:/opt/bks/bin/add-host-alias. This script freezes and unfreezes the dynamic zones using rndc freeze <zone> and rndc thaw <zone> e.g. zone is cluster.ucsf.bkslab.org
 
* NB CNAMES must be terminated with a dot .
 
* NB bkslab.org is managed by aaa1 but uoft.bkslab.org is delegated to spinaltap and ucsf.bkslab.org to alpha
 
* use joker.com to manage top level bkslab.org domain
 
There are two kinds of maintenance, reactive and pro-active.  Pro-active maintenance is classified temporally, reactive is always in the present.  


= Pro-active =  
= Conventions =  
* When we create a desktop, we create the user account l_<USER> (l as in lion or local). This allows the user to use the desktop if ldap or network are down.


* [[Periodic system maintenance]]
= Reactive maintenance =
* [[Backups]]
* [[Security review]]
* [[Software upgrades]]
 
= Reactive =


* Create a new user
* Create a new user
* Retire a user
* Retire a user
* RAID disk failure
* RAID disk failure
* disk full
* security breach
= Policies =
* we have an elaborate scheme for [[private addresses]] that is possibly more trouble than it is worth
* if a machine does not have to be on the public network, is should not be on the public network
* use iptables aggressively to suppress nearly all public services outside the lab
* use VMs
* document all machines in the google docs
* document everything that is not security related on the wiki


= System down/hung/crashed/offline =
= System down/hung/crashed/offline =
Line 61: Line 72:


* Add new hardware to the cluster
* Add new hardware to the cluster


[[Category:Sysadmin]]
[[Category:Sysadmin]]
[[Category:Tutorials]]
[[Category:Tutorials]]




== Procedures ==
== Procedures ==
Line 96: Line 97:
* Daylight
* Daylight
* Marvin/JChem
* Marvin/JChem
*
 


== Troubleshooting Services ==
== Troubleshooting Services ==

Latest revision as of 20:57, 27 December 2018

To contact the systadmins, look here.

These are pages for docking.org sysadmins, also useful for anyone who wants to install, run and manage a docking.org site. Sysadmin pages are *only* relevant if you have sudo on a docking.org type cluster. For other roles, see here. Once the docking lab has been set up , it must be maintained. This guide covers everything it takes to be a docking.org sysadmin.

For security reasons, documents pertaining to security are kept in google docs. For access, contact the sysadmins. This includes 1) Software Licenses 2) Our Computers 3) Sysadmin security secrets

Pro-active maintenance

There are two kinds of maintenance, reactive and pro-active. Pro-active maintenance is classified temporally, reactive is always in the present.

  • Periodic system maintenance
  • manage public DNS https://www.cgl.ucsf.edu/dns_dhcp/
  • edit host alias file to define private address machine names. Use alpha:/opt/bks/bin/add-host-alias. This script freezes and unfreezes the dynamic zones using rndc freeze <zone> and rndc thaw <zone> e.g. zone is cluster.ucsf.bkslab.org
  • NB CNAMES must be terminated with a dot .
  • NB bkslab.org is managed by aaa1 but uoft.bkslab.org is delegated to spinaltap and ucsf.bkslab.org to alpha
  • use joker.com to manage top level bkslab.org domain

Conventions

  • When we create a desktop, we create the user account l_<USER> (l as in lion or local). This allows the user to use the desktop if ldap or network are down.

Reactive maintenance

  • Create a new user
  • Retire a user
  • RAID disk failure
  • disk full
  • security breach

Policies

  • we have an elaborate scheme for private addresses that is possibly more trouble than it is worth
  • if a machine does not have to be on the public network, is should not be on the public network
  • use iptables aggressively to suppress nearly all public services outside the lab
  • use VMs
  • document all machines in the google docs
  • document everything that is not security related on the wiki

System down/hung/crashed/offline

This section has two parts. In the first, Diagnosis, we enumerate the possible problems and what the symptoms might look like. In the second part, we rehearse scenarios of how to proceed. There are so many different kinds of failure that it is difficult to anticipate every one. Still, we have tried to write down the most common failure modes and sensible ways to proceed.

Diagnosis

  • system up but df hangs -> disk is off, hung, or unmounted. Solution ->
  • cannot ping head node.
  • no home directory
  • web server down or does not respond
  • jobs don't start in queuing system
  • disk full
  • kernel panic


Scenarios

  • Install new software by request

After power failure

  • check that mailman came back up properly
  • Cluster 0 - check that XML RPC services came back up properly
  • check on pipeline pilot server back up correctly.

When someone leaves the lab

  • back up their data or move to proust as appropriate
  • reduce disk footprint as much as possible
  • offer them portable USB disks for backups


  • Add new hardware to the cluster


Procedures

  • How to run backups
  • How to restore
  • How to set up a new computer
  • Monthly tasks
  • Security


Updating Software

  • Delphi
  • AMSOL
  • DOCK
  • dockenv
  • mol2db
  • molinspiration
  • OpenEye
  • Cactvs
  • Daylight
  • Marvin/JChem


Troubleshooting Services

  • MySQL
  • Perl
  • Apache, mod_perl
  • Python
  • Mailman
  • condor
  • sendmail

Subcategories

This category has only the following subcategory.

S

Pages in category "Sysadmin"

The following 151 pages are in this category, out of 151 total.

Z