Cluster 2: Difference between revisions

From DISI
Jump to navigation Jump to search
 
(35 intermediate revisions by 3 users not shown)
Line 1: Line 1:
Our new cluster at [[UCSF]] is described on this page.  The physical equipment in cluster [[Cluster 0]] will be subsumed into this cluster when it replicates all the functions of the original. We expect this to happen later in 2014.
This is the default lab cluster.
 
{{TOCright}}
{{TOCright}}


= Hardware =  
= Priorities and Policies =  
* aleph: aka hypervisor, run core services
* [[Lab Security Policy]]
* happy.  internal  SATA 10 x 2TB raw  = 9 TB RAID10. external SAS HP 12 x 4 TB raw = 36 TB RAID6.
* [[Disk space policy]]
* nas1: SiM 24 x 4 TB SAS NAS = 72TB RAID6
* [[Backups]] policy.
* 4 x 32 core SiM machines.
* [[Portal system]] for off-site ssh cluster access.
* machines to be added from cluster 0
* Get a [[Cluster 2 account]] and get started


= Disk organization =  
= Special machines =  
* shin aka nas1 mounted as /nfs/db/ =  72 TB SAS RAID6
Normally, you will just ssh to sgehead aka gimel from portal.ucsf.bkslab.org where you can do almost anything, including job management. A few things require licensing and must be done on special machines.
* bet aka happy, internal: /nfs/store and psql (temp) as 10 TB SATA RAID10
* bet aka happy external: /nfs/work only as 36 TB SAS RAID6
* het (43) aka  former vmware1 MSA 60 exports /nfs/home and /nfs/soft


= Getting started =
hypervisor 'he' hosts:
* alpha  - which is critical and runs foreman, DNS, DHCP, and other important services
* beta - with runs LDAP authentication
* epsilon - portal.ucsf.bkslab.org - cluster gateway from public internet
* gamma - sun grid engine qmaster
* phi - mysqld/excipients
* psi for using the PG fortran compiler
* ppilot is at  http://zeta:9944/ - you must be on the Cluster 2 private network to use it
* Tau is the web server for ZINC,
* no other specia
* zeta - Psicquic/pipeline pilot
* Sigma can definitely go off and stay off. It was planned for a fingerprinting server, never done.


Welcome to the lab. Here is what you need to know to get started.
hypervisor 'aleph2' hosts:
* 1. Your account. Get it from your system administrator Therese Demers (or John Irwin).
* alpha7 - This is to be the future architecture VM of the cluster (DNS/DHCP/Puppet/Foreman/Ansible). CentOS7.
* 2. Your home is on /nfs/home/<your_id>/. This area is backed up and is for important persistent files.
* kappa is licensing. ask me. ("i have no clue what this licenses. Turned off." - ben)
* 3. You should run docking jobs and other intense calculations in /nfs/work/<your_id>/.
* rho contains this wiki and also bkslab.org
* 4. You should keep static data (e.g. crystallography data, results of published papers) in /nfs/store/<your_id>/.
* 5. Lab guests get 100GB in each of these areas, and lab members get 500GB. You may request more, just ask!
* 6. If you go over your limit, you get emails for 2 weeks, then we impose a hard limit if you have not solved your overage.
* 7. You can choose bash or tcsh to be your default shell. We don't care. Everything should work equally well with both.
* 8. There is a special kind of static data, databases, for which you may request space. They will go in /nfs/db/<db_name>/. e.g. /nfs/db/zinc/ and /nfs/db/dude/ and /nfs/db/pdb and so on.
* 9. Please run large docking jobs on /nfs/work and not on /nfs/store or /nfs/home. When you publish a paper, please delete what you can, compress the rest, and move it to /store/. Do not leave it on /work/ if you are no longer using it actively.
* 10. Set up your account so that you can log in all across the cluster without a password. ssh-keygen; cd .ssh; cp id_rsa.pub authorized_keys; chmod 600 authorized_keys;
* 11. Software lives in /nfs/software/. All our machines are 64 bit Centos 6.3 unless otherwise indicated.
* 12. Python 2.7 and 3.0 are installed. We currently recommend 2.7 because of library availability, but that may change soon. (Aug 2012)
* 13. If you use tcsh, copy .login and .cshrc from ~jji/  ; If you use bash, copy .bash_profile from ~jji/


= Notes =
* to get from SVN, use svn ssh+svn


= Roles =  
= Hardware and physical location =
* 1856 cpu-cores for queued jobs
* 128 cpu-cores for infrastructure, databases, management and ad hoc jobs.
* 788 TB of high quality NFS-available disk
* Our policy is to have 4 GB RAM per cpu-core unless otherwise specified.
* Machines older than 3 years may have 2GB/core and 6 years old have 1GB/core.
* Cluster 2 is currently stored entirely in Rack 0 which is in Row 0, Position 4 of BH101 at 1700 4th St (Byers Hall).
* Central services are on he,aleph2,and bet
* CPU
** 3 Silicon Mechanics Rackform nServ A4412.v4 s, each comprising 4 computers of 32 cpu-cores for a total of 384 cpu-cores.
** 1 Dell C6145 with 128 cores.
** An HP DL165G7 (24-way) is sgehead
** more computers to come from Cluster 0, when Cluster 2 is fully ready.
* DISK
** HP disks - 40 TB RAID6 SAS (new in 2014)
** Silicon Mechanics NAS - new in 2014 - 77 TB RAID6 SAS (new in 2014)
** A HP DL160G5 and an MSA60 with 12 TB SAS (disks new in 2014)


= General =
= Naming convention
* '''sgehead''' - access to the cluster from within the lab
* The Hebrew alphabet is used for physical machines
** pgf fortran compiler
* Greek letters for VMs.
** submit jobs to queue
* Functions (e.g. sgehead) are aliases (CNAMEs).
* '''portal''' - access to the cluster from off campus
* compbio.ucsf.edu and ucsf.bkslab.org domains both supported.
* ppilot - our pipeline pilot license will be transferred here
* www - static webserver VM
* dock - dock licensing VM
* drupal -
* wordpress -
* public - runs public services ZINC, DOCK Blaster, SEA, DUDE
* happy - postgres production server
* ark - intern psql, like raiders in yyz
* nfs1 - disk server 1
* nfs2 - disk server 2
* nfs3 - disk server 3
* fprint - fingerprinting server


= Services =  
= Disk organization =  
* aleph - VM running core administrative functions
* shin aka nas1 mounted as /nfs/db/ =  72 TB SAS RAID6.  NOTE: ON BAND:  $ sudo /usr/local/RAID\ Web\ Console\ 2/startupui.sh to interact with raid controller.  username: raid.  pw: c2 pass
* bet -
* bet aka happy, internal: /nfs/store and psql (temp) as 10 TB SATA RAID10
* gimel -
* elated on happy: /nfs/work only as 36 TB SAS RAID6
* dalet -
* dalet exports /nfs/home & /nfs/home2
* he -
* vav -
* zayin -


= Special purpose machines - all .ucsf.bkslab.org = 
* sgehead aka gimel.cluster - nearly the only machine you'll need.
* psi.cluster - PG fortran compiler (if it only has a .cluster address means it has no public address)
* portal aka epsilon - secure access
* zeta.cluster - Pipeline Pilot
* shin, bet, and dalet are the three NFS servers. You should not need to log in to them.


== SEA server ==
on teague desktop, /usr/local/RAID Web Console 2/startupui.sh
* fawlty
connect to shin on public network
* mysql server is on msqlserver aka inception
raid / C2 on shin
* fingerprint server is on fingerprint aka darkcrystal
 
 
= By rack =
== Rack 0 - 10.20.0.* ==
Location BH101, column 7 row 5
* aleph
* bet
* happy
 
== Rack 1 - 10.20.10.* ==
Location: BH101, column 1 row 0
*
*
 
== Rack 2 - 10.20.30.* ==
Location: BH
 
 
= how to administer DHCP / DNS in BH101
 
https://www.cgl.ucsf.edu/dns_dhcp/


* mysql1.cluster - general purpose mysql server (like former scratch)
* pg1.cluster - general purpose postgres server
* fprint.cluster - fingerprinting server


[[About our cluster]]
[[About our cluster]]

Latest revision as of 00:43, 8 January 2019

This is the default lab cluster.

Priorities and Policies

Special machines

Normally, you will just ssh to sgehead aka gimel from portal.ucsf.bkslab.org where you can do almost anything, including job management. A few things require licensing and must be done on special machines.

hypervisor 'he' hosts:

  • alpha - which is critical and runs foreman, DNS, DHCP, and other important services
  • beta - with runs LDAP authentication
  • epsilon - portal.ucsf.bkslab.org - cluster gateway from public internet
  • gamma - sun grid engine qmaster
  • phi - mysqld/excipients
  • psi for using the PG fortran compiler
  • ppilot is at http://zeta:9944/ - you must be on the Cluster 2 private network to use it
  • Tau is the web server for ZINC,
  • no other specia
  • zeta - Psicquic/pipeline pilot
  • Sigma can definitely go off and stay off. It was planned for a fingerprinting server, never done.

hypervisor 'aleph2' hosts:

  • alpha7 - This is to be the future architecture VM of the cluster (DNS/DHCP/Puppet/Foreman/Ansible). CentOS7.
  • kappa is licensing. ask me. ("i have no clue what this licenses. Turned off." - ben)
  • rho contains this wiki and also bkslab.org

Notes

  • to get from SVN, use svn ssh+svn

Hardware and physical location

  • 1856 cpu-cores for queued jobs
  • 128 cpu-cores for infrastructure, databases, management and ad hoc jobs.
  • 788 TB of high quality NFS-available disk
  • Our policy is to have 4 GB RAM per cpu-core unless otherwise specified.
  • Machines older than 3 years may have 2GB/core and 6 years old have 1GB/core.
  • Cluster 2 is currently stored entirely in Rack 0 which is in Row 0, Position 4 of BH101 at 1700 4th St (Byers Hall).
  • Central services are on he,aleph2,and bet
  • CPU
    • 3 Silicon Mechanics Rackform nServ A4412.v4 s, each comprising 4 computers of 32 cpu-cores for a total of 384 cpu-cores.
    • 1 Dell C6145 with 128 cores.
    • An HP DL165G7 (24-way) is sgehead
    • more computers to come from Cluster 0, when Cluster 2 is fully ready.
  • DISK
    • HP disks - 40 TB RAID6 SAS (new in 2014)
    • Silicon Mechanics NAS - new in 2014 - 77 TB RAID6 SAS (new in 2014)
    • A HP DL160G5 and an MSA60 with 12 TB SAS (disks new in 2014)

= Naming convention

  • The Hebrew alphabet is used for physical machines
  • Greek letters for VMs.
  • Functions (e.g. sgehead) are aliases (CNAMEs).
  • compbio.ucsf.edu and ucsf.bkslab.org domains both supported.

Disk organization

  • shin aka nas1 mounted as /nfs/db/ = 72 TB SAS RAID6. NOTE: ON BAND: $ sudo /usr/local/RAID\ Web\ Console\ 2/startupui.sh to interact with raid controller. username: raid. pw: c2 pass
  • bet aka happy, internal: /nfs/store and psql (temp) as 10 TB SATA RAID10
  • elated on happy: /nfs/work only as 36 TB SAS RAID6
  • dalet exports /nfs/home & /nfs/home2

Special purpose machines - all .ucsf.bkslab.org

  • sgehead aka gimel.cluster - nearly the only machine you'll need.
  • psi.cluster - PG fortran compiler (if it only has a .cluster address means it has no public address)
  • portal aka epsilon - secure access
  • zeta.cluster - Pipeline Pilot
  • shin, bet, and dalet are the three NFS servers. You should not need to log in to them.
on teague desktop, /usr/local/RAID Web Console 2/startupui.sh 
connect to shin on public network
raid /  C2 on shin
  • mysql1.cluster - general purpose mysql server (like former scratch)
  • pg1.cluster - general purpose postgres server
  • fprint.cluster - fingerprinting server

About our cluster