Cluster 2: Difference between revisions

From DISI
Jump to navigation Jump to search
 
(49 intermediate revisions by 4 users not shown)
Line 1: Line 1:
Our new cluster at [[UCSF]] is described on this page.  The physical equipment in cluster [[Cluster 0]] will be subsumed into this cluster when it replicates all the functions of the original. We expect this to happen later in 2014.
= Introduction =
Cluster 2 is the most modern cluster Irwin Lab maintains.
{{TOCright}}
 
(Edited in May 06,2024)


= Priorities and Policies =  
= Priorities and Policies =  
* [[Lab Security Policy]]
* [[Lab Security Policy]]
* [[Disk space policy]]
* [[Disk space policy]]
* [[Get a cluster 2 account]]
* [[Backups]] policy.
* [[Portal system]] for off-site ssh cluster access.
* Get a [[Cluster 2 account]] and get started
 
= How to Login =
=== Off Site ===
* Off site access requires an SSH key. Contact sysadmins for help.
<source>ssh <user>@portal3.compbio.ucsf.edu</source>


= Equipment, names, roles =
=== On Site ===
* The Hebrew alphabet is used for physical machines, Greek for VMs. Functions (e.g. sgehead) are aliases (CNAMEs).
<source>ssh -o HostKeyAlgorithms=+ssh-rsa <user>@gimel.compbio.ucsf.edu</source>
* Cluster 2 is currently stored entirely in Rack 0 which is in Row 0, Position 4 of BH101 at 1700 4th St (Byers Hall). '''More racks will be added by July.'''
* Core services are on aleph, an HP DL160G5. Using a libvirt hypervisor, aleph runs all core services.
* There are 3 Silicon Mechanics Rackform nServ A4412.v4 s, each comprising 4 computers of 32 cpu-cores for a total of 384 cpu-cores.
* An HP DL165G7 (24-way) is sgehead
* HP disks - new in 2014 - 40 TB RAID6 SAS
* Silicon Mechanics NAS - new in 2014 - 76 TB RAID6 SAS
* an HP DL160G5 and an MSA60 with 12 TB SAS - new in 2014.
* A Dell C6145 with 128 cores.  
* Current total of 512 cores for queued jobs and 128 cores for infrastructure, databases, management and ad hoc jobs.
= Disk organization =
* shin aka nas1 mounted as /nfs/db/ =  72 TB SAS RAID6
* bet aka happy, internal: /nfs/store and psql (temp) as 10 TB SATA RAID10
* elated on happy: /nfs/work only as 36 TB SAS RAID6
* het (43) aka  former vmware2 MSA 60  exports /nfs/home and /nfs/soft


= Getting started =  
= Where to Submit Jobs and How (SGE/SLURM) =
=== SGE ===
Submit SGE jobs in the machine called '''gimel.compbio.ucsf.edu''' aka '''gimel'''.
* Refer to the pages below for the basic commands/examples.
** [[SGE Cluster Docking]], replace '''sgehead.compbio.ucsf.edu''' with '''gimel'''.
** [[SGE idioms]]
** [[Using SGE cluster]]
* For sysadmins
** [[SGE_notes]]
** [[Sun Grid Engine (SGE)]]


=== SLURM ===
Submit SLURM jobs in '''gimel2'''.
* Refer to the pages below for basic guides
** [[Slurm]]


= Roles =  
= Special machines =  
Normally, you will just ssh to sgehead aka gimel from portal.ucsf.bkslab.org where you can do almost anything, including job management.  A few things require licensing and must be done on special machines.


= General =
hypervisor 'he' hosts:
* '''sgehead''' - access to the cluster from within the lab
* alpha  - which is critical and runs foreman, DNS, DHCP, and other important services
** pgf fortran compiler
* beta - with runs LDAP authentication
** submit jobs to queue
* epsilon - portal.ucsf.bkslab.org - cluster gateway from public internet
* '''portal''' - access to the cluster from off campus
* gamma - sun grid engine qmaster
* ppilot - our pipeline pilot license will be transferred here
* phi - mysqld/excipients
* www - static webserver VM
* psi for using the PG fortran compiler
* dock - dock licensing VM
* ppilot is at  http://zeta:9944/ - you must be on the Cluster 2 private network to use it
* drupal -  
* Tau is the web server for ZINC,
* wordpress -
* no other specia
* public - runs public services ZINC, DOCK Blaster, SEA, DUDE
* zeta - Psicquic/pipeline pilot
* happy - postgres production server
* Sigma can definitely go off and stay off. It was planned for a fingerprinting server, never done.
* ark - intern psql, like raiders in yyz
* nfs1 - disk server 1
* nfs2 - disk server 2
* nfs3 - disk server 3
* fprint - fingerprinting server


= Services =
hypervisor 'aleph2' hosts:
* aleph - VM running core administrative functions
* alpha7 - This is to be the future architecture VM of the cluster (DNS/DHCP/Puppet/Foreman/Ansible).  CentOS7. 
* bet -  
* kappa is licensing. ask me.  ("i have no clue what this licenses.  Turned off." - ben)
* gimel -
* rho contains this wiki and also bkslab.org
* dalet -
* he -
* vav -
* zayin -


= Notes =
* to get from SVN, use svn ssh+svn


== SEA server ==
= Hardware and physical location =
* fawlty
* 1856 cpu-cores for queued jobs
* mysql server is on msqlserver aka inception
* 128 cpu-cores for infrastructure, databases, management and ad hoc jobs.
* fingerprint server is on fingerprint aka darkcrystal
* 788 TB of high quality NFS-available disk
* Our policy is to have 4 GB RAM per cpu-core unless otherwise specified.
* Machines older than 3 years may have 2GB/core and 6 years old have 1GB/core.
* Cluster 2 is currently stored entirely in Rack 0 which is in Row 0, Position 4 of BH101 at 1700 4th St (Byers Hall).
* Central services are on he,aleph2,and bet
* CPU
** 3 Silicon Mechanics Rackform nServ A4412.v4 s, each comprising 4 computers of 32 cpu-cores for a total of 384 cpu-cores.
** 1 Dell C6145 with 128 cores.
** An HP DL165G7 (24-way) is sgehead
** more computers to come from Cluster 0, when Cluster 2 is fully ready.
* DISK
** HP disks - 40 TB RAID6 SAS (new in 2014)
** Silicon Mechanics NAS - new in 2014 - 77 TB RAID6 SAS (new in 2014)
** A HP DL160G5 and an MSA60 with 12 TB SAS (disks new in 2014)


= Naming convention
* The Hebrew alphabet is used for physical machines
* Greek letters for VMs.
* Functions (e.g. sgehead) are aliases (CNAMEs).
* compbio.ucsf.edu and ucsf.bkslab.org domains both supported.


= By rack =  
= Disk organization =  
== Rack 0 - 10.20.0.* ==
* shin aka nas1 mounted as /nfs/db/ = 72 TB SAS RAID6.  NOTE: ON BAND:  $ sudo /usr/local/RAID\ Web\ Console\ 2/startupui.sh to interact with raid controller. username: raid. pw: c2 pass
Location BH101, column 7 row 5
* bet aka happy, internal: /nfs/store and psql (temp) as 10 TB SATA RAID10
* aleph
* elated on happy: /nfs/work only as 36 TB SAS RAID6
* bet
* dalet exports /nfs/home & /nfs/home2
* happy


== Rack 1 - 10.20.10.* ==
= Special purpose machines - all .ucsf.bkslab.org =
Location: BH101, column 1 row 0
*  
* sgehead aka gimel.cluster - nearly the only machine you'll need.  
*  
* psi.cluster - PG fortran compiler (if it only has a .cluster address means it has no public address)
* portal aka epsilon - secure access
* zeta.cluster - Pipeline Pilot
* shin, bet, and dalet are the three NFS servers. You should not need to log in to them.


== Rack 2 - 10.20.30.* ==
on teague desktop, /usr/local/RAID Web Console 2/startupui.sh
Location: BH
connect to shin on public network
raid /  C2 on shin


* mysql1.cluster - general purpose mysql server (like former scratch)
* pg1.cluster - general purpose postgres server
* fprint.cluster - fingerprinting server


= how to administer DHCP / DNS in BH101
= Table of Server Information =
=== SLURM ===
{| class="wikitable"
|-
!Server Name
!Operating System
!Functions
|-
| epyc || Rocky 8 || Apache/HTTPD Webserver + Proxy
|-
| epyc2 || Rocky 8 ||
|-
| epyc-A40 || Rocky 8 ||
|-
| n-1-101 || Centos 7 ||
|-
| n-1-105 || Centos 7 ||
|-
| n-1-124 || Centos 7 ||
|-
| n-1-126 || Centos 7 ||
|-
| n-1-141 || Centos 7 ||
|-
| n-1-16 || Centos 7 ||
|-
| n-1-17 || Centos 7 ||
|-
| n-1-18 || Centos 7 ||
|-
| n-1-19 || Centos 7 ||
|-
| n-1-20 || Centos 7 ||
|-
| n-1-21 || Centos 7 ||
|-
| n-1-28 || Centos 7 ||
|-
| n-1-38 || Centos 7 ||
|-
| n-5-13 || Centos 7 ||
|-
| n-5-14 || Centos 7 ||
|-
| n-5-15 || Centos 7 ||
|-
| n-5-32 || Centos 7 ||
|-
| n-5-33 || Centos 7 ||
|-
| n-5-34 || Centos 7 ||
|-
| n-5-35 || Centos 7 ||
|-
| n-9-19 || Centos 7 ||
|-
| n-9-20 || Centos 7 ||
|-
| n-9-21 || Centos 7 ||
|-
| n-9-22 || Centos 7 ||
|-
| n-9-34 || Centos 7 ||
|-
| n-9-36 || Centos 7 ||
|-
| n-9-38 || Centos 7 ||
|-
| qof || Centos 7 ||
|-
| shin || Centos 7 ||
|-
|}


https://www.cgl.ucsf.edu/dns_dhcp/
=== SGE ===
{| class="wikitable"
|-
!Server Name
!Operating System
!Functions
|-
| gimel || Centos 6 || In-person Login Node
|-
| he || Centos 6 || Hosts Vital VMs for cluster 2 for function.
|-
| het || Centos 6 ||
|-
| n-0-129 || Centos 6 ||
|-
| n-0-136 || Centos 6 ||
|-
| n-0-139 || Centos 6 ||
|-
| n-0-30 || Centos 6 ||
|-
| n-0-37 || Centos 6 ||
|-
| n-0-39 || Centos 6 ||
|-
| n-8-27 || Centos 6 ||
|-
| n-9-23 || Centos 6 ||
|-
|}




Line 91: Line 216:
[[Category:Internal]]
[[Category:Internal]]
[[Category:UCSF]]
[[Category:UCSF]]
[[Category:Hardware]]

Latest revision as of 21:53, 27 June 2024

Introduction

Cluster 2 is the most modern cluster Irwin Lab maintains.

(Edited in May 06,2024)

Priorities and Policies

How to Login

Off Site

  • Off site access requires an SSH key. Contact sysadmins for help.
ssh <user>@portal3.compbio.ucsf.edu

On Site

ssh -o HostKeyAlgorithms=+ssh-rsa <user>@gimel.compbio.ucsf.edu

Where to Submit Jobs and How (SGE/SLURM)

SGE

Submit SGE jobs in the machine called gimel.compbio.ucsf.edu aka gimel.

SLURM

Submit SLURM jobs in gimel2.

  • Refer to the pages below for basic guides

Special machines

Normally, you will just ssh to sgehead aka gimel from portal.ucsf.bkslab.org where you can do almost anything, including job management. A few things require licensing and must be done on special machines.

hypervisor 'he' hosts:

  • alpha - which is critical and runs foreman, DNS, DHCP, and other important services
  • beta - with runs LDAP authentication
  • epsilon - portal.ucsf.bkslab.org - cluster gateway from public internet
  • gamma - sun grid engine qmaster
  • phi - mysqld/excipients
  • psi for using the PG fortran compiler
  • ppilot is at http://zeta:9944/ - you must be on the Cluster 2 private network to use it
  • Tau is the web server for ZINC,
  • no other specia
  • zeta - Psicquic/pipeline pilot
  • Sigma can definitely go off and stay off. It was planned for a fingerprinting server, never done.

hypervisor 'aleph2' hosts:

  • alpha7 - This is to be the future architecture VM of the cluster (DNS/DHCP/Puppet/Foreman/Ansible). CentOS7.
  • kappa is licensing. ask me. ("i have no clue what this licenses. Turned off." - ben)
  • rho contains this wiki and also bkslab.org

Notes

  • to get from SVN, use svn ssh+svn

Hardware and physical location

  • 1856 cpu-cores for queued jobs
  • 128 cpu-cores for infrastructure, databases, management and ad hoc jobs.
  • 788 TB of high quality NFS-available disk
  • Our policy is to have 4 GB RAM per cpu-core unless otherwise specified.
  • Machines older than 3 years may have 2GB/core and 6 years old have 1GB/core.
  • Cluster 2 is currently stored entirely in Rack 0 which is in Row 0, Position 4 of BH101 at 1700 4th St (Byers Hall).
  • Central services are on he,aleph2,and bet
  • CPU
    • 3 Silicon Mechanics Rackform nServ A4412.v4 s, each comprising 4 computers of 32 cpu-cores for a total of 384 cpu-cores.
    • 1 Dell C6145 with 128 cores.
    • An HP DL165G7 (24-way) is sgehead
    • more computers to come from Cluster 0, when Cluster 2 is fully ready.
  • DISK
    • HP disks - 40 TB RAID6 SAS (new in 2014)
    • Silicon Mechanics NAS - new in 2014 - 77 TB RAID6 SAS (new in 2014)
    • A HP DL160G5 and an MSA60 with 12 TB SAS (disks new in 2014)

= Naming convention

  • The Hebrew alphabet is used for physical machines
  • Greek letters for VMs.
  • Functions (e.g. sgehead) are aliases (CNAMEs).
  • compbio.ucsf.edu and ucsf.bkslab.org domains both supported.

Disk organization

  • shin aka nas1 mounted as /nfs/db/ = 72 TB SAS RAID6. NOTE: ON BAND: $ sudo /usr/local/RAID\ Web\ Console\ 2/startupui.sh to interact with raid controller. username: raid. pw: c2 pass
  • bet aka happy, internal: /nfs/store and psql (temp) as 10 TB SATA RAID10
  • elated on happy: /nfs/work only as 36 TB SAS RAID6
  • dalet exports /nfs/home & /nfs/home2

Special purpose machines - all .ucsf.bkslab.org

  • sgehead aka gimel.cluster - nearly the only machine you'll need.
  • psi.cluster - PG fortran compiler (if it only has a .cluster address means it has no public address)
  • portal aka epsilon - secure access
  • zeta.cluster - Pipeline Pilot
  • shin, bet, and dalet are the three NFS servers. You should not need to log in to them.
on teague desktop, /usr/local/RAID Web Console 2/startupui.sh 
connect to shin on public network
raid /  C2 on shin
  • mysql1.cluster - general purpose mysql server (like former scratch)
  • pg1.cluster - general purpose postgres server
  • fprint.cluster - fingerprinting server

Table of Server Information

SLURM

Server Name Operating System Functions
epyc Rocky 8 Apache/HTTPD Webserver + Proxy
epyc2 Rocky 8
epyc-A40 Rocky 8
n-1-101 Centos 7
n-1-105 Centos 7
n-1-124 Centos 7
n-1-126 Centos 7
n-1-141 Centos 7
n-1-16 Centos 7
n-1-17 Centos 7
n-1-18 Centos 7
n-1-19 Centos 7
n-1-20 Centos 7
n-1-21 Centos 7
n-1-28 Centos 7
n-1-38 Centos 7
n-5-13 Centos 7
n-5-14 Centos 7
n-5-15 Centos 7
n-5-32 Centos 7
n-5-33 Centos 7
n-5-34 Centos 7
n-5-35 Centos 7
n-9-19 Centos 7
n-9-20 Centos 7
n-9-21 Centos 7
n-9-22 Centos 7
n-9-34 Centos 7
n-9-36 Centos 7
n-9-38 Centos 7
qof Centos 7
shin Centos 7

SGE

Server Name Operating System Functions
gimel Centos 6 In-person Login Node
he Centos 6 Hosts Vital VMs for cluster 2 for function.
het Centos 6
n-0-129 Centos 6
n-0-136 Centos 6
n-0-139 Centos 6
n-0-30 Centos 6
n-0-37 Centos 6
n-0-39 Centos 6
n-8-27 Centos 6
n-9-23 Centos 6


About our cluster