Cluster 2: Difference between revisions

From DISI
Jump to navigation Jump to search
(asdf)
 
(24 intermediate revisions by 3 users not shown)
Line 1: Line 1:
This is the default lab cluster.
= Introduction =
Cluster 2 is the most modern cluster Irwin Lab maintains.


{{TOCright}}
(Edited in May 06,2024)


= Priorities and Policies =  
= Priorities and Policies =  
Line 9: Line 10:
* [[Portal system]] for off-site ssh cluster access.
* [[Portal system]] for off-site ssh cluster access.
* Get a [[Cluster 2 account]] and get started
* Get a [[Cluster 2 account]] and get started
= How to Login =
=== Off Site ===
* Off site access requires an SSH key. Contact sysadmins for help.
<source>ssh <user>@portal3.compbio.ucsf.edu</source>
=== On Site ===
<source>ssh -o HostKeyAlgorithms=+ssh-rsa <user>@gimel.compbio.ucsf.edu</source>
= Where to Submit Jobs and How (SGE/SLURM) =
=== SGE ===
Submit SGE jobs in the machine called '''gimel.compbio.ucsf.edu''' aka '''gimel'''.
* Refer to the pages below for the basic commands/examples.
** [[SGE Cluster Docking]], replace '''sgehead.compbio.ucsf.edu''' with '''gimel'''.
** [[SGE idioms]]
** [[Using SGE cluster]]
* For sysadmins
** [[SGE_notes]]
** [[Sun Grid Engine (SGE)]]
=== SLURM ===
Submit SLURM jobs in '''gimel2'''.
* Refer to the pages below for basic guides
** [[Slurm]]


= Special machines =  
= Special machines =  
Normally, you will just ssh to sgehead aka gimel from portal.ucsf.bkslab.org where you can do almost anything, including job management.  A few things require licensing and must be done on special machines.  
Normally, you will just ssh to sgehead aka gimel from portal.ucsf.bkslab.org where you can do almost anything, including job management.  A few things require licensing and must be done on special machines.  


hypervisor 'he' hosts:
* alpha  - which is critical and runs foreman, DNS, DHCP, and other important services
* beta - with runs LDAP authentication
* epsilon - portal.ucsf.bkslab.org - cluster gateway from public internet
* gamma - sun grid engine qmaster
* phi - mysqld/excipients
* psi for using the PG fortran compiler
* ppilot is at  http://zeta:9944/ - you must be on the Cluster 2 private network to use it
* Tau is the web server for ZINC,
* no other specia
* zeta - Psicquic/pipeline pilot
* Sigma can definitely go off and stay off. It was planned for a fingerprinting server, never done.
* Sigma can definitely go off and stay off. It was planned for a fingerprinting server, never done.
* kappa is licensing. ask me.
* Rho is the wiki. stays on
* Psi is fortran and stays on
* Tau is the web server, and will move to he
* he also hosts:
* alpha  - which is critical and runs foreman, DNS, and other important services
* beta - with runs LDAP and is important
* gamma -=


* psi for using the PG fortran compiler
hypervisor 'aleph2' hosts:
* ppilot is at http://zeta:9944/ - you must be on the Cluster 2 private network to use it
* alpha7 - This is to be the future architecture VM of the cluster (DNS/DHCP/Puppet/Foreman/Ansible).  CentOS7. 
* no other special machines
* kappa is licensing. ask me.  ("i have no clue what this licenses. Turned off." - ben)
* rho contains this wiki and also bkslab.org


= Notes =  
= Notes =  
Line 31: Line 60:


= Hardware and physical location =
= Hardware and physical location =
* 1232 cpu-cores for queued jobs
* 1856 cpu-cores for queued jobs
* 128 cpu-cores for infrastructure, databases, management and ad hoc jobs.
* 128 cpu-cores for infrastructure, databases, management and ad hoc jobs.
* 128 TB of high quality NFS-available disk
* 788 TB of high quality NFS-available disk
* 32 TB of other disk
* We expect this to grow to over 1500 cpu-cores and 200 TB in late 2016 once Cluster 0 is merged with Cluster 2
* Our policy is to have 4 GB RAM per cpu-core unless otherwise specified.
* Our policy is to have 4 GB RAM per cpu-core unless otherwise specified.
* Machines older than 3 years may have 2GB/core and 6 years old have 1GB/core.
* Machines older than 3 years may have 2GB/core and 6 years old have 1GB/core.
* Cluster 2 is currently stored entirely in Rack 0 which is in Row 0, Position 4 of BH101 at 1700 4th St (Byers Hall).
* Cluster 2 is currently stored entirely in Rack 0 which is in Row 0, Position 4 of BH101 at 1700 4th St (Byers Hall).
* '''More racks will be added (from cluster 0) in summer 2016.'''
* Central services are on he,aleph2,and bet  
* Central services are on aleph, an HP DL160G5 and bet, an HP xxxx.
* CPU
* CPU
** 3 Silicon Mechanics Rackform nServ A4412.v4 s, each comprising 4 computers of 32 cpu-cores for a total of 384 cpu-cores.
** 3 Silicon Mechanics Rackform nServ A4412.v4 s, each comprising 4 computers of 32 cpu-cores for a total of 384 cpu-cores.
Line 58: Line 84:


= Disk organization =  
= Disk organization =  
* shin aka nas1 mounted as /nfs/db/ =  72 TB SAS RAID6
* shin aka nas1 mounted as /nfs/db/ =  72 TB SAS RAID6.  NOTE: ON BAND:  $ sudo /usr/local/RAID\ Web\ Console\ 2/startupui.sh to interact with raid controller.  username: raid.  pw: c2 pass
* bet aka happy, internal: /nfs/store and psql (temp) as 10 TB SATA RAID10
* bet aka happy, internal: /nfs/store and psql (temp) as 10 TB SATA RAID10
* elated on happy: /nfs/work only as 36 TB SAS RAID6
* elated on happy: /nfs/work only as 36 TB SAS RAID6
* het (43) aka  former vmware2 MSA 60  exports /nfs/home and /nfs/soft
* dalet exports /nfs/home & /nfs/home2


= Special purpose machines - all .ucsf.bkslab.org =   
= Special purpose machines - all .ucsf.bkslab.org =   
Line 70: Line 96:
* zeta.cluster - Pipeline Pilot
* zeta.cluster - Pipeline Pilot
* shin, bet, and dalet are the three NFS servers. You should not need to log in to them.
* shin, bet, and dalet are the three NFS servers. You should not need to log in to them.
on teague desktop, /usr/local/RAID Web Console 2/startupui.sh
connect to shin on public network
raid /  C2 on shin
* mysql1.cluster - general purpose mysql server (like former scratch)
* mysql1.cluster - general purpose mysql server (like former scratch)
* pg1.cluster - general purpose postgres server  
* pg1.cluster - general purpose postgres server  
* fprint.cluster - fingerprinting server
* fprint.cluster - fingerprinting server
= Table of Server Information =
=== SLURM ===
{| class="wikitable"
|-
!Server Name
!Operating System
!Functions
|-
| epyc || Rocky 8 || Apache/HTTPD Webserver + Proxy
|-
| epyc2 || Rocky 8 ||
|-
| epyc-A40 || Rocky 8 ||
|-
| n-1-101 || Centos 7 ||
|-
| n-1-105 || Centos 7 ||
|-
| n-1-124 || Centos 7 ||
|-
| n-1-126 || Centos 7 ||
|-
| n-1-141 || Centos 7 ||
|-
| n-1-16 || Centos 7 ||
|-
| n-1-17 || Centos 7 ||
|-
| n-1-18 || Centos 7 ||
|-
| n-1-19 || Centos 7 ||
|-
| n-1-20 || Centos 7 ||
|-
| n-1-21 || Centos 7 ||
|-
| n-1-28 || Centos 7 ||
|-
| n-1-38 || Centos 7 ||
|-
| n-5-13 || Centos 7 ||
|-
| n-5-14 || Centos 7 ||
|-
| n-5-15 || Centos 7 ||
|-
| n-5-32 || Centos 7 ||
|-
| n-5-33 || Centos 7 ||
|-
| n-5-34 || Centos 7 ||
|-
| n-5-35 || Centos 7 ||
|-
| n-9-19 || Centos 7 ||
|-
| n-9-20 || Centos 7 ||
|-
| n-9-21 || Centos 7 ||
|-
| n-9-22 || Centos 7 ||
|-
| n-9-34 || Centos 7 ||
|-
| n-9-36 || Centos 7 ||
|-
| n-9-38 || Centos 7 ||
|-
| qof || Centos 7 ||
|-
| shin || Centos 7 ||
|-
|}
=== SGE ===
{| class="wikitable"
|-
!Server Name
!Operating System
!Functions
|-
| gimel || Centos 6 || In-person Login Node
|-
| he || Centos 6 || Hosts Vital VMs for cluster 2 for function.
|-
| het || Centos 6 ||
|-
| n-0-129 || Centos 6 ||
|-
| n-0-136 || Centos 6 ||
|-
| n-0-139 || Centos 6 ||
|-
| n-0-30 || Centos 6 ||
|-
| n-0-37 || Centos 6 ||
|-
| n-0-39 || Centos 6 ||
|-
| n-8-27 || Centos 6 ||
|-
| n-9-23 || Centos 6 ||
|-
|}


[[About our cluster]]
[[About our cluster]]
Line 79: Line 216:
[[Category:Internal]]
[[Category:Internal]]
[[Category:UCSF]]
[[Category:UCSF]]
[[Category:Hardware]]

Latest revision as of 21:53, 27 June 2024

Introduction

Cluster 2 is the most modern cluster Irwin Lab maintains.

(Edited in May 06,2024)

Priorities and Policies

How to Login

Off Site

  • Off site access requires an SSH key. Contact sysadmins for help.
ssh <user>@portal3.compbio.ucsf.edu

On Site

ssh -o HostKeyAlgorithms=+ssh-rsa <user>@gimel.compbio.ucsf.edu

Where to Submit Jobs and How (SGE/SLURM)

SGE

Submit SGE jobs in the machine called gimel.compbio.ucsf.edu aka gimel.

SLURM

Submit SLURM jobs in gimel2.

  • Refer to the pages below for basic guides

Special machines

Normally, you will just ssh to sgehead aka gimel from portal.ucsf.bkslab.org where you can do almost anything, including job management. A few things require licensing and must be done on special machines.

hypervisor 'he' hosts:

  • alpha - which is critical and runs foreman, DNS, DHCP, and other important services
  • beta - with runs LDAP authentication
  • epsilon - portal.ucsf.bkslab.org - cluster gateway from public internet
  • gamma - sun grid engine qmaster
  • phi - mysqld/excipients
  • psi for using the PG fortran compiler
  • ppilot is at http://zeta:9944/ - you must be on the Cluster 2 private network to use it
  • Tau is the web server for ZINC,
  • no other specia
  • zeta - Psicquic/pipeline pilot
  • Sigma can definitely go off and stay off. It was planned for a fingerprinting server, never done.

hypervisor 'aleph2' hosts:

  • alpha7 - This is to be the future architecture VM of the cluster (DNS/DHCP/Puppet/Foreman/Ansible). CentOS7.
  • kappa is licensing. ask me. ("i have no clue what this licenses. Turned off." - ben)
  • rho contains this wiki and also bkslab.org

Notes

  • to get from SVN, use svn ssh+svn

Hardware and physical location

  • 1856 cpu-cores for queued jobs
  • 128 cpu-cores for infrastructure, databases, management and ad hoc jobs.
  • 788 TB of high quality NFS-available disk
  • Our policy is to have 4 GB RAM per cpu-core unless otherwise specified.
  • Machines older than 3 years may have 2GB/core and 6 years old have 1GB/core.
  • Cluster 2 is currently stored entirely in Rack 0 which is in Row 0, Position 4 of BH101 at 1700 4th St (Byers Hall).
  • Central services are on he,aleph2,and bet
  • CPU
    • 3 Silicon Mechanics Rackform nServ A4412.v4 s, each comprising 4 computers of 32 cpu-cores for a total of 384 cpu-cores.
    • 1 Dell C6145 with 128 cores.
    • An HP DL165G7 (24-way) is sgehead
    • more computers to come from Cluster 0, when Cluster 2 is fully ready.
  • DISK
    • HP disks - 40 TB RAID6 SAS (new in 2014)
    • Silicon Mechanics NAS - new in 2014 - 77 TB RAID6 SAS (new in 2014)
    • A HP DL160G5 and an MSA60 with 12 TB SAS (disks new in 2014)

= Naming convention

  • The Hebrew alphabet is used for physical machines
  • Greek letters for VMs.
  • Functions (e.g. sgehead) are aliases (CNAMEs).
  • compbio.ucsf.edu and ucsf.bkslab.org domains both supported.

Disk organization

  • shin aka nas1 mounted as /nfs/db/ = 72 TB SAS RAID6. NOTE: ON BAND: $ sudo /usr/local/RAID\ Web\ Console\ 2/startupui.sh to interact with raid controller. username: raid. pw: c2 pass
  • bet aka happy, internal: /nfs/store and psql (temp) as 10 TB SATA RAID10
  • elated on happy: /nfs/work only as 36 TB SAS RAID6
  • dalet exports /nfs/home & /nfs/home2

Special purpose machines - all .ucsf.bkslab.org

  • sgehead aka gimel.cluster - nearly the only machine you'll need.
  • psi.cluster - PG fortran compiler (if it only has a .cluster address means it has no public address)
  • portal aka epsilon - secure access
  • zeta.cluster - Pipeline Pilot
  • shin, bet, and dalet are the three NFS servers. You should not need to log in to them.
on teague desktop, /usr/local/RAID Web Console 2/startupui.sh 
connect to shin on public network
raid /  C2 on shin
  • mysql1.cluster - general purpose mysql server (like former scratch)
  • pg1.cluster - general purpose postgres server
  • fprint.cluster - fingerprinting server

Table of Server Information

SLURM

Server Name Operating System Functions
epyc Rocky 8 Apache/HTTPD Webserver + Proxy
epyc2 Rocky 8
epyc-A40 Rocky 8
n-1-101 Centos 7
n-1-105 Centos 7
n-1-124 Centos 7
n-1-126 Centos 7
n-1-141 Centos 7
n-1-16 Centos 7
n-1-17 Centos 7
n-1-18 Centos 7
n-1-19 Centos 7
n-1-20 Centos 7
n-1-21 Centos 7
n-1-28 Centos 7
n-1-38 Centos 7
n-5-13 Centos 7
n-5-14 Centos 7
n-5-15 Centos 7
n-5-32 Centos 7
n-5-33 Centos 7
n-5-34 Centos 7
n-5-35 Centos 7
n-9-19 Centos 7
n-9-20 Centos 7
n-9-21 Centos 7
n-9-22 Centos 7
n-9-34 Centos 7
n-9-36 Centos 7
n-9-38 Centos 7
qof Centos 7
shin Centos 7

SGE

Server Name Operating System Functions
gimel Centos 6 In-person Login Node
he Centos 6 Hosts Vital VMs for cluster 2 for function.
het Centos 6
n-0-129 Centos 6
n-0-136 Centos 6
n-0-139 Centos 6
n-0-30 Centos 6
n-0-37 Centos 6
n-0-39 Centos 6
n-8-27 Centos 6
n-9-23 Centos 6


About our cluster