Cluster 2: Difference between revisions

From DISI
Jump to navigation Jump to search
 
(43 intermediate revisions by 4 users not shown)
Line 1: Line 1:
Our new cluster at [[UCSF]] is described on this page.  The physical equipment in [[Cluster 0]] will be subsumed into this cluster when Cluster 2 replicates all the functions of Cluster 0. We expect this to happen later in 2014.
= Introduction =
Cluster 2 is the most modern cluster Irwin Lab maintains.
{{TOCright}}
 
(Edited in May 06,2024)


= Priorities and Policies =  
= Priorities and Policies =  
Line 7: Line 8:
* [[Disk space policy]]
* [[Disk space policy]]
* [[Backups]] policy.
* [[Backups]] policy.
* [[Portal system]] for off-site ssh cluster access.
* Get a [[Cluster 2 account]] and get started
* Get a [[Cluster 2 account]] and get started


= Equipment, names, roles =
= How to Login =
* '''512 cpu-cores for queued jobs and 128 cpu-cores for infrastructure, databases, management and ad hoc jobs. 128 TB of high quality disk, 32 TB of other disk'''
=== Off Site ===
* We expect this to grow to over 1200 cpu-cores and 200 TB in 2014 once Cluster 0 is merged with Cluster 2
* Off site access requires an SSH key. Contact sysadmins for help.
<source>ssh <user>@portal3.compbio.ucsf.edu</source>
 
=== On Site ===
<source>ssh -o HostKeyAlgorithms=+ssh-rsa <user>@gimel.compbio.ucsf.edu</source>
 
= Where to Submit Jobs and How (SGE/SLURM) =
=== SGE ===
Submit SGE jobs in the machine called '''gimel.compbio.ucsf.edu''' aka '''gimel'''.
* Refer to the pages below for the basic commands/examples.
** [[SGE Cluster Docking]], replace '''sgehead.compbio.ucsf.edu''' with '''gimel'''.
** [[SGE idioms]]
** [[Using SGE cluster]]
* For sysadmins
** [[SGE_notes]]
** [[Sun Grid Engine (SGE)]]
 
=== SLURM ===
Submit SLURM jobs in '''gimel2'''.
* Refer to the pages below for basic guides
** [[Slurm]]
 
= Special machines =
Normally, you will just ssh to sgehead aka gimel from portal.ucsf.bkslab.org where you can do almost anything, including job management.  A few things require licensing and must be done on special machines.
 
hypervisor 'he' hosts:
* alpha  - which is critical and runs foreman, DNS, DHCP, and other important services
* beta - with runs LDAP authentication
* epsilon - portal.ucsf.bkslab.org - cluster gateway from public internet
* gamma - sun grid engine qmaster
* phi - mysqld/excipients
* psi for using the PG fortran compiler
* ppilot is at  http://zeta:9944/ - you must be on the Cluster 2 private network to use it
* Tau is the web server for ZINC,
* no other specia
* zeta - Psicquic/pipeline pilot
* Sigma can definitely go off and stay off. It was planned for a fingerprinting server, never done.
 
hypervisor 'aleph2' hosts:
* alpha7 - This is to be the future architecture VM of the cluster (DNS/DHCP/Puppet/Foreman/Ansible).  CentOS7. 
* kappa is licensing. ask me.  ("i have no clue what this licenses.  Turned off." - ben)
* rho contains this wiki and also bkslab.org
 
= Notes =
* to get from SVN, use svn ssh+svn
 
= Hardware and physical location =
* 1856 cpu-cores for queued jobs
* 128 cpu-cores for infrastructure, databases, management and ad hoc jobs.
* 788 TB of high quality NFS-available disk
* Our policy is to have 4 GB RAM per cpu-core unless otherwise specified.
* Our policy is to have 4 GB RAM per cpu-core unless otherwise specified.
* The Hebrew alphabet is used for physical machines, Greek for VMs. Functions (e.g. sgehead) are aliases (CNAMEs).  
* Machines older than 3 years may have 2GB/core and 6 years old have 1GB/core.
* Cluster 2 is currently stored entirely in Rack 0 which is in Row 0, Position 4 of BH101 at 1700 4th St (Byers Hall). '''More racks will be added by July.'''
* Cluster 2 is currently stored entirely in Rack 0 which is in Row 0, Position 4 of BH101 at 1700 4th St (Byers Hall).
* Central services are on aleph, an HP DL160G5 and bet, an HP xxxx.
* Central services are on he,aleph2,and bet  
* CPU
* CPU
** 3 Silicon Mechanics Rackform nServ A4412.v4 s, each comprising 4 computers of 32 cpu-cores for a total of 384 cpu-cores.
** 3 Silicon Mechanics Rackform nServ A4412.v4 s, each comprising 4 computers of 32 cpu-cores for a total of 384 cpu-cores.
** 1 Dell C6145 with 128 cores.
** 1 Dell C6145 with 128 cores.
** An HP DL165G7 (24-way) is sgehead
** An HP DL165G7 (24-way) is sgehead
** more computers to come from Cluster 0, when Cluster 2 is fully ready.
* DISK
* DISK
** HP disks - new in 2014 - 40 TB RAID6 SAS
** HP disks - 40 TB RAID6 SAS (new in 2014)
** Silicon Mechanics NAS - new in 2014 - 76 TB RAID6 SAS
** Silicon Mechanics NAS - new in 2014 - 77 TB RAID6 SAS (new in 2014)
** A HP DL160G5 and an MSA60 with 12 TB SAS - new in 2014.
** A HP DL160G5 and an MSA60 with 12 TB SAS (disks new in 2014)
 
= Naming convention
* The Hebrew alphabet is used for physical machines
* Greek letters for VMs.
* Functions (e.g. sgehead) are aliases (CNAMEs).
* compbio.ucsf.edu and ucsf.bkslab.org domains both supported.


= Disk organization =  
= Disk organization =  
* shin aka nas1 mounted as /nfs/db/ =  72 TB SAS RAID6
* shin aka nas1 mounted as /nfs/db/ =  72 TB SAS RAID6.  NOTE: ON BAND:  $ sudo /usr/local/RAID\ Web\ Console\ 2/startupui.sh to interact with raid controller.  username: raid.  pw: c2 pass
* bet aka happy, internal: /nfs/store and psql (temp) as 10 TB SATA RAID10
* bet aka happy, internal: /nfs/store and psql (temp) as 10 TB SATA RAID10
* elated on happy: /nfs/work only as 36 TB SAS RAID6
* elated on happy: /nfs/work only as 36 TB SAS RAID6
* het (43) aka  former vmware2 MSA 60  exports /nfs/home and /nfs/soft
* dalet exports /nfs/home & /nfs/home2


= Special purpose machines - all .ucsf.bkslab.org =   
= Special purpose machines - all .ucsf.bkslab.org =   
   
   
* sgehead- we recommend you use this - in addition to your desktop - for most purposes, including launching jobs on the cluster.
* sgehead aka gimel.cluster - nearly the only machine you'll need.  
* pgf - fortran compiler
* psi.cluster - PG fortran compiler (if it only has a .cluster address means it has no public address)
* portal - secure access from
* portal aka epsilon - secure access
* ppilot - pipeline pilot
* zeta.cluster - Pipeline Pilot
* shin, bet, and dalet are the three NFS servers. You should not need to log in.
* shin, bet, and dalet are the three NFS servers. You should not need to log in to them.
* mysql1 - general purpose mysql server (like former scratch)
 
* pg1 - general purpose postgres server  
on teague desktop, /usr/local/RAID Web Console 2/startupui.sh
* fprint - fingerprinting server
connect to shin on public network
raid /  C2 on shin
 
* mysql1.cluster - general purpose mysql server (like former scratch)
* pg1.cluster - general purpose postgres server  
* fprint.cluster - fingerprinting server
 
= Table of Server Information =
=== SLURM ===
{| class="wikitable"
|-
!Server Name
!Operating System
!Functions
|-
| epyc || Rocky 8 || Apache/HTTPD Webserver + Proxy
|-
| epyc2 || Rocky 8 ||
|-
| epyc-A40 || Rocky 8 ||
|-
| n-1-101 || Centos 7 ||
|-
| n-1-105 || Centos 7 ||
|-
| n-1-124 || Centos 7 ||
|-
| n-1-126 || Centos 7 ||
|-
| n-1-141 || Centos 7 ||
|-
| n-1-16 || Centos 7 ||
|-
| n-1-17 || Centos 7 ||
|-
| n-1-18 || Centos 7 ||
|-
| n-1-19 || Centos 7 ||
|-
| n-1-20 || Centos 7 ||
|-
| n-1-21 || Centos 7 ||
|-
| n-1-28 || Centos 7 ||
|-
| n-1-38 || Centos 7 ||
|-
| n-5-13 || Centos 7 ||
|-
| n-5-14 || Centos 7 ||
|-
| n-5-15 || Centos 7 ||
|-
| n-5-32 || Centos 7 ||
|-
| n-5-33 || Centos 7 ||
|-
| n-5-34 || Centos 7 ||
|-
| n-5-35 || Centos 7 ||
|-
| n-9-19 || Centos 7 ||
|-
| n-9-20 || Centos 7 ||
|-
| n-9-21 || Centos 7 ||
|-
| n-9-22 || Centos 7 ||
|-
| n-9-34 || Centos 7 ||
|-
| n-9-36 || Centos 7 ||
|-
| n-9-38 || Centos 7 ||
|-
| qof || Centos 7 ||
|-
| shin || Centos 7 ||
|-
|}
 
=== SGE ===
{| class="wikitable"
|-
!Server Name
!Operating System
!Functions
|-
| gimel || Centos 6 || In-person Login Node
|-
| he || Centos 6 || Hosts Vital VMs for cluster 2 for function.
|-
| het || Centos 6 ||
|-
| n-0-129 || Centos 6 ||
|-
| n-0-136 || Centos 6 ||
|-
| n-0-139 || Centos 6 ||
|-
| n-0-30 || Centos 6 ||
|-
| n-0-37 || Centos 6 ||
|-
| n-0-39 || Centos 6 ||
|-
| n-8-27 || Centos 6 ||
|-
| n-9-23 || Centos 6 ||
|-
|}
 


[[About our cluster]]
[[About our cluster]]
Line 47: Line 216:
[[Category:Internal]]
[[Category:Internal]]
[[Category:UCSF]]
[[Category:UCSF]]
[[Category:Hardware]]

Latest revision as of 21:53, 27 June 2024

Introduction

Cluster 2 is the most modern cluster Irwin Lab maintains.

(Edited in May 06,2024)

Priorities and Policies

How to Login

Off Site

  • Off site access requires an SSH key. Contact sysadmins for help.
ssh <user>@portal3.compbio.ucsf.edu

On Site

ssh -o HostKeyAlgorithms=+ssh-rsa <user>@gimel.compbio.ucsf.edu

Where to Submit Jobs and How (SGE/SLURM)

SGE

Submit SGE jobs in the machine called gimel.compbio.ucsf.edu aka gimel.

SLURM

Submit SLURM jobs in gimel2.

  • Refer to the pages below for basic guides

Special machines

Normally, you will just ssh to sgehead aka gimel from portal.ucsf.bkslab.org where you can do almost anything, including job management. A few things require licensing and must be done on special machines.

hypervisor 'he' hosts:

  • alpha - which is critical and runs foreman, DNS, DHCP, and other important services
  • beta - with runs LDAP authentication
  • epsilon - portal.ucsf.bkslab.org - cluster gateway from public internet
  • gamma - sun grid engine qmaster
  • phi - mysqld/excipients
  • psi for using the PG fortran compiler
  • ppilot is at http://zeta:9944/ - you must be on the Cluster 2 private network to use it
  • Tau is the web server for ZINC,
  • no other specia
  • zeta - Psicquic/pipeline pilot
  • Sigma can definitely go off and stay off. It was planned for a fingerprinting server, never done.

hypervisor 'aleph2' hosts:

  • alpha7 - This is to be the future architecture VM of the cluster (DNS/DHCP/Puppet/Foreman/Ansible). CentOS7.
  • kappa is licensing. ask me. ("i have no clue what this licenses. Turned off." - ben)
  • rho contains this wiki and also bkslab.org

Notes

  • to get from SVN, use svn ssh+svn

Hardware and physical location

  • 1856 cpu-cores for queued jobs
  • 128 cpu-cores for infrastructure, databases, management and ad hoc jobs.
  • 788 TB of high quality NFS-available disk
  • Our policy is to have 4 GB RAM per cpu-core unless otherwise specified.
  • Machines older than 3 years may have 2GB/core and 6 years old have 1GB/core.
  • Cluster 2 is currently stored entirely in Rack 0 which is in Row 0, Position 4 of BH101 at 1700 4th St (Byers Hall).
  • Central services are on he,aleph2,and bet
  • CPU
    • 3 Silicon Mechanics Rackform nServ A4412.v4 s, each comprising 4 computers of 32 cpu-cores for a total of 384 cpu-cores.
    • 1 Dell C6145 with 128 cores.
    • An HP DL165G7 (24-way) is sgehead
    • more computers to come from Cluster 0, when Cluster 2 is fully ready.
  • DISK
    • HP disks - 40 TB RAID6 SAS (new in 2014)
    • Silicon Mechanics NAS - new in 2014 - 77 TB RAID6 SAS (new in 2014)
    • A HP DL160G5 and an MSA60 with 12 TB SAS (disks new in 2014)

= Naming convention

  • The Hebrew alphabet is used for physical machines
  • Greek letters for VMs.
  • Functions (e.g. sgehead) are aliases (CNAMEs).
  • compbio.ucsf.edu and ucsf.bkslab.org domains both supported.

Disk organization

  • shin aka nas1 mounted as /nfs/db/ = 72 TB SAS RAID6. NOTE: ON BAND: $ sudo /usr/local/RAID\ Web\ Console\ 2/startupui.sh to interact with raid controller. username: raid. pw: c2 pass
  • bet aka happy, internal: /nfs/store and psql (temp) as 10 TB SATA RAID10
  • elated on happy: /nfs/work only as 36 TB SAS RAID6
  • dalet exports /nfs/home & /nfs/home2

Special purpose machines - all .ucsf.bkslab.org

  • sgehead aka gimel.cluster - nearly the only machine you'll need.
  • psi.cluster - PG fortran compiler (if it only has a .cluster address means it has no public address)
  • portal aka epsilon - secure access
  • zeta.cluster - Pipeline Pilot
  • shin, bet, and dalet are the three NFS servers. You should not need to log in to them.
on teague desktop, /usr/local/RAID Web Console 2/startupui.sh 
connect to shin on public network
raid /  C2 on shin
  • mysql1.cluster - general purpose mysql server (like former scratch)
  • pg1.cluster - general purpose postgres server
  • fprint.cluster - fingerprinting server

Table of Server Information

SLURM

Server Name Operating System Functions
epyc Rocky 8 Apache/HTTPD Webserver + Proxy
epyc2 Rocky 8
epyc-A40 Rocky 8
n-1-101 Centos 7
n-1-105 Centos 7
n-1-124 Centos 7
n-1-126 Centos 7
n-1-141 Centos 7
n-1-16 Centos 7
n-1-17 Centos 7
n-1-18 Centos 7
n-1-19 Centos 7
n-1-20 Centos 7
n-1-21 Centos 7
n-1-28 Centos 7
n-1-38 Centos 7
n-5-13 Centos 7
n-5-14 Centos 7
n-5-15 Centos 7
n-5-32 Centos 7
n-5-33 Centos 7
n-5-34 Centos 7
n-5-35 Centos 7
n-9-19 Centos 7
n-9-20 Centos 7
n-9-21 Centos 7
n-9-22 Centos 7
n-9-34 Centos 7
n-9-36 Centos 7
n-9-38 Centos 7
qof Centos 7
shin Centos 7

SGE

Server Name Operating System Functions
gimel Centos 6 In-person Login Node
he Centos 6 Hosts Vital VMs for cluster 2 for function.
het Centos 6
n-0-129 Centos 6
n-0-136 Centos 6
n-0-139 Centos 6
n-0-30 Centos 6
n-0-37 Centos 6
n-0-39 Centos 6
n-8-27 Centos 6
n-9-23 Centos 6


About our cluster