Cluster 2: Difference between revisions

From DISI
Jump to navigation Jump to search
 
(17 intermediate revisions by 2 users not shown)
Line 1: Line 1:
This is the default lab cluster.
= Introduction =
Cluster 2 is the most modern cluster Irwin Lab maintains.


{{TOCright}}
(Edited in May 06,2024)


= Priorities and Policies =  
= Priorities and Policies =  
Line 9: Line 10:
* [[Portal system]] for off-site ssh cluster access.
* [[Portal system]] for off-site ssh cluster access.
* Get a [[Cluster 2 account]] and get started
* Get a [[Cluster 2 account]] and get started
= How to Login =
=== Off Site ===
* Off site access requires an SSH key. Contact sysadmins for help.
<source>ssh <user>@portal3.compbio.ucsf.edu</source>
=== On Site ===
<source>ssh -o HostKeyAlgorithms=+ssh-rsa <user>@gimel.compbio.ucsf.edu</source>
= Where to Submit Jobs and How (SGE/SLURM) =
=== SGE ===
Submit SGE jobs in the machine called '''gimel.compbio.ucsf.edu''' aka '''gimel'''.
* Refer to the pages below for the basic commands/examples.
** [[SGE Cluster Docking]], replace '''sgehead.compbio.ucsf.edu''' with '''gimel'''.
** [[SGE idioms]]
** [[Using SGE cluster]]
* For sysadmins
** [[SGE_notes]]
** [[Sun Grid Engine (SGE)]]
=== SLURM ===
Submit SLURM jobs in '''gimel2'''.
* Refer to the pages below for basic guides
** [[Slurm]]


= Special machines =  
= Special machines =  
Line 59: Line 84:


= Disk organization =  
= Disk organization =  
* shin aka nas1 mounted as /nfs/db/ =  72 TB SAS RAID6.  NOTE: ON BAND:  $ sudo /usr/local/RAID\ Web\ Console\ 2/startupui.sh to interact with raid controller
* shin aka nas1 mounted as /nfs/db/ =  72 TB SAS RAID6.  NOTE: ON BAND:  $ sudo /usr/local/RAID\ Web\ Console\ 2/startupui.sh to interact with raid controller.  username: raid.  pw: c2 pass
* bet aka happy, internal: /nfs/store and psql (temp) as 10 TB SATA RAID10
* bet aka happy, internal: /nfs/store and psql (temp) as 10 TB SATA RAID10
* elated on happy: /nfs/work only as 36 TB SAS RAID6
* elated on happy: /nfs/work only as 36 TB SAS RAID6
Line 79: Line 104:
* pg1.cluster - general purpose postgres server  
* pg1.cluster - general purpose postgres server  
* fprint.cluster - fingerprinting server
* fprint.cluster - fingerprinting server
= Table of Server Information =
=== SLURM ===
{| class="wikitable"
|-
!Server Name
!Operating System
!Functions
|-
| epyc || Rocky 8 || Apache/HTTPD Webserver + Proxy
|-
| epyc2 || Rocky 8 ||
|-
| epyc-A40 || Rocky 8 ||
|-
| n-1-101 || Centos 7 ||
|-
| n-1-105 || Centos 7 ||
|-
| n-1-124 || Centos 7 ||
|-
| n-1-126 || Centos 7 ||
|-
| n-1-141 || Centos 7 ||
|-
| n-1-16 || Centos 7 ||
|-
| n-1-17 || Centos 7 ||
|-
| n-1-18 || Centos 7 ||
|-
| n-1-19 || Centos 7 ||
|-
| n-1-20 || Centos 7 ||
|-
| n-1-21 || Centos 7 ||
|-
| n-1-28 || Centos 7 ||
|-
| n-1-38 || Centos 7 ||
|-
| n-5-13 || Centos 7 ||
|-
| n-5-14 || Centos 7 ||
|-
| n-5-15 || Centos 7 ||
|-
| n-5-32 || Centos 7 ||
|-
| n-5-33 || Centos 7 ||
|-
| n-5-34 || Centos 7 ||
|-
| n-5-35 || Centos 7 ||
|-
| n-9-19 || Centos 7 ||
|-
| n-9-20 || Centos 7 ||
|-
| n-9-21 || Centos 7 ||
|-
| n-9-22 || Centos 7 ||
|-
| n-9-34 || Centos 7 ||
|-
| n-9-36 || Centos 7 ||
|-
| n-9-38 || Centos 7 ||
|-
| qof || Centos 7 ||
|-
| shin || Centos 7 ||
|-
|}
=== SGE ===
{| class="wikitable"
|-
!Server Name
!Operating System
!Functions
|-
| gimel || Centos 6 || In-person Login Node
|-
| he || Centos 6 || Hosts Vital VMs for cluster 2 for function.
|-
| het || Centos 6 ||
|-
| n-0-129 || Centos 6 ||
|-
| n-0-136 || Centos 6 ||
|-
| n-0-139 || Centos 6 ||
|-
| n-0-30 || Centos 6 ||
|-
| n-0-37 || Centos 6 ||
|-
| n-0-39 || Centos 6 ||
|-
| n-8-27 || Centos 6 ||
|-
| n-9-23 || Centos 6 ||
|-
|}


[[About our cluster]]
[[About our cluster]]
Line 85: Line 216:
[[Category:Internal]]
[[Category:Internal]]
[[Category:UCSF]]
[[Category:UCSF]]
[[Category:Hardware]]

Latest revision as of 21:53, 27 June 2024

Introduction

Cluster 2 is the most modern cluster Irwin Lab maintains.

(Edited in May 06,2024)

Priorities and Policies

How to Login

Off Site

  • Off site access requires an SSH key. Contact sysadmins for help.
ssh <user>@portal3.compbio.ucsf.edu

On Site

ssh -o HostKeyAlgorithms=+ssh-rsa <user>@gimel.compbio.ucsf.edu

Where to Submit Jobs and How (SGE/SLURM)

SGE

Submit SGE jobs in the machine called gimel.compbio.ucsf.edu aka gimel.

SLURM

Submit SLURM jobs in gimel2.

  • Refer to the pages below for basic guides

Special machines

Normally, you will just ssh to sgehead aka gimel from portal.ucsf.bkslab.org where you can do almost anything, including job management. A few things require licensing and must be done on special machines.

hypervisor 'he' hosts:

  • alpha - which is critical and runs foreman, DNS, DHCP, and other important services
  • beta - with runs LDAP authentication
  • epsilon - portal.ucsf.bkslab.org - cluster gateway from public internet
  • gamma - sun grid engine qmaster
  • phi - mysqld/excipients
  • psi for using the PG fortran compiler
  • ppilot is at http://zeta:9944/ - you must be on the Cluster 2 private network to use it
  • Tau is the web server for ZINC,
  • no other specia
  • zeta - Psicquic/pipeline pilot
  • Sigma can definitely go off and stay off. It was planned for a fingerprinting server, never done.

hypervisor 'aleph2' hosts:

  • alpha7 - This is to be the future architecture VM of the cluster (DNS/DHCP/Puppet/Foreman/Ansible). CentOS7.
  • kappa is licensing. ask me. ("i have no clue what this licenses. Turned off." - ben)
  • rho contains this wiki and also bkslab.org

Notes

  • to get from SVN, use svn ssh+svn

Hardware and physical location

  • 1856 cpu-cores for queued jobs
  • 128 cpu-cores for infrastructure, databases, management and ad hoc jobs.
  • 788 TB of high quality NFS-available disk
  • Our policy is to have 4 GB RAM per cpu-core unless otherwise specified.
  • Machines older than 3 years may have 2GB/core and 6 years old have 1GB/core.
  • Cluster 2 is currently stored entirely in Rack 0 which is in Row 0, Position 4 of BH101 at 1700 4th St (Byers Hall).
  • Central services are on he,aleph2,and bet
  • CPU
    • 3 Silicon Mechanics Rackform nServ A4412.v4 s, each comprising 4 computers of 32 cpu-cores for a total of 384 cpu-cores.
    • 1 Dell C6145 with 128 cores.
    • An HP DL165G7 (24-way) is sgehead
    • more computers to come from Cluster 0, when Cluster 2 is fully ready.
  • DISK
    • HP disks - 40 TB RAID6 SAS (new in 2014)
    • Silicon Mechanics NAS - new in 2014 - 77 TB RAID6 SAS (new in 2014)
    • A HP DL160G5 and an MSA60 with 12 TB SAS (disks new in 2014)

= Naming convention

  • The Hebrew alphabet is used for physical machines
  • Greek letters for VMs.
  • Functions (e.g. sgehead) are aliases (CNAMEs).
  • compbio.ucsf.edu and ucsf.bkslab.org domains both supported.

Disk organization

  • shin aka nas1 mounted as /nfs/db/ = 72 TB SAS RAID6. NOTE: ON BAND: $ sudo /usr/local/RAID\ Web\ Console\ 2/startupui.sh to interact with raid controller. username: raid. pw: c2 pass
  • bet aka happy, internal: /nfs/store and psql (temp) as 10 TB SATA RAID10
  • elated on happy: /nfs/work only as 36 TB SAS RAID6
  • dalet exports /nfs/home & /nfs/home2

Special purpose machines - all .ucsf.bkslab.org

  • sgehead aka gimel.cluster - nearly the only machine you'll need.
  • psi.cluster - PG fortran compiler (if it only has a .cluster address means it has no public address)
  • portal aka epsilon - secure access
  • zeta.cluster - Pipeline Pilot
  • shin, bet, and dalet are the three NFS servers. You should not need to log in to them.
on teague desktop, /usr/local/RAID Web Console 2/startupui.sh 
connect to shin on public network
raid /  C2 on shin
  • mysql1.cluster - general purpose mysql server (like former scratch)
  • pg1.cluster - general purpose postgres server
  • fprint.cluster - fingerprinting server

Table of Server Information

SLURM

Server Name Operating System Functions
epyc Rocky 8 Apache/HTTPD Webserver + Proxy
epyc2 Rocky 8
epyc-A40 Rocky 8
n-1-101 Centos 7
n-1-105 Centos 7
n-1-124 Centos 7
n-1-126 Centos 7
n-1-141 Centos 7
n-1-16 Centos 7
n-1-17 Centos 7
n-1-18 Centos 7
n-1-19 Centos 7
n-1-20 Centos 7
n-1-21 Centos 7
n-1-28 Centos 7
n-1-38 Centos 7
n-5-13 Centos 7
n-5-14 Centos 7
n-5-15 Centos 7
n-5-32 Centos 7
n-5-33 Centos 7
n-5-34 Centos 7
n-5-35 Centos 7
n-9-19 Centos 7
n-9-20 Centos 7
n-9-21 Centos 7
n-9-22 Centos 7
n-9-34 Centos 7
n-9-36 Centos 7
n-9-38 Centos 7
qof Centos 7
shin Centos 7

SGE

Server Name Operating System Functions
gimel Centos 6 In-person Login Node
he Centos 6 Hosts Vital VMs for cluster 2 for function.
het Centos 6
n-0-129 Centos 6
n-0-136 Centos 6
n-0-139 Centos 6
n-0-30 Centos 6
n-0-37 Centos 6
n-0-39 Centos 6
n-8-27 Centos 6
n-9-23 Centos 6


About our cluster