Cluster 2: Difference between revisions
Jump to navigation
Jump to search
(asdf) |
Jgutierrez6 (talk | contribs) m (→SGE) |
||
(26 intermediate revisions by 4 users not shown) | |||
Line 1: | Line 1: | ||
= Introduction = | |||
Cluster 2 is the most modern cluster Irwin Lab maintains. | |||
(Edited in May 06,2024) | |||
= Priorities and Policies = | = Priorities and Policies = | ||
Line 9: | Line 10: | ||
* [[Portal system]] for off-site ssh cluster access. | * [[Portal system]] for off-site ssh cluster access. | ||
* Get a [[Cluster 2 account]] and get started | * Get a [[Cluster 2 account]] and get started | ||
= How to Login = | |||
=== Off Site === | |||
* Off site access requires an SSH key. Contact sysadmins for help. | |||
<source>ssh <user>@portal3.compbio.ucsf.edu</source> | |||
=== On Site === | |||
<source>ssh -o HostKeyAlgorithms=+ssh-rsa <user>@gimel.compbio.ucsf.edu</source> | |||
= Where to Submit Jobs and How (SGE/SLURM) = | |||
=== SGE === | |||
Submit SGE jobs in the machine called '''gimel.compbio.ucsf.edu''' aka '''gimel'''. | |||
* Refer to the pages below for the basic commands/examples. | |||
** [[SGE Cluster Docking]], replace '''sgehead.compbio.ucsf.edu''' with '''gimel'''. | |||
** [[SGE idioms]] | |||
** [[Using SGE cluster]] | |||
* For sysadmins | |||
** [[SGE_notes]] | |||
** [[Sun Grid Engine (SGE)]] | |||
=== SLURM === | |||
Submit SLURM jobs in '''gimel2'''. | |||
* Refer to the pages below for basic guides | |||
** [[Slurm]] | |||
= Special machines = | = Special machines = | ||
Normally, you will just ssh to sgehead aka gimel from portal.ucsf.bkslab.org where you can do almost anything, including job management. A few things require licensing and must be done on special machines. | Normally, you will just ssh to sgehead aka gimel from portal.ucsf.bkslab.org where you can do almost anything, including job management. A few things require licensing and must be done on special machines. | ||
hypervisor 'he' hosts: | |||
* alpha - which is critical and runs foreman, DNS, DHCP, and other important services | |||
* beta - with runs LDAP authentication | |||
* epsilon - portal.ucsf.bkslab.org - cluster gateway from public internet | |||
* gamma - sun grid engine qmaster | |||
* phi - mysqld/excipients | |||
* psi for using the PG fortran compiler | * psi for using the PG fortran compiler | ||
* ppilot is at http://zeta:9944/ - you must be on the Cluster 2 private network to use it | * ppilot is at http://zeta:9944/ - you must be on the Cluster 2 private network to use it | ||
* no other | * Tau is the web server for ZINC, | ||
* no other specia | |||
* zeta - Psicquic/pipeline pilot | |||
* Sigma can definitely go off and stay off. It was planned for a fingerprinting server, never done. | |||
hypervisor 'aleph2' hosts: | |||
* alpha7 - This is to be the future architecture VM of the cluster (DNS/DHCP/Puppet/Foreman/Ansible). CentOS7. | |||
* kappa is licensing. ask me. ("i have no clue what this licenses. Turned off." - ben) | |||
* rho contains this wiki and also bkslab.org | |||
= Notes = | = Notes = | ||
Line 21: | Line 60: | ||
= Hardware and physical location = | = Hardware and physical location = | ||
* | * 1856 cpu-cores for queued jobs | ||
* 128 cpu-cores for infrastructure, databases, management and ad hoc jobs. | * 128 cpu-cores for infrastructure, databases, management and ad hoc jobs. | ||
* | * 788 TB of high quality NFS-available disk | ||
* Our policy is to have 4 GB RAM per cpu-core unless otherwise specified. | * Our policy is to have 4 GB RAM per cpu-core unless otherwise specified. | ||
* Machines older than 3 years may have 2GB/core and 6 years old have 1GB/core. | * Machines older than 3 years may have 2GB/core and 6 years old have 1GB/core. | ||
* Cluster 2 is currently stored entirely in Rack 0 which is in Row 0, Position 4 of BH101 at 1700 4th St (Byers Hall). | * Cluster 2 is currently stored entirely in Rack 0 which is in Row 0, Position 4 of BH101 at 1700 4th St (Byers Hall). | ||
* Central services are on he,aleph2,and bet | |||
* Central services are on | |||
* CPU | * CPU | ||
** 3 Silicon Mechanics Rackform nServ A4412.v4 s, each comprising 4 computers of 32 cpu-cores for a total of 384 cpu-cores. | ** 3 Silicon Mechanics Rackform nServ A4412.v4 s, each comprising 4 computers of 32 cpu-cores for a total of 384 cpu-cores. | ||
Line 48: | Line 84: | ||
= Disk organization = | = Disk organization = | ||
* shin aka nas1 mounted as /nfs/db/ = 72 TB SAS RAID6 | * shin aka nas1 mounted as /nfs/db/ = 72 TB SAS RAID6. NOTE: ON BAND: $ sudo /usr/local/RAID\ Web\ Console\ 2/startupui.sh to interact with raid controller. username: raid. pw: c2 pass | ||
* bet aka happy, internal: /nfs/store and psql (temp) as 10 TB SATA RAID10 | * bet aka happy, internal: /nfs/store and psql (temp) as 10 TB SATA RAID10 | ||
* elated on happy: /nfs/work only as 36 TB SAS RAID6 | * elated on happy: /nfs/work only as 36 TB SAS RAID6 | ||
* | * dalet exports /nfs/home & /nfs/home2 | ||
= Special purpose machines - all .ucsf.bkslab.org = | = Special purpose machines - all .ucsf.bkslab.org = | ||
Line 60: | Line 96: | ||
* zeta.cluster - Pipeline Pilot | * zeta.cluster - Pipeline Pilot | ||
* shin, bet, and dalet are the three NFS servers. You should not need to log in to them. | * shin, bet, and dalet are the three NFS servers. You should not need to log in to them. | ||
on teague desktop, /usr/local/RAID Web Console 2/startupui.sh | |||
connect to shin on public network | |||
raid / C2 on shin | |||
* mysql1.cluster - general purpose mysql server (like former scratch) | * mysql1.cluster - general purpose mysql server (like former scratch) | ||
* pg1.cluster - general purpose postgres server | * pg1.cluster - general purpose postgres server | ||
* fprint.cluster - fingerprinting server | * fprint.cluster - fingerprinting server | ||
= Table of Server Information = | |||
=== SLURM === | |||
{| class="wikitable" | |||
|- | |||
!Server Name | |||
!Operating System | |||
!Functions | |||
|- | |||
| epyc || Rocky 8 || Apache/HTTPD Webserver + Proxy | |||
|- | |||
| epyc2 || Rocky 8 || | |||
|- | |||
| epyc-A40 || Rocky 8 || | |||
|- | |||
| n-1-101 || Centos 7 || | |||
|- | |||
| n-1-105 || Centos 7 || | |||
|- | |||
| n-1-124 || Centos 7 || | |||
|- | |||
| n-1-126 || Centos 7 || | |||
|- | |||
| n-1-141 || Centos 7 || | |||
|- | |||
| n-1-16 || Centos 7 || | |||
|- | |||
| n-1-17 || Centos 7 || | |||
|- | |||
| n-1-18 || Centos 7 || | |||
|- | |||
| n-1-19 || Centos 7 || | |||
|- | |||
| n-1-20 || Centos 7 || | |||
|- | |||
| n-1-21 || Centos 7 || | |||
|- | |||
| n-1-28 || Centos 7 || | |||
|- | |||
| n-1-38 || Centos 7 || | |||
|- | |||
| n-5-13 || Centos 7 || | |||
|- | |||
| n-5-14 || Centos 7 || | |||
|- | |||
| n-5-15 || Centos 7 || | |||
|- | |||
| n-5-32 || Centos 7 || | |||
|- | |||
| n-5-33 || Centos 7 || | |||
|- | |||
| n-5-34 || Centos 7 || | |||
|- | |||
| n-5-35 || Centos 7 || | |||
|- | |||
| n-9-19 || Centos 7 || | |||
|- | |||
| n-9-20 || Centos 7 || | |||
|- | |||
| n-9-21 || Centos 7 || | |||
|- | |||
| n-9-22 || Centos 7 || | |||
|- | |||
| n-9-34 || Centos 7 || | |||
|- | |||
| n-9-36 || Centos 7 || | |||
|- | |||
| n-9-38 || Centos 7 || | |||
|- | |||
| qof || Centos 7 || | |||
|- | |||
| shin || Centos 7 || | |||
|- | |||
|} | |||
=== SGE === | |||
{| class="wikitable" | |||
|- | |||
!Server Name | |||
!Operating System | |||
!Functions | |||
|- | |||
| gimel || Centos 6 || In-person Login Node | |||
|- | |||
| he || Centos 6 || Hosts Vital VMs for cluster 2 for function. | |||
|- | |||
| het || Centos 6 || | |||
|- | |||
| n-0-129 || Centos 6 || | |||
|- | |||
| n-0-136 || Centos 6 || | |||
|- | |||
| n-0-139 || Centos 6 || | |||
|- | |||
| n-0-30 || Centos 6 || | |||
|- | |||
| n-0-37 || Centos 6 || | |||
|- | |||
| n-0-39 || Centos 6 || | |||
|- | |||
| n-8-27 || Centos 6 || | |||
|- | |||
| n-9-23 || Centos 6 || | |||
|- | |||
|} | |||
[[About our cluster]] | [[About our cluster]] | ||
Line 69: | Line 216: | ||
[[Category:Internal]] | [[Category:Internal]] | ||
[[Category:UCSF]] | [[Category:UCSF]] | ||
[[Category:Hardware]] |
Latest revision as of 21:53, 27 June 2024
Introduction
Cluster 2 is the most modern cluster Irwin Lab maintains.
(Edited in May 06,2024)
Priorities and Policies
- Lab Security Policy
- Disk space policy
- Backups policy.
- Portal system for off-site ssh cluster access.
- Get a Cluster 2 account and get started
How to Login
Off Site
- Off site access requires an SSH key. Contact sysadmins for help.
ssh <user>@portal3.compbio.ucsf.edu
On Site
ssh -o HostKeyAlgorithms=+ssh-rsa <user>@gimel.compbio.ucsf.edu
Where to Submit Jobs and How (SGE/SLURM)
SGE
Submit SGE jobs in the machine called gimel.compbio.ucsf.edu aka gimel.
- Refer to the pages below for the basic commands/examples.
- SGE Cluster Docking, replace sgehead.compbio.ucsf.edu with gimel.
- SGE idioms
- Using SGE cluster
- For sysadmins
SLURM
Submit SLURM jobs in gimel2.
- Refer to the pages below for basic guides
Special machines
Normally, you will just ssh to sgehead aka gimel from portal.ucsf.bkslab.org where you can do almost anything, including job management. A few things require licensing and must be done on special machines.
hypervisor 'he' hosts:
- alpha - which is critical and runs foreman, DNS, DHCP, and other important services
- beta - with runs LDAP authentication
- epsilon - portal.ucsf.bkslab.org - cluster gateway from public internet
- gamma - sun grid engine qmaster
- phi - mysqld/excipients
- psi for using the PG fortran compiler
- ppilot is at http://zeta:9944/ - you must be on the Cluster 2 private network to use it
- Tau is the web server for ZINC,
- no other specia
- zeta - Psicquic/pipeline pilot
- Sigma can definitely go off and stay off. It was planned for a fingerprinting server, never done.
hypervisor 'aleph2' hosts:
- alpha7 - This is to be the future architecture VM of the cluster (DNS/DHCP/Puppet/Foreman/Ansible). CentOS7.
- kappa is licensing. ask me. ("i have no clue what this licenses. Turned off." - ben)
- rho contains this wiki and also bkslab.org
Notes
- to get from SVN, use svn ssh+svn
Hardware and physical location
- 1856 cpu-cores for queued jobs
- 128 cpu-cores for infrastructure, databases, management and ad hoc jobs.
- 788 TB of high quality NFS-available disk
- Our policy is to have 4 GB RAM per cpu-core unless otherwise specified.
- Machines older than 3 years may have 2GB/core and 6 years old have 1GB/core.
- Cluster 2 is currently stored entirely in Rack 0 which is in Row 0, Position 4 of BH101 at 1700 4th St (Byers Hall).
- Central services are on he,aleph2,and bet
- CPU
- 3 Silicon Mechanics Rackform nServ A4412.v4 s, each comprising 4 computers of 32 cpu-cores for a total of 384 cpu-cores.
- 1 Dell C6145 with 128 cores.
- An HP DL165G7 (24-way) is sgehead
- more computers to come from Cluster 0, when Cluster 2 is fully ready.
- DISK
- HP disks - 40 TB RAID6 SAS (new in 2014)
- Silicon Mechanics NAS - new in 2014 - 77 TB RAID6 SAS (new in 2014)
- A HP DL160G5 and an MSA60 with 12 TB SAS (disks new in 2014)
= Naming convention
- The Hebrew alphabet is used for physical machines
- Greek letters for VMs.
- Functions (e.g. sgehead) are aliases (CNAMEs).
- compbio.ucsf.edu and ucsf.bkslab.org domains both supported.
Disk organization
- shin aka nas1 mounted as /nfs/db/ = 72 TB SAS RAID6. NOTE: ON BAND: $ sudo /usr/local/RAID\ Web\ Console\ 2/startupui.sh to interact with raid controller. username: raid. pw: c2 pass
- bet aka happy, internal: /nfs/store and psql (temp) as 10 TB SATA RAID10
- elated on happy: /nfs/work only as 36 TB SAS RAID6
- dalet exports /nfs/home & /nfs/home2
Special purpose machines - all .ucsf.bkslab.org
- sgehead aka gimel.cluster - nearly the only machine you'll need.
- psi.cluster - PG fortran compiler (if it only has a .cluster address means it has no public address)
- portal aka epsilon - secure access
- zeta.cluster - Pipeline Pilot
- shin, bet, and dalet are the three NFS servers. You should not need to log in to them.
on teague desktop, /usr/local/RAID Web Console 2/startupui.sh connect to shin on public network raid / C2 on shin
- mysql1.cluster - general purpose mysql server (like former scratch)
- pg1.cluster - general purpose postgres server
- fprint.cluster - fingerprinting server
Table of Server Information
SLURM
Server Name | Operating System | Functions |
---|---|---|
epyc | Rocky 8 | Apache/HTTPD Webserver + Proxy |
epyc2 | Rocky 8 | |
epyc-A40 | Rocky 8 | |
n-1-101 | Centos 7 | |
n-1-105 | Centos 7 | |
n-1-124 | Centos 7 | |
n-1-126 | Centos 7 | |
n-1-141 | Centos 7 | |
n-1-16 | Centos 7 | |
n-1-17 | Centos 7 | |
n-1-18 | Centos 7 | |
n-1-19 | Centos 7 | |
n-1-20 | Centos 7 | |
n-1-21 | Centos 7 | |
n-1-28 | Centos 7 | |
n-1-38 | Centos 7 | |
n-5-13 | Centos 7 | |
n-5-14 | Centos 7 | |
n-5-15 | Centos 7 | |
n-5-32 | Centos 7 | |
n-5-33 | Centos 7 | |
n-5-34 | Centos 7 | |
n-5-35 | Centos 7 | |
n-9-19 | Centos 7 | |
n-9-20 | Centos 7 | |
n-9-21 | Centos 7 | |
n-9-22 | Centos 7 | |
n-9-34 | Centos 7 | |
n-9-36 | Centos 7 | |
n-9-38 | Centos 7 | |
qof | Centos 7 | |
shin | Centos 7 |
SGE
Server Name | Operating System | Functions |
---|---|---|
gimel | Centos 6 | In-person Login Node |
he | Centos 6 | Hosts Vital VMs for cluster 2 for function. |
het | Centos 6 | |
n-0-129 | Centos 6 | |
n-0-136 | Centos 6 | |
n-0-139 | Centos 6 | |
n-0-30 | Centos 6 | |
n-0-37 | Centos 6 | |
n-0-39 | Centos 6 | |
n-8-27 | Centos 6 | |
n-9-23 | Centos 6 |