OCI Slurm Autoscaling

From DISI
Revision as of 23:18, 13 June 2022 by Btingle (talk | contribs)
Jump to navigation Jump to search
  • Create a compartment for your autoscaling resources
SlurmAutoScaling
  • Create a dynamic group for autoscaling instances. Add the following as a matching rule:
Any {instance.compartment.id = 'ocid1.compartment.oc1...'}
  • Replace the ocid bit with your compartment's ocid. This makes it so that all instances launched under the SlurmAutoScaling compartment are lumped into the dynamic group
  • Create a policy for your auto scaling dynamic group. This policy needs to be created in the parent compartment to your SlurmAutoScaling compartment
Allow dynamic-group SlurmDynamicGroup to read app-catalog-listing in tenancy
Allow dynamic-group SlurmDynamicGroup to use tag-namespace in tenancy
Allow dynamic-group SlurmDynamicGroup to manage compute-management-family in compartment SlurmAutoScaling
Allow dynamic-group SlurmDynamicGroup to manage instance-family in compartment SlurmAutoScaling
Allow dynamic-group SlurmDynamicGroup to use virtual-network-family in compartment SlurmAutoScaling
Allow dynamic-group SlurmDynamicGroup to use volumes in compartment SlurmAutoScaling
Allow dynamic-group SlurmDynamicGroup to use bucket in compartment btingle
Allow dynamic-group SlurmDynamicGroup to manage objects in compartment btingle where all {target.bucket.name='btingletestbucket'}
  • Modify the bits at the bottom to encompass all private buckets you want to access from within the autoscaling group
  • Go to marketplace->all applications and search for the hpc cluster stack from oracle. It should be free.
  • Configure the stack, select the options to install SLURM. Make sure "Instance Principal" is enabled and "Scheduler Autoscaling" too.
  • Under compute node options, pick an availability domain and node shape. I don't quite understand this bit, but what shapes are available depends on your plan and availability domain.
  • To figure out what instance shapes are available, look up domain in the search bar, go to "Limits, Quotas, and Usage". Check each scope, if the service limit for a shape is above 0 and not n/a you can allocate them within that availability domain. Bit of a headache.
  • Finally, hit "apply" for the stack and wait for the apply job to go through. If the apply job fails, you can destroy your stack and try applying again with different configuration. The failures I encountered were due to instance shape problems. In the end I used US-ASHBURN-AD-2 for availability domain and VM.Standard.E2.1 for instance shape.
  • Once the apply job has finished, your head node should be online and you can log in to it. From the head node you have a standard slurm interface, which when jobs are submitted will spin up/down new machines as necessary to keep up with demand. This behavior can be configured in /opt/oci-hpc/conf on the head node.
  • One cool thing- apparently when you "yum install" something on the head node, it is automatically installed to all compute nodes as well.