AWS Auto Scaling

From DISI
Revision as of 16:42, 2 June 2020 by Dudenko (talk | contribs)
Jump to navigation Jump to search

A step-by-step instruction how to create a slurm cluster on AWS with auto-scaling possibility


a) upgrade your pip: pip install --upgrade pip

b) install aws client: pip install awscli

c) pip install aws-parallelcluster


d) prepare your aws_access_key_id and aws_secret_access_key.
Those can be found in "My Security Credentials -> Access keys (access key ID and secret access key)" section.
If you haven't got it yet, press "Create New Access Key" and follow the instructions.

 aws configure
 Access Key ID [None]: _YOUR_ACCESS_KEY_ID_
 AWS Secret Access Key [None]: _YOUR_SECRET_ACCESS_KEY_
 this will be stored in .aws/credentials

e) parallel cluster configuration NB: We use AWS Region us-east-1, which corresponds to N.Virginia.
You are welcome to re-consider this choice.

 pcluster configure
 Allowed values for AWS Region ID:
 1. ap-northeast-1
 2. ap-northeast-2
 3. ap-south-1
 4. ap-southeast-1
 5. ap-southeast-2
 6. ca-central-1
 7. eu-central-1
 8. eu-north-1
 9. eu-west-1
 10. eu-west-2
 11. eu-west-3
 12. sa-east-1
 13. us-east-1
 14. us-east-2
 15. us-west-1
 16. us-west-2
 AWS Region ID [us-east-1]:

Network & Security -> Key Pairs -> Create New Pair

 Allowed values for EC2 Key Pair Name:
 1. EC2_v1
 EC2 Key Pair Name [EC2_v1]:
 Allowed values for Scheduler:
 1. sge
 2. torque
 3. slurm
 4. awsbatch
 Scheduler [slurm]:
 Allowed values for Scheduler:
 1. sge
 2. torque
 3. slurm
 4. awsbatch
 Scheduler [slurm]:
 Minimum cluster size (instances) [0]:   <------- THIS CAN BE CHANGED LATER
 Maximum cluster size (instances) [10]:  <------- THIS CAN BE CHANGED LATER
 Master instance type [t2.micro]:        <------- THIS CAN BE CHANGED LATER
 Compute instance type [t2.micro]:       <------- THIS CAN BE CHANGED LATER
 Automate VPC creation? (y/n) [n]: 
 Allowed values for VPC ID:
 1. vpc-579d8e2d | 0 subnets inside
 VPC ID [vpc-579d8e2d]: 
 Allowed values for Network Configuration:
 1. Master in a public subnet and compute fleet in a private subnet
 2. Master and compute fleet in the same public subnet
 Network Configuration [Master in a public subnet and compute fleet in a private subnet]: 1


The config file is ready and stored in ~/.parallelcluster/config
You may revise it and edit, if needed.

To create a cluster on AWS, do pcluster create -c ~/.parallelcluster/config UCSFbeta

 Beginning cluster creation for cluster: UCSFbeta
 Creating stack named: parallelcluster-UCSFbeta
 Status: ComputeFleet - CREATE_COMPLETE                                          
 Status: parallelcluster-UCSFbeta - CREATE_COMPLETE                              
 ClusterUser: centos
 MasterPrivateIP: 172.31.0.25

Your cluster is ready to go! In EC2->Instances->Instance you will see your master node awaiting for jobs.
Now you've got two launch template: one is for the master node, another - for computing nodes.
You can modify them here: EC2->Instances->Launche Templates


In EC2->Auto Scaling->Auto Scaling Groups you can modify your cluster shape parameters,
i.e., min size, max size, desired size, default cooldown (when to start terminating idle compute nodee), etc...


To connect to your master node via SSH, do similar to ssh -i "YOUR_PRIVATE_KEY.pem" centos@ec2-54-89-150-98.compute-1.amazonaws.com

As it can be seen via sinfo -lNe, there are no computing resources available (smart saving mode).
In order to bring the compute nodes up, it is sufficient to ask even a simple line: srun -n4 hostname
Answer --->srun: Required node not available (down, drained or reserved)

Then what happens: within 1 minute jobwatcher will notice that there are jobs in the queue.
The system will bring up extra resources (up to maxsize parameter) and queue will start computing.
Should the compute nodes become idle, the system will terminate the compute nodes(only those, which are idle longer than "cooldown time").


When the nodes are brought up, one will them in the list of available resources: sinfo -lNe

 Tue Jun  2 14:20:55 2020
 NODELIST          NODES PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON              
 ip-172-31-16-148      1  compute*        idle    1    1:1:1      1        0      1   (null) none                
 ip-172-31-18-22       1  compute*        idle    1    1:1:1      1        0      1   (null) none                
 ip-172-31-22-52       1  compute*        idle    1    1:1:1      1        0      1   (null) none                
 ip-172-31-29-18       1  compute*        idle    1    1:1:1      1        0      1   (null) none

Running a job for 4 cpus:

 [centos@ip-172-31-0-25 ~]$ srun -n4 hostname
 ip-172-31-25-53
 ip-172-31-18-252
 ip-172-31-23-39
 ip-172-31-16-128



Useful links: