Set up a Server: Difference between revisions
m (asdf) |
|||
(12 intermediate revisions by 2 users not shown) | |||
Line 88: | Line 88: | ||
== | == (CentOS 7 Only) Install Puppet and Create Puppet Certificate == | ||
Puppetmaster on alpha is old and only work on puppet 3.x on CentOS 7 repo. We can only install Puppet 7 on Rock Linux which is incompatible with puppetmaster. Therefore, we will manually install and configure the necessary packages. | |||
=== Packages Installation === | === Packages Installation === | ||
Line 153: | Line 154: | ||
$ systemctl start oddjobd | $ systemctl start oddjobd | ||
$ systemctl enable oddjobd | $ systemctl enable oddjobd | ||
=== Troubleshooting === | === Troubleshooting === | ||
==== Puppet SSL issue ==== | ===== Puppet SSL issue ===== | ||
* Datetime mismatch | * Datetime mismatch | ||
https://wiki.docking.org/index.php/Troubleshooting_-_Puppet_Failed_to_generate_additional_resources_using_%27eval_generate:_SSL_connect_returned%3D1%27 | |||
These are some issues from n-5-34/5 and the proposed solutions | These are some issues from n-5-34/5 and the proposed solutions | ||
Line 186: | Line 173: | ||
$ rm -rf /var/lib/puppet/ssl | $ rm -rf /var/lib/puppet/ssl | ||
==== Other Issues ==== | ===== Other Issues ===== | ||
1. Network configuration (/etc/resolv.conf) | 1. Network configuration (/etc/resolv.conf) | ||
Line 203: | Line 190: | ||
$ systemctl restart network | $ systemctl restart network | ||
2. Yum not working ( | 2. Yum not working (https://yum/centos/7/contrib/x86_64/repodata/repomd.xml: [Errno 14] HTTP Error 404 - Not Found) | ||
Issue: Puppet overwrote the existing Centos-Base.repo (Centos-7) with a Centos 6's Centos-Base.repo file | Issue: Puppet overwrote the existing Centos-Base.repo (Centos-7) with a Centos 6's Centos-Base.repo file | ||
Line 229: | Line 216: | ||
$ authconfig-tui | $ authconfig-tui | ||
Uncheck 'Shadow Password' | Uncheck 'Shadow Password' | ||
== (For Rock Linux) Install and Configure Packages == | |||
I wrote a collection of scripts to automate this process. You can scp /nfs/home/khtang/code/sysadmin-stuff to the machine | |||
=== LDAP/SSSD === | |||
$ sh setup-sssd.sh | |||
=== Mount Filesystems === | |||
$ sh mount-nfs.sh | |||
== GPU == | |||
Nouveau is the proprietary driver that is enable by default. In order to nvidia driver to work, nouveau must be disable | |||
'''How to know''' | |||
$ lsmod | grep nouveau | |||
'''How to disable nouveau''' | |||
$ vim /etc/default/grub | |||
Append this line 'rd.driver.blacklist=nouveau nouveau.modeset=0' at the end of GRUB_CMDLINE_LINUX | |||
$ mv /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r).img.bak | |||
$ echo "blacklist nouveau" > /etc/modprobe.d/nouveau-blacklist.conf | |||
$ dracut /boot/initramfs-$(uname -r).img $(uname -r) | |||
$ reboot | |||
=== Install CUDA and Nvidia Driver === | |||
'''Please make sure the nouveau driver is turned off''' | |||
==== CentOS 7 ==== | |||
Follow 'Update CUDA 11 and Nvidia-driver' in https://wiki.docking.org/index.php/Gpus | |||
==== Rock Linux ==== | |||
Add cuda repo and install cuda. It will also install the latest nvidia-driver compatible with the cuda version | |||
$ dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo | |||
$ dnf --enablerepo=epel -y install cuda-11-0 | |||
==== Install a specific version of Nvidia-driver ==== | |||
Download the driver compatible of your GPUs | |||
https://www.nvidia.com/Download/index.aspx?lang=en-us | |||
It is necessary to install the driver without graphical interface. You can go into single user mode with | |||
$ sudo init 1 | |||
or | |||
$ systemctl isolate multi-user.target | |||
$ chmod +x NVIDIA-Linux-x86_64-<version>.sh | |||
$ bash NVIDIA-Linux-x86_64-<version>.sh | |||
==== Test Nvidia driver ==== | |||
$ nvidia-smi | |||
+-----------------------------------------------------------------------------+ | |||
| NVIDIA-SMI 470.129.06 Driver Version: 470.129.06 CUDA Version: 11.4 | | |||
|-------------------------------+----------------------+----------------------+ | |||
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | |||
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | |||
| | | MIG M. | | |||
|===============================+======================+======================| | |||
| 0 Quadro K600 Off | 00000000:01:00.0 Off | N/A | | |||
| 27% 55C P0 N/A / N/A | 0MiB / 981MiB | 0% Default | | |||
| | | N/A | | |||
+-------------------------------+----------------------+----------------------+ | |||
+-----------------------------------------------------------------------------+ | |||
| Processes: | | |||
| GPU GI CI PID Type Process name GPU Memory | | |||
| ID ID Usage | | |||
|=============================================================================| | |||
| No running processes found | | |||
+-----------------------------------------------------------------------------+ | |||
[[ Category: Ben ]] [[ Category : Sysadmin ]] | [[ Category: Ben ]] [[ Category : Sysadmin ]] |
Latest revision as of 01:24, 24 May 2024
This page described how to install CentOS and setup/troubleshooting puppet
Getting a Bootable USB stick
You can borrow one from the Sysadmin or DIY one (4.4GB+ storage) with instruction here
- Download the ISO
Rocky Linux Minimal : https://rockylinux.org/download/
Install CentOS 7/Rocky Linux 8
Change Boot Order
1. Insert the USB stick and connect the monitor to the machine
2. Reboot the machine
3. Get to Boot Menu, there are a few ways:
a. Bring up the BIOS Menu by pressing Del button while the machine is booting. If that doesn't work, try F2 or F10
- In Boot, change the boot oder so that the USB get booted first
- Save changes and reboot
b. Press F11 and pick the USB drive
Configurations
Adopted from this guide -> https://phoenixnap.com/kb/how-to-install-centos-7
Select Test this media and install <OS>
Step 1 : Choose Keyboard and Language
Step 2 : Network Configuration
Select NETWORK & HOSTNAME
1. Switch on the Ethernet
2. Change Host name at the bottom
3. Select Configure
Select IPv4 Settings DNS Servers: [alpha private ip address] Search domains: cluster.ucsf.bkslab.org, ucsf.bkslab.org, bkslab.org, compbio.ucsf.edu, ucsf.edu Check "Require IPv4 addressing for this connection to complete". Save.
Step 3: Set Date and Time
Turn on Network Time and Select the local timezone.
Step 4: Partitioning
Select INSTALLATION DESTINATION.
Option 1: Automatic Partitioning
Under the Other Storage Options heading, select the Automatically configure partitioning checkbox. This ensures the selected destination storage disk will automatically partition with the /(root), /home and swap partitions. It will automatically create an LVM logical volume in the XFS file system.
If you do not have enough free space, you can reclaim disk space and instruct the system to delete files.
When finished, click the Done button.
Option 2: Manual Partitioning
Select the I will configure partitioning checkbox and choose Done.
If you want to use other file systems (such as ext4 and vfat) and a non-LVM partitioning scheme, such as btrfs. This will initiate a configuration pop-up where you can set up your partitioning manually.
Step 5: Software Selection
Select Compute Node on the left menu, then select Add-Ons on the right menu.
Step 6: Enable KDUMP
Double-check if KDUMP is enabled.
Step 7: Start installation Process
Hit Begin Installation
Step 8: Setup Root Password & User
During Installation, will see 2 items on top
Root Password
The usual one
User Creation
Create a local administrator account
User name : survival Check "Make this user administrator" Check "Require a password for this account" Password : [Hint it starts with G and has t somewhere in the middle]
REBOOT when Installation is completed
(CentOS 7 Only) Install Puppet and Create Puppet Certificate
Puppetmaster on alpha is old and only work on puppet 3.x on CentOS 7 repo. We can only install Puppet 7 on Rock Linux which is incompatible with puppetmaster. Therefore, we will manually install and configure the necessary packages.
Packages Installation
Login as root user
- Install EPEL release. EPEL is a repository for enterprise releases. Learn more
yum install epel-release -y This will install access to public repo on Epel. GPG key is provided to provide transaction is valid
- Update centos packages
yum update -y
- Install Puppet
yum install puppet -y
- Install sssd
yum install sssd -y
- Install perl libraries
yum install perl-DBD-Pg -y
- Install nss-pam-ldapd
yum install nss-pam-ldapd -y
yum install oddjob-mkhomedir -y systemctl start oddjobd systemctl enable oddjobd
Edit Puppet configuration on foreman.uscf.bkslab.org
- Search for host with it is existed.
- Edit Puppet setting
- If the machine is brand new, click on 'New Host', choose 'Testing' as Host Group and replicate the other existing desktop settings.
- In Parameters, click "Override" in "variant" and assign "cluster" as variable at the bottom.
- In Puppet class, Choose :
* nfs-mounts.* * ssd*
Issue new Puppet Certificate
In a second terminal, log in as root
- Log into alpha, to create new puppet certificate for the new computer
$ sudo puppet cert list -a | grep <hostname>.cluster.ucsf.bkslab.org //to list all of the current puppet certificates and check if there was an existing certificate for this machine
- To clean out existing certificate
$ sudo puppet cert clean <hostname>.cluster.ucsf.bkslab.org
BEFORE PROCEEDING TO THE NEXT STEP, MAKE SURE that you have 2 terminals on: one logged in as root on the new computer (client) and the other logged in as s_ on alpha (server) 1. On the client side:
$ puppet agent --test --waitforcert=10 "puppet agent --test" command initial integration with puppet for a new computer or reintegrate puppet. Without this command, the machine will not have access to the /mnt/nfs, /nfs/* and /nfs/soft "--waitforcert=10" means "keep calm, wait 10s for DNS server to respond"
2. On server (alpha) side:
Sign the certificate $ sudo puppet cert sign <hostname>.cluster.ucsf.bkslab.org
Testing puppet
$ id <user_name>
If failed, try running these commands and try it again:
$ systemctl restart sssd | systemctl enable sssd $ authcofig-tui This will prompt you to the authcofig-tui screen. User SpaceBar to change setting. 1. Uncheck "Use Shadow Password". 2. Uncheck "User Fingerprint reader" so that it would not raise any fingerprint error later. Click "Next' after. 3. Under "LDAP Settings", make sure it says: [*] User TLS Server: ldaps://ds.ucsf.bkslab.org/ Base DN: dc=bkslab, dc=org
$ systemctl start oddjobd $ systemctl enable oddjobd
Troubleshooting
Puppet SSL issue
- Datetime mismatch
https://wiki.docking.org/index.php/Troubleshooting_-_Puppet_Failed_to_generate_additional_resources_using_%27eval_generate:_SSL_connect_returned%3D1%27
These are some issues from n-5-34/5 and the proposed solutions
- Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Invalid tag "" on node
This error happens because puppet uses cache version of the node instead of creating new one. You must clean all trace of node on alpha before reissuing a new certification
[root@alpha tmp]# puppet node clean samekh.cluster.ucsf.bkslab.org
- To reissue Puppet on machine:
-revoke puppet certificate in alpha $ sudo puppet cert clean <hostname>.cluster.ucsf.bkslab.org -remove this directory $ rm -rf /var/lib/puppet/ssl
Other Issues
1. Network configuration (/etc/resolv.conf)
Issue 1 : DNS and nameserver are empty (Ethernet connection was not configured during installation)
What I did:
$ nmtui (NetworkManager tui) -Edit the connection by following the example from n-1-136
Issue 2: nameserver 127.0.0.1
What I did:
- Commented out all items in [main] section in /etc/NetworkManager/NetworkManager.conf - Change nameserver to 10.20.1.1 $ systemctl restart NetworkManager.service $ systemctl restart network
2. Yum not working (https://yum/centos/7/contrib/x86_64/repodata/repomd.xml: [Errno 14] HTTP Error 404 - Not Found)
Issue: Puppet overwrote the existing Centos-Base.repo (Centos-7) with a Centos 6's Centos-Base.repo file
What I did:
- Overwritten /etc/yum.repos.d/CentOS-Base.repo with copy of the correct version from n-1-136
3. Machine not recognizing users Issue 1: sssd was not installed What I did:
$ yum install sssd $ systemctl start sssd $ systemctl enable sssd
Issue 2:
$ id s_khtang uid=xxxx(s_khtang) gid=1000(n-5-34) groups=1000(n-5-34)
This means the machine mistake sysadmin group 1000 for n-5-34
What I did:
$ vim /etc/group Change n-5-34:x:1000:n-5-34 to sysadmin:x:1000:n-5-34
$ authconfig-tui Uncheck 'Shadow Password'
(For Rock Linux) Install and Configure Packages
I wrote a collection of scripts to automate this process. You can scp /nfs/home/khtang/code/sysadmin-stuff to the machine
LDAP/SSSD
$ sh setup-sssd.sh
Mount Filesystems
$ sh mount-nfs.sh
GPU
Nouveau is the proprietary driver that is enable by default. In order to nvidia driver to work, nouveau must be disable
How to know
$ lsmod | grep nouveau
How to disable nouveau
$ vim /etc/default/grub Append this line 'rd.driver.blacklist=nouveau nouveau.modeset=0' at the end of GRUB_CMDLINE_LINUX $ mv /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r).img.bak $ echo "blacklist nouveau" > /etc/modprobe.d/nouveau-blacklist.conf $ dracut /boot/initramfs-$(uname -r).img $(uname -r)
$ reboot
Install CUDA and Nvidia Driver
Please make sure the nouveau driver is turned off
CentOS 7
Follow 'Update CUDA 11 and Nvidia-driver' in https://wiki.docking.org/index.php/Gpus
Rock Linux
Add cuda repo and install cuda. It will also install the latest nvidia-driver compatible with the cuda version
$ dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo $ dnf --enablerepo=epel -y install cuda-11-0
Install a specific version of Nvidia-driver
Download the driver compatible of your GPUs
https://www.nvidia.com/Download/index.aspx?lang=en-us
It is necessary to install the driver without graphical interface. You can go into single user mode with
$ sudo init 1 or $ systemctl isolate multi-user.target
$ chmod +x NVIDIA-Linux-x86_64-<version>.sh $ bash NVIDIA-Linux-x86_64-<version>.sh
Test Nvidia driver
$ nvidia-smi +-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.129.06 Driver Version: 470.129.06 CUDA Version: 11.4 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Quadro K600 Off | 00000000:01:00.0 Off | N/A | | 27% 55C P0 N/A / N/A | 0MiB / 981MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+