Set up a Server: Difference between revisions

From DISI
Jump to navigation Jump to search
No edit summary
m (asdf)
 
(45 intermediate revisions by 4 users not shown)
Line 1: Line 1:
This page described how to install CentOS and setup/troubleshooting puppet
This page described how to install CentOS and setup/troubleshooting puppet


== Installing CentOS 7 ==
== Getting a Bootable USB stick ==
=== Getting a Bootable USB stick ===
You can borrow one from the [[Sysadmin]] or DIY one (4.4GB+ storage) with instruction [https://www.linux.com/tutorials/how-burn-iso-usb-drive/ here]
You can borrow it from the [[Sysadmin]] or DIY one (4.4GB+ storage) with instruction [https://www.linux.com/tutorials/how-burn-iso-usb-drive/ here]
* Download the ISO
Rocky Linux Minimal : https://rockylinux.org/download/


== Install CentOS 7/Rocky Linux 8 ==
=== Change Boot Order ===
=== Change Boot Order ===
1. Insert the USB stick and connect the monitor to the machine
1. Insert the USB stick and connect the monitor to the machine
Line 10: Line 12:
2. Reboot the machine
2. Reboot the machine


3. Bring up the BIOS Menu by pressing Del button while the machine is booting
3. Get to Boot Menu, there are a few ways:
 
a. Bring up the BIOS Menu by pressing Del button while the machine is booting. If that doesn't work, try F2 or F10


- In Boot, change the boot oder so that the USB get booted first
- In Boot, change the boot oder so that the USB get booted first
Line 16: Line 20:
- Save changes and reboot
- Save changes and reboot


=== Install CentOS 7 ===
b. Press F11 and pick the USB drive
 
=== Configurations ===
Adopted from this guide -> https://phoenixnap.com/kb/how-to-install-centos-7
Adopted from this guide -> https://phoenixnap.com/kb/how-to-install-centos-7


Select '''Test this media and install Centos 7'''
Select '''Test this media and install <OS>'''
==== Step 1 : Choose Keyboard and Language ====
==== Step 1 : Choose Keyboard and Language ====
===== Step 2 : Network Configuration =====
 
==== Step 2 : Network Configuration ====
Select '''NETWORK & HOSTNAME'''
Select '''NETWORK & HOSTNAME'''


Line 32: Line 39:
  Select '''IPv4 Settings'''
  Select '''IPv4 Settings'''
  DNS Servers:
  DNS Servers:
   [alpha private ip adress]
   [alpha private ip address]
  Search domains:
  Search domains:
   cluster.ucsf.bkslab.org, ucsf.bkslab.org, bkslab.org, compbio.ucsf.edu, ucsf.edu
   cluster.ucsf.bkslab.org, ucsf.bkslab.org, bkslab.org, compbio.ucsf.edu, ucsf.edu
Line 74: Line 81:
Create a local administrator account
Create a local administrator account
  User name : survival
  User name : survival
  Check "Make this user adminstrator"
  Check "Make this user administrator"
  Check "Require a password for this account"
  Check "Require a password for this account"
  Password : [Hint it starts with G and has t somewhere in the middle]
  Password : [Hint it starts with G and has t somewhere in the middle]


''''REBOOT''' when Installation is completed
'''REBOOT''' when Installation is completed
 
 
== (CentOS 7 Only) Install Puppet and Create Puppet Certificate ==
Puppetmaster on alpha is old and only work on puppet 3.x on CentOS 7 repo. We can only install Puppet 7 on Rock Linux which is incompatible with puppetmaster. Therefore, we will manually install and configure the necessary packages.


==== Install Puppet and Create Puppet Certificate ====
=== Packages Installation ===
=== Packages Installation ===
Login as root user
Login as root user
* Install EPEL release. EPEL is a repository for enterprise releases. [https://www.tecmint.com/how-to-enable-epel-repository-for-rhel-centos-6-5/ Learn more]
* Install EPEL release. EPEL is a repository for enterprise releases. [https://www.tecmint.com/how-to-enable-epel-repository-for-rhel-centos-6-5/ Learn more]
  $ sudo yum install epel-release
  yum install epel-release -y
This will install access to public repo on Epel. GPG key is provided to provide transaction is valid
* Update centos packages
* Update centos packages
  $ sudo yum update
  yum update -y
This will install access to public repo on Epel. GPG key is provided to provide transaction is valid
* Install Puppet
* Install Puppet
  $ sudo yum install puppet
  yum install puppet -y
* Install sssd
* Install sssd
  $ sudo yum install sssd
  yum install sssd -y
* Install perl libraries
yum install perl-DBD-Pg -y
* Install nss-pam-ldapd
* Install nss-pam-ldapd
  $ sudo yum install nss-pam-ldapd
  yum install nss-pam-ldapd -y
 
yum install oddjob-mkhomedir -y
systemctl start oddjobd
systemctl enable oddjobd


==== Edit Puppet configuration on foreman.uscf.bkslab.org ====
==== Edit Puppet configuration on foreman.uscf.bkslab.org ====
Line 125: Line 141:
  $ id <user_name>
  $ id <user_name>
If failed, try running these commands and try it again:
If failed, try running these commands and try it again:
  $ systemctl restart sssd
  $ systemctl restart sssd | systemctl enable sssd
$ systemctl enable sssd
   
   
  $ authcofig-tui or nmtui
  $ authcofig-tui
  This will prompt you to the authcofig-tui screen. User SpaceBar to change setting.
  This will prompt you to the authcofig-tui screen. User SpaceBar to change setting.
  1. Uncheck "User Fingerprint reader" so that it would not raise any fingerprint error later. Click "Next' after.
  1. Uncheck "Use Shadow Password".
  2. Under "LDAP Settings", make sure it says:
2. Uncheck "User Fingerprint reader" so that it would not raise any fingerprint error later. Click "Next' after.
  3. Under "LDAP Settings", make sure it says:
     [*] User TLS
     [*] User TLS
     Server: ldaps://ds.ucsf.bkslab.org/
     Server: ldaps://ds.ucsf.bkslab.org/
     Base DN: dc=bkslab, dc=org
     Base DN: dc=bkslab, dc=org
$ systemctl start oddjobd
$ systemctl enable oddjobd
=== Troubleshooting ===
===== Puppet SSL issue =====
* Datetime mismatch
https://wiki.docking.org/index.php/Troubleshooting_-_Puppet_Failed_to_generate_additional_resources_using_%27eval_generate:_SSL_connect_returned%3D1%27
These are some issues from n-5-34/5 and the proposed solutions
* Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Invalid tag "" on node
This error happens because puppet uses cache version of the node instead of creating new one. You must clean all trace of node on '''alpha''' before reissuing a new certification
[root@alpha tmp]# puppet node clean samekh.cluster.ucsf.bkslab.org
   
   
=== Troubleshooting ===
 
* To reissue Puppet on machine:
-revoke puppet certificate in alpha
  $ sudo puppet cert clean <hostname>.cluster.ucsf.bkslab.org
-remove this directory
  $ rm -rf /var/lib/puppet/ssl
 
===== Other Issues =====
1. Network configuration (/etc/resolv.conf)
 
Issue 1 : DNS and nameserver are empty (Ethernet connection was not configured during installation)
 
What I did:
$ nmtui (NetworkManager tui)
-Edit the connection by following the example from n-1-136
 
Issue 2: nameserver 127.0.0.1
 
What I did:
- Commented out all items in [main] section in /etc/NetworkManager/NetworkManager.conf
- Change nameserver to 10.20.1.1
$ systemctl restart NetworkManager.service
$ systemctl restart network
 
2. Yum not working (https://yum/centos/7/contrib/x86_64/repodata/repomd.xml: [Errno 14] HTTP Error 404 - Not Found)
 
Issue: Puppet overwrote the existing Centos-Base.repo (Centos-7) with a Centos 6's Centos-Base.repo file
 
What I did:
- Overwritten /etc/yum.repos.d/CentOS-Base.repo with copy of the correct version from n-1-136
 
3. Machine not recognizing users
Issue 1: sssd was not installed
What I did:
$ yum install sssd
$ systemctl start sssd
$ systemctl enable sssd
Issue 2:
$ id s_khtang
uid=xxxx(s_khtang) gid=1000(n-5-34) groups=1000(n-5-34)
 
This means the machine mistake sysadmin group 1000 for n-5-34
 
What I did:
 
$ vim /etc/group
Change n-5-34:x:1000:n-5-34 to sysadmin:x:1000:n-5-34
 
$ authconfig-tui
Uncheck 'Shadow Password'
 
== (For Rock Linux) Install and Configure Packages ==
I wrote a collection of scripts to automate this process. You can scp /nfs/home/khtang/code/sysadmin-stuff to the machine
=== LDAP/SSSD ===
$ sh setup-sssd.sh
=== Mount Filesystems ===
$ sh mount-nfs.sh
 
== GPU ==
Nouveau is the proprietary driver that is enable by default. In order to nvidia driver to work, nouveau must be disable
 
'''How to know'''
$ lsmod | grep nouveau
 
'''How to disable nouveau'''
$ vim /etc/default/grub
Append this line 'rd.driver.blacklist=nouveau nouveau.modeset=0' at the end of GRUB_CMDLINE_LINUX
$ mv /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r).img.bak
$ echo "blacklist nouveau" > /etc/modprobe.d/nouveau-blacklist.conf
$ dracut /boot/initramfs-$(uname -r).img $(uname -r)
 
$ reboot
 
=== Install CUDA and Nvidia Driver ===
'''Please make sure the nouveau driver is turned off'''
==== CentOS 7 ====
Follow 'Update CUDA 11 and Nvidia-driver' in https://wiki.docking.org/index.php/Gpus
==== Rock Linux ====
Add cuda repo and install cuda. It will also install the latest nvidia-driver compatible with the cuda version
$ dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo
$ dnf --enablerepo=epel -y install cuda-11-0
==== Install a specific version of Nvidia-driver ====
Download the driver compatible of your GPUs
https://www.nvidia.com/Download/index.aspx?lang=en-us
 
It is necessary to install the driver without graphical interface. You can go into single user mode with
$ sudo init 1
or
$ systemctl isolate multi-user.target
 
$ chmod +x NVIDIA-Linux-x86_64-<version>.sh
$ bash NVIDIA-Linux-x86_64-<version>.sh
 
==== Test Nvidia driver ====
$ nvidia-smi
  +-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.129.06  Driver Version: 470.129.06  CUDA Version: 11.4    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|        Memory-Usage | GPU-Util  Compute M. |
|                              |                      |              MIG M. |
|===============================+======================+======================|
|  0  Quadro K600        Off  | 00000000:01:00.0 Off |                  N/A |
| 27%  55C    P0    N/A /  N/A |      0MiB /  981MiB |      0%      Default |
|                              |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU  GI  CI        PID  Type  Process name                  GPU Memory |
|        ID  ID                                                  Usage      |
|=============================================================================|
|  No running processes found                                                |
+-----------------------------------------------------------------------------+
 
 
[[ Category: Ben ]] [[ Category : Sysadmin ]]

Latest revision as of 01:24, 24 May 2024

This page described how to install CentOS and setup/troubleshooting puppet

Getting a Bootable USB stick

You can borrow one from the Sysadmin or DIY one (4.4GB+ storage) with instruction here

  • Download the ISO
Rocky Linux Minimal : https://rockylinux.org/download/

Install CentOS 7/Rocky Linux 8

Change Boot Order

1. Insert the USB stick and connect the monitor to the machine

2. Reboot the machine

3. Get to Boot Menu, there are a few ways:

a. Bring up the BIOS Menu by pressing Del button while the machine is booting. If that doesn't work, try F2 or F10

- In Boot, change the boot oder so that the USB get booted first

- Save changes and reboot

b. Press F11 and pick the USB drive

Configurations

Adopted from this guide -> https://phoenixnap.com/kb/how-to-install-centos-7

Select Test this media and install <OS>

Step 1 : Choose Keyboard and Language

Step 2 : Network Configuration

Select NETWORK & HOSTNAME

1. Switch on the Ethernet

2. Change Host name at the bottom

3. Select Configure

Select IPv4 Settings
DNS Servers:
  [alpha private ip address]
Search domains:
  cluster.ucsf.bkslab.org, ucsf.bkslab.org, bkslab.org, compbio.ucsf.edu, ucsf.edu
Check "Require IPv4 addressing for this connection to complete".
Save.

Step 3: Set Date and Time

Turn on Network Time and Select the local timezone.

Step 4: Partitioning

Select INSTALLATION DESTINATION.

Option 1: Automatic Partitioning

Under the Other Storage Options heading, select the Automatically configure partitioning checkbox. This ensures the selected destination storage disk will automatically partition with the /(root), /home and swap partitions. It will automatically create an LVM logical volume in the XFS file system.

If you do not have enough free space, you can reclaim disk space and instruct the system to delete files.

When finished, click the Done button.

Option 2: Manual Partitioning

Select the I will configure partitioning checkbox and choose Done.

If you want to use other file systems (such as ext4 and vfat) and a non-LVM partitioning scheme, such as btrfs. This will initiate a configuration pop-up where you can set up your partitioning manually.

Step 5: Software Selection

Select Compute Node on the left menu, then select Add-Ons on the right menu.

Step 6: Enable KDUMP

Double-check if KDUMP is enabled.

Step 7: Start installation Process

Hit Begin Installation

Step 8: Setup Root Password & User

During Installation, will see 2 items on top

Root Password

The usual one

User Creation

Create a local administrator account

User name : survival
Check "Make this user administrator"
Check "Require a password for this account"
Password : [Hint it starts with G and has t somewhere in the middle]

REBOOT when Installation is completed


(CentOS 7 Only) Install Puppet and Create Puppet Certificate

Puppetmaster on alpha is old and only work on puppet 3.x on CentOS 7 repo. We can only install Puppet 7 on Rock Linux which is incompatible with puppetmaster. Therefore, we will manually install and configure the necessary packages.

Packages Installation

Login as root user

  • Install EPEL release. EPEL is a repository for enterprise releases. Learn more
yum install epel-release -y
This will install access to public repo on Epel. GPG key is provided to provide transaction is valid
  • Update centos packages
yum update -y
  • Install Puppet
yum install puppet -y
  • Install sssd
yum install sssd -y
  • Install perl libraries
yum install perl-DBD-Pg -y
  • Install nss-pam-ldapd
yum install nss-pam-ldapd -y
yum install oddjob-mkhomedir -y
systemctl start oddjobd
systemctl enable oddjobd

Edit Puppet configuration on foreman.uscf.bkslab.org

  1. Search for host with it is existed.
  2. Edit Puppet setting
    1. If the machine is brand new, click on 'New Host', choose 'Testing' as Host Group and replicate the other existing desktop settings.
    2. In Parameters, click "Override" in "variant" and assign "cluster" as variable at the bottom.
    3. In Puppet class, Choose :
           * nfs-mounts.*
           * ssd*
           

Issue new Puppet Certificate

In a second terminal, log in as root

  • Log into alpha, to create new puppet certificate for the new computer
$ sudo puppet cert list -a | grep <hostname>.cluster.ucsf.bkslab.org //to list all of the current puppet certificates and check if there was an existing certificate for this machine
  • To clean out existing certificate
$ sudo puppet cert clean <hostname>.cluster.ucsf.bkslab.org

BEFORE PROCEEDING TO THE NEXT STEP, MAKE SURE that you have 2 terminals on: one logged in as root on the new computer (client) and the other logged in as s_ on alpha (server) 1. On the client side:

$ puppet agent --test --waitforcert=10
"puppet agent --test" command initial integration with puppet for a new computer or reintegrate puppet. Without this command, the machine will not have access to the /mnt/nfs, /nfs/* and /nfs/soft 
"--waitforcert=10" means "keep calm, wait 10s for DNS server to respond"

2. On server (alpha) side:

Sign the certificate
$ sudo puppet cert sign <hostname>.cluster.ucsf.bkslab.org


Testing puppet

$ id <user_name>

If failed, try running these commands and try it again:

$ systemctl restart sssd | systemctl enable sssd

$ authcofig-tui
This will prompt you to the authcofig-tui screen. User SpaceBar to change setting.
1. Uncheck "Use Shadow Password".
2. Uncheck "User Fingerprint reader" so that it would not raise any fingerprint error later. Click "Next' after.
3. Under "LDAP Settings", make sure it says:
   [*] User TLS
   Server: ldaps://ds.ucsf.bkslab.org/
   Base DN: dc=bkslab, dc=org
$ systemctl start oddjobd
$ systemctl enable oddjobd

Troubleshooting

Puppet SSL issue
  • Datetime mismatch
https://wiki.docking.org/index.php/Troubleshooting_-_Puppet_Failed_to_generate_additional_resources_using_%27eval_generate:_SSL_connect_returned%3D1%27

These are some issues from n-5-34/5 and the proposed solutions

  • Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Invalid tag "" on node

This error happens because puppet uses cache version of the node instead of creating new one. You must clean all trace of node on alpha before reissuing a new certification

[root@alpha tmp]# puppet node clean samekh.cluster.ucsf.bkslab.org

  • To reissue Puppet on machine:
-revoke puppet certificate in alpha
 $ sudo puppet cert clean <hostname>.cluster.ucsf.bkslab.org
-remove this directory
 $ rm -rf /var/lib/puppet/ssl
Other Issues

1. Network configuration (/etc/resolv.conf)

Issue 1 : DNS and nameserver are empty (Ethernet connection was not configured during installation)

What I did:

$ nmtui (NetworkManager tui)
-Edit the connection by following the example from n-1-136 

Issue 2: nameserver 127.0.0.1

What I did:

- Commented out all items in [main] section in /etc/NetworkManager/NetworkManager.conf
- Change nameserver to 10.20.1.1
$ systemctl restart NetworkManager.service
$ systemctl restart network

2. Yum not working (https://yum/centos/7/contrib/x86_64/repodata/repomd.xml: [Errno 14] HTTP Error 404 - Not Found)

Issue: Puppet overwrote the existing Centos-Base.repo (Centos-7) with a Centos 6's Centos-Base.repo file

What I did:

- Overwritten /etc/yum.repos.d/CentOS-Base.repo with copy of the correct version from n-1-136

3. Machine not recognizing users Issue 1: sssd was not installed What I did:

$ yum install sssd
$ systemctl start sssd
$ systemctl enable sssd

Issue 2:

$ id s_khtang
uid=xxxx(s_khtang) gid=1000(n-5-34) groups=1000(n-5-34)

This means the machine mistake sysadmin group 1000 for n-5-34

What I did:

$ vim /etc/group
Change n-5-34:x:1000:n-5-34 to sysadmin:x:1000:n-5-34
$ authconfig-tui
Uncheck 'Shadow Password'

(For Rock Linux) Install and Configure Packages

I wrote a collection of scripts to automate this process. You can scp /nfs/home/khtang/code/sysadmin-stuff to the machine

LDAP/SSSD

$ sh setup-sssd.sh

Mount Filesystems

$ sh mount-nfs.sh

GPU

Nouveau is the proprietary driver that is enable by default. In order to nvidia driver to work, nouveau must be disable

How to know

$ lsmod | grep nouveau

How to disable nouveau

$ vim /etc/default/grub
Append this line 'rd.driver.blacklist=nouveau nouveau.modeset=0' at the end of GRUB_CMDLINE_LINUX
$ mv /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r).img.bak
$ echo "blacklist nouveau" > /etc/modprobe.d/nouveau-blacklist.conf 
$ dracut /boot/initramfs-$(uname -r).img $(uname -r)
$ reboot

Install CUDA and Nvidia Driver

Please make sure the nouveau driver is turned off

CentOS 7

Follow 'Update CUDA 11 and Nvidia-driver' in https://wiki.docking.org/index.php/Gpus

Rock Linux

Add cuda repo and install cuda. It will also install the latest nvidia-driver compatible with the cuda version

$ dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo

$ dnf --enablerepo=epel -y install cuda-11-0

Install a specific version of Nvidia-driver

Download the driver compatible of your GPUs

https://www.nvidia.com/Download/index.aspx?lang=en-us

It is necessary to install the driver without graphical interface. You can go into single user mode with

$ sudo init 1 
or
$ systemctl isolate multi-user.target
$ chmod +x NVIDIA-Linux-x86_64-<version>.sh
$ bash NVIDIA-Linux-x86_64-<version>.sh

Test Nvidia driver

$ nvidia-smi
 +-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.129.06   Driver Version: 470.129.06   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro K600         Off  | 00000000:01:00.0 Off |                  N/A |
| 27%   55C    P0    N/A /  N/A |      0MiB /   981MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+ 
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+