Set up a Server: Difference between revisions

From DISI
Jump to navigation Jump to search
m (asdf)
 
(12 intermediate revisions by 2 users not shown)
Line 88: Line 88:




=== Install Puppet and Create Puppet Certificate ===
== (CentOS 7 Only) Install Puppet and Create Puppet Certificate ==
Puppetmaster on alpha is old and only work on puppet 3.x on CentOS 7 repo. We can only install Puppet 7 on Rock Linux which is incompatible with puppetmaster. Therefore, we will manually install and configure the necessary packages.


=== Packages Installation ===
=== Packages Installation ===
Line 153: Line 154:
  $ systemctl start oddjobd
  $ systemctl start oddjobd
  $ systemctl enable oddjobd
  $ systemctl enable oddjobd
=== GPU ===
Nouveau is the proprietary driver that is enable by default. In order to nvidia driver to work, nouveau must be disable
'''How to know'''
$ lsmod | grep nouveau
'''How to disable nouveau'''
$ vim /etc/default/grub
Append this line 'rd.driver.blacklist=nouveau nouveau.modeset=0' at the end of GRUB_CMDLINE_LINUX
$ mv /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r).img.bak
$ echo "blacklist nouveau" > /etc/modprobe.d/nouveau-blacklist.conf
$ dracut /boot/initramfs-$(uname -r).img $(uname -r)
$ reboot


=== Troubleshooting ===
=== Troubleshooting ===
==== Puppet SSL issue ====
===== Puppet SSL issue =====
* Datetime mismatch
* Datetime mismatch
  http://wiki.docking.org/index.php/Troubleshooting_-_Puppet_Failed_to_generate_additional_resources_using_%27eval_generate:_SSL_connect_returned%3D1%27
  https://wiki.docking.org/index.php/Troubleshooting_-_Puppet_Failed_to_generate_additional_resources_using_%27eval_generate:_SSL_connect_returned%3D1%27
These are some issues from n-5-34/5 and the proposed solutions
These are some issues from n-5-34/5 and the proposed solutions


Line 186: Line 173:
   $ rm -rf /var/lib/puppet/ssl
   $ rm -rf /var/lib/puppet/ssl


==== Other Issues ====
===== Other Issues =====
1. Network configuration (/etc/resolv.conf)
1. Network configuration (/etc/resolv.conf)


Line 203: Line 190:
  $ systemctl restart network
  $ systemctl restart network


2. Yum not working (http://yum/centos/7/contrib/x86_64/repodata/repomd.xml: [Errno 14] HTTP Error 404 - Not Found)
2. Yum not working (https://yum/centos/7/contrib/x86_64/repodata/repomd.xml: [Errno 14] HTTP Error 404 - Not Found)


Issue: Puppet overwrote the existing Centos-Base.repo (Centos-7) with a Centos 6's Centos-Base.repo file
Issue: Puppet overwrote the existing Centos-Base.repo (Centos-7) with a Centos 6's Centos-Base.repo file
Line 229: Line 216:
  $ authconfig-tui
  $ authconfig-tui
  Uncheck 'Shadow Password'
  Uncheck 'Shadow Password'
== (For Rock Linux) Install and Configure Packages ==
I wrote a collection of scripts to automate this process. You can scp /nfs/home/khtang/code/sysadmin-stuff to the machine
=== LDAP/SSSD ===
$ sh setup-sssd.sh
=== Mount Filesystems ===
$ sh mount-nfs.sh
== GPU ==
Nouveau is the proprietary driver that is enable by default. In order to nvidia driver to work, nouveau must be disable
'''How to know'''
$ lsmod | grep nouveau
'''How to disable nouveau'''
$ vim /etc/default/grub
Append this line 'rd.driver.blacklist=nouveau nouveau.modeset=0' at the end of GRUB_CMDLINE_LINUX
$ mv /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r).img.bak
$ echo "blacklist nouveau" > /etc/modprobe.d/nouveau-blacklist.conf
$ dracut /boot/initramfs-$(uname -r).img $(uname -r)
$ reboot
=== Install CUDA and Nvidia Driver ===
'''Please make sure the nouveau driver is turned off'''
==== CentOS 7 ====
Follow 'Update CUDA 11 and Nvidia-driver' in https://wiki.docking.org/index.php/Gpus
==== Rock Linux ====
Add cuda repo and install cuda. It will also install the latest nvidia-driver compatible with the cuda version
$ dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo
$ dnf --enablerepo=epel -y install cuda-11-0
==== Install a specific version of Nvidia-driver ====
Download the driver compatible of your GPUs
https://www.nvidia.com/Download/index.aspx?lang=en-us
It is necessary to install the driver without graphical interface. You can go into single user mode with
$ sudo init 1
or
$ systemctl isolate multi-user.target
$ chmod +x NVIDIA-Linux-x86_64-<version>.sh
$ bash NVIDIA-Linux-x86_64-<version>.sh
==== Test Nvidia driver ====
$ nvidia-smi
  +-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.129.06  Driver Version: 470.129.06  CUDA Version: 11.4    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|        Memory-Usage | GPU-Util  Compute M. |
|                              |                      |              MIG M. |
|===============================+======================+======================|
|  0  Quadro K600        Off  | 00000000:01:00.0 Off |                  N/A |
| 27%  55C    P0    N/A /  N/A |      0MiB /  981MiB |      0%      Default |
|                              |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU  GI  CI        PID  Type  Process name                  GPU Memory |
|        ID  ID                                                  Usage      |
|=============================================================================|
|  No running processes found                                                |
+-----------------------------------------------------------------------------+


[[ Category: Ben ]] [[ Category : Sysadmin ]]
[[ Category: Ben ]] [[ Category : Sysadmin ]]

Latest revision as of 01:24, 24 May 2024

This page described how to install CentOS and setup/troubleshooting puppet

Getting a Bootable USB stick

You can borrow one from the Sysadmin or DIY one (4.4GB+ storage) with instruction here

  • Download the ISO
Rocky Linux Minimal : https://rockylinux.org/download/

Install CentOS 7/Rocky Linux 8

Change Boot Order

1. Insert the USB stick and connect the monitor to the machine

2. Reboot the machine

3. Get to Boot Menu, there are a few ways:

a. Bring up the BIOS Menu by pressing Del button while the machine is booting. If that doesn't work, try F2 or F10

- In Boot, change the boot oder so that the USB get booted first

- Save changes and reboot

b. Press F11 and pick the USB drive

Configurations

Adopted from this guide -> https://phoenixnap.com/kb/how-to-install-centos-7

Select Test this media and install <OS>

Step 1 : Choose Keyboard and Language

Step 2 : Network Configuration

Select NETWORK & HOSTNAME

1. Switch on the Ethernet

2. Change Host name at the bottom

3. Select Configure

Select IPv4 Settings
DNS Servers:
  [alpha private ip address]
Search domains:
  cluster.ucsf.bkslab.org, ucsf.bkslab.org, bkslab.org, compbio.ucsf.edu, ucsf.edu
Check "Require IPv4 addressing for this connection to complete".
Save.

Step 3: Set Date and Time

Turn on Network Time and Select the local timezone.

Step 4: Partitioning

Select INSTALLATION DESTINATION.

Option 1: Automatic Partitioning

Under the Other Storage Options heading, select the Automatically configure partitioning checkbox. This ensures the selected destination storage disk will automatically partition with the /(root), /home and swap partitions. It will automatically create an LVM logical volume in the XFS file system.

If you do not have enough free space, you can reclaim disk space and instruct the system to delete files.

When finished, click the Done button.

Option 2: Manual Partitioning

Select the I will configure partitioning checkbox and choose Done.

If you want to use other file systems (such as ext4 and vfat) and a non-LVM partitioning scheme, such as btrfs. This will initiate a configuration pop-up where you can set up your partitioning manually.

Step 5: Software Selection

Select Compute Node on the left menu, then select Add-Ons on the right menu.

Step 6: Enable KDUMP

Double-check if KDUMP is enabled.

Step 7: Start installation Process

Hit Begin Installation

Step 8: Setup Root Password & User

During Installation, will see 2 items on top

Root Password

The usual one

User Creation

Create a local administrator account

User name : survival
Check "Make this user administrator"
Check "Require a password for this account"
Password : [Hint it starts with G and has t somewhere in the middle]

REBOOT when Installation is completed


(CentOS 7 Only) Install Puppet and Create Puppet Certificate

Puppetmaster on alpha is old and only work on puppet 3.x on CentOS 7 repo. We can only install Puppet 7 on Rock Linux which is incompatible with puppetmaster. Therefore, we will manually install and configure the necessary packages.

Packages Installation

Login as root user

  • Install EPEL release. EPEL is a repository for enterprise releases. Learn more
yum install epel-release -y
This will install access to public repo on Epel. GPG key is provided to provide transaction is valid
  • Update centos packages
yum update -y
  • Install Puppet
yum install puppet -y
  • Install sssd
yum install sssd -y
  • Install perl libraries
yum install perl-DBD-Pg -y
  • Install nss-pam-ldapd
yum install nss-pam-ldapd -y
yum install oddjob-mkhomedir -y
systemctl start oddjobd
systemctl enable oddjobd

Edit Puppet configuration on foreman.uscf.bkslab.org

  1. Search for host with it is existed.
  2. Edit Puppet setting
    1. If the machine is brand new, click on 'New Host', choose 'Testing' as Host Group and replicate the other existing desktop settings.
    2. In Parameters, click "Override" in "variant" and assign "cluster" as variable at the bottom.
    3. In Puppet class, Choose :
           * nfs-mounts.*
           * ssd*
           

Issue new Puppet Certificate

In a second terminal, log in as root

  • Log into alpha, to create new puppet certificate for the new computer
$ sudo puppet cert list -a | grep <hostname>.cluster.ucsf.bkslab.org //to list all of the current puppet certificates and check if there was an existing certificate for this machine
  • To clean out existing certificate
$ sudo puppet cert clean <hostname>.cluster.ucsf.bkslab.org

BEFORE PROCEEDING TO THE NEXT STEP, MAKE SURE that you have 2 terminals on: one logged in as root on the new computer (client) and the other logged in as s_ on alpha (server) 1. On the client side:

$ puppet agent --test --waitforcert=10
"puppet agent --test" command initial integration with puppet for a new computer or reintegrate puppet. Without this command, the machine will not have access to the /mnt/nfs, /nfs/* and /nfs/soft 
"--waitforcert=10" means "keep calm, wait 10s for DNS server to respond"

2. On server (alpha) side:

Sign the certificate
$ sudo puppet cert sign <hostname>.cluster.ucsf.bkslab.org


Testing puppet

$ id <user_name>

If failed, try running these commands and try it again:

$ systemctl restart sssd | systemctl enable sssd

$ authcofig-tui
This will prompt you to the authcofig-tui screen. User SpaceBar to change setting.
1. Uncheck "Use Shadow Password".
2. Uncheck "User Fingerprint reader" so that it would not raise any fingerprint error later. Click "Next' after.
3. Under "LDAP Settings", make sure it says:
   [*] User TLS
   Server: ldaps://ds.ucsf.bkslab.org/
   Base DN: dc=bkslab, dc=org
$ systemctl start oddjobd
$ systemctl enable oddjobd

Troubleshooting

Puppet SSL issue
  • Datetime mismatch
https://wiki.docking.org/index.php/Troubleshooting_-_Puppet_Failed_to_generate_additional_resources_using_%27eval_generate:_SSL_connect_returned%3D1%27

These are some issues from n-5-34/5 and the proposed solutions

  • Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Invalid tag "" on node

This error happens because puppet uses cache version of the node instead of creating new one. You must clean all trace of node on alpha before reissuing a new certification

[root@alpha tmp]# puppet node clean samekh.cluster.ucsf.bkslab.org

  • To reissue Puppet on machine:
-revoke puppet certificate in alpha
 $ sudo puppet cert clean <hostname>.cluster.ucsf.bkslab.org
-remove this directory
 $ rm -rf /var/lib/puppet/ssl
Other Issues

1. Network configuration (/etc/resolv.conf)

Issue 1 : DNS and nameserver are empty (Ethernet connection was not configured during installation)

What I did:

$ nmtui (NetworkManager tui)
-Edit the connection by following the example from n-1-136 

Issue 2: nameserver 127.0.0.1

What I did:

- Commented out all items in [main] section in /etc/NetworkManager/NetworkManager.conf
- Change nameserver to 10.20.1.1
$ systemctl restart NetworkManager.service
$ systemctl restart network

2. Yum not working (https://yum/centos/7/contrib/x86_64/repodata/repomd.xml: [Errno 14] HTTP Error 404 - Not Found)

Issue: Puppet overwrote the existing Centos-Base.repo (Centos-7) with a Centos 6's Centos-Base.repo file

What I did:

- Overwritten /etc/yum.repos.d/CentOS-Base.repo with copy of the correct version from n-1-136

3. Machine not recognizing users Issue 1: sssd was not installed What I did:

$ yum install sssd
$ systemctl start sssd
$ systemctl enable sssd

Issue 2:

$ id s_khtang
uid=xxxx(s_khtang) gid=1000(n-5-34) groups=1000(n-5-34)

This means the machine mistake sysadmin group 1000 for n-5-34

What I did:

$ vim /etc/group
Change n-5-34:x:1000:n-5-34 to sysadmin:x:1000:n-5-34
$ authconfig-tui
Uncheck 'Shadow Password'

(For Rock Linux) Install and Configure Packages

I wrote a collection of scripts to automate this process. You can scp /nfs/home/khtang/code/sysadmin-stuff to the machine

LDAP/SSSD

$ sh setup-sssd.sh

Mount Filesystems

$ sh mount-nfs.sh

GPU

Nouveau is the proprietary driver that is enable by default. In order to nvidia driver to work, nouveau must be disable

How to know

$ lsmod | grep nouveau

How to disable nouveau

$ vim /etc/default/grub
Append this line 'rd.driver.blacklist=nouveau nouveau.modeset=0' at the end of GRUB_CMDLINE_LINUX
$ mv /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r).img.bak
$ echo "blacklist nouveau" > /etc/modprobe.d/nouveau-blacklist.conf 
$ dracut /boot/initramfs-$(uname -r).img $(uname -r)
$ reboot

Install CUDA and Nvidia Driver

Please make sure the nouveau driver is turned off

CentOS 7

Follow 'Update CUDA 11 and Nvidia-driver' in https://wiki.docking.org/index.php/Gpus

Rock Linux

Add cuda repo and install cuda. It will also install the latest nvidia-driver compatible with the cuda version

$ dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo

$ dnf --enablerepo=epel -y install cuda-11-0

Install a specific version of Nvidia-driver

Download the driver compatible of your GPUs

https://www.nvidia.com/Download/index.aspx?lang=en-us

It is necessary to install the driver without graphical interface. You can go into single user mode with

$ sudo init 1 
or
$ systemctl isolate multi-user.target
$ chmod +x NVIDIA-Linux-x86_64-<version>.sh
$ bash NVIDIA-Linux-x86_64-<version>.sh

Test Nvidia driver

$ nvidia-smi
 +-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.129.06   Driver Version: 470.129.06   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro K600         Off  | 00000000:01:00.0 Off |                  N/A |
| 27%   55C    P0    N/A /  N/A |      0MiB /   981MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+ 
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+