## Getting started with High Performance Computing in Research _Richard Polzin
_ ###### Last Updated: 27. Nov 2024 --- ### Table of Content 1. Introduction 2. HPC.NRW Linux Basics for HPC 3. Tips and Tricks 4. SLURM Job Manager 5. Compute Time Application --- # Introduction ### Objectives - Understand what HPC is - Learn why HPC is essential for modern research - Discover available HPC resources - Walk through a project application process ## What is HPC? - **H**igh **P**erformance **C**omputing involves the use of supercomputers and parallel processing techniques to solve complex computational problems - Combines computing resources for higher performance - HPC Systems consist of clusters with interconnected nodes - Enable processing larger datasets and complex simulations ## What is HPC? ![](claix23.png) ## Why is HPC Essential? - **Speed:** Reduces time to results - **Capacity:** Handle large scale data and simulation - **Complexity:** Solve problems to complex for standard computers ## HPC Resources - **State of the art** HPC facilities are available through the National High Performance Computing (NHR) network - Including Compute Clusters, Large Scale Storage, and advanced networking infrastructure ## Access Eligibility and Requirements - Researchers, students, and collaborators from German Universities - University credentials (TIM-ID) are required - Compliance with university and HPC usage policy is expected - Resources are to be used responsible ## HPC Resources ![](nhr.png) ## The RWTH Compute Cluster Per Node: - 2x Intel Xeon 8468 (48 core CPU) - 1.5 TB local SSD Storage - 256GB - 1,024GB of RAM - 632 such notes available ## The RWTH Compute Cluster Per Node (ML): - 2x Intel Xeon 8468 - 695 GB local SSD Storage - 512GB of RAM - 4 x NVIDIA H100 GPU (96GB HBM2e) - 52 such notes available ### Account Creation Use Selfservice to create your account (https://idm.rwth-aachen.de/selfservice/) 1. Accounts and Passwords 2. Account Overview 3. Create HPC Account ### Login ```zsh ssh ab1234@login23-1.hpc-itc.rwth-aachen.de ``` - Login with secure shell (ssh): **ssh tim@host** - Needs VPN if not connected to eduroam - d*Different nodes available for different use cases* ### Cluster Access Nodes Use the appropriate node for your task. Be considerate of shared resources. - Login Nodes - lightweight tasks (script editing, testing) - Copy Nodes - Optimized for data transfer ![overview of different login nodes](login-nodes.png) ### The File Systems - **$HOME:** Small quote, backed up. For scripts and small files. - **$WORK:** Large quote, not backed up. For working with many small files. - **$HPCWORK:** Largest quote, not backed up. For I/O intense jobs and large files. ![](filesystems.png) ### Data Transfer Methods - Use copy nodes for data transfer - Consider data security and encryption - Terminal: *scp* or *rsync* - Mount as folder: *sshfs* ### Project Management - Groups are created for every project - Every group consists of owners (PC/PI), Managers and Members - Granted computation time and storage space is shared through groups - In Aachen project storage is deleted **8 months** after a project's conclusion. Make sure you migrate the data by then! ### Project Management Users can be added to and removed from groups/projects using their TIM-ID. ```sh member add --name
member delete --name
member finger # view group affiliations ``` - append **--manager** to any command above to assign or revoke the manager role ### Summary of this Chapter - HPC enables advanced computational research - NHR provides access to free HPC resources - Account creation and cluster access - File System and Project management --- > The following slides are adapted from the HPC.NRW Competency Network under the CC-BY-SA license. Check out their work at https://hpc-wiki.info/hpc/HPC_Wiki --- ## Background and History ![](hpc.nrw/01_0002.png) ![](hpc.nrw/01_0003.png) ![](hpc.nrw/01_0004.png) ![](hpc.nrw/01_0005.png) --- ## The Command Line ![](hpc.nrw/02_0002.png) ![](hpc.nrw/02_0003.png) ![](hpc.nrw/02_0004.png) ![](hpc.nrw/02_0005.png) ![](hpc.nrw/02_0006.png) ![](hpc.nrw/02_0007.png) ![](hpc.nrw/02_0008.png) ![](hpc.nrw/02_0009.png) --- ## Directory Structure ![](hpc.nrw/03_0002.png) ![](hpc.nrw/03_0003.png) ![](hpc.nrw/03_0004.png) ![](hpc.nrw/03_0005.png) ![](hpc.nrw/03_0006.png) ![](hpc.nrw/03_0007.png) ![](hpc.nrw/03_0008.png) --- ## Files ![](hpc.nrw/04_0002.png) ![](hpc.nrw/04_0003.png) ![](hpc.nrw/04_0004.png) ![](hpc.nrw/04_0005.png) ![](hpc.nrw/04_0006.png) ![](hpc.nrw/04_0007.png) --- ## Text Display and Search ![](hpc.nrw/05_0002.png) ![](hpc.nrw/05_0003.png) ![](hpc.nrw/05_0004.png) ![](hpc.nrw/05_0005.png) ![](hpc.nrw/05_0006.png) ![](hpc.nrw/05_0007.png) --- ## Users and Permissions ![](hpc.nrw/06_0002.png) ![](hpc.nrw/06_0003.png) ![](hpc.nrw/06_0004.png) ![](hpc.nrw/06_0005.png) ![](hpc.nrw/06_0006.png) ![](hpc.nrw/06_0007.png) --- ## Processes ![](hpc.nrw/07_0002.png) ![](hpc.nrw/07_0003.png) ![](hpc.nrw/07_0004.png) ![](hpc.nrw/07_0005.png) --- ## The vim Text editor ![](hpc.nrw/08_0002.png) ![](hpc.nrw/08_0003.png) ![](hpc.nrw/08_0004.png) ![](hpc.nrw/08_0005.png) ![](hpc.nrw/08_0006.png) ![](hpc.nrw/08_0007.png) ![](hpc.nrw/08_0008.png) ![](hpc.nrw/08_0009.png) --- ## Shell Scripts ![](hpc.nrw/09_0002.png) ![](hpc.nrw/09_0003.png) ![](hpc.nrw/09_0004.png) ![](hpc.nrw/09_0005.png) ![](hpc.nrw/09_0006.png) ![](hpc.nrw/09_0007.png) ![](hpc.nrw/09_0008.png) --- ## Environment Variables ![](hpc.nrw/10_0002.png) ![](hpc.nrw/10_0003.png) ![](hpc.nrw/10_0004.png) ![](hpc.nrw/10_0005.png) ![](hpc.nrw/10_0006.png) --- ## System Configuration ![](hpc.nrw/11_0002.png) ![](hpc.nrw/11_0003.png) ![](hpc.nrw/11_0004.png) ![](hpc.nrw/11_0005.png) ![](hpc.nrw/11_0006.png) ![](hpc.nrw/11_0007.png) --- ## SSH Connections ![](hpc.nrw/12_0002.png) ![](hpc.nrw/12_0003.png) ![](hpc.nrw/12_0004.png) ![](hpc.nrw/12_0005.png) ![](hpc.nrw/12_0006.png) ![](hpc.nrw/12_0007.png) ![](hpc.nrw/12_0008.png) ![](hpc.nrw/12_0009.png) --- ## Tips and Tricks This is a collection of tips and tricks that make working with the cluster easier and more convenient. ## Tips and Tricks 1. You can mount the remote cluster file system to your local machine ```zsh sshfs ab1234@copy23node:/home/ab1234 /mnt/clusterhome ``` ## Tips and Tricks 1. You can mount the remote cluster file system to your local machine ```zsh sshfs ab1234@copy23node:/home/ab1234 /mnt/clusterhome ``` unmount with: ```zsh sudo umount -l /mnt/clusterhome ``` ## Tips and Tricks 2. If you are outside of eduroam, you can connect through VPN and access the cluster *everywhere* Go to http://help.itc.rwth-aachen.de and search for "VPN" --- ## SLURM Job Manager **S**imple **L**inux **U**tility for **R**esource **M**anagement - Job scheduler often used in supercomputers and compute clusters - Provides many advantages for utilizing HPC hardware with many users, such as.. - ... Accounting, Containerization, Priorities, Chain- and Array-Jobs, ... ## SLURM Job Manager - Users can interact with SLURM from login nodes - Users may request cores, memory and time, then send their programs to be queued - SLURM reserves these resources and waits till they are available - Once available, the code will then be run ## SLURM Job Manager ![](slurm.png) ## SLURM Job Manager SLURM is fed **jobscripts**, which contain all information the scheduler needs to run a program These jobscripts consist of three parts: 1. Shebang 2. Job Parameters 3. Actual Job Code ## SLURM Job Manager #### Example ## SLURM Job Manager #### Example ```sh #!/usr/bin/zsh ``` ## SLURM Job Manager #### Example ```sh #!/usr/bin/zsh ### Job Parameters #SBATCH --cpus-per-task=8 #SBATCH --time=00:15:00 #SBATCH --job-name=example_job #SBATCH --output=stdout.txt #SBATCH --account=
``` ## SLURM Job Manager #### Example ```sh #!/usr/bin/zsh ### Job Parameters #SBATCH --cpus-per-task=8 #SBATCH --time=00:15:00 #SBATCH --job-name=example_job #SBATCH --output=stdout.txt #SBATCH --account=
### Program Code echo "Hello SLURM" ``` ## SLURM Job Manager #### Example - Safe the file - Submit the job ```sh > sbatch testjob.sh ``` - Check its state ```sh > squeue --me JOBID PARTITION NAME USER ST TIME NODES 12345678 c23ms example_job AB123456 R 0:02 1 ``` ## SLURM Job Manager #### Common Parameters - Number of cores: **-c /--cpus-per-task \
** - Memory: **-m /--mem=\
G** - Human readable Job name: **-J /--job-name** - Reporting File: **-o /--output=\
** - Runtime: **-t /--time=d-hh:mm:ss** - Account: **-A /--account=\
** - GPUs: **--gres=gpu:\
:\
** ## SLURM Job Manager #### Commands in a Nutshell Submit jobs ```sh > sbatch
[ADDITIONAL ARGUMENTS] ``` ## SLURM Job Manager #### Commands in a Nutshell Display Job Queue ```sh > squeue [OPTIONS] ``` **--me** shows only your jobs **--start** shows the estimated starttime **--format** can be used to filter e.g. for GPU jobs ## SLURM Job Manager #### Commands in a Nutshell Cancel Jobs ```sh > scancel [OPTIONS]
``` **--me** cancel all of your jobs **--v** provide details of the cancellation process ## SLURM Job Manager #### Commands in a Nutshell Request an interactive job ```sh > salloc [OPTIONS] ``` Job parameters just like in the jobscript. Redirects the shell to the head node. Job ends when the shell terminates. ```sh salloc --gres=gpu:1 -n 24 -t 1:00:00 # 24c+GPU node for 1 hour ``` ## SLURM Job Manager #### Commands in a Nutshell Display Accounting Information ```sh > sacct [OPTIONS] ``` Print details abound pending, running, or past jobs. ```sh > sacct -S $(date -I --date="yesterday") ``` Would show all jobs submitted since yesterday. ## SLURM Job Manager #### Commands in a Nutshell ```sh > r_wlm_usage ``` Command for only the RWTH HPC to display available / used resources against the account / group quota ## SLURM Job Manager Monitoring tools are improving constantly, with *perfmon* currently being established across compute clusters ## SLURM Job Manager ![](grafana.png) --- ## Compute Time Application - Resources are measured in core hours (core-h) - A Macbook Pro has roughly 70k core-h/year - The smallest project already guarantees 360k core-h - Estimate your needs based on previous work - Write and submit your application! ## Compute Time Application HPC resources in Germany are arranged hierarchically in the HPC Performance Pyramid. - **Tier-0**: PRACE and EuroHPC - **Tier-1**: Gauss Center for Supercomputing (GCS, JSC, HLRS, LRZ) - **Tier-2**: HPC Centres with supra-regional tasks and thematically dedicated HPC/centres - **Tier-3**: Regional HPC centres ## Compute Time Application ![alt text](image-2.png) ## Compute Time Application ![](computetime.jpg) ![](nhr.png) ![](nhrprocess.jpg) ### Acknowledgement (RWTH/JARA) *Computations were performed with computing resources granted by RWTH Aachen University under project
.* ### Acknowledgement (RWTH/JARA) *The authors gratefully acknowledge the computing time provided to them at the NHR Center NHR4CES at RWTH Aachen University (project number
). This is funded by the Federal Ministry of Education and Research, and the state governments participating on the basis of the resolutions of the GWK for national high performance computing at universities (www.nhr-verein.de/unsere-partner).* --- ### Any Questions?