diff --git a/1.architectures/7.sagemaker-hyperpod-eks/slinky-slurm/.gitignore b/1.architectures/7.sagemaker-hyperpod-eks/slinky-slurm/.gitignore new file mode 100644 index 00000000..b6d26282 --- /dev/null +++ b/1.architectures/7.sagemaker-hyperpod-eks/slinky-slurm/.gitignore @@ -0,0 +1,3 @@ +slurm*/ + +*.tgz \ No newline at end of file diff --git a/1.architectures/7.sagemaker-hyperpod-eks/slinky-slurm/Docker-Build-README.md b/1.architectures/7.sagemaker-hyperpod-eks/slinky-slurm/Docker-Build-README.md new file mode 100644 index 00000000..a0b54dec --- /dev/null +++ b/1.architectures/7.sagemaker-hyperpod-eks/slinky-slurm/Docker-Build-README.md @@ -0,0 +1,154 @@ +# Docker Build for the Slurmd Deep Learning Container + +This build includes Python 3.12.8 + PyTorch 2.6.0 + CUDA 12.6 + NCCL 2.23.4 + EFA Installer 1.38.0 (bundled with OFI NCCL plugin) + +Clone the AWSome Distributed Training repo: +``` +https://github.com/aws-samples/awsome-distributed-training.git +cd awsome-distributed-training/1.architectures/7.sagemaker-hyperpod-eks/slinky-slurm/ + +``` + +Build the container image: + +``` + +# Authenticate to DLC repo (Account 763104351884 is publicly known) +aws ecr get-login-password --region us-east-1 \ +| docker login --username AWS \ +--password-stdin 763104351884.dkr.ecr.us-east-1.amazonaws.com + +# on a Mac +docker buildx build --platform linux/amd64 -t dlc-slurmd:24.11.4-ubuntu24.04 -f dlc-slurmd.Dockerfile . + +# on Linux +# docker build -t dlc-slurmd:24.11.4-ubuntu24.04 -f dlc-slurmd.Dockerfile . + +``` + +Test locally: + +Verify Python 3.12.8 + PyTorch 2.6.0 + CUDA 12.6 + NCCL 2.23.4 + +``` + +docker run --platform linux/amd64 -it --entrypoint=/bin/bash dlc-slurmd:24.11.4-ubuntu24.04 + +python3 --version +# Python 3.12.8 + +which python3 +# /usr/local/bin/python3 + +nvcc --version +# nvcc: NVIDIA (R) Cuda compiler driver +# Copyright (c) 2005-2024 NVIDIA Corporation +# Built on Tue_Oct_29_23:50:19_PDT_2024 +# Cuda compilation tools, release 12.6, V12.6.85 +# Build cuda_12.6.r12.6/compiler.35059454_0 + +python3 -c "import torch; print(torch.__version__)" +# 2.6.0+cu126 + +python3 -c "import torch; print(torch.cuda.nccl.version())" +# (2, 23, 4) + +ls -l /usr/local/lib/libnccl* +# -rwxr-xr-x 1 root root 263726576 Mar 6 23:36 /usr/local/lib/libnccl.so +# -rwxr-xr-x 1 root root 263726576 Mar 6 23:36 /usr/local/lib/libnccl.so.2 +# -rwxr-xr-x 1 root root 263726576 Mar 6 23:36 /usr/local/lib/libnccl.so.2.23.4 +# -rw-r--r-- 1 root root 277972056 Mar 6 23:36 /usr/local/lib/libnccl_static.a + +cat /etc/nccl.conf +# NCCL_DEBUG=INFO +# NCCL_SOCKET_IFNAME=^docker0 + +``` + +Create a private ECR repo: + +``` + +aws ecr create-repository --repository-name dlc-slurmd + +``` + +Authenticate to the repo: + +``` +export AWS_ACCOUNT_ID= +export AWS_REGION= + +aws ecr get-login-password --region $AWS_REGION \ + | docker login --username AWS \ + --password-stdin ${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com + +``` + +Tag the image: + +``` + +docker tag dlc-slurmd:24.11.4-ubuntu24.04 \ + ${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com/dlc-slurmd:24.11.4-ubuntu24.04 + +``` + +Push the image to an ECR repo: + +``` + +docker push ${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com/dlc-slurmd:24.11.4-ubuntu24.04 + +``` + +Test ECR access: + +``` + +kubectl run test-pod \ + --image=${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com/dlc-slurmd:24.11.4-ubuntu24.04 \ + --restart=Never \ + --image-pull-policy=Always + +# verify slurm version +kubectl exec -it test-pod -- slurmd -V + +kubectl describe pod test-pod + +# verify additional requirements +kubectl exec -it test-pod -- ls /usr/local/lib/python3.12/site-packages/ \ + | egrep "datasets|fsspec|numpy|torch|torchaudio|torchvision|transformers" + +kubectl delete pod test-pod + +``` + +(Optional) Update the container image used by the Slinky NodeSet: + +Note: this step is not required if you specify the image repository and tag in the [values.yaml](./values.yaml) file, but is useful if you want to test a new image build without redeploying the entire Slurm cluster. + +``` +export NODESET_NAME=$(kubectl get nodeset -n slurm -o custom-columns=NAME:metadata.name --no-headers) + +kubectl -n slurm patch nodeset.slinky.slurm.net \ + $NODESET_NAME \ + --type='json' \ + -p="[ + {\"op\": \"replace\", \"path\": \"/spec/template/spec/containers/0/image\", \"value\":\"${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com/dlc-slurmd:24.11.4-ubuntu24.04\"}, + {\"op\": \"replace\", \"path\": \"/spec/template/spec/containers/0/imagePullPolicy\", \"value\":\"Always\"} + ]" + +``` + +Scale the Slinky NodeSet down and back up to trigger replacement: + +``` + +kubectl -n slurm scale nodeset/$NODESET_NAME --replicas=0 + + +kubectl -n slurm scale nodeset/$NODESET_NAME --replicas=4 + +``` + diff --git a/1.architectures/7.sagemaker-hyperpod-eks/slinky-slurm/README.md b/1.architectures/7.sagemaker-hyperpod-eks/slinky-slurm/README.md new file mode 100644 index 00000000..4a6e8f5c --- /dev/null +++ b/1.architectures/7.sagemaker-hyperpod-eks/slinky-slurm/README.md @@ -0,0 +1,857 @@ +# Running Slurm on HyperPod EKS with Slinky + +### What is the Slinky Project? + +The [Slinky Project](https://github.com/SlinkyProject/slurm-operator/tree/main) is an open-source solution maintained by SchedMD (the main developer of Slurm) that deploys Slurm on Kubernetes. When paired with HyperPod EKS, the Slinky Project unlocks the ability for enterprises who have standardized infrastructure management on Kubernetes to deliver a Slurm-based experience to their ML scientists. It also enables training, experimentation, and inference to happen on the same cluster of accelerated nodes with the build-in resiliency provided by HyperPod. + +--- + +### Slinky on HypePod EKS Architecture +![Image Description](./slinky-slurm-hp-eks.png) + +The diagram above depicts the resulting proof-of-concept deployment outlined in this guide. An Amazon EKS cluster acts as an orchestration layer, while a HyperPod cluster deliver a resilient instance group of GPU accelerated compute nodes. The Slinky Slurm operator is installed to extend Kubernetes with custom resources and actions, and a containerized Slurm cluster is deployed using Kubernetes pods via Helm chart. This Slurm cluster includes the following components: +| Component | Description | +|-----------|-------------| +| Controller (slurmctld) | The central management daemon that monitors resources, accepts jobs, and assigns work to compute nodes. | +| Accounting (slurmdbd) | Handles job accounting and user/project management through a MariaDB database backend. | +| Compute (slurmd) | The worker nodes that execute jobs, organized into NodeSets which can be grouped into different partitions. | +| Login | Provides SSH access points for users to interact with the Slurm cluster and submit jobs. | +| REST API (slurmrestd) | Offers HTTP-based API access to Slurm functionality for programmatic interaction with the cluster. | +| Authentication (sackd) | Manages credential authentication for secure access to Slurm services. | +| MariaDB | The database backend used by the accounting service to store job, user, and project information. | +| Slurm Exporter | Collects and exports Slurm metrics for monitoring purposes. | + +The login LoadBalancer type service is annotated to dynamically create an AWS Network Load Balancer using the [AWS Load Balancer Controller](https://github.com/kubernetes-sigs/aws-load-balancer-controller), allowing ML scientists to SSH into their login pods without interfacing with the Kubernetes API server via kubectl. + +The login and compute node pods also have FSx for Lustre and FSx for OpenZFS shared filesystems mounted. Having containerized compute node pods allows many dependencies that would traditionally be installed manually using Conda or a Python virtual environment to be baked into the container image, but shared filesystems are still beneficial for storing training artifacts, data, logs, and checkpoints. If Conda environments are still required, FSx for OpenZFS has proven optimal to avoid IOPS saturation with many small files. + +--- + +### Release Notes + +The following was tested in two infrastructure scenarios for hosting the compute NodeSet pods: +1. On 4 `g5.8xlarge` instances (1 A10G Tensor Core GPU each) +2. On 2 `p5.48xlarge` instances (8 H100 Tensor Core GPUs each) with EFAv2 + +For simplicity, 2 `m5.2xlarge` instances were also allocated for separately hosting other components like the Controller and Login pods. You can adjust the number and type of instances associated with your HyperPod cluster, as well as the component affinity rules in the respective [g5-values.yaml](./g5/g5-values.yaml) or [p5-values.yaml](./p5/p5-values.yaml) files to modify how they are spread across your nodes. + +Testing used [Slurm Operator v0.2.1](https://github.com/slinkyproject/slurm-operator/pkgs/container/slurm-operator) (pulled as OCI artifacts from the Slinky container registry) and [Slurm Cluster v0.3.0](https://github.com/SlinkyProject/slurm-operator/tree/main/helm/slurm) (packaged and deployed locally using the main branch of the Slinky git repository) in order to include the NoteSet volume mount and Login Pod features. These features are expected to be included in the official Slurm Cluster v0.3.0 release when it becomes available, along with a new version of the Slurm Operator with corresponding validating webhooks. + +Note that the [Slinky Project](https://github.com/SlinkyProject) is under active development and could introduce breaking changes that may require modified deployment and configuration steps. + +Worker pods were built with Python 3.12.8 + PyTorch 2.6.0 + CUDA 12.6 + NCCL 2.23.4 + EFA Installer 1.38.0 (bundled with OFI NCCL plugin) pre-installed in the container image. See the [Docker Build for the Slurmd Deep Learning Container](./Docker-Build-README.md) for details. + +* * * + +### Set Up the HyperPod Cluster: + +Follow the [Prerequisites](https://catalog.workshops.aws/sagemaker-hyperpod-eks/en-US/00-setup) and [Cluster Configuration](https://catalog.workshops.aws/sagemaker-hyperpod-eks/en-US/01-cluster) steps of the [HyperPod EKS Workshop](https://catalog.workshops.aws/sagemaker-hyperpod-eks/en-US). + +Be sure to modify the Accelerated and General Purpose instance groups as needed to deploy the desired instance type and number of nodes. + +(Optional) Add an access entry (if needed): + +``` +export AWS_ACCOUNT_ID= + +export EKS_CLUSTER_NAME=sagemaker-hyperpod-eks-cluster + +export ROLE_ARN=arn:aws:iam::$AWS_ACCOUNT_ID:role/ + +export PLCY_ARN=arn:aws:eks::aws:cluster-access-policy/AmazonEKSClusterAdminPolicy + +export AWS_REGION=us-west-2 + +aws eks create-access-entry \ + --cluster-name $EKS_CLUSTER_NAME \ + --principal-arn $ROLE_ARN \ + --type STANDARD \ + --region $AWS_REGION + +aws eks associate-access-policy \ + --cluster-name $EKS_CLUSTER_NAME \ + --principal-arn $ROLE_ARN \ + --policy-arn $PLCY_ARN \ + --access-scope type=cluster \ + --region $AWS_REGION +``` + +Update your kubectl context: + +``` +aws eks update-kubeconfig --name $EKS_CLUSTER_NAME + +kubectl get nodes +``` + +* * * + +### Create an FSx for Lustre Storage Class: + +Follow the [Setup FSx for Lustre File System](https://catalog.workshops.aws/sagemaker-hyperpod-eks/en-US/01-cluster/06-fsx-for-lustre) of the [HyperPod EKS Workshop](https://catalog.workshops.aws/sagemaker-hyperpod-eks/en-US). + +Verify` fsx-sc` Storage Class: + +``` +kubectl get storageclass fsx-sc -oyaml +``` + +* * * + +### Create an FSx for OpenZFS Storage Class: + +Install the [OpenZFS CSI driver](https://github.com/kubernetes-sigs/aws-fsx-openzfs-csi-driver) following the steps provided below: + +``` +eksctl create iamserviceaccount \ + --name fsx-openzfs-csi-controller-sa \ + --namespace kube-system \ + --cluster $EKS_CLUSTER_NAME \ + --attach-policy-arn arn:aws:iam::aws:policy/AmazonFSxFullAccess \ + --approve \ + --role-name FSXOCSI-${EKS_CLUSTER_NAME}-${AWS_REGION} \ + --region $AWS_REGION + +helm repo add aws-fsx-openzfs-csi-driver \ + https://kubernetes-sigs.github.io/aws-fsx-openzfs-csi-driver + +helm repo update + +helm upgrade --install aws-fsx-openzfs-csi-driver \ + --namespace kube-system \ + --set controller.serviceAccount.create=false \ + aws-fsx-openzfs-csi-driver/aws-fsx-openzfs-csi-driver + +kubectl get pods -n kube-system \ + -l app.kubernetes.io/part-of=aws-fsx-openzfs-csi-driver +``` + +Follow the [Dynamic Provisioning](https://github.com/kubernetes-sigs/aws-fsx-openzfs-csi-driver/tree/main/examples/kubernetes/dynamic-provisioning) guide to create an FSx for OpenZFS Storage Class: + +``` +export PRIVATE_SUBNET_ID= +export SECURITY_GROUP_ID= + +kubectl apply -f openzfs-storageclass.yaml + +kubectl get sc openzfs-sc -oyaml +``` + +* * * + +### Install the AWS Load Balancer Controller: + +Following the instructions below, which are a consolidation of the full [Install with Helm](https://docs.aws.amazon.com/eks/latest/userguide/lbc-helm.html) instructions found in the Amazon EKS documentation: + +``` +export EKS_CLUSTER_NAME=sagemaker-hyperpod-eks-cluster +export VPC_ID= +export AWS_REGION=us-west-2 +export AWS_ACCOUNT_ID= + +# manually update crds +kubectl apply -k "github.com/aws/eks-charts/stable/aws-load-balancer-controller/crds?ref=master" + +curl -O https://raw.githubusercontent.com/kubernetes-sigs/aws-load-balancer-controller/v2.12.0/docs/install/iam_policy.json + +aws iam create-policy \ + --policy-name AWSLoadBalancerControllerIAMPolicy-v2.12.0 \ + --policy-document file://iam_policy.json + +eksctl create iamserviceaccount \ + --cluster=$EKS_CLUSTER_NAME \ + --namespace=kube-system \ + --name=aws-load-balancer-controller \ + --attach-policy-arn=arn:aws:iam::$AWS_ACCOUNT_ID:policy/AWSLoadBalancerControllerIAMPolicy-v2.12.0 \ + --override-existing-serviceaccounts \ + --region $AWS_REGION \ + --approve + +helm repo add eks https://aws.github.io/eks-charts +helm repo update + +helm install aws-load-balancer-controller eks/aws-load-balancer-controller \ + -n kube-system \ + --set clusterName=$EKS_CLUSTER_NAME \ + --set serviceAccount.create=false \ + --set serviceAccount.name=aws-load-balancer-controller \ + --set region=$AWS_REGION \ + --set vpcId=$VPC_ID + +kubectl get pods -n kube-system -l app.kubernetes.io/name=aws-load-balancer-controller + +kubectl get sa aws-load-balancer-controller -n kube-system -oyaml +``` + +* * * + +### Instill Slinky Prerequisites (Cert Manager and Prometheus): + +Follow the steps included in the [Slinky QuickStart Guide | Pre-Requisites](https://github.com/SlinkyProject/slurm-operator/blob/main/docs/quickstart.md#pre-requisites) section to install Cert Manager and Prometheus. + +Verify Pre-Requisites Instillation: + +``` + kubectl get all -n cert-manager + kubectl get all -n prometheus +``` + +* * * + +### Install the Slurm Operator: + +For [Slurm Operator](https://github.com/SlinkyProject/slurm-operator/blob/main/docs/quickstart.md#pre-requisites) Installation, we'll install release v0.2.1, which is the latest release available at the time of testing. + + Note: We will locally build and deploy a pre-release v0.3.0 of the [Slurm Cluster](https://github.com/SlinkyProject/slurm-operator/tree/main/helm/slurm) from the main branch of the Slinky Project repository. The project is being actively developed, so there is a risk of pulling down breaking changes, but it includes the features to [add additional volume mounts to compute NodeSets](https://github.com/SlinkyProject/slurm-operator/commit/b0e111b0a8434e38b5fb37a2051e7525d5679319) and [deploy Login Pods](https://github.com/SlinkyProject/slurm-operator/commit/37f020f041556164b9c935f799b51df65d22aefe). + +``` +curl -L https://raw.githubusercontent.com/SlinkyProject/slurm-operator/refs/tags/v0.2.1/helm/slurm-operator/values.yaml \ + -o values-operator-0.2.1.yaml + +# Delete any stale crds (if you deployed an older version) +kubectl delete crd clusters.slinky.slurm.net +kubectl delete crd nodesets.slinky.slurm.net + +helm install slurm-operator oci://ghcr.io/slinkyproject/charts/slurm-operator \ + --values=values-operator-0.2.1.yaml --version=0.2.1 --namespace=slinky --create-namespace +``` + +Verify Slurm Operator Instillation: + +``` +kubectl get all -n slinky +``` + +* * * + +### Install the Slurm Cluster: + +To deploy the slurm cluster, we first need to make some modifications to the [values.yaml](https://github.com/SlinkyProject/slurm-operator/blob/dd65faba359702a8eda6cce9484b702f2fd2ae2e/helm/slurm/values.yaml)` file. After that, in order to test the latest changes in release v0.3.0, we’ll locally package and deploy the helm chart from the main branch of the cloned repo. + +For your convenience, we've provided [g5-values.yaml](./g5/g5-values.yaml) and [p5-values.yaml](./p5/p5-values.yaml) files with most of the configuration changes mentioned below already implemented, so you'll only need to make additional changes as needed to further customize your deployment. + +The two things you must minimally modify are the container image that the slurm compute nodes use ([instructions here](#build-and-set-the-compute-node-container-image)) and the root ssh key used for accessing the login node ([instructions here](#login-access)). + +--- + +#### Clone the Repos +Clone the Slurm Operator repository, which also contains the Helm chart artifacts for the Slurm Cluster: +``` +git clone https://github.com/SlinkyProject/slurm-operator.git +``` + +Clone the AWSome Distributed Training repo to use the [values.yaml](./values.yaml) file we've provided: +``` +git clone https://github.com/aws-samples/awsome-distributed-training.git + +cd awsome-distributed-training/1.architectures/7.sagemaker-hyperpod-eks/slinky-slurm +``` + +(Optional) If you wish to start from scratch, open the [values.yaml](https://github.com/SlinkyProject/slurm-operator/blob/dd65faba359702a8eda6cce9484b702f2fd2ae2e/helm/slurm/values.yaml) file associated with the Slurm Cluster Helm Chart: +``` +code slurm-operator/helm/slurm/values.yaml +``` +--- + +#### Component Affinity: + +Verify the existence of the instance type label for non-compute component affinity: + +``` +export GEN_INSTANCE_TYPE=ml.m5.2xlarge + +kubectl get nodes -l node.kubernetes.io/instance-type=$GEN_INSTANCE_TYPE +``` +For each non-compute component, we apply both a Node Affinity and a Pod Anti-affinity in [values.yaml](./values.yaml) to ensure they are hosted only on the 2 `m5.2xlarge` instances while also being evenly spread between the hosts. + +``` +# Inter-pod anti-affinity and node affinity for non-compute components +commonAffinity: &commonAffinity + nodeAffinity: + requiredDuringSchedulingIgnoredDuringExecution: + nodeSelectorTerms: + - matchExpressions: + - key: "node.kubernetes.io/instance-type" + operator: In + values: + - "ml.m5.2xlarge" + podAntiAffinity: + preferredDuringSchedulingIgnoredDuringExecution: + - weight: 100 + podAffinityTerm: + labelSelector: + matchExpressions: + - key: "app.kubernetes.io/name" + operator: In + values: ["slurmdbd", "slurmctld", "slurm-exporter", "login", "mariadb", "slurmrestd"] + topologyKey: "kubernetes.io/hostname" +``` +You can modify this common affinity setting, or apply unique affinity settings for individual components for further customization. + +--- + +#### Compute Node Selector: + +Verify the existence of the instance type label for compute node selector: + +``` +# for g5 instances +ACCEL_INSTANCE_TYPE=ml.g5.8xlarge + +# for p5 instances +ACCEL_INSTANCE_TYPE=ml.p5.48xlarge + + kubectl get nodes -l node.kubernetes.io/instance-type=$ACCEL_INSTANCE_TYPE +``` + +The instance type label is used as a node selector to ensure the compute pods only run on either the `ml.g5.8xlarge` or `ml.p5.48xlarge` GPU accelerated instances: + +``` +# for g5 instances +compute: +... + nodeSets: + - name: hp-node + ... + replicas: 4 + ... + nodeSelector: + kubernetes.io/os: linux + node.kubernetes.io/instance-type: ml.g5.8xlarge +... + +# for p5 instances +compute: +... + nodeSets: + - name: hp-node + ... + replicas: 4 + ... + nodeSelector: + kubernetes.io/os: linux + node.kubernetes.io/instance-type: ml.p5.48xlarge +... +``` +--- + +#### Create an FSx for Lustre Persistent Volume Claim (PVC) in the slurm namespace: + +Create the slurm namespace: + +``` +kubectl create ns slurm +``` + +This is needed to reference for node volume mounts later. + +``` +kubectl apply -f lustre-pvc-slurm.yaml +``` + +Verify FSx for Lustre PVC creation: + +``` +kubectl get pvc -n slurm + +# check for a bound state +kubectl get pvc fsx-claim -n slurm -ojson \ + | jq -r .status.phase + +# get the the volume ID +kubectl get pv $(kubectl get pvc fsx-claim -n slurm -ojson \ + | jq -r .spec.volumeName) -ojson \ + | jq -r .spec.csi.volumeHandle +``` +--- + +#### Create an FSx for OpenZFS PVC in the slurm namespace: + +``` +kubectl apply -f openzfs-pvc-slurm.yaml +``` + +Verify FSx for OpenZFS PVC creation: + +``` +kubectl get pvc -n slurm + +# check for a bound state +kubectl get pvc openzfs-claim -n slurm -ojson \ + | jq -r .status.phase + +# get the volume ID +kubectl get pv $(kubectl get pvc openzfs-claim -n slurm -ojson \ + | jq -r .spec.volumeName) -ojson \ + | jq -r .spec.csi.volumeHandle +``` + +FSx for Lustre and OpenZFS PVCs are added to the list of `extraVolumeMounts` and `extraVolumes` for both the login service and compute nodes: + +``` +login: + ... + extraVolumeMounts: + - name: fsx-lustre + mountPath: /fsx + - name: fsx-openzfs + mountPath: /home + ... + extraVolumes: + - name: fsx-lustre + persistentVolumeClaim: + claimName: fsx-claim + - name: fsx-openzfs + persistentVolumeClaim: + claimName: openzfs-claim + +compute: + nodesets: + - name: hp-node + ... + extraVolumeMounts: + - name: fsx-lustre + mountPath: /fsx + - name: fsx-openzfs + mountPath: /home + - name: shmem + mountPath: /dev/shm + ... + extraVolumes: + - name: fsx-lustre + persistentVolumeClaim: + claimName: fsx-claim + - name: fsx-openzfs + persistentVolumeClaim: + claimName: openzfs-claim + - name: shmem + hostPath: + path: /dev/shm +``` + +Note that for the compute nodes we've also added `/dev/shm` to provide access to the EC2 host's shared memory segment. This shared memory is used to for inter-process communication. + +--- + +#### Configure Compute Node Resources: + +Note: limits are required, otherwise the compute nodes will not deploy. + +``` +# for g5 instances +compute: + nodesets: + - name: hp-node + ... + resources: + limit: + nvidia.com/gpu: "1" + requests: + nvidia.com/gpu: "1" + ... + +# for p5 instances +compute: + nodesets: + - name: hp-node + ... + resources: + limits: + nvidia.com/gpu: 4 + vpc.amazonaws.com/efa: 16 + requests: + nvidia.com/gpu: 4 + vpc.amazonaws.com/efa: 16 + ... +``` +Note that for p5 capacity, we are allocating half the available GPU and EFA network interfaces to each pod so that two pods can run on one instances. This can be adjusted to accomodate other pod topologies. + +--- + +#### Build and Set the Compute Node Container Image: + +Use the provided [dlc-slurmd.Dockerfile](./dlc-slurmd.Dockerfile) to build a [Slurmd Deep Learning Container](./Docker-Build-README.md) (Slurmd DLC), following [the instructions here](./Docker-Build-README.md). + +then modify the compute node container image to use your Slurmd DLC build: + +``` +compute: + nodesets: + - name: compute-node + ... + # Set the image to use. + image: + # + # -- (string) + # Set the image repository to use. + repository: ".dkr.ecr..amazonaws.com/dlc-slurmd" + # + # -- (string) + # Set the image tag to use. + tag: "24.11.4-ubuntu24.04" + ... +``` +The Slurm DLC has Python 3.12.8 + PyTorch 2.6.0 + CUDA 12.6 + NCCL 2.23.4 + EFA Installer 1.38.0 (bundled with OFI NCCL plugin) pre-installed in the container image, but you can modify the [dlc-slurmd.Dockerfile](./dlc-slurmd.Dockerfile) for further customization. + +--- + +#### Login Access: + +Access to the login service can be configured through several authentication and networking mechanisms. The login service can be exposed either as a `LoadBalancer` (default) or `NodePort` type service, with the external port configurable via `servicePort` (default 22) or `serviceNodePort` (default 32222) respectively. Authentication can be integrated with LDAP through SSSD configuration, where users and groups can be managed via the `sssdConf` settings that define LDAP URIs, search bases, and domain configurations. SSH access can be customized through both `sshdConfig` and `rootSshAuthorizedKeys` parameters, allowing for specific SSH daemon configurations and authorized key management. Additionally, the name service switch configuration (`nsswitchConf`) can be customized to control how various databases like passwd, group, and hosts are resolved, with support for multiple sources including files, SSS, and database lookups. + +For simplicity of demonstration, we'll use SSH key authentication for root access. + +Generate an SSH key for root authorization: + +``` +export EMAIL_ADDR= + +ssh-keygen -t ed25519 -f ~/.ssh/id_ed25519_slurm -C "${EMAIL_ADDR}" + +cat ~/.ssh/id_ed25519_slurm.pub + +# ssh-ed25519 janedoe@example.com +``` + +Specify the root SSH authorized key in `values.yaml`: + +``` +login: + ... + rootSshAuthorizedKeys: + - "ssh-ed25519 janedoe@example.com" + ... +``` +--- + +#### Deploy the Slurm Cluster: + +Locally package and deploy the slurm cluster using the modified `values.yaml` file (either [g5-values.yaml](./g5/g5-values.yaml) or [p5-values.yaml](./p5/p5-values.yaml)): + +Assuming you are still sitting in the `slinky-slurm` directory of the AWSome Distributed Training repo that we cloned and navigated into earlier, and assuming you cloned the Slinky repo into your home directory (adjust the path as needed), copy the Helm chart artifacts in for packaging: +``` +cp -r ~/slurm-operator/helm/slurm . +``` + +Locally package the Slurm cluster Helm chart v0.3.0: + +``` +helm dependency update slurm + +helm package slurm +``` +**Option 1**: Deploy the Slurm cluster on `ml.g5.8xlarge` instances: +``` +# Dry run +helm install --dry-run slurm slurm-0.3.0.tgz \ +-f g5/g5-values.yaml \ +-n slurm + +helm install slurm slurm-0.3.0.tgz \ +-f g5/g5-values.yaml \ +-n slurm +``` +**Option 2**: Deploy the Slurm cluster on `ml.p5.48xlarge` instances: +``` +# Dry run +helm install --dry-run slurm slurm-0.3.0.tgz \ +-f p5/p5-values.yaml \ +-n slurm + +helm install slurm slurm-0.3.0.tgz \ +-f p5/p5-values.yaml \ +-n slurm +``` + +Watch the deployment status of the Slurm cluster: + +``` +kubectl -n slurm get pods -l app.kubernetes.io/instance=slurm --watch +``` + +Verify the deployment status of all components: + +``` +kubectl get all -n slurm +``` + +--- + +#### Configure Login Network Load Balancer provisioning using the AWS Load Balancer Controller: + +Manually add annotation to the `slurm-login` service: + +``` +export PUBLIC_SUBNET_ID_1= +export PUBLIC_SUBNET_ID_2= + +kubectl annotate service slurm-login -n slurm \ + service.beta.kubernetes.io/aws-load-balancer-type="nlb" \ + service.beta.kubernetes.io/aws-load-balancer-scheme="internet-facing" \ + service.beta.kubernetes.io/aws-load-balancer-nlb-target-type="ip" \ + service.beta.kubernetes.io/aws-load-balancer-subnets="$PUBLIC_SUBNET_ID_1,$PUBLIC_SUBNET_ID_2" \ + service.beta.kubernetes.io/aws-load-balancer-healthcheck-port="22" \ + --overwrite + +kubectl describe service slurm-login -n slurm +``` + +Any annotations added to the slurm cluster `values.yaml` file for the slurm-login service are currently ignored, but AWS Load Balancer Controller actively watches for and implements annotation changes. It Automatically adds inbound rules to the node security group to allow traffic from the NLB security group on the target port (22 in this case). + +--- + +### Basic Tests: + +SSH into the login node as root from the NLB endpoint: + +``` +SLURM_LOGIN_HOSTNAME="$(kubectl get services -n slurm -l app.kubernetes.io/instance=slurm,app.kubernetes.io/name=login -o jsonpath="{.items[0].status.loadBalancer.ingress[0].hostname}")" +ssh -i ~/.ssh/id_ed25519_slurm -p 22 root@$SLURM_LOGIN_HOSTNAME +``` +--- + +Check the available nodes: + +``` +sinfo + +PARTITION AVAIL TIMELIMIT NODES STATE NODELIST +hp-node up infinite 4 idle hp-node-[0-3] +all* up infinite 4 idle hp-node-[0-3] +``` +Note that in both scenarios (using 4 `ml.g5.8xlarge` instances or 2 `ml.p5.48xlarge` instances) we should see the same number of slurm compute nodes. When running on 4 `ml.g5.8xlarge` instances, each slurm compute node is mapped to 1 available A10G GPU, whereas when running on 2 `ml.p5.48xlarge` instances, each slurm compute node is mapped to 4 available H100 GPUs and 16 EFA network interfaces. + +--- + +Verify FSx for Lustre and OpenZFS filesystem mounts on the login pod: + +``` +df -h + +# Filesystem Size Used Avail Use% Mounted on +# overlay 500G 30G 471G 6% / +# tmpfs 64M 0 64M 0% /dev +# tmpfs 63G 0 63G 0% /sys/fs/cgroup +# 10.1.12.93@tcp:/7c5dpb4v 1.2T 7.8M 1.2T 1% /fsx +# fs-03221b7c7d3767607.fsx.us-west-2.amazonaws.com:/fsx 64G 0 64G 0% /home +# tmpfs 115G 4.0K 115G 1% /etc/slurm +# /dev/nvme0n1p1 100G 23G 78G 23% /run +# /dev/nvme1n1 500G 30G 471G 6% /etc/hostname +# shm 64M 0 64M 0% /dev/shm +# tmpfs 115G 4.0K 115G 1% /etc/sssd/sssd.conf +# tmpfs 115G 12K 115G 1% /etc/ssh/ssh_host_rsa_key +# tmpfs 63G 0 63G 0% /proc/acpi +# tmpfs 63G 0 63G 0% /sys/firmware + +exit +``` +--- + +Verify FSx for Lustre and OpenZFS filesystem mounts on the compute node pods: + +``` +kubectl -n slurm exec -it pod/slurm-compute-hp-node-0 -- bash --login + +df -h + +# Filesystem Size Used Avail Use% Mounted on +# overlay 500G 31G 470G 7% / +# tmpfs 64M 0 64M 0% /dev +# tmpfs 63G 0 63G 0% /sys/fs/cgroup +# 10.1.12.93@tcp:/7c5dpb4v 1.2T 7.5M 1.2T 1% /fsx +# fs-03221b7c7d3767607.fsx.us-west-2.amazonaws.com:/fsx 64G 0 64G 0% /home +# tmpfs 115G 4.0K 115G 1% /etc/slurm +# /dev/nvme0n1p1 100G 23G 78G 23% /run +# /dev/nvme1n1 500G 31G 470G 7% /etc/hostname +# shm 64M 0 64M 0% /dev/shm +# tmpfs 115G 0 115G 0% /var/log/slurm +``` +--- + +Check the installed CUDA compiler version on compute node pods: + +``` +nvcc --version + +# nvcc: NVIDIA (R) Cuda compiler driver +# Copyright (c) 2005-2024 NVIDIA Corporation +# Built on Tue_Oct_29_23:50:19_PDT_2024 +# Cuda compilation tools, release 12.6, V12.6.85 +# Build cuda_12.6.r12.6/compiler.35059454_0 +``` +--- + +Check the NCCL version on compute node pods: + +``` +ldconfig -v | grep "libnccl.so" | tail -n1 | sed -r 's/^.*\.so\.//' + +# 2.23.4 +``` +--- + +Confirm NCCL headers are installed worker node pods: + +``` +find /usr/local/lib/ -name "nccl.h" 2>/dev/null + +# /usr/local/lib/python3.12/site-packages/torch/include/torch/csrc/cuda/nccl.h +``` +--- + +For p5 capacity, check EFA availability: +``` +ls /sys/class/infiniband/ +fi_info -p efa +``` +Check that the EFA libraries are properly mounted +``` +ls /opt/amazon/efa/lib +ls /opt/amazon/ofi-nccl/lib/x86_64-linux-gnu +``` +Verify EFA device allocation: +``` +ls -l /dev/infiniband/ +``` +Verify intra-node GPU topology: +``` +nvidia-smi topo -m +``` +The GPU topology should show all GPUs are connected via NVLink (NV18 indicates 18 NVLink connections). +The GPUs are split across two NUMA nodes (0-3 on NUMA 0, 4-7 on NUMA 1). + +--- + +### FSDP Test + +SSH into the login pod as root, clone the repo, and create a checkpoints directory: + +``` +SLURM_LOGIN_HOSTNAME="$(kubectl get services -n slurm -l app.kubernetes.io/instance=slurm,app.kubernetes.io/name=login -o jsonpath="{.items[0].status.loadBalancer.ingress[0].hostname}")" +ssh -i ~/.ssh/id_ed25519_slurm -p 22 root@$SLURM_LOGIN_HOSTNAME + +# install git +apt update +apt install -y git +git --version + +# install vim (optional) +apt install -y vim +vim --version + +cd /fsx +git clone https://github.com/aws-samples/awsome-distributed-training/ +cd awsome-distributed-training/3.test_cases/pytorch/FSDP/slurm + +mkdir -p checkpoints +``` +--- +Copy the modified sbatch file: +``` +export SLINKY_PATH=/fsx/awsome-distributed-training/1.architectures/7.sagemaker-hyperpod-eks/slinky-slurm + +# for g5 instances +cp ${SLINKY_PATH}/g5/g5-llama2_7b-training.sbatch ./llama2_7b-training.sbatch + +# for p5 instances +cp ${SLINKY_PATH}/p5/p5-llama2_7b-training.sbatch ./llama2_7b-training.sbatch +``` +--- +Add your Hugging Face token to stream the [allenai/c4](https://huggingface.co/datasets/allenai/c4) dataset without throttling: +``` +NEW_TOKEN="your_new_token_here" +sed -i "s/export HF_TOKEN=.*$/export HF_TOKEN=$NEW_TOKEN/" llama2_7b-training.sbatch +``` + +--- +Kick-off the training job: +``` +sbatch llama2_7b-training.sbatch +``` +--- + +Watch the output logs from the login pod: + +``` +export JOB_ID=$(squeue -h -u root -o "%i" | head -1) + +tail -f logs/llama2_7b-FSDP_${JOB_ID}.out +``` +--- + +Watch the error logs from `slurm-compute-hp-node-0`: + +``` +# from a new terminal window +kubectl -n slurm exec -it pod/slurm-compute-hp-node-0 -- bash --login + +cd /fsx/awsome-distributed-training/3.test_cases/pytorch/FSDP/slurm +export JOB_ID=$(squeue -h -u root -o "%i" | head -1) + +watch "grep 'Batch.*Loss' logs/llama2_7b-FSDP_${JOB_ID}.err" + +# or + +tail -f logs/llama2_7b-FSDP_${JOB_ID}.err | grep --line-buffered 'Batch.*Loss' +``` + +Watch squeue from `slurm-compute-hp-node-1`: + +``` +# from a new terminal window +kubectl -n slurm exec -it pod/slurm-compute-hp-node-1 -- bash --login + +# 1 second updates +watch -n 1 squeue +``` + +Watch checkpoints from `slurm-compute-hp-node-2`: + +``` +# from a new terminal window +kubectl -n slurm exec -it pod/slurm-compute-hp-node-2 -- bash --login + +cd /fsx/awsome-distributed-training/3.test_cases/pytorch/FSDP/slurm + +# highlight changes, show timestamps, 5 second updates +watch -n 5 -d "ls -lh checkpoints" +``` + +* * * + +### Clean Up: + +``` +rm -rf checkpoints/* + +rm -rf logs/* + +helm uninstall slurm -n slurm +helm uninstall slurm-operator -n slinky + +helm uninstall prometheus -n prometheus +helm uninstall cert-manager -n cert-manager + +kubectl delete pvc fsx-claim -n slurm +kubectl delete pvc openzfs-claim + +helm uninstall aws-fsx-csi-driver -n kube-system +helm uninstall aws-fsx-openzfs-csi-driver -n kube-system + +helm uninstall aws-load-balancer-controller -n kube-system + +eksctl delete iamserviceaccount \ + --name fsx-csi-controller-sa \ + --namespace kube-system \ + --cluster $EKS_CLUSTER_NAME + +eksctl delete iamserviceaccount \ + --name fsx-openzfs-csi-controller-sa \ + --namespace kube-system \ + --cluster $EKS_CLUSTER_NAME + +eksctl delete iamserviceaccount \ + --name aws-load-balancer-controller \ + --namespace kube-system \ + --cluster $EKS_CLUSTER_NAME +``` diff --git a/1.architectures/7.sagemaker-hyperpod-eks/slinky-slurm/dlc-slurmd.Dockerfile b/1.architectures/7.sagemaker-hyperpod-eks/slinky-slurm/dlc-slurmd.Dockerfile new file mode 100644 index 00000000..d3181bf6 --- /dev/null +++ b/1.architectures/7.sagemaker-hyperpod-eks/slinky-slurm/dlc-slurmd.Dockerfile @@ -0,0 +1,112 @@ +# First stage - DLC PyTorch 2.6, Python 3.12, CUDA 12.6, Ubuntu 22.04 +FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:2.6.0-gpu-py312-cu126-ubuntu22.04-ec2 AS dlc + +# Second stage - Slurm compute node +FROM ghcr.io/slinkyproject/slurmd:24.11.4-ubuntu24.04 + +ARG PYTHON_SHORT_VERSION=3.12 + +# Create required directory +RUN mkdir -p /var/spool/slurmd + +# Environment variables from DLC +ENV CUDA_HOME="/usr/local/cuda" \ + EFA_PATH="/opt/amazon/efa" \ + OPEN_MPI_PATH="/opt/amazon/openmpi" + +ENV LD_LIBRARY_PATH="lib:${EFA_PATH}/lib:${OPEN_MPI_PATH}/lib:${CUDA_HOME}/lib64:/usr/local/lib:/lib/x86_64-linux-gnu:/opt/nccl/build/lib:/opt/amazon/ofi-nccl/lib/x86_64-linux-gnu:/usr/local/nvidia/lib" \ + PATH="${EFA_PATH}/bin:${OPEN_MPI_PATH}/bin:${CUDA_HOME}/bin:${PATH}" \ + NCCL_DEBUG=INFO \ + NCCL_SOCKET_IFNAME=^docker0 \ + PYTHONDONTWRITEBYTECODE=1 \ + PYTHONUNBUFFERED=1 \ + PYTHONIOENCODING=UTF-8 \ + LANG=C.UTF-8 \ + LC_ALL=C.UTF-8 \ + NVTE_FRAMEWORK=pytorch + +# Install critical system dependencies missing in base Slurm image +ENV DEBIAN_FRONTEND=noninteractive +RUN apt-get update && \ + apt-get install -y --no-install-recommends \ + build-essential \ + ca-certificates \ + cmake \ + curl \ + git \ + libcurl4-openssl-dev \ + libssl-dev \ + libnuma1 \ + libnuma-dev \ + libibverbs-dev \ + libtool \ + autoconf \ + pkg-config \ + libglib2.0-0 \ + libsm6 \ + libxext6 \ + libxrender-dev \ + && rm -rf /var/lib/apt/lists/* \ + && apt-get clean + +# Copy CUDA stack from DLC +COPY --from=dlc /usr/local/cuda /usr/local/cuda + +# Copy EFA stack from DLC +COPY --from=dlc /opt/amazon/efa /opt/amazon/efa +COPY --from=dlc /opt/amazon/openmpi /opt/amazon/openmpi +COPY --from=dlc /opt/amazon/ofi-nccl /opt/amazon/ofi-nccl + +# Copy NCCL configuration +COPY --from=dlc /usr/local/lib/libnccl* /usr/local/lib/ +COPY --from=dlc /etc/nccl.conf /etc/nccl.conf + +# Configure OpenMPI +RUN mv ${OPEN_MPI_PATH}/bin/mpirun ${OPEN_MPI_PATH}/bin/mpirun.real \ + && echo '#!/bin/bash' > ${OPEN_MPI_PATH}/bin/mpirun \ + && echo "${OPEN_MPI_PATH}/bin/mpirun.real --allow-run-as-root \"\$@\"" >> ${OPEN_MPI_PATH}/bin/mpirun \ + && chmod a+x ${OPEN_MPI_PATH}/bin/mpirun \ + && echo "hwloc_base_binding_policy = none" >> ${OPEN_MPI_PATH}/etc/openmpi-mca-params.conf \ + && echo "rmaps_base_mapping_policy = slot" >> ${OPEN_MPI_PATH}/etc/openmpi-mca-params.conf + +# Copy Python installation +COPY --from=dlc /usr/local/bin/python${PYTHON_SHORT_VERSION}* /usr/local/bin/ +COPY --from=dlc /usr/local/lib/python${PYTHON_SHORT_VERSION} /usr/local/lib/python${PYTHON_SHORT_VERSION} +COPY --from=dlc /usr/local/lib/libpython${PYTHON_SHORT_VERSION}* /usr/local/lib/ +COPY --from=dlc /usr/local/include/python${PYTHON_SHORT_VERSION}* /usr/local/include/ + +# Fix Python symlinks +RUN rm -f /usr/local/bin/python3 && \ + rm -f /usr/local/bin/python && \ + ln -s /usr/local/bin/python${PYTHON_SHORT_VERSION} /usr/local/bin/python3 && \ + ln -s /usr/local/bin/python${PYTHON_SHORT_VERSION} /usr/local/bin/python + +# Additional requirements +RUN /usr/local/bin/python3 -m pip install --no-cache-dir \ + transformers==4.37.2 \ + datasets==2.17.1 + +# Remove problematic typing.py to avoid conflicts +RUN rm -f /usr/local/lib/python${PYTHON_SHORT_VERSION}/site-packages/typing.py + +# Install OpenSSH, allow OpenSSH to talk to containers without asking for confirmation +RUN apt-get update \ + && apt-get install -y --no-install-recommends openssh-client openssh-server \ + && mkdir -p /var/run/sshd \ + && cat /etc/ssh/ssh_config | grep -v StrictHostKeyChecking > /etc/ssh/ssh_config.new \ + && echo " StrictHostKeyChecking no" >> /etc/ssh/ssh_config.new \ + && mv /etc/ssh/ssh_config.new /etc/ssh/ssh_config \ + && rm -rf /var/lib/apt/lists/* \ + && apt-get clean + +# Configure OpenSSH so that nodes can communicate with each other +RUN mkdir -p /var/run/sshd \ + && sed 's@session\s*required\s*pam_loginuid.so@session optional pam_loginuid.so@g' -i /etc/pam.d/sshd + +RUN rm -rf /root/.ssh/ \ + && mkdir -p /root/.ssh/ \ + && ssh-keygen -q -t rsa -N '' -f /root/.ssh/id_rsa \ + && cp /root/.ssh/id_rsa.pub /root/.ssh/authorized_keys \ + && printf "Host *\n StrictHostKeyChecking no\n" >> /root/.ssh/config + +WORKDIR /home diff --git a/1.architectures/7.sagemaker-hyperpod-eks/slinky-slurm/g5/g5-llama2_7b-training.sbatch b/1.architectures/7.sagemaker-hyperpod-eks/slinky-slurm/g5/g5-llama2_7b-training.sbatch new file mode 100644 index 00000000..9635ff16 --- /dev/null +++ b/1.architectures/7.sagemaker-hyperpod-eks/slinky-slurm/g5/g5-llama2_7b-training.sbatch @@ -0,0 +1,123 @@ +#!/bin/bash +# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. +# SPDX-License-Identifier: MIT-0 + +#SBATCH --nodes=4 # number of nodes to use +#SBATCH --job-name=llama2_7b-FSDP # name of your job +#SBATCH --output=logs/%x_%j.out # logfile for stdout +#SBATCH --error=logs/%x_%j.err # logfile for stderr, remove it to merge both outputs +#SBATCH --exclusive # job has exclusive use of the resource, no sharing +#SBATCH --ntasks-per-node=1 # one task per node +#SBATCH --cpus-per-task=32 # match the number of CPUs per node +set -ex; + +########################### +###### User Variables ##### +########################### + +GPUS_PER_NODE=1 + +########################### +## Environment Variables ## +########################### + +export CUDA_HOME="/usr/local/cuda" +export EFA_PATH="/opt/amazon/efa" +export OPEN_MPI_PATH="/opt/amazon/openmpi" +export OFI_NCCL_PATH="/opt/amazon/ofi-nccl" +export LD_LIBRARY_PATH="lib:${EFA_PATH}/lib:${OPEN_MPI_PATH}/lib:${CUDA_HOME}/lib64:/usr/local/lib:/lib/x86_64-linux-gnu:/opt/nccl/build/lib:${OFI_NCCL_PATH}/lib/x86_64-linux-gnu:/usr/local/nvidia/lib" + +# LD_PRELOAD is required for PyTorch to find the NCCL library +export LD_PRELOAD="/usr/local/lib/libnccl.so.2" + +export CUDA_VISIBLE_DEVICES=0 # Restrict PyTorch to only use the first GPU (GPU 0) +export NVIDIA_VISIBLE_DEVICES=all # Make all GPUs visible to NVIDIA container runtime + +# Debug settings +export NCCL_DEBUG=INFO # Set NCCL debug level for troubleshooting +export NCCL_DEBUG_SUBSYS=ALL # Enable detailed debugging output for all NCCL subsystems + +# Timeout settings +export NCCL_TIMEOUT=1800 # Set overall NCCL operation timeout to 30 minutes (in seconds) +export NCCL_SOCKET_TIMEOUT=300 # Allow 5 minutes for TCP socket connections between nodes +export NCCL_ASYNC_ERROR_HANDLING=1 # Enable asynchronous error handling for better fault tolerance + +# Buffer settings +export NCCL_BUFFSIZE=2097152 # Set NCCL communication buffer size to 2MB for larger transfers + +# TCP connection settings +export TORCH_DISTRIBUTED_DETAILED_LOGGING=1 # Enable verbose logging for PyTorch distributed operations +export GLOO_SOCKET_IFNAME=eth0 # Use eth0 network interface for Gloo collective operations +export TP_SOCKET_IFNAME=eth0 # Use eth0 for tensor parallelism communication +export NCCL_SOCKET_IFNAME=eth0 # Use eth0 (primary EC2 network interface) for NCCL communication + +# TCP Store timeout settings +export TORCHELASTIC_MAX_CALLTIME=3600 # Set maximum call time for TorchElastic operations to 1 hour +export PYTORCH_TIMEOUT=3600 # Set PyTorch RPC timeout to 1 hour +export TORCH_DISTRIBUTED_TIMEOUT=3600 # Set PyTorch distributed timeout to 1 hour + +# PyTorch specific settings +export TORCH_DISTRIBUTED_DEBUG=DETAIL # Enable detailed debugging for distributed operations +export TORCH_CPP_LOG_LEVEL=INFO # Set C++ frontend logging level to INFO +export CUDA_LAUNCH_BLOCKING=0 # Allow asynchronous CUDA kernel launches (0=async, 1=sync) + +# HuggingFace settings +export HF_HUB_ETAG_TIMEOUT=60 # Metadata timeout (in seconds) for large clusters +export HF_TOKEN= # Token used to avoid throttling for data streaming + +########################### +####### Torch Dist ####### +########################### + +# Debug Slurm environment +echo "=== Slurm Environment ===" +echo "SLURM_JOB_ID: $SLURM_JOB_ID" +echo "SLURM_JOB_NUM_NODES: $SLURM_JOB_NUM_NODES" +echo "SLURM_NODELIST: $SLURM_NODELIST" +echo "SLURM_JOB_NODELIST: $SLURM_JOB_NODELIST" +echo "LD_LIBRARY_PATH: $LD_LIBRARY_PATH" +echo "=======================" + +declare -a TORCHRUN_ARGS=( + --nproc_per_node=$GPUS_PER_NODE + --nnodes=$SLURM_JOB_NUM_NODES + --rdzv_id=$SLURM_JOB_ID + --rdzv_backend=c10d + --rdzv_endpoint=$(hostname) +) + +export PATH="/usr/local/bin:$PATH" +export TRAIN_SCRIPT="/fsx/awsome-distributed-training/3.test_cases/pytorch/FSDP/src/train.py" +export PYTHONPATH="/usr/local/lib/python3.12/site-packages:$PYTHONPATH" +export TORCHRUN="/usr/local/bin/python3 -m torch.distributed.run" + +export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True + +############################ +# llama2_7b Training Params ## +############################ +declare -a TRAINING_ARGS=( + --max_context_width=512 + --num_key_value_heads=8 + --intermediate_size=2048 + --hidden_width=1024 + --num_layers=8 + --num_heads=16 + --model_type=llama_v2 + --tokenizer="hf-internal-testing/llama-tokenizer" + --checkpoint_freq=100 + --validation_freq=100 + --max_steps=1000 + --checkpoint_dir=./checkpoints + --dataset='allenai/c4' + --dataset_config_name='en' + --resume_from_checkpoint=./checkpoints + --train_batch_size=1 + --val_batch_size=1 + --gradient_checkpointing=True + --mixed_precision=bf16 + --sharding_strategy="full" # https://pytorch.org/docs/stable/fsdp.html + --offload_activations=1 +) + +srun --export=ALL -l ${TORCHRUN} "${TORCHRUN_ARGS[@]}" $TRAIN_SCRIPT "${TRAINING_ARGS[@]}" \ No newline at end of file diff --git a/1.architectures/7.sagemaker-hyperpod-eks/slinky-slurm/g5/g5-values.yaml b/1.architectures/7.sagemaker-hyperpod-eks/slinky-slurm/g5/g5-values.yaml new file mode 100644 index 00000000..32654cd0 --- /dev/null +++ b/1.architectures/7.sagemaker-hyperpod-eks/slinky-slurm/g5/g5-values.yaml @@ -0,0 +1,956 @@ +# SPDX-FileCopyrightText: Copyright (C) SchedMD LLC. +# SPDX-License-Identifier: Apache-2.0 +# Inter-pod anti-affinity and node affinity for non-compute components +commonAffinity: &commonAffinity + nodeAffinity: + requiredDuringSchedulingIgnoredDuringExecution: + nodeSelectorTerms: + - matchExpressions: + - key: "node.kubernetes.io/instance-type" + operator: In + values: + - "ml.m5.2xlarge" + podAntiAffinity: + preferredDuringSchedulingIgnoredDuringExecution: + - weight: 100 + podAffinityTerm: + labelSelector: + matchExpressions: + - key: "app.kubernetes.io/name" + operator: In + values: ["slurmdbd", "slurmctld", "slurm-exporter", "login", "mariadb", "slurmrestd"] + topologyKey: "kubernetes.io/hostname" + +# +# Debug configuration. +# @ignored +debug: + # + # -- (bool) + # Enables debug configuration. + enabled: false + # + # -- (bool) + # Allow a locally running operator to communicate with slurm cluster via port-forward. + # NOTE: use when running the operator in a local debugger. + localOperator: true + +# +# -- (string) +# Overrides the name of the release. +nameOverride: "" + +# +# -- (string) +# Overrides the full name of the release. +fullnameOverride: "" + +# +# -- (string) +# Overrides the namespace of the release. +namespaceOverride: "" + +# +# -- (list) +# Set the secrets for image pull. +# Ref: https://kubernetes.io/docs/tasks/configure-pod-container/pull-image-private-registry/ +imagePullSecrets: [] + # - name: regcred + +# +# -- (string) +# Set the image pull policy. +imagePullPolicy: IfNotPresent + +# +# -- (string) +# Set the priority class to use. +# Ref: https://kubernetes.io/docs/concepts/scheduling-eviction/pod-priority-preemption/#priorityclass +priorityClassName: "" + +# +# Slurm JWT authentication. +jwt: + # + # JWT hs256 configurations. + hs256: + # + # -- (string) + # The existing secret to use otherwise one will be generated. + existingSecret: "" + +# +# Slurm configurations. +slurm: + # + # Slurm authentication configurations. + auth: + # + # -- (string) + # The existing secret to use otherwise one will be generated. + existingSecret: "" + # + # -- (map[string]string | map[string][]string) + # Extra slurmdbd configuration lines to append to `slurmdbd.conf`. + # WARNING: Values can override existing ones. + # Ref: https://slurm.schedmd.com/slurmdbd.conf.html + extraSlurmdbdConf: {} + # CommitDelay: 1 + ### LOGGING ### + # DebugLevel: debug2 + # DebugFlags: [] + ### PURGE ### + # PurgeEventAfter: 12month + # PurgeJobAfter: 12month + # PurgeResvAfter: 2month + # PurgeStepAfter: 2month + # PurgeSuspendAfter: 1month + # PurgeTXNAfter: 12month + # PurgeUsageAfter: 12month + # + # -- (map[string]string | map[string][]string) + # Extra slurm configuration lines to append to `slurm.conf`, represetned as a string or a map. + # WARNING: Values can override existing ones. + # Ref: https://slurm.schedmd.com/slurm.conf.html + extraSlurmConf: {} + # MinJobAge: 2 + # MaxNodeCount: 1024 + ### LOGGING ### + # SlurmctldDebug: debug2 + # SlurmSchedLogLevel: 1 + # SlurmdDebug: debug2 + # DebugFlags: [] + ### PLUGINS & PARAMETERS ### + # SchedulerParameters: + # - defer_batch + # + # -- (map[string]string) + # Optional raw Slurm configuration files, as a map. + # The map key represents the config file by name; the map value represents config file contents as a string. + # Ref: https://slurm.schedmd.com/man_index.html#configuration_files + configFiles: + # acct_gather.conf: | + # # Ref: https://slurm.schedmd.com/acct_gather.conf.html + # burst_buffer.conf: | + # # Ref: https://slurm.schedmd.com/burst_buffer.conf.html + # gres.conf: | + # # Ref: https://slurm.schedmd.com/gres.conf.html + # helpers.conf: | + # # Ref: https://slurm.schedmd.com/helpers.conf.html + # job_container.conf: | + # # Ref: https://slurm.schedmd.com/job_container.conf.html + # mpi.conf: | + # # Ref: https://slurm.schedmd.com/mpi.conf.html + # oci.conf: | + # # Ref: https://slurm.schedmd.com/oci.conf.html + # plugstack.conf: | + # # Ref: https://slurm.schedmd.com/plugstack.conf.html + # topology.conf: | + # # Ref: https://slurm.schedmd.com/topology.conf.html + # + # -- (map[string]string) + # The Prolog scripts for compute nodesets, as a map. + # The map key represents the filename; the map value represents the script contents. + # WARNING: The script must include a shebang (!) so it can be executed correctly by Slurm. + # Ref: https://slurm.schedmd.com/slurm.conf.html#OPT_Prolog + # Ref: https://slurm.schedmd.com/prolog_epilog.html + # Ref: https://en.wikipedia.org/wiki/Shebang_(Unix) + prologScripts: {} + # 00-empty.sh: | + # #!/usr/bin/env bash + # set -euo pipefail + # exit 0 + # + # -- (map[string]string) + # The Epilog scripts for compute nodesets, as a map. + # The map key represents the filename; the map value represents the script contents. + # WARNING: The script must include a shebang (!) so it can be executed correctly by Slurm. + # Ref: https://slurm.schedmd.com/slurm.conf.html#OPT_Epilog + # Ref: https://slurm.schedmd.com/prolog_epilog.html + # Ref: https://en.wikipedia.org/wiki/Shebang_(Unix) + epilogScripts: {} + # 00-empty.sh: | + # #!/usr/bin/env bash + # set -euo pipefail + # exit 0 + +# Slurm authcred (sackd) configurations. +authcred: + # + # -- (string) + # Set the image pull policy. + imagePullPolicy: IfNotPresent + # + # Set the image to use. + image: + # + # -- (string) + # Set the image repository to use. + repository: ghcr.io/slinkyproject/sackd + # + # -- (string) + # Set the image tag to use. + tag: 24.11-ubuntu24.04 + # + # -- (object) + # Set container resource requests and limits for Kubernetes Pod scheduling. + # Ref: https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#resource-requests-and-limits-of-pod-and-container + resources: {} + # requests: + # cpu: 1 + # memory: 1Gi + # limits: + # cpu: 2 + # memory: 4Gi + +# +# Slurm controller (slurmctld) configurations. +controller: + # + # -- (string) + # Set the image pull policy. + imagePullPolicy: IfNotPresent + # + # Set the image to use. + image: + # + # -- (string) + # Set the image repository to use. + repository: ghcr.io/slinkyproject/slurmctld + # + # -- (string) + # Set the image tag to use. + tag: 24.11-ubuntu24.04 + # + # -- (object) + # The controller service configuration. + # Ref: https://kubernetes.io/docs/concepts/services-networking/service/ + service: {} + # type: LoadBalancer + # externalIPs: [] + # externalName: my.slurmctld.example.com + # + # -- (integer) + # The external service port number. + servicePort: 6817 + # + # -- (integer) + # The external service node port number. + # Ignored unless `service.type == NodePort`. + serviceNodePort: 36817 + # + # -- (string) + # Set the priority class to use. + # Ref: https://kubernetes.io/docs/concepts/scheduling-eviction/pod-priority-preemption/#priorityclass + priorityClassName: "" + # + # -- (object) + # Set affinity for Kubernetes Pod scheduling. + # Ref: https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#affinity-and-anti-affinity + affinity: *commonAffinity + # + # -- (list) + # Configure pod tolerations. + # Ref: https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/ + tolerations: [] + # + # -- (object) + # Set container resource requests and limits for Kubernetes Pod scheduling. + # Ref: https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#resource-requests-and-limits-of-pod-and-container + resources: {} + # requests: + # cpu: 1 + # memory: 1Gi + # limits: + # cpu: 2 + # memory: 4Gi + # + # Define a persistent volume for the slurm controller to store its save-state. + # Used to recover from system failures or from pod upgrades. + persistence: + # + # -- (bool) + # Enables save-state persistence. + enabled: false + # + # -- (string) + # Name of an existing `PersistentVolumeClaim` to use instead of creating one from definition. + # NOTE: When not empty, the other persistence fields will be ignored. + existingClaim: "" + # + # -- (object) + # Create a `PersistentVolumeClaim` with these annotations. + annotations: {} + # + # -- (object) + # Create a `PersistentVolumeClaim` with these labels. + labels: {} + # + # -- (string) + # Create a `PersistentVolumeClaim` with this storage class. + storageClass: standard + # + # -- (list) + # Create a `PersistentVolumeClaim` with these access modes. + accessModes: + - ReadWriteOnce + # + # -- (string) + # Create a `PersistentVolumeClaim` with this storage size. + size: 4Gi + # + # -- (object) + # Selector to match an existing `PersistentVolume`. + selector: {} + # matchLabels: + # app: foo + +# +# Login node configurations. +login: + # + # -- (bool) + # Enables login nodes. + enabled: true + # + # -- (integer) + # Set the number of replicas to deploy. + replicas: 1 + # + # -- (string) + # Set the image pull policy. + imagePullPolicy: IfNotPresent + # + # Set the image to use. + image: + # + # -- (string) + # Set the image repository to use. + repository: ghcr.io/slinkyproject/login + # + # -- (string) + # Set the image tag to use. + tag: 24.11-ubuntu24.04 + # + # -- (object) + # The login service configuration. + # Ref: https://kubernetes.io/docs/concepts/services-networking/service/ + service: + type: LoadBalancer + # externalIPs: [] + # externalName: my.login.example.com + # + # -- (integer) + # The external service port number. + servicePort: 22 + # + # -- (integer) + # The external service node port number. + # Ignored unless `service.type == NodePort`. + serviceNodePort: 32222 + # + # -- (list) + # The `/root/.ssh/authorized_keys` file to write, represented as a list. + rootSshAuthorizedKeys: + - "" + # + # -- (map) + # The `/etc/ssh/sshd_config` file to use, represented as a map. + # Ref: https://man.openbsd.org/sshd_config + sshdConfig: + # LogLevel: DEBUG3 + # Include: "/etc/ssh/sshd_config.d/*.conf" + # X11Forwarding: "yes" + # UsePAM: "yes" + # Subsystem: sftp /usr/libexec/openssh/sftp-server + AcceptEnv: "LANG LC_*" + AuthorizedKeysFile: "/root/.ssh/authorized_keys" + ChallengeResponseAuthentication: "no" + ClientAliveCountMax: "3" + ClientAliveInterval: "60" + LogLevel: "INFO" + PasswordAuthentication: "no" + PermitRootLogin: "yes" + Port: "22" + PrintMotd: "no" + Protocol: "2" + PubkeyAuthentication: "yes" + Subsystem: "sftp internal-sftp" + TCPKeepAlive: "yes" + UseDNS: "no" + UsePAM: "no" + X11Forwarding: "no" + # + # The `/etc/sssd/sssd.conf` represented by as a map. + sssdConf: + # + # -- (map) + # The `/etc/sssd/sssd.conf` [sssd] section, represented as a map. + # Ref: https://man.archlinux.org/man/sssd.conf.5#The_%5Bsssd%5D_section + sssd: + # debug_level: 9 + config_file_version: 2 + services: nss, pam + domains: DEFAULT + # + # -- (map[map]) + # The `/etc/sssd/sssd.conf` [domain/$DOMAIN] sections, represented as a map of map. + # Ref: https://man.archlinux.org/man/sssd.conf.5#DOMAIN_SECTIONS + domains: + DEFAULT: + # debug_level: 9 + auth_provider: ldap + id_provider: ldap + ldap_uri: ldap://ldap.example.com + ldap_search_base: dc=example,dc=com + ldap_user_search_base: ou=Users,dc=example,dc=com + ldap_group_search_base: ou=Groups,dc=example,dc=com + # + # -- (map) + # The `/etc/sssd/sssd.conf` [nss] section, represented as a map. + # Ref: https://man.archlinux.org/man/sssd.conf.5#NSS_configuration_options + nss: + # debug_level: 9 + filter_groups: root,slurm + filter_users: root,slurm + # + # -- (map) + # The `/etc/sssd/sssd.conf` [pam] section, represented as a map. + # Ref: https://man.archlinux.org/man/sssd.conf.5#PAM_configuration_options + pam: {} + # debug_level: 9 + # + # --(list) + # List of volume mounts. + # Ref: https://kubernetes.io/docs/concepts/storage/volumes/ + extraVolumeMounts: + - name: fsx-lustre + mountPath: /fsx + - name: fsx-openzfs + mountPath: /home + # - name: nfs-home + # mountPath: /home + # - name: nfs-data + # mountPath: /mnt/data + # + # --(list) + # Define list of pod volumes. + # Ref: https://kubernetes.io/docs/concepts/storage/volumes/ + extraVolumes: + - name: fsx-lustre + persistentVolumeClaim: + claimName: fsx-claim + - name: fsx-openzfs + persistentVolumeClaim: + claimName: openzfs-claim + # - name: nfs-home + # persistentVolumeClaim: + # claimName: nfs-home + # - name: nfs-data + # persistentVolumeClaim: + # - name: nfs-home + # nfs: + # server: nfs-server.example.com + # path: /exports/home/ + # - name: nfs-data + # persistentVolumeClaim: + # claimName: nfs-data + # + # -- (string) + # Set the priority class to use. + # Ref: https://kubernetes.io/docs/concepts/scheduling-eviction/pod-priority-preemption/#priorityclass + priorityClassName: "" + # + # -- (object) + # Set affinity for Kubernetes Pod scheduling. + # Ref: https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#affinity-and-anti-affinity + affinity: *commonAffinity + # + # -- (list) + # Configure pod tolerations. + # Ref: https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/ + tolerations: [] + # + # -- (object) + # Set container resource requests and limits for Kubernetes Pod scheduling. + # Ref: https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#resource-requests-and-limits-of-pod-and-container + resources: {} + # requests: + # cpu: 1 + # memory: 1Gi + # limits: + # cpu: 2 + # memory: 4Gi + +# +# Slurm compute (slurmd) configurations. +compute: + # + # -- (string) + # Set the image pull policy. + imagePullPolicy: IfNotPresent + # + # Default image for the nodeset pod (slurmd) + # Each nodeset may override this setting. + image: + # + # -- (string) + # Set the image repository to use. + repository: ghcr.io/slinkyproject/slurmd + # + # -- (string) + # Set the image tag to use. + tag: 24.11-ubuntu24.04 + # + # -- (list) + # Slurm NodeSets by object list. + nodesets: + # + # -- (string) + # Name of NodeSet. Must be unique. + - name: hp-node + # + # -- (bool) + # Enables the NodeSet in Slurm. + enabled: true + # + # -- (integer) + # Set the number of replicas to deploy. + replicas: 4 + # + # -- (string) + # Set the image pull policy. + imagePullPolicy: Always + # + # Set the image to use. + image: + # + # -- (string) + # Set the image repository to use. + repository: ".dkr.ecr..amazonaws.com/dlc-slurmd" + # + # -- (string) + # Set the image tag to use. + tag: "24.11.4-ubuntu24.04" + # + # -- (string) + # Set the priority class to use. + # Ref: https://kubernetes.io/docs/concepts/scheduling-eviction/pod-priority-preemption/#priorityclass + priorityClassName: "" + # + # -- (object) + # Set container resource requests and limits for Kubernetes Pod scheduling. + # Ref: https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#resource-requests-and-limits-of-pod-and-container + resources: + limits: + nvidia.com/gpu: "1" + requests: + nvidia.com/gpu: "1" + # + # -- (map) + # Selector which must match a node's labels for the pod to be scheduled on that node. + nodeSelector: + kubernetes.io/os: linux + node.kubernetes.io/instance-type: ml.g5.8xlarge + # + # -- (object) + # Set affinity for Kubernetes Pod scheduling. + affinity: {} + # nodeAffinity: + # requiredDuringSchedulingIgnoredDuringExecution: + # nodeSelectorTerms: + # - matchExpressions: + # - key: "kubernetes.io/os" + # operator: In + # values: + # - linux + # podAntiAffinity: + # requiredDuringSchedulingIgnoredDuringExecution: + # - topologyKey: "kubernetes.io/hostname" + # labelSelector: + # matchExpressions: + # - key: "app.kubernetes.io/name" + # operator: In + # values: + # - slurmctld + # - slurmdbd + # - slurmrestd + # + # -- (list) + # Configure pod tolerations. + # Ref: https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/ + tolerations: [] + # + # -- (object) + # Set the update strategy configuration. + updateStrategy: + # + # -- (string) + # Set the update strategy type. + # Can be either: "RollingUpdate"; "OnDelete". + type: RollingUpdate + # + # -- (object) + # Define the rolling update policy. + # Only used when "updateStrategy.type=RollingUpdate". + rollingUpdate: + # + # -- (string) + # The maximum number of pods that can be unavailable during the update. + # Value can be an absolute number (ex: 5) or a percentage of desired + # pods (ex: 10%). Absolute number is calculated from percentage by + # rounding up. This can not be 0. Defaults to 1. + maxUnavailable: 20% + # + # -- (object) + # The policy used for PVCs created from the NodeSet VolumeClaimTemplates. + persistentVolumeClaimRetentionPolicy: + # + # -- (string) + # WhenDeleted specifies what happens to PVCs created from NodeSet + # VolumeClaimTemplates when the NodeSet is deleted. The default policy + # of `Retain` causes PVCs to not be affected by NodeSet deletion. The + # `Delete` policy causes those PVCs to be deleted. + whenDeleted: Retain + # + # -- (string) + # WhenScaled specifies what happens to PVCs created from NodeSet + # VolumeClaimTemplates when the NodeSet is scaled down. The default + # policy of `Retain` causes PVCs to not be affected by a scale-in. The + # `Delete` policy causes the associated PVCs for any excess pods to be + # deleted. + whenScaled: Retain + # + # -- (list) + # List of PVCs to be created from template and mounted on each NodeSet pod. + # PVCs are given a unique identity relative to each NodeSet pod. + # Ref: https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#volume-claim-templates + volumeClaimTemplates: [] + # - metadata: + # name: scratch + # spec: + # mountPath: /mnt/scratch + # storageClassName: standard + # accessModes: + # - ReadWriteOnce + # resources: + # requests: + # storage: 1Gi + # + # --(list) + # List of volume mounts. + # Ref: https://kubernetes.io/docs/concepts/storage/volumes/ + extraVolumeMounts: + - name: fsx-lustre + mountPath: /fsx + - name: fsx-openzfs + mountPath: /home + - name: shmem + mountPath: /dev/shm + # - name: nfs-home + # mountPath: /home + # - name: nfs-data + # mountPath: /mnt/data + # + # --(list) + # Define list of pod volumes. + # Ref: https://kubernetes.io/docs/concepts/storage/volumes/ + extraVolumes: + - name: fsx-lustre + persistentVolumeClaim: + claimName: fsx-claim + - name: fsx-openzfs + persistentVolumeClaim: + claimName: openzfs-claim + - name: shmem + hostPath: + path: /dev/shm + # - name: nfs-home + # nfs: + # server: nfs-server.example.com + # path: /exports/home/ + # - name: nfs-data + # persistentVolumeClaim: + # claimName: nfs-data + # + # -- (object) + # Partition describes the partition created specifically for this NodeSet to be added. + partition: + # + # -- (bool) + # Enables this NodeSet's partition line to be added in Slurm. + enabled: true + # + # -- (map[string]string | map[string][]string) + # Extra Slurm partition configuration appended onto the partition line. + # Ref: https://slurm.schedmd.com/slurm.conf.html#lbAI + config: + State: UP + MaxTime: UNLIMITED + # + # -- (string) + # Set Slurm node GRES. + # Ref: https://slurm.schedmd.com/slurm.conf.html#OPT_Gres_1 + nodeGres: "" + # + # -- (list) + # Set Slurm node Features as a list(string). + # Ref: https://slurm.schedmd.com/slurm.conf.html#OPT_Features + nodeFeatures: [] + # + # -- (string) + # Set Slurm node weight for Slurm scheduling. + # Ref: https://slurm.schedmd.com/slurm.conf.html#OPT_Weight + nodeWeight: 1 + # + # -- (list) + # Slurm Partitions by object list. + partitions: + # + # -- (string) + # Name of Partition. Must be unique. + - name: all + # + # -- (bool) + # Enables the partition in Slurm. + enabled: true + # + # -- (list) + # NodeSets to put into this Partition by name/key. + # NOTE: 'ALL' is a Slurm meta value to mean all nodes in the system. + nodesets: + - ALL + # + # -- (map[string]string | map[string][]string) + # Extra Slurm partition configuration appended onto the partition line. + # Ref: https://slurm.schedmd.com/slurm.conf.html#lbAI + config: + State: UP + Default: "YES" + MaxTime: UNLIMITED + +# +# Slurm accounting (slurmdbd) configurations. +accounting: + # + # -- (bool) + # Enables accounting services. + enabled: true + # + # -- (string) + # Set the image pull policy. + imagePullPolicy: IfNotPresent + # + # Set the image to use. + image: + # + # -- (string) + # Set the image repository to use. + repository: ghcr.io/slinkyproject/slurmdbd + # + # -- (string) + # Set the image tag to use. + tag: 24.11-ubuntu24.04 + # + # -- (object) + # Set affinity for Kubernetes Pod scheduling. + # Ref: https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#affinity-and-anti-affinity + affinity: *commonAffinity + # + # -- (list) + # Configure pod tolerations. + # Ref: https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/ + tolerations: [] + # + # -- (object) + # Set container resource requests and limits for Kubernetes Pod scheduling. + # Ref: https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#resource-requests-and-limits-of-pod-and-container + resources: {} + # requests: + # cpu: 1 + # memory: 1Gi + # limits: + # cpu: 2 + # memory: 4Gi + # + # Configuration for an external accounting instance (slurmdbd). + external: + # + # -- (bool) + # Use an external acounting instance (slurmdbd) instead of deploying one. + enabled: false + # + # -- (string) + # The external acounting instance (slurmdbd) host. + host: "" + # + # -- (integer) + # The external acounting instance (slurmdbd) port. + port: 6819 + +# +# `bitnami/mariadb` subchart configurations. +# Ref: https://github.com/bitnami/charts/blob/main/bitnami/mariadb/values.yaml +mariadb: + enabled: true + auth: + username: slurm + database: slurm_acct_db + tls: + enabled: false + tde: + enabled: false + primary: + # NOTE: https://slurm.schedmd.com/accounting.html#slurm-accounting-configuration-before-build + configuration: |- + [mysqld] + skip-name-resolve + explicit_defaults_for_timestamp + basedir=/opt/bitnami/mariadb + datadir=/bitnami/mariadb/data + plugin_dir=/opt/bitnami/mariadb/plugin + port={{ .Values.primary.containerPorts.mysql }} + socket=/opt/bitnami/mariadb/tmp/mysql.sock + tmpdir=/opt/bitnami/mariadb/tmp + innodb_buffer_pool_size=4096M + innodb_lock_wait_timeout=900 + innodb_log_file_size=1024M + max_allowed_packet=16M + bind-address=* + pid-file=/opt/bitnami/mariadb/tmp/mysqld.pid + log-error=/opt/bitnami/mariadb/logs/mysqld.log + character-set-server=UTF8 + collation-server=utf8_general_ci + slow_query_log=0 + long_query_time=10.0 + binlog_expire_logs_seconds=2592000 + {{- if .Values.tls.enabled }} + ssl_cert=/opt/bitnami/mariadb/certs/{{ .Values.tls.certFilename }} + ssl_key=/opt/bitnami/mariadb/certs/{{ .Values.tls.certKeyFilename }} + {{- if (include "mariadb.tlsCACert" .) }} + ssl_ca={{ include "mariadb.tlsCACert" . }} + {{- end }} + {{- end }} + {{- if .Values.tde.enabled }} + plugin_load_add=file_key_management + file_key_management_filename=/opt/bitnami/mariadb/tde/{{ .Values.tde.encryptedKeyFilename }} + file_key_management_filekey=FILE:/opt/bitnami/mariadb/tde/{{ .Values.tde.randomKeyFilename }} + file_key_management_encryption_algorithm={{ .Values.tde.fileKeyManagementEncryptionAlgorithm }} + innodb_encrypt_tables={{ .Values.tde.innodbEncryptTables }} + innodb_encrypt_log={{ .Values.tde.innodbEncryptLog }} + innodb_encrypt_temporary_tables={{ .Values.tde.innodbEncryptTemporaryTables }} + innodb_encryption_threads={{ .Values.tde.innodbEncryptionThreads }} + encrypt_tmp_disk_tables={{ .Values.tde.encryptTmpDiskTables }} + encrypt_tmp_files={{ .Values.tde.encryptTmpTiles }} + encrypt_binlog={{ .Values.tde.encryptBINLOG }} + aria_encrypt_tables={{ .Values.tde.ariaEncryptTables }} + {{- end }} + + [client] + port=3306 + socket=/opt/bitnami/mariadb/tmp/mysql.sock + default-character-set=UTF8 + plugin_dir=/opt/bitnami/mariadb/plugin + + [manager] + port=3306 + socket=/opt/bitnami/mariadb/tmp/mysql.sock + pid-file=/opt/bitnami/mariadb/tmp/mysqld.pid + persistence: + enabled: false + existingClaim: "" + storageClass: standard + labels: {} + annotations: {} + accessModes: + - ReadWriteOnce + size: 8Gi + selector: {} + priorityClassName: "" + tolerations: [] + affinity: *commonAffinity + metrics: + enabled: false + serviceMonitor: + enabled: false + resources: {} + +# +# Slurm REST API (slurmrestd) configurations. +restapi: + # + # -- (bool) + # Enables restapi services. + enabled: true + # + # -- (integer) + # Set the number of replicas to deploy. + replicas: 1 + # + # -- (string) + # Set the image pull policy. + imagePullPolicy: IfNotPresent + # + # Set the image to use. + image: + # + # -- (string) + # Set the image repository to use. + repository: ghcr.io/slinkyproject/slurmrestd + # + # -- (string) + # Set the image tag to use. + tag: 24.11-ubuntu24.04 + # + # -- (object) + # The restapi service configuration. + # Ref: https://kubernetes.io/docs/concepts/services-networking/service/ + service: {} + # type: LoadBalancer + # externalIPs: [] + # externalName: my.slurmrestd.example.com + # + # -- (integer) + # The external service port number. + servicePort: 6820 + # + # -- (integer) + # The external service node port number. + # Ignored unless `service.type == NodePort`. + serviceNodePort: 36820 + # + # -- (string) + # Set the priority class to use. + # Ref: https://kubernetes.io/docs/concepts/scheduling-eviction/pod-priority-preemption/#priorityclass + priorityClassName: "" + # + # -- (object) + # Set affinity for Kubernetes Pod scheduling. + # Ref: https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#affinity-and-anti-affinity + affinity: *commonAffinity + # + # -- (list) + # Configure pod tolerations. + # Ref: https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/ + tolerations: [] + # + # -- (object) + # Set container resource requests and limits for Kubernetes Pod scheduling. + # Ref: https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#resource-requests-and-limits-of-pod-and-container + resources: {} + # requests: + # cpu: 1 + # memory: 1Gi + # limits: + # cpu: 2 + # memory: 4Gi + +# +# `slurm-exporter` subchart configurations. +# Ref: https://github.com/SlinkyProject/slurm-exporter/-/blob/main/helm/slurm-exporter/values.yaml +slurm-exporter: + enabled: true + exporter: + enabled: true + secretName: "slurm-token-exporter" + affinity: *commonAffinity \ No newline at end of file diff --git a/1.architectures/7.sagemaker-hyperpod-eks/slinky-slurm/lustre-pvc-slurm.yaml b/1.architectures/7.sagemaker-hyperpod-eks/slinky-slurm/lustre-pvc-slurm.yaml new file mode 100644 index 00000000..c3e6fc1e --- /dev/null +++ b/1.architectures/7.sagemaker-hyperpod-eks/slinky-slurm/lustre-pvc-slurm.yaml @@ -0,0 +1,12 @@ +apiVersion: v1 +kind: PersistentVolumeClaim +metadata: + name: fsx-claim + namespace: slurm +spec: + accessModes: + - ReadWriteMany + storageClassName: fsx-sc + resources: + requests: + storage: 1200Gi \ No newline at end of file diff --git a/1.architectures/7.sagemaker-hyperpod-eks/slinky-slurm/openzfs-pvc-slurm.yaml b/1.architectures/7.sagemaker-hyperpod-eks/slinky-slurm/openzfs-pvc-slurm.yaml new file mode 100644 index 00000000..51218f34 --- /dev/null +++ b/1.architectures/7.sagemaker-hyperpod-eks/slinky-slurm/openzfs-pvc-slurm.yaml @@ -0,0 +1,12 @@ +apiVersion: v1 +kind: PersistentVolumeClaim +metadata: + name: openzfs-claim + namespace: slurm +spec: + accessModes: + - ReadWriteMany + storageClassName: openzfs-sc + resources: + requests: + storage: 64Gi \ No newline at end of file diff --git a/1.architectures/7.sagemaker-hyperpod-eks/slinky-slurm/openzfs-storageclass.yaml b/1.architectures/7.sagemaker-hyperpod-eks/slinky-slurm/openzfs-storageclass.yaml new file mode 100644 index 00000000..5b84aae1 --- /dev/null +++ b/1.architectures/7.sagemaker-hyperpod-eks/slinky-slurm/openzfs-storageclass.yaml @@ -0,0 +1,19 @@ +kind: StorageClass +apiVersion: storage.k8s.io/v1 +metadata: + name: openzfs-sc +provisioner: fsx.openzfs.csi.aws.com +parameters: + ResourceType: "filesystem" #REQUIRED + DeploymentType: '"SINGLE_AZ_HA_2"' #REQUIRED + ThroughputCapacity: '160' #REQUIRED + SubnetIds: '["${PRIVATE_SUBNET_ID}"]' #REQUIRED + SkipFinalBackupOnDeletion: 'true' #REQUIRED + AutomaticBackupRetentionDays: '30' + SecurityGroupIds: '["${SECURITY_GROUP_ID}"]' + CopyTagsToBackups: 'true' + CopyTagsToVolumes: 'true' + DailyAutomaticBackupStartTime: '"19:00"' + OptionsOnDeletion: '["DELETE_CHILD_VOLUMES_AND_SNAPSHOTS"]' + RootVolumeConfiguration: '{"DataCompressionType": "NONE", "NfsExports": [{"ClientConfigurations": [{"Clients": "*", "Options": ["rw", "no_root_squash", "crossmnt"]}]}]}' + WeeklyMaintenanceStartTime: '"1:04:00"' \ No newline at end of file diff --git a/1.architectures/7.sagemaker-hyperpod-eks/slinky-slurm/p5/p5-llama2_7b-training.sbatch b/1.architectures/7.sagemaker-hyperpod-eks/slinky-slurm/p5/p5-llama2_7b-training.sbatch new file mode 100644 index 00000000..52897370 --- /dev/null +++ b/1.architectures/7.sagemaker-hyperpod-eks/slinky-slurm/p5/p5-llama2_7b-training.sbatch @@ -0,0 +1,100 @@ +#!/bin/bash + +# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. +# SPDX-License-Identifier: MIT-0 + +#SBATCH --nodes=4 # number of nodes to use +#SBATCH --job-name=llama2_7b-FSDP # name of your job +#SBATCH --output=logs/%x_%j.out # logfile for stdout +#SBATCH --error=logs/%x_%j.err # logfile for stderr, remove it to merge both outputs +#SBATCH --exclusive # job has exclusive use of the resource, no sharing + +set -ex; + +########################### +###### User Variables ##### +########################### + +GPUS_PER_NODE=4 + +########################### +## Environment Variables ## +########################### + +export CUDA_HOME="/usr/local/cuda" +export EFA_PATH="/opt/amazon/efa" +export OPEN_MPI_PATH="/opt/amazon/openmpi" +export OFI_NCCL_PATH="/opt/amazon/ofi-nccl" +export LD_LIBRARY_PATH="lib:${EFA_PATH}/lib:${OPEN_MPI_PATH}/lib:${CUDA_HOME}/lib64:/usr/local/lib:/lib/x86_64-linux-gnu:/opt/nccl/build/lib:${OFI_NCCL_PATH}/lib/x86_64-linux-gnu:/usr/local/nvidia/lib" + +# LD_PRELOAD is required for PyTorch to find the NCCL library +export LD_PRELOAD="/usr/local/lib/libnccl.so.2" + +# NCCL settings for EFA +export NCCL_PROTO=simple # Use a simpler communication protocol, often more reliable for EFA +export NCCL_ALGO=ring # Use ring algorithm for collective operations, typically best for EFA +export NCCL_NET_GDR_LEVEL=5 # Enable GPUDirect RDMA for direct GPU-to-network transfers +export NCCL_DEBUG=INFO # Set NCCL debug level for troubleshooting +export NCCL_DEBUG_SUBSYS=ALL # Enable detailed debugging output for all NCCL subsystems +export NCCL_SOCKET_IFNAME=^lo # Exclude loopback interface +export NCCL_NET_MAX_REQUESTS=8 # Maximum number of concurrent network requests, optimized for EFA +export NCCL_MIN_NCHANNELS=8 # Minimum number of channels for NCCL communications, increases parallelism +export NCCL_NSOCKS_PERTHREAD=8 # Number of sockets per thread for network operations + +# HuggingFace settings +export HF_HUB_ETAG_TIMEOUT=60 # Metadata timeout (in seconds) for large clusters +export HF_TOKEN= # Token used to avoid throttling for data streaming + +########################### +####### Torch Dist ####### +########################### + +# Debug Slurm environment +echo "=== Slurm Environment ===" +echo "SLURM_JOB_ID: $SLURM_JOB_ID" +echo "SLURM_JOB_NUM_NODES: $SLURM_JOB_NUM_NODES" +echo "SLURM_NODELIST: $SLURM_NODELIST" +echo "SLURM_JOB_NODELIST: $SLURM_JOB_NODELIST" +echo "LD_LIBRARY_PATH: $LD_LIBRARY_PATH" +echo "=======================" + + +declare -a TORCHRUN_ARGS=( + --nproc_per_node=$GPUS_PER_NODE + --nnodes=$SLURM_JOB_NUM_NODES + --rdzv_id=$SLURM_JOB_ID + --rdzv_backend=c10d + --rdzv_endpoint=$(hostname) +) + +export PATH="/usr/local/bin:$PATH" +export TRAIN_SCRIPT="/fsx/awsome-distributed-training/3.test_cases/pytorch/FSDP/src/train.py" +export PYTHONPATH="/usr/local/lib/python3.12/site-packages:$PYTHONPATH" +export TORCHRUN="/usr/local/bin/python3 -m torch.distributed.run" + +############################ +# llama2_7b Training Params ## +############################ +declare -a TRAINING_ARGS=( + --max_context_width=4096 + --num_key_value_heads=32 + --intermediate_size=11008 + --hidden_width=4096 + --num_layers=32 + --num_heads=32 + --model_type=llama_v2 + --tokenizer="hf-internal-testing/llama-tokenizer" + --checkpoint_freq=5000 + --validation_freq=500 + --max_steps=5000 + --checkpoint_dir=./checkpoints + --dataset='allenai/c4' + --dataset_config_name='en' + --resume_from_checkpoint=./checkpoints + --train_batch_size=1 + --val_batch_size=1 + --sharding_strategy="full" # https://pytorch.org/docs/stable/fsdp.html + --offload_activations=1 +) + +srun -l ${TORCHRUN} "${TORCHRUN_ARGS[@]}" $TRAIN_SCRIPT "${TRAINING_ARGS[@]}" \ No newline at end of file diff --git a/1.architectures/7.sagemaker-hyperpod-eks/slinky-slurm/p5/p5-values.yaml b/1.architectures/7.sagemaker-hyperpod-eks/slinky-slurm/p5/p5-values.yaml new file mode 100644 index 00000000..54715c27 --- /dev/null +++ b/1.architectures/7.sagemaker-hyperpod-eks/slinky-slurm/p5/p5-values.yaml @@ -0,0 +1,954 @@ +# SPDX-FileCopyrightText: Copyright (C) SchedMD LLC. +# SPDX-License-Identifier: Apache-2.0 +# Inter-pod anti-affinity and node affinity for non-compute components +commonAffinity: &commonAffinity + nodeAffinity: + requiredDuringSchedulingIgnoredDuringExecution: + nodeSelectorTerms: + - matchExpressions: + - key: "node.kubernetes.io/instance-type" + operator: In + values: + - "ml.m5.2xlarge" + podAntiAffinity: + preferredDuringSchedulingIgnoredDuringExecution: + - weight: 100 + podAffinityTerm: + labelSelector: + matchExpressions: + - key: "app.kubernetes.io/name" + operator: In + values: ["slurmdbd", "slurmctld", "slurm-exporter", "login", "mariadb", "slurmrestd"] + topologyKey: "kubernetes.io/hostname" + +# +# Debug configuration. +# @ignored +debug: + # + # -- (bool) + # Enables debug configuration. + enabled: false + # + # -- (bool) + # Allow a locally running operator to communicate with slurm cluster via port-forward. + # NOTE: use when running the operator in a local debugger. + localOperator: true + +# +# -- (string) +# Overrides the name of the release. +nameOverride: "" + +# +# -- (string) +# Overrides the full name of the release. +fullnameOverride: "" + +# +# -- (string) +# Overrides the namespace of the release. +namespaceOverride: "" + +# +# -- (list) +# Set the secrets for image pull. +# Ref: https://kubernetes.io/docs/tasks/configure-pod-container/pull-image-private-registry/ +imagePullSecrets: [] + # - name: regcred + +# +# -- (string) +# Set the image pull policy. +imagePullPolicy: IfNotPresent + +# +# -- (string) +# Set the priority class to use. +# Ref: https://kubernetes.io/docs/concepts/scheduling-eviction/pod-priority-preemption/#priorityclass +priorityClassName: "" + +# +# Slurm JWT authentication. +jwt: + # + # JWT hs256 configurations. + hs256: + # + # -- (string) + # The existing secret to use otherwise one will be generated. + existingSecret: "" + +# +# Slurm configurations. +slurm: + # + # Slurm authentication configurations. + auth: + # + # -- (string) + # The existing secret to use otherwise one will be generated. + existingSecret: "" + # + # -- (map[string]string | map[string][]string) + # Extra slurmdbd configuration lines to append to `slurmdbd.conf`. + # WARNING: Values can override existing ones. + # Ref: https://slurm.schedmd.com/slurmdbd.conf.html + extraSlurmdbdConf: {} + # CommitDelay: 1 + ### LOGGING ### + # DebugLevel: debug2 + # DebugFlags: [] + ### PURGE ### + # PurgeEventAfter: 12month + # PurgeJobAfter: 12month + # PurgeResvAfter: 2month + # PurgeStepAfter: 2month + # PurgeSuspendAfter: 1month + # PurgeTXNAfter: 12month + # PurgeUsageAfter: 12month + # + # -- (map[string]string | map[string][]string) + # Extra slurm configuration lines to append to `slurm.conf`, represetned as a string or a map. + # WARNING: Values can override existing ones. + # Ref: https://slurm.schedmd.com/slurm.conf.html + extraSlurmConf: {} + # MinJobAge: 2 + # MaxNodeCount: 1024 + ### LOGGING ### + # SlurmctldDebug: debug2 + # SlurmSchedLogLevel: 1 + # SlurmdDebug: debug2 + # DebugFlags: [] + ### PLUGINS & PARAMETERS ### + # SchedulerParameters: + # - defer_batch + # + # -- (map[string]string) + # Optional raw Slurm configuration files, as a map. + # The map key represents the config file by name; the map value represents config file contents as a string. + # Ref: https://slurm.schedmd.com/man_index.html#configuration_files + configFiles: + # acct_gather.conf: | + # # Ref: https://slurm.schedmd.com/acct_gather.conf.html + # burst_buffer.conf: | + # # Ref: https://slurm.schedmd.com/burst_buffer.conf.html + # gres.conf: | + # # Ref: https://slurm.schedmd.com/gres.conf.html + # helpers.conf: | + # # Ref: https://slurm.schedmd.com/helpers.conf.html + # job_container.conf: | + # # Ref: https://slurm.schedmd.com/job_container.conf.html + # mpi.conf: | + # # Ref: https://slurm.schedmd.com/mpi.conf.html + # oci.conf: | + # # Ref: https://slurm.schedmd.com/oci.conf.html + # plugstack.conf: | + # # Ref: https://slurm.schedmd.com/plugstack.conf.html + # topology.conf: | + # # Ref: https://slurm.schedmd.com/topology.conf.html + # + # -- (map[string]string) + # The Prolog scripts for compute nodesets, as a map. + # The map key represents the filename; the map value represents the script contents. + # WARNING: The script must include a shebang (!) so it can be executed correctly by Slurm. + # Ref: https://slurm.schedmd.com/slurm.conf.html#OPT_Prolog + # Ref: https://slurm.schedmd.com/prolog_epilog.html + # Ref: https://en.wikipedia.org/wiki/Shebang_(Unix) + prologScripts: {} + # 00-empty.sh: | + # #!/usr/bin/env bash + # set -euo pipefail + # exit 0 + # + # -- (map[string]string) + # The Epilog scripts for compute nodesets, as a map. + # The map key represents the filename; the map value represents the script contents. + # WARNING: The script must include a shebang (!) so it can be executed correctly by Slurm. + # Ref: https://slurm.schedmd.com/slurm.conf.html#OPT_Epilog + # Ref: https://slurm.schedmd.com/prolog_epilog.html + # Ref: https://en.wikipedia.org/wiki/Shebang_(Unix) + epilogScripts: {} + # 00-empty.sh: | + # #!/usr/bin/env bash + # set -euo pipefail + # exit 0 + +# Slurm authcred (sackd) configurations. +authcred: + # + # -- (string) + # Set the image pull policy. + imagePullPolicy: IfNotPresent + # + # Set the image to use. + image: + # + # -- (string) + # Set the image repository to use. + repository: ghcr.io/slinkyproject/sackd + # + # -- (string) + # Set the image tag to use. + tag: 24.11-ubuntu24.04 + # + # -- (object) + # Set container resource requests and limits for Kubernetes Pod scheduling. + # Ref: https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#resource-requests-and-limits-of-pod-and-container + resources: {} + # requests: + # cpu: 1 + # memory: 1Gi + # limits: + # cpu: 2 + # memory: 4Gi + +# +# Slurm controller (slurmctld) configurations. +controller: + # + # -- (string) + # Set the image pull policy. + imagePullPolicy: IfNotPresent + # + # Set the image to use. + image: + # + # -- (string) + # Set the image repository to use. + repository: ghcr.io/slinkyproject/slurmctld + # + # -- (string) + # Set the image tag to use. + tag: 24.11-ubuntu24.04 + # + # -- (object) + # The controller service configuration. + # Ref: https://kubernetes.io/docs/concepts/services-networking/service/ + service: {} + # type: LoadBalancer + # externalIPs: [] + # externalName: my.slurmctld.example.com + # + # -- (integer) + # The external service port number. + servicePort: 6817 + # + # -- (integer) + # The external service node port number. + # Ignored unless `service.type == NodePort`. + serviceNodePort: 36817 + # + # -- (string) + # Set the priority class to use. + # Ref: https://kubernetes.io/docs/concepts/scheduling-eviction/pod-priority-preemption/#priorityclass + priorityClassName: "" + # + # -- (object) + # Set affinity for Kubernetes Pod scheduling. + # Ref: https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#affinity-and-anti-affinity + affinity: *commonAffinity + # + # -- (list) + # Configure pod tolerations. + # Ref: https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/ + tolerations: [] + # + # -- (object) + # Set container resource requests and limits for Kubernetes Pod scheduling. + # Ref: https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#resource-requests-and-limits-of-pod-and-container + resources: {} + # requests: + # cpu: 1 + # memory: 1Gi + # limits: + # cpu: 2 + # memory: 4Gi + # + # Define a persistent volume for the slurm controller to store its save-state. + # Used to recover from system failures or from pod upgrades. + persistence: + # + # -- (bool) + # Enables save-state persistence. + enabled: false + # + # -- (string) + # Name of an existing `PersistentVolumeClaim` to use instead of creating one from definition. + # NOTE: When not empty, the other persistence fields will be ignored. + existingClaim: "" + # + # -- (object) + # Create a `PersistentVolumeClaim` with these annotations. + annotations: {} + # + # -- (object) + # Create a `PersistentVolumeClaim` with these labels. + labels: {} + # + # -- (string) + # Create a `PersistentVolumeClaim` with this storage class. + storageClass: standard + # + # -- (list) + # Create a `PersistentVolumeClaim` with these access modes. + accessModes: + - ReadWriteOnce + # + # -- (string) + # Create a `PersistentVolumeClaim` with this storage size. + size: 4Gi + # + # -- (object) + # Selector to match an existing `PersistentVolume`. + selector: {} + # matchLabels: + # app: foo + +# +# Login node configurations. +login: + # + # -- (bool) + # Enables login nodes. + enabled: true + # + # -- (integer) + # Set the number of replicas to deploy. + replicas: 1 + # + # -- (string) + # Set the image pull policy. + imagePullPolicy: IfNotPresent + # + # Set the image to use. + image: + # + # -- (string) + # Set the image repository to use. + repository: ghcr.io/slinkyproject/login + # + # -- (string) + # Set the image tag to use. + tag: 24.11-ubuntu24.04 + # + # -- (object) + # The login service configuration. + # Ref: https://kubernetes.io/docs/concepts/services-networking/service/ + service: + type: LoadBalancer + # externalIPs: [] + # externalName: my.login.example.com + # + # -- (integer) + # The external service port number. + servicePort: 22 + # + # -- (integer) + # The external service node port number. + # Ignored unless `service.type == NodePort`. + serviceNodePort: 32222 + # + # -- (list) + # The `/root/.ssh/authorized_keys` file to write, represented as a list. + rootSshAuthorizedKeys: + - "" + # + # -- (map) + # The `/etc/ssh/sshd_config` file to use, represented as a map. + # Ref: https://man.openbsd.org/sshd_config + sshdConfig: + # LogLevel: DEBUG3 + # Include: "/etc/ssh/sshd_config.d/*.conf" + # X11Forwarding: "yes" + # UsePAM: "yes" + # Subsystem: sftp /usr/libexec/openssh/sftp-server + AcceptEnv: "LANG LC_*" + AuthorizedKeysFile: "/root/.ssh/authorized_keys" + ChallengeResponseAuthentication: "no" + ClientAliveCountMax: "3" + ClientAliveInterval: "60" + LogLevel: "INFO" + PasswordAuthentication: "no" + PermitRootLogin: "yes" + Port: "22" + PrintMotd: "no" + Protocol: "2" + PubkeyAuthentication: "yes" + Subsystem: "sftp internal-sftp" + TCPKeepAlive: "yes" + UseDNS: "no" + UsePAM: "no" + X11Forwarding: "no" + # + # The `/etc/sssd/sssd.conf` represented by as a map. + sssdConf: + # + # -- (map) + # The `/etc/sssd/sssd.conf` [sssd] section, represented as a map. + # Ref: https://man.archlinux.org/man/sssd.conf.5#The_%5Bsssd%5D_section + sssd: + # debug_level: 9 + config_file_version: 2 + services: nss, pam + domains: DEFAULT + # + # -- (map[map]) + # The `/etc/sssd/sssd.conf` [domain/$DOMAIN] sections, represented as a map of map. + # Ref: https://man.archlinux.org/man/sssd.conf.5#DOMAIN_SECTIONS + domains: + DEFAULT: + # debug_level: 9 + auth_provider: ldap + id_provider: ldap + ldap_uri: ldap://ldap.example.com + ldap_search_base: dc=example,dc=com + ldap_user_search_base: ou=Users,dc=example,dc=com + ldap_group_search_base: ou=Groups,dc=example,dc=com + # + # -- (map) + # The `/etc/sssd/sssd.conf` [nss] section, represented as a map. + # Ref: https://man.archlinux.org/man/sssd.conf.5#NSS_configuration_options + nss: + # debug_level: 9 + filter_groups: root,slurm + filter_users: root,slurm + # + # -- (map) + # The `/etc/sssd/sssd.conf` [pam] section, represented as a map. + # Ref: https://man.archlinux.org/man/sssd.conf.5#PAM_configuration_options + pam: {} + # debug_level: 9 + # + # --(list) + # List of volume mounts. + # Ref: https://kubernetes.io/docs/concepts/storage/volumes/ + extraVolumeMounts: + - name: fsx-lustre + mountPath: /fsx + - name: fsx-openzfs + mountPath: /home + # - name: nfs-home + # mountPath: /home + # - name: nfs-data + # mountPath: /mnt/data + # + # --(list) + # Define list of pod volumes. + # Ref: https://kubernetes.io/docs/concepts/storage/volumes/ + extraVolumes: + - name: fsx-lustre + persistentVolumeClaim: + claimName: fsx-claim + - name: fsx-openzfs + persistentVolumeClaim: + claimName: openzfs-claim + # - name: nfs-home + # nfs: + # server: nfs-server.example.com + # path: /exports/home/ + # - name: nfs-data + # persistentVolumeClaim: + # claimName: nfs-data + # + # -- (string) + # Set the priority class to use. + # Ref: https://kubernetes.io/docs/concepts/scheduling-eviction/pod-priority-preemption/#priorityclass + priorityClassName: "" + # + # -- (object) + # Set affinity for Kubernetes Pod scheduling. + # Ref: https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#affinity-and-anti-affinity + affinity: *commonAffinity + # + # -- (list) + # Configure pod tolerations. + # Ref: https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/ + tolerations: [] + # + # -- (object) + # Set container resource requests and limits for Kubernetes Pod scheduling. + # Ref: https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#resource-requests-and-limits-of-pod-and-container + resources: {} + # requests: + # cpu: 1 + # memory: 1Gi + # limits: + # cpu: 2 + # memory: 4Gi + +# +# Slurm compute (slurmd) configurations. +compute: + # + # -- (string) + # Set the image pull policy. + imagePullPolicy: IfNotPresent + # + # Default image for the nodeset pod (slurmd) + # Each nodeset may override this setting. + image: + # + # -- (string) + # Set the image repository to use. + repository: ghcr.io/slinkyproject/slurmd + # + # -- (string) + # Set the image tag to use. + tag: 24.11-ubuntu24.04 + # + # -- (list) + # Slurm NodeSets by object list. + nodesets: + # + # -- (string) + # Name of NodeSet. Must be unique. + - name: hp-node + # + # -- (bool) + # Enables the NodeSet in Slurm. + enabled: true + # + # -- (integer) + # Set the number of replicas to deploy. + replicas: 4 + # + # -- (string) + # Set the image pull policy. + imagePullPolicy: Always + # + # Set the image to use. + image: + # + # -- (string) + # Set the image repository to use. + repository: ".dkr.ecr..amazonaws.com/dlc-slurmd" + # + # -- (string) + # Set the image tag to use. + tag: "24.11.4-ubuntu24.04" + # + # -- (string) + # Set the priority class to use. + # Ref: https://kubernetes.io/docs/concepts/scheduling-eviction/pod-priority-preemption/#priorityclass + priorityClassName: "" + # + # -- (object) + # Set container resource requests and limits for Kubernetes Pod scheduling. + # Ref: https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#resource-requests-and-limits-of-pod-and-container + resources: + limits: + nvidia.com/gpu: 4 + vpc.amazonaws.com/efa: 16 + requests: + nvidia.com/gpu: 4 + vpc.amazonaws.com/efa: 16 + # + # -- (map) + # Selector which must match a node's labels for the pod to be scheduled on that node. + nodeSelector: + kubernetes.io/os: linux + node.kubernetes.io/instance-type: ml.p5.48xlarge + # + # -- (object) + # Set affinity for Kubernetes Pod scheduling. + affinity: {} + # nodeAffinity: + # requiredDuringSchedulingIgnoredDuringExecution: + # nodeSelectorTerms: + # - matchExpressions: + # - key: "kubernetes.io/os" + # operator: In + # values: + # - linux + # podAntiAffinity: + # requiredDuringSchedulingIgnoredDuringExecution: + # - topologyKey: "kubernetes.io/hostname" + # labelSelector: + # matchExpressions: + # - key: "app.kubernetes.io/name" + # operator: In + # values: + # - slurmctld + # - slurmdbd + # - slurmrestd + # + # -- (list) + # Configure pod tolerations. + # Ref: https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/ + tolerations: [] + # + # -- (object) + # Set the update strategy configuration. + updateStrategy: + # + # -- (string) + # Set the update strategy type. + # Can be either: "RollingUpdate"; "OnDelete". + type: RollingUpdate + # + # -- (object) + # Define the rolling update policy. + # Only used when "updateStrategy.type=RollingUpdate". + rollingUpdate: + # + # -- (string) + # The maximum number of pods that can be unavailable during the update. + # Value can be an absolute number (ex: 5) or a percentage of desired + # pods (ex: 10%). Absolute number is calculated from percentage by + # rounding up. This can not be 0. Defaults to 1. + maxUnavailable: 20% + # + # -- (object) + # The policy used for PVCs created from the NodeSet VolumeClaimTemplates. + persistentVolumeClaimRetentionPolicy: + # + # -- (string) + # WhenDeleted specifies what happens to PVCs created from NodeSet + # VolumeClaimTemplates when the NodeSet is deleted. The default policy + # of `Retain` causes PVCs to not be affected by NodeSet deletion. The + # `Delete` policy causes those PVCs to be deleted. + whenDeleted: Retain + # + # -- (string) + # WhenScaled specifies what happens to PVCs created from NodeSet + # VolumeClaimTemplates when the NodeSet is scaled down. The default + # policy of `Retain` causes PVCs to not be affected by a scale-in. The + # `Delete` policy causes the associated PVCs for any excess pods to be + # deleted. + whenScaled: Retain + # + # -- (list) + # List of PVCs to be created from template and mounted on each NodeSet pod. + # PVCs are given a unique identity relative to each NodeSet pod. + # Ref: https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#volume-claim-templates + volumeClaimTemplates: [] + # - metadata: + # name: scratch + # spec: + # mountPath: /mnt/scratch + # storageClassName: standard + # accessModes: + # - ReadWriteOnce + # resources: + # requests: + # storage: 1Gi + # + # --(list) + # List of volume mounts. + # Ref: https://kubernetes.io/docs/concepts/storage/volumes/ + extraVolumeMounts: + - name: fsx-lustre + mountPath: /fsx + - name: fsx-openzfs + mountPath: /home + - name: shmem + mountPath: /dev/shm + # - name: nfs-home + # mountPath: /home + # - name: nfs-data + # mountPath: /mnt/data + # + # --(list) + # Define list of pod volumes. + # Ref: https://kubernetes.io/docs/concepts/storage/volumes/ + extraVolumes: + - name: fsx-lustre + persistentVolumeClaim: + claimName: fsx-claim + - name: fsx-openzfs + persistentVolumeClaim: + claimName: openzfs-claim + - name: shmem + hostPath: + path: /dev/shm + # - name: nfs-home + # nfs: + # server: nfs-server.example.com + # path: /exports/home/ + # - name: nfs-data + # persistentVolumeClaim: + # claimName: nfs-data + # + # -- (object) + # Partition describes the partition created specifically for this NodeSet to be added. + partition: + # + # -- (bool) + # Enables this NodeSet's partition line to be added in Slurm. + enabled: true + # + # -- (map[string]string | map[string][]string) + # Extra Slurm partition configuration appended onto the partition line. + # Ref: https://slurm.schedmd.com/slurm.conf.html#lbAI + config: + State: UP + MaxTime: UNLIMITED + # + # -- (string) + # Set Slurm node GRES. + # Ref: https://slurm.schedmd.com/slurm.conf.html#OPT_Gres_1 + nodeGres: "" + # + # -- (list) + # Set Slurm node Features as a list(string). + # Ref: https://slurm.schedmd.com/slurm.conf.html#OPT_Features + nodeFeatures: [] + # + # -- (string) + # Set Slurm node weight for Slurm scheduling. + # Ref: https://slurm.schedmd.com/slurm.conf.html#OPT_Weight + nodeWeight: 1 + # + # -- (list) + # Slurm Partitions by object list. + partitions: + # + # -- (string) + # Name of Partition. Must be unique. + - name: all + # + # -- (bool) + # Enables the partition in Slurm. + enabled: true + # + # -- (list) + # NodeSets to put into this Partition by name/key. + # NOTE: 'ALL' is a Slurm meta value to mean all nodes in the system. + nodesets: + - ALL + # + # -- (map[string]string | map[string][]string) + # Extra Slurm partition configuration appended onto the partition line. + # Ref: https://slurm.schedmd.com/slurm.conf.html#lbAI + config: + State: UP + Default: "YES" + MaxTime: UNLIMITED + +# +# Slurm accounting (slurmdbd) configurations. +accounting: + # + # -- (bool) + # Enables accounting services. + enabled: true + # + # -- (string) + # Set the image pull policy. + imagePullPolicy: IfNotPresent + # + # Set the image to use. + image: + # + # -- (string) + # Set the image repository to use. + repository: ghcr.io/slinkyproject/slurmdbd + # + # -- (string) + # Set the image tag to use. + tag: 24.11-ubuntu24.04 + # + # -- (object) + # Set affinity for Kubernetes Pod scheduling. + # Ref: https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#affinity-and-anti-affinity + affinity: *commonAffinity + # + # -- (list) + # Configure pod tolerations. + # Ref: https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/ + tolerations: [] + # + # -- (object) + # Set container resource requests and limits for Kubernetes Pod scheduling. + # Ref: https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#resource-requests-and-limits-of-pod-and-container + resources: {} + # requests: + # cpu: 1 + # memory: 1Gi + # limits: + # cpu: 2 + # memory: 4Gi + # + # Configuration for an external accounting instance (slurmdbd). + external: + # + # -- (bool) + # Use an external acounting instance (slurmdbd) instead of deploying one. + enabled: false + # + # -- (string) + # The external acounting instance (slurmdbd) host. + host: "" + # + # -- (integer) + # The external acounting instance (slurmdbd) port. + port: 6819 + +# +# `bitnami/mariadb` subchart configurations. +# Ref: https://github.com/bitnami/charts/blob/main/bitnami/mariadb/values.yaml +mariadb: + enabled: true + auth: + username: slurm + database: slurm_acct_db + tls: + enabled: false + tde: + enabled: false + primary: + # NOTE: https://slurm.schedmd.com/accounting.html#slurm-accounting-configuration-before-build + configuration: |- + [mysqld] + skip-name-resolve + explicit_defaults_for_timestamp + basedir=/opt/bitnami/mariadb + datadir=/bitnami/mariadb/data + plugin_dir=/opt/bitnami/mariadb/plugin + port={{ .Values.primary.containerPorts.mysql }} + socket=/opt/bitnami/mariadb/tmp/mysql.sock + tmpdir=/opt/bitnami/mariadb/tmp + innodb_buffer_pool_size=4096M + innodb_lock_wait_timeout=900 + innodb_log_file_size=1024M + max_allowed_packet=16M + bind-address=* + pid-file=/opt/bitnami/mariadb/tmp/mysqld.pid + log-error=/opt/bitnami/mariadb/logs/mysqld.log + character-set-server=UTF8 + collation-server=utf8_general_ci + slow_query_log=0 + long_query_time=10.0 + binlog_expire_logs_seconds=2592000 + {{- if .Values.tls.enabled }} + ssl_cert=/opt/bitnami/mariadb/certs/{{ .Values.tls.certFilename }} + ssl_key=/opt/bitnami/mariadb/certs/{{ .Values.tls.certKeyFilename }} + {{- if (include "mariadb.tlsCACert" .) }} + ssl_ca={{ include "mariadb.tlsCACert" . }} + {{- end }} + {{- end }} + {{- if .Values.tde.enabled }} + plugin_load_add=file_key_management + file_key_management_filename=/opt/bitnami/mariadb/tde/{{ .Values.tde.encryptedKeyFilename }} + file_key_management_filekey=FILE:/opt/bitnami/mariadb/tde/{{ .Values.tde.randomKeyFilename }} + file_key_management_encryption_algorithm={{ .Values.tde.fileKeyManagementEncryptionAlgorithm }} + innodb_encrypt_tables={{ .Values.tde.innodbEncryptTables }} + innodb_encrypt_log={{ .Values.tde.innodbEncryptLog }} + innodb_encrypt_temporary_tables={{ .Values.tde.innodbEncryptTemporaryTables }} + innodb_encryption_threads={{ .Values.tde.innodbEncryptionThreads }} + encrypt_tmp_disk_tables={{ .Values.tde.encryptTmpDiskTables }} + encrypt_tmp_files={{ .Values.tde.encryptTmpTiles }} + encrypt_binlog={{ .Values.tde.encryptBINLOG }} + aria_encrypt_tables={{ .Values.tde.ariaEncryptTables }} + {{- end }} + + [client] + port=3306 + socket=/opt/bitnami/mariadb/tmp/mysql.sock + default-character-set=UTF8 + plugin_dir=/opt/bitnami/mariadb/plugin + + [manager] + port=3306 + socket=/opt/bitnami/mariadb/tmp/mysql.sock + pid-file=/opt/bitnami/mariadb/tmp/mysqld.pid + persistence: + enabled: false + existingClaim: "" + storageClass: standard + labels: {} + annotations: {} + accessModes: + - ReadWriteOnce + size: 8Gi + selector: {} + priorityClassName: "" + tolerations: [] + affinity: *commonAffinity + metrics: + enabled: false + serviceMonitor: + enabled: false + affinity: {} + resources: {} + +# +# Slurm REST API (slurmrestd) configurations. +restapi: + # + # -- (bool) + # Enables restapi services. + enabled: true + # + # -- (integer) + # Set the number of replicas to deploy. + replicas: 1 + # + # -- (string) + # Set the image pull policy. + imagePullPolicy: IfNotPresent + # + # Set the image to use. + image: + # + # -- (string) + # Set the image repository to use. + repository: ghcr.io/slinkyproject/slurmrestd + # + # -- (string) + # Set the image tag to use. + tag: 24.11-ubuntu24.04 + # + # -- (object) + # The restapi service configuration. + # Ref: https://kubernetes.io/docs/concepts/services-networking/service/ + service: {} + # type: LoadBalancer + # externalIPs: [] + # externalName: my.slurmrestd.example.com + # + # -- (integer) + # The external service port number. + servicePort: 6820 + # + # -- (integer) + # The external service node port number. + # Ignored unless `service.type == NodePort`. + serviceNodePort: 36820 + # + # -- (string) + # Set the priority class to use. + # Ref: https://kubernetes.io/docs/concepts/scheduling-eviction/pod-priority-preemption/#priorityclass + priorityClassName: "" + # + # -- (object) + # Set affinity for Kubernetes Pod scheduling. + # Ref: https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#affinity-and-anti-affinity + affinity: *commonAffinity + # + # -- (list) + # Configure pod tolerations. + # Ref: https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/ + tolerations: [] + # + # -- (object) + # Set container resource requests and limits for Kubernetes Pod scheduling. + # Ref: https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#resource-requests-and-limits-of-pod-and-container + resources: {} + # requests: + # cpu: 1 + # memory: 1Gi + # limits: + # cpu: 2 + # memory: 4Gi + +# +# `slurm-exporter` subchart configurations. +# Ref: https://github.com/SlinkyProject/slurm-exporter/-/blob/main/helm/slurm-exporter/values.yaml +slurm-exporter: + enabled: true + exporter: + enabled: true + secretName: "slurm-token-exporter" + affinity: *commonAffinity \ No newline at end of file diff --git a/1.architectures/7.sagemaker-hyperpod-eks/slinky-slurm/slinky-slurm-hp-eks.png b/1.architectures/7.sagemaker-hyperpod-eks/slinky-slurm/slinky-slurm-hp-eks.png new file mode 100644 index 00000000..d63ff8f4 Binary files /dev/null and b/1.architectures/7.sagemaker-hyperpod-eks/slinky-slurm/slinky-slurm-hp-eks.png differ