Skip to content

re-adding deepspeed #659

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
May 7, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
123 changes: 123 additions & 0 deletions 3.test_cases/pytorch/deepspeed/0.deepspeed.dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@
# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
# SPDX-License-Identifier: MIT-0

FROM nvcr.io/nvidia/pytorch:25.03-py3

ARG GDRCOPY_VERSION=v2.4.1
ARG EFA_INSTALLER_VERSION=1.37.0
ARG AWS_OFI_NCCL_VERSION=v1.13.2-aws
ARG TRANSFORMERS_VERSION=4.44.2
ARG MEGATRON_LM_VERSION=core_r0.8.0

ARG OPEN_MPI_PATH=/opt/amazon/openmpi

######################
# Update and remove the IB libverbs
######################
RUN apt-get update -y && apt-get upgrade -y
RUN apt-get remove -y --allow-change-held-packages \
ibverbs-utils \
libibverbs-dev \
libibverbs1 \
libmlx5-1

RUN rm -rf /opt/hpcx/ompi \
&& rm -rf /usr/local/mpi \
&& rm -rf /usr/local/ucx \
&& ldconfig

RUN DEBIAN_FRONTEND=noninteractive apt install -y --allow-unauthenticated \
apt-utils \
autoconf \
automake \
build-essential \
cmake \
curl \
gcc \
gdb \
git \
kmod \
libtool \
openssh-client \
openssh-server \
vim \
&& apt autoremove -y

RUN mkdir -p /var/run/sshd && \
sed -i 's/[ #]\(.*StrictHostKeyChecking \).*/ \1no/g' /etc/ssh/ssh_config && \
echo " UserKnownHostsFile /dev/null" >> /etc/ssh/ssh_config && \
sed -i 's/#\(StrictModes \).*/\1no/g' /etc/ssh/sshd_config && \
sed 's@session\s*required\s*pam_loginuid.so@session optional pam_loginuid.so@g' -i /etc/pam.d/sshd

RUN rm -rf /root/.ssh/ \
&& mkdir -p /root/.ssh/ \
&& ssh-keygen -q -t rsa -N '' -f /root/.ssh/id_rsa \
&& cp /root/.ssh/id_rsa.pub /root/.ssh/authorized_keys \
&& printf "Host *\n StrictHostKeyChecking no\n" >> /root/.ssh/config

ENV LD_LIBRARY_PATH=/usr/local/cuda/extras/CUPTI/lib64:/opt/amazon/openmpi/lib:/opt/nccl/build/lib:/opt/amazon/efa/lib:/opt/aws-ofi-nccl/install/lib:$LD_LIBRARY_PATH
ENV PATH=/opt/amazon/openmpi/bin/:/opt/amazon/efa/bin:/usr/bin:/usr/local/bin:$PATH

#################################################
## Install NVIDIA GDRCopy
##
## NOTE: if `nccl-tests` or `/opt/gdrcopy/bin/sanity -v` crashes with incompatible version, ensure
## that the cuda-compat-xx-x package is the latest.
RUN git clone -b ${GDRCOPY_VERSION} https://github.com/NVIDIA/gdrcopy.git /tmp/gdrcopy \
&& cd /tmp/gdrcopy \
&& make prefix=/opt/gdrcopy install

ENV LD_LIBRARY_PATH /opt/gdrcopy/lib:/usr/local/cuda/compat:$LD_LIBRARY_PATH
ENV LIBRARY_PATH /opt/gdrcopy/lib:/usr/local/cuda/compat/:$LIBRARY_PATH
ENV CPATH /opt/gdrcopy/include:$CPATH
ENV PATH /opt/gdrcopy/bin:$PATH

#################################################
## Install EFA installer
RUN cd $HOME \
&& curl -O https://efa-installer.amazonaws.com/aws-efa-installer-${EFA_INSTALLER_VERSION}.tar.gz \
&& tar -xf $HOME/aws-efa-installer-${EFA_INSTALLER_VERSION}.tar.gz \
&& cd aws-efa-installer \
&& ./efa_installer.sh -y -g -d --skip-kmod --skip-limit-conf --no-verify \
&& rm -rf $HOME/aws-efa-installer


###################################################
## Install AWS-OFI-NCCL plugin
RUN DEBIAN_FRONTEND=noninteractive apt-get install -y libhwloc-dev
#Switch from sh to bash to allow parameter expansion
SHELL ["/bin/bash", "-c"]
RUN curl -OL https://github.com/aws/aws-ofi-nccl/releases/download/${AWS_OFI_NCCL_VERSION}/aws-ofi-nccl-${AWS_OFI_NCCL_VERSION//v}.tar.gz \
&& tar -xf aws-ofi-nccl-${AWS_OFI_NCCL_VERSION//v}.tar.gz \
&& cd aws-ofi-nccl-${AWS_OFI_NCCL_VERSION//v} \
&& ./configure --prefix=/opt/aws-ofi-nccl/install \
--with-mpi=/opt/amazon/openmpi \
--with-libfabric=/opt/amazon/efa \
--with-cuda=/usr/local/cuda \
--enable-platform-aws \
&& make -j $(nproc) \
&& make install \
&& cd .. \
&& rm -rf aws-ofi-nccl-${AWS_OFI_NCCL_VERSION//v} \
&& rm aws-ofi-nccl-${AWS_OFI_NCCL_VERSION//v}.tar.gz

SHELL ["/bin/sh", "-c"]

###################################################
RUN rm -rf /var/lib/apt/lists/*

RUN echo "hwloc_base_binding_policy = none" >> /opt/amazon/openmpi/etc/openmpi-mca-params.conf \
&& echo "rmaps_base_mapping_policy = slot" >> /opt/amazon/openmpi/etc/openmpi-mca-params.conf

RUN pip3 install awscli pynvml

RUN mv $OPEN_MPI_PATH/bin/mpirun $OPEN_MPI_PATH/bin/mpirun.real \
&& echo '#!/bin/bash' > $OPEN_MPI_PATH/bin/mpirun \
&& echo '/opt/amazon/openmpi/bin/mpirun.real "$@"' >> $OPEN_MPI_PATH/bin/mpirun \
&& chmod a+x $OPEN_MPI_PATH/bin/mpirun

######################
# DeepSpeed dependencies
######################
RUN pip install transformers==${TRANSFORMERS_VERSION} sentencepiece python-etcd deepspeed accelerate

24 changes: 24 additions & 0 deletions 3.test_cases/pytorch/deepspeed/1.build-image.sbatch
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
#!/bin/bash

# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
# SPDX-License-Identifier: MIT-0

#SBATCH -N 1 # number of nodes to use
#SBATCH --job-name=build-neox-image # name of your job
#SBATCH --output=logs/%x_%j.out # logfile for stdout
#SBATCH --error=logs/%x_%j.err # logfile for stderr, remove it to merge both outputs

set -euxo pipefail

# default variables for Enroot, if these variables are defined then use them
: "${APPS_PATH:=/fsx/apps}"
: "${IMAGE:=$APPS_PATH/deepspeed.sqsh}"

ENROOT_IMAGE=deepspeed
docker build -t ${ENROOT_IMAGE} -f 0.deepspeed.dockerfile .
# Remove old sqsh file if exists
if [ -f ${ENROOT_IMAGE}.sqsh ] ; then
rm ${ENROOT_IMAGE}.sqsh
fi
enroot import -o ${ENROOT_IMAGE}.sqsh dockerd://${ENROOT_IMAGE}:latest
mv ${ENROOT_IMAGE}.sqsh ${IMAGE}
12 changes: 12 additions & 0 deletions 3.test_cases/pytorch/deepspeed/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
ENROOT_IMAGE=deepspeed

all: build clean import

build:
docker build -t ${ENROOT_IMAGE} -f 0.deepspeed.dockerfile .

clean:
-rm ${ENROOT_IMAGE}.sqsh

import:
enroot import -o ${ENROOT_IMAGE}.sqsh dockerd://${ENROOT_IMAGE}:latest
87 changes: 87 additions & 0 deletions 3.test_cases/pytorch/deepspeed/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
# DeepSpeed Test Cases <!-- omit in toc -->

[DeepSpeed](https://github.com/microsoft/DeepSpeed) enables world's most powerful language models like MT-530B and BLOOM. It is an easy-to-use deep learning optimization software suite that powers unprecedented scale and speed for both training and inference. `deepspeed` illustrates several example test cases for DeepSpeed training on AWS.

## 1. Preparation

This guide assumes that you have the following:

* A functional Slurm cluster on AWS.
* Docker, [Pyxis](https://github.com/NVIDIA/pyxis) and [Enroot](https://github.com/NVIDIA/enroot) installed.
* An FSx for Lustre filesystem mounted on `/fsx`.

We recommend that you set up a Slurm cluster using the templates in the architectures [directory](../../1.architectures). You need to set the following environment variables to run these test cases:

```bash
export APPS_PATH=/fsx/apps
export ENROOT_IMAGE=$APPS_PATH/deepspeed.sqsh
export FSX_PATH=/fsx
export MODEL_PATH=$FSX_PATH/deepspeed
export TEST_CASE_PATH=${HOME}/18.deepspeed # where you copy the test case or set to your test case path
cd $TEST_CASE_PATH # Note that we assume that you are here during the following command executions
```



## 2. Build the container

Before running training jobs, you need to use a build docker container image. [Enroot](https://github.com/NVIDIA/enroot) will be used to turn the image into unprivileged sandbox for Slurm but build step may exceed the storage available on the head node so we reccomend building it on a compute node following instructions below (option 2)

### Option 1: build image on a head node

Below are the steps you need to follow:


1. Build the Docker image with the command below in this directory.

```bash
docker build -t deepspeed -f 0.deepspeed.dockerfile .
```


2. Once the Docker image is built, you can check if it is present with `docker images`. You should see an output similar to this one:

```bash
REPOSITORY TAG IMAGE ID CREATED SIZE
deepspeed latest b6c49033c424 9 minutes ago 23.3GB
...
```

3. Convert the Docker image to a squash file with the command below.

```bash
enroot import -o ${ENROOT_IMAGE} dockerd://deepspeed:latest
```

The file will be stored in the `/apps` directory (by default). The output should look as below.

```bash
[INFO] Fetching image

36a8c752c28a2db543d2a632a3fc1fcbd5789a6f3d45b9d3a24632420dedcfa8

[INFO] Extracting image content...
[INFO] Creating squashfs filesystem...

Parallel mksquashfs: Using 32 processors
Creating 4.0 filesystem on /apps/deepspeed.sqsh, block size 131072.
[========================================================================================================================================================================================================================-] 291068/291068 100%

Exportable Squashfs 4.0 filesystem, gzip compressed, data block size 131072
uncompressed data, uncompressed metadata, uncompressed fragments, uncompressed xattrs
duplicates are not removed
...
```

Once done proceed to the next stage.

### Option 2: Build image on a compute node

In this option, you will use a compute node to build the image. Submit the job as:

```bash
sbatch 1.build-image.sbatch
```


Once the image is prepared, you can proceed to `examples_*` directory for various deepspeed test cases.
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Megatron-DeepSpeed
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
# Megatron-DeepSpeed Test Cases <!-- omit in toc -->
[DeepSpeed version of NVIDIA's Megatron-LM](https://github.com/microsoft/Megatron-DeepSpeed/tree/main) adds additional support for several features such as MoE model training, Curriculum Learning, 3D Parallelism, and others to [DeepSpeed](https://github.com/microsoft/DeepSpeed) framework. The `examples_deepspeed` directory includes example scripts about the features supported by DeepSpeed.

## 1. Preparation

You need to follow steps in `../README.md` to prepare AWS-optimized DeepSpeed container. Also, set the following environment variables to run the test cases:

```bash
export APPS_PATH=/fsx/apps
export ENROOT_IMAGE=$APPS_PATH/deepspeed.sqsh
export FSX_PATH=/fsx
export MODEL_PATH=$FSX_PATH/deepspeed
export TEST_CASE_PATH=${HOME}/18.deepspeed # where you copy the test case or set to your test case path
cd $TEST_CASE_PATH # Note that we assume that you are here during the following command executions
```

Then clone the project repository:

```bash
git clone https://github.com/microsoft/Megatron-DeepSpeed
```

Proceed to each example sub-directory once the set up has completed.
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
ds_config.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
#!/bin/bash
#SBATCH --exclusive
#SBATCH --job-name=convert-llama-weights
#SBATCH --output=logs/%x_%j.out # logfile for stdout/stderr
#SBATCH --nodes 1

: "${APPS_PATH:=/fsx/apps}"
: "${IMAGE:=$APPS_PATH/deepspeed.sqsh}"
: "${FSX_PATH:=/fsx}"
: "${DATASET:=c4_subset}"
: "${DATA_PATH:=$FSX_PATH/$DATASET}"
: "${MODEL_PATH:=$FSX_PATH/deepspeed}"
: "${CONTAINER_MOUNT:=$FSX_PATH:$FSX_PATH}"


declare -a ARGS=(
--container-image ${IMAGE}
--container-mounts /fsx
)

srun -l "${ARGS[@]}" python3 ${PWD}/src/convert_llama_weights_to_hf.py \
--input_dir ${MODEL_PATH}/Llama2-meta --model_size 7B --output_dir ${MODEL_PATH}/Llama2-7b-hf
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
#!/bin/bash

sbatch --nodes=1 --job-name=cvtw-mgtds scripts/finetune_llama.sbatch convert
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@

#!/bin/bash

sbatch --nodes=1 --job-name=finetune-llama scripts/finetune_llama.sbatch finetune
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
# Finetuning Llama from Huggingface Weights

This test case showcase how to finetune Llama2 model from HuuggingFace Weights using Megatron DeepSpeed.

## 1. Preparation
Set the following environment variables to run the test cases:

```bash
export APPS_PATH=/fsx/apps
export ENROOT_IMAGE=$APPS_PATH/deepspeed.sqsh
export FSX_PATH=/fsx
export MODEL_PATH=$FSX_PATH/deepspeed
export DATA_PATH=$FSX_PATH/alpaca
```
In this step, we prepares Llama2 dataset and pretrained weights.

This tutorial uses [Stanford Alphaca](https://github.com/tatsu-lab/stanford_alpaca) dataset. Download the dataset with the command below:

```bash
mkdir -p ${DATA_PATH}
wget https://raw.githubusercontent.com/tatsu-lab/stanford_alpaca/main/alpaca_data.json -O ${DATA_PATH}/alpaca_data.json
```

Llama2 model, which governed by the Meta license and must be downloaded and converted to the standard [Hugging Face](https://huggingface.co/) format prior to running this sample.
You can submit access request from [here](https://ai.meta.com/resources/models-and-libraries/llama-downloads/), we need "Llama 2 & Llama Chat" to be checked. Use the [download.sh](https://github.com/facebookresearch/llama/blob/main/download.sh) in the official repository. You will be asked to input an URL from the email you recieve from meta.

We will assume that you had placed the model and tokenizer as follows on cluster:

```
${MODEL_PATH}/Llama2-meta/
├── 7B/
│ ├── checklist.chk
│ ├── consolidated.00.pth
│ └── params.json
├── tokenizer.model
└── tokenizer_checklist.chk
```

Convert the model weights into HF format:

```bash
sbatch 1.convert-weights-to-hf.sbatch
```

`convert_llama_weights_to_hf.py` transforms the original weights into the Huggingface format as in:

```
${MODEL_PATH}/Llama2-7b-hf
├── config.json
├── generation_config.json
├── pytorch_model-00001-of-00003.bin
├── pytorch_model-00002-of-00003.bin
├── pytorch_model-00003-of-00003.bin
├── pytorch_model.bin.index.json
├── special_tokens_map.json
├── tokenizer.json
├── tokenizer.model
└── tokenizer_config.json
```

Finally, transforms the checkpoint into Megatron DeepSpeed format:

``bash
bash 2.convert-weights-to-mega-ds.sh
```


## 1. Finetuning

Finetuning job can be submitted as follows:

```bash
bash 3.finetune-llama.sh
```
Loading