Skip to content

Commit 3a2f273

Browse files
authored
Merge branch 'vllm-project:main' into addMoreTorchNightlyTest0429
2 parents 5db2937 + 1144a8e commit 3a2f273

File tree

166 files changed

+4575
-1284
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

166 files changed

+4575
-1284
lines changed

.buildkite/release-pipeline.yaml

+6-5
Original file line numberDiff line numberDiff line change
@@ -1,20 +1,20 @@
11
steps:
2-
- label: "Build wheel - CUDA 12.4"
2+
- label: "Build wheel - CUDA 12.8"
33
agents:
44
queue: cpu_queue_postmerge
55
commands:
6-
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=12.4.0 --tag vllm-ci:build-image --target build --progress plain -f docker/Dockerfile ."
6+
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=12.8.1 --tag vllm-ci:build-image --target build --progress plain -f docker/Dockerfile ."
77
- "mkdir artifacts"
88
- "docker run --rm -v $(pwd)/artifacts:/artifacts_host vllm-ci:build-image bash -c 'cp -r dist /artifacts_host && chmod -R a+rw /artifacts_host'"
99
- "bash .buildkite/scripts/upload-wheels.sh"
1010
env:
1111
DOCKER_BUILDKIT: "1"
1212

13-
- label: "Build wheel - CUDA 12.1"
13+
- label: "Build wheel - CUDA 12.6"
1414
agents:
1515
queue: cpu_queue_postmerge
1616
commands:
17-
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=12.1.0 --tag vllm-ci:build-image --target build --progress plain -f docker/Dockerfile ."
17+
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=12.6.3 --tag vllm-ci:build-image --target build --progress plain -f docker/Dockerfile ."
1818
- "mkdir artifacts"
1919
- "docker run --rm -v $(pwd)/artifacts:/artifacts_host vllm-ci:build-image bash -c 'cp -r dist /artifacts_host && chmod -R a+rw /artifacts_host'"
2020
- "bash .buildkite/scripts/upload-wheels.sh"
@@ -48,7 +48,7 @@ steps:
4848
queue: cpu_queue_postmerge
4949
commands:
5050
- "aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws/q9t5s3a7"
51-
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=12.4.0 --tag public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT --target vllm-openai --progress plain -f docker/Dockerfile ."
51+
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=12.8.1 --tag public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT --target vllm-openai --progress plain -f docker/Dockerfile ."
5252
- "docker push public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT"
5353

5454
- label: "Build and publish TPU release image"
@@ -57,6 +57,7 @@ steps:
5757
agents:
5858
queue: tpu_queue_postmerge
5959
commands:
60+
- "git fetch --all"
6061
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --tag vllm/vllm-tpu:nightly --tag vllm/vllm-tpu:$BUILDKITE_COMMIT --progress plain -f docker/Dockerfile.tpu ."
6162
- "docker push vllm/vllm-tpu:nightly"
6263
- "docker push vllm/vllm-tpu:$BUILDKITE_COMMIT"

.buildkite/scripts/upload-wheels.sh

+9-9
Original file line numberDiff line numberDiff line change
@@ -50,11 +50,11 @@ aws s3 cp "$normal_wheel" "s3://vllm-wheels/$BUILDKITE_COMMIT/"
5050
if [[ $normal_wheel == *"cu118"* ]]; then
5151
# if $normal_wheel matches cu118, do not upload the index.html
5252
echo "Skipping index files for cu118 wheels"
53-
elif [[ $normal_wheel == *"cu121"* ]]; then
54-
# if $normal_wheel matches cu121, do not upload the index.html
55-
echo "Skipping index files for cu121 wheels"
53+
elif [[ $normal_wheel == *"cu126"* ]]; then
54+
# if $normal_wheel matches cu126, do not upload the index.html
55+
echo "Skipping index files for cu126 wheels"
5656
else
57-
# only upload index.html for cu124 wheels (default wheels)
57+
# only upload index.html for cu128 wheels (default wheels)
5858
aws s3 cp index.html "s3://vllm-wheels/$BUILDKITE_COMMIT/vllm/index.html"
5959
aws s3 cp "s3://vllm-wheels/nightly/index.html" "s3://vllm-wheels/$BUILDKITE_COMMIT/index.html"
6060
fi
@@ -66,12 +66,12 @@ aws s3 cp "$normal_wheel" "s3://vllm-wheels/nightly/"
6666
if [[ $normal_wheel == *"cu118"* ]]; then
6767
# if $normal_wheel matches cu118, do not upload the index.html
6868
echo "Skipping index files for cu118 wheels"
69-
elif [[ $normal_wheel == *"cu121"* ]]; then
70-
# if $normal_wheel matches cu121, do not upload the index.html
71-
echo "Skipping index files for cu121 wheels"
69+
elif [[ $normal_wheel == *"cu126"* ]]; then
70+
# if $normal_wheel matches cu126, do not upload the index.html
71+
echo "Skipping index files for cu126 wheels"
7272
else
73-
# only upload index.html for cu124 wheels (default wheels)
73+
# only upload index.html for cu128 wheels (default wheels)
7474
aws s3 cp index.html "s3://vllm-wheels/nightly/vllm/index.html"
7575
fi
7676

77-
aws s3 cp "$wheel" "s3://vllm-wheels/$version/"
77+
aws s3 cp "$wheel" "s3://vllm-wheels/$version/"

.buildkite/test-pipeline.yaml

+1-1
Original file line numberDiff line numberDiff line change
@@ -460,7 +460,7 @@ steps:
460460
- tests/models/encoder_decoder/language
461461
commands:
462462
# Install causal-conv1d for plamo2 models here, as it is not compatible with pip-compile.
463-
- pip install causal-conv1d
463+
- pip install 'git+https://github.com/Dao-AILab/causal-conv1d@v1.5.0.post8'
464464
- pytest -v -s models/decoder_only/language -m 'core_model or quant_model'
465465
- pytest -v -s models/embedding/language -m core_model
466466

.github/ISSUE_TEMPLATE/400-bug-report.yml

+2-2
Original file line numberDiff line numberDiff line change
@@ -21,12 +21,12 @@ body:
2121
It is suggested to download and execute the latest script, as vllm might frequently update the diagnosis information needed for accurately and quickly responding to issues.
2222
value: |
2323
<details>
24-
<summary>The output of `python collect_env.py`</summary>
24+
<summary>The output of <code>python collect_env.py</code></summary>
2525
2626
```text
2727
Your output of `python collect_env.py` here
2828
```
29-
29+
3030
</details>
3131
validations:
3232
required: true

.github/workflows/lint-and-deploy.yaml

+2-2
Original file line numberDiff line numberDiff line change
@@ -66,7 +66,7 @@ jobs:
6666
export AWS_SECRET_ACCESS_KEY=minioadmin
6767
sleep 30 && kubectl -n ns-vllm logs -f "$(kubectl -n ns-vllm get pods | awk '/deployment/ {print $1;exit}')" &
6868
helm install --wait --wait-for-jobs --timeout 5m0s --debug --create-namespace --namespace=ns-vllm test-vllm examples/online_serving/chart-helm -f examples/online_serving/chart-helm/values.yaml --set secrets.s3endpoint=http://minio:9000 --set secrets.s3bucketname=testbucket --set secrets.s3accesskeyid=$AWS_ACCESS_KEY_ID --set secrets.s3accesskey=$AWS_SECRET_ACCESS_KEY --set resources.requests.cpu=1 --set resources.requests.memory=4Gi --set resources.limits.cpu=2 --set resources.limits.memory=5Gi --set image.env[0].name=VLLM_CPU_KVCACHE_SPACE --set image.env[1].name=VLLM_LOGGING_LEVEL --set-string image.env[0].value="1" --set-string image.env[1].value="DEBUG" --set-string extraInit.s3modelpath="opt-125m/" --set-string 'resources.limits.nvidia\.com/gpu=0' --set-string 'resources.requests.nvidia\.com/gpu=0' --set-string image.repository="vllm-cpu-env"
69-
69+
7070
- name: curl test
7171
run: |
7272
kubectl -n ns-vllm port-forward service/test-vllm-service 8001:80 &
@@ -79,4 +79,4 @@ jobs:
7979
"max_tokens": 7,
8080
"temperature": 0
8181
}'):$CODE"
82-
echo "$CODE"
82+
echo "$CODE"

.pre-commit-config.yaml

+1-1
Original file line numberDiff line numberDiff line change
@@ -46,7 +46,7 @@ repos:
4646
rev: 0.6.17
4747
hooks:
4848
- id: pip-compile
49-
args: [requirements/test.in, -o, requirements/test.txt]
49+
args: [requirements/test.in, -o, requirements/test.txt, --index-strategy, unsafe-best-match]
5050
files: ^requirements/test\.(in|txt)$
5151
- repo: local
5252
hooks:

CMakeLists.txt

+2-2
Original file line numberDiff line numberDiff line change
@@ -46,8 +46,8 @@ set(HIP_SUPPORTED_ARCHS "gfx906;gfx908;gfx90a;gfx942;gfx950;gfx1030;gfx1100;gfx1
4646
# requirements.txt files and should be kept consistent. The ROCm torch
4747
# versions are derived from docker/Dockerfile.rocm
4848
#
49-
set(TORCH_SUPPORTED_VERSION_CUDA "2.6.0")
50-
set(TORCH_SUPPORTED_VERSION_ROCM "2.6.0")
49+
set(TORCH_SUPPORTED_VERSION_CUDA "2.7.0")
50+
set(TORCH_SUPPORTED_VERSION_ROCM "2.7.0")
5151

5252
#
5353
# Try to find python package with an executable that exactly matches

benchmarks/backend_request_func.py

+1
Original file line numberDiff line numberDiff line change
@@ -260,6 +260,7 @@ async def async_request_openai_completions(
260260
if request_func_input.model_name else request_func_input.model,
261261
"prompt": request_func_input.prompt,
262262
"temperature": 0.0,
263+
"repetition_penalty": 1.0,
263264
"max_tokens": request_func_input.output_len,
264265
"logprobs": request_func_input.logprobs,
265266
"stream": True,

benchmarks/benchmark_serving_structured_output.py

+6-3
Original file line numberDiff line numberDiff line change
@@ -123,6 +123,8 @@ def sample_requests(tokenizer: PreTrainedTokenizerBase,
123123
copy.deepcopy(schema) for _ in range(args.num_prompts)
124124
]
125125
for i in range(len(json_schemas)):
126+
if "properties" not in json_schemas[i]:
127+
json_schemas[i]["properties"] = {}
126128
json_schemas[i]["properties"][
127129
f"__optional_field_{uuid.uuid4()}"] = {
128130
"type":
@@ -134,7 +136,7 @@ def sample_requests(tokenizer: PreTrainedTokenizerBase,
134136
json_schemas = [schema] * args.num_prompts
135137

136138
def gen_prompt(index: int):
137-
return f"Generate an example of a user profile given the following schema: {json.dumps(get_schema(index))}" # noqa: E501
139+
return f"Generate an example of a brief user profile given the following schema: {json.dumps(get_schema(index))}" # noqa: E501
138140

139141
def get_schema(index: int):
140142
return json_schemas[index % len(json_schemas)]
@@ -231,7 +233,8 @@ def _filter_func(item):
231233
idx -= len_dataset
232234
schema = dataset["schema"][idx]
233235
prompt = tokenizer.apply_chat_template(dataset["prompt"][idx],
234-
tokenize=False)
236+
tokenize=False,
237+
add_generation_prompt=True)
235238
input_len = len(tokenizer(prompt).input_ids)
236239
completion = dataset["completion"][idx]
237240

@@ -849,7 +852,7 @@ def main(args: argparse.Namespace):
849852
'json', 'json-unique', 'grammar', 'regex',
850853
'choice', 'xgrammar_bench'
851854
])
852-
parser.add_argument("--json_schema_path",
855+
parser.add_argument("--json-schema-path",
853856
type=str,
854857
default=None,
855858
help="Path to json schema.")

docker/Dockerfile

+32-14
Original file line numberDiff line numberDiff line change
@@ -5,11 +5,11 @@
55
# docs/source/contributing/dockerfile/dockerfile.md and
66
# docs/source/assets/contributing/dockerfile-stages-dependency.png
77

8-
ARG CUDA_VERSION=12.4.1
8+
ARG CUDA_VERSION=12.8.1
99
#################### BASE BUILD IMAGE ####################
1010
# prepare basic build environment
1111
FROM nvidia/cuda:${CUDA_VERSION}-devel-ubuntu20.04 AS base
12-
ARG CUDA_VERSION=12.4.1
12+
ARG CUDA_VERSION=12.8.1
1313
ARG PYTHON_VERSION=3.12
1414
ARG TARGETPLATFORM
1515
ENV DEBIAN_FRONTEND=noninteractive
@@ -37,6 +37,7 @@ RUN --mount=type=cache,target=/root/.cache/uv \
3737
# This timeout (in seconds) is necessary when installing some dependencies via uv since it's likely to time out
3838
# Reference: https://github.com/astral-sh/uv/pull/1694
3939
ENV UV_HTTP_TIMEOUT=500
40+
ENV UV_INDEX_STRATEGY="unsafe-best-match"
4041

4142
# Upgrade to GCC 10 to avoid https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92519
4243
# as it was causing spam when compiling the CUTLASS kernels
@@ -69,7 +70,8 @@ RUN --mount=type=cache,target=/root/.cache/uv \
6970
COPY requirements/common.txt requirements/common.txt
7071
COPY requirements/cuda.txt requirements/cuda.txt
7172
RUN --mount=type=cache,target=/root/.cache/uv \
72-
uv pip install --system -r requirements/cuda.txt
73+
uv pip install --system -r requirements/cuda.txt \
74+
--extra-index-url https://download.pytorch.org/whl/cu$(echo $CUDA_VERSION | cut -d. -f1,2 | tr -d '.')
7375

7476
# cuda arch list used by torch
7577
# can be useful for both `dev` and `test`
@@ -92,9 +94,11 @@ COPY requirements/build.txt requirements/build.txt
9294
# This timeout (in seconds) is necessary when installing some dependencies via uv since it's likely to time out
9395
# Reference: https://github.com/astral-sh/uv/pull/1694
9496
ENV UV_HTTP_TIMEOUT=500
97+
ENV UV_INDEX_STRATEGY="unsafe-best-match"
9598

9699
RUN --mount=type=cache,target=/root/.cache/uv \
97-
uv pip install --system -r requirements/build.txt
100+
uv pip install --system -r requirements/build.txt \
101+
--extra-index-url https://download.pytorch.org/whl/cu$(echo $CUDA_VERSION | cut -d. -f1,2 | tr -d '.')
98102

99103
COPY . .
100104
ARG GIT_REPO_CHECK=0
@@ -161,22 +165,25 @@ FROM base as dev
161165
# This timeout (in seconds) is necessary when installing some dependencies via uv since it's likely to time out
162166
# Reference: https://github.com/astral-sh/uv/pull/1694
163167
ENV UV_HTTP_TIMEOUT=500
168+
ENV UV_INDEX_STRATEGY="unsafe-best-match"
169+
170+
# Workaround for #17068
171+
RUN --mount=type=cache,target=/root/.cache/uv \
172+
uv pip install --system --no-build-isolation "git+https://github.com/state-spaces/mamba@v2.2.4"
164173

165174
COPY requirements/lint.txt requirements/lint.txt
166175
COPY requirements/test.txt requirements/test.txt
167176
COPY requirements/dev.txt requirements/dev.txt
168-
# Workaround for #17068
169-
RUN --mount=type=cache,target=/root/.cache/uv \
170-
uv pip install --system mamba-ssm==2.2.4 --no-build-isolation
171177
RUN --mount=type=cache,target=/root/.cache/uv \
172-
uv pip install --system -r requirements/dev.txt
178+
uv pip install --system -r requirements/dev.txt \
179+
--extra-index-url https://download.pytorch.org/whl/cu$(echo $CUDA_VERSION | cut -d. -f1,2 | tr -d '.')
173180
#################### DEV IMAGE ####################
174181

175182
#################### vLLM installation IMAGE ####################
176183
# image with vLLM installed
177184
# TODO: Restore to base image after FlashInfer AOT wheel fixed
178185
FROM nvidia/cuda:${CUDA_VERSION}-devel-ubuntu22.04 AS vllm-base
179-
ARG CUDA_VERSION=12.4.1
186+
ARG CUDA_VERSION=12.8.1
180187
ARG PYTHON_VERSION=3.12
181188
WORKDIR /vllm-workspace
182189
ENV DEBIAN_FRONTEND=noninteractive
@@ -209,6 +216,7 @@ RUN --mount=type=cache,target=/root/.cache/uv \
209216
# This timeout (in seconds) is necessary when installing some dependencies via uv since it's likely to time out
210217
# Reference: https://github.com/astral-sh/uv/pull/1694
211218
ENV UV_HTTP_TIMEOUT=500
219+
ENV UV_INDEX_STRATEGY="unsafe-best-match"
212220

213221
# Workaround for https://github.com/openai/triton/issues/2507 and
214222
# https://github.com/pytorch/pytorch/issues/107960 -- hopefully
@@ -229,7 +237,8 @@ RUN --mount=type=cache,target=/root/.cache/uv \
229237
# Install vllm wheel first, so that torch etc will be installed.
230238
RUN --mount=type=bind,from=build,src=/workspace/dist,target=/vllm-workspace/dist \
231239
--mount=type=cache,target=/root/.cache/uv \
232-
uv pip install --system dist/*.whl --verbose
240+
uv pip install --system dist/*.whl --verbose \
241+
--extra-index-url https://download.pytorch.org/whl/cu$(echo $CUDA_VERSION | cut -d. -f1,2 | tr -d '.')
233242

234243
# If we need to build FlashInfer wheel before its release:
235244
# $ export FLASHINFER_ENABLE_AOT=1
@@ -246,19 +255,26 @@ RUN --mount=type=bind,from=build,src=/workspace/dist,target=/vllm-workspace/dist
246255
RUN --mount=type=cache,target=/root/.cache/uv \
247256
. /etc/environment && \
248257
if [ "$TARGETPLATFORM" != "linux/arm64" ]; then \
249-
uv pip install --system https://github.com/flashinfer-ai/flashinfer/releases/download/v0.2.1.post2/flashinfer_python-0.2.1.post2+cu124torch2.6-cp38-abi3-linux_x86_64.whl ; \
258+
# TESTING: install FlashInfer from source to test 2.7.0 final RC
259+
FLASHINFER_ENABLE_AOT=1 TORCH_CUDA_ARCH_LIST='7.5 8.0 8.6 8.9 9.0+PTX' \
260+
uv pip install --system --no-build-isolation "git+https://github.com/flashinfer-ai/flashinfer@v0.2.2.post1" ; \
250261
fi
251262
COPY examples examples
252263
COPY benchmarks benchmarks
253264
COPY ./vllm/collect_env.py .
254265

266+
RUN --mount=type=cache,target=/root/.cache/uv \
267+
. /etc/environment && \
268+
uv pip list
269+
255270
# Although we build Flashinfer with AOT mode, there's still
256271
# some issues w.r.t. JIT compilation. Therefore we need to
257272
# install build dependencies for JIT compilation.
258273
# TODO: Remove this once FlashInfer AOT wheel is fixed
259274
COPY requirements/build.txt requirements/build.txt
260275
RUN --mount=type=cache,target=/root/.cache/uv \
261-
uv pip install --system -r requirements/build.txt
276+
uv pip install --system -r requirements/build.txt \
277+
--extra-index-url https://download.pytorch.org/whl/cu$(echo $CUDA_VERSION | cut -d. -f1,2 | tr -d '.')
262278

263279
#################### vLLM installation IMAGE ####################
264280

@@ -272,11 +288,13 @@ ADD . /vllm-workspace/
272288
# This timeout (in seconds) is necessary when installing some dependencies via uv since it's likely to time out
273289
# Reference: https://github.com/astral-sh/uv/pull/1694
274290
ENV UV_HTTP_TIMEOUT=500
291+
ENV UV_INDEX_STRATEGY="unsafe-best-match"
275292

276-
# install development dependencies (for testing)
277293
# Workaround for #17068
278294
RUN --mount=type=cache,target=/root/.cache/uv \
279-
uv pip install --system mamba-ssm==2.2.4 --no-build-isolation
295+
uv pip install --system --no-build-isolation "git+https://github.com/state-spaces/mamba@v2.2.4"
296+
297+
# install development dependencies (for testing)
280298
RUN --mount=type=cache,target=/root/.cache/uv \
281299
uv pip install --system -r requirements/dev.txt
282300

docker/Dockerfile.xpu

-6
Original file line numberDiff line numberDiff line change
@@ -40,12 +40,6 @@ RUN --mount=type=cache,target=/root/.cache/pip \
4040
--mount=type=bind,source=.git,target=.git \
4141
python3 setup.py install
4242

43-
# Please refer xpu doc, we need manually install intel-extension-for-pytorch 2.6.10+xpu due to there are some conflict dependencies with torch 2.6.0+xpu
44-
# FIXME: This will be fix in ipex 2.7. just leave this here for awareness.
45-
RUN --mount=type=cache,target=/root/.cache/pip \
46-
pip install intel-extension-for-pytorch==2.6.10+xpu \
47-
--extra-index-url=https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
48-
4943
CMD ["/bin/bash"]
5044

5145
FROM vllm-base AS vllm-openai

docs/source/contributing/overview.md

+4
Original file line numberDiff line numberDiff line change
@@ -40,6 +40,10 @@ pre-commit install --hook-type pre-commit --hook-type commit-msg
4040
# You can manually run pre-commit with
4141
pre-commit run --all-files
4242

43+
# To manually run something from CI that does not run
44+
# locally by default, you can run:
45+
pre-commit run mypy-3.9 --hook-stage manual --all-files
46+
4347
# Unit tests
4448
pytest tests/
4549
```

docs/source/design/v1/prefix_caching.md

+19-1
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ In the example above, the KV cache in the first block can be uniquely identified
1616

1717
* Parent hash value: The hash value of the parent hash block.
1818
* Block tokens: A tuple of tokens in this block. The reason to include the exact tokens is to reduce potential hash value collision.
19-
* Extra hashes: Other values required to make this block unique, such as LoRA IDs and multi-modality input hashes (see the example below).
19+
* Extra hashes: Other values required to make this block unique, such as LoRA IDs, multi-modality input hashes (see the example below), and cache salts to isolate caches in multi-tenant environments.
2020

2121
> **Note 1:** We only cache full blocks.
2222
@@ -76,6 +76,24 @@ Block 3
7676

7777
In the rest of this document, we first introduce the data structure used for prefix caching in vLLM v1, followed by the prefix caching workflow of major KV cache operators (e.g., allocate, append, free, eviction). Finally, we use an example to illustrate the end to end prefix caching workflow.
7878

79+
**Cache Isolation for Security**
80+
To improve privacy in shared environments, vLLM supports isolating prefix cache reuse through optional per-request salting. By including a `cache_salt` in the request, this value is injected into the hash of the first block, ensuring that only requests with the same salt can reuse cached KV blocks. This prevents timing-based attacks where an adversary could infer cached content by observing latency differences. This offers protection without compromising performance.
81+
82+
```json
83+
{
84+
"messages": [
85+
{"role": "system", "content": "You are a helpful assistant."},
86+
{"role": "user", "content": "Here is a document with details about the world series: ..."},
87+
{"role": "user", "content": "Who won the world series in 2020?"}
88+
],
89+
"cache_salt": "Z3V2bmV3aGxza3ZubGFoZ3Zud3V3ZWZ2bmd0b3V2bnZmc2xpZ3RoZ2x2aQ=="
90+
}
91+
```
92+
93+
With this setup, cache sharing is limited to users or requests that explicitly agree on a common salt, enabling cache reuse within a trust group while isolating others.
94+
95+
> **Note:** Cache isolation is not supported in engine V0.
96+
7997
## Data Structure
8098

8199
The prefix caching in vLLM v1 is implemented in the KV cache manager. The basic building block is the “Block” data class (simplified):

0 commit comments

Comments
 (0)