You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: .github/ISSUE_TEMPLATE/400-bug-report.yml
+2-2
Original file line number
Diff line number
Diff line change
@@ -21,12 +21,12 @@ body:
21
21
It is suggested to download and execute the latest script, as vllm might frequently update the diagnosis information needed for accurately and quickly responding to issues.
22
22
value: |
23
23
<details>
24
-
<summary>The output of `python collect_env.py`</summary>
24
+
<summary>The output of <code>python collect_env.py</code></summary>
Copy file name to clipboardExpand all lines: docker/Dockerfile.xpu
-6
Original file line number
Diff line number
Diff line change
@@ -40,12 +40,6 @@ RUN --mount=type=cache,target=/root/.cache/pip \
40
40
--mount=type=bind,source=.git,target=.git \
41
41
python3 setup.py install
42
42
43
-
# Please refer xpu doc, we need manually install intel-extension-for-pytorch 2.6.10+xpu due to there are some conflict dependencies with torch 2.6.0+xpu
44
-
# FIXME: This will be fix in ipex 2.7. just leave this here for awareness.
Copy file name to clipboardExpand all lines: docs/source/design/v1/prefix_caching.md
+19-1
Original file line number
Diff line number
Diff line change
@@ -16,7 +16,7 @@ In the example above, the KV cache in the first block can be uniquely identified
16
16
17
17
* Parent hash value: The hash value of the parent hash block.
18
18
* Block tokens: A tuple of tokens in this block. The reason to include the exact tokens is to reduce potential hash value collision.
19
-
* Extra hashes: Other values required to make this block unique, such as LoRA IDs and multi-modality input hashes (see the example below).
19
+
* Extra hashes: Other values required to make this block unique, such as LoRA IDs, multi-modality input hashes (see the example below), and cache salts to isolate caches in multi-tenant environments.
20
20
21
21
> **Note 1:** We only cache full blocks.
22
22
@@ -76,6 +76,24 @@ Block 3
76
76
77
77
In the rest of this document, we first introduce the data structure used for prefix caching in vLLM v1, followed by the prefix caching workflow of major KV cache operators (e.g., allocate, append, free, eviction). Finally, we use an example to illustrate the end to end prefix caching workflow.
78
78
79
+
**Cache Isolation for Security**
80
+
To improve privacy in shared environments, vLLM supports isolating prefix cache reuse through optional per-request salting. By including a `cache_salt` in the request, this value is injected into the hash of the first block, ensuring that only requests with the same salt can reuse cached KV blocks. This prevents timing-based attacks where an adversary could infer cached content by observing latency differences. This offers protection without compromising performance.
81
+
82
+
```json
83
+
{
84
+
"messages": [
85
+
{"role": "system", "content": "You are a helpful assistant."},
86
+
{"role": "user", "content": "Here is a document with details about the world series: ..."},
87
+
{"role": "user", "content": "Who won the world series in 2020?"}
With this setup, cache sharing is limited to users or requests that explicitly agree on a common salt, enabling cache reuse within a trust group while isolating others.
94
+
95
+
> **Note:** Cache isolation is not supported in engine V0.
96
+
79
97
## Data Structure
80
98
81
99
The prefix caching in vLLM v1 is implemented in the KV cache manager. The basic building block is the “Block” data class (simplified):
0 commit comments