vLLM is an inference and serving engine for large language models (LLMs). From to before 0.20.0, the extract_hidden_states speculative decoding proposer in vLLM returns a tensor with an incorrect shape after the first decode step, causing a RuntimeError that crashes the EngineCore process. The crash is triggered when any request in the batch uses sampling penalty parameters (repetition_penalty, frequency_penalty, or presence_penalty). A single request with a penalty parameter (e.g., "repetition_penalty": 1.1) is sufficient to crash the server. This vulnerability is fixed in 0.20.0.
vLLM: extract_hidden_states speculative decoding crashes server on any request with penalty parameters
Problem type
Affected products
vllm-project
>= 0.18.0, < 0.20.0 - AFFECTED
References
https://github.com/vllm-project/vllm/security/advisories/GHSA-83vm-p52w-f9pw
https://github.com/vllm-project/vllm/pull/38610
GitHub Security Advisories
GHSA-83vm-p52w-f9pw
vLLM: extract_hidden_states speculative decoding crashes server on any request with penalty parameters
https://github.com/advisories/GHSA-83vm-p52w-f9pwSummary
The extract_hidden_states speculative decoding proposer in vLLM returns a tensor with an incorrect shape after the first decode step, causing a RuntimeError that crashes the EngineCore process. The crash is triggered when any request in the batch uses sampling penalty parameters (repetition_penalty, frequency_penalty, or presence_penalty).
A single request with a penalty parameter (e.g., "repetition_penalty": 1.1) is sufficient to crash the server. The crash is deterministic and immediate — no concurrency, race condition, or special workload is required.
Details
In vLLM v0.17.0, the extract_hidden_states proposer's propose() method returned sampled_token_ids.unsqueeze(-1), producing a tensor of shape (batch_size, 1).
In PR #37013 (first released in v0.18.0), the KV connector interface was refactored out of propose(). The return type changed from tuple[Tensor, KVConnectorOutput | None] to Tensor, and the .unsqueeze(-1) call was removed along with the KV connector output:
# Before (v0.17.0):
return sampled_token_ids.unsqueeze(-1), kv_connector_output # shape (batch_size, 1)
# After (v0.18.0+):
return sampled_token_ids # shape (batch_size, 2) after first decode step
The refactor missed that sampled_token_ids changed semantics between the first and subsequent decode steps. After the first decode step, the rejection sampler allocates its output as (batch_size, max_spec_len + 1). With num_speculative_tokens=1, this produces shape (batch_size, 2) instead of the expected (batch_size, 1), causing a broadcast shape mismatch during penalty application.
Impact
Any vLLM deployment between v0.18.0 and v0.19.1 (inclusive) configured with extract_hidden_states speculative decoding is affected. A single API request containing any penalty parameter immediately and permanently crashes the EngineCore process, resulting in complete loss of service availability.
Patches
Fixed in PR #38610, first included in vLLM v0.20.0. The fix slices the return value to sampled_token_ids[:, :1], ensuring the correct (batch_size, 1) shape regardless of the rejection sampler's output dimensions.
Workarounds
- Upgrade to vLLM v0.20.0 or later.
- If upgrading is not possible, avoid using
extract_hidden_statesas the speculative decoding method on affected versions. - Alternatively, reject or strip penalty parameters (
repetition_penalty,frequency_penalty,presence_penalty) from incoming requests at an API gateway before they reach vLLM.
JSON source
https://cveawg.mitre.org/api/cve/CVE-2026-44223Click to expand
{
"dataType": "CVE_RECORD",
"dataVersion": "5.2",
"cveMetadata": {
"cveId": "CVE-2026-44223",
"assignerOrgId": "a0819718-46f1-4df5-94e2-005712e83aaa",
"assignerShortName": "GitHub_M",
"dateUpdated": "2026-05-12T19:58:40.862Z",
"dateReserved": "2026-05-05T15:42:40.518Z",
"datePublished": "2026-05-12T19:58:40.862Z",
"state": "PUBLISHED"
},
"containers": {
"cna": {
"providerMetadata": {
"orgId": "a0819718-46f1-4df5-94e2-005712e83aaa",
"shortName": "GitHub_M",
"dateUpdated": "2026-05-12T19:58:40.862Z"
},
"title": "vLLM: extract_hidden_states speculative decoding crashes server on any request with penalty parameters",
"descriptions": [
{
"lang": "en",
"value": "vLLM is an inference and serving engine for large language models (LLMs). From to before 0.20.0, the extract_hidden_states speculative decoding proposer in vLLM returns a tensor with an incorrect shape after the first decode step, causing a RuntimeError that crashes the EngineCore process. The crash is triggered when any request in the batch uses sampling penalty parameters (repetition_penalty, frequency_penalty, or presence_penalty). A single request with a penalty parameter (e.g., \"repetition_penalty\": 1.1) is sufficient to crash the server. This vulnerability is fixed in 0.20.0."
}
],
"affected": [
{
"vendor": "vllm-project",
"product": "vllm",
"versions": [
{
"version": ">= 0.18.0, < 0.20.0",
"status": "affected"
}
]
}
],
"problemTypes": [
{
"descriptions": [
{
"lang": "en",
"description": "CWE-131: Incorrect Calculation of Buffer Size",
"cweId": "CWE-131",
"type": "CWE"
}
]
},
{
"descriptions": [
{
"lang": "en",
"description": "CWE-704: Incorrect Type Conversion or Cast",
"cweId": "CWE-704",
"type": "CWE"
}
]
}
],
"references": [
{
"url": "https://github.com/vllm-project/vllm/security/advisories/GHSA-83vm-p52w-f9pw",
"name": "https://github.com/vllm-project/vllm/security/advisories/GHSA-83vm-p52w-f9pw",
"tags": [
"x_refsource_CONFIRM"
]
},
{
"url": "https://github.com/vllm-project/vllm/pull/38610",
"name": "https://github.com/vllm-project/vllm/pull/38610",
"tags": [
"x_refsource_MISC"
]
}
],
"metrics": [
{
"cvssV3_1": {
"version": "3.1",
"vectorString": "CVSS:3.1/AV:N/AC:L/PR:L/UI:N/S:U/C:N/I:N/A:H",
"attackVector": "NETWORK",
"attackComplexity": "LOW",
"privilegesRequired": "LOW",
"userInteraction": "NONE",
"scope": "UNCHANGED",
"confidentialityImpact": "NONE",
"integrityImpact": "NONE",
"availabilityImpact": "HIGH",
"baseScore": 6.5,
"baseSeverity": "MEDIUM"
}
}
]
}
}
}