vLLM: extract_hidden_states speculative decoding crashes server on any request with penalty parameters

vLLM is an inference and serving engine for large language models (LLMs). From to before 0.20.0, the extract_hidden_states speculative decoding proposer in vLLM returns a tensor with an incorrect shape after the first decode step, causing a RuntimeError that crashes the EngineCore process. The crash is triggered when any request in the batch uses sampling penalty parameters (repetition_penalty, frequency_penalty, or presence_penalty). A single request with a penalty parameter (e.g., "repetition_penalty": 1.1) is sufficient to crash the server. This vulnerability is fixed in 0.20.0.

Problem type

Affected products

vllm-project

vllm

>= 0.18.0, < 0.20.0 - AFFECTED

References

https://github.com/vllm-project/vllm/security/advisories/GHSA-83vm-p52w-f9pw

https://github.com/vllm-project/vllm/security/advisories/GHSA-83vm-p52w-f9pw

x_refsource_CONFIRM

https://github.com/vllm-project/vllm/pull/38610

https://github.com/vllm-project/vllm/pull/38610

x_refsource_MISC

GitHub Security Advisories

GHSA-83vm-p52w-f9pw

vLLM: extract_hidden_states speculative decoding crashes server on any request with penalty parameters

https://github.com/advisories/GHSA-83vm-p52w-f9pw

Summary

The extract_hidden_states speculative decoding proposer in vLLM returns a tensor with an incorrect shape after the first decode step, causing a RuntimeError that crashes the EngineCore process. The crash is triggered when any request in the batch uses sampling penalty parameters (repetition_penalty, frequency_penalty, or presence_penalty).

A single request with a penalty parameter (e.g., "repetition_penalty": 1.1) is sufficient to crash the server. The crash is deterministic and immediate — no concurrency, race condition, or special workload is required.

Details

In vLLM v0.17.0, the extract_hidden_states proposer's propose() method returned sampled_token_ids.unsqueeze(-1), producing a tensor of shape (batch_size, 1).

In PR #37013 (first released in v0.18.0), the KV connector interface was refactored out of propose(). The return type changed from tuple[Tensor, KVConnectorOutput | None] to Tensor, and the .unsqueeze(-1) call was removed along with the KV connector output:

# Before (v0.17.0):
return sampled_token_ids.unsqueeze(-1), kv_connector_output  # shape (batch_size, 1)

# After (v0.18.0+):
return sampled_token_ids  # shape (batch_size, 2) after first decode step

The refactor missed that sampled_token_ids changed semantics between the first and subsequent decode steps. After the first decode step, the rejection sampler allocates its output as (batch_size, max_spec_len + 1). With num_speculative_tokens=1, this produces shape (batch_size, 2) instead of the expected (batch_size, 1), causing a broadcast shape mismatch during penalty application.

Impact

Any vLLM deployment between v0.18.0 and v0.19.1 (inclusive) configured with extract_hidden_states speculative decoding is affected. A single API request containing any penalty parameter immediately and permanently crashes the EngineCore process, resulting in complete loss of service availability.

Patches

Fixed in PR #38610, first included in vLLM v0.20.0. The fix slices the return value to sampled_token_ids[:, :1], ensuring the correct (batch_size, 1) shape regardless of the rejection sampler's output dimensions.

Workarounds

Upgrade to vLLM v0.20.0 or later.
If upgrading is not possible, avoid using extract_hidden_states as the speculative decoding method on affected versions.
Alternatively, reject or strip penalty parameters (repetition_penalty, frequency_penalty, presence_penalty) from incoming requests at an API gateway before they reach vLLM.

github.com

https://github.com/vllm-project/vllm/security/advisories/GHSA-83vm-p52w-f9pw

github.com

https://github.com/vllm-project/vllm/pull/38610

nvd.nist.gov

https://nvd.nist.gov/vuln/detail/CVE-2026-44223

github.com

https://github.com/advisories/GHSA-83vm-p52w-f9pw

JSON source

https://cveawg.mitre.org/api/cve/CVE-2026-44223

Click to expand

{
  "dataType": "CVE_RECORD",
  "dataVersion": "5.2",
  "cveMetadata": {
    "cveId": "CVE-2026-44223",
    "assignerOrgId": "a0819718-46f1-4df5-94e2-005712e83aaa",
    "assignerShortName": "GitHub_M",
    "dateUpdated": "2026-05-12T19:58:40.862Z",
    "dateReserved": "2026-05-05T15:42:40.518Z",
    "datePublished": "2026-05-12T19:58:40.862Z",
    "state": "PUBLISHED"
  },
  "containers": {
    "cna": {
      "providerMetadata": {
        "orgId": "a0819718-46f1-4df5-94e2-005712e83aaa",
        "shortName": "GitHub_M",
        "dateUpdated": "2026-05-12T19:58:40.862Z"
      },
      "title": "vLLM: extract_hidden_states speculative decoding crashes server on any request with penalty parameters",
      "descriptions": [
        {
          "lang": "en",
          "value": "vLLM is an inference and serving engine for large language models (LLMs). From  to before 0.20.0, the extract_hidden_states speculative decoding proposer in vLLM returns a tensor with an incorrect shape after the first decode step, causing a RuntimeError that crashes the EngineCore process. The crash is triggered when any request in the batch uses sampling penalty parameters (repetition_penalty, frequency_penalty, or presence_penalty). A single request with a penalty parameter (e.g., \"repetition_penalty\": 1.1) is sufficient to crash the server. This vulnerability is fixed in 0.20.0."
        }
      ],
      "affected": [
        {
          "vendor": "vllm-project",
          "product": "vllm",
          "versions": [
            {
              "version": ">= 0.18.0, < 0.20.0",
              "status": "affected"
            }
          ]
        }
      ],
      "problemTypes": [
        {
          "descriptions": [
            {
              "lang": "en",
              "description": "CWE-131: Incorrect Calculation of Buffer Size",
              "cweId": "CWE-131",
              "type": "CWE"
            }
          ]
        },
        {
          "descriptions": [
            {
              "lang": "en",
              "description": "CWE-704: Incorrect Type Conversion or Cast",
              "cweId": "CWE-704",
              "type": "CWE"
            }
          ]
        }
      ],
      "references": [
        {
          "url": "https://github.com/vllm-project/vllm/security/advisories/GHSA-83vm-p52w-f9pw",
          "name": "https://github.com/vllm-project/vllm/security/advisories/GHSA-83vm-p52w-f9pw",
          "tags": [
            "x_refsource_CONFIRM"
          ]
        },
        {
          "url": "https://github.com/vllm-project/vllm/pull/38610",
          "name": "https://github.com/vllm-project/vllm/pull/38610",
          "tags": [
            "x_refsource_MISC"
          ]
        }
      ],
      "metrics": [
        {
          "cvssV3_1": {
            "version": "3.1",
            "vectorString": "CVSS:3.1/AV:N/AC:L/PR:L/UI:N/S:U/C:N/I:N/A:H",
            "attackVector": "NETWORK",
            "attackComplexity": "LOW",
            "privilegesRequired": "LOW",
            "userInteraction": "NONE",
            "scope": "UNCHANGED",
            "confidentialityImpact": "NONE",
            "integrityImpact": "NONE",
            "availabilityImpact": "HIGH",
            "baseScore": 6.5,
            "baseSeverity": "MEDIUM"
          }
        }
      ]
    }
  }
}