Lucene search

HistoryJan 11, 2009 - 12:00 a.m.

Java Runtime UTF-8 Decoder Smuggling Vector


Due to misconfiguration of mailing lists, it was just pointed out this is
already public. Apologies to those vendors who have not reacted to Sun's
announcements of December 2nd in a timely manner;

Mitre ID: CVE-2008-2938

Initial title: Java Runtime UTF-8 Decoding Flaw

Actual title: Java Runtime UTF-8 Decoder Smuggling Vector

Discovered by: William A. Rowe, Jr. <[email protected]>
Sr. Software Engineer, SpringSource, Inc.
Security Team member, Apache Software Foundation

Based on Tomcat Path Traversal Flaw reported by OuTian[1] and Simon Ryeo[2].

Thanks go to the members of the Apache Security Team for their energy and
endless efforts to triage and research potential vulnerabilities, separating
signal from noise; notably Remy Maucherat, Mark Thomas, Tim Ellison, and
Joe Orton for their various contributions to triaging this specific flaw.

Sun's Resolution

Sun released Java 6u11, 1.5.0_17, and 1.4.2_19 addressing this flaw. [3]

IBM's Resolution

IBM suffered a more limited vector which is addressed in J2SE 5.0 SR9, and
one would assume will be addressed by J2SE 1.4.2 SR13 and Java SE 6 SR4
but no further information was provided by IBM.

Disclosure History

Initial disclosures to the Java Runtime author community;
17 Jul - Apache Harmony Project
18 Jul - OpenJDK Project
21 Jul - Sun Microsystems, Inc.
28 Jul - HP
31 Jul - Apple, Inc.

Apache projects across the board, Spring, IBM, BEA, RedHat etc were also
notified at various points along the way.


On July 15 OuTian reported a vulnerability in Apache Tomcat[2] whereby
overwide byte sequences in utf-8 could bypass both Apache Tomcat access
control restrictions as well as path decoding logic.

On July 17 Simon Ryeo reported[3] a variation of the same vulnerability in
Apache httpd server when proxying content generated from Tomcat.

Remy Maucherat wrote a patch to address this particular expression of the
vector for Tomcat 6.0.x[4] which also mitigates against any similar but as
yet undiscovered decoding vulnerabilities. This patch has also been ported
to 5.5.x[5] and 4.1.x[6]. On July 31st the Apache Software Foundation
published a mitigation to this vulnerability as Apache Tomcat release
6.0.18.[7] and added this vulnerability to the Apache Tomcat security
pages[8]. Releases for 5.5.x and 4.1.x will follow shortly. The Tomcat
vulnerability had been announced by Ryeo [9] but the full implications
remained undisclosed.

During the course of research, the Glassfish implementation was determined
not to be vulnerable to the specific exploit identified and reported by
OuTian/Ryeo. However, all implementations which accept overlong paths,
including Glassfish, remain vulnerable insofar as any access control is
implemented at the proxy or gateway layer of an http service. Apache Tomcat
release 6.0.18 is no longer vulnerable with respect to its URI path, as
6.0.18 rejects all requests where the decoded value changes the path
representation, but is still exposed due to this vector in other

That said, the underlying vector for this vulnerability identified by Rowe
is actually within the UTF-8 charset implementation of the
java.nio.charset.CharsetDecoder. The onMaformedInput CodingErrorAction is
not triggered by the presence of overlong utf-8 octet sequences in a number
of vulnerable Java runtime implementations, including Sun's JRE, OpenJDK,
HP's RTE, BEA's JRocket, IBM's SDK, Apple's SDK and Apache Harmony. Other
implementations were not tested.

On July 18th, Rowe and Maucherat confirmed this flaw in Apache Harmony,
Sun's JRE and OpenJDK, and began distributing this information to affected
Java Runtime authors to allow all to prepare appropriate fixes.

On August 13th, this information was made available to various framework
authors such as Spring, BEA, IBM, etc and other affected developers as
identified by US-CERT to address their specific exposure and potential
vulnerabilities. It is the desire of the author that this announcement
in limited form coincide with Sun's Synchronized Security Release[1] of
the Java platform in October, with parallel releases by HP, Apple, OpenJDK,
Apache Harmony etc within that time frame.

Actual Vulnerability

In RFC 3629 "UTF-8, a transformation format of ISO 10646" [10] and even as
early as the preceding RFC 2279 [11], F. Yergeau et. al. clearly identified
under section 6. "Security Considerations" the impact of overlong byte
sequences (and declaring same as invalid sequences) in January 1998. Such
Security Considerations were not discussed in the preceding RFC 2044 [12]
published October 1996.

Limiting consideration for the moment to the original vulnerability report
and the HTTP/1.1 URI syntax, it becomes immediately clear that; HTTP/1.1
does not specify an encoding for the URI (RFC 2616 [13] and RFC 2396 [14])
and treats it as a octet stream known to the client and origin server, and
otherwise transparent to intervening proxies. Specific characters in the
HTTP URI are significant, all of them within the US-ASCII character set
(which is a deliberate subset of UTF-8 and the first 128 code points of
Unicode). Many implementers and applications use UTF-8 encoding for their
URI patterns as permitted (but not required) by HTTP/1.1.

However, high octets have no specific meaning within RFC 2616 or RFC 2396.
Their presence, mapping two or more high octet bytes into a US-ASCII code
point, must be ignored by proxies, as such bytes are entirely appropriate
in other character sets and HTTP/1.1 does not attribute any UTF-8 properties
to this string. Non-conforming implementations which treat the entire URI
as UTF-8, and which suffer from decoding overlong octet sequences into the
US-ASCII range, will behave differently than their conforming cousins.

This mismatch of behavior results yet again in the same class of vectors
that were identified three years ago by Linhart, Klein, Heled and Orrin.
The essential premise of their HTTP Request Smuggling whitepaper [15] holds
that the subtle differences in request parsing yield surprisingly
disastrous results. The same is true where a CR-LF line termination,
delimiter, etc. can be tunneled through proxy layers which are conforming
across into a nonconforming endpoint.

The risks of this vector are not limited in any manner to the http
request line, however. Any multi-tier service may be at risk provided
that 1) the end point accepts invalid UTF-8 sequences, 2) an intermediate
transport layer performs no UTF-8 decoding, and 3) the intermediate
transport layer performs decoding, routing, or access control functions
based on US-ASCII assumptions about such invalid strings. Such services
might be external interfaces, or firewalled interfaces such as SQL query
strings and similar.

The authors of this note point out that the vulnerability is not to be
confused with the issue of normative canonical forms for string comparison.
As there should exist no mapping of code points > 127, any code point in
the range 0…127 should be available for parsing without an awareness that
the resulting string will be utf-8, provided all utf-8 high-bit octets are
passed unmodified in the same sequence. Full string comparisons for access
control containing code points > 127 require a normative form common to the
input and reference strings, and authors must take this into consideration
when implementing any access control based on UTF-8 where non-normative
forms can be passed through any intermediate access control, but are
accepted and then transformed by the endpoint into another representation.

Mitigating Abuse

There are a number of layers which a service author must be concerned with.
At the simplest, if the request is read in UTF-8 for http or similar request
protocols, yet the protocol does not define the request stream as UTF-8,
or is handled as essentially ASCII for transport purposes, embedded CR-LF
line delimiters may be abused for smuggling attacks.

Any delimiters within the input must then be considered. For example,
the colon of a header line may be rendered invisible, permitting headers
that would otherwise be rejected, or the various comma and similar
delimiters between fields may be hidden rendering multiple tokens into
a single apparent value.

Finally, the text itself may be encoded with apparently unknown values.
In the case of http, these must be passed on as connection level headers
rather than transport layer (hop by hop) headers and ignored. So some
field such as Transport-Encoding: chunked or Content-Length:value can
be passed without a proxy or service provider recognizing them for what
they are (a disallowed combination). The impact upon the HTTP URI was
already clearly disclosed, however it is not difficult to identify other
nefarious effects which this can have.

If the application cannot be migrated to a corrected Java VM, the author
should examine the conversions to utf-8 component by component, and
be very cautious to reject and terminate any connection where overlong
utf-8 sequences are identified. It's necessary to probe for these
explicitly if the VM will not reject them. Invalid patterns begin with
the octets 0xC0, 0xC1, 0xE0 followed by a value < 0xA0, 0xF0 followed by
a value < 0x90. Since five and six byte values cannot be represented by
UTF-16, the values 0xF5 and higher should be rejected out of hand.

Finally, if these overlong sequences are not explicitly parsed for, across
any sort of applications beyond http, note the following statement of fact
from RFC 3629;

o US-ASCII octet values do not appear otherwise in a UTF-8 encoded
character stream. This provides compatibility with file systems
or other software (e.g., the printf() function in C libraries)
that parse based on US-ASCII values but are transparent to other

and contrast this to the case of an errant implementation such as those
found in the affected JVM's; this assumption must be turned on it's head.
Multiply the cases affected by this error both into and out of the
filesystem and other resources from a given java-based service. It becomes
critical that all evaluation occurs after that translation, and none before
the string becomes Unicode.


[1] OuTian, "Tomcat - Unicode decoding directory traversal vulnerability"

[2] Ryeo, S., "Directory Traversal Vulnerability"

[3] Sun Microsystems, Java SE 6 Update 11 Release Notes

[4] Maucherat, R., "Additional normalization check";view=rev

[5] Thomas, M., "Additional normalization check";view=rev

[6] Thomas, M., "Additional normalization check";view=rev

[7] Maucherat, R., "[ANN] Apache Tomcat 6.0.18 released" […]

[8] "Tomcat Security Pages"

[9] Ryeo, S., "Apache Tomcat Directory Traversal Vulnerability"

[10] Yergeau, F., "UTF-8, a transformation format of ISO 10646"

[11] Yergeau, F., "UTF-8, a transformation format of ISO 10646"

[12] Yergeau, F., "UTF-8, a transformation format of ISO 10646"

[13] Fielding, R., et al., "HTTP/1.1"

[14] Berners-Lee, T., R. Fielding, L. Masinter "URI Generic Syntax"

[15] Linhart, C., A. Klein, R. Heled, S. Orrin "HTTP Request Smuggling"