Matroska libebml EbmlUnicodeString Heap Information Leak

January 28, 2016
A specially crafted unicode string can cause an off-by-few read on the heap in unicode string parsing code in libebml. This issue can potentialy be used for information leaks.

  • libmatroska master branch

An off-by-few read on heap occurs when parsing unicode strings in EbmlUnicodeString.cpp:UTFstring::UpdateFromUTF8. String is parsed in a for loop but in case of a four byte character, no check is made if the last bytes accessed fall outside the allocated buffer:

Vulnerable code is located in EbmlUnicodeString.cpp:UTFstring::UpdateFromUTF8:

for (j=0, i=0; i<UTF8string.length(); j++) {

uint8 lead = static_cast&lt;uint8&gt;(UTF8string[i]);
if (lead &lt; 0x80) {
  _Data[j] = lead;
} else if ((lead &gt;&gt; 5) == 0x6) {
  _Data[j] = ((lead & 0x1F) &lt;&lt; 6) + (UTF8string[i+1] & 0x3F);
  i += 2;
} else if ((lead &gt;&gt; 4) == 0xe) {
  _Data[j] = ((lead & 0x0F) &lt;&lt; 12) + ((UTF8string[i+1] & 0x3F) &lt;&lt; 6) + (UTF8string[i+2] & 0x3F);
  i += 3;
} else if ((lead &gt;&gt; 3) == 0x1e) {
   printf("i is now %d and the highest accessed byte is  %d\n",i,i+3 );
  _Data[j] = ((lead & 0x07) &lt;&lt; 18) + ((UTF8string[i+1] & 0x3F) &lt;&lt; 12) + ((UTF8string[i+2] & 0x3F) &lt;&lt; 6) + (UTF8string[i+3] & 0x3F);
  i += 4;
} else
  // Invalid char?


If the last byte in the string being parsed satisfies the else if ((lead &gt;&gt; 3) == 0x1e) condition, for example 0xf2, 3 bytes past the end of the buffer will be read thereby causing a out of bounds read on the heap.


Richard Johnson and Aleksandar Nikolic

