summaryrefslogtreecommitdiff
path: root/svtools
diff options
context:
space:
mode:
authorMichael Stahl <mstahl@redhat.com>2017-09-07 23:01:26 +0200
committerAndras Timar <andras.timar@collabora.com>2017-09-18 17:50:29 +0200
commitfb7de575d2a308e9656bc83828045263dad87f9f (patch)
tree9d00ed818a6ba21d6e3eb94cfbe5c671001377dd /svtools
parente56850ce7c66aed7e3b6b4b5b140e70e7becbb1c (diff)
svtools: HTML import: don't put lone surrogates in OUString
The bytes "ed b3 b5" in fdo67610-1.doc (which, as the name indicates, is an HTML file) are converted to the lone UTF-16 surrogate "dcf5", which is inserted into SwTextNode and causes asserts later on. The actual encoding of the HTML document is probably GBK (at least VIM doesn't display any missing characters with that), but because it doesn't contain any indication of its encoding it's apparently imported as UTF-8; the ImplConvertUtf8ToUnicode() thinking a surrogate code point is valid even if the Java-compatible mode RTL_TEXTENCODING_JAVA_UTF8 is not specified is a bit of a surprise. [note: the master commit says "JSON-compatible mode" but i was confusing different text encoding perversions there] Change-Id: Idd788d9d461fed150171dd907439166f3075a834 (cherry picked from commit fc670f637d4271246691904fd649358ce2e7be59) Reviewed-on: https://gerrit.libreoffice.org/42101 Tested-by: Jenkins <ci@libreoffice.org> Reviewed-by: Caolán McNamara <caolanm@redhat.com> Tested-by: Caolán McNamara <caolanm@redhat.com> (cherry picked from commit 756949c06b8bf933bcd13a226f449b8909cbf3ae)
Diffstat (limited to 'svtools')
-rw-r--r--svtools/source/svrtf/svparser.cxx3
1 files changed, 2 insertions, 1 deletions
diff --git a/svtools/source/svrtf/svparser.cxx b/svtools/source/svrtf/svparser.cxx
index 0540e172be10..ca4f389b83b5 100644
--- a/svtools/source/svrtf/svparser.cxx
+++ b/svtools/source/svrtf/svparser.cxx
@@ -390,7 +390,8 @@ sal_uInt32 SvParser::GetNextChar()
while( 0 == nChars && !bErr );
}
- if ( ! rtl::isUnicodeCodePoint( c ) )
+ // Note: ImplConvertUtf8ToUnicode() may produce a surrogate!
+ if (!rtl::isUnicodeCodePoint(c) || rtl::isHighSurrogate(c) || rtl::isLowSurrogate(c))
c = (sal_uInt32) '?' ;
if( bErr )