summaryrefslogtreecommitdiff
path: root/sal/textenc
AgeCommit message (Collapse)AuthorFilesLines
2020-07-02Upcoming improved loplugin:staticanonymous -> redundantstatic: salStephan Bergmann41-2061/+2061
Change-Id: I022f5ed37d25f2c8a8870033bab32ff59d4d8da6 Reviewed-on: https://gerrit.libreoffice.org/c/core/+/97648 Tested-by: Jenkins Reviewed-by: Stephan Bergmann <sbergman@redhat.com>
2020-05-10compact namespace in sal..svgioNoel Grandin4-8/+8
Change-Id: I7e70614ea5a1cb1a1dc0ef8e9fb6fd48e85c3562 Reviewed-on: https://gerrit.libreoffice.org/c/core/+/93904 Tested-by: Jenkins Reviewed-by: Noel Grandin <noel.grandin@collabora.co.uk>
2020-03-12Revert "loplugin:constfields in reportdesign,sal,sax"Noel Grandin6-66/+66
This reverts commit d4d37662b090cb237585156a47cd8e1f1cbe2656. Now that we know that making fields has negative side effects like disabling assignment operator generation. Change-Id: Idef4937b89a83d2efbfaf0ab87d059a0143c0164 Reviewed-on: https://gerrit.libreoffice.org/c/core/+/90364 Tested-by: Jenkins Reviewed-by: Noel Grandin <noel.grandin@collabora.co.uk>
2020-01-28New loplugin:unsignedcompareStephan Bergmann1-1/+2
"Find explicit casts from signed to unsigned integer in comparison against unsigned integer, where the cast is presumably used to avoid warnings about signed vs. unsigned comparisons, and could thus be replaced with o3tl::make_unsigned for clairty." (compilerplugins/clang/unsignedcompare.cxx) o3tl::make_unsigned requires its argument to be non-negative, and there is a chance that some original code like static_cast<sal_uInt32>(n) >= c used the explicit cast to actually force a (potentially negative) value of sal_Int32 to be interpreted as an unsigned sal_uInt32, rather than using the cast to avoid a false "signed vs. unsigned comparison" warning in a case where n is known to be non-negative. It appears that restricting this plugin to non- equality comparisons (<, >, <=, >=) and excluding equality comparisons (==, !=) is a useful heuristic to avoid such false positives. The only remainging false positive I found was 0288c8ffecff4956a52b9147d441979941e8b87f "Rephrase cast from sal_Int32 to sal_uInt32". But which of course does not mean that there were no further false positivies that I missed. So this commit may accidentally introduce some false hits of the assert in o3tl::make_unsigned. At least, it passed a full (Linux ASan+UBSan --enable-dbgutil) `make check && make screenshot`. It is by design that o3tl::make_unsigned only accepts signed integer parameter types (and is not defined as a nop for unsigned ones), to avoid unnecessary uses which would in general be suspicious. But the STATIC_ARRAY_SELECT macro in include/oox/helper/helper.hxx is used with both signed and unsigned types, so needs a little oox::detail::make_unsigned helper function for now. (The ultimate fix being to get rid of the macro in the first place.) Change-Id: Ia4adc9f44c70ad1dfd608784cac39ee922c32175 Reviewed-on: https://gerrit.libreoffice.org/c/core/+/87556 Tested-by: Jenkins Reviewed-by: Stephan Bergmann <sbergman@redhat.com>
2020-01-26tdf#124176: Use pragma once instead of include guardsBurak Bala1-4/+1
Change-Id: Ib2465f040f12413560b2cec1c742cf3558461309 Reviewed-on: https://gerrit.libreoffice.org/c/core/+/87404 Tested-by: Jenkins Reviewed-by: Muhammet Kara <muhammet.kara@collabora.com>
2020-01-25tdf#124176: Use pragma once instead of include guardsiakarsu1-5/+1
Change-Id: Ibf31d5b97017f875e62b609beef0ecdebd559502 Reviewed-on: https://gerrit.libreoffice.org/c/core/+/87391 Tested-by: Jenkins Reviewed-by: Muhammet Kara <muhammet.kara@collabora.com>
2019-12-19sal_Char->char in remotebridges..saxNoel Grandin5-18/+18
Change-Id: I6d32942960a5e997f16eb1301c45495661cd4cea Reviewed-on: https://gerrit.libreoffice.org/85514 Tested-by: Jenkins Reviewed-by: Noel Grandin <noel.grandin@collabora.co.uk>
2019-12-11Fix typoAndrea Gelmini1-1/+1
Change-Id: Idbcf73ea3034b62e283537e052c17a9fb3988a8b Reviewed-on: https://gerrit.libreoffice.org/84918 Tested-by: Jenkins Reviewed-by: Julien Nabet <serval2412@yahoo.fr>
2019-11-22Extend loplugin:external to warn about classesStephan Bergmann4-0/+20
...following up on 314f15bff08b76bf96acf99141776ef64d2f1355 "Extend loplugin:external to warn about enums". Cases where free functions were moved into an unnamed namespace along with a class, to not break ADL, are in: filter/source/svg/svgexport.cxx sc/source/filter/excel/xelink.cxx sc/source/filter/excel/xilink.cxx svx/source/sdr/contact/viewobjectcontactofunocontrol.cxx All other free functions mentioning moved classes appear to be harmless and not give rise to (silent, even) ADL breakage. (One remaining TODO in compilerplugins/clang/external.cxx is that derived classes are not covered by computeAffectedTypes, even though they could also be affected by ADL-breakage--- but don't seem to be in any acutal case across the code base.) For friend declarations using elaborate type specifiers, like class C1 {}; class C2 { friend class C1; }; * If C2 (but not C1) is moved into an unnamed namespace, the friend declaration must be changed to not use an elaborate type specifier (i.e., "friend C1;"; see C++17 [namespace.memdef]/3: "If the name in a friend declaration is neither qualified nor a template-id and the declaration is a function or an elaborated-type-specifier, the lookup to determine whether the entity has been previously declared shall not consider any scopes outside the innermost enclosing namespace.") * If C1 (but not C2) is moved into an unnamed namespace, the friend declaration must be changed too, see <https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71882> "elaborated-type-specifier friend not looked up in unnamed namespace". Apart from that, to keep changes simple and mostly mechanical (which should help avoid regressions), out-of-line definitions of class members have been left in the enclosing (named) namespace. But explicit specializations of class templates had to be moved into the unnamed namespace to appease <https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92598> "explicit specialization of template from unnamed namespace using unqualified-id in enclosing namespace". Also, accompanying declarations (of e.g. typedefs or static variables) that could arguably be moved into the unnamed namespace too have been left alone. And in some cases, mention of affected types in blacklists in other loplugins needed to be adapted. And sc/qa/unit/mark_test.cxx uses a hack of including other .cxx, one of which is sc/source/core/data/segmenttree.cxx where e.g. ScFlatUInt16SegmentsImpl is not moved into an unnamed namespace (because it is declared in sc/inc/segmenttree.hxx), but its base ScFlatSegmentsImpl is. GCC warns about such combinations with enabled-by-default -Wsubobject-linkage, but "The compiler doesn’t give this warning for types defined in the main .C file, as those are unlikely to have multiple definitions." (<https://gcc.gnu.org/onlinedocs/gcc-9.2.0/gcc/Warning-Options.html>) The warned-about classes also don't have multiple definitions in the given test, so disable the warning when including the .cxx. Change-Id: Ib694094c0d8168be68f8fe90dfd0acbb66a3f1e4 Reviewed-on: https://gerrit.libreoffice.org/83239 Tested-by: Jenkins Reviewed-by: Stephan Bergmann <sbergman@redhat.com>
2019-11-20ofz#19010 wrong start of rangeCaolán McNamara1-1/+1
Change-Id: Ibf97a830932d3f153b99031abc8c4a00b54cedab Reviewed-on: https://gerrit.libreoffice.org/83265 Reviewed-by: Stephan Bergmann <sbergman@redhat.com> Tested-by: Jenkins
2019-09-11Fix Unicode to Shift JIS/MS932 conversion dataStephan Bergmann1-2/+2
These are MS932 extensions, and per <https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT> ("Table version: 2.01", "Date: 04/15/98"), U+4F92 is a mapping for 0xFA6F (and also for 0xED53, which is also an MS932 extension, and "loses" here), and U+4F9A is a mapping for 0xFA71 (and also for 0xED55, which is also an MS932 extension, and "loses" here). (And neither U+4F92 nor U+4F9A appear as mappings in <https://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/SHIFTJIS.TXT>, "Table version: 2.0", "Date: 2011 October 14 (header updated: 2015 December 02)".) This appears to be a typo dating back to 9399c662f36c385b0c705eb34e636a9aec450282 "initial import". Change-Id: I0c699675355d839e62d6e4082355a2d67472533e Reviewed-on: https://gerrit.libreoffice.org/78720 Tested-by: Jenkins Reviewed-by: Stephan Bergmann <sbergman@redhat.com>
2019-09-06Fix typo in comment (ASCII 0x42 is "B")Stephan Bergmann1-1/+1
Change-Id: Iba8411cede4dc47aaa1d9d433de2606c0d66e0bf Reviewed-on: https://gerrit.libreoffice.org/78692 Tested-by: Jenkins Reviewed-by: Stephan Bergmann <sbergman@redhat.com>
2019-09-05Fix conversion of U+0000 in ImplUnicodeToDBCSStephan Bergmann1-1/+1
...which appears to have been broken when 13824735057ef25075af8fd0ddb8f14e34c7eeb6 "#81346# - Fix for unconverted characters for DBCS encodings" moved that "if" out of surrounding "if" block. (And, for consistency, write the "if" check in the same way as the preceding one is written since 739cb04c36524c5a1bbf768dfe93624a1b2ec8b4 "#97705# Fixed mapping of Big5 EUDC points.") Change-Id: I4324197c4eba671ab6313fb89f988da102b8ffa5 Reviewed-on: https://gerrit.libreoffice.org/78627 Tested-by: Jenkins Reviewed-by: Stephan Bergmann <sbergman@redhat.com>
2019-09-04Do not exclude Unicode noncharacters from rtl_convertUnicodeToTextStephan Bergmann10-50/+80
For one, that broke round-tripping with e.g. UTF-8 (see the test case added to Test::testComplex in sal/qa/rtl/textenc/rtl_textcvt.cxx) which did not treat noncharacters as invalid. For another, <https://unicode.org/faq/private_use.html#nonchar7> is meanwhile quite clear on the matter: "Q: Are noncharacters prohibited in interchange? "A: This question has led to some controversy, because the Unicode Standard has been somewhat ambiguous about the status of noncharacters. The formal wording of the definition of 'noncharacter' in the standard has always indicated that noncharacters 'should never be interchanged.' That led some people to assume that the definition actually meant 'shall not be interchanged' and that therefore the presence of a noncharacter in any Unicode string immediately rendered that string malformed according to the standard. But the intended use of noncharacters requires the ability to exchange them in a limited context, at least across APIs and even through data files and other means of 'interchange', so that they can be processed as intended. The choice of the word 'should' in the original definition was deliberate, and indicated that one should not try to interchange noncharacters precisely because their interpretation is strictly internal to whatever implementation uses them, so they have no publicly interchangeable semantics. But other informative wording in the text of the core specification and in the character names list was differently and more strongly worded, leading to contradictory interpretations. "Given this ambiguity of intent, in 2013 the UTC issued Corrigendum #9, which deleted the phrase 'and that should never be interchanged' from the definition of noncharacters, to make it clear that prohibition from interchange is not part of the formal definition of noncharacters. Corrigendum #9 has been incorporated into the core specification for Unicode 7.0. "Q: Are noncharacters invalid in Unicode strings and UTFs? "A: Absolutely not. Noncharacters do not cause a Unicode string to be ill-formed in any UTF. This can be seen explicitly in the table above, where every noncharacter code point has a well-formed representation in UTF-32, in UTF-16, and in UTF-8. An implementation which converts noncharacter code points between one UTF representation and another must preserve these values correctly. The fact that they are called 'noncharacters' and are not intended for open interchange does not mean that they are somehow illegal or invalid code points which make strings containing them invalid." Change-Id: I4fcc0156e3d2fd305a7c7bb0c7b3dbef846c9e64 Reviewed-on: https://gerrit.libreoffice.org/78598 Tested-by: Jenkins Reviewed-by: Stephan Bergmann <sbergman@redhat.com>
2019-09-04[API CHANGE] rtl_convertTextToUnicode behavior upon erroneous inputStephan Bergmann12-31/+194
<http://udk.openoffice.org/cpp/man/spec/textconversion.html> specifies that FLAGS_UNDEFINED_ERROR, FLAGS_MBUNDEFINED_ERROR, and FLAGS_INVALID_ERROR: "Read past the [erroneous] code in the input buffer [...]" But actual behavior of rtl_convertTextToUnicode for the various rtl_TextEncoding values has been inconsistent. Some erroneous input (mostly single-byte UNDEFINED and INVALID ones) has not been consumed at all, some (multi-byte MBUNDEFINED and INVALID) has been consumed partly, and some has been consumed fully as required. However, at least since 8dd4265b9ddbd7786b6237676909eae5b540da0e "CWS-TOOLING: integrate CWS hb18", Custom8BitToUnicode in sw/source/filter/ww8/ww8par.cxx appears to rely on the broken behavior of not consuming erroneous input. (It reads the chunk of valid input with e.g. some RTL_TEXTENCODING_MS_125x that happens to exhibit the broken behavior of not consuming erroneous input, then wants to try to re-read the erroneous input with RTL_TEXTENCODING_MS_1252. For example, opening sw/qa/core/data/ww8/pass/forcepoint50-grfanchor-1.doc triggers that code. For whatever reason, the am_faksas.dot attached to <https://bz.apache.org/ooo/show_bug.cgi?id=9240#c1> "Do not show lithuanian letter 'Š'" appears to not, or at least no longer, trigger that code.) Therefore, it would be useful to have a mode in which rtl_convertTextToUnicode does not consume erroneous input. (And I plan on doing changes in sal/osl/unx/file* that would benefit from that behavior, too.) But changing rtl_convertTextToUnicode to generally not consume erroneous input would not be feasible: If calls do not set RTL_TEXTTOUNICODE_FLAGS_FLUSH, part of an erroneous input can already have been consumed by a previous call, so the current call cannot undo that. But a change that looks like it can work is to change the behavior only if RTL_TEXTTOUNICODE_FLAGS_FLUSH is set. In that case we can at least not consume the part of an erroneous input that has not yet been consumed by a previous call (which would necessarily have been done with RTL_TEXTTOUNICODE_FLAGS_FLUSH unset). The expecation is that code that relies on the don't-consume behavior will do only single calls with RTL_TEXTTOUNICODE_FLAGS_FLUSH set (so reliably not consume the complete erroneous input), while other code (which might do calls in a loop) will not care whether erroneous input has been consumed, anyway. This can be considered a mild form of behavioral API CHANGE (but note that the old implementation didn't exhibit the requested behavior anyway). So all implementations of rtl_convertTextToUnicode for the various rtl_TextEncoding values have been adapted to the new behavior. The only exceptions are ImplDummyToUnicode (sal/textenc/textcvt.cxx), which is a special case anyway used by RTL_TEXTENCODING_DONTKNOW, and two out of three places (marked with a "TODO" each) in ImplUTF7ToUnicode (sal/textenc/tcvtutf7.cxx), where it is hard to retrofit the expected behaivor, and RTL_TEXTENCODING_UTF7 is probably not relevant for the use cases relying on the don't-consume--behavior, anyway. Whether a similar change should be done for rtl_convertUnicodeToText can be examined later. Change-Id: I1ac2c4cfd99e2a0eca219f9a3855ef110b254855 Reviewed-on: https://gerrit.libreoffice.org/78584 Tested-by: Jenkins Reviewed-by: Stephan Bergmann <sbergman@redhat.com>
2019-09-03Fix handling of invalid bytes >= 0x80 in ImplUTF7ToUnicodeStephan Bergmann1-7/+10
Change-Id: I08838f9ae34a31712d7269ddaaee3fe59ece2178 Reviewed-on: https://gerrit.libreoffice.org/78562 Tested-by: Jenkins Reviewed-by: Stephan Bergmann <sbergman@redhat.com>
2019-08-02Fix typosAndrea Gelmini6-20/+20
Change-Id: Ie183c445bf8a545f59aac7b0e29f72ab679a6cf3 Reviewed-on: https://gerrit.libreoffice.org/76852 Tested-by: Jenkins Reviewed-by: Andrea Gelmini <andrea.gelmini@gelma.net>
2019-03-05re-land "new loplugin typedefparam""Noel Grandin2-6/+6
This reverts commit c9bb48386bad7d2a40e6958883328145ae439cad, and adds a bunch more fixes. Change-Id: Ib584d302a73125528eba85fa1e722cb6fc41538a Reviewed-on: https://gerrit.libreoffice.org/68680 Tested-by: Jenkins Reviewed-by: Noel Grandin <noel.grandin@collabora.co.uk>
2019-03-04Revert "new loplugin typedefparam"Noel Grandin2-6/+6
This reverts commit 9865440d217d975206a3f91612f0666312bc8fd8. This is not ready to land yet, seems like the latest update of the logic reveals a bunch more places I need to fix before it can land.
2019-03-04new loplugin typedefparamNoel Grandin2-6/+6
verify that parameters use the exact same typedef-names (if any) in definition and declaration Change-Id: I55d2817f599b0253904dce2d35a1a93967e15a77 Reviewed-on: https://gerrit.libreoffice.org/68439 Tested-by: Jenkins Reviewed-by: Noel Grandin <noel.grandin@collabora.co.uk>
2018-12-02-Werror,-Wc++98-compat-extra-semiStephan Bergmann1-1/+1
Change-Id: I15d67108b4a80a4788982ad6bea65e32fd941a35
2018-12-01tdf#39468 Translate German commentsJens Carl1-1/+1
Change-Id: I27e5e4604cd999d778eb84976b3bea0ef35122ee Reviewed-on: https://gerrit.libreoffice.org/64353 Tested-by: Jenkins Reviewed-by: Eike Rathke <erack@redhat.com>
2018-11-19Fix typosAndrea Gelmini1-10/+10
Change-Id: I6d51e4eb4a49a30193b904b2c7d62df1e16ea3d9 Reviewed-on: https://gerrit.libreoffice.org/63475 Tested-by: Jenkins Reviewed-by: Julien Nabet <serval2412@yahoo.fr>
2018-10-31Translate German commentsJohnny_M3-11/+11
Change-Id: I94cdb753d01dfd0d5b8f78ede1819b281b840ab2 Reviewed-on: https://gerrit.libreoffice.org/62669 Tested-by: Jenkins Reviewed-by: Adolfo Jayme Barrientos <fitojb@ubuntu.com>
2018-10-28tdf#120703 PVS: V547 Expression is always true/falseMike Kaganski1-5/+3
Change-Id: Ic92cc594979cac2edac04a085957398672a5dfcc Reviewed-on: https://gerrit.libreoffice.org/62450 Tested-by: Jenkins Reviewed-by: Mike Kaganski <mike.kaganski@collabora.com>
2018-10-12loplugin:constfields in reportdesign,sal,saxNoel Grandin6-66/+66
and improve the rewriter so I spend less time fixing formatting Change-Id: Ic2a6e5e31a5a202d2d02a47d77c484a57a5ec514 Reviewed-on: https://gerrit.libreoffice.org/61676 Reviewed-by: Noel Grandin <noel.grandin@collabora.co.uk> Tested-by: Noel Grandin <noel.grandin@collabora.co.uk>
2018-08-19Fix typosAndrea Gelmini1-2/+2
Change-Id: Ib717870185bdf4ac43af8fcd3a7233b051e23d30 Reviewed-on: https://gerrit.libreoffice.org/58888 Tested-by: Jenkins Reviewed-by: Regina Henschel <rb.henschel@t-online.de>
2018-07-23ofz#9507 wrong start point for Johab block 59Caolán McNamara1-1/+1
Change-Id: I011f4cbb10324c4a7d4e1be3ab1355291f79730b Reviewed-on: https://gerrit.libreoffice.org/57838 Tested-by: Jenkins Reviewed-by: Caolán McNamara <caolanm@redhat.com> Tested-by: Caolán McNamara <caolanm@redhat.com>
2018-02-08ofz#6112 wrong start off sets for korean KSC5601 tableCaolán McNamara1-2/+2
Change-Id: If986352478f34f54015f1969c97c26e2ef05c06c Reviewed-on: https://gerrit.libreoffice.org/49444 Tested-by: Jenkins <ci@libreoffice.org> Reviewed-by: Caolán McNamara <caolanm@redhat.com> Tested-by: Caolán McNamara <caolanm@redhat.com>
2018-01-12More loplugin:cstylecast: salStephan Bergmann14-67/+63
auto-rewrite with <https://gerrit.libreoffice.org/#/c/47798/> "Enable loplugin:cstylecast for some more cases" plus solenv/clang-format/reformat-formatted-files Change-Id: I7d89b011464ba5d2dd12e04d5fc9f65cb4daebde
2017-12-27sal: fix typo in tcvtmb.cxx and remove comment cruftChris Sherlock1-13/+1
Change-Id: I1617900cd2df096d46a2cba75ef2fe1373c0ab63 Reviewed-on: https://gerrit.libreoffice.org/46948 Tested-by: Jenkins <ci@libreoffice.org> Reviewed-by: Stephan Bergmann <sbergman@redhat.com>
2017-12-11loplugin:salcall fix functionsNoel Grandin1-1/+1
since cdecl is the default calling convention on Windows for such functions, the annotation is redundant. Change-Id: I1a85fa27e5ac65ce0e04a19bde74c90800ffaa2d Reviewed-on: https://gerrit.libreoffice.org/46164 Tested-by: Jenkins <ci@libreoffice.org> Reviewed-by: Noel Grandin <noel.grandin@collabora.co.uk>
2017-11-13Fix typosAndrea Gelmini1-2/+2
Change-Id: Ia544298334364ece3b3963a4adc00c5e01189b91 Reviewed-on: https://gerrit.libreoffice.org/44654 Tested-by: Jenkins <ci@libreoffice.org> Reviewed-by: Mark Page <aptitude@btconnect.com>
2017-11-01-I$(dir $(3)) in gb_CObject__command_pattern is no longer neededStephan Bergmann1-1/+1
...at least in com_GCC_class.mk (com_MSC_class.mk will be addressed in a follow- up commit), after the recent loplugin:includeform clean-up. Two static libraries built from external sources needed adjustment, two compilerplugin tests needed adjustment (which wasn't found by loplugin:includeform, by design), and one more adjustment in sal/textenc/generate/. Change-Id: Idad5ae355a02ae130369a9a45b5f5925ab48ffef Reviewed-on: https://gerrit.libreoffice.org/44174 Tested-by: Jenkins <ci@libreoffice.org> Reviewed-by: Stephan Bergmann <sbergman@redhat.com>
2017-10-23loplugin:includeform: salStephan Bergmann55-131/+131
Change-Id: I539ca8b9dee5edc5fc2282a2b9b0ffd78bad8b11
2017-10-20loplugin:constmethod in codemaker,registry,storeNoel Grandin1-1/+1
Change-Id: Ie75875974f054ff79bd64b1c261e79e2b78eb7fc Reviewed-on: https://gerrit.libreoffice.org/43540 Tested-by: Jenkins <ci@libreoffice.org> Reviewed-by: Noel Grandin <noel.grandin@collabora.co.uk>
2017-09-28Warn about missing text converterStephan Bergmann1-0/+3
(probably because the encoding is RTL_TEXTENCODING_DONTKNOW; that may still happen in various places, so can't use anything stronger than SAL_WARN here) Change-Id: I75993c9ad629104055c171389ed7708649747d9b
2017-09-26ofz#3186: wrong starting offset for JOHAB 0x6D blockCaolán McNamara1-1/+1
Change-Id: I4de6d9d781b2f2313d8fd338b34dcb31434efe91 Reviewed-on: https://gerrit.libreoffice.org/41638 Tested-by: Jenkins <ci@libreoffice.org> Reviewed-by: Caolán McNamara <caolanm@redhat.com> Tested-by: Caolán McNamara <caolanm@redhat.com>
2017-09-24Map Windows code page 42 to RTL_TEXTENCODING_SYMBOLStephan Bergmann1-0/+2
<https://msdn.microsoft.com/en-us/library/windows/desktop/ dd374130(v=vs.85).aspx> "WideCharToMultiByte function" suggests that there now is CP_SYMBOL, "Windows 2000: Symbol code page (42)." And a little test program on Windows indicates that our RTL_TEXTENCODING_SYMBOL is working the same way as CP_SYMBOL, where MultiByteToWideChar maps 00..1F to U+0000..1F and 20..FF to U+F020..F0FF. At least CppunitTest_writerfilter_rtftok, when testing writerfilter/qa/cppunittests/rtftok/data/pass/EDB-18940-1.rtf, goes into case RTF_FCHARSET in RTFDocumentImpl::dispatchValue (writerfilter/source/rtftok/rtfdispatchvalue.cxx) with nParam matching aRTFEncodings[2] (i.e., a mapping from charset 2 to codepage 42, see writerfilter/source/rtftok/rtfcharsets.cxx), then passes 42 into rtl_getTextEncodingFromWindowsCodePage and obtains an unhelpful RTL_TEXTENCODING_DONTKNOW. testFdo72031 (sw/qa/extras/rtfexport/rtfexport2.cxx, CppunitTest_sw_rtfexport2) needed to be adapted, as the circled plus from the Symbol font is now internally represented as U+F0C5, not (somewhat bogusly) as U+00C5 (aka LATIN CAPTIAL LETTER A WITH RING ABOVE). But, when displayed with the Symobl font, the glyph that is actually shown remains the circled plus. Turns out changing rtl_getTextEncodingFromWindowsCodePage would start to make CppunitTest_sw_rtfimport fail: Sep 20 15:49:24 <sberg> vmiklos, with <https://gerrit.libreoffice.org/#/c/42477/>, testN823675 (sw/qa/extras/rtfimport/rtfimport.cxx) fails, the aFont.Name is not "Symbol"; sw/qa/extras/rtfimport/data/n823675.rtf contains a \fonttbl that specifies \f3 to have \fcharset2 (i.e., symbol font) and fontname "Symbol". However, RTFDocumentImpl::checkUnicode (writerfilter/source/rtftok/rtfdocumentimpl.cxx) converts m_aHexBuffer (containing "Symbol;") with nCurrentEncoding apparently being the encoding specified by \fcharset2 (i.e., now RTL_TEXTENCODING_SYMBOL instead of old RTL_TEXTENCODING_DONTKNOW), so the resulting OUString is garbage (instead of the byte-for-byte conversion to Unicode "Symbol;" that RTL_TEXTENCODING_DONTKNOW would do there); do you know whether such \fonttbl fontnames should actually be interpreted in the given \fcharset? Sep 20 15:49:24 <IZBot> gerrit: »Map Windows code page 42 to RTL_TEXTENCODING_SYMBOL« by Stephan Bergmann for master [NEW] Sep 20 15:51:15 <vmiklos> sberg: let me check if the spec covers that Sep 20 15:54:29 <mst_> sberg: i think the name is typically encoded in the font's encoding but probably they have to make a (likely undocumented) exception for symbol encoding Sep 20 15:57:46 <vmiklos> sberg: the spec only says that \fcharset is about the encoding of the content using that font, i don't see it described what would be the encoding of the font name itself Sep 20 15:58:51 <vmiklos> sberg: i'm not sure about if that encoding should or should not affect the encoding of the font name in general, but indeed at least for 2 (symbol encoding) you're right, Word doesn't encoding the font name with that encoding, either. Sep 20 15:59:30 <sberg> vmiklos, mst_, at the top of page 14 of Word2007RTFSpec9.docx I see "Note that runs of text marked with a particular font index (see \fN in the Font Table section) use the codepage for that font as given by \cpgN or implied by \fcharsetN, unless they use Unicode RTF described in the following section." Would that match what mst_ says? Sep 20 15:59:33 <vmiklos> so if it helps you case to handle at as e.g. ascii, just for that encoding, i think there would be no problem with that. Sep 20 16:00:07 <vmiklos> sberg: that still talks about the content using the font, not the strings (font names) in the font table itself, i think. Sep 20 16:01:17 <sberg> vmiklos, what's the control word to select such a font, also \fN? I don't see any such in n823675.rtf Sep 20 16:02:16 <mikekaganski> loircbot: e.g. \af3 Sep 20 16:02:31 <mikekaganski> sberg: ^ Sep 20 16:02:47 <mst_> 04d5a280beeeb6e056df68395dc9c3b3a674361b Sep 20 16:02:50 <IZBot> core - related: fdo#77979: writerfilter RTF import: read encoded font name - http://cgit.freedesktop.org/libreoffice/core/commit/?id=04d5a280beeeb6e056df68395dc9c3b3a674361b Sep 20 16:02:52 <mst_> sberg: ^ Sep 20 16:04:05 <sberg> mst_, thanks; so there's likely an (implicit?) exception for \fcharset2, as you say Sep 20 16:04:33 <mst_> that's most plausible, our own font code is full of exceptions for "symbol fonts" too Sep 20 16:05:19 <sberg> mikekaganski, ENOCONTEXT Sep 20 16:05:36 <mikekaganski> sberg: [17:01:16] sberg: vmiklos, what's the control word to select such a font, also \fN? I don't see any such in n823675.rtf Sep 20 16:06:32 <sberg> mikekaganski, so you say selection is done with \af3 instead of \f3? Sep 20 16:06:40 <mikekaganski> sberg: yes, in that case Sep 20 16:07:34 <mst_> i think there are several different keywords that apply fonts, but can't remember the whole list Sep 20 16:08:10 <mst_> \fN shoudl be one of them though Sep 20 16:22:18 <sberg> vmiklos, so who generated that sw/qa/extras/rtfimport/data/n823675.rtf, was it manually created and lacks a \cpgN before "Symbol"? Sep 20 16:29:17 <sberg> vmiklos, (after further reading of the RTF spec): disregard the "and lacks a \cpgN before 'Symbol'" part of my above question Sep 20 16:30:27 <mst_> sberg: i suggest not reading too much about encoding in RTF, it gets pretty lovecraftian pretty fast... Sep 20 16:32:58 <vmiklos> sberg: given how short that bugdoc is, i'm pretty sure i cut it down manually to something readable from a multi-MB real bugdoc Sep 20 16:33:07 <sberg> mst_, do you have a recommendation how I could get that "don't use symbol font encoding to read a symbol font's name" into writerfilter/source/rtftok/rtfdocumentimpl.cxx? RTFDocumentImpl::checkUnicode lacks the context to tell whether it is using m_aStates.top().nCurrentEncoding to convert a fontname, and the caller of checkUnicode (at the end of RTFDocumentImpl::resolveChars in this case) appears to lack the context, too Sep 20 16:33:12 <mst_> various Old Ones from The Time Before Unicode and their Backward Compatibility Tentacles etc. Sep 20 16:34:59 <sberg> vmiklos, anyway, that "so there's likely an (implicit?) exception for \fcharset2" hypothesis sounds sane, so we should probably implement it (if only you or mst_ can give me a good hint how...) Sep 20 16:35:13 <vmiklos> sberg: looking for a code pointer Sep 20 16:36:05 <mst_> sberg: m_aStates.top().eDestination == Destination::FONTENTRY should be the relevant check? Sep 20 16:36:17 <vmiklos> sberg: RTFDocumentImpl::text() is where the text is taken, Destination::FONTENTRY is the state on the parser stack which is a font entry in the font table. so to detect "your case" during decoding a byte array into a string, m_aStates.top().eDestination == Destination::FONTENTRY is what you want Sep 20 16:36:35 <vmiklos> ah good, two independent matching hints are promising ;) Sep 20 16:37:35 <sberg> mst_, vmiklos, ah; but what also looks dodgy is that checkUnicode operates there on "Symbol;" including the closing ";" of the full <fontinfo>, not just the <fontname> part of the <fontinfo> Sep 20 16:39:24 <vmiklos> sberg: i think we already assume that the only "token" in the font entry destination that is not bound to a control world (\foo) is the font name Sep 20 16:40:52 <vmiklos> sberg: writerfilter/source/rtftok/rtfdocumentimpl.cxx:1237 is where we simply strip away the trailing semicolon, there is no further separation between the font name and other character content inside the destination (apart from the control words and their arguments) Sep 20 16:42:18 <sberg> vmiklos, OK, thanks; I'll just pretend I haven't seen those dodgy details :) ...so I'm switching to (somewhat arbitrarily) RTL_TEXTENCODING_MS_1252 there now Change-Id: Iebd1bcecb7fa71c489798154d3356062b052775e Reviewed-on: https://gerrit.libreoffice.org/42477 Reviewed-by: Stephan Bergmann <sbergman@redhat.com> Tested-by: Stephan Bergmann <sbergman@redhat.com>
2017-09-15Assert flags passed to rtl_convertTextToUnicode/UnicodeToText are validStephan Bergmann1-0/+41
...so that at least some typos of using OUSTRING_TO_OSTRING_CVTFLAGS (0x566) instead of OSTRING_TO_OUSTRING_CVTFLAGS (0x333) can be found. (Unfortunately, in the other direction, 0x333 is a valid combination of RTL_UNICODETOTEXT_FLAGS_*.) Change-Id: I7cfb3677b103ae90de88833cc93b0a5384607e15 Reviewed-on: https://gerrit.libreoffice.org/42288 Tested-by: Jenkins <ci@libreoffice.org> Reviewed-by: Stephan Bergmann <sbergman@redhat.com>
2017-09-13New rtl::isUnicodeScalarValue, rtl::isSurrogateStephan Bergmann1-4/+1
There are apparently various places that want to check for a Unicode scalar value rather than for a Unicode code point. Changed those uses of rtl::isUnicodeCodePoint where that was obvious. (For changing svtools/source/svrtf/svparser.cxx see 8e0fb74dc01927b60d8b868548ef8fe1d7a80ce3 "Revert 'svtools: HTML import: don't put lone surrogates in OUString'".) Other uses of rtl::isUnicodeCodePoint might also want to use rtl::isUnicodeScalarValue instead. As a side effect, this change also introduces rtl::isSurrogate, which is useful in a few places as well. Change-Id: I9245f4f98b83877145a4d392f0ddb7c5d824a535
2017-09-13Silence warning C4701: potentially uninitialized local variableStephan Bergmann1-1/+1
Change-Id: Ia37347108f9fe7094f055a5c6f2ec9511c3aff1d
2017-09-13Make reading UTF-8 strictStephan Bergmann1-24/+49
Consider non-shortest forms, surrogates, and representations of values larger than 0x10FFFF (which can even cover five or six bytes, for historical reasons) as "invalid" (they used to be considered as "undefined" instead). This is in response to fc670f637d4271246691904fd649358ce2e7be59 "svtools: HTML import: don't put lone surrogates in OUString" (which can now be reverted again in a follow-up commit). My fear would have been that some places in the code rely on the original, relaxed handling, but at least 'make check' still succeeded for me. Change-Id: I017e6c04ed3c577c3694b417167f853987a1d1ce
2017-08-25ofz#2852 korean table entries start at 0xF not 0x7Caolán McNamara1-2/+2
Change-Id: Iaf3ed48d0eb0e5a57770af057c565a7310bb96d4 Reviewed-on: https://gerrit.libreoffice.org/40761 Tested-by: Jenkins <ci@libreoffice.org> Reviewed-by: Caolán McNamara <caolanm@redhat.com> Tested-by: Caolán McNamara <caolanm@redhat.com>
2017-08-18Fix typosAndrea Gelmini1-2/+2
Change-Id: I795059109e23800987cda6f04c58ab18c488ad07 Reviewed-on: https://gerrit.libreoffice.org/41242 Tested-by: Jenkins <ci@libreoffice.org> Reviewed-by: Julien Nabet <serval2412@yahoo.fr>
2017-08-17Fix typosAndrea Gelmini2-9/+9
Change-Id: Iaa9c0aea3ea1a239e378bd714ba335f91bb1faf3 Reviewed-on: https://gerrit.libreoffice.org/41194 Reviewed-by: Michael Stahl <mstahl@redhat.com> Tested-by: Michael Stahl <mstahl@redhat.com>
2017-07-17RTL_UNICODETOTEXT_INFO_{DEST|SCR}BUFFERTOSMALL should use TOO, not TOChris Sherlock14-48/+48
I have kept the old mispelled constant for backwards compatibility Change-Id: I128a2eec76d00cc5ef058cd6a0c35a7474d2411e Reviewed-on: https://gerrit.libreoffice.org/39995 Reviewed-by: Chris Sherlock <chris.sherlock79@gmail.com> Tested-by: Chris Sherlock <chris.sherlock79@gmail.com>
2017-07-10teach unnecessaryparen loplugin about identifiersNoel Grandin1-1/+1
Change-Id: I5710b51e53779c222cec0bf08cd34bda330fec4b Reviewed-on: https://gerrit.libreoffice.org/39737 Reviewed-by: Noel Grandin <noel.grandin@collabora.co.uk> Tested-by: Noel Grandin <noel.grandin@collabora.co.uk>
2017-07-05new loplugin unnecessaryparenNoel Grandin1-1/+1
Change-Id: Ic883a07b30069ca6342d7521c8ad890f4326f0ec Reviewed-on: https://gerrit.libreoffice.org/39549 Tested-by: Jenkins <ci@libreoffice.org> Reviewed-by: Noel Grandin <noel.grandin@collabora.co.uk>
2017-07-03Rather translate "Sonderzeichen" as "special characters"Stephan Bergmann1-7/+7
...as had been done in sal/textenc/tcvtlat1.tab, in 5034e8217c9844293dc94e5dff0bdc865ad7a91a "Translate German comments and debug strings (leftovers in dirs sal to sc)". What those comments actually want to say is that, despite being the ASCII control character region, those encodings contain some graphic characters there. Change-Id: Iefad9c45341891c3e91e6b9fb265d3070d1220ad