tdf#129353, tdf#129402: fix node creation on index import

ToC, bibliography, and index sections import code changed to closely follow what Word does, make sure that pre-rendered entries don't get imported as standalone paragraphs outside of the index sections, and paragraph count is accurate (no missing or added paragraphs as much as possible). In Word, an index may start and end in the middle of a paragraph: <w:p> <w:r> <w:t>Some text before index</w:t> </w:r> <w:r> <w:fldChar w:fldCharType="begin"/> </w:r> <w:r> <w:instrText> TOC ...</w:instrText> </w:r> <w:r> <w:fldChar w:fldCharType="separate"/> </w:r> <w:r> <w:t>First pre-rendered index entry</w:t> </w:r> </w:p> ... <w:p> <w:r> <w:t>Last pre-rendered index entry</w:t> </w:r> <w:r> <w:fldChar w:fldCharType="end"/> </w:r> <w:r> <w:t>Some text after index</w:t> </w:r> </w:p> However, normally it looks like either no runs precedig index, or no runs of pre-rendered contents will be present. When no Std elements are used, the typical situation is that there's a normal paragraph (possibly with some user text), which ends with index start marker, without any pre-rendered contents in the same paragraph; and all pre- rendered contents goes in following paragraphs. Such index normally ends with index end marker in the *first* run of a paragraph, which then might have normal text runs. When Stds are used, then no leading/trailing out-of-index runs in paragraphs with marks are usually present; and in this case, when paragraphs with index marks don't contain pre-rendered entries, they still are treated as part of the index. In Writer, indexes are node sections (and so cannot be inline with other paragraph contents). When there was some paragraph content already before the start-of-index mark, the paragraph is assumed to end before the index; in this case, when current <w:p> element ends, importer decides if a separate starting paragraph is needed or not, depending on if there was some runs after the mark. When there was no text runs before the starting mark, then the paragraph is treated as leading paragraph of the index. This allows to not miss empty paragraphs before index; and not have two paragraphs where there was one in Word. Only in cases when user had manually typed text both in and outside of the index in the same paragraph in Word, we would have the paragraph split into two in Writer. For end marks, the behaviour depends on whether it's inside Std. When inside, the ending paragraph starting with index end mark is considered part of the index. For out-of-Std case, it's considered normal paragraph (and measures are taken to make sure it's not dropped even if empty, because sometimes such paragraphs don't have other content, and have section settings, which is usually treated by Writer as "drop this paragraph" sign). A special problem is multi-column index. It's wrapped into a continuous section by Word; and in Writer, we also wrap it into a section. It would be possibly useful to detect somehow if this section is part of index definition, and in this case, drop the section and put its properties into the Writer's index section. That would avoid an explicit section in the imported document. This is TODO, for someone who figures how to detect reliably if the section belongs to index definition. See comment in DomainMapper_Impl::appendTextSectionAfter. By the way, current export code is wrong, producing an index that is single-column in Word; this change doesn't touch that. Several existing tests needed to be fixed, which used to test wrong results. Reviewed-on: https://gerrit.libreoffice.org/85089 Tested-by: Jenkins Reviewed-by: Mike Kaganski <mike.kaganski@collabora.com> (cherry picked from commit 5cdb14345842c07eb1a466897753da910e9488f8) Change-Id: I9597c8ab13f31ded9abcc24054d3478d3e3a3b40 Reviewed-on: https://gerrit.libreoffice.org/c/core/+/85289 Tested-by: Jenkins Reviewed-by: Andras Timar <andras.timar@collabora.com>
author: Mike Kaganski <mike.kaganski@collabora.com> 2019-12-13 09:36:39 +0300
committer: Andras Timar <andras.timar@collabora.com> 2020-01-09 08:19:01 +0100
commit: e76fcb8f169acac3d81bdaf9b126cc4c98e2eb8e (patch)
tree: 2327d7636cd5efcf78cb04294fbd5fb9ce582459 /sw
parent: 75265fad67c38827aec074e99e010109f9219079 (diff)
5 files changed, 66 insertions, 10 deletions
diff --git a/sw/qa/extras/ooxmlexport/data/tdf129353.docx b/sw/qa/extras/ooxmlexport/data/tdf129353.docx
new file mode 100644
index 000000000000..c5cf8865eda6
--- /dev/null
+++ b/sw/qa/extras/ooxmlexport/data/tdf129353.docx
diff --git a/sw/qa/extras/ooxmlexport/ooxmlexport14.cxx b/sw/qa/extras/ooxmlexport/ooxmlexport14.cxx
index 2271aa413dd6..062831503404 100644
--- a/sw/qa/extras/ooxmlexport/ooxmlexport14.cxx
+++ b/sw/qa/extras/ooxmlexport/ooxmlexport14.cxx
@@ -14,8 +14,10 @@
 
 #include <editsh.hxx>
 #include <frmatr.hxx>
+#include <tools/lineend.hxx>
 #include <com/sun/star/text/TableColumnSeparator.hpp>
 #include <com/sun/star/text/RelOrientation.hpp>
+#include <com/sun/star/text/XDocumentIndex.hpp>
 
 class Test : public SwModelTestBase
 {
@@ -207,6 +209,33 @@ DECLARE_OOXMLEXPORT_EXPORTONLY_TEST(testTdf128889, "tdf128889.fodt")
     assertXPath(pXml, "/w:document/w:body/w:p[1]/w:r[2]/w:br", "type", "page");
 }
 
+DECLARE_OOXMLEXPORT_TEST(testTdf129353, "tdf129353.docx")
+{
+    CPPUNIT_ASSERT_EQUAL(8, getParagraphs());
+    getParagraph(1, "(Verne, 1870)");
+    getParagraph(2, "Bibliography");
+    getParagraph(4, "Christie, A. (1922). The Secret Adversary. ");
+    CPPUNIT_ASSERT_EQUAL(OUString(), getParagraph(8)->getString());
+
+    uno::Reference<text::XDocumentIndexesSupplier> xIndexSupplier(mxComponent, uno::UNO_QUERY);
+    uno::Reference<container::XIndexAccess> xIndexes = xIndexSupplier->getDocumentIndexes();
+    uno::Reference<text::XDocumentIndex> xIndex(xIndexes->getByIndex(0), uno::UNO_QUERY);
+    uno::Reference<text::XTextRange> xTextRange = xIndex->getAnchor();
+    uno::Reference<text::XText> xText = xTextRange->getText();
+    uno::Reference<text::XTextCursor> xTextCursor = xText->createTextCursor();
+    xTextCursor->gotoRange(xTextRange->getStart(), false);
+    xTextCursor->gotoRange(xTextRange->getEnd(), true);
+    OUString aIndexString(convertLineEnd(xTextCursor->getString(), LineEnd::LINEEND_LF));
+
+    // Check that all the pre-rendered entries are correct, including trailing spaces
+    CPPUNIT_ASSERT_EQUAL(OUString("\n" // starting with an empty paragraph
+                                  "Christie, A. (1922). The Secret Adversary. \n"
+                                  "\n"
+                                  "Verne, J. G. (1870). Twenty Thousand Leagues Under the Sea. \n"
+                                  ""), // ending with an empty paragraph
+                         aIndexString);
+}
+
 CPPUNIT_PLUGIN_IMPLEMENT();
 
 /* vim:set shiftwidth=4 softtabstop=4 expandtab: */
diff --git a/sw/qa/extras/ooxmlexport/ooxmlexport5.cxx b/sw/qa/extras/ooxmlexport/ooxmlexport5.cxx
index 2560cf89a506..1c2560320cdc 100644
--- a/sw/qa/extras/ooxmlexport/ooxmlexport5.cxx
+++ b/sw/qa/extras/ooxmlexport/ooxmlexport5.cxx
@@ -9,6 +9,8 @@
 
 #include <swmodeltestbase.hxx>
 
+#include <com/sun/star/text/XDocumentIndex.hpp>
+#include <com/sun/star/text/XDocumentIndexesSupplier.hpp>
 #include <com/sun/star/text/XFootnote.hpp>
 #include <com/sun/star/text/XTextTable.hpp>
 #include <com/sun/star/style/LineSpacing.hpp>
@@ -694,6 +696,33 @@ DECLARE_OOXMLEXPORT_TEST(testFdo77129, "fdo77129.docx")
     assertXPathContent(pXmlDoc, "/w:document/w:body/w:p[4]/w:r[1]/w:t", "Abstract");
 }
 
+// Test the same testdoc used for testFdo77129.
+DECLARE_OOXMLEXPORT_TEST(testTdf129402, "fdo77129.docx")
+{
+    // tdf#129402: ToC title must be "Contents", not "Content"; the index field must include
+    // pre-rendered element.
+
+    // Currently export drops empty paragraph after ToC, so skip getParagraphs test for now
+//    CPPUNIT_ASSERT_EQUAL(5, getParagraphs());
+    CPPUNIT_ASSERT_EQUAL(OUString("owners."), getParagraph(1)->getString());
+    CPPUNIT_ASSERT_EQUAL(OUString("Contents"), getParagraph(2)->getString());
+    CPPUNIT_ASSERT_EQUAL(OUString("How\t2"), getParagraph(3)->getString());
+//    CPPUNIT_ASSERT_EQUAL(OUString(), getParagraph(4)->getString());
+
+    uno::Reference<text::XDocumentIndexesSupplier> xIndexSupplier(mxComponent, uno::UNO_QUERY);
+    uno::Reference<container::XIndexAccess> xIndexes = xIndexSupplier->getDocumentIndexes();
+    uno::Reference<text::XDocumentIndex> xIndex(xIndexes->getByIndex(0), uno::UNO_QUERY);
+    uno::Reference<text::XTextRange> xTextRange = xIndex->getAnchor();
+    uno::Reference<text::XText> xText = xTextRange->getText();
+    uno::Reference<text::XTextCursor> xTextCursor = xText->createTextCursor();
+    xTextCursor->gotoRange(xTextRange->getStart(), false);
+    xTextCursor->gotoRange(xTextRange->getEnd(), true);
+    OUString aTocString(xTextCursor->getString());
+
+    // Check that the pre-rendered entry is inside the index
+    CPPUNIT_ASSERT_EQUAL(OUString("How\t2"), aTocString);
+}
+
 DECLARE_OOXMLEXPORT_TEST(testfdo79969_xlsm, "fdo79969_xlsm.docx")
 {
     // This UT for DOCX embedded with excel work sheet.
diff --git a/sw/qa/extras/ooxmlexport/ooxmlfieldexport.cxx b/sw/qa/extras/ooxmlexport/ooxmlfieldexport.cxx
index 15390ccd574c..d20d8a90938f 100644
--- a/sw/qa/extras/ooxmlexport/ooxmlfieldexport.cxx
+++ b/sw/qa/extras/ooxmlexport/ooxmlfieldexport.cxx
@@ -282,9 +282,9 @@ DECLARE_OOXMLEXPORT_TEST(testAlphabeticalIndex_MultipleColumns,"alphabeticalInde
 
     // check for section breaks after and before the Index Section
     assertXPath(pXmlDoc, "/w:document/w:body/w:p[2]/w:pPr/w:sectPr/w:type","val","continuous");
-    assertXPath(pXmlDoc, "/w:document/w:body/w:p[9]/w:pPr/w:sectPr/w:type","val","continuous");
+    assertXPath(pXmlDoc, "/w:document/w:body/w:p[8]/w:pPr/w:sectPr/w:type","val","continuous");
     // check for "w:space" attribute for the columns in Section Properties
-    assertXPath(pXmlDoc, "/w:document/w:body/w:p[9]/w:pPr/w:sectPr/w:cols/w:col[1]","space","720");
+    assertXPath(pXmlDoc, "/w:document/w:body/w:p[8]/w:pPr/w:sectPr/w:cols/w:col[1]","space","720");
 }
 
 DECLARE_OOXMLEXPORT_TEST(testPageref, "testPageref.docx")
@@ -353,15 +353,13 @@ DECLARE_OOXMLEXPORT_TEST(testGenericTextField, "Unsupportedtextfields.docx")
     xmlXPathFreeObject(pXmlObj);
 }
 
-DECLARE_OOXMLEXPORT_TEST(test_FieldType, "99_Fields.docx")
+DECLARE_OOXMLEXPORT_EXPORTONLY_TEST(test_FieldType, "99_Fields.docx")
 {
     xmlDocPtr pXmlDoc = parseExport("word/document.xml");
-    if (!pXmlDoc)
-        return;
     // Checking for three field types (BIBLIOGRAPHY, BIDIOUTLINE, CITATION) in sequence
-    assertXPath(pXmlDoc, "/w:document[1]/w:body[1]/w:p[2]/w:r[2]/w:instrText[1]",1);
-    assertXPath(pXmlDoc, "/w:document[1]/w:body[1]/w:p[5]/w:r[2]/w:instrText[1]",1);
-    assertXPath(pXmlDoc, "/w:document[1]/w:body[1]/w:p/w:sdt/w:sdtContent/w:r[2]/w:instrText[1]",1);
+    assertXPath(pXmlDoc, "/w:document/w:body/w:p[2]/w:r[2]/w:instrText");
+    assertXPath(pXmlDoc, "/w:document/w:body/w:p[3]/w:r[2]/w:instrText");
+    assertXPath(pXmlDoc, "/w:document/w:body/w:p[4]/w:sdt/w:sdtContent/w:r[2]/w:instrText");
 }
 
 DECLARE_OOXMLEXPORT_TEST(testCitation,"FDO74775.docx")
@@ -411,7 +409,7 @@ DECLARE_OOXMLEXPORT_TEST(testFDO78654 , "fdo78654.docx")
         return;
     // In case of two "Hyperlink" tags in one paragraph and one of them
     // contains "PAGEREF" field then field end tag was missing from hyperlink.
-    assertXPath ( pXmlDoc, "/w:document/w:body/w:p[2]/w:hyperlink[3]/w:r[5]/w:fldChar", "fldCharType", "end" );
+    assertXPath ( pXmlDoc, "/w:document/w:body/w:sdt/w:sdtContent/w:p[2]/w:hyperlink[3]/w:r[5]/w:fldChar", "fldCharType", "end" );
 }
 
 
diff --git a/sw/qa/extras/rtfimport/rtfimport.cxx b/sw/qa/extras/rtfimport/rtfimport.cxx
index a88dddf2f086..14b822553d85 100644
--- a/sw/qa/extras/rtfimport/rtfimport.cxx
+++ b/sw/qa/extras/rtfimport/rtfimport.cxx
@@ -971,7 +971,7 @@ CPPUNIT_TEST_FIXTURE(Test, testFdo82071)
 {
     load(mpTestDocumentPath, "fdo82071.rtf");
     // The problem was that in TOC, chapter names were underlined, but they should not be.
-    uno::Reference<text::XTextRange> xRun = getRun(getParagraph(2), 1);
+    uno::Reference<text::XTextRange> xRun = getRun(getParagraph(1), 1);
     // Make sure we test the right text portion.
     CPPUNIT_ASSERT_EQUAL(OUString("Chapter 1"), xRun->getString());
     // This was awt::FontUnderline::SINGLE.
author	Mike Kaganski <mike.kaganski@collabora.com>	2019-12-13 09:36:39 +0300
committer	Andras Timar <andras.timar@collabora.com>	2020-01-09 08:19:01 +0100
commit	e76fcb8f169acac3d81bdaf9b126cc4c98e2eb8e (patch)
tree	2327d7636cd5efcf78cb04294fbd5fb9ce582459 /sw
parent	75265fad67c38827aec074e99e010109f9219079 (diff)