summaryrefslogtreecommitdiff
path: root/poppler/TextOutputDev.cc
AgeCommit message (Collapse)AuthorFilesLines
12 daysUse std::span for some data, size combosSune Vuorela1-2/+2
With c++20, we have std::span which is a nice wrapper around a pointer and a length. Use that rather than carry them around by themselves. We also have std::span created transparantly from vectors and stuff
2024-04-20Update (C)Albert Astals Cid1-1/+1
2024-04-20cpp: Fix crash extracting text and font in some filesAlbert Astals Cid1-1/+1
Issue reported and patch suggestion by Samad Koita and Aviral Agarwal Fixes issue #1477
2024-03-31Update (C)Albert Astals Cid1-1/+2
2024-03-30Reduce worst case algorithmic complexity of TextBlock::coalesceStefan Brüns1-49/+60
The old algorithm restarts the inner loop for the RHS word from the beginning on each match, i.e. the worst case complexity approaches O(N^3), while O(N^2) is obviously sufficient for a pairwise compare of all words. Fortunately, O(N^2) is hardly ever happening, as the inner N is limited by a) the maxBaseIdx, b) removing duplicates from the set. For some pathological cases this changes the runtime from minutes to seconds. See poppler#1173.
2024-03-30Reduce TextWord space and allocation overheadStefan Brüns1-204/+174
Currently, the word characters are allocated as a struct of arrays, e.g. text and charcode are allocated separately. This causes some space (6 pointers, 6 malloc chunk management words (size_t/flags), alignment, ...) and runtime overhead (6 allocs/ frees per word). Changing this to an array of struct reduces this overhead. It also allows to be more conservative with allocations, as resizing is less costly, i.e. starting with a single character allocation instead of 16. It is also more efficient, as most accesses affect multiple or all attributes, i.e. values in the same or neighboring CPU cache lines. Using a std::vector instead of separate raw arrays also reduces code and manual data management. The "charPos end index" and trailing "edge" attributes are no longer stored as an additional entry entry in the array, but as dedicated data members, `charPosEnd` and `edgeEnd`. The memory saving is most notably for short words, but even for words with 16 characters there are small savings, and still less allocations (1 + 4 allocations instead of 6. Growing is fairly cheap, as the CharInfo struct is trivially copyable.) See poppler#1173.
2024-03-30Fix text search across lines between paragraphsNelson Benítez León1-24/+36
This commit fixes the "across lines" text search feature of TextPage::findText() when the match happens from the last line of a paragraph to the first line of next paragraph. Includes tests for this bug. Fixes #1475 Fixes https://gitlab.gnome.org/GNOME/evince/-/issues/2001
2024-03-30Fix regression on issue #157Nelson Benítez León1-12/+14
Redo the fix for issue #157 which is about doing transparent selection for glyphless documents (eg. tesseract scanned documents) because it stopped working after commit 29f32a47
2024-02-01Update (C)Albert Astals Cid1-0/+1
2024-02-01More unicode vectors; fewer raw pointersSune Vuorela1-6/+2
2024-01-23Update (C)Albert Astals Cid1-0/+1
2024-01-18TextPage::takeText: reset actualText for the new pageAdam Sampson1-0/+2
actualText has an internal pointer to the TextPage it's writing to, so if you called takeText and then continued to output more pages to the TextOutputDev, their text would be written to the page you'd taken rather than the new one.
2022-08-19We can use isnan nowAlbert Astals Cid1-2/+1
2022-05-13TextPage::coalesce: Fix crash on broken filesAlbert Astals Cid1-2/+3
oss-fuzz/47350
2022-04-30Update (C)Albert Astals Cid1-1/+1
2022-04-26fix multiline find_text() bug in two column docsNelson Benítez León1-0/+6
Fix for a bug in double column documents where some single line matches are wrongly returned as being multiline matches. Includes test case for the bug.
2022-04-26fix bug in multiline find_text()Nelson Benítez León1-1/+2
which caused some false positives being returned. Includes test case for the bug. See original comment about this bug: https://gitlab.gnome.org/GNOME/evince/-/merge_requests/159#note_1431380
2022-03-30Change GfxFont name into an optional std::stringAlbert Astals Cid1-1/+1
2022-03-11Add readability-braces-around-statementsAlbert Astals Cid1-101/+194
2022-03-10Update (C) of previous commitAlbert Astals Cid1-1/+1
2022-03-09Replace hand-coded reference counting in GfxFont by std::shared_ptrOliver Sander1-9/+2
2021-12-07TextOutputDev: require more spacing between columnsNelson Benítez León1-3/+11
Require more spacing for adjacent text to be considered a separate column of text. We do that by increasing 'minColSpacing1' parameter, which marks the distance, within which, an adjacent word will be pulled to the current block. We provide a way to tweak the default value: double getMinColSpacing1(); void setMinColSpacing1(double val); Fixes issue #1093
2021-11-01TextOutputDev improvementsAlbert Astals Cid1-65/+22
Vectors don't need to be a pointer and they can contain unique_ptr too Make pools be an array of unique_ptr too Makes for easier memory management
2021-10-30Make makeWordList return a unique_ptrAlbert Astals Cid1-3/+3
2021-10-29Port a few functions from GooString to std::stringAlbert Astals Cid1-1/+1
2021-10-11Update (C)Albert Astals Cid1-1/+1
2021-10-11TextOutputDev: Respect orientation when selecting wordsMarek Kasik1-33/+142
Take rotation of text lines into account when visiting selection. This works for text rotated by multiples of 90 degrees. Issue #499
2021-08-29Update (C)Albert Astals Cid1-0/+1
2021-08-27Fix up setmode callsPeter Williams1-4/+4
To compile and work correctly on both Cygwin and MSVC, we should always call the function `_setmode` and check for either `_WIN32` or `__CYGWIN__` being defined. This fixes the MSVC build and corrects some behavior handling output to stdout on Cygwin.
2021-08-27CI: Enable google-explicit-constructorAlbert Astals Cid1-2/+2
I was doing some refactoring before and was hit by one of the constructors being magically called when i didn't want that. Since we don't really on it (was just used in some of the explicit type conversions) I think it makes sense to enable And 2 small qt6 clang-tidy fixes because we don't have qt6 on the clang-tidy CI yet There's 2 potentially source incompatible changes in the qt frontend, but i really really hope noone was using the constructors that way
2021-04-25find, glib: Enhance find to support multi-line matchingNelson Benítez León1-33/+149
On the backend side, adds 3 new parameters to TextPage::findText(), one bool to enable the feature, one out PDFRectangle to store the part of the match that falls on the next line, and one out bool to inform whether hyphen was present and ignored at end of the previous match part. For the glib binding, this extends the public PopplerRectangle struct by new members to hold additional information about whether the rectangle belongs to a group of rectangles for the same match, and whether a hyphen was ignored at the end of the line. Since PopplerRectangle is public ABI, this is done by making the public PopplerRectangle API return the enlarged struct, and internally casting to the new struct when required, the new members are accessible only via accessor functions. For Qt5 Qt6 bindings, this commit only implements the new flag Poppler::Page::AcrossLines (but no new function and no new return data type) and if this flag is passed, the returned list of rectangles will also include rectangles for the second part of across-line matches. This minimum Qt bindings still allows for the creation of tests for this feature (using the Qt test framework) which this commit *do includes*. But a more complete binding (with a new return type that includes 'matchContinued' and 'ignoredHypen' boolean fields) is left to do for qt backend maintainers if they want to use this feature in eg. Okular. So, as mentioned, this commit incorporates tests for the implemented across-line matching feature, and the tests do also check for two included aspects of this feature, which are: - Ignoring hyphen character while matching when 1) it's the last character of the line and 2) its corresponding matching character in the search term is not an hyphen too. - Any whitespace characters in the search term will be allowed to match on the logic position where the lines split (i.e. what would normally be the newline character in a text file, but PDF text does not include newline characters between lines). Regarding the enhancement to findText() function which implements matching across lines, just two more notes: - It won't match on text spanning more than two lines, i.e. it only matches text spanning from end of one line to start of next line. - It does not supports finding backwards, if findText() receives both <backward> and <matchAcrossLines> parameters as true, it will ignore the <matchAcrossLines> parameter. Implementing <matchAcrossLines> with backwards direction is possible, but it will make an already complex function like findText() to be even more complex, for little gain as eg. Evince does not even use the <backward> parameter of findText(). Fixes poppler issues #744 and #755 Related Evince issue https://gitlab.gnome.org/GNOME/evince/issues/333
2021-04-07TextOutputDev: Fix crash in malformed fileAlbert Astals Cid1-1/+1
oss-fuzz/32952
2021-03-12TextSelectionDumper: fix word order for RTL textNelson Benítez León1-2/+6
This is used by glib backend (Evince). Fixes issue #53
2021-02-26Make TextSelectionSizer a bit easier to understand standaloneAlbert Astals Cid1-4/+9
Nothing really changes because it's only used in one place and that place called getRegion so there's no leak but looking at the class standalone one could think that one would get a leak if getRegion was not called.
2021-02-14Update (C)Albert Astals Cid1-1/+1
2021-02-14TextSelectionDumper: Fix getText() for space after wordNelson Benítez León1-1/+1
Fix TextSelectionDumper::getText() (which is currently only used by the glib frontend) to not default to add a space after word in the case the word is explicitly set to not carry that space by means of the 'spaceAfter' TextWord field. Fixes issue #1042
2020-11-28Fix crash when searching things of length 0Albert Astals Cid1-0/+4
2020-10-29clang: Warn about weak-vtablesAlbert Astals Cid1-1/+3
2020-08-27Update (C)Albert Astals Cid1-1/+1
2020-08-26TextSelectionPainter: support glyphless fontsNelson Benítez León1-5/+30
in text selections, by: - Ignoring to draw characters with it. - Painting the selection's background as transparent. Fixes issue #157 Based on inital work by Nelson Benitez and changed to be not tesseract specific by Julian Andres Klode.
2020-07-03Run clang-formatAlbert Astals Cid1-4890/+4596
find . \( -name "*.cpp" -or -name "*.h" -or -name "*.c" -or -name "*.cc" \) -exec clang-format -i {} \; If you reached this file doing a git blame, please see README.contributors (instructions added 2 commits in the future to this one)
2020-05-19Update (C)Albert Astals Cid1-1/+1
2020-05-19[TextOutputDev] simplify TextFontInfo::matches(const Ref *ref)Albert Astals Cid1-1/+1
2020-05-19[cpp] Add the font infos to the text_box object.suzuki toshiya1-0/+4
2020-01-07Mark some static arrays as constAlbert Astals Cid1-1/+1
2020-01-05Update last commit (C)Albert Astals Cid1-1/+1
2020-01-05Remove UnicodeMap reference countingAlbert Astals Cid1-16/+5
And make the cache just be "infinite", it's not like we support that many maps or that there's so many used in a given session, and if they are, well it's good we cached them All the unicode maps we support use about 2MB of memory, but PSOutputDev is the only one that loads "random" unicodeMaps so to load them all you'd had to print lots of different documents with fonts with lots of different font encodings, so it seems like a not very likely situation and the code gets simplified a bit
2019-12-05Move textEOL and textPageBreaks out of GlobalParams to TextOutputDevAlbert Astals Cid1-8/+10
2019-12-03Enable modernize-loop-convertAlbert Astals Cid1-11/+5
2019-12-02enable modernize-redundant-void-argAlbert Astals Cid1-2/+2
No copyright, it's a mechanical change