Age | Commit message (Collapse) | Author | Files | Lines |
|
With c++20, we have std::span which is a nice wrapper around a pointer
and a length. Use that rather than carry them around by themselves.
We also have std::span created transparantly from vectors and stuff
|
|
|
|
Issue reported and patch suggestion by Samad Koita and Aviral Agarwal
Fixes issue #1477
|
|
|
|
The old algorithm restarts the inner loop for the RHS word from the
beginning on each match, i.e. the worst case complexity approaches
O(N^3), while O(N^2) is obviously sufficient for a pairwise compare of
all words. Fortunately, O(N^2) is hardly ever happening, as the inner N
is limited by a) the maxBaseIdx, b) removing duplicates from the set.
For some pathological cases this changes the runtime from minutes to
seconds.
See poppler#1173.
|
|
Currently, the word characters are allocated as a struct of arrays,
e.g. text and charcode are allocated separately.
This causes some space (6 pointers, 6 malloc chunk management
words (size_t/flags), alignment, ...) and runtime overhead (6 allocs/
frees per word).
Changing this to an array of struct reduces this overhead. It also allows
to be more conservative with allocations, as resizing is less costly, i.e.
starting with a single character allocation instead of 16. It is also more
efficient, as most accesses affect multiple or all attributes, i.e.
values in the same or neighboring CPU cache lines.
Using a std::vector instead of separate raw arrays also reduces code
and manual data management.
The "charPos end index" and trailing "edge" attributes are no
longer stored as an additional entry entry in the array, but as dedicated
data members, `charPosEnd` and `edgeEnd`.
The memory saving is most notably for short words, but even for words
with 16 characters there are small savings, and still less allocations
(1 + 4 allocations instead of 6. Growing is fairly cheap, as the CharInfo
struct is trivially copyable.)
See poppler#1173.
|
|
This commit fixes the "across lines" text
search feature of TextPage::findText() when
the match happens from the last line of a
paragraph to the first line of next paragraph.
Includes tests for this bug.
Fixes #1475
Fixes https://gitlab.gnome.org/GNOME/evince/-/issues/2001
|
|
Redo the fix for issue #157 which is about doing
transparent selection for glyphless documents (eg.
tesseract scanned documents) because it stopped
working after commit 29f32a47
|
|
|
|
|
|
|
|
actualText has an internal pointer to the TextPage it's writing to, so
if you called takeText and then continued to output more pages to the
TextOutputDev, their text would be written to the page you'd taken
rather than the new one.
|
|
|
|
oss-fuzz/47350
|
|
|
|
Fix for a bug in double column documents where some
single line matches are wrongly returned as being
multiline matches.
Includes test case for the bug.
|
|
which caused some false positives being returned.
Includes test case for the bug.
See original comment about this bug:
https://gitlab.gnome.org/GNOME/evince/-/merge_requests/159#note_1431380
|
|
|
|
|
|
|
|
|
|
Require more spacing for adjacent text to be
considered a separate column of text.
We do that by increasing 'minColSpacing1' parameter,
which marks the distance, within which, an adjacent
word will be pulled to the current block.
We provide a way to tweak the default value:
double getMinColSpacing1();
void setMinColSpacing1(double val);
Fixes issue #1093
|
|
Vectors don't need to be a pointer
and they can contain unique_ptr too
Make pools be an array of unique_ptr too
Makes for easier memory management
|
|
|
|
|
|
|
|
Take rotation of text lines into account when visiting
selection. This works for text rotated by multiples of
90 degrees.
Issue #499
|
|
|
|
To compile and work correctly on both Cygwin and MSVC, we should always
call the function `_setmode` and check for either `_WIN32` or
`__CYGWIN__` being defined. This fixes the MSVC build and corrects some
behavior handling output to stdout on Cygwin.
|
|
I was doing some refactoring before and was hit by one of the
constructors being magically called when i didn't want that.
Since we don't really on it (was just used in some of the explicit type
conversions) I think it makes sense to enable
And 2 small qt6 clang-tidy fixes because we don't have qt6 on
the clang-tidy CI yet
There's 2 potentially source incompatible changes in the qt frontend,
but i really really hope noone was using the constructors that way
|
|
On the backend side, adds 3 new parameters to TextPage::findText(),
one bool to enable the feature, one out PDFRectangle to store
the part of the match that falls on the next line, and one out
bool to inform whether hyphen was present and ignored at end of
the previous match part.
For the glib binding, this extends the public PopplerRectangle
struct by new members to hold additional information about
whether the rectangle belongs to a group of rectangles for the
same match, and whether a hyphen was ignored at the end of the
line. Since PopplerRectangle is public ABI, this is done by making
the public PopplerRectangle API return the enlarged struct, and
internally casting to the new struct when required, the new
members are accessible only via accessor functions.
For Qt5 Qt6 bindings, this commit only implements the new flag
Poppler::Page::AcrossLines (but no new function and no new
return data type) and if this flag is passed, the returned
list of rectangles will also include rectangles for the
second part of across-line matches.
This minimum Qt bindings still allows for the creation of
tests for this feature (using the Qt test framework) which
this commit *do includes*. But a more complete binding (with
a new return type that includes 'matchContinued' and 'ignoredHypen'
boolean fields) is left to do for qt backend maintainers
if they want to use this feature in eg. Okular.
So, as mentioned, this commit incorporates tests for the
implemented across-line matching feature, and the tests do
also check for two included aspects of this feature, which are:
- Ignoring hyphen character while matching when 1) it's the
last character of the line and 2) its corresponding matching
character in the search term is not an hyphen too.
- Any whitespace characters in the search term will be allowed
to match on the logic position where the lines split (i.e. what
would normally be the newline character in a text file, but
PDF text does not include newline characters between lines).
Regarding the enhancement to findText() function which implements
matching across lines, just two more notes:
- It won't match on text spanning more than two lines, i.e. it
only matches text spanning from end of one line to start of
next line.
- It does not supports finding backwards, if findText() receives
both <backward> and <matchAcrossLines> parameters as true, it
will ignore the <matchAcrossLines> parameter. Implementing
<matchAcrossLines> with backwards direction is possible, but
it will make an already complex function like findText() to be
even more complex, for little gain as eg. Evince does not even
use the <backward> parameter of findText().
Fixes poppler issues #744 and #755
Related Evince issue https://gitlab.gnome.org/GNOME/evince/issues/333
|
|
oss-fuzz/32952
|
|
This is used by glib backend (Evince).
Fixes issue #53
|
|
Nothing really changes because it's only used in one place and that
place called getRegion so there's no leak but looking at the class
standalone one could think that one would get a leak if getRegion was
not called.
|
|
|
|
Fix TextSelectionDumper::getText() (which is
currently only used by the glib frontend) to
not default to add a space after word in the
case the word is explicitly set to not carry
that space by means of the 'spaceAfter' TextWord
field.
Fixes issue #1042
|
|
|
|
|
|
|
|
in text selections, by:
- Ignoring to draw characters with it.
- Painting the selection's background as transparent.
Fixes issue #157
Based on inital work by Nelson Benitez and changed
to be not tesseract specific by Julian Andres Klode.
|
|
find . \( -name "*.cpp" -or -name "*.h" -or -name "*.c" -or -name "*.cc" \) -exec clang-format -i {} \;
If you reached this file doing a git blame, please see README.contributors (instructions added 2 commits in the future to this one)
|
|
|
|
|
|
|
|
|
|
|
|
And make the cache just be "infinite", it's not like we support
that many maps or that there's so many used in a given session,
and if they are, well it's good we cached them
All the unicode maps we support use about 2MB of memory, but PSOutputDev
is the only one that loads "random" unicodeMaps so to load them all
you'd had to print lots of different documents with fonts with lots of
different font encodings, so it seems like a not very likely situation
and the code gets simplified a bit
|
|
|
|
|
|
No copyright, it's a mechanical change
|