Discussion:
[gs-bugs] [Bug 698521] - MuPDF - Bad recognition of letters as #xfffdl in -Ftxt and -Fstext mode
b***@artifex.com
2017-09-12 21:51:09 UTC
Permalink
http://bugs.ghostscript.com/show_bug.cgi?id=698521

Bug ID: 698521
Summary: Bad recognition of letters as #xfffdl in -Ftxt and
-Fstext mode
Product: MuPDF
Version: master
Hardware: PC
OS: Windows 8
Status: UNCONFIRMED
Severity: major
Priority: P4
Component: apps
Assignee: mupdf-***@artifex.com
Reporter: ***@onet.pl
QA Contact: gs-***@ghostscript.com
Word Size: ---

Created attachment 14239
--> http://bugs.ghostscript.com/attachment.cgi?id=14239&action=edit
bad recognition of characters without details about character. �

The document:
www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/pdf_reference_1-7.pdf

pages: 1013, 1014, 1015.

If the mutool cannot recognize these fonts is there possibility in -Fstext mode
to make like this:

<line bbox="67.6284 186.59923 238.7639 198.42279">
<span bbox="67.6284 186.59923 75.209697 198.42279" font="Symbol"
size="10.50045">
<char bbox="67.6284 186.59923 75.209697 198.42279" x="67.6284" y="195.28314"
c="&#xfffd;" alt="&#x....;"/>
</span>
<span bbox="103.63427 186.75675 117.274288 197.90827" font="Minion-Regular"
size="10.50045">
<char bbox="103.63427 186.75675 109.45149 197.90827" x="103.63427"
y="195.28314" c="E"/>

alt only for not converted characters. Attribute alt should be equal values
visible in other programs after copy text and paste it in text editor.
--
You are receiving this mail because:
You are the QA Contact for the bug.
b***@artifex.com
2017-09-13 10:45:42 UTC
Permalink
http://bugs.ghostscript.com/show_bug.cgi?id=698521

Tor Andersson <***@artifex.com> changed:

What |Removed |Added
----------------------------------------------------------------------------
CC| |***@artifex.com

--- Comment #1 from Tor Andersson <***@artifex.com> ---
If we had any useful information to put in the 'alt' attribute as you wish,
we'd already be putting it in the 'c'. The font in question has NO encoding
information for the glyphs you mention.

If you use "mutool draw -Ftrace" you can see the raw font information before
it is cooked into structured text. This for example is for the 'chi' character:

<span font="EKGGBK+Symbol" wmode="0" trm="10.5004 0 0 10.5005">
<g unicode="U+fffd" glyph="114" x="295.63325" y="93.71336" />
</span>

Font EKGGBK+Symbol is object number 9016 in the file, if you want to look for
yourself. Glyph number 114 has no information about what unicode character it
is supposed to represent in that font. The embedded font file has no glyph name
for glyph 114. The PDF font object's /Encoding has a /Differences array which
does not list anything for glyph 114, nor does the ToUnicode CMap stream define
a mapping for glyph 114.

As you can see, we have NO reliable information to what the character is
supposed to be.

It's (IMAO) more useful to tell the user that the text they've tried to copy is
garbage by using the unicode replacement character (U+FFFD) than handing them
an essentially random character.
--
You are receiving this mail because:
You are the QA Contact for the bug.
b***@artifex.com
2017-09-20 08:37:44 UTC
Permalink
http://bugs.ghostscript.com/show_bug.cgi?id=698521

--- Comment #2 from Diana <***@onet.pl> ---
BTW. In many documents in stext mode the bbox attribute is wrongly set for
these glyphs (fffd). The "trace" function has no bbox attribute for characters,
thus the adding bbox to trace function or "glyph"/g/"alt" attribute to <char>
in stext mode will be helpfull (if c="&#xfffd;").
--
You are receiving this mail because:
You are the QA Contact for the bug.
b***@artifex.com
2017-09-20 11:11:49 UTC
Permalink
http://bugs.ghostscript.com/show_bug.cgi?id=698521

--- Comment #3 from Tor Andersson <***@artifex.com> ---
Do you have an example PDF where the bbox for U+FFFD characters is wrong? In
the pdfref17.pdf on the pages you mention the bbox looks correct to me.

The "trace" mode prints the raw drawing operations, which only have a glyph
origin coordinate not a bbox. The stext device calculates an appropriate bbox
during its text analysis step.
--
You are receiving this mail because:
You are the QA Contact for the bug.
b***@artifex.com
2017-11-08 15:38:06 UTC
Permalink
http://bugs.ghostscript.com/show_bug.cgi?id=698521

Robin Watts <***@artifex.com> changed:

What |Removed |Added
----------------------------------------------------------------------------
Status|UNCONFIRMED |RESOLVED
Resolution|--- |INVALID
CC| |***@artifex.com

--- Comment #4 from Robin Watts <***@artifex.com> ---
Closing due to lack of response from original reporter to the question in
comment #3. If you can answer the question, please reopen this bug with the
answer.
--
You are receiving this mail because:
You are the QA Contact for the bug.
Loading...