Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Add support for more accents in mathtext #23189

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

oscargus
Copy link
Member

@oscargus oscargus commented Jun 2, 2022

PR Summary

Add support for \check #7738 and the brief forms in https://en.wikibooks.org/wiki/LaTeX/Special_Characters (double acute is new, the others just use the standard single-letter names).

In addition, replaces a character + combining accent with a single character once available as mentioned in #4561 (comment) This means that e.g. \" i now works and is properly replaced with ï.

  • Check how this works with cmr10
  • Currently it is not checking if the combined single character exists in the font, no idea how to do that efficiently (maybe add an kwarg and/or rcparam so that this can be turned off)?
  • Add tests
  • Add release note

PR Checklist

Tests and Styling

  • Has pytest style unit tests (and pytest passes).
  • Is Flake 8 compliant (install flake8-docstrings and run flake8 --docstring-convention=all).

Documentation

  • New features are documented, with examples if plot related.
  • New features have an entry in doc/users/next_whats_new/ (follow instructions in README.rst there).
  • API changes documented in doc/api/next_api_changes/ (follow instructions in README.rst there).
  • Documentation is sphinx and numpydoc compliant (the docs should build without error).

@oscargus
Copy link
Member Author

oscargus commented Jun 2, 2022

The remaining errors after removing the single letter cases above (keeping H) are:

Now:
mathtext_cm_26

Earlier:
mathtext_cm_26-expected

so a consequence of the actual characters being used.

Now:
mathtext_cm_77

Earlier:
mathtext_cm_77-expected

The addition of check leads to that the checkmark is used here. I do not really understand this test, nor the use of _accentprefixed.

Now:
mathtext_cm_00

Earlier:
mathtext_cm_00-expected

So a consequence of the combined character not being in the font.

For the first, and I assume the second, case, the right thing would be to update the images.

For the final case, there should be some checking if the glyph exists in the used font.

@oscargus oscargus force-pushed the moreaccentsabove branch 2 times, most recently from 897487e to df3add6 Compare June 6, 2022 12:32
@anntzer
Copy link
Contributor

anntzer commented Jun 11, 2022

accentprefixed is being handled (removed) at #22950.

@oscargus
Copy link
Member Author

@anntzer Do you know if #22950 will enable using single character accents (that is also a starting character of another LaTeX symbol)?

Also, do you have any idea how one can detect if a glyph actually exists as in the ṡ turning into ¤ in the image above? (I do not think it is Matplotlib that does that substitution?)

@anntzer
Copy link
Contributor

anntzer commented Jun 12, 2022

first point: yes, I think that should work.
second point: I think the relevant parts are around

glyphindex = font.get_char_index(uniindex)
if glyphindex != 0:
found_symbol = True

@oscargus
Copy link
Member Author

Thanks! Ahh, I knew I had seen that somewhere! Grepped for ¤ though...

@oscargus
Copy link
Member Author

oscargus commented Jun 12, 2022

I'm wondering if one should introduce some rcParam for the replacement. If I understand it correctly, it may not be possible for the parser to actually know the exact font being used? (Only like 'rm')

Edit: Inkscape was not in the path due to a reinstall...

Also, it seems like the svg output actually handles ṡ, but not the pdf or png output. Checking the source, it seems like something converts the combined character back into a combined accent and character. Not sure what though.

Anyway, I am wondering if one possible should try and decompose the characters once the _get_glyph-operation fails?

Example: (not relevant anymore, but may still be of interest)

import unicodedata
accent = chr(775)
withcombiningaccent = 's' + chr(775)
print(withcombiningaccent , len(withcombiningaccent))
combined = unicodedata.normalize('NFC',  withcombiningaccent)
print(combined, len(combined))
print(ord(combined))

This shows that it correctly finds https://www.codetable.net/decimal/7777

One can do unicodedata.normalize('NFD', chr(7777)) to get the two characters back again.

However, in the svg output

@oscargus
Copy link
Member Author

I also replaced some of the accents with the "proper" combining accent. So this breaks another test. But avoids having to resize \circ.

@@ -999,9 +999,14 @@
'combiningdiaeresis' : 776,
'combiningtilde' : 771,
'combiningrightarrowabove' : 8407,
'combiningleftarrowabove' : 8406,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit of aligning required here and a few lines down.

@anntzer
Copy link
Contributor

anntzer commented Jun 12, 2022

Perhaps split out the addition of new accents as a separate PR, which should be fairly uncontroversial?

I suspect that general handling of combining characters would basically require harfbuzz (which knows how to position an accent by itself, e.g. the classic "zalgo" text h̷̡̦͚́͛̅̔̅̊͘ě̶͚̣̭́̉͜ļ̴͚͙̝̑̒l̸̛̙̹ͅơ̵͎̻͔̯̊ ̶̨̨͖̥̺͓̽̋̒͝w̶̨̗̻̥̜͍̮̏͛͒͝o̷̟͆̍̓̚ŗ̵̢͔̦̑͗̑̑̃l̸̲̥̲̹͖̔̇̾̏͆d̴͍̲̓̄̑̉̌̇͜) + switching from bakoma to lm-math, to have access to the combining characters... Still,

If I understand it correctly, it may not be possible for the parser to actually know the exact font being used?

I think that's actually possible? e.g. Char._update_metrics does self._metrics = self.font_output.get_metrics(self.font, self.font_class, self.c, self.fontsize, self.dpi) loads the metrics of a glyph in the current concrete font, so that can certainly check whether the glyph exists in that font.

@oscargus oscargus force-pushed the moreaccentsabove branch 2 times, most recently from ce1fa41 to e671b41 Compare June 12, 2022 14:38
@oscargus
Copy link
Member Author

You are correct that it was possible. I couldn't follow the order of things happening properly.

I think that the zalgo support is actually not that much affected by this. It is just that when there are proper glyphs available these will be used, if not, it will be as before (which I guess supported zalgo to some extent). See for example the test with r'$\mathring{A} \AA$', where now both characters render identical (Å). (This is a rather good test for this feature, possibly including a few more Unicode characters.)

One may even consider consider checking if a Unicode character can be split.

Anyway, this should really wait until #22950 is merged so that more accents can be added. One could also consider adding support for other combining accents, like cedilla and ogonek, which at least should work when there are available combined characters. Maybe one should have two separate groups of accents: the current ones where it is possible to "create" decently looking combinations and those like cedilla and ogonek which may have a valid combined glyph. If those doesn't work one could error if they do not combine or the glyph is not available.

(I tried out to get combining accents below working, but I had some issues with aligning them correctly, especially since cedilla and ogonek should be without a gap and I didn't get that to work for e.g. p, which probably noone wants, but still...)

There are now some more things changed:

  • macron and overline are different
  • if possible, a dotless i is used (as LaTeX does nowdays)
  • there a number of new test images, primarily for illustration, as I expect them to change (note that \check is not working)

@@ -2050,10 +2060,27 @@ def accent(self, s, loc, toks):
accent_box = AutoWidthChar(
'\\' + accent, sym.width, state, char_class=Accent)
else:
# Check if accent and character can be combined
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One can possibly consider splitting the accents into those that may have precomposed characters and those that may not.
https://en.wikipedia.org/wiki/List_of_precomposed_Latin_characters_in_Unicode

Possibly one should check that the character is one of the standard latin characters as well, although that may lead to that those precomposed with two accents may not work (which should be checked if they even do to start with...).

@oscargus
Copy link
Member Author

Turns out that for some characters caron (\check) is written like that https://www.compart.com/en/unicode/U+0165

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants