Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[DomCrawler] Failing test for Crawler emojis handling regression #46212

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

ogizanagi
Copy link
Contributor

@ogizanagi ogizanagi commented Apr 29, 2022

Q A
Branch? 4.4
Bug fix? yes
New feature? no
Deprecations? no
Tickets N/A
License MIT
Doc PR N/A

since 4.4.39, 5.4.6 and 6.0.6.

This is the minimal reproducer for an issue we encountered on our blog while upgrading to 5.4.6 (this probably goes beyond emojis issues).
Our contents are parsed from markdown files, converted to HTML, then read using the Crawler to performs some checks and modifications. But since the upgrades, emojis included in the original content were wrongly encoded:

-'<body><p>Hey 👋</p></body>'
+'<body><p>Hey �</p></body>'

These changes are the culprit: https://github.com/symfony/symfony/pull/45532/files#diff-940b51ffa31dedac17aa49cd8e04e44aa8b3747782c7f9f27457b0510587c05d

However, I'm unsure this is an actual regression, or somehow a misuse of the Crawler, since adding:

<head>
    <meta charset="UTF-8"/>
</head>

in the test case would fix it?

@stof
Copy link
Member

stof commented Apr 29, 2022

The default charset in the HTML spec is not UTF-8 but ISO-8859-1. So you indeed need to specify the charset as UTF-8 if you use UTF-8 content (and you cannot use emojis in ISO-8859-1).
Your code was working before by using the Symfony bug as a feature...

It was not a misuse of the Crawler. It was a misuse of HTML.

@ogizanagi
Copy link
Contributor Author

ogizanagi commented Apr 29, 2022

Still, I'm double-asking considering:

/**
* Adds HTML/XML content.
*
* If the charset is not set via the content type, it is assumed to be UTF-8,
* or ISO-8859-1 as a fallback, which is the default charset defined by the
* HTTP 1.1 specification.
*/
public function addContent(string $content, string $type = null)

Note:

$crawler = new Crawler();
$crawler->addHtmlContent('<body><p>Hey 👋</p></body>', 'UTF-8');

would not work either.

@nicolas-grekas
Copy link
Member

nicolas-grekas commented Apr 30, 2022

Good catch thanks!
Fixed in #46221

nicolas-grekas added a commit that referenced this pull request Apr 30, 2022
…grekas)

This PR was merged into the 4.4 branch.

Discussion
----------

[DomCrawler][VarDumper] Fix html-encoding emojis

| Q             | A
| ------------- | ---
| Branch?       | 4.4
| Bug fix?      | yes
| New feature?  | no
| Deprecations? | no
| Tickets       | Fix #46212
| License       | MIT
| Doc PR        | -

Commits
-------

26fbc96 [DomCrawler][VarDumper] Fix html-encoding emojis
@ogizanagi ogizanagi deleted the bug-crawler-emojis branch April 30, 2022 21:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants