-
-
Notifications
You must be signed in to change notification settings - Fork 40
Open
Description
I've been running crux over several sites and noticed the following bug.
Problem
Here is an example URL that displays the problem: https://www.bbc.com/news/world-europe-61691816
Test based off the README example to verify the problem:
@Test
fun broken() {
val crux = Crux()
val httpUrl = "https://www.bbc.com/news/world-europe-61691816".toHttpUrl()
val document = Jsoup.connect(httpUrl.toString()).get()
val resource = runBlocking {
crux.extractFrom(httpUrl, document)
}
assertEquals("Ukraine anger as Macron says 'Don't humiliate Russia'", resource.fields[Fields.TITLE])
}The sequence of events is:
- HtmlMetadataExtractor correctly extracts the right title "Ukraine anger as Macron says 'Don't humiliate Russia' - BBC News"
- WebAppManifestParser extracts the title "BBC"
- The fold operation in
Crux.extractFromusesResource.plusto merge the resources overwriting the title with "BBC"
fields = if (anotherResource?.fields == null) fields else fields + anotherResource.fields,
Possible solutions
If you update Crux.createDefaultPlugins to place WebAppManifestParser before HtmlMetadataExtractor like this:
public fun createDefaultPlugins(okHttpClient: OkHttpClient): List<Plugin> = listOf(
// Static redirectors go first, to avoid getting stuck into CAPTCHAs.
GoogleUrlRewriter(),
FacebookUrlRewriter(),
// Remove any tracking parameters remaining.
TrackingParameterRemover(),
// Prefer canonical URLs over AMP URLs.
AmpRedirector(refetchContentFromCanonicalUrl = true, okHttpClient),
// Fetches and parses the Web Manifest. May replace existing favicon URL with one from the manifest.json.
WebAppManifestParser(okHttpClient),
// Parses many standard HTML metadata attributes.
HtmlMetadataExtractor(okHttpClient),
// Extracts the best possible favicon from all the markup available on the page itself.
FaviconExtractor(),
// Parses the content of the page to remove ads, navigation, and all the other fluff.
ArticleExtractor(okHttpClient),
)It will produce the correct results.
This is the simplest way we can resolve it. Is there a specific reason to have WebAppManifestParser after HtmlMetadataExtractor or can we reorder it?
If that is not possible then we might need to consider a new way to handle merging the fields.
chimbori
Metadata
Metadata
Assignees
Labels
No labels