Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@anchitshrivastava
Copy link
Contributor

Please find the traceback attached below.

  article_t = g.extract(url=url)
  File "/usr/local/lib/python3.10/dist-packages/goose3/__init__.py", line 125, in extract
    return self.__crawl(crawl_candidate)
  File "/usr/local/lib/python3.10/dist-packages/goose3/__init__.py", line 153, in __crawl
    return crawler_wrapper(self.config.parser_class, parsers, crawl_candidate)
  File "/usr/local/lib/python3.10/dist-packages/goose3/__init__.py", line 141, in crawler_wrapper
    article = crawler.crawl(crawl_candidate)
  File "/usr/local/lib/python3.10/dist-packages/goose3/crawler.py", line 135, in crawl
    return self.process(raw_html, parse_candidate.url, parse_candidate.link_hash)
  File "/usr/local/lib/python3.10/dist-packages/goose3/crawler.py", line 183, in process
    self.article._authors = self.authors_extractor.extract()
  File "/usr/local/lib/python3.10/dist-packages/goose3/extractors/authors.py", line 27, in extract
    authors_from_schema = self.__get_authors_from_schema()
  File "/usr/local/lib/python3.10/dist-packages/goose3/extractors/authors.py", line 73, in __get_authors_from_schema
    authors.append(author["name"])
KeyError: 'name'

@barrust barrust merged commit 92a2698 into goose3:master Jun 16, 2023
@erikvullings
Copy link
Contributor

erikvullings commented Jun 26, 2023

Although this fixes the error that is raised and not caught, it sets the author name to "", which implies that it will ignore the author name from meta.

    def extract(self):
        authors_from_schema = self.__get_authors_from_schema()
        authors_from_meta = self.__get_authors_from_meta()
        if authors_from_schema:
            return authors_from_schema
        return authors_from_meta

Instead, you should perhaps use something like below:

    def __get_authors_from_schema(self):
        authors = list()
        if self.article.schema and "author" in self.article.schema:
            schema_authors = self.article.schema["author"]
            if isinstance(schema_authors, dict):
                schema_authors = [schema_authors]
            for author in schema_authors:
                if isinstance(author, dict):
                    author = author.get("name", None)
                    if author:
                        authors.append(author)
                else:
                    authors.append(author)
        return authors

I received this error on a page prepared by Reuters, where the author in the schema was an object "author":{"@type":"Person","byline":"Nia Williams"}, which has no name key.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants