Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Read all columns available in each data source for URLs #6

@deamonpog

Description

@deamonpog

It was found that although Brandwatch claims the "Expanded URLs" column contains all URLs, there is a chance that it doesn't contain some URLs found in Content. Also, there are a bunch of other columns that may contain valid URLs or references to websites. Therefore, the suggestion here is to check every possible column. This task maybe expanded for all data sources, including Reddit and 4chan.

Requirement:

  1. Detect all URLs
  2. Expand all URLs
  3. Detect all unique URLs
  4. Put them in "article_urls"

Subtask list:

  • Detect all URLs and add them to the search_article_urls column
    • Brandwatch
      • Search Brandwatch URL columns, Title column, and Full Text column for URLs and put them in the "search_article_urls"
    • 4Chan
      • Search in 4chan for URLs in the required columns and put them in the search_article_urls column
      • Do full search in 4chan for the search_article_urls column
    • Reddit
      • Search in Reddit for URLs in the required columns and put them in the search_article_urls column
      • Do full search in Reddit for the search_article_urls column
      • For RedditComments use it's respective parent RedditSubmission post for URLs? (do we do the same for twitter then?)
  • Expand all URLs
  • Detect all unique URLs
  • Put them in "article_urls"

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions