Thanks to visit codestin.com
Credit goes to Github.com

Skip to content
nitred edited this page Jan 4, 2022 · 1 revision

NOTE FOR DEVELOPERS: Feel free to add any new content. Discuss your findings and insights in Matrix chat server

Table Of Contents

TODO

Ideas

TODO

Considerations

Hosting on our dedicated hardware

  • Read operations on SSDs and NVMe SSDs can be assumed to not wear out the SSDs.
  • SSDs have high sequential read speeds and very importantly NVMe SSDs have high random IOPS which is important for random index lookups.
  • It appears we are entering 2022 during a chip shortage where prices are high. One needs to pay attention to what the prices might be two years down the line, where SSD prices might seem comparabe to today's HDD prices.

Common Crawl

  • They create warc archives/dumps once every few months. Their data consists of ~250GB of metadata and a ~100TB of archives. If this is going to be our primary source of data (unlikely in the long term), then we can assume to have at least one month between dumps to process or index the data.

Experiments

Optimizing Common Crawl Metadata Filtering

Ideally in the long term, all commoncrawl data is going to be used and there is no need to filter a subset of data. However, we don't need to index all the data to get meaningful insights into the quality of results. The code in mwmbml.indexer.extract.py filter the metadata in s3://commoncrawl/cc-index/table/cc-main/warc/CC-xx-xx/subset=warc to get a subset of a few domains. There are at least two viable strategies of filtering subsets:

  1. Run code closer to S3 within AWS, such that the network i/o involved in downloading S3 files is minimized.
  2. Pre-download all the S3 metadata onto a local disk (preferrably attached SSDs) and then run filtering locally.
  3. Leverage the fact that the metadata files are in parquet format and you can run filtering on them directly without having to download the entire file.
  • Experiment 1
    • Hardware: NVMe SSD on x4 PCIe 3.0, Xeon E5-1620 (4/8 cores/threads 3.5GHz), 32GB DDR4-2133 RAM
    • Pre-downloading all S3 metadata onto a local NVMe disk which was 120GB (only the required columns were downloaded)
    • Using pyarrow + parquet, we were able to filter the top-HN domains (~8000 domains) in under 6 minutes.

Optimizing Fetching Common Crawl Archive Data

TODO

Clone this wiki locally