-
Notifications
You must be signed in to change notification settings - Fork 81
Optimizations
nitred edited this page Jan 4, 2022
·
1 revision
NOTE FOR DEVELOPERS: Feel free to add any new content. Discuss your findings and insights in Matrix chat server
TODO
TODO
- Read operations on SSDs and NVMe SSDs can be assumed to not wear out the SSDs.
- SSDs have high sequential read speeds and very importantly NVMe SSDs have high random IOPS which is important for random index lookups.
- It appears we are entering 2022 during a chip shortage where prices are high. One needs to pay attention to what the prices might be two years down the line, where SSD prices might seem comparabe to today's HDD prices.
- They create warc archives/dumps once every few months. Their data consists of ~250GB of metadata and a ~100TB of archives. If this is going to be our primary source of data (unlikely in the long term), then we can assume to have at least one month between dumps to process or index the data.
Ideally in the long term, all commoncrawl data is going to be used and there is no need to filter a subset of data. However, we don't need to index all the data to get meaningful insights into the quality of results. The code in mwmbml.indexer.extract.py filter the metadata in s3://commoncrawl/cc-index/table/cc-main/warc/CC-xx-xx/subset=warc to get a subset of a few domains. There are at least two viable strategies of filtering subsets:
- Run code closer to S3 within AWS, such that the network i/o involved in downloading S3 files is minimized.
- Pre-download all the S3 metadata onto a local disk (preferrably attached SSDs) and then run filtering locally.
- Leverage the fact that the metadata files are in
parquetformat and you can run filtering on them directly without having to download the entire file.
- Experiment 1
- Hardware: NVMe SSD on x4 PCIe 3.0, Xeon E5-1620 (4/8 cores/threads 3.5GHz), 32GB DDR4-2133 RAM
- Pre-downloading all S3 metadata onto a local NVMe disk which was 120GB (only the required columns were downloaded)
- Using pyarrow + parquet, we were able to filter the top-HN domains (~8000 domains) in under 6 minutes.
TODO