-
Couldn't load subscription status.
- Fork 6
Open
Description
See #58 (comment) and #58 (comment)
Also repeating here
Traceback (most recent call last):
File "./bin/triage_links", line 34, in get_url_parts
link = urljoin(record.url, record.href)
File "scurl/cgurl.pyx", line 308, in scurl.cgurl.urljoin
File "scurl/cgurl.pyx", line 353, in scurl.cgurl.urljoin
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf0 in position 503: invalid continuation byte
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "./bin/triage_links", line 102, in <module>
main()
File "./bin/triage_links", line 13, in main
CSVPipeline(callback=process).execute()
File "./bin/../crawler/utils/csvpipeline.py", line 42, in execute
self.save_csv()
File "./bin/../crawler/utils/csvpipeline.py", line 96, in save_csv
df = df.compute()
File "/usr/local/lib/python3.7/site-packages/dask/base.py", line 175, in compute
(result,) = compute(self, traverse=False, **kwargs)
File "/usr/local/lib/python3.7/site-packages/dask/base.py", line 446, in compute
results = schedule(dsk, keys, **kwargs)
File "/usr/local/lib/python3.7/site-packages/dask/threaded.py", line 82, in get
Segmentation fault: 11
To reproduce, run a broad crawl on this dataset and extract all links:
https://www.kaggle.com/cheedcheed/top1m
use urljoin() and urlsplit() on each one.
Metadata
Metadata
Assignees
Labels
No labels