Segfault or encoding error when parsing a URL

See https://github.com/scrapy/scurl/issues/58#issuecomment-513520254 and https://github.com/scrapy/scurl/issues/58#issuecomment-513583355

Also repeating here 
```
Traceback (most recent call last):
  File "./bin/triage_links", line 34, in get_url_parts
    link = urljoin(record.url, record.href)
  File "scurl/cgurl.pyx", line 308, in scurl.cgurl.urljoin
  File "scurl/cgurl.pyx", line 353, in scurl.cgurl.urljoin
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf0 in position 503: invalid continuation byte

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "./bin/triage_links", line 102, in <module>
    main()
  File "./bin/triage_links", line 13, in main
    CSVPipeline(callback=process).execute()
  File "./bin/../crawler/utils/csvpipeline.py", line 42, in execute
    self.save_csv()
  File "./bin/../crawler/utils/csvpipeline.py", line 96, in save_csv
    df = df.compute()
  File "/usr/local/lib/python3.7/site-packages/dask/base.py", line 175, in compute
    (result,) = compute(self, traverse=False, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/dask/base.py", line 446, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/dask/threaded.py", line 82, in get
Segmentation fault: 11
```

To reproduce, run a broad crawl on this dataset and extract all links:

https://www.kaggle.com/cheedcheed/top1m

use ``urljoin()`` and ``urlsplit()`` on each one.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Segfault or encoding error when parsing a URL #59

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Segfault or encoding error when parsing a URL #59

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions