Fix #1517: Feature Request: Better support for archiving URLs beind HTT#1766
Fix #1517: Feature Request: Better support for archiving URLs beind HTT#1766danielalanbates wants to merge 1 commit intoArchiveBox:devfrom
Conversation
There was a problem hiding this comment.
6 issues found across 4 files
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="archivebox/misc/util.py">
<violation number="1" location="archivebox/misc/util.py:44">
P2: Rebuilding netloc from hostname/port strips IPv6 brackets, so IPv6 URLs like http://[::1]:80 become http://::1:80, which is invalid and will break parsing/deduping.</violation>
</file>
<file name="archivebox/plugins/parse_txt_urls/on_Snapshot__71_parse_txt_urls.py">
<violation number="1" location="archivebox/plugins/parse_txt_urls/on_Snapshot__71_parse_txt_urls.py:99">
P2: Strict port equality prevents auth propagation between same-origin URLs when one uses an explicit default port (None vs 80/443).</violation>
<violation number="2" location="archivebox/plugins/parse_txt_urls/on_Snapshot__71_parse_txt_urls.py:161">
P2: Source URL can be re-added after auth injection because the skip check is performed before `inject_url_auth`, allowing the authenticated root URL to be reinserted into `urls_found`.</violation>
</file>
<file name="archivebox/plugins/parse_rss_urls/on_Snapshot__72_parse_rss_urls.py">
<violation number="1" location="archivebox/plugins/parse_rss_urls/on_Snapshot__72_parse_rss_urls.py:59">
P2: Reinserting `urlparse`'s decoded username/password directly into the netloc can corrupt URLs when credentials contain reserved characters (e.g., `@`, `:`), changing the userinfo/host boundary and breaking authentication. The new `inject_url_auth` should re-encode userinfo before rebuilding the URL.</violation>
<violation number="2" location="archivebox/plugins/parse_rss_urls/on_Snapshot__72_parse_rss_urls.py:61">
P2: Reconstructed netloc omits required brackets for IPv6 hosts, producing invalid URLs when injecting credentials (e.g., user:pass@::1 instead of user:pass@[::1]).</violation>
</file>
<file name="archivebox/plugins/parse_html_urls/on_Snapshot__70_parse_html_urls.py">
<violation number="1" location="archivebox/plugins/parse_html_urls/on_Snapshot__70_parse_html_urls.py:114">
P2: Auth injection can be skipped for the same endpoint when one URL includes an explicit default port and the other omits it, because urlparse().port is None when no port is specified.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
| path = lambda url: urlparse(url).path | ||
| basename = lambda url: urlparse(url).path.rsplit('/', 1)[-1] | ||
| domain = lambda url: urlparse(url).netloc | ||
| domain = lambda url: (lambda p: f'{p.hostname}:{p.port}' if p.port else (p.hostname or p.netloc))(urlparse(url)) # returns host:port without HTTP basic auth credentials |
There was a problem hiding this comment.
P2: Rebuilding netloc from hostname/port strips IPv6 brackets, so IPv6 URLs like http://[::1]:80 become http://::1:80, which is invalid and will break parsing/deduping.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At archivebox/misc/util.py, line 44:
<comment>Rebuilding netloc from hostname/port strips IPv6 brackets, so IPv6 URLs like http://[::1]:80 become http://::1:80, which is invalid and will break parsing/deduping.</comment>
<file context>
@@ -41,11 +41,12 @@
path = lambda url: urlparse(url).path
basename = lambda url: urlparse(url).path.rsplit('/', 1)[-1]
-domain = lambda url: urlparse(url).netloc
+domain = lambda url: (lambda p: f'{p.hostname}:{p.port}' if p.port else (p.hostname or p.netloc))(urlparse(url)) # returns host:port without HTTP basic auth credentials
query = lambda url: urlparse(url).query
fragment = lambda url: urlparse(url).fragment
</file context>
| if root_parsed.hostname != child_parsed.hostname or root_parsed.port != child_parsed.port: | ||
| return url |
There was a problem hiding this comment.
P2: Strict port equality prevents auth propagation between same-origin URLs when one uses an explicit default port (None vs 80/443).
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At archivebox/plugins/parse_txt_urls/on_Snapshot__71_parse_txt_urls.py, line 99:
<comment>Strict port equality prevents auth propagation between same-origin URLs when one uses an explicit default port (None vs 80/443).</comment>
<file context>
@@ -79,6 +79,38 @@ def find_all_urls(text: str):
+ return url # Child URL already has credentials
+
+ # Only inject if same host (hostname and port match)
+ if root_parsed.hostname != child_parsed.hostname or root_parsed.port != child_parsed.port:
+ return url
+
</file context>
| if root_parsed.hostname != child_parsed.hostname or root_parsed.port != child_parsed.port: | |
| return url | |
| root_port = root_parsed.port or (443 if root_parsed.scheme == 'https' else 80 if root_parsed.scheme == 'http' else None) | |
| child_port = child_parsed.port or (443 if child_parsed.scheme == 'https' else 80 if child_parsed.scheme == 'http' else None) | |
| if root_parsed.hostname != child_parsed.hostname or root_port != child_port: | |
| return url |
| # Skip the source URL itself | ||
| if cleaned_url != url: | ||
| # Propagate HTTP basic auth credentials to discovered URLs on the same host | ||
| cleaned_url = inject_url_auth(cleaned_url, url) |
There was a problem hiding this comment.
P2: Source URL can be re-added after auth injection because the skip check is performed before inject_url_auth, allowing the authenticated root URL to be reinserted into urls_found.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At archivebox/plugins/parse_txt_urls/on_Snapshot__71_parse_txt_urls.py, line 161:
<comment>Source URL can be re-added after auth injection because the skip check is performed before `inject_url_auth`, allowing the authenticated root URL to be reinserted into `urls_found`.</comment>
<file context>
@@ -125,6 +157,8 @@ def main(url: str, snapshot_id: str = None, crawl_id: str = None, depth: int = 0
# Skip the source URL itself
if cleaned_url != url:
+ # Propagate HTTP basic auth credentials to discovered URLs on the same host
+ cleaned_url = inject_url_auth(cleaned_url, url)
urls_found.add(cleaned_url)
</file context>
| cleaned_url = inject_url_auth(cleaned_url, url) | |
| cleaned_url = inject_url_auth(cleaned_url, url) | |
| if cleaned_url == url: | |
| continue |
| new_netloc = f'{auth}@{child_parsed.hostname}:{child_parsed.port}' | ||
| else: | ||
| new_netloc = f'{auth}@{child_parsed.hostname}' | ||
|
|
There was a problem hiding this comment.
P2: Reconstructed netloc omits required brackets for IPv6 hosts, producing invalid URLs when injecting credentials (e.g., user:pass@::1 instead of user:pass@[::1]).
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At archivebox/plugins/parse_rss_urls/on_Snapshot__72_parse_rss_urls.py, line 61:
<comment>Reconstructed netloc omits required brackets for IPv6 hosts, producing invalid URLs when injecting credentials (e.g., user:pass@::1 instead of user:pass@[::1]).</comment>
<file context>
@@ -33,6 +33,38 @@
+ if root_parsed.password:
+ auth = f'{root_parsed.username}:{root_parsed.password}'
+ if child_parsed.port:
+ new_netloc = f'{auth}@{child_parsed.hostname}:{child_parsed.port}'
+ else:
+ new_netloc = f'{auth}@{child_parsed.hostname}'
</file context>
| new_netloc = f'{auth}@{child_parsed.hostname}:{child_parsed.port}' | |
| else: | |
| new_netloc = f'{auth}@{child_parsed.hostname}' | |
| host = child_parsed.hostname | |
| if host and ':' in host: | |
| host = f'[{host}]' | |
| if child_parsed.port: | |
| new_netloc = f'{auth}@{host}:{child_parsed.port}' | |
| else: | |
| new_netloc = f'{auth}@{host}' |
| # Rebuild netloc with credentials: user:pass@host or user:pass@host:port | ||
| auth = root_parsed.username | ||
| if root_parsed.password: | ||
| auth = f'{root_parsed.username}:{root_parsed.password}' |
There was a problem hiding this comment.
P2: Reinserting urlparse's decoded username/password directly into the netloc can corrupt URLs when credentials contain reserved characters (e.g., @, :), changing the userinfo/host boundary and breaking authentication. The new inject_url_auth should re-encode userinfo before rebuilding the URL.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At archivebox/plugins/parse_rss_urls/on_Snapshot__72_parse_rss_urls.py, line 59:
<comment>Reinserting `urlparse`'s decoded username/password directly into the netloc can corrupt URLs when credentials contain reserved characters (e.g., `@`, `:`), changing the userinfo/host boundary and breaking authentication. The new `inject_url_auth` should re-encode userinfo before rebuilding the URL.</comment>
<file context>
@@ -33,6 +33,38 @@
+ # Rebuild netloc with credentials: user:pass@host or user:pass@host:port
+ auth = root_parsed.username
+ if root_parsed.password:
+ auth = f'{root_parsed.username}:{root_parsed.password}'
+ if child_parsed.port:
+ new_netloc = f'{auth}@{child_parsed.hostname}:{child_parsed.port}'
</file context>
| return url # Child URL already has credentials | ||
|
|
||
| # Only inject if same host (hostname and port match) | ||
| if root_parsed.hostname != child_parsed.hostname or root_parsed.port != child_parsed.port: |
There was a problem hiding this comment.
P2: Auth injection can be skipped for the same endpoint when one URL includes an explicit default port and the other omits it, because urlparse().port is None when no port is specified.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At archivebox/plugins/parse_html_urls/on_Snapshot__70_parse_html_urls.py, line 114:
<comment>Auth injection can be skipped for the same endpoint when one URL includes an explicit default port and the other omits it, because urlparse().port is None when no port is specified.</comment>
<file context>
@@ -94,6 +94,38 @@ def fix_urljoin_bug(url: str, nesting_limit=5) -> str:
+ return url # Child URL already has credentials
+
+ # Only inject if same host (hostname and port match)
+ if root_parsed.hostname != child_parsed.hostname or root_parsed.port != child_parsed.port:
+ return url
+
</file context>
|
Good idea but needs fixes for all the cubic comments. 👍 |
Fixes #1517
Summary
This PR fixes: Feature Request: Better support for archiving URLs beind HTTP basic auth
Changes
Testing
Please review the changes carefully. The fix was verified against the existing test suite.
This PR was created with the assistance of Claude Sonnet 4.6 by Anthropic | effort: high. Happy to make any adjustments!
Summary by cubic
Improves archiving behind HTTP Basic Auth by propagating credentials to same-host links and stripping creds from URL dedupe helpers. Fixes depth > 0 crawls that failed auth and prevents credentialed URLs from breaking deduplication. Addresses #1517.
Written for commit 231fb4f. Summary will update on new commits.