Codestin Search App

danielalanbates · 2026-02-21T05:17:48Z

Summary

This PR fixes: Feature Request: Better support for archiving URLs beind HTTP basic auth

Changes

archivebox/misc/util.py                            |  5 +--
 .../on_Snapshot__70_parse_html_urls.py             | 41 +++++++++++++++++++++-
 .../on_Snapshot__72_parse_rss_urls.py              | 37 ++++++++++++++++++-
 .../on_Snapshot__71_parse_txt_urls.py              | 34 ++++++++++++++++++
 4 files changed, 113 insertions(+), 4 deletions(-)

Testing

Please review the changes carefully. The fix was verified against the existing test suite.

This PR was created with the assistance of Claude Sonnet 4.6 by Anthropic | effort: high. Happy to make any adjustments!

Summary by cubic

Improves archiving behind HTTP Basic Auth by propagating credentials to same-host links and stripping creds from URL dedupe helpers. Fixes depth > 0 crawls that failed auth and prevents credentialed URLs from breaking deduplication. Addresses #1517.

Bug Fixes
- HTML/RSS/TXT parsers: inject user:pass from the root URL into discovered URLs when hostname and port match; skip if child already has creds.
- URL utils: domain() now excludes credentials and includes port; added without_auth(); base_url() now ignores credentials so user:pass@host and host dedupe correctly.

^{Written for commit 231fb4f. Summary will update on new commits.}

…Ls beind HTT

cubic-dev-ai

6 issues found across 4 files

Prompt for AI agents (all issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="archivebox/misc/util.py">

<violation number="1" location="archivebox/misc/util.py:44">
P2: Rebuilding netloc from hostname/port strips IPv6 brackets, so IPv6 URLs like http://[::1]:80 become http://::1:80, which is invalid and will break parsing/deduping.</violation>
</file>

<file name="archivebox/plugins/parse_txt_urls/on_Snapshot__71_parse_txt_urls.py">

<violation number="1" location="archivebox/plugins/parse_txt_urls/on_Snapshot__71_parse_txt_urls.py:99">
P2: Strict port equality prevents auth propagation between same-origin URLs when one uses an explicit default port (None vs 80/443).</violation>

<violation number="2" location="archivebox/plugins/parse_txt_urls/on_Snapshot__71_parse_txt_urls.py:161">
P2: Source URL can be re-added after auth injection because the skip check is performed before `inject_url_auth`, allowing the authenticated root URL to be reinserted into `urls_found`.</violation>
</file>

<file name="archivebox/plugins/parse_rss_urls/on_Snapshot__72_parse_rss_urls.py">

<violation number="1" location="archivebox/plugins/parse_rss_urls/on_Snapshot__72_parse_rss_urls.py:59">
P2: Reinserting `urlparse`'s decoded username/password directly into the netloc can corrupt URLs when credentials contain reserved characters (e.g., `@`, `:`), changing the userinfo/host boundary and breaking authentication. The new `inject_url_auth` should re-encode userinfo before rebuilding the URL.</violation>

<violation number="2" location="archivebox/plugins/parse_rss_urls/on_Snapshot__72_parse_rss_urls.py:61">
P2: Reconstructed netloc omits required brackets for IPv6 hosts, producing invalid URLs when injecting credentials (e.g., user:pass@::1 instead of user:pass@[::1]).</violation>
</file>

<file name="archivebox/plugins/parse_html_urls/on_Snapshot__70_parse_html_urls.py">

<violation number="1" location="archivebox/plugins/parse_html_urls/on_Snapshot__70_parse_html_urls.py:114">
P2: Auth injection can be skipped for the same endpoint when one URL includes an explicit default port and the other omits it, because urlparse().port is None when no port is specified.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

cubic-dev-ai · 2026-02-21T05:21:11Z

archivebox/misc/util.py

 path = lambda url: urlparse(url).path
 basename = lambda url: urlparse(url).path.rsplit('/', 1)[-1]
-domain = lambda url: urlparse(url).netloc
+domain = lambda url: (lambda p: f'{p.hostname}:{p.port}' if p.port else (p.hostname or p.netloc))(urlparse(url))  # returns host:port without HTTP basic auth credentials


P2: Rebuilding netloc from hostname/port strips IPv6 brackets, so IPv6 URLs like http://[::1]:80 become http://::1:80, which is invalid and will break parsing/deduping.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At archivebox/misc/util.py, line 44: <comment>Rebuilding netloc from hostname/port strips IPv6 brackets, so IPv6 URLs like http://[::1]:80 become http://::1:80, which is invalid and will break parsing/deduping.</comment> <file context> @@ -41,11 +41,12 @@ path = lambda url: urlparse(url).path basename = lambda url: urlparse(url).path.rsplit('/', 1)[-1] -domain = lambda url: urlparse(url).netloc +domain = lambda url: (lambda p: f'{p.hostname}:{p.port}' if p.port else (p.hostname or p.netloc))(urlparse(url)) # returns host:port without HTTP basic auth credentials query = lambda url: urlparse(url).query fragment = lambda url: urlparse(url).fragment </file context>

cubic-dev-ai · 2026-02-21T05:21:11Z

archivebox/plugins/parse_txt_urls/on_Snapshot__71_parse_txt_urls.py

+    if root_parsed.hostname != child_parsed.hostname or root_parsed.port != child_parsed.port:
+        return url


P2: Strict port equality prevents auth propagation between same-origin URLs when one uses an explicit default port (None vs 80/443).

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At archivebox/plugins/parse_txt_urls/on_Snapshot__71_parse_txt_urls.py, line 99: <comment>Strict port equality prevents auth propagation between same-origin URLs when one uses an explicit default port (None vs 80/443).</comment> <file context> @@ -79,6 +79,38 @@ def find_all_urls(text: str): + return url # Child URL already has credentials + + # Only inject if same host (hostname and port match) + if root_parsed.hostname != child_parsed.hostname or root_parsed.port != child_parsed.port: + return url + </file context>

Suggested change

if root_parsed.hostname != child_parsed.hostname or root_parsed.port != child_parsed.port:

return url

root_port = root_parsed.port or (443 if root_parsed.scheme == 'https' else 80 if root_parsed.scheme == 'http' else None)

child_port = child_parsed.port or (443 if child_parsed.scheme == 'https' else 80 if child_parsed.scheme == 'http' else None)

if root_parsed.hostname != child_parsed.hostname or root_port != child_port:

return url

cubic-dev-ai · 2026-02-21T05:21:11Z

archivebox/plugins/parse_txt_urls/on_Snapshot__71_parse_txt_urls.py

        # Skip the source URL itself
        if cleaned_url != url:
+            # Propagate HTTP basic auth credentials to discovered URLs on the same host
+            cleaned_url = inject_url_auth(cleaned_url, url)


P2: Source URL can be re-added after auth injection because the skip check is performed before inject_url_auth, allowing the authenticated root URL to be reinserted into urls_found.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At archivebox/plugins/parse_txt_urls/on_Snapshot__71_parse_txt_urls.py, line 161: <comment>Source URL can be re-added after auth injection because the skip check is performed before `inject_url_auth`, allowing the authenticated root URL to be reinserted into `urls_found`.</comment> <file context> @@ -125,6 +157,8 @@ def main(url: str, snapshot_id: str = None, crawl_id: str = None, depth: int = 0 # Skip the source URL itself if cleaned_url != url: + # Propagate HTTP basic auth credentials to discovered URLs on the same host + cleaned_url = inject_url_auth(cleaned_url, url) urls_found.add(cleaned_url) </file context>

Suggested change

cleaned_url = inject_url_auth(cleaned_url, url)

cleaned_url = inject_url_auth(cleaned_url, url)

if cleaned_url == url:

continue

cubic-dev-ai · 2026-02-21T05:21:12Z

archivebox/plugins/parse_rss_urls/on_Snapshot__72_parse_rss_urls.py

+        new_netloc = f'{auth}@{child_parsed.hostname}:{child_parsed.port}'
+    else:
+        new_netloc = f'{auth}@{child_parsed.hostname}'
+


P2: Reconstructed netloc omits required brackets for IPv6 hosts, producing invalid URLs when injecting credentials (e.g., user:pass@::1 instead of user:pass@[::1]).

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At archivebox/plugins/parse_rss_urls/on_Snapshot__72_parse_rss_urls.py, line 61: <comment>Reconstructed netloc omits required brackets for IPv6 hosts, producing invalid URLs when injecting credentials (e.g., user:pass@::1 instead of user:pass@[::1]).</comment> <file context> @@ -33,6 +33,38 @@ + if root_parsed.password: + auth = f'{root_parsed.username}:{root_parsed.password}' + if child_parsed.port: + new_netloc = f'{auth}@{child_parsed.hostname}:{child_parsed.port}' + else: + new_netloc = f'{auth}@{child_parsed.hostname}' </file context>

Suggested change

new_netloc = f'{auth}@{child_parsed.hostname}:{child_parsed.port}'

else:

new_netloc = f'{auth}@{child_parsed.hostname}'

host = child_parsed.hostname

if host and ':' in host:

host = f'[{host}]'

if child_parsed.port:

new_netloc = f'{auth}@{host}:{child_parsed.port}'

else:

new_netloc = f'{auth}@{host}'

cubic-dev-ai · 2026-02-21T05:21:12Z

archivebox/plugins/parse_rss_urls/on_Snapshot__72_parse_rss_urls.py

+    # Rebuild netloc with credentials: user:pass@host or user:pass@host:port
+    auth = root_parsed.username
+    if root_parsed.password:
+        auth = f'{root_parsed.username}:{root_parsed.password}'


P2: Reinserting urlparse's decoded username/password directly into the netloc can corrupt URLs when credentials contain reserved characters (e.g., @, :), changing the userinfo/host boundary and breaking authentication. The new inject_url_auth should re-encode userinfo before rebuilding the URL.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At archivebox/plugins/parse_rss_urls/on_Snapshot__72_parse_rss_urls.py, line 59: <comment>Reinserting `urlparse`'s decoded username/password directly into the netloc can corrupt URLs when credentials contain reserved characters (e.g., `@`, `:`), changing the userinfo/host boundary and breaking authentication. The new `inject_url_auth` should re-encode userinfo before rebuilding the URL.</comment> <file context> @@ -33,6 +33,38 @@ + # Rebuild netloc with credentials: user:pass@host or user:pass@host:port + auth = root_parsed.username + if root_parsed.password: + auth = f'{root_parsed.username}:{root_parsed.password}' + if child_parsed.port: + new_netloc = f'{auth}@{child_parsed.hostname}:{child_parsed.port}' </file context>

cubic-dev-ai · 2026-02-21T05:21:12Z

archivebox/plugins/parse_html_urls/on_Snapshot__70_parse_html_urls.py

+        return url  # Child URL already has credentials
+
+    # Only inject if same host (hostname and port match)
+    if root_parsed.hostname != child_parsed.hostname or root_parsed.port != child_parsed.port:


P2: Auth injection can be skipped for the same endpoint when one URL includes an explicit default port and the other omits it, because urlparse().port is None when no port is specified.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At archivebox/plugins/parse_html_urls/on_Snapshot__70_parse_html_urls.py, line 114: <comment>Auth injection can be skipped for the same endpoint when one URL includes an explicit default port and the other omits it, because urlparse().port is None when no port is specified.</comment> <file context> @@ -94,6 +94,38 @@ def fix_urljoin_bug(url: str, nesting_limit=5) -> str: + return url # Child URL already has credentials + + # Only inject if same host (hostname and port match) + if root_parsed.hostname != child_parsed.hostname or root_parsed.port != child_parsed.port: + return url + </file context>

pirate · 2026-02-24T23:38:50Z

Good idea but needs fixes for all the cubic comments. 👍

Fix ArchiveBox#1517: Feature Request: Better support for archiving UR…

231fb4f

…Ls beind HTT

cubic-dev-ai bot reviewed Feb 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix #1517: Feature Request: Better support for archiving URLs beind HTT#1766

Fix #1517: Feature Request: Better support for archiving URLs beind HTT#1766
danielalanbates wants to merge 1 commit intoArchiveBox:devfrom
danielalanbates:fix/issue-1517

danielalanbates commented Feb 21, 2026 •

edited by cubic-dev-ai bot

Loading

Uh oh!

cubic-dev-ai bot left a comment

Uh oh!

cubic-dev-ai bot Feb 21, 2026 •

edited

Loading

Uh oh!

cubic-dev-ai bot Feb 21, 2026 •

edited

Loading

Uh oh!

cubic-dev-ai bot Feb 21, 2026 •

edited

Loading

Uh oh!

cubic-dev-ai bot Feb 21, 2026 •

edited

Loading

Uh oh!

cubic-dev-ai bot Feb 21, 2026 •

edited

Loading

Uh oh!

cubic-dev-ai bot Feb 21, 2026 •

edited

Loading

Uh oh!

pirate commented Feb 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		if root_parsed.hostname != child_parsed.hostname or root_parsed.port != child_parsed.port:
		return url

-        new_netloc = f'{auth}@{child_parsed.hostname}:{child_parsed.port}'
-    else:
-        new_netloc = f'{auth}@{child_parsed.hostname}'
+    host = child_parsed.hostname
+    if host and ':' in host:
+        host = f'[{host}]'
+    if child_parsed.port:
+        new_netloc = f'{auth}@{host}:{child_parsed.port}'
+    else:
+        new_netloc = f'{auth}@{host}'

Uh oh!

Conversation

danielalanbates commented Feb 21, 2026 • edited by cubic-dev-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Testing

Summary by cubic

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai bot Feb 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai bot Feb 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai bot Feb 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai bot Feb 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai bot Feb 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai bot Feb 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pirate commented Feb 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

danielalanbates commented Feb 21, 2026 •

edited by cubic-dev-ai bot

Loading

cubic-dev-ai bot Feb 21, 2026 •

edited

Loading

cubic-dev-ai bot Feb 21, 2026 •

edited

Loading

cubic-dev-ai bot Feb 21, 2026 •

edited

Loading

cubic-dev-ai bot Feb 21, 2026 •

edited

Loading

cubic-dev-ai bot Feb 21, 2026 •

edited

Loading

cubic-dev-ai bot Feb 21, 2026 •

edited

Loading