Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Fix #1517: Feature Request: Better support for archiving URLs beind HTT#1766

Open
danielalanbates wants to merge 1 commit intoArchiveBox:devfrom
danielalanbates:fix/issue-1517
Open

Fix #1517: Feature Request: Better support for archiving URLs beind HTT#1766
danielalanbates wants to merge 1 commit intoArchiveBox:devfrom
danielalanbates:fix/issue-1517

Conversation

@danielalanbates
Copy link

@danielalanbates danielalanbates commented Feb 21, 2026

Fixes #1517

Summary

This PR fixes: Feature Request: Better support for archiving URLs beind HTTP basic auth

Changes

archivebox/misc/util.py                            |  5 +--
 .../on_Snapshot__70_parse_html_urls.py             | 41 +++++++++++++++++++++-
 .../on_Snapshot__72_parse_rss_urls.py              | 37 ++++++++++++++++++-
 .../on_Snapshot__71_parse_txt_urls.py              | 34 ++++++++++++++++++
 4 files changed, 113 insertions(+), 4 deletions(-)

Testing

Please review the changes carefully. The fix was verified against the existing test suite.


This PR was created with the assistance of Claude Sonnet 4.6 by Anthropic | effort: high. Happy to make any adjustments!


Summary by cubic

Improves archiving behind HTTP Basic Auth by propagating credentials to same-host links and stripping creds from URL dedupe helpers. Fixes depth > 0 crawls that failed auth and prevents credentialed URLs from breaking deduplication. Addresses #1517.

  • Bug Fixes
    • HTML/RSS/TXT parsers: inject user:pass from the root URL into discovered URLs when hostname and port match; skip if child already has creds.
    • URL utils: domain() now excludes credentials and includes port; added without_auth(); base_url() now ignores credentials so user:pass@host and host dedupe correctly.

Written for commit 231fb4f. Summary will update on new commits.

Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6 issues found across 4 files

Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="archivebox/misc/util.py">

<violation number="1" location="archivebox/misc/util.py:44">
P2: Rebuilding netloc from hostname/port strips IPv6 brackets, so IPv6 URLs like http://[::1]:80 become http://::1:80, which is invalid and will break parsing/deduping.</violation>
</file>

<file name="archivebox/plugins/parse_txt_urls/on_Snapshot__71_parse_txt_urls.py">

<violation number="1" location="archivebox/plugins/parse_txt_urls/on_Snapshot__71_parse_txt_urls.py:99">
P2: Strict port equality prevents auth propagation between same-origin URLs when one uses an explicit default port (None vs 80/443).</violation>

<violation number="2" location="archivebox/plugins/parse_txt_urls/on_Snapshot__71_parse_txt_urls.py:161">
P2: Source URL can be re-added after auth injection because the skip check is performed before `inject_url_auth`, allowing the authenticated root URL to be reinserted into `urls_found`.</violation>
</file>

<file name="archivebox/plugins/parse_rss_urls/on_Snapshot__72_parse_rss_urls.py">

<violation number="1" location="archivebox/plugins/parse_rss_urls/on_Snapshot__72_parse_rss_urls.py:59">
P2: Reinserting `urlparse`'s decoded username/password directly into the netloc can corrupt URLs when credentials contain reserved characters (e.g., `@`, `:`), changing the userinfo/host boundary and breaking authentication. The new `inject_url_auth` should re-encode userinfo before rebuilding the URL.</violation>

<violation number="2" location="archivebox/plugins/parse_rss_urls/on_Snapshot__72_parse_rss_urls.py:61">
P2: Reconstructed netloc omits required brackets for IPv6 hosts, producing invalid URLs when injecting credentials (e.g., user:pass@::1 instead of user:pass@[::1]).</violation>
</file>

<file name="archivebox/plugins/parse_html_urls/on_Snapshot__70_parse_html_urls.py">

<violation number="1" location="archivebox/plugins/parse_html_urls/on_Snapshot__70_parse_html_urls.py:114">
P2: Auth injection can be skipped for the same endpoint when one URL includes an explicit default port and the other omits it, because urlparse().port is None when no port is specified.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

path = lambda url: urlparse(url).path
basename = lambda url: urlparse(url).path.rsplit('/', 1)[-1]
domain = lambda url: urlparse(url).netloc
domain = lambda url: (lambda p: f'{p.hostname}:{p.port}' if p.port else (p.hostname or p.netloc))(urlparse(url)) # returns host:port without HTTP basic auth credentials
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot Feb 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Rebuilding netloc from hostname/port strips IPv6 brackets, so IPv6 URLs like http://[::1]:80 become http://::1:80, which is invalid and will break parsing/deduping.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At archivebox/misc/util.py, line 44:

<comment>Rebuilding netloc from hostname/port strips IPv6 brackets, so IPv6 URLs like http://[::1]:80 become http://::1:80, which is invalid and will break parsing/deduping.</comment>

<file context>
@@ -41,11 +41,12 @@
 path = lambda url: urlparse(url).path
 basename = lambda url: urlparse(url).path.rsplit('/', 1)[-1]
-domain = lambda url: urlparse(url).netloc
+domain = lambda url: (lambda p: f'{p.hostname}:{p.port}' if p.port else (p.hostname or p.netloc))(urlparse(url))  # returns host:port without HTTP basic auth credentials
 query = lambda url: urlparse(url).query
 fragment = lambda url: urlparse(url).fragment
</file context>
Fix with Cubic

Comment on lines +99 to +100
if root_parsed.hostname != child_parsed.hostname or root_parsed.port != child_parsed.port:
return url
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot Feb 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Strict port equality prevents auth propagation between same-origin URLs when one uses an explicit default port (None vs 80/443).

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At archivebox/plugins/parse_txt_urls/on_Snapshot__71_parse_txt_urls.py, line 99:

<comment>Strict port equality prevents auth propagation between same-origin URLs when one uses an explicit default port (None vs 80/443).</comment>

<file context>
@@ -79,6 +79,38 @@ def find_all_urls(text: str):
+        return url  # Child URL already has credentials
+
+    # Only inject if same host (hostname and port match)
+    if root_parsed.hostname != child_parsed.hostname or root_parsed.port != child_parsed.port:
+        return url
+
</file context>
Suggested change
if root_parsed.hostname != child_parsed.hostname or root_parsed.port != child_parsed.port:
return url
root_port = root_parsed.port or (443 if root_parsed.scheme == 'https' else 80 if root_parsed.scheme == 'http' else None)
child_port = child_parsed.port or (443 if child_parsed.scheme == 'https' else 80 if child_parsed.scheme == 'http' else None)
if root_parsed.hostname != child_parsed.hostname or root_port != child_port:
return url
Fix with Cubic

# Skip the source URL itself
if cleaned_url != url:
# Propagate HTTP basic auth credentials to discovered URLs on the same host
cleaned_url = inject_url_auth(cleaned_url, url)
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot Feb 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Source URL can be re-added after auth injection because the skip check is performed before inject_url_auth, allowing the authenticated root URL to be reinserted into urls_found.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At archivebox/plugins/parse_txt_urls/on_Snapshot__71_parse_txt_urls.py, line 161:

<comment>Source URL can be re-added after auth injection because the skip check is performed before `inject_url_auth`, allowing the authenticated root URL to be reinserted into `urls_found`.</comment>

<file context>
@@ -125,6 +157,8 @@ def main(url: str, snapshot_id: str = None, crawl_id: str = None, depth: int = 0
         # Skip the source URL itself
         if cleaned_url != url:
+            # Propagate HTTP basic auth credentials to discovered URLs on the same host
+            cleaned_url = inject_url_auth(cleaned_url, url)
             urls_found.add(cleaned_url)
 
</file context>
Suggested change
cleaned_url = inject_url_auth(cleaned_url, url)
cleaned_url = inject_url_auth(cleaned_url, url)
if cleaned_url == url:
continue
Fix with Cubic

Comment on lines +61 to +64
new_netloc = f'{auth}@{child_parsed.hostname}:{child_parsed.port}'
else:
new_netloc = f'{auth}@{child_parsed.hostname}'

Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot Feb 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Reconstructed netloc omits required brackets for IPv6 hosts, producing invalid URLs when injecting credentials (e.g., user:pass@::1 instead of user:pass@[::1]).

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At archivebox/plugins/parse_rss_urls/on_Snapshot__72_parse_rss_urls.py, line 61:

<comment>Reconstructed netloc omits required brackets for IPv6 hosts, producing invalid URLs when injecting credentials (e.g., user:pass@::1 instead of user:pass@[::1]).</comment>

<file context>
@@ -33,6 +33,38 @@
+    if root_parsed.password:
+        auth = f'{root_parsed.username}:{root_parsed.password}'
+    if child_parsed.port:
+        new_netloc = f'{auth}@{child_parsed.hostname}:{child_parsed.port}'
+    else:
+        new_netloc = f'{auth}@{child_parsed.hostname}'
</file context>
Suggested change
new_netloc = f'{auth}@{child_parsed.hostname}:{child_parsed.port}'
else:
new_netloc = f'{auth}@{child_parsed.hostname}'
host = child_parsed.hostname
if host and ':' in host:
host = f'[{host}]'
if child_parsed.port:
new_netloc = f'{auth}@{host}:{child_parsed.port}'
else:
new_netloc = f'{auth}@{host}'
Fix with Cubic

# Rebuild netloc with credentials: user:pass@host or user:pass@host:port
auth = root_parsed.username
if root_parsed.password:
auth = f'{root_parsed.username}:{root_parsed.password}'
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot Feb 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Reinserting urlparse's decoded username/password directly into the netloc can corrupt URLs when credentials contain reserved characters (e.g., @, :), changing the userinfo/host boundary and breaking authentication. The new inject_url_auth should re-encode userinfo before rebuilding the URL.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At archivebox/plugins/parse_rss_urls/on_Snapshot__72_parse_rss_urls.py, line 59:

<comment>Reinserting `urlparse`'s decoded username/password directly into the netloc can corrupt URLs when credentials contain reserved characters (e.g., `@`, `:`), changing the userinfo/host boundary and breaking authentication. The new `inject_url_auth` should re-encode userinfo before rebuilding the URL.</comment>

<file context>
@@ -33,6 +33,38 @@
+    # Rebuild netloc with credentials: user:pass@host or user:pass@host:port
+    auth = root_parsed.username
+    if root_parsed.password:
+        auth = f'{root_parsed.username}:{root_parsed.password}'
+    if child_parsed.port:
+        new_netloc = f'{auth}@{child_parsed.hostname}:{child_parsed.port}'
</file context>
Fix with Cubic

return url # Child URL already has credentials

# Only inject if same host (hostname and port match)
if root_parsed.hostname != child_parsed.hostname or root_parsed.port != child_parsed.port:
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot Feb 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Auth injection can be skipped for the same endpoint when one URL includes an explicit default port and the other omits it, because urlparse().port is None when no port is specified.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At archivebox/plugins/parse_html_urls/on_Snapshot__70_parse_html_urls.py, line 114:

<comment>Auth injection can be skipped for the same endpoint when one URL includes an explicit default port and the other omits it, because urlparse().port is None when no port is specified.</comment>

<file context>
@@ -94,6 +94,38 @@ def fix_urljoin_bug(url: str, nesting_limit=5) -> str:
+        return url  # Child URL already has credentials
+
+    # Only inject if same host (hostname and port match)
+    if root_parsed.hostname != child_parsed.hostname or root_parsed.port != child_parsed.port:
+        return url
+
</file context>
Fix with Cubic

@pirate
Copy link
Member

pirate commented Feb 24, 2026

Good idea but needs fixes for all the cubic comments. 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature Request: Better support for archiving URLs beind HTTP basic auth

2 participants