Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

fblgit
Copy link

@fblgit fblgit commented May 27, 2025

Description

  1. Fixes large Slack workspaces where ratelimiting of both channels listing and history takes place.
  2. Github Org Filtering Repositories

Motivation and Context

The project is nice, and this will allow large slack workspaces to be indexed properly with retry/ratelimit into consideration.
Also this addresses the support for org-based github repositories.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance improvement (non-breaking change which enhances performance)
  • Documentation update
  • Breaking change (fix or feature that would cause existing functionality to change)

Testing

  • I have tested these changes locally
  • I have added/updated unit tests
  • I have added/updated integration tests

Checklist:

  • My code follows the code style of this project
  • My change requires documentation updates
  • I have updated the documentation accordingly
  • My change requires dependency updates
  • I have updated the dependencies accordingly
  • My code builds clean without any errors or warnings
  • All new and existing tests passed

Summary by CodeRabbit

  • New Features

    • Expanded repository listing to include all accessible GitHub repositories, not just owned ones.
  • Bug Fixes

    • Improved Slack integration with robust handling of API rate limits and errors.
    • Enhanced Slack channel listing to provide detailed channel information and better error handling.
    • Simplified Slack channel membership checks, reducing unnecessary API calls.
  • Tests

    • Added comprehensive unit tests for both GitHub and Slack connectors, covering edge cases and error scenarios.

The `get_all_channels` method in `slack_history.py` was making paginated
requests to `conversations.list` without any delay, leading to HTTP 429
errors when fetching channels from large Slack workspaces.

This commit introduces the following changes:
- Adds a 3-second delay between paginated calls to `conversations.list`
  to comply with Slack's Tier 2 rate limits (approx. 20 requests/minute).
- Implements handling for the `Retry-After` header when a 429 error is
  received. The system will wait for the specified duration before
  retrying. If the header is missing or invalid, a default of 60 seconds
  is used.
- Adds comprehensive unit tests to verify the new delay and retry logic,
  covering scenarios with and without the `Retry-After` header, as well
  as other API errors.
This commit includes two main improvements:

1. Slack Connector (`slack_history.py`):
   - Addresses API rate limiting for `conversations.list` by introducing a 3-second delay between paginated calls.
   - Implements handling for the `Retry-After` header when HTTP 429 errors occur.
   - Fixes a `SyntaxError` caused by a non-printable character accidentally introduced in a previous modification.
   - Adds comprehensive unit tests for the rate limiting and retry logic in `test_slack_history.py`.

2. GitHub Connector (`github_connector.py`):
   - Modifies `get_user_repositories` to fetch all repositories accessible by you (including organization repositories) by changing the API call parameter from `type='owner'` to `type='all'`.
   - Adds unit tests in `test_github_connector.py` to verify this change and other connector functionalities.
Here's a rundown of what I did:

Fix: Robust Slack rate limiting, error handling & GitHub org repos

This update delivers comprehensive improvements to Slack connector stability and enhances the GitHub connector.

**Slack Connector (`slack_history.py`, `connectors_indexing_tasks.py`):**
- I've implemented proactive delays (1.2s for `conversations.history`, 3s for `conversations.list` pagination) and `Retry-After` header handling for 429 rate limit errors across `conversations.list`, `conversations.history`, and `users.info` API calls.
- I'll now gracefully handle `not_in_channel` errors when fetching conversation history by logging a warning and skipping the channel.
- I've refactored channel info fetching: `get_all_channels` now returns richer channel data (including `is_member`, `is_private`).
- I've removed direct calls to `conversations.info` from `connectors_indexing_tasks.py`, using the richer data from `get_all_channels` instead, to prevent associated rate limits.
- I corrected a `SyntaxError` (non-printable character) in `slack_history.py`.
- I've enhanced logging for rate limit actions, delays, and errors.
- I've updated unit tests in `test_slack_history.py` to cover all new logic.

**GitHub Connector (`github_connector.py`):**
- I've modified `get_user_repositories` to fetch all repositories accessible by you (owned, collaborated, organization) by changing the API call parameter from `type='owner'` to `type='all'`.
- I've included unit tests in `test_github_connector.py` for this change.
This commit addresses recurring `SyntaxError: invalid non-printable character U+001B`
errors in `surfsense_backend/app/connectors/slack_history.py`.

The file was cleaned to remove all occurrences of the
U+001B (ESCAPE) character. This ensures that previously introduced
problematic control characters are fully removed, allowing the application
to parse and load the module correctly.
@vercel
Copy link

vercel bot commented May 27, 2025

@google-labs-jules[bot] is attempting to deploy a commit to the Rohan Verma's projects Team on Vercel.

A member of the Team first needs to authorize it.

@coderabbitai
Copy link

coderabbitai bot commented May 27, 2025

Walkthrough

The updates broaden GitHub repository fetching to include all accessible repositories, enhance Slack API integration with robust rate limit and error handling, and update Slack channel data structures. Associated test suites for both connectors are added, and Slack channel iteration is refactored to use richer channel objects and simplified membership checks.

Changes

File(s) Change Summary
surfsense_backend/app/connectors/github_connector.py Broadened get_user_repositories to fetch all accessible repositories (type='all') instead of only owned ones.
surfsense_backend/app/connectors/slack_history.py Enhanced Slack API methods for rate limit handling, error management, and richer channel data structures.
surfsense_backend/app/connectors/test_github_connector.py Added comprehensive tests for GitHubConnector, covering repository retrieval and error handling.
surfsense_backend/app/connectors/test_slack_history.py Added comprehensive tests for SlackHistory, focusing on pagination, rate limits, and error handling.
surfsense_backend/app/tasks/connectors_indexing_tasks.py Refactored Slack channel iteration to use channel objects and simplified membership checks.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant GitHubConnector
    participant GitHubAPI

    User->>GitHubConnector: get_user_repositories()
    GitHubConnector->>GitHubAPI: List repositories (type='all', sort='updated')
    GitHubAPI-->>GitHubConnector: Return all accessible repositories
    GitHubConnector-->>User: Return list of repository dicts
Loading
sequenceDiagram
    participant SlackHistory
    participant SlackAPI
    participant Logger

    SlackHistory->>SlackAPI: conversations.list (with cursor)
    alt Rate limit (429)
        SlackAPI-->>SlackHistory: 429 error with Retry-After
        SlackHistory->>Logger: Log rate limit
        SlackHistory->>SlackAPI: Retry after delay
    else Success
        SlackAPI-->>SlackHistory: Channel data page
        SlackHistory->>SlackAPI: (repeat if next_cursor)
    end
    SlackHistory-->>Caller: List of channel dicts
Loading
sequenceDiagram
    participant SlackHistory
    participant SlackAPI
    participant Logger

    SlackHistory->>SlackAPI: conversations.history (with cursor)
    alt Rate limit (429)
        SlackAPI-->>SlackHistory: 429 error with Retry-After
        SlackHistory->>Logger: Log rate limit
        SlackHistory->>SlackAPI: Retry after delay
    else Not in channel
        SlackAPI-->>SlackHistory: not_in_channel error
        SlackHistory->>Logger: Log warning
        SlackHistory-->>Caller: []
    else Success
        SlackAPI-->>SlackHistory: Message data page
        SlackHistory->>SlackAPI: (repeat if next_cursor)
    end
    SlackHistory-->>Caller: List of messages
Loading

Poem

In the meadow of code where connectors play,
GitHub and Slack now fetch in a broader way.
With tests that hop and logging that sings,
Rate limits and errors are tamed with new wings.
Channels and repos, all gathered with care—
A rabbit’s delight in springtime air!
🐇✨

✨ Finishing Touches
  • 📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@recurseml
Copy link

recurseml bot commented May 27, 2025

⚠️ Only 5 files will be analyzed due to processing limits.

kwargs["cursor"] = next_cursor

result = self.client.conversations_history(**kwargs)
current_api_call_successful = False
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The result variable is explicitly set to None on line 169 after current_api_call_successful is set to False, but the code attempts to access result['messages'] on line 192 without checking if result is None, which could cause a NoneType error.


React with 👍 to tell me that this comment was useful, or 👎 if not (and I'll stop posting more comments like this in the future)


# Process each channel
for channel_name, channel_id in channels.items():
for channel_obj in channels: # Modified loop to iterate over list of channel objects
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code lacks validation for the expected dictionary keys ('id', 'name', 'is_private', 'is_member') in the channel_obj. If any of these keys are missing, the code will raise KeyError exceptions. The previous version didn't require these checks as it was using a simpler key-value structure. Should add validation or use .get() with default values.


React with 👍 to tell me that this comment was useful, or 👎 if not (and I'll stop posting more comments like this in the future)

@recurseml
Copy link

recurseml bot commented May 27, 2025

😱 Found 2 issues. Time to roll up your sleeves! 😱

@fblgit
Copy link
Author

fblgit commented May 27, 2025

@MODSetter will this fix be a good fit to the codebase?

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (6)
surfsense_backend/app/connectors/test_github_connector.py (2)

2-2: Remove unused import.

The call import from unittest.mock is not used anywhere in the test file.

Apply this diff to remove the unused import:

-from unittest.mock import patch, Mock, call
+from unittest.mock import patch, Mock
🧰 Tools
🪛 Ruff (0.11.9)

2-2: unittest.mock.call imported but unused

Remove unused import: unittest.mock.call

(F401)


7-7: Fix typo in comment.

There's a typo in the comment: "surfsend_backend" should be "surfsense_backend".

Apply this diff to fix the typo:

-# Assuming surfsend_backend/app/connectors/test_github_connector.py
+# Assuming surfsense_backend/app/connectors/test_github_connector.py
surfsense_backend/app/connectors/slack_history.py (3)

13-14: Remove unused imports.

The imports timedelta and Union are not used in this file.

-from datetime import datetime, timedelta
+from datetime import datetime
-from typing import Dict, List, Optional, Tuple, Any, Union
+from typing import Dict, List, Optional, Tuple, Any
🧰 Tools
🪛 Ruff (0.11.9)

13-13: datetime.timedelta imported but unused

Remove unused import: datetime.timedelta

(F401)


14-14: typing.Union imported but unused

Remove unused import: typing.Union

(F401)


153-189: Good rate limit handling, but consider simplifying the nested try-except structure.

The proactive delay and rate limit handling are well-implemented. However, the nested try-except blocks make the code harder to follow.

Consider extracting the API call with retry logic into a separate helper method to improve readability:

def _call_conversations_history_with_retry(self, **kwargs):
    """Helper method to call conversations.history with rate limit retry."""
    while True:
        time.sleep(1.2)  # Proactive delay for Tier 3
        try:
            return self.client.conversations_history(**kwargs)
        except SlackApiError as e:
            if e.response and e.response.status_code == 429:
                retry_after = e.response.headers.get('Retry-After', '60')
                wait_time = int(retry_after) if retry_after.isdigit() else 60
                logger.warning(f"Rate limited on conversations.history. Retrying after {wait_time} seconds.")
                time.sleep(wait_time)
                continue
            raise

115-115: Use exception chaining for better error traceability.

When re-raising exceptions within except blocks, use raise ... from to maintain the exception chain.

For line 115:

-raise SlackApiError(f"Error retrieving channels: {e}", e.response)
+raise SlackApiError(f"Error retrieving channels: {e}", e.response) from e

For line 119:

-raise RuntimeError(f"An unexpected error occurred during channel fetching: {general_error}")
+raise RuntimeError(f"An unexpected error occurred during channel fetching: {general_error}") from general_error

For line 211:

-raise SlackApiError(f"Error retrieving history for channel {channel_id}: {e}", e.response)
+raise SlackApiError(f"Error retrieving history for channel {channel_id}: {e}", e.response) from e

For line 316:

-raise SlackApiError(f"Error retrieving user info for {user_id}: {e_user_info}", e_user_info.response)
+raise SlackApiError(f"Error retrieving user info for {user_id}: {e_user_info}", e_user_info.response) from e_user_info

Also applies to: 119-119, 211-211, 316-316

🧰 Tools
🪛 Ruff (0.11.9)

115-115: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

surfsense_backend/app/connectors/test_slack_history.py (1)

2-2: Remove unused import.

The time module is imported but not used in this test file.

-import time # Imported to be available for patching target module
🧰 Tools
🪛 Ruff (0.11.9)

2-2: time imported but unused

Remove unused import: time

(F401)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5fef58b and 379694c.

📒 Files selected for processing (5)
  • surfsense_backend/app/connectors/github_connector.py (1 hunks)
  • surfsense_backend/app/connectors/slack_history.py (4 hunks)
  • surfsense_backend/app/connectors/test_github_connector.py (1 hunks)
  • surfsense_backend/app/connectors/test_slack_history.py (1 hunks)
  • surfsense_backend/app/tasks/connectors_indexing_tasks.py (1 hunks)
🧰 Additional context used
🪛 Ruff (0.11.9)
surfsense_backend/app/connectors/test_slack_history.py

2-2: time imported but unused

Remove unused import: time

(F401)

surfsense_backend/app/connectors/slack_history.py

13-13: datetime.timedelta imported but unused

Remove unused import: datetime.timedelta

(F401)


14-14: typing.Union imported but unused

Remove unused import: typing.Union

(F401)


115-115: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


119-119: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


211-211: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


316-316: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

surfsense_backend/app/connectors/test_github_connector.py

2-2: unittest.mock.call imported but unused

Remove unused import: unittest.mock.call

(F401)

🔇 Additional comments (9)
surfsense_backend/app/connectors/github_connector.py (1)

83-83:

✅ Verification successful

Verify the impact of expanding repository scope.

The change from type='owner' to type='all' is a significant functional modification that will now include collaborative repositories in addition to owned ones. This could potentially return many more repositories and impact performance or existing functionality.

Please run the following script to verify how this method is used across the codebase:


🏁 Script executed:

#!/bin/bash
# Description: Find all usages of get_user_repositories to assess impact of scope change

# Search for direct method calls
rg -A 5 -B 2 "get_user_repositories"

# Search for any code that might assume only owned repositories
rg -A 3 -B 3 "(owner.*repo|owned.*repo|user.*own)"

Length of output: 24541


Change to type='all' validated—no further action required.

  • get_user_repositories() is only consumed in surfsense_backend/app/routes/search_source_connectors_routes.py, where the full list is returned without owner-specific assumptions.
  • Tests in surfsense_backend/app/connectors/test_github_connector.py already assert type='all', confirming the intended behavior.
surfsense_backend/app/connectors/test_github_connector.py (4)

14-65: Excellent test coverage for the main functionality.

This test method comprehensively verifies that the get_user_repositories method correctly calls the GitHub API with type='all' and sort='updated' parameters, and properly structures the returned data. The test properly mocks the GitHub client and validates both the API call parameters and the returned repository structure.


68-95: Good edge case handling verification.

This test properly verifies that the method handles edge cases where repository descriptions are None and updated_at is None, ensuring they are converted to empty strings and preserved as None respectively. This is important for robust error handling.


98-132: Comprehensive authentication failure testing.

These test methods thoroughly verify that the connector properly handles authentication failures with both ForbiddenError and AuthenticationFailed exceptions, ensuring that meaningful error messages are provided to users. The mock setup is appropriate and the assertions validate the expected behavior.


135-150: Good API exception handling verification.

This test ensures that when the GitHub API throws an exception during repository fetching, the method gracefully handles it by logging the error and returning an empty list instead of crashing. This is good defensive programming practice.

surfsense_backend/app/tasks/connectors_indexing_tasks.py (1)

117-131: LGTM! Simplified channel membership verification.

The refactoring to use channel objects with membership flags directly is a good improvement. It eliminates redundant API calls and simplifies the logic while maintaining the same functionality.

surfsense_backend/app/connectors/slack_history.py (2)

40-121: Excellent rate limit handling implementation!

The refactored get_all_channels method now provides:

  • Proper pagination with delays between requests
  • Robust rate limit handling with Retry-After header support
  • Richer channel data structure with membership information
  • Comprehensive error handling and logging

This implementation aligns well with Slack API best practices.

🧰 Tools
🪛 Ruff (0.11.9)

115-115: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


119-119: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


296-319: Well-implemented rate limit handling for user info retrieval.

The retry logic with Retry-After header support is consistent with the other methods and follows best practices.

🧰 Tools
🪛 Ruff (0.11.9)

316-316: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

surfsense_backend/app/connectors/test_slack_history.py (1)

9-421: Excellent test coverage for rate limit handling!

The test suite provides comprehensive coverage of:

  • Pagination with delays
  • Rate limit handling with various Retry-After scenarios
  • Error propagation for different error types
  • Logging behavior verification
  • Edge cases like malformed channel data

This thorough testing ensures the robustness of the rate limiting implementation.

@fblgit fblgit mentioned this pull request May 27, 2025
@fblgit fblgit changed the title Fix/slack rate limiting Fix/slack rate limiting & Github Repos Filtering May 27, 2025
@fblgit fblgit changed the title Fix/slack rate limiting & Github Repos Filtering Fix/slack rate limiting & Github Repos ORG Filtering May 27, 2025
@MODSetter
Copy link
Owner

@MODSetter will this fix be a good fit to the codebase?

Hey @fblgit thanks for this. Will review & merge by EOD 👍

@MODSetter
Copy link
Owner

Looks good to me. Thanks @fblgit

@MODSetter MODSetter merged commit fd6da4c into MODSetter:main May 28, 2025
1 of 3 checks passed
AbdullahAlMousawi pushed a commit to AbdullahAlMousawi/SurfSense that referenced this pull request Jul 14, 2025
Fix/slack rate limiting & Github Repos ORG Filtering
CREDO23 pushed a commit to CREDO23/SurfSense that referenced this pull request Jul 25, 2025
Fix/slack rate limiting & Github Repos ORG Filtering
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants