Codestin Search App

kazewong · 2025-07-24T20:07:44Z

This PR aims to integrate the dagster instance in local K8s environment.

Summary by CodeRabbit

Chores
- Updated container images to expose port 80 and configure environment settings for improved runtime access.
- Renamed a workflow step for clarity in the continuous deployment process.
- Added new optional dependencies for Kubernetes and Postgres support.
New Features
- Introduced Minio object storage integration with configurable resource support.
- Updated data pipeline to store event lists, raw data, and plots in Minio instead of local disk.
- Added pipeline configuration to load assets and resources for Minio storage.

coderabbitai · 2025-07-24T20:07:52Z

Walkthrough

The changes update two Dockerfiles to set the PATH environment variable and expose port 80, and adjust the GitHub Actions workflow by renaming a Docker image build step for CUDA. New Dagster pipeline definitions and assets are added to integrate Minio object storage as a resource. The pyproject.toml adds optional dependencies for Dagster Kubernetes, Postgres, and Minio. Asset functions are updated to use Minio for storage instead of local files, and a new MinioResource class is introduced for Minio interactions.

Changes

File(s)	Change Summary
containers/Containerfile.cpu, containers/Containerfile.cuda	Changed git branch to `dagster_development`, extended `PATH` env variable, and exposed port 80.
.github/workflows/CD.yml	Renamed a workflow step to "Build and push Docker image for cuda" in the `publish_docker` job.
pipeline/dagster/RealDataCatalog/definitions.py	Added Dagster definitions loading assets and configuring MinioResource with env var-based configuration.
pipeline/dagster/RealDataCatalog/minio_resource.py	Added new `MinioResource` class providing Minio client initialization and object storage interaction methods.
pipeline/dagster/RealDataCatalog/assets.py	Added multiple Dagster assets for gravitational wave data processing, all using MinioResource for storage and retrieval.
pyproject.toml	Added optional dependencies: `dagster-k8s`, `dagster-postgres`, and `minio` with specified minimum versions.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

CICD update #242: Introduces the .github/workflows/CD.yml workflow with the publish_docker job, which is directly modified in this PR.

Poem

🐇
A Docker hop, a port exposed,
New paths set where Dagster goes.
Minio joins to store with grace,
Assets travel to a cloud-based place.
Workflow steps renamed with care,
Dependencies added to the lair.
🥕

Note

⚡️ Unit Test Generation is now available in beta!

Learn more here, or try it out under "Finishing Touches" below.

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch dagster_development

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai generate unit tests to generate unit tests for this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 0

🔭 Outside diff range comments (1)

containers/Containerfile.cuda (1)

2-2: Build is failing – base image requires authentication
nvcr.io/nvidia/cuda:12.9.1-base-ubuntu20.04 returns 401 Unauthorized in CI.
Options:

Switch to the public Docker-Hub mirror: nvidia/cuda:12.9.1-base-ubuntu20.04.

Add a docker/login-action step with an NGC API token (and set NVCR_TOKEN secret).
Until one of these is done, the pipeline will remain red.

🧹 Nitpick comments (6)

containers/Containerfile.cpu (2)
15-20: Combine build-time steps and set VIRTUAL_ENV for cleaner image layers

RUN uv sync, ENV PATH=…, and EXPOSE 80 can be squashed into a single RUN layer to reduce final image size.

Several Python tools rely on the VIRTUAL_ENV env-var. Declaring it avoids subtle path issues when invoking Dagster CLI inside the container.
-# Run uv sync
-RUN uv sync --extra dagster
-
-ENV PATH="/home/jim/.venv/bin:$PATH"
-# Expose the port that your Dagster instance will run on
-EXPOSE 80
+# Install deps & expose Dagster
+RUN uv sync --extra dagster \
+ && echo 'export VIRTUAL_ENV=/home/jim/.venv' >> /etc/profile.d/virtual_env.sh
+
+ENV VIRTUAL_ENV=/home/jim/.venv \
+    PATH="/home/jim/.venv/bin:$PATH"
+
+EXPOSE 80
9-12: Shallow-clone to speed up builds
Use --depth 1 --branch jim-dev to avoid fetching the full commit history and an extra checkout step.
-RUN git clone https://github.com/kazewong/jim.git
-
-WORKDIR /home/jim
-RUN git checkout jim-dev
+RUN git clone --depth 1 --branch jim-dev https://github.com/kazewong/jim.git /home/jim
+
+WORKDIR /home/jim
containers/Containerfile.cuda (2)
15-20: Mirror CPU-image improvements & set VIRTUAL_ENV
Apply the same layer consolidation and VIRTUAL_ENV export as suggested for the CPU image to keep both images consistent and slimmer.

1-1: Minor Hadolint warning – keyword casing
FROM ghcr.io/astral-sh/uv:python3.12-bookworm as uv-source → use AS (uppercase) to match FROM and silence DL4000.
-FROM ghcr.io/astral-sh/uv:python3.12-bookworm as uv-source
+FROM ghcr.io/astral-sh/uv:python3.12-bookworm AS uv-source
.github/workflows/CD.yml (2)
77-84: Consider building a CPU image as well
Renaming clarifies the CUDA build, but users on non-GPU nodes (e.g. local Kind clusters) will need a CPU-only image.
Add a second docker/build-push-action step (or a matrix) building containers/Containerfile.cpu and tagging it appropriately.

77-79: Step-id mismatch
id: push is now ambiguous because the step name mentions CUDA specifically. Renaming the id to push_cuda keeps logs self-describing and avoids future shadowing if a CPU step is added.
-      - name: Build and push Docker image for cuda
-        id: push
+      - name: Build and push CUDA Docker image
+        id: push_cuda

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3d01168 and 407e325.

📒 Files selected for processing (3)

.github/workflows/CD.yml (1 hunks)
containers/Containerfile.cpu (1 hunks)
containers/Containerfile.cuda (1 hunks)

🧰 Additional context used

🪛 GitHub Actions: CD

containers/Containerfile.cuda

[warning] 1-2: FromAsCasing: 'as' and 'FROM' keywords' casing do not match (line 1)

[error] Build failed: failed to resolve source metadata for nvcr.io/nvidia/cuda:12.9.1-base-ubuntu20.04 due to 401 Unauthorized error when accessing https://nvcr.io/v2/nvidia/cuda/manifests/12.9.1-base-ubuntu20.04

Update Dagster-related packages to 1.11.3 and add dependencies for Kubernetes and Postgres support.

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8c51289 and 69e7a17.

⛔ Files ignored due to path filters (1)

uv.lock is excluded by !**/*.lock

📒 Files selected for processing (1)

pyproject.toml (1 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: publish_docker

coderabbitai · 2025-07-25T22:28:57Z

pyproject.toml

 dagster = [
    "dagster>=1.11.1",
    "dagster-dg-cli>=1.11.1",
+    "dagster-k8s>=0.27.3",
    "dagster-pipes>=1.11.1",
+    "dagster-postgres>=0.27.3",
    "dagster-webserver>=1.11.1",
 ]


⚠️ Potential issue

Version mismatch: dagster-k8s / dagster-postgres pinned to 0.27.x will not install alongside core Dagster 1.11.x

The core package is declared at dagster>=1.11.1, but both plugins are constrained to >=0.27.3.
Dagster packages must share the same major (and usually minor) version; otherwise ‑‐

• pip/uv will fail to resolve because dagster-k8s 0.27.3 requires dagster~=0.27.3
• Runtime APIs differ drastically between 0.27 and 1.x, so even if forced-installed things will break.

Patch suggestion:

- "dagster-k8s>=0.27.3", + # Keep plugin version aligned with core Dagster + "dagster-k8s>=1.11.1", ... - "dagster-postgres>=0.27.3", + "dagster-postgres>=1.11.1",

Or omit explicit versions and rely on the resolver:

- "dagster-k8s>=0.27.3", - "dagster-postgres>=0.27.3", + "dagster-k8s", + "dagster-postgres",

Please adjust before merging to avoid broken images / CI.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

dagster = [

"dagster>=1.11.1",

"dagster-dg-cli>=1.11.1",

"dagster-k8s>=0.27.3",

"dagster-pipes>=1.11.1",

"dagster-postgres>=0.27.3",

"dagster-webserver>=1.11.1",

]

dagster = [

"dagster>=1.11.1",

"dagster-dg-cli>=1.11.1",

# Keep plugin version aligned with core Dagster

"dagster-k8s>=1.11.1",

"dagster-pipes>=1.11.1",

"dagster-postgres>=1.11.1",

"dagster-webserver>=1.11.1",

]

🤖 Prompt for AI Agents

In pyproject.toml lines 35 to 42, the versions of dagster-k8s and dagster-postgres are pinned to 0.27.x while the core dagster package is at 1.11.x, causing version conflicts. Update the versions of dagster-k8s and dagster-postgres to match the major and minor version of the core dagster package (1.11.x) or remove explicit version pins to let the resolver handle compatible versions automatically.

coderabbitai

Actionable comments posted: 2

🔭 Outside diff range comments (2)

pipeline/dagster/RealDataCatalog/assets.py (2)
176-180: Critical inconsistency: Function still uses local filesystem after Minio migration.

The config_file function reads from the local "data/event_list.txt" file, but this data is now stored in Minio object storage. This will cause runtime failures since the local file no longer exists.

Update the function to read from Minio storage:
-    with open("data/event_list.txt", "r") as f:
-        lines = f.readlines()
-        event_dict = dict(line.strip().split() for line in lines)
+    # Download event list from Minio and read
+    import tempfile, os
+    with tempfile.NamedTemporaryFile(mode='w+', delete=False) as temp_file:
+        # Note: This function needs MinioResource parameter added to signature
+        minio.download_object("event_list.txt", temp_file.name)
+        with open(temp_file.name, 'r') as f:
+            lines = f.readlines()
+            event_dict = dict(line.strip().split() for line in lines)
+    os.unlink(temp_file.name)
Also, the function signature needs to be updated to include the MinioResource parameter:
-def config_file(context: AssetExecutionContext):
+def config_file(context: AssetExecutionContext, minio: MinioResource):
59-94: Suggest essential refactor: Migrate raw_data function to use Minio storage.

The raw_data function should be updated to store processed data in Minio object storage to maintain consistency with the storage strategy migration. Currently, it still saves data to local filesystem directories.

Consider updating the function to:

Store strain and PSD data in Minio with organized object naming (e.g., {event_name}/raw/{ifo}_data, {event_name}/raw/{ifo}_psd)

Use the existing Minio resource methods for object storage

Maintain the same error handling logic but adapt file operations to Minio operations

This will ensure the entire pipeline uses consistent object storage and works properly in containerized/K8s environments where local filesystem persistence is not guaranteed.

🧹 Nitpick comments (1)

pipeline/dagster/RealDataCatalog/minio_resource.py (1)

27-30: Minor improvement: Add error handling and validation.

The put_object method works correctly but could benefit from parameter validation and error handling.

Consider adding validation:

 def put_object(self, object_name: str, data, size: int, content_type: str):
+    if not object_name or not object_name.strip():
+        raise ValueError("Object name cannot be empty")
+    if size < 0:
+        raise ValueError("Size cannot be negative")
     client = self.get_client()
-    
-    return client.put_object(self.bucket_name, object_name, data, size, content_type)
+    try:
+        return client.put_object(self.bucket_name, object_name, data, size, content_type)
+    except Exception as e:
+        raise RuntimeError(f"Failed to upload object {object_name}: {str(e)}") from e

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 69e7a17 and 9e2f920.

⛔ Files ignored due to path filters (1)

uv.lock is excluded by !**/*.lock

📒 Files selected for processing (4)

pipeline/dagster/RealDataCatalog/assets.py (4 hunks)
pipeline/dagster/RealDataCatalog/definitions.py (1 hunks)
pipeline/dagster/RealDataCatalog/minio_resource.py (1 hunks)
pyproject.toml (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

pyproject.toml
pipeline/dagster/RealDataCatalog/definitions.py

🧰 Additional context used

🧬 Code Graph Analysis (1)

pipeline/dagster/RealDataCatalog/assets.py (2)

pipeline/dagster/RealDataCatalog/minio_resource.py (2)

MinioResource (4-34)

put_object (27-30)

jim_dagster/RealDataCatalog/assets.py (1)

event_list (21-37)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: publish_docker

🔇 Additional comments (8)

pipeline/dagster/RealDataCatalog/assets.py (4)

4-4: LGTM: Import addition for string buffer operations.

The io import is correctly added to support the in-memory string buffer operations used for Minio integration.

7-7: LGTM: MinioResource import for object storage integration.

The import of MinioResource is correctly added to enable Minio object storage functionality in the asset functions.

23-23: LGTM: Function signature updated for Minio integration.

The function signature correctly adds the MinioResource parameter for dependency injection.

35-45: LGTM: Proper implementation of Minio object upload.

The implementation correctly:

Uses io.StringIO() for in-memory string operations

Properly encodes string data to bytes

Uses io.BytesIO() for the data stream

Specifies appropriate content type and size

The migration from local file writing to Minio object storage is well-implemented.

pipeline/dagster/RealDataCatalog/minio_resource.py (4)

4-10: LGTM: Well-structured resource configuration.

The MinioResource class properly extends ConfigurableResource and defines all necessary configuration parameters with appropriate type hints. The configuration covers all essential Minio connection parameters.

19-21: LGTM: Presigned URL generation implemented correctly.

The method properly uses the Minio client to generate presigned URLs for object access.

23-25: LGTM: Object listing with proper filtering.

The method correctly implements object listing with prefix filtering and recursive search, returning a proper list.

32-34: LGTM: File download method implemented correctly.

The method properly uses the Minio client's fget_object method to download objects to local file paths.

pipeline/dagster/RealDataCatalog/assets.py

coderabbitai · 2025-07-26T14:20:53Z

pipeline/dagster/RealDataCatalog/minio_resource.py

+    def get_client(self):
+        client = Minio(self.endpoint + ":" + self.port, self.access_key, self.secret_key, secure=False)
+        found = client.bucket_exists(self.bucket_name)
+        if not found:
+            client.make_bucket(self.bucket_name)
+        return client


🛠️ Refactor suggestion

Security concern: Insecure connection and potential performance issue.

Two issues with the get_client method:

Security: secure=False disables TLS encryption, which is acceptable for local development but should be configurable for production environments.

Performance: The method creates a new client instance and checks bucket existence on every call, which is inefficient for frequent operations.

Consider these improvements:

+ _client: Minio = None + secure: bool = False # Add to class configuration + def get_client(self): + if self._client is None: - client = Minio(self.endpoint + ":" + self.port, self.access_key, self.secret_key, secure=False) + self._client = Minio(self.endpoint + ":" + self.port, self.access_key, self.secret_key, secure=self.secure) + self._ensure_bucket_exists() + return self._client + + def _ensure_bucket_exists(self): - found = client.bucket_exists(self.bucket_name) + found = self._client.bucket_exists(self.bucket_name) if not found: - client.make_bucket(self.bucket_name) - return client + self._client.make_bucket(self.bucket_name)

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents

In pipeline/dagster/RealDataCatalog/minio_resource.py around lines 12 to 17, the get_client method disables TLS by setting secure=False and creates a new Minio client and checks bucket existence on every call, causing security and performance issues. Modify the method to accept a configurable parameter or environment variable to toggle secure mode for production use. Refactor the code to instantiate the Minio client once (e.g., during initialization) and reuse it, avoiding repeated bucket existence checks and client creation on every call.

coderabbitai

Actionable comments posted: 2

🔭 Outside diff range comments (3)

pipeline/dagster/RealDataCatalog/assets.py (3)
201-218: Config file function has multiple filesystem dependencies.

This function has several issues:

Reads event_list.txt from local filesystem (line 203) instead of Minio

Checks for data/PSD files in local filesystem (lines 211-213) that are now in Minio

Update to use Minio:
-    with open("data/event_list.txt", "r") as f:
-        lines = f.readlines()
-        event_dict = dict(line.strip().split() for line in lines)
+    # Fetch from Minio
+    event_list_obj = minio.get_object("event_list.txt")
+    lines = event_list_obj.read().decode("utf-8").splitlines()
+    event_dict = dict(line.strip().split() for line in lines)
For checking file availability, you'll need to use Minio's object listing or head_object methods instead of os.path.exists.

311-668: All diagnostic plotting functions need Minio integration.

All the remaining diagnostic functions (loss_plot, production_chains_corner_plot, etc.) read from local results.npz files that may not exist if the pipeline is using Minio for storage.

Consider creating a helper function to handle Minio downloads for all diagnostic functions:
def download_results_from_minio(context: AssetExecutionContext, minio: MinioResource, event_name: str):
    """Download results.npz from Minio to a temporary location."""
    import tempfile
    temp_file = tempfile.NamedTemporaryFile(suffix='.npz', delete=False)
    try:
        minio.download_object(f"{event_name}/results.npz", temp_file.name)
        return temp_file.name
    except Exception as e:
        os.unlink(temp_file.name)
        raise FileNotFoundError(f"Results file not found in Minio: {e}")
This would simplify updating all diagnostic functions and ensure consistent error handling.

127-155: Update diagnostic assets to fetch data from Minio, not the local data/ directory

The plotting (raw_data_plot, psd_plot, etc.) and configuration (config_file) functions still assume files live under data/<event_name>/…, but upstream you’ve moved raw and PSD data into Minio. These functions will fail at runtime unless they first pull the required files from Minio.

Please update each diagnostic asset to:

Inject the Minio resource via context.resources.minio.

Download the needed file(s) (e.g. .npz, event_list.txt) into a temporary directory (or stream them) before loading.

Clean up or close temporary files when done.

Affected locations:

pipeline/dagster/RealDataCatalog/assets.py:

raw_data_plot (lines 127–155)

psd_plot (lines 156–191)

config_file (lines 201–263)

All other diagnostic plotting functions (lines 311–668)

Example diff for raw_data_plot:
 def raw_data_plot(context: AssetExecutionContext):
     """
     Plot the raw strain data for each IFO for the event.
     """
-    import matplotlib.pyplot as plt
+    import matplotlib.pyplot as plt
     import tempfile
     import os
     import numpy as np

     event_name = context.partition_key
-    event_dir = os.path.join("data", event_name, "raw")
+    # Download raw data from Minio
+    temp_dir = tempfile.mkdtemp()
+    minio = context.resources.minio
+
+    plots_dir = os.path.join("data", event_name, "plots")
     os.makedirs(plots_dir, exist_ok=True)

     ifos = ["H1", "L1", "V1"]
     plot_paths = []
     for ifo in ifos:
-        data_file = os.path.join(event_dir, f"{ifo}_data.npz")
+        local_npz = os.path.join(temp_dir, f"{ifo}_data.npz")
+        # pull from Minio bucket `<event_name>/raw/`
+        minio.download_object(
+            bucket_name=event_name,
+            object_name=f"raw/{ifo}_data.npz",
+            file_path=local_npz,
+        )
+        data_file = local_npz

         if os.path.exists(data_file):
             data = np.load(data_file)
             t = data["epoch"] + np.arange(data["td"].shape[0]) * data["dt"]
             td = data["td"]
             if t is not None and td is not None:
                 plt.figure()
                 plt.plot(t, td)
                 plt.xlabel("Time (s)")
                 plt.ylabel("Strain")
                 plt.title(f"{ifo} Strain for {event_name}")
                 plot_path = os.path.join(plots_dir, f"{ifo}_strain.png")
                 plt.savefig(plot_path)
                 plt.close()
                 plot_paths.append(plot_path)
     return plot_paths
Apply analogous changes to psd_plot, config_file, and all other diagnostics so that every file read is preceded by a Minio download.

♻️ Duplicate comments (2)

pipeline/dagster/RealDataCatalog/assets.py (2)

59-65: Good fix! The function now correctly reads from Minio.

The implementation properly addresses the previous inconsistency by fetching the event list from Minio storage instead of the local filesystem.

164-191: PSD plot function also needs Minio integration.

Similar to raw_data_plot, this function expects local files but should download from Minio.

🧹 Nitpick comments (1)

pipeline/dagster/RealDataCatalog/assets.py (1)

23-45: Well-implemented transition to Minio storage.

The function correctly uploads the event list to Minio instead of writing to local filesystem. The implementation properly handles string-to-bytes conversion and sets the appropriate content type.

Minor optimization: The buffer.seek(0) on line 38 is unnecessary since buffer.getvalue() returns the entire buffer contents regardless of the current position.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9e2f920 and 5608cde.

📒 Files selected for processing (2)

pipeline/dagster/RealDataCatalog/assets.py (5 hunks)
pipeline/dagster/RealDataCatalog/minio_resource.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

pipeline/dagster/RealDataCatalog/minio_resource.py

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: publish_docker

🔇 Additional comments (1)

pipeline/dagster/RealDataCatalog/assets.py (1)

4-7: LGTM! Import statements are appropriate for Minio integration.

The addition of io module and MinioResource import aligns well with the transition from local filesystem to object storage.

coderabbitai · 2025-07-26T15:29:31Z

pipeline/dagster/RealDataCatalog/assets.py

+    event_dir = os.path.join('tmp', event_name, "raw")
    os.makedirs(event_dir, exist_ok=True)


🛠️ Refactor suggestion

Use Python's tempfile module for temporary directory management.

Instead of hardcoding 'tmp' directory, use Python's tempfile.mkdtemp() for better cross-platform compatibility and automatic cleanup.

- # Use a temp directory, but keep event_name and "raw" part - event_dir = os.path.join('tmp', event_name, "raw") + import tempfile + temp_dir = tempfile.mkdtemp() + event_dir = os.path.join(temp_dir, event_name, "raw")

Remember to clean up the entire temporary directory at the end of the function or use a context manager.

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents

In pipeline/dagster/RealDataCatalog/assets.py around lines 70 to 71, replace the hardcoded 'tmp' directory with a temporary directory created using Python's tempfile.mkdtemp() for better cross-platform compatibility. Update event_dir to be inside this temporary directory. Also, ensure to clean up the entire temporary directory at the end of the function or use a context manager to handle automatic cleanup.

coderabbitai · 2025-07-26T15:29:31Z

pipeline/dagster/RealDataCatalog/assets.py

🛠️ Refactor suggestion

⚠️ Potential issue

Incomplete cleanup and broad exception handling.

Two issues to address:

The cleanup only removes .npz files but leaves the directory structure

The exception handling is too broad and doesn't properly clean up on error

For proper cleanup, consider:

import shutil # At the end of the function (outside the try block) if 'temp_dir' in locals(): shutil.rmtree(temp_dir)

For better error handling:

- except Exception as e: - print(f"Error fetching data for {ifo} during {event_name}: {e}") - continue + except gwosc.api.APIError as e: + context.log.error(f"GWOSC API error for {ifo} during {event_name}: {e}") + continue + except Exception as e: + context.log.error(f"Unexpected error for {ifo} during {event_name}: {e}") + # Clean up any partial files + for f in [data_file_path + '.npz', psd_file_path + '.npz']: + if os.path.exists(f): + os.remove(f) + raise

🤖 Prompt for AI Agents

In pipeline/dagster/RealDataCatalog/assets.py around lines 114 to 118, the current cleanup only removes .npz files but leaves the temporary directory intact, and the exception handling is too broad without proper cleanup on errors. To fix this, add code after the try-except block to check if the temporary directory variable exists and remove the entire directory using shutil.rmtree to ensure complete cleanup. Also, narrow the exception handling to catch specific exceptions if possible, and ensure that cleanup code runs regardless of success or failure, for example by using a finally block or by placing cleanup after the try-except.

- Move Data and IMRPhenomPv2StandardCBCRunDefinition imports into functions - Update raw_data, raw_data_plot, and psd_plot to read/write via Minio - Store plots in Minio instead of local disk using temporary files - Clean up temporary files after upload - Add MinioResource argument to plot asset functions - Enable t_c_range in config_file asset

- Replace local filesystem access with Minio for reading and writing data, config files, and plots - Update asset signatures to accept MinioResource where needed - Remove redundant imports and local directory creation - Use temporary files for serialization before uploading to Minio

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (2)

pipeline/dagster/RealDataCatalog/assets.py (2)

70-71: Consider using tempfile module for better cross-platform compatibility.

The hardcoded 'tmp' directory could cause issues on different platforms. Consider using Python's tempfile.mkdtemp() for better portability.

This issue was previously flagged and remains unaddressed. The current implementation still uses hardcoded paths which may not work consistently across different environments.

117-119: Improve exception handling and cleanup strategy.

The current approach has two issues:

Exception handling is too broad and may mask specific errors

Cleanup only removes .npz files but leaves directory structure

This issue was previously flagged and remains unaddressed. Consider implementing proper cleanup with shutil.rmtree() and more specific exception handling.

🧹 Nitpick comments (3)

pipeline/dagster/RealDataCatalog/assets.py (3)
11-11: Fix typo in comment.

There's a typo in the comment: "configuration0" should be "configuration".
-# Create asset group for run and configuration0
+# Create asset group for run and configuration
317-352: Placeholder asset - implementation needed.

The run multi-asset is properly structured with correct dependencies and partitions, but currently unimplemented. This appears to be intentional for the current PR scope.

Would you like me to help implement the logic for loading results from the external run execution?

355-770: Consider refactoring to reduce code duplication across diagnostic assets.

The diagnostic assets follow very similar patterns for:

Loading data from Minio

Creating plots with matplotlib/corner

Saving to temporary files

Uploading to Minio

Cleaning up temporary files

Consider creating helper functions to reduce duplication:
def create_and_upload_plot(event_name: str, minio: MinioResource, plot_func, plot_name: str, **kwargs):
    """Helper function to create plots and upload to Minio"""
    with tempfile.NamedTemporaryFile(suffix=".png", delete=False) as tmpfile:
        fig = plot_func(**kwargs)
        fig.savefig(tmpfile.name)
        plt.close(fig)
        tmpfile.flush()
        tmpfile.seek(0)
        tmpfile_size = os.path.getsize(tmpfile.name)
        minio_plot_path = f"{event_name}/plots/{plot_name}.png"
        with open(tmpfile.name, "rb") as plotfile:
            minio.put_object(
                object_name=minio_plot_path,
                data=plotfile,
                size=tmpfile_size,
                content_type="image/png",
            )
    os.remove(tmpfile.name)
    return minio_plot_path

def load_results_from_minio(event_name: str, minio: MinioResource):
    """Helper function to load results.npz from Minio"""
    results_obj = minio.get_object(f"{event_name}/results.npz")
    return np.load(io.BytesIO(results_obj.read()), allow_pickle=True)
This would significantly reduce code duplication and improve maintainability.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0107686 and 5d07a78.

📒 Files selected for processing (1)

pipeline/dagster/RealDataCatalog/assets.py (1 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: publish_docker

🔇 Additional comments (14)

pipeline/dagster/RealDataCatalog/assets.py (14)

16-46: LGTM! Well-structured event fetching and partition registration.

The asset correctly fetches events from multiple GWTC catalogs, properly buffers the data, uploads to Minio, and registers dynamic partitions. The implementation is clean and follows best practices.

57-66: LGTM! Critical inconsistency resolved.

The function now correctly reads the event list from Minio instead of the local filesystem, addressing the previous critical issue. The Minio integration is properly implemented.

122-168: LGTM! Well-implemented plotting asset with proper resource management.

The raw data plotting asset demonstrates good practices:

Proper use of tempfile for temporary files

Correct Minio integration for both reading and writing

Good error handling per IFO

Proper cleanup of temporary files

Clear return value

171-217: LGTM! Consistent implementation with the strain plotting asset.

The PSD plotting asset follows the same good patterns as the raw data plot asset, with proper resource management and error handling.

220-315: LGTM! Comprehensive configuration asset with robust IFO availability checking.

The config file asset is well-implemented:

Proper Minio integration for reading event list

Robust availability checking for IFOs by attempting to fetch both data and PSD files

Comprehensive parameter configuration for IMRPhenomPv2StandardCBCRunDefinition

Correct use of tempfile for serialization

Proper cleanup of temporary files

355-393: LGTM! Well-structured diagnostic asset with proper error handling.

The loss plot asset demonstrates good practices with proper Minio integration, error handling for missing data, and correct tempfile usage.

396-445: LGTM! Consistent corner plot implementation.

The production chains corner plot follows the same good patterns as other diagnostic assets. The parameter key selection is comprehensive for gravitational wave analysis.

448-497: LGTM! Normalizing flow samples corner plot well-implemented.

The asset correctly handles NF samples data and maintains consistency with other corner plot assets.

500-549: LGTM! Prior samples corner plot maintains consistency.

The implementation is consistent with other corner plot assets and properly handles prior samples data.

552-610: LGTM! Comprehensive trace plot implementation.

The trace plot asset correctly handles multiple parameters and creates a well-structured multi-subplot figure. The subplot layout logic is appropriate for the number of parameters.

613-650: LGTM! Log probability distribution plot well-implemented.

The asset correctly creates a histogram of log probabilities with proper error handling.

653-690: LGTM! Log probability evolution plot maintains consistency.

The asset follows the same good patterns as other diagnostic assets and correctly plots the evolution of log probabilities.

693-730: LGTM! Local acceptance rate plot properly implemented.

The asset correctly accesses nested data structure and handles missing keys appropriately.

733-770: LGTM! Global acceptance rate plot completes the diagnostic suite.

The final diagnostic asset maintains consistency with the local acceptance plot and properly handles the global acceptance data.

coderabbitai

Actionable comments posted: 4

♻️ Duplicate comments (2)

pipeline/dagster/RealDataCatalog/assets.py (2)
70-71: Use Python's tempfile module for temporary directory management.

Instead of hardcoding 'tmp' directory, use Python's tempfile.mkdtemp() for better cross-platform compatibility and automatic cleanup.
-    # Use a temp directory, but keep event_name and "raw" part
-    event_dir = os.path.join('tmp', event_name, "raw")
+    import tempfile
+    temp_dir = tempfile.mkdtemp()
+    event_dir = os.path.join(temp_dir, event_name, "raw")
Remember to clean up the entire temporary directory at the end of the function or use a context manager.

115-119: Incomplete cleanup and broad exception handling.

Two issues to address:

The cleanup only removes .npz files but leaves the directory structure

The exception handling is too broad and doesn't properly clean up on error

For proper cleanup, consider:
import shutil
# At the end of the function (outside the try block)
if 'temp_dir' in locals():
    shutil.rmtree(temp_dir)
For better error handling:
-        except Exception as e:
-            print(f"Error fetching data for {ifo} during {event_name}: {e}")
-            continue
+        except gwosc.api.APIError as e:
+            context.log.error(f"GWOSC API error for {ifo} during {event_name}: {e}")
+            continue
+        except Exception as e:
+            context.log.error(f"Unexpected error for {ifo} during {event_name}: {e}")
+            # Clean up any partial files
+            for f in [data_file_path + '.npz', psd_file_path + '.npz']:
+                if os.path.exists(f):
+                    os.remove(f)
+            raise

🧹 Nitpick comments (3)

pipeline/dagster/RealDataCatalog/assets.py (3)
11-11: Fix typo in comment.

There's a typo "configuration0" instead of "configuration".
-# Create asset group for run and configuration0
+# Create asset group for run and configuration
348-352: Implement the run asset to load results.

The run asset is currently a placeholder. This asset should load the results from execute_single_run.py and yield the specified assets.

Would you like me to help implement this asset to load results from the results.npz file and yield each asset appropriately?

412-427: Consider extracting parameter key lists to constants.

The parameter key lists are duplicated across multiple diagnostic assets. Consider extracting them to module-level constants for better maintainability.
# At module level
PARAMETER_KEYS = [
    "M_c", "q", "s1_mag", "s1_theta", "s1_phi",
    "s2_mag", "s2_theta", "s2_phi", "iota", "d_L",
    "phase_c", "psi", "ra", "dec"
]
Then use PARAMETER_KEYS in place of the hardcoded lists.

Also applies to: 464-479, 516-531, 567-582

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5d07a78 and dbc5c90.

📒 Files selected for processing (1)

pipeline/dagster/RealDataCatalog/assets.py (1 hunks)

🧰 Additional context used

🧬 Code Graph Analysis (1)

pipeline/dagster/RealDataCatalog/assets.py (4)

pipeline/dagster/RealDataCatalog/minio_resource.py (3)

MinioResource (4-34)

put_object (27-30)

get_object (32-34)

src/jimgw/core/single_event/data.py (2)

from_gwosc (247-276)

to_psd (232-244)

src/jimgw/run/library/IMRPhenomPv2_standard_cbc.py (1)

IMRPhenomPv2StandardCBCRunDefinition (33-333)

jim_dagster/RealDataCatalog/assets.py (4)

event_list (21-37)

run (262-266)

raw_data (49-83)

config_file (166-228)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: publish_docker

🔇 Additional comments (4)

pipeline/dagster/RealDataCatalog/assets.py (4)

62-66: LGTM! Minio integration correctly implemented.

The function now correctly reads the event list from Minio instead of local filesystem, addressing the previous inconsistency issue.

232-236: LGTM! Proper Minio integration for event list reading.

The asset correctly reads the event list from Minio, maintaining consistency with the storage architecture.

243-249: Efficient approach to check IFO data availability.

Using try-catch blocks to check object existence in Minio is a reasonable approach for determining available IFOs.

1-771: Excellent overall architecture and Minio integration.

The asset pipeline is well-structured with:

Consistent patterns across all assets

Proper Minio integration replacing local filesystem operations

Good use of dynamic partitioning for event-based processing

Clear separation of concerns with asset groups (prerun, run, diagnostics)

Appropriate dependency management between assets

The migration from local filesystem to object storage is comprehensive and maintains data flow integrity.

coderabbitai · 2025-07-26T16:16:53Z

pipeline/dagster/RealDataCatalog/assets.py

+    for catalog in catalogs:
+        event_list = fetch_catalog_json(catalog)["events"]
+        for event in event_list.values():
+            name = event["commonName"]
+            gps_time = event["GPS"]
+            result.append((name, gps_time))
+            event_names.append(name)


🛠️ Refactor suggestion

Add error handling for GWOSC API calls.

The catalog fetching could fail due to network issues or API problems. Consider adding try-catch blocks around the API calls to handle potential failures gracefully.

for catalog in catalogs: - event_list = fetch_catalog_json(catalog)["events"] - for event in event_list.values(): - name = event["commonName"] - gps_time = event["GPS"] - result.append((name, gps_time)) - event_names.append(name) + try: + event_list = fetch_catalog_json(catalog)["events"] + for event in event_list.values(): + name = event["commonName"] + gps_time = event["GPS"] + result.append((name, gps_time)) + event_names.append(name) + except Exception as e: + context.log.error(f"Failed to fetch catalog {catalog}: {e}") + continue

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

for catalog in catalogs:

event_list = fetch_catalog_json(catalog)["events"]

for event in event_list.values():

name = event["commonName"]

gps_time = event["GPS"]

result.append((name, gps_time))

event_names.append(name)

for catalog in catalogs:

try:

event_list = fetch_catalog_json(catalog)["events"]

for event in event_list.values():

name = event["commonName"]

gps_time = event["GPS"]

result.append((name, gps_time))

event_names.append(name)

except Exception as e:

context.log.error(f"Failed to fetch catalog {catalog}: {e}")

continue

🤖 Prompt for AI Agents

In pipeline/dagster/RealDataCatalog/assets.py around lines 26 to 32, the code calls fetch_catalog_json without error handling, which can cause unhandled exceptions if the API call fails. Wrap the fetch_catalog_json call and subsequent processing in a try-except block to catch exceptions like network errors, log or handle the error appropriately, and ensure the program continues or fails gracefully.

coderabbitai · 2025-07-26T16:16:53Z

pipeline/dagster/RealDataCatalog/assets.py

+                    plot_paths.append(minio_plot_path)
+                os.remove(tmpfile.name)
+        except Exception as e:
+            print(f"Error processing {ifo} for {event_name}: {e}")


🛠️ Refactor suggestion

Use Dagster logging instead of print statements.

For better integration with Dagster's logging system, use context.log.error() instead of print() for error messages.

- print(f"Error processing {ifo} for {event_name}: {e}") + context.log.error(f"Error processing {ifo} for {event_name}: {e}")

Also applies to: 215-215

🤖 Prompt for AI Agents

In pipeline/dagster/RealDataCatalog/assets.py at lines 166 and 215, replace the print statements used for error messages with Dagster's logging system by using context.log.error(). This involves passing the Dagster context object to the function if not already available, and then calling context.log.error() with the error message instead of print(), ensuring proper integration with Dagster's logging.

coderabbitai · 2025-07-26T16:16:53Z

pipeline/dagster/RealDataCatalog/assets.py

+    loss = results["loss_data"]
+    if loss is None:
+        raise ValueError("No 'loss' key found in loss_data.")


⚠️ Potential issue

Inconsistent variable name in error message.

The error message references 'loss' key but the actual key being accessed is 'loss_data'.

- if loss is None: - raise ValueError("No 'loss' key found in loss_data.") + if loss is None: + raise ValueError("No 'loss_data' key found in results.")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

loss = results["loss_data"]

if loss is None:

raise ValueError("No 'loss' key found in loss_data.")

loss = results["loss_data"]

if loss is None:

raise ValueError("No 'loss_data' key found in results.")

🤖 Prompt for AI Agents

In pipeline/dagster/RealDataCatalog/assets.py around lines 370 to 372, the error message incorrectly references the key 'loss' while the code accesses 'loss_data'. Update the error message to correctly mention 'loss_data' to maintain consistency and clarity.

coderabbitai · 2025-07-26T16:16:53Z

pipeline/dagster/RealDataCatalog/assets.py

+    if log_prob is None:
+        raise ValueError("No 'log_prob' key found in loss_data.")


⚠️ Potential issue

Inconsistent error messages reference wrong data structure.

The error messages reference 'loss_data' but the actual key being accessed is from the main results array.

- raise ValueError("No 'log_prob' key found in loss_data.") + raise ValueError("No 'log_probs' key found in results.")

Also applies to: 668-669

🤖 Prompt for AI Agents

In pipeline/dagster/RealDataCatalog/assets.py around lines 628-629 and 668-669, the error messages incorrectly reference 'loss_data' when the key 'log_prob' is actually being accessed from the main results array. Update the error messages to correctly mention the main results array instead of 'loss_data' to maintain consistency and clarity.

kazewong added 2 commits July 24, 2025 15:42

Restructure pipeline modules under dagster directory

6c271a1

Add PATH and EXPOSE 80 to containerfiles for cpu and cuda

407e325

coderabbitai bot reviewed Jul 24, 2025

View reviewed changes

kazewong added 3 commits July 24, 2025 21:24

Switch Jim checkout branch to dagster_development

623c566

Fix asset import to use absolute import path

8c51289

Add dagster-k8s and dagster-postgres to dagster extra

69e7a17

Update Dagster-related packages to 1.11.3 and add dependencies for Kubernetes and Postgres support.

coderabbitai bot reviewed Jul 25, 2025

View reviewed changes

kazewong added 2 commits July 26, 2025 10:08

Add Minio integration for event list storage and resource management

d5176a5

Ensure Minio bucket exists before returning client

9e2f920

coderabbitai bot reviewed Jul 26, 2025

View reviewed changes

Upload raw and psd data files to Minio and clean up local files

5608cde

coderabbitai bot reviewed Jul 26, 2025

View reviewed changes

kazewong added 2 commits July 26, 2025 11:37

coderabbitai bot reviewed Jul 26, 2025

View reviewed changes

Refactor gwosc import and update fetch_catalog_json usage

dbc5c90

coderabbitai bot reviewed Jul 26, 2025

View reviewed changes

		event_dir = os.path.join('tmp', event_name, "raw")
		os.makedirs(event_dir, exist_ok=True)

		if log_prob is None:
		raise ValueError("No 'log_prob' key found in loss_data.")

Conversation

kazewong commented Jul 24, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jul 25, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Jul 26, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jul 26, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jul 26, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jul 26, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jul 26, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jul 26, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jul 26, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

kazewong commented Jul 24, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jul 24, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)