Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Dagster development#243

Open
kazewong wants to merge 11 commits intomainfrom
dagster_development
Open

Dagster development#243
kazewong wants to merge 11 commits intomainfrom
dagster_development

Conversation

@kazewong
Copy link
Owner

@kazewong kazewong commented Jul 24, 2025

This PR aims to integrate the dagster instance in local K8s environment.

Summary by CodeRabbit

  • Chores
    • Updated container images to expose port 80 and configure environment settings for improved runtime access.
    • Renamed a workflow step for clarity in the continuous deployment process.
    • Added new optional dependencies for Kubernetes and Postgres support.
  • New Features
    • Introduced Minio object storage integration with configurable resource support.
    • Updated data pipeline to store event lists, raw data, and plots in Minio instead of local disk.
    • Added pipeline configuration to load assets and resources for Minio storage.

@coderabbitai
Copy link

coderabbitai bot commented Jul 24, 2025

Walkthrough

The changes update two Dockerfiles to set the PATH environment variable and expose port 80, and adjust the GitHub Actions workflow by renaming a Docker image build step for CUDA. New Dagster pipeline definitions and assets are added to integrate Minio object storage as a resource. The pyproject.toml adds optional dependencies for Dagster Kubernetes, Postgres, and Minio. Asset functions are updated to use Minio for storage instead of local files, and a new MinioResource class is introduced for Minio interactions.

Changes

File(s) Change Summary
containers/Containerfile.cpu, containers/Containerfile.cuda Changed git branch to dagster_development, extended PATH env variable, and exposed port 80.
.github/workflows/CD.yml Renamed a workflow step to "Build and push Docker image for cuda" in the publish_docker job.
pipeline/dagster/RealDataCatalog/definitions.py Added Dagster definitions loading assets and configuring MinioResource with env var-based configuration.
pipeline/dagster/RealDataCatalog/minio_resource.py Added new MinioResource class providing Minio client initialization and object storage interaction methods.
pipeline/dagster/RealDataCatalog/assets.py Added multiple Dagster assets for gravitational wave data processing, all using MinioResource for storage and retrieval.
pyproject.toml Added optional dependencies: dagster-k8s, dagster-postgres, and minio with specified minimum versions.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • CICD update #242: Introduces the .github/workflows/CD.yml workflow with the publish_docker job, which is directly modified in this PR.

Poem

🐇
A Docker hop, a port exposed,
New paths set where Dagster goes.
Minio joins to store with grace,
Assets travel to a cloud-based place.
Workflow steps renamed with care,
Dependencies added to the lair.
🥕

Note

⚡️ Unit Test Generation is now available in beta!

Learn more here, or try it out under "Finishing Touches" below.

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch dagster_development

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai generate unit tests to generate unit tests for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🔭 Outside diff range comments (1)
containers/Containerfile.cuda (1)

2-2: Build is failing – base image requires authentication
nvcr.io/nvidia/cuda:12.9.1-base-ubuntu20.04 returns 401 Unauthorized in CI.
Options:

  1. Switch to the public Docker-Hub mirror: nvidia/cuda:12.9.1-base-ubuntu20.04.
  2. Add a docker/login-action step with an NGC API token (and set NVCR_TOKEN secret).
    Until one of these is done, the pipeline will remain red.
🧹 Nitpick comments (6)
containers/Containerfile.cpu (2)

15-20: Combine build-time steps and set VIRTUAL_ENV for cleaner image layers

  1. RUN uv sync, ENV PATH=…, and EXPOSE 80 can be squashed into a single RUN layer to reduce final image size.
  2. Several Python tools rely on the VIRTUAL_ENV env-var. Declaring it avoids subtle path issues when invoking Dagster CLI inside the container.
-# Run uv sync
-RUN uv sync --extra dagster
-
-ENV PATH="/home/jim/.venv/bin:$PATH"
-# Expose the port that your Dagster instance will run on
-EXPOSE 80
+# Install deps & expose Dagster
+RUN uv sync --extra dagster \
+ && echo 'export VIRTUAL_ENV=/home/jim/.venv' >> /etc/profile.d/virtual_env.sh
+
+ENV VIRTUAL_ENV=/home/jim/.venv \
+    PATH="/home/jim/.venv/bin:$PATH"
+
+EXPOSE 80

9-12: Shallow-clone to speed up builds
Use --depth 1 --branch jim-dev to avoid fetching the full commit history and an extra checkout step.

-RUN git clone https://github.com/kazewong/jim.git
-
-WORKDIR /home/jim
-RUN git checkout jim-dev
+RUN git clone --depth 1 --branch jim-dev https://github.com/kazewong/jim.git /home/jim
+
+WORKDIR /home/jim
containers/Containerfile.cuda (2)

15-20: Mirror CPU-image improvements & set VIRTUAL_ENV
Apply the same layer consolidation and VIRTUAL_ENV export as suggested for the CPU image to keep both images consistent and slimmer.


1-1: Minor Hadolint warning – keyword casing
FROM ghcr.io/astral-sh/uv:python3.12-bookworm as uv-source → use AS (uppercase) to match FROM and silence DL4000.

-FROM ghcr.io/astral-sh/uv:python3.12-bookworm as uv-source
+FROM ghcr.io/astral-sh/uv:python3.12-bookworm AS uv-source
.github/workflows/CD.yml (2)

77-84: Consider building a CPU image as well
Renaming clarifies the CUDA build, but users on non-GPU nodes (e.g. local Kind clusters) will need a CPU-only image.
Add a second docker/build-push-action step (or a matrix) building containers/Containerfile.cpu and tagging it appropriately.


77-79: Step-id mismatch
id: push is now ambiguous because the step name mentions CUDA specifically. Renaming the id to push_cuda keeps logs self-describing and avoids future shadowing if a CPU step is added.

-      - name: Build and push Docker image for cuda
-        id: push
+      - name: Build and push CUDA Docker image
+        id: push_cuda
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3d01168 and 407e325.

📒 Files selected for processing (3)
  • .github/workflows/CD.yml (1 hunks)
  • containers/Containerfile.cpu (1 hunks)
  • containers/Containerfile.cuda (1 hunks)
🧰 Additional context used
🪛 GitHub Actions: CD
containers/Containerfile.cuda

[warning] 1-2: FromAsCasing: 'as' and 'FROM' keywords' casing do not match (line 1)


[error] Build failed: failed to resolve source metadata for nvcr.io/nvidia/cuda:12.9.1-base-ubuntu20.04 due to 401 Unauthorized error when accessing https://nvcr.io/v2/nvidia/cuda/manifests/12.9.1-base-ubuntu20.04

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8c51289 and 69e7a17.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (1)
  • pyproject.toml (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: publish_docker

Comment on lines 35 to 42
dagster = [
"dagster>=1.11.1",
"dagster-dg-cli>=1.11.1",
"dagster-k8s>=0.27.3",
"dagster-pipes>=1.11.1",
"dagster-postgres>=0.27.3",
"dagster-webserver>=1.11.1",
]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Version mismatch: dagster-k8s / dagster-postgres pinned to 0.27.x will not install alongside core Dagster 1.11.x

The core package is declared at dagster>=1.11.1, but both plugins are constrained to >=0.27.3.
Dagster packages must share the same major (and usually minor) version; otherwise ‑‐

pip/uv will fail to resolve because dagster-k8s 0.27.3 requires dagster~=0.27.3
• Runtime APIs differ drastically between 0.27 and 1.x, so even if forced-installed things will break.

Patch suggestion:

-    "dagster-k8s>=0.27.3",
+    # Keep plugin version aligned with core Dagster
+    "dagster-k8s>=1.11.1",
...
-    "dagster-postgres>=0.27.3",
+    "dagster-postgres>=1.11.1",

Or omit explicit versions and rely on the resolver:

-    "dagster-k8s>=0.27.3",
-    "dagster-postgres>=0.27.3",
+    "dagster-k8s",
+    "dagster-postgres",

Please adjust before merging to avoid broken images / CI.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
dagster = [
"dagster>=1.11.1",
"dagster-dg-cli>=1.11.1",
"dagster-k8s>=0.27.3",
"dagster-pipes>=1.11.1",
"dagster-postgres>=0.27.3",
"dagster-webserver>=1.11.1",
]
dagster = [
"dagster>=1.11.1",
"dagster-dg-cli>=1.11.1",
# Keep plugin version aligned with core Dagster
"dagster-k8s>=1.11.1",
"dagster-pipes>=1.11.1",
"dagster-postgres>=1.11.1",
"dagster-webserver>=1.11.1",
]
🤖 Prompt for AI Agents
In pyproject.toml lines 35 to 42, the versions of dagster-k8s and
dagster-postgres are pinned to 0.27.x while the core dagster package is at
1.11.x, causing version conflicts. Update the versions of dagster-k8s and
dagster-postgres to match the major and minor version of the core dagster
package (1.11.x) or remove explicit version pins to let the resolver handle
compatible versions automatically.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🔭 Outside diff range comments (2)
pipeline/dagster/RealDataCatalog/assets.py (2)

176-180: Critical inconsistency: Function still uses local filesystem after Minio migration.

The config_file function reads from the local "data/event_list.txt" file, but this data is now stored in Minio object storage. This will cause runtime failures since the local file no longer exists.

Update the function to read from Minio storage:

-    with open("data/event_list.txt", "r") as f:
-        lines = f.readlines()
-        event_dict = dict(line.strip().split() for line in lines)
+    # Download event list from Minio and read
+    import tempfile, os
+    with tempfile.NamedTemporaryFile(mode='w+', delete=False) as temp_file:
+        # Note: This function needs MinioResource parameter added to signature
+        minio.download_object("event_list.txt", temp_file.name)
+        with open(temp_file.name, 'r') as f:
+            lines = f.readlines()
+            event_dict = dict(line.strip().split() for line in lines)
+    os.unlink(temp_file.name)

Also, the function signature needs to be updated to include the MinioResource parameter:

-def config_file(context: AssetExecutionContext):
+def config_file(context: AssetExecutionContext, minio: MinioResource):

59-94: Suggest essential refactor: Migrate raw_data function to use Minio storage.

The raw_data function should be updated to store processed data in Minio object storage to maintain consistency with the storage strategy migration. Currently, it still saves data to local filesystem directories.

Consider updating the function to:

  1. Store strain and PSD data in Minio with organized object naming (e.g., {event_name}/raw/{ifo}_data, {event_name}/raw/{ifo}_psd)
  2. Use the existing Minio resource methods for object storage
  3. Maintain the same error handling logic but adapt file operations to Minio operations

This will ensure the entire pipeline uses consistent object storage and works properly in containerized/K8s environments where local filesystem persistence is not guaranteed.

🧹 Nitpick comments (1)
pipeline/dagster/RealDataCatalog/minio_resource.py (1)

27-30: Minor improvement: Add error handling and validation.

The put_object method works correctly but could benefit from parameter validation and error handling.

Consider adding validation:

 def put_object(self, object_name: str, data, size: int, content_type: str):
+    if not object_name or not object_name.strip():
+        raise ValueError("Object name cannot be empty")
+    if size < 0:
+        raise ValueError("Size cannot be negative")
     client = self.get_client()
-    
-    return client.put_object(self.bucket_name, object_name, data, size, content_type)
+    try:
+        return client.put_object(self.bucket_name, object_name, data, size, content_type)
+    except Exception as e:
+        raise RuntimeError(f"Failed to upload object {object_name}: {str(e)}") from e
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 69e7a17 and 9e2f920.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (4)
  • pipeline/dagster/RealDataCatalog/assets.py (4 hunks)
  • pipeline/dagster/RealDataCatalog/definitions.py (1 hunks)
  • pipeline/dagster/RealDataCatalog/minio_resource.py (1 hunks)
  • pyproject.toml (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • pyproject.toml
  • pipeline/dagster/RealDataCatalog/definitions.py
🧰 Additional context used
🧬 Code Graph Analysis (1)
pipeline/dagster/RealDataCatalog/assets.py (2)
pipeline/dagster/RealDataCatalog/minio_resource.py (2)
  • MinioResource (4-34)
  • put_object (27-30)
jim_dagster/RealDataCatalog/assets.py (1)
  • event_list (21-37)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: publish_docker
🔇 Additional comments (8)
pipeline/dagster/RealDataCatalog/assets.py (4)

4-4: LGTM: Import addition for string buffer operations.

The io import is correctly added to support the in-memory string buffer operations used for Minio integration.


7-7: LGTM: MinioResource import for object storage integration.

The import of MinioResource is correctly added to enable Minio object storage functionality in the asset functions.


23-23: LGTM: Function signature updated for Minio integration.

The function signature correctly adds the MinioResource parameter for dependency injection.


35-45: LGTM: Proper implementation of Minio object upload.

The implementation correctly:

  • Uses io.StringIO() for in-memory string operations
  • Properly encodes string data to bytes
  • Uses io.BytesIO() for the data stream
  • Specifies appropriate content type and size

The migration from local file writing to Minio object storage is well-implemented.

pipeline/dagster/RealDataCatalog/minio_resource.py (4)

4-10: LGTM: Well-structured resource configuration.

The MinioResource class properly extends ConfigurableResource and defines all necessary configuration parameters with appropriate type hints. The configuration covers all essential Minio connection parameters.


19-21: LGTM: Presigned URL generation implemented correctly.

The method properly uses the Minio client to generate presigned URLs for object access.


23-25: LGTM: Object listing with proper filtering.

The method correctly implements object listing with prefix filtering and recursive search, returning a proper list.


32-34: LGTM: File download method implemented correctly.

The method properly uses the Minio client's fget_object method to download objects to local file paths.

Comment on lines +12 to +17
def get_client(self):
client = Minio(self.endpoint + ":" + self.port, self.access_key, self.secret_key, secure=False)
found = client.bucket_exists(self.bucket_name)
if not found:
client.make_bucket(self.bucket_name)
return client
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Security concern: Insecure connection and potential performance issue.

Two issues with the get_client method:

  1. Security: secure=False disables TLS encryption, which is acceptable for local development but should be configurable for production environments.

  2. Performance: The method creates a new client instance and checks bucket existence on every call, which is inefficient for frequent operations.

Consider these improvements:

+    _client: Minio = None
+    secure: bool = False  # Add to class configuration
+
     def get_client(self):
+        if self._client is None:
-            client = Minio(self.endpoint + ":" + self.port, self.access_key, self.secret_key, secure=False)
+            self._client = Minio(self.endpoint + ":" + self.port, self.access_key, self.secret_key, secure=self.secure)
+            self._ensure_bucket_exists()
+        return self._client
+    
+    def _ensure_bucket_exists(self):
-        found = client.bucket_exists(self.bucket_name)
+        found = self._client.bucket_exists(self.bucket_name)
         if not found:
-            client.make_bucket(self.bucket_name)
-        return client
+            self._client.make_bucket(self.bucket_name)

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In pipeline/dagster/RealDataCatalog/minio_resource.py around lines 12 to 17, the
get_client method disables TLS by setting secure=False and creates a new Minio
client and checks bucket existence on every call, causing security and
performance issues. Modify the method to accept a configurable parameter or
environment variable to toggle secure mode for production use. Refactor the code
to instantiate the Minio client once (e.g., during initialization) and reuse it,
avoiding repeated bucket existence checks and client creation on every call.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🔭 Outside diff range comments (3)
pipeline/dagster/RealDataCatalog/assets.py (3)

201-218: Config file function has multiple filesystem dependencies.

This function has several issues:

  1. Reads event_list.txt from local filesystem (line 203) instead of Minio
  2. Checks for data/PSD files in local filesystem (lines 211-213) that are now in Minio

Update to use Minio:

-    with open("data/event_list.txt", "r") as f:
-        lines = f.readlines()
-        event_dict = dict(line.strip().split() for line in lines)
+    # Fetch from Minio
+    event_list_obj = minio.get_object("event_list.txt")
+    lines = event_list_obj.read().decode("utf-8").splitlines()
+    event_dict = dict(line.strip().split() for line in lines)

For checking file availability, you'll need to use Minio's object listing or head_object methods instead of os.path.exists.


311-668: All diagnostic plotting functions need Minio integration.

All the remaining diagnostic functions (loss_plot, production_chains_corner_plot, etc.) read from local results.npz files that may not exist if the pipeline is using Minio for storage.

Consider creating a helper function to handle Minio downloads for all diagnostic functions:

def download_results_from_minio(context: AssetExecutionContext, minio: MinioResource, event_name: str):
    """Download results.npz from Minio to a temporary location."""
    import tempfile
    temp_file = tempfile.NamedTemporaryFile(suffix='.npz', delete=False)
    try:
        minio.download_object(f"{event_name}/results.npz", temp_file.name)
        return temp_file.name
    except Exception as e:
        os.unlink(temp_file.name)
        raise FileNotFoundError(f"Results file not found in Minio: {e}")

This would simplify updating all diagnostic functions and ensure consistent error handling.


127-155: Update diagnostic assets to fetch data from Minio, not the local data/ directory

The plotting (raw_data_plot, psd_plot, etc.) and configuration (config_file) functions still assume files live under data/<event_name>/…, but upstream you’ve moved raw and PSD data into Minio. These functions will fail at runtime unless they first pull the required files from Minio.

Please update each diagnostic asset to:

  • Inject the Minio resource via context.resources.minio.
  • Download the needed file(s) (e.g. .npz, event_list.txt) into a temporary directory (or stream them) before loading.
  • Clean up or close temporary files when done.

Affected locations:

  • pipeline/dagster/RealDataCatalog/assets.py:
    • raw_data_plot (lines 127–155)
    • psd_plot (lines 156–191)
    • config_file (lines 201–263)
    • All other diagnostic plotting functions (lines 311–668)

Example diff for raw_data_plot:

 def raw_data_plot(context: AssetExecutionContext):
     """
     Plot the raw strain data for each IFO for the event.
     """
-    import matplotlib.pyplot as plt
+    import matplotlib.pyplot as plt
     import tempfile
     import os
     import numpy as np

     event_name = context.partition_key
-    event_dir = os.path.join("data", event_name, "raw")
+    # Download raw data from Minio
+    temp_dir = tempfile.mkdtemp()
+    minio = context.resources.minio
+
+    plots_dir = os.path.join("data", event_name, "plots")
     os.makedirs(plots_dir, exist_ok=True)

     ifos = ["H1", "L1", "V1"]
     plot_paths = []
     for ifo in ifos:
-        data_file = os.path.join(event_dir, f"{ifo}_data.npz")
+        local_npz = os.path.join(temp_dir, f"{ifo}_data.npz")
+        # pull from Minio bucket `<event_name>/raw/`
+        minio.download_object(
+            bucket_name=event_name,
+            object_name=f"raw/{ifo}_data.npz",
+            file_path=local_npz,
+        )
+        data_file = local_npz

         if os.path.exists(data_file):
             data = np.load(data_file)
             t = data["epoch"] + np.arange(data["td"].shape[0]) * data["dt"]
             td = data["td"]
             if t is not None and td is not None:
                 plt.figure()
                 plt.plot(t, td)
                 plt.xlabel("Time (s)")
                 plt.ylabel("Strain")
                 plt.title(f"{ifo} Strain for {event_name}")
                 plot_path = os.path.join(plots_dir, f"{ifo}_strain.png")
                 plt.savefig(plot_path)
                 plt.close()
                 plot_paths.append(plot_path)
     return plot_paths

Apply analogous changes to psd_plot, config_file, and all other diagnostics so that every file read is preceded by a Minio download.

♻️ Duplicate comments (2)
pipeline/dagster/RealDataCatalog/assets.py (2)

59-65: Good fix! The function now correctly reads from Minio.

The implementation properly addresses the previous inconsistency by fetching the event list from Minio storage instead of the local filesystem.


164-191: PSD plot function also needs Minio integration.

Similar to raw_data_plot, this function expects local files but should download from Minio.

🧹 Nitpick comments (1)
pipeline/dagster/RealDataCatalog/assets.py (1)

23-45: Well-implemented transition to Minio storage.

The function correctly uploads the event list to Minio instead of writing to local filesystem. The implementation properly handles string-to-bytes conversion and sets the appropriate content type.

Minor optimization: The buffer.seek(0) on line 38 is unnecessary since buffer.getvalue() returns the entire buffer contents regardless of the current position.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9e2f920 and 5608cde.

📒 Files selected for processing (2)
  • pipeline/dagster/RealDataCatalog/assets.py (5 hunks)
  • pipeline/dagster/RealDataCatalog/minio_resource.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • pipeline/dagster/RealDataCatalog/minio_resource.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: publish_docker
🔇 Additional comments (1)
pipeline/dagster/RealDataCatalog/assets.py (1)

4-7: LGTM! Import statements are appropriate for Minio integration.

The addition of io module and MinioResource import aligns well with the transition from local filesystem to object storage.

Comment on lines 70 to 71
event_dir = os.path.join('tmp', event_name, "raw")
os.makedirs(event_dir, exist_ok=True)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Use Python's tempfile module for temporary directory management.

Instead of hardcoding 'tmp' directory, use Python's tempfile.mkdtemp() for better cross-platform compatibility and automatic cleanup.

-    # Use a temp directory, but keep event_name and "raw" part
-    event_dir = os.path.join('tmp', event_name, "raw")
+    import tempfile
+    temp_dir = tempfile.mkdtemp()
+    event_dir = os.path.join(temp_dir, event_name, "raw")

Remember to clean up the entire temporary directory at the end of the function or use a context manager.

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In pipeline/dagster/RealDataCatalog/assets.py around lines 70 to 71, replace the
hardcoded 'tmp' directory with a temporary directory created using Python's
tempfile.mkdtemp() for better cross-platform compatibility. Update event_dir to
be inside this temporary directory. Also, ensure to clean up the entire
temporary directory at the end of the function or use a context manager to
handle automatic cleanup.

Comment on lines 114 to 118
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

⚠️ Potential issue

Incomplete cleanup and broad exception handling.

Two issues to address:

  1. The cleanup only removes .npz files but leaves the directory structure
  2. The exception handling is too broad and doesn't properly clean up on error

For proper cleanup, consider:

import shutil
# At the end of the function (outside the try block)
if 'temp_dir' in locals():
    shutil.rmtree(temp_dir)

For better error handling:

-        except Exception as e:
-            print(f"Error fetching data for {ifo} during {event_name}: {e}")
-            continue
+        except gwosc.api.APIError as e:
+            context.log.error(f"GWOSC API error for {ifo} during {event_name}: {e}")
+            continue
+        except Exception as e:
+            context.log.error(f"Unexpected error for {ifo} during {event_name}: {e}")
+            # Clean up any partial files
+            for f in [data_file_path + '.npz', psd_file_path + '.npz']:
+                if os.path.exists(f):
+                    os.remove(f)
+            raise
🤖 Prompt for AI Agents
In pipeline/dagster/RealDataCatalog/assets.py around lines 114 to 118, the
current cleanup only removes .npz files but leaves the temporary directory
intact, and the exception handling is too broad without proper cleanup on
errors. To fix this, add code after the try-except block to check if the
temporary directory variable exists and remove the entire directory using
shutil.rmtree to ensure complete cleanup. Also, narrow the exception handling to
catch specific exceptions if possible, and ensure that cleanup code runs
regardless of success or failure, for example by using a finally block or by
placing cleanup after the try-except.

kazewong added 2 commits July 26, 2025 11:37
- Move Data and IMRPhenomPv2StandardCBCRunDefinition imports into
functions - Update raw_data, raw_data_plot, and psd_plot to read/write
via Minio - Store plots in Minio instead of local disk using temporary
files - Clean up temporary files after upload - Add MinioResource
argument to plot asset functions - Enable t_c_range in config_file asset
- Replace local filesystem access with Minio for reading and writing
data, config files, and plots - Update asset signatures to accept
MinioResource where needed - Remove redundant imports and local
directory creation - Use temporary files for serialization before
uploading to Minio
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (2)
pipeline/dagster/RealDataCatalog/assets.py (2)

70-71: Consider using tempfile module for better cross-platform compatibility.

The hardcoded 'tmp' directory could cause issues on different platforms. Consider using Python's tempfile.mkdtemp() for better portability.

This issue was previously flagged and remains unaddressed. The current implementation still uses hardcoded paths which may not work consistently across different environments.


117-119: Improve exception handling and cleanup strategy.

The current approach has two issues:

  1. Exception handling is too broad and may mask specific errors
  2. Cleanup only removes .npz files but leaves directory structure

This issue was previously flagged and remains unaddressed. Consider implementing proper cleanup with shutil.rmtree() and more specific exception handling.

🧹 Nitpick comments (3)
pipeline/dagster/RealDataCatalog/assets.py (3)

11-11: Fix typo in comment.

There's a typo in the comment: "configuration0" should be "configuration".

-# Create asset group for run and configuration0
+# Create asset group for run and configuration

317-352: Placeholder asset - implementation needed.

The run multi-asset is properly structured with correct dependencies and partitions, but currently unimplemented. This appears to be intentional for the current PR scope.

Would you like me to help implement the logic for loading results from the external run execution?


355-770: Consider refactoring to reduce code duplication across diagnostic assets.

The diagnostic assets follow very similar patterns for:

  • Loading data from Minio
  • Creating plots with matplotlib/corner
  • Saving to temporary files
  • Uploading to Minio
  • Cleaning up temporary files

Consider creating helper functions to reduce duplication:

def create_and_upload_plot(event_name: str, minio: MinioResource, plot_func, plot_name: str, **kwargs):
    """Helper function to create plots and upload to Minio"""
    with tempfile.NamedTemporaryFile(suffix=".png", delete=False) as tmpfile:
        fig = plot_func(**kwargs)
        fig.savefig(tmpfile.name)
        plt.close(fig)
        tmpfile.flush()
        tmpfile.seek(0)
        tmpfile_size = os.path.getsize(tmpfile.name)
        minio_plot_path = f"{event_name}/plots/{plot_name}.png"
        with open(tmpfile.name, "rb") as plotfile:
            minio.put_object(
                object_name=minio_plot_path,
                data=plotfile,
                size=tmpfile_size,
                content_type="image/png",
            )
    os.remove(tmpfile.name)
    return minio_plot_path

def load_results_from_minio(event_name: str, minio: MinioResource):
    """Helper function to load results.npz from Minio"""
    results_obj = minio.get_object(f"{event_name}/results.npz")
    return np.load(io.BytesIO(results_obj.read()), allow_pickle=True)

This would significantly reduce code duplication and improve maintainability.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0107686 and 5d07a78.

📒 Files selected for processing (1)
  • pipeline/dagster/RealDataCatalog/assets.py (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: publish_docker
🔇 Additional comments (14)
pipeline/dagster/RealDataCatalog/assets.py (14)

16-46: LGTM! Well-structured event fetching and partition registration.

The asset correctly fetches events from multiple GWTC catalogs, properly buffers the data, uploads to Minio, and registers dynamic partitions. The implementation is clean and follows best practices.


57-66: LGTM! Critical inconsistency resolved.

The function now correctly reads the event list from Minio instead of the local filesystem, addressing the previous critical issue. The Minio integration is properly implemented.


122-168: LGTM! Well-implemented plotting asset with proper resource management.

The raw data plotting asset demonstrates good practices:

  • Proper use of tempfile for temporary files
  • Correct Minio integration for both reading and writing
  • Good error handling per IFO
  • Proper cleanup of temporary files
  • Clear return value

171-217: LGTM! Consistent implementation with the strain plotting asset.

The PSD plotting asset follows the same good patterns as the raw data plot asset, with proper resource management and error handling.


220-315: LGTM! Comprehensive configuration asset with robust IFO availability checking.

The config file asset is well-implemented:

  • Proper Minio integration for reading event list
  • Robust availability checking for IFOs by attempting to fetch both data and PSD files
  • Comprehensive parameter configuration for IMRPhenomPv2StandardCBCRunDefinition
  • Correct use of tempfile for serialization
  • Proper cleanup of temporary files

355-393: LGTM! Well-structured diagnostic asset with proper error handling.

The loss plot asset demonstrates good practices with proper Minio integration, error handling for missing data, and correct tempfile usage.


396-445: LGTM! Consistent corner plot implementation.

The production chains corner plot follows the same good patterns as other diagnostic assets. The parameter key selection is comprehensive for gravitational wave analysis.


448-497: LGTM! Normalizing flow samples corner plot well-implemented.

The asset correctly handles NF samples data and maintains consistency with other corner plot assets.


500-549: LGTM! Prior samples corner plot maintains consistency.

The implementation is consistent with other corner plot assets and properly handles prior samples data.


552-610: LGTM! Comprehensive trace plot implementation.

The trace plot asset correctly handles multiple parameters and creates a well-structured multi-subplot figure. The subplot layout logic is appropriate for the number of parameters.


613-650: LGTM! Log probability distribution plot well-implemented.

The asset correctly creates a histogram of log probabilities with proper error handling.


653-690: LGTM! Log probability evolution plot maintains consistency.

The asset follows the same good patterns as other diagnostic assets and correctly plots the evolution of log probabilities.


693-730: LGTM! Local acceptance rate plot properly implemented.

The asset correctly accesses nested data structure and handles missing keys appropriately.


733-770: LGTM! Global acceptance rate plot completes the diagnostic suite.

The final diagnostic asset maintains consistency with the local acceptance plot and properly handles the global acceptance data.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

♻️ Duplicate comments (2)
pipeline/dagster/RealDataCatalog/assets.py (2)

70-71: Use Python's tempfile module for temporary directory management.

Instead of hardcoding 'tmp' directory, use Python's tempfile.mkdtemp() for better cross-platform compatibility and automatic cleanup.

-    # Use a temp directory, but keep event_name and "raw" part
-    event_dir = os.path.join('tmp', event_name, "raw")
+    import tempfile
+    temp_dir = tempfile.mkdtemp()
+    event_dir = os.path.join(temp_dir, event_name, "raw")

Remember to clean up the entire temporary directory at the end of the function or use a context manager.


115-119: Incomplete cleanup and broad exception handling.

Two issues to address:

  1. The cleanup only removes .npz files but leaves the directory structure
  2. The exception handling is too broad and doesn't properly clean up on error

For proper cleanup, consider:

import shutil
# At the end of the function (outside the try block)
if 'temp_dir' in locals():
    shutil.rmtree(temp_dir)

For better error handling:

-        except Exception as e:
-            print(f"Error fetching data for {ifo} during {event_name}: {e}")
-            continue
+        except gwosc.api.APIError as e:
+            context.log.error(f"GWOSC API error for {ifo} during {event_name}: {e}")
+            continue
+        except Exception as e:
+            context.log.error(f"Unexpected error for {ifo} during {event_name}: {e}")
+            # Clean up any partial files
+            for f in [data_file_path + '.npz', psd_file_path + '.npz']:
+                if os.path.exists(f):
+                    os.remove(f)
+            raise
🧹 Nitpick comments (3)
pipeline/dagster/RealDataCatalog/assets.py (3)

11-11: Fix typo in comment.

There's a typo "configuration0" instead of "configuration".

-# Create asset group for run and configuration0
+# Create asset group for run and configuration

348-352: Implement the run asset to load results.

The run asset is currently a placeholder. This asset should load the results from execute_single_run.py and yield the specified assets.

Would you like me to help implement this asset to load results from the results.npz file and yield each asset appropriately?


412-427: Consider extracting parameter key lists to constants.

The parameter key lists are duplicated across multiple diagnostic assets. Consider extracting them to module-level constants for better maintainability.

# At module level
PARAMETER_KEYS = [
    "M_c", "q", "s1_mag", "s1_theta", "s1_phi",
    "s2_mag", "s2_theta", "s2_phi", "iota", "d_L",
    "phase_c", "psi", "ra", "dec"
]

Then use PARAMETER_KEYS in place of the hardcoded lists.

Also applies to: 464-479, 516-531, 567-582

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5d07a78 and dbc5c90.

📒 Files selected for processing (1)
  • pipeline/dagster/RealDataCatalog/assets.py (1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
pipeline/dagster/RealDataCatalog/assets.py (4)
pipeline/dagster/RealDataCatalog/minio_resource.py (3)
  • MinioResource (4-34)
  • put_object (27-30)
  • get_object (32-34)
src/jimgw/core/single_event/data.py (2)
  • from_gwosc (247-276)
  • to_psd (232-244)
src/jimgw/run/library/IMRPhenomPv2_standard_cbc.py (1)
  • IMRPhenomPv2StandardCBCRunDefinition (33-333)
jim_dagster/RealDataCatalog/assets.py (4)
  • event_list (21-37)
  • run (262-266)
  • raw_data (49-83)
  • config_file (166-228)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: publish_docker
🔇 Additional comments (4)
pipeline/dagster/RealDataCatalog/assets.py (4)

62-66: LGTM! Minio integration correctly implemented.

The function now correctly reads the event list from Minio instead of local filesystem, addressing the previous inconsistency issue.


232-236: LGTM! Proper Minio integration for event list reading.

The asset correctly reads the event list from Minio, maintaining consistency with the storage architecture.


243-249: Efficient approach to check IFO data availability.

Using try-catch blocks to check object existence in Minio is a reasonable approach for determining available IFOs.


1-771: Excellent overall architecture and Minio integration.

The asset pipeline is well-structured with:

  • Consistent patterns across all assets
  • Proper Minio integration replacing local filesystem operations
  • Good use of dynamic partitioning for event-based processing
  • Clear separation of concerns with asset groups (prerun, run, diagnostics)
  • Appropriate dependency management between assets

The migration from local filesystem to object storage is comprehensive and maintains data flow integrity.

Comment on lines +26 to +32
for catalog in catalogs:
event_list = fetch_catalog_json(catalog)["events"]
for event in event_list.values():
name = event["commonName"]
gps_time = event["GPS"]
result.append((name, gps_time))
event_names.append(name)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add error handling for GWOSC API calls.

The catalog fetching could fail due to network issues or API problems. Consider adding try-catch blocks around the API calls to handle potential failures gracefully.

 for catalog in catalogs:
-    event_list = fetch_catalog_json(catalog)["events"]
-    for event in event_list.values():
-        name = event["commonName"]
-        gps_time = event["GPS"]
-        result.append((name, gps_time))
-        event_names.append(name)
+    try:
+        event_list = fetch_catalog_json(catalog)["events"]
+        for event in event_list.values():
+            name = event["commonName"]
+            gps_time = event["GPS"]
+            result.append((name, gps_time))
+            event_names.append(name)
+    except Exception as e:
+        context.log.error(f"Failed to fetch catalog {catalog}: {e}")
+        continue
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
for catalog in catalogs:
event_list = fetch_catalog_json(catalog)["events"]
for event in event_list.values():
name = event["commonName"]
gps_time = event["GPS"]
result.append((name, gps_time))
event_names.append(name)
for catalog in catalogs:
try:
event_list = fetch_catalog_json(catalog)["events"]
for event in event_list.values():
name = event["commonName"]
gps_time = event["GPS"]
result.append((name, gps_time))
event_names.append(name)
except Exception as e:
context.log.error(f"Failed to fetch catalog {catalog}: {e}")
continue
🤖 Prompt for AI Agents
In pipeline/dagster/RealDataCatalog/assets.py around lines 26 to 32, the code
calls fetch_catalog_json without error handling, which can cause unhandled
exceptions if the API call fails. Wrap the fetch_catalog_json call and
subsequent processing in a try-except block to catch exceptions like network
errors, log or handle the error appropriately, and ensure the program continues
or fails gracefully.

plot_paths.append(minio_plot_path)
os.remove(tmpfile.name)
except Exception as e:
print(f"Error processing {ifo} for {event_name}: {e}")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Use Dagster logging instead of print statements.

For better integration with Dagster's logging system, use context.log.error() instead of print() for error messages.

-            print(f"Error processing {ifo} for {event_name}: {e}")
+            context.log.error(f"Error processing {ifo} for {event_name}: {e}")

Also applies to: 215-215

🤖 Prompt for AI Agents
In pipeline/dagster/RealDataCatalog/assets.py at lines 166 and 215, replace the
print statements used for error messages with Dagster's logging system by using
context.log.error(). This involves passing the Dagster context object to the
function if not already available, and then calling context.log.error() with the
error message instead of print(), ensuring proper integration with Dagster's
logging.

Comment on lines +370 to +372
loss = results["loss_data"]
if loss is None:
raise ValueError("No 'loss' key found in loss_data.")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Inconsistent variable name in error message.

The error message references 'loss' key but the actual key being accessed is 'loss_data'.

-    if loss is None:
-        raise ValueError("No 'loss' key found in loss_data.")
+    if loss is None:
+        raise ValueError("No 'loss_data' key found in results.")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
loss = results["loss_data"]
if loss is None:
raise ValueError("No 'loss' key found in loss_data.")
loss = results["loss_data"]
if loss is None:
raise ValueError("No 'loss_data' key found in results.")
🤖 Prompt for AI Agents
In pipeline/dagster/RealDataCatalog/assets.py around lines 370 to 372, the error
message incorrectly references the key 'loss' while the code accesses
'loss_data'. Update the error message to correctly mention 'loss_data' to
maintain consistency and clarity.

Comment on lines +628 to +629
if log_prob is None:
raise ValueError("No 'log_prob' key found in loss_data.")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Inconsistent error messages reference wrong data structure.

The error messages reference 'loss_data' but the actual key being accessed is from the main results array.

-        raise ValueError("No 'log_prob' key found in loss_data.")
+        raise ValueError("No 'log_probs' key found in results.")

Also applies to: 668-669

🤖 Prompt for AI Agents
In pipeline/dagster/RealDataCatalog/assets.py around lines 628-629 and 668-669,
the error messages incorrectly reference 'loss_data' when the key 'log_prob' is
actually being accessed from the main results array. Update the error messages
to correctly mention the main results array instead of 'loss_data' to maintain
consistency and clarity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments