Conversation
WalkthroughThe changes update two Dockerfiles to set the PATH environment variable and expose port 80, and adjust the GitHub Actions workflow by renaming a Docker image build step for CUDA. New Dagster pipeline definitions and assets are added to integrate Minio object storage as a resource. The Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Poem
Note ⚡️ Unit Test Generation is now available in beta!Learn more here, or try it out under "Finishing Touches" below. ✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Actionable comments posted: 0
🔭 Outside diff range comments (1)
containers/Containerfile.cuda (1)
2-2: Build is failing – base image requires authentication
nvcr.io/nvidia/cuda:12.9.1-base-ubuntu20.04returns401 Unauthorizedin CI.
Options:
- Switch to the public Docker-Hub mirror:
nvidia/cuda:12.9.1-base-ubuntu20.04.- Add a
docker/login-actionstep with an NGC API token (and setNVCR_TOKENsecret).
Until one of these is done, the pipeline will remain red.
🧹 Nitpick comments (6)
containers/Containerfile.cpu (2)
15-20: Combine build-time steps and setVIRTUAL_ENVfor cleaner image layers
RUN uv sync,ENV PATH=…, andEXPOSE 80can be squashed into a singleRUNlayer to reduce final image size.- Several Python tools rely on the
VIRTUAL_ENVenv-var. Declaring it avoids subtle path issues when invoking Dagster CLI inside the container.-# Run uv sync -RUN uv sync --extra dagster - -ENV PATH="/home/jim/.venv/bin:$PATH" -# Expose the port that your Dagster instance will run on -EXPOSE 80 +# Install deps & expose Dagster +RUN uv sync --extra dagster \ + && echo 'export VIRTUAL_ENV=/home/jim/.venv' >> /etc/profile.d/virtual_env.sh + +ENV VIRTUAL_ENV=/home/jim/.venv \ + PATH="/home/jim/.venv/bin:$PATH" + +EXPOSE 80
9-12: Shallow-clone to speed up builds
Use--depth 1 --branch jim-devto avoid fetching the full commit history and an extra checkout step.-RUN git clone https://github.com/kazewong/jim.git - -WORKDIR /home/jim -RUN git checkout jim-dev +RUN git clone --depth 1 --branch jim-dev https://github.com/kazewong/jim.git /home/jim + +WORKDIR /home/jimcontainers/Containerfile.cuda (2)
15-20: Mirror CPU-image improvements & setVIRTUAL_ENV
Apply the same layer consolidation andVIRTUAL_ENVexport as suggested for the CPU image to keep both images consistent and slimmer.
1-1: Minor Hadolint warning – keyword casing
FROM ghcr.io/astral-sh/uv:python3.12-bookworm as uv-source→ useAS(uppercase) to matchFROMand silenceDL4000.-FROM ghcr.io/astral-sh/uv:python3.12-bookworm as uv-source +FROM ghcr.io/astral-sh/uv:python3.12-bookworm AS uv-source.github/workflows/CD.yml (2)
77-84: Consider building a CPU image as well
Renaming clarifies the CUDA build, but users on non-GPU nodes (e.g. local Kind clusters) will need a CPU-only image.
Add a seconddocker/build-push-actionstep (or a matrix) buildingcontainers/Containerfile.cpuand tagging it appropriately.
77-79: Step-id mismatch
id: pushis now ambiguous because the step name mentions CUDA specifically. Renaming the id topush_cudakeeps logs self-describing and avoids future shadowing if a CPU step is added.- - name: Build and push Docker image for cuda - id: push + - name: Build and push CUDA Docker image + id: push_cuda
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
.github/workflows/CD.yml(1 hunks)containers/Containerfile.cpu(1 hunks)containers/Containerfile.cuda(1 hunks)
🧰 Additional context used
🪛 GitHub Actions: CD
containers/Containerfile.cuda
[warning] 1-2: FromAsCasing: 'as' and 'FROM' keywords' casing do not match (line 1)
[error] Build failed: failed to resolve source metadata for nvcr.io/nvidia/cuda:12.9.1-base-ubuntu20.04 due to 401 Unauthorized error when accessing https://nvcr.io/v2/nvidia/cuda/manifests/12.9.1-base-ubuntu20.04
Update Dagster-related packages to 1.11.3 and add dependencies for Kubernetes and Postgres support.
There was a problem hiding this comment.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock
📒 Files selected for processing (1)
pyproject.toml(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: publish_docker
| dagster = [ | ||
| "dagster>=1.11.1", | ||
| "dagster-dg-cli>=1.11.1", | ||
| "dagster-k8s>=0.27.3", | ||
| "dagster-pipes>=1.11.1", | ||
| "dagster-postgres>=0.27.3", | ||
| "dagster-webserver>=1.11.1", | ||
| ] |
There was a problem hiding this comment.
Version mismatch: dagster-k8s / dagster-postgres pinned to 0.27.x will not install alongside core Dagster 1.11.x
The core package is declared at dagster>=1.11.1, but both plugins are constrained to >=0.27.3.
Dagster packages must share the same major (and usually minor) version; otherwise ‑‐
• pip/uv will fail to resolve because dagster-k8s 0.27.3 requires dagster~=0.27.3
• Runtime APIs differ drastically between 0.27 and 1.x, so even if forced-installed things will break.
Patch suggestion:
- "dagster-k8s>=0.27.3",
+ # Keep plugin version aligned with core Dagster
+ "dagster-k8s>=1.11.1",
...
- "dagster-postgres>=0.27.3",
+ "dagster-postgres>=1.11.1",Or omit explicit versions and rely on the resolver:
- "dagster-k8s>=0.27.3",
- "dagster-postgres>=0.27.3",
+ "dagster-k8s",
+ "dagster-postgres",Please adjust before merging to avoid broken images / CI.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| dagster = [ | |
| "dagster>=1.11.1", | |
| "dagster-dg-cli>=1.11.1", | |
| "dagster-k8s>=0.27.3", | |
| "dagster-pipes>=1.11.1", | |
| "dagster-postgres>=0.27.3", | |
| "dagster-webserver>=1.11.1", | |
| ] | |
| dagster = [ | |
| "dagster>=1.11.1", | |
| "dagster-dg-cli>=1.11.1", | |
| # Keep plugin version aligned with core Dagster | |
| "dagster-k8s>=1.11.1", | |
| "dagster-pipes>=1.11.1", | |
| "dagster-postgres>=1.11.1", | |
| "dagster-webserver>=1.11.1", | |
| ] |
🤖 Prompt for AI Agents
In pyproject.toml lines 35 to 42, the versions of dagster-k8s and
dagster-postgres are pinned to 0.27.x while the core dagster package is at
1.11.x, causing version conflicts. Update the versions of dagster-k8s and
dagster-postgres to match the major and minor version of the core dagster
package (1.11.x) or remove explicit version pins to let the resolver handle
compatible versions automatically.
There was a problem hiding this comment.
Actionable comments posted: 2
🔭 Outside diff range comments (2)
pipeline/dagster/RealDataCatalog/assets.py (2)
176-180: Critical inconsistency: Function still uses local filesystem after Minio migration.The
config_filefunction reads from the local"data/event_list.txt"file, but this data is now stored in Minio object storage. This will cause runtime failures since the local file no longer exists.Update the function to read from Minio storage:
- with open("data/event_list.txt", "r") as f: - lines = f.readlines() - event_dict = dict(line.strip().split() for line in lines) + # Download event list from Minio and read + import tempfile, os + with tempfile.NamedTemporaryFile(mode='w+', delete=False) as temp_file: + # Note: This function needs MinioResource parameter added to signature + minio.download_object("event_list.txt", temp_file.name) + with open(temp_file.name, 'r') as f: + lines = f.readlines() + event_dict = dict(line.strip().split() for line in lines) + os.unlink(temp_file.name)Also, the function signature needs to be updated to include the
MinioResourceparameter:-def config_file(context: AssetExecutionContext): +def config_file(context: AssetExecutionContext, minio: MinioResource):
59-94: Suggest essential refactor: Migrate raw_data function to use Minio storage.The
raw_datafunction should be updated to store processed data in Minio object storage to maintain consistency with the storage strategy migration. Currently, it still saves data to local filesystem directories.Consider updating the function to:
- Store strain and PSD data in Minio with organized object naming (e.g.,
{event_name}/raw/{ifo}_data,{event_name}/raw/{ifo}_psd)- Use the existing Minio resource methods for object storage
- Maintain the same error handling logic but adapt file operations to Minio operations
This will ensure the entire pipeline uses consistent object storage and works properly in containerized/K8s environments where local filesystem persistence is not guaranteed.
🧹 Nitpick comments (1)
pipeline/dagster/RealDataCatalog/minio_resource.py (1)
27-30: Minor improvement: Add error handling and validation.The
put_objectmethod works correctly but could benefit from parameter validation and error handling.Consider adding validation:
def put_object(self, object_name: str, data, size: int, content_type: str): + if not object_name or not object_name.strip(): + raise ValueError("Object name cannot be empty") + if size < 0: + raise ValueError("Size cannot be negative") client = self.get_client() - - return client.put_object(self.bucket_name, object_name, data, size, content_type) + try: + return client.put_object(self.bucket_name, object_name, data, size, content_type) + except Exception as e: + raise RuntimeError(f"Failed to upload object {object_name}: {str(e)}") from e
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock
📒 Files selected for processing (4)
pipeline/dagster/RealDataCatalog/assets.py(4 hunks)pipeline/dagster/RealDataCatalog/definitions.py(1 hunks)pipeline/dagster/RealDataCatalog/minio_resource.py(1 hunks)pyproject.toml(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- pyproject.toml
- pipeline/dagster/RealDataCatalog/definitions.py
🧰 Additional context used
🧬 Code Graph Analysis (1)
pipeline/dagster/RealDataCatalog/assets.py (2)
pipeline/dagster/RealDataCatalog/minio_resource.py (2)
MinioResource(4-34)put_object(27-30)jim_dagster/RealDataCatalog/assets.py (1)
event_list(21-37)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: publish_docker
🔇 Additional comments (8)
pipeline/dagster/RealDataCatalog/assets.py (4)
4-4: LGTM: Import addition for string buffer operations.The
ioimport is correctly added to support the in-memory string buffer operations used for Minio integration.
7-7: LGTM: MinioResource import for object storage integration.The import of
MinioResourceis correctly added to enable Minio object storage functionality in the asset functions.
23-23: LGTM: Function signature updated for Minio integration.The function signature correctly adds the
MinioResourceparameter for dependency injection.
35-45: LGTM: Proper implementation of Minio object upload.The implementation correctly:
- Uses
io.StringIO()for in-memory string operations- Properly encodes string data to bytes
- Uses
io.BytesIO()for the data stream- Specifies appropriate content type and size
The migration from local file writing to Minio object storage is well-implemented.
pipeline/dagster/RealDataCatalog/minio_resource.py (4)
4-10: LGTM: Well-structured resource configuration.The
MinioResourceclass properly extendsConfigurableResourceand defines all necessary configuration parameters with appropriate type hints. The configuration covers all essential Minio connection parameters.
19-21: LGTM: Presigned URL generation implemented correctly.The method properly uses the Minio client to generate presigned URLs for object access.
23-25: LGTM: Object listing with proper filtering.The method correctly implements object listing with prefix filtering and recursive search, returning a proper list.
32-34: LGTM: File download method implemented correctly.The method properly uses the Minio client's
fget_objectmethod to download objects to local file paths.
| def get_client(self): | ||
| client = Minio(self.endpoint + ":" + self.port, self.access_key, self.secret_key, secure=False) | ||
| found = client.bucket_exists(self.bucket_name) | ||
| if not found: | ||
| client.make_bucket(self.bucket_name) | ||
| return client |
There was a problem hiding this comment.
🛠️ Refactor suggestion
Security concern: Insecure connection and potential performance issue.
Two issues with the get_client method:
-
Security:
secure=Falsedisables TLS encryption, which is acceptable for local development but should be configurable for production environments. -
Performance: The method creates a new client instance and checks bucket existence on every call, which is inefficient for frequent operations.
Consider these improvements:
+ _client: Minio = None
+ secure: bool = False # Add to class configuration
+
def get_client(self):
+ if self._client is None:
- client = Minio(self.endpoint + ":" + self.port, self.access_key, self.secret_key, secure=False)
+ self._client = Minio(self.endpoint + ":" + self.port, self.access_key, self.secret_key, secure=self.secure)
+ self._ensure_bucket_exists()
+ return self._client
+
+ def _ensure_bucket_exists(self):
- found = client.bucket_exists(self.bucket_name)
+ found = self._client.bucket_exists(self.bucket_name)
if not found:
- client.make_bucket(self.bucket_name)
- return client
+ self._client.make_bucket(self.bucket_name)Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In pipeline/dagster/RealDataCatalog/minio_resource.py around lines 12 to 17, the
get_client method disables TLS by setting secure=False and creates a new Minio
client and checks bucket existence on every call, causing security and
performance issues. Modify the method to accept a configurable parameter or
environment variable to toggle secure mode for production use. Refactor the code
to instantiate the Minio client once (e.g., during initialization) and reuse it,
avoiding repeated bucket existence checks and client creation on every call.
There was a problem hiding this comment.
Actionable comments posted: 2
🔭 Outside diff range comments (3)
pipeline/dagster/RealDataCatalog/assets.py (3)
201-218: Config file function has multiple filesystem dependencies.This function has several issues:
- Reads event_list.txt from local filesystem (line 203) instead of Minio
- Checks for data/PSD files in local filesystem (lines 211-213) that are now in Minio
Update to use Minio:
- with open("data/event_list.txt", "r") as f: - lines = f.readlines() - event_dict = dict(line.strip().split() for line in lines) + # Fetch from Minio + event_list_obj = minio.get_object("event_list.txt") + lines = event_list_obj.read().decode("utf-8").splitlines() + event_dict = dict(line.strip().split() for line in lines)For checking file availability, you'll need to use Minio's object listing or head_object methods instead of os.path.exists.
311-668: All diagnostic plotting functions need Minio integration.All the remaining diagnostic functions (
loss_plot,production_chains_corner_plot, etc.) read from localresults.npzfiles that may not exist if the pipeline is using Minio for storage.Consider creating a helper function to handle Minio downloads for all diagnostic functions:
def download_results_from_minio(context: AssetExecutionContext, minio: MinioResource, event_name: str): """Download results.npz from Minio to a temporary location.""" import tempfile temp_file = tempfile.NamedTemporaryFile(suffix='.npz', delete=False) try: minio.download_object(f"{event_name}/results.npz", temp_file.name) return temp_file.name except Exception as e: os.unlink(temp_file.name) raise FileNotFoundError(f"Results file not found in Minio: {e}")This would simplify updating all diagnostic functions and ensure consistent error handling.
127-155: Update diagnostic assets to fetch data from Minio, not the localdata/directoryThe plotting (
raw_data_plot,psd_plot, etc.) and configuration (config_file) functions still assume files live underdata/<event_name>/…, but upstream you’ve moved raw and PSD data into Minio. These functions will fail at runtime unless they first pull the required files from Minio.Please update each diagnostic asset to:
- Inject the Minio resource via
context.resources.minio.- Download the needed file(s) (e.g.
.npz,event_list.txt) into a temporary directory (or stream them) before loading.- Clean up or close temporary files when done.
Affected locations:
- pipeline/dagster/RealDataCatalog/assets.py:
raw_data_plot(lines 127–155)psd_plot(lines 156–191)config_file(lines 201–263)- All other diagnostic plotting functions (lines 311–668)
Example diff for
raw_data_plot:def raw_data_plot(context: AssetExecutionContext): """ Plot the raw strain data for each IFO for the event. """ - import matplotlib.pyplot as plt + import matplotlib.pyplot as plt import tempfile import os import numpy as np event_name = context.partition_key - event_dir = os.path.join("data", event_name, "raw") + # Download raw data from Minio + temp_dir = tempfile.mkdtemp() + minio = context.resources.minio + + plots_dir = os.path.join("data", event_name, "plots") os.makedirs(plots_dir, exist_ok=True) ifos = ["H1", "L1", "V1"] plot_paths = [] for ifo in ifos: - data_file = os.path.join(event_dir, f"{ifo}_data.npz") + local_npz = os.path.join(temp_dir, f"{ifo}_data.npz") + # pull from Minio bucket `<event_name>/raw/` + minio.download_object( + bucket_name=event_name, + object_name=f"raw/{ifo}_data.npz", + file_path=local_npz, + ) + data_file = local_npz if os.path.exists(data_file): data = np.load(data_file) t = data["epoch"] + np.arange(data["td"].shape[0]) * data["dt"] td = data["td"] if t is not None and td is not None: plt.figure() plt.plot(t, td) plt.xlabel("Time (s)") plt.ylabel("Strain") plt.title(f"{ifo} Strain for {event_name}") plot_path = os.path.join(plots_dir, f"{ifo}_strain.png") plt.savefig(plot_path) plt.close() plot_paths.append(plot_path) return plot_pathsApply analogous changes to
psd_plot,config_file, and all other diagnostics so that every file read is preceded by a Minio download.
♻️ Duplicate comments (2)
pipeline/dagster/RealDataCatalog/assets.py (2)
59-65: Good fix! The function now correctly reads from Minio.The implementation properly addresses the previous inconsistency by fetching the event list from Minio storage instead of the local filesystem.
164-191: PSD plot function also needs Minio integration.Similar to
raw_data_plot, this function expects local files but should download from Minio.
🧹 Nitpick comments (1)
pipeline/dagster/RealDataCatalog/assets.py (1)
23-45: Well-implemented transition to Minio storage.The function correctly uploads the event list to Minio instead of writing to local filesystem. The implementation properly handles string-to-bytes conversion and sets the appropriate content type.
Minor optimization: The
buffer.seek(0)on line 38 is unnecessary sincebuffer.getvalue()returns the entire buffer contents regardless of the current position.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
pipeline/dagster/RealDataCatalog/assets.py(5 hunks)pipeline/dagster/RealDataCatalog/minio_resource.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- pipeline/dagster/RealDataCatalog/minio_resource.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: publish_docker
🔇 Additional comments (1)
pipeline/dagster/RealDataCatalog/assets.py (1)
4-7: LGTM! Import statements are appropriate for Minio integration.The addition of
iomodule andMinioResourceimport aligns well with the transition from local filesystem to object storage.
| event_dir = os.path.join('tmp', event_name, "raw") | ||
| os.makedirs(event_dir, exist_ok=True) |
There was a problem hiding this comment.
🛠️ Refactor suggestion
Use Python's tempfile module for temporary directory management.
Instead of hardcoding 'tmp' directory, use Python's tempfile.mkdtemp() for better cross-platform compatibility and automatic cleanup.
- # Use a temp directory, but keep event_name and "raw" part
- event_dir = os.path.join('tmp', event_name, "raw")
+ import tempfile
+ temp_dir = tempfile.mkdtemp()
+ event_dir = os.path.join(temp_dir, event_name, "raw")Remember to clean up the entire temporary directory at the end of the function or use a context manager.
Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In pipeline/dagster/RealDataCatalog/assets.py around lines 70 to 71, replace the
hardcoded 'tmp' directory with a temporary directory created using Python's
tempfile.mkdtemp() for better cross-platform compatibility. Update event_dir to
be inside this temporary directory. Also, ensure to clean up the entire
temporary directory at the end of the function or use a context manager to
handle automatic cleanup.
There was a problem hiding this comment.
🛠️ Refactor suggestion
Incomplete cleanup and broad exception handling.
Two issues to address:
- The cleanup only removes .npz files but leaves the directory structure
- The exception handling is too broad and doesn't properly clean up on error
For proper cleanup, consider:
import shutil
# At the end of the function (outside the try block)
if 'temp_dir' in locals():
shutil.rmtree(temp_dir)For better error handling:
- except Exception as e:
- print(f"Error fetching data for {ifo} during {event_name}: {e}")
- continue
+ except gwosc.api.APIError as e:
+ context.log.error(f"GWOSC API error for {ifo} during {event_name}: {e}")
+ continue
+ except Exception as e:
+ context.log.error(f"Unexpected error for {ifo} during {event_name}: {e}")
+ # Clean up any partial files
+ for f in [data_file_path + '.npz', psd_file_path + '.npz']:
+ if os.path.exists(f):
+ os.remove(f)
+ raise🤖 Prompt for AI Agents
In pipeline/dagster/RealDataCatalog/assets.py around lines 114 to 118, the
current cleanup only removes .npz files but leaves the temporary directory
intact, and the exception handling is too broad without proper cleanup on
errors. To fix this, add code after the try-except block to check if the
temporary directory variable exists and remove the entire directory using
shutil.rmtree to ensure complete cleanup. Also, narrow the exception handling to
catch specific exceptions if possible, and ensure that cleanup code runs
regardless of success or failure, for example by using a finally block or by
placing cleanup after the try-except.
- Move Data and IMRPhenomPv2StandardCBCRunDefinition imports into functions - Update raw_data, raw_data_plot, and psd_plot to read/write via Minio - Store plots in Minio instead of local disk using temporary files - Clean up temporary files after upload - Add MinioResource argument to plot asset functions - Enable t_c_range in config_file asset
- Replace local filesystem access with Minio for reading and writing data, config files, and plots - Update asset signatures to accept MinioResource where needed - Remove redundant imports and local directory creation - Use temporary files for serialization before uploading to Minio
There was a problem hiding this comment.
Actionable comments posted: 0
♻️ Duplicate comments (2)
pipeline/dagster/RealDataCatalog/assets.py (2)
70-71: Consider using tempfile module for better cross-platform compatibility.The hardcoded 'tmp' directory could cause issues on different platforms. Consider using Python's
tempfile.mkdtemp()for better portability.This issue was previously flagged and remains unaddressed. The current implementation still uses hardcoded paths which may not work consistently across different environments.
117-119: Improve exception handling and cleanup strategy.The current approach has two issues:
- Exception handling is too broad and may mask specific errors
- Cleanup only removes .npz files but leaves directory structure
This issue was previously flagged and remains unaddressed. Consider implementing proper cleanup with
shutil.rmtree()and more specific exception handling.
🧹 Nitpick comments (3)
pipeline/dagster/RealDataCatalog/assets.py (3)
11-11: Fix typo in comment.There's a typo in the comment: "configuration0" should be "configuration".
-# Create asset group for run and configuration0 +# Create asset group for run and configuration
317-352: Placeholder asset - implementation needed.The run multi-asset is properly structured with correct dependencies and partitions, but currently unimplemented. This appears to be intentional for the current PR scope.
Would you like me to help implement the logic for loading results from the external run execution?
355-770: Consider refactoring to reduce code duplication across diagnostic assets.The diagnostic assets follow very similar patterns for:
- Loading data from Minio
- Creating plots with matplotlib/corner
- Saving to temporary files
- Uploading to Minio
- Cleaning up temporary files
Consider creating helper functions to reduce duplication:
def create_and_upload_plot(event_name: str, minio: MinioResource, plot_func, plot_name: str, **kwargs): """Helper function to create plots and upload to Minio""" with tempfile.NamedTemporaryFile(suffix=".png", delete=False) as tmpfile: fig = plot_func(**kwargs) fig.savefig(tmpfile.name) plt.close(fig) tmpfile.flush() tmpfile.seek(0) tmpfile_size = os.path.getsize(tmpfile.name) minio_plot_path = f"{event_name}/plots/{plot_name}.png" with open(tmpfile.name, "rb") as plotfile: minio.put_object( object_name=minio_plot_path, data=plotfile, size=tmpfile_size, content_type="image/png", ) os.remove(tmpfile.name) return minio_plot_path def load_results_from_minio(event_name: str, minio: MinioResource): """Helper function to load results.npz from Minio""" results_obj = minio.get_object(f"{event_name}/results.npz") return np.load(io.BytesIO(results_obj.read()), allow_pickle=True)This would significantly reduce code duplication and improve maintainability.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
pipeline/dagster/RealDataCatalog/assets.py(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: publish_docker
🔇 Additional comments (14)
pipeline/dagster/RealDataCatalog/assets.py (14)
16-46: LGTM! Well-structured event fetching and partition registration.The asset correctly fetches events from multiple GWTC catalogs, properly buffers the data, uploads to Minio, and registers dynamic partitions. The implementation is clean and follows best practices.
57-66: LGTM! Critical inconsistency resolved.The function now correctly reads the event list from Minio instead of the local filesystem, addressing the previous critical issue. The Minio integration is properly implemented.
122-168: LGTM! Well-implemented plotting asset with proper resource management.The raw data plotting asset demonstrates good practices:
- Proper use of tempfile for temporary files
- Correct Minio integration for both reading and writing
- Good error handling per IFO
- Proper cleanup of temporary files
- Clear return value
171-217: LGTM! Consistent implementation with the strain plotting asset.The PSD plotting asset follows the same good patterns as the raw data plot asset, with proper resource management and error handling.
220-315: LGTM! Comprehensive configuration asset with robust IFO availability checking.The config file asset is well-implemented:
- Proper Minio integration for reading event list
- Robust availability checking for IFOs by attempting to fetch both data and PSD files
- Comprehensive parameter configuration for IMRPhenomPv2StandardCBCRunDefinition
- Correct use of tempfile for serialization
- Proper cleanup of temporary files
355-393: LGTM! Well-structured diagnostic asset with proper error handling.The loss plot asset demonstrates good practices with proper Minio integration, error handling for missing data, and correct tempfile usage.
396-445: LGTM! Consistent corner plot implementation.The production chains corner plot follows the same good patterns as other diagnostic assets. The parameter key selection is comprehensive for gravitational wave analysis.
448-497: LGTM! Normalizing flow samples corner plot well-implemented.The asset correctly handles NF samples data and maintains consistency with other corner plot assets.
500-549: LGTM! Prior samples corner plot maintains consistency.The implementation is consistent with other corner plot assets and properly handles prior samples data.
552-610: LGTM! Comprehensive trace plot implementation.The trace plot asset correctly handles multiple parameters and creates a well-structured multi-subplot figure. The subplot layout logic is appropriate for the number of parameters.
613-650: LGTM! Log probability distribution plot well-implemented.The asset correctly creates a histogram of log probabilities with proper error handling.
653-690: LGTM! Log probability evolution plot maintains consistency.The asset follows the same good patterns as other diagnostic assets and correctly plots the evolution of log probabilities.
693-730: LGTM! Local acceptance rate plot properly implemented.The asset correctly accesses nested data structure and handles missing keys appropriately.
733-770: LGTM! Global acceptance rate plot completes the diagnostic suite.The final diagnostic asset maintains consistency with the local acceptance plot and properly handles the global acceptance data.
There was a problem hiding this comment.
Actionable comments posted: 4
♻️ Duplicate comments (2)
pipeline/dagster/RealDataCatalog/assets.py (2)
70-71: Use Python's tempfile module for temporary directory management.Instead of hardcoding 'tmp' directory, use Python's
tempfile.mkdtemp()for better cross-platform compatibility and automatic cleanup.- # Use a temp directory, but keep event_name and "raw" part - event_dir = os.path.join('tmp', event_name, "raw") + import tempfile + temp_dir = tempfile.mkdtemp() + event_dir = os.path.join(temp_dir, event_name, "raw")Remember to clean up the entire temporary directory at the end of the function or use a context manager.
115-119: Incomplete cleanup and broad exception handling.Two issues to address:
- The cleanup only removes .npz files but leaves the directory structure
- The exception handling is too broad and doesn't properly clean up on error
For proper cleanup, consider:
import shutil # At the end of the function (outside the try block) if 'temp_dir' in locals(): shutil.rmtree(temp_dir)For better error handling:
- except Exception as e: - print(f"Error fetching data for {ifo} during {event_name}: {e}") - continue + except gwosc.api.APIError as e: + context.log.error(f"GWOSC API error for {ifo} during {event_name}: {e}") + continue + except Exception as e: + context.log.error(f"Unexpected error for {ifo} during {event_name}: {e}") + # Clean up any partial files + for f in [data_file_path + '.npz', psd_file_path + '.npz']: + if os.path.exists(f): + os.remove(f) + raise
🧹 Nitpick comments (3)
pipeline/dagster/RealDataCatalog/assets.py (3)
11-11: Fix typo in comment.There's a typo "configuration0" instead of "configuration".
-# Create asset group for run and configuration0 +# Create asset group for run and configuration
348-352: Implement the run asset to load results.The run asset is currently a placeholder. This asset should load the results from
execute_single_run.pyand yield the specified assets.Would you like me to help implement this asset to load results from the
results.npzfile and yield each asset appropriately?
412-427: Consider extracting parameter key lists to constants.The parameter key lists are duplicated across multiple diagnostic assets. Consider extracting them to module-level constants for better maintainability.
# At module level PARAMETER_KEYS = [ "M_c", "q", "s1_mag", "s1_theta", "s1_phi", "s2_mag", "s2_theta", "s2_phi", "iota", "d_L", "phase_c", "psi", "ra", "dec" ]Then use
PARAMETER_KEYSin place of the hardcoded lists.Also applies to: 464-479, 516-531, 567-582
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
pipeline/dagster/RealDataCatalog/assets.py(1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
pipeline/dagster/RealDataCatalog/assets.py (4)
pipeline/dagster/RealDataCatalog/minio_resource.py (3)
MinioResource(4-34)put_object(27-30)get_object(32-34)src/jimgw/core/single_event/data.py (2)
from_gwosc(247-276)to_psd(232-244)src/jimgw/run/library/IMRPhenomPv2_standard_cbc.py (1)
IMRPhenomPv2StandardCBCRunDefinition(33-333)jim_dagster/RealDataCatalog/assets.py (4)
event_list(21-37)run(262-266)raw_data(49-83)config_file(166-228)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: publish_docker
🔇 Additional comments (4)
pipeline/dagster/RealDataCatalog/assets.py (4)
62-66: LGTM! Minio integration correctly implemented.The function now correctly reads the event list from Minio instead of local filesystem, addressing the previous inconsistency issue.
232-236: LGTM! Proper Minio integration for event list reading.The asset correctly reads the event list from Minio, maintaining consistency with the storage architecture.
243-249: Efficient approach to check IFO data availability.Using try-catch blocks to check object existence in Minio is a reasonable approach for determining available IFOs.
1-771: Excellent overall architecture and Minio integration.The asset pipeline is well-structured with:
- Consistent patterns across all assets
- Proper Minio integration replacing local filesystem operations
- Good use of dynamic partitioning for event-based processing
- Clear separation of concerns with asset groups (prerun, run, diagnostics)
- Appropriate dependency management between assets
The migration from local filesystem to object storage is comprehensive and maintains data flow integrity.
| for catalog in catalogs: | ||
| event_list = fetch_catalog_json(catalog)["events"] | ||
| for event in event_list.values(): | ||
| name = event["commonName"] | ||
| gps_time = event["GPS"] | ||
| result.append((name, gps_time)) | ||
| event_names.append(name) |
There was a problem hiding this comment.
🛠️ Refactor suggestion
Add error handling for GWOSC API calls.
The catalog fetching could fail due to network issues or API problems. Consider adding try-catch blocks around the API calls to handle potential failures gracefully.
for catalog in catalogs:
- event_list = fetch_catalog_json(catalog)["events"]
- for event in event_list.values():
- name = event["commonName"]
- gps_time = event["GPS"]
- result.append((name, gps_time))
- event_names.append(name)
+ try:
+ event_list = fetch_catalog_json(catalog)["events"]
+ for event in event_list.values():
+ name = event["commonName"]
+ gps_time = event["GPS"]
+ result.append((name, gps_time))
+ event_names.append(name)
+ except Exception as e:
+ context.log.error(f"Failed to fetch catalog {catalog}: {e}")
+ continue📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| for catalog in catalogs: | |
| event_list = fetch_catalog_json(catalog)["events"] | |
| for event in event_list.values(): | |
| name = event["commonName"] | |
| gps_time = event["GPS"] | |
| result.append((name, gps_time)) | |
| event_names.append(name) | |
| for catalog in catalogs: | |
| try: | |
| event_list = fetch_catalog_json(catalog)["events"] | |
| for event in event_list.values(): | |
| name = event["commonName"] | |
| gps_time = event["GPS"] | |
| result.append((name, gps_time)) | |
| event_names.append(name) | |
| except Exception as e: | |
| context.log.error(f"Failed to fetch catalog {catalog}: {e}") | |
| continue |
🤖 Prompt for AI Agents
In pipeline/dagster/RealDataCatalog/assets.py around lines 26 to 32, the code
calls fetch_catalog_json without error handling, which can cause unhandled
exceptions if the API call fails. Wrap the fetch_catalog_json call and
subsequent processing in a try-except block to catch exceptions like network
errors, log or handle the error appropriately, and ensure the program continues
or fails gracefully.
| plot_paths.append(minio_plot_path) | ||
| os.remove(tmpfile.name) | ||
| except Exception as e: | ||
| print(f"Error processing {ifo} for {event_name}: {e}") |
There was a problem hiding this comment.
🛠️ Refactor suggestion
Use Dagster logging instead of print statements.
For better integration with Dagster's logging system, use context.log.error() instead of print() for error messages.
- print(f"Error processing {ifo} for {event_name}: {e}")
+ context.log.error(f"Error processing {ifo} for {event_name}: {e}")Also applies to: 215-215
🤖 Prompt for AI Agents
In pipeline/dagster/RealDataCatalog/assets.py at lines 166 and 215, replace the
print statements used for error messages with Dagster's logging system by using
context.log.error(). This involves passing the Dagster context object to the
function if not already available, and then calling context.log.error() with the
error message instead of print(), ensuring proper integration with Dagster's
logging.
| loss = results["loss_data"] | ||
| if loss is None: | ||
| raise ValueError("No 'loss' key found in loss_data.") |
There was a problem hiding this comment.
Inconsistent variable name in error message.
The error message references 'loss' key but the actual key being accessed is 'loss_data'.
- if loss is None:
- raise ValueError("No 'loss' key found in loss_data.")
+ if loss is None:
+ raise ValueError("No 'loss_data' key found in results.")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| loss = results["loss_data"] | |
| if loss is None: | |
| raise ValueError("No 'loss' key found in loss_data.") | |
| loss = results["loss_data"] | |
| if loss is None: | |
| raise ValueError("No 'loss_data' key found in results.") |
🤖 Prompt for AI Agents
In pipeline/dagster/RealDataCatalog/assets.py around lines 370 to 372, the error
message incorrectly references the key 'loss' while the code accesses
'loss_data'. Update the error message to correctly mention 'loss_data' to
maintain consistency and clarity.
| if log_prob is None: | ||
| raise ValueError("No 'log_prob' key found in loss_data.") |
There was a problem hiding this comment.
Inconsistent error messages reference wrong data structure.
The error messages reference 'loss_data' but the actual key being accessed is from the main results array.
- raise ValueError("No 'log_prob' key found in loss_data.")
+ raise ValueError("No 'log_probs' key found in results.")Also applies to: 668-669
🤖 Prompt for AI Agents
In pipeline/dagster/RealDataCatalog/assets.py around lines 628-629 and 668-669,
the error messages incorrectly reference 'loss_data' when the key 'log_prob' is
actually being accessed from the main results array. Update the error messages
to correctly mention the main results array instead of 'loss_data' to maintain
consistency and clarity.
This PR aims to integrate the dagster instance in local K8s environment.
Summary by CodeRabbit