Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

ecchilds-oss
Copy link

@ecchilds-oss ecchilds-oss commented Feb 6, 2025

This RFC documents the code changes for the import/export feature, which has been requested by the DataHub team.

Summary by CodeRabbit

  • New Features
    • Introduced an Import/Export feature that lets users export datasets to CSV files and import them back via the UI.
    • Added options to export individual datasets or all datasets within a container, including a modal for entering CSV file details.
  • Documentation
    • Updated guides provide detailed explanations on the CSV file structure and how to use the new export and import functionalities.

Copy link

coderabbitai bot commented Feb 6, 2025

Walkthrough

This update introduces an Import/Export feature for DataHub. Users can export datasets in CSV format and import CSV files through the UI. The feature adds options in the SearchExtendedMenu dropdown to export individual datasets or all datasets within a container. An export modal collects CSV file details, and metadata is retrieved via GraphQL queries. For importing, the papaparse library processes CSV files and a new GraphQL mutation (upsertDataset) along with related input types manages dataset upsertion. Documentation has been updated accordingly.

Changes

File / Component Change Summary
active/.../README.md Documents the Import/Export feature, detailing CSV formats, export/import modal flows, and UI options in the SearchExtendedMenu.
GraphQL API (Mutations & Inputs) Adds a new mutation (upsertDataset) and introduces input types: DatasetUpsertInput, SchemaMetadataInput, and SchemaFieldInput to support CSV import.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant UI
    participant GraphQL
    participant CSV_Module

    User->>UI: Clicks "Export" option in SearchExtendedMenu
    UI->>User: Displays export modal for CSV file details
    User->>UI: Submits CSV file name and export options
    UI->>GraphQL: Fetches dataset metadata via GraphQL queries
    GraphQL-->>UI: Returns dataset metadata
    UI->>CSV_Module: Generates CSV file with required columns
    CSV_Module-->>User: Provides CSV file download
Loading
sequenceDiagram
    participant User
    participant UI
    participant Papaparse
    participant GraphQL

    User->>UI: Selects CSV file for import
    UI->>Papaparse: Parses CSV file using papaparse library
    Papaparse-->>UI: Returns parsed dataset information
    UI->>GraphQL: Sends upsertDataset mutation(s) with dataset details
    GraphQL-->>UI: Confirms upsertion of datasets
    UI-->>User: Displays import result
Loading

Poem

I'm a bunny on a coding spree,
Hopping through changes with glee.
CSVs and GraphQL now take flight,
Import and export shine so bright.
With a twitch of my nose and a joyful beat,
I celebrate our code so neat!
🐇💻


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (11)
active/0000-import-export-feature/README.md (11)

1-4: Metadata and RFC Details Update

The header section includes placeholder metadata (start date, RFC PR link, and implementation PR references). Ensure these fields are updated with the actual dates and links once available.


36-46: CSV Columns Explanation – Typo Correction

The explanation of CSV columns is thorough. Note: In line 39, the word “assset” should be corrected to “asset.”

- - asset_type: What type of assset is contained in the row. This is either a dataset or schema field.
+ - asset_type: What type of asset is contained in the row. This is either a dataset or schema field.

48-52: Export Button Description – Capitalization

In line 50, the phrase “using a react effect” should capitalize “React” as it refers to the framework.

- This is done using a react effect, which greys out the button unless the URL of the current page contains the word "container".
+ This is done using a React effect, which greys out the button unless the URL of the current page contains the word "container".
🧰 Tools
🪛 LanguageTool

[grammar] ~50-~50: “React” is a proper noun and needs to be capitalized.
Context: ...cannot be pressed. This is done using a react effect, which greys out the button unle...

(A_GOOGLE)


53-57: Modal Fields Clarity

The description of fields in the export modal is clear. Consider adding a note regarding the assumptions about the container structure and potential limitations for data sources that do not follow it.


70-73: GraphQL Queries Header Consistency

Ensure consistent capitalization by using “GraphQL” instead of “GraphQl” (e.g., in line 70).

- #### GraphQl queries
+ #### GraphQL queries

254-261: File Input for Import

The JSX snippet for the file upload input is concise. For improved accessibility, consider adding attributes like aria-label or a visually hidden label.


293-301: Import Field Mappings – Typo Corrections

The defaults for fields on import are clearly listed. Important: In lines 295 and 298, the CSV field is referred to as resrource which should be corrected to resource.

- - `name`: The name is extracted from the dataset URN stored in the `resrource` CSV field.
+ - `name`: The name is extracted from the dataset URN stored in the `resource` CSV field.
- - `platformUrn`: The platform URN is extracted from the dataset URN stored in the `resrource` CSV field.
+ - `platformUrn`: The platform URN is extracted from the dataset URN stored in the `resource` CSV field.
🧰 Tools
🪛 LanguageTool

[uncategorized] ~295-~295: Loose punctuation mark.
Context: ...d with these values on import: - name: The name is extracted from the dataset ...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~298-~298: Loose punctuation mark.
Context: ...aName: An empty string. - platformUrn`: The platform URN is extracted from the ...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~299-~299: Loose punctuation mark.
Context: ...d in the resrource CSV field. - type: SchemaFieldDataType.Null - `nativeDat...

(UNLIKELY_OPENING_PUNCTUATION)


320-323: Clarification on External Extension

In line 322, change “a extension” to “an extension” since “extension” begins with a vowel sound.

- It's also notable that a extension for DataHub does exist which adds very similar functionality...
+ It's also notable that an extension for DataHub does exist which adds very similar functionality...
🧰 Tools
🪛 LanguageTool

[misspelling] ~322-~322: Use “an” instead of ‘a’ if the following word starts with a vowel sound, e.g. ‘an article’, ‘an hour’.
Context: ...th remediating. It's also notable that a extension for DataHub does exist which ...

(EN_A_VS_AN)


328-333: Future Work and CSV Enhancements – Typo Correction

In the future work section, specifically line 330, “uased” should be corrected to “used.” Additionally, consider clarifying how new CSV columns will impact the UX and data integrity.

- ... it could also be uased to store the sub types of datasets.
+ ... it could also be used to store the sub types of datasets.
🧰 Tools
🪛 LanguageTool

[typographical] ~330-~330: The conjunction “so that” does not have a comma in front.
Context: ... corresponding columns to the CSV schema, so that we can populate those fields. Notably, ...

(SO_THAT_UNNECESSARY_COMMA)


[misspelling] ~330-~330: This word is normally spelled as one.
Context: ...mn, it could also be uased to store the sub types of datasets. Additionally, the `glossa...

(EN_COMPOUNDS_SUB_TYPES)


334-335: Dataset-level Export Refactoring

There is a spelling mistake in line 334 where “laters” should be “layers.” Additionally, consider varying the phrasing here versus other similar statements in the document to reduce repetition.

- ... designed to only work with data sources with two laters of containers in DataHub.
+ ... designed to only work with data sources with two layers of containers in DataHub.
🧰 Tools
🪛 LanguageTool

[style] ~334-~334: You have already used this phrasing in nearby sentences. Consider replacing it to add variety to your writing.
Context: ...low, the dataset-level export will also need to be refactored to be more flexible, as a...

(REP_NEED_TO_VB)


336-340: Unresolved Questions – Style Improvement

The unresolved questions section raises important points about flexibility and side effects. To improve readability, consider rephrasing some repetitive language (e.g., reducing repeated uses of “as such” and similar constructs).

🧰 Tools
🪛 LanguageTool

[style] ~338-~338: You have already used this phrasing in nearby sentences. Consider replacing it to add variety to your writing.
Context: ...and as such, this component will likely need to be refactored to be more flexible. We w...

(REP_NEED_TO_VB)


[style] ~338-~338: You have already used this phrasing in nearby sentences. Consider replacing it to add variety to your writing.
Context: ...refactored to be more flexible. We will need to determine what shape the component shou...

(REP_NEED_TO_VB)


[style] ~340-~340: You have already used this phrasing in nearby sentences. Consider replacing it to add variety to your writing.
Context: ...rough GraphQL as a side effect. It will need to be evaluated whether this is an accepta...

(REP_NEED_TO_VB)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8852b24 and b43b351.

📒 Files selected for processing (1)
  • active/0000-import-export-feature/README.md (1 hunks)
🧰 Additional context used
🪛 LanguageTool
active/0000-import-export-feature/README.md

[grammar] ~50-~50: “React” is a proper noun and needs to be capitalized.
Context: ...cannot be pressed. This is done using a react effect, which greys out the button unle...

(A_GOOGLE)


[uncategorized] ~295-~295: Loose punctuation mark.
Context: ...d with these values on import: - name: The name is extracted from the dataset ...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~298-~298: Loose punctuation mark.
Context: ...aName: An empty string. - platformUrn`: The platform URN is extracted from the ...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~299-~299: Loose punctuation mark.
Context: ...d in the resrource CSV field. - type: SchemaFieldDataType.Null - `nativeDat...

(UNLIKELY_OPENING_PUNCTUATION)


[misspelling] ~322-~322: Use “an” instead of ‘a’ if the following word starts with a vowel sound, e.g. ‘an article’, ‘an hour’.
Context: ...th remediating. It's also notable that a extension for DataHub does exist which ...

(EN_A_VS_AN)


[typographical] ~330-~330: The conjunction “so that” does not have a comma in front.
Context: ... corresponding columns to the CSV schema, so that we can populate those fields. Notably, ...

(SO_THAT_UNNECESSARY_COMMA)


[misspelling] ~330-~330: This word is normally spelled as one.
Context: ...mn, it could also be uased to store the sub types of datasets. Additionally, the `glossa...

(EN_COMPOUNDS_SUB_TYPES)


[style] ~334-~334: You have already used this phrasing in nearby sentences. Consider replacing it to add variety to your writing.
Context: ...low, the dataset-level export will also need to be refactored to be more flexible, as a...

(REP_NEED_TO_VB)


[style] ~338-~338: You have already used this phrasing in nearby sentences. Consider replacing it to add variety to your writing.
Context: ...and as such, this component will likely need to be refactored to be more flexible. We w...

(REP_NEED_TO_VB)


[style] ~338-~338: You have already used this phrasing in nearby sentences. Consider replacing it to add variety to your writing.
Context: ...refactored to be more flexible. We will need to determine what shape the component shou...

(REP_NEED_TO_VB)


[style] ~340-~340: You have already used this phrasing in nearby sentences. Consider replacing it to add variety to your writing.
Context: ...rough GraphQL as a side effect. It will need to be evaluated whether this is an accepta...

(REP_NEED_TO_VB)

🔇 Additional comments (14)
active/0000-import-export-feature/README.md (14)

5-9: Clear Feature Summary

The feature title and summary clearly state the goal (import/export datasets via CSV) and the current status. Ensure that any further refinements or updates to the implementation are documented in this section.


11-14: Motivation Section Clarity

The motivation section explains the rationale behind mimicking Collibra’s functionality and how it benefits enterprise users. Consider adding details on expected user benefits or quantitative metrics if available.


15-21: Requirements Section Verification

The requirements are explicitly listed for export and import functionality. This clarity is helpful. Ensure that any limitations (for example container-based assumptions) are clearly communicated in both documentation and UI, as noted later.


22-25: Non-Requirements Section

The non-requirements section is succinct and clearly specifies what is out of scope (i.e. REST API). No changes needed.


26-28: Detailed Design Introduction

The introduction to the detailed design outlines the overall approach for adding three options in the dropdown. This sets the stage well for the detailed sections that follow.


30-34: CSV Columns Definition

The CSV block listing the column names is clear. Verify that the list of columns exactly matches the expected fields during both export and import.


58-68: Detailed Export Process

The step-by-step outline of the export process is detailed and easy to follow. Ensure that error handling (e.g., when no datasets are found) is robust in the implementation.


74-159: GraphQL Query: getDatasetByUrn

The query is comprehensive and correctly structured to fetch dataset metadata. Verify that client-side code consuming this query handles pagination (50 datasets per query) as described.


160-252: GraphQL Query: getTable

This complementary query for fetching dataset details looks consistent with the first query. Ensure both queries are maintained together as updates in one may require changes in the other.


262-291: New GraphQL Mutation and Input Types

The new upsertDataset mutation and its corresponding input types are well defined. Check that the trailing comma in globalTagUrns: [String], (line 273) is acceptable per your GraphQL schema style guidelines.


304-309: User Guide Section (“How we teach this”)

This section outlines the essentials for a user guide effectively. Consider detailing common pitfalls or visual guides for using the feature.


310-313: GraphQL Documentation Update Reminder

The note regarding updating GraphQL documentation is insightful. Keep the documentation in sync with schema changes to avoid confusion.


314-319: Drawbacks and Alternatives Discussion

The discussion on drawbacks and alternatives is comprehensive. Consider ensuring consistent styling for “React” across the document when referring to the UI framework.


324-327: Rollout/Adoption Strategy Clarity

The rollout strategy is concise and confirms that no migration is needed. This clarity is beneficial.

@PinkMonstera
Copy link

OSS Review Recommendation: Proceed

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (7)
active/0009-import-export-feature/README.md (7)

1-4: Review Header and PR Details:
The header clearly lists the start date, RFC PR URL, and the placeholder for implementation PR(s). Please consider formatting the bare URL on line 2 as a proper markdown link (e.g., using angle brackets or the link text syntax) to adhere to markdownlint guidelines. Also, if an implementation PR exists, including its link would improve clarity.

🧰 Tools
🪛 markdownlint-cli2 (0.17.2)

2-2: Bare URL used
null

(MD034, no-bare-urls)


58-69: Export Process Steps Overview:
The step-by-step description of the export process is well laid out and easy to follow. It might be beneficial to include notes on error handling or fallback behavior if a step fails (for example, if the GraphQL query returns an unexpected result).


74-253: GraphQL Queries Implementation:
The two GraphQL queries (getDatasetByUrn and getTable) are well-structured and clearly presented. One point to consider is the performance handling for containers with more than 50 datasets—monitoring and possibly refactoring the pagination and multiple CSV downloads might be useful in the future.


254-261: File Input for Import:
The hidden file input element is implemented correctly. However, using inline styles (i.e., setting opacity directly) might limit maintainability. Consider moving these styles to an external CSS class, which would promote consistency and easier updates in the future.


293-301: Field Defaults for Import Details:
The explanation of how missing CSV fields are supplemented with default values (e.g., version: 0, schemaName: "", etc.) is useful. Please review the punctuation and formatting in this section (specifically around lines 295–299) to remove any loose punctuation and ensure consistency.

🧰 Tools
🪛 LanguageTool

[uncategorized] ~295-~295: Loose punctuation mark.
Context: ...d with these values on import: - name: The name is extracted from the dataset ...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~298-~298: Loose punctuation mark.
Context: ...aName: An empty string. - platformUrn`: The platform URN is extracted from the ...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~299-~299: Loose punctuation mark.
Context: ...ed in the resource CSV field. - type: SchemaFieldDataType.Null - `nativeDat...

(UNLIKELY_OPENING_PUNCTUATION)


330-330: Typographical Correction Needed:
There appears to be a typographical error on line 330 where "uased" is used instead of "used." A simple fix is needed here.

🧰 Tools
🪛 LanguageTool

[typographical] ~330-~330: The conjunction “so that” does not have a comma in front.
Context: ... corresponding columns to the CSV schema, so that we can populate those fields. Notably, ...

(SO_THAT_UNNECESSARY_COMMA)


[misspelling] ~330-~330: This word is normally spelled as one.
Context: ...mn, it could also be uased to store the sub types of datasets. Additionally, the `glossa...

(EN_COMPOUNDS_SUB_TYPES)


334-340: Repetitive Phrasing in Future Work and Unresolved Questions:
The phrases such as “need to be refactored to be more flexible” are repeated across these sections. Consider rephrasing these areas to add variety and enhance readability.

🧰 Tools
🪛 LanguageTool

[style] ~334-~334: You have already used this phrasing in nearby sentences. Consider replacing it to add variety to your writing.
Context: ...low, the dataset-level export will also need to be refactored to be more flexible, as a...

(REP_NEED_TO_VB)


[style] ~338-~338: You have already used this phrasing in nearby sentences. Consider replacing it to add variety to your writing.
Context: ...and as such, this component will likely need to be refactored to be more flexible. We w...

(REP_NEED_TO_VB)


[style] ~338-~338: You have already used this phrasing in nearby sentences. Consider replacing it to add variety to your writing.
Context: ...refactored to be more flexible. We will need to determine what shape the component shou...

(REP_NEED_TO_VB)


[style] ~340-~340: You have already used this phrasing in nearby sentences. Consider replacing it to add variety to your writing.
Context: ...rough GraphQL as a side effect. It will need to be evaluated whether this is an accepta...

(REP_NEED_TO_VB)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b43b351 and 562d1a9.

📒 Files selected for processing (1)
  • active/0009-import-export-feature/README.md (1 hunks)
🧰 Additional context used
🪛 LanguageTool
active/0009-import-export-feature/README.md

[uncategorized] ~295-~295: Loose punctuation mark.
Context: ...d with these values on import: - name: The name is extracted from the dataset ...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~298-~298: Loose punctuation mark.
Context: ...aName: An empty string. - platformUrn`: The platform URN is extracted from the ...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~299-~299: Loose punctuation mark.
Context: ...ed in the resource CSV field. - type: SchemaFieldDataType.Null - `nativeDat...

(UNLIKELY_OPENING_PUNCTUATION)


[typographical] ~330-~330: The conjunction “so that” does not have a comma in front.
Context: ... corresponding columns to the CSV schema, so that we can populate those fields. Notably, ...

(SO_THAT_UNNECESSARY_COMMA)


[misspelling] ~330-~330: This word is normally spelled as one.
Context: ...mn, it could also be uased to store the sub types of datasets. Additionally, the `glossa...

(EN_COMPOUNDS_SUB_TYPES)


[style] ~334-~334: You have already used this phrasing in nearby sentences. Consider replacing it to add variety to your writing.
Context: ...low, the dataset-level export will also need to be refactored to be more flexible, as a...

(REP_NEED_TO_VB)


[style] ~338-~338: You have already used this phrasing in nearby sentences. Consider replacing it to add variety to your writing.
Context: ...and as such, this component will likely need to be refactored to be more flexible. We w...

(REP_NEED_TO_VB)


[style] ~338-~338: You have already used this phrasing in nearby sentences. Consider replacing it to add variety to your writing.
Context: ...refactored to be more flexible. We will need to determine what shape the component shou...

(REP_NEED_TO_VB)


[style] ~340-~340: You have already used this phrasing in nearby sentences. Consider replacing it to add variety to your writing.
Context: ...rough GraphQL as a side effect. It will need to be evaluated whether this is an accepta...

(REP_NEED_TO_VB)

🪛 markdownlint-cli2 (0.17.2)
active/0009-import-export-feature/README.md

2-2: Bare URL used
null

(MD034, no-bare-urls)

🔇 Additional comments (6)
active/0009-import-export-feature/README.md (6)

32-34: CSV Columns Listing is Clear:
The CSV header block clearly lists all the column names required for the import/export feature. The format is concise and self-explanatory.


36-47: Detailed CSV Columns Usage Explanation:
This section thoroughly explains the purpose and formatting of each CSV column. It provides valuable context for both datasets and schema fields. Consider adding examples in future revisions if more clarity is needed for edge cases.


50-57: Export Modal Usage Details:
The explanation for when and how the export modal is activated is comprehensive. Please ensure that the assumptions about the container hierarchy (e.g., the expectation of a specific number of containers) are validated at runtime to avoid potential issues in diverse environments.


70-73: GraphQL Queries Section Introduction:
This brief introduction to the GraphQL queries sets the stage nicely for the detailed queries that follow.


262-292: GraphQL Mutation and Input Types Definitions:
The new GraphQL mutation (upsertDataset) along with its input types (DatasetUpsertInput, SchemaMetadataInput, and SchemaFieldInput) are clearly defined and documented. This helps ensure that the integration with GMS is well understood.


1-340: Overall Documentation Quality:
This README provides a comprehensive overview of the new Import/Export feature for DataHub, covering motivation, design, detailed implementation, and future considerations. The level of detail is excellent; just ensure minor editorial fixes and consistency improvements are applied.

🧰 Tools
🪛 LanguageTool

[uncategorized] ~295-~295: Loose punctuation mark.
Context: ...d with these values on import: - name: The name is extracted from the dataset ...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~298-~298: Loose punctuation mark.
Context: ...aName: An empty string. - platformUrn`: The platform URN is extracted from the ...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~299-~299: Loose punctuation mark.
Context: ...ed in the resource CSV field. - type: SchemaFieldDataType.Null - `nativeDat...

(UNLIKELY_OPENING_PUNCTUATION)


[typographical] ~330-~330: The conjunction “so that” does not have a comma in front.
Context: ... corresponding columns to the CSV schema, so that we can populate those fields. Notably, ...

(SO_THAT_UNNECESSARY_COMMA)


[misspelling] ~330-~330: This word is normally spelled as one.
Context: ...mn, it could also be uased to store the sub types of datasets. Additionally, the `glossa...

(EN_COMPOUNDS_SUB_TYPES)


[style] ~334-~334: You have already used this phrasing in nearby sentences. Consider replacing it to add variety to your writing.
Context: ...low, the dataset-level export will also need to be refactored to be more flexible, as a...

(REP_NEED_TO_VB)


[style] ~338-~338: You have already used this phrasing in nearby sentences. Consider replacing it to add variety to your writing.
Context: ...and as such, this component will likely need to be refactored to be more flexible. We w...

(REP_NEED_TO_VB)


[style] ~338-~338: You have already used this phrasing in nearby sentences. Consider replacing it to add variety to your writing.
Context: ...refactored to be more flexible. We will need to determine what shape the component shou...

(REP_NEED_TO_VB)


[style] ~340-~340: You have already used this phrasing in nearby sentences. Consider replacing it to add variety to your writing.
Context: ...rough GraphQL as a side effect. It will need to be evaluated whether this is an accepta...

(REP_NEED_TO_VB)

🪛 markdownlint-cli2 (0.17.2)

2-2: Bare URL used
null

(MD034, no-bare-urls)

<input id="file" type="file" onChange={changeHandler} style={{ opacity: 0 }} />
```

The `papaparse` library is used to parse the CSV file and iterate over each row present within it. The data is then fed into GraphQL mutations to create datasets. Notably, a new GraphQL mutation had to be created to allow the upserting of schema metadata. Here is the specification for that new mutation:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What sort of scale do we want to advertise for this feature?
How much have we tested up to?
Should the UI reject the import if there are too many datasets?
How should progress be displayed to the user?

Copy link

@jayacryl jayacryl Mar 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had previously tested a prototype that broke requests down even at the row-level. I.e., one request for glossary terms, one request per schema column changed, etc.
Furthermore, it cached key fields of the exported file locally. This allowed for only submitting the diffs to graphql.

This approach helped us quickly identify if a specific cell in the csv failed to apply, while still succeeding with all the rest. This was easily presented in a final upload report at the end.
It seemed to work well with up to 100k items changed.

image image

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At present, the implementation we've written does not display a progress bar, nor does it reject the user if too many datasets are present for import or export. However, import a CSV file can take a great deal of time if there are a lot of datasets present in the file. I believe we've tested up to 36 datasets at once.


## Detailed design

This feature will add three new options to the existing `SearchExtendedMenu` dropdown. One to export all datasets within a container, one to export individual datasets, and one to import previously exported data into DataHub. The export options create CSV files from data existing in DataHub, while the import option adds new data to DataHub from CSV files.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A screenshot of the proposed look and feel here would be great.


This feature will add the ability to both export datasets to CSV files and import them back into DataHub from those CSV files, using the UI. Code is already implemented for this feature, though further work may need to be done. This RFC details the Implementation in its current state.

## Motivation

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A section on the User Journey and Motivations would be useful here.

e.g.
Why is the User exporting to csv -> Do they intend to make some changes to metadata in bulk and then import it back?

Are there scenarios where Users are trying to import CSV-s containing metadata that has been hand written or sourced from "non DataHub" catalogs? In those scenarios - how will users provide the "urn" field which represents the main identity of the dataset on DataHub?


## Non-Requirements

This feature is not intended to add a REST API for import/export like that of Collibra. It is only intended for use through the UI.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this is where we declare that we are not intending to support a from-scratch import of csv metadata which has not been sourced from DataHub

Copy link

@jayacryl jayacryl Mar 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes agreed. For now, if users want to achieve this they'll need to closely mimic our format and have the assets pre-created in datahub through other means.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In contrast, for bulk import glossary terms we'll need something a bit simpler to let users transform externally sourced terms & term groups into a csv for import. Also will need to support upserts, not just updates.


### Export

Within the `SearchExtendedMenu` dropdown, the container-level export option is only available when a container is being viewed. At all other times, it is grayed out and cannot be pressed. This is done using a React effect, which greys out the button unless the URL of the current page contains the word "container".

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

screenshot here would be great

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image Here's the dataset one!

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll be updating the RFC with screenshots from the current implementation we have.


Within the `SearchExtendedMenu` dropdown, the container-level export option is only available when a container is being viewed. At all other times, it is grayed out and cannot be pressed. This is done using a React effect, which greys out the button unless the URL of the current page contains the word "container".

When either export option is selected, it opens a modal which prompts the user to enter the name of the CSV file to be created. For dataset-level export, the user is also prompted to enter the data source, database, schema, and table name of the dataset to be exported. Notably, these fields assume a specific number of containers to be present, which may not be the case for every data source. As such, this modal may need to be altered. This is what the fields presently refer to:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For dataset-level export, the user is also prompted to enter the data source, database, schema, and table name of the dataset to be exported.

Isn't this something that can be inferred from the dataset urn by making a graphql call? Not sure why the user needs to provide this information for the dataset.

Copy link

@jayacryl jayacryl Mar 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed - we can auto-infer it. Users will still be able to rename the file in the system download popup.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this option appears in the search results, we cannot derive a URN form context. We would have to modify the form to ask for a dataset URN instead of the current fields. I can modify the RFC to reflect that if you'd like.


Upon entry, the following steps occur:

1. The modal is made invisible, but continues executing code for the export process. A notification is created to inform the user that the export process is ongoing.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

screenshot / design mock would be great here

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we assuming a progress bar?

Copy link

@jayacryl jayacryl Mar 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps something like this
image

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll be updating the RFC doc with a screenshot of the notification in the current implementation.


This feature as it is currently implemented is only intended to support:
- Export to CSV of individual datasets.
- Export to CSV of all datasets within a container.
Copy link

@jayacryl jayacryl Mar 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or we could simply say: all datasets that match a search predicate?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The feature as it is currently implemented is not designed to support that.

## Requirements

This feature as it is currently implemented is only intended to support:
- Export to CSV of individual datasets.
Copy link

@jayacryl jayacryl Mar 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and their schemas!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the current implementation, schemas are not present within the CSV files. During schema-level export, only the datasets within a given schema are exported.


## Non-Requirements

This feature is not intended to add a REST API for import/export like that of Collibra. It is only intended for use through the UI.
Copy link

@jayacryl jayacryl Mar 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes agreed. For now, if users want to achieve this they'll need to closely mimic our format and have the assets pre-created in datahub through other means.


## Non-Requirements

This feature is not intended to add a REST API for import/export like that of Collibra. It is only intended for use through the UI.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In contrast, for bulk import glossary terms we'll need something a bit simpler to let users transform externally sourced terms & term groups into a csv for import. Also will need to support upserts, not just updates.


### Export

Within the `SearchExtendedMenu` dropdown, the container-level export option is only available when a container is being viewed. At all other times, it is grayed out and cannot be pressed. This is done using a React effect, which greys out the button unless the URL of the current page contains the word "container".

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image Here's the dataset one!


Within the `SearchExtendedMenu` dropdown, the container-level export option is only available when a container is being viewed. At all other times, it is grayed out and cannot be pressed. This is done using a React effect, which greys out the button unless the URL of the current page contains the word "container".

When either export option is selected, it opens a modal which prompts the user to enter the name of the CSV file to be created. For dataset-level export, the user is also prompted to enter the data source, database, schema, and table name of the dataset to be exported. Notably, these fields assume a specific number of containers to be present, which may not be the case for every data source. As such, this modal may need to be altered. This is what the fields presently refer to:
Copy link

@jayacryl jayacryl Mar 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed - we can auto-infer it. Users will still be able to rename the file in the system download popup.


Upon entry, the following steps occur:

1. The modal is made invisible, but continues executing code for the export process. A notification is created to inform the user that the export process is ongoing.
Copy link

@jayacryl jayacryl Mar 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps something like this
image

<input id="file" type="file" onChange={changeHandler} style={{ opacity: 0 }} />
```

The `papaparse` library is used to parse the CSV file and iterate over each row present within it. The data is then fed into GraphQL mutations to create datasets. Notably, a new GraphQL mutation had to be created to allow the upserting of schema metadata. Here is the specification for that new mutation:
Copy link

@jayacryl jayacryl Mar 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had previously tested a prototype that broke requests down even at the row-level. I.e., one request for glossary terms, one request per schema column changed, etc.
Furthermore, it cached key fields of the exported file locally. This allowed for only submitting the diffs to graphql.

This approach helped us quickly identify if a specific cell in the csv failed to apply, while still succeeding with all the rest. This was easily presented in a final upload report at the end.
It seemed to work well with up to 100k items changed.

image image


## Drawbacks and Alternatives

As mentioned before, this feature is only intended for use within the UI. As the code has currently been written, it would not be possible to extend the import and export functionality to a different API (i.e., REST), as all the code is written in React.
Copy link

@jayacryl jayacryl Mar 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could convert it to python without too much effort - given all the core logic is in .ts, with no jsx stuff involved.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we could do a such a thing, but most of the existing code we've written would need to be scrapped and rewritten in another language (Python, in this case).


It's also notable that because the format of the CSV files is so different from those produced by the existing functionality of downloading search results, existing CSV files cannot be used to import datasets. This may cause confusion among users, and may be worth remediating.

It's also notable that an extension for DataHub does exist which adds very similar functionality ([link](https://datahubproject.io/docs/0.13.1/generated/ingestion/sources/csv/)). This has not been investigated in detail, but if this is a duplicate feature, it may not be worth integrating into DataHub.
Copy link

@jayacryl jayacryl Mar 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe folks are usually use this as a one-off rather than on a loop, so we could sunset the source once there is a mechanism for it in the UI.
And especially if we copy the react to python as mentioned above, then developers can leverage the python tool to do this programmatically.


It's notable that the dataset-level export component of this feature was designed specifically for data sources with two layers of containers in DataHub. This is unlikely to always be the case, and as such, this component will likely need to be refactored to be more flexible. We will need to determine what shape the component should take before performing this refactoring.

Additionally, this feature would end up adding the ability to create Datasets through GraphQL as a side effect. It will need to be evaluated whether this is an acceptable outcome, or if it is acceptable, whether it should be made accessible through the GraphiQL interface. No newline at end of file

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder what the implications of creating and deleting datasets via graphql would be for the overall DataHub platform?

* rfc: Update document with requested changes

* Fix typo on line 28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants