-
Notifications
You must be signed in to change notification settings - Fork 14
rfc: Add import/export feature #9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
rfc: Add import/export feature #9
Conversation
WalkthroughThis update introduces an Import/Export feature for DataHub. Users can export datasets in CSV format and import CSV files through the UI. The feature adds options in the SearchExtendedMenu dropdown to export individual datasets or all datasets within a container. An export modal collects CSV file details, and metadata is retrieved via GraphQL queries. For importing, the papaparse library processes CSV files and a new GraphQL mutation (upsertDataset) along with related input types manages dataset upsertion. Documentation has been updated accordingly. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant UI
participant GraphQL
participant CSV_Module
User->>UI: Clicks "Export" option in SearchExtendedMenu
UI->>User: Displays export modal for CSV file details
User->>UI: Submits CSV file name and export options
UI->>GraphQL: Fetches dataset metadata via GraphQL queries
GraphQL-->>UI: Returns dataset metadata
UI->>CSV_Module: Generates CSV file with required columns
CSV_Module-->>User: Provides CSV file download
sequenceDiagram
participant User
participant UI
participant Papaparse
participant GraphQL
User->>UI: Selects CSV file for import
UI->>Papaparse: Parses CSV file using papaparse library
Papaparse-->>UI: Returns parsed dataset information
UI->>GraphQL: Sends upsertDataset mutation(s) with dataset details
GraphQL-->>UI: Confirms upsertion of datasets
UI-->>User: Displays import result
Poem
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (11)
active/0000-import-export-feature/README.md (11)
1-4
: Metadata and RFC Details UpdateThe header section includes placeholder metadata (start date, RFC PR link, and implementation PR references). Ensure these fields are updated with the actual dates and links once available.
36-46
: CSV Columns Explanation – Typo CorrectionThe explanation of CSV columns is thorough. Note: In line 39, the word “assset” should be corrected to “asset.”
- - asset_type: What type of assset is contained in the row. This is either a dataset or schema field. + - asset_type: What type of asset is contained in the row. This is either a dataset or schema field.
48-52
: Export Button Description – CapitalizationIn line 50, the phrase “using a react effect” should capitalize “React” as it refers to the framework.
- This is done using a react effect, which greys out the button unless the URL of the current page contains the word "container". + This is done using a React effect, which greys out the button unless the URL of the current page contains the word "container".🧰 Tools
🪛 LanguageTool
[grammar] ~50-~50: “React” is a proper noun and needs to be capitalized.
Context: ...cannot be pressed. This is done using a react effect, which greys out the button unle...(A_GOOGLE)
53-57
: Modal Fields ClarityThe description of fields in the export modal is clear. Consider adding a note regarding the assumptions about the container structure and potential limitations for data sources that do not follow it.
70-73
: GraphQL Queries Header ConsistencyEnsure consistent capitalization by using “GraphQL” instead of “GraphQl” (e.g., in line 70).
- #### GraphQl queries + #### GraphQL queries
254-261
: File Input for ImportThe JSX snippet for the file upload input is concise. For improved accessibility, consider adding attributes like
aria-label
or a visually hidden label.
293-301
: Import Field Mappings – Typo CorrectionsThe defaults for fields on import are clearly listed. Important: In lines 295 and 298, the CSV field is referred to as
resrource
which should be corrected toresource
.- - `name`: The name is extracted from the dataset URN stored in the `resrource` CSV field. + - `name`: The name is extracted from the dataset URN stored in the `resource` CSV field.- - `platformUrn`: The platform URN is extracted from the dataset URN stored in the `resrource` CSV field. + - `platformUrn`: The platform URN is extracted from the dataset URN stored in the `resource` CSV field.🧰 Tools
🪛 LanguageTool
[uncategorized] ~295-~295: Loose punctuation mark.
Context: ...d with these values on import: -name
: The name is extracted from the dataset ...(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~298-~298: Loose punctuation mark.
Context: ...aName: An empty string. -
platformUrn`: The platform URN is extracted from the ...(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~299-~299: Loose punctuation mark.
Context: ...d in theresrource
CSV field. -type
:SchemaFieldDataType.Null
- `nativeDat...(UNLIKELY_OPENING_PUNCTUATION)
320-323
: Clarification on External ExtensionIn line 322, change “a extension” to “an extension” since “extension” begins with a vowel sound.
- It's also notable that a extension for DataHub does exist which adds very similar functionality... + It's also notable that an extension for DataHub does exist which adds very similar functionality...🧰 Tools
🪛 LanguageTool
[misspelling] ~322-~322: Use “an” instead of ‘a’ if the following word starts with a vowel sound, e.g. ‘an article’, ‘an hour’.
Context: ...th remediating. It's also notable that a extension for DataHub does exist which ...(EN_A_VS_AN)
328-333
: Future Work and CSV Enhancements – Typo CorrectionIn the future work section, specifically line 330, “uased” should be corrected to “used.” Additionally, consider clarifying how new CSV columns will impact the UX and data integrity.
- ... it could also be uased to store the sub types of datasets. + ... it could also be used to store the sub types of datasets.🧰 Tools
🪛 LanguageTool
[typographical] ~330-~330: The conjunction “so that” does not have a comma in front.
Context: ... corresponding columns to the CSV schema, so that we can populate those fields. Notably, ...(SO_THAT_UNNECESSARY_COMMA)
[misspelling] ~330-~330: This word is normally spelled as one.
Context: ...mn, it could also be uased to store the sub types of datasets. Additionally, the `glossa...(EN_COMPOUNDS_SUB_TYPES)
334-335
: Dataset-level Export RefactoringThere is a spelling mistake in line 334 where “laters” should be “layers.” Additionally, consider varying the phrasing here versus other similar statements in the document to reduce repetition.
- ... designed to only work with data sources with two laters of containers in DataHub. + ... designed to only work with data sources with two layers of containers in DataHub.🧰 Tools
🪛 LanguageTool
[style] ~334-~334: You have already used this phrasing in nearby sentences. Consider replacing it to add variety to your writing.
Context: ...low, the dataset-level export will also need to be refactored to be more flexible, as a...(REP_NEED_TO_VB)
336-340
: Unresolved Questions – Style ImprovementThe unresolved questions section raises important points about flexibility and side effects. To improve readability, consider rephrasing some repetitive language (e.g., reducing repeated uses of “as such” and similar constructs).
🧰 Tools
🪛 LanguageTool
[style] ~338-~338: You have already used this phrasing in nearby sentences. Consider replacing it to add variety to your writing.
Context: ...and as such, this component will likely need to be refactored to be more flexible. We w...(REP_NEED_TO_VB)
[style] ~338-~338: You have already used this phrasing in nearby sentences. Consider replacing it to add variety to your writing.
Context: ...refactored to be more flexible. We will need to determine what shape the component shou...(REP_NEED_TO_VB)
[style] ~340-~340: You have already used this phrasing in nearby sentences. Consider replacing it to add variety to your writing.
Context: ...rough GraphQL as a side effect. It will need to be evaluated whether this is an accepta...(REP_NEED_TO_VB)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
active/0000-import-export-feature/README.md
(1 hunks)
🧰 Additional context used
🪛 LanguageTool
active/0000-import-export-feature/README.md
[grammar] ~50-~50: “React” is a proper noun and needs to be capitalized.
Context: ...cannot be pressed. This is done using a react effect, which greys out the button unle...
(A_GOOGLE)
[uncategorized] ~295-~295: Loose punctuation mark.
Context: ...d with these values on import: - name
: The name is extracted from the dataset ...
(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~298-~298: Loose punctuation mark.
Context: ...aName: An empty string. -
platformUrn`: The platform URN is extracted from the ...
(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~299-~299: Loose punctuation mark.
Context: ...d in the resrource
CSV field. - type
: SchemaFieldDataType.Null
- `nativeDat...
(UNLIKELY_OPENING_PUNCTUATION)
[misspelling] ~322-~322: Use “an” instead of ‘a’ if the following word starts with a vowel sound, e.g. ‘an article’, ‘an hour’.
Context: ...th remediating. It's also notable that a extension for DataHub does exist which ...
(EN_A_VS_AN)
[typographical] ~330-~330: The conjunction “so that” does not have a comma in front.
Context: ... corresponding columns to the CSV schema, so that we can populate those fields. Notably, ...
(SO_THAT_UNNECESSARY_COMMA)
[misspelling] ~330-~330: This word is normally spelled as one.
Context: ...mn, it could also be uased to store the sub types of datasets. Additionally, the `glossa...
(EN_COMPOUNDS_SUB_TYPES)
[style] ~334-~334: You have already used this phrasing in nearby sentences. Consider replacing it to add variety to your writing.
Context: ...low, the dataset-level export will also need to be refactored to be more flexible, as a...
(REP_NEED_TO_VB)
[style] ~338-~338: You have already used this phrasing in nearby sentences. Consider replacing it to add variety to your writing.
Context: ...and as such, this component will likely need to be refactored to be more flexible. We w...
(REP_NEED_TO_VB)
[style] ~338-~338: You have already used this phrasing in nearby sentences. Consider replacing it to add variety to your writing.
Context: ...refactored to be more flexible. We will need to determine what shape the component shou...
(REP_NEED_TO_VB)
[style] ~340-~340: You have already used this phrasing in nearby sentences. Consider replacing it to add variety to your writing.
Context: ...rough GraphQL as a side effect. It will need to be evaluated whether this is an accepta...
(REP_NEED_TO_VB)
🔇 Additional comments (14)
active/0000-import-export-feature/README.md (14)
5-9
: Clear Feature SummaryThe feature title and summary clearly state the goal (import/export datasets via CSV) and the current status. Ensure that any further refinements or updates to the implementation are documented in this section.
11-14
: Motivation Section ClarityThe motivation section explains the rationale behind mimicking Collibra’s functionality and how it benefits enterprise users. Consider adding details on expected user benefits or quantitative metrics if available.
15-21
: Requirements Section VerificationThe requirements are explicitly listed for export and import functionality. This clarity is helpful. Ensure that any limitations (for example container-based assumptions) are clearly communicated in both documentation and UI, as noted later.
22-25
: Non-Requirements SectionThe non-requirements section is succinct and clearly specifies what is out of scope (i.e. REST API). No changes needed.
26-28
: Detailed Design IntroductionThe introduction to the detailed design outlines the overall approach for adding three options in the dropdown. This sets the stage well for the detailed sections that follow.
30-34
: CSV Columns DefinitionThe CSV block listing the column names is clear. Verify that the list of columns exactly matches the expected fields during both export and import.
58-68
: Detailed Export ProcessThe step-by-step outline of the export process is detailed and easy to follow. Ensure that error handling (e.g., when no datasets are found) is robust in the implementation.
74-159
: GraphQL Query: getDatasetByUrnThe query is comprehensive and correctly structured to fetch dataset metadata. Verify that client-side code consuming this query handles pagination (50 datasets per query) as described.
160-252
: GraphQL Query: getTableThis complementary query for fetching dataset details looks consistent with the first query. Ensure both queries are maintained together as updates in one may require changes in the other.
262-291
: New GraphQL Mutation and Input TypesThe new
upsertDataset
mutation and its corresponding input types are well defined. Check that the trailing comma inglobalTagUrns: [String],
(line 273) is acceptable per your GraphQL schema style guidelines.
304-309
: User Guide Section (“How we teach this”)This section outlines the essentials for a user guide effectively. Consider detailing common pitfalls or visual guides for using the feature.
310-313
: GraphQL Documentation Update ReminderThe note regarding updating GraphQL documentation is insightful. Keep the documentation in sync with schema changes to avoid confusion.
314-319
: Drawbacks and Alternatives DiscussionThe discussion on drawbacks and alternatives is comprehensive. Consider ensuring consistent styling for “React” across the document when referring to the UI framework.
324-327
: Rollout/Adoption Strategy ClarityThe rollout strategy is concise and confirms that no migration is needed. This clarity is beneficial.
OSS Review Recommendation: Proceed |
rfc: rename file and add PR link
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (7)
active/0009-import-export-feature/README.md (7)
1-4
: Review Header and PR Details:
The header clearly lists the start date, RFC PR URL, and the placeholder for implementation PR(s). Please consider formatting the bare URL on line 2 as a proper markdown link (e.g., using angle brackets or the link text syntax) to adhere to markdownlint guidelines. Also, if an implementation PR exists, including its link would improve clarity.🧰 Tools
🪛 markdownlint-cli2 (0.17.2)
2-2: Bare URL used
null(MD034, no-bare-urls)
58-69
: Export Process Steps Overview:
The step-by-step description of the export process is well laid out and easy to follow. It might be beneficial to include notes on error handling or fallback behavior if a step fails (for example, if the GraphQL query returns an unexpected result).
74-253
: GraphQL Queries Implementation:
The two GraphQL queries (getDatasetByUrn and getTable) are well-structured and clearly presented. One point to consider is the performance handling for containers with more than 50 datasets—monitoring and possibly refactoring the pagination and multiple CSV downloads might be useful in the future.
254-261
: File Input for Import:
The hidden file input element is implemented correctly. However, using inline styles (i.e., setting opacity directly) might limit maintainability. Consider moving these styles to an external CSS class, which would promote consistency and easier updates in the future.
293-301
: Field Defaults for Import Details:
The explanation of how missing CSV fields are supplemented with default values (e.g.,version: 0
,schemaName: ""
, etc.) is useful. Please review the punctuation and formatting in this section (specifically around lines 295–299) to remove any loose punctuation and ensure consistency.🧰 Tools
🪛 LanguageTool
[uncategorized] ~295-~295: Loose punctuation mark.
Context: ...d with these values on import: -name
: The name is extracted from the dataset ...(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~298-~298: Loose punctuation mark.
Context: ...aName: An empty string. -
platformUrn`: The platform URN is extracted from the ...(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~299-~299: Loose punctuation mark.
Context: ...ed in theresource
CSV field. -type
:SchemaFieldDataType.Null
- `nativeDat...(UNLIKELY_OPENING_PUNCTUATION)
330-330
: Typographical Correction Needed:
There appears to be a typographical error on line 330 where "uased" is used instead of "used." A simple fix is needed here.🧰 Tools
🪛 LanguageTool
[typographical] ~330-~330: The conjunction “so that” does not have a comma in front.
Context: ... corresponding columns to the CSV schema, so that we can populate those fields. Notably, ...(SO_THAT_UNNECESSARY_COMMA)
[misspelling] ~330-~330: This word is normally spelled as one.
Context: ...mn, it could also be uased to store the sub types of datasets. Additionally, the `glossa...(EN_COMPOUNDS_SUB_TYPES)
334-340
: Repetitive Phrasing in Future Work and Unresolved Questions:
The phrases such as “need to be refactored to be more flexible” are repeated across these sections. Consider rephrasing these areas to add variety and enhance readability.🧰 Tools
🪛 LanguageTool
[style] ~334-~334: You have already used this phrasing in nearby sentences. Consider replacing it to add variety to your writing.
Context: ...low, the dataset-level export will also need to be refactored to be more flexible, as a...(REP_NEED_TO_VB)
[style] ~338-~338: You have already used this phrasing in nearby sentences. Consider replacing it to add variety to your writing.
Context: ...and as such, this component will likely need to be refactored to be more flexible. We w...(REP_NEED_TO_VB)
[style] ~338-~338: You have already used this phrasing in nearby sentences. Consider replacing it to add variety to your writing.
Context: ...refactored to be more flexible. We will need to determine what shape the component shou...(REP_NEED_TO_VB)
[style] ~340-~340: You have already used this phrasing in nearby sentences. Consider replacing it to add variety to your writing.
Context: ...rough GraphQL as a side effect. It will need to be evaluated whether this is an accepta...(REP_NEED_TO_VB)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
active/0009-import-export-feature/README.md
(1 hunks)
🧰 Additional context used
🪛 LanguageTool
active/0009-import-export-feature/README.md
[uncategorized] ~295-~295: Loose punctuation mark.
Context: ...d with these values on import: - name
: The name is extracted from the dataset ...
(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~298-~298: Loose punctuation mark.
Context: ...aName: An empty string. -
platformUrn`: The platform URN is extracted from the ...
(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~299-~299: Loose punctuation mark.
Context: ...ed in the resource
CSV field. - type
: SchemaFieldDataType.Null
- `nativeDat...
(UNLIKELY_OPENING_PUNCTUATION)
[typographical] ~330-~330: The conjunction “so that” does not have a comma in front.
Context: ... corresponding columns to the CSV schema, so that we can populate those fields. Notably, ...
(SO_THAT_UNNECESSARY_COMMA)
[misspelling] ~330-~330: This word is normally spelled as one.
Context: ...mn, it could also be uased to store the sub types of datasets. Additionally, the `glossa...
(EN_COMPOUNDS_SUB_TYPES)
[style] ~334-~334: You have already used this phrasing in nearby sentences. Consider replacing it to add variety to your writing.
Context: ...low, the dataset-level export will also need to be refactored to be more flexible, as a...
(REP_NEED_TO_VB)
[style] ~338-~338: You have already used this phrasing in nearby sentences. Consider replacing it to add variety to your writing.
Context: ...and as such, this component will likely need to be refactored to be more flexible. We w...
(REP_NEED_TO_VB)
[style] ~338-~338: You have already used this phrasing in nearby sentences. Consider replacing it to add variety to your writing.
Context: ...refactored to be more flexible. We will need to determine what shape the component shou...
(REP_NEED_TO_VB)
[style] ~340-~340: You have already used this phrasing in nearby sentences. Consider replacing it to add variety to your writing.
Context: ...rough GraphQL as a side effect. It will need to be evaluated whether this is an accepta...
(REP_NEED_TO_VB)
🪛 markdownlint-cli2 (0.17.2)
active/0009-import-export-feature/README.md
2-2: Bare URL used
null
(MD034, no-bare-urls)
🔇 Additional comments (6)
active/0009-import-export-feature/README.md (6)
32-34
: CSV Columns Listing is Clear:
The CSV header block clearly lists all the column names required for the import/export feature. The format is concise and self-explanatory.
36-47
: Detailed CSV Columns Usage Explanation:
This section thoroughly explains the purpose and formatting of each CSV column. It provides valuable context for both datasets and schema fields. Consider adding examples in future revisions if more clarity is needed for edge cases.
50-57
: Export Modal Usage Details:
The explanation for when and how the export modal is activated is comprehensive. Please ensure that the assumptions about the container hierarchy (e.g., the expectation of a specific number of containers) are validated at runtime to avoid potential issues in diverse environments.
70-73
: GraphQL Queries Section Introduction:
This brief introduction to the GraphQL queries sets the stage nicely for the detailed queries that follow.
262-292
: GraphQL Mutation and Input Types Definitions:
The new GraphQL mutation (upsertDataset) along with its input types (DatasetUpsertInput, SchemaMetadataInput, and SchemaFieldInput) are clearly defined and documented. This helps ensure that the integration with GMS is well understood.
1-340
: Overall Documentation Quality:
This README provides a comprehensive overview of the new Import/Export feature for DataHub, covering motivation, design, detailed implementation, and future considerations. The level of detail is excellent; just ensure minor editorial fixes and consistency improvements are applied.🧰 Tools
🪛 LanguageTool
[uncategorized] ~295-~295: Loose punctuation mark.
Context: ...d with these values on import: -name
: The name is extracted from the dataset ...(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~298-~298: Loose punctuation mark.
Context: ...aName: An empty string. -
platformUrn`: The platform URN is extracted from the ...(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~299-~299: Loose punctuation mark.
Context: ...ed in theresource
CSV field. -type
:SchemaFieldDataType.Null
- `nativeDat...(UNLIKELY_OPENING_PUNCTUATION)
[typographical] ~330-~330: The conjunction “so that” does not have a comma in front.
Context: ... corresponding columns to the CSV schema, so that we can populate those fields. Notably, ...(SO_THAT_UNNECESSARY_COMMA)
[misspelling] ~330-~330: This word is normally spelled as one.
Context: ...mn, it could also be uased to store the sub types of datasets. Additionally, the `glossa...(EN_COMPOUNDS_SUB_TYPES)
[style] ~334-~334: You have already used this phrasing in nearby sentences. Consider replacing it to add variety to your writing.
Context: ...low, the dataset-level export will also need to be refactored to be more flexible, as a...(REP_NEED_TO_VB)
[style] ~338-~338: You have already used this phrasing in nearby sentences. Consider replacing it to add variety to your writing.
Context: ...and as such, this component will likely need to be refactored to be more flexible. We w...(REP_NEED_TO_VB)
[style] ~338-~338: You have already used this phrasing in nearby sentences. Consider replacing it to add variety to your writing.
Context: ...refactored to be more flexible. We will need to determine what shape the component shou...(REP_NEED_TO_VB)
[style] ~340-~340: You have already used this phrasing in nearby sentences. Consider replacing it to add variety to your writing.
Context: ...rough GraphQL as a side effect. It will need to be evaluated whether this is an accepta...(REP_NEED_TO_VB)
🪛 markdownlint-cli2 (0.17.2)
2-2: Bare URL used
null(MD034, no-bare-urls)
<input id="file" type="file" onChange={changeHandler} style={{ opacity: 0 }} /> | ||
``` | ||
|
||
The `papaparse` library is used to parse the CSV file and iterate over each row present within it. The data is then fed into GraphQL mutations to create datasets. Notably, a new GraphQL mutation had to be created to allow the upserting of schema metadata. Here is the specification for that new mutation: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What sort of scale do we want to advertise for this feature?
How much have we tested up to?
Should the UI reject the import if there are too many datasets?
How should progress be displayed to the user?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had previously tested a prototype that broke requests down even at the row-level. I.e., one request for glossary terms, one request per schema column changed, etc.
Furthermore, it cached key fields of the exported file locally. This allowed for only submitting the diffs to graphql.
This approach helped us quickly identify if a specific cell in the csv failed to apply, while still succeeding with all the rest. This was easily presented in a final upload report at the end.
It seemed to work well with up to 100k items changed.


There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At present, the implementation we've written does not display a progress bar, nor does it reject the user if too many datasets are present for import or export. However, import a CSV file can take a great deal of time if there are a lot of datasets present in the file. I believe we've tested up to 36 datasets at once.
|
||
## Detailed design | ||
|
||
This feature will add three new options to the existing `SearchExtendedMenu` dropdown. One to export all datasets within a container, one to export individual datasets, and one to import previously exported data into DataHub. The export options create CSV files from data existing in DataHub, while the import option adds new data to DataHub from CSV files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A screenshot of the proposed look and feel here would be great.
|
||
This feature will add the ability to both export datasets to CSV files and import them back into DataHub from those CSV files, using the UI. Code is already implemented for this feature, though further work may need to be done. This RFC details the Implementation in its current state. | ||
|
||
## Motivation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A section on the User Journey and Motivations would be useful here.
e.g.
Why is the User exporting to csv -> Do they intend to make some changes to metadata in bulk and then import it back?
Are there scenarios where Users are trying to import CSV-s containing metadata that has been hand written or sourced from "non DataHub" catalogs? In those scenarios - how will users provide the "urn" field which represents the main identity of the dataset on DataHub?
|
||
## Non-Requirements | ||
|
||
This feature is not intended to add a REST API for import/export like that of Collibra. It is only intended for use through the UI. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this is where we declare that we are not intending to support a from-scratch import of csv metadata which has not been sourced from DataHub
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes agreed. For now, if users want to achieve this they'll need to closely mimic our format and have the assets pre-created in datahub through other means.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In contrast, for bulk import glossary terms we'll need something a bit simpler to let users transform externally sourced terms & term groups into a csv for import. Also will need to support upserts, not just updates.
|
||
### Export | ||
|
||
Within the `SearchExtendedMenu` dropdown, the container-level export option is only available when a container is being viewed. At all other times, it is grayed out and cannot be pressed. This is done using a React effect, which greys out the button unless the URL of the current page contains the word "container". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
screenshot here would be great
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll be updating the RFC with screenshots from the current implementation we have.
|
||
Within the `SearchExtendedMenu` dropdown, the container-level export option is only available when a container is being viewed. At all other times, it is grayed out and cannot be pressed. This is done using a React effect, which greys out the button unless the URL of the current page contains the word "container". | ||
|
||
When either export option is selected, it opens a modal which prompts the user to enter the name of the CSV file to be created. For dataset-level export, the user is also prompted to enter the data source, database, schema, and table name of the dataset to be exported. Notably, these fields assume a specific number of containers to be present, which may not be the case for every data source. As such, this modal may need to be altered. This is what the fields presently refer to: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For dataset-level export, the user is also prompted to enter the data source, database, schema, and table name of the dataset to be exported.
Isn't this something that can be inferred from the dataset urn by making a graphql call? Not sure why the user needs to provide this information for the dataset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed - we can auto-infer it. Users will still be able to rename the file in the system download popup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this option appears in the search results, we cannot derive a URN form context. We would have to modify the form to ask for a dataset URN instead of the current fields. I can modify the RFC to reflect that if you'd like.
|
||
Upon entry, the following steps occur: | ||
|
||
1. The modal is made invisible, but continues executing code for the export process. A notification is created to inform the user that the export process is ongoing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
screenshot / design mock would be great here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we assuming a progress bar?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll be updating the RFC doc with a screenshot of the notification in the current implementation.
|
||
This feature as it is currently implemented is only intended to support: | ||
- Export to CSV of individual datasets. | ||
- Export to CSV of all datasets within a container. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or we could simply say: all datasets that match a search predicate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The feature as it is currently implemented is not designed to support that.
## Requirements | ||
|
||
This feature as it is currently implemented is only intended to support: | ||
- Export to CSV of individual datasets. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and their schemas!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the current implementation, schemas are not present within the CSV files. During schema-level export, only the datasets within a given schema are exported.
|
||
## Non-Requirements | ||
|
||
This feature is not intended to add a REST API for import/export like that of Collibra. It is only intended for use through the UI. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes agreed. For now, if users want to achieve this they'll need to closely mimic our format and have the assets pre-created in datahub through other means.
|
||
## Non-Requirements | ||
|
||
This feature is not intended to add a REST API for import/export like that of Collibra. It is only intended for use through the UI. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In contrast, for bulk import glossary terms we'll need something a bit simpler to let users transform externally sourced terms & term groups into a csv for import. Also will need to support upserts, not just updates.
|
||
### Export | ||
|
||
Within the `SearchExtendedMenu` dropdown, the container-level export option is only available when a container is being viewed. At all other times, it is grayed out and cannot be pressed. This is done using a React effect, which greys out the button unless the URL of the current page contains the word "container". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
Within the `SearchExtendedMenu` dropdown, the container-level export option is only available when a container is being viewed. At all other times, it is grayed out and cannot be pressed. This is done using a React effect, which greys out the button unless the URL of the current page contains the word "container". | ||
|
||
When either export option is selected, it opens a modal which prompts the user to enter the name of the CSV file to be created. For dataset-level export, the user is also prompted to enter the data source, database, schema, and table name of the dataset to be exported. Notably, these fields assume a specific number of containers to be present, which may not be the case for every data source. As such, this modal may need to be altered. This is what the fields presently refer to: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed - we can auto-infer it. Users will still be able to rename the file in the system download popup.
|
||
Upon entry, the following steps occur: | ||
|
||
1. The modal is made invisible, but continues executing code for the export process. A notification is created to inform the user that the export process is ongoing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<input id="file" type="file" onChange={changeHandler} style={{ opacity: 0 }} /> | ||
``` | ||
|
||
The `papaparse` library is used to parse the CSV file and iterate over each row present within it. The data is then fed into GraphQL mutations to create datasets. Notably, a new GraphQL mutation had to be created to allow the upserting of schema metadata. Here is the specification for that new mutation: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had previously tested a prototype that broke requests down even at the row-level. I.e., one request for glossary terms, one request per schema column changed, etc.
Furthermore, it cached key fields of the exported file locally. This allowed for only submitting the diffs to graphql.
This approach helped us quickly identify if a specific cell in the csv failed to apply, while still succeeding with all the rest. This was easily presented in a final upload report at the end.
It seemed to work well with up to 100k items changed.


|
||
## Drawbacks and Alternatives | ||
|
||
As mentioned before, this feature is only intended for use within the UI. As the code has currently been written, it would not be possible to extend the import and export functionality to a different API (i.e., REST), as all the code is written in React. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could convert it to python without too much effort - given all the core logic is in .ts, with no jsx stuff involved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we could do a such a thing, but most of the existing code we've written would need to be scrapped and rewritten in another language (Python, in this case).
|
||
It's also notable that because the format of the CSV files is so different from those produced by the existing functionality of downloading search results, existing CSV files cannot be used to import datasets. This may cause confusion among users, and may be worth remediating. | ||
|
||
It's also notable that an extension for DataHub does exist which adds very similar functionality ([link](https://datahubproject.io/docs/0.13.1/generated/ingestion/sources/csv/)). This has not been investigated in detail, but if this is a duplicate feature, it may not be worth integrating into DataHub. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe folks are usually use this as a one-off rather than on a loop, so we could sunset the source once there is a mechanism for it in the UI.
And especially if we copy the react to python as mentioned above, then developers can leverage the python tool to do this programmatically.
|
||
It's notable that the dataset-level export component of this feature was designed specifically for data sources with two layers of containers in DataHub. This is unlikely to always be the case, and as such, this component will likely need to be refactored to be more flexible. We will need to determine what shape the component should take before performing this refactoring. | ||
|
||
Additionally, this feature would end up adding the ability to create Datasets through GraphQL as a side effect. It will need to be evaluated whether this is an acceptable outcome, or if it is acceptable, whether it should be made accessible through the GraphiQL interface. No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder what the implications of creating and deleting datasets via graphql would be for the overall DataHub platform?
* rfc: Update document with requested changes * Fix typo on line 28
This RFC documents the code changes for the import/export feature, which has been requested by the DataHub team.
Summary by CodeRabbit