-
Notifications
You must be signed in to change notification settings - Fork 2.4k
Add ExtractTask to luigi.contrib.BigQuery module #2134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@dmohns, thanks for your PR! By analyzing the history of the files in this pull request, we identified @mikekap, @ulzha and @mbruggmann to be potential reviewers. |
1235681 to
7af7dba
Compare
- Fill job configuration according to https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs for extract jobs. - Add assertitions for the jobs to be sane. - Add basic unittest for extracts - Add Integration test with basic configuration
7af7dba to
6899c95
Compare
|
This would be great to have as we're after something similar. Is it worth checking if the |
luigi/contrib/bigquery.py
Outdated
| 'tableId': input.table.table_id | ||
| }, | ||
| 'destinationUris': destination_uris, | ||
| 'printHeader': self.print_header, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does BigQuery ignore printHeader and fieldDelimiter if the destinationFormat is not CSV?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! I never used with BigQuery with anything other than csv and am therefore not aware of the behavior. I know from other cases that superfluous job parameter are ignored so I would assume same is true for printHeader and fieldDelimiter.
Regardless it would be more consistent to set printHeader and fieldDelimiter conditionally. I will add as soon as find some time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah - I'm not sure if it's still an issue but I only noticed because Airflow has a conditional check on this.
|
The way the Task is currently designed I think it would be unnecessary overhead to add a dedicated |
- non-CSV extracts are incompatible with FieldDelimiter and PrintHeader. These job configuration attributes can therefore only be set conditionally. - Added tests for non-CSV extracts
|
@miike Your were right: non-CSV extracts indeed fail when provided with |
|
@dlstadther From my point of view this can be merged, I am waiting for some potential feedback and/or reviews. |
|
@dmohns Thanks for contributing! Sorry again for the long delay between submission and merge |
The extract job is currently not implemented in
contrib.bigquerymodule. Scope of this PR is to implement it according to https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs .Motivation and Context
Without the extract we can only assemble pipelines into BigQuery but not from BQ to elsewhere.
Description
Have you tested this? If so, how?
Questions
Being relatively new to luigi ecosystem I am not 100% certain I correctly followed the luigi style everywhere. Happy to receive feedback.