Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@dmohns
Copy link
Contributor

@dmohns dmohns commented May 26, 2017

The extract job is currently not implemented in contrib.bigquery module. Scope of this PR is to implement it according to https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs .

Motivation and Context

Without the extract we can only assemble pipelines into BigQuery but not from BQ to elsewhere.

Description

Have you tested this? If so, how?

  • Added basic unit and integration tests
  • A version of this runs in our production environment.

Questions

Being relatively new to luigi ecosystem I am not 100% certain I correctly followed the luigi style everywhere. Happy to receive feedback.

@mention-bot
Copy link

@dmohns, thanks for your PR! By analyzing the history of the files in this pull request, we identified @mikekap, @ulzha and @mbruggmann to be potential reviewers.

@dmohns dmohns force-pushed the bq-add-unload-task branch from 1235681 to 7af7dba Compare May 26, 2017 13:57
- Fill job configuration according to https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs for extract jobs.
- Add assertitions for the jobs to be sane.
- Add basic unittest for extracts
- Add Integration test with basic configuration
@dmohns dmohns force-pushed the bq-add-unload-task branch from 7af7dba to 6899c95 Compare May 26, 2017 14:24
@miike
Copy link

miike commented Jul 10, 2017

This would be great to have as we're after something similar. Is it worth checking if the table_exists in run?

'tableId': input.table.table_id
},
'destinationUris': destination_uris,
'printHeader': self.print_header,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does BigQuery ignore printHeader and fieldDelimiter if the destinationFormat is not CSV?

Copy link
Contributor Author

@dmohns dmohns Jul 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! I never used with BigQuery with anything other than csv and am therefore not aware of the behavior. I know from other cases that superfluous job parameter are ignored so I would assume same is true for printHeader and fieldDelimiter.

Regardless it would be more consistent to set printHeader and fieldDelimiter conditionally. I will add as soon as find some time.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah - I'm not sure if it's still an issue but I only noticed because Airflow has a conditional check on this.

@dmohns
Copy link
Contributor Author

dmohns commented Jul 11, 2017

The way the Task is currently designed I think it would be unnecessary overhead to add a dedicated exists check. The ExtractTask asserts to have a singleton BigQueryTarget as requires (which's exists method is essentially the table_exists). The main reason behind is separation responsibility. The ExtractTask should take care of unloading the table while it is luigi's job to take care of dependency checking and resolution.
In case the table is non-existent at run-time it will result in a BigQuery exception which I think is appropriate behavior.

- non-CSV extracts are incompatible with FieldDelimiter and PrintHeader. These job configuration attributes can therefore only be set conditionally.
- Added tests for non-CSV extracts
@dmohns
Copy link
Contributor Author

dmohns commented Jul 18, 2017

@miike Your were right: non-CSV extracts indeed fail when provided with FieldDelimiter or PrintHeader options. They are now set conditionally as suggested.

@dlstadther
Copy link
Collaborator

@dmohns @miike Is this PR approved and ready to merge?

@dmohns
Copy link
Contributor Author

dmohns commented Sep 19, 2017

@dlstadther From my point of view this can be merged, I am waiting for some potential feedback and/or reviews.

@dlstadther dlstadther merged commit 4bc00ca into spotify:master Sep 20, 2017
@dlstadther
Copy link
Collaborator

@dmohns Thanks for contributing! Sorry again for the long delay between submission and merge

This was referenced Jun 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants