Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Add a hard upper limit to datastore_search(_sql) rows returned#4562

Merged
davidread merged 20 commits into
masterfrom
4561-limit-datastore_search
Jan 27, 2019
Merged

Add a hard upper limit to datastore_search(_sql) rows returned#4562
davidread merged 20 commits into
masterfrom
4561-limit-datastore_search

Conversation

@davidread
Copy link
Copy Markdown

@davidread davidread commented Nov 23, 2018

Fixes #4561

Proposed fixes:

ckan.datastore.search.rows_max is the new config option which limits datastore_search and datastore_search_sql.

When datastore_search_sql returns results that have hit this configured limit, then it includes records_truncated: True in the response.

It was too complicated to do the same for datastore_search, however you can tell when the limit you requested is above the ckan.datastore.search.rows_max limit, because the response includes a limit value which is changed from what you specified in the request to ckan.datastore.search.rows_max.

This PR includes some work to ensure 'datastore dump' still works (and remains not limited by ckan.datastore.search.rows_max). I've updated modernized the related tests, and that work is also found in a separate PR: #4581 if you want to merge that separately (before this one).

Features:

  • includes tests covering changes
  • includes updated documentation
  • includes user-visible changes
  • includes API changes
  • includes bugfix for possible backport

Please [X] all the boxes above that apply

@davidread davidread force-pushed the 4561-limit-datastore_search branch from 67f3d8c to 2dc6bd1 Compare November 23, 2018 15:37
@davidread davidread force-pushed the 4561-limit-datastore_search branch from 2dc6bd1 to e6d6361 Compare November 23, 2018 15:44
@davidread davidread changed the title Makes the datastore_search default limit configurable Makes the datastore_search default limit configurable [WIP] Nov 23, 2018
@davidread davidread force-pushed the 4561-limit-datastore_search branch from dfa7def to e647bbd Compare November 23, 2018 22:26
@davidread davidread changed the title Makes the datastore_search default limit configurable [WIP] Add a hard upper limit to datastore_search rows returned [WIP] Nov 30, 2018
@davidread davidread changed the title Add a hard upper limit to datastore_search rows returned [WIP] Add a hard upper limit to datastore_search rows returned Nov 30, 2018
@davidread davidread requested a review from wardi November 30, 2018 18:27
@davidread davidread changed the title Add a hard upper limit to datastore_search rows returned Add a hard upper limit to datastore_search(_sql) rows returned Nov 30, 2018
Comment thread ckanext/datastore/backend/postgres.py Outdated
@wardi
Copy link
Copy Markdown
Contributor

wardi commented Nov 30, 2018

ckan.datastore.search.rows_max is likely going to cause a problem with https://github.com/ckan/ckan/blob/master/ckanext/datastore/controller.py#L46

@davidread
Copy link
Copy Markdown
Author

I guess the datastore dump should be exempt from the rows_max. It's not flexible with lots of options like datastore_search, so is very cacheable. And it's just a bit having an API to get a straight dump of a file and it turning up truncated.

If so, dump_to could pass a context variable that tells datastore_search not to impose the rows_max.

@wardi
Copy link
Copy Markdown
Contributor

wardi commented Nov 30, 2018

-1 on special behaviours based on context variables, just need to update dump to check for the results_truncated return value. If we try to be too clever we won't be able to support custom validation rules because we end up repeating them in other parts of the code.

Also for choosing a default datastore_search limit set it at least as large as the dump PAGINATE_BY value.

David Read added 4 commits December 7, 2018 12:48
@wardi wardi assigned wardi, smotornyuk and tino097 and unassigned wardi Dec 7, 2018
@davidread
Copy link
Copy Markdown
Author

Ok makes sense.

choosing a default datastore_search limit set it at least as large as the dump PAGINATE_BY value.

Rather than disallow rows_max to be less than 32000, I've done some code that allows it to be lower, and in this case it simply reduces the PAGINATE_BY value, so no rows are missed by the dumper. test_dump_with_low_rows_max is the test (in test_dump.py).

Ready for rereview @wardi

Comment thread ckanext/datastore/controller.py Outdated
@davidread
Copy link
Copy Markdown
Author

@wardi Adding a 'records_truncated' to datastore_search seems too complicated. As discussed for this PR, for datastore_search_sql we return records_truncated to indicate if you get back less records than you would have done without rows_max. This is done by querying for rows_max+1 records and discarding the last one. However in datastore_search you can't remove the last record, because the result is returned from postgres as a CSV/TSV/JSON/XML string. So you have to query for exactly rows_max records, and if you want to know if you are truncating you need to do a second query for rows_max+1 and count the results. Yuk.

So I think the best we could do for datastore_search is say if the limit specified by the user has been lowered to rows_max. This is slightly useful, because the user doesn't know what rows_max is configured to be. I've not coded this yet, but it's simple.

For consistency the 'Datastore dump' call should also warn the user if the records are curtailed. Because it uses datastore_search it too cannot say if you are over the rows_max, without that annoying extra query. But we can tell the user if we have returned rows up to the rows_max, even with the pagination going on - it's complicated but I've managed to implement it with the help of a bunch of tests. I'm not sure it is worth it - let me know what you think.

@wardi
Copy link
Copy Markdown
Contributor

wardi commented Dec 7, 2018

If we're going to limit the number of records that the dump controller returns that should be a separate option that defaults to unlimited.

The whole point of the dump controller is to dump all the data requested so that users don't need to paginate with the API themselves. The controller uses a constant amount of memory, it just takes more time if there are more records.

@davidread
Copy link
Copy Markdown
Author

Ok, I wasn't quite sure which way you were nudging, but that's clear now and helps!

@davidread
Copy link
Copy Markdown
Author

Backport for 2.7 is on branch 4561-limit-datastore_search-2.7-backport

@davidread
Copy link
Copy Markdown
Author

This is now ready for further comments @tino097 @smotornyuk

@davidread
Copy link
Copy Markdown
Author

@tino097 @smotornyuk Just a friendly ping to remind you about this PR :)

Comment thread ckanext/datastore/backend/postgres.py Outdated
@davidread
Copy link
Copy Markdown
Author

Any more comments or is this good to merge @wardi @smotornyuk ?

Comment thread ckanext/datastore/logic/schema.py Outdated

A number of parameters from :meth:`~ckanext.datastore.logic.action.datastore_search` can be used:
``offset``, ``limit``, ``filters``, ``q``, ``distinct``, ``plain``, ``language``, ``fields``, ``sort``

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, thanks

@kmbn kmbn added the Awaiting tech team feedback This PR or issue needs feedback from the tech team. label Jan 24, 2019
@davidread davidread force-pushed the 4561-limit-datastore_search branch from 03cb7d8 to 51337ff Compare January 25, 2019 11:07
Comment thread ckanext/datastore/logic/action.py
@wardi
Copy link
Copy Markdown
Contributor

wardi commented Jan 25, 2019

this looks good to me

@davidread
Copy link
Copy Markdown
Author

Thanks @wardi.

@tino097 @smotornyuk did either of you want to look at this any more before I merge?

@tino097
Copy link
Copy Markdown
Member

tino097 commented Jan 26, 2019

@davidread It looks great, click the button

@davidread davidread merged commit 4171ce9 into master Jan 27, 2019
@davidread davidread deleted the 4561-limit-datastore_search branch January 27, 2019 21:34
@davidread davidread removed the Awaiting tech team feedback This PR or issue needs feedback from the tech team. label Jan 27, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add a hard upper limit to datastore_search and datasearch_sql rows returned

5 participants