- 
                Notifications
    You must be signed in to change notification settings 
- Fork 51
Publishing in batches #1983
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Publishing in batches #1983
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did manual testing on this by using the GFF3 importer on the Tripalus databasica test gff3.
Publishing mRNA with any size of batch worked great.
Publishing gene failed however being unable to find the base table and there was no check to make sure it was found...
Specifically, I created and organism and analysis, I ran the GFF3 importer with the file /var/www/drupal/web/modules/contrib/tripal/tripal_chado/tests/fixtures/gff3_loader/TripalusDatabasicaChr1Genes.gff3 and ran the job. I also imported the "Genomic" content type collection. Then I changed the batch size to 5 and published mRNA. It ran without error and created beautiful pages. I then tried to publish Genes and it failed on the command-line with an error regarding the base table. The failure for genes was not reproducible in another docker so I'm not sure what the difference was...
That said, code review also turns up that we are using the chado base table 3rd party setting directly which makes TripalPublish only work for chado 🙈 In PR #1991 we are looking at making a chado specific publish and creating a plugin infrastructure so each backend can have it's own publish but in the meantime, we are keeping Tripal Publish and don't want to break it for other data backends (if it works for them... this hasn't actually been tested).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Upon further thought, I do think we need to be careful not to assume chado storage here... I think this small change takes out that assumption (couldn't use github suggestion since it covers lines of code you hadn't originally changed):
diff --git a/tripal/src/Services/TripalPublish.php b/tripal/src/Services/TripalPublish.php
index 4d608c401..1c6c734cf 100644
--- a/tripal/src/Services/TripalPublish.php
+++ b/tripal/src/Services/TripalPublish.php
@@ -215,7 +215,6 @@ class TripalPublish {
       throw new \Exception(t($error_msg, ['%bundle' => $bundle]));
     }
     $this->entity_type = $entity_type;
-    $this->base_table = $entity_type->getThirdPartySetting('tripal', 'chado_base_table');
 
     // Get the storage plugin used to publish.
     /** @var \Drupal\tripal\TripalStorage\PluginManager\TripalStorageManager $storage_manager **/
@@ -228,6 +227,21 @@ class TripalPublish {
 
     $this->setFieldInfo();
 
+    // We need a way to get all the record ids for a bundle.
+    // If this is the chado storage backend then we do this using the chado table.
+    if ($datastore == 'chado_storage') {
+      $this->base_table = $entity_type->getThirdPartySetting('tripal', 'chado_base_table');
+    }
+    // But if this is not chado storage then the backend needs to provide the base
+    // table for a bundle.
+    else {
+      $this->base_table = $this->storage->getBaseTable($bundle);
+    }
+    if (empty($this->base_table)) {
+      $error_msg = 'Could not find the base table for the %bundle entity type.';
+      throw new \Exception(t($error_msg, ['%bundle' => $bundle]));
+    }
+
     // Get the required field properties that will uniquely identify an entity.
     // We only need to search on those properties.
     $this->required_types = $this->storage->getStoredTypes();Essentially, the above now checks in the init where you set the base table
- if the backend is chado_storage then I use the same code you did
- if not then I expect the backend to have a method to retrieve the base table based on the bundle
- I also throw an exception now if we are not able to determine the base table.
Whereas before you just grabbed the third party setting and assumed it was present.
The rest of your code then remains untouched.
…tripal into tv4g1-issue1910-partial-publish
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
✅ Code review now looks good 👍 There are faster ways to do the query restrict to batch that we can explore in the future but for now this is a huge improvement.
✅ Manual testing of importing both genes and mRNA when there is less then the amount in a batch + new records, greater then the amount and new records, and the same two combinations when there are more then the batch size. Worked great in all these cases and nice and fast too :-)
Thanks for all this work on publish @dsenalik! This is ready to merge!
New Feature
Closes #1910
Closes #1959
Depends on #1977 (merged)
Tripal Version: 4.x
Description
Suppose you want to migrate your Tripal 3 site with several genomes, each having tens of thousands of "gene", "mRNA", etc. records. Right now, publish may fail when it runs out of memory.
This PR implements publishing in batches of at most 1000 records at a time.
We might want to put the actual batch size on the publish form, but for now it is hard-coded.
As far as log messages, to be more succinct, I removed this field message,
and only display the other when at least one field value was published
If only a few records, output will be the same as before, however if enough records exist to trigger batching, then a batch prefix is included. For example
Testing?
Either you migrate a site with thousands of records of a particular type, or for a fast test you can temporarily change the batch size in
tripal/src/Services/TripalPublish.php