Publishing in batches #1983

dsenalik · 2024-09-27T23:09:05Z

New Feature

Closes #1910

Closes #1959

Depends on #1977 (merged)

Tripal Version: 4.x

Description

Suppose you want to migrate your Tripal 3 site with several genomes, each having tens of thousands of "gene", "mRNA", etc. records. Right now, publish may fail when it runs out of memory.
This PR implements publishing in batches of at most 1000 records at a time.
We might want to put the actual batch size on the publish form, but for now it is hard-coded.

As far as log messages, to be more succinct, I removed this field message,

        $this->logger->notice("  Checking for published items for the field: $field_name...");

and only display the other when at least one field value was published

        if ($num_inserted) {
          $this->logger->notice("  Published " . number_format($num_inserted) . " items for field: $field_name...");
        }

If only a few records, output will be the same as before, however if enough records exist to trigger batching, then a batch prefix is included. For example

Finding candidate record IDs...
Batch 1 of 2, Step 1 of 6: Find matching records...
Batch 1 of 2, Step 2 of 6: Generate page titles...
Batch 1 of 2, Step 3 of 6: Find existing published entities...
Batch 1 of 2, Step 4 of 6: Publishing 1,000 new entities...
Batch 1 of 2, Step 5 of 6: Find IDs of entities...
Batch 1 of 2, Step 6 of 6: Add field items to published entities...
  Published 1,000 items for field: organism_abbreviation...
  Published 1 items for field: organism_comment...
  Published 3 items for field: organism_common_name...
  Published 1,462 items for field: organism_dbxref...
  Published 1,000 items for field: organism_genus...
  Published 95 items for field: organism_infraspecific_name...
  Published 1,000 items for field: organism_infraspecific_type...
  Published 1,000 items for field: organism_species...
Batch 2 of 2, Step 1 of 6: Find matching records...
Batch 2 of 2, Step 2 of 6: Generate page titles...
Batch 2 of 2, Step 3 of 6: Find existing published entities...
Batch 2 of 2, Step 4 of 6: Publishing 919 new entities...
Batch 2 of 2, Step 5 of 6: Find IDs of entities...
Batch 2 of 2, Step 6 of 6: Add field items to published entities...
  Published 919 items for field: organism_abbreviation...
  Published 4 items for field: organism_comment...
  Published 1 items for field: organism_common_name...
  Published 1,096 items for field: organism_dbxref...
  Published 919 items for field: organism_genus...
  Published 68 items for field: organism_infraspecific_name...
  Published 916 items for field: organism_infraspecific_type...
  Published 4 items for field: organism_pub...
  Published 919 items for field: organism_species...
Publish completed. Published 1,919 new entities, checked 0 existing entities, and added 10,407 field values.

Testing?

Either you migrate a site with thousands of records of a particular type, or for a fast test you can temporarily change the batch size in tripal/src/Services/TripalPublish.php

  /**
   * Specifies the maximum number of records to publish at one time.
   * This limits memory consumption if there are many thousands of
   * records, for example gene records in the feature table.
   * @todo We might want to add this as an option on the publish form.
   * 
   * @var integer $batch_size
   */
  private $batch_size = 1000;

laceysanderson

I did manual testing on this by using the GFF3 importer on the Tripalus databasica test gff3.

Publishing mRNA with any size of batch worked great.
Publishing gene failed however being unable to find the base table and there was no check to make sure it was found...

Specifically, I created and organism and analysis, I ran the GFF3 importer with the file /var/www/drupal/web/modules/contrib/tripal/tripal_chado/tests/fixtures/gff3_loader/TripalusDatabasicaChr1Genes.gff3 and ran the job. I also imported the "Genomic" content type collection. Then I changed the batch size to 5 and published mRNA. It ran without error and created beautiful pages. I then tried to publish Genes and it failed on the command-line with an error regarding the base table. The failure for genes was not reproducible in another docker so I'm not sure what the difference was...

That said, code review also turns up that we are using the chado base table 3rd party setting directly which makes TripalPublish only work for chado 🙈 In PR #1991 we are looking at making a chado specific publish and creating a plugin infrastructure so each backend can have it's own publish but in the meantime, we are keeping Tripal Publish and don't want to break it for other data backends (if it works for them... this hasn't actually been tested).

laceysanderson

Upon further thought, I do think we need to be careful not to assume chado storage here... I think this small change takes out that assumption (couldn't use github suggestion since it covers lines of code you hadn't originally changed):

diff --git a/tripal/src/Services/TripalPublish.php b/tripal/src/Services/TripalPublish.php
index 4d608c401..1c6c734cf 100644
--- a/tripal/src/Services/TripalPublish.php
+++ b/tripal/src/Services/TripalPublish.php
@@ -215,7 +215,6 @@ class TripalPublish {
       throw new \Exception(t($error_msg, ['%bundle' => $bundle]));
     }
     $this->entity_type = $entity_type;
-    $this->base_table = $entity_type->getThirdPartySetting('tripal', 'chado_base_table');
 
     // Get the storage plugin used to publish.
     /** @var \Drupal\tripal\TripalStorage\PluginManager\TripalStorageManager $storage_manager **/
@@ -228,6 +227,21 @@ class TripalPublish {
 
     $this->setFieldInfo();
 
+    // We need a way to get all the record ids for a bundle.
+    // If this is the chado storage backend then we do this using the chado table.
+    if ($datastore == 'chado_storage') {
+      $this->base_table = $entity_type->getThirdPartySetting('tripal', 'chado_base_table');
+    }
+    // But if this is not chado storage then the backend needs to provide the base
+    // table for a bundle.
+    else {
+      $this->base_table = $this->storage->getBaseTable($bundle);
+    }
+    if (empty($this->base_table)) {
+      $error_msg = 'Could not find the base table for the %bundle entity type.';
+      throw new \Exception(t($error_msg, ['%bundle' => $bundle]));
+    }
+
     // Get the required field properties that will uniquely identify an entity.
     // We only need to search on those properties.
     $this->required_types = $this->storage->getStoredTypes();

Essentially, the above now checks in the init where you set the base table

if the backend is chado_storage then I use the same code you did
if not then I expect the backend to have a method to retrieve the base table based on the bundle
I also throw an exception now if we are not able to determine the base table.

Whereas before you just grabbed the third party setting and assumed it was present.

The rest of your code then remains untouched.

…tripal into tv4g1-issue1910-partial-publish

laceysanderson

✅ Code review now looks good 👍 There are faster ways to do the query restrict to batch that we can explore in the future but for now this is a huge improvement.
✅ Manual testing of importing both genes and mRNA when there is less then the amount in a batch + new records, greater then the amount and new records, and the same two combinations when there are more then the batch size. Worked great in all these cases and nice and fast too :-)

Thanks for all this work on publish @dsenalik! This is ready to merge!

first draft of publishing in batches

80c1882

dsenalik marked this pull request as draft September 27, 2024 23:09

dsenalik mentioned this pull request Sep 27, 2024

Allow publishing in batches #1910

Closed

dsenalik added 4 commits September 28, 2024 06:26

more code inside loop

4f62804

new vs. updated

6b70ba1

update test for absent infraspecies

dc1ca1a

now dependent on PR #1977

f566205

dsenalik mentioned this pull request Sep 28, 2024

Publish new vs. update existing entities #1959

Closed

laceysanderson mentioned this pull request Oct 3, 2024

TripalPublish: finding unpublished records may need a redesign #1986

Closed

dsenalik marked this pull request as ready for review October 4, 2024 16:46

laceysanderson self-requested a review October 4, 2024 17:10

laceysanderson and others added 6 commits October 5, 2024 17:10

Merge branch '4.x' into tv4g1-issue1910-partial-publish

a1211e6

Merge branch '4.x' into tv4g1-issue1910-partial-publish

02d0cdc

Merge branch '4.x' into tv4g1-issue1910-partial-publish

abc121b

skip empty batches

31adbdd

remove log file

555c245

Merge branch '4.x' into tv4g1-issue1910-partial-publish

b68578e

laceysanderson reviewed Oct 8, 2024

View reviewed changes

laceysanderson requested changes Oct 8, 2024

View reviewed changes

dsenalik added 2 commits October 8, 2024 18:44

base table change from Lacey's code review

07bc06c

Merge branch 'tv4g1-issue1910-partial-publish' of github.com:/tripal/…

b5cac3b

…tripal into tv4g1-issue1910-partial-publish

laceysanderson approved these changes Oct 9, 2024

View reviewed changes

laceysanderson merged commit 3140f35 into 4.x Oct 9, 2024
15 checks passed

dsenalik deleted the tv4g1-issue1910-partial-publish branch October 9, 2024 19:15

laceysanderson mentioned this pull request Oct 3, 2025

Create the 4.0.0-alpha3 release. #2307

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Publishing in batches #1983

Publishing in batches #1983

Uh oh!

dsenalik commented Sep 27, 2024 •

edited

Loading

Uh oh!

laceysanderson left a comment •

edited

Loading

Uh oh!

laceysanderson left a comment

Uh oh!

laceysanderson left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Publishing in batches #1983

Publishing in batches #1983

Uh oh!

Conversation

dsenalik commented Sep 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

New Feature

Closes #1910

Closes #1959

Depends on #1977 (merged)

Tripal Version: 4.x

Description

Testing?

Uh oh!

laceysanderson left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

laceysanderson left a comment

Choose a reason for hiding this comment

Uh oh!

laceysanderson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dsenalik commented Sep 27, 2024 •

edited

Loading

laceysanderson left a comment •

edited

Loading