Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@dsenalik
Copy link
Contributor

@dsenalik dsenalik commented Sep 27, 2024

New Feature

Closes #1910

Closes #1959

Depends on #1977 (merged)

Tripal Version: 4.x

Description

Suppose you want to migrate your Tripal 3 site with several genomes, each having tens of thousands of "gene", "mRNA", etc. records. Right now, publish may fail when it runs out of memory.
This PR implements publishing in batches of at most 1000 records at a time.
We might want to put the actual batch size on the publish form, but for now it is hard-coded.

As far as log messages, to be more succinct, I removed this field message,

        $this->logger->notice("  Checking for published items for the field: $field_name...");

and only display the other when at least one field value was published

        if ($num_inserted) {
          $this->logger->notice("  Published " . number_format($num_inserted) . " items for field: $field_name...");
        }

If only a few records, output will be the same as before, however if enough records exist to trigger batching, then a batch prefix is included. For example

Finding candidate record IDs...
Batch 1 of 2, Step 1 of 6: Find matching records...
Batch 1 of 2, Step 2 of 6: Generate page titles...
Batch 1 of 2, Step 3 of 6: Find existing published entities...
Batch 1 of 2, Step 4 of 6: Publishing 1,000 new entities...
Batch 1 of 2, Step 5 of 6: Find IDs of entities...
Batch 1 of 2, Step 6 of 6: Add field items to published entities...
  Published 1,000 items for field: organism_abbreviation...
  Published 1 items for field: organism_comment...
  Published 3 items for field: organism_common_name...
  Published 1,462 items for field: organism_dbxref...
  Published 1,000 items for field: organism_genus...
  Published 95 items for field: organism_infraspecific_name...
  Published 1,000 items for field: organism_infraspecific_type...
  Published 1,000 items for field: organism_species...
Batch 2 of 2, Step 1 of 6: Find matching records...
Batch 2 of 2, Step 2 of 6: Generate page titles...
Batch 2 of 2, Step 3 of 6: Find existing published entities...
Batch 2 of 2, Step 4 of 6: Publishing 919 new entities...
Batch 2 of 2, Step 5 of 6: Find IDs of entities...
Batch 2 of 2, Step 6 of 6: Add field items to published entities...
  Published 919 items for field: organism_abbreviation...
  Published 4 items for field: organism_comment...
  Published 1 items for field: organism_common_name...
  Published 1,096 items for field: organism_dbxref...
  Published 919 items for field: organism_genus...
  Published 68 items for field: organism_infraspecific_name...
  Published 916 items for field: organism_infraspecific_type...
  Published 4 items for field: organism_pub...
  Published 919 items for field: organism_species...
Publish completed. Published 1,919 new entities, checked 0 existing entities, and added 10,407 field values.

Testing?

Either you migrate a site with thousands of records of a particular type, or for a fast test you can temporarily change the batch size in tripal/src/Services/TripalPublish.php

  /**
   * Specifies the maximum number of records to publish at one time.
   * This limits memory consumption if there are many thousands of
   * records, for example gene records in the feature table.
   * @todo We might want to add this as an option on the publish form.
   * 
   * @var integer $batch_size
   */
  private $batch_size = 1000;

@dsenalik dsenalik added Group 1 - Tripal Content Types | Terms | Fields Any issue relating to Tripal Content including types, terms, and fields. Group 10 - Performance Any issue related to performance concerns and/or ideas for improvement. Priority - Medium Any issue/PR which has a minor impact on system usability or is not often encountered. labels Sep 27, 2024
@dsenalik dsenalik marked this pull request as draft September 27, 2024 23:09
Copy link
Member

@laceysanderson laceysanderson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did manual testing on this by using the GFF3 importer on the Tripalus databasica test gff3.

Publishing mRNA with any size of batch worked great.
Publishing gene failed however being unable to find the base table and there was no check to make sure it was found...

Specifically, I created and organism and analysis, I ran the GFF3 importer with the file /var/www/drupal/web/modules/contrib/tripal/tripal_chado/tests/fixtures/gff3_loader/TripalusDatabasicaChr1Genes.gff3 and ran the job. I also imported the "Genomic" content type collection. Then I changed the batch size to 5 and published mRNA. It ran without error and created beautiful pages. I then tried to publish Genes and it failed on the command-line with an error regarding the base table. The failure for genes was not reproducible in another docker so I'm not sure what the difference was...

That said, code review also turns up that we are using the chado base table 3rd party setting directly which makes TripalPublish only work for chado 🙈 In PR #1991 we are looking at making a chado specific publish and creating a plugin infrastructure so each backend can have it's own publish but in the meantime, we are keeping Tripal Publish and don't want to break it for other data backends (if it works for them... this hasn't actually been tested).

Copy link
Member

@laceysanderson laceysanderson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upon further thought, I do think we need to be careful not to assume chado storage here... I think this small change takes out that assumption (couldn't use github suggestion since it covers lines of code you hadn't originally changed):

diff --git a/tripal/src/Services/TripalPublish.php b/tripal/src/Services/TripalPublish.php
index 4d608c401..1c6c734cf 100644
--- a/tripal/src/Services/TripalPublish.php
+++ b/tripal/src/Services/TripalPublish.php
@@ -215,7 +215,6 @@ class TripalPublish {
       throw new \Exception(t($error_msg, ['%bundle' => $bundle]));
     }
     $this->entity_type = $entity_type;
-    $this->base_table = $entity_type->getThirdPartySetting('tripal', 'chado_base_table');
 
     // Get the storage plugin used to publish.
     /** @var \Drupal\tripal\TripalStorage\PluginManager\TripalStorageManager $storage_manager **/
@@ -228,6 +227,21 @@ class TripalPublish {
 
     $this->setFieldInfo();
 
+    // We need a way to get all the record ids for a bundle.
+    // If this is the chado storage backend then we do this using the chado table.
+    if ($datastore == 'chado_storage') {
+      $this->base_table = $entity_type->getThirdPartySetting('tripal', 'chado_base_table');
+    }
+    // But if this is not chado storage then the backend needs to provide the base
+    // table for a bundle.
+    else {
+      $this->base_table = $this->storage->getBaseTable($bundle);
+    }
+    if (empty($this->base_table)) {
+      $error_msg = 'Could not find the base table for the %bundle entity type.';
+      throw new \Exception(t($error_msg, ['%bundle' => $bundle]));
+    }
+
     // Get the required field properties that will uniquely identify an entity.
     // We only need to search on those properties.
     $this->required_types = $this->storage->getStoredTypes();

Essentially, the above now checks in the init where you set the base table

  1. if the backend is chado_storage then I use the same code you did
  2. if not then I expect the backend to have a method to retrieve the base table based on the bundle
  3. I also throw an exception now if we are not able to determine the base table.

Whereas before you just grabbed the third party setting and assumed it was present.

The rest of your code then remains untouched.

Copy link
Member

@laceysanderson laceysanderson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Code review now looks good 👍 There are faster ways to do the query restrict to batch that we can explore in the future but for now this is a huge improvement.
✅ Manual testing of importing both genes and mRNA when there is less then the amount in a batch + new records, greater then the amount and new records, and the same two combinations when there are more then the batch size. Worked great in all these cases and nice and fast too :-)

Thanks for all this work on publish @dsenalik! This is ready to merge!

@laceysanderson laceysanderson merged commit 3140f35 into 4.x Oct 9, 2024
15 checks passed
@dsenalik dsenalik deleted the tv4g1-issue1910-partial-publish branch October 9, 2024 19:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Group 1 - Tripal Content Types | Terms | Fields Any issue relating to Tripal Content including types, terms, and fields. Group 10 - Performance Any issue related to performance concerns and/or ideas for improvement. Priority - Medium Any issue/PR which has a minor impact on system usability or is not often encountered.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Publish new vs. update existing entities Allow publishing in batches

2 participants