Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@tysonlt
Copy link
Contributor

@tysonlt tysonlt commented Jul 13, 2021

References

Description

Add support for a 'relationships' manifest file in the Simple Archive Format. This allows items to be linked to new or existing entities during import. Items can be linked to other items in this import set by referring to their import subfolder name.

Each line in the file contains a relationship type key and an item identifier in the following format:

relation.<relation_key> <handle|uuid|folderName:import_item_folder|schema.element[.qualifier]:value>

The input_item_folder should refer the folder name of another item in this import batch. Example:

relation.isAuthorOfPublication 5dace143-1238-4b4f-affb-ed559f9254bb
relation.isAuthorOfPublication 123456789/1123
relation.isOrgUnitOfPublication folderName:item_001
relation.isProjectOfPublication project.identifier.id:123
relation.isProjectOfPublication project.identifier.name:A Name with Spaces

During initial import, new items are stored in a map keyed by the item folder name. Once the initial import is complete, a second pass checks for a 'relationships' manifest file in each folder and creates a relationship of the specified type to the specified item.

NOTE: this is only implemented for 'add' mode at this stage.

Instructions for Reviewers

  • Create a Simple Archive Format import, with entities of various types (I used Publication, Person, and Org Unit)
  • Create appropriate relationship manifest files in each import subfolder, referencing the folder name of the item to link to, or handle/UUID of existing item.
  • Add a "-l" or "--relationships" flag to the import command line (disabled by default) Runs automatically

Checklist

I have tested this on my own import dataset, but haven't created any unit/integration tests as I was unable to find any for the import utility.

  • My PR is small in size (e.g. less than 1,000 lines of code, not including comments & integration tests). Exceptions may be made if previously agreed upon.
  • My PR passes Checkstyle validation based on the Code Style Guide.
  • My PR includes Javadoc for all new (or modified) public methods and classes. It also includes Javadoc for large or complex private methods.
  • My PR passes all tests and includes new/updated Unit or Integration Tests based on the Code Testing Guide.
  • If my PR includes new, third-party dependencies (in any pom.xml), I've made sure their licenses align with the DSpace BSD License based on the Licensing of Contributions documentation.
  • If my PR modifies the REST API, I've linked to the REST Contract page (or open PR) related to this change.

@tysonlt
Copy link
Contributor Author

tysonlt commented Jul 13, 2021

Currently a bit fragile with respect to resuming imports. It has to run after the main import to know how to look up items by import folder name, and the lookup map won't be complete for resumed imports. Could use the skipItems map to help with this.

@tdonohue
Copy link
Member

Thanks @tysonlt for getting this started. I've linked this PR up to a placeholder ticket #2883 that we had for this exact feature (as you've discovered it doesn't exist yet). Once you feel this is ready for review by others, I can also find a volunteer or two to give it a try on their end & provide feedback.

@tdonohue tdonohue changed the title Import relationships Import relationships via SAF (Simple Archive Format) Jul 13, 2021
@tdonohue tdonohue added component: configurable entities Related to Configurable Entities feature tools: import Related to import of data into the system labels Jul 13, 2021
@tysonlt
Copy link
Contributor Author

tysonlt commented Jul 14, 2021

Updated to match MetadataImport behaviour:

  • Specify relationship by key (eg. relation.isAuthorOfPublication) instead of relationship database id
  • Prefix folder name with "folderName:" similar to "rowName:" in MetadataImport
  • Allow item lookup by unique metadata value in relationships manifest

NOTE: Added org.dspace.app.util.RelationshipUtils with method matchRelationshipType() copied from org.dspace.app.bulkedit.MetadataImport. This method could now be changed to defer to RelationshipUtils.

@tysonlt tysonlt marked this pull request as ready for review July 14, 2021 03:29
Copy link
Member

@tdonohue tdonohue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tysonlt : Thanks for this contribution! Apologies for the delay in review, but I finally had a chance to look at this today. I was able to successfully get this working and import a Person Entity with two Publications.

The manner in which Entities must be created is a tad complex in SAF, but it works. For other reviewers, here's the directory structure I had to create for testing. This creates two Publication Entities both linked to the same Person Entity

  • item_dir = root directory for all item directories
    • item_000 = First Publication
      • contents (created an empty file, but you can add actual file references)
      • dubin_core.xml - specify main dc fields like 'dc.title', dates, etc
      • metadata_dspace.xml - add a "dspace.entity.type" field with value of "Publication"
      • relationships - add one line... relation.isAuthorOfPublication folderName:item_001
    • item_001 = Person
      • contents (created an empty file, but you can add actual file references)
      • dubin_core.xml - specify main dc fields (if any)
      • metadata_dspace.xml - add a "dspace.entity.type" field with value of "Person"
      • metadata_person.xml - add person.familyName and person.givenName.
      • (No "relationships" file... as I chose to link Publications back to this Person)
    • item_0002 = Second Publication
      • contents (created an empty file, but you can add actual file references)
      • dubin_core.xml - specify main dc fields like 'dc.title', dates, etc
      • metadata_dspace.xml - add a "dspace.entity.type" field with value of "Publication"
      • relationships - add one line... relation.isAuthorOfPublication folderName:item_001

After creating that structure, I was able to run the SAF script using "add" mode and the new "-l" (--relationships") flag like this...

./dspace import -a -s ~/item_dir/ -e [user-email] -c [collection-handle] -m ~/mapfile.txt

Overall, I approve of this PR. I do have a few minor requests inline below. I'd also appreciate @benbosman 's feedback on the approach you took. The new relationships file approach seems reasonable to me, but I'd like a second set of eyes.

Thanks overall & hopefully we can get this moved forward quickly!

@tdonohue tdonohue added this to the 7.1 milestone Aug 24, 2021
@tdonohue tdonohue self-requested a review August 26, 2021 15:03
@tdonohue
Copy link
Member

tdonohue commented Sep 1, 2021

@tysonlt : This is failing unit tests as it has minor checkstyle violations. It looks like lines 22 and 27 in the RelationshipUtils.java class both have trailing whitespace (which is not allowed). Once you fix those, this should pass.

Copy link
Member

@tdonohue tdonohue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tysonlt : See my comment above. This looks great overall & I think it's nearly ready. It just has small Checkstyle issues that need fixing so that the unit tests can pass. Thanks!

@tysonlt
Copy link
Contributor Author

tysonlt commented Sep 3, 2021

Thanks @tdonohue, I've committed those changes now. I've also updated MetadataImport.java to use RelationshipUtils.java, since that's where I got that code from.

@tdonohue tdonohue self-requested a review September 3, 2021 14:19
Copy link
Member

@tdonohue tdonohue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tysonlt : Looks good now! I retested this and everything is still working perfectly. I only see one more small improvement we may want to make...I think we should verify that, when folderName is used, a user cannot select a folder that is outside of the import directory. So, they couldn't do something like ../../../some/other/folder/on/filesystem. It doesn't look to me like it's "insecure" as-is, but likely better safe than sorry. See my small suggestion inline below.

Beyond that, I'm +1 this change. Thanks again for your hard work on this!

Copy link
Member

@benbosman benbosman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR,
I've done a partial initial review.

A few things I noticed here:

  • The solution here assumes a plain-text relationships file. If you'd export an item with relationships, you'll notice the metadata_relation.xml file will contain the current relationships. It may make more sense to use that, in order to support future updates using the SAF
  • The customizations here don't support the --test parameter. That feature is currently broken because the solution assumes the items are really created
  • The --collection with a UUID is broken, but I noticed you didn't change that, so it may be broken in main

@benbosman
Copy link
Member

Thanks for the update @tysonlt

I think the current relationships file can also work, but in order to support updates with the file, the export should eventually support the relationships file as well.

@tdonohue what do you think about this:

  • Would it be best to work with the existing xml files, and parse it as if it's metadata (similar to the CSV solution)
  • Or would it be best to use the relationships file, which can in a future PR also support updates if the SAF export also supports it

I think both solution can also be extended to support mixing entities and plain text in a future PR

@tdonohue
Copy link
Member

@benbosman and @tysonlt : In my opinion, this approach of having a plain text relationships file for creation of new relationships is reasonable. I realize it's a bit odd that we have a metadata_relation.xml on export (because relationships are mapped to relation.* fields). But, I don't see an easy way to reuse the metadata_relation.xml file for re-import/modification, as it'd require us to process that metadata_relation.xml file different from all the other metadata_*.xml files.

So, in my opinion, we may want to consider treating the metadata_relation.xml as "simple, flat metadata" (just like every other metadata_*.xml file, while the relationships file is used to add/update/modify relationships. We'll obviously have to document that clearly though, as I do see that it could get confusing.

Does that sound reasonable to both of you, or am I missing a clear reason why the metadata_relation.xml file would be a better approach here?

@tysonlt
Copy link
Contributor Author

tysonlt commented Sep 20, 2021

@tdonohue @benbosman I also would prefer to keep the xml as flat, static mappings to literal values, with no pre-processing just on that file, and not the others. The relationships manifest solves the very specific problem of not knowing what to put in the rel*.xml file, as the target item doesn't have a UUID yet. In this way it is similar to the other manifest files, like the collections or handle files, in that it provides instructions on how to process the item, rather than providing metadata directly.

@tdonohue
Copy link
Member

tdonohue commented Sep 30, 2021

@tysonlt : I talked with @benbosman about this today, and we both agree that your current approach of using the relationships text file seems good enough. We just may want to improve upon it later (in a future PR) by also ensuring the export process can export relationships into that same relationships text file, so that it's easier to export & reimport from the same SAF directory.

All that means this PR can move forward as-is...we'd just need you to finish addressing any outstanding feedback, mainly the other comments from @benbosman in his review: #3322 (review) Then @benbosman and I can give it a final review & test and hopefully get this merged soon. Thanks!

@tdonohue tdonohue requested review from benbosman and tdonohue October 4, 2021 14:33
@tysonlt
Copy link
Contributor Author

tysonlt commented Oct 5, 2021

I'm fixing those checkstyle errors, but I'm occasionally seeing this exception when creating relationships:

java.lang.IllegalArgumentException: Comparison method violates its general contract!

It is coming from org.dspace.content.DSpaceObjectServiceImpl.update(DSpaceObjectServiceImpl.java:623).

It looks like that comparator function has a missing edge case, but I can't see it!

Copy link
Member

@benbosman benbosman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @tysonlt

This is looking good to me.
I did notice that incorrect data in the relationships is not noticed with the --test parameter, but that can probably be handled in a future PR

I would want some automated test to proof it keeps working though.
This can be similar to the ITs for the CSV import with relationships

@tdonohue
Copy link
Member

tdonohue commented Oct 6, 2021

@benbosman : While I understand the desire for ITs/tests in this area, unfortunately we don't yet have test infrastructure (at all) for Simple Archive Format imports/exports. This is a definite flaw (and should be fixed), but I don't think we should require this small PR to add/create that missing infrastructure.

So, my recommendation here would be that, if we want to see this in 7.1, we should accept it as-is (i.e. with no automated tests) & log a ticket about the lack of ITs for SAF import/export. Otherwise, I don't see any way this code will make it into 7.1....building this infrastructure for SAF testing is unfortunately not a small task.

Copy link
Member

@benbosman benbosman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @tdonohue for that update

That would indeed increase the scope of the task too much now, but it should be fixed at a later time

Copy link
Member

@tdonohue tdonohue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Re-tested and this works well. I also tested the -c commandline option now also accepts a Collection's UUID (it works too). Thanks @tysonlt for your hard work on this new feature!

@tdonohue
Copy link
Member

tdonohue commented Oct 6, 2021

@tysonlt : This is at +2, so I'm merging this for release in 7.1. I also just copied over your basic instructions into our Documentation at: https://wiki.lyrasis.org/display/DSDOC7x/Importing+and+Exporting+Items+via+Simple+Archive+Format#ImportingandExportingItemsviaSimpleArchiveFormat-relationshipsfile

Could you double check the docs look good to you as well? Ping me on Slack if you'd like access to edit those docs (you'd need to setup a wiki account)

@tdonohue
Copy link
Member

tdonohue commented Oct 6, 2021

After merging, I've created two tickets based on discussions in this PR:

@tysonlt
Copy link
Contributor Author

tysonlt commented Oct 7, 2021

Great, thanks @tdonohue!

One small change for the wiki: "Entities can be linked to other entitiess in this import set..."

@tysonlt
Copy link
Contributor Author

tysonlt commented Oct 7, 2021

I also wonder whether it's worth mentioning the following:

"If you already know the UUID of an existing item that you want to relate to, you can create a metadata_dspace.xml file and specify the relationships there. The relationships file is primarily for linking to items in the same import batch, although of course you can specify relationships to existing items using this file instead of metadata_dspace.xml if you prefer."

Regarding exporting a relationships file during export, I'm not sure of the utility of this. Since by definition all the related entities will already exist, it will export a metadata_dspace.xml file to model those relationships. In this case the relationships file is redundant.

If the relationships file became the primary or preferred/recommended way to specify relationships, then yes it should include it in the export.

@tysonlt tysonlt deleted the import_relationships branch October 7, 2021 01:18
@tdonohue
Copy link
Member

tdonohue commented Oct 7, 2021

@tysonlt : Good point in both comments. I fixed the typo. I also added a yellow note to describe your last comment (which I believe was about the metadata_relation.xml and not metadata_dspace.xml). Please give it a look...it's at the bottom of this section now: https://wiki.lyrasis.org/display/DSDOC7x/Importing+and+Exporting+Items+via+Simple+Archive+Format#ImportingandExportingItemsviaSimpleArchiveFormat-relationshipsfile

@tysonlt
Copy link
Contributor Author

tysonlt commented Oct 8, 2021

Looks great, I think that’s pretty clear and comprehensive. Only one small change at the end of the first notice box: “if you with to”

@tdonohue
Copy link
Member

tdonohue commented Oct 8, 2021

@tysonlt : Thanks for catching all my silly typos. 🥇 Fixed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

component: configurable entities Related to Configurable Entities feature tools: import Related to import of data into the system

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Creating new Relationships/Entities via SAF Import

3 participants