-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Import relationships via SAF (Simple Archive Format) #3322
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Currently a bit fragile with respect to resuming imports. It has to run after the main import to know how to look up items by import folder name, and the lookup map won't be complete for resumed imports. Could use the skipItems map to help with this. |
|
Thanks @tysonlt for getting this started. I've linked this PR up to a placeholder ticket #2883 that we had for this exact feature (as you've discovered it doesn't exist yet). Once you feel this is ready for review by others, I can also find a volunteer or two to give it a try on their end & provide feedback. |
|
Updated to match MetadataImport behaviour:
NOTE: Added org.dspace.app.util.RelationshipUtils with method matchRelationshipType() copied from org.dspace.app.bulkedit.MetadataImport. This method could now be changed to defer to RelationshipUtils. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tysonlt : Thanks for this contribution! Apologies for the delay in review, but I finally had a chance to look at this today. I was able to successfully get this working and import a Person Entity with two Publications.
The manner in which Entities must be created is a tad complex in SAF, but it works. For other reviewers, here's the directory structure I had to create for testing. This creates two Publication Entities both linked to the same Person Entity
item_dir= root directory for all item directoriesitem_000= First Publicationcontents(created an empty file, but you can add actual file references)dubin_core.xml- specify main dc fields like 'dc.title', dates, etcmetadata_dspace.xml- add a "dspace.entity.type" field with value of "Publication"relationships- add one line...relation.isAuthorOfPublication folderName:item_001
item_001= Personcontents(created an empty file, but you can add actual file references)dubin_core.xml- specify main dc fields (if any)metadata_dspace.xml- add a "dspace.entity.type" field with value of "Person"metadata_person.xml- addperson.familyNameandperson.givenName.- (No "relationships" file... as I chose to link Publications back to this Person)
item_0002= Second Publicationcontents(created an empty file, but you can add actual file references)dubin_core.xml- specify main dc fields like 'dc.title', dates, etcmetadata_dspace.xml- add a "dspace.entity.type" field with value of "Publication"relationships- add one line...relation.isAuthorOfPublication folderName:item_001
After creating that structure, I was able to run the SAF script using "add" mode and the new "-l" (--relationships") flag like this...
./dspace import -a -s ~/item_dir/ -e [user-email] -c [collection-handle] -m ~/mapfile.txt
Overall, I approve of this PR. I do have a few minor requests inline below. I'd also appreciate @benbosman 's feedback on the approach you took. The new relationships file approach seems reasonable to me, but I'd like a second set of eyes.
Thanks overall & hopefully we can get this moved forward quickly!
dspace-api/src/main/java/org/dspace/app/itemimport/ItemImportCLITool.java
Outdated
Show resolved
Hide resolved
|
@tysonlt : This is failing unit tests as it has minor checkstyle violations. It looks like lines 22 and 27 in the RelationshipUtils.java class both have trailing whitespace (which is not allowed). Once you fix those, this should pass. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tysonlt : See my comment above. This looks great overall & I think it's nearly ready. It just has small Checkstyle issues that need fixing so that the unit tests can pass. Thanks!
|
Thanks @tdonohue, I've committed those changes now. I've also updated MetadataImport.java to use RelationshipUtils.java, since that's where I got that code from. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tysonlt : Looks good now! I retested this and everything is still working perfectly. I only see one more small improvement we may want to make...I think we should verify that, when folderName is used, a user cannot select a folder that is outside of the import directory. So, they couldn't do something like ../../../some/other/folder/on/filesystem. It doesn't look to me like it's "insecure" as-is, but likely better safe than sorry. See my small suggestion inline below.
Beyond that, I'm +1 this change. Thanks again for your hard work on this!
dspace-api/src/main/java/org/dspace/app/itemimport/ItemImportServiceImpl.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this PR,
I've done a partial initial review.
A few things I noticed here:
- The solution here assumes a plain-text relationships file. If you'd export an item with relationships, you'll notice the metadata_relation.xml file will contain the current relationships. It may make more sense to use that, in order to support future updates using the SAF
- The customizations here don't support the
--testparameter. That feature is currently broken because the solution assumes the items are really created - The
--collectionwith a UUID is broken, but I noticed you didn't change that, so it may be broken inmain
dspace-api/src/main/java/org/dspace/app/itemimport/ItemImportServiceImpl.java
Outdated
Show resolved
Hide resolved
|
Thanks for the update @tysonlt I think the current @tdonohue what do you think about this:
I think both solution can also be extended to support mixing entities and plain text in a future PR |
|
@benbosman and @tysonlt : In my opinion, this approach of having a plain text So, in my opinion, we may want to consider treating the Does that sound reasonable to both of you, or am I missing a clear reason why the |
|
@tdonohue @benbosman I also would prefer to keep the xml as flat, static mappings to literal values, with no pre-processing just on that file, and not the others. The relationships manifest solves the very specific problem of not knowing what to put in the rel*.xml file, as the target item doesn't have a UUID yet. In this way it is similar to the other manifest files, like the collections or handle files, in that it provides instructions on how to process the item, rather than providing metadata directly. |
|
@tysonlt : I talked with @benbosman about this today, and we both agree that your current approach of using the All that means this PR can move forward as-is...we'd just need you to finish addressing any outstanding feedback, mainly the other comments from @benbosman in his review: #3322 (review) Then @benbosman and I can give it a final review & test and hopefully get this merged soon. Thanks! |
|
I'm fixing those checkstyle errors, but I'm occasionally seeing this exception when creating relationships:
It is coming from org.dspace.content.DSpaceObjectServiceImpl.update(DSpaceObjectServiceImpl.java:623). It looks like that comparator function has a missing edge case, but I can't see it! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @tysonlt
This is looking good to me.
I did notice that incorrect data in the relationships is not noticed with the --test parameter, but that can probably be handled in a future PR
I would want some automated test to proof it keeps working though.
This can be similar to the ITs for the CSV import with relationships
|
@benbosman : While I understand the desire for ITs/tests in this area, unfortunately we don't yet have test infrastructure (at all) for Simple Archive Format imports/exports. This is a definite flaw (and should be fixed), but I don't think we should require this small PR to add/create that missing infrastructure. So, my recommendation here would be that, if we want to see this in 7.1, we should accept it as-is (i.e. with no automated tests) & log a ticket about the lack of ITs for SAF import/export. Otherwise, I don't see any way this code will make it into 7.1....building this infrastructure for SAF testing is unfortunately not a small task. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @tdonohue for that update
That would indeed increase the scope of the task too much now, but it should be fixed at a later time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Re-tested and this works well. I also tested the -c commandline option now also accepts a Collection's UUID (it works too). Thanks @tysonlt for your hard work on this new feature!
|
@tysonlt : This is at +2, so I'm merging this for release in 7.1. I also just copied over your basic instructions into our Documentation at: https://wiki.lyrasis.org/display/DSDOC7x/Importing+and+Exporting+Items+via+Simple+Archive+Format#ImportingandExportingItemsviaSimpleArchiveFormat-relationshipsfile Could you double check the docs look good to you as well? Ping me on Slack if you'd like access to edit those docs (you'd need to setup a wiki account) |
|
After merging, I've created two tickets based on discussions in this PR:
|
|
Great, thanks @tdonohue! One small change for the wiki: "Entities can be linked to other entitiess in this import set..." |
|
I also wonder whether it's worth mentioning the following: "If you already know the UUID of an existing item that you want to relate to, you can create a metadata_dspace.xml file and specify the relationships there. The relationships file is primarily for linking to items in the same import batch, although of course you can specify relationships to existing items using this file instead of metadata_dspace.xml if you prefer." Regarding exporting a relationships file during export, I'm not sure of the utility of this. Since by definition all the related entities will already exist, it will export a metadata_dspace.xml file to model those relationships. In this case the relationships file is redundant. If the relationships file became the primary or preferred/recommended way to specify relationships, then yes it should include it in the export. |
|
@tysonlt : Good point in both comments. I fixed the typo. I also added a yellow note to describe your last comment (which I believe was about the |
|
Looks great, I think that’s pretty clear and comprehensive. Only one small change at the end of the first notice box: “if you with to” |
|
@tysonlt : Thanks for catching all my silly typos. 🥇 Fixed! |
References
Description
Add support for a 'relationships' manifest file in the Simple Archive Format. This allows items to be linked to new or existing entities during import. Items can be linked to other items in this import set by referring to their import subfolder name.
Each line in the file contains a relationship type key and an item identifier in the following format:
relation.<relation_key> <handle|uuid|folderName:import_item_folder|schema.element[.qualifier]:value>
The input_item_folder should refer the folder name of another item in this import batch. Example:
During initial import, new items are stored in a map keyed by the item folder name. Once the initial import is complete, a second pass checks for a 'relationships' manifest file in each folder and creates a relationship of the specified type to the specified item.
NOTE: this is only implemented for 'add' mode at this stage.
Instructions for Reviewers
Add a "-l" or "--relationships" flag to the import command line (disabled by default)Runs automaticallyChecklist
I have tested this on my own import dataset, but haven't created any unit/integration tests as I was unable to find any for the import utility.
pom.xml), I've made sure their licenses align with the DSpace BSD License based on the Licensing of Contributions documentation.