You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This PR refactors the Token and Sentence positional properties inherited from the DataPoint to replace the self.start_pos and self.end_pos attributes.
For the Sentence, both variants of accessing positional information behaved differently, resulting in inconsistent results where they appropriately should have been aliases. These are the inconsistencies:
Initializing the Sentence with str, i.e. untokenized input.
We allow the user to set an offset start_position in the init but this is not respected in the start_position property. It always returns zero. The inconsistency is also that in start_pos the start position offset is included. Suggestion: Only use and expose the property inherited from the DataPoint. Having multiple attributes doing the same thing with different names may get confusing.
Initializing the Sentence with List[str], i.e. pre-tokenized input.
Same concern as in (1)
Added to (1), end_pos and end_position actually do not behave the same.
fromflair.dataimportSentences=Sentence(['This', 'is', 'an', 'example', '.'])
print(s.end_position) # Prints 17 -> Corresponding to the character-level end positionprint(s.end_pos) # Prints 5 -> Corresponding to the token-level end position
Suggestion: Always use the character-level end position since the token-level end position is accessible with len(s).
Suggestion: Do not use two separate methods to construct the tokens. Instead, convert the case of initializing the Sentence with List[str] to the case of initializing the str.
Please see the commits as isolated corresponding to the suggestions.
@dobbersc thanks for fixing this! I see you removed the try-catch block in the token offset calculation. I actually don't remember why we needed this, and we have no unit test for a problem case, so removing it is fine.
@dobbersc thanks for fixing this! I see you removed the try-catch block in the token offset calculation. I actually don't remember why we needed this, and we have no unit test for a problem case, so removing it is fine.
From my debugging, I found that the try-catch was used only for the initialization with List[str]. Since the current_offset is calculated over the character lengths, the indices did not align with the words in the given list. The handling of this error caused the token start and end positions to be incorrect. Since now we join the words from the list to a single string, this try-catch is no longer needed.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR refactors the Token and Sentence positional properties inherited from the DataPoint to replace the
self.start_posandself.end_posattributes.For the Sentence, both variants of accessing positional information behaved differently, resulting in inconsistent results where they appropriately should have been aliases. These are the inconsistencies:
Initializing the Sentence with
str, i.e. untokenized input.start_positionin the init but this is not respected in thestart_positionproperty. It always returns zero. The inconsistency is also that instart_posthe start position offset is included.Suggestion: Only use and expose the property inherited from the DataPoint. Having multiple attributes doing the same thing with different names may get confusing.
Initializing the Sentence with
List[str], i.e. pre-tokenized input.end_posandend_positionactually do not behave the same.len(s).List[str]to the case of initializing thestr.Please see the commits as isolated corresponding to the suggestions.