Thanks to visit codestin.com
Credit goes to github.com

Skip to content

The cost is not optimal #89

@aeft

Description

@aeft
from_tree = json.build_tree(["de", 1])
to_tree = json.build_tree([2])

print(from_tree.diff(to_tree).edited_cost())

for edit in from_tree.get_all_edits(to_tree):
    print(edit)

output:

4                                                                                                                                                                                                       
Insert(to_insert=IntegerNode(2), insert_into=ListNode((StringNode('de'), IntegerNode(1))))                                                                                                              
Remove(StringNode('de'), remove_from=ListNode((StringNode('de'), IntegerNode(1))))
Remove(IntegerNode(1), remove_from=ListNode((StringNode('de'), IntegerNode(1))))

But the minimal cost should be 3.
output:

3                                                                                                                                                                                                       
Remove(StringNode('de'), remove_from=ListNode((StringNode('de'), IntegerNode(1))))                                                                                                                      
Match(match_from=IntegerNode(1), match_to=IntegerNode(2), cost=1)

I found the problem is in EditDistance._best_match:

dcost = (self.costs[row - 1][col - 1], self.path_costs[row - 1][col - 1])
lcost = (self.costs[row][col - 1], self.path_costs[row][col - 1])
ucost = (self.costs[row - 1][col], self.path_costs[row - 1][col])
diag_is_best = dcost <= lcost and dcost <= ucost
if diag_is_best:
    make_distinct(self.edit_matrix[row][col], self.edit_matrix[row][0], self.edit_matrix[0][col])
if diag_is_best and \ 
        self.edit_matrix[row][col].bounds() < self.edit_matrix[row][0].bounds() and \
        self.edit_matrix[row][col].bounds() < self.edit_matrix[0][col].bounds():  # condition1
    brow, bcol, edit = row - 1, col - 1, self.edit_matrix[row][col]
elif ucost <= dcost:  # condition2
    brow, bcol, edit = row - 1, col, self.edit_matrix[row][0]
else: # condition3
    brow, bcol, edit = row, col - 1, self.edit_matrix[0][col]
self.path_costs[row][col] = self.path_costs[brow][bcol] + 1
self.costs[row][col] = self.costs[brow][bcol] + edit.bounds().upper_bound
# In the above example, when we call _best_match(1, 1).
# value information:
# self.edit_matrix[row][col].bounds() = (2,2)
# self.edit_matrix[0][col].bounds() = (2,2)
# ucost = (2, 1)
# dcost = (0, 0)
# lcost = (1, 1)

brow = row, bcol = col-1 (condition3 success) because self.edit_matrix[row][col].bounds() == self.edit_matrix[0][col].bounds() (condition1 fail) and ucost > dcost (condition2 fail).
But actually self.costs[row][col-1] + self.edit_matrix[0][col].bounds().upper_bound > self.costs[row-1][col-1] + self.edit_matrix[row][col].bounds().upper_bound. The optimal self.costs[row][col] should be self.costs[row-1][col-1] + self.edit_matrix[row][col].bounds().upper_bound.

Here is a fix (success in the above example, may need further testing):

make_distinct(self.edit_matrix[row][col], self.edit_matrix[row][0], self.edit_matrix[0][col])
dcost = (self.costs[row - 1][col - 1] + self.edit_matrix[row][col].bounds().upper_bound, self.path_costs[row - 1][col - 1])
lcost = (self.costs[row][col - 1] + self.edit_matrix[0][col].bounds().upper_bound, self.path_costs[row][col - 1])
ucost = (self.costs[row - 1][col] + self.edit_matrix[row][0].bounds().upper_bound, self.path_costs[row - 1][col])
if dcost <= lcost and dcost <= ucost:
    brow, bcol, edit = row - 1, col - 1, self.edit_matrix[row][col]
elif ucost <= dcost and ucost <= lcost:
    brow, bcol, edit = row - 1, col, self.edit_matrix[row][0]
else:
    brow, bcol, edit = row, col - 1, self.edit_matrix[0][col]
self.path_costs[row][col] = self.path_costs[brow][bcol] + 1
self.costs[row][col] = self.costs[brow][bcol] + edit.bounds().upper_bound

In summary, we should use the final real cost to guide transition. Like the canonical implementation of the Levenshtein distance metric:

dist[row][col] = min(dist[row - 1][col] + 1,
                                 dist[row][col - 1] + 1,
                                 dist[row - 1][col - 1] + cost)

If this is a bug, I would be happy to provide a PR.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions