Pairwise string comparison#51
Merged
Merged
Conversation
new Splink comparison
we cannot directly use dialect properties to adjust the SQL, so we need to make our own version
94a513e to
06e53fe
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Following Splink#2517 and Splink#2546, add Clickhouse-specific versions of
PairwiseStringDistanceFunctionLevelandPairwiseStringDistanceFunctionAtThresholds.These could not work using just the dialect markers introduced in the latter PR, as the SQL in Splink works on the assumption that the lambda is passed as the second argument to functions. However, in Clickhouse it must be passed as the first argument, and the parser will fail if it is passed elsewhere. This rules out options such as defining a udf with the arguments switched (as the parser will still fail). Rather than using some more involved string manipulation, we simply re-implement the SQL in a way that is appropriate for Clickhouse dialect.
This actually turned out to be doubly useful as it is also not directly possible to unnest lists by a single level in Clickhouse, requiring a workaround (although tbf this could have still been achieved with a udf).
Mainly to facilitate testing this also allows
ClickhouseAPIto correctly register pandas columns that are arrays of strings. For now any other array columns will be coërced to arrays of strings (I think, though haven't tested).For
chdbthis is currently less straightforward as we rely on its nativeSELECT * FROM Python(input)rather than doing any manual processing, and this does not presently recognise array types.