Codestin Search App

ADBond · 2024-12-12T17:46:59Z

Following Splink#2517 and Splink#2546, add Clickhouse-specific versions of PairwiseStringDistanceFunctionLevel and PairwiseStringDistanceFunctionAtThresholds.

These could not work using just the dialect markers introduced in the latter PR, as the SQL in Splink works on the assumption that the lambda is passed as the second argument to functions. However, in Clickhouse it must be passed as the first argument, and the parser will fail if it is passed elsewhere. This rules out options such as defining a udf with the arguments switched (as the parser will still fail). Rather than using some more involved string manipulation, we simply re-implement the SQL in a way that is appropriate for Clickhouse dialect.

This actually turned out to be doubly useful as it is also not directly possible to unnest lists by a single level in Clickhouse, requiring a workaround (although tbf this could have still been achieved with a udf).

Mainly to facilitate testing this also allows ClickhouseAPI to correctly register pandas columns that are arrays of strings. For now any other array columns will be coërced to arrays of strings (I think, though haven't tested).
For chdb this is currently less straightforward as we rely on its native SELECT * FROM Python(input) rather than doing any manual processing, and this does not presently recognise array types.

new Splink comparison

we cannot directly use dialect properties to adjust the SQL, so we need to make our own version

ADBond changed the title ~~Pairwse string comparison~~ Pairwise string comparison Dec 12, 2024

ADBond added enhancement New feature or request comparisons labels Dec 12, 2024

ADBond added 9 commits December 16, 2024 09:43

Clickhouse server - register pandas arrays of strings

ed7110d

array of strings in test fixture frame

4bf2822

test PairwiseStringDistanceFunctionAtThresholds

d5b7c40

new Splink comparison

test -> clickhouse only

633cb71

dialect array functions

a24ca35

create a shadow version of comparison (level)

c1b1303

we cannot directly use dialect properties to adjust the SQL, so we need to make our own version

move test to cl_ch module, now that we are implementing our own version

7245737

pairwise bits to changelog

ea2e889

another change that happened

06e53fe

ADBond force-pushed the feature/pariwise-string-comparison branch from 94a513e to 06e53fe Compare December 16, 2024 09:44

ADBond merged commit e8de8cd into main Dec 16, 2024

ADBond deleted the feature/pariwise-string-comparison branch December 16, 2024 09:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pairwise string comparison#51

Pairwise string comparison#51
ADBond merged 9 commits into
mainfrom
feature/pariwise-string-comparison

ADBond commented Dec 12, 2024 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ADBond commented Dec 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ADBond commented Dec 12, 2024 •

edited

Loading