Set the schema case sensitivity to false for Hive#234
Set the schema case sensitivity to false for Hive#234ljfgem merged 2 commits intolinkedin:masterfrom
Conversation
|
cc @wmoustafa |
wmoustafa
left a comment
There was a problem hiding this comment.
Could you update the PR description with an example query (preferably simplest one), and the behavior before and after the patch?
Can we add a Hive to Rel conversion test?
| run(driver, "create table table_same_column_name_a (some_id string)"); | ||
| run(driver, "create table table_same_column_name_b (some_id string)"); |
There was a problem hiding this comment.
Could you please follow the same casing convention as the rest of the test cases? (I see some are lowercase, but these seem to have slipped).
| run(driver, "create table table_same_column_name_a (some_id string)"); | ||
| run(driver, "create table table_same_column_name_b (some_id string)"); | ||
| run(driver, | ||
| "create view if not exists view_column_name_case_insensitive as\n" + " select a.some_id\n" |
There was a problem hiding this comment.
Could you simplify the view to the minimal one that satisfies the test case?
There was a problem hiding this comment.
create table hive.test.table_a (
some_id string
);
create table hive.test.table_b (
some_id string
);
These 2 views will have the same issue:
-- Hive view version 1
create or replace view test.view_ab as
select a.some_id
from test.table_a a
left join
(
select trim(some_id) AS SOME_ID
from test.table_b
) b
on a.some_id = b.some_id
where a.some_id != ''
-- Hive view version 2
create or replace view test.view_ab2 as
select a.some_id
from test.table_a a
left join
(
select some_id AS SOME_ID
from test.table_b
) b
on a.some_id = trim(b.some_id)
where a.some_id != ''
The WHERE clause is needed in order to enforce the usage of the inner select. I don't have at hand a sample view easier to grasp than the ones above.
| run(driver, String.join("\n", "", "CREATE VIEW IF NOT EXISTS named_struct_view", "AS", | ||
| "SELECT named_struct('abc', 123, 'def', 'xyz') AS named_struc", "FROM bar")); | ||
|
|
||
| run(driver, "create table table_same_column_name_a (some_id string)"); |
There was a problem hiding this comment.
table_same_column_name_a -> duplicate_column_name.tableA (same for the other table)
There was a problem hiding this comment.
I did the update for the spark tests.
However the other tests follow a slightly different naming scheme for the test tables. Do you want me to adjust the naming for the other tests (hive/trino) ?
66733bd to
1baab42
Compare
ljfgem
left a comment
There was a problem hiding this comment.
@findinpath Thanks for this PR!
I have kicked off the i-test and will update once it finishes.
Could you please run ./gradlew spotlessApply to fix the code format issue? Please ensure that ./gradlew clean build succeeds before pushing.
1baab42 to
f1527ef
Compare
ljfgem
left a comment
There was a problem hiding this comment.
LGTM and all the i-tests passed.
Thanks @findinpath @findepi !
@wmoustafa @funcheetah do you want to take another look?
In the translation of views, when joining tables that have the namesake column names ,irrespective of their case, make sure that the resulting relation is using names of the columns which have unique names in order to avoid the situation where the SQL statement created contains ambiguous column names.
f1527ef to
6ca047f
Compare
funcheetah
left a comment
There was a problem hiding this comment.
Thanks @findinpath for the PR. LGTM.
Problem description
Assume that we have two tables which contain a column with the same name:
Assume also that we have a view doing a select from the result of these table joined together and the name of the common column is used in a different case (some_id, SOME_ID):
When translating in Coral from Hive, the following result is being obtained:
As it can be seen from the query above this leads to having an ambiguous column, because there is both "some_id" and "SOME_ID" which in Hive point to the same column.
Solution description
Pointing out to coral that Hive is a case insensitive system, solves the problem by creating unique names for the columns in the selection.
When joining tables that have the same column name
,irrespective of their case, make sure that the
names of the columns have unique names to avoid
the situation where the SQL statement created
contains ambiguous column names.