-
Notifications
You must be signed in to change notification settings - Fork 2.1k
[Feature][Connector-V2] Hive sink support SchemaSaveMode and DataSaveMode #9743
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
cc @liunaijie |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add test case in HiveIT
and update the docs.
} | ||
|
||
private void createTableUsingTemplate() throws TException { | ||
processCreateTemplate(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see a SQL template here, but the result not used.
Perhaps you meant to parse the sql result to a Table
?
docs/en/connector-v2/sink/Hive.md
Outdated
|
||
|
||
|
||
### table_format [string] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's introduce save_mode_create_template
instead of table_format
and partition_fields
Please refer https://github.com/apache/seatunnel/blob/dev/docs/en/connector-v2/sink/Doris.md#save_mode_create_template
<groupId>org.apache.hive</groupId> | ||
<artifactId>hive-exec</artifactId> | ||
<version>${hive.exec.version}</version> | ||
<scope>provided</scope> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please keep hive-exec as provided. Because user could be different version
<exclusion> | ||
<groupId>org.apache.parquet</groupId> | ||
<artifactId>parquet-hadoop-bundle</artifactId> | ||
<artifactId>*</artifactId> | ||
</exclusion> | ||
<exclusion> | ||
<groupId>org.apache.avro</groupId> | ||
<artifactId>avro</artifactId> | ||
</exclusion> | ||
<!-- Exclude unnecessary dependencies --> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where is other version dependency?
</exclusions> | ||
</dependency> | ||
<!-- Hive Common dependency - contains HiveConf class --> | ||
<dependency> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
</dependency> | ||
|
||
<!-- Hive MetaStore dependency - contains HiveMetaStoreClient, AlreadyExistsException and other metastore API classes --> | ||
<dependency> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
|
||
|
||
|
||
### save_mode_create_template [string] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's default value of save_mode_create_template
?
|
||
Flag to decide whether to use overwrite mode when inserting data into Hive. If set to true, for non-partitioned tables, the existing data in the table will be deleted before inserting new data. For partitioned tables, the data in the relevant partition will be deleted before inserting new data. | ||
|
||
### schema_save_mode [enum] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
default value?
} | ||
} | ||
|
||
public void createDatabaseIfNotExists(String db) throws TException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about let HiveMetaStoreProxy
implements Catalog interface? So that we do need overwrite most method of HiveSaveModeHandler
.
private final List<String> partitionFields; | ||
|
||
private HiveMetaStoreProxy hiveMetaStoreProxy; | ||
private Catalog optionalCatalog; // 可选的Catalog支持 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove chinese.
|
||
@Slf4j | ||
public class HiveMetaStoreProxy implements Closeable, Serializable { | ||
public class HiveMetaStoreProxy implements Catalog, Closeable, Serializable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
public class HiveMetaStoreProxy implements Catalog, Closeable, Serializable { | |
public class HiveMetaStoreCatalog implements Catalog, Closeable, Serializable { |
…eCreationORCFormat
table.setTableName(tableName); | ||
table.setOwner(System.getProperty("user.name", "seatunnel")); | ||
table.setCreateTime((int) (System.currentTimeMillis() / 1000)); | ||
table.setTableType("MANAGED_TABLE"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For customTemplate, show we still need set the type to MANAGED_TABLE
?
sd.setCols(cols); | ||
|
||
// Set table location | ||
String tableLocation = HiveTableTemplateUtils.getDefaultTableLocation(dbName, tableName); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
location should also extract from template
…nt of TableType, Location, and TblProperties
executeJob( | ||
container, | ||
"/auto_table_creation/fake_to_hive_create_when_not_exist.conf", | ||
"/auto_table_creation/hive_auto_create_to_assert.conf"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why use new seatunnel job to verify table? Maybe we can just verify new table by hiveclient.
// For Hive, data save mode is handled by the existing OVERWRITE parameter | ||
// No additional data handling is needed here | ||
log.info( | ||
"Data save mode handling is managed by existing OVERWRITE parameter for table {}.{}", | ||
dbName, | ||
tableName); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's merge overwrite logic with datasavemode. We can set overwrite=true
or datasavemode=TRUNCATE
to do same thing.
// Initialize partition fields from template if available | ||
this.partitionFields = extractPartitionFieldsFromConfig(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found this field only used by test case? We should use another way to verify partition fields in test case, not in runtime code.
this.partitionFields = extractPartitionFieldsFromConfig(); | ||
} | ||
|
||
public HiveSaveModeHandler( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
useless?
private final List<String> partitionFields; | ||
|
||
private HiveMetaStoreCatalog hiveCatalog; | ||
private Catalog optionalCatalog; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not direct use hiveCatalog
but optionalCatalog
?
import java.util.List; | ||
|
||
@Slf4j | ||
public class HiveSaveModeHandler implements SaveModeHandler, AutoCloseable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not implement DefaultSaveModeHandler
? Many code is duplicated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
public static final Option<String> SAVE_MODE_CREATE_TEMPLATE = | ||
Options.key("save_mode_create_template") | ||
.stringType() | ||
.noDefaultValue() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
default value?
FakeSource { | ||
schema = { | ||
fields { | ||
pk_id = bigint |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you help create a detection for all field types?
case TIMESTAMP: | ||
return "timestamp"; | ||
case ROW: | ||
return "struct"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there an error here when struct needs to map fields?
case ROW: | ||
return "struct"; | ||
case ARRAY: | ||
return "array"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
case ARRAY: | ||
return "array"; | ||
case MAP: | ||
return "map"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
@yzeng1618 hi,I have a small suggestion. You can use this identifier or "done" to mark that the problem has been solved |
Purpose of this pull request
This pull request implements a comprehensive Hive Sink connector for Apache SeaTunnel with advanced auto-create table capabilities. The connector enables seamless data writing to Apache Hive tables with support for automatic schema management, partitioning, multiple storage formats, and cloud storage integration.
Key features include:
Auto-create table functionality with configurable schema save modes
Advanced partitioning support through partition_fields configuration
Multiple storage formats (PARQUET, ORC, TEXTFILE)
Does this PR introduce any user-facing change?
Yes, this PR introduces a new Hive Sink connector with the following user-facing changes:
New Configuration Options:
schema_save_mode: Controls table creation behavior (CREATE_SCHEMA_WHEN_NOT_EXIST, RECREATE_SCHEMA, ERROR_WHEN_SCHEMA_NOT_EXIST, IGNORE)
table_format: Specifies storage format (PARQUET, ORC, TEXTFILE)
partition_fields: List of partition columns for table partitioning
save_mode_create_template: Customizable table creation template
Example Configuration:
sink {
Hive {
metastore_uri = "thrift://metastore:9083"
table_name = "warehouse.user_data"
schema_save_mode = "CREATE_SCHEMA_WHEN_NOT_EXIST"
table_format = "PARQUET"
partition_fields = ["dt"]
}
}
How was this patch tested?
Unit Tests:
HiveSinkOptionsTest: Tests configuration option parsing and validation
HiveSaveModeHandlerTest: Tests table creation and schema management logic
HiveTypeConvertorTest: Tests data type conversion between source and Hive types
Local Tests
Check list
New License Guide