Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

yzeng1618
Copy link
Contributor

Purpose of this pull request

This pull request implements a comprehensive Hive Sink connector for Apache SeaTunnel with advanced auto-create table capabilities. The connector enables seamless data writing to Apache Hive tables with support for automatic schema management, partitioning, multiple storage formats, and cloud storage integration.
Key features include:
Auto-create table functionality with configurable schema save modes
Advanced partitioning support through partition_fields configuration
Multiple storage formats (PARQUET, ORC, TEXTFILE)

Does this PR introduce any user-facing change?

Yes, this PR introduces a new Hive Sink connector with the following user-facing changes:
New Configuration Options:
schema_save_mode: Controls table creation behavior (CREATE_SCHEMA_WHEN_NOT_EXIST, RECREATE_SCHEMA, ERROR_WHEN_SCHEMA_NOT_EXIST, IGNORE)
table_format: Specifies storage format (PARQUET, ORC, TEXTFILE)
partition_fields: List of partition columns for table partitioning
save_mode_create_template: Customizable table creation template

Example Configuration:
sink {
Hive {
metastore_uri = "thrift://metastore:9083"
table_name = "warehouse.user_data"
schema_save_mode = "CREATE_SCHEMA_WHEN_NOT_EXIST"
table_format = "PARQUET"
partition_fields = ["dt"]
}
}

How was this patch tested?

  1. Unit Tests:
    HiveSinkOptionsTest: Tests configuration option parsing and validation
    HiveSaveModeHandlerTest: Tests table creation and schema management logic
    HiveTypeConvertorTest: Tests data type conversion between source and Hive types

  2. Local Tests

Check list

@github-actions github-actions bot removed the e2e label Aug 21, 2025
@Hisoka-X
Copy link
Member

cc @liunaijie

Copy link
Member

@Hisoka-X Hisoka-X left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add test case in HiveIT and update the docs.

}

private void createTableUsingTemplate() throws TException {
processCreateTemplate();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see a SQL template here, but the result not used.
Perhaps you meant to parse the sql result to a Table?




### table_format [string]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's introduce save_mode_create_template instead of table_format and partition_fields
Please refer https://github.com/apache/seatunnel/blob/dev/docs/en/connector-v2/sink/Doris.md#save_mode_create_template

<groupId>org.apache.hive</groupId>
<artifactId>hive-exec</artifactId>
<version>${hive.exec.version}</version>
<scope>provided</scope>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please keep hive-exec as provided. Because user could be different version

Comment on lines 101 to 109
<exclusion>
<groupId>org.apache.parquet</groupId>
<artifactId>parquet-hadoop-bundle</artifactId>
<artifactId>*</artifactId>
</exclusion>
<exclusion>
<groupId>org.apache.avro</groupId>
<artifactId>avro</artifactId>
</exclusion>
<!-- Exclude unnecessary dependencies -->
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is other version dependency?

</exclusions>
</dependency>
<!-- Hive Common dependency - contains HiveConf class -->
<dependency>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

</dependency>

<!-- Hive MetaStore dependency - contains HiveMetaStoreClient, AlreadyExistsException and other metastore API classes -->
<dependency>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto




### save_mode_create_template [string]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's default value of save_mode_create_template?


Flag to decide whether to use overwrite mode when inserting data into Hive. If set to true, for non-partitioned tables, the existing data in the table will be deleted before inserting new data. For partitioned tables, the data in the relevant partition will be deleted before inserting new data.

### schema_save_mode [enum]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

default value?

}
}

public void createDatabaseIfNotExists(String db) throws TException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about let HiveMetaStoreProxy implements Catalog interface? So that we do need overwrite most method of HiveSaveModeHandler.

private final List<String> partitionFields;

private HiveMetaStoreProxy hiveMetaStoreProxy;
private Catalog optionalCatalog; // 可选的Catalog支持
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove chinese.


@Slf4j
public class HiveMetaStoreProxy implements Closeable, Serializable {
public class HiveMetaStoreProxy implements Catalog, Closeable, Serializable {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public class HiveMetaStoreProxy implements Catalog, Closeable, Serializable {
public class HiveMetaStoreCatalog implements Catalog, Closeable, Serializable {

table.setTableName(tableName);
table.setOwner(System.getProperty("user.name", "seatunnel"));
table.setCreateTime((int) (System.currentTimeMillis() / 1000));
table.setTableType("MANAGED_TABLE");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For customTemplate, show we still need set the type to MANAGED_TABLE?

sd.setCols(cols);

// Set table location
String tableLocation = HiveTableTemplateUtils.getDefaultTableLocation(dbName, tableName);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

location should also extract from template

@yzeng1618 yzeng1618 requested a review from liunaijie September 11, 2025 03:45
executeJob(
container,
"/auto_table_creation/fake_to_hive_create_when_not_exist.conf",
"/auto_table_creation/hive_auto_create_to_assert.conf");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why use new seatunnel job to verify table? Maybe we can just verify new table by hiveclient.

Comment on lines 164 to 169
// For Hive, data save mode is handled by the existing OVERWRITE parameter
// No additional data handling is needed here
log.info(
"Data save mode handling is managed by existing OVERWRITE parameter for table {}.{}",
dbName,
tableName);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's merge overwrite logic with datasavemode. We can set overwrite=true or datasavemode=TRUNCATE to do same thing.

Comment on lines 72 to 73
// Initialize partition fields from template if available
this.partitionFields = extractPartitionFieldsFromConfig();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found this field only used by test case? We should use another way to verify partition fields in test case, not in runtime code.

this.partitionFields = extractPartitionFieldsFromConfig();
}

public HiveSaveModeHandler(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

useless?

private final List<String> partitionFields;

private HiveMetaStoreCatalog hiveCatalog;
private Catalog optionalCatalog;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not direct use hiveCatalog but optionalCatalog?

import java.util.List;

@Slf4j
public class HiveSaveModeHandler implements SaveModeHandler, AutoCloseable {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not implement DefaultSaveModeHandler? Many code is duplicated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

public static final Option<String> SAVE_MODE_CREATE_TEMPLATE =
Options.key("save_mode_create_template")
.stringType()
.noDefaultValue()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

default value?

FakeSource {
schema = {
fields {
pk_id = bigint
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you help create a detection for all field types?

case TIMESTAMP:
return "timestamp";
case ROW:
return "struct";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an error here when struct needs to map fields?

case ROW:
return "struct";
case ARRAY:
return "array";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

case ARRAY:
return "array";
case MAP:
return "map";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@yzeng1618 yzeng1618 requested a review from Hisoka-X September 16, 2025 11:44
@Hisoka-X Hisoka-X changed the title [Feature][Connector-Hive] Hive Sink Connector with Auto-Create Table Support [Feature][Connector-V2] Hive sink support SchemaSaveMode and DataSaveMode Sep 17, 2025
@Carl-Zhou-CN
Copy link
Member

@yzeng1618 hi,I have a small suggestion. You can use this identifier or "done" to mark that the problem has been solved
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants