Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

wtybxqm
Copy link

@wtybxqm wtybxqm commented Aug 10, 2025

Purpose of this pull request

To address the problem: sensitive information such as data source usernames and passwords is directly written into task scripts
close #9687

Does this PR introduce any user-facing change?

yes
To enable Metalake, users need to configure three critical settings in seatunnel.sh: first set 'metalake_enable=true', then specify their 'metalake_url' and 'metalake_type'. Once activated, users can simply add a 'sourceId' field in their task script's source/sink configurations and replace sensitive credentials with secure placeholders like '${password}'.

How was this patch tested?

Integration testing has passed.

Check list

wtybxqm added 4 commits July 20, 2025 18:57
…nel-env.cmd files; modify EnvCommonOptions class to include metalake configuration option
…ntroduce factory for type-based MetalakeClient creation; Fetch Metalake data by sourceId and replace Config placeholders
@Hisoka-X
Copy link
Member

cc @liugddx

@liugddx
Copy link
Member

liugddx commented Aug 11, 2025

Please add test cases and add documentation.

@github-actions github-actions bot added the e2e label Aug 25, 2025
Comment on lines 45 to 47
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpclient</artifactId>
<version>4.5.13</version>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change necessary?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to send http request to get metainfo from gravitino, so a http tool is necessary

}

sink {
Console {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use the Assert connector.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used the Assert connector instead of console

Comment on lines 184 to 191
boolean metalakeEnabled =
Boolean.parseBoolean(System.getenv().getOrDefault("METALAKE_ENABLED", "false"));
if (metalakeEnabled) {
this.seaTunnelJobConfig =
getMetalakeConfig(ConfigBuilder.of(Paths.get(jobDefineFilePath), variables));
} else {
this.seaTunnelJobConfig = ConfigBuilder.of(Paths.get(jobDefineFilePath), variables);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to ensure that Flink/Spark/Zeta all support this feature simultaneously.

Therefore, it needs to be adapted in both the FlinkTaskExecuteCommandand SparkTaskExecuteCommand classes.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this feature in spark and flink engine and the test has passed

</dependency>
<dependency>
<groupId>org.testcontainers</groupId>
<artifactId>mysql</artifactId>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E2e test case need to support flink/spark/zeta, so they should be placed in the seatunnel-connector-v2-e2e.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the metalakeIT to seatunnel-connector-v2-e2e

@github-actions github-actions bot added the core SeaTunnel core module label Sep 8, 2025
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Place the document in the concept directory.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I moved the docs to concept directory

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same as above

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E2e cases will execute all engines, so there is no need to create separate test cases for Flink and Spark.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed the test cases

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same as above

@github-actions github-actions bot added the dependencies Pull requests that update a dependency file label Sep 11, 2025
return new ChangeStreamTableSourceCheckpoint(coordinatorState, subtaskState);
}

private Config getMetalakeConfig(Config jobConfigTmp) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you do some refactor? I see this method three times.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have refactor this method

Comment on lines 874 to 883
if (strValue.startsWith("${") && strValue.endsWith("}")) {
String placeholder = strValue.substring(2, strValue.length() - 1);

if (metalakeJson.has(placeholder)) {
String replaced = metalakeJson.get(placeholder).asText();
tmp =
tmp.withValue(
subKey,
ConfigValueFactory.fromAnyRef(replaced));
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please reuse PlaceholderUtils.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reused PlaceholderUtils here

import static org.apache.seatunnel.e2e.common.util.ContainerUtil.PROJECT_ROOT_PATH;
import static org.awaitility.Awaitility.given;

public class MetalakeIT extends SeaTunnelContainer {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't need too complex test case. Just verify the job config has right password/username after parse by metalake. Not need to run it. Also we can mock environment by

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This annotation set env in test environment, but we need set env in test container

@@ -0,0 +1,69 @@
# METALAKE

由于seatunnel在执行任务时,需要将数据库用户名与密码等隐私信息明文写在脚本中,可能会导致信息泄露;并且维护较为困难,数据源信息发生变更时可能需要手动更改。
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
由于seatunnel在执行任务时,需要将数据库用户名与密码等隐私信息明文写在脚本中,可能会导致信息泄露;并且维护较为困难,数据源信息发生变更时可能需要手动更改。
由于SeaTunnel在执行任务时,需要将数据库用户名与密码等隐私信息明文写在脚本中,可能会导致信息泄露;并且维护较为困难,数据源信息发生变更时可能需要手动更改。

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is fixed


由于seatunnel在执行任务时,需要将数据库用户名与密码等隐私信息明文写在脚本中,可能会导致信息泄露;并且维护较为困难,数据源信息发生变更时可能需要手动更改。

因此引入了metalake,将数据源的信息存储于Apache Gravitino等metalake中,任务脚本采用`sourId`和占位符的方法来代替原本的用户名和密码等信息,运行时seatunnel-engine通过http请求从metalake获取信息,根据占位符进行替换。
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
因此引入了metalake,将数据源的信息存储于Apache Gravitino等metalake中,任务脚本采用`sourId`和占位符的方法来代替原本的用户名和密码等信息,运行时seatunnel-engine通过http请求从metalake获取信息,根据占位符进行替换。
因此引入了metalake,将数据源的信息存储于Apache Gravitino等metalake中,任务脚本采用`sourceId`和占位符的方法来代替原本的用户名和密码等信息,运行时seatunnel-engine通过http请求从metalake获取信息,根据占位符进行替换。

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is fixed

url = "jdbc:mysql://mysql-e2e:3306/seatunnel?useSSL=false&serverTimezone=UTC&allowPublicKeyRetrieval=true"
driver = "${jdbc-driver}"
connection_check_timeout_sec = 100
sourceId = "test_catalog"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why use sourceId not source_name?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

source_name is the id of the catalog in Apache Gravitino, but maybe source_name is not used in other metalake type

Copy link
Member

@Hisoka-X Hisoka-X left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @wtybxqm for update! It's looks great!

if (entity == null) {
throw new RuntimeException("No response entity");
}
ObjectMapper mapper = new ObjectMapper();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please reuse JsonUtils

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a readTree method in JsonUtils and reuse it here

Comment on lines 46 to 47
String metalakeType = System.getenv("METALAKE_TYPE");
String metalakeUrl = System.getenv("METALAKE_URL");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe user can overwrite it on job config env part is better. cc @liugddx

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe user can overwrite it on job config env part is better. cc @liugddx

Yes, user can retrieve variables from the environment (env), and if they are not present in the environment, they can obtain them from the System. Config envConfig = config.getConfig("env");

Comment on lines 65 to 69
if (value.valueType() == ConfigValueType.STRING) {
String strValue = (String) value.unwrapped();
Matcher matcher = pattern.matcher(strValue);
if (matcher.find()) {
String placeholder = matcher.group(1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about move this part into PlaceholderUtils with new method?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have moved this into PlaceholderUtils

Comment on lines 98 to 113
if (sinkObj.containsKey("sourceId")) {
ConfigObject tmp = sinkObj;
String sourceId = sinkObj.toConfig().getString("sourceId");
JsonNode metalakeJson = metalakeClient.getMetaInfo(sourceId);
for (Map.Entry<String, ConfigValue> entry : sinkObj.entrySet()) {
String subKey = entry.getKey();
ConfigValue value = entry.getValue();

if (value.valueType() == ConfigValueType.STRING) {
String strValue = (String) value.unwrapped();
Matcher matcher = pattern.matcher(strValue);
if (matcher.find()) {
String placeholder = matcher.group(1);

if (metalakeJson.has(placeholder)) {
String replaced = metalakeJson.get(placeholder).asText();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parse source and sink has same code. Let's refactor it again!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have refactored these same code

Comment on lines 12 to 14
httpcore-4.4.13.jar
httpcore-4.4.16.jar
httpcore-4.4.4.jar
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's weired we introduce new two version of httpcore. Can you analyze the dependencies and use only httpcore-4.4.16.jar?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is actually wired. I just introduce httpclient-4.5.13 which is already in the other part of the project, but the Dependency License test remind me to add other two version of httpcore

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should find the core reason before merge it. cc @liugddx

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

httpcore has been introduced in multiple versions. Can we unify the version?
image

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we unify the version?

Sure.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have unified the version of httpcore to 4.4.16 and removed the extra version in known-dependency.txt

@github-actions github-actions bot removed the dependencies Pull requests that update a dependency file label Sep 17, 2025
@Slf4j
public class MetalakeConfigUtils {

public static Config getMetalakeConfig(Config jobConfigTmp) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should support transform too. For example, OpenAI key in LLM/Embedding etc.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have supported transform here.

Comment on lines 176 to 189
Config envConfig =
ConfigBuilder.of(Paths.get(jobDefineFilePath), variables).getConfig("env");
boolean metalakeEnabled =
envConfig.hasPath("metalake_enabled")
? envConfig.getBoolean("metalake_enabled")
: Boolean.parseBoolean(
System.getenv().getOrDefault("METALAKE_ENABLED", "false"));
if (metalakeEnabled) {
this.seaTunnelJobConfig =
MetalakeConfigUtils.getMetalakeConfig(
ConfigBuilder.of(Paths.get(jobDefineFilePath), variables));
} else {
this.seaTunnelJobConfig = ConfigBuilder.of(Paths.get(jobDefineFilePath), variables);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The part can be refactor too?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have refactored this part

Copy link
Member

@Hisoka-X Hisoka-X left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @wtybxqm . We almost fininsh this.

return result.toString();
}

public static String replacePlaceholders(String input, JsonNode json) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public static String replacePlaceholders(String input, JsonNode json) {
public static String replacePlaceholders(String input, JsonNode supportedValues) {

Comment on lines 49 to 81
Config update = jobConfigTmp;
try {
ConfigList sourceList = jobConfigTmp.getList("source");
update =
update.withValue(
"source",
ConfigValueFactory.fromIterable(
replaceConfigList(sourceList, envConfig)));
} catch (IOException e) {
log.error("Fail to get MetaInfo", e);
}

try {
ConfigList sinkList = jobConfigTmp.getList("sink");
update =
update.withValue(
"sink",
ConfigValueFactory.fromIterable(
replaceConfigList(sinkList, envConfig)));
} catch (IOException e) {
log.error("Fail to get MetaInfo", e);
}

try {
ConfigList sinkList = jobConfigTmp.getList("transform");
update =
update.withValue(
"transform",
ConfigValueFactory.fromIterable(
replaceConfigList(sinkList, envConfig)));
} catch (IOException e) {
log.error("Fail to get MetaInfo", e);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found this part can be refactor too.

Comment on lines 57 to 66
Config envConfig = updateConfig.getConfig("env");
String metalakeType =
envConfig.hasPath("metalake_type")
? envConfig.getString("metalake_type")
: System.getenv("METALAKE_TYPE");
String metalakeUrl =
envConfig.hasPath("metalake_url")
? envConfig.getString("metalake_url")
: System.getenv("METALAKE_URL");
MetalakeClient metalakeClient = MetalakeClientFactory.create(metalakeType, metalakeUrl);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should move this part outer replaceConfigList. The metalakeClient should only created once.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature][Engine] Metalake support for data source information storage and management
3 participants