-
Notifications
You must be signed in to change notification settings - Fork 2.1k
[Feature][Engine]Metalake support for data source information storage and management #9688
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
…nel-env.cmd files; modify EnvCommonOptions class to include metalake configuration option
…metalake_enable=true
…ntroduce factory for type-based MetalakeClient creation; Fetch Metalake data by sourceId and replace Config placeholders
cc @liugddx |
Please add test cases and add documentation. |
seatunnel-api/pom.xml
Outdated
<groupId>org.apache.httpcomponents</groupId> | ||
<artifactId>httpclient</artifactId> | ||
<version>4.5.13</version> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this change necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to send http request to get metainfo from gravitino, so a http tool is necessary
} | ||
|
||
sink { | ||
Console { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use the Assert connector.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used the Assert connector instead of console
boolean metalakeEnabled = | ||
Boolean.parseBoolean(System.getenv().getOrDefault("METALAKE_ENABLED", "false")); | ||
if (metalakeEnabled) { | ||
this.seaTunnelJobConfig = | ||
getMetalakeConfig(ConfigBuilder.of(Paths.get(jobDefineFilePath), variables)); | ||
} else { | ||
this.seaTunnelJobConfig = ConfigBuilder.of(Paths.get(jobDefineFilePath), variables); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to ensure that Flink/Spark/Zeta all support this feature simultaneously.
Therefore, it needs to be adapted in both the FlinkTaskExecuteCommand
and SparkTaskExecuteCommand
classes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this feature in spark and flink engine and the test has passed
</dependency> | ||
<dependency> | ||
<groupId>org.testcontainers</groupId> | ||
<artifactId>mysql</artifactId> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
E2e test case need to support flink/spark/zeta, so they should be placed in the seatunnel-connector-v2-e2e
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved the metalakeIT to seatunnel-connector-v2-e2e
…or instead of console in the conf file for test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Place the document in the concept
directory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I moved the docs to concept directory
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same as above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
E2e cases will execute all engines, so there is no need to create separate test cases for Flink and Spark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have removed the test cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same as above
return new ChangeStreamTableSourceCheckpoint(coordinatorState, subtaskState); | ||
} | ||
|
||
private Config getMetalakeConfig(Config jobConfigTmp) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you do some refactor? I see this method three times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have refactor this method
if (strValue.startsWith("${") && strValue.endsWith("}")) { | ||
String placeholder = strValue.substring(2, strValue.length() - 1); | ||
|
||
if (metalakeJson.has(placeholder)) { | ||
String replaced = metalakeJson.get(placeholder).asText(); | ||
tmp = | ||
tmp.withValue( | ||
subKey, | ||
ConfigValueFactory.fromAnyRef(replaced)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please reuse PlaceholderUtils
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reused PlaceholderUtils here
import static org.apache.seatunnel.e2e.common.util.ContainerUtil.PROJECT_ROOT_PATH; | ||
import static org.awaitility.Awaitility.given; | ||
|
||
public class MetalakeIT extends SeaTunnelContainer { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we don't need too complex test case. Just verify the job config has right password/username after parse by metalake. Not need to run it. Also we can mock environment by
Line 758 in dac8fda
@SetEnvironmentVariable( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This annotation set env in test environment, but we need set env in test container
docs/zh/concept/metalake.md
Outdated
@@ -0,0 +1,69 @@ | |||
# METALAKE | |||
|
|||
由于seatunnel在执行任务时,需要将数据库用户名与密码等隐私信息明文写在脚本中,可能会导致信息泄露;并且维护较为困难,数据源信息发生变更时可能需要手动更改。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
由于seatunnel在执行任务时,需要将数据库用户名与密码等隐私信息明文写在脚本中,可能会导致信息泄露;并且维护较为困难,数据源信息发生变更时可能需要手动更改。 | |
由于SeaTunnel在执行任务时,需要将数据库用户名与密码等隐私信息明文写在脚本中,可能会导致信息泄露;并且维护较为困难,数据源信息发生变更时可能需要手动更改。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is fixed
docs/zh/concept/metalake.md
Outdated
|
||
由于seatunnel在执行任务时,需要将数据库用户名与密码等隐私信息明文写在脚本中,可能会导致信息泄露;并且维护较为困难,数据源信息发生变更时可能需要手动更改。 | ||
|
||
因此引入了metalake,将数据源的信息存储于Apache Gravitino等metalake中,任务脚本采用`sourId`和占位符的方法来代替原本的用户名和密码等信息,运行时seatunnel-engine通过http请求从metalake获取信息,根据占位符进行替换。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
因此引入了metalake,将数据源的信息存储于Apache Gravitino等metalake中,任务脚本采用`sourId`和占位符的方法来代替原本的用户名和密码等信息,运行时seatunnel-engine通过http请求从metalake获取信息,根据占位符进行替换。 | |
因此引入了metalake,将数据源的信息存储于Apache Gravitino等metalake中,任务脚本采用`sourceId`和占位符的方法来代替原本的用户名和密码等信息,运行时seatunnel-engine通过http请求从metalake获取信息,根据占位符进行替换。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is fixed
url = "jdbc:mysql://mysql-e2e:3306/seatunnel?useSSL=false&serverTimezone=UTC&allowPublicKeyRetrieval=true" | ||
driver = "${jdbc-driver}" | ||
connection_check_timeout_sec = 100 | ||
sourceId = "test_catalog" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why use sourceId not source_name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
source_name is the id of the catalog in Apache Gravitino, but maybe source_name is not used in other metalake type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @wtybxqm for update! It's looks great!
if (entity == null) { | ||
throw new RuntimeException("No response entity"); | ||
} | ||
ObjectMapper mapper = new ObjectMapper(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please reuse JsonUtils
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a readTree method in JsonUtils and reuse it here
String metalakeType = System.getenv("METALAKE_TYPE"); | ||
String metalakeUrl = System.getenv("METALAKE_URL"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe user can overwrite it on job config env
part is better. cc @liugddx
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe user can overwrite it on job config
env
part is better. cc @liugddx
Yes, user can retrieve variables from the environment (env), and if they are not present in the environment, they can obtain them from the System. Config envConfig = config.getConfig("env");
if (value.valueType() == ConfigValueType.STRING) { | ||
String strValue = (String) value.unwrapped(); | ||
Matcher matcher = pattern.matcher(strValue); | ||
if (matcher.find()) { | ||
String placeholder = matcher.group(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about move this part into PlaceholderUtils
with new method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have moved this into PlaceholderUtils
if (sinkObj.containsKey("sourceId")) { | ||
ConfigObject tmp = sinkObj; | ||
String sourceId = sinkObj.toConfig().getString("sourceId"); | ||
JsonNode metalakeJson = metalakeClient.getMetaInfo(sourceId); | ||
for (Map.Entry<String, ConfigValue> entry : sinkObj.entrySet()) { | ||
String subKey = entry.getKey(); | ||
ConfigValue value = entry.getValue(); | ||
|
||
if (value.valueType() == ConfigValueType.STRING) { | ||
String strValue = (String) value.unwrapped(); | ||
Matcher matcher = pattern.matcher(strValue); | ||
if (matcher.find()) { | ||
String placeholder = matcher.group(1); | ||
|
||
if (metalakeJson.has(placeholder)) { | ||
String replaced = metalakeJson.get(placeholder).asText(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parse source and sink has same code. Let's refactor it again!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have refactored these same code
httpcore-4.4.13.jar | ||
httpcore-4.4.16.jar | ||
httpcore-4.4.4.jar |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's weired we introduce new two version of httpcore. Can you analyze the dependencies and use only httpcore-4.4.16.jar?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is actually wired. I just introduce httpclient-4.5.13 which is already in the other part of the project, but the Dependency License test remind me to add other two version of httpcore
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should find the core reason before merge it. cc @liugddx
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we unify the version?
Sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have unified the version of httpcore to 4.4.16 and removed the extra version in known-dependency.txt
@Slf4j | ||
public class MetalakeConfigUtils { | ||
|
||
public static Config getMetalakeConfig(Config jobConfigTmp) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should support transform too. For example, OpenAI key in LLM/Embedding etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have supported transform here.
Config envConfig = | ||
ConfigBuilder.of(Paths.get(jobDefineFilePath), variables).getConfig("env"); | ||
boolean metalakeEnabled = | ||
envConfig.hasPath("metalake_enabled") | ||
? envConfig.getBoolean("metalake_enabled") | ||
: Boolean.parseBoolean( | ||
System.getenv().getOrDefault("METALAKE_ENABLED", "false")); | ||
if (metalakeEnabled) { | ||
this.seaTunnelJobConfig = | ||
MetalakeConfigUtils.getMetalakeConfig( | ||
ConfigBuilder.of(Paths.get(jobDefineFilePath), variables)); | ||
} else { | ||
this.seaTunnelJobConfig = ConfigBuilder.of(Paths.get(jobDefineFilePath), variables); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The part can be refactor too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have refactored this part
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @wtybxqm . We almost fininsh this.
return result.toString(); | ||
} | ||
|
||
public static String replacePlaceholders(String input, JsonNode json) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
public static String replacePlaceholders(String input, JsonNode json) { | |
public static String replacePlaceholders(String input, JsonNode supportedValues) { |
Config update = jobConfigTmp; | ||
try { | ||
ConfigList sourceList = jobConfigTmp.getList("source"); | ||
update = | ||
update.withValue( | ||
"source", | ||
ConfigValueFactory.fromIterable( | ||
replaceConfigList(sourceList, envConfig))); | ||
} catch (IOException e) { | ||
log.error("Fail to get MetaInfo", e); | ||
} | ||
|
||
try { | ||
ConfigList sinkList = jobConfigTmp.getList("sink"); | ||
update = | ||
update.withValue( | ||
"sink", | ||
ConfigValueFactory.fromIterable( | ||
replaceConfigList(sinkList, envConfig))); | ||
} catch (IOException e) { | ||
log.error("Fail to get MetaInfo", e); | ||
} | ||
|
||
try { | ||
ConfigList sinkList = jobConfigTmp.getList("transform"); | ||
update = | ||
update.withValue( | ||
"transform", | ||
ConfigValueFactory.fromIterable( | ||
replaceConfigList(sinkList, envConfig))); | ||
} catch (IOException e) { | ||
log.error("Fail to get MetaInfo", e); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found this part can be refactor too.
…etMetalakeConfig method
Config envConfig = updateConfig.getConfig("env"); | ||
String metalakeType = | ||
envConfig.hasPath("metalake_type") | ||
? envConfig.getString("metalake_type") | ||
: System.getenv("METALAKE_TYPE"); | ||
String metalakeUrl = | ||
envConfig.hasPath("metalake_url") | ||
? envConfig.getString("metalake_url") | ||
: System.getenv("METALAKE_URL"); | ||
MetalakeClient metalakeClient = MetalakeClientFactory.create(metalakeType, metalakeUrl); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should move this part outer replaceConfigList
. The metalakeClient should only created once.
Purpose of this pull request
To address the problem: sensitive information such as data source usernames and passwords is directly written into task scripts
close #9687
Does this PR introduce any user-facing change?
yes
To enable Metalake, users need to configure three critical settings in seatunnel.sh: first set 'metalake_enable=true', then specify their 'metalake_url' and 'metalake_type'. Once activated, users can simply add a 'sourceId' field in their task script's source/sink configurations and replace sensitive credentials with secure placeholders like '${password}'.
How was this patch tested?
Integration testing has passed.
Check list
New License Guide