summaryrefslogtreecommitdiff
path: root/docs
diff options
context:
space:
mode:
authordoufenghu <[email protected]>2023-12-31 00:43:48 +0800
committerdoufenghu <[email protected]>2023-12-31 00:43:48 +0800
commitfdd7119689ec54c3a5446062b71e759b5fed4b9f (patch)
treeb7f873904608e8183b944105b045c0f43ac71546 /docs
parent92e5b06386419a550099598eaaab3abcc4584e74 (diff)
[Fix][Bootstrap] 修复任务配置文件并行度不生效的问题,梳理RuntimeEnvironment加载envconfig使其边界更清晰。撰写env-config文档,描述不同方式指定任务名的优先级;描述不同level指定并行度优先级及覆盖逻辑。
Diffstat (limited to 'docs')
-rw-r--r--docs/env-config.md29
-rw-r--r--docs/user-guide.md36
2 files changed, 55 insertions, 10 deletions
diff --git a/docs/env-config.md b/docs/env-config.md
new file mode 100644
index 0000000..72ddb52
--- /dev/null
+++ b/docs/env-config.md
@@ -0,0 +1,29 @@
+# The Job Environment Configuration
+The env configuration includes basic parameters and engine parameters.
+## Basic Parameter
+
+### name
+This parameter is used to define the name of the job. In addition, the job name can also be specified in the flink cluster by using the `flink run` command. If is not specified, the default name is `groot-stream-job`.
+Above three ways to specify the job name, the priority is `flink run` > `name` in the configuration file > default name.
+
+### parallelism
+An execution environment defines a default parallelism for all processors, filters, data sources, and data sinks it executes. In addition, the parallelism of a job can be specified on different levels, and the priority is `Operator Level` > `Execution Environment Level` > `Client Level` > `System Level`.
+- Operator Level: The parallelism of a processor, filter, data source, or data sink can be specified in the configuration file.
+- Execution Environment Level: The parallelism of a job can be specified in the env configuration file.
+- Client Level: The parallelism of a job can be specified by using the `flink run -p` command.
+- System Level: The parallelism of a job can be specified by using the `flink-conf.yaml` file.
+Note: The parallelism of a job can be overridden by explicitly configuring the parallelism of a processor, filter, data source, or data sink in the configuration file.
+
+### pipeline.object-reuse
+This parameter is used to enable/disable object reuse for the execution of the job. If it is not specified, the default value is `false`.
+
+### jars
+
+### pipeline.jars
+
+### pipeline.classpaths
+
+
+## Engine Parameter
+
+Some flink parameter names corresponding use prefix `flink.`, more details please refer to the official [flink documentation](https://flink.apache.org/). such as `flink.execution.checkpointing.mode`, `flink.execution.checkpointing.timeout`, etc.
diff --git a/docs/user-guide.md b/docs/user-guide.md
index 41af047..0be29f6 100644
--- a/docs/user-guide.md
+++ b/docs/user-guide.md
@@ -53,11 +53,11 @@ processing_pipelines:
key: value
functions: # [array of object] Function List
- function: DROP
- lookup_fields: [ '' ]
- output_fields: [ '' ]
+ lookup_fields: []
+ output_fields: []
filter: event.client_ip == '192.168.10.100'
- function: SNOWFLAKE_ID
- lookup_fields: [ '' ]
+ lookup_fields: []
output_fields: [ log_id ]
sinks:
@@ -66,14 +66,13 @@ sinks:
properties:
format: json
-
application: # [object] Application Configuration
- name: inline-to-console-job # [string] Application Name
env: # [object] Environment Variables
- execution.parallelism: 1 # [number] Job-Level Parallelism
+ name: inline-to-console-job # [string] Job Name
+ parallelism: 1 # [number] Job-Level Parallelism
topology: # [array of object] Node List. It will be used build data flow for job dag graph.
- name: inline_source # [string] Node Name, must be unique. It will be used as the name of the corresponding Flink operator. eg. kafka_source the processor type as SOURCE.
- parallelism: 1 # [number] Operator-Level Parallelism.
+ #parallelism: 1 # [number] Operator-Level Parallelism.
downstream: [filter]
- name: filter
downstream: [transform_processor]
@@ -82,7 +81,6 @@ application: # [object] Application Configuration
- name: session_record_processor
downstream: [print_sink]
- name: print_sink
- parallelism: 1
downstream: []
```
@@ -128,10 +126,28 @@ Source is used to define where GrootStream needs to ingest data. Multiple source
## Sinks
## Application
-
+Used to define some common parameters of the job and the topology of the job. such as the name of the job, the parallelism of the job, etc. The following configuration parameters are supported.
### ENV
+Used to define job environment configuration information. For more details, you can refer to the documentation [JobEnvConfig](./env-config.md).
+
+
+# Command
+## Run a job by CLI
+```bash
+Usage: start.sh [options]
+Options:
+ --check Whether check config (default: false)
+ -c, --config <config file> Config file path, must be specified
+ -e, --deploy-mode <deploy mode> Deploy mode, only support [run] (default: run)
+ --target <target> Submitted target type, support [local, remote, yarn-session, yarn-per-job]
+ -n, --name <name> Job name (default: groot-stream-job)
+ -i, --variable <variable> User-defined parameters, eg. -i key=value (default: [])
+ -h, --help Show help message
+ -v, --version Show version message
+
+```
-
+```