diff options
| author | doufenghu <[email protected]> | 2024-03-16 19:32:42 +0800 |
|---|---|---|
| committer | doufenghu <[email protected]> | 2024-03-16 19:32:42 +0800 |
| commit | 25994fade7720a43021b25004ade13c71f941e88 (patch) | |
| tree | 9c91a57d526f0579f241add718d8cb114ff04468 /docs | |
| parent | 9ff68b2c631606cf06a7001036ff16475c52371c (diff) | |
[Improve][Docs] Add some help information for connector schema and knowledge base files.
Diffstat (limited to 'docs')
| -rw-r--r-- | docs/connector/connector.md | 38 | ||||
| -rw-r--r-- | docs/grootstream-config.md | 19 | ||||
| -rw-r--r-- | docs/user-guide.md | 55 |
3 files changed, 61 insertions, 51 deletions
diff --git a/docs/connector/connector.md b/docs/connector/connector.md index 6bcc878..e36214c 100644 --- a/docs/connector/connector.md +++ b/docs/connector/connector.md @@ -7,13 +7,13 @@ Source Connector contains some common core features, and each source connector s sources: ${source_name}: type: ${source_connector_type} - # source table schema, config through fields or local_file or url - schema: - fields: + # Source schema, config through fields or local_file or url. if not set schema, all fields(Map<String, Object>) will be output. + schema: + fields: - name: ${field_name} type: ${field_type} - # local_file: "schema path" - # url: "schema http url" + # local_file: "/path/to/schema.json" + # url: "https://localhost:8080/schema.json" properties: ${prop_key}: ${prop_value} ``` @@ -29,12 +29,13 @@ The source connector supports reading only specified fields from the data source The Schema Structure refer to [Schema Structure](../user-guide.md#schema-structure). ## Schema Config -Schema can config through fields or local_file or url. +Schema can be configured through fields or local_file or url. If not set schema, all fields(Map<String, Object>) will be output. And local_file and url only support Avro schema format. More details see the [Avro Schema](https://avro.apache.org/docs/1.11.1/specification/). -### fields +### Fields +It can be configured through array or sql style. It is recommended to use array style, which is more readable. ```yaml schema: - # by array + # array style fields: - name: ${field_name} type: ${field_type} @@ -42,31 +43,28 @@ schema: ```yaml schema: - # by sql + # sql style fields: "struct<field_name:field_type, ...>" # can also without outer struct<> # fields: "field_name:field_type, ..." ``` -### local_file - +### Local File +To retrieve the schema from a local file using its absolute path. ```yaml schema: # by array fields: - local_file: "schema path" + local_file: "/path/to/schema.json" ``` -### url -Retrieve updated schema from URL for cycle, support dynamic schema. Not all connector support dynamic schema. - -The connectors that currently support dynamic schema include: clickHouse sink. - +### URL +Some connectors support periodically fetching and updating the schema from a URL, such as the `ClickHouse Sink`. ```yaml schema: # by array fields: - url: "schema http url" + url: "https://localhost:8080/schema.json" ``` # Sink Connector @@ -81,8 +79,8 @@ sinks: # sink table schema, config through fields or local_file or url. if not set schema, all fields(Map<String, Object>) will be output. schema: fields: "struct<field_name:field_type, ...>" - # local_file: "schema path" - # url: "schema url" + # local_file: "/path/to/schema.json" + # url: "https://localhost:8080/schema.json" properties: ${prop_key}: ${prop_value} ``` diff --git a/docs/grootstream-config.md b/docs/grootstream-config.md index a359c39..479f4a7 100644 --- a/docs/grootstream-config.md +++ b/docs/grootstream-config.md @@ -5,16 +5,25 @@ The purpose of this file is to provide a global configuration for the groot-stre ```yaml grootstream: - knowledge_base: # Define the knowledge base list. - - name: ${knowledge_base_name} # Define the name of the knowledge base, used to kb function. - fs_type: ${file_system_type} # Define the type of the file system.Support: local,hdfs,http. - fs_path: ${file_system_path} # Define the path of the file system. + knowledge_base: # Define the libraries + - name: ${knowledge_base_name} + fs_type: ${file_system_type} + fs_path: ${file_system_path} files: - ${file_name} # Define the file name of the knowledge base. properties: # Custom parameters. hos.path: ${hos_path} hos.bucket.name.traffic_file: ${traffic_file_bucket} hos.bucket.name.troubleshooting_file: ${troubleshooting_file_bucket} - scheduler.knowledge_base.update.interval.minutes: ${knowledge_base_update_interval_minutes} + scheduler.knowledge_base.update.interval.minutes: ${knowledge_base_update_interval_minutes} # Define the interval of the knowledge base file update. ``` +### Knowledge Base +The knowledge base is a collection of libraries that can be used in the groot-stream job's UDFs. File system type can be specified `local` or `http` mode. If the value is `http`, must be `KB Repository` URL. The library will be dynamically updated according to the `scheduler.knowledge_base.update.interval.minutes` configuration. + +| Name | Type | Required | Default | Description | +|:---------|:--------|:---------|:--------|:---------------------------------------------------------------------------| +| name | String | Yes | - | The name of the knowledge base, used to [UDF](processor/udf.md) | +| fs_type | String | Yes | - | The type of the file system. Enum: local and http. | +| fs_path | String | Yes | - | The path of the file system. It can be file directory or http restful api. | +| files | Array | No | - | The file list of the knowledge base object. | diff --git a/docs/user-guide.md b/docs/user-guide.md index fa05547..a8f5067 100644 --- a/docs/user-guide.md +++ b/docs/user-guide.md @@ -8,19 +8,20 @@ The main format of the config template file is `yaml`, for more details of this sources: inline_source: type: inline - fields: - - name: log_id - type: bigint - - name: recv_time - type: bigint - - name: fqdn_string - type: string - - name: client_ip - type: string - - name: server_ip - type: string - - name: decoded_as - type: string + schema: + fields: + - name: log_id + type: bigint + - name: recv_time + type: bigint + - name: fqdn_string + type: string + - name: client_ip + type: string + - name: server_ip + type: string + - name: decoded_as + type: string properties: data: '{"log_id": 1, "recv_time":"111","fqdn_string":"baidu.com", "client_ip":"192.168.0.1","server_ip":"120.233.20.242","decoded_as":"BASE", "dup_traffic_flag":1}' format: json @@ -92,19 +93,20 @@ application: ## Schema Structure Some sources are not strongly limited schema, so you need use `fields` to define the field name and type. The source can customize the schema. Like `Kafka` `Inline` source etc. ```yaml -fields: - - name: log_id - type: bigint - - name: recv_time - type: bigint - - name: fqdn_string - type: string - - name: client_ip - type: string - - name: server_ip - type: string - - name: decoded_as - type: string +Schema: + fields: + - name: log_id + type: bigint + - name: recv_time + type: bigint + - name: fqdn_string + type: string + - name: client_ip + type: string + - name: server_ip + type: string + - name: decoded_as + type: string ``` `name` The name of the field. `type` The data type of the field. @@ -136,6 +138,7 @@ Sink is used to define where GrootStream needs to output data. Multiple sinks ca ## Application Used to define some common parameters of the job and the topology of the job. such as the name of the job, the parallelism of the job, etc. The following configuration parameters are supported. + ### ENV Used to define job environment configuration information. For more details, you can refer to the documentation [JobEnvConfig](./env-config.md). |
