summaryrefslogtreecommitdiff
path: root/docs
diff options
context:
space:
mode:
authordoufenghu <[email protected]>2024-03-16 19:32:42 +0800
committerdoufenghu <[email protected]>2024-03-16 19:32:42 +0800
commit25994fade7720a43021b25004ade13c71f941e88 (patch)
tree9c91a57d526f0579f241add718d8cb114ff04468 /docs
parent9ff68b2c631606cf06a7001036ff16475c52371c (diff)
[Improve][Docs] Add some help information for connector schema and knowledge base files.
Diffstat (limited to 'docs')
-rw-r--r--docs/connector/connector.md38
-rw-r--r--docs/grootstream-config.md19
-rw-r--r--docs/user-guide.md55
3 files changed, 61 insertions, 51 deletions
diff --git a/docs/connector/connector.md b/docs/connector/connector.md
index 6bcc878..e36214c 100644
--- a/docs/connector/connector.md
+++ b/docs/connector/connector.md
@@ -7,13 +7,13 @@ Source Connector contains some common core features, and each source connector s
sources:
${source_name}:
type: ${source_connector_type}
- # source table schema, config through fields or local_file or url
- schema:
- fields:
+ # Source schema, config through fields or local_file or url. if not set schema, all fields(Map<String, Object>) will be output.
+ schema:
+ fields:
- name: ${field_name}
type: ${field_type}
- # local_file: "schema path"
- # url: "schema http url"
+ # local_file: "/path/to/schema.json"
+ # url: "https://localhost:8080/schema.json"
properties:
${prop_key}: ${prop_value}
```
@@ -29,12 +29,13 @@ The source connector supports reading only specified fields from the data source
The Schema Structure refer to [Schema Structure](../user-guide.md#schema-structure).
## Schema Config
-Schema can config through fields or local_file or url.
+Schema can be configured through fields or local_file or url. If not set schema, all fields(Map<String, Object>) will be output. And local_file and url only support Avro schema format. More details see the [Avro Schema](https://avro.apache.org/docs/1.11.1/specification/).
-### fields
+### Fields
+It can be configured through array or sql style. It is recommended to use array style, which is more readable.
```yaml
schema:
- # by array
+ # array style
fields:
- name: ${field_name}
type: ${field_type}
@@ -42,31 +43,28 @@ schema:
```yaml
schema:
- # by sql
+ # sql style
fields: "struct<field_name:field_type, ...>"
# can also without outer struct<>
# fields: "field_name:field_type, ..."
```
-### local_file
-
+### Local File
+To retrieve the schema from a local file using its absolute path.
```yaml
schema:
# by array
fields:
- local_file: "schema path"
+ local_file: "/path/to/schema.json"
```
-### url
-Retrieve updated schema from URL for cycle, support dynamic schema. Not all connector support dynamic schema.
-
-The connectors that currently support dynamic schema include: clickHouse sink.
-
+### URL
+Some connectors support periodically fetching and updating the schema from a URL, such as the `ClickHouse Sink`.
```yaml
schema:
# by array
fields:
- url: "schema http url"
+ url: "https://localhost:8080/schema.json"
```
# Sink Connector
@@ -81,8 +79,8 @@ sinks:
# sink table schema, config through fields or local_file or url. if not set schema, all fields(Map<String, Object>) will be output.
schema:
fields: "struct<field_name:field_type, ...>"
- # local_file: "schema path"
- # url: "schema url"
+ # local_file: "/path/to/schema.json"
+ # url: "https://localhost:8080/schema.json"
properties:
${prop_key}: ${prop_value}
```
diff --git a/docs/grootstream-config.md b/docs/grootstream-config.md
index a359c39..479f4a7 100644
--- a/docs/grootstream-config.md
+++ b/docs/grootstream-config.md
@@ -5,16 +5,25 @@ The purpose of this file is to provide a global configuration for the groot-stre
```yaml
grootstream:
- knowledge_base: # Define the knowledge base list.
- - name: ${knowledge_base_name} # Define the name of the knowledge base, used to kb function.
- fs_type: ${file_system_type} # Define the type of the file system.Support: local,hdfs,http.
- fs_path: ${file_system_path} # Define the path of the file system.
+ knowledge_base: # Define the libraries
+ - name: ${knowledge_base_name}
+ fs_type: ${file_system_type}
+ fs_path: ${file_system_path}
files:
- ${file_name} # Define the file name of the knowledge base.
properties: # Custom parameters.
hos.path: ${hos_path}
hos.bucket.name.traffic_file: ${traffic_file_bucket}
hos.bucket.name.troubleshooting_file: ${troubleshooting_file_bucket}
- scheduler.knowledge_base.update.interval.minutes: ${knowledge_base_update_interval_minutes}
+ scheduler.knowledge_base.update.interval.minutes: ${knowledge_base_update_interval_minutes} # Define the interval of the knowledge base file update.
```
+### Knowledge Base
+The knowledge base is a collection of libraries that can be used in the groot-stream job's UDFs. File system type can be specified `local` or `http` mode. If the value is `http`, must be `KB Repository` URL. The library will be dynamically updated according to the `scheduler.knowledge_base.update.interval.minutes` configuration.
+
+| Name | Type | Required | Default | Description |
+|:---------|:--------|:---------|:--------|:---------------------------------------------------------------------------|
+| name | String | Yes | - | The name of the knowledge base, used to [UDF](processor/udf.md) |
+| fs_type | String | Yes | - | The type of the file system. Enum: local and http. |
+| fs_path | String | Yes | - | The path of the file system. It can be file directory or http restful api. |
+| files | Array | No | - | The file list of the knowledge base object. |
diff --git a/docs/user-guide.md b/docs/user-guide.md
index fa05547..a8f5067 100644
--- a/docs/user-guide.md
+++ b/docs/user-guide.md
@@ -8,19 +8,20 @@ The main format of the config template file is `yaml`, for more details of this
sources:
inline_source:
type: inline
- fields:
- - name: log_id
- type: bigint
- - name: recv_time
- type: bigint
- - name: fqdn_string
- type: string
- - name: client_ip
- type: string
- - name: server_ip
- type: string
- - name: decoded_as
- type: string
+ schema:
+ fields:
+ - name: log_id
+ type: bigint
+ - name: recv_time
+ type: bigint
+ - name: fqdn_string
+ type: string
+ - name: client_ip
+ type: string
+ - name: server_ip
+ type: string
+ - name: decoded_as
+ type: string
properties:
data: '{"log_id": 1, "recv_time":"111","fqdn_string":"baidu.com", "client_ip":"192.168.0.1","server_ip":"120.233.20.242","decoded_as":"BASE", "dup_traffic_flag":1}'
format: json
@@ -92,19 +93,20 @@ application:
## Schema Structure
Some sources are not strongly limited schema, so you need use `fields` to define the field name and type. The source can customize the schema. Like `Kafka` `Inline` source etc.
```yaml
-fields:
- - name: log_id
- type: bigint
- - name: recv_time
- type: bigint
- - name: fqdn_string
- type: string
- - name: client_ip
- type: string
- - name: server_ip
- type: string
- - name: decoded_as
- type: string
+Schema:
+ fields:
+ - name: log_id
+ type: bigint
+ - name: recv_time
+ type: bigint
+ - name: fqdn_string
+ type: string
+ - name: client_ip
+ type: string
+ - name: server_ip
+ type: string
+ - name: decoded_as
+ type: string
```
`name` The name of the field. `type` The data type of the field.
@@ -136,6 +138,7 @@ Sink is used to define where GrootStream needs to output data. Multiple sinks ca
## Application
Used to define some common parameters of the job and the topology of the job. such as the name of the job, the parallelism of the job, etc. The following configuration parameters are supported.
+
### ENV
Used to define job environment configuration information. For more details, you can refer to the documentation [JobEnvConfig](./env-config.md).