[Improve][Docs] Add some help information for connector schema and knowledge base files.

author: doufenghu <[email protected]> 2024-03-16 19:32:42 +0800
committer: doufenghu <[email protected]> 2024-03-16 19:32:42 +0800
commit: 25994fade7720a43021b25004ade13c71f941e88 (patch)
tree: 9c91a57d526f0579f241add718d8cb114ff04468 /docs
parent: 9ff68b2c631606cf06a7001036ff16475c52371c (diff)
3 files changed, 61 insertions, 51 deletions
diff --git a/docs/connector/connector.md b/docs/connector/connector.md
index 6bcc878..e36214c 100644
--- a/docs/connector/connector.md
+++ b/docs/connector/connector.md
@@ -7,13 +7,13 @@ Source Connector contains some common core features, and each source connector s
 sources:
   ${source_name}:
     type: ${source_connector_type}
-    # source table schema, config through fields or local_file or url
-    schema:
-      fields: 
+    # Source schema, config through fields or local_file or url. if not set schema, all fields(Map<String, Object>) will be output.
+    schema: 
+      fields:
         - name: ${field_name}
           type: ${field_type}
-      # local_file: "schema path"
-      # url: "schema http url"
+      # local_file: "/path/to/schema.json" 
+      # url: "https://localhost:8080/schema.json"
     properties:
       ${prop_key}: ${prop_value}
 ```
@@ -29,12 +29,13 @@ The source connector supports reading only specified fields from the data source
 The Schema Structure refer to [Schema Structure](../user-guide.md#schema-structure).
 
 ## Schema Config
-Schema can config through fields or local_file or url.
+Schema can be configured through fields or local_file or url. If not set schema, all fields(Map<String, Object>) will be output. And local_file and url only support Avro schema format. More details see the [Avro Schema](https://avro.apache.org/docs/1.11.1/specification/).
 
-### fields
+### Fields
+It can be configured through array or sql style. It is recommended to use array style, which is more readable.
 ```yaml
 schema:
-  # by array
+  # array style
   fields:
     - name: ${field_name}
       type: ${field_type}
@@ -42,31 +43,28 @@ schema:
 
 ```yaml
 schema:
-  # by sql
+  # sql style
   fields: "struct<field_name:field_type, ...>"
   # can also without outer struct<>
   # fields: "field_name:field_type, ..."
 ```
 
-### local_file
-
+### Local File
+To retrieve the schema from a local file using its absolute path.
 ```yaml
 schema:
   # by array
   fields:
-    local_file: "schema path"
+    local_file: "/path/to/schema.json"
 ```
 
-### url
-Retrieve updated schema from URL for cycle, support dynamic schema. Not all connector support dynamic schema.
-
-The connectors that currently support dynamic schema include: clickHouse sink.
-
+### URL
+Some connectors support periodically fetching and updating the schema from a URL, such as the `ClickHouse Sink`.
 ```yaml
 schema:
   # by array
   fields:
-    url: "schema http url"
+    url: "https://localhost:8080/schema.json"
 ```
 
 # Sink Connector
@@ -81,8 +79,8 @@ sinks:
     # sink table schema, config through fields or local_file or url. if not set schema, all fields(Map<String, Object>) will be output.
     schema:
       fields: "struct<field_name:field_type, ...>"
-      # local_file: "schema path"
-      # url: "schema url"
+      # local_file: "/path/to/schema.json"
+      # url: "https://localhost:8080/schema.json"
     properties:
       ${prop_key}: ${prop_value}
 ```
diff --git a/docs/grootstream-config.md b/docs/grootstream-config.md
index a359c39..479f4a7 100644
--- a/docs/grootstream-config.md
+++ b/docs/grootstream-config.md
@@ -5,16 +5,25 @@ The purpose of this file is to provide a global configuration for the groot-stre
 
 ```yaml
 grootstream: 
-  knowledge_base: # Define the knowledge base list.
-    - name: ${knowledge_base_name} # Define the name of the knowledge base, used to kb function.
-      fs_type: ${file_system_type} # Define the type of the file system.Support: local,hdfs,http.
-      fs_path: ${file_system_path} # Define the path of the file system.
+  knowledge_base: # Define the libraries
+    - name: ${knowledge_base_name}
+      fs_type: ${file_system_type} 
+      fs_path: ${file_system_path}
       files:
         - ${file_name} # Define the file name of the knowledge base.
   properties: # Custom parameters.
     hos.path: ${hos_path}
     hos.bucket.name.traffic_file: ${traffic_file_bucket}
     hos.bucket.name.troubleshooting_file: ${troubleshooting_file_bucket}
-    scheduler.knowledge_base.update.interval.minutes: ${knowledge_base_update_interval_minutes}
+    scheduler.knowledge_base.update.interval.minutes: ${knowledge_base_update_interval_minutes} # Define the interval of the knowledge base file update.
 
 ```
+### Knowledge Base
+The knowledge base is a collection of libraries that can be used in the groot-stream job's UDFs. File system type can be specified `local` or `http` mode. If the value is `http`, must be `KB Repository` URL. The library will be dynamically updated according to the `scheduler.knowledge_base.update.interval.minutes` configuration.
+
+| Name     | Type    | Required | Default | Description                                                                |
+|:---------|:--------|:---------|:--------|:---------------------------------------------------------------------------|
+| name     | String  | Yes      | -       | The name of the knowledge base, used to [UDF](processor/udf.md)            |
+| fs_type  | String  | Yes      | -       | The type of the file system. Enum: local and http.                         |
+| fs_path  | String  | Yes      | -       | The path of the file system. It can be file directory or http restful api. |                                                                                      
+| files    | Array   | No       | -       | The file list of the knowledge base object.                                |                                                                                                                                                                                                                                                            
diff --git a/docs/user-guide.md b/docs/user-guide.md
index fa05547..a8f5067 100644
--- a/docs/user-guide.md
+++ b/docs/user-guide.md
@@ -8,19 +8,20 @@ The main format of the config template file is `yaml`, for more details of this
 sources:
   inline_source:
     type: inline
-    fields:
-      - name: log_id
-        type: bigint
-      - name: recv_time
-        type: bigint
-      - name: fqdn_string
-        type: string
-      - name: client_ip
-        type: string
-      - name: server_ip
-        type: string
-      - name: decoded_as
-        type: string
+    schema:
+        fields:
+          - name: log_id
+            type: bigint
+          - name: recv_time
+            type: bigint
+          - name: fqdn_string
+            type: string
+          - name: client_ip
+            type: string
+          - name: server_ip
+            type: string
+          - name: decoded_as
+            type: string
     properties:
       data: '{"log_id": 1, "recv_time":"111","fqdn_string":"baidu.com", "client_ip":"192.168.0.1","server_ip":"120.233.20.242","decoded_as":"BASE", "dup_traffic_flag":1}'
       format: json
@@ -92,19 +93,20 @@ application:
 ## Schema Structure
 Some sources are not strongly limited schema, so you need use `fields` to define the field name and type. The source can customize the schema. Like `Kafka` `Inline` source etc.
 ```yaml
-fields:   
-      - name: log_id
-        type: bigint
-      - name: recv_time
-        type: bigint
-      - name: fqdn_string
-        type: string
-      - name: client_ip
-        type: string
-      - name: server_ip
-        type: string
-      - name: decoded_as
-        type: string
+Schema:
+    fields:   
+          - name: log_id
+            type: bigint
+          - name: recv_time
+            type: bigint
+          - name: fqdn_string
+            type: string
+          - name: client_ip
+            type: string
+          - name: server_ip
+            type: string
+          - name: decoded_as
+            type: string
 ```
 `name` The name of the field. `type` The data type of the field.
 
@@ -136,6 +138,7 @@ Sink is used to define where GrootStream needs to output data. Multiple sinks ca
 
 ## Application
 Used to define some common parameters of the job and the topology of the job. such as the name of the job, the parallelism of the job, etc. The following configuration parameters are supported.
+
 ### ENV
 Used to define job environment configuration information. For more details, you can refer to the documentation [JobEnvConfig](./env-config.md).
author	doufenghu <[email protected]>	2024-03-16 19:32:42 +0800
committer	doufenghu <[email protected]>	2024-03-16 19:32:42 +0800
commit	25994fade7720a43021b25004ade13c71f941e88 (patch)
tree	9c91a57d526f0579f241add718d8cb114ff04468 /docs
parent	9ff68b2c631606cf06a7001036ff16475c52371c (diff)