diff options
| author | 窦凤虎 <[email protected]> | 2024-08-19 11:08:46 +0000 |
|---|---|---|
| committer | 窦凤虎 <[email protected]> | 2024-08-19 11:08:46 +0000 |
| commit | 56b21d494bfa07012b1cc4e43dcb4ccdb6257d12 (patch) | |
| tree | 0fa1a094dbb4f4703ecbf013c678b3bb485b385b /docs | |
| parent | 07332297c1306aa0dac649c7d15bf131e8edbc7e (diff) | |
| parent | 6564a5e9a43ecd88f5497e2b75a219a8a54101bb (diff) | |
Merge branch 'test/e2e-test-clickhouse' into 'develop'release/1.5.0-SNAPSHOT
Test/e2e test clickhouse
See merge request galaxy/platform/groot-stream!94
Diffstat (limited to 'docs')
| -rw-r--r-- | docs/connector/connector.md | 94 | ||||
| -rw-r--r-- | docs/images/groot_stream_architecture.jpg | bin | 5054004 -> 5263679 bytes | |||
| -rw-r--r-- | docs/processor/aggregate-processor.md | 2 | ||||
| -rw-r--r-- | docs/processor/table-processor.md | 61 | ||||
| -rw-r--r-- | docs/processor/udaf.md | 180 | ||||
| -rw-r--r-- | docs/processor/udtf.md | 66 |
6 files changed, 365 insertions, 38 deletions
diff --git a/docs/connector/connector.md b/docs/connector/connector.md index 1123385..766b73e 100644 --- a/docs/connector/connector.md +++ b/docs/connector/connector.md @@ -85,41 +85,49 @@ schema: The mock data type is used to define the template of the mock data. -| Mock Type | Parameter | Result Type | Default | Description | -|-----------------------------------------|-------------|-----------------------|---------------------|---------------------------------------------------------------------------------------------------------------------------------------------| -| **[Number](#Number)** | - | **int/bigint/double** | - | **Randomly generate a number.** | -| - | min | number | 0 | The minimum value (include). | -| - | max | number | int32.max | The maximum value (exclusive). | -| - | options | array of number | (none) | The optional values. If set, the random value will be selected from the options and `start` and `end` will be ignored. | -| - | random | boolean | true | Default is random mode. If set to false, the value will be generated in order. | -| **[Sequence](#Sequence)** | - | **bigint** | - | **Generate a sequence number based on a specific step value .** | -| - | start | bigint | 0 | The first number in the sequence (include). | -| - | step | bigint | 1 | The number to add to each subsequent value. | -| **[UniqueSequence](#UniqueSequence)** | - | **bigint** | - | **Generate a global unique sequence number.** | -| - | start | bigint | 0 | The first number in the sequence (include). | -| **[String](#String)** | - | string | - | **Randomly generate a string.** | -| - | regex | string | [a-zA-Z]{0,5} | The regular expression. | -| - | options | array of string | (none) | The optional values. If set, the random value will be selected from the options and `regex` will be ignored. | -| - | random | boolean | true | Default is random mode. If set to false, the options value will be generated in order. | -| **[Timestamp](#Timestamp)** | - | **bigint** | - | **Generate a unix timestamp in milliseconds or seconds.** | -| - | unit | string | second | The unit of the timestamp. The optional values are `second`, `millis`. | -| **[FormatTimestamp](#FormatTimestamp)** | - | **string** | - | **Generate a formatted timestamp.** | -| - | format | string | yyyy-MM-dd HH:mm:ss | The format to output. | -| - | utc | boolean | false | Default is local time. If set to true, the time will be converted to UTC time. | -| **[IPv4](#IPv4)** | - | **string** | - | **Randomly generate a IPv4 address.** | -| - | start | string | 0.0.0.0 | The minimum value of the IPv4 address(include). | -| - | end | string | 255.255.255.255 | The maximum value of the IPv4 address(include). | -| **[Expression](#Expression)** | - | string | - | **Use library [Datafaker](https://www.datafaker.net/documentation/expressions/) expressions to generate fake data.** | -| - | expression | string | (none) | The datafaker expression used #{expression}. | -| **[Eval](#Eval)** | - | **string** | - | **Use AviatorScript value expression to generate data.** | -| - | expression | string | (none) | Support basic arithmetic operations and function calls. More details sess [AviatorScript](https://www.yuque.com/boyan-avfmj/aviatorscript). | -| **[Object](#Object)** | - | **struct/object** | - | **Generate a object data structure. It used to define the nested structure of the mock data.** | -| - | fields | array of object | (none) | The fields of the object. | -| **[Union](#Union)** | - | - | - | **Generate a union data structure with multiple mock data type fields.** | -| - | unionFields | array of object | (none) | The fields of the object. | -| - | - fields | - array of object | (none) | | -| - | - weight | - int | 0 | The weight of the generated object. | -| | random | boolean | true | Default is random mode. If set to false, the options value will be generated in order. | +| Mock Type | Parameter | Result Type | Default | Description | +|-----------------------------------------|---------------------------------|-----------------------|---------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------| +| **[Number](#Number)** | - | **int/bigint/double** | - | **Randomly generate a number.** | +| | min | number | 0 | The minimum value (inclusive). | +| | max | number | int32.max | The maximum value (exclusive). | +| | options | array of number | (none) | The optional values. If set, the random value will be selected from the options and `min` and `max` will be ignored. | +| | random | boolean | true | Default is random mode. If set to false, the value will be generated in order. | +| **[Sequence](#Sequence)** | - | **bigint** | - | **Generate a sequence number based on a specific step value.** | +| | start | bigint | 0 | The first number in the sequence (inclusive). | +| | step | bigint | 1 | The number to add to each subsequent value. | +| **[UniqueSequence](#UniqueSequence)** | - | **bigint** | - | **Generate a globally unique sequence number.** | +| | start | bigint | 0 | The first number in the sequence (inclusive). | +| **[String](#String)** | - | string | - | **Randomly generate a string.** | +| | regex | string | [a-zA-Z]{0,5} | The regular expression used to generate the string. | +| | options | array of string | (none) | The optional values. If set, the random value will be selected from the options and `regex` will be ignored. | +| | random | boolean | true | Default is random mode. If set to false, the options value will be generated in order. | +| **[Timestamp](#Timestamp)** | - | **bigint** | - | **Generate a Unix timestamp in milliseconds or seconds.** | +| | unit | string | second | The unit of the timestamp. Options are `second` or `millis`. | +| **[FormatTimestamp](#FormatTimestamp)** | - | **string** | - | **Generate a formatted timestamp.** | +| | format | string | yyyy-MM-dd HH:mm:ss | The format to output the timestamp in. | +| | utc | boolean | false | Default is local time. If set to true, the time will be converted to UTC time. | +| **[IPv4](#IPv4)** | - | **string** | - | **Randomly generate an IPv4 address.** | +| | start | string | 0.0.0.0 | The minimum value of the IPv4 address (inclusive). | +| | end | string | 255.255.255.255 | The maximum value of the IPv4 address (inclusive). | +| **[Expression](#Expression)** | - | string | - | **Use library [Datafaker](https://www.datafaker.net/documentation/expressions/) expressions to generate fake data.** | +| | expression | string | (none) | The Datafaker expression to use, in the format `#{expression}`. | +| **[Hlld](#HLLD)** | - | **string** | - | **Generate a IP Address HyperLogLog data structure and store it as a base64 string. Use library [HLLD](https://github.com/armon/hlld).** | +| | itemCount | bigint | 1000000 | The total number of items. | +| | batchCount | int | 10000 | The number of items in each batch. | +| | precision | int | 12 | The precision of the HyperLogLog data structure. Allowed range is [4, 18]. | +| **[HdrHistogram](#HdrHistogram)** | - | **string** | - | **Generate a Latency HdrHistogram data structure and store it as a base64 string. Use library [HdrHistogram](https://github.com/HdrHistogram/HdrHistogram).** | +| | max | bigint | 100000 | The maximum value of the histogram. | +| | batchCount | int | 1000 | The random number of items in each batch. | +| | numberOfSignificantValueDigits | int | 1 | The precision of the histogram data structure. Allowed range is [1, 5]. | +| **[Eval](#Eval)** | - | **string** | - | **Use AviatorScript value expression to generate data.** | +| | expression | string | (none) | Support basic arithmetic operations and function calls. More details in [AviatorScript](https://www.yuque.com/boyan-avfmj/aviatorscript). | +| **[Object](#Object)** | - | **struct/object** | - | **Generate an object data structure. Used to define the nested structure of the mock data.** | +| | fields | array of object | (none) | The fields of the object. | +| **[Union](#Union)** | - | - | - | **Generate a union data structure with multiple mock data type fields.** | +| | unionFields | array of object | (none) | The fields of the union. | +| | weight | int | 0 | The weight of the generated object. | +| | random | boolean | true | Default is random mode. If set to false, the options value will be generated in order. | + ### Common Parameters @@ -250,6 +258,22 @@ Mock data type supports some common parameters. {"name":"phoneNumber","type":"Expression","expression":"#{phoneNumber.phoneNumber}"} ``` +### HLLD + +- Generate a IP Address HyperLogLog data structure, stored as a base64 string. At most 1000 IP addresses are generated in each batch. + +```json +{"name":"hll","type":"Hlld","itemCount":1000000,"batchCount":1000,"precision":12} +``` + +### HdrHistogram + +- Generate a Latency HdrHistogram data structure, stored as a base64 string. The maximum value of the histogram is 100000, and at most 1000 items are generated in each batch. + +```json +{"name":"distribution","type":"HdrHistogram","max":100000,"batchCount":1000,"numberOfSignificantValueDigits":1} +``` + ### Eval - Generate a value by using AviatorScript expression. Commonly used for arithmetic operations. diff --git a/docs/images/groot_stream_architecture.jpg b/docs/images/groot_stream_architecture.jpg Binary files differindex 1fff0e5..d8f1d4b 100644 --- a/docs/images/groot_stream_architecture.jpg +++ b/docs/images/groot_stream_architecture.jpg diff --git a/docs/processor/aggregate-processor.md b/docs/processor/aggregate-processor.md index af82d4e..5ab0ae0 100644 --- a/docs/processor/aggregate-processor.md +++ b/docs/processor/aggregate-processor.md @@ -12,7 +12,7 @@ Note:Default will output internal fields `__window_start_timestamp` and `__win | name | type | required | default value | |--------------------------|--------|----------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| type | String | Yes | The type of the processor, now only support `com.geedgenetworks.core.processor.projection.AggregateProcessor` | +| type | String | Yes | The type of the processor, now only support `com.geedgenetworks.core.processor.aggregate.AggregateProcessor` | | output_fields | Array | No | Array of String. The list of fields that need to be kept. Fields not in the list will be removed. | | remove_fields | Array | No | Array of String. The list of fields that need to be removed. | | group_by_fields | Array | yes | Array of String. The list of fields that need to be grouped. | diff --git a/docs/processor/table-processor.md b/docs/processor/table-processor.md new file mode 100644 index 0000000..7b3066c --- /dev/null +++ b/docs/processor/table-processor.md @@ -0,0 +1,61 @@ +# Table Processor + +> Processing pipelines for table processors using UDTFs + +## Description + +Table processor is used to process the data from source to sink. It is a part of the processing pipeline. It can be used in the pre-processing, processing, and post-processing pipeline. Each processor can assemble UDTFs(User-defined Table functions) into a pipeline. Within the pipeline, events are processed by each Function in order, top‑>down. More details can be found in user-defined table functions [(UDTFs)](udtf.md). + +## Options + +| name | type | required | default value | +|-----------------|--------|----------|------------------------------------------------------------------------------------------------------| +| type | String | Yes | The type of the processor, now only support `com.geedgenetworks.core.processor.table.TableProcessor` | +| output_fields | Array | No | Array of String. The list of fields that ne ed to be kept. Fields not in the list will be removed. | +| remove_fields | Array | No | Array of String. The list of fields that need to be removed. | +| functions | Array | No | Array of Object. The list of functions that need to be applied to the data. | + +## Usage Example +This example uses a table processor to unroll the encapsulation field, converting one row into multiple rows. + +```yaml +sources: + inline_source: + type: inline + properties: + data: '[{"tcp_rtt_ms":128,"decoded_as":"HTTP","http_version":"http1","http_request_line":"GET / HTTP/1.1","http_host":"www.ct.cn","http_url":"www.ct.cn/","http_user_agent":"curl/8.0.1","http_status_code":200,"http_response_line":"HTTP/1.1 200 OK","http_response_content_type":"text/html; charset=UTF-8","http_response_latency_ms":31,"http_session_duration_ms":5451,"in_src_mac":"ba:bb:a7:3c:67:1c","in_dest_mac":"86:dd:7a:8f:ae:e2","out_src_mac":"86:dd:7a:8f:ae:e2","out_dest_mac":"ba:bb:a7:3c:67:1c","tcp_client_isn":678677906,"tcp_server_isn":1006700307,"address_type":4,"client_ip":"192.11.22.22","server_ip":"8.8.8.8","client_port":42751,"server_port":80,"in_link_id":65535,"out_link_id":65535,"start_timestamp_ms":1703646546127,"end_timestamp_ms":1703646551702,"duration_ms":5575,"sent_pkts":97,"sent_bytes":5892,"received_pkts":250,"received_bytes":333931,"encapsulation":"[{\"tunnels_schema_type\":\"MULTIPATH_ETHERNET\",\"c2s_source_mac\":\"48:73:97:96:38:27\",\"c2s_destination_mac\":\"58:b3:8f:fa:3b:11\",\"s2c_source_mac\":\"58:b3:8f:fa:3b:11\",\"s2c_destination_mac\":\"48:73:97:96:38:27\"}]"},{"tcp_rtt_ms":256,"decoded_as":"HTTP","http_version":"http1","http_request_line":"GET / HTTP/1.1","http_host":"www.abc.cn","http_url":"www.cabc.cn/","http_user_agent":"curl/8.0.1","http_status_code":200,"http_response_line":"HTTP/1.1 200 OK","http_response_content_type":"text/html; charset=UTF-8","http_response_latency_ms":31,"http_session_duration_ms":5451,"in_src_mac":"ba:bb:a7:3c:67:1c","in_dest_mac":"86:dd:7a:8f:ae:e2","out_src_mac":"86:dd:7a:8f:ae:e2","out_dest_mac":"ba:bb:a7:3c:67:1c","tcp_client_isn":678677906,"tcp_server_isn":1006700307,"address_type":4,"client_ip":"192.168.10.198","server_ip":"4.4.4.4","client_port":42751,"server_port":80,"in_link_id":65535,"out_link_id":65535,"start_timestamp_ms":1703646546127,"end_timestamp_ms":1703646551702,"duration_ms":2575,"sent_pkts":197,"sent_bytes":5892,"received_pkts":350,"received_bytes":533931,"device_tag":"{\"tags\":[{\"tag\":\"data_center\",\"value\":\"center-xxg-tsgx\"},{\"tag\":\"device_group\",\"value\":\"group-xxg-tsgx\"}]}"}]' + format: json + json.ignore.parse.errors: false + +processing_pipelines: + table_processor: + type: table + functions: + - function: JSON_UNROLL + lookup_fields: [ encapsulation] + output_fields: [ encapsulation ] + +sinks: + print_sink: + type: print + properties: + format: json + mode: log_warn + +application: + env: + name: example-inline-to-print-use-udtf + parallelism: 3 + pipeline: + object-reuse: true + topology: + - name: inline_source + downstream: [table_processor] + - name: table_processor + downstream: [ print_sink ] + - name: print_sink + downstream: [] + +``` + + diff --git a/docs/processor/udaf.md b/docs/processor/udaf.md index e22846f..dd1dd70 100644 --- a/docs/processor/udaf.md +++ b/docs/processor/udaf.md @@ -11,7 +11,11 @@ - [Long Count](#Long-Count) - [MEAN](#Mean) - [Number SUM](#Number-SUM) - +- [HLLD](#HLLD) +- [Approx Count Distinct HLLD](#Approx-Count-Distinct-HLLD) +- [HDR Histogram](#HDR-Histogram) +- [Approx Quantile HDR](#APPROX_QUANTILE_HDR) +- [Approx Quantiles HDR](#APPROX_QUANTILES_HDR) ## Description @@ -146,4 +150,176 @@ NUMBER_SUM is used to sum the value of the field in the group of events. The loo - function: NUMBER_SUM lookup_fields: [received_bytes] output_fields: [received_bytes_sum] -```
\ No newline at end of file +``` + +### HLLD +hlld is a high-performance C server which is used to expose HyperLogLog sets and operations over them to networked clients. More details can be found in [hlld](https://github.com/armon/hlld). + +```HLLD(filter, lookup_fields, output_fields[, parameters])``` +- filter: optional +- lookup_fields: required. +- output_fields: required. +- parameters: optional. + - input_type: `<String>` optional. input field type can be `regular` or `sketch`. Default is `sketch`. regular field data type includes `string`, `int`, `long`, `float`, `double` etc. + - precision: `<Integer>` optional. The precision of the hlld value. Default is 12. + - output_format: `<String>` optional. The output format can be either `base64(encoded string)` or `binary(byte[])`. The default is `base64`. + +### Example + Merge multiple string field into a HyperLogLog data structure. +```yaml + - function: HLLD + lookup_fields: [client_ip] + output_fields: [client_ip_hlld] + parameters: + input_type: regular + +``` + Merge multiple `unique_count ` metric type fields into a HyperLogLog data structure +```yaml + - function: HLLD + lookup_fields: [client_ip_hlld] + output_fields: [client_ip_hlld] + parameters: + input_type: sketch +``` + +### Approx Count Distinct HLLD +Approx Count Distinct HLLD is used to count the approximate number of distinct values in the group of events. + +```APPROX_COUNT_DISTINCT_HLLD(filter, lookup_fields, output_fields[, parameters])``` +- filter: optional +- lookup_fields: required. +- output_fields: required. +- parameters: optional. + - input_type: `<String>` optional. Refer to `HLLD` function. + - precision: `<Integer>` optional. Refer to `HLLD` function. + +### Example + +```yaml +- function: APPROX_COUNT_DISTINCT_HLLD + lookup_fields: [client_ip] + output_fields: [unique_client_ip] + parameters: + input_type: regular +``` + +```yaml +- function: APPROX_COUNT_DISTINCT_HLLD + lookup_fields: [client_ip_hlld] + output_fields: [unique_client_ip] + parameters: + input_type: sketch +``` + +### HDR Histogram + +A High Dynamic Range (HDR) Histogram. More details can be found in [HDR Histogram](https://github.com/HdrHistogram/HdrHistogram). + +```HDR_HISTOGRAM(filter, lookup_fields, output_fields[, parameters])``` +- filter: optional +- lookup_fields: required. +- output_fields: required. +- parameters: optional. + - input_type: `<String>` optional. input field type can be `regular` or `sketch`. Default is `sketch`. regular field is a number. + - lowestDiscernibleValue: `<Integer>` optional. The lowest trackable value. Default is 1. + - highestTrackableValue: `<Integer>` optional. The highest trackable value. Default is 2. + - numberOfSignificantValueDigits: `<Integer>` optional. The number of significant value digits. Default is 1. The range is 1 to 5. + - autoResize: `<Boolean>` optional. If true, the highestTrackableValue will auto-resize. Default is true. + - output_format: `<String>` optional. The output format can be either `base64(encoded string)` or `binary(byte[])`. The default is `base64`. + +### Example + + ```yaml + - function: HDR_HISTOGRAM + lookup_fields: [latency_ms] + output_fields: [latency_ms_histogram] + parameters: + input_type: regular + lowestDiscernibleValue: 1 + highestTrackableValue: 3600000 + numberOfSignificantValueDigits: 3 + ``` + ```yaml + - function: HDR_HISTOGRAM + lookup_fields: [latency_ms_histogram] + output_fields: [latency_ms_histogram] + parameters: + input_type: sketch + ``` + +### Approx Quantile HDR + +Approx Quantile HDR is used to calculate the approximate quantile value of the field in the group of events. + +```APPROX_QUANTILE_HDR(filter, lookup_fields, output_fields, quantile[, parameters])``` +- filter: optional +- lookup_fields: required. +- output_fields: required. +- parameters: optional. + - input_type: `<String>` optional. Refer to `HDR_HISTOGRAM` function. + - lowestDiscernibleValue: `<Integer>` optional. Refer to `HDR_HISTOGRAM` function. + - highestTrackableValue: `<Integer>` required. Refer to `HDR_HISTOGRAM` function. + - numberOfSignificantValueDigits: `<Integer>` optional. Refer to `HDR_HISTOGRAM` function. + - autoResize: `<Boolean>` optional. Refer to `HDR_HISTOGRAM` function. + - probability: `<Double>` optional. The probability of the quantile. Default is 0.5. + +### Example + + ```yaml + - function: APPROX_QUANTILE_HDR + lookup_fields: [latency_ms] + output_fields: [latency_ms_p95] + parameters: + input_type: regular + probability: 0.95 + ``` + + ```yaml + - function: APPROX_QUANTILE_HDR + lookup_fields: [latency_ms_HDR] + output_fields: [latency_ms_p95] + parameters: + input_type: sketch + probability: 0.95 + + ``` + +### Approx Quantiles HDR + +Approx Quantiles HDR is used to calculate the approximate quantile values of the field in the group of events. + +```APPROX_QUANTILES_HDR(filter, lookup_fields, output_fields, quantiles[, parameters])``` +- filter: optional +- lookup_fields: required. +- output_fields: required. +- parameters: optional. + - input_type: `<String>` optional. Refer to `HDR_HISTOGRAM` function. + - lowestDiscernibleValue: `<Integer>` optional. Refer to `HDR_HISTOGRAM` function. + - highestTrackableValue: `<Integer>` required. Refer to `HDR_HISTOGRAM` function. + - numberOfSignificantValueDigits: `<Integer>` optional. Refer to `HDR_HISTOGRAM` function. + - autoResize: `<Boolean>` optional. Refer to `HDR_HISTOGRAM` function. + - probabilities: `<Array<Double>>` required. The list of probabilities of the quantiles. Range is 0 to 1. + +### Example + +```yaml +- function: APPROX_QUANTILES_HDR + lookup_fields: [latency_ms] + output_fields: [latency_ms_quantiles] + parameters: + input_type: regular + probabilities: [0.5, 0.95, 0.99] +``` + +```yaml +- function: APPROX_QUANTILES_HDR + lookup_fields: [latency_ms_HDR] + output_fields: [latency_ms_quantiles] + parameters: + input_type: sketch + probabilities: [0.5, 0.95, 0.99] +``` + + + diff --git a/docs/processor/udtf.md b/docs/processor/udtf.md new file mode 100644 index 0000000..a6e8444 --- /dev/null +++ b/docs/processor/udtf.md @@ -0,0 +1,66 @@ +# UDTF + +> The functions for table processors. + +## Function of content + +- [UNROLL](#unroll) +- [JSON_UNROLL](#json_unroll) + +## Description + +The UDTFs(user-defined table functions) are used to process the data from source to sink. It is a part of the processing pipeline. It can be used in the pre-processing, processing, and post-processing pipeline. Each processor can assemble UDTFs into a pipeline. Within the pipeline, events are processed by each Function in order, top‑>down. +Unlike scalar functions, which return a single value, UDTFs are particularly useful when you need to explode or unroll data, transforming a single input row into multiple output rows. + +## UDTF Definition + + The UDTFs and UDFs share similar input and context structures, please refer to [UDF](udf.md). + +## Functions + +### UNROLL + +The Unroll Function handles an array field—or an expression evaluating to an array—and unrolls it into individual events. + +```UNROLL(filter, lookup_fields, output_fields[, parameters])``` +- filter: optional +- lookup_fields: required +- output_fields: required +- parameters: optional + - regex: `<String>` optional. If lookup_fields is a string, the regex parameter is used to split the string into an array. The default value is a comma. + +#### Example + +```yaml +functions: + - function: UNROLL + lookup_fields: [ monitor_rule_list ] + output_fields: [ monitor_rule ] +``` + +### JSON_UNROLL + +The JSON Unroll Function handles a JSON object, unrolls/explodes an array of objects therein into individual events, while also inheriting top level fields. + +```JSON_UNROLL(filter, lookup_fields, output_fields[, parameters])``` +- filter: optional +- lookup_fields: required +- output_fields: required +- parameters: optional + - path: `<String>` optional. Path to array to unroll, default is the root of the JSON object. + - new_path: `<String>` optional. Rename path to new_path, default is the same as path. + +#### Example + +```yaml +functions: + - function: JSON_UNROLL + lookup_fields: [ device_tag ] + output_fields: [ device_tag ] + parameters: + - path: tags + - new_path: tag +``` + + + |
