summaryrefslogtreecommitdiff
path: root/docs/processor
diff options
context:
space:
mode:
authordoufenghu <[email protected]>2024-08-19 18:54:33 +0800
committerdoufenghu <[email protected]>2024-08-19 18:54:33 +0800
commita1f2fd8385f418af0a55867e12747967e7b836f9 (patch)
treec88fe41ff5eb62b73e6269940f726e968ab396ba /docs/processor
parent43bc690b73d2df56cb29cea7235c650d60af82d6 (diff)
[docs][table processor] add table-processor and udtf description.
Diffstat (limited to 'docs/processor')
-rw-r--r--docs/processor/aggregate-processor.md2
-rw-r--r--docs/processor/table-processor.md61
-rw-r--r--docs/processor/udaf.md180
-rw-r--r--docs/processor/udtf.md66
4 files changed, 306 insertions, 3 deletions
diff --git a/docs/processor/aggregate-processor.md b/docs/processor/aggregate-processor.md
index af82d4e..5ab0ae0 100644
--- a/docs/processor/aggregate-processor.md
+++ b/docs/processor/aggregate-processor.md
@@ -12,7 +12,7 @@ Note:Default will output internal fields `__window_start_timestamp` and `__win
| name | type | required | default value |
|--------------------------|--------|----------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| type | String | Yes | The type of the processor, now only support `com.geedgenetworks.core.processor.projection.AggregateProcessor` |
+| type | String | Yes | The type of the processor, now only support `com.geedgenetworks.core.processor.aggregate.AggregateProcessor` |
| output_fields | Array | No | Array of String. The list of fields that need to be kept. Fields not in the list will be removed. |
| remove_fields | Array | No | Array of String. The list of fields that need to be removed. |
| group_by_fields | Array | yes | Array of String. The list of fields that need to be grouped. |
diff --git a/docs/processor/table-processor.md b/docs/processor/table-processor.md
new file mode 100644
index 0000000..7b3066c
--- /dev/null
+++ b/docs/processor/table-processor.md
@@ -0,0 +1,61 @@
+# Table Processor
+
+> Processing pipelines for table processors using UDTFs
+
+## Description
+
+Table processor is used to process the data from source to sink. It is a part of the processing pipeline. It can be used in the pre-processing, processing, and post-processing pipeline. Each processor can assemble UDTFs(User-defined Table functions) into a pipeline. Within the pipeline, events are processed by each Function in order, top‑>down. More details can be found in user-defined table functions [(UDTFs)](udtf.md).
+
+## Options
+
+| name | type | required | default value |
+|-----------------|--------|----------|------------------------------------------------------------------------------------------------------|
+| type | String | Yes | The type of the processor, now only support `com.geedgenetworks.core.processor.table.TableProcessor` |
+| output_fields | Array | No | Array of String. The list of fields that ne ed to be kept. Fields not in the list will be removed. |
+| remove_fields | Array | No | Array of String. The list of fields that need to be removed. |
+| functions | Array | No | Array of Object. The list of functions that need to be applied to the data. |
+
+## Usage Example
+This example uses a table processor to unroll the encapsulation field, converting one row into multiple rows.
+
+```yaml
+sources:
+ inline_source:
+ type: inline
+ properties:
+ data: '[{"tcp_rtt_ms":128,"decoded_as":"HTTP","http_version":"http1","http_request_line":"GET / HTTP/1.1","http_host":"www.ct.cn","http_url":"www.ct.cn/","http_user_agent":"curl/8.0.1","http_status_code":200,"http_response_line":"HTTP/1.1 200 OK","http_response_content_type":"text/html; charset=UTF-8","http_response_latency_ms":31,"http_session_duration_ms":5451,"in_src_mac":"ba:bb:a7:3c:67:1c","in_dest_mac":"86:dd:7a:8f:ae:e2","out_src_mac":"86:dd:7a:8f:ae:e2","out_dest_mac":"ba:bb:a7:3c:67:1c","tcp_client_isn":678677906,"tcp_server_isn":1006700307,"address_type":4,"client_ip":"192.11.22.22","server_ip":"8.8.8.8","client_port":42751,"server_port":80,"in_link_id":65535,"out_link_id":65535,"start_timestamp_ms":1703646546127,"end_timestamp_ms":1703646551702,"duration_ms":5575,"sent_pkts":97,"sent_bytes":5892,"received_pkts":250,"received_bytes":333931,"encapsulation":"[{\"tunnels_schema_type\":\"MULTIPATH_ETHERNET\",\"c2s_source_mac\":\"48:73:97:96:38:27\",\"c2s_destination_mac\":\"58:b3:8f:fa:3b:11\",\"s2c_source_mac\":\"58:b3:8f:fa:3b:11\",\"s2c_destination_mac\":\"48:73:97:96:38:27\"}]"},{"tcp_rtt_ms":256,"decoded_as":"HTTP","http_version":"http1","http_request_line":"GET / HTTP/1.1","http_host":"www.abc.cn","http_url":"www.cabc.cn/","http_user_agent":"curl/8.0.1","http_status_code":200,"http_response_line":"HTTP/1.1 200 OK","http_response_content_type":"text/html; charset=UTF-8","http_response_latency_ms":31,"http_session_duration_ms":5451,"in_src_mac":"ba:bb:a7:3c:67:1c","in_dest_mac":"86:dd:7a:8f:ae:e2","out_src_mac":"86:dd:7a:8f:ae:e2","out_dest_mac":"ba:bb:a7:3c:67:1c","tcp_client_isn":678677906,"tcp_server_isn":1006700307,"address_type":4,"client_ip":"192.168.10.198","server_ip":"4.4.4.4","client_port":42751,"server_port":80,"in_link_id":65535,"out_link_id":65535,"start_timestamp_ms":1703646546127,"end_timestamp_ms":1703646551702,"duration_ms":2575,"sent_pkts":197,"sent_bytes":5892,"received_pkts":350,"received_bytes":533931,"device_tag":"{\"tags\":[{\"tag\":\"data_center\",\"value\":\"center-xxg-tsgx\"},{\"tag\":\"device_group\",\"value\":\"group-xxg-tsgx\"}]}"}]'
+ format: json
+ json.ignore.parse.errors: false
+
+processing_pipelines:
+ table_processor:
+ type: table
+ functions:
+ - function: JSON_UNROLL
+ lookup_fields: [ encapsulation]
+ output_fields: [ encapsulation ]
+
+sinks:
+ print_sink:
+ type: print
+ properties:
+ format: json
+ mode: log_warn
+
+application:
+ env:
+ name: example-inline-to-print-use-udtf
+ parallelism: 3
+ pipeline:
+ object-reuse: true
+ topology:
+ - name: inline_source
+ downstream: [table_processor]
+ - name: table_processor
+ downstream: [ print_sink ]
+ - name: print_sink
+ downstream: []
+
+```
+
+
diff --git a/docs/processor/udaf.md b/docs/processor/udaf.md
index e22846f..dd1dd70 100644
--- a/docs/processor/udaf.md
+++ b/docs/processor/udaf.md
@@ -11,7 +11,11 @@
- [Long Count](#Long-Count)
- [MEAN](#Mean)
- [Number SUM](#Number-SUM)
-
+- [HLLD](#HLLD)
+- [Approx Count Distinct HLLD](#Approx-Count-Distinct-HLLD)
+- [HDR Histogram](#HDR-Histogram)
+- [Approx Quantile HDR](#APPROX_QUANTILE_HDR)
+- [Approx Quantiles HDR](#APPROX_QUANTILES_HDR)
## Description
@@ -146,4 +150,176 @@ NUMBER_SUM is used to sum the value of the field in the group of events. The loo
- function: NUMBER_SUM
lookup_fields: [received_bytes]
output_fields: [received_bytes_sum]
-``` \ No newline at end of file
+```
+
+### HLLD
+hlld is a high-performance C server which is used to expose HyperLogLog sets and operations over them to networked clients. More details can be found in [hlld](https://github.com/armon/hlld).
+
+```HLLD(filter, lookup_fields, output_fields[, parameters])```
+- filter: optional
+- lookup_fields: required.
+- output_fields: required.
+- parameters: optional.
+ - input_type: `<String>` optional. input field type can be `regular` or `sketch`. Default is `sketch`. regular field data type includes `string`, `int`, `long`, `float`, `double` etc.
+ - precision: `<Integer>` optional. The precision of the hlld value. Default is 12.
+ - output_format: `<String>` optional. The output format can be either `base64(encoded string)` or `binary(byte[])`. The default is `base64`.
+
+### Example
+ Merge multiple string field into a HyperLogLog data structure.
+```yaml
+ - function: HLLD
+ lookup_fields: [client_ip]
+ output_fields: [client_ip_hlld]
+ parameters:
+ input_type: regular
+
+```
+ Merge multiple `unique_count ` metric type fields into a HyperLogLog data structure
+```yaml
+ - function: HLLD
+ lookup_fields: [client_ip_hlld]
+ output_fields: [client_ip_hlld]
+ parameters:
+ input_type: sketch
+```
+
+### Approx Count Distinct HLLD
+Approx Count Distinct HLLD is used to count the approximate number of distinct values in the group of events.
+
+```APPROX_COUNT_DISTINCT_HLLD(filter, lookup_fields, output_fields[, parameters])```
+- filter: optional
+- lookup_fields: required.
+- output_fields: required.
+- parameters: optional.
+ - input_type: `<String>` optional. Refer to `HLLD` function.
+ - precision: `<Integer>` optional. Refer to `HLLD` function.
+
+### Example
+
+```yaml
+- function: APPROX_COUNT_DISTINCT_HLLD
+ lookup_fields: [client_ip]
+ output_fields: [unique_client_ip]
+ parameters:
+ input_type: regular
+```
+
+```yaml
+- function: APPROX_COUNT_DISTINCT_HLLD
+ lookup_fields: [client_ip_hlld]
+ output_fields: [unique_client_ip]
+ parameters:
+ input_type: sketch
+```
+
+### HDR Histogram
+
+A High Dynamic Range (HDR) Histogram. More details can be found in [HDR Histogram](https://github.com/HdrHistogram/HdrHistogram).
+
+```HDR_HISTOGRAM(filter, lookup_fields, output_fields[, parameters])```
+- filter: optional
+- lookup_fields: required.
+- output_fields: required.
+- parameters: optional.
+ - input_type: `<String>` optional. input field type can be `regular` or `sketch`. Default is `sketch`. regular field is a number.
+ - lowestDiscernibleValue: `<Integer>` optional. The lowest trackable value. Default is 1.
+ - highestTrackableValue: `<Integer>` optional. The highest trackable value. Default is 2.
+ - numberOfSignificantValueDigits: `<Integer>` optional. The number of significant value digits. Default is 1. The range is 1 to 5.
+ - autoResize: `<Boolean>` optional. If true, the highestTrackableValue will auto-resize. Default is true.
+ - output_format: `<String>` optional. The output format can be either `base64(encoded string)` or `binary(byte[])`. The default is `base64`.
+
+### Example
+
+ ```yaml
+ - function: HDR_HISTOGRAM
+ lookup_fields: [latency_ms]
+ output_fields: [latency_ms_histogram]
+ parameters:
+ input_type: regular
+ lowestDiscernibleValue: 1
+ highestTrackableValue: 3600000
+ numberOfSignificantValueDigits: 3
+ ```
+ ```yaml
+ - function: HDR_HISTOGRAM
+ lookup_fields: [latency_ms_histogram]
+ output_fields: [latency_ms_histogram]
+ parameters:
+ input_type: sketch
+ ```
+
+### Approx Quantile HDR
+
+Approx Quantile HDR is used to calculate the approximate quantile value of the field in the group of events.
+
+```APPROX_QUANTILE_HDR(filter, lookup_fields, output_fields, quantile[, parameters])```
+- filter: optional
+- lookup_fields: required.
+- output_fields: required.
+- parameters: optional.
+ - input_type: `<String>` optional. Refer to `HDR_HISTOGRAM` function.
+ - lowestDiscernibleValue: `<Integer>` optional. Refer to `HDR_HISTOGRAM` function.
+ - highestTrackableValue: `<Integer>` required. Refer to `HDR_HISTOGRAM` function.
+ - numberOfSignificantValueDigits: `<Integer>` optional. Refer to `HDR_HISTOGRAM` function.
+ - autoResize: `<Boolean>` optional. Refer to `HDR_HISTOGRAM` function.
+ - probability: `<Double>` optional. The probability of the quantile. Default is 0.5.
+
+### Example
+
+ ```yaml
+ - function: APPROX_QUANTILE_HDR
+ lookup_fields: [latency_ms]
+ output_fields: [latency_ms_p95]
+ parameters:
+ input_type: regular
+ probability: 0.95
+ ```
+
+ ```yaml
+ - function: APPROX_QUANTILE_HDR
+ lookup_fields: [latency_ms_HDR]
+ output_fields: [latency_ms_p95]
+ parameters:
+ input_type: sketch
+ probability: 0.95
+
+ ```
+
+### Approx Quantiles HDR
+
+Approx Quantiles HDR is used to calculate the approximate quantile values of the field in the group of events.
+
+```APPROX_QUANTILES_HDR(filter, lookup_fields, output_fields, quantiles[, parameters])```
+- filter: optional
+- lookup_fields: required.
+- output_fields: required.
+- parameters: optional.
+ - input_type: `<String>` optional. Refer to `HDR_HISTOGRAM` function.
+ - lowestDiscernibleValue: `<Integer>` optional. Refer to `HDR_HISTOGRAM` function.
+ - highestTrackableValue: `<Integer>` required. Refer to `HDR_HISTOGRAM` function.
+ - numberOfSignificantValueDigits: `<Integer>` optional. Refer to `HDR_HISTOGRAM` function.
+ - autoResize: `<Boolean>` optional. Refer to `HDR_HISTOGRAM` function.
+ - probabilities: `<Array<Double>>` required. The list of probabilities of the quantiles. Range is 0 to 1.
+
+### Example
+
+```yaml
+- function: APPROX_QUANTILES_HDR
+ lookup_fields: [latency_ms]
+ output_fields: [latency_ms_quantiles]
+ parameters:
+ input_type: regular
+ probabilities: [0.5, 0.95, 0.99]
+```
+
+```yaml
+- function: APPROX_QUANTILES_HDR
+ lookup_fields: [latency_ms_HDR]
+ output_fields: [latency_ms_quantiles]
+ parameters:
+ input_type: sketch
+ probabilities: [0.5, 0.95, 0.99]
+```
+
+
+
diff --git a/docs/processor/udtf.md b/docs/processor/udtf.md
new file mode 100644
index 0000000..a6e8444
--- /dev/null
+++ b/docs/processor/udtf.md
@@ -0,0 +1,66 @@
+# UDTF
+
+> The functions for table processors.
+
+## Function of content
+
+- [UNROLL](#unroll)
+- [JSON_UNROLL](#json_unroll)
+
+## Description
+
+The UDTFs(user-defined table functions) are used to process the data from source to sink. It is a part of the processing pipeline. It can be used in the pre-processing, processing, and post-processing pipeline. Each processor can assemble UDTFs into a pipeline. Within the pipeline, events are processed by each Function in order, top‑>down.
+Unlike scalar functions, which return a single value, UDTFs are particularly useful when you need to explode or unroll data, transforming a single input row into multiple output rows.
+
+## UDTF Definition
+
+ The UDTFs and UDFs share similar input and context structures, please refer to [UDF](udf.md).
+
+## Functions
+
+### UNROLL
+
+The Unroll Function handles an array field—or an expression evaluating to an array—and unrolls it into individual events.
+
+```UNROLL(filter, lookup_fields, output_fields[, parameters])```
+- filter: optional
+- lookup_fields: required
+- output_fields: required
+- parameters: optional
+ - regex: `<String>` optional. If lookup_fields is a string, the regex parameter is used to split the string into an array. The default value is a comma.
+
+#### Example
+
+```yaml
+functions:
+ - function: UNROLL
+ lookup_fields: [ monitor_rule_list ]
+ output_fields: [ monitor_rule ]
+```
+
+### JSON_UNROLL
+
+The JSON Unroll Function handles a JSON object, unrolls/explodes an array of objects therein into individual events, while also inheriting top level fields.
+
+```JSON_UNROLL(filter, lookup_fields, output_fields[, parameters])```
+- filter: optional
+- lookup_fields: required
+- output_fields: required
+- parameters: optional
+ - path: `<String>` optional. Path to array to unroll, default is the root of the JSON object.
+ - new_path: `<String>` optional. Rename path to new_path, default is the same as path.
+
+#### Example
+
+```yaml
+functions:
+ - function: JSON_UNROLL
+ lookup_fields: [ device_tag ]
+ output_fields: [ device_tag ]
+ parameters:
+ - path: tags
+ - new_path: tag
+```
+
+
+