1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
|
# Groot Stream Platform
Groot Stream Platform helps you process netflow data - logs, metrics etc. - in real time, high reliability and high performance, distributed data integration and synchronization tool.
## Table of contents
- [Features](#features)
- [Groot Stream Workflow](#groot-stream-workflow)
- [Supported Connectors & Functions](#supported-connectors--functions)
- [Minimum Requirements](#minimum-requirements)
- [Getting Started](#getting-started)
- [Documentation](#documentation)
## Features
Groot Stream is designed to simplify the operation of ETL (Extract, Transform, Load). It efficiently collects data from multiple sources and processes and enriches it.
- **Real-time data processing**: Using Flink as the execution engine, it can provide high throughput and low-latency processing capabilities for large-scale data streams.
- **Designed for extension**: Plugin-based management that support for User-defined Functions, Sources, and Sinks.
- **Highly Configurable**: Customize data flow through YML templates to swiftly fulfill ETL requirements without development.
- **Out-of-the-box Functions**: Built-in functions for data processing, including data type conversion, data filtering, data aggregation, and data enrichment.
## Groot Stream Workflow

Configure a job, you'll set up Sources, Filters, Processing Pipeline, and Sinks, and will assemble several built-in functions into a Processing Pipeline. The job will then be deployed to a Flink cluster for execution.
- **Source**: The data source of the job, which can be a Kafka topic, a IPFIX Collector, or a file.
- **Filter**: Filters data based on specified conditions.
- **Types of Pipelines**: The fundamental unit of data stream processing is the processor, categorized by functionality into stateless and stateful processors. Each processor can be assemble `UDFs`(User-defined functions) or `UDAFs`(User-defined aggregation functions) into a pipeline. There are 3 types of pipelines at different stages of the data processing process:
- **Pre-processing Pipeline**: Optional. These pipelines that are attached to a source to normalize the events before they enter the processing pipeline.
- **Processing Pipeline**: Event processing pipeline.
- **Post-processing Pipeline**: Optional. These pipelines that are attached to a sink to normalize the events before they're written to the sink.
- **Sink**: The data sink of the job, which can be a Kafka topic, a ClickHouse table, or a file.
## Supported Connectors & Processors & Functions
- [Source Connectors](docs/connector/source)
- [Sink Connectors](docs/connector/sink)
- [Processor](docs/processor)
- [Functions](docs/processor/udf.md)
## Minimum Requirements
- Git installed
- JAVA(JDK/JRE11 are required)installed and `JAVA_HOME` set
- Maven 3.5.4
- Scala 2.12
- Flink 1.13.1
## Getting Started
### Building
Run the following Maven command to build the project modules using parallel threads:
```shell
./mvnw clean install -T2C
```
Run the following Maven command to build the project modules and Skip Tests:
```shell
./mvnw clean install -DskipTests
```
### Deploying
#### 1.Download the release package
Download the latest release package from the [Releases](https://git.mesalab.cn/galaxy/platform/groot-stream/-/releases).
Copy the `groot-release/target/groot-stream-${version}-bin.tar.gz` file to the target machine and extract it:
```shell
tar -zxvf groot-stream-${version}-bin.tar.gz
ls -lh groot-stream-${version}
```
#### 2. Configure the environment
You need to configure Flink engine environment variables in `config/grootstream-env.sh` file.Default will use system environment variables. If not set, it will use the default value for the following variables:
```shell
FLINK_HOME=${FLINK_HOME:-/opt/flink}
FLINK_JOB_MANAGER_ADDRESS=${FLINK_JOB_MANAGER_ADDRESS:-localhost:8081}
YARN_ADDRESS=${YARN_ADDRESS:-yarn-cluster}
```
#### 3. Configure the groot-stream job
You need to configure the groot-stream job in `config/template/grootstream_job_template.yaml` file. More information about config please check [config concept](docs/user-guide.md)
#### 4. Submit a job to flink engine
Can be started by a daemon with `-d`.
```shell
./bin/start.sh -c *.yaml -d
```
### Starting
#### Running job in your IDE
1. Set `groot-bootstrap` module pom.xml scope to `compile`.
2. Open the `Run/Debug Configurations` window.
3. Choose -cp `groot-bootstrap`
4. Choose Main Class `com.geedgenetworks.bootstrap.main.GrootStreamServer`.
5. Add VM options `--target local -c /...../groot-stream/config/inline_to_print_template.yaml`.
6. Click the `Run` button.
#### Running the CLI
- Run the following command to start the groot-stream server for Standalone Mode:
```shell
cd "groot-stream-${version}"
./bin/start.sh -c ./config/grootstream_job_example.yaml --target remote -n inline-to-print-job -d
```
- Run the following command to start the groot-stream server for Yarn Session Mode:
```shell
cd "groot-stream-${version}"
./bin/start.sh -c ./config/grootstream_job_example.yaml --target yarn-session -Dyarn.application.id=application_XXXX_YY -n inline-to-print-job -d
```
- Run the following command to start the groot-stream server for Yarn Per-job Mode:
```shell
cd "groot-stream-${version}"
./bin/start.sh -c ./config/grootstream_job_example.yaml --target yarn-per-job -Dyarn.application.name="inline-to-print-job" -n inline-to-print-job -d
```
### Configuring
The [User Guide](docs/user-guide.md) provides detailed information on how to configure a job.
## Documentation
See the [Groot Stream Documentation](docs) for more information.
## Contributors
All developers see the list of contributors [here](https://git.mesalab.cn/galaxy/platform/groot-stream/-/graphs/develop?ref_type=heads).
|