# Groot Stream Platform Groot Stream Platform helps you process netflow data - logs, metrics etc. - in real time, high reliability and high performance, distributed data integration and synchronization tool. ## Table of contents - [Features](#features) - [Groot Stream Workflow](#groot-stream-workflow) - [Supported Connectors & Functions](#supported-connectors--functions) - [Minimum Requirements](#minimum-requirements) - [Getting Started](#getting-started) - [Documentation](#documentation) ## Features ###### Groot Stream is designed to simplify the operation of ETL (Extract, Transform, Load). It efficiently collects data from multiple sources and processes and enriches it. - **Real-time data processing**: Using Flink as the execution engine, it can provide high throughput and low-latency processing capabilities for large-scale data streams. - **Designed for extension**: Plugin-based management that support for User-defined Functions, Sources, and Sinks. - **Highly Configurable**: Customize data flow through YML templates to swiftly fulfill ETL requirements without development. - **Out-of-the-box Functions**: Built-in functions for data processing, including data type conversion, data filtering, data aggregation, and data enrichment. ## Groot Stream Workflow ![Groot Stream Workflow](docs/images/groot_stream_architecture.jpg) Configure a job, you'll set up Sources, Filters, Processing Pipeline, and Sinks, and will assemble several built-in functions into a Processing Pipeline. The job will then be deployed to a Flink cluster for execution. - **Source**: The data source of the job, which can be a Kafka topic, a IPFIX Collector, or a file. - **Filter**: Filters data based on specified conditions. - **Types of Pipelines**: The fundamental unit of data stream processing is the processor, categorized by functionality into stateless and stateful processors. Each processor can be assemble `UDFs`(User-defined functions) into a pipeline. There are 3 types of pipelines at different stages of the data processing process: - **Pre-processing Pipeline**: Optional. These pipelines that are attached to a source to normalize the events before they enter the processing pipeline. - **Processing Pipeline**: Event processing pipeline. - **Post-processing Pipeline**: Optional. These pipelines that are attached to a sink to normalize the events before they're written to the sink. - **Sink**: The data sink of the job, which can be a Kafka topic, a ClickHouse table, or a file. ## Supported Connectors & Processors & Functions - [Source Connectors](docs/connector/source) - [Sink Connectors](docs/connector/sink) - [Processor](docs/processor) - [Functions](docs/processor/udf.md) ## Minimum Requirements - Git installed - JAVA(JDK/JRE11 are required)installed and `JAVA_HOME` set - Maven 3.5.4 - Scala 2.12 - Flink 1.13.1 ## Getting Started ### Building Run the following Maven command to build the project modules using parallel threads: ```shell ./mvnw clean install -T2C ``` Run the following Maven command to build the project modules and Skip Tests: ```shell ./mvnw clean install -DskipTests ``` ### Deploying #### 1.Download the release package Download the latest release package from the [Releases](https://git.mesalab.cn/galaxy/platform/groot-stream/-/releases). Copy the `groot-release/target/groot-stream-${version}-bin.tar.gz` file to the target machine and extract it: ```shell tar -zxvf groot-stream-${version}-bin.tar.gz ls -lh groot-stream-${version} ``` #### 2. Configure the environment You need to configure Flink engine environment variables in `config/grootstream-env.sh` file.Default will use system environment variables. If not set, it will use the default value for the following variables: ```shell FLINK_HOME=${FLINK_HOME:-/opt/flink} FLINK_JOB_MANAGER_ADDRESS=${FLINK_JOB_MANAGER_ADDRESS:-localhost:8081} YARN_ADDRESS=${YARN_ADDRESS:-yarn-cluster} ``` #### 3. Configure the groot-stream job You need to configure the groot-stream job in `config/grootstream_job_example.yaml` file. More information about config please check [config concept](docs/user-guide.md) #### 4. Submit a job to flink engine Can be started by a daemon with `-d`. ```shell ./bin/start.sh -c *.yaml -d ``` ### Starting #### Running job in your IDE 1. Set `groot-bootstrap` module pom.xml scope to `compile`. 2. Open the `Run/Debug Configurations` window. 3. Choose -cp `groot-bootstrap` 4. Choose Main Class `com.geedgenetworks.bootstrap.main.GrootStreamServer`. 5. Add VM options `--target local -c /...../groot-stream/config/grootstream_job_example.yaml`. 6. Click the `Run` button. #### Running the CLI - Run the following command to start the groot-stream server for Standalone Mode: ```shell cd "groot-stream-${version}" ./bin/start.sh -c ./config/grootstream_job_example.yaml --target remote -n inline-to-print-job -d ``` - Run the following command to start the groot-stream server for Yarn Session Mode: ```shell # First create a yarn session cluster yarn-session.sh -d # Then start the groot-stream server for Yarn Session Mode. cd "groot-stream-${version}" ./bin/start.sh -c ./config/grootstream_job_example.yaml --target yarn-session -Dyarn.application.id=application_XXXX_YY -n inline-to-print-job -d ``` - Run the following command to start the groot-stream server for Yarn Per-job Mode: ```shell cd "groot-stream-${version}" ./bin/start.sh -c ./config/grootstream_job_example.yaml --target yarn-per-job -Dyarn.application.name="inline-to-print-job" Djobmanager.memory.process.size=1024m -Dtaskmanager.memory.process.size=2048m -Dtaskmanager.numberOfTaskSlots=3 -p 6 -n inline-to-print-job -d ``` ### Configuring The [User Guide](docs/user-guide.md) provides detailed information on how to configure a job. ## Documentation See the [Groot Stream Documentation](docs) for more information. ## Contributors All developers see the list of contributors [here](https://git.mesalab.cn/galaxy/platform/groot-stream/-/graphs/develop?ref_type=heads).