diff options
| author | qidaijie <[email protected]> | 2021-11-20 16:22:28 +0300 |
|---|---|---|
| committer | qidaijie <[email protected]> | 2021-11-20 16:22:28 +0300 |
| commit | 74ebd80fab69f3480278b58e6ecc08f0f0fa5837 (patch) | |
| tree | 984bfe57a91a305fd8bb5545ad9ccd2da300ea63 | |
| parent | d6d19565c5081cb5f3c7890bc01711b3d0728e58 (diff) | |
| parent | cc0a895d79733a1eb0191ad13b41bc20adb2373e (diff) | |
Merge branch 'E21' of https://git.mesalab.cn/galaxy/online-config into E21
| -rw-r--r-- | hbase/DC/docker-compose.yml | 2 | ||||
| -rw-r--r-- | olap_metrics | 103 | ||||
| -rw-r--r-- | 现场情况记录 | 20 |
3 files changed, 124 insertions, 1 deletions
diff --git a/hbase/DC/docker-compose.yml b/hbase/DC/docker-compose.yml index 9794510..523c310 100644 --- a/hbase/DC/docker-compose.yml +++ b/hbase/DC/docker-compose.yml @@ -30,7 +30,7 @@ services: deploy: resources: limits: - memory: 15G + memory: 25G networks: galaxy: external: true diff --git a/olap_metrics b/olap_metrics new file mode 100644 index 0000000..1c24c6a --- /dev/null +++ b/olap_metrics @@ -0,0 +1,103 @@ +services通用: +http_server_requests_seconds_count +process_uptime_seconds +http_server_requests_seconds_max +jvm_memory_used_bytes +http_server_requests_seconds_sum + +job: +jobLogSuccessCount +jobLogCount +triggerCountRunningTotal +triggerDayCountSucList +triggerDayCountFailList +system_cpu_usage +logback_events_total +triggerCountSucTotal +triggerCountFailTotal + +report: +system_cpu_usage +report_success_count_total +report_fail_count_total + +hos: +process_cpu_usage +dashInfo + +Flink: +flink_taskmanager_job_task_backPressuredTimeMsPerSecond +flink_taskmanager_job_task_numBytesInPerSecond +flink_taskmanager_job_task_numBytesOutPerSecond +flink_taskmanager_Status_JVM_CPU_Load +flink_taskmanager_Status_JVM_Memory_Heap_Used +flink_taskmanager_Status_JVM_Memory_Heap_Committed +flink_jobmanager_job_numRestarts +flink_jobmanager_taskSlotsTotal +flink_jobmanager_numRunningJobsa +同时过滤task.*和job_id标签。 + +Nginx: +nginx_vts_start_time_seconds +nginx_vts_server_requests_total +nginx_vts_upstream_requests_total +nginx_vts_upstream_response_seconds_total + +Kafka: +kafka_consumergroup_lag_sum +kafka_server_BrokerTopicMetrics_OneMinuteRate +kafka_server_BrokerTopicMetrics_OneMinuteRate +kafka_server_socket_server_metrics_request_rate +kafka_network_RequestMetrics_Errors_total + +Clickhouse: +bad_requests_total +clickhouse_inserted_bytes_total +clickhouse_inserted_rows_total +clickhouse_merge_total +clickhouse_slow_read_total +process_cpu_seconds_total +process_virtual_memory_bytes +request_sum_total + +Druid: +coordinator_segment_count +coordinator_segment_size +sys_swap_page_in +ingest_kafka_lag +node_cpu_seconds_total +sys_mem_used + +Hadoop: +Hadoop_DataNode_HeartbeatsNumOps +Hadoop_HBase_numMasterWALs +Hadoop_HBase_numRegionServers +Hadoop_NameNode_Total +Hadoop_NameNode_PercentUsed +Hadoop_NameNode_NumDeadDataNodes +Hadoop_NameNode_NumberOfMissingBlocks + +HBase: +java_lang_OperatingSystem_ProcessCpuLoad +jvm_memory_bytes_used +Hadoop_HBase_numDeadRegionServers +Hadoop_HBase_regionCount +Hadoop_HBase_ritCount +Hadoop_HBase_slowGetCount +Hadoop_HBase_slowPutCount +Hadoop_HBase_slowAppendCount + +Nacos: +http_server_requests_seconds_count +jvm_memory_used_bytes +system_cpu_usage + +Zookeeper: +zookeeper_connections +zookeeper_latency_avg_ms +zookeeper_latency_max_ms +zookeeper_leader +zookeeper_outstanding_requests +zookeeper_packets_received +zookeeper_packets_sent +zookeeper_znode_count @@ -0,0 +1,20 @@ +2021-11-03 +1:界面无dos阈值配置时,dos-detection程序后台频繁抛出异常。(dos-detection程序已更新) +2:更新generate-baselines程序。 +3:topN计算在数据量较少的情况下无法统计出结果。 + +2021-11-04 +1:集群安装包: +1.1:Flink set_flink_env.sh.j2脚本内chkconfig keepflinkjob 错误 应为:chkconfig keepflinkjob on +1.2:iplearning启动脚本判断已启动命令错误。 +1.3:dos-baseline程序启动脚本配置文件打入jar位置异常。 +1.4:dos-baseline定时任务周期错误。 +1.5:HBase-region监控配置文件异常,缺少配置。 + +2021-11-05 +昨天晚上对分中心主机名进行修改发现以下问题: +1:修改主机名后,HBase容器重启/重建、删除data数据目录/zk内节点重建服务均异常;全部目录删除重装后正常。 +2:整机重启后Flink任务丢失。 +------------------------ +1:单机taskmanager重启后,从出现两个的情况;此时两个taskmanager服务状态不正常(无法在界面看到任何资源信息,均为-),同时任务也无法恢复。 +2:schema原始日志和livecharts删除data_center解析。 |
