summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorqidaijie <[email protected]>2021-11-20 16:22:28 +0300
committerqidaijie <[email protected]>2021-11-20 16:22:28 +0300
commit74ebd80fab69f3480278b58e6ecc08f0f0fa5837 (patch)
tree984bfe57a91a305fd8bb5545ad9ccd2da300ea63
parentd6d19565c5081cb5f3c7890bc01711b3d0728e58 (diff)
parentcc0a895d79733a1eb0191ad13b41bc20adb2373e (diff)
Merge branch 'E21' of https://git.mesalab.cn/galaxy/online-config into E21
-rw-r--r--hbase/DC/docker-compose.yml2
-rw-r--r--olap_metrics103
-rw-r--r--现场情况记录20
3 files changed, 124 insertions, 1 deletions
diff --git a/hbase/DC/docker-compose.yml b/hbase/DC/docker-compose.yml
index 9794510..523c310 100644
--- a/hbase/DC/docker-compose.yml
+++ b/hbase/DC/docker-compose.yml
@@ -30,7 +30,7 @@ services:
deploy:
resources:
limits:
- memory: 15G
+ memory: 25G
networks:
galaxy:
external: true
diff --git a/olap_metrics b/olap_metrics
new file mode 100644
index 0000000..1c24c6a
--- /dev/null
+++ b/olap_metrics
@@ -0,0 +1,103 @@
+services通用:
+http_server_requests_seconds_count
+process_uptime_seconds
+http_server_requests_seconds_max
+jvm_memory_used_bytes
+http_server_requests_seconds_sum
+
+job:
+jobLogSuccessCount
+jobLogCount
+triggerCountRunningTotal
+triggerDayCountSucList
+triggerDayCountFailList
+system_cpu_usage
+logback_events_total
+triggerCountSucTotal
+triggerCountFailTotal
+
+report:
+system_cpu_usage
+report_success_count_total
+report_fail_count_total
+
+hos:
+process_cpu_usage
+dashInfo
+
+Flink:
+flink_taskmanager_job_task_backPressuredTimeMsPerSecond
+flink_taskmanager_job_task_numBytesInPerSecond
+flink_taskmanager_job_task_numBytesOutPerSecond
+flink_taskmanager_Status_JVM_CPU_Load
+flink_taskmanager_Status_JVM_Memory_Heap_Used
+flink_taskmanager_Status_JVM_Memory_Heap_Committed
+flink_jobmanager_job_numRestarts
+flink_jobmanager_taskSlotsTotal
+flink_jobmanager_numRunningJobsa
+同时过滤task.*和job_id标签。
+
+Nginx:
+nginx_vts_start_time_seconds
+nginx_vts_server_requests_total
+nginx_vts_upstream_requests_total
+nginx_vts_upstream_response_seconds_total
+
+Kafka:
+kafka_consumergroup_lag_sum
+kafka_server_BrokerTopicMetrics_OneMinuteRate
+kafka_server_BrokerTopicMetrics_OneMinuteRate
+kafka_server_socket_server_metrics_request_rate
+kafka_network_RequestMetrics_Errors_total
+
+Clickhouse:
+bad_requests_total
+clickhouse_inserted_bytes_total
+clickhouse_inserted_rows_total
+clickhouse_merge_total
+clickhouse_slow_read_total
+process_cpu_seconds_total
+process_virtual_memory_bytes
+request_sum_total
+
+Druid:
+coordinator_segment_count
+coordinator_segment_size
+sys_swap_page_in
+ingest_kafka_lag
+node_cpu_seconds_total
+sys_mem_used
+
+Hadoop:
+Hadoop_DataNode_HeartbeatsNumOps
+Hadoop_HBase_numMasterWALs
+Hadoop_HBase_numRegionServers
+Hadoop_NameNode_Total
+Hadoop_NameNode_PercentUsed
+Hadoop_NameNode_NumDeadDataNodes
+Hadoop_NameNode_NumberOfMissingBlocks
+
+HBase:
+java_lang_OperatingSystem_ProcessCpuLoad
+jvm_memory_bytes_used
+Hadoop_HBase_numDeadRegionServers
+Hadoop_HBase_regionCount
+Hadoop_HBase_ritCount
+Hadoop_HBase_slowGetCount
+Hadoop_HBase_slowPutCount
+Hadoop_HBase_slowAppendCount
+
+Nacos:
+http_server_requests_seconds_count
+jvm_memory_used_bytes
+system_cpu_usage
+
+Zookeeper:
+zookeeper_connections
+zookeeper_latency_avg_ms
+zookeeper_latency_max_ms
+zookeeper_leader
+zookeeper_outstanding_requests
+zookeeper_packets_received
+zookeeper_packets_sent
+zookeeper_znode_count
diff --git a/现场情况记录 b/现场情况记录
new file mode 100644
index 0000000..1fd9639
--- /dev/null
+++ b/现场情况记录
@@ -0,0 +1,20 @@
+2021-11-03
+1:界面无dos阈值配置时,dos-detection程序后台频繁抛出异常。(dos-detection程序已更新)
+2:更新generate-baselines程序。
+3:topN计算在数据量较少的情况下无法统计出结果。
+
+2021-11-04
+1:集群安装包:
+1.1:Flink set_flink_env.sh.j2脚本内chkconfig keepflinkjob 错误 应为:chkconfig keepflinkjob on
+1.2:iplearning启动脚本判断已启动命令错误。
+1.3:dos-baseline程序启动脚本配置文件打入jar位置异常。
+1.4:dos-baseline定时任务周期错误。
+1.5:HBase-region监控配置文件异常,缺少配置。
+
+2021-11-05
+昨天晚上对分中心主机名进行修改发现以下问题:
+1:修改主机名后,HBase容器重启/重建、删除data数据目录/zk内节点重建服务均异常;全部目录删除重装后正常。
+2:整机重启后Flink任务丢失。
+------------------------
+1:单机taskmanager重启后,从出现两个的情况;此时两个taskmanager服务状态不正常(无法在界面看到任何资源信息,均为-),同时任务也无法恢复。
+2:schema原始日志和livecharts删除data_center解析。