diff options
| author | zy <[email protected]> | 2023-11-10 02:04:11 -0500 |
|---|---|---|
| committer | zy <[email protected]> | 2023-11-10 02:04:11 -0500 |
| commit | 075d759e62d50ca17a4d6a2b419917ec9973db36 (patch) | |
| tree | b4dc5c5dfcf59577632bc8d164961fe97495d0b6 | |
| parent | b03de1505382c0391da0981e42e081b3a2a0f3bd (diff) | |
| -rw-r--r-- | README.md | 6 | ||||
| -rw-r--r-- | update_5.17.15-1.el8.x86_64.md | 449 |
2 files changed, 455 insertions, 0 deletions
@@ -1,5 +1,11 @@ # diagnose-tools +--- + +10.18.2023 以后提交,针对 5.17.15-1.el8.x86_64 内核做了适配,编译通过,load-monitor 功能正常,其他未测试. + +--- + 1、快速上手 建议在 Centos 7.5/7.6 版本中进行实验。 diff --git a/update_5.17.15-1.el8.x86_64.md b/update_5.17.15-1.el8.x86_64.md new file mode 100644 index 0000000..1aa4ec4 --- /dev/null +++ b/update_5.17.15-1.el8.x86_64.md @@ -0,0 +1,449 @@ +# update_5.17.15-1.el8.x86_64 + +[diagnose-tools](https://github.com/alibaba/diagnose-tools) +- 阿里开源的内核诊断工具 .[^1] 支持到 centos 7 , 需要将其移植到 centos 8. + +这里是记录的 log. + +~~wsl 下没有 centos , 使用了 [CentOS-WSL](https://github.com/mishamosher/CentOS-WSL/releases?page=2) 安装. centos 7.8~~ + +18:32 wsl 的内核是特殊修改过的, 为了排除这个问题, 还是虚拟机吧.. + +## centos7 + +hyper-v centos 7.6.1810 +- /etc/sysconfig/network-scripts/ifcfg-eth0 + +```ini +BOOTPROTO="static" +IPADDR=172.22.63.1 +NETMASK=255.255.255.0 +GATEWAY=172.22.48.1 +DNS1=172.22.48.1 +``` + +```bash +[root@localhost diagnose-tools]# make +cd SOURCE/module; make --jobs=4 +make[1]: Entering directory `/root/diagnose-tools/SOURCE/module' +make CFLAGS_MODULE="-DMODULE -DCENTOS_7U" -C /lib/modules/3.10.0-1160.99.1.el7.x86_64/build M=/root/diagnose-tools/SOURCE/module modules +make: Entering an unknow +``` + +缺少了源代码树 + +```bash +yum install kernel-devel-3.10.0-1160.99.1.el7.x86_64 +``` + +函数签名不一致?? +- <https://github.com/alibaba/diagnose-tools/issues/164> + +```bash +/root/diagnose-tools/SOURCE/module/pmu/entry.c:49:5: error: conflicting types for ‘diag_pmu_exit’ + int diag_pmu_exit(void) + ^ +In file included from /root/diagnose-tools/SOURCE/module/pmu/entry.c:17:0: +/root/diagnose-tools/SOURCE/module/internal.h:902:6: note: previous declaration of ‘diag_pmu_exit’ was here + void diag_pmu_exit(void); +``` + +把 `int diag_pmu_exit(void)` 改成 `void diag_pmu_exit(void)` 解决. +- 这是个条件编译, 似乎没啥用… + +## rocky 8 + +5.17.15-1.el8.x86_64 +kernel-ml-5.17.15-1.el8.x86_64 -> <https://elrepo.org/tiki/kernel-ml> +- 下载所有 5.17.15-1 的 rpm 包. 并 yum 安装. + +<https://rockylinux.org/download/> -> 这里下载 iso +- hyper-v 可能有 [[2023-10-18#^a3yfga|坑]] + +```bash +yum --enablerepo=elrepo-kernel install kernel-ml-5.17.15-1.el8.x86_64 +dnf --enablerepo="elrepo-kernel" install kernel-ml-5.17.15-1.el8.x86_64 +``` + +```bash +grubby --set-default=/boot/vmlinuz-5.17.15-1.el8.x86_64 +grub2-set-default 0 +dracut -f +grub2-mkconfig -o /boot/grub2/grub.cfg +``` + +### `make devel` + +```bash +yum install -y libstdc++-static +Last metadata expiration check: 0:46:17 ago on Wed Oct 18 02:45:32 2023. +No match for argument: libstdc++-static +Error: Unable to find a match: libstdc++-static +make: *** [Makefile:39: devel] Error 1 +``` + +- `dnf --enablerepo=powertools install libstdc++-static`[^2] + - `yum install dnf-plugins-core` 安装 dnf + +```bash +No match for argument: glibc-static +Error: Unable to find a match: glibc-static +``` + +- 同样的 `dnf --enablerepo=powertools install glibc-static` + +别这么麻烦 + +```bash +yum install dnf-plugins-core +yum config-manager --set-enabled powertools +``` + +然而还有…. `Error: Unable to find a match: libunwind` +- red hat 在 centos 8 中移除了 libunwind [^3] +- `yum -y install epel-release` [^4] 启用 [EPEL 存储库](https://docs.fedoraproject.org/en-US/epel/) + +貌似 powertools 和 epel-release 是一个东西.. + +```bash +dnf config-manager --set-enabled powertools +dnf install epel-release +``` + +然而还有… `Error: Unable to find a match: openssl-static`.. +- 同样也在 centos 8 被移除了.. [^3] +- 而且似乎是因为安全问题, 静态库不再被添加了… +- 尝试 安装 [openssl11-static-1.1.1k-5.el7.x86_64.rpm](https://dl.fedoraproject.org/pub/epel/7/x86_64/Packages/o/openssl11-static-1.1.1k-5.el7.x86_64.rpm) 也没有成功.只能自行编译了. + +``` +mkdir -p /root/rpmbuild/{BUILD,BUILDROOT,RPMS,SOURCES,SPECS,SRPMS} +# 编译依赖 +dnf -y --nogpgcheck install https://dl.fedoraproject.org/pub/epel/epel-release-latest-8.noarch.rpm wget zip unzip bzip2 gcc gcc-c++ rpm-build make krb5-devel perl-interpreter zlib-devel lksctp-tools-devel perl-podlators perl-Test-Harness perl-Math-BigInt perl-Module-Load-Conditional perl-File-Temp perl-Time-HiRes perl-CPAN perl-Test-Simple + +cd /root/rpmbuild/ +dnf download --source openssl +rpmbuild --rebuild xxx.rpm +# 编译完成后在: /root/rpmbuild/RPMS/x86_64/下 +``` + +- 参考这里 -> <https://gist.github.com/Kungergely/4a55c5668f1f0942625f67e012215502> + +安装 + +```bash +yum install openssl-libs-1.1.1k-9.el8.x86_64.rpm --allowerasing +yum install openssl-devel-1.1.1k-9.el8.x86_64.rpm +yum install openssl-static-1.1.1k-9.el8.x86_64.rpm +``` + +### make deps + +貌似没啥错误输出// + +### make module + +```bash +/root/diagnose-tools/SOURCE/module/pub/trace_file.c: In function ‘to_trace_file’: +/root/diagnose-tools/SOURCE/module/pub/trace_file.c:70:9: error: implicit declaration of function ‘PDE_DATA’; did you mean ‘NODE_DATA’? [-Werror=implicit-function-declaration] + return PDE_DATA(file->f_inode); + ^~~~~~~~ + NODE_DATA +/root/diagnose-tools/SOURCE/module/pub/trace_file.c:70:9: error: returning ‘int’ from a function with return type ‘struct diag_trace_file *’ makes pointer from integer without a cast [-Werror=int-conversion] + return PDE_DATA(file->f_inode); +``` + +- PDE_DATA 对应在 `linux/proc_fs.h` 文件中, 但 5.17 以后 [PDE_DATA() replaced by pde_data()](https://github.com/openzfs/zfs/issues/13004) +- SOURCE/module/pub/trace_file.c:70 添加个添加编译 大于 5.17 内核, 返回 pde_data . + +```bash +/root/diagnose-tools/SOURCE/module/pub/uprobe.c:47:9: error: implicit declaration of function ‘fcheck_files’; did you mean ‘unshare_files’? [-Werror=implicit-function-declaration] + file = fcheck_files(files, fd); + ^~~~~~~~~~~~ + unshare_files +/root/diagnose-tools/SOURCE/module/pub/uprobe.c:47:7: error: assignment to ‘struct file *’ from ‘int’ makes pointer from integer without a cast [-Werror=int-conversion] + file = fcheck_files(files, fd); + ^ +cc1: all warnings being treated as errors +make[3]: *** [scripts/Makefile.build:288: /root/diagnose-tools/SOURCE/module/pub/uprobe.o] Error 1 +make[3]: *** Waiting for unfinished jobs…. +/root/diagnose-tools/SOURCE/module/pub/fs_utils.c: In function ‘for_each_files_task’: +/root/diagnose-tools/SOURCE/module/pub/fs_utils.c:161:10: error: implicit declaration of function ‘fcheck_files’; did you mean ‘unshare_files’? [-Werror=implicit-function-declaration] + file = fcheck_files(files, fd); + ^~~~~~~~~~~~ + unshare_files +/root/diagnose-tools/SOURCE/module/pub/fs_utils.c:161:8: error: assignment to ‘struct file *’ from ‘int’ makes pointer from integer without a cast [-Werror=int-conversion] + file = fcheck_files(files, fd); +``` + +- [[PATCH v2 09/24] file: Replace fcheck_files with files_lookup_fd_rcu](https://lore.kernel.org/lkml/[email protected]/#r) +- 这个 patch 对应内核版本就有些不确定了… 按照日期算大致是 5.10… +- 添加个条件编译 + +```bash +/root/diagnose-tools/SOURCE/module/misc.c: In function ‘diag_task_brief’: +/root/diagnose-tools/SOURCE/module/misc.c:593:23: error: ‘struct task_struct’ has no member named ‘state’; did you mean ‘stats’? + detail->state = tsk->state; + ^~~~~ + stats +``` + +- [[PATCH v2 7/7] sched: Change task_struct::state](https://lore.kernel.org/all/[email protected]/#r) +- `volatile long state;` 变成了 `unsigned int __state;` <mark style="background: #FFB86CA6;">这里是个隐患点</mark> +- 差不多是 5.13.rc1-rc2 + +```bash +/root/diagnose-tools/SOURCE/module/kernel/mutex.c: In function ‘__activate_mutex_monitor’: +/root/diagnose-tools/SOURCE/module/kernel/mutex.c:434:2: error: implicit declaration of function ‘get_online_cpus’; did you mean ‘get_online_mems’? [-Werror=implicit-function-declaration] + get_online_cpus(); + ^~~~~~~~~~~~~~~ + get_online_mems +``` + +- ~~[[tip:sched/core] sched: Remove get_online_cpus() usage](https://lore.kernel.org/lkml/[email protected]/#r)~~ +- ~~已经被移除了…这///~~ 找错了, 这个补丁仅仅是调度器中删除了不必要的 get_online_cpus(). +- [[PATCH 24/38] cgroup: Replace deprecated CPU-hotplug functions.](https://lore.kernel.org/lkml/[email protected]/#r) 这一个补丁提到了 get_online_cpus() 被替换成了 `cpus_read_lock()`. +- 按照日期推算大致在 v5.14 被合并.. 于此同时还有另一个被替换的函数 put_online_cpus(); +- <mark style="background: #ADCCFFA6;">源码中有大量的 `get_online_cpus()` / `put_online_cpus();` 千万注意</mark> + +```bash +/root/diagnose-tools/SOURCE/module/kernel/exit.c:176:2: error: implicit declaration of function ‘profile_event_register’; did you mean ‘kprobe_event_delete’? [-Werror=implicit-function-declaration] + profile_event_register(PROFILE_TASK_EXIT, &task_exit_nb); + ^~~~~~~~~~~~~~~~~~~~~~ + kprobe_event_delete +``` + +- [[PATCH 01/17] exit: Remove profile_task_exit & profile_munmap](https://lore.kernel.org/lkml/[email protected]/#r) +- 居然直接被移除了… 还没有替换的函数… 对应大于是 5.17 内核..[^5] + - 还有个对应的 `profile_event_unregister` 也被移除了. +- 参考 [捕获内核的异常事件 - smilingsusu](https://www.cnblogs.com/smilingsusu/articles/14596364.html) ,profile_event_register 应该是 Process exit 的事件,还可以使用 kprobe 机制替换. + - 更详细修改参考 <https://tinylab.org/linux-kprobes/> +- 这里修改比较多, <mark style="background: #FF5582A6;">可能是个坑</mark> + +```bash +/root/diagnose-tools/SOURCE/module/kernel/load.c: In function ‘diag_load_timer’: +/root/diagnose-tools/SOURCE/module/kernel/load.c:147:8: error: implicit declaration of function ‘task_contributes_to_load’; did you mean ‘task_state_to_char’? [-Werror=implicit-function-declaration] + if (task_contributes_to_load(p)) + ^~~~~~~~~~~~~~~~~~~~~~~~ + task_state_to_char +``` + +- 这个问题有点棘手,,, ~~始终找不到其被移除的记录. 5.18 内核源码中确实没有这个函数.~~ 在 github 的 [torvalds/linux](https://github.com/torvalds/linux) 终于找到了. +- [sched: Fix loadavg accounting race](https://www.spinics.net/lists/kernel/msg3582022.html) +- 这个函数确实是被移除了,但是其宏定义的原型还在. 没有替换函数, 暂时用用户定义 `task_contributes_to_load2` 代替. + +```bash +/root/diagnose-tools/SOURCE/module/misc.c:93:14: error: ‘bdevt_str’ defined but not used [-Werror=unused-function] +``` + +- 看前后代码也没有任何调用 bdevt_str 的地方,其调用地方有个判断 大于 5.8 的内核 不用.加个 条件编译. + +--- + +开始缺乏编译选项了. + +```bash +[root@localhost diagnose-tools]# make module +cd SOURCE/module; make --jobs=4 +make[1]: Entering directory '/root/diagnose-tools/SOURCE/module' +make CFLAGS_MODULE="-DMODULE -DCENTOS_8U" -C /lib/modules/5.17.15-1.el8.x86_64/build M=/root/diagnose-tools/SOURCE/module modules +make[2]: Entering directory '/usr/src/kernels/5.17.15-1.el8.x86_64' + + ERROR: Kernel configuration is invalid. + include/generated/autoconf.h or include/config/auto.conf are missing. + Run 'make oldconfig && make prepare' on kernel src to fix it. +``` + +- 这个文件从 `scp [email protected]:/usr/src/kernels/5.17.15-1.el8.x86_64/include/generated/autoconf.h ./` + +--- + +```bash +/root/diagnose-tools/SOURCE/module/mm/memcg_stats.c:249:8: error: implicit declaration of function ‘mem_cgroup_nodeinfo’; did you mean ‘mem_cgroup_online’? [-Werror=implicit-function-declaration] + mz = mem_cgroup_nodeinfo(memcg, pgdat->node_id); + ^~~~~~~~~~~~~~~~~~~ + mem_cgroup_online +``` + +- [mm: memcontrol: kill mem_cgroup_nodeinfo()](https://github.com/torvalds/linux/commit/a3747b53b1771a787fea71d86a2fc39aea337685) 删掉了 mem_cgroup_nodeinfo, 看源码就是一个简单的访问结构体成员. +- v5.13-rc1 + +```bash +/root/diagnose-tools/SOURCE/module/net/tcp_retrans.c:94:35: error: ‘tcp_retrans_variant_buffer’ defined but not used [-Werror=unused-variable] + static struct diag_variant_buffer tcp_retrans_variant_buffer; +``` + +- 不是 linux 源码的问题, 看下文 使用到 `tcp_retrans_variant_buffer` 会是 5.3.0 以下. 添加个条件编译. + +--- + +```bash + MODPOST /root/diagnose-tools/SOURCE/module/Module.symvers +ERROR: modpost: "cpuacct_cgroup_walk_tree" [/root/diagnose-tools/SOURCE/module/diagnose.ko] undefined! +make[3]: *** [scripts/Makefile.modpost:134: /root/diagnose-tools/SOURCE/module/Module.symvers] Error 1 +make[3]: *** Deleting file '/root/diagnose-tools/SOURCE/module/Module.symvers' +make[2]: *** [Makefile:1746: modules] Error 2 +make[2]: Leaving directory '/usr/src/kernels/5.17.15-1.el8.x86_64' +make[1]: *** [Makefile:251: default] Error 2 +make[1]: Leaving directory '/root/diagnose-tools/SOURCE/module' +make: *** [Makefile:74: module] Error 2 +``` + +- 对于这个问题在 `diagnose-tools/SOURCE/module/pub/cgroup.c` 文件中的条件编译选项 + +```bash +#if LINUX_VERSION_CODE < KERNEL_VERSION(3,10,0) || LINUX_VERSION_CODE > KERNEL_VERSION(5,16,0) +``` + +- 这个条件编译选项很奇怪.. ~~将其改成 5.18 可以编译通过~~… <mark style="background: #FF5582A6;">这里将会是 巨坑..</mark> +- 这里是关于 pmu 的模块, 将其注释掉 可以通过编译. 而问题 `Invalid module format` 依然存在. + +```bash +# use open qw(:std :utf8); make test 时候出现了问题 +yum install perl-open.noarch +``` + +--- + +```bash +[root@localhost diagnose-tools]# insmod -f SOURCE/module/diagnose.ko +insmod: ERROR: could not insert module SOURCE/module/diagnose.ko: Invalid module format + +or + +[root@localhost diagnose-tools]# insmod SOURCE/module/diagnose.ko +insmod: ERROR: could not insert module SOURCE/module/diagnose.ko: Invalid parameters +``` + +- ~~这个问题很棘手… 只能重新编译内核并 开启 log 输出查找具体原因了.~~ + +题外话: +- 2 个 与 `Invalid module format` 相关的 commit [1](https://github.com/alibaba/diagnose-tools/pull/21) [2](https://github.com/alibaba/diagnose-tools/pull/21/files) +- 修改 `orig___mutex_unlock_slowpath = (void *)diag_kallsyms_lookup_name("__mutex_unlock_slowpath.isra");` 这个 bug 应该是逃避过了. + +入口添加打印开始排查是那里问题 +- `entry.c -> static int __init diagnosis_init(void)` 入口 +- `ret = alidiagnose_symbols_init();` 出问题了 +- 到了 `diagnose-tools/SOURCE/module/symbol.c -> int alidiagnose_symbols_init(void)` +- `ret = lookup_syms();` 这里 + +```c +static int lookup_syms(void) +{ + LOOKUP_SYMS(text_mutex); + LOOKUP_SYMS(tasklist_lock); + +#if LINUX_VERSION_CODE >= KERNEL_VERSION(5, 2, 0) || defined(CENTOS_4_18_193) + LOOKUP_SYMS(stack_trace_save_tsk); +#ifdef CONFIG_USER_STACKTRACE_SUPPORT + LOOKUP_SYMS(stack_trace_save_user); +#endif +xxx + LOOKUP_SYMS(disk_name); // 这里出问题了 +``` + +`diagnose-tools/SOURCE/module/internal.h` -> `LOOKUP_SYMS` + +```bash +#define LOOKUP_SYMS(name) do { \ + orig_##name = (void *)diag_kallsyms_lookup_name(#name); \ + if (!orig_##name) { \ + pr_err("kallsyms_lookup_name: %s\n", #name); \ + return -EINVAL; \ + } \ + } while (0) +``` + +- 和 log 对上了 `6354.070181] kallsyms_lookup_name: disk_name` +- LOOKUP_SYMS 是查找 函数名的 宏定义, 找不到的函数名是 `disk_name` ^z0gbfx +- [block: remove disk_name()](https://github.com/torvalds/linux/commit/abd2864a3e46368a58f3718491521779099bfc14) 这个 commit 移除了 disk_name 函数.. v5.15-rc1.. +- 看 commit: disk_name 是 bdevname 更底层一点, 但是这个 comiit 中合并了.. 似乎也没其他地方调用…, 加个条件编译,.. +- 看样子这里又会是 一系列的错误… + +```bash +[ 9374.055981] kallsyms_lookup_name: get_files_struct +``` + +- [file: Remove get_files_struct](https://github.com/torvalds/linux/commit/fa67bf885e5211c7dce9514ef2877212c0a5e09e), 5.11.rc1, 没有替换函数 +- get_files_struct 似乎是实际用到的…<mark style="background: #FF5582A6;"> 这里会有坑..</mark> +- 关联: `for_each_files_task` `hook_uprobe` + +```bash +[11394.047582] kallsyms_lookup_name: get_task_type +[11394.068295] kallsyms_lookup_name: cpuacct_subsys +[11394.078570] kallsyms_lookup_name: css_get_next +``` + +- `get_task_type` `cpuacct_subsys` 无法查找到; `css_get_next` 早就 remove 了 …[^6] 一个一个来看. +- `get_task_type` <- `orig_get_task_type` <- `diag_get_task_type` 还有用到 +- `cpuacct_subsys` 没有用到了; `css_get_next` 没有再用到了. +- 关联到 [module: use LOOKUP_SYMS_NORET and LOOKUP_SYMS to find symbol address](https://github.com/alibaba/diagnose-tools/commit/1877444c32257540517b5e9bec044c33f6656d2a) 似乎都是 centos8 有同样的输出. + + ```c + int diag_get_task_type(struct task_struct *tsk) + { + if (orig_get_task_type) + return orig_get_task_type(&tsk->se); + + return 0; + } + ``` + + - 看源码, 似乎 get_task_type 为空也没关系 + +```bash +[11394.224686] kallsyms_lookup_name: avenrun_r +[11394.245031] kallsyms_lookup_name: _cond_resched +[11394.245054] diag_sys_delay_init failed, ret=-22 +[11394.266447] diag_kernel_init failed. +``` + +- `_cond_resched` 函数导致了 `diag_kernel_init failed.` 但这个函数依然存在, 为啥呢… 11:55 陷入僵局 + - 内核宏,检查当前进程是否需要放弃 cpu. 似乎和 yiled 有关. +- ~~13:55 换到 el8 内核加载试试…~~ 换到 el8 内核也是这个报错.. +- 16:37 无解将 `_cond_resched` 注释掉 可以通过编译, load-monitor 是可以运行了… <mark style="background: #FFB86CA6;">可能有坑</mark> +- 查找 `cat /proc/kallsyms` 下确认没有 `_cond_resched`.. + - 一切皆文件//👽// +- <mark style="background: #FF5582A6;">暂时停滞在此.</mark> + +### ~~自编译内核~~ + +```bash +[root@localhost 5.17.15-1.el8.x86_64]# make olddefconfig +lib/Kconfig.debug:2690: can't open file "Documentation/Kconfig" +make[1]: *** [scripts/kconfig/Makefile:77: olddefconfig] Error 1 +``` + +- 那就下 源码 解包 复制粘贴. +- 排查故障时, dmesg / dmesg -C 没有输出. --> 只能自编译内核 然后打开打印了 + +```bash +yum install gcc ncurses-devel elfutils-libelf-devel bc rpm-build +wget https://cdn.kernel.org/pub/linux/kernel/v5.x/linux-5.17.15.tar.xz +tar xf linux-5.17.15.tar.xz +cp /boot/config-`uname -r` linux-5.17.15/.config + +scripts/config --enable CONFIG_DYNAMIC_DEBUG +scripts/config --enable CONFIG_FTRACE +scripts/config --enable CONFIG_LOG_BUF_KERNEL +scripts/config --enable CONFIG_MODULE_SIG_FORCE +``` + +```bash +make -j $(nproc) +make modules -j $(nproc) + +make modules_install +make install + +grub2-mkconfig -o /boot/grub2/grub.cfg +dracut -f +``` + +dmesg 有输出了 + +[^1]: <https://github.com/alibaba/diagnose-tools> +[^2]: <https://stackoverflow.com/questions/73242249/how-to-install-static-libraries-eg-libstdc-libm-libc-on-aws-official-rocky> +[^3]: <https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html-single/considerations_in_adopting_rhel_8/index> +[^4]: <https://github.com/dotnet/docs/issues/20437> +[^5]: <https://community.intel.com/t5/Analyzers/VTune-installation-failure-on-Fedora-35/td-p/1390304> 这里也提到了是 5.17 kernel +[^6]: <https://github.com/torvalds/linux/commit/6d2488f64a240191f0733c1f32d73607916b01b7> |
