summaryrefslogtreecommitdiff
path: root/update_5.17.15-1.el8.x86_64.md
diff options
context:
space:
mode:
Diffstat (limited to 'update_5.17.15-1.el8.x86_64.md')
-rw-r--r--update_5.17.15-1.el8.x86_64.md449
1 files changed, 449 insertions, 0 deletions
diff --git a/update_5.17.15-1.el8.x86_64.md b/update_5.17.15-1.el8.x86_64.md
new file mode 100644
index 0000000..1aa4ec4
--- /dev/null
+++ b/update_5.17.15-1.el8.x86_64.md
@@ -0,0 +1,449 @@
+# update_5.17.15-1.el8.x86_64
+
+[diagnose-tools](https://github.com/alibaba/diagnose-tools)
+- 阿里开源的内核诊断工具 .[^1] 支持到 centos 7 , 需要将其移植到 centos 8.
+
+这里是记录的 log.
+
+~~wsl 下没有 centos , 使用了 [CentOS-WSL](https://github.com/mishamosher/CentOS-WSL/releases?page=2) 安装. centos 7.8~~
+
+18:32 wsl 的内核是特殊修改过的, 为了排除这个问题, 还是虚拟机吧..
+
+## centos7
+
+hyper-v centos 7.6.1810
+- /etc/sysconfig/network-scripts/ifcfg-eth0
+
+```ini
+BOOTPROTO="static"
+IPADDR=172.22.63.1
+NETMASK=255.255.255.0
+GATEWAY=172.22.48.1
+DNS1=172.22.48.1
+```
+
+```bash
+[root@localhost diagnose-tools]# make
+cd SOURCE/module; make --jobs=4
+make[1]: Entering directory `/root/diagnose-tools/SOURCE/module'
+make CFLAGS_MODULE="-DMODULE -DCENTOS_7U" -C /lib/modules/3.10.0-1160.99.1.el7.x86_64/build M=/root/diagnose-tools/SOURCE/module modules
+make: Entering an unknow
+```
+
+缺少了源代码树
+
+```bash
+yum install kernel-devel-3.10.0-1160.99.1.el7.x86_64
+```
+
+函数签名不一致??
+- <https://github.com/alibaba/diagnose-tools/issues/164>
+
+```bash
+/root/diagnose-tools/SOURCE/module/pmu/entry.c:49:5: error: conflicting types for ‘diag_pmu_exit’
+ int diag_pmu_exit(void)
+ ^
+In file included from /root/diagnose-tools/SOURCE/module/pmu/entry.c:17:0:
+/root/diagnose-tools/SOURCE/module/internal.h:902:6: note: previous declaration of ‘diag_pmu_exit’ was here
+ void diag_pmu_exit(void);
+```
+
+把 `int diag_pmu_exit(void)` 改成 `void diag_pmu_exit(void)` 解决.
+- 这是个条件编译, 似乎没啥用…
+
+## rocky 8
+
+5.17.15-1.el8.x86_64
+kernel-ml-5.17.15-1.el8.x86_64 -> <https://elrepo.org/tiki/kernel-ml>
+- 下载所有 5.17.15-1 的 rpm 包. 并 yum 安装.
+
+<https://rockylinux.org/download/> -> 这里下载 iso
+- hyper-v 可能有 [[2023-10-18#^a3yfga|坑]]
+
+```bash
+yum --enablerepo=elrepo-kernel install kernel-ml-5.17.15-1.el8.x86_64
+dnf --enablerepo="elrepo-kernel" install kernel-ml-5.17.15-1.el8.x86_64
+```
+
+```bash
+grubby --set-default=/boot/vmlinuz-5.17.15-1.el8.x86_64
+grub2-set-default 0
+dracut -f
+grub2-mkconfig -o /boot/grub2/grub.cfg
+```
+
+### `make devel`
+
+```bash
+yum install -y libstdc++-static
+Last metadata expiration check: 0:46:17 ago on Wed Oct 18 02:45:32 2023.
+No match for argument: libstdc++-static
+Error: Unable to find a match: libstdc++-static
+make: *** [Makefile:39: devel] Error 1
+```
+
+- `dnf --enablerepo=powertools install libstdc++-static`[^2]
+ - `yum install dnf-plugins-core` 安装 dnf
+
+```bash
+No match for argument: glibc-static
+Error: Unable to find a match: glibc-static
+```
+
+- 同样的 `dnf --enablerepo=powertools install glibc-static`
+
+别这么麻烦
+
+```bash
+yum install dnf-plugins-core
+yum config-manager --set-enabled powertools
+```
+
+然而还有…. `Error: Unable to find a match: libunwind`
+- red hat 在 centos 8 中移除了 libunwind [^3]
+- `yum -y install epel-release` [^4] 启用 [EPEL 存储库](https://docs.fedoraproject.org/en-US/epel/)
+
+貌似 powertools 和 epel-release 是一个东西..
+
+```bash
+dnf config-manager --set-enabled powertools
+dnf install epel-release
+```
+
+然而还有… `Error: Unable to find a match: openssl-static`..
+- 同样也在 centos 8 被移除了.. [^3]
+- 而且似乎是因为安全问题, 静态库不再被添加了…
+- 尝试 安装 [openssl11-static-1.1.1k-5.el7.x86_64.rpm](https://dl.fedoraproject.org/pub/epel/7/x86_64/Packages/o/openssl11-static-1.1.1k-5.el7.x86_64.rpm) 也没有成功.只能自行编译了.
+
+```
+mkdir -p /root/rpmbuild/{BUILD,BUILDROOT,RPMS,SOURCES,SPECS,SRPMS}
+# 编译依赖
+dnf -y --nogpgcheck install https://dl.fedoraproject.org/pub/epel/epel-release-latest-8.noarch.rpm wget zip unzip bzip2 gcc gcc-c++ rpm-build make krb5-devel perl-interpreter zlib-devel lksctp-tools-devel perl-podlators perl-Test-Harness perl-Math-BigInt perl-Module-Load-Conditional perl-File-Temp perl-Time-HiRes perl-CPAN perl-Test-Simple
+
+cd /root/rpmbuild/
+dnf download --source openssl
+rpmbuild --rebuild xxx.rpm
+# 编译完成后在: /root/rpmbuild/RPMS/x86_64/下
+```
+
+- 参考这里 -> <https://gist.github.com/Kungergely/4a55c5668f1f0942625f67e012215502>
+
+安装
+
+```bash
+yum install openssl-libs-1.1.1k-9.el8.x86_64.rpm --allowerasing
+yum install openssl-devel-1.1.1k-9.el8.x86_64.rpm
+yum install openssl-static-1.1.1k-9.el8.x86_64.rpm
+```
+
+### make deps
+
+貌似没啥错误输出//
+
+### make module
+
+```bash
+/root/diagnose-tools/SOURCE/module/pub/trace_file.c: In function ‘to_trace_file’:
+/root/diagnose-tools/SOURCE/module/pub/trace_file.c:70:9: error: implicit declaration of function ‘PDE_DATA’; did you mean ‘NODE_DATA’? [-Werror=implicit-function-declaration]
+ return PDE_DATA(file->f_inode);
+ ^~~~~~~~
+ NODE_DATA
+/root/diagnose-tools/SOURCE/module/pub/trace_file.c:70:9: error: returning ‘int’ from a function with return type ‘struct diag_trace_file *’ makes pointer from integer without a cast [-Werror=int-conversion]
+ return PDE_DATA(file->f_inode);
+```
+
+- PDE_DATA 对应在 `linux/proc_fs.h` 文件中, 但 5.17 以后 [PDE_DATA() replaced by pde_data()](https://github.com/openzfs/zfs/issues/13004)
+- SOURCE/module/pub/trace_file.c:70 添加个添加编译 大于 5.17 内核, 返回 pde_data .
+
+```bash
+/root/diagnose-tools/SOURCE/module/pub/uprobe.c:47:9: error: implicit declaration of function ‘fcheck_files’; did you mean ‘unshare_files’? [-Werror=implicit-function-declaration]
+ file = fcheck_files(files, fd);
+ ^~~~~~~~~~~~
+ unshare_files
+/root/diagnose-tools/SOURCE/module/pub/uprobe.c:47:7: error: assignment to ‘struct file *’ from ‘int’ makes pointer from integer without a cast [-Werror=int-conversion]
+ file = fcheck_files(files, fd);
+ ^
+cc1: all warnings being treated as errors
+make[3]: *** [scripts/Makefile.build:288: /root/diagnose-tools/SOURCE/module/pub/uprobe.o] Error 1
+make[3]: *** Waiting for unfinished jobs….
+/root/diagnose-tools/SOURCE/module/pub/fs_utils.c: In function ‘for_each_files_task’:
+/root/diagnose-tools/SOURCE/module/pub/fs_utils.c:161:10: error: implicit declaration of function ‘fcheck_files’; did you mean ‘unshare_files’? [-Werror=implicit-function-declaration]
+ file = fcheck_files(files, fd);
+ ^~~~~~~~~~~~
+ unshare_files
+/root/diagnose-tools/SOURCE/module/pub/fs_utils.c:161:8: error: assignment to ‘struct file *’ from ‘int’ makes pointer from integer without a cast [-Werror=int-conversion]
+ file = fcheck_files(files, fd);
+```
+
+- [[PATCH v2 09/24] file: Replace fcheck_files with files_lookup_fd_rcu](https://lore.kernel.org/lkml/[email protected]/#r)
+- 这个 patch 对应内核版本就有些不确定了… 按照日期算大致是 5.10…
+- 添加个条件编译
+
+```bash
+/root/diagnose-tools/SOURCE/module/misc.c: In function ‘diag_task_brief’:
+/root/diagnose-tools/SOURCE/module/misc.c:593:23: error: ‘struct task_struct’ has no member named ‘state’; did you mean ‘stats’?
+ detail->state = tsk->state;
+ ^~~~~
+ stats
+```
+
+- [[PATCH v2 7/7] sched: Change task_struct::state](https://lore.kernel.org/all/[email protected]/#r)
+- `volatile long state;` 变成了 `unsigned int __state;` <mark style="background: #FFB86CA6;">这里是个隐患点</mark>
+- 差不多是 5.13.rc1-rc2
+
+```bash
+/root/diagnose-tools/SOURCE/module/kernel/mutex.c: In function ‘__activate_mutex_monitor’:
+/root/diagnose-tools/SOURCE/module/kernel/mutex.c:434:2: error: implicit declaration of function ‘get_online_cpus’; did you mean ‘get_online_mems’? [-Werror=implicit-function-declaration]
+ get_online_cpus();
+ ^~~~~~~~~~~~~~~
+ get_online_mems
+```
+
+- ~~[[tip:sched/core] sched: Remove get_online_cpus() usage](https://lore.kernel.org/lkml/[email protected]/#r)~~
+- ~~已经被移除了…这///~~ 找错了, 这个补丁仅仅是调度器中删除了不必要的 get_online_cpus().
+- [[PATCH 24/38] cgroup: Replace deprecated CPU-hotplug functions.](https://lore.kernel.org/lkml/[email protected]/#r) 这一个补丁提到了 get_online_cpus() 被替换成了 `cpus_read_lock()`.
+- 按照日期推算大致在 v5.14 被合并.. 于此同时还有另一个被替换的函数 put_online_cpus();
+- <mark style="background: #ADCCFFA6;">源码中有大量的 `get_online_cpus()` / `put_online_cpus();` 千万注意</mark>
+
+```bash
+/root/diagnose-tools/SOURCE/module/kernel/exit.c:176:2: error: implicit declaration of function ‘profile_event_register’; did you mean ‘kprobe_event_delete’? [-Werror=implicit-function-declaration]
+ profile_event_register(PROFILE_TASK_EXIT, &task_exit_nb);
+ ^~~~~~~~~~~~~~~~~~~~~~
+ kprobe_event_delete
+```
+
+- [[PATCH 01/17] exit: Remove profile_task_exit & profile_munmap](https://lore.kernel.org/lkml/[email protected]/#r)
+- 居然直接被移除了… 还没有替换的函数… 对应大于是 5.17 内核..[^5]
+ - 还有个对应的 `profile_event_unregister` 也被移除了.
+- 参考 [捕获内核的异常事件 - smilingsusu](https://www.cnblogs.com/smilingsusu/articles/14596364.html) ,profile_event_register 应该是 Process exit 的事件,还可以使用 kprobe 机制替换.
+ - 更详细修改参考 <https://tinylab.org/linux-kprobes/>
+- 这里修改比较多, <mark style="background: #FF5582A6;">可能是个坑</mark>
+
+```bash
+/root/diagnose-tools/SOURCE/module/kernel/load.c: In function ‘diag_load_timer’:
+/root/diagnose-tools/SOURCE/module/kernel/load.c:147:8: error: implicit declaration of function ‘task_contributes_to_load’; did you mean ‘task_state_to_char’? [-Werror=implicit-function-declaration]
+ if (task_contributes_to_load(p))
+ ^~~~~~~~~~~~~~~~~~~~~~~~
+ task_state_to_char
+```
+
+- 这个问题有点棘手,,, ~~始终找不到其被移除的记录. 5.18 内核源码中确实没有这个函数.~~ 在 github 的 [torvalds/linux](https://github.com/torvalds/linux) 终于找到了.
+- [sched: Fix loadavg accounting race](https://www.spinics.net/lists/kernel/msg3582022.html)
+- 这个函数确实是被移除了,但是其宏定义的原型还在. 没有替换函数, 暂时用用户定义 `task_contributes_to_load2` 代替.
+
+```bash
+/root/diagnose-tools/SOURCE/module/misc.c:93:14: error: ‘bdevt_str’ defined but not used [-Werror=unused-function]
+```
+
+- 看前后代码也没有任何调用 bdevt_str 的地方,其调用地方有个判断 大于 5.8 的内核 不用.加个 条件编译.
+
+---
+
+开始缺乏编译选项了.
+
+```bash
+[root@localhost diagnose-tools]# make module
+cd SOURCE/module; make --jobs=4
+make[1]: Entering directory '/root/diagnose-tools/SOURCE/module'
+make CFLAGS_MODULE="-DMODULE -DCENTOS_8U" -C /lib/modules/5.17.15-1.el8.x86_64/build M=/root/diagnose-tools/SOURCE/module modules
+make[2]: Entering directory '/usr/src/kernels/5.17.15-1.el8.x86_64'
+
+ ERROR: Kernel configuration is invalid.
+ include/generated/autoconf.h or include/config/auto.conf are missing.
+ Run 'make oldconfig && make prepare' on kernel src to fix it.
+```
+
+- 这个文件从 `scp [email protected]:/usr/src/kernels/5.17.15-1.el8.x86_64/include/generated/autoconf.h ./`
+
+---
+
+```bash
+/root/diagnose-tools/SOURCE/module/mm/memcg_stats.c:249:8: error: implicit declaration of function ‘mem_cgroup_nodeinfo’; did you mean ‘mem_cgroup_online’? [-Werror=implicit-function-declaration]
+ mz = mem_cgroup_nodeinfo(memcg, pgdat->node_id);
+ ^~~~~~~~~~~~~~~~~~~
+ mem_cgroup_online
+```
+
+- [mm: memcontrol: kill mem_cgroup_nodeinfo()](https://github.com/torvalds/linux/commit/a3747b53b1771a787fea71d86a2fc39aea337685) 删掉了 mem_cgroup_nodeinfo, 看源码就是一个简单的访问结构体成员.
+- v5.13-rc1
+
+```bash
+/root/diagnose-tools/SOURCE/module/net/tcp_retrans.c:94:35: error: ‘tcp_retrans_variant_buffer’ defined but not used [-Werror=unused-variable]
+ static struct diag_variant_buffer tcp_retrans_variant_buffer;
+```
+
+- 不是 linux 源码的问题, 看下文 使用到 `tcp_retrans_variant_buffer` 会是 5.3.0 以下. 添加个条件编译.
+
+---
+
+```bash
+ MODPOST /root/diagnose-tools/SOURCE/module/Module.symvers
+ERROR: modpost: "cpuacct_cgroup_walk_tree" [/root/diagnose-tools/SOURCE/module/diagnose.ko] undefined!
+make[3]: *** [scripts/Makefile.modpost:134: /root/diagnose-tools/SOURCE/module/Module.symvers] Error 1
+make[3]: *** Deleting file '/root/diagnose-tools/SOURCE/module/Module.symvers'
+make[2]: *** [Makefile:1746: modules] Error 2
+make[2]: Leaving directory '/usr/src/kernels/5.17.15-1.el8.x86_64'
+make[1]: *** [Makefile:251: default] Error 2
+make[1]: Leaving directory '/root/diagnose-tools/SOURCE/module'
+make: *** [Makefile:74: module] Error 2
+```
+
+- 对于这个问题在 `diagnose-tools/SOURCE/module/pub/cgroup.c` 文件中的条件编译选项
+
+```bash
+#if LINUX_VERSION_CODE < KERNEL_VERSION(3,10,0) || LINUX_VERSION_CODE > KERNEL_VERSION(5,16,0)
+```
+
+- 这个条件编译选项很奇怪.. ~~将其改成 5.18 可以编译通过~~… <mark style="background: #FF5582A6;">这里将会是 巨坑..</mark>
+- 这里是关于 pmu 的模块, 将其注释掉 可以通过编译. 而问题 `Invalid module format` 依然存在.
+
+```bash
+# use open qw(:std :utf8); make test 时候出现了问题
+yum install perl-open.noarch
+```
+
+---
+
+```bash
+[root@localhost diagnose-tools]# insmod -f SOURCE/module/diagnose.ko
+insmod: ERROR: could not insert module SOURCE/module/diagnose.ko: Invalid module format
+
+or
+
+[root@localhost diagnose-tools]# insmod SOURCE/module/diagnose.ko
+insmod: ERROR: could not insert module SOURCE/module/diagnose.ko: Invalid parameters
+```
+
+- ~~这个问题很棘手… 只能重新编译内核并 开启 log 输出查找具体原因了.~~
+
+题外话:
+- 2 个 与 `Invalid module format` 相关的 commit [1](https://github.com/alibaba/diagnose-tools/pull/21) [2](https://github.com/alibaba/diagnose-tools/pull/21/files)
+- 修改 `orig___mutex_unlock_slowpath = (void *)diag_kallsyms_lookup_name("__mutex_unlock_slowpath.isra");` 这个 bug 应该是逃避过了.
+
+入口添加打印开始排查是那里问题
+- `entry.c -> static int __init diagnosis_init(void)` 入口
+- `ret = alidiagnose_symbols_init();` 出问题了
+- 到了 `diagnose-tools/SOURCE/module/symbol.c -> int alidiagnose_symbols_init(void)`
+- `ret = lookup_syms();` 这里
+
+```c
+static int lookup_syms(void)
+{
+ LOOKUP_SYMS(text_mutex);
+ LOOKUP_SYMS(tasklist_lock);
+
+#if LINUX_VERSION_CODE >= KERNEL_VERSION(5, 2, 0) || defined(CENTOS_4_18_193)
+ LOOKUP_SYMS(stack_trace_save_tsk);
+#ifdef CONFIG_USER_STACKTRACE_SUPPORT
+ LOOKUP_SYMS(stack_trace_save_user);
+#endif
+xxx
+ LOOKUP_SYMS(disk_name); // 这里出问题了
+```
+
+`diagnose-tools/SOURCE/module/internal.h` -> `LOOKUP_SYMS`
+
+```bash
+#define LOOKUP_SYMS(name) do { \
+ orig_##name = (void *)diag_kallsyms_lookup_name(#name); \
+ if (!orig_##name) { \
+ pr_err("kallsyms_lookup_name: %s\n", #name); \
+ return -EINVAL; \
+ } \
+ } while (0)
+```
+
+- 和 log 对上了 `6354.070181] kallsyms_lookup_name: disk_name`
+- LOOKUP_SYMS 是查找 函数名的 宏定义, 找不到的函数名是 `disk_name` ^z0gbfx
+- [block: remove disk_name()](https://github.com/torvalds/linux/commit/abd2864a3e46368a58f3718491521779099bfc14) 这个 commit 移除了 disk_name 函数.. v5.15-rc1..
+- 看 commit: disk_name 是 bdevname 更底层一点, 但是这个 comiit 中合并了.. 似乎也没其他地方调用…, 加个条件编译,..
+- 看样子这里又会是 一系列的错误…
+
+```bash
+[ 9374.055981] kallsyms_lookup_name: get_files_struct
+```
+
+- [file: Remove get_files_struct](https://github.com/torvalds/linux/commit/fa67bf885e5211c7dce9514ef2877212c0a5e09e), 5.11.rc1, 没有替换函数
+- get_files_struct 似乎是实际用到的…<mark style="background: #FF5582A6;"> 这里会有坑..</mark>
+- 关联: `for_each_files_task` `hook_uprobe`
+
+```bash
+[11394.047582] kallsyms_lookup_name: get_task_type
+[11394.068295] kallsyms_lookup_name: cpuacct_subsys
+[11394.078570] kallsyms_lookup_name: css_get_next
+```
+
+- `get_task_type` `cpuacct_subsys` 无法查找到; `css_get_next` 早就 remove 了 …[^6] 一个一个来看.
+- `get_task_type` <- `orig_get_task_type` <- `diag_get_task_type` 还有用到
+- `cpuacct_subsys` 没有用到了; `css_get_next` 没有再用到了.
+- 关联到 [module: use LOOKUP_SYMS_NORET and LOOKUP_SYMS to find symbol address](https://github.com/alibaba/diagnose-tools/commit/1877444c32257540517b5e9bec044c33f6656d2a) 似乎都是 centos8 有同样的输出.
+
+ ```c
+ int diag_get_task_type(struct task_struct *tsk)
+ {
+ if (orig_get_task_type)
+ return orig_get_task_type(&tsk->se);
+
+ return 0;
+ }
+ ```
+
+ - 看源码, 似乎 get_task_type 为空也没关系
+
+```bash
+[11394.224686] kallsyms_lookup_name: avenrun_r
+[11394.245031] kallsyms_lookup_name: _cond_resched
+[11394.245054] diag_sys_delay_init failed, ret=-22
+[11394.266447] diag_kernel_init failed.
+```
+
+- `_cond_resched` 函数导致了 `diag_kernel_init failed.` 但这个函数依然存在, 为啥呢… 11:55 陷入僵局
+ - 内核宏,检查当前进程是否需要放弃 cpu. 似乎和 yiled 有关.
+- ~~13:55 换到 el8 内核加载试试…~~ 换到 el8 内核也是这个报错..
+- 16:37 无解将 `_cond_resched` 注释掉 可以通过编译, load-monitor 是可以运行了… <mark style="background: #FFB86CA6;">可能有坑</mark>
+- 查找 `cat /proc/kallsyms` 下确认没有 `_cond_resched`..
+ - 一切皆文件//👽//
+- <mark style="background: #FF5582A6;">暂时停滞在此.</mark>
+
+### ~~自编译内核~~
+
+```bash
+[root@localhost 5.17.15-1.el8.x86_64]# make olddefconfig
+lib/Kconfig.debug:2690: can't open file "Documentation/Kconfig"
+make[1]: *** [scripts/kconfig/Makefile:77: olddefconfig] Error 1
+```
+
+- 那就下 源码 解包 复制粘贴.
+- 排查故障时, dmesg / dmesg -C 没有输出. --> 只能自编译内核 然后打开打印了
+
+```bash
+yum install gcc ncurses-devel elfutils-libelf-devel bc rpm-build
+wget https://cdn.kernel.org/pub/linux/kernel/v5.x/linux-5.17.15.tar.xz
+tar xf linux-5.17.15.tar.xz
+cp /boot/config-`uname -r` linux-5.17.15/.config
+
+scripts/config --enable CONFIG_DYNAMIC_DEBUG
+scripts/config --enable CONFIG_FTRACE
+scripts/config --enable CONFIG_LOG_BUF_KERNEL
+scripts/config --enable CONFIG_MODULE_SIG_FORCE
+```
+
+```bash
+make -j $(nproc)
+make modules -j $(nproc)
+
+make modules_install
+make install
+
+grub2-mkconfig -o /boot/grub2/grub.cfg
+dracut -f
+```
+
+dmesg 有输出了
+
+[^1]: <https://github.com/alibaba/diagnose-tools>
+[^2]: <https://stackoverflow.com/questions/73242249/how-to-install-static-libraries-eg-libstdc-libm-libc-on-aws-official-rocky>
+[^3]: <https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html-single/considerations_in_adopting_rhel_8/index>
+[^4]: <https://github.com/dotnet/docs/issues/20437>
+[^5]: <https://community.intel.com/t5/Analyzers/VTune-installation-failure-on-Fedora-35/td-p/1390304> 这里也提到了是 5.17 kernel
+[^6]: <https://github.com/torvalds/linux/commit/6d2488f64a240191f0733c1f32d73607916b01b7>