summaryrefslogtreecommitdiff
path: root/mm/memory_hotplug.c
AgeCommit message (Collapse)Author
2021-09-08Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge more updates from Andrew Morton: "147 patches, based on 7d2a07b769330c34b4deabeed939325c77a7ec2f. Subsystems affected by this patch series: mm (memory-hotplug, rmap, ioremap, highmem, cleanups, secretmem, kfence, damon, and vmscan), alpha, percpu, procfs, misc, core-kernel, MAINTAINERS, lib, checkpatch, epoll, init, nilfs2, coredump, fork, pids, criu, kconfig, selftests, ipc, and scripts" * emailed patches from Andrew Morton <[email protected]>: (94 commits) scripts: check_extable: fix typo in user error message mm/workingset: correct kernel-doc notations ipc: replace costly bailout check in sysvipc_find_ipc() selftests/memfd: remove unused variable Kconfig.debug: drop selecting non-existing HARDLOCKUP_DETECTOR_ARCH configs: remove the obsolete CONFIG_INPUT_POLLDEV prctl: allow to setup brk for et_dyn executables pid: cleanup the stale comment mentioning pidmap_init(). kernel/fork.c: unexport get_{mm,task}_exe_file coredump: fix memleak in dump_vma_snapshot() fs/coredump.c: log if a core dump is aborted due to changed file permissions nilfs2: use refcount_dec_and_lock() to fix potential UAF nilfs2: fix memory leak in nilfs_sysfs_delete_snapshot_group nilfs2: fix memory leak in nilfs_sysfs_create_snapshot_group nilfs2: fix memory leak in nilfs_sysfs_delete_##name##_group nilfs2: fix memory leak in nilfs_sysfs_create_##name##_group nilfs2: fix NULL pointer in nilfs_##name##_attr_release nilfs2: fix memory leak in nilfs_sysfs_create_device_group trap: cleanup trap_init() init: move usermodehelper_enable() to populate_rootfs() ...
2021-09-08mm/memory_hotplug: use helper zone_is_zone_device() to simplify the codeMiaohe Lin
Patch series "Cleanup and fixups for memory hotplug". This series contains cleanup to use helper function to simplify the code. Also we fix some potential bugs. More details can be found in the respective changelogs. This patch (of 3): Use helper zone_is_zone_device() to simplify the code and remove some explicit CONFIG_ZONE_DEVICE codes. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Reviewed-by: Naoya Horiguchi <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Chris Goldsworthy <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-08mm/memory_hotplug: improved dynamic memory group aware "auto-movable" online ↵David Hildenbrand
policy Currently, the "auto-movable" online policy does not allow for hotplugged KERNEL (ZONE_NORMAL) memory to increase the amount of MOVABLE memory we can have, primarily, because there is no coordiantion across memory devices and we don't want to create zone-imbalances accidentially when unplugging memory. However, within a single memory device it's different. Let's allow for KERNEL memory within a dynamic memory group to allow for more MOVABLE within the same memory group. The only thing we have to take care of is that the managing driver avoids zone imbalances by unplugging MOVABLE memory first, otherwise there can be corner cases where unplug of memory could result in (accidential) zone imbalances. virtio-mem is the only user of dynamic memory groups and recently added support for prioritizing unplug of ZONE_MOVABLE over ZONE_NORMAL, so we don't need a new toggle to enable it for dynamic memory groups. We limit this handling to dynamic memory groups, because: * We want to keep the runtime overhead for collecting stats when onlining a single memory block small. We tend to have only a handful of dynamic memory groups, but we can have quite some static memory groups (e.g., 256 DIMMs). * It doesn't make too much sense for static memory groups, as we try onlining all applicable memory blocks either completely to ZONE_MOVABLE or not. In ordinary operation, we won't have a mixture of zones within a static memory group. When adding memory to a dynamic memory group, we'll first online memory to ZONE_MOVABLE as long as early KERNEL memory allows for it. Then, we'll online the next unit(s) to ZONE_NORMAL, until we can online the next unit(s) to ZONE_MOVABLE. For a simple virtio-mem device with a MOVABLE:KERNEL ratio of 3:1, it will result in a layout like: [M][M][M][M][M][M][M][M][N][M][M][M][N][M][M][M]... ^ movable memory due to early kernel memory ^ allows for more movable memory ... ^-----^ ... here ^ allows for more movable memory ... ^-----^ ... here While the created layout is sub-optimal when it comes to contiguous zones, it gives us the maximum flexibility when dynamically growing/shrinking a device; we can grow small VMs really big in small steps, and still shrink reliably to e.g., 1/4 of the maximum VM size in this example, removing full memory blocks along with meta data more reliably. Mark dynamic memory groups in the xarray such that we can efficiently iterate over them when collecting stats. In usual setups, we have one virtio-mem device per NUMA node, and usually only a small number of NUMA nodes. Note: for now, there seems to be no compelling reason to make this behavior configurable. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Dan Williams <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Hui Zhu <[email protected]> Cc: Jason Wang <[email protected]> Cc: Len Brown <[email protected]> Cc: Marek Kedzierski <[email protected]> Cc: "Michael S. Tsirkin" <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Pankaj Gupta <[email protected]> Cc: Pavel Tatashin <[email protected]> Cc: Rafael J. Wysocki <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: Vitaly Kuznetsov <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Wei Yang <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-08mm/memory_hotplug: memory group aware "auto-movable" online policyDavid Hildenbrand
Use memory groups to improve our "auto-movable" onlining policy: 1. For static memory groups (e.g., a DIMM), online a memory block MOVABLE only if all other memory blocks in the group are either MOVABLE or could be onlined MOVABLE. A DIMM will either be MOVABLE or not, not a mixture. 2. For dynamic memory groups (e.g., a virtio-mem device), online a memory block MOVABLE only if all other memory blocks inside the current unit are either MOVABLE or could be onlined MOVABLE. For a virtio-mem device with a device block size with 512 MiB, all 128 MiB memory blocks wihin a 512 MiB unit will either be MOVABLE or not, not a mixture. We have to pass the memory group to zone_for_pfn_range() to take the memory group into account. Note: for now, there seems to be no compelling reason to make this behavior configurable. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Dan Williams <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Hui Zhu <[email protected]> Cc: Jason Wang <[email protected]> Cc: Len Brown <[email protected]> Cc: Marek Kedzierski <[email protected]> Cc: "Michael S. Tsirkin" <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Pankaj Gupta <[email protected]> Cc: Pavel Tatashin <[email protected]> Cc: Rafael J. Wysocki <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: Vitaly Kuznetsov <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Wei Yang <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-08mm/memory_hotplug: track present pages in memory groupsDavid Hildenbrand
Let's track all present pages in each memory group. Especially, track memory present in ZONE_MOVABLE and memory present in one of the kernel zones (which really only is ZONE_NORMAL right now as memory groups only apply to hotplugged memory) separately within a memory group, to prepare for making smart auto-online decision for individual memory blocks within a memory group based on group statistics. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Dan Williams <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Hui Zhu <[email protected]> Cc: Jason Wang <[email protected]> Cc: Len Brown <[email protected]> Cc: Marek Kedzierski <[email protected]> Cc: "Michael S. Tsirkin" <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Pankaj Gupta <[email protected]> Cc: Pavel Tatashin <[email protected]> Cc: Rafael J. Wysocki <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: Vitaly Kuznetsov <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Wei Yang <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-08drivers/base/memory: introduce "memory groups" to logically group memory blocksDavid Hildenbrand
In our "auto-movable" memory onlining policy, we want to make decisions across memory blocks of a single memory device. Examples of memory devices include ACPI memory devices (in the simplest case a single DIMM) and virtio-mem. For now, we don't have a connection between a single memory block device and the real memory device. Each memory device consists of 1..X memory block devices. Let's logically group memory blocks belonging to the same memory device in "memory groups". Memory groups can span multiple physical ranges and a memory group itself does not contain any information regarding physical ranges, only properties (e.g., "max_pages") necessary for improved memory onlining. Introduce two memory group types: 1) Static memory group: E.g., a single ACPI memory device, consisting of 1..X memory resources. A memory group consists of 1..Y memory blocks. The whole group is added/removed in one go. If any part cannot get offlined, the whole group cannot be removed. 2) Dynamic memory group: E.g., a single virtio-mem device. Memory is dynamically added/removed in a fixed granularity, called a "unit", consisting of 1..X memory blocks. A unit is added/removed in one go. If any part of a unit cannot get offlined, the whole unit cannot be removed. In case of 1) we usually want either all memory managed by ZONE_MOVABLE or none. In case of 2) we usually want to have as many units as possible managed by ZONE_MOVABLE. We want a single unit to be of the same type. For now, memory groups are an internal concept that is not exposed to user space; we might want to change that in the future, though. add_memory() users can specify a mgid instead of a nid when passing the MHP_NID_IS_MGID flag. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Dan Williams <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Hui Zhu <[email protected]> Cc: Jason Wang <[email protected]> Cc: Len Brown <[email protected]> Cc: Marek Kedzierski <[email protected]> Cc: "Michael S. Tsirkin" <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Pankaj Gupta <[email protected]> Cc: Pavel Tatashin <[email protected]> Cc: Rafael J. Wysocki <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: Vitaly Kuznetsov <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Wei Yang <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-08mm/memory_hotplug: introduce "auto-movable" online policyDavid Hildenbrand
When onlining without specifying a zone (using "online" instead of "online_kernel" or "online_movable"), we currently select a zone such that existing zones are kept contiguous. This online policy made sense in the past, where contiguous zones where required. We'd like to implement smarter policies, however: * User space has little insight. As one example, it has no idea which memory blocks logically belong together (e.g., to a DIMM or to a virtio-mem device). * Drivers that add memory in separate memory blocks, especially virtio-mem, want memory to get onlined right from the kernel when adding. So we really want to have onlining to differing zones managed in the kernel, configured by user space. We see more and more cases where we might eventually hotplug a lot of memory in the future (e.g., eventually grow a 2 GiB VM to 64 GiB), however: * Resizing happens dynamically, in smaller steps in both directions (e.g., 2 GiB -> 8 GiB -> 4 GiB -> 16 GiB ...) * We still want as much flexibility as possible, especially, hotunplugging as much memory as possible later. We can really only use "online_movable" if we know that the amount of memory we are going to hotplug upfront, and we know that it won't result in a zone imbalance. So in our example, a 2 GiB VM that could grow to 64 GiB could currently not use "online_movable", and instead, "online_kernel" would have to be used, resulting in worse (no) memory hotunplug reliability. Let's add a new "auto-movable" online policy that considers the current zone ratios (global, per-node) to determine, whether we a memory block can be onlined to ZONE_MOVABLE: MOVABLE : KERNEL However, internally we'll only consider the following ratio for now: MOVABLE : KERNEL_EARLY For now, we don't allow for hotplugged KERNEL memory to allow for more MOVABLE memory, because there is no coordination across memory devices. In follow-up patches, we will allow for more KERNEL memory within a memory device to allow for more MOVABLE memory within the same memory device -- which only makes sense for special memory device types. We base our calculation on "present pages", see the code comments for details. Hotplugged memory will get online to ZONE_MOVABLE if the configured ratio allows for it. Depending on the setup, this can result in fragmented zones, which can make compaction slower and dynamic allocation of gigantic pages when not using CMA less reliable (... which is already pretty unreliable). The old policy will be the default and called "contig-zones". In follow-up patches, our new policy will use additional information, such as memory groups, to make even smarter decisions across memory blocks. Configuration: * memory_hotplug.online_policy is used to switch between both polices and defaults to "contig-zones". * memory_hotplug.auto_movable_ratio defines the maximum ratio is in percent and defaults to "301" -- allowing e.g., most 8 GiB machines to grow to 32 GiB and have all hotplugged memory in ZONE_MOVABLE. The additional percent accounts for a handful of lost present pages (e.g., firmware allocations). User space is expected to adjust this ratio when enabling the new "auto-movable" policy, though. * memory_hotplug.auto_movable_numa_aware considers numa node stats in addition to global stats, and defaults to "true". Note: just like the old policy, the new policy won't take things like unmovable huge pages or memory ballooning that doesn't support balloon compaction into account. User space has to configure onlining accordingly. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Dan Williams <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Hui Zhu <[email protected]> Cc: Jason Wang <[email protected]> Cc: Len Brown <[email protected]> Cc: Marek Kedzierski <[email protected]> Cc: "Michael S. Tsirkin" <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Pankaj Gupta <[email protected]> Cc: Pavel Tatashin <[email protected]> Cc: Rafael J. Wysocki <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: Vitaly Kuznetsov <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Wei Yang <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-08mm: track present early pages per zoneDavid Hildenbrand
Patch series "mm/memory_hotplug: "auto-movable" online policy and memory groups", v3. I. Goal The goal of this series is improving in-kernel auto-online support. It tackles the fundamental problems that: 1) We can create zone imbalances when onlining all memory blindly to ZONE_MOVABLE, in the worst case crashing the system. We have to know upfront how much memory we are going to hotplug such that we can safely enable auto-onlining of all hotplugged memory to ZONE_MOVABLE via "online_movable". This is far from practical and only applicable in limited setups -- like inside VMs under the RHV/oVirt hypervisor which will never hotplug more than 3 times the boot memory (and the limitation is only in place due to the Linux limitation). 2) We see more setups that implement dynamic VM resizing, hot(un)plugging memory to resize VM memory. In these setups, we might hotplug a lot of memory, but it might happen in various small steps in both directions (e.g., 2 GiB -> 8 GiB -> 4 GiB -> 16 GiB ...). virtio-mem is the primary driver of this upstream right now, performing such dynamic resizing NUMA-aware via multiple virtio-mem devices. Onlining all hotplugged memory to ZONE_NORMAL means we basically have no hotunplug guarantees. Onlining all to ZONE_MOVABLE means we can easily run into zone imbalances when growing a VM. We want a mixture, and we want as much memory as reasonable/configured in ZONE_MOVABLE. Details regarding zone imbalances can be found at [1]. 3) Memory devices consist of 1..X memory block devices, however, the kernel doesn't really track the relationship. Consequently, also user space has no idea. We want to make per-device decisions. As one example, for memory hotunplug it doesn't make sense to use a mixture of zones within a single DIMM: we want all MOVABLE if possible, otherwise all !MOVABLE, because any !MOVABLE part will easily block the whole DIMM from getting hotunplugged. As another example, virtio-mem operates on individual units that span 1..X memory blocks. Similar to a DIMM, we want a unit to either be all MOVABLE or !MOVABLE. A "unit" can be thought of like a DIMM, however, all units of a virtio-mem device logically belong together and are managed (added/removed) by a single driver. We want as much memory of a virtio-mem device to be MOVABLE as possible. 4) We want memory onlining to be done right from the kernel while adding memory, not triggered by user space via udev rules; for example, this is reqired for fast memory hotplug for drivers that add individual memory blocks, like virito-mem. We want a way to configure a policy in the kernel and avoid implementing advanced policies in user space. The auto-onlining support we have in the kernel is not sufficient. All we have is a) online everything MOVABLE (online_movable) b) online everything !MOVABLE (online_kernel) c) keep zones contiguous (online). This series allows configuring c) to mean instead "online movable if possible according to the coniguration, driven by a maximum MOVABLE:KERNEL ratio" -- a new onlining policy. II. Approach This series does 3 things: 1) Introduces the "auto-movable" online policy that initially operates on individual memory blocks only. It uses a maximum MOVABLE:KERNEL ratio to make a decision whether a memory block will be onlined to ZONE_MOVABLE or not. However, in the basic form, hotplugged KERNEL memory does not allow for more MOVABLE memory (details in the patches). CMA memory is treated like MOVABLE memory. 2) Introduces static (e.g., DIMM) and dynamic (e.g., virtio-mem) memory groups and uses group information to make decisions in the "auto-movable" online policy across memory blocks of a single memory device (modeled as memory group). More details can be found in patch #3 or in the DIMM example below. 3) Maximizes ZONE_MOVABLE memory within dynamic memory groups, by allowing ZONE_NORMAL memory within a dynamic memory group to allow for more ZONE_MOVABLE memory within the same memory group. The target use case is dynamic VM resizing using virtio-mem. See the virtio-mem example below. I remember that the basic idea of using a ratio to implement a policy in the kernel was once mentioned by Vitaly Kuznetsov, but I might be wrong (I lost the pointer to that discussion). For me, the main use case is using it along with virtio-mem (and DIMMs / ppc64 dlpar where necessary) for dynamic resizing of VMs, increasing the amount of memory we can hotunplug reliably again if we might eventually hotplug a lot of memory to a VM. III. Target Usage The target usage will be: 1) Linux boots with "mhp_default_online_type=offline" 2) User space (e.g., systemd unit) configures memory onlining (according to a config file and system properties), for example: * Setting memory_hotplug.online_policy=auto-movable * Setting memory_hotplug.auto_movable_ratio=301 * Setting memory_hotplug.auto_movable_numa_aware=true 3) User space enabled auto onlining via "echo online > /sys/devices/system/memory/auto_online_blocks" 4) User space triggers manual onlining of all already-offline memory blocks (go over offline memory blocks and set them to "online") IV. Example For DIMMs, hotplugging 4 GiB DIMMs to a 4 GiB VM with a configured ratio of 301% results in the following layout: Memory block 0-15: DMA32 (early) Memory block 32-47: Normal (early) Memory block 48-79: Movable (DIMM 0) Memory block 80-111: Movable (DIMM 1) Memory block 112-143: Movable (DIMM 2) Memory block 144-275: Normal (DIMM 3) Memory block 176-207: Normal (DIMM 4) ... all Normal (-> hotplugged Normal memory does not allow for more Movable memory) For virtio-mem, using a simple, single virtio-mem device with a 4 GiB VM will result in the following layout: Memory block 0-15: DMA32 (early) Memory block 32-47: Normal (early) Memory block 48-143: Movable (virtio-mem, first 12 GiB) Memory block 144: Normal (virtio-mem, next 128 MiB) Memory block 145-147: Movable (virtio-mem, next 384 MiB) Memory block 148: Normal (virtio-mem, next 128 MiB) Memory block 149-151: Movable (virtio-mem, next 384 MiB) ... Normal/Movable mixture as above (-> hotplugged Normal memory allows for more Movable memory within the same device) Which gives us maximum flexibility when dynamically growing/shrinking a VM in smaller steps. V. Doc Update I'll update the memory-hotplug.rst documentation, once the overhaul [1] is usptream. Until then, details can be found in patch #2. VI. Future Work 1) Use memory groups for ppc64 dlpar 2) Being able to specify a portion of (early) kernel memory that will be excluded from the ratio. Like "128 MiB globally/per node" are excluded. This might be helpful when starting VMs with extremely small memory footprint (e.g., 128 MiB) and hotplugging memory later -- not wanting the first hotplugged units getting onlined to ZONE_MOVABLE. One alternative would be a trigger to not consider ZONE_DMA memory in the ratio. We'll have to see if this is really rrequired. 3) Indicate to user space that MOVABLE might be a bad idea -- especially relevant when memory ballooning without support for balloon compaction is active. This patch (of 9): For implementing a new memory onlining policy, which determines when to online memory blocks to ZONE_MOVABLE semi-automatically, we need the number of present early (boot) pages -- present pages excluding hotplugged pages. Let's track these pages per zone. Pass a page instead of the zone to adjust_present_page_count(), similar as adjust_managed_page_count() and derive the zone from the page. It's worth noting that a memory block to be offlined/onlined is either completely "early" or "not early". add_memory() and friends can only add complete memory blocks and we only online/offline complete (individual) memory blocks. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Cc: Vitaly Kuznetsov <[email protected]> Cc: "Michael S. Tsirkin" <[email protected]> Cc: Jason Wang <[email protected]> Cc: Marek Kedzierski <[email protected]> Cc: Hui Zhu <[email protected]> Cc: Pankaj Gupta <[email protected]> Cc: Wei Yang <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Dan Williams <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: Len Brown <[email protected]> Cc: Pavel Tatashin <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Rafael J. Wysocki <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-08mm/memory_hotplug: remove nid parameter from remove_memory() and friendsDavid Hildenbrand
There is only a single user remaining. We can simply lookup the nid only used for node offlining purposes when walking our memory blocks. We don't expect to remove multi-nid ranges; and if we'd ever do, we most probably don't care about removing multi-nid ranges that actually result in empty nodes. If ever required, we can detect the "multi-nid" scenario and simply try offlining all online nodes. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Acked-by: Michael Ellerman <[email protected]> (powerpc) Cc: Michael Ellerman <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: Len Brown <[email protected]> Cc: Dan Williams <[email protected]> Cc: Vishal Verma <[email protected]> Cc: Dave Jiang <[email protected]> Cc: "Michael S. Tsirkin" <[email protected]> Cc: Jason Wang <[email protected]> Cc: Nathan Lynch <[email protected]> Cc: Laurent Dufour <[email protected]> Cc: "Aneesh Kumar K.V" <[email protected]> Cc: Scott Cheloha <[email protected]> Cc: Anton Blanchard <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Ard Biesheuvel <[email protected]> Cc: Baoquan He <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Christian Borntraeger <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jia He <[email protected]> Cc: Joe Perches <[email protected]> Cc: Kefeng Wang <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Michel Lespinasse <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Pankaj Gupta <[email protected]> Cc: Pankaj Gupta <[email protected]> Cc: Pavel Tatashin <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Pierre Morel <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: Rich Felker <[email protected]> Cc: Sergei Trofimovich <[email protected]> Cc: Thiago Jung Bauermann <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: Vitaly Kuznetsov <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Wei Yang <[email protected]> Cc: Will Deacon <[email protected]> Cc: Yoshinori Sato <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-08mm/memory_hotplug: remove nid parameter from arch_remove_memory()David Hildenbrand
The parameter is unused, let's remove it. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Acked-by: Catalin Marinas <[email protected]> Acked-by: Michael Ellerman <[email protected]> [powerpc] Acked-by: Heiko Carstens <[email protected]> [s390] Reviewed-by: Pankaj Gupta <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Will Deacon <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: Christian Borntraeger <[email protected]> Cc: Yoshinori Sato <[email protected]> Cc: Rich Felker <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Ard Biesheuvel <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Pavel Tatashin <[email protected]> Cc: Baoquan He <[email protected]> Cc: Laurent Dufour <[email protected]> Cc: Sergei Trofimovich <[email protected]> Cc: Kefeng Wang <[email protected]> Cc: Michel Lespinasse <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: "Aneesh Kumar K.V" <[email protected]> Cc: Thiago Jung Bauermann <[email protected]> Cc: Joe Perches <[email protected]> Cc: Pierre Morel <[email protected]> Cc: Jia He <[email protected]> Cc: Anton Blanchard <[email protected]> Cc: Dan Williams <[email protected]> Cc: Dave Jiang <[email protected]> Cc: Jason Wang <[email protected]> Cc: Len Brown <[email protected]> Cc: "Michael S. Tsirkin" <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Nathan Lynch <[email protected]> Cc: Pankaj Gupta <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: Scott Cheloha <[email protected]> Cc: Vishal Verma <[email protected]> Cc: Vitaly Kuznetsov <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Wei Yang <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-08mm/memory_hotplug: use "unsigned long" for PFN in zone_for_pfn_range()David Hildenbrand
Patch series "mm/memory_hotplug: preparatory patches for new online policy and memory" These are all cleanups and one fix previously sent as part of [1]: [PATCH v1 00/12] mm/memory_hotplug: "auto-movable" online policy and memory groups. These patches make sense even without the other series, therefore I pulled them out to make the other series easier to digest. [1] https://lkml.kernel.org/r/[email protected] This patch (of 4): Checkpatch complained on a follow-up patch that we are using "unsigned" here, which defaults to "unsigned int" and checkpatch is correct. As we will search for a fitting zone using the wrong pfn, we might end up onlining memory to one of the special kernel zones, such as ZONE_DMA, which can end badly as the onlined memory does not satisfy properties of these zones. Use "unsigned long" instead, just as we do in other places when handling PFNs. This can bite us once we have physical addresses in the range of multiple TB. Link: https://lkml.kernel.org/r/[email protected] Fixes: e5e689302633 ("mm, memory_hotplug: display allowed zones in the preferred ordering") Signed-off-by: David Hildenbrand <[email protected]> Reviewed-by: Pankaj Gupta <[email protected]> Reviewed-by: Muchun Song <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Vitaly Kuznetsov <[email protected]> Cc: "Michael S. Tsirkin" <[email protected]> Cc: Jason Wang <[email protected]> Cc: Pankaj Gupta <[email protected]> Cc: Wei Yang <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Dan Williams <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: Len Brown <[email protected]> Cc: Pavel Tatashin <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: [email protected] Cc: Andy Lutomirski <[email protected]> Cc: "Aneesh Kumar K.V" <[email protected]> Cc: Anton Blanchard <[email protected]> Cc: Ard Biesheuvel <[email protected]> Cc: Baoquan He <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Christian Borntraeger <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Dave Jiang <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jia He <[email protected]> Cc: Joe Perches <[email protected]> Cc: Kefeng Wang <[email protected]> Cc: Laurent Dufour <[email protected]> Cc: Michel Lespinasse <[email protected]> Cc: Nathan Lynch <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Pierre Morel <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: Rich Felker <[email protected]> Cc: Scott Cheloha <[email protected]> Cc: Sergei Trofimovich <[email protected]> Cc: Thiago Jung Bauermann <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: Vishal Verma <[email protected]> Cc: Will Deacon <[email protected]> Cc: Yoshinori Sato <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-08mm: memory_hotplug: cleanup after removal of pfn_valid_within()Mike Rapoport
When test_pages_in_a_zone() used pfn_valid_within() is has some logic surrounding pfn_valid_within() checks. Since pfn_valid_within() is gone, this logic can be removed. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Rapoport <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-08mm: remove pfn_valid_within() and CONFIG_HOLES_IN_ZONEMike Rapoport
Patch series "mm: remove pfn_valid_within() and CONFIG_HOLES_IN_ZONE". After recent updates to freeing unused parts of the memory map, no architecture can have holes in the memory map within a pageblock. This makes pfn_valid_within() check and CONFIG_HOLES_IN_ZONE configuration option redundant. The first patch removes them both in a mechanical way and the second patch simplifies memory_hotplug::test_pages_in_a_zone() that had pfn_valid_within() surrounded by more logic than simple if. This patch (of 2): After recent changes in freeing of the unused parts of the memory map and rework of pfn_valid() in arm and arm64 there are no architectures that can have holes in the memory map within a pageblock and so nothing can enable CONFIG_HOLES_IN_ZONE which guards non trivial implementation of pfn_valid_within(). With that, pfn_valid_within() is always hardwired to 1 and can be completely removed. Remove calls to pfn_valid_within() and CONFIG_HOLES_IN_ZONE. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Rapoport <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-03mm/migrate: enable returning precise migrate_pages() success countYang Shi
Under normal circumstances, migrate_pages() returns the number of pages migrated. In error conditions, it returns an error code. When returning an error code, there is no way to know how many pages were migrated or not migrated. Make migrate_pages() return how many pages are demoted successfully for all cases, including when encountering errors. Page reclaim behavior will depend on this in subsequent patches. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Yang Shi <[email protected]> Signed-off-by: Dave Hansen <[email protected]> Signed-off-by: "Huang, Ying" <[email protected]> Suggested-by: Oscar Salvador <[email protected]> [optional parameter] Reviewed-by: Yang Shi <[email protected]> Reviewed-by: Zi Yan <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Wei Xu <[email protected]> Cc: Dan Williams <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: David Rientjes <[email protected]> Cc: Greg Thelen <[email protected]> Cc: Keith Busch <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-08-25mm/memory_hotplug: fix potential permanent lru cache disableMiaohe Lin
If offline_pages failed after lru_cache_disable(), it forgot to do lru_cache_enable() in error path. So we would have lru cache disabled permanently in this case. Link: https://lkml.kernel.org/r/[email protected] Fixes: d479960e44f2 ("mm: disable LRU pagevec during the migration temporarily") Signed-off-by: Miaohe Lin <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Reviewed-by: Naoya Horiguchi <[email protected]> Cc: Chris Goldsworthy <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Minchan Kim <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-07-01mm/memory_hotplug: fix kerneldoc comment for __remove_memoryMel Gorman
make W=1 generates the following warning for __remove_memory mm/memory_hotplug.c:2044: warning: expecting prototype for remove_memory(). Prototype was for __remove_memory() instead Commit eca499ab3749 ("mm/hotplug: make remove_memory() interface usable") introduced the kerneldoc comment and function but the kerneldoc name and function name did not match. Link: https://lkml.kernel.org/r/[email protected] Fixes: eca499ab3749 ("mm/hotplug: make remove_memory() interface usable") Signed-off-by: Mel Gorman <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Reviewed-by: Yang Shi <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Dan Streetman <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-07-01mm/memory_hotplug: fix kerneldoc comment for __try_online_nodeMel Gorman
make W=1 generates the following warning for try_online_node mm/memory_hotplug.c:1087: warning: expecting prototype for try_online_node(). Prototype was for __try_online_node() instead Commit b9ff036082cd ("mm/memory_hotplug.c: make add_memory_resource use __try_online_node") renamed the function but did not update the associated kerneldoc. The function is static and somewhat specialised in nature so it's not clear it warrants being a kerneldoc by moving the comment to try_online_node. Hence, leave the comment of the internal helper in place but leave it out of kerneldoc and correct the function name in the comment. Link: https://lkml.kernel.org/r/[email protected] Fixes: Commit b9ff036082cd ("mm/memory_hotplug.c: make add_memory_resource use __try_online_node") Signed-off-by: Mel Gorman <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Reviewed-by: Yang Shi <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Dan Streetman <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-07-01mm: fix spelling mistakesZhen Lei
Fix some spelling mistakes in comments: each having differents usage ==> each has a different usage statments ==> statements adresses ==> addresses aggresive ==> aggressive datas ==> data posion ==> poison higer ==> higher precisly ==> precisely wont ==> won't We moves tha ==> We move the endianess ==> endianness Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Zhen Lei <[email protected]> Reviewed-by: Souptick Joarder <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30mm,memory_hotplug: drop unneeded lockingOscar Salvador
Currently, memory-hotplug code takes zone's span_writelock and pgdat's resize_lock when resizing the node/zone's spanned pages via {move_pfn_range_to_zone(),remove_pfn_range_from_zone()} and when resizing node and zone's present pages via adjust_present_page_count(). These locks are also taken during the initialization of the system at boot time, where it protects parallel struct page initialization, but they should not really be needed in memory-hotplug where all operations are a) synchronized on device level and b) serialized by the mem_hotplug_lock lock. [[email protected]: remove now-unused locals] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Oscar Salvador <[email protected]> Acked-by: David Hildenbrand <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Pavel Tatashin <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30mm/memory_hotplug: rate limit page migration warningsLiam Mark
When offlining memory the system can attempt to migrate a lot of pages, if there are problems with migration this can flood the logs. Printing all the data hogs the CPU and cause some RT threads to run for a long time, which may have some bad consequences. Rate limit the page migration warnings in order to avoid this. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Liam Mark <[email protected]> Signed-off-by: Georgi Djakov <[email protected]> Cc: David Hildenbrand <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30mm: sparsemem: use huge PMD mapping for vmemmap pagesMuchun Song
The preparation of splitting huge PMD mapping of vmemmap pages is ready, so switch the mapping from PTE to PMD. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Muchun Song <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Cc: Chen Huang <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Xiongchun Duan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30mm: memory_hotplug: disable memmap_on_memory when hugetlb_free_vmemmap enabledMuchun Song
The parameter of memory_hotplug.memmap_on_memory is not compatible with hugetlb_free_vmemmap. So disable it when hugetlb_free_vmemmap is enabled. [[email protected]: remove unneeded include, per Oscar] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Muchun Song <[email protected]> Acked-by: Mike Kravetz <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Balbir Singh <[email protected]> Cc: Barry Song <[email protected]> Cc: Bodeddula Balasubramaniam <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Chen Huang <[email protected]> Cc: Dave Hansen <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: David Rientjes <[email protected]> Cc: HORIGUCHI NAOYA <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Joao Martins <[email protected]> Cc: Joerg Roedel <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Mina Almasry <[email protected]> Cc: Oliver Neukum <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Pawan Gupta <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Randy Dunlap <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Xiongchun Duan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30mm: memory_hotplug: factor out bootmem core functions to bootmem_info.cMuchun Song
Patch series "Free some vmemmap pages of HugeTLB page", v23. This patch series will free some vmemmap pages(struct page structures) associated with each HugeTLB page when preallocated to save memory. In order to reduce the difficulty of the first version of code review. In this version, we disable PMD/huge page mapping of vmemmap if this feature was enabled. This acutely eliminates a bunch of the complex code doing page table manipulation. When this patch series is solid, we cam add the code of vmemmap page table manipulation in the future. The struct page structures (page structs) are used to describe a physical page frame. By default, there is an one-to-one mapping from a page frame to it's corresponding page struct. The HugeTLB pages consist of multiple base page size pages and is supported by many architectures. See hugetlbpage.rst in the Documentation directory for more details. On the x86 architecture, HugeTLB pages of size 2MB and 1GB are currently supported. Since the base page size on x86 is 4KB, a 2MB HugeTLB page consists of 512 base pages and a 1GB HugeTLB page consists of 4096 base pages. For each base page, there is a corresponding page struct. Within the HugeTLB subsystem, only the first 4 page structs are used to contain unique information about a HugeTLB page. HUGETLB_CGROUP_MIN_ORDER provides this upper limit. The only 'useful' information in the remaining page structs is the compound_head field, and this field is the same for all tail pages. By removing redundant page structs for HugeTLB pages, memory can returned to the buddy allocator for other uses. When the system boot up, every 2M HugeTLB has 512 struct page structs which size is 8 pages(sizeof(struct page) * 512 / PAGE_SIZE). HugeTLB struct pages(8 pages) page frame(8 pages) +-----------+ ---virt_to_page---> +-----------+ mapping to +-----------+ | | | 0 | -------------> | 0 | | | +-----------+ +-----------+ | | | 1 | -------------> | 1 | | | +-----------+ +-----------+ | | | 2 | -------------> | 2 | | | +-----------+ +-----------+ | | | 3 | -------------> | 3 | | | +-----------+ +-----------+ | | | 4 | -------------> | 4 | | 2MB | +-----------+ +-----------+ | | | 5 | -------------> | 5 | | | +-----------+ +-----------+ | | | 6 | -------------> | 6 | | | +-----------+ +-----------+ | | | 7 | -------------> | 7 | | | +-----------+ +-----------+ | | | | | | +-----------+ The value of page->compound_head is the same for all tail pages. The first page of page structs (page 0) associated with the HugeTLB page contains the 4 page structs necessary to describe the HugeTLB. The only use of the remaining pages of page structs (page 1 to page 7) is to point to page->compound_head. Therefore, we can remap pages 2 to 7 to page 1. Only 2 pages of page structs will be used for each HugeTLB page. This will allow us to free the remaining 6 pages to the buddy allocator. Here is how things look after remapping. HugeTLB struct pages(8 pages) page frame(8 pages) +-----------+ ---virt_to_page---> +-----------+ mapping to +-----------+ | | | 0 | -------------> | 0 | | | +-----------+ +-----------+ | | | 1 | -------------> | 1 | | | +-----------+ +-----------+ | | | 2 | ----------------^ ^ ^ ^ ^ ^ | | +-----------+ | | | | | | | | 3 | ------------------+ | | | | | | +-----------+ | | | | | | | 4 | --------------------+ | | | | 2MB | +-----------+ | | | | | | 5 | ----------------------+ | | | | +-----------+ | | | | | 6 | ------------------------+ | | | +-----------+ | | | | 7 | --------------------------+ | | +-----------+ | | | | | | +-----------+ When a HugeTLB is freed to the buddy system, we should allocate 6 pages for vmemmap pages and restore the previous mapping relationship. Apart from 2MB HugeTLB page, we also have 1GB HugeTLB page. It is similar to the 2MB HugeTLB page. We also can use this approach to free the vmemmap pages. In this case, for the 1GB HugeTLB page, we can save 4094 pages. This is a very substantial gain. On our server, run some SPDK/QEMU applications which will use 1024GB HugeTLB page. With this feature enabled, we can save ~16GB (1G hugepage)/~12GB (2MB hugepage) memory. Because there are vmemmap page tables reconstruction on the freeing/allocating path, it increases some overhead. Here are some overhead analysis. 1) Allocating 10240 2MB HugeTLB pages. a) With this patch series applied: # time echo 10240 > /proc/sys/vm/nr_hugepages real 0m0.166s user 0m0.000s sys 0m0.166s # bpftrace -e 'kprobe:alloc_fresh_huge_page { @start[tid] = nsecs; } kretprobe:alloc_fresh_huge_page /@start[tid]/ { @latency = hist(nsecs - @start[tid]); delete(@start[tid]); }' Attaching 2 probes... @latency: [8K, 16K) 5476 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [16K, 32K) 4760 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [32K, 64K) 4 | | b) Without this patch series: # time echo 10240 > /proc/sys/vm/nr_hugepages real 0m0.067s user 0m0.000s sys 0m0.067s # bpftrace -e 'kprobe:alloc_fresh_huge_page { @start[tid] = nsecs; } kretprobe:alloc_fresh_huge_page /@start[tid]/ { @latency = hist(nsecs - @start[tid]); delete(@start[tid]); }' Attaching 2 probes... @latency: [4K, 8K) 10147 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [8K, 16K) 93 | | Summarize: this feature is about ~2x slower than before. 2) Freeing 10240 2MB HugeTLB pages. a) With this patch series applied: # time echo 0 > /proc/sys/vm/nr_hugepages real 0m0.213s user 0m0.000s sys 0m0.213s # bpftrace -e 'kprobe:free_pool_huge_page { @start[tid] = nsecs; } kretprobe:free_pool_huge_page /@start[tid]/ { @latency = hist(nsecs - @start[tid]); delete(@start[tid]); }' Attaching 2 probes... @latency: [8K, 16K) 6 | | [16K, 32K) 10227 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [32K, 64K) 7 | | b) Without this patch series: # time echo 0 > /proc/sys/vm/nr_hugepages real 0m0.081s user 0m0.000s sys 0m0.081s # bpftrace -e 'kprobe:free_pool_huge_page { @start[tid] = nsecs; } kretprobe:free_pool_huge_page /@start[tid]/ { @latency = hist(nsecs - @start[tid]); delete(@start[tid]); }' Attaching 2 probes... @latency: [4K, 8K) 6805 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [8K, 16K) 3427 |@@@@@@@@@@@@@@@@@@@@@@@@@@ | [16K, 32K) 8 | | Summary: The overhead of __free_hugepage is about ~2-3x slower than before. Although the overhead has increased, the overhead is not significant. Like Mike said, "However, remember that the majority of use cases create HugeTLB pages at or shortly after boot time and add them to the pool. So, additional overhead is at pool creation time. There is no change to 'normal run time' operations of getting a page from or returning a page to the pool (think page fault/unmap)". Despite the overhead and in addition to the memory gains from this series. The following data is obtained by Joao Martins. Very thanks to his effort. There's an additional benefit which is page (un)pinners will see an improvement and Joao presumes because there are fewer memmap pages and thus the tail/head pages are staying in cache more often. Out of the box Joao saw (when comparing linux-next against linux-next + this series) with gup_test and pinning a 16G HugeTLB file (with 1G pages): get_user_pages(): ~32k -> ~9k unpin_user_pages(): ~75k -> ~70k Usually any tight loop fetching compound_head(), or reading tail pages data (e.g. compound_head) benefit a lot. There's some unpinning inefficiencies Joao was fixing[2], but with that in added it shows even more: unpin_user_pages(): ~27k -> ~3.8k [1] https://lore.kernel.org/linux-mm/[email protected]/ [2] https://lore.kernel.org/linux-mm/[email protected]/ This patch (of 9): Move bootmem info registration common API to individual bootmem_info.c. And we will use {get,put}_page_bootmem() to initialize the page for the vmemmap pages or free the vmemmap pages to buddy in the later patch. So move them out of CONFIG_MEMORY_HOTPLUG_SPARSE. This is just code movement without any functional change. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Muchun Song <[email protected]> Acked-by: Mike Kravetz <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Reviewed-by: Miaohe Lin <[email protected]> Tested-by: Chen Huang <[email protected]> Tested-by: Bodeddula Balasubramaniam <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: [email protected] Cc: "H. Peter Anvin" <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Pawan Gupta <[email protected]> Cc: Randy Dunlap <[email protected]> Cc: Oliver Neukum <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Joerg Roedel <[email protected]> Cc: Mina Almasry <[email protected]> Cc: David Rientjes <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Barry Song <[email protected]> Cc: HORIGUCHI NAOYA <[email protected]> Cc: Joao Martins <[email protected]> Cc: Xiongchun Duan <[email protected]> Cc: Balbir Singh <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-29mm/page_alloc: disassociate the pcp->high from pcp->batchMel Gorman
The pcp high watermark is based on the batch size but there is no relationship between them other than it is convenient to use early in boot. This patch takes the first step and bases pcp->high on the zone low watermark split across the number of CPUs local to a zone while the batch size remains the same to avoid increasing allocation latencies. The intent behind the default pcp->high is "set the number of PCP pages such that if they are all full that background reclaim is not started prematurely". Note that in this patch the pcp->high values are adjusted after memory hotplug events, min_free_kbytes adjustments and watermark scale factor adjustments but not CPU hotplug events which is handled later in the series. On a test KVM instance; Before grep -E "high:|batch" /proc/zoneinfo | tail -2 high: 378 batch: 63 After grep -E "high:|batch" /proc/zoneinfo | tail -2 high: 649 batch: 63 [[email protected]: fix __setup_per_zone_wmarks for parallel memory hotplug] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Mel Gorman <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Hillf Danton <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-05-05mm/mempool: minor coding style tweaksZhiyuan Dai
Various coding style tweaks to various files under mm/ [[email protected]: mm/swapfile: minor coding style tweaks] Link: https://lkml.kernel.org/r/[email protected] [[email protected]: mm/sparse: minor coding style tweaks] Link: https://lkml.kernel.org/r/[email protected] [[email protected]: mm/vmscan: minor coding style tweaks] Link: https://lkml.kernel.org/r/[email protected] [[email protected]: mm/compaction: minor coding style tweaks] Link: https://lkml.kernel.org/r/[email protected] [[email protected]: mm/oom_kill: minor coding style tweaks] Link: https://lkml.kernel.org/r/[email protected] [[email protected]: mm/shmem: minor coding style tweaks] Link: https://lkml.kernel.org/r/[email protected] [[email protected]: mm/page_alloc: minor coding style tweaks] Link: https://lkml.kernel.org/r/[email protected] [[email protected]: mm/filemap: minor coding style tweaks] Link: https://lkml.kernel.org/r/[email protected] [[email protected]: mm/mlock: minor coding style tweaks] Link: https://lkml.kernel.org/r/[email protected] [[email protected]: mm/frontswap: minor coding style tweaks] Link: https://lkml.kernel.org/r/[email protected] [[email protected]: mm/vmalloc: minor coding style tweaks] Link: https://lkml.kernel.org/r/[email protected] [[email protected]: mm/memory_hotplug: minor coding style tweaks] Link: https://lkml.kernel.org/r/[email protected] [[email protected]: mm/mempolicy: minor coding style tweaks] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Zhiyuan Dai <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-05-05mm,memory_hotplug: add kernel boot option to enable memmap_on_memoryOscar Salvador
Self stored memmap leads to a sparse memory situation which is unsuitable for workloads that requires large contiguous memory chunks, so make this an opt-in which needs to be explicitly enabled. To control this, let memory_hotplug have its own memory space, as suggested by David, so we can add memory_hotplug.memmap_on_memory parameter. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Oscar Salvador <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Pavel Tatashin <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-05-05mm,memory_hotplug: allocate memmap from the added memory rangeOscar Salvador
Physical memory hotadd has to allocate a memmap (struct page array) for the newly added memory section. Currently, alloc_pages_node() is used for those allocations. This has some disadvantages: a) an existing memory is consumed for that purpose (eg: ~2MB per 128MB memory section on x86_64) This can even lead to extreme cases where system goes OOM because the physically hotplugged memory depletes the available memory before it is onlined. b) if the whole node is movable then we have off-node struct pages which has performance drawbacks. c) It might be there are no PMD_ALIGNED chunks so memmap array gets populated with base pages. This can be improved when CONFIG_SPARSEMEM_VMEMMAP is enabled. Vmemap page tables can map arbitrary memory. That means that we can reserve a part of the physically hotadded memory to back vmemmap page tables. This implementation uses the beginning of the hotplugged memory for that purpose. There are some non-obviously things to consider though. Vmemmap pages are allocated/freed during the memory hotplug events (add_memory_resource(), try_remove_memory()) when the memory is added/removed. This means that the reserved physical range is not online although it is used. The most obvious side effect is that pfn_to_online_page() returns NULL for those pfns. The current design expects that this should be OK as the hotplugged memory is considered a garbage until it is onlined. For example hibernation wouldn't save the content of those vmmemmaps into the image so it wouldn't be restored on resume but this should be OK as there no real content to recover anyway while metadata is reachable from other data structures (e.g. vmemmap page tables). The reserved space is therefore (de)initialized during the {on,off}line events (mhp_{de}init_memmap_on_memory). That is done by extracting page allocator independent initialization from the regular onlining path. The primary reason to handle the reserved space outside of {on,off}line_pages is to make each initialization specific to the purpose rather than special case them in a single function. As per above, the functions that are introduced are: - mhp_init_memmap_on_memory: Initializes vmemmap pages by calling move_pfn_range_to_zone(), calls kasan_add_zero_shadow(), and onlines as many sections as vmemmap pages fully span. - mhp_deinit_memmap_on_memory: Offlines as many sections as vmemmap pages fully span, removes the range from zhe zone by remove_pfn_range_from_zone(), and calls kasan_remove_zero_shadow() for the range. The new function memory_block_online() calls mhp_init_memmap_on_memory() before doing the actual online_pages(). Should online_pages() fail, we clean up by calling mhp_deinit_memmap_on_memory(). Adjusting of present_pages is done at the end once we know that online_pages() succedeed. On offline, memory_block_offline() needs to unaccount vmemmap pages from present_pages() before calling offline_pages(). This is necessary because offline_pages() tears down some structures based on the fact whether the node or the zone become empty. If offline_pages() fails, we account back vmemmap pages. If it succeeds, we call mhp_deinit_memmap_on_memory(). Hot-remove: We need to be careful when removing memory, as adding and removing memory needs to be done with the same granularity. To check that this assumption is not violated, we check the memory range we want to remove and if a) any memory block has vmemmap pages and b) the range spans more than a single memory block, we scream out loud and refuse to proceed. If all is good and the range was using memmap on memory (aka vmemmap pages), we construct an altmap structure so free_hugepage_table does the right thing and calls vmem_altmap_free instead of free_pagetable. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Oscar Salvador <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Pavel Tatashin <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-05-05mm,memory_hotplug: factor out adjusting present pages into ↵David Hildenbrand
adjust_present_page_count() Let's have a single place (inspired by adjust_managed_page_count()) where we adjust present pages. In contrast to adjust_managed_page_count(), only memory onlining or offlining is allowed to modify the number of present pages. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Signed-off-by: Oscar Salvador <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Pavel Tatashin <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-05-05mm,memory_hotplug: relax fully spanned sections checkOscar Salvador
We want {online,offline}_pages to operate on whole memblocks, but memmap_on_memory will poke pageblock_nr_pages aligned holes in the beginning, which is a special case we want to allow. Relax the check to account for that case. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Oscar Salvador <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Pavel Tatashin <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-05-05mm: disable LRU pagevec during the migration temporarilyMinchan Kim
LRU pagevec holds refcount of pages until the pagevec are drained. It could prevent migration since the refcount of the page is greater than the expection in migration logic. To mitigate the issue, callers of migrate_pages drains LRU pagevec via migrate_prep or lru_add_drain_all before migrate_pages call. However, it's not enough because pages coming into pagevec after the draining call still could stay at the pagevec so it could keep preventing page migration. Since some callers of migrate_pages have retrial logic with LRU draining, the page would migrate at next trail but it is still fragile in that it doesn't close the fundamental race between upcoming LRU pages into pagvec and migration so the migration failure could cause contiguous memory allocation failure in the end. To close the race, this patch disables lru caches(i.e, pagevec) during ongoing migration until migrate is done. Since it's really hard to reproduce, I measured how many times migrate_pages retried with force mode(it is about a fallback to a sync migration) with below debug code. int migrate_pages(struct list_head *from, new_page_t get_new_page, .. .. if (rc && reason == MR_CONTIG_RANGE && pass > 2) { printk(KERN_ERR, "pfn 0x%lx reason %d", page_to_pfn(page), rc); dump_page(page, "fail to migrate"); } The test was repeating android apps launching with cma allocation in background every five seconds. Total cma allocation count was about 500 during the testing. With this patch, the dump_page count was reduced from 400 to 30. The new interface is also useful for memory hotplug which currently drains lru pcp caches after each migration failure. This is rather suboptimal as it has to disrupt others running during the operation. With the new interface the operation happens only once. This is also in line with pcp allocator cache which are disabled for the offlining as well. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Minchan Kim <[email protected]> Reviewed-by: Chris Goldsworthy <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: John Dias <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Oliver Sang <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-03-10arm64: mte: Map hotplugged memory as Normal TaggedCatalin Marinas
In a system supporting MTE, the linear map must allow reading/writing allocation tags by setting the memory type as Normal Tagged. Currently, this is only handled for memory present at boot. Hotplugged memory uses Normal non-Tagged memory. Introduce pgprot_mhp() for hotplugged memory and use it in add_memory_resource(). The arm64 code maps pgprot_mhp() to pgprot_tagged(). Note that ZONE_DEVICE memory should not be mapped as Tagged and therefore setting the memory type in arch_add_memory() is not feasible. Signed-off-by: Catalin Marinas <[email protected]> Fixes: 0178dc761368 ("arm64: mte: Use Normal Tagged attributes for the linear map") Reported-by: Patrick Daly <[email protected]> Tested-by: Patrick Daly <[email protected]> Link: https://lore.kernel.org/r/[email protected] Cc: <[email protected]> # 5.10.x Cc: Will Deacon <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Vincenzo Frascino <[email protected]> Cc: David Hildenbrand <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Reviewed-by: Vincenzo Frascino <[email protected]> Reviewed-by: Anshuman Khandual <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Will Deacon <[email protected]>
2021-02-26mm/memory_hotplug: prevalidate the address range being added with platformAnshuman Khandual
Patch series "mm/memory_hotplug: Pre-validate the address range with platform", v5. This series adds a mechanism allowing platforms to weigh in and prevalidate incoming address range before proceeding further with the memory hotplug. This helps prevent potential platform errors for the given address range, down the hotplug call chain, which inevitably fails the hotplug itself. This mechanism was suggested by David Hildenbrand during another discussion with respect to a memory hotplug fix on arm64 platform. https://lore.kernel.org/linux-arm-kernel/[email protected]/ This mechanism focuses on the addressibility aspect and not [sub] section alignment aspect. Hence check_hotplug_memory_range() and check_pfn_span() have been left unchanged. This patch (of 4): This introduces mhp_range_allowed() which can be called in various memory hotplug paths to prevalidate the address range which is being added, with the platform. Then mhp_range_allowed() calls mhp_get_pluggable_range() which provides applicable address range depending on whether linear mapping is required or not. For ranges that require linear mapping, it calls a new arch callback arch_get_mappable_range() which the platform can override. So the new callback, in turn provides the platform an opportunity to configure acceptable memory hotplug address ranges in case there are constraints. This mechanism will help prevent platform specific errors deep down during hotplug calls. This drops now redundant check_hotplug_memory_addressable() check in __add_pages() but instead adds a VM_BUG_ON() check which would ensure that the range has been validated with mhp_range_allowed() earlier in the call chain. Besides mhp_get_pluggable_range() also can be used by potential memory hotplug callers to avail the allowed physical range which would go through on a given platform. This does not really add any new range check in generic memory hotplug but instead compensates for lost checks in arch_add_memory() where applicable and check_hotplug_memory_addressable(), with unified mhp_range_allowed(). [[email protected]: make pagemap_range() return -EINVAL when mhp_range_allowed() fails] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Anshuman Khandual <[email protected]> Suggested-by: David Hildenbrand <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Vasily Gorbik <[email protected]> # s390 Cc: Will Deacon <[email protected]> Cc: Ard Biesheuvel <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Jason Wang <[email protected]> Cc: Jonathan Cameron <[email protected]> Cc: "Michael S. Tsirkin" <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Pankaj Gupta <[email protected]> Cc: Pankaj Gupta <[email protected]> Cc: teawater <[email protected]> Cc: Wei Yang <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-02-26mm/memory_hotplug: use helper function zone_end_pfn() to get end_pfnMiaohe Lin
Commit 108bcc96ef70 ("mm: add & use zone_end_pfn() and zone_spans_pfn()") introduced the helper zone_end_pfn() to calculate the zone end pfn. But update_pgdat_span() forgot to use it. Use this helper and rename local variable zone_end_pfn to end_pfn to avoid a naming conflict with the existing zone_end_pfn(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-02-26mm/memory_hotplug: MEMHP_MERGE_RESOURCE -> MHP_MERGE_RESOURCEDavid Hildenbrand
Let's make "MEMHP_MERGE_RESOURCE" consistent with "MHP_NONE", "mhp_t" and "mhp_flags". As discussed recently [1], "mhp" is our internal acronym for memory hotplug now. [1] https://lore.kernel.org/linux-mm/[email protected]/ Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Reviewed-by: Miaohe Lin <[email protected]> Acked-by: Michael S. Tsirkin <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Acked-by: Wei Liu <[email protected]> Reviewed-by: Pankaj Gupta <[email protected]> Cc: "K. Y. Srinivasan" <[email protected]> Cc: Haiyang Zhang <[email protected]> Cc: Stephen Hemminger <[email protected]> Cc: Jason Wang <[email protected]> Cc: Boris Ostrovsky <[email protected]> Cc: Juergen Gross <[email protected]> Cc: Stefano Stabellini <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Wei Yang <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-02-26mm/memory_hotplug: rename all existing 'memhp' into 'mhp'Anshuman Khandual
This renames all 'memhp' instances to 'mhp' except for memhp_default_state for being a kernel command line option. This is just a clean up and should not cause a functional change. Let's make it consistent rater than mixing the two prefixes. In preparation for more users of the 'mhp' terminology. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Anshuman Khandual <[email protected]> Suggested-by: David Hildenbrand <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-02-26mm: teach pfn_to_online_page() about ZONE_DEVICE section collisionsDan Williams
While pfn_to_online_page() is able to determine pfn_valid() at subsection granularity it is not able to reliably determine if a given pfn is also online if the section is mixes ZONE_{NORMAL,MOVABLE} with ZONE_DEVICE. This means that pfn_to_online_page() may return invalid @page objects. For example with a memory map like: 100000000-1fbffffff : System RAM 142000000-143002e16 : Kernel code 143200000-143713fff : Kernel rodata 143800000-143b15b7f : Kernel data 144227000-144ffffff : Kernel bss 1fc000000-2fbffffff : Persistent Memory (legacy) 1fc000000-2fbffffff : namespace0.0 This command: echo 0x1fc000000 > /sys/devices/system/memory/soft_offline_page ...succeeds when it should fail. When it succeeds it touches an uninitialized page and may crash or cause other damage (see dissolve_free_huge_page()). While the memory map above is contrived via the memmap=ss!nn kernel command line option, the collision happens in practice on shipping platforms. The memory controller resources that decode spans of physical address space are a limited resource. One technique platform-firmware uses to conserve those resources is to share a decoder across 2 devices to keep the address range contiguous. Unfortunately the unit of operation of a decoder is 64MiB while the Linux section size is 128MiB. This results in situations where, without subsection hotplug memory mappings with different lifetimes collide into one object that can only express one lifetime. Update move_pfn_range_to_zone() to flag (SECTION_TAINT_ZONE_DEVICE) a section that mixes ZONE_DEVICE pfns with other online pfns. With SECTION_TAINT_ZONE_DEVICE to delineate, pfn_to_online_page() can fall back to a slow-path check for ZONE_DEVICE pfns in an online section. In the fast path online_section() for a full ZONE_DEVICE section returns false. Because the collision case is rare, and for simplicity, the SECTION_TAINT_ZONE_DEVICE flag is never cleared once set. [[email protected]: fix CONFIG_ZONE_DEVICE=n build] Link: https://lkml.kernel.org/r/CAPcyv4iX+7LAgAeSqx7Zw-Zd=ZV9gBv8Bo7oTbwCOOqJoZ3+Yg@mail.gmail.com Link: https://lkml.kernel.org/r/161058500675.1840162.7887862152161279354.stgit@dwillia2-desk3.amr.corp.intel.com Fixes: ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug") Signed-off-by: Dan Williams <[email protected]> Reported-by: Michal Hocko <[email protected]> Acked-by: Michal Hocko <[email protected]> Reported-by: David Hildenbrand <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Qian Cai <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-02-26mm: teach pfn_to_online_page() to consider subsection validityDan Williams
pfn_to_online_page is primarily used to filter out offline or fully uninitialized pages. pfn_valid resp. online_section_nr have a coarse per memory section granularity. If a section shared with a partially offline memory (e.g. part of ZONE_DEVICE) then pfn_to_online_page would lead to a false positive on some pfns. Fix this by adding pfn_section_valid check which is subsection aware. [[email protected]: changelog rewrite] Link: https://lkml.kernel.org/r/161058500148.1840162.4365921007820501696.stgit@dwillia2-desk3.amr.corp.intel.com Fixes: b13bc35193d9 ("mm/hotplug: invalid PFNs from pfn_to_online_page()") Signed-off-by: Dan Williams <[email protected]> Reported-by: David Hildenbrand <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Qian Cai <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Naoya Horiguchi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-02-26mm: move pfn_to_online_page() out of lineDan Williams
Patch series "mm: Fix pfn_to_online_page() with respect to ZONE_DEVICE", v4. A pfn-walker that uses pfn_to_online_page() may inadvertently translate a pfn as online and in the page allocator, when it is offline managed by a ZONE_DEVICE mapping (details in Patch 3: ("mm: Teach pfn_to_online_page() about ZONE_DEVICE section collisions")). The 2 proposals under consideration are teach pfn_to_online_page() to be precise in the presence of mixed-zone sections, or teach the memory-add code to drop the System RAM associated with ZONE_DEVICE collisions. In order to not regress memory capacity by a few 10s to 100s of MiB the approach taken in this set is to add precision to pfn_to_online_page(). In the course of validating pfn_to_online_page() a couple other fixes fell out: 1/ soft_offline_page() fails to drop the reference taken in the madvise(..., MADV_SOFT_OFFLINE) case. 2/ memory_failure() uses get_dev_pagemap() to lookup ZONE_DEVICE pages, however that mapping may contain data pages and metadata raw pfns. Introduce pgmap_pfn_valid() to delineate the 2 types and fail the handling of raw metadata pfns. This patch (of 4); pfn_to_online_page() is already too large to be a macro or an inline function. In anticipation of further logic changes / growth, move it out of line. No functional change, just code movement. Link: https://lkml.kernel.org/r/161058499000.1840162.702316708443239771.stgit@dwillia2-desk3.amr.corp.intel.com Link: https://lkml.kernel.org/r/161058499608.1840162.10165648147615238793.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <[email protected]> Reported-by: Michal Hocko <[email protected]> Acked-by: Michal Hocko <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Qian Cai <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-02-24hugetlb: convert page_huge_active() HPageMigratable flagMike Kravetz
Use the new hugetlb page specific flag HPageMigratable to replace the page_huge_active interfaces. By it's name, page_huge_active implied that a huge page was on the active list. However, that is not really what code checking the flag wanted to know. It really wanted to determine if the huge page could be migrated. This happens when the page is actually added to the page cache and/or task page table. This is the reasoning behind the name change. The VM_BUG_ON_PAGE() calls in the *_huge_active() interfaces are not really necessary as we KNOW the page is a hugetlb page. Therefore, they are removed. The routine page_huge_active checked for PageHeadHuge before testing the active bit. This is unnecessary in the case where we hold a reference or lock and know it is a hugetlb head page. page_huge_active is also called without holding a reference or lock (scan_movable_pages), and can race with code freeing the page. The extra check in page_huge_active shortened the race window, but did not prevent the race. Offline code calling scan_movable_pages already deals with these races, so removing the check is acceptable. Add comment to racy code. [[email protected]: remove set_page_huge_active() declaration from include/linux/hugetlb.h] Link: https://lkml.kernel.org/r/CAMZfGtUda+KoAZscU0718TN61cSFwp4zy=y2oZ=+6Z2TAZZwng@mail.gmail.com Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Kravetz <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Reviewed-by: Muchun Song <[email protected]> Reviewed-by: Miaohe Lin <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Naoya Horiguchi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-02-24mm: rename memmap_init() and memmap_init_zone()Baoquan He
The current memmap_init_zone() only handles memory region inside one zone, actually memmap_init() does the memmap init of one zone. So rename both of them accordingly. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Baoquan He <[email protected]> Reviewed-by: Mike Rapoport <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2020-12-29mm: memmap defer init doesn't work as expectedBaoquan He
VMware observed a performance regression during memmap init on their platform, and bisected to commit 73a6e474cb376 ("mm: memmap_init: iterate over memblock regions rather that check each PFN") causing it. Before the commit: [0.033176] Normal zone: 1445888 pages used for memmap [0.033176] Normal zone: 89391104 pages, LIFO batch:63 [0.035851] ACPI: PM-Timer IO Port: 0x448 With commit [0.026874] Normal zone: 1445888 pages used for memmap [0.026875] Normal zone: 89391104 pages, LIFO batch:63 [2.028450] ACPI: PM-Timer IO Port: 0x448 The root cause is the current memmap defer init doesn't work as expected. Before, memmap_init_zone() was used to do memmap init of one whole zone, to initialize all low zones of one numa node, but defer memmap init of the last zone in that numa node. However, since commit 73a6e474cb376, function memmap_init() is adapted to iterater over memblock regions inside one zone, then call memmap_init_zone() to do memmap init for each region. E.g, on VMware's system, the memory layout is as below, there are two memory regions in node 2. The current code will mistakenly initialize the whole 1st region [mem 0xab00000000-0xfcffffffff], then do memmap defer to iniatialize only one memmory section on the 2nd region [mem 0x10000000000-0x1033fffffff]. In fact, we only expect to see that there's only one memory section's memmap initialized. That's why more time is costed at the time. [ 0.008842] ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x0009ffff] [ 0.008842] ACPI: SRAT: Node 0 PXM 0 [mem 0x00100000-0xbfffffff] [ 0.008843] ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x55ffffffff] [ 0.008844] ACPI: SRAT: Node 1 PXM 1 [mem 0x5600000000-0xaaffffffff] [ 0.008844] ACPI: SRAT: Node 2 PXM 2 [mem 0xab00000000-0xfcffffffff] [ 0.008845] ACPI: SRAT: Node 2 PXM 2 [mem 0x10000000000-0x1033fffffff] Now, let's add a parameter 'zone_end_pfn' to memmap_init_zone() to pass down the real zone end pfn so that defer_init() can use it to judge whether defer need be taken in zone wide. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Fixes: commit 73a6e474cb376 ("mm: memmap_init: iterate over memblock regions rather that check each PFN") Signed-off-by: Baoquan He <[email protected]> Reported-by: Rahul Gopakumar <[email protected]> Reviewed-by: Mike Rapoport <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2020-12-24Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhostLinus Torvalds
Pull virtio updates from Michael Tsirkin: - vdpa sim refactoring - virtio mem: Big Block Mode support - misc cleanus, fixes * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (61 commits) vdpa: Use simpler version of ida allocation vdpa: Add missing comment for virtqueue count uapi: virtio_ids: add missing device type IDs from OASIS spec uapi: virtio_ids.h: consistent indentions vhost scsi: fix error return code in vhost_scsi_set_endpoint() virtio_ring: Fix two use after free bugs virtio_net: Fix error code in probe() virtio_ring: Cut and paste bugs in vring_create_virtqueue_packed() tools/virtio: add barrier for aarch64 tools/virtio: add krealloc_array tools/virtio: include asm/bug.h vdpa/mlx5: Use write memory barrier after updating CQ index vdpa: split vdpasim to core and net modules vdpa_sim: split vdpasim_virtqueue's iov field in out_iov and in_iov vdpa_sim: make vdpasim->buffer size configurable vdpa_sim: use kvmalloc to allocate vdpasim->buffer vdpa_sim: set vringh notify callback vdpa_sim: add set_config callback in vdpasim_dev_attr vdpa_sim: add get_config callback in vdpasim_dev_attr vdpa_sim: make 'config' generic and usable for any device type ...
2020-12-18mm/memory_hotplug: extend offline_and_remove_memory() to handle more than ↵David Hildenbrand
one memory block virtio-mem soon wants to use offline_and_remove_memory() memory that exceeds a single Linux memory block (memory_block_size_bytes()). Let's remove that restriction. Let's remember the old state and try to restore that if anything goes wrong. While re-onlining can, in general, fail, it's highly unlikely to happen (usually only when a notifier fails to allocate memory, and these are rather rare). This will be used by virtio-mem to offline+remove memory ranges that are bigger than a single memory block - for example, with a device block size of 1 GiB (e.g., gigantic pages in the hypervisor) and a Linux memory block size of 128MB. While we could compress the state into 2 bit, using 8 bit is much easier. This handling is similar, but different to acpi_scan_try_to_offline(): a) We don't try to offline twice. I am not sure if this CONFIG_MEMCG optimization is still relevant - it should only apply to ZONE_NORMAL (where we have no guarantees). If relevant, we can always add it. b) acpi_scan_try_to_offline() simply onlines all memory in case something goes wrong. It doesn't restore previous online type. Let's do that, so we won't overwrite what e.g., user space configured. Reviewed-by: Wei Yang <[email protected]> Cc: "Michael S. Tsirkin" <[email protected]> Cc: Jason Wang <[email protected]> Cc: Pankaj Gupta <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Wei Yang <[email protected]> Cc: Andrew Morton <[email protected]> Signed-off-by: David Hildenbrand <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Michael S. Tsirkin <[email protected]> Acked-by: Andrew Morton <[email protected]>
2020-12-15mm/memory_hotplug: quieting offline operationLaurent Dufour
On PowerPC, when dymically removing memory from a system we can see in the console a lot of messages like this: [ 186.575389] Offlined Pages 4096 This message is displayed on each LMB (256MB) removed, which means that we removing 1TB of memory, this message is displayed 4096 times. Moving it to DEBUG to not flood the console. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Laurent Dufour <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Nathan Lynch <[email protected]> Cc: Scott Cheloha <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2020-12-15mm, page_alloc: disable pcplists during memory offlineVlastimil Babka
Memory offlining relies on page isolation to guarantee a forward progress because pages cannot be reused while they are isolated. But the page isolation itself doesn't prevent from races while freed pages are stored on pcp lists and thus can be reused. This can be worked around by repeated draining of pcplists, as done by commit 968318261221 ("mm/memory_hotplug: drain per-cpu pages again during memory offline"). David and Michal would prefer that this race was closed in a way that callers of page isolation who need stronger guarantees don't need to repeatedly drain. David suggested disabling pcplists usage completely during page isolation, instead of repeatedly draining them. To achieve this without adding special cases in alloc/free fastpath, we can use the same approach as boot pagesets - when pcp->high is 0, any pcplist addition will be immediately flushed. The race can thus be closed by setting pcp->high to 0 and draining pcplists once, before calling start_isolate_page_range(). The draining will serialize after processes that already disabled interrupts and read the old value of pcp->high in free_unref_page_commit(), and processes that have not yet disabled interrupts, will observe pcp->high == 0 when they are rescheduled, and skip pcplists. This guarantees no stray pages on pcplists in zones where isolation happens. This patch thus adds zone_pcp_disable() and zone_pcp_enable() functions that page isolation users can call before start_isolate_page_range() and after unisolating (or offlining) the isolated pages. Also, drain_all_pages() is optimized to only execute on cpus where pcplists are not empty. The check can however race with a free to pcplist that has not yet increased the pcp->count from 0 to 1. Thus make the drain optionally skip the racy check and drain on all cpus, and use this option in zone_pcp_disable(). As we have to avoid external updates to high and batch while pcplists are disabled, we take pcp_batch_high_lock in zone_pcp_disable() and release it in zone_pcp_enable(). This also synchronizes multiple users of zone_pcp_disable()/enable(). Currently the only user of this functionality is offline_pages(). [[email protected]: add comment, per David] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Vlastimil Babka <[email protected]> Suggested-by: David Hildenbrand <[email protected]> Suggested-by: Michal Hocko <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Acked-by: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2020-12-15mm, page_alloc: move draining pcplists to page isolation usersVlastimil Babka
Currently, pcplists are drained during set_migratetype_isolate() which means once per pageblock processed start_isolate_page_range(). This is somewhat wasteful. Moreover, the callers might need different guarantees, and the draining is currently prone to races and does not guarantee that no page from isolated pageblock will end up on the pcplist after the drain. Better guarantees are added by later patches and require explicit actions by page isolation users that need them. Thus it makes sense to move the current imperfect draining to the callers also as a preparation step. Link: https://lkml.kernel.org/r/[email protected] Suggested-by: David Hildenbrand <[email protected]> Suggested-by: Pavel Tatashin <[email protected]> Signed-off-by: Vlastimil Babka <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Acked-by: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2020-12-15mm: introduce debug_pagealloc_{map,unmap}_pages() helpersMike Rapoport
Patch series "arch, mm: improve robustness of direct map manipulation", v7. During recent discussion about KVM protected memory, David raised a concern about usage of __kernel_map_pages() outside of DEBUG_PAGEALLOC scope [1]. Indeed, for architectures that define CONFIG_ARCH_HAS_SET_DIRECT_MAP it is possible that __kernel_map_pages() would fail, but since this function is void, the failure will go unnoticed. Moreover, there's lack of consistency of __kernel_map_pages() semantics across architectures as some guard this function with #ifdef DEBUG_PAGEALLOC, some refuse to update the direct map if page allocation debugging is disabled at run time and some allow modifying the direct map regardless of DEBUG_PAGEALLOC settings. This set straightens this out by restoring dependency of __kernel_map_pages() on DEBUG_PAGEALLOC and updating the call sites accordingly. Since currently the only user of __kernel_map_pages() outside DEBUG_PAGEALLOC is hibernation, it is updated to make direct map accesses there more explicit. [1] https://lore.kernel.org/lkml/[email protected] This patch (of 4): When CONFIG_DEBUG_PAGEALLOC is enabled, it unmaps pages from the kernel direct mapping after free_pages(). The pages than need to be mapped back before they could be used. Theese mapping operations use __kernel_map_pages() guarded with with debug_pagealloc_enabled(). The only place that calls __kernel_map_pages() without checking whether DEBUG_PAGEALLOC is enabled is the hibernation code that presumes availability of this function when ARCH_HAS_SET_DIRECT_MAP is set. Still, on arm64, __kernel_map_pages() will bail out when DEBUG_PAGEALLOC is not enabled but set_direct_map_invalid_noflush() may render some pages not present in the direct map and hibernation code won't be able to save such pages. To make page allocation debugging and hibernation interaction more robust, the dependency on DEBUG_PAGEALLOC or ARCH_HAS_SET_DIRECT_MAP has to be made more explicit. Start with combining the guard condition and the call to __kernel_map_pages() into debug_pagealloc_map_pages() and debug_pagealloc_unmap_pages() functions to emphasize that __kernel_map_pages() should not be called without DEBUG_PAGEALLOC and use these new functions to map/unmap pages when page allocation debugging is enabled. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Rapoport <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Albert Ou <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Christian Borntraeger <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Dave Hansen <[email protected]> Cc: David Rientjes <[email protected]> Cc: "Edgecombe, Rick P" <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Len Brown <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Palmer Dabbelt <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Paul Walmsley <[email protected]> Cc: Pavel Machek <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: Will Deacon <[email protected]> Cc: Rafael J. Wysocki <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2020-12-15mm/rmap: always do TTU_IGNORE_ACCESSShakeel Butt
Since commit 369ea8242c0f ("mm/rmap: update to new mmu_notifier semantic v2"), the code to check the secondary MMU's page table access bit is broken for !(TTU_IGNORE_ACCESS) because the page is unmapped from the secondary MMU's page table before the check. More specifically for those secondary MMUs which unmap the memory in mmu_notifier_invalidate_range_start() like kvm. However memory reclaim is the only user of !(TTU_IGNORE_ACCESS) or the absence of TTU_IGNORE_ACCESS and it explicitly performs the page table access check before trying to unmap the page. So, at worst the reclaim will miss accesses in a very short window if we remove page table access check in unmapping code. There is an unintented consequence of !(TTU_IGNORE_ACCESS) for the memcg reclaim. From memcg reclaim the page_referenced() only account the accesses from the processes which are in the same memcg of the target page but the unmapping code is considering accesses from all the processes, so, decreasing the effectiveness of memcg reclaim. The simplest solution is to always assume TTU_IGNORE_ACCESS in unmapping code. Link: https://lkml.kernel.org/r/[email protected] Fixes: 369ea8242c0f ("mm/rmap: update to new mmu_notifier semantic v2") Signed-off-by: Shakeel Butt <[email protected]> Acked-by: Johannes Weiner <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Dan Williams <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2020-11-22mm: fix phys_to_target_node() and memory_add_physaddr_to_nid() exportsDan Williams
The core-mm has a default __weak implementation of phys_to_target_node() to mirror the weak definition of memory_add_physaddr_to_nid(). That symbol is exported for modules. However, while the export in mm/memory_hotplug.c exported the symbol in the configuration cases of: CONFIG_NUMA_KEEP_MEMINFO=y CONFIG_MEMORY_HOTPLUG=y ...and: CONFIG_NUMA_KEEP_MEMINFO=n CONFIG_MEMORY_HOTPLUG=y ...it failed to export the symbol in the case of: CONFIG_NUMA_KEEP_MEMINFO=y CONFIG_MEMORY_HOTPLUG=n Not only is that broken, but Christoph points out that the kernel should not be exporting any __weak symbol, which means that memory_add_physaddr_to_nid() example that phys_to_target_node() copied is broken too. Rework the definition of phys_to_target_node() and memory_add_physaddr_to_nid() to not require weak symbols. Move to the common arch override design-pattern of an asm header defining a symbol to replace the default implementation. The only common header that all memory_add_physaddr_to_nid() producing architectures implement is asm/sparsemem.h. In fact, powerpc already defines its memory_add_physaddr_to_nid() helper in sparsemem.h. Double-down on that observation and define phys_to_target_node() where necessary in asm/sparsemem.h. An alternate consideration that was discarded was to put this override in asm/numa.h, but that entangles with the definition of MAX_NUMNODES relative to the inclusion of linux/nodemask.h, and requires powerpc to grow a new header. The dependency on NUMA_KEEP_MEMINFO for DEV_DAX_HMEM_DEVICES is invalid now that the symbol is properly exported / stubbed in all combinations of CONFIG_NUMA_KEEP_MEMINFO and CONFIG_MEMORY_HOTPLUG. [[email protected]: v4] Link: https://lkml.kernel.org/r/160461461867.1505359.5301571728749534585.stgit@dwillia2-desk3.amr.corp.intel.com [[email protected]: powerpc: fix create_section_mapping compile warning] Link: https://lkml.kernel.org/r/160558386174.2948926.2740149041249041764.stgit@dwillia2-desk3.amr.corp.intel.com Fixes: a035b6bf863e ("mm/memory_hotplug: introduce default phys_to_target_node() implementation") Reported-by: Randy Dunlap <[email protected]> Reported-by: Thomas Gleixner <[email protected]> Reported-by: kernel test robot <[email protected]> Reported-by: Christoph Hellwig <[email protected]> Signed-off-by: Dan Williams <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Tested-by: Randy Dunlap <[email protected]> Tested-by: Thomas Gleixner <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Cc: Joao Martins <[email protected]> Cc: Tony Luck <[email protected]> Cc: Fenghua Yu <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Vishal Verma <[email protected]> Cc: Stephen Rothwell <[email protected]> Link: https://lkml.kernel.org/r/160447639846.1133764.7044090803980177548.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Linus Torvalds <[email protected]>
2020-10-18mm/memory_hotplug: remove a wrapper for alloc_migration_target()Joonsoo Kim
To calculate the correct node to migrate the page for hotplug, we need to check node id of the page. Wrapper for alloc_migration_target() exists for this purpose. However, Vlastimil informs that all migration source pages come from a single node. In this case, we don't need to check the node id for each page and we don't need to re-set the target nodemask for each page by using the wrapper. Set up the migration_target_control once and use it for all pages. Signed-off-by: Joonsoo Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Roman Gushchin <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>