summaryrefslogtreecommitdiff
path: root/mm/userfaultfd.c
AgeCommit message (Collapse)Author
2021-09-03userfaultfd: change mmap_changing to atomicNadav Amit
Patch series "userfaultfd: minor bug fixes". Three unrelated bug fixes. The first two addresses possible issues (not too theoretical ones), but I did not encounter them in practice. The third patch addresses a test bug that causes the test to fail on my system. It has been sent before as part of a bigger RFC. This patch (of 3): mmap_changing is currently a boolean variable, which is set and cleared without any lock that protects against concurrent modifications. mmap_changing is supposed to mark whether userfaultfd page-faults handling should be retried since mappings are undergoing a change. However, concurrent calls, for instance to madvise(MADV_DONTNEED), might cause mmap_changing to be false, although the remove event was still not read (hence acknowledged) by the user. Change mmap_changing to atomic_t and increase/decrease appropriately. Add a debug assertion to see whether mmap_changing is negative. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Fixes: df2cc96e77011 ("userfaultfd: prevent non-cooperative events vs mcopy_atomic races") Signed-off-by: Nadav Amit <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Peter Xu <[email protected]> Cc: Axel Rasmussen <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Jens Axboe <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30userfaultfd/shmem: modify shmem_mfill_atomic_pte to use install_pte()Axel Rasmussen
In a previous commit, we added the mfill_atomic_install_pte() helper. This helper does the job of setting up PTEs for an existing page, to map it into a given VMA. It deals with both the anon and shmem cases, as well as the shared and private cases. In other words, shmem_mfill_atomic_pte() duplicates a case it already handles. So, expose it, and let shmem_mfill_atomic_pte() use it directly, to reduce code duplication. This requires that we refactor shmem_mfill_atomic_pte() a bit: Instead of doing accounting (shmem_recalc_inode() et al) part-way through the PTE setup, do it afterward. This frees up mfill_atomic_install_pte() from having to care about this accounting, and means we don't need to e.g. shmem_uncharge() in the error path. A side effect is this switches shmem_mfill_atomic_pte() to use lru_cache_add_inactive_or_unevictable() instead of just lru_cache_add(). This wrapper does some extra accounting in an exceptional case, if appropriate, so it's actually the more correct thing to use. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Axel Rasmussen <[email protected]> Reviewed-by: Peter Xu <[email protected]> Acked-by: Hugh Dickins <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Brian Geffon <[email protected]> Cc: "Dr . David Alan Gilbert" <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: Joe Perches <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Lokesh Gidra <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Mina Almasry <[email protected]> Cc: Oliver Upton <[email protected]> Cc: Shaohua Li <[email protected]> Cc: Shuah Khan <[email protected]> Cc: Stephen Rothwell <[email protected]> Cc: Wang Qing <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30userfaultfd/shmem: support UFFDIO_CONTINUE for shmemAxel Rasmussen
With this change, userspace can resolve a minor fault within a shmem-backed area with a UFFDIO_CONTINUE ioctl. The semantics for this match those for hugetlbfs - we look up the existing page in the page cache, and install a PTE for it. This commit introduces a new helper: mfill_atomic_install_pte. Why handle UFFDIO_CONTINUE for shmem in mm/userfaultfd.c, instead of in shmem.c? The existing userfault implementation only relies on shmem.c for VM_SHARED VMAs. However, minor fault handling / CONTINUE work just fine for !VM_SHARED VMAs as well. We'd prefer to handle CONTINUE for shmem in one place, regardless of shared/private (to reduce code duplication). Why add a new mfill_atomic_install_pte helper? A problem we have with continue is that shmem_mfill_atomic_pte() and mcopy_atomic_pte() are *close* to what we want, but not exactly. We do want to setup the PTEs in a CONTINUE operation, but we don't want to e.g. allocate a new page, charge it (e.g. to the shmem inode), manipulate various flags, etc. Also we have the problem stated above: shmem_mfill_atomic_pte() and mcopy_atomic_pte() both handle one-half of the problem (shared / private) continue cares about. So, introduce mcontinue_atomic_pte(), to handle all of the shmem continue cases. Introduce the helper so it doesn't duplicate code with mcopy_atomic_pte(). In a future commit, shmem_mfill_atomic_pte() will also be modified to use this new helper. However, since this is a bigger refactor, it seems most clear to do it as a separate change. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Axel Rasmussen <[email protected]> Acked-by: Hugh Dickins <[email protected]> Acked-by: Peter Xu <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Brian Geffon <[email protected]> Cc: "Dr . David Alan Gilbert" <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: Joe Perches <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Lokesh Gidra <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Mina Almasry <[email protected]> Cc: Oliver Upton <[email protected]> Cc: Shaohua Li <[email protected]> Cc: Shuah Khan <[email protected]> Cc: Stephen Rothwell <[email protected]> Cc: Wang Qing <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30userfaultfd/shmem: combine shmem_{mcopy_atomic,mfill_zeropage}_pteAxel Rasmussen
Patch series "userfaultfd: add minor fault handling for shmem", v6. Overview ======== See the series which added minor faults for hugetlbfs [3] for a detailed overview of minor fault handling in general. This series adds the same support for shmem-backed areas. This series is structured as follows: - Commits 1 and 2 are cleanups. - Commits 3 and 4 implement the new feature (minor fault handling for shmem). - Commit 5 advertises that the feature is now available since at this point it's fully implemented. - Commit 6 is a final cleanup, modifying an existing code path to re-use a new helper we've introduced. - Commits 7, 8, 9, 10 update the userfaultfd selftest to exercise the feature. Use Case ======== In some cases it is useful to have VM memory backed by tmpfs instead of hugetlbfs. So, this feature will be used to support the same VM live migration use case described in my original series. Additionally, Android folks (Lokesh Gidra <[email protected]>) hope to optimize the Android Runtime garbage collector using this feature: "The plan is to use userfaultfd for concurrently compacting the heap. With this feature, the heap can be shared-mapped at another location where the GC-thread(s) could continue the compaction operation without the need to invoke userfault ioctl(UFFDIO_COPY) each time. OTOH, if and when Java threads get faults on the heap, UFFDIO_CONTINUE can be used to resume execution. Furthermore, this feature enables updating references in the 'non-moving' portion of the heap efficiently. Without this feature, uneccessary page copying (ioctl(UFFDIO_COPY)) would be required." [1] https://lore.kernel.org/patchwork/cover/1388144/ [2] https://lore.kernel.org/patchwork/patch/1408161/ [3] https://lore.kernel.org/linux-fsdevel/[email protected]/T/#t This patch (of 9): Previously, we did a dance where we had one calling path in userfaultfd.c (mfill_atomic_pte), but then we split it into two in shmem_fs.h (shmem_{mcopy_atomic,mfill_zeropage}_pte), and then rejoined into a single shared function in shmem.c (shmem_mfill_atomic_pte). This is all a bit overly complex. Just call the single combined shmem function directly, allowing us to clean up various branches, boilerplate, etc. While we're touching this function, two other small cleanup changes: - offset is equivalent to pgoff, so we can get rid of offset entirely. - Split two VM_BUG_ON cases into two statements. This means the line number reported when the BUG is hit specifies exactly which condition was true. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Axel Rasmussen <[email protected]> Reviewed-by: Peter Xu <[email protected]> Acked-by: Hugh Dickins <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Brian Geffon <[email protected]> Cc: "Dr . David Alan Gilbert" <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: Joe Perches <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Lokesh Gidra <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Mina Almasry <[email protected]> Cc: Oliver Upton <[email protected]> Cc: Shaohua Li <[email protected]> Cc: Shuah Khan <[email protected]> Cc: Stephen Rothwell <[email protected]> Cc: Wang Qing <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30mm, hugetlb: fix racy resv_huge_pages underflow on UFFDIO_COPYMina Almasry
On UFFDIO_COPY, if we fail to copy the page contents while holding the hugetlb_fault_mutex, we will drop the mutex and return to the caller after allocating a page that consumed a reservation. In this case there may be a fault that double consumes the reservation. To handle this, we free the allocated page, fix the reservations, and allocate a temporary hugetlb page and return that to the caller. When the caller does the copy outside of the lock, we again check the cache, and allocate a page consuming the reservation, and copy over the contents. Test: Hacked the code locally such that resv_huge_pages underflows produce a warning and the copy_huge_page_from_user() always fails, then: ./tools/testing/selftests/vm/userfaultfd hugetlb_shared 10 2 /tmp/kokonut_test/huge/userfaultfd_test && echo test success ./tools/testing/selftests/vm/userfaultfd hugetlb 10 2 /tmp/kokonut_test/huge/userfaultfd_test && echo test success Both tests succeed and produce no warnings. After the test runs number of free/resv hugepages is correct. [[email protected]: remove set but not used variable 'vm_alloc_shared'] Link: https://lkml.kernel.org/r/[email protected] [[email protected]: fix allocation error check and copy func name] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Mina Almasry <[email protected]> Signed-off-by: YueHaibing <[email protected]> Cc: Axel Rasmussen <[email protected]> Cc: Peter Xu <[email protected]> Cc: Mike Kravetz <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-05-22userfaultfd: hugetlbfs: fix new flag usage in error pathMike Kravetz
In commit d6995da31122 ("hugetlb: use page.private for hugetlb specific page flags") the use of PagePrivate to indicate a reservation count should be restored at free time was changed to the hugetlb specific flag HPageRestoreReserve. Changes to a userfaultfd error path as well as a VM_BUG_ON() in remove_inode_hugepages() were overlooked. Users could see incorrect hugetlb reserve counts if they experience an error with a UFFDIO_COPY operation. Specifically, this would be the result of an unlikely copy_huge_page_from_user error. There is not an increased chance of hitting the VM_BUG_ON. Link: https://lkml.kernel.org/r/[email protected] Fixes: d6995da31122 ("hugetlb: use page.private for hugetlb specific page flags") Signed-off-by: Mike Kravetz <[email protected]> Reviewed-by: Mina Almasry <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Muchun Song <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Mina Almasry <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-05-05userfaultfd: add UFFDIO_CONTINUE ioctlAxel Rasmussen
This ioctl is how userspace ought to resolve "minor" userfaults. The idea is, userspace is notified that a minor fault has occurred. It might change the contents of the page using its second non-UFFD mapping, or not. Then, it calls UFFDIO_CONTINUE to tell the kernel "I have ensured the page contents are correct, carry on setting up the mapping". Note that it doesn't make much sense to use UFFDIO_{COPY,ZEROPAGE} for MINOR registered VMAs. ZEROPAGE maps the VMA to the zero page; but in the minor fault case, we already have some pre-existing underlying page. Likewise, UFFDIO_COPY isn't useful if we have a second non-UFFD mapping. We'd just use memcpy() or similar instead. It turns out hugetlb_mcopy_atomic_pte() already does very close to what we want, if an existing page is provided via `struct page **pagep`. We already special-case the behavior a bit for the UFFDIO_ZEROPAGE case, so just extend that design: add an enum for the three modes of operation, and make the small adjustments needed for the MCOPY_ATOMIC_CONTINUE case. (Basically, look up the existing page, and avoid adding the existing page to the page cache or calling set_page_huge_active() on it.) Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Axel Rasmussen <[email protected]> Reviewed-by: Peter Xu <[email protected]> Cc: Adam Ruprecht <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Alexey Dobriyan <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Cannon Matthews <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Chinwen Chang <[email protected]> Cc: David Rientjes <[email protected]> Cc: "Dr . David Alan Gilbert" <[email protected]> Cc: Huang Ying <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jann Horn <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Lokesh Gidra <[email protected]> Cc: "Matthew Wilcox (Oracle)" <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: "Michal Koutn" <[email protected]> Cc: Michel Lespinasse <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Mina Almasry <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Oliver Upton <[email protected]> Cc: Shaohua Li <[email protected]> Cc: Shawn Anastasio <[email protected]> Cc: Steven Price <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-05-05hugetlb: pass vma into huge_pte_alloc() and huge_pmd_share()Peter Xu
Patch series "hugetlb: Disable huge pmd unshare for uffd-wp", v4. This series tries to disable huge pmd unshare of hugetlbfs backed memory for uffd-wp. Although uffd-wp of hugetlbfs is still during rfc stage, the idea of this series may be needed for multiple tasks (Axel's uffd minor fault series, and Mike's soft dirty series), so I picked it out from the larger series. This patch (of 4): It is a preparation work to be able to behave differently in the per architecture huge_pte_alloc() according to different VMA attributes. Pass it deeper into huge_pmd_share() so that we can avoid the find_vma() call. [[email protected]: build fix] Link: https://lkml.kernel.org/r/20210304164653.GB397383@xz-x1Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Peter Xu <[email protected]> Suggested-by: Mike Kravetz <[email protected]> Cc: Adam Ruprecht <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Alexey Dobriyan <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Axel Rasmussen <[email protected]> Cc: Cannon Matthews <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Chinwen Chang <[email protected]> Cc: David Rientjes <[email protected]> Cc: "Dr . David Alan Gilbert" <[email protected]> Cc: Huang Ying <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jann Horn <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Lokesh Gidra <[email protected]> Cc: "Matthew Wilcox (Oracle)" <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: "Michal Koutn" <[email protected]> Cc: Michel Lespinasse <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Mina Almasry <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Oliver Upton <[email protected]> Cc: Shaohua Li <[email protected]> Cc: Shawn Anastasio <[email protected]> Cc: Steven Price <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2020-08-12mm/vmscan: protect the workingset on anonymous LRUJoonsoo Kim
In current implementation, newly created or swap-in anonymous page is started on active list. Growing active list results in rebalancing active/inactive list so old pages on active list are demoted to inactive list. Hence, the page on active list isn't protected at all. Following is an example of this situation. Assume that 50 hot pages on active list. Numbers denote the number of pages on active/inactive list (active | inactive). 1. 50 hot pages on active list 50(h) | 0 2. workload: 50 newly created (used-once) pages 50(uo) | 50(h) 3. workload: another 50 newly created (used-once) pages 50(uo) | 50(uo), swap-out 50(h) This patch tries to fix this issue. Like as file LRU, newly created or swap-in anonymous pages will be inserted to the inactive list. They are promoted to active list if enough reference happens. This simple modification changes the above example as following. 1. 50 hot pages on active list 50(h) | 0 2. workload: 50 newly created (used-once) pages 50(h) | 50(uo) 3. workload: another 50 newly created (used-once) pages 50(h) | 50(uo), swap-out 50(uo) As you can see, hot pages on active list would be protected. Note that, this implementation has a drawback that the page cannot be promoted and will be swapped-out if re-access interval is greater than the size of inactive list but less than the size of total(active+inactive). To solve this potential issue, following patch will apply workingset detection similar to the one that's already applied to file LRU. Signed-off-by: Joonsoo Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Acked-by: Johannes Weiner <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Minchan Kim <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-06-09mmap locking API: convert mmap_sem commentsMichel Lespinasse
Convert comments that reference mmap_sem to reference mmap_lock instead. [[email protected]: fix up linux-next leftovers] [[email protected]: s/lockaphore/lock/, per Vlastimil] [[email protected]: more linux-next fixups, per Michel] Signed-off-by: Michel Lespinasse <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: Vlastimil Babka <[email protected]> Reviewed-by: Daniel Jordan <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: David Rientjes <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: John Hubbard <[email protected]> Cc: Laurent Dufour <[email protected]> Cc: Liam Howlett <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Ying Han <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-06-09mmap locking API: use coccinelle to convert mmap_sem rwsem call sitesMichel Lespinasse
This change converts the existing mmap_sem rwsem calls to use the new mmap locking API instead. The change is generated using coccinelle with the following rule: // spatch --sp-file mmap_lock_api.cocci --in-place --include-headers --dir . @@ expression mm; @@ ( -init_rwsem +mmap_init_lock | -down_write +mmap_write_lock | -down_write_killable +mmap_write_lock_killable | -down_write_trylock +mmap_write_trylock | -up_write +mmap_write_unlock | -downgrade_write +mmap_write_downgrade | -down_read +mmap_read_lock | -down_read_killable +mmap_read_lock_killable | -down_read_trylock +mmap_read_trylock | -up_read +mmap_read_unlock ) -(&mm->mmap_sem) +(mm) Signed-off-by: Michel Lespinasse <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: Daniel Jordan <[email protected]> Reviewed-by: Laurent Dufour <[email protected]> Reviewed-by: Vlastimil Babka <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: David Rientjes <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: John Hubbard <[email protected]> Cc: Liam Howlett <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Ying Han <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-06-03mm: memcontrol: delete unused lrucare handlingJohannes Weiner
Swapin faults were the last event to charge pages after they had already been put on the LRU list. Now that we charge directly on swapin, the lrucare portion of the charge code is unused. Signed-off-by: Johannes Weiner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: Joonsoo Kim <[email protected]> Cc: Alex Shi <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Balbir Singh <[email protected]> Cc: Shakeel Butt <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-06-03mm: memcontrol: convert anon and file-thp to new mem_cgroup_charge() APIJohannes Weiner
With the page->mapping requirement gone from memcg, we can charge anon and file-thp pages in one single step, right after they're allocated. This removes two out of three API calls - especially the tricky commit step that needed to happen at just the right time between when the page is "set up" and when it's "published" - somewhat vague and fluid concepts that varied by page type. All we need is a freshly allocated page and a memcg context to charge. v2: prevent double charges on pre-allocated hugepages in khugepaged [[email protected]: Fix crash - *hpage could be ERR_PTR instead of NULL] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Johannes Weiner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: Joonsoo Kim <[email protected]> Cc: Alex Shi <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Balbir Singh <[email protected]> Cc: Qian Cai <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-06-03mm: memcontrol: switch to native NR_ANON_MAPPED counterJohannes Weiner
Memcg maintains a private MEMCG_RSS counter. This divergence from the generic VM accounting means unnecessary code overhead, and creates a dependency for memcg that page->mapping is set up at the time of charging, so that page types can be told apart. Convert the generic accounting sites to mod_lruvec_page_state and friends to maintain the per-cgroup vmstat counter of NR_ANON_MAPPED. We use lock_page_memcg() to stabilize page->mem_cgroup during rmap changes, the same way we do for NR_FILE_MAPPED. With the previous patch removing MEMCG_CACHE and the private NR_SHMEM counter, this patch finally eliminates the need to have page->mapping set up at charge time. However, we need to have page->mem_cgroup set up by the time rmap runs and does the accounting, so switch the commit and the rmap callbacks around. v2: fix temporary accounting bug by switching rmap<->commit (Joonsoo) Signed-off-by: Johannes Weiner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Cc: Alex Shi <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Balbir Singh <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-06-03mm: memcontrol: drop @compound parameter from memcg charging APIJohannes Weiner
The memcg charging API carries a boolean @compound parameter that tells whether the page we're dealing with is a hugepage. mem_cgroup_commit_charge() has another boolean @lrucare that indicates whether the page needs LRU locking or not while charging. The majority of callsites know those parameters at compile time, which results in a lot of naked "false, false" argument lists. This makes for cryptic code and is a breeding ground for subtle mistakes. Thankfully, the huge page state can be inferred from the page itself and doesn't need to be passed along. This is safe because charging completes before the page is published and somebody may split it. Simplify the callsites by removing @compound, and let memcg infer the state by using hpage_nr_pages() unconditionally. That function does PageTransHuge() to identify huge pages, which also helpfully asserts that nobody passes in tail pages by accident. The following patches will introduce a new charging API, best not to carry over unnecessary weight. Signed-off-by: Johannes Weiner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: Alex Shi <[email protected]> Reviewed-by: Joonsoo Kim <[email protected]> Reviewed-by: Shakeel Butt <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Balbir Singh <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-04-07userfaultfd: wp: support write protection for userfault vma rangeShaohua Li
Add API to enable/disable writeprotect a vma range. Unlike mprotect, this doesn't split/merge vmas. [[email protected]: - use the helper to find VMA; - return -ENOENT if not found to match mcopy case; - use the new MM_CP_UFFD_WP* flags for change_protection - check against mmap_changing for failures - replace find_dst_vma with vma_find_uffd] Signed-off-by: Shaohua Li <[email protected]> Signed-off-by: Andrea Arcangeli <[email protected]> Signed-off-by: Peter Xu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: Jerome Glisse <[email protected]> Reviewed-by: Mike Rapoport <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Bobby Powers <[email protected]> Cc: Brian Geffon <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Denis Plotnikov <[email protected]> Cc: "Dr . David Alan Gilbert" <[email protected]> Cc: Martin Cracauer <[email protected]> Cc: Marty McFadden <[email protected]> Cc: Maya Gokhale <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Pavel Emelyanov <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-04-07userfaultfd: wp: apply _PAGE_UFFD_WP bitPeter Xu
Firstly, introduce two new flags MM_CP_UFFD_WP[_RESOLVE] for change_protection() when used with uffd-wp and make sure the two new flags are exclusively used. Then, - For MM_CP_UFFD_WP: apply the _PAGE_UFFD_WP bit and remove _PAGE_RW when a range of memory is write protected by uffd - For MM_CP_UFFD_WP_RESOLVE: remove the _PAGE_UFFD_WP bit and recover _PAGE_RW when write protection is resolved from userspace And use this new interface in mwriteprotect_range() to replace the old MM_CP_DIRTY_ACCT. Do this change for both PTEs and huge PMDs. Then we can start to identify which PTE/PMD is write protected by general (e.g., COW or soft dirty tracking), and which is for userfaultfd-wp. Since we should keep the _PAGE_UFFD_WP when doing pte_modify(), add it into _PAGE_CHG_MASK as well. Meanwhile, since we have this new bit, we can be even more strict when detecting uffd-wp page faults in either do_wp_page() or wp_huge_pmd(). After we're with _PAGE_UFFD_WP, a special case is when a page is both protected by the general COW logic and also userfault-wp. Here the userfault-wp will have higher priority and will be handled first. Only after the uffd-wp bit is cleared on the PTE/PMD will we continue to handle the general COW. These are the steps on what will happen with such a page: 1. CPU accesses write protected shared page (so both protected by general COW and uffd-wp), blocked by uffd-wp first because in do_wp_page we'll handle uffd-wp first, so it has higher priority than general COW. 2. Uffd service thread receives the request, do UFFDIO_WRITEPROTECT to remove the uffd-wp bit upon the PTE/PMD. However here we still keep the write bit cleared. Notify the blocked CPU. 3. The blocked CPU resumes the page fault process with a fault retry, during retry it'll notice it was not with the uffd-wp bit this time but it is still write protected by general COW, then it'll go though the COW path in the fault handler, copy the page, apply write bit where necessary, and retry again. 4. The CPU will be able to access this page with write bit set. Suggested-by: Andrea Arcangeli <[email protected]> Signed-off-by: Peter Xu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Cc: Brian Geffon <[email protected]> Cc: Pavel Emelyanov <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Martin Cracauer <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Bobby Powers <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: "Kirill A . Shutemov" <[email protected]> Cc: Maya Gokhale <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Marty McFadden <[email protected]> Cc: Denis Plotnikov <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: "Dr . David Alan Gilbert" <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Shaohua Li <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-04-07userfaultfd: wp: add UFFDIO_COPY_MODE_WPAndrea Arcangeli
This allows UFFDIO_COPY to map pages write-protected. [[email protected]: switch to VM_WARN_ON_ONCE in mfill_atomic_pte; add brackets around "dst_vma->vm_flags & VM_WRITE"; fix wordings in comments and commit messages] Signed-off-by: Andrea Arcangeli <[email protected]> Signed-off-by: Peter Xu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: Jerome Glisse <[email protected]> Reviewed-by: Mike Rapoport <[email protected]> Cc: Bobby Powers <[email protected]> Cc: Brian Geffon <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Denis Plotnikov <[email protected]> Cc: "Dr . David Alan Gilbert" <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: "Kirill A . Shutemov" <[email protected]> Cc: Martin Cracauer <[email protected]> Cc: Marty McFadden <[email protected]> Cc: Maya Gokhale <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Pavel Emelyanov <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Shaohua Li <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-04-02hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronizationMike Kravetz
Patch series "hugetlbfs: use i_mmap_rwsem for more synchronization", v2. While discussing the issue with huge_pte_offset [1], I remembered that there were more outstanding hugetlb races. These issues are: 1) For shared pmds, huge PTE pointers returned by huge_pte_alloc can become invalid via a call to huge_pmd_unshare by another thread. 2) hugetlbfs page faults can race with truncation causing invalid global reserve counts and state. A previous attempt was made to use i_mmap_rwsem in this manner as described at [2]. However, those patches were reverted starting with [3] due to locking issues. To effectively use i_mmap_rwsem to address the above issues it needs to be held (in read mode) during page fault processing. However, during fault processing we need to lock the page we will be adding. Lock ordering requires we take page lock before i_mmap_rwsem. Waiting until after taking the page lock is too late in the fault process for the synchronization we want to do. To address this lock ordering issue, the following patches change the lock ordering for hugetlb pages. This is not too invasive as hugetlbfs processing is done separate from core mm in many places. However, I don't really like this idea. Much ugliness is contained in the new routine hugetlb_page_mapping_lock_write() of patch 1. The only other way I can think of to address these issues is by catching all the races. After catching a race, cleanup, backout, retry ... etc, as needed. This can get really ugly, especially for huge page reservations. At one time, I started writing some of the reservation backout code for page faults and it got so ugly and complicated I went down the path of adding synchronization to avoid the races. Any other suggestions would be welcome. [1] https://lore.kernel.org/linux-mm/[email protected]/ [2] https://lore.kernel.org/linux-mm/[email protected]/ [3] https://lore.kernel.org/linux-mm/[email protected] [4] https://lore.kernel.org/linux-mm/[email protected]/ [5] https://lore.kernel.org/lkml/[email protected]/ This patch (of 2): While looking at BUGs associated with invalid huge page map counts, it was discovered and observed that a huge pte pointer could become 'invalid' and point to another task's page table. Consider the following: A task takes a page fault on a shared hugetlbfs file and calls huge_pte_alloc to get a ptep. Suppose the returned ptep points to a shared pmd. Now, another task truncates the hugetlbfs file. As part of truncation, it unmaps everyone who has the file mapped. If the range being truncated is covered by a shared pmd, huge_pmd_unshare will be called. For all but the last user of the shared pmd, huge_pmd_unshare will clear the pud pointing to the pmd. If the task in the middle of the page fault is not the last user, the ptep returned by huge_pte_alloc now points to another task's page table or worse. This leads to bad things such as incorrect page map/reference counts or invalid memory references. To fix, expand the use of i_mmap_rwsem as follows: - i_mmap_rwsem is held in read mode whenever huge_pmd_share is called. huge_pmd_share is only called via huge_pte_alloc, so callers of huge_pte_alloc take i_mmap_rwsem before calling. In addition, callers of huge_pte_alloc continue to hold the semaphore until finished with the ptep. - i_mmap_rwsem is held in write mode whenever huge_pmd_unshare is called. One problem with this scheme is that it requires taking i_mmap_rwsem before taking the page lock during page faults. This is not the order specified in the rest of mm code. Handling of hugetlbfs pages is mostly isolated today. Therefore, we use this alternative locking order for PageHuge() pages. mapping->i_mmap_rwsem hugetlb_fault_mutex (hugetlbfs specific page fault mutex) page->flags PG_locked (lock_page) To help with lock ordering issues, hugetlb_page_mapping_lock_write() is introduced to write lock the i_mmap_rwsem associated with a page. In most cases it is easy to get address_space via vma->vm_file->f_mapping. However, in the case of migration or memory errors for anon pages we do not have an associated vma. A new routine _get_hugetlb_page_mapping() will use anon_vma to get address_space in these cases. Signed-off-by: Mike Kravetz <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: "Aneesh Kumar K . V" <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: "Kirill A . Shutemov" <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: Prakash Sangappa <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2019-12-01mm: fix typos in comments when calling __SetPageUptodate()Wei Yang
There are several places emphasise the effect of __SetPageUptodate(), while the comment seems to have a typo in two places. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Wei Yang <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-12-01userfaultfd: wrap the common dst_vma check into an inlined functionWei Yang
When doing UFFDIO_COPY, it is necessary to find the correct destination vma and make sure fault range is in it. Since there are two places need to do the same task, just wrap those common check into an inlined function. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Wei Yang <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-12-01userfaultfd: remove unnecessary WARN_ON() in __mcopy_atomic_hugetlb()Wei Yang
These warning here is to make sure address(dst_addr) and length(len - copied) are huge page size aligned. While this is ensured by: dst_start and len is huge page size aligned dst_addr equals to dst_start and increase huge page size each time copied increase huge page size each time This means these warnings will never be triggered. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Wei Yang <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-12-01userfaultfd: use vma_pagesize for all huge page size calculationWei Yang
In __mcopy_atomic_hugetlb() we use two variables to deal with huge page size: vma_hpagesize and huge_page_size. Since they are the same, it is not necessary to use two different mechanism. This patch makes it consistent by all using vma_hpagesize. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Wei Yang <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-12-01hugetlb: remove unused hstate in hugetlb_fault_mutex_hash()Wei Yang
The first parameter hstate in function hugetlb_fault_mutex_hash() is not used anymore. This patch removes it. [[email protected]: various build fixes] [[email protected]: fix a GCC compilation warning] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Wei Yang <[email protected]> Signed-off-by: Qian Cai <[email protected]> Suggested-by: Andrew Morton <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Andrea Arcangeli <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-12-01hugetlbfs: hugetlb_fault_mutex_hash() cleanupMike Kravetz
A new clang diagnostic (-Wsizeof-array-div) warns about the calculation to determine the number of u32's in an array of unsigned longs. Suppress warning by adding parentheses. While looking at the above issue, noticed that the 'address' parameter to hugetlb_fault_mutex_hash is no longer used. So, remove it from the definition and all callers. No functional change. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Kravetz <[email protected]> Reported-by: Nathan Chancellor <[email protected]> Reviewed-by: Nathan Chancellor <[email protected]> Reviewed-by: Davidlohr Bueso <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: Nick Desaulniers <[email protected]> Cc: Ilie Halip <[email protected]> Cc: David Bolvansky <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-06-19treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 499Thomas Gleixner
Based on 1 normalized pattern(s): this work is licensed under the terms of the gnu gpl version 2 see the copying file in the top level directory extracted by the scancode license scanner the SPDX license identifier GPL-2.0-only has been chosen to replace the boilerplate/reference in 35 file(s). Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Kate Stewart <[email protected]> Reviewed-by: Enrico Weigelt <[email protected]> Reviewed-by: Allison Randal <[email protected]> Cc: [email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>
2019-05-14hugetlb: use same fault hash key for shared and private mappingsMike Kravetz
hugetlb uses a fault mutex hash table to prevent page faults of the same pages concurrently. The key for shared and private mappings is different. Shared keys off address_space and file index. Private keys off mm and virtual address. Consider a private mappings of a populated hugetlbfs file. A fault will map the page from the file and if needed do a COW to map a writable page. Hugetlbfs hole punch uses the fault mutex to prevent mappings of file pages. It uses the address_space file index key. However, private mappings will use a different key and could race with this code to map the file page. This causes problems (BUG) for the page cache remove code as it expects the page to be unmapped. A sample stack is: page dumped because: VM_BUG_ON_PAGE(page_mapped(page)) kernel BUG at mm/filemap.c:169! ... RIP: 0010:unaccount_page_cache_page+0x1b8/0x200 ... Call Trace: __delete_from_page_cache+0x39/0x220 delete_from_page_cache+0x45/0x70 remove_inode_hugepages+0x13c/0x380 ? __add_to_page_cache_locked+0x162/0x380 hugetlbfs_fallocate+0x403/0x540 ? _cond_resched+0x15/0x30 ? __inode_security_revalidate+0x5d/0x70 ? selinux_file_permission+0x100/0x130 vfs_fallocate+0x13f/0x270 ksys_fallocate+0x3c/0x80 __x64_sys_fallocate+0x1a/0x20 do_syscall_64+0x5b/0x180 entry_SYSCALL_64_after_hwframe+0x44/0xa9 There seems to be another potential COW issue/race with this approach of different private and shared keys as noted in commit 8382d914ebf7 ("mm, hugetlb: improve page-fault scalability"). Since every hugetlb mapping (even anon and private) is actually a file mapping, just use the address_space index key for all mappings. This results in potentially more hash collisions. However, this should not be the common case. Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/20190412165235.t4sscoujczfhuiyt@linux-r8p5 Fixes: b5cec28d36f5 ("hugetlbfs: truncate_hugepages() takes a range of pages") Signed-off-by: Mike Kravetz <[email protected]> Reviewed-by: Naoya Horiguchi <[email protected]> Reviewed-by: Davidlohr Bueso <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: "Kirill A . Shutemov" <[email protected]> Cc: Michal Hocko <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-01-08hugetlbfs: revert "use i_mmap_rwsem for more pmd sharing synchronization"Mike Kravetz
This reverts b43a9990055958e70347c56f90ea2ae32c67334c The reverted commit caused issues with migration and poisoning of anon huge pages. The LTP move_pages12 test will cause an "unable to handle kernel NULL pointer" BUG would occur with stack similar to: RIP: 0010:down_write+0x1b/0x40 Call Trace: migrate_pages+0x81f/0xb90 __ia32_compat_sys_migrate_pages+0x190/0x190 do_move_pages_to_node.isra.53.part.54+0x2a/0x50 kernel_move_pages+0x566/0x7b0 __x64_sys_move_pages+0x24/0x30 do_syscall_64+0x5b/0x180 entry_SYSCALL_64_after_hwframe+0x44/0xa9 The purpose of the reverted patch was to fix some long existing races with huge pmd sharing. It used i_mmap_rwsem for this purpose with the idea that this could also be used to address truncate/page fault races with another patch. Further analysis has determined that i_mmap_rwsem can not be used to address all these hugetlbfs synchronization issues. Therefore, revert this patch while working an another approach to the underlying issues. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Kravetz <[email protected]> Reported-by: Jan Stancek <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: "Aneesh Kumar K . V" <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: "Kirill A . Shutemov" <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: Prakash Sangappa <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-01-04mm: treewide: remove unused address argument from pte_alloc functionsJoel Fernandes (Google)
Patch series "Add support for fast mremap". This series speeds up the mremap(2) syscall by copying page tables at the PMD level even for non-THP systems. There is concern that the extra 'address' argument that mremap passes to pte_alloc may do something subtle architecture related in the future that may make the scheme not work. Also we find that there is no point in passing the 'address' to pte_alloc since its unused. This patch therefore removes this argument tree-wide resulting in a nice negative diff as well. Also ensuring along the way that the enabled architectures do not do anything funky with the 'address' argument that goes unnoticed by the optimization. Build and boot tested on x86-64. Build tested on arm64. The config enablement patch for arm64 will be posted in the future after more testing. The changes were obtained by applying the following Coccinelle script. (thanks Julia for answering all Coccinelle questions!). Following fix ups were done manually: * Removal of address argument from pte_fragment_alloc * Removal of pte_alloc_one_fast definitions from m68k and microblaze. // Options: --include-headers --no-includes // Note: I split the 'identifier fn' line, so if you are manually // running it, please unsplit it so it runs for you. virtual patch @pte_alloc_func_def depends on patch exists@ identifier E2; identifier fn =~ "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$"; type T2; @@ fn(... - , T2 E2 ) { ... } @pte_alloc_func_proto_noarg depends on patch exists@ type T1, T2, T3, T4; identifier fn =~ "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$"; @@ ( - T3 fn(T1, T2); + T3 fn(T1); | - T3 fn(T1, T2, T4); + T3 fn(T1, T2); ) @pte_alloc_func_proto depends on patch exists@ identifier E1, E2, E4; type T1, T2, T3, T4; identifier fn =~ "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$"; @@ ( - T3 fn(T1 E1, T2 E2); + T3 fn(T1 E1); | - T3 fn(T1 E1, T2 E2, T4 E4); + T3 fn(T1 E1, T2 E2); ) @pte_alloc_func_call depends on patch exists@ expression E2; identifier fn =~ "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$"; @@ fn(... -, E2 ) @pte_alloc_macro depends on patch exists@ identifier fn =~ "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$"; identifier a, b, c; expression e; position p; @@ ( - #define fn(a, b, c) e + #define fn(a, b) e | - #define fn(a, b) e + #define fn(a) e ) Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Joel Fernandes (Google) <[email protected]> Suggested-by: Kirill A. Shutemov <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Julia Lawall <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: William Kucharski <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-12-28hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronizationMike Kravetz
While looking at BUGs associated with invalid huge page map counts, it was discovered and observed that a huge pte pointer could become 'invalid' and point to another task's page table. Consider the following: A task takes a page fault on a shared hugetlbfs file and calls huge_pte_alloc to get a ptep. Suppose the returned ptep points to a shared pmd. Now, another task truncates the hugetlbfs file. As part of truncation, it unmaps everyone who has the file mapped. If the range being truncated is covered by a shared pmd, huge_pmd_unshare will be called. For all but the last user of the shared pmd, huge_pmd_unshare will clear the pud pointing to the pmd. If the task in the middle of the page fault is not the last user, the ptep returned by huge_pte_alloc now points to another task's page table or worse. This leads to bad things such as incorrect page map/reference counts or invalid memory references. To fix, expand the use of i_mmap_rwsem as follows: - i_mmap_rwsem is held in read mode whenever huge_pmd_share is called. huge_pmd_share is only called via huge_pte_alloc, so callers of huge_pte_alloc take i_mmap_rwsem before calling. In addition, callers of huge_pte_alloc continue to hold the semaphore until finished with the ptep. - i_mmap_rwsem is held in write mode whenever huge_pmd_unshare is called. [[email protected]: add explicit check for mapping != null] Link: http://lkml.kernel.org/r/[email protected] Fixes: 39dde65c9940 ("shared page table for hugetlb page") Signed-off-by: Mike Kravetz <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: "Aneesh Kumar K . V" <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: Prakash Sangappa <[email protected]> Cc: Colin Ian King <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-11-30userfaultfd: shmem: add i_size checksAndrea Arcangeli
With MAP_SHARED: recheck the i_size after taking the PT lock, to serialize against truncate with the PT lock. Delete the page from the pagecache if the i_size_read check fails. With MAP_PRIVATE: check the i_size after the PT lock before mapping anonymous memory or zeropages into the MAP_PRIVATE shmem mapping. A mostly irrelevant cleanup: like we do the delete_from_page_cache() pagecache removal after dropping the PT lock, the PT lock is a spinlock so drop it before the sleepable page lock. Link: http://lkml.kernel.org/r/[email protected] Fixes: 4c27fe4c4c84 ("userfaultfd: shmem: add shmem_mcopy_atomic_pte for userfaultfd support") Signed-off-by: Andrea Arcangeli <[email protected]> Reviewed-by: Mike Rapoport <[email protected]> Reviewed-by: Hugh Dickins <[email protected]> Reported-by: Jann Horn <[email protected]> Cc: <[email protected]> Cc: "Dr. David Alan Gilbert" <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Peter Xu <[email protected]> Cc: [email protected] Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-11-30userfaultfd: shmem/hugetlbfs: only allow to register VM_MAYWRITE vmasAndrea Arcangeli
After the VMA to register the uffd onto is found, check that it has VM_MAYWRITE set before allowing registration. This way we inherit all common code checks before allowing to fill file holes in shmem and hugetlbfs with UFFDIO_COPY. The userfaultfd memory model is not applicable for readonly files unless it's a MAP_PRIVATE. Link: http://lkml.kernel.org/r/[email protected] Fixes: ff62a3421044 ("hugetlb: implement memfd sealing") Signed-off-by: Andrea Arcangeli <[email protected]> Reviewed-by: Mike Rapoport <[email protected]> Reviewed-by: Hugh Dickins <[email protected]> Reported-by: Jann Horn <[email protected]> Fixes: 4c27fe4c4c84 ("userfaultfd: shmem: add shmem_mcopy_atomic_pte for userfaultfd support") Cc: <[email protected]> Cc: "Dr. David Alan Gilbert" <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Peter Xu <[email protected]> Cc: [email protected] Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-11-30userfaultfd: shmem: allocate anonymous memory for MAP_PRIVATE shmemAndrea Arcangeli
Userfaultfd did not create private memory when UFFDIO_COPY was invoked on a MAP_PRIVATE shmem mapping. Instead it wrote to the shmem file, even when that had not been opened for writing. Though, fortunately, that could only happen where there was a hole in the file. Fix the shmem-backed implementation of UFFDIO_COPY to create private memory for MAP_PRIVATE mappings. The hugetlbfs-backed implementation was already correct. This change is visible to userland, if userfaultfd has been used in unintended ways: so it introduces a small risk of incompatibility, but is necessary in order to respect file permissions. An app that uses UFFDIO_COPY for anything like postcopy live migration won't notice the difference, and in fact it'll run faster because there will be no copy-on-write and memory waste in the tmpfs pagecache anymore. Userfaults on MAP_PRIVATE shmem keep triggering only on file holes like before. The real zeropage can also be built on a MAP_PRIVATE shmem mapping through UFFDIO_ZEROPAGE and that's safe because the zeropage pte is never dirty, in turn even an mprotect upgrading the vma permission from PROT_READ to PROT_READ|PROT_WRITE won't make the zeropage pte writable. Link: http://lkml.kernel.org/r/[email protected] Fixes: 4c27fe4c4c84 ("userfaultfd: shmem: add shmem_mcopy_atomic_pte for userfaultfd support") Signed-off-by: Andrea Arcangeli <[email protected]> Reported-by: Mike Rapoport <[email protected]> Reviewed-by: Hugh Dickins <[email protected]> Cc: <[email protected]> Cc: "Dr. David Alan Gilbert" <[email protected]> Cc: Jann Horn <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Peter Xu <[email protected]> Cc: [email protected] Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-11-30userfaultfd: use ENOENT instead of EFAULT if the atomic copy user failsAndrea Arcangeli
Patch series "userfaultfd shmem updates". Jann found two bugs in the userfaultfd shmem MAP_SHARED backend: the lack of the VM_MAYWRITE check and the lack of i_size checks. Then looking into the above we also fixed the MAP_PRIVATE case. Hugh by source review also found a data loss source if UFFDIO_COPY is used on shmem MAP_SHARED PROT_READ mappings (the production usages incidentally run with PROT_READ|PROT_WRITE, so the data loss couldn't happen in those production usages like with QEMU). The whole patchset is marked for stable. We verified QEMU postcopy live migration with guest running on shmem MAP_PRIVATE run as well as before after the fix of shmem MAP_PRIVATE. Regardless if it's shmem or hugetlbfs or MAP_PRIVATE or MAP_SHARED, QEMU unconditionally invokes a punch hole if the guest mapping is filebacked and a MADV_DONTNEED too (needed to get rid of the MAP_PRIVATE COWs and for the anon backend). This patch (of 5): We internally used EFAULT to communicate with the caller, switch to ENOENT, so EFAULT can be used as a non internal retval. Link: http://lkml.kernel.org/r/[email protected] Fixes: 4c27fe4c4c84 ("userfaultfd: shmem: add shmem_mcopy_atomic_pte for userfaultfd support") Signed-off-by: Andrea Arcangeli <[email protected]> Reviewed-by: Mike Rapoport <[email protected]> Reviewed-by: Hugh Dickins <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Jann Horn <[email protected]> Cc: Peter Xu <[email protected]> Cc: "Dr. David Alan Gilbert" <[email protected]> Cc: <[email protected]> Cc: [email protected] Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-06-07userfaultfd: prevent non-cooperative events vs mcopy_atomic racesMike Rapoport
If a process monitored with userfaultfd changes it's memory mappings or forks() at the same time as uffd monitor fills the process memory with UFFDIO_COPY, the actual creation of page table entries and copying of the data in mcopy_atomic may happen either before of after the memory mapping modifications and there is no way for the uffd monitor to maintain consistent view of the process memory layout. For instance, let's consider fork() running in parallel with userfaultfd_copy(): process | uffd monitor ---------------------------------+------------------------------ fork() | userfaultfd_copy() ... | ... dup_mmap() | down_read(mmap_sem) down_write(mmap_sem) | /* create PTEs, copy data */ dup_uffd() | up_read(mmap_sem) copy_page_range() | up_write(mmap_sem) | dup_uffd_complete() | /* notify monitor */ | If the userfaultfd_copy() takes the mmap_sem first, the new page(s) will be present by the time copy_page_range() is called and they will appear in the child's memory mappings. However, if the fork() is the first to take the mmap_sem, the new pages won't be mapped in the child's address space. If the pages are not present and child tries to access them, the monitor will get page fault notification and everything is fine. However, if the pages *are present*, the child can access them without uffd noticing. And if we copy them into child it'll see the wrong data. Since we are talking about background copy, we'd need to decide whether the pages should be copied or not regardless #PF notifications. Since userfaultfd monitor has no way to determine what was the order, let's disallow userfaultfd_copy in parallel with the non-cooperative events. In such case we return -EAGAIN and the uffd monitor can understand that userfaultfd_copy() clashed with a non-cooperative event and take an appropriate action. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Rapoport <[email protected]> Acked-by: Pavel Emelyanov <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Andrei Vagin <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-02-06mm/userfaultfd.c: remove duplicate includePravin Shedge
These duplicate includes have been found with scripts/checkincludes.pl but they have been removed manually to avoid removing false positives. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Pravin Shedge <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-06userfaultfd: shmem: wire up shmem_mfill_zeropage_pteMike Rapoport
For shmem VMAs we can use shmem_mfill_zeropage_pte for UFFDIO_ZEROPAGE Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Rapoport <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Hillf Danton <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Pavel Emelyanov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-06userfaultfd: mcopy_atomic: introduce mfill_atomic_pte helperMike Rapoport
Shuffle the code a bit to improve readability. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Rapoport <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Hillf Danton <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Pavel Emelyanov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-03-09mm: convert generic code to 5-level pagingKirill A. Shutemov
Convert all non-architecture-specific code to 5-level paging. It's mostly mechanical adding handling one more page table level in places where we deal with pud_t. Signed-off-by: Kirill A. Shutemov <[email protected]> Acked-by: Michal Hocko <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-03-02sched/headers: Prepare to move signal wakeup & sigpending methods from ↵Ingo Molnar
<linux/sched.h> into <linux/sched/signal.h> Fix up affected files that include this signal functionality via sched.h. Acked-by: Linus Torvalds <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-02-24userfaultfd: mcopy_atomic: return -ENOENT when no compatible VMA foundMike Rapoport
The memory mapping of a process may change between #PF event and the call to mcopy_atomic that comes to resolve the page fault. In such case, there will be no VMA covering the range passed to mcopy_atomic or the VMA will not have userfaultfd context. To allow uffd monitor to distinguish those case from other errors, let's return -ENOENT instead of -EINVAL. Note, that despite availability of UFFD_EVENT_UNMAP there still might be race between the processing of UFFD_EVENT_UNMAP and outstanding mcopy_atomic in case of non-cooperative uffd usage. [[email protected]: update cases returning -ENOENT] Link: http://lkml.kernel.org/r/20170207150249.GA6709@rapoport-lnx [[email protected]: merge fix] [[email protected]: fix the merge fix] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Rapoport <[email protected]> Acked-by: Hillf Danton <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: "Dr. David Alan Gilbert" <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Pavel Emelyanov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-02-22userfaultfd: hugetlbfs: add UFFDIO_COPY support for shared mappingsMike Kravetz
When userfaultfd hugetlbfs support was originally added, it followed the pattern of anon mappings and did not support any vmas marked VM_SHARED. As such, support was only added for private mappings. Remove this limitation and support shared mappings. The primary functional change required is adding pages to the page cache. More subtle changes are required for huge page reservation handling in error paths. A lengthy comment in the code describes the reservation handling. [[email protected]: update] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Kravetz <[email protected]> Reviewed-by: Andrea Arcangeli <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: "Dr. David Alan Gilbert" <[email protected]> Cc: Hillf Danton <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Pavel Emelyanov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-02-22userfaultfd: shmem: use shmem_mcopy_atomic_pte for shared memoryMike Rapoport
The shmem_mcopy_atomic_pte implements low lever part of UFFDIO_COPY operation for shared memory VMAs. It's based on mcopy_atomic_pte with adjustments necessary for shared memory pages. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Rapoport <[email protected]> Signed-off-by: Andrea Arcangeli <[email protected]> Cc: "Dr. David Alan Gilbert" <[email protected]> Cc: Hillf Danton <[email protected]> Cc: Michael Rapoport <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Pavel Emelyanov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-02-22userfaultfd: hugetlbfs: reserve count on error in __mcopy_atomic_hugetlbMike Kravetz
If __mcopy_atomic_hugetlb exits with an error, put_page will be called if a huge page was allocated and needs to be freed. If a reservation was associated with the huge page, the PagePrivate flag will be set. Clear PagePrivate before calling put_page/free_huge_page so that the global reservation count is not incremented. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Kravetz <[email protected]> Signed-off-by: Andrea Arcangeli <[email protected]> Cc: "Dr. David Alan Gilbert" <[email protected]> Cc: Hillf Danton <[email protected]> Cc: Michael Rapoport <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Pavel Emelyanov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-02-22userfaultfd: hugetlbfs: fix __mcopy_atomic_hugetlb retry/error processingMike Kravetz
The new routine copy_huge_page_from_user() uses kmap_atomic() to map PAGE_SIZE pages. However, this prevents page faults in the subsequent call to copy_from_user(). This is OK in the case where the routine is copied with mmap_sema held. However, in another case we want to allow page faults. So, add a new argument allow_pagefault to indicate if the routine should allow page faults. [[email protected]: unmap the correct pointer] Link: http://lkml.kernel.org/r/20170113082608.GA3548@mwanda [[email protected]: kunmap() takes a page*, per Hugh] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Kravetz <[email protected]> Signed-off-by: Andrea Arcangeli <[email protected]> Signed-off-by: Dan Carpenter <[email protected]> Cc: "Dr. David Alan Gilbert" <[email protected]> Cc: Hillf Danton <[email protected]> Cc: Michael Rapoport <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Pavel Emelyanov <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-02-22userfaultfd: hugetlbfs: add __mcopy_atomic_hugetlb for huge page UFFDIO_COPYMike Kravetz
__mcopy_atomic_hugetlb performs the UFFDIO_COPY operation for huge pages. It is based on the existing __mcopy_atomic routine for normal pages. Unlike normal pages, there is no huge page support for the UFFDIO_ZEROPAGE operation. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Kravetz <[email protected]> Signed-off-by: Andrea Arcangeli <[email protected]> Cc: "Dr. David Alan Gilbert" <[email protected]> Cc: Hillf Danton <[email protected]> Cc: Michael Rapoport <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Pavel Emelyanov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-02-22userfaultfd: use vma_is_anonymousAndrea Arcangeli
Cleanup the vma->vm_ops usage. Side note: it would be more robust if vma_is_anonymous() would also check that vm_flags hasn't VM_PFNMAP set. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Andrea Arcangeli <[email protected]> Cc: "Dr. David Alan Gilbert" <[email protected]> Cc: Hillf Danton <[email protected]> Cc: Michael Rapoport <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Pavel Emelyanov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-04-04mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macrosKirill A. Shutemov
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time ago with promise that one day it will be possible to implement page cache with bigger chunks than PAGE_SIZE. This promise never materialized. And unlikely will. We have many places where PAGE_CACHE_SIZE assumed to be equal to PAGE_SIZE. And it's constant source of confusion on whether PAGE_CACHE_* or PAGE_* constant should be used in a particular case, especially on the border between fs and mm. Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much breakage to be doable. Let's stop pretending that pages in page cache are special. They are not. The changes are pretty straight-forward: - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>; - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>; - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN}; - page_cache_get() -> get_page(); - page_cache_release() -> put_page(); This patch contains automated changes generated with coccinelle using script below. For some reason, coccinelle doesn't patch header files. I've called spatch for them manually. The only adjustment after coccinelle is revert of changes to PAGE_CAHCE_ALIGN definition: we are going to drop it later. There are few places in the code where coccinelle didn't reach. I'll fix them manually in a separate patch. Comments and documentation also will be addressed with the separate patch. virtual patch @@ expression E; @@ - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT) + E @@ expression E; @@ - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) + E @@ @@ - PAGE_CACHE_SHIFT + PAGE_SHIFT @@ @@ - PAGE_CACHE_SIZE + PAGE_SIZE @@ @@ - PAGE_CACHE_MASK + PAGE_MASK @@ expression E; @@ - PAGE_CACHE_ALIGN(E) + PAGE_ALIGN(E) @@ expression E; @@ - page_cache_get(E) + get_page(E) @@ expression E; @@ - page_cache_release(E) + put_page(E) Signed-off-by: Kirill A. Shutemov <[email protected]> Acked-by: Michal Hocko <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-03-17mm: cleanup *pte_alloc* interfacesKirill A. Shutemov
There are few things about *pte_alloc*() helpers worth cleaning up: - 'vma' argument is unused, let's drop it; - most __pte_alloc() callers do speculative check for pmd_none(), before taking ptl: let's introduce pte_alloc() macro which does the check. The only direct user of __pte_alloc left is userfaultfd, which has different expectation about atomicity wrt pmd. - pte_alloc_map() and pte_alloc_map_lock() are redefined using pte_alloc(). [[email protected]: fix build for arm64 hugetlbpage] [[email protected]: fix arch/arm/mm/mmu.c some more] Signed-off-by: Kirill A. Shutemov <[email protected]> Cc: Dave Hansen <[email protected]> Signed-off-by: Sudeep Holla <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Signed-off-by: Stephen Rothwell <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-01-15memcg: adjust to support new THP refcountingKirill A. Shutemov
As with rmap, with new refcounting we cannot rely on PageTransHuge() to check if we need to charge size of huge page form the cgroup. We need to get information from caller to know whether it was mapped with PMD or PTE. We do uncharge when last reference on the page gone. At that point if we see PageTransHuge() it means we need to unchange whole huge page. The tricky part is partial unmap -- when we try to unmap part of huge page. We don't do a special handing of this situation, meaning we don't uncharge the part of huge page unless last user is gone or split_huge_page() is triggered. In case of cgroup memory pressure happens the partial unmapped page will be split through shrinker. This should be good enough. Signed-off-by: Kirill A. Shutemov <[email protected]> Tested-by: Sasha Levin <[email protected]> Tested-by: Aneesh Kumar K.V <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Acked-by: Jerome Marchand <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Steve Capper <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>