mirror of
https://github.com/torvalds/linux.git
synced 2025-12-01 15:35:58 +07:00
master
2297 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
5bebe8de19 |
mm/huge_memory: Fix initialization of huge zero folio
The recent fix to properly initialize the tags of the huge zero folio
had an unfortunate not-so-subtle side effect: it caused the actual
*contents* of the huge zero folio to not be initialized at all when the
hardware didn't support the memory tagging.
The reason was the unfortunate semantics of tag_clear_highpage(): on
hardware that didn't do the tagging, it would silently just not do
anything at all. And since this is done only on arm64 with MTE support,
that basically meant most hardware.
It wasn't necessarily immediately obvious since the huge zero page isn't
necessarily very heavily used - or because it might already be zero
because all-zeroes is the most common pattern. But it ends up causing
random odd user space failures when you do hit it.
The unfortunate semantics have been around for a while, but became a
real bug only when we started actively using __GFP_ZEROTAGS in the
generic get_huge_zero_folio() function - before that, it had only ever
been used in code that checked that the hardware supported it.
Fix this by simply changing the semantics of tag_clear_highpage() to
return whether it actually successfully did something or not. While at
it, also make it initialize multiple pages in one go, since that's
actually what the only caller wants it to do and it simplifies the whole
logic.
Fixes:
|
||
|
|
8804d970fa |
Merge tag 'mm-stable-2025-10-01-19-00' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- "mm, swap: improve cluster scan strategy" from Kairui Song improves
performance and reduces the failure rate of swap cluster allocation
- "support large align and nid in Rust allocators" from Vitaly Wool
permits Rust allocators to set NUMA node and large alignment when
perforning slub and vmalloc reallocs
- "mm/damon/vaddr: support stat-purpose DAMOS" from Yueyang Pan extend
DAMOS_STAT's handling of the DAMON operations sets for virtual
address spaces for ops-level DAMOS filters
- "execute PROCMAP_QUERY ioctl under per-vma lock" from Suren
Baghdasaryan reduces mmap_lock contention during reads of
/proc/pid/maps
- "mm/mincore: minor clean up for swap cache checking" from Kairui Song
performs some cleanup in the swap code
- "mm: vm_normal_page*() improvements" from David Hildenbrand provides
code cleanup in the pagemap code
- "add persistent huge zero folio support" from Pankaj Raghav provides
a block layer speedup by optionalls making the
huge_zero_pagepersistent, instead of releasing it when its refcount
falls to zero
- "kho: fixes and cleanups" from Mike Rapoport adds a few touchups to
the recently added Kexec Handover feature
- "mm: make mm->flags a bitmap and 64-bit on all arches" from Lorenzo
Stoakes turns mm_struct.flags into a bitmap. To end the constant
struggle with space shortage on 32-bit conflicting with 64-bit's
needs
- "mm/swapfile.c and swap.h cleanup" from Chris Li cleans up some swap
code
- "selftests/mm: Fix false positives and skip unsupported tests" from
Donet Tom fixes a few things in our selftests code
- "prctl: extend PR_SET_THP_DISABLE to only provide THPs when advised"
from David Hildenbrand "allows individual processes to opt-out of
THP=always into THP=madvise, without affecting other workloads on the
system".
It's a long story - the [1/N] changelog spells out the considerations
- "Add and use memdesc_flags_t" from Matthew Wilcox gets us started on
the memdesc project. Please see
https://kernelnewbies.org/MatthewWilcox/Memdescs and
https://blogs.oracle.com/linux/post/introducing-memdesc
- "Tiny optimization for large read operations" from Chi Zhiling
improves the efficiency of the pagecache read path
- "Better split_huge_page_test result check" from Zi Yan improves our
folio splitting selftest code
- "test that rmap behaves as expected" from Wei Yang adds some rmap
selftests
- "remove write_cache_pages()" from Christoph Hellwig removes that
function and converts its two remaining callers
- "selftests/mm: uffd-stress fixes" from Dev Jain fixes some UFFD
selftests issues
- "introduce kernel file mapped folios" from Boris Burkov introduces
the concept of "kernel file pages". Using these permits btrfs to
account its metadata pages to the root cgroup, rather than to the
cgroups of random inappropriate tasks
- "mm/pageblock: improve readability of some pageblock handling" from
Wei Yang provides some readability improvements to the page allocator
code
- "mm/damon: support ARM32 with LPAE" from SeongJae Park teaches DAMON
to understand arm32 highmem
- "tools: testing: Use existing atomic.h for vma/maple tests" from
Brendan Jackman performs some code cleanups and deduplication under
tools/testing/
- "maple_tree: Fix testing for 32bit compiles" from Liam Howlett fixes
a couple of 32-bit issues in tools/testing/radix-tree.c
- "kasan: unify kasan_enabled() and remove arch-specific
implementations" from Sabyrzhan Tasbolatov moves KASAN arch-specific
initialization code into a common arch-neutral implementation
- "mm: remove zpool" from Johannes Weiner removes zspool - an
indirection layer which now only redirects to a single thing
(zsmalloc)
- "mm: task_stack: Stack handling cleanups" from Pasha Tatashin makes a
couple of cleanups in the fork code
- "mm: remove nth_page()" from David Hildenbrand makes rather a lot of
adjustments at various nth_page() callsites, eventually permitting
the removal of that undesirable helper function
- "introduce kasan.write_only option in hw-tags" from Yeoreum Yun
creates a KASAN read-only mode for ARM, using that architecture's
memory tagging feature. It is felt that a read-only mode KASAN is
suitable for use in production systems rather than debug-only
- "mm: hugetlb: cleanup hugetlb folio allocation" from Kefeng Wang does
some tidying in the hugetlb folio allocation code
- "mm: establish const-correctness for pointer parameters" from Max
Kellermann makes quite a number of the MM API functions more accurate
about the constness of their arguments. This was getting in the way
of subsystems (in this case CEPH) when they attempt to improving
their own const/non-const accuracy
- "Cleanup free_pages() misuse" from Vishal Moola fixes a number of
code sites which were confused over when to use free_pages() vs
__free_pages()
- "Add Rust abstraction for Maple Trees" from Alice Ryhl makes the
mapletree code accessible to Rust. Required by nouveau and by its
forthcoming successor: the new Rust Nova driver
- "selftests/mm: split_huge_page_test: split_pte_mapped_thp
improvements" from David Hildenbrand adds a fix and some cleanups to
the thp selftesting code
- "mm, swap: introduce swap table as swap cache (phase I)" from Chris
Li and Kairui Song is the first step along the path to implementing
"swap tables" - a new approach to swap allocation and state tracking
which is expected to yield speed and space improvements. This
patchset itself yields a 5-20% performance benefit in some situations
- "Some ptdesc cleanups" from Matthew Wilcox utilizes the new memdesc
layer to clean up the ptdesc code a little
- "Fix va_high_addr_switch.sh test failure" from Chunyu Hu fixes some
issues in our 5-level pagetable selftesting code
- "Minor fixes for memory allocation profiling" from Suren Baghdasaryan
addresses a couple of minor issues in relatively new memory
allocation profiling feature
- "Small cleanups" from Matthew Wilcox has a few cleanups in
preparation for more memdesc work
- "mm/damon: add addr_unit for DAMON_LRU_SORT and DAMON_RECLAIM" from
Quanmin Yan makes some changes to DAMON in furtherance of supporting
arm highmem
- "selftests/mm: Add -Wunreachable-code and fix warnings" from Muhammad
Anjum adds that compiler check to selftests code and fixes the
fallout, by removing dead code
- "Improvements to Victim Process Thawing and OOM Reaper Traversal
Order" from zhongjinji makes a number of improvements in the OOM
killer: mainly thawing a more appropriate group of victim threads so
they can release resources
- "mm/damon: misc fixups and improvements for 6.18" from SeongJae Park
is a bunch of small and unrelated fixups for DAMON
- "mm/damon: define and use DAMON initialization check function" from
SeongJae Park implement reliability and maintainability improvements
to a recently-added bug fix
- "mm/damon/stat: expose auto-tuned intervals and non-idle ages" from
SeongJae Park provides additional transparency to userspace clients
of the DAMON_STAT information
- "Expand scope of khugepaged anonymous collapse" from Dev Jain removes
some constraints on khubepaged's collapsing of anon VMAs. It also
increases the success rate of MADV_COLLAPSE against an anon vma
- "mm: do not assume file == vma->vm_file in compat_vma_mmap_prepare()"
from Lorenzo Stoakes moves us further towards removal of
file_operations.mmap(). This patchset concentrates upon clearing up
the treatment of stacked filesystems
- "mm: Improve mlock tracking for large folios" from Kiryl Shutsemau
provides some fixes and improvements to mlock's tracking of large
folios. /proc/meminfo's "Mlocked" field became more accurate
- "mm/ksm: Fix incorrect accounting of KSM counters during fork" from
Donet Tom fixes several user-visible KSM stats inaccuracies across
forks and adds selftest code to verify these counters
- "mm_slot: fix the usage of mm_slot_entry" from Wei Yang addresses
some potential but presently benign issues in KSM's mm_slot handling
* tag 'mm-stable-2025-10-01-19-00' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (372 commits)
mm: swap: check for stable address space before operating on the VMA
mm: convert folio_page() back to a macro
mm/khugepaged: use start_addr/addr for improved readability
hugetlbfs: skip VMAs without shareable locks in hugetlb_vmdelete_list
alloc_tag: fix boot failure due to NULL pointer dereference
mm: silence data-race in update_hiwater_rss
mm/memory-failure: don't select MEMORY_ISOLATION
mm/khugepaged: remove definition of struct khugepaged_mm_slot
mm/ksm: get mm_slot by mm_slot_entry() when slot is !NULL
hugetlb: increase number of reserving hugepages via cmdline
selftests/mm: add fork inheritance test for ksm_merging_pages counter
mm/ksm: fix incorrect KSM counter handling in mm_struct during fork
drivers/base/node: fix double free in register_one_node()
mm: remove PMD alignment constraint in execmem_vmalloc()
mm/memory_hotplug: fix typo 'esecially' -> 'especially'
mm/rmap: improve mlock tracking for large folios
mm/filemap: map entire large folio faultaround
mm/fault: try to map the entire file folio in finish_fault()
mm/rmap: mlock large folios in try_to_unmap_one()
mm/rmap: fix a mlock race condition in folio_referenced_one()
...
|
||
|
|
d7242af864 |
mm: Introduce alloc_frozen_pages_nolock()
Split alloc_pages_nolock() and introduce alloc_frozen_pages_nolock() to be used by alloc_slab_page(). Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> |
||
|
|
99253de51f |
mm: Allow GFP_ACCOUNT to be used in alloc_pages_nolock().
Change alloc_pages_nolock() to default to __GFP_COMP when allocating pages, since upcoming reentrant alloc_slab_page() needs __GFP_COMP. Also allow __GFP_ACCOUNT flag to be specified, since most of BPF infra needs __GFP_ACCOUNT except BPF streams. Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> |
||
|
|
19c5fb83f2 |
mm: page_alloc: avoid kswapd thrashing due to NUMA restrictions
On NUMA systems without bindings, allocations check all nodes for free space, then wake up the kswapds on all nodes and retry. This ensures all available space is evenly used before reclaim begins. However, when one process or certain allocations have node restrictions, they can cause kswapds on only a subset of nodes to be woken up. Since kswapd hysteresis targets watermarks that are *higher* than needed for allocation, even *unrestricted* allocations can now get suckered onto such nodes that are already pressured. This ends up concentrating all allocations on them, even when there are idle nodes available for the unrestricted requests. This was observed with two numa nodes, where node0 is normal and node1 is ZONE_MOVABLE to facilitate hotplugging: a kernel allocation wakes kswapd on node0 only (since node1 is not eligible); once kswapd0 is active, the watermarks hover between low and high, and then even the movable allocations end up on node0, only to be kicked out again; meanwhile node1 is empty and idle. Similar behavior is possible when a process with NUMA bindings is causing selective kswapd wakeups. To fix this, on NUMA systems augment the (misleading) watermark test with a check for whether kswapd is already active during the first iteration through the zonelist. If this fails to place the request, kswapd must be running everywhere already, and the watermark test is good enough to decide placement. With this patch, unrestricted requests successfully make use of node1, even while kswapd is reclaiming node0 for restricted allocations. [gourry@gourry.net: don't retry if no kswapds were active] Link: https://lkml.kernel.org/r/20250919162134.1098208-1-hannes@cmpxchg.org Signed-off-by: Gregory Price <gourry@gourry.net> Tested-by: Joshua Hahn <joshua.hahnjy@gmail.com> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Zi Yan <ziy@nvidia.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
9d003dec97 |
mm: remove page->order
We already use page->private for storing the order of a page while it's in the buddy allocator system; extend that to also storing the order while it's in the pcp_llist. Link: https://lkml.kernel.org/r/20250910142923.2465470-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
7ffc923e25 |
alloc_tag: avoid warnings when freeing non-compound "tail" pages
When freeing "tail" pages of a non-compount high-order page, we properly subtract the allocation tag counters, however later when these pages are released, alloc_tag_sub() will issue warnings because tags for these pages are NULL. This issue was originally anticipated by Vlastimil in his review [1] and then recently reported by David. Prevent warnings by marking the tags empty. Link: https://lkml.kernel.org/r/20250915212756.3998938-4-surenb@google.com Link: https://lore.kernel.org/all/6db0f0c8-81cb-4d04-9560-ba73d63db4b8@suse.cz/ [1] Signed-off-by: Suren Baghdasaryan <surenb@google.com> Suggested-by: David Wang <00107082@163.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Usama Arif <usamaarif642@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Sourav Panda <souravpanda@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
e7a5f249e6 |
mm: re-enable kswapd when memory pressure subsides or demotion is toggled
If kswapd fails to reclaim pages from a node MAX_RECLAIM_RETRIES in a
row, kswapd on that node gets disabled. That is, the system won't wakeup
kswapd for that node until page reclamation is observed at least once.
That reclamation is mostly done by direct reclaim, which in turn enables
kswapd back.
However, on systems with CXL memory nodes, workloads with high anon page
usage can disable kswapd indefinitely, without triggering direct
reclaim. This can be reproduced with following steps:
numa node 0 (32GB memory, 48 CPUs)
numa node 2~5 (512GB CXL memory, 128GB each)
(numa node 1 is disabled)
swap space 8GB
1) Set /sys/kernel/mm/demotion_enabled to 0.
2) Set /proc/sys/kernel/numa_balancing to 0.
3) Run a process that allocates and random accesses 500GB of anon
pages.
4) Let the process exit normally.
During 3), free memory on node 0 gets lower than low watermark, and
kswapd runs and depletes swap space. Then, kswapd fails consecutively
and gets disabled. Allocation afterwards happens on CXL memory, so node
0 never gains more memory pressure to trigger direct reclaim.
After 4), kswapd on node 0 remains disabled, and tasks running on that
node are unable to swap. If you turn on NUMA_BALANCING_MEMORY_TIERING
and demotion now, it won't work properly since kswapd is disabled.
To mitigate this problem, reset kswapd_failures to 0 on following
conditions:
a) ZONE_BELOW_HIGH bit of a zone in hopeless node with a fallback
memory node gets cleared.
b) demotion_enabled is changed from false to true.
Rationale for a):
ZONE_BELOW_HIGH bit being cleared might be a sign that the node may
be reclaimable afterwards. This won't help much if the memory-hungry
process keeps running without freeing anything, but at least the node
will go back to reclaimable state when the process exits.
Rationale for b):
When demotion_enabled is false, kswapd can only reclaim anon pages by
swapping them out to swap space. If demotion_enabled is turned on,
kswapd can demote anon pages to another node for reclaiming. So, the
original failure count for determining reclaimability is no longer
valid.
Since kswapd_failures resets may be missed by ++ operation, it is
changed from int to atomic_t.
[akpm@linux-foundation.org: tweak whitespace]
Link: https://lkml.kernel.org/r/aL6qGi69jWXfPc4D@pcw-MS-7D22
Signed-off-by: Chanwon Park <flyinrm@gmail.com>
Cc: Brendan Jackman <jackmanb@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
||
|
|
4805ef3707 |
mm/page_alloc: check the correct buddy if it is a starting block
find_large_buddy() search buddy based on start_pfn, which maybe different
from page's pfn, e.g. when page is not pageblock aligned, because
prep_move_freepages_block() always align start_pfn to pageblock.
This means when we found a starting block at start_pfn, it may check on
the wrong page theoretically. And not split the free page as it is
supposed to, causing a freelist migratetype mismatch.
The good news is the page passed to __move_freepages_block_isolate() has
only two possible cases:
* page is pageblock aligned
* page is __first_valid_page() of this block
So it is safe for the first case, and it won't get a buddy larger than
pageblock for the second case.
To fix the issue, check the returned pfn of find_large_buddy() to decide
whether to split the free page:
1. if it is not a PageBuddy pfn, no split;
2. if it is a PageBuddy pfn but order <= pageblock_order, no split;
3. if it is a PageBuddy pfn with order > pageblock_order, start_pfn is
either in the starting block or tail block, split the PageBuddy at
pageblock_order level.
Link: https://lkml.kernel.org/r/20250905140358.28849-1-richard.weiyang@gmail.com
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
||
|
|
162f6c69ea |
mm/page_alloc: add kernel-docs for free_pages()
Patch series "Cleanup free_pages() misuse", v3. free_pages() is supposed to be called when we only have a virtual address. __free_pages() is supposed to be called when we have a page. There are a number of callers that use page_address() to get a page's virtual address then call free_pages() on it when they should just call __free_pages() directly. Add kernel-docs for free_pages() to help callers better understand which function they should be calling, and replace the obvious cases of misuse. This patch (of 7): Add kernel-docs to free_pages(). This will help callers understand when to use it instead of __free_pages(). Link: https://lkml.kernel.org/r/20250903185921.1785167-1-vishal.moola@gmail.com Link: https://lkml.kernel.org/r/20250903185921.1785167-2-vishal.moola@gmail.com Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: SeongJae Park <sj@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Andy Lutomirski <luto@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Justin Sanders <justin@coraid.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
0bf2edf041 |
mm/page_alloc: reject unreasonable folio/compound page sizes in alloc_contig_range_noprof()
Let's reject them early, which in turn makes folio_alloc_gigantic() reject them properly. To avoid converting from order to nr_pages, let's just add MAX_FOLIO_ORDER and calculate MAX_FOLIO_NR_PAGES based on that. While at it, let's just make the order a "const unsigned order". Link: https://lkml.kernel.org/r/20250901150359.867252-7-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Acked-by: SeongJae Park <sj@kernel.org> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
204dfefe03 |
mm/page_alloc: find_large_buddy() from start_pfn aligned order
We iterate pfn from order 0 to MAX_PAGE_ORDER aligned to find large buddy. While if the order is less than start_pfn aligned order, we would get the same pfn and do the same check again. Iterate from start_pfn aligned order to reduce duplicated work. [richard.weiyang@gmail.com: add comment on assignment of order] Link: https://lkml.kernel.org/r/20250828091618.7869-1-richard.weiyang@gmail.com Link: https://lkml.kernel.org/r/20250902025807.11467-1-richard.weiyang@gmail.com Link: https://lkml.kernel.org/r/20250828091618.7869-1-richard.weiyang@gmail.com Link: https://lkml.kernel.org/r/20250902025807.11467-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
98c94f1035 |
mm/pageblock-flags: remove PB_migratetype_bits/PB_migrate_end
enum pageblock_bits defines the meaning of pageblock bits. Currently PB_migratetype_bits says the lowest 3 bits represents migratetype and PB_migrate_end/MIGRATETYPE_MASK's definition rely on it with magical computation. Remove the definition of PB_migratetype_bits/PB_migrate_end. Use PB_migrate_[0|1|2] to represent lowest bits for migratetype. Then we can simplify related definition. Also, MIGRATETYPE_AND_ISO_MASK is MIGRATETYPE_MASK add isolation bit. Use MIGRATETYPE_MASK in the definition of MIGRATETYPE_AND_ISO_MASK looks cleaner. No functional change intended. Link: https://lkml.kernel.org/r/20250827070105.16864-3-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Suggested-by: David Hildenbrand <david@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
dd3b304b94 |
mm/page_alloc: use xxx_pageblock_isolate() for better reading
Patch series "mm/pageblock: improve readability of some pageblock
handling", v3.
During code reading, found two possible points to improve the readability
of pageblock handling.
Patch 1: isolate bit is standalone and there are dedicated helpers.
Instead of check the bit directly, we could use the helper to do it.
Patch 2: remove PB_migratetype_bits and PB_migrate_end to reduce magical
computation.
This patch (of 2):
Since commit
|
||
|
|
6c3826173e |
mm/page_alloc: harmonize should_compact_retry() type
Currently order is signed in one version of the function and unsigned in the other. Tidy that up. In page_alloc.c, order is unsigned in the vast majority of cases. But, there is a cluster of exceptions in compaction-related code (probably stemming from the fact that compact_control.order is signed). So, prefer local consistency and make this one signed too. Link: https://lkml.kernel.org/r/20250826-cleanup-should_compact_retry-v1-1-d2ca89727fcf@google.com Signed-off-by: Brendan Jackman <jackmanb@google.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
ce32123b9b |
mm: remove is_migrate_highatomic()
There are 3 potential reasons for is_migrate_*() helpers:
1. They represent higher-level attributes of migratetypes, like
is_migrate_movable()
2. They are ifdef'd, like is_migrate_isolate().
3. For consistency with an is_migrate_*_page() helper, also like
is_migrate_isolate().
It looks like is_migrate_highatomic() was for case 3, but that was
removed in commit
|
||
|
|
53fbef56e0 |
mm: introduce memdesc_flags_t
Patch series "Add and use memdesc_flags_t". At some point struct page will be separated from struct slab and struct folio. This is a step towards that by introducing a type for the 'flags' word of all three structures. This gives us a certain amount of type safety by establishing that some of these unsigned longs are different from other unsigned longs in that they contain things like node ID, section number and zone number in the upper bits. That lets us have functions that can be easily called by anyone who has a slab, folio or page (but not easily by anyone else) to get the node or zone. There's going to be some unusual merge problems with this as some odd bits of the kernel decide they want to print out the flags value or something similar by writing page->flags and now they'll need to write page->flags.f instead. That's most of the churn here. Maybe we should be removing these things from the debug output? This patch (of 11): Wrap the unsigned long flags in a typedef. In upcoming patches, this will provide a strong hint that you can't just pass a random unsigned long to functions which take this as an argument. [willy@infradead.org: s/flags/flags.f/ in several architectures] Link: https://lkml.kernel.org/r/aKMgPRLD-WnkPxYm@casper.infradead.org [nicola.vetrini@gmail.com: mips: fix compilation error] Link: https://lore.kernel.org/lkml/CA+G9fYvkpmqGr6wjBNHY=dRp71PLCoi2341JxOudi60yqaeUdg@mail.gmail.com/ Link: https://lkml.kernel.org/r/20250825214245.1838158-1-nicola.vetrini@gmail.com Link: https://lkml.kernel.org/r/20250805172307.1302730-1-willy@infradead.org Link: https://lkml.kernel.org/r/20250805172307.1302730-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Zi Yan <ziy@nvidia.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
5922deb3ec |
mm/page_alloc: remove redundant pcp->free_count initialization in per_cpu_pages_init()
In per_cpu_pages_init(), pcp->free_count is explicitly initialized to 0, but this is redundant because the entire struct is already zeroed by memset(pcp, 0, sizeof(*pcp)). Link: https://lkml.kernel.org/r/20250814071828.12036-1-ye.liu@linux.dev Signed-off-by: Ye Liu <liuye@kylinos.cn> Reviewed-by: Brendan Jackman <jackmanb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
0c04015d45 |
mm/page_alloc: simplify lowmem_reserve max calculation
Use max() to find the maximum lowmem_reserve value and min_t() to cap it to managed_pages in calculate_totalreserve_pages(), instead of open-coding the comparisons. No functional change. [liuye@kylinos.cn: fix layout, use min_t] Link: https://lkml.kernel.org/r/20250815024509.37900-1-ye.liu@linux.dev Link: https://lkml.kernel.org/r/20250814090053.22241-1-ye.liu@linux.dev Signed-off-by: Ye Liu <liuye@kylinos.cn> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
6a204d4b14 |
mm/page_alloc: only set ALLOC_HIGHATOMIC for __GPF_HIGH allocations
Commit |
||
|
|
48e6561b66 |
mm/page_alloc: remove trace_mm_alloc_contig_migrate_range_info()
The trace event has not recorded the right data since it was introduced at
commit
|
||
|
|
5799c0ed0a |
mm/page-alloc: remove PageMappingFlags()
As PageMappingFlags() now only indicates anon (incl. KSM) folios, we can now simply check for PageAnon() and remove PageMappingFlags(). ... and while at it, use the folio instead and operate on folio->mapping. Link: https://lkml.kernel.org/r/20250704102524.326966-24-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Brendan Jackman <jackmanb@google.com> Cc: Byungchul Park <byungchul@sk.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Christian Brauner <brauner@kernel.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Eugenio Pé rez <eperezma@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Gregory Price <gourry@gourry.net> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Jan Kara <jack@suse.cz> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jason Wang <jasowang@redhat.com> Cc: Jerrin Shaji George <jerrin.shaji-george@broadcom.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Mathew Brost <matthew.brost@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Rik van Riel <riel@surriel.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Cc: xu xin <xu.xin16@zte.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
d4fb4587bd |
mm: rename __PageMovable() to page_has_movable_ops()
Let's make it clearer that we are talking about movable_ops pages. While at it, convert a VM_BUG_ON to a VM_WARN_ON_ONCE_PAGE. Link: https://lkml.kernel.org/r/20250704102524.326966-17-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Brendan Jackman <jackmanb@google.com> Cc: Byungchul Park <byungchul@sk.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Christian Brauner <brauner@kernel.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Eugenio Pé rez <eperezma@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Gregory Price <gourry@gourry.net> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Jan Kara <jack@suse.cz> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jason Wang <jasowang@redhat.com> Cc: Jerrin Shaji George <jerrin.shaji-george@broadcom.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Mathew Brost <matthew.brost@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Rik van Riel <riel@surriel.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Cc: xu xin <xu.xin16@zte.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
2dfcd1608f |
mm/page_alloc: let page freeing clear any set page type
Currently, any user of page types must clear that type before freeing a page back to the buddy, otherwise we'll run into mapcount related sanity checks (because the page type currently overlays the page mapcount). Let's allow for not clearing the page type by page type users by letting the buddy handle it instead. We'll focus on having a page type set on the first page of a larger allocation only. With this change, we can reliably identify typed folios even though they might be in the process of getting freed, which will come in handy in migration code (at least in the transition phase). In the future we might want to warn on some page types. Instead of having an "allow list", let's rather wait until we know about once that should go on such a "disallow list". Link: https://lkml.kernel.org/r/20250704102524.326966-5-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Acked-by: Harry Yoo <harry.yoo@oracle.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Brendan Jackman <jackmanb@google.com> Cc: Byungchul Park <byungchul@sk.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Christian Brauner <brauner@kernel.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Eugenio Pé rez <eperezma@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Gregory Price <gourry@gourry.net> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Jan Kara <jack@suse.cz> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jason Wang <jasowang@redhat.com> Cc: Jerrin Shaji George <jerrin.shaji-george@broadcom.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Mathew Brost <matthew.brost@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Rik van Riel <riel@surriel.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Cc: xu xin <xu.xin16@zte.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
d1554fb630 |
mm/page_isolation: remove migratetype parameter from more functions
migratetype is no longer overwritten during pageblock isolation, start_isolate_page_range(), has_unmovable_pages(), and set_migratetype_isolate() no longer need which migratetype to restore during isolation failure. For has_unmoable_pages(), it needs to know if the isolation is for CMA allocation, so adding PB_ISOLATE_MODE_CMA_ALLOC provide the information. At the same time change isolation flags to enum pb_isolate_mode (PB_ISOLATE_MODE_MEM_OFFLINE, PB_ISOLATE_MODE_CMA_ALLOC, PB_ISOLATE_MODE_OTHER). Remove REPORT_FAILURE and check PB_ISOLATE_MODE_MEM_OFFLINE, since only PB_ISOLATE_MODE_MEM_OFFLINE reports isolation failures. alloc_contig_range() no longer needs migratetype. Replace it with a newly defined acr_flags_t to tell if an allocation is for CMA. So does __alloc_contig_migrate_range(). Add ACR_FLAGS_NONE (set to 0) to indicate ordinary allocations. Link: https://lkml.kernel.org/r/20250617021115.2331563-7-ziy@nvidia.com Signed-off-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Hildenbrand <david@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shuemov <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Richard Chang <richardycc@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
7a3324eb66 |
mm/page_isolation: remove migratetype from undo_isolate_page_range()
Since migratetype is no longer overwritten during pageblock isolation, undoing pageblock isolation no longer needs which migratetype to restore. Link: https://lkml.kernel.org/r/20250617021115.2331563-6-ziy@nvidia.com Signed-off-by: Zi Yan <ziy@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shuemov <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Richard Chang <richardycc@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
b1df9c5713 |
mm/page_isolation: remove migratetype from move_freepages_block_isolate()
Since migratetype is no longer overwritten during pageblock isolation, moving a pageblock out of MIGRATE_ISOLATE no longer needs a new migratetype. Add pageblock_isolate_and_move_free_pages() and pageblock_unisolate_and_move_free_pages() to be explicit about the page isolation operations. Both share the common code in __move_freepages_block_isolate(), which is renamed from move_freepages_block_isolate(). Add toggle_pageblock_isolate() to flip pageblock isolation bit in __move_freepages_block_isolate(). Make set_pageblock_migratetype() only accept non MIGRATE_ISOLATE types, so that one should use set_pageblock_isolate() to isolate pageblocks. As a result, move pageblock migratetype code out of __move_freepages_block(). Link: https://lkml.kernel.org/r/20250617021115.2331563-5-ziy@nvidia.com Signed-off-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Hildenbrand <david@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shuemov <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Richard Chang <richardycc@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
1bc3587a88 |
mm/page_alloc: add support for initializing pageblock as isolated
MIGRATE_ISOLATE is a standalone bit, so a pageblock cannot be initialized to just MIGRATE_ISOLATE. Add init_pageblock_migratetype() to enable initialize a pageblock with a migratetype and isolated. Link: https://lkml.kernel.org/r/20250617021115.2331563-4-ziy@nvidia.com Signed-off-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Hildenbrand <david@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shuemov <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Richard Chang <richardycc@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
e904bce2d9 |
mm/page_isolation: make page isolation a standalone bit
During page isolation, the original migratetype is overwritten, since MIGRATE_* are enums and stored in pageblock bitmaps. Change MIGRATE_ISOLATE to be stored a standalone bit, PB_migrate_isolate, like PB_compact_skip, so that migratetype is not lost during pageblock isolation. Link: https://lkml.kernel.org/r/20250617021115.2331563-3-ziy@nvidia.com Signed-off-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Hildenbrand <david@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shuemov <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Richard Chang <richardycc@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
42f46ed99a |
mm/page_alloc: pageblock flags functions clean up
Patch series "Make MIGRATE_ISOLATE a standalone bit", v10.
This patchset moves MIGRATE_ISOLATE to a standalone bit to avoid being
overwritten during pageblock isolation process. Currently,
MIGRATE_ISOLATE is part of enum migratetype (in include/linux/mmzone.h),
thus, setting a pageblock to MIGRATE_ISOLATE overwrites its original
migratetype. This causes pageblock migratetype loss during
alloc_contig_range() and memory offline, especially when the process fails
due to a failed pageblock isolation and the code tries to undo the
finished pageblock isolations.
In terms of performance for changing pageblock types, no performance
change is observed:
1. I used perf to collect stats of offlining and onlining all memory
of a 40GB VM 10 times and see that get_pfnblock_flags_mask() and
set_pfnblock_flags_mask() take about 0.12% and 0.02% of the whole
process respectively with and without this patchset across 3 runs.
2. I used perf to collect stats of dd from /dev/random to a 40GB tmpfs
file and find get_pfnblock_flags_mask() takes about 0.05% of the
process with and without this patchset across 3 runs.
This patch (of 6):
No functional change is intended.
1. Add __NR_PAGEBLOCK_BITS for the number of pageblock flag bits and use
roundup_pow_of_two(__NR_PAGEBLOCK_BITS) as NR_PAGEBLOCK_BITS to take
right amount of bits for pageblock flags.
2. Rename PB_migrate_skip to PB_compact_skip.
3. Add {get,set,clear}_pfnblock_bit() to operate one a standalone bit,
like PB_compact_skip.
3. Make {get,set}_pfnblock_flags_mask() internal functions and use
{get,set}_pfnblock_migratetype() for pageblock migratetype operations.
4. Move pageblock flags common code to get_pfnblock_bitmap_bitidx().
3. Use MIGRATETYPE_MASK to get the migratetype of a pageblock from its
flags.
4. Use PB_migrate_end in the definition of MIGRATETYPE_MASK instead of
PB_migrate_bits.
5. Add a comment on is_migrate_cma_folio() to prevent one from changing it
to use get_pageblock_migratetype() and causing issues.
Link: https://lkml.kernel.org/r/20250617021115.2331563-1-ziy@nvidia.com
Link: https://lkml.kernel.org/r/20250617021115.2331563-2-ziy@nvidia.com
Signed-off-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Brendan Jackman <jackmanb@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kirill A. Shuemov <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Richard Chang <richardycc@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
||
|
|
3091b61505 |
mm: restore documentation for __free_pages()
The documentation was converted to be for ___free_pages(), which doesn't
need documentation as it's static.
Link: https://lkml.kernel.org/r/20250604190327.814086-1-willy@infradead.org
Fixes:
|
||
|
|
00c010e130 |
Merge tag 'mm-stable-2025-05-31-14-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton: - "Add folio_mk_pte()" from Matthew Wilcox simplifies the act of creating a pte which addresses the first page in a folio and reduces the amount of plumbing which architecture must implement to provide this. - "Misc folio patches for 6.16" from Matthew Wilcox is a shower of largely unrelated folio infrastructure changes which clean things up and better prepare us for future work. - "memory,x86,acpi: hotplug memory alignment advisement" from Gregory Price adds early-init code to prevent x86 from leaving physical memory unused when physical address regions are not aligned to memory block size. - "mm/compaction: allow more aggressive proactive compaction" from Michal Clapinski provides some tuning of the (sadly, hard-coded (more sadly, not auto-tuned)) thresholds for our invokation of proactive compaction. In a simple test case, the reduction of a guest VM's memory consumption was dramatic. - "Minor cleanups and improvements to swap freeing code" from Kemeng Shi provides some code cleaups and a small efficiency improvement to this part of our swap handling code. - "ptrace: introduce PTRACE_SET_SYSCALL_INFO API" from Dmitry Levin adds the ability for a ptracer to modify syscalls arguments. At this time we can alter only "system call information that are used by strace system call tampering, namely, syscall number, syscall arguments, and syscall return value. This series should have been incorporated into mm.git's "non-MM" branch, but I goofed. - "fs/proc: extend the PAGEMAP_SCAN ioctl to report guard regions" from Andrei Vagin extends the info returned by the PAGEMAP_SCAN ioctl against /proc/pid/pagemap. This permits CRIU to more efficiently get at the info about guard regions. - "Fix parameter passed to page_mapcount_is_type()" from Gavin Shan implements that fix. No runtime effect is expected because validate_page_before_insert() happens to fix up this error. - "kernel/events/uprobes: uprobe_write_opcode() rewrite" from David Hildenbrand basically brings uprobe text poking into the current decade. Remove a bunch of hand-rolled implementation in favor of using more current facilities. - "mm/ptdump: Drop assumption that pxd_val() is u64" from Anshuman Khandual provides enhancements and generalizations to the pte dumping code. This might be needed when 128-bit Page Table Descriptors are enabled for ARM. - "Always call constructor for kernel page tables" from Kevin Brodsky ensures that the ctor/dtor is always called for kernel pgtables, as it already is for user pgtables. This permits the addition of more functionality such as "insert hooks to protect page tables". This change does result in various architectures performing unnecesary work, but this is fixed up where it is anticipated to occur. - "Rust support for mm_struct, vm_area_struct, and mmap" from Alice Ryhl adds plumbing to permit Rust access to core MM structures. - "fix incorrectly disallowed anonymous VMA merges" from Lorenzo Stoakes takes advantage of some VMA merging opportunities which we've been missing for 15 years. - "mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE" from SeongJae Park optimizes process_madvise()'s TLB flushing. Instead of flushing each address range in the provided iovec, we batch the flushing across all the iovec entries. The syscall's cost was approximately halved with a microbenchmark which was designed to load this particular operation. - "Track node vacancy to reduce worst case allocation counts" from Sidhartha Kumar makes the maple tree smarter about its node preallocation. stress-ng mmap performance increased by single-digit percentages and the amount of unnecessarily preallocated memory was dramaticelly reduced. - "mm/gup: Minor fix, cleanup and improvements" from Baoquan He removes a few unnecessary things which Baoquan noted when reading the code. - ""Enhance sysfs handling for memory hotplug in weighted interleave" from Rakie Kim "enhances the weighted interleave policy in the memory management subsystem by improving sysfs handling, fixing memory leaks, and introducing dynamic sysfs updates for memory hotplug support". Fixes things on error paths which we are unlikely to hit. - "mm/damon: auto-tune DAMOS for NUMA setups including tiered memory" from SeongJae Park introduces new DAMOS quota goal metrics which eliminate the manual tuning which is required when utilizing DAMON for memory tiering. - "mm/vmalloc.c: code cleanup and improvements" from Baoquan He provides cleanups and small efficiency improvements which Baoquan found via code inspection. - "vmscan: enforce mems_effective during demotion" from Gregory Price changes reclaim to respect cpuset.mems_effective during demotion when possible. because presently, reclaim explicitly ignores cpuset.mems_effective when demoting, which may cause the cpuset settings to violated. This is useful for isolating workloads on a multi-tenant system from certain classes of memory more consistently. - "Clean up split_huge_pmd_locked() and remove unnecessary folio pointers" from Gavin Guo provides minor cleanups and efficiency gains in in the huge page splitting and migrating code. - "Use kmem_cache for memcg alloc" from Huan Yang creates a slab cache for `struct mem_cgroup', yielding improved memory utilization. - "add max arg to swappiness in memory.reclaim and lru_gen" from Zhongkun He adds a new "max" argument to the "swappiness=" argument for memory.reclaim MGLRU's lru_gen. This directs proactive reclaim to reclaim from only anon folios rather than file-backed folios. - "kexec: introduce Kexec HandOver (KHO)" from Mike Rapoport is the first step on the path to permitting the kernel to maintain existing VMs while replacing the host kernel via file-based kexec. At this time only memblock's reserve_mem is preserved. - "mm: Introduce for_each_valid_pfn()" from David Woodhouse provides and uses a smarter way of looping over a pfn range. By skipping ranges of invalid pfns. - "sched/numa: Skip VMA scanning on memory pinned to one NUMA node via cpuset.mems" from Libo Chen removes a lot of pointless VMA scanning when a task is pinned a single NUMA mode. Dramatic performance benefits were seen in some real world cases. - "JFS: Implement migrate_folio for jfs_metapage_aops" from Shivank Garg addresses a warning which occurs during memory compaction when using JFS. - "move all VMA allocation, freeing and duplication logic to mm" from Lorenzo Stoakes moves some VMA code from kernel/fork.c into the more appropriate mm/vma.c. - "mm, swap: clean up swap cache mapping helper" from Kairui Song provides code consolidation and cleanups related to the folio_index() function. - "mm/gup: Cleanup memfd_pin_folios()" from Vishal Moola does that. - "memcg: Fix test_memcg_min/low test failures" from Waiman Long addresses some bogus failures which are being reported by the test_memcontrol selftest. - "eliminate mmap() retry merge, add .mmap_prepare hook" from Lorenzo Stoakes commences the deprecation of file_operations.mmap() in favor of the new file_operations.mmap_prepare(). The latter is more restrictive and prevents drivers from messing with things in ways which, amongst other problems, may defeat VMA merging. - "memcg: decouple memcg and objcg stocks"" from Shakeel Butt decouples the per-cpu memcg charge cache from the objcg's one. This is a step along the way to making memcg and objcg charging NMI-safe, which is a BPF requirement. - "mm/damon: minor fixups and improvements for code, tests, and documents" from SeongJae Park is yet another batch of miscellaneous DAMON changes. Fix and improve minor problems in code, tests and documents. - "memcg: make memcg stats irq safe" from Shakeel Butt converts memcg stats to be irq safe. Another step along the way to making memcg charging and stats updates NMI-safe, a BPF requirement. - "Let unmap_hugepage_range() and several related functions take folio instead of page" from Fan Ni provides folio conversions in the hugetlb code. * tag 'mm-stable-2025-05-31-14-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (285 commits) mm: pcp: increase pcp->free_count threshold to trigger free_high mm/hugetlb: convert use of struct page to folio in __unmap_hugepage_range() mm/hugetlb: refactor __unmap_hugepage_range() to take folio instead of page mm/hugetlb: refactor unmap_hugepage_range() to take folio instead of page mm/hugetlb: pass folio instead of page to unmap_ref_private() memcg: objcg stock trylock without irq disabling memcg: no stock lock for cpu hot-unplug memcg: make __mod_memcg_lruvec_state re-entrant safe against irqs memcg: make count_memcg_events re-entrant safe against irqs memcg: make mod_memcg_state re-entrant safe against irqs memcg: move preempt disable to callers of memcg_rstat_updated memcg: memcg_rstat_updated re-entrant safe against irqs mm: khugepaged: decouple SHMEM and file folios' collapse selftests/eventfd: correct test name and improve messages alloc_tag: check mem_profiling_support in alloc_tag_init Docs/damon: update titles and brief introductions to explain DAMOS selftests/damon/_damon_sysfs: read tried regions directories in order mm/damon/tests/core-kunit: add a test for damos_set_filters_default_reject() mm/damon/paddr: remove unused variable, folio_list, in damon_pa_stat() mm/damon/sysfs-schemes: fix wrong comment on damons_sysfs_quota_goal_metric_strs ... |
||
|
|
1b98f357da |
Merge tag 'net-next-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Paolo Abeni:
"Core:
- Implement the Device Memory TCP transmit path, allowing zero-copy
data transmission on top of TCP from e.g. GPU memory to the wire.
- Move all the IPv6 routing tables management outside the RTNL scope,
under its own lock and RCU. The route control path is now 3x times
faster.
- Convert queue related netlink ops to instance lock, reducing again
the scope of the RTNL lock. This improves the control plane
scalability.
- Refactor the software crc32c implementation, removing unneeded
abstraction layers and improving significantly the related
micro-benchmarks.
- Optimize the GRO engine for UDP-tunneled traffic, for a 10%
performance improvement in related stream tests.
- Cover more per-CPU storage with local nested BH locking; this is a
prep work to remove the current per-CPU lock in local_bh_disable()
on PREMPT_RT.
- Introduce and use nlmsg_payload helper, combining buffer bounds
verification with accessing payload carried by netlink messages.
Netfilter:
- Rewrite the procfs conntrack table implementation, improving
considerably the dump performance. A lot of user-space tools still
use this interface.
- Implement support for wildcard netdevice in netdev basechain and
flowtables.
- Integrate conntrack information into nft trace infrastructure.
- Export set count and backend name to userspace, for better
introspection.
BPF:
- BPF qdisc support: BPF-qdisc can be implemented with BPF struct_ops
programs and can be controlled in similar way to traditional qdiscs
using the "tc qdisc" command.
- Refactor the UDP socket iterator, addressing long standing issues
WRT duplicate hits or missed sockets.
Protocols:
- Improve TCP receive buffer auto-tuning and increase the default
upper bound for the receive buffer; overall this improves the
single flow maximum thoughput on 200Gbs link by over 60%.
- Add AFS GSSAPI security class to AF_RXRPC; it provides transport
security for connections to the AFS fileserver and VL server.
- Improve TCP multipath routing, so that the sources address always
matches the nexthop device.
- Introduce SO_PASSRIGHTS for AF_UNIX, to allow disabling SCM_RIGHTS,
and thus preventing DoS caused by passing around problematic FDs.
- Retire DCCP socket. DCCP only receives updates for bugs, and major
distros disable it by default. Its removal allows for better
organisation of TCP fields to reduce the number of cache lines hit
in the fast path.
- Extend TCP drop-reason support to cover PAWS checks.
Driver API:
- Reorganize PTP ioctl flag support to require an explicit opt-in for
the drivers, avoiding the problem of drivers not rejecting new
unsupported flags.
- Converted several device drivers to timestamping APIs.
- Introduce per-PHY ethtool dump helpers, improving the support for
dump operations targeting PHYs.
Tests and tooling:
- Add support for classic netlink in user space C codegen, so that
ynl-c can now read, create and modify links, routes addresses and
qdisc layer configuration.
- Add ynl sub-types for binary attributes, allowing ynl-c to output
known struct instead of raw binary data, clarifying the classic
netlink output.
- Extend MPTCP selftests to improve the code-coverage.
- Add tests for XDP tail adjustment in AF_XDP.
New hardware / drivers:
- OpenVPN virtual driver: offload OpenVPN data channels processing to
the kernel-space, increasing the data transfer throughput WRT the
user-space implementation.
- Renesas glue driver for the gigabit ethernet RZ/V2H(P) SoC.
- Broadcom asp-v3.0 ethernet driver.
- AMD Renoir ethernet device.
- ReakTek MT9888 2.5G ethernet PHY driver.
- Aeonsemi 10G C45 PHYs driver.
Drivers:
- Ethernet high-speed NICs:
- nVidia/Mellanox (mlx5):
- refactor the steering table handling to significantly
reduce the amount of memory used
- add support for complex matches in H/W flow steering
- improve flow streeing error handling
- convert to netdev instance locking
- Intel (100G, ice, igb, ixgbe, idpf):
- ice: add switchdev support for LLDP traffic over VF
- ixgbe: add firmware manipulation and regions devlink support
- igb: introduce support for frame transmission premption
- igb: adds persistent NAPI configuration
- idpf: introduce RDMA support
- idpf: add initial PTP support
- Meta (fbnic):
- extend hardware stats coverage
- add devlink dev flash support
- Broadcom (bnxt):
- add support for RX-side device memory TCP
- Wangxun (txgbe):
- implement support for udp tunnel offload
- complete PTP and SRIOV support for AML 25G/10G devices
- Ethernet NICs embedded and virtual:
- Google (gve):
- add device memory TCP TX support
- Amazon (ena):
- support persistent per-NAPI config
- Airoha:
- add H/W support for L2 traffic offload
- add per flow stats for flow offloading
- RealTek (rtl8211): add support for WoL magic packet
- Synopsys (stmmac):
- dwmac-socfpga 1000BaseX support
- add Loongson-2K3000 support
- introduce support for hardware-accelerated VLAN stripping
- Broadcom (bcmgenet):
- expose more H/W stats
- Freescale (enetc, dpaa2-eth):
- enetc: add MAC filter, VLAN filter RSS and loopback support
- dpaa2-eth: convert to H/W timestamping APIs
- vxlan: convert FDB table to rhashtable, for better scalabilty
- veth: apply qdisc backpressure on full ring to reduce TX drops
- Ethernet switches:
- Microchip (kzZ88x3): add ETS scheduler support
- Ethernet PHYs:
- RealTek (rtl8211):
- add support for WoL magic packet
- add support for PHY LEDs
- CAN:
- Adds RZ/G3E CANFD support to the rcar_canfd driver.
- Preparatory work for CAN-XL support.
- Add self-tests framework with support for CAN physical interfaces.
- WiFi:
- mac80211:
- scan improvements with multi-link operation (MLO)
- Qualcomm (ath12k):
- enable AHB support for IPQ5332
- add monitor interface support to QCN9274
- add multi-link operation support to WCN7850
- add 802.11d scan offload support to WCN7850
- monitor mode for WCN7850, better 6 GHz regulatory
- Qualcomm (ath11k):
- restore hibernation support
- MediaTek (mt76):
- WiFi-7 improvements
- implement support for mt7990
- Intel (iwlwifi):
- enhanced multi-link single-radio (EMLSR) support on 5 GHz links
- rework device configuration
- RealTek (rtw88):
- improve throughput for RTL8814AU
- RealTek (rtw89):
- add multi-link operation support
- STA/P2P concurrency improvements
- support different SAR configs by antenna
- Bluetooth:
- introduce HCI Driver protocol
- btintel_pcie: do not generate coredump for diagnostic events
- btusb: add HCI Drv commands for configuring altsetting
- btusb: add RTL8851BE device 0x0bda:0xb850
- btusb: add new VID/PID 13d3/3584 for MT7922
- btusb: add new VID/PID 13d3/3630 and 13d3/3613 for MT7925
- btnxpuart: implement host-wakeup feature"
* tag 'net-next-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1611 commits)
selftests/bpf: Fix bpf selftest build warning
selftests: netfilter: Fix skip of wildcard interface test
net: phy: mscc: Stop clearing the the UDPv4 checksum for L2 frames
net: openvswitch: Fix the dead loop of MPLS parse
calipso: Don't call calipso functions for AF_INET sk.
selftests/tc-testing: Add a test for HFSC eltree double add with reentrant enqueue behaviour on netem
net_sched: hfsc: Address reentrant enqueue adding class to eltree twice
octeontx2-pf: QOS: Refactor TC_HTB_LEAF_DEL_LAST callback
octeontx2-pf: QOS: Perform cache sync on send queue teardown
net: mana: Add support for Multi Vports on Bare metal
net: devmem: ncdevmem: remove unused variable
net: devmem: ksft: upgrade rx test to send 1K data
net: devmem: ksft: add 5 tuple FS support
net: devmem: ksft: add exit_wait to make rx test pass
net: devmem: ksft: add ipv4 support
net: devmem: preserve sockc_err
page_pool: fix ugly page_pool formatting
net: devmem: move list_add to net_devmem_bind_dmabuf.
selftests: netfilter: nft_queue.sh: include file transfer duration in log message
net: phy: mscc: Fix memory leak when using one step timestamping
...
|
||
|
|
c544a952ba |
mm: pcp: increase pcp->free_count threshold to trigger free_high
In old pcp design, pcp->free_factor gets incremented in nr_pcp_free() which is invoked by free_pcppages_bulk(). So, it used to increase free_factor by 1 only when we try to reduce the size of pcp list and free_high used to trigger only for order > 0 and order < costly_order and pcp->free_factor > 0. For iperf3 I noticed that with older design in kernel v6.6, pcp list was drained mostly when pcp->count > high (more often when count goes above 530). and most of the time pcp->free_factor was 0, triggering very few high order flushes. But this is changed in the current design, introduced in commit |
||
|
|
2aad4edf6e |
mm: rename try_alloc_pages() to alloc_pages_nolock()
The "try_" prefix is confusing, since it made people believe that try_alloc_pages() is analogous to spin_trylock() and NULL return means EAGAIN. This is not the case. If it returns NULL there is no reason to call it again. It will most likely return NULL again. Hence rename it to alloc_pages_nolock() to make it symmetrical to free_pages_nolock() and document that NULL means ENOMEM. Link: https://lkml.kernel.org/r/20250517003446.60260-1-alexei.starovoitov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Harry Yoo <harry.yoo@oracle.com> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
33e1b1b399 |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.15-rc8). Conflicts: |
||
|
|
e05741fb10 |
mm/page_alloc.c: avoid infinite retries caused by cpuset race
__alloc_pages_slowpath has no change detection for ac->nodemask in the
part of retry path, while cpuset can modify it in parallel. For some
processes that set mempolicy as MPOL_BIND, this results ac->nodemask
changes, and then the should_reclaim_retry will judge based on the latest
nodemask and jump to retry, while the get_page_from_freelist only
traverses the zonelist from ac->preferred_zoneref, which selected by a
expired nodemask and may cause infinite retries in some cases
cpu 64:
__alloc_pages_slowpath {
/* ..... */
retry:
/* ac->nodemask = 0x1, ac->preferred->zone->nid = 1 */
if (alloc_flags & ALLOC_KSWAPD)
wake_all_kswapds(order, gfp_mask, ac);
/* cpu 1:
cpuset_write_resmask
update_nodemask
update_nodemasks_hier
update_tasks_nodemask
mpol_rebind_task
mpol_rebind_policy
mpol_rebind_nodemask
// mempolicy->nodes has been modified,
// which ac->nodemask point to
*/
/* ac->nodemask = 0x3, ac->preferred->zone->nid = 1 */
if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
did_some_progress > 0, &no_progress_loops))
goto retry;
}
Simultaneously starting multiple cpuset01 from LTP can quickly reproduce
this issue on a multi node server when the maximum memory pressure is
reached and the swap is enabled
Link: https://lkml.kernel.org/r/20250416082405.20988-1-zhangtianyang@loongson.cn
Fixes:
|
||
|
|
8adce08577 |
cpuset: rename cpuset_node_allowed to cpuset_current_node_allowed
Patch series "vmscan: enforce mems_effective during demotion", v5. Change reclaim to respect cpuset.mems_effective during demotion when possible. Presently, reclaim explicitly ignores cpuset.mems_effective when demoting, which may cause the cpuset settings to violated. Implement cpuset_node_allowed() to check the cpuset.mems_effective associated wih the mem_cgroup of the lruvec being scanned. This only applies to cgroup/cpuset v2, as cpuset exists in a different hierarchy than mem_cgroup in v1. This requires renaming the existing cpuset_node_allowed() to be cpuset_current_now_allowed() - which is more descriptive anyway - to implement the new cpuset_node_allowed() which takes a target cgroup. This patch (of 2): Rename cpuset_node_allowed to reflect that the function checks the current task's cpuset.mems. This allows us to make a new cpuset_node_allowed function that checks a target cgroup's cpuset.mems. Link: https://lkml.kernel.org/r/20250424202806.52632-1-gourry@gourry.net Link: https://lkml.kernel.org/r/20250424202806.52632-2-gourry@gourry.net Signed-off-by: Gregory Price <gourry@gourry.net> Acked-by: Waiman Long <longman@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Koutný <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
ee414bd97b |
mm: page_alloc: tighten up find_suitable_fallback()
find_suitable_fallback() is not as efficient as it could be, and somewhat
difficult to follow.
1. should_try_claim_block() is a loop invariant. There is no point in
checking fallback areas if the caller is interested in claimable
blocks but the order and the migratetype don't allow for that.
2. __rmqueue_steal() doesn't care about claimability, so it shouldn't
have to run those tests.
Different callers want different things from this helper:
1. __compact_finished() scans orders up until it finds a claimable block
2. __rmqueue_claim() scans orders down as long as blocks are claimable
3. __rmqueue_steal() doesn't care about claimability at all
Move should_try_claim_block() out of the loop. Only test it for the
two callers who care in the first place. Distinguish "no blocks" from
"order + mt are not claimable" in the return value; __rmqueue_claim()
can stop once order becomes unclaimable, __compact_finished() can keep
advancing until order becomes claimable.
Before:
Performance counter stats for './run case-lru-file-mmap-read' (5 runs):
85,294.85 msec task-clock # 5.644 CPUs utilized ( +- 0.32% )
15,968 context-switches # 187.209 /sec ( +- 3.81% )
153 cpu-migrations # 1.794 /sec ( +- 3.29% )
801,808 page-faults # 9.400 K/sec ( +- 0.10% )
733,358,331,786 instructions # 1.87 insn per cycle ( +- 0.20% ) (64.94%)
392,622,904,199 cycles # 4.603 GHz ( +- 0.31% ) (64.84%)
148,563,488,531 branches # 1.742 G/sec ( +- 0.18% ) (63.86%)
152,143,228 branch-misses # 0.10% of all branches ( +- 1.19% ) (62.82%)
15.1128 +- 0.0637 seconds time elapsed ( +- 0.42% )
After:
Performance counter stats for './run case-lru-file-mmap-read' (5 runs):
84,380.21 msec task-clock # 5.664 CPUs utilized ( +- 0.21% )
16,656 context-switches # 197.392 /sec ( +- 3.27% )
151 cpu-migrations # 1.790 /sec ( +- 3.28% )
801,703 page-faults # 9.501 K/sec ( +- 0.09% )
731,914,183,060 instructions # 1.88 insn per cycle ( +- 0.38% ) (64.90%)
388,673,535,116 cycles # 4.606 GHz ( +- 0.24% ) (65.06%)
148,251,482,143 branches # 1.757 G/sec ( +- 0.37% ) (63.92%)
149,766,550 branch-misses # 0.10% of all branches ( +- 1.22% ) (62.88%)
14.8968 +- 0.0486 seconds time elapsed ( +- 0.33% )
Link: https://lkml.kernel.org/r/20250407180154.63348-2-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Brendan Jackman <jackmanb@google.com>
Tested-by: Shivank Garg <shivankg@amd.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Carlos Song <carlos.song@nxp.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
||
|
|
3a531a9939 |
mm/page_alloc: simplify free_page_is_bad by removing free_page_is_bad_report
Refactor free_page_is_bad() to call bad_page() directly, removing the intermediate free_page_is_bad_report(). This reduces unnecessary indirection, improving code clarity and maintainability without changing functionality. Link: https://lkml.kernel.org/r/20250328012031.1204993-1-ye.liu@linux.dev Signed-off-by: Ye Liu <liuye@kylinos.cn> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
cd348c5e6a |
mm: page_alloc: remove redundant READ_ONCE
In the current code, batch is a local variable, and it cannot be
concurrently modified. It's unnecessary to use READ_ONCE here, so remove
it.
Link: https://lkml.kernel.org/r/CAA=HWd1kn01ym8YuVFuAqK2Ggq3itEGkqX8T6eCXs_C7tiv-Jw@mail.gmail.com
Fixes:
|
||
|
|
fefc075182 |
mm/page_alloc: fix race condition in unaccepted memory handling
The page allocator tracks the number of zones that have unaccepted memory
using static_branch_enc/dec() and uses that static branch in hot paths to
determine if it needs to deal with unaccepted memory.
Borislav and Thomas pointed out that the tracking is racy: operations on
static_branch are not serialized against adding/removing unaccepted pages
to/from the zone.
Sanity checks inside static_branch machinery detects it:
WARNING: CPU: 0 PID: 10 at kernel/jump_label.c:276 __static_key_slow_dec_cpuslocked+0x8e/0xa0
The comment around the WARN() explains the problem:
/*
* Warn about the '-1' case though; since that means a
* decrement is concurrent with a first (0->1) increment. IOW
* people are trying to disable something that wasn't yet fully
* enabled. This suggests an ordering problem on the user side.
*/
The effect of this static_branch optimization is only visible on
microbenchmark.
Instead of adding more complexity around it, remove it altogether.
Link: https://lkml.kernel.org/r/20250506133207.1009676-1-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Fixes:
|
||
|
|
23fa022a07 |
mm/page_alloc: ensure try_alloc_pages() plays well with unaccepted memory
try_alloc_pages() will not attempt to allocate memory if the system has
*any* unaccepted memory. Memory is accepted as needed and can remain in
the system indefinitely, causing the interface to always fail.
Rather than immediately giving up, attempt to use already accepted memory
on free lists.
Pass 'alloc_flags' to cond_accept_memory() and do not accept new memory
for ALLOC_TRYLOCK requests.
Found via code inspection - only BPF uses this at present and the
runtime effects are unclear.
Link: https://lkml.kernel.org/r/20250506112509.905147-2-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Fixes:
|
||
|
|
0ae0227fa3 |
mm/codetag: move tag retrieval back upfront in __free_pages()
Commit |
||
|
|
5565acd1e6 |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.15-rc4). This pull includes wireless and a fix to vxlan which isn't in Linus's tree just yet. The latter creates with a silent conflict / build breakage, so merging it now to avoid causing problems. drivers/net/vxlan/vxlan_vnifilter.c |
||
|
|
3844818145 |
mm: vmscan: restore high-cpu watermark safety in kswapd
Vlastimil points out that commit |
||
|
|
4067196a52 |
mm/page_alloc: fix deadlock on cpu_hotplug_lock in __accept_page()
When the last page in the zone is accepted, __accept_page() calls
static_branch_dec(). This function takes cpu_hotplug_lock, which can lead
to a deadlock if the allocation occurs during CPU bringup path as
_cpu_up() also takes the lock.
To prevent this deadlock, defer static_branch_dec() to a workqueue.
Call static_branch_dec() only when the workqueue is not yet initialized.
Workqueues are initialized before CPU bring up, so this will not conflict
with the first scenario.
Link: https://lkml.kernel.org/r/20250329171030.3942298-1-kirill.shutemov@linux.intel.com
Fixes:
|
||
|
|
240ce924d2 |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.15-rc3). No conflicts. Adjacent changes: tools/net/ynl/pyynl/ynl_gen_c.py |
||
|
|
cd3c93167d |
page_pool: Move pp_magic check into helper functions
Since we are about to stash some more information into the pp_magic field, let's move the magic signature checks into a pair of helper functions so it can be changed in one place. Reviewed-by: Mina Almasry <almasrymina@google.com> Tested-by: Yonglong Liu <liuyonglong@huawei.com> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://patch.msgid.link/20250409-page-pool-track-dma-v9-1-6a9ef2e0cba8@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
|
|
90abee6d78 |
mm: page_alloc: speed up fallbacks in rmqueue_bulk()
The test robot identified |