Pull dma-mapping fixes from Marek Szyprowski:
"Two last minute fixes for the recently modified DMA API infrastructure:
- proper handling of DMA_ATTR_MMIO in dma_iova_unlink() function (me)
- regression fix for the code refactoring related to P2PDMA (Pranjal
Shrivastava)"
* tag 'dma-mapping-6.18-2025-11-27' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux:
dma-direct: Fix missing sg_dma_len assignment in P2PDMA bus mappings
iommu/dma: add missing support for DMA_ATTR_MMIO for dma_iova_unlink()
Pull ring-buffer fix from Steven Rostedt:
- Do not allow mmapped ring buffer to be split
When the ring buffer VMA is split by a partial munmap or a MAP_FIXED,
the kernel calls vm_ops->close() on each portion. This causes the
ring_buffer_unmap() to be called multiple times. This causes
subsequent calls to return -ENODEV and triggers a warning.
There's no reason to allow user space to split up memory mapping of
the ring buffer. Have it return -EINVAL when that happens.
* tag 'trace-ringbuffer-v6.18-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing: Fix WARN_ON in tracing_buffers_mmap_close for split VMAs
Prior to commit a25e7962db ("PCI/P2PDMA: Refactor the p2pdma mapping
helpers"), P2P segments were mapped using the pci_p2pdma_map_segment()
helper. This helper was responsible for populating sg->dma_address,
marking the bus address, and also setting sg_dma_len(sg).
The refactor[1] removed this helper and moved the mapping logic directly
into the callers. While iommu_dma_map_sg() was correctly updated to set
the length in the new flow, it was missed in dma_direct_map_sg().
Thus, in dma_direct_map_sg(), the PCI_P2PDMA_MAP_BUS_ADDR case sets the
dma_address and marks the segment, but immediately executes 'continue',
which causes the loop to skip the standard assignment logic at the end:
sg_dma_len(sg) = sg->length;
As a result, when CONFIG_NEED_SG_DMA_LENGTH is enabled, the dma_length
field remains uninitialized (zero) for P2P bus address mappings. This
breaks upper-layer drivers (for e.g. RDMA/IB) that rely on sg_dma_len()
to determine the transfer size.
Fix this by explicitly setting the DMA length in the
PCI_P2PDMA_MAP_BUS_ADDR case before continuing to the next scatterlist
entry.
Fixes: a25e7962db ("PCI/P2PDMA: Refactor the p2pdma mapping helpers")
Reported-by: Jacob Moroni <jmoroni@google.com>
Signed-off-by: Pranjal Shrivastava <praan@google.com>
[1]
https://lore.kernel.org/all/ac14a0e94355bf898de65d023ccf8a2ad22a3ece.1746424934.git.leon@kernel.org/
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Shivaji Kant <shivajikant@google.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Link: https://lore.kernel.org/r/20251126114112.3694469-1-praan@google.com
Pull timer fixes from Ingo Molnar:
- Fix a race in timer->function clearing in timer_shutdown_sync()
- Fix a timekeeper sysfs-setup resource leak in error paths
- Fix the NOHZ report_idle_softirq() syslog rate-limiting
logic to have no side effects on the return value
* tag 'timers-urgent-2025-11-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
timers: Fix NULL function pointer race in timer_shutdown_sync()
timekeeping: Fix resource leak in tk_aux_sysfs_init() error paths
tick/sched: Fix bogus condition in report_idle_softirq()
There is a race condition between timer_shutdown_sync() and timer
expiration that can lead to hitting a WARN_ON in expire_timers().
The issue occurs when timer_shutdown_sync() clears the timer function
to NULL while the timer is still running on another CPU. The race
scenario looks like this:
CPU0 CPU1
<SOFTIRQ>
lock_timer_base()
expire_timers()
base->running_timer = timer;
unlock_timer_base()
[call_timer_fn enter]
mod_timer()
...
timer_shutdown_sync()
lock_timer_base()
// For now, will not detach the timer but only clear its function to NULL
if (base->running_timer != timer)
ret = detach_if_pending(timer, base, true);
if (shutdown)
timer->function = NULL;
unlock_timer_base()
[call_timer_fn exit]
lock_timer_base()
base->running_timer = NULL;
unlock_timer_base()
...
// Now timer is pending while its function set to NULL.
// next timer trigger
<SOFTIRQ>
expire_timers()
WARN_ON_ONCE(!fn) // hit
...
lock_timer_base()
// Now timer will detach
if (base->running_timer != timer)
ret = detach_if_pending(timer, base, true);
if (shutdown)
timer->function = NULL;
unlock_timer_base()
The problem is that timer_shutdown_sync() clears the timer function
regardless of whether the timer is currently running. This can leave a
pending timer with a NULL function pointer, which triggers the
WARN_ON_ONCE(!fn) check in expire_timers().
Fix this by only clearing the timer function when actually detaching the
timer. If the timer is running, leave the function pointer intact, which is
safe because the timer will be properly detached when it finishes running.
Fixes: 0cc04e8045 ("timers: Add shutdown mechanism to the internal functions")
Signed-off-by: Yipeng Zou <zouyipeng@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20251122093942.301559-1-zouyipeng@huawei.com
Pull sched_ext fix from Tejun Heo:
"One low risk and obvious fix: scx_enable() was dereferencing an error
pointer on helper kthread creation failure. Fixed"
* tag 'sched_ext-for-6.18-rc6-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
sched_ext: Fix scx_enable() crash on helper kthread creation failure
A crash was observed when the sched_ext selftests runner was
terminated with Ctrl+\ while test 15 was running:
NIP [c00000000028fa58] scx_enable.constprop.0+0x358/0x12b0
LR [c00000000028fa2c] scx_enable.constprop.0+0x32c/0x12b0
Call Trace:
scx_enable.constprop.0+0x32c/0x12b0 (unreliable)
bpf_struct_ops_link_create+0x18c/0x22c
__sys_bpf+0x23f8/0x3044
sys_bpf+0x2c/0x6c
system_call_exception+0x124/0x320
system_call_vectored_common+0x15c/0x2ec
kthread_run_worker() returns an ERR_PTR() on failure rather than NULL,
but the current code in scx_alloc_and_add_sched() only checks for a NULL
helper. Incase of failure on SIGQUIT, the error is not handled in
scx_alloc_and_add_sched() and scx_enable() ends up dereferencing an
error pointer.
Error handling is fixed in scx_alloc_and_add_sched() to propagate
PTR_ERR() into ret, so that scx_enable() jumps to the existing error
path, avoiding random dereference on failure.
Fixes: bff3b5aec1 ("sched_ext: Move disable machinery into scx_sched")
Cc: stable@vger.kernel.org # v6.16+
Reported-and-tested-by: Samir Mulani <samir@linux.ibm.com>
Signed-off-by: Saket Kumar Bhaskar <skb99@linux.ibm.com>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Reviewed-by: Vishal Chourasia <vishalc@linux.ibm.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
tk_aux_sysfs_init() returns immediately on error during the auxiliary clock
initialization loop without cleaning up previously allocated kobjects and
sysfs groups.
If kobject_create_and_add() or sysfs_create_group() fails during loop
iteration, the parent kobjects (tko and auxo) and any previously created
child kobjects are leaked.
Fix this by adding proper error handling with goto labels to ensure all
allocated resources are cleaned up on failure. kobject_put() on the
parent kobjects will handle cleanup of their children.
Fixes: 7b95663a3d ("timekeeping: Provide interface to control auxiliary clocks")
Signed-off-by: Malaya Kumar Rout <mrout@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://patch.msgid.link/20251120150213.246777-1-mrout@redhat.com
Currently cpu-clock event always returns 0 count, e.g.,
perf stat -e cpu-clock -- sleep 1
Performance counter stats for 'sleep 1':
0 cpu-clock # 0.000 CPUs utilized
1.002308394 seconds time elapsed
The root cause is the commit 'bc4394e5e79c ("perf: Fix the throttle
error of some clock events")' adds PERF_EF_UPDATE flag check before
calling cpu_clock_event_update() to update the count, however the
PERF_EF_UPDATE flag is never set when the cpu-clock event is stopped in
counting mode (pmu->dev() -> cpu_clock_event_del() ->
cpu_clock_event_stop()). This leads to the cpu-clock event count is
never updated.
To fix this issue, force to set PERF_EF_UPDATE flag for cpu-clock event
just like what task-clock does.
Fixes: bc4394e5e7 ("perf: Fix the throttle error of some clock events")
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ian Rogers <irogers@google.com>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Link: https://patch.msgid.link/20251112080526.3971392-1-dapeng1.mi@linux.intel.com
In commit 0345691b24 ("tick/rcu: Stop allowing RCU_SOFTIRQ in idle") the
new function report_idle_softirq() was created by breaking code out of the
existing can_stop_idle_tick() for kernels v5.18 and newer.
In doing so, the code essentially went from this form:
if (A) {
static int ratelimit;
if (ratelimit < 10 && !C && A&D) {
pr_warn("NOHZ tick-stop error: ...");
ratelimit++;
}
return false;
}
to a new function:
static bool report_idle_softirq(void)
{
static int ratelimit;
if (likely(!A))
return false;
if (ratelimit < 10)
return false;
...
pr_warn("NOHZ tick-stop error: local softirq work is pending, handler #%02x!!!\n",
pending);
ratelimit++;
return true;
}
commit a7e282c777 ("tick/rcu: Fix bogus ratelimit condition") realized
ratelimit was essentially set to zero instead of ten, and hence *no*
softirq pending messages would ever be issued, but "fixed" it as:
- if (ratelimit < 10)
+ if (ratelimit >= 10)
return false;
However, this fix introduced another issue:
When ratelimit is greater than or equal 10, even if A is true, it will
directly return false. While ratelimit in the original code was only used
to control printing and will not affect the return value.
Restore the original logic and restrict ratelimit to control the printk and
not the return value.
Fixes: 0345691b24 ("tick/rcu: Stop allowing RCU_SOFTIRQ in idle")
Fixes: a7e282c777 ("tick/rcu: Fix bogus ratelimit condition")
Signed-off-by: Wen Yang <wen.yang@linux.dev>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://patch.msgid.link/20251119174525.29470-1-wen.yang@linux.dev
Pull vfs fixes from Christian Brauner:
- Fix unitialized variable in statmount_string()
- Fix hostfs mounting when passing host root during boot
- Fix dynamic lookup to fail on cell lookup failure
- Fix missing file type when reading bfs inodes from disk
- Enforce checking of sb_min_blocksize() calls and update all callers
accordingly
- Restore write access before closing files opened by open_exec() in
binfmt_misc
- Always freeze efivarfs during suspend/hibernate cycles
- Fix statmount()'s and listmount()'s grab_requested_mnt_ns() helper to
actually allow mount namespace file descriptor in addition to mount
namespace ids
- Fix tmpfs remount when noswap is specified
- Switch Landlock to iput_not_last() to remove false-positives from
might_sleep() annotations in iput()
- Remove dead node_to_mnt_ns() code
- Ensure that per-queue kobjects are successfully created
* tag 'vfs-6.18-rc7.fixes' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs:
landlock: fix splats from iput() after it started calling might_sleep()
fs: add iput_not_last()
shmem: fix tmpfs reconfiguration (remount) when noswap is set
fs/namespace: correctly handle errors returned by grab_requested_mnt_ns
power: always freeze efivarfs
binfmt_misc: restore write access before closing files opened by open_exec()
block: add __must_check attribute to sb_min_blocksize()
virtio-fs: fix incorrect check for fsvq->kobj
xfs: check the return value of sb_min_blocksize() in xfs_fs_fill_super
isofs: check the return value of sb_min_blocksize() in isofs_fill_super
exfat: check return value of sb_min_blocksize in exfat_read_boot_sector
vfat: fix missing sb_min_blocksize() return value checks
mnt: Remove dead code which might prevent from building
bfs: Reconstruct file type when loading from disk
afs: Fix dynamic lookup to fail on cell lookup failure
hostfs: Fix only passing host root in boot stage with new mount
fs: Fix uninitialized 'offp' in statmount_string()
Pull sched_ext fixes from Tejun Heo:
"Five fixes addressing PREEMPT_RT compatibility and locking issues.
Three commits fix potential deadlocks and sleeps in atomic contexts on
RT kernels by converting locks to raw spinlocks and ensuring IRQ work
runs in hard-irq context. The remaining two fix unsafe locking in the
debug dump path and a variable dereference typo"
* tag 'sched_ext-for-6.18-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
sched_ext: Use IRQ_WORK_INIT_HARD() to initialize rq->scx.kick_cpus_irq_work
sched_ext: Fix possible deadlock in the deferred_irq_workfn()
sched/ext: convert scx_tasks_lock to raw spinlock
sched_ext: Fix unsafe locking in the scx_dump_state()
sched_ext: Fix use of uninitialized variable in scx_bpf_cpuperf_set()
For PREEMPT_RT kernels, the kick_cpus_irq_workfn() be invoked in
the per-cpu irq_work/* task context and there is no rcu-read critical
section to protect. this commit therefore use IRQ_WORK_INIT_HARD() to
initialize the per-cpu rq->scx.kick_cpus_irq_work in the
init_sched_ext_class().
Signed-off-by: Zqiang <qiang.zhang@linux.dev>
Signed-off-by: Tejun Heo <tj@kernel.org>
Pull misc fixes from Andrew Morton:
"7 hotfixes. 5 are cc:stable, 4 are against mm/
All are singletons - please see the respective changelogs for details"
* tag 'mm-hotfixes-stable-2025-11-16-10-40' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
mm, swap: fix potential UAF issue for VMA readahead
selftests/user_events: fix type cast for write_index packed member in perf_test
lib/test_kho: check if KHO is enabled
mm/huge_memory: fix folio split check for anon folios in swapcache
MAINTAINERS: update David Hildenbrand's email address
crash: fix crashkernel resource shrink
mm: fix MAX_FOLIO_ORDER on powerpc configs with hugetlb
When crashkernel is configured with a high reservation, shrinking its
value below the low crashkernel reservation causes two issues:
1. Invalid crashkernel resource objects
2. Kernel crash if crashkernel shrinking is done twice
For example, with crashkernel=200M,high, the kernel reserves 200MB of high
memory and some default low memory (say 256MB). The reservation appears
as:
cat /proc/iomem | grep -i crash
af000000-beffffff : Crash kernel
433000000-43f7fffff : Crash kernel
If crashkernel is then shrunk to 50MB (echo 52428800 >
/sys/kernel/kexec_crash_size), /proc/iomem still shows 256MB reserved:
af000000-beffffff : Crash kernel
Instead, it should show 50MB:
af000000-b21fffff : Crash kernel
Further shrinking crashkernel to 40MB causes a kernel crash with the
following trace (x86):
BUG: kernel NULL pointer dereference, address: 0000000000000038
PGD 0 P4D 0
Oops: 0000 [#1] PREEMPT SMP NOPTI
<snip...>
Call Trace: <TASK>
? __die_body.cold+0x19/0x27
? page_fault_oops+0x15a/0x2f0
? search_module_extables+0x19/0x60
? search_bpf_extables+0x5f/0x80
? exc_page_fault+0x7e/0x180
? asm_exc_page_fault+0x26/0x30
? __release_resource+0xd/0xb0
release_resource+0x26/0x40
__crash_shrink_memory+0xe5/0x110
crash_shrink_memory+0x12a/0x190
kexec_crash_size_store+0x41/0x80
kernfs_fop_write_iter+0x141/0x1f0
vfs_write+0x294/0x460
ksys_write+0x6d/0xf0
<snip...>
This happens because __crash_shrink_memory()/kernel/crash_core.c
incorrectly updates the crashk_res resource object even when
crashk_low_res should be updated.
Fix this by ensuring the correct crashkernel resource object is updated
when shrinking crashkernel memory.
Link: https://lkml.kernel.org/r/20251101193741.289252-1-sourabhjain@linux.ibm.com
Fixes: 16c6006af4 ("kexec: enable kexec_crash_size to support two crash kernel regions")
Signed-off-by: Sourabh Jain <sourabhjain@linux.ibm.com>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Zhen Lei <thunder.leizhen@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Pull timer fix from Ingo Molnar:
"Fix a memory leak in the posix timer creation logic"
* tag 'timers-urgent-2025-11-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
posix-timers: Plug potential memory leak in do_timer_create()
Pull bpf fixes from Alexei Starovoitov:
- Fix interaction between livepatch and BPF fexit programs (Song Liu)
With Steven and Masami acks.
- Fix stack ORC unwind from BPF kprobe_multi (Jiri Olsa)
With Steven and Masami acks.
- Fix out of bounds access in widen_imprecise_scalars() in the verifier
(Eduard Zingerman)
- Fix conflicts between MPTCP and BPF sockmap (Jiayuan Chen)
- Fix net_sched storage collision with BPF data_meta/data_end (Eric
Dumazet)
- Add _impl suffix to BPF kfuncs with implicit args to avoid breaking
them in bpf-next when KF_IMPLICIT_ARGS is added (Mykyta Yatsenko)
* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
selftests/bpf: Test widen_imprecise_scalars() with different stack depth
bpf: account for current allocated stack depth in widen_imprecise_scalars()
bpf: Add bpf_prog_run_data_pointers()
selftests/bpf: Add mptcp test with sockmap
mptcp: Fix proto fallback detection with BPF
mptcp: Disallow MPTCP subflows from sockmap
selftests/bpf: Add stacktrace ips test for raw_tp
selftests/bpf: Add stacktrace ips test for kprobe_multi/kretprobe_multi
x86/fgraph,bpf: Fix stack ORC unwind from kprobe_multi return probe
Revert "perf/x86: Always store regs->ip in perf_callchain_kernel()"
bpf: add _impl suffix for bpf_stream_vprintk() kfunc
bpf:add _impl suffix for bpf_task_work_schedule* kfuncs
selftests/bpf: Add tests for livepatch + bpf trampoline
ftrace: bpf: Fix IPMODIFY + DIRECT in modify_ftrace_direct()
ftrace: Fix BPF fexit with livepatch
The usage pattern for widen_imprecise_scalars() looks as follows:
prev_st = find_prev_entry(env, ...);
queued_st = push_stack(...);
widen_imprecise_scalars(env, prev_st, queued_st);
Where prev_st is an ancestor of the queued_st in the explored states
tree. This ancestor is not guaranteed to have same allocated stack
depth as queued_st. E.g. in the following case:
def main():
for i in 1..2:
foo(i) // same callsite, differnt param
def foo(i):
if i == 1:
use 128 bytes of stack
iterator based loop
Here, for a second 'foo' call prev_st->allocated_stack is 128,
while queued_st->allocated_stack is much smaller.
widen_imprecise_scalars() needs to take this into account and avoid
accessing bpf_verifier_state->frame[*]->stack out of bounds.
Fixes: 2793a8b015 ("bpf: exact states comparison for iterator convergence checks")
Reported-by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20251114025730.772723-1-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Pull power management fixes from Rafael Wysocki:
"These fix issues related to the handling of compressed hibernation
images and a recent intel_pstate driver regression:
- Fix issues related to using inadequate data types and incorrect use
of atomic variables in the compressed hibernation images handling
code that were introduced during the 6.9 development cycle (Mario
Limonciello)
- Move a X86_FEATURE_IDA check from turbo_is_disabled() to the places
where a new value for MSR_IA32_PERF_CTL is computed in intel_pstate
to address a regression preventing users from enabling turbo
frequencies post-boot (Srinivas Pandruvada)"
* tag 'pm-6.18-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
cpufreq: intel_pstate: Check IDA only before MSR_IA32_PERF_CTL writes
PM: hibernate: Fix style issues in save_compressed_image()
PM: hibernate: Use atomic64_t for compressed_size variable
PM: hibernate: Emit an error when image writing fails
Merge fixes for issues related to the handling of compressed hibernation
images that were introduced during the 6.9 development cycle.
* pm-sleep:
PM: hibernate: Fix style issues in save_compressed_image()
PM: hibernate: Use atomic64_t for compressed_size variable
PM: hibernate: Emit an error when image writing fails
For PREEMPT_RT=y kernels, the deferred_irq_workfn() is executed in
the per-cpu irq_work/* task context and not disable-irq, if the rq
returned by container_of() is current CPU's rq, the following scenarios
may occur:
lock(&rq->__lock);
<Interrupt>
lock(&rq->__lock);
This commit use IRQ_WORK_INIT_HARD() to replace init_irq_work() to
initialize rq->scx.deferred_irq_work, make the deferred_irq_workfn()
is always invoked in hard-irq context.
Signed-off-by: Zqiang <qiang.zhang@linux.dev>
Signed-off-by: Tejun Heo <tj@kernel.org>
Update scx_task_locks so that it's safe to lock/unlock in a
non-sleepable context in PREEMPT_RT kernels. scx_task_locks is
(non-raw) spinlock used to protect the list of tasks under SCX.
This list is updated during from finish_task_switch(), which
cannot sleep. Regular spinlocks can be locked in such a context
in non-RT kernels, but are sleepable under when CONFIG_PREEMPT_RT=y.
Convert scx_task_locks into a raw spinlock, which is not sleepable
even on RT kernels.
Sample backtrace:
<TASK>
dump_stack_lvl+0x83/0xa0
__might_resched+0x14a/0x200
rt_spin_lock+0x61/0x1c0
? sched_ext_dead+0x2d/0xf0
? lock_release+0xc6/0x280
sched_ext_dead+0x2d/0xf0
? srso_alias_return_thunk+0x5/0xfbef5
finish_task_switch.isra.0+0x254/0x360
__schedule+0x584/0x11d0
? srso_alias_return_thunk+0x5/0xfbef5
? srso_alias_return_thunk+0x5/0xfbef5
? tick_nohz_idle_exit+0x7e/0x120
schedule_idle+0x23/0x40
cpu_startup_entry+0x29/0x30
start_secondary+0xf8/0x100
common_startup_64+0x13e/0x148
</TASK>
Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
kho_vmalloc_unpreserve_chunk() calls __kho_unpreserve() with end_pfn as
pfn + 1. This happens to work for 0-order pages, but leaks higher order
pages.
For example, say order 2 pages back the allocation. During preservation,
they get preserved in the order 2 bitmaps, but
kho_vmalloc_unpreserve_chunk() would try to unpreserve them from the order
0 bitmaps, which should not have these bits set anyway, leaving the order
2 bitmaps untouched. This results in the pages being carried over to the
next kernel. Nothing will free those pages in the next boot, leaking
them.
Fix this by taking the order into account when calculating the end PFN for
__kho_unpreserve().
Link: https://lkml.kernel.org/r/20251103180235.71409-2-pratyush@kernel.org
Fixes: a667300bd5 ("kho: add support for preserving vmalloc allocations")
Signed-off-by: Pratyush Yadav <pratyush@kernel.org>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Alexander Graf <graf@amazon.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The list of pages in a vmalloc chunk is NULL-terminated. So when looping
through the pages in a vmalloc chunk, both kho_restore_vmalloc() and
kho_vmalloc_unpreserve_chunk() rightly make sure to stop when encountering
a NULL page. But when the chunk is full, the loops do not stop and go
past the bounds of chunk->phys, resulting in out-of-bounds memory access,
and possibly the restoration or unpreservation of an invalid page.
Fix this by making sure the processing of chunk stops at the end of the
array.
Link: https://lkml.kernel.org/r/20251103110159.8399-1-pratyush@kernel.org
Fixes: a667300bd5 ("kho: add support for preserving vmalloc allocations")
Signed-off-by: Pratyush Yadav <pratyush@kernel.org>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Alexander Graf <graf@amazon.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
KHO allocates metadata for its preserved memory map using the slab
allocator via kzalloc(). This metadata is temporary and is used by the
next kernel during early boot to find preserved memory.
A problem arises when KFENCE is enabled. kzalloc() calls can be randomly
intercepted by kfence_alloc(), which services the allocation from a
dedicated KFENCE memory pool. This pool is allocated early in boot via
memblock.
When booting via KHO, the memblock allocator is restricted to a "scratch
area", forcing the KFENCE pool to be allocated within it. This creates a
conflict, as the scratch area is expected to be ephemeral and
overwriteable by a subsequent kexec. If KHO metadata is placed in this
KFENCE pool, it leads to memory corruption when the next kernel is loaded.
To fix this, modify KHO to allocate its metadata directly from the buddy
allocator instead of slab.
Link: https://lkml.kernel.org/r/20251021000852.2924827-4-pasha.tatashin@soleen.com
Fixes: fc33e4b44b ("kexec: enable KHO support for memory preservation")
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Pratyush Yadav <pratyush@kernel.org>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: David Matlack <dmatlack@google.com>
Cc: Alexander Graf <graf@amazon.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Miguel Ojeda <ojeda@kernel.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Samiullah Khawaja <skhawaja@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
KHO memory preservation metadata is preserved in 512 byte chunks which
requires their allocation from slab allocator. Slabs are not safe to be
used with KHO because of kfence, and because partial slabs may lead leaks
to the next kernel. Change the size to be PAGE_SIZE.
The kfence specifically may cause memory corruption, where it randomly
provides slab objects that can be within the scratch area. The reason for
that is that kfence allocates its objects prior to KHO scratch is marked
as CMA region.
While this change could potentially increase metadata overhead on systems
with sparsely preserved memory, this is being mitigated by ongoing work to
reduce sparseness during preservation via 1G guest pages. Furthermore,
this change aligns with future work on a stateless KHO, which will also
use page-sized bitmaps for its radix tree metadata.
Link: https://lkml.kernel.org/r/20251021000852.2924827-3-pasha.tatashin@soleen.com
Fixes: fc33e4b44b ("kexec: enable KHO support for memory preservation")
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Pratyush Yadav <pratyush@kernel.org>
Cc: Alexander Graf <graf@amazon.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: David Matlack <dmatlack@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Miguel Ojeda <ojeda@kernel.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Samiullah Khawaja <skhawaja@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "KHO: kfence + KHO memory corruption fix", v3.
This series fixes a memory corruption bug in KHO that occurs when KFENCE
is enabled.
The root cause is that KHO metadata, allocated via kzalloc(), can be
randomly serviced by kfence_alloc(). When a kernel boots via KHO, the
early memblock allocator is restricted to a "scratch area". This forces
the KFENCE pool to be allocated within this scratch area, creating a
conflict. If KHO metadata is subsequently placed in this pool, it gets
corrupted during the next kexec operation.
Google is using KHO and have had obscure crashes due to this memory
corruption, with stacks all over the place. I would prefer this fix to be
properly backported to stable so we can also automatically consume it once
we switch to the upstream KHO.
Patch 1/3 introduces a debug-only feature (CONFIG_KEXEC_HANDOVER_DEBUG)
that adds checks to detect and fail any operation that attempts to place
KHO metadata or preserved memory within the scratch area. This serves as
a validation and diagnostic tool to confirm the problem without affecting
production builds.
Patch 2/3 Increases bitmap to PAGE_SIZE, so buddy allocator can be used.
Patch 3/3 Provides the fix by modifying KHO to allocate its metadata
directly from the buddy allocator instead of slab. This bypasses the
KFENCE interception entirely.
This patch (of 3):
It is invalid for KHO metadata or preserved memory regions to be located
within the KHO scratch area, as this area is overwritten when the next
kernel is loaded, and used early in boot by the next kernel. This can
lead to memory corruption.
Add checks to kho_preserve_* and KHO's internal metadata allocators
(xa_load_or_alloc, new_chunk) to verify that the physical address of the
memory does not overlap with any defined scratch region. If an overlap is
detected, the operation will fail and a WARN_ON is triggered. To avoid
performance overhead in production kernels, these checks are enabled only
when CONFIG_KEXEC_HANDOVER_DEBUG is selected.
[rppt@kernel.org: fix KEXEC_HANDOVER_DEBUG Kconfig dependency]
Link: https://lkml.kernel.org/r/aQHUyyFtiNZhx8jo@kernel.org
[pasha.tatashin@soleen.com: build fix]
Link: https://lkml.kernel.org/r/CA+CK2bBnorfsTymKtv4rKvqGBHs=y=MjEMMRg_tE-RME6n-zUw@mail.gmail.com
Link: https://lkml.kernel.org/r/20251021000852.2924827-1-pasha.tatashin@soleen.com
Link: https://lkml.kernel.org/r/20251021000852.2924827-2-pasha.tatashin@soleen.com
Fixes: fc33e4b44b ("kexec: enable KHO support for memory preservation")
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Mike Rapoport <rppt@kernel.org>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Pratyush Yadav <pratyush@kernel.org>
Cc: Alexander Graf <graf@amazon.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: David Matlack <dmatlack@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Miguel Ojeda <ojeda@kernel.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Samiullah Khawaja <skhawaja@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Pull scheduler fix from Ingo Molnar:
"Fix a group-throttling bug in the fair scheduler"
* tag 'sched-urgent-2025-11-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/fair: Prevent cfs_rq from being unthrottled with zero runtime_remaining
Pull perf event fix from Ingo Molnar:
"Fix a system hang caused by cpu-clock events deadlock"
* tag 'perf-urgent-2025-11-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/core: Fix system hang caused by cpu-clock usage
Pull locking fix from Ingo Molnar:
"Fix (well, cut in half) a futex performance regression on PowerPC"
* tag 'locking-urgent-2025-11-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
futex: Optimize per-cpu reference counting
Pull tracing fixes from Steven Rostedt:
- Check for reader catching up in ring_buffer_map_get_reader()
If the reader catches up to the writer in the memory mapped ring
buffer then calling rb_get_reader_page() will return NULL as there's
no pages left. But this isn't checked for before calling
rb_get_reader_page() and the return of NULL causes a warning.
If it is detected that the reader caught up to the writer, then
simply exit the routine
- Fix memory leak in histogram create_field_var()
The couple of the error paths in create_field_var() did not properly
clean up what was allocated. Make sure everything is freed properly
on error
- Fix help message of tools latency_collector
The help message incorrectly stated that "-t" was the same as
"--threads" whereas "--threads" is actually represented by "-e"
* tag 'trace-v6.18-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing/tools: Fix incorrcet short option in usage text for --threads
tracing: Fix memory leaks in create_field_var()
ring-buffer: Do not warn in ring_buffer_map_get_reader() when reader catches up
The function create_field_var() allocates memory for 'val' through
create_hist_field() inside parse_atom(), and for 'var' through
create_var(), which in turn allocates var->type and var->var.name
internally. Simply calling kfree() to release these structures will
result in memory leaks.
Use destroy_hist_field() to properly free 'val', and explicitly release
the memory of var->type and var->var.name before freeing 'var' itself.
Link: https://patch.msgid.link/20251106120132.3639920-1-zilin@seu.edu.cn
Fixes: 02205a6752 ("tracing: Add support for 'field variables'")
Signed-off-by: Zilin Guan <zilin@seu.edu.cn>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Since __tracepoint_user_init() calls tracepoint_user_register() without
initializing tuser->tpoint with given tracpoint, it does not register
tracepoint stub function as callback correctly, and tprobe does not work.
Initializing tuser->tpoint correctly before tracepoint_user_register()
so that it sets up tracepoint callback.
I confirmed below example works fine again.
echo "t sched_switch preempt prev_pid=prev->pid next_pid=next->pid" > /sys/kernel/tracing/dynamic_events
echo 1 > /sys/kernel/tracing/events/tracepoints/sched_switch/enable
cat /sys/kernel/tracing/trace_pipe
Link: https://lore.kernel.org/all/176244793514.155515.6466348656998627773.stgit@devnote2/
Fixes: 2867495dea ("tracing: tprobe-events: Register tracepoint when enable tprobe event")
Reported-by: Beau Belgrave <beaub@linux.microsoft.com>
Cc: stable@vger.kernel.org
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Tested-by: Beau Belgrave <beaub@linux.microsoft.com>
Reviewed-by: Beau Belgrave <beaub@linux.microsoft.com>
Shrikanth noted that the per-cpu reference counter was still some 10%
slower than the old immutable option (which removes the reference
counting entirely).
Further optimize the per-cpu reference counter by:
- switching from RCU to preempt;
- using __this_cpu_*() since we now have preempt disabled;
- switching from smp_load_acquire() to READ_ONCE().
This is all safe because disabling preemption inhibits the RCU grace
period exactly like rcu_read_lock().
Having preemption disabled allows using __this_cpu_*() provided the
only access to the variable is in task context -- which is the case
here.
Furthermore, since we know changing fph->state to FR_ATOMIC demands a
full RCU grace period we can rely on the implied smp_mb() from that to
replace the acquire barrier().
This is very similar to the percpu_down_read_internal() fast-path.
The reason this is significant for PowerPC is that it uses the generic
this_cpu_*() implementation which relies on local_irq_disable() (the
x86 implementation relies on it being a single memop instruction to be
IRQ-safe). Switching to preempt_disable() and __this_cpu*() avoids
this IRQ state swizzling. Also, PowerPC needs LWSYNC for the ACQUIRE
barrier, not having to use explicit barriers safes a bunch.
Combined this reduces the performance gap by half, down to some 5%.
Fixes: 760e6f7bef ("futex: Remove support for IMMUTABLE")
Reported-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://patch.msgid.link/20251106092929.GR4067720@noisy.programming.kicks-ass.net
When a cfs_rq is to be throttled, its limbo list should be empty and
that's why there is a warn in tg_throttle_down() for non empty
cfs_rq->throttled_limbo_list.
When running a test with the following hierarchy:
root
/ \
A* ...
/ | \ ...
B
/ \
C*
where both A and C have quota settings, that warn on non empty limbo list
is triggered for a cfs_rq of C, let's call it cfs_rq_c(and ignore the cpu
part of the cfs_rq for the sake of simpler representation).
Debug showed it happened like this:
Task group C is created and quota is set, so in tg_set_cfs_bandwidth(),
cfs_rq_c is initialized with runtime_enabled set, runtime_remaining
equals to 0 and *unthrottled*. Before any tasks are enqueued to cfs_rq_c,
*multiple* throttled tasks can migrate to cfs_rq_c (e.g., due to task
group changes). When enqueue_task_fair(cfs_rq_c, throttled_task) is
called and cfs_rq_c is in a throttled hierarchy (e.g., A is throttled),
these throttled tasks are directly placed into cfs_rq_c's limbo list by
enqueue_throttled_task().
Later, when A is unthrottled, tg_unthrottle_up(cfs_rq_c) enqueues these
tasks. The first enqueue triggers check_enqueue_throttle(), and with zero
runtime_remaining, cfs_rq_c can be throttled in throttle_cfs_rq() if it
can't get more runtime and enters tg_throttle_down(), where the warning
is hit due to remaining tasks in the limbo list.
I think it's a chaos to trigger throttle on unthrottle path, the status
of a being unthrottled cfs_rq can be in a mixed state in the end, so fix
this by granting 1ns to cfs_rq in tg_set_cfs_bandwidth(). This ensures
cfs_rq_c has a positive runtime_remaining when initialized as unthrottled
and cannot enter tg_unthrottle_up() with zero runtime_remaining.
Also, update outdated comments in tg_throttle_down() since
unthrottle_cfs_rq() is no longer called with zero runtime_remaining.
While at it, remove a redundant assignment to se in tg_throttle_down().
Fixes: e1fad12dcb ("sched/fair: Switch to task based throttle model")
Reviewed-By: Benjamin Segall <bsegall@google.com>
Suggested-by: Benjamin Segall <bsegall@google.com>
Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Tested-by: Hao Jia <jiahao1@lixiang.com>
Link: https://patch.msgid.link/20251030032755.560-1-ziqianlu@bytedance.com
Rename bpf_stream_vprintk() to bpf_stream_vprintk_impl().
This makes bpf_stream_vprintk() follow the already established "_impl"
suffix-based naming convention for kfuncs with the bpf_prog_aux
argument provided by the verifier implicitly. This convention will be
taken advantage of with the upcoming KF_IMPLICIT_ARGS feature to
preserve backwards compatibility to BPF programs.
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Link: https://lore.kernel.org/r/20251104-implv2-v3-2-4772b9ae0e06@meta.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Ihor Solodrai <ihor.solodrai@linux.dev>
Rename:
bpf_task_work_schedule_resume()->bpf_task_work_schedule_resume_impl()
bpf_task_work_schedule_signal()->bpf_task_work_schedule_signal_impl()
This aligns task work scheduling kfuncs with the established naming
scheme for kfuncs with the bpf_prog_aux argument provided by the
verifier implicitly. This convention will be taken advantage of with the
upcoming KF_IMPLICIT_ARGS feature to preserve backwards compatibility to
BPF programs.
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Link: https://lore.kernel.org/r/20251104-implv2-v3-1-4772b9ae0e06@meta.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Ihor Solodrai <ihor.solodrai@linux.dev>
ftrace_hash_ipmodify_enable() checks IPMODIFY and DIRECT ftrace_ops on
the same kernel function. When needed, ftrace_hash_ipmodify_enable()
calls ops->ops_func() to prepare the direct ftrace (BPF trampoline) to
share the same function as the IPMODIFY ftrace (livepatch).
ftrace_hash_ipmodify_enable() is called in register_ftrace_direct() path,
but not called in modify_ftrace_direct() path. As a result, the following
operations will break livepatch:
1. Load livepatch to a kernel function;
2. Attach fentry program to the kernel function;
3. Attach fexit program to the kernel function.
After 3, the kernel function being used will not be the livepatched
version, but the original version.
Fix this by adding __ftrace_hash_update_ipmodify() to
__modify_ftrace_direct() and adjust some logic around the call.
Signed-off-by: Song Liu <song@kernel.org>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20251027175023.1521602-3-song@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>