2739 Commits

Author SHA1 Message Date
Sean Christopherson
ae431059e7 KVM: guest_memfd: Remove bindings on memslot deletion when gmem is dying
When unbinding a memslot from a guest_memfd instance, remove the bindings
even if the guest_memfd file is dying, i.e. even if its file refcount has
gone to zero.  If the memslot is freed before the file is fully released,
nullifying the memslot side of the binding in kvm_gmem_release() will
write to freed memory, as detected by syzbot+KASAN:

  ==================================================================
  BUG: KASAN: slab-use-after-free in kvm_gmem_release+0x176/0x440 virt/kvm/guest_memfd.c:353
  Write of size 8 at addr ffff88807befa508 by task syz.0.17/6022

  CPU: 0 UID: 0 PID: 6022 Comm: syz.0.17 Not tainted syzkaller #0 PREEMPT(full)
  Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/02/2025
  Call Trace:
   <TASK>
   dump_stack_lvl+0x189/0x250 lib/dump_stack.c:120
   print_address_description mm/kasan/report.c:378 [inline]
   print_report+0xca/0x240 mm/kasan/report.c:482
   kasan_report+0x118/0x150 mm/kasan/report.c:595
   kvm_gmem_release+0x176/0x440 virt/kvm/guest_memfd.c:353
   __fput+0x44c/0xa70 fs/file_table.c:468
   task_work_run+0x1d4/0x260 kernel/task_work.c:227
   resume_user_mode_work include/linux/resume_user_mode.h:50 [inline]
   exit_to_user_mode_loop+0xe9/0x130 kernel/entry/common.c:43
   exit_to_user_mode_prepare include/linux/irq-entry-common.h:225 [inline]
   syscall_exit_to_user_mode_work include/linux/entry-common.h:175 [inline]
   syscall_exit_to_user_mode include/linux/entry-common.h:210 [inline]
   do_syscall_64+0x2bd/0xfa0 arch/x86/entry/syscall_64.c:100
   entry_SYSCALL_64_after_hwframe+0x77/0x7f
  RIP: 0033:0x7fbeeff8efc9
   </TASK>

  Allocated by task 6023:
   kasan_save_stack mm/kasan/common.c:56 [inline]
   kasan_save_track+0x3e/0x80 mm/kasan/common.c:77
   poison_kmalloc_redzone mm/kasan/common.c:397 [inline]
   __kasan_kmalloc+0x93/0xb0 mm/kasan/common.c:414
   kasan_kmalloc include/linux/kasan.h:262 [inline]
   __kmalloc_cache_noprof+0x3e2/0x700 mm/slub.c:5758
   kmalloc_noprof include/linux/slab.h:957 [inline]
   kzalloc_noprof include/linux/slab.h:1094 [inline]
   kvm_set_memory_region+0x747/0xb90 virt/kvm/kvm_main.c:2104
   kvm_vm_ioctl_set_memory_region+0x6f/0xd0 virt/kvm/kvm_main.c:2154
   kvm_vm_ioctl+0x957/0xc60 virt/kvm/kvm_main.c:5201
   vfs_ioctl fs/ioctl.c:51 [inline]
   __do_sys_ioctl fs/ioctl.c:597 [inline]
   __se_sys_ioctl+0xfc/0x170 fs/ioctl.c:583
   do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
   do_syscall_64+0xfa/0xfa0 arch/x86/entry/syscall_64.c:94
   entry_SYSCALL_64_after_hwframe+0x77/0x7f

  Freed by task 6023:
   kasan_save_stack mm/kasan/common.c:56 [inline]
   kasan_save_track+0x3e/0x80 mm/kasan/common.c:77
   kasan_save_free_info+0x46/0x50 mm/kasan/generic.c:584
   poison_slab_object mm/kasan/common.c:252 [inline]
   __kasan_slab_free+0x5c/0x80 mm/kasan/common.c:284
   kasan_slab_free include/linux/kasan.h:234 [inline]
   slab_free_hook mm/slub.c:2533 [inline]
   slab_free mm/slub.c:6622 [inline]
   kfree+0x19a/0x6d0 mm/slub.c:6829
   kvm_set_memory_region+0x9c4/0xb90 virt/kvm/kvm_main.c:2130
   kvm_vm_ioctl_set_memory_region+0x6f/0xd0 virt/kvm/kvm_main.c:2154
   kvm_vm_ioctl+0x957/0xc60 virt/kvm/kvm_main.c:5201
   vfs_ioctl fs/ioctl.c:51 [inline]
   __do_sys_ioctl fs/ioctl.c:597 [inline]
   __se_sys_ioctl+0xfc/0x170 fs/ioctl.c:583
   do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
   do_syscall_64+0xfa/0xfa0 arch/x86/entry/syscall_64.c:94
   entry_SYSCALL_64_after_hwframe+0x77/0x7f

Deliberately don't acquire filemap invalid lock when the file is dying as
the lifecycle of f_mapping is outside the purview of KVM.  Dereferencing
the mapping is *probably* fine, but there's no need to invalidate anything
as memslot deletion is responsible for zapping SPTEs, and the only code
that can access the dying file is kvm_gmem_release(), whose core code is
mutually exclusive with unbinding.

Note, the mutual exclusivity is also what makes it safe to access the
bindings on a dying gmem instance.  Unbinding either runs with slots_lock
held, or after the last reference to the owning "struct kvm" is put, and
kvm_gmem_release() nullifies the slot pointer under slots_lock, and puts
its reference to the VM after that is done.

Reported-by: syzbot+2479e53d0db9b32ae2aa@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/68fa7a22.a70a0220.3bf6c6.008b.GAE@google.com
Tested-by: syzbot+2479e53d0db9b32ae2aa@syzkaller.appspotmail.com
Fixes: a7800aa80e ("KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory")
Cc: stable@vger.kernel.org
Cc: Hillf Danton <hdanton@sina.com>
Reviewed-By: Vishal Annapurve <vannapurve@google.com>
Link: https://patch.msgid.link/20251104011205.3853541-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-04 09:16:53 -08:00
Paolo Bonzini
4361f5aa8b Merge tag 'kvm-x86-fixes-6.18-rc2' of https://github.com/kvm-x86/linux into HEAD
KVM x86 fixes for 6.18:

 - Expand the KVM_PRE_FAULT_MEMORY selftest to add a regression test for the
   bug fixed by commit 3ccbf6f470 ("KVM: x86/mmu: Return -EAGAIN if userspace
   deletes/moves memslot during prefault")

 - Don't try to get PMU capabbilities from perf when running a CPU with hybrid
   CPUs/PMUs, as perf will rightly WARN.

 - Rework KVM_CAP_GUEST_MEMFD_MMAP (newly introduced in 6.18) into a more
   generic KVM_CAP_GUEST_MEMFD_FLAGS

 - Add a guest_memfd INIT_SHARED flag and require userspace to explicitly set
   said flag to initialize memory as SHARED, irrespective of MMAP.  The
   behavior merged in 6.18 is that enabling mmap() implicitly initializes
   memory as SHARED, which would result in an ABI collision for x86 CoCo VMs
   as their memory is currently always initialized PRIVATE.

 - Allow mmap() on guest_memfd for x86 CoCo VMs, i.e. on VMs with private
   memory, to enable testing such setups, i.e. to hopefully flush out any
   other lurking ABI issues before 6.18 is officially released.

 - Add testcases to the guest_memfd selftest to cover guest_memfd without MMAP,
   and host userspace accesses to mmap()'d private memory.
2025-10-18 10:25:43 +02:00
Sean Christopherson
44c6cb9fe9 KVM: guest_memfd: Allow mmap() on guest_memfd for x86 VMs with private memory
Allow mmap() on guest_memfd instances for x86 VMs with private memory as
the need to track private vs. shared state in the guest_memfd instance is
only pertinent to INIT_SHARED.  Doing mmap() on private memory isn't
terrible useful (yet!), but it's now possible, and will be desirable when
guest_memfd gains support for other VMA-based syscalls, e.g. mbind() to
set NUMA policy.

Lift the restriction now, before MMAP support is officially released, so
that KVM doesn't need to add another capability to enumerate support for
mmap() on private memory.

Fixes: 3d3a04fad2 ("KVM: Allow and advertise support for host mmap() on guest_memfd files")
Reviewed-by: Ackerley Tng <ackerleytng@google.com>
Tested-by: Ackerley Tng <ackerleytng@google.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/20251003232606.4070510-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-10-10 14:25:25 -07:00
Sean Christopherson
9aef71c892 KVM: Explicitly mark KVM_GUEST_MEMFD as depending on KVM_GENERIC_MMU_NOTIFIER
Add KVM_GENERIC_MMU_NOTIFIER as a dependency for selecting KVM_GUEST_MEMFD,
as guest_memfd relies on kvm_mmu_invalidate_{begin,end}(), which are
defined if and only if the generic mmu_notifier implementation is enabled.

The missing dependency is currently benign as s390 is the only KVM arch
that doesn't utilize the generic mmu_notifier infrastructure, and s390
doesn't currently support guest_memfd.

Fixes: a7800aa80e ("KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory")
Reviewed-by: David Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/20251003232606.4070510-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-10-10 14:25:24 -07:00
Sean Christopherson
5d3341d684 KVM: guest_memfd: Invalidate SHARED GPAs if gmem supports INIT_SHARED
When invalidating gmem ranges, e.g. in response to PUNCH_HOLE, process all
possible range types (PRIVATE vs. SHARED) for the gmem instance.  Since
since guest_memfd doesn't yet support in-place conversions, simply pivot
on INIT_SHARED as a gmem instance can currently only have private or shared
memory, not both.

Failure to mark shared GPAs for invalidation is benign in the current code
base, as only x86's TDX consumes KVM_FILTER_{PRIVATE,SHARED}, and TDX
doesn't yet support INIT_SHARED with guest_memfd.  However, invalidating
only private GPAs is conceptually wrong and a lurking bug, e.g. could
result in missed invalidations if ARM starts filtering invalidations based
on attributes.

Fixes: 3d3a04fad2 ("KVM: Allow and advertise support for host mmap() on guest_memfd files")
Reviewed-by: Ackerley Tng <ackerleytng@google.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/20251003232606.4070510-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-10-10 14:25:24 -07:00
Sean Christopherson
fe2bf6234e KVM: guest_memfd: Add INIT_SHARED flag, reject user page faults if not set
Add a guest_memfd flag to allow userspace to state that the underlying
memory should be configured to be initialized as shared, and reject user
page faults if the guest_memfd instance's memory isn't shared.  Because
KVM doesn't yet support in-place private<=>shared conversions, all
guest_memfd memory effectively follows the initial state.

Alternatively, KVM could deduce the initial state based on MMAP, which for
all intents and purposes is what KVM currently does.  However, implicitly
deriving the default state based on MMAP will result in a messy ABI when
support for in-place conversions is added.

For x86 CoCo VMs, which don't yet support MMAP, memory is currently private
by default (otherwise the memory would be unusable).  If MMAP implies
memory is shared by default, then the default state for CoCo VMs will vary
based on MMAP, and from userspace's perspective, will change when in-place
conversion support is added.  I.e. to maintain guest<=>host ABI, userspace
would need to immediately convert all memory from shared=>private, which
is both ugly and inefficient.  The inefficiency could be avoided by adding
a flag to state that memory is _private_ by default, irrespective of MMAP,
but that would lead to an equally messy and hard to document ABI.

Bite the bullet and immediately add a flag to control the default state so
that the effective behavior is explicit and straightforward.

Fixes: 3d3a04fad2 ("KVM: Allow and advertise support for host mmap() on guest_memfd files")
Cc: David Hildenbrand <david@redhat.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Ackerley Tng <ackerleytng@google.com>
Tested-by: Ackerley Tng <ackerleytng@google.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/20251003232606.4070510-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-10-10 14:25:23 -07:00
Sean Christopherson
d2042d8f96 KVM: Rework KVM_CAP_GUEST_MEMFD_MMAP into KVM_CAP_GUEST_MEMFD_FLAGS
Rework the not-yet-released KVM_CAP_GUEST_MEMFD_MMAP into a more generic
KVM_CAP_GUEST_MEMFD_FLAGS capability so that adding new flags doesn't
require a new capability, and so that developers aren't tempted to bundle
multiple flags into a single capability.

Note, kvm_vm_ioctl_check_extension_generic() can only return a 32-bit
value, but that limitation can be easily circumvented by adding e.g.
KVM_CAP_GUEST_MEMFD_FLAGS2 in the unlikely event guest_memfd supports more
than 32 flags.

Reviewed-by: Ackerley Tng <ackerleytng@google.com>
Tested-by: Ackerley Tng <ackerleytng@google.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/20251003232606.4070510-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-10-10 14:25:22 -07:00
Linus Torvalds
2215336295 Merge tag 'hyperv-next-signed-20251006' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux
Pull hyperv updates from Wei Liu:

 - Unify guest entry code for KVM and MSHV (Sean Christopherson)

 - Switch Hyper-V MSI domain to use msi_create_parent_irq_domain()
   (Nam Cao)

 - Add CONFIG_HYPERV_VMBUS and limit the semantics of CONFIG_HYPERV
   (Mukesh Rathor)

 - Add kexec/kdump support on Azure CVMs (Vitaly Kuznetsov)

 - Deprecate hyperv_fb in favor of Hyper-V DRM driver (Prasanna
   Kumar T S M)

 - Miscellaneous enhancements, fixes and cleanups (Abhishek Tiwari,
   Alok Tiwari, Nuno Das Neves, Wei Liu, Roman Kisel, Michael Kelley)

* tag 'hyperv-next-signed-20251006' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux:
  hyperv: Remove the spurious null directive line
  MAINTAINERS: Mark hyperv_fb driver Obsolete
  fbdev/hyperv_fb: deprecate this in favor of Hyper-V DRM driver
  Drivers: hv: Make CONFIG_HYPERV bool
  Drivers: hv: Add CONFIG_HYPERV_VMBUS option
  Drivers: hv: vmbus: Fix typos in vmbus_drv.c
  Drivers: hv: vmbus: Fix sysfs output format for ring buffer index
  Drivers: hv: vmbus: Clean up sscanf format specifier in target_cpu_store()
  x86/hyperv: Switch to msi_create_parent_irq_domain()
  mshv: Use common "entry virt" APIs to do work in root before running guest
  entry: Rename "kvm" entry code assets to "virt" to genericize APIs
  entry/kvm: KVM: Move KVM details related to signal/-EINTR into KVM proper
  mshv: Handle NEED_RESCHED_LAZY before transferring to guest
  x86/hyperv: Add kexec/kdump support on Azure CVMs
  Drivers: hv: Simplify data structures for VMBus channel close message
  Drivers: hv: util: Cosmetic changes for hv_utils_transport.c
  mshv: Add support for a new parent partition configuration
  clocksource: hyper-v: Skip unnecessary checks for the root partition
  hyperv: Add missing field to hv_output_map_device_interrupt
2025-10-07 08:40:15 -07:00
Sean Christopherson
9be7e1e320 entry: Rename "kvm" entry code assets to "virt" to genericize APIs
Rename the "kvm" entry code files and Kconfigs to use generic "virt"
nomenclature so that the code can be reused by other hypervisors (or
rather, their root/dom0 partition drivers), without incorrectly suggesting
the code somehow relies on and/or involves KVM.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Wei Liu <wei.liu@kernel.org>
2025-09-30 22:50:18 +00:00
Sean Christopherson
20c4892058 KVM: Export KVM-internal symbols for sub-modules only
Rework the vast majority of KVM's exports to expose symbols only to KVM
submodules, i.e. to x86's kvm-{amd,intel}.ko and PPC's kvm-{pr,hv}.ko.
With few exceptions, KVM's exported APIs are intended (and safe) for KVM-
internal usage only.

Keep kvm_get_kvm(), kvm_get_kvm_safe(), and kvm_put_kvm() as normal
exports, as they are needed by VFIO, and are generally safe for external
usage (though ideally even the get/put APIs would be KVM-internal, and
VFIO would pin a VM by grabbing a reference to its associated file).

Implement a framework in kvm_types.h in anticipation of providing a macro
to restrict KVM-specific kernel exports, i.e. to provide symbol exports
for KVM if and only if KVM is built as one or more modules.

Link: https://lore.kernel.org/r/20250919003303.1355064-3-seanjc@google.com
Cc: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-09-30 13:40:02 -04:00
Paolo Bonzini
a104e0a305 Merge tag 'kvm-x86-svm-6.18' of https://github.com/kvm-x86/linux into HEAD
KVM SVM changes for 6.18

 - Require a minimum GHCB version of 2 when starting SEV-SNP guests via
   KVM_SEV_INIT2 so that invalid GHCB versions result in immediate errors
   instead of latent guest failures.

 - Add support for Secure TSC for SEV-SNP guests, which prevents the untrusted
   host from tampering with the guest's TSC frequency, while still allowing the
   the VMM to configure the guest's TSC frequency prior to launch.

 - Mitigate the potential for TOCTOU bugs when accessing GHCB fields by
   wrapping all accesses via READ_ONCE().

 - Validate the XCR0 provided by the guest (via the GHCB) to avoid tracking a
   bogous XCR0 value in KVM's software model.

 - Save an SEV guest's policy if and only if LAUNCH_START fully succeeds to
   avoid leaving behind stale state (thankfully not consumed in KVM).

 - Explicitly reject non-positive effective lengths during SNP's LAUNCH_UPDATE
   instead of subtly relying on guest_memfd to do the "heavy" lifting.

 - Reload the pre-VMRUN TSC_AUX on #VMEXIT for SEV-ES guests, not the host's
   desired TSC_AUX, to fix a bug where KVM could clobber a different vCPU's
   TSC_AUX due to hardware not matching the value cached in the user-return MSR
   infrastructure.

 - Enable AVIC by default for Zen4+ if x2AVIC (and other prereqs) is supported,
   and clean up the AVIC initialization code along the way.
2025-09-30 13:34:12 -04:00
Paolo Bonzini
5b0d0d8542 Merge tag 'kvm-x86-mmu-6.18' of https://github.com/kvm-x86/linux into HEAD
KVM x86 MMU changes for 6.18

 - Recover possible NX huge pages within the TDP MMU under read lock to
   reduce guest jitter when restoring NX huge pages.

 - Return -EAGAIN during prefault if userspace concurrently deletes/moves the
   relevant memslot to fix an issue where prefaulting could deadlock with the
   memslot update.

 - Don't retry in TDX's anti-zero-step mitigation if the target memslot is
   invalid, i.e. is being deleted or moved, to fix a deadlock scenario similar
   to the aforementioned prefaulting case.
2025-09-30 13:32:27 -04:00
Paolo Bonzini
99cab80208 Merge tag 'kvm-x86-generic-6.18' of https://github.com/kvm-x86/linux into HEAD
KVM common changes for 6.18

Remove a redundant __GFP_NOWARN from kvm_setup_async_pf() as __GFP_NOWARN is
now included in GFP_NOWAIT.
2025-09-30 13:27:59 -04:00
Paolo Bonzini
924ebaefce Merge tag 'kvmarm-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 updates for 6.18

- Add support for FF-A 1.2 as the secure memory conduit for pKVM,
  allowing more registers to be used as part of the message payload.

- Change the way pKVM allocates its VM handles, making sure that the
  privileged hypervisor is never tricked into using uninitialised
  data.

- Speed up MMIO range registration by avoiding unnecessary RCU
  synchronisation, which results in VMs starting much quicker.

- Add the dump of the instruction stream when panic-ing in the EL2
  payload, just like the rest of the kernel has always done. This will
  hopefully help debugging non-VHE setups.

- Add 52bit PA support to the stage-1 page-table walker, and make use
  of it to populate the fault level reported to the guest on failing
  to translate a stage-1 walk.

- Add NV support to the GICv3-on-GICv5 emulation code, ensuring
  feature parity for guests, irrespective of the host platform.

- Fix some really ugly architecture problems when dealing with debug
  in a nested VM. This has some bad performance impacts, but is at
  least correct.

- Add enough infrastructure to be able to disable EL2 features and
  give effective values to the EL2 control registers. This then allows
  a bunch of features to be turned off, which helps cross-host
  migration.

- Large rework of the selftest infrastructure to allow most tests to
  transparently run at EL2. This is the first step towards enabling
  NV testing.

- Various fixes and improvements all over the map, including one BE
  fix, just in time for the removal of the feature.
2025-09-30 13:23:28 -04:00
Sean Christopherson
5b66e335ea KVM: SEV: Reject non-positive effective lengths during LAUNCH_UPDATE
Check for an invalid length during LAUNCH_UPDATE at the start of
snp_launch_update() instead of subtly relying on kvm_gmem_populate() to
detect the bad state.  Code that directly handles userspace input
absolutely should sanitize those inputs; failure to do so is asking for
bugs where KVM consumes an invalid "npages".

Keep the check in gmem, but wrap it in a WARN to flag any bad usage by
the caller.

Note, this is technically an ABI change as KVM would previously allow a
length of '0'.  But allowing a length of '0' is nonsensical and creates
pointless conundrums in KVM.  E.g. an empty range is arguably neither
private nor shared, but LAUNCH_UPDATE will fail if the starting gpa can't
be made private.  In practice, no known or well-behaved VMM passes a
length of '0'.

Note #2, the PAGE_ALIGNED(params.len) check ensures that lengths between
1 and 4095 (inclusive) are also rejected, i.e. that KVM won't end up with
npages=0 when doing "npages = params.len / PAGE_SIZE".

Cc: Thomas Lendacky <thomas.lendacky@amd.com>
Cc: Michael Roth <michael.roth@amd.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20250919211649.1575654-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-23 08:55:20 -07:00
Keir Fraser
7d9a0273c4 KVM: Avoid synchronize_srcu() in kvm_io_bus_register_dev()
Device MMIO registration may happen quite frequently during VM boot,
and the SRCU synchronization each time has a measurable effect
on VM startup time. In our experiments it can account for around 25%
of a VM's startup time.

Replace the synchronization with a deferred free of the old kvm_io_bus
structure.

Tested-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: Keir Fraser <keirf@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-09-15 10:55:23 +01:00
Keir Fraser
7788255aba KVM: Implement barriers before accessing kvm->buses[] on SRCU read paths
This ensures that, if a VCPU has "observed" that an IO registration has
occurred, the instruction currently being trapped or emulated will also
observe the IO registration.

At the same time, enforce that kvm_get_bus() is used only on the
update side, ensuring that a long-term reference cannot be obtained by
an SRCU reader.

Signed-off-by: Keir Fraser <keirf@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-09-15 10:55:23 +01:00
Sean Christopherson
2bc2694fe2 KVM: TDX: Do not retry locally when the retry is caused by invalid memslot
Avoid local retries within the TDX EPT violation handler if a retry is
triggered by faulting in an invalid memslot, indicating that the memslot
is undergoing a removal process.  Faulting in a GPA from an invalid
memslot will never succeed, and holding SRCU prevents memslot deletion
from succeeding, i.e. retrying when the memslot is being actively deleted
will lead to (breakable) deadlock.

Opportunistically export kvm_vcpu_gfn_to_memslot() to allow for a per-vCPU
lookup (which, strictly speaking, is unnecessary since TDX doesn't support
SMM, but aligns the TDX code with the MMU code).

Fixes: b0327bb2e7 ("KVM: TDX: Retry locally in TDX EPT violation handler on RET_PF_RETRY")
Reported-by: Reinette Chatre <reinette.chatre@intel.com>
Closes: https://lore.kernel.org/all/20250519023737.30360-1-yan.y.zhao@intel.com
[Yan: Wrote patch log, comment, fixed a minor error, function export]
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Link: https://lore.kernel.org/r/20250822070523.26495-1-yan.y.zhao@intel.com
[sean: massage changelog, relocate and tweak comment]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-10 12:06:35 -07:00
Fuad Tabba
3d3a04fad2 KVM: Allow and advertise support for host mmap() on guest_memfd files
Now that all the x86 and arm64 plumbing for mmap() on guest_memfd is in
place, allow userspace to set GUEST_MEMFD_FLAG_MMAP and advertise support
via a new capability, KVM_CAP_GUEST_MEMFD_MMAP.

The availability of this capability is determined per architecture, and
its enablement for a specific guest_memfd instance is controlled by the
GUEST_MEMFD_FLAG_MMAP flag at creation time.

Update the KVM API documentation to detail the KVM_CAP_GUEST_MEMFD_MMAP
capability, the associated GUEST_MEMFD_FLAG_MMAP, and provide essential
information regarding support for mmap in guest_memfd.

Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Shivank Garg <shivankg@amd.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20250729225455.670324-22-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-08-27 04:37:03 -04:00
Fuad Tabba
576d035e2a KVM: guest_memfd: Track guest_memfd mmap support in memslot
Add a new internal flag, KVM_MEMSLOT_GMEM_ONLY, to the top half of
memslot->flags (which makes it strictly for KVM's internal use). This
flag tracks when a guest_memfd-backed memory slot supports host
userspace mmap operations, which implies that all memory, not just
private memory for CoCo VMs, is consumed through guest_memfd: "gmem
only".

This optimization avoids repeatedly checking the underlying guest_memfd
file for mmap support, which would otherwise require taking and
releasing a reference on the file for each check. By caching this
information directly in the memslot, we reduce overhead and simplify the
logic involved in handling guest_memfd-backed pages for host mappings.

Reviewed-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Shivank Garg <shivankg@amd.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Acked-by: David Hildenbrand <david@redhat.com>
Suggested-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20250729225455.670324-12-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-08-27 04:35:00 -04:00
Fuad Tabba
a12578e147 KVM: guest_memfd: Add plumbing to host to map guest_memfd pages
Introduce the core infrastructure to enable host userspace to mmap()
guest_memfd-backed memory. This is needed for several evolving KVM use
cases:

* Non-CoCo VM backing: Allows VMMs like Firecracker to run guests
  entirely backed by guest_memfd, even for non-CoCo VMs [1]. This
  provides a unified memory management model and simplifies guest memory
  handling.

* Direct map removal for enhanced security: This is an important step
  for direct map removal of guest memory [2]. By allowing host userspace
  to fault in guest_memfd pages directly, we can avoid maintaining host
  kernel direct maps of guest memory. This provides additional hardening
  against Spectre-like transient execution attacks by removing a
  potential attack surface within the kernel.

* Future guest_memfd features: This also lays the groundwork for future
  enhancements to guest_memfd, such as supporting huge pages and
  enabling in-place sharing of guest memory with the host for CoCo
  platforms that permit it [3].

Enable the basic mmap and fault handling logic within guest_memfd, but
hold off on allow userspace to actually do mmap() until the architecture
support is also in place.

[1] https://github.com/firecracker-microvm/firecracker/tree/feature/secret-hiding
[2] https://lore.kernel.org/linux-mm/cc1bb8e9bc3e1ab637700a4d3defeec95b55060a.camel@amazon.com
[3] https://lore.kernel.org/all/c1c9591d-218a-495c-957b-ba356c8f8e09@redhat.com/T/#u

Reviewed-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Shivank Garg <shivankg@amd.com>
Acked-by: David Hildenbrand <david@redhat.com>
Co-developed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20250729225455.670324-11-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-08-27 04:35:00 -04:00
Fuad Tabba
d1e54dd08f KVM: x86: Enable KVM_GUEST_MEMFD for all 64-bit builds
Enable KVM_GUEST_MEMFD for all KVM x86 64-bit builds, i.e. for "default"
VM types when running on 64-bit KVM.  This will allow using guest_memfd
to back non-private memory for all VM shapes, by supporting mmap() on
guest_memfd.

Opportunistically clean up various conditionals that become tautologies
once x86 selects KVM_GUEST_MEMFD more broadly.  Specifically, because
SW protected VMs, SEV, and TDX are all 64-bit only, private memory no
longer needs to take explicit dependencies on KVM_GUEST_MEMFD, because
it is effectively a prerequisite.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20250729225455.670324-10-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-08-27 04:35:00 -04:00
Fuad Tabba
69116e01f6 KVM: Fix comments that refer to slots_lock
Fix comments so that they refer to slots_lock instead of slots_locks
(remove trailing s).

Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Shivank Garg <shivankg@amd.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20250729225455.670324-8-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-08-27 04:35:00 -04:00
Fuad Tabba
923310be23 KVM: Rename kvm_slot_can_be_private() to kvm_slot_has_gmem()
Rename kvm_slot_can_be_private() to kvm_slot_has_gmem() to improve
clarity and accurately reflect its purpose.

The function kvm_slot_can_be_private() was previously used to check if a
given kvm_memory_slot is backed by guest_memfd. However, its name
implied that the memory in such a slot was exclusively "private".

As guest_memfd support expands to include non-private memory (e.g.,
shared host mappings), it's important to remove this association. The
new name, kvm_slot_has_gmem(), states that the slot is backed by
guest_memfd without making assumptions about the memory's privacy
attributes.

Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Shivank Garg <shivankg@amd.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Co-developed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20250729225455.670324-7-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-08-27 04:35:00 -04:00
Fuad Tabba
36cf63bb5d KVM: Rename CONFIG_KVM_GENERIC_PRIVATE_MEM to CONFIG_HAVE_KVM_ARCH_GMEM_POPULATE
The original name was vague regarding its functionality. This Kconfig
option specifically enables and gates the kvm_gmem_populate() function,
which is responsible for populating a GPA range with guest data.

The new name, HAVE_KVM_ARCH_GMEM_POPULATE, describes the purpose of the
option: to enable arch-specific guest_memfd population mechanisms. It
also follows the same pattern as the other HAVE_KVM_ARCH_* configuration
options.

This improves clarity for developers and ensures the name accurately
reflects the functionality it controls, especially as guest_memfd
support expands beyond purely "private" memory scenarios.

Temporarily keep KVM_GENERIC_PRIVATE_MEM as an x86-only config so as to
minimize churn, and to hopefully make it easier to see what features
require HAVE_KVM_ARCH_GMEM_POPULATE.  On that note, omit GMEM_POPULATE
for KVM_X86_SW_PROTECTED_VM, as regular ol' memset() suffices for
software-protected VMs.

As for KVM_GENERIC_PRIVATE_MEM, a future change will select KVM_GUEST_MEMFD
for all 64-bit KVM builds, at which point the intermediate config will
become obsolete and can/will be dropped.

Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Shivank Garg <shivankg@amd.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Co-developed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20250729225455.670324-6-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-08-27 04:35:00 -04:00
Fuad Tabba
19a9a1ab5c KVM: Rename CONFIG_KVM_PRIVATE_MEM to CONFIG_KVM_GUEST_MEMFD
Rename the Kconfig option CONFIG_KVM_PRIVATE_MEM to
CONFIG_KVM_GUEST_MEMFD. The original name implied that the feature only
supported "private" memory. However, CONFIG_KVM_PRIVATE_MEM enables
guest_memfd in general, which is not exclusively for private memory.
Subsequent patches in this series will add guest_memfd support for
non-CoCo VMs, whose memory is not private.

Renaming the Kconfig option to CONFIG_KVM_GUEST_MEMFD more accurately
reflects its broader scope as the main Kconfig option for all
guest_memfd-backed memory. This provides clearer semantics for the
option and avoids confusion as new features are introduced.

Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Shivank Garg <shivankg@amd.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Co-developed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20250729225455.670324-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-08-27 04:34:59 -04:00
Qianfeng Rong
cf6a8401b6 KVM: remove redundant __GFP_NOWARN
Commit 16f5dfbc85 ("gfp: include __GFP_NOWARN in GFP_NOWAIT") made
GFP_NOWAIT implicitly include __GFP_NOWARN.

Therefore, explicit __GFP_NOWARN combined with GFP_NOWAIT (e.g.,
`GFP_NOWAIT | __GFP_NOWARN`) is now redundant.  Let's clean up these
redundant flags across subsystems.

Signed-off-by: Qianfeng Rong <rongqianfeng@vivo.com>
Link: https://lore.kernel.org/r/20250807144943.581663-1-rongqianfeng@vivo.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-08-19 11:51:13 -07:00
Paolo Bonzini
9de13951d5 Merge tag 'kvm-x86-no_assignment-6.17' of https://github.com/kvm-x86/linux into HEAD
KVM VFIO device assignment cleanups for 6.17

Kill off kvm_arch_{start,end}_assignment() and x86's associated tracking now
that KVM no longer uses assigned_device_count as a bad heuristic for "VM has
an irqbypass producer" or for "VM has access to host MMIO".
2025-07-29 08:36:42 -04:00
Paolo Bonzini
cc5a1021aa Merge tag 'kvm-x86-dirty_ring-6.17' of https://github.com/kvm-x86/linux into HEAD
KVM Dirty Ring changes for 6.17

Fix issues with dirty ring harvesting where KVM doesn't bound the processing
of entries in any way, which allows userspace to keep KVM in a tight loop
indefinitely.  Clean up code and comments along the way.
2025-07-29 08:36:41 -04:00
Paolo Bonzini
d284562862 Merge tag 'kvm-x86-generic-6.17' of https://github.com/kvm-x86/linux into HEAD
KVM generic changes for 6.17

 - Add a tracepoint for KVM_SET_MEMORY_ATTRIBUTES to help debug issues related
   to private <=> shared memory conversions.

 - Drop guest_memfd's .getattr() implementation as the VFS layer will call
   generic_fillattr() if inode_operations.getattr is NULL.
2025-07-29 08:36:41 -04:00
Paolo Bonzini
f02b1bcc73 Merge tag 'kvm-x86-irqs-6.17' of https://github.com/kvm-x86/linux into HEAD
KVM IRQ changes for 6.17

 - Rework irqbypass to track/match producers and consumers via an xarray
   instead of a linked list.  Using a linked list leads to O(n^2) insertion
   times, which is hugely problematic for use cases that create large numbers
   of VMs.  Such use cases typically don't actually use irqbypass, but
   eliminating the pointless registration is a future problem to solve as it
   likely requires new uAPI.

 - Track irqbypass's "token" as "struct eventfd_ctx *" instead of a "void *",
   to avoid making a simple concept unnecessarily difficult to understand.

 - Add CONFIG_KVM_IOAPIC for x86 to allow disabling support for I/O APIC, PIC,
   and PIT emulation at compile time.

 - Drop x86's irq_comm.c, and move a pile of IRQ related code into irq.c.

 - Fix a variety of flaws and bugs in the AVIC device posted IRQ code.

 - Inhibited AVIC if a vCPU's ID is too big (relative to what hardware
   supports) instead of rejecting vCPU creation.

 - Extend enable_ipiv module param support to SVM, by simply leaving IsRunning
   clear in the vCPU's physical ID table entry.

 - Disable IPI virtualization, via enable_ipiv, if the CPU is affected by
   erratum #1235, to allow (safely) enabling AVIC on such CPUs.

 - Dedup x86's device posted IRQ code, as the vast majority of functionality
   can be shared verbatime between SVM and VMX.

 - Harden the device posted IRQ code against bugs and runtime errors.

 - Use vcpu_idx, not vcpu_id, for GA log tag/metadata, to make lookups O(1)
   instead of O(n).

 - Generate GA Log interrupts if and only if the target vCPU is blocking, i.e.
   only if KVM needs a notification in order to wake the vCPU.

 - Decouple device posted IRQs from VFIO device assignment, as binding a VM to
   a VFIO group is not a requirement for enabling device posted IRQs.

 - Clean up and document/comment the irqfd assignment code.

 - Disallow binding multiple irqfds to an eventfd with a priority waiter, i.e.
   ensure an eventfd is bound to at most one irqfd through the entire host,
   and add a selftest to verify eventfd:irqfd bindings are globally unique.
2025-07-29 08:35:46 -04:00
Shivank Garg
87d4fbf4a3 KVM: guest_memfd: Remove redundant kvm_gmem_getattr implementation
Remove the redundant kvm_gmem_getattr() implementation that simply calls
generic_fillattr() without any special handling. The VFS layer
(vfs_getattr_nosec()) will call generic_fillattr() by default when no
custom getattr operation is provided in the inode_operations structure.

This is a cleanup with no functional change.

Signed-off-by: Shivank Garg <shivankg@amd.com>
Reviewed-By: Ackerley Tng <ackerleytng@google.com>
Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com>
Link: https://lore.kernel.org/r/20250602172317.10601-1-shivankg@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-25 13:42:33 -07:00
Sean Christopherson
bbc13ae593 VFIO: KVM: x86: Drop kvm_arch_{start,end}_assignment()
Drop kvm_arch_{start,end}_assignment() and all associated code now that
KVM x86 no longer consumes assigned_device_count.  Tracking whether or not
a VFIO-assigned device is formally associated with a VM is fundamentally
flawed, as such an association is optional for general usage, i.e. is prone
to false negatives.  E.g. prior to commit 2edd9cb79f ("kvm: detect
assigned device via irqbypass manager"), device passthrough via VFIO would
fail to enable IRQ bypass if userspace omitted the formal VFIO<=>KVM
binding.

And device drivers that *need* the VFIO<=>KVM connection, e.g. KVM-GT,
shouldn't be relying on generic x86 tracking infrastructure.

Cc: Jim Mattson <jmattson@google.com>
Link: https://lore.kernel.org/r/20250523011756.3243624-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-25 09:51:33 -07:00
Liam Merwick
47bb584237 KVM: Allow CPU to reschedule while setting per-page memory attributes
When running an SEV-SNP guest with a sufficiently large amount of memory (1TB+),
the host can experience CPU soft lockups when running an operation in
kvm_vm_set_mem_attributes() to set memory attributes on the whole
range of guest memory.

watchdog: BUG: soft lockup - CPU#8 stuck for 26s! [qemu-kvm:6372]
CPU: 8 UID: 0 PID: 6372 Comm: qemu-kvm Kdump: loaded Not tainted 6.15.0-rc7.20250520.el9uek.rc1.x86_64 #1 PREEMPT(voluntary)
Hardware name: Oracle Corporation ORACLE SERVER E4-2c/Asm,MB Tray,2U,E4-2c, BIOS 78016600 11/13/2024
RIP: 0010:xas_create+0x78/0x1f0
Code: 00 00 00 41 80 fc 01 0f 84 82 00 00 00 ba 06 00 00 00 bd 06 00 00 00 49 8b 45 08 4d 8d 65 08 41 39 d6 73 20 83 ed 06 48 85 c0 <74> 67 48 89 c2 83 e2 03 48 83 fa 02 75 0c 48 3d 00 10 00 00 0f 87
RSP: 0018:ffffad890a34b940 EFLAGS: 00000286
RAX: ffff96f30b261daa RBX: ffffad890a34b9c8 RCX: 0000000000000000
RDX: 000000000000001e RSI: 0000000000000000 RDI: 0000000000000000
RBP: 0000000000000018 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffffad890a356868
R13: ffffad890a356860 R14: 0000000000000000 R15: ffffad890a356868
FS:  00007f5578a2a400(0000) GS:ffff97ed317e1000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f015c70fb18 CR3: 00000001109fd006 CR4: 0000000000f70ef0
PKRU: 55555554
Call Trace:
 <TASK>
 xas_store+0x58/0x630
 __xa_store+0xa5/0x130
 xa_store+0x2c/0x50
 kvm_vm_set_mem_attributes+0x343/0x710 [kvm]
 kvm_vm_ioctl+0x796/0xab0 [kvm]
 __x64_sys_ioctl+0xa3/0xd0
 do_syscall_64+0x8c/0x7a0
 entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7f5578d031bb
Code: ff ff ff 85 c0 79 9b 49 c7 c4 ff ff ff ff 5b 5d 4c 89 e0 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 2d 4c 0f 00 f7 d8 64 89 01 48
RSP: 002b:00007ffe0a742b88 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 000000004020aed2 RCX: 00007f5578d031bb
RDX: 00007ffe0a742c80 RSI: 000000004020aed2 RDI: 000000000000000b
RBP: 0000010000000000 R08: 0000010000000000 R09: 0000017680000000
R10: 0000000000000080 R11: 0000000000000246 R12: 00005575e5f95120
R13: 00007ffe0a742c80 R14: 0000000000000008 R15: 00005575e5f961e0

While looping through the range of memory setting the attributes,
call cond_resched() to give the scheduler a chance to run a higher
priority task on the runqueue if necessary and avoid staying in
kernel mode long enough to trigger the lockup.

Fixes: 5a475554db ("KVM: Introduce per-page memory attributes")
Cc: stable@vger.kernel.org # 6.12.x
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Liam Merwick <liam.merwick@oracle.com>
Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com>
Link: https://lore.kernel.org/r/20250609091121.2497429-2-liam.merwick@oracle.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-24 12:20:17 -07:00
Sean Christopherson
b599d44a71 KVM: Drop sanity check that per-VM list of irqfds is unique
Now that the eventfd's waitqueue ensures it has at most one priority
waiter, i.e. prevents KVM from binding multiple irqfds to one eventfd,
drop KVM's sanity check that eventfds are unique for a single VM.

Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250522235223.3178519-11-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23 09:50:59 -07:00
Sean Christopherson
2cdd64cbf9 KVM: Disallow binding multiple irqfds to an eventfd with a priority waiter
Disallow binding an irqfd to an eventfd that already has a priority waiter,
i.e. to an eventfd that already has an attached irqfd.  KVM always
operates in exclusive mode for EPOLL_IN (unconditionally returns '1'),
i.e. only the first waiter will be notified.

KVM already disallows binding multiple irqfds to an eventfd in a single
VM, but doesn't guard against multiple VMs binding to an eventfd.  Adding
the extra protection reduces the pain of a userspace VMM bug, e.g. if
userspace fails to de-assign before re-assigning when transferring state
for intra-host migration, then the migration will explicitly fail as
opposed to dropping IRQs on the destination VM.

Temporarily keep KVM's manual check on irqfds.items, but add a WARN, e.g.
to allow sanity checking the waitqueue enforcement.

Cc: Oliver Upton <oliver.upton@linux.dev>
Cc: David Matlack <dmatlack@google.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250522235223.3178519-10-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23 09:50:59 -07:00
Sean Christopherson
867347bb21 sched/wait: Drop WQ_FLAG_EXCLUSIVE from add_wait_queue_priority()
Drop the setting of WQ_FLAG_EXCLUSIVE from add_wait_queue_priority() and
instead have callers manually add the flag prior to adding their structure
to the queue.  Blindly setting WQ_FLAG_EXCLUSIVE is flawed, as the nature
of exclusive, priority waiters means that only the first waiter added will
ever receive notifications.

Pushing the flawed behavior to callers will allow fixing the problem one
hypervisor at a time (KVM added the flawed API, and then KVM's code was
copy+pasted nearly verbatim by Xen and Hyper-V), and will also allow for
adding an API that provides true exclusivity, i.e. that guarantees at most
one priority waiter is in the queue.

Opportunistically add a comment in Hyper-V to call out the mess.  Xen
privcmd's irqfd_wakefup() doesn't actually operate in exclusive mode, i.e.
can be "fixed" simply by dropping WQ_FLAG_EXCLUSIVE.  And KVM is primed to
switch to the aforementioned fully exclusive API, i.e. won't be carrying
the flawed code for long.

No functional change intended.

Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250522235223.3178519-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23 09:50:57 -07:00
Sean Christopherson
86e00cd162 KVM: Add irqfd to eventfd's waitqueue while holding irqfds.lock
Add an irqfd to its target eventfd's waitqueue while holding irqfds.lock,
which is mildly terrifying but functionally safe.  irqfds.lock is taken
inside the waitqueue's lock, but if and only if the eventfd is being
released, i.e. that path is mutually exclusive with registration as KVM
holds a reference to the eventfd (and obviously must do so to avoid UAF).

This will allow using the eventfd's waitqueue to enforce KVM's requirement
that eventfd is assigned to at most one irqfd, without introducing races.

Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250522235223.3178519-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23 09:50:56 -07:00
Sean Christopherson
5f8ca05ea9 KVM: Add irqfd to KVM's list via the vfs_poll() callback
Add the irqfd structure to KVM's list of irqfds in kvm_irqfd_register(),
i.e. via the vfs_poll() callback.  This will allow taking irqfds.lock
across the entire registration sequence (add to waitqueue, add to list),
and more importantly will allow inserting into KVM's list if and only if
adding to the waitqueue succeeds (spoiler alert), without needing to
juggle return codes in weird ways.

Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250522235223.3178519-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23 09:50:55 -07:00
Sean Christopherson
b5c543518a KVM: Initialize irqfd waitqueue callback when adding to the queue
Initialize the irqfd waitqueue callback immediately prior to inserting the
irqfd into the eventfd's waitqueue.  Pre-initializing the state in a
completely different context is all kinds of confusing, and incorrectly
suggests that the waitqueue function needs to be initialize prior to
vfs_poll().

Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250522235223.3178519-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23 09:50:55 -07:00
Sean Christopherson
140768a7bf KVM: Acquire SCRU lock outside of irqfds.lock during assignment
Acquire SRCU outside of irqfds.lock so that the locking is symmetrical,
and add a comment explaining why on earth KVM holds SRCU for so long.

Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250522235223.3178519-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23 09:50:54 -07:00
Sean Christopherson
283ed5001d KVM: Use a local struct to do the initial vfs_poll() on an irqfd
Use a function-local struct for the poll_table passed to vfs_poll(), as
nothing in the vfs_poll() callchain grabs a long-term reference to the
structure, i.e. its lifetime doesn't need to be tied to the irqfd.  Using
a local structure will also allow propagating failures out of the polling
callback without further polluting kvm_kernel_irqfd.

Opportunstically rename irqfd_ptable_queue_proc() to kvm_irqfd_register()
to capture what it actually does.

Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250522235223.3178519-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23 09:50:53 -07:00
Sean Christopherson
77bb184ab8 KVM: Fold kvm_arch_irqfd_route_changed() into kvm_arch_update_irqfd_routing()
Fold kvm_arch_irqfd_route_changed() into kvm_arch_update_irqfd_routing().
Calling arch code to know whether or not to call arch code is absurd.

Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20250611224604.313496-35-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23 09:50:33 -07:00
Sean Christopherson
b33252b9d1 KVM: Don't WARN if updating IRQ bypass route fails
Don't bother WARNing if updating an IRTE route fails now that vendor code
provides much more precise WARNs.  The generic WARN doesn't provide enough
information to actually debug the problem, and has obviously done nothing
to surface the myriad bugs in KVM x86's implementation.

Drop all of the associated return code plumbing that existed just so that
common KVM could WARN.

Link: https://lore.kernel.org/r/20250611224604.313496-34-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23 09:50:32 -07:00
Liam Merwick
aa006b2e51 KVM: fix typo in kvm_vm_set_mem_attributes() comment
It should be 'has' in the sentence and not 'as'.

Signed-off-by: Liam Merwick <liam.merwick@oracle.com>
Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com>
Link: https://lore.kernel.org/r/20250609091121.2497429-4-liam.merwick@oracle.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:55:12 -07:00
Liam Merwick
741e595f02 KVM: Add trace_kvm_vm_set_mem_attributes()
Add a tracing function that, for a guest memory range, displays
the start and end addresses plus the per-page attributes being set.

Signed-off-by: Liam Merwick <liam.merwick@oracle.com>
Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com>
Link: https://lore.kernel.org/r/20250609091121.2497429-3-liam.merwick@oracle.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:55:12 -07:00
Sean Christopherson
cb21073767 KVM: Pass new routing entries and irqfd when updating IRTEs
When updating IRTEs in response to a GSI routing or IRQ bypass change,
pass the new/current routing information along with the associated irqfd.
This will allow KVM x86 to harden, simplify, and deduplicate its code.

Since adding/removing a bypass producer is now conveniently protected with
irqfds.lock, i.e. can't run concurrently with kvm_irq_routing_update(),
use the routing information cached in the irqfd instead of looking up
the information in the current GSI routing tables.

Opportunistically convert an existing printk() to pr_info() and put its
string onto a single line (old code that strictly adhered to 80 chars).

Link: https://lore.kernel.org/r/20250611224604.313496-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:52:53 -07:00
Sean Christopherson
e295d2e7fb KVM: x86: Trigger I/O APIC route rescan in kvm_arch_irq_routing_update()
Trigger the I/O APIC route rescan that's performed for a split IRQ chip
after userspace updates IRQ routes in kvm_arch_irq_routing_update(), i.e.
before dropping kvm->irq_lock.  Calling kvm_make_all_cpus_request() under
a mutex is perfectly safe, and the smp_wmb()+smp_mb__after_atomic() pair
in __kvm_make_request()+kvm_check_request() ensures the new routing is
visible to vCPUs prior to the request being visible to vCPUs.

In all likelihood, commit b053b2aef2 ("KVM: x86: Add EOI exit bitmap
inference") somewhat arbitrarily made the request outside of irq_lock to
avoid holding irq_lock any longer than is strictly necessary.  And then
commit abdb080f7a ("kvm/irqchip: kvm_arch_irq_routing_update renaming
split") took the easy route of adding another arch hook instead of risking
a functional change.

Note, the call to synchronize_srcu_expedited() does NOT provide ordering
guarantees with respect to vCPUs scanning the new routing; as above, the
request infrastructure provides the necessary ordering.  I.e. there's no
need to wait for kvm_scan_ioapic_routes() to complete if it's actively
running, because regardless of whether it grabs the old or new table, the
vCPU will have another KVM_REQ_SCAN_IOAPIC pending, i.e. will rescan again
and see the new mappings.

Acked-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20250611213557.294358-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:52:41 -07:00
Sean Christopherson
23b54381ce irqbypass: Require producers to pass in Linux IRQ number during registration
Pass in the Linux IRQ associated with an IRQ bypass producer instead of
relying on the caller to set the field prior to registration, as there's
no benefit to relying on callers to do the right thing.

Take care to set producer->irq before __connect(), as KVM expects the IRQ
to be valid as soon as a connection is possible.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Link: https://lore.kernel.org/r/20250516230734.2564775-9-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:52:41 -07:00
Sean Christopherson
8394b32fae irqbypass: Use xarray to track producers and consumers
Track IRQ bypass producers and consumers using an xarray to avoid the O(2n)
insertion time associated with walking a list to check for duplicate
entries, and to search for an partner.

At low (tens or few hundreds) total producer/consumer counts, using a list
is faster due to the need to allocate backing storage for xarray.  But as
count creeps into the thousands, xarray wins easily, and can provide
several orders of magnitude better latency at high counts.  E.g. hundreds
of nanoseconds vs. hundreds of milliseconds.

Cc: Oliver Upton <oliver.upton@linux.dev>
Cc: David Matlack <dmatlack@google.com>
Cc: Like Xu <like.xu.linux@gmail.com>
Cc: Binbin Wu <binbin.wu@linux.intel.com>
Reported-by: Yong He <alexyonghe@tencent.com>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217379
Link: https://lore.kernel.org/all/20230801115646.33990-1-likexu@tencent.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Link: https://lore.kernel.org/r/20250516230734.2564775-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:52:40 -07:00