Pull kvm fixes from Paolo Bonzini:
"Arm:
- Fix trapping regression when no in-kernel irqchip is present
- Check host-provided, untrusted ranges and offsets in pKVM
- Fix regression restoring the ID_PFR1_EL1 register
- Fix vgic ITS locking issues when LPIs are not directly injected
Arm selftests:
- Correct target CPU programming in vgic_lpi_stress selftest
- Fix exposure of SCTLR2_EL2 and ZCR_EL2 in get-reg-list selftest
RISC-V:
- Fix check for local interrupts on riscv32
- Read HGEIP CSR on the correct cpu when checking for IMSIC
interrupts
- Remove automatic I/O mapping from kvm_arch_prepare_memory_region()
x86:
- Inject #UD if the guest attempts to execute SEAMCALL or TDCALL as
KVM doesn't support virtualization the instructions, but the
instructions are gated only by VMXON. That is, they will VM-Exit
instead of taking a #UD and until now this resulted in KVM exiting
to userspace with an emulation error.
- Unload the "FPU" when emulating INIT of XSTATE features if and only
if the FPU is actually loaded, instead of trying to predict when
KVM will emulate an INIT (CET support missed the MP_STATE path).
Add sanity checks to detect and harden against similar bugs in the
future.
- Unregister KVM's GALog notifier (for AVIC) when kvm-amd.ko is
unloaded.
- Use a raw spinlock for svm->ir_list_lock as the lock is taken
during schedule(), and "normal" spinlocks are sleepable locks when
PREEMPT_RT=y.
- Remove guest_memfd bindings on memslot deletion when a gmem file is
dying to fix a use-after-free race found by syzkaller.
- Fix a goof in the EPT Violation handler where KVM checks the wrong
variable when determining if the reported GVA is valid.
- Fix and simplify the handling of LBR virtualization on AMD, which
was made buggy and unnecessarily complicated by nested VM support
Misc:
- Update Oliver's email address"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (28 commits)
KVM: nSVM: Fix and simplify LBR virtualization handling with nested
KVM: nSVM: Always recalculate LBR MSR intercepts in svm_update_lbrv()
KVM: SVM: Mark VMCB_LBR dirty when MSR_IA32_DEBUGCTLMSR is updated
MAINTAINERS: Switch myself to using kernel.org address
KVM: arm64: vgic-v3: Release reserved slot outside of lpi_xa's lock
KVM: arm64: vgic-v3: Reinstate IRQ lock ordering for LPI xarray
KVM: arm64: Limit clearing of ID_{AA64PFR0,PFR1}_EL1.GIC to userspace irqchip
KVM: arm64: Set ID_{AA64PFR0,PFR1}_EL1.GIC when GICv3 is configured
KVM: arm64: Make all 32bit ID registers fully writable
KVM: VMX: Fix check for valid GVA on an EPT violation
KVM: guest_memfd: Remove bindings on memslot deletion when gmem is dying
KVM: SVM: switch to raw spinlock for svm->ir_list_lock
KVM: SVM: Make avic_ga_log_notifier() local to avic.c
KVM: SVM: Unregister KVM's GALog notifier on kvm-amd.ko exit
KVM: SVM: Initialize per-CPU svm_data at the end of hardware setup
KVM: x86: Call out MSR_IA32_S_CET is not handled by XSAVES
KVM: x86: Harden KVM against imbalanced load/put of guest FPU state
KVM: x86: Unload "FPU" state on INIT if and only if its currently in-use
KVM: arm64: Check the untrusted offset in FF-A memory share
KVM: arm64: Check range args for pKVM mem transitions
...
The current scheme for handling LBRV when nested is used is very
complicated, especially when L1 does not enable LBRV (i.e. does not set
LBR_CTL_ENABLE_MASK).
To avoid copying LBRs between VMCB01 and VMCB02 on every nested
transition, the current implementation switches between using VMCB01 or
VMCB02 as the source of truth for the LBRs while L2 is running. If L2
enables LBR, VMCB02 is used as the source of truth. When L2 disables
LBR, the LBRs are copied to VMCB01 and VMCB01 is used as the source of
truth. This introduces significant complexity, and incorrect behavior in
some cases.
For example, on a nested #VMEXIT, the LBRs are only copied from VMCB02
to VMCB01 if LBRV is enabled in VMCB01. This is because L2's writes to
MSR_IA32_DEBUGCTLMSR to enable LBR are intercepted and propagated to
VMCB01 instead of VMCB02. However, LBRV is only enabled in VMCB02 when
L2 is running.
This means that if L2 enables LBR and exits to L1, the LBRs will not be
propagated from VMCB02 to VMCB01, because LBRV is disabled in VMCB01.
There is no meaningful difference in CPUID rate in L2 when copying LBRs
on every nested transition vs. the current approach, so do the simple
and correct thing and always copy LBRs between VMCB01 and VMCB02 on
nested transitions (when LBRV is disabled by L1). Drop the conditional
LBRs copying in __svm_{enable/disable}_lbrv() as it is now unnecessary.
VMCB02 becomes the only source of truth for LBRs when L2 is running,
regardless of LBRV being enabled by L1, drop svm_get_lbr_vmcb() and use
svm->vmcb directly in its place.
Fixes: 1d5a1b5860 ("KVM: x86: nSVM: correctly virtualize LBR msrs when L2 is running")
Cc: stable@vger.kernel.org
Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Link: https://patch.msgid.link/20251108004524.1600006-4-yosry.ahmed@linux.dev
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
svm_update_lbrv() is called when MSR_IA32_DEBUGCTLMSR is updated, and on
nested transitions where LBRV is used. It checks whether LBRV enablement
needs to be changed in the current VMCB, and if it does, it also
recalculate intercepts to LBR MSRs.
However, there are cases where intercepts need to be updated even when
LBRV enablement doesn't. Example scenario:
- L1 has MSR_IA32_DEBUGCTLMSR cleared.
- L1 runs L2 without LBR_CTL_ENABLE (no LBRV).
- L2 sets DEBUGCTLMSR_LBR in MSR_IA32_DEBUGCTLMSR, svm_update_lbrv()
sets LBR_CTL_ENABLE in VMCB02 and disables intercepts to LBR MSRs.
- L2 exits to L1, svm_update_lbrv() is not called on this transition.
- L1 clears MSR_IA32_DEBUGCTLMSR, svm_update_lbrv() finds that
LBR_CTL_ENABLE is already cleared in VMCB01 and does nothing.
- Intercepts remain disabled, L1 reads to LBR MSRs read the host MSRs.
Fix it by always recalculating intercepts in svm_update_lbrv().
Fixes: 1d5a1b5860 ("KVM: x86: nSVM: correctly virtualize LBR msrs when L2 is running")
Cc: stable@vger.kernel.org
Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Link: https://patch.msgid.link/20251108004524.1600006-3-yosry.ahmed@linux.dev
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The APM lists the DbgCtlMsr field as being tracked by the VMCB_LBR clean
bit. Always clear the bit when MSR_IA32_DEBUGCTLMSR is updated.
The history is complicated, it was correctly cleared for L1 before
commit 1d5a1b5860 ("KVM: x86: nSVM: correctly virtualize LBR msrs when
L2 is running"). At that point svm_set_msr() started to rely on
svm_update_lbrv() to clear the bit, but when nested virtualization
is enabled the latter does not always clear it even if MSR_IA32_DEBUGCTLMSR
changed. Go back to clearing it directly in svm_set_msr().
Fixes: 1d5a1b5860 ("KVM: x86: nSVM: correctly virtualize LBR msrs when L2 is running")
Reported-by: Matteo Rizzo <matteorizzo@google.com>
Reported-by: evn@google.com
Co-developed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Link: https://patch.msgid.link/20251108004524.1600006-2-yosry.ahmed@linux.dev
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM x86 fixes for 6.18:
- Inject #UD if the guest attempts to execute SEAMCALL or TDCALL as KVM
doesn't support virtualization the instructions, but the instructions
are gated only by VMXON, i.e. will VM-Exit instead of taking a #UD and
thus result in KVM exiting to userspace with an emulation error.
- Unload the "FPU" when emulating INIT of XSTATE features if and only if
the FPU is actually loaded, instead of trying to predict when KVM will
emulate an INIT (CET support missed the MP_STATE path). Add sanity
checks to detect and harden against similar bugs in the future.
- Unregister KVM's GALog notifier (for AVIC) when kvm-amd.ko is unloaded.
- Use a raw spinlock for svm->ir_list_lock as the lock is taken during
schedule(), and "normal" spinlocks are sleepable locks when PREEMPT_RT=y.
- Remove guest_memfd bindings on memslot deletion when a gmem file is dying
to fix a use-after-free race found by syzkaller.
- Fix a goof in the EPT Violation handler where KVM checks the wrong
variable when determining if the reported GVA is valid.
KVM/riscv fixes for 6.18, take #2
- Fix check for local interrupts on riscv32
- Read HGEIP CSR on the correct cpu when checking for IMSIC interrupts
- Remove automatic I/O mapping from kvm_arch_prepare_memory_region()
Zenghui reports that running a KVM guest with an assigned device and
lockdep enabled produces an unfriendly splat due to an inconsistent irq
context when taking the lpi_xa's spinlock.
This is no good as in rare cases the last reference to an LPI can get
dropped after injection of a cached LPI translation. In this case,
vgic_put_irq() will release the IRQ struct and take the lpi_xa's
spinlock to erase it from the xarray.
Reinstate the IRQ ordering and update the lockdep hint accordingly. Note
that there is no irqsave equivalent of might_lock(), so just explictly
grab and release the spinlock on lockdep kernels.
Reported-by: Zenghui Yu <yuzenghui@huawei.com>
Closes: https://lore.kernel.org/kvmarm/b4d7cb0f-f007-0b81-46d1-998b15cc14bc@huawei.com/
Fixes: 982f31bbb5 ("KVM: arm64: vgic-v3: Don't require IRQs be disabled for LPI xarray lock")
Signed-off-by: Oliver Upton <oupton@kernel.org>
Link: https://patch.msgid.link/20251107184847.1784820-2-oupton@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
Now that the idreg's GIC field is in sync with the irqchip, limit
the runtime clearing of these fields to the pathological case where
we do not have an in-kernel GIC.
While we're at it, use the existing API instead of open-coded
accessors to access the ID regs.
Fixes: 5cb57a1aff ("KVM: arm64: Zero ID_AA64PFR0_EL1.GIC when no GICv3 is presented to the guest")
Reviewed-by: Oliver Upton <oupton@kernel.org>
Link: https://patch.msgid.link/20251030122707.2033690-4-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
32bit ID registers aren't getting much love these days, and are
often missed in updates. One of these updates broke restoring
a GICv2 guest on a GICv3 machine.
Instead of performing a piecemeal fix, just bite the bullet
and make all 32bit ID regs fully writable. KVM itself never
relies on them for anything, and if the VMM wants to mess up
the guest, so be it.
Fixes: 5cb57a1aff ("KVM: arm64: Zero ID_AA64PFR0_EL1.GIC when no GICv3 is presented to the guest")
Reported-by: Peter Maydell <peter.maydell@linaro.org>
Cc: stable@vger.kernel.org
Reviewed-by: Oliver Upton <oupton@kernel.org>
Link: https://patch.msgid.link/20251030122707.2033690-2-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
When unbinding a memslot from a guest_memfd instance, remove the bindings
even if the guest_memfd file is dying, i.e. even if its file refcount has
gone to zero. If the memslot is freed before the file is fully released,
nullifying the memslot side of the binding in kvm_gmem_release() will
write to freed memory, as detected by syzbot+KASAN:
==================================================================
BUG: KASAN: slab-use-after-free in kvm_gmem_release+0x176/0x440 virt/kvm/guest_memfd.c:353
Write of size 8 at addr ffff88807befa508 by task syz.0.17/6022
CPU: 0 UID: 0 PID: 6022 Comm: syz.0.17 Not tainted syzkaller #0 PREEMPT(full)
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/02/2025
Call Trace:
<TASK>
dump_stack_lvl+0x189/0x250 lib/dump_stack.c:120
print_address_description mm/kasan/report.c:378 [inline]
print_report+0xca/0x240 mm/kasan/report.c:482
kasan_report+0x118/0x150 mm/kasan/report.c:595
kvm_gmem_release+0x176/0x440 virt/kvm/guest_memfd.c:353
__fput+0x44c/0xa70 fs/file_table.c:468
task_work_run+0x1d4/0x260 kernel/task_work.c:227
resume_user_mode_work include/linux/resume_user_mode.h:50 [inline]
exit_to_user_mode_loop+0xe9/0x130 kernel/entry/common.c:43
exit_to_user_mode_prepare include/linux/irq-entry-common.h:225 [inline]
syscall_exit_to_user_mode_work include/linux/entry-common.h:175 [inline]
syscall_exit_to_user_mode include/linux/entry-common.h:210 [inline]
do_syscall_64+0x2bd/0xfa0 arch/x86/entry/syscall_64.c:100
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7fbeeff8efc9
</TASK>
Allocated by task 6023:
kasan_save_stack mm/kasan/common.c:56 [inline]
kasan_save_track+0x3e/0x80 mm/kasan/common.c:77
poison_kmalloc_redzone mm/kasan/common.c:397 [inline]
__kasan_kmalloc+0x93/0xb0 mm/kasan/common.c:414
kasan_kmalloc include/linux/kasan.h:262 [inline]
__kmalloc_cache_noprof+0x3e2/0x700 mm/slub.c:5758
kmalloc_noprof include/linux/slab.h:957 [inline]
kzalloc_noprof include/linux/slab.h:1094 [inline]
kvm_set_memory_region+0x747/0xb90 virt/kvm/kvm_main.c:2104
kvm_vm_ioctl_set_memory_region+0x6f/0xd0 virt/kvm/kvm_main.c:2154
kvm_vm_ioctl+0x957/0xc60 virt/kvm/kvm_main.c:5201
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:597 [inline]
__se_sys_ioctl+0xfc/0x170 fs/ioctl.c:583
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0xfa/0xfa0 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
Freed by task 6023:
kasan_save_stack mm/kasan/common.c:56 [inline]
kasan_save_track+0x3e/0x80 mm/kasan/common.c:77
kasan_save_free_info+0x46/0x50 mm/kasan/generic.c:584
poison_slab_object mm/kasan/common.c:252 [inline]
__kasan_slab_free+0x5c/0x80 mm/kasan/common.c:284
kasan_slab_free include/linux/kasan.h:234 [inline]
slab_free_hook mm/slub.c:2533 [inline]
slab_free mm/slub.c:6622 [inline]
kfree+0x19a/0x6d0 mm/slub.c:6829
kvm_set_memory_region+0x9c4/0xb90 virt/kvm/kvm_main.c:2130
kvm_vm_ioctl_set_memory_region+0x6f/0xd0 virt/kvm/kvm_main.c:2154
kvm_vm_ioctl+0x957/0xc60 virt/kvm/kvm_main.c:5201
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:597 [inline]
__se_sys_ioctl+0xfc/0x170 fs/ioctl.c:583
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0xfa/0xfa0 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
Deliberately don't acquire filemap invalid lock when the file is dying as
the lifecycle of f_mapping is outside the purview of KVM. Dereferencing
the mapping is *probably* fine, but there's no need to invalidate anything
as memslot deletion is responsible for zapping SPTEs, and the only code
that can access the dying file is kvm_gmem_release(), whose core code is
mutually exclusive with unbinding.
Note, the mutual exclusivity is also what makes it safe to access the
bindings on a dying gmem instance. Unbinding either runs with slots_lock
held, or after the last reference to the owning "struct kvm" is put, and
kvm_gmem_release() nullifies the slot pointer under slots_lock, and puts
its reference to the VM after that is done.
Reported-by: syzbot+2479e53d0db9b32ae2aa@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/68fa7a22.a70a0220.3bf6c6.008b.GAE@google.com
Tested-by: syzbot+2479e53d0db9b32ae2aa@syzkaller.appspotmail.com
Fixes: a7800aa80e ("KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory")
Cc: stable@vger.kernel.org
Cc: Hillf Danton <hdanton@sina.com>
Reviewed-By: Vishal Annapurve <vannapurve@google.com>
Link: https://patch.msgid.link/20251104011205.3853541-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Use a raw spinlock for vcpu_svm.ir_list_lock as the lock can be taken
during schedule() via kvm_sched_out() => __avic_vcpu_put(), and "normal"
spinlocks are sleepable locks when PREEMPT_RT=y.
This fixes the following lockdep warning:
=============================
[ BUG: Invalid wait context ]
6.12.0-146.1640_2124176644.el10.x86_64+debug #1 Not tainted
-----------------------------
qemu-kvm/38299 is trying to lock:
ff11000239725600 (&svm->ir_list_lock){....}-{3:3}, at: __avic_vcpu_put+0xfd/0x300 [kvm_amd]
other info that might help us debug this:
context-{5:5}
2 locks held by qemu-kvm/38299:
#0: ff11000239723ba8 (&vcpu->mutex){+.+.}-{4:4}, at: kvm_vcpu_ioctl+0x240/0xe00 [kvm]
#1: ff11000b906056d8 (&rq->__lock){-.-.}-{2:2}, at: raw_spin_rq_lock_nested+0x2e/0x130
stack backtrace:
CPU: 1 UID: 0 PID: 38299 Comm: qemu-kvm Kdump: loaded Not tainted 6.12.0-146.1640_2124176644.el10.x86_64+debug #1 PREEMPT(voluntary)
Hardware name: AMD Corporation QUARTZ/QUARTZ, BIOS RQZ100AB 09/14/2023
Call Trace:
<TASK>
dump_stack_lvl+0x6f/0xb0
__lock_acquire+0x921/0xb80
lock_acquire.part.0+0xbe/0x270
_raw_spin_lock_irqsave+0x46/0x90
__avic_vcpu_put+0xfd/0x300 [kvm_amd]
svm_vcpu_put+0xfa/0x130 [kvm_amd]
kvm_arch_vcpu_put+0x48c/0x790 [kvm]
kvm_sched_out+0x161/0x1c0 [kvm]
prepare_task_switch+0x36b/0xf60
__schedule+0x4f7/0x1890
schedule+0xd4/0x260
xfer_to_guest_mode_handle_work+0x54/0xc0
vcpu_run+0x69a/0xa70 [kvm]
kvm_arch_vcpu_ioctl_run+0xdc0/0x17e0 [kvm]
kvm_vcpu_ioctl+0x39f/0xe00 [kvm]
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://patch.msgid.link/20251030194130.307900-1-mlevitsk@redhat.com
[sean: massage changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Setup the per-CPU SVM data structures at the very end of hardware setup so
that svm_hardware_unsetup() can be used in svm_hardware_setup() to unwind
AVIC setup (for the GALog notifier). Alternatively, the error path could
do an explicit, manual unwind, e.g. by adding a helper to free the per-CPU
structures. But the per-CPU allocations have no interactions or
dependencies, i.e. can comfortably live at the end, and so converting to
a manual unwind would introduce churn and code without providing any
immediate advantage.
Link: https://patch.msgid.link/20251016190643.80529-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Update the comment above is_xstate_managed_msr() to note that
MSR_IA32_S_CET isn't saved/restored by XSAVES/XRSTORS.
MSR_IA32_S_CET isn't part of CET_U/S state as the SDM states:
The register state used by Control-Flow Enforcement Technology (CET)
comprises the two 64-bit MSRs (IA32_U_CET and IA32_PL3_SSP) that manage
CET when CPL = 3 (CET_U state); and the three 64-bit MSRs
(IA32_PL0_SSP–IA32_PL2_SSP) that manage CET when CPL < 3 (CET_S state).
Opportunistically shift the snippet about the safety of loading certain
MSRs to the function comment for kvm_access_xstate_msr(), which is where
the MSRs are actually loaded into hardware.
Fixes: e44eb58334 ("KVM: x86: Load guest FPU state when access XSAVE-managed MSRs")
Signed-off-by: Chao Gao <chao.gao@intel.com>
Link: https://patch.msgid.link/20251028060142.29830-1-chao.gao@intel.com
[sean: shift snippet about safety to kvm_access_xstate_msr()]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Assert, via KVM_BUG_ON(), that guest FPU state isn't/is in use when
loading/putting the FPU to help detect KVM bugs without needing an assist
from KASAN. If an imbalanced load/put is detected, skip the redundant
load/put to avoid clobbering guest state and/or crashing the host.
Note, kvm_access_xstate_msr() already provides a similar assertion.
Reviewed-by: Yao Yuan <yaoyuan@linux.alibaba.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Link: https://patch.msgid.link/20251030185802.3375059-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Replace the hack added by commit f958bd2314 ("KVM: x86: Fix potential
put_fpu() w/o load_fpu() on MPX platform") with a more robust approach of
unloading+reloading guest FPU state based on whether or not the vCPU's FPU
is currently in-use, i.e. currently loaded. This fixes a bug on hosts
that support CET but not MPX, where kvm_arch_vcpu_ioctl_get_mpstate()
neglects to load FPU state (it only checks for MPX support) and leads to
KVM attempting to put FPU state due to kvm_apic_accept_events() triggering
INIT emulation. E.g. on a host with CET but not MPX, syzkaller+KASAN
generates:
Oops: general protection fault, probably for non-canonical address 0xdffffc0000000004: 0000 [#1] SMP KASAN NOPTI
KASAN: null-ptr-deref in range [0x0000000000000020-0x0000000000000027]
CPU: 211 UID: 0 PID: 20451 Comm: syz.9.26 Tainted: G S 6.18.0-smp-DEV #7 NONE
Tainted: [S]=CPU_OUT_OF_SPEC
Hardware name: Google Izumi/izumi, BIOS 0.20250729.1-0 07/29/2025
RIP: 0010:fpu_swap_kvm_fpstate+0x3ce/0x610 ../arch/x86/kernel/fpu/core.c:377
RSP: 0018:ff1100410c167cc0 EFLAGS: 00010202
RAX: 0000000000000004 RBX: 0000000000000020 RCX: 00000000000001aa
RDX: 00000000000001ab RSI: ffffffff817bb960 RDI: 0000000022600000
RBP: dffffc0000000000 R08: ff110040d23c8007 R09: 1fe220081a479000
R10: dffffc0000000000 R11: ffe21c081a479001 R12: ff110040d23c8d98
R13: 00000000fffdc578 R14: 0000000000000000 R15: ff110040d23c8d90
FS: 00007f86dd1876c0(0000) GS:ff11007fc969b000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f86dd186fa8 CR3: 00000040d1dfa003 CR4: 0000000000f73ef0
PKRU: 80000000
Call Trace:
<TASK>
kvm_vcpu_reset+0x80d/0x12c0 ../arch/x86/kvm/x86.c:11818
kvm_apic_accept_events+0x1cb/0x500 ../arch/x86/kvm/lapic.c:3489
kvm_arch_vcpu_ioctl_get_mpstate+0xd0/0x4e0 ../arch/x86/kvm/x86.c:12145
kvm_vcpu_ioctl+0x5e2/0xed0 ../virt/kvm/kvm_main.c:4539
__se_sys_ioctl+0x11d/0x1b0 ../fs/ioctl.c:51
do_syscall_x64 ../arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x6e/0x940 ../arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7f86de71d9c9
</TASK>
with a very simple reproducer:
r0 = openat$kvm(0xffffffffffffff9c, &(0x7f0000000000), 0x80b00, 0x0)
r1 = ioctl$KVM_CREATE_VM(r0, 0xae01, 0x0)
ioctl$KVM_CREATE_IRQCHIP(r1, 0xae60)
r2 = ioctl$KVM_CREATE_VCPU(r1, 0xae41, 0x0)
ioctl$KVM_SET_IRQCHIP(r1, 0x8208ae63, ...)
ioctl$KVM_GET_MP_STATE(r2, 0x8004ae98, &(0x7f00000000c0))
Alternatively, the MPX hack in GET_MP_STATE could be extended to cover CET,
but from a "don't break existing functionality" perspective, that isn't any
less risky than peeking at the state of in_use, and it's far less robust
for a long term solution (as evidenced by this bug).
Reported-by: Alexander Potapenko <glider@google.com>
Fixes: 69cc3e8865 ("KVM: x86: Add XSS support for CET_KERNEL and CET_USER")
Reviewed-by: Yao Yuan <yaoyuan@linux.alibaba.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Link: https://patch.msgid.link/20251030185802.3375059-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
There's currently no verification for host issued ranges in most of the
pKVM memory transitions. The end boundary might therefore be subject to
overflow and later checks could be evaded.
Close this loophole with an additional pfn_range_is_valid() check on a
per public function basis. Once this check has passed, it is safe to
convert pfn and nr_pages into a phys_addr_t and a size.
host_unshare_guest transition is already protected via
__check_host_shared_guest(), while assert_host_shared_guest() callers
are already ignoring host checks.
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Link: https://patch.msgid.link/20251016164541.3771235-1-vdonnefort@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
get-reg-list includes ZCR_EL2 in the list of EL2 registers that it looks
for when NV is enabled but does not have any feature gate for this register,
meaning that testing any combination of features that includes EL2 but does
not include SVE will result in a test failure due to a missing register
being reported:
| The following lines are missing registers:
|
| ARM64_SYS_REG(3, 4, 1, 2, 0),
Add ZCR_EL2 to feat_id_regs so that the test knows not to expect to see it
without SVE being enabled.
Fixes: 3a90b6f279 ("KVM: arm64: selftests: get-reg-list: Add base EL2 registers")
Signed-off-by: Mark Brown <broonie@kernel.org>
Link: https://patch.msgid.link/20251024-kvm-arm64-get-reg-list-zcr-el2-v1-1-0cd0ff75e22f@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
Since GITS_TYPER.PTA == 0, the ITS MAPC command demands a CPU ID,
rather than a physical redistributor address, for its RDbase
command argument.
As such, when MAPC-ing guest ITS collections, vgic_lpi_stress iterates
over CPU IDs in the range [0, nr_cpus), passing them as the RDbase
vcpu_id argument to its_send_mapc_cmd().
However, its_encode_target() in the its_send_mapc_cmd() selftest
handler expects RDbase arguments to be formatted with a 16 bit
offset, as shown by the 16-bit target_addr right shift its implementation:
its_mask_encode(&cmd->raw_cmd[2], target_addr >> 16, 51, 16)
At the moment, all CPU IDs passed into its_send_mapc_cmd() have no
offset, therefore becoming 0x0 after the bit shift. Thus, when
vgic_its_cmd_handle_mapc() receives the ITS command in vgic-its.c,
it always interprets the RDbase target CPU as CPU 0. All interrupts
sent to collections will be processed by vCPU 0, which defeats the
purpose of this multi-vCPU test.
Fix by creating procnum_to_rdbase() helper function, which left-shifts
the vCPU parameter received by its_send_mapc_cmd 16 bits before passing
it to its_encode_target for encoding.
Signed-off-by: Maximilian Dittgen <mdittgen@amazon.de>
Link: https://patch.msgid.link/20251020145946.48288-1-mdittgen@amazon.de
Signed-off-by: Marc Zyngier <maz@kernel.org>
Add VMX exit handlers for SEAMCALL and TDCALL to inject a #UD if a non-TD
guest attempts to execute SEAMCALL or TDCALL. Neither SEAMCALL nor TDCALL
is gated by any software enablement other than VMXON, and so will generate
a VM-Exit instead of e.g. a native #UD when executed from the guest kernel.
Note! No unprivileged DoS of the L1 kernel is possible as TDCALL and
SEAMCALL #GP at CPL > 0, and the CPL check is performed prior to the VMX
non-root (VM-Exit) check, i.e. userspace can't crash the VM. And for a
nested guest, KVM forwards unknown exits to L1, i.e. an L2 kernel can
crash itself, but not L1.
Note #2! The Intel® Trust Domain CPU Architectural Extensions spec's
pseudocode shows the CPL > 0 check for SEAMCALL coming _after_ the VM-Exit,
but that appears to be a documentation bug (likely because the CPL > 0
check was incorrectly bundled with other lower-priority #GP checks).
Testing on SPR and EMR shows that the CPL > 0 check is performed before
the VMX non-root check, i.e. SEAMCALL #GPs when executed in usermode.
Note #3! The aforementioned Trust Domain spec uses confusing pseudocode
that says that SEAMCALL will #UD if executed "inSEAM", but "inSEAM"
specifically means in SEAM Root Mode, i.e. in the TDX-Module. The long-
form description explicitly states that SEAMCALL generates an exit when
executed in "SEAM VMX non-root operation". But that's a moot point as the
TDX-Module injects #UD if the guest attempts to execute SEAMCALL, as
documented in the "Unconditionally Blocked Instructions" section of the
TDX-Module base specification.
Cc: stable@vger.kernel.org
Cc: Kai Huang <kai.huang@intel.com>
Cc: Xiaoyao Li <xiaoyao.li@intel.com>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20251016182148.69085-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
When executing kvm_riscv_vcpu_aia_has_interrupts, the vCPU may have
migrated and the IMSIC VS-file have not been updated yet, currently
the HGEIP CSR should be read from the imsic->vsfile_cpu ( the pCPU
before migration ) via on_each_cpu_mask, but this will trigger an
IPI call and repeated IPI within a period of time is expensive in
a many-core systems.
Just let the vCPU execute and update the correct IMSIC VS-file via
kvm_riscv_vcpu_aia_imsic_update may be a simple solution.
Fixes: 4cec89db80 ("RISC-V: KVM: Move HGEI[E|P] CSR access to IMSIC virtualization")
Signed-off-by: Fangyu Yu <fangyu.yu@linux.alibaba.com>
Reviewed-by: Guo Ren <guoren@kernel.org>
Reviewed-by: Anup Patel <anup@brainfault.org>
Tested-by: Anup Patel <anup@brainfault.org>
Link: https://lore.kernel.org/r/20251016012659.82998-1-fangyu.yu@linux.alibaba.com
Signed-off-by: Anup Patel <anup@brainfault.org>
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.