summaryrefslogtreecommitdiff
path: root/arch/x86/include
AgeCommit message (Collapse)Author
2023-07-18add system call uintr_event(int fd)temp2Your Name
2023-07-17add UINTR_EVENT_VECTORtemp1Your Name
2023-07-14sys_uintr_wait 添加超时时间支持.qemu 下建议 2575usuintr_wait_fixYour Name
2021-09-12x86/uintr: Introduce uintr_wait() syscallSohil Mehta
Add a new system call to allow applications to block in the kernel and wait for user interrupts. <The current implementation doesn't support waking up from other blocking system calls like sleep(), read(), epoll(), etc. uintr_wait() is a placeholder syscall while we decide on that behaviour.> When the application makes this syscall the notification vector is switched to a new kernel vector. Any new SENDUIPI will invoke the kernel interrupt which is then used to wake up the process. Currently, the task wait list is global one. To make the implementation scalable there is a need to move to a distributed per-cpu wait list. Signed-off-by: Sohil Mehta <[email protected]>
2021-09-12x86/uintr: Introduce user IPI sender syscallsSohil Mehta
Add a registration syscall for a task to register itself as a user interrupt sender using the uintr_fd generated by the receiver. A task can register multiple uintr_fds. Each unique successful connection creates a new entry in the User Interrupt Target Table (UITT). Each entry in the UITT table is referred by the UITT index (uipi_index). The uipi_index returned during the registration syscall lets a sender generate a user IPI using the 'SENDUIPI <uipi_index>' instruction. Also, add a sender unregister syscall to unregister a particular task from the uintr_fd. Calling close on the uintr_fd will disconnect all threads in a sender process from that FD. Currently, the UITT size is arbitrarily chosen as 256 entries corresponding to a 4KB page. Based on feedback and usage data this can either be increased/decreased or made dynamic later. Architecturally, the UITT table can be unique for each thread or shared across threads of the same thread group. The current implementation keeps the UITT as unique for the each thread. This makes the kernel implementation relatively simple and only threads that use uintr get setup with the related structures. However, this means that the uipi_index for each thread would be inconsistent wrt to other threads. (Executing 'SENDUIPI 2' on threads of the same process could generate different user interrupts.) Alternatively, the benefit of sharing the UITT table is that all threads would see the same view of the UITT table. Also the kernel UITT memory allocation would be more efficient if multiple threads connect to the same uintr_fd. However, this would mean the kernel needs to keep the UITT table size MISC_MSR[] in sync across these threads. Also the UPID/UITT teardown flows might need additional consideration. Signed-off-by: Sohil Mehta <[email protected]>
2021-09-12x86/uintr: Introduce vector registration and uintr_fd syscallSohil Mehta
Each receiving task has its own interrupt vector space of 64 vectors. For each vector registered by a task create a uintr_fd. Only tasks that have previously registered a user interrupt handler can register a vector. The sender for the user interrupt could be another userspace application, kernel or an external source (like a device). Any sender that wants to generate a user interrupt needs access to receiver's vector number and UPID. uintr_fd abstracts that information and allows a sender with access to uintr_fd to connect and generate a user interrupt. Upon interrupt delivery, the interrupt handler would be invoked with the associated vector number pushed onto the stack. Using an FD abstraction automatically provides a secure mechanism to connect with a receiver. It also makes the tracking and management of the interrupt vector resource easier for userspace. uintr_fd can be useful in some of the usages where eventfd is used for userspace event notifications. Though uintr_fd is nowhere close to a drop-in replacement, the semantics are meant to be somewhat similar to an eventfd or the write end of a pipe. Access to uintr_fd can be achieved in the following ways: - direct access if the task is part of the same thread group (process) - inherited by a child process. - explicitly shared using any of the FD sharing mechanisms. If the sender is another userspace task, it can use the uintr_fd to send user IPIs to the receiver. This works in conjunction with the SENDUIPI instruction. The details related to this are covered later. The exact APIs for the sender being a kernel or another external source are still being worked upon. The general idea is that the receiver would pass the uintr_fd to the kernel by extending some existing API (like io_uring). The vector associated with uintr_fd can be unregistered by closing all references to the uintr_fd. Signed-off-by: Sohil Mehta <[email protected]>
2021-09-12x86/process/64: Clean up uintr task fork and exit pathsSohil Mehta
The user interrupt MSRs and the user interrupt state is task specific. During task fork and exit clear the task state, clear the MSRs and dereference the shared resources. Some of the memory resources like the UPID are referenced in the file descriptor and could be in use while the uintr_fd is still valid. Instead of freeing up the UPID just dereference it. Eventually when every user releases the reference the memory resource will be freed up. Signed-off-by: Jacob Pan <[email protected]> Signed-off-by: Sohil Mehta <[email protected]>
2021-09-12x86/process/64: Add uintr task context switch supportSohil Mehta
User interrupt state is saved and restored using xstate supervisor feature support. This includes the MSR state and the User Interrupt Flag (UIF) value. During context switch update the UPID for a uintr task to reflect the current state of the task; namely whether the task should receive interrupt notifications and which cpu the task is currently running on. XSAVES clears the notification vector (UINV) in the MISC MSR to prevent interrupts from being recognized in the UIRR MSR while the task is being context switched. The UINV is restored back when the kernel does an XRSTORS. However, this conflicts with the kernel's lazy restore optimization which skips an XRSTORS if the kernel is scheduling the same user task back and the underlying MSR state hasn't been modified. Special handling is needed for a uintr task in the context switch path to keep using this optimization. Signed-off-by: Jacob Pan <[email protected]> Signed-off-by: Sohil Mehta <[email protected]>
2021-09-12x86/uintr: Introduce uintr receiver syscallsSohil Mehta
Any application that wants to receive a user interrupt needs to register an interrupt handler with the kernel. Add a registration syscall that sets up the interrupt handler and the related kernel structures for the task that makes this syscall. Only one interrupt handler per task can be registered with the kernel/hardware. Each task has its private interrupt vector space of 64 vectors. The vector registration and the related FD management is covered later. Also add an unregister syscall to let a task unregister the interrupt handler. The UPID for each receiver task needs to be updated whenever a task gets context switched or it moves from one cpu to another. This will also be covered later. The system calls haven't been wired up yet so no real harm is done if we don't update the UPID right now. <Code typically in the x86/kernel directory doesn't deal with file descriptor management. I have kept uintr_fd.c separate to make it easier to move it somewhere else if needed.> Signed-off-by: Jacob Pan <[email protected]> Signed-off-by: Sohil Mehta <[email protected]>
2021-09-12x86/irq: Reserve a user IPI notification vectorSohil Mehta
A user interrupt notification vector is used on the receiver's cpu to identify an interrupt as a user interrupt (and not a kernel interrupt). Hardware uses the same notification vector to generate an IPI from a sender's cpu core when the SENDUIPI instruction is executed. Typically, the kernel shouldn't receive an interrupt with this vector. However, it is possible that the kernel might receive this vector. Scenario that can cause the spurious interrupt: Step cpu 0 (receiver task) cpu 1 (sender task) ---- --------------------- ------------------- 1 task is running 2 executes SENDUIPI 3 IPI sent 4 context switched out 5 IPI delivered (kernel interrupt detected) A kernel interrupt can be detected, if a receiver task gets scheduled out after the SENDUIPI-based IPI was sent but before the IPI was delivered. The kernel doesn't need to do anything in this case other than receiving the interrupt and clearing the local APIC. The user interrupt is always stored in the receiver's UPID before the IPI is generated. When the receiver gets scheduled back the interrupt would be delivered based on its UPID. Signed-off-by: Jacob Pan <[email protected]> Signed-off-by: Sohil Mehta <[email protected]>
2021-09-12x86/fpu/xstate: Enumerate User Interrupts supervisor stateSohil Mehta
Enable xstate supervisor support for User Interrupts by default. The user interrupt state for a task consists of the MSR state and the User Interrupt Flag (UIF) value. XSAVES and XRSTORS handle saving and restoring both of these states. <The supervisor XSTATE code might be reworked based on issues reported in the past. The Uintr context switching code would also need rework and additional testing in that regard.> Signed-off-by: Sohil Mehta <[email protected]>
2021-09-12x86/cpu: Enumerate User Interrupts supportSohil Mehta
User Interrupts support including user IPIs is enumerated through cpuid. The 'uintr' flag in /proc/cpuinfo can be used to identify it. The recommended mechanism for user applications to detect support is calling the uintr related syscalls. Use CONFIG_X86_USER_INTERRUPTS to compile with User Interrupts support. The feature can be disabled at boot time using the 'nouintr' kernel parameter. SENDUIPI is a special ring-3 instruction that makes a supervisor mode memory access to the UPID and UITT memory. Currently, KPTI needs to be off for User IPIs to work. Processors that support user interrupts are not affected by Meltdown so the auto mode of KPTI will default to off. Users who want to force enable KPTI will need to wait for a later version of this patch series that is compatible with KPTI. We need to allocate the UPID and UITT structures from a special memory region that has supervisor access but it is mapped into userspace. The plan is to implement a mechanism similar to LDT. Signed-off-by: Jacob Pan <[email protected]> Signed-off-by: Sohil Mehta <[email protected]>
2021-09-08arch: remove compat_alloc_user_spaceArnd Bergmann
All users of compat_alloc_user_space() and copy_in_user() have been removed from the kernel, only a few functions in sparc remain that can be changed to calling arch_copy_in_user() instead. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Arnd Bergmann <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Cc: Al Viro <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Christian Borntraeger <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Eric Biederman <[email protected]> Cc: Feng Tang <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Helge Deller <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: "James E.J. Bottomley" <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Thomas Bogendoerfer <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-07Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull KVM updates from Paolo Bonzini: "ARM: - Page ownership tracking between host EL1 and EL2 - Rely on userspace page tables to create large stage-2 mappings - Fix incompatibility between pKVM and kmemleak - Fix the PMU reset state, and improve the performance of the virtual PMU - Move over to the generic KVM entry code - Address PSCI reset issues w.r.t. save/restore - Preliminary rework for the upcoming pKVM fixed feature - A bunch of MM cleanups - a vGIC fix for timer spurious interrupts - Various cleanups s390: - enable interpretation of specification exceptions - fix a vcpu_idx vs vcpu_id mixup x86: - fast (lockless) page fault support for the new MMU - new MMU now the default - increased maximum allowed VCPU count - allow inhibit IRQs on KVM_RUN while debugging guests - let Hyper-V-enabled guests run with virtualized LAPIC as long as they do not enable the Hyper-V "AutoEOI" feature - fixes and optimizations for the toggling of AMD AVIC (virtualized LAPIC) - tuning for the case when two-dimensional paging (EPT/NPT) is disabled - bugfixes and cleanups, especially with respect to vCPU reset and choosing a paging mode based on CR0/CR4/EFER - support for 5-level page table on AMD processors Generic: - MMU notifier invalidation callbacks do not take mmu_lock unless necessary - improved caching of LRU kvm_memory_slot - support for histogram statistics - add statistics for halt polling and remote TLB flush requests" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (210 commits) KVM: Drop unused kvm_dirty_gfn_invalid() KVM: x86: Update vCPU's hv_clock before back to guest when tsc_offset is adjusted KVM: MMU: mark role_regs and role accessors as maybe unused KVM: MIPS: Remove a "set but not used" variable x86/kvm: Don't enable IRQ when IRQ enabled in kvm_wait KVM: stats: Add VM stat for remote tlb flush requests KVM: Remove unnecessary export of kvm_{inc,dec}_notifier_count() KVM: x86/mmu: Move lpage_disallowed_link further "down" in kvm_mmu_page KVM: x86/mmu: Relocate kvm_mmu_page.tdp_mmu_page for better cache locality Revert "KVM: x86: mmu: Add guest physical address check in translate_gpa()" KVM: x86/mmu: Remove unused field mmio_cached in struct kvm_mmu_page kvm: x86: Increase KVM_SOFT_MAX_VCPUS to 710 kvm: x86: Increase MAX_VCPUS to 1024 kvm: x86: Set KVM_MAX_VCPU_ID to 4*KVM_MAX_VCPUS KVM: VMX: avoid running vmx_handle_exit_irqoff in case of emulation KVM: x86/mmu: Don't freak out if pml5_root is NULL on 4-level host KVM: s390: index kvm->arch.idle_mask by vcpu_idx KVM: s390: Enable specification exception interpretation KVM: arm64: Trim guest debug exception handling KVM: SVM: Add 5-level page table support for SVM ...
2021-09-06kvm: x86: Increase KVM_SOFT_MAX_VCPUS to 710Eduardo Habkost
Support for 710 VCPUs was tested by Red Hat since RHEL-8.4, so increase KVM_SOFT_MAX_VCPUS to 710. Signed-off-by: Eduardo Habkost <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-09-06kvm: x86: Increase MAX_VCPUS to 1024Eduardo Habkost
Increase KVM_MAX_VCPUS to 1024, so we can test larger VMs. I'm not changing KVM_SOFT_MAX_VCPUS yet because I'm afraid it might involve complicated questions around the meaning of "supported" and "recommended" in the upstream tree. KVM_SOFT_MAX_VCPUS will be changed in a separate patch. For reference, visible effects of this change are: - KVM_CAP_MAX_VCPUS will now return 1024 (of course) - Default value for CPUID[HYPERV_CPUID_IMPLEMENT_LIMITS (00x40000005)].EAX will now be 1024 - KVM_MAX_VCPU_ID will change from 1151 to 4096 - Size of struct kvm will increase from 19328 to 22272 bytes (in x86_64) - Size of struct kvm_ioapic will increase from 1780 to 5084 bytes (in x86_64) - Bitmap stack variables that will grow: - At kvm_hv_flush_tlb() kvm_hv_send_ipi(), vp_bitmap[] and vcpu_bitmap[] will now be 128 bytes long - vcpu_bitmap at bioapic_write_indirect() will be 128 bytes long once patch "KVM: x86: Fix stack-out-of-bounds memory access from ioapic_write_indirect()" is applied Signed-off-by: Eduardo Habkost <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-09-06kvm: x86: Set KVM_MAX_VCPU_ID to 4*KVM_MAX_VCPUSEduardo Habkost
Instead of requiring KVM_MAX_VCPU_ID to be manually increased every time we increase KVM_MAX_VCPUS, set it to 4*KVM_MAX_VCPUS. This should be enough for CPU topologies where Cores-per-Package and Packages-per-Socket are not powers of 2. In practice, this increases KVM_MAX_VCPU_ID from 1023 to 1152. The only side effect of this change is making some fields in struct kvm_ioapic larger, increasing the struct size from 1628 to 1780 bytes (in x86_64). Signed-off-by: Eduardo Habkost <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-09-01Merge tag 'hyperv-next-signed-20210831' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux Pull hyperv updates from Wei Liu: - make Hyper-V code arch-agnostic (Michael Kelley) - fix sched_clock behaviour on Hyper-V (Ani Sinha) - fix a fault when Linux runs as the root partition on MSHV (Praveen Kumar) - fix VSS driver (Vitaly Kuznetsov) - cleanup (Sonia Sharma) * tag 'hyperv-next-signed-20210831' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux: hv_utils: Set the maximum packet size for VSS driver to the length of the receive buffer Drivers: hv: Enable Hyper-V code to be built on ARM64 arm64: efi: Export screen_info arm64: hyperv: Initialize hypervisor on boot arm64: hyperv: Add panic handler arm64: hyperv: Add Hyper-V hypercall and register access utilities x86/hyperv: fix root partition faults when writing to VP assist page MSR hv: hyperv.h: Remove unused inline functions drivers: hv: Decouple Hyper-V clock/timer code from VMbus drivers x86/hyperv: add comment describing TSC_INVARIANT_CONTROL MSR setting bit 0 Drivers: hv: Move Hyper-V misc functionality to arch-neutral code Drivers: hv: Add arch independent default functions for some Hyper-V handlers Drivers: hv: Make portions of Hyper-V init code be arch neutral x86/hyperv: fix for unwanted manipulation of sched_clock when TSC marked unstable asm-generic/hyperv: Add missing #include of nmi.h
2021-09-01Merge tag 'drm-next-2021-08-31-1' of git://anongit.freedesktop.org/drm/drmLinus Torvalds
Pull drm updates from Dave Airlie: "Highlights: - i915 has seen a lot of refactoring and uAPI cleanups due to a change in the upstream direction going forward This has all been audited with known userspace, but there may be some pitfalls that were missed. - i915 now uses common TTM to enable discrete memory on DG1/2 GPUs - i915 enables Jasper and Elkhart Lake by default and has preliminary XeHP/DG2 support - amdgpu adds support for Cyan Skillfish - lots of implicit fencing rules documented and fixed up in drivers - msm now uses the core scheduler - the irq midlayer has been removed for non-legacy drivers - the sysfb code now works on more than x86. Otherwise the usual smattering of stuff everywhere, panels, bridges, refactorings. Detailed summary: core: - extract i915 eDP backlight into core - DP aux bus support - drm_device.irq_enabled removed - port drivers to native irq interfaces - export gem shadow plane handling for vgem - print proper driver name in framebuffer registration - driver fixes for implicit fencing rules - ARM fixed rate compression modifier added - updated fb damage handling - rmfb ioctl logging/docs - drop drm_gem_object_put_locked - define DRM_FORMAT_MAX_PLANES - add gem fb vmap/vunmap helpers - add lockdep_assert(once) helpers - mark drm irq midlayer as legacy - use offset adjusted bo mapping conversion vgaarb: - cleanups fbdev: - extend efifb handling to all arches - div by 0 fixes for multiple drivers udmabuf: - add hugepage mapping support dma-buf: - non-dynamic exporter fixups - document implicit fencing rules amdgpu: - Initial Cyan Skillfish support - switch virtual DCE over to vkms based atomic - VCN/JPEG power down fixes - NAVI PCIE link handling fixes - AMD HDMI freesync fixes - Yellow Carp + Beige Goby fixes - Clockgating/S0ix/SMU/EEPROM fixes - embed hw fence in job - rework dma-resv handling - ensure eviction to system ram amdkfd: - uapi: SVM address range query added - sysfs leak fix - GPUVM TLB optimizations - vmfault/migration counters i915: - Enable JSL and EHL by default - preliminary XeHP/DG2 support - remove all CNL support (never shipped) - move to TTM for discrete memory support - allow mixed object mmap handling - GEM uAPI spring cleaning - add I915_MMAP_OBJECT_FIXED - reinstate ADL-P mmap ioctls - drop a bunch of unused by userspace features - disable and remove GPU relocations - revert some i915 misfeatures - major refactoring of GuC for Gen11+ - execbuffer object locking separate step - reject caching/set-domain on discrete - Enable pipe DMC loading on XE-LPD and ADL-P - add PSF GV point support - Refactor and fix DDI buffer translations - Clean up FBC CFB allocation code - Finish INTEL_GEN() and friends macro conversions nouveau: - add eDP backlight support - implicit fence fix msm: - a680/7c3 support - drm/scheduler conversion panfrost: - rework GPU reset virtio: - fix fencing for planes ast: - add detect support bochs: - move to tiny GPU driver vc4: - use hotplug irqs - HDMI codec support vmwgfx: - use internal vmware device headers ingenic: - demidlayering irq rcar-du: - shutdown fixes - convert to bridge connector helpers zynqmp-dsub: - misc fixes mgag200: - convert PLL handling to atomic mediatek: - MT8133 AAL support - gem mmap object support - MT8167 support etnaviv: - NXP Layerscape LS1028A SoC support - GEM mmap cleanups tegra: - new user API exynos: - missing unlock fix - build warning fix - use refcount_t" * tag 'drm-next-2021-08-31-1' of git://anongit.freedesktop.org/drm/drm: (1318 commits) drm/amd/display: Move AllowDRAMSelfRefreshOrDRAMClockChangeInVblank to bounding box drm/amd/display: Remove duplicate dml init drm/amd/display: Update bounding box states (v2) drm/amd/display: Update number of DCN3 clock states drm/amdgpu: disable GFX CGCG in aldebaran drm/amdgpu: Clear RAS interrupt status on aldebaran drm/amdgpu: Add support for RAS XGMI err query drm/amdkfd: Account for SH/SE count when setting up cu masks. drm/amdgpu: rename amdgpu_bo_get_preferred_pin_domain drm/amdgpu: drop redundant cancel_delayed_work_sync call drm/amdgpu: add missing cleanups for more ASICs on UVD/VCE suspend drm/amdgpu: add missing cleanups for Polaris12 UVD/VCE on suspend drm/amdkfd: map SVM range with correct access permission drm/amdkfd: check access permisson to restore retry fault drm/amdgpu: Update RAS XGMI Error Query drm/amdgpu: Add driver infrastructure for MCA RAS drm/amd/display: Add Logging for HDMI color depth information drm/amd/amdgpu: consolidate PSP TA init shared buf functions drm/amd/amdgpu: add name field back to ras_common_if drm/amdgpu: Fix build with missing pm_suspend_target_state module export ...
2021-08-31Merge tag 'net-next-5.15' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Jakub Kicinski: "Core: - Enable memcg accounting for various networking objects. BPF: - Introduce bpf timers. - Add perf link and opaque bpf_cookie which the program can read out again, to be used in libbpf-based USDT library. - Add bpf_task_pt_regs() helper to access user space pt_regs in kprobes, to help user space stack unwinding. - Add support for UNIX sockets for BPF sockmap. - Extend BPF iterator support for UNIX domain sockets. - Allow BPF TCP congestion control progs and bpf iterators to call bpf_setsockopt(), e.g. to switch to another congestion control algorithm. Protocols: - Support IOAM Pre-allocated Trace with IPv6. - Support Management Component Transport Protocol. - bridge: multicast: add vlan support. - netfilter: add hooks for the SRv6 lightweight tunnel driver. - tcp: - enable mid-stream window clamping (by user space or BPF) - allow data-less, empty-cookie SYN with TFO_SERVER_COOKIE_NOT_REQD - more accurate DSACK processing for RACK-TLP - mptcp: - add full mesh path manager option - add partial support for MP_FAIL - improve use of backup subflows - optimize option processing - af_unix: add OOB notification support. - ipv6: add IFLA_INET6_RA_MTU to expose MTU value advertised by the router. - mac80211: Target Wake Time support in AP mode. - can: j1939: extend UAPI to notify about RX status. Driver APIs: - Add page frag support in page pool API. - Many improvements to the DSA (distributed switch) APIs. - ethtool: extend IRQ coalesce uAPI with timer reset modes. - devlink: control which auxiliary devices are created. - Support CAN PHYs via the generic PHY subsystem. - Proper cross-chip support for tag_8021q. - Allow TX forwarding for the software bridge data path to be offloaded to capable devices. Drivers: - veth: more flexible channels number configuration. - openvswitch: introduce per-cpu upcall dispatch. - Add internet mix (IMIX) mode to pktgen. - Transparently handle XDP operations in the bonding driver. - Add LiteETH network driver. - Renesas (ravb): - support Gigabit Ethernet IP - NXP Ethernet switch (sja1105): - fast aging support - support for "H" switch topologies - traffic termination for ports under VLAN-aware bridge - Intel 1G Ethernet - support getcrosststamp() with PCIe PTM (Precision Time Measurement) for better time sync - support Credit-Based Shaper (CBS) offload, enabling HW traffic prioritization and bandwidth reservation - Broadcom Ethernet (bnxt) - support pulse-per-second output - support larger Rx rings - Mellanox Ethernet (mlx5) - support ethtool RSS contexts and MQPRIO channel mode - support LAG offload with bridging - support devlink rate limit API - support packet sampling on tunnels - Huawei Ethernet (hns3): - basic devlink support - add extended IRQ coalescing support - report extended link state - Netronome Ethernet (nfp): - add conntrack offload support - Broadcom WiFi (brcmfmac): - add WPA3 Personal with FT to supported cipher suites - support 43752 SDIO device - Intel WiFi (iwlwifi): - support scanning hidden 6GHz networks - support for a new hardware family (Bz) - Xen pv driver: - harden netfront against malicious backends - Qualcomm mobile - ipa: refactor power management and enable automatic suspend - mhi: move MBIM to WWAN subsystem interfaces Refactor: - Ambient BPF run context and cgroup storage cleanup. - Compat rework for ndo_ioctl. Old code removal: - prism54 remove the obsoleted driver, deprecated by the p54 driver. - wan: remove sbni/granch driver" * tag 'net-next-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1715 commits) net: Add depends on OF_NET for LiteX's LiteETH ipv6: seg6: remove duplicated include net: hns3: remove unnecessary spaces net: hns3: add some required spaces net: hns3: clean up a type mismatch warning net: hns3: refine function hns3_set_default_feature() ipv6: remove duplicated 'net/lwtunnel.h' include net: w5100: check return value after calling platform_get_resource() net/mlxbf_gige: Make use of devm_platform_ioremap_resourcexxx() net: mdio: mscc-miim: Make use of the helper function devm_platform_ioremap_resource() net: mdio-ipq4019: Make use of devm_platform_ioremap_resource() fou: remove sparse errors ipv4: fix endianness issue in inet_rtm_getroute_build_skb() octeontx2-af: Set proper errorcode for IPv4 checksum errors octeontx2-af: Fix static code analyzer reported issues octeontx2-af: Fix mailbox errors in nix_rss_flowkey_cfg octeontx2-af: Fix loop in free and unmap counter af_unix: fix potential NULL deref in unix_dgram_connect() dpaa2-eth: Replace strlcpy with strscpy octeontx2-af: Use NDC TX for transmit packet data ...
2021-08-30Merge tag 'x86-irq-2021-08-30' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 PIRQ updates from Thomas Gleixner: "A set of updates to support port 0x22/0x23 based PCI configuration space which can be found on various ALi chipsets and is also available on older Intel systems which expose a PIRQ router. While the Intel support is more or less nostalgia, the ALi chips are still in use on popular embedded boards used for routers" * tag 'x86-irq-2021-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86: Fix typo s/ECLR/ELCR/ for the PIC register x86: Avoid magic number with ELCR register accesses x86/PCI: Add support for the Intel 82426EX PIRQ router x86/PCI: Add support for the Intel 82374EB/82374SB (ESC) PIRQ router x86/PCI: Add support for the ALi M1487 (IBC) PIRQ router x86: Add support for 0x22/0x23 port I/O configuration space
2021-08-30Merge tag 'x86-cpu-2021-08-30' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 cache flush updates from Thomas Gleixner: "A reworked version of the opt-in L1D flush mechanism. This is a stop gap for potential future speculation related hardware vulnerabilities and a mechanism for truly security paranoid applications. It allows a task to request that the L1D cache is flushed when the kernel switches to a different mm. This can be requested via prctl(). Changes vs the previous versions: - Get rid of the software flush fallback - Make the handling consistent with other mitigations - Kill the task when it ends up on a SMT enabled core which defeats the purpose of L1D flushing obviously" * tag 'x86-cpu-2021-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: Documentation: Add L1D flushing Documentation x86, prctl: Hook L1D flushing in via prctl x86/mm: Prepare for opt-in based L1D flush in switch_mm() x86/process: Make room for TIF_SPEC_L1D_FLUSH sched: Add task_work callback for paranoid L1D flush x86/mm: Refactor cond_ibpb() to support other use cases x86/smp: Add a per-cpu view of SMT state
2021-08-30Merge tag 'perf-core-2021-08-30' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 perf event updates from Ingo Molnar: - Add support for Intel Sapphire Rapids server CPU uncore events - Allow the AMD uncore driver to be built as a module - Misc cleanups and fixes * tag 'perf-core-2021-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits) perf/x86/amd/ibs: Add bitfield definitions in new <asm/amd-ibs.h> header perf/amd/uncore: Allow the driver to be built as a module x86/cpu: Add get_llc_id() helper function perf/amd/uncore: Clean up header use, use <linux/ include paths instead of <asm/ perf/amd/uncore: Simplify code, use free_percpu()'s built-in check for NULL perf/hw_breakpoint: Replace deprecated CPU-hotplug functions perf/x86/intel: Replace deprecated CPU-hotplug functions perf/x86: Remove unused assignment to pointer 'e' perf/x86/intel/uncore: Fix IIO cleanup mapping procedure for SNR/ICX perf/x86/intel/uncore: Support IMC free-running counters on Sapphire Rapids server perf/x86/intel/uncore: Support IIO free-running counters on Sapphire Rapids server perf/x86/intel/uncore: Factor out snr_uncore_mmio_map() perf/x86/intel/uncore: Add alias PMU name perf/x86/intel/uncore: Add Sapphire Rapids server MDF support perf/x86/intel/uncore: Add Sapphire Rapids server M3UPI support perf/x86/intel/uncore: Add Sapphire Rapids server UPI support perf/x86/intel/uncore: Add Sapphire Rapids server M2M support perf/x86/intel/uncore: Add Sapphire Rapids server IMC support perf/x86/intel/uncore: Add Sapphire Rapids server PCU support perf/x86/intel/uncore: Add Sapphire Rapids server M2PCIe support ...
2021-08-30Merge tag 'ras_core_for_v5.15' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull RAS update from Borislav Petkov: "A single RAS change for 5.15: - Do not start processing MCEs logged early because the decoding chain is not up yet - delay that processing until everything is ready" * tag 'ras_core_for_v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mce: Defer processing of early errors
2021-08-30Merge tag 's390-5.15-1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux Pull s390 updates from Heiko Carstens: - Improve ftrace code patching so that stop_machine is not required anymore. This requires a small common code patch acked by Steven Rostedt: https://lore.kernel.org/linux-s390/[email protected]/ - Enable KCSAN for s390. This comes with a small common code change to fix a compile warning. Acked by Marco Elver: https://lore.kernel.org/r/[email protected] - Add KFENCE support for s390. This also comes with a minimal x86 patch from Marco Elver who said also this can be carried via the s390 tree: https://lore.kernel.org/linux-s390/[email protected]/ - More changes to prepare the decompressor for relocation. - Enable DAT also for CPU restart path. - Final set of register asm removal patches; leaving only three locations where needed and sane. - Add NNPA, Vector-Packed-Decimal-Enhancement Facility 2, PCI MIO support to hwcaps flags. - Cleanup hwcaps implementation. - Add new instructions to in-kernel disassembler. - Various QDIO cleanups. - Add SCLP debug feature. - Various other cleanups and improvements all over the place. * tag 's390-5.15-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (105 commits) s390: remove SCHED_CORE from defconfigs s390/smp: do not use nodat_stack for secondary CPU start s390/smp: enable DAT before CPU restart callback is called s390: update defconfigs s390/ap: fix state machine hang after failure to enable irq KVM: s390: generate kvm hypercall functions s390/sclp: add tracing of SCLP interactions s390/debug: add early tracing support s390/debug: fix debug area life cycle s390/debug: keep debug data on resize s390/diag: make restart_part2 a local label s390/mm,pageattr: fix walk_pte_level() early exit s390: fix typo in linker script s390: remove do_signal() prototype and do_notify_resume() function s390/crypto: fix all kernel-doc warnings in vfio_ap_ops.c s390/pci: improve DMA translation init and exit s390/pci: simplify CLP List PCI handling s390/pci: handle FH state mismatch only on disable s390/pci: fix misleading rc in clp_set_pci_fn() s390/boot: factor out offset_vmlinux_info() function ...
2021-08-26perf/x86/amd/ibs: Add bitfield definitions in new <asm/amd-ibs.h> headerKim Phillips
Add <asm/amd-ibs.h> with bitfield definitions for IBS MSRs, and demonstrate usage within the driver. Also move 'struct perf_ibs_data' where it can be shared with the perf tool that will soon be using it. No functional changes. Signed-off-by: Kim Phillips <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-08-26x86/cpu: Add get_llc_id() helper functionKim Phillips
Factor out a helper function rather than export cpu_llc_id, which is needed in order to be able to build the AMD uncore driver as a module. Signed-off-by: Kim Phillips <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-08-24x86/mce: Defer processing of early errorsBorislav Petkov
When a fatal machine check results in a system reset, Linux does not clear the error(s) from machine check bank(s) - hardware preserves the machine check banks across a warm reset. During initialization of the kernel after the reboot, Linux reads, logs, and clears all machine check banks. But there is a problem. In: 5de97c9f6d85 ("x86/mce: Factor out and deprecate the /dev/mcelog driver") the call to mce_register_decode_chain() moved later in the boot sequence. This means that /dev/mcelog doesn't see those early error logs. This was partially fixed by: cd9c57cad3fe ("x86/MCE: Dump MCE to dmesg if no consumers") which made sure that the logs were not lost completely by printing to the console. But parsing console logs is error prone. Users of /dev/mcelog should expect to find any early errors logged to standard places. Add a new flag MCP_QUEUE_LOG to machine_check_poll() to be used in early machine check initialization to indicate that any errors found should just be queued to genpool. When mcheck_late_init() is called it will call mce_schedule_work() to actually log and flush any errors queued in the genpool. [ Based on an original patch, commit message by and completely productized by Tony Luck. ] Fixes: 5de97c9f6d85 ("x86/mce: Factor out and deprecate the /dev/mcelog driver") Reported-by: Sumanth Kamatala <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Signed-off-by: Tony Luck <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2021-08-20KVM: x86/mmu: Support shadowing NPT when 5-level paging is enabled in hostWei Huang
When the 5-level page table CPU flag is set in the host, but the guest has CR4.LA57=0 (including the case of a 32-bit guest), the top level of the shadow NPT page tables will be fixed, consisting of one pointer to a lower-level table and 511 non-present entries. Extend the existing code that creates the fixed PML4 or PDP table, to provide a fixed PML5 table if needed. This is not needed on EPT because the number of layers in the tables is specified in the EPTP instead of depending on the host CR4. Suggested-by: Paolo Bonzini <[email protected]> Signed-off-by: Wei Huang <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-08-20KVM: x86: Allow CPU to force vendor-specific TDP levelWei Huang
AMD future CPUs will require a 5-level NPT if host CR4.LA57 is set. To prevent kvm_mmu_get_tdp_level() from incorrectly changing NPT level on behalf of CPUs, add a new parameter in kvm_configure_mmu() to force a fixed TDP level. Signed-off-by: Wei Huang <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-08-20KVM: x86: implement KVM_GUESTDBG_BLOCKIRQMaxim Levitsky
KVM_GUESTDBG_BLOCKIRQ will allow KVM to block all interrupts while running. This change is mostly intended for more robust single stepping of the guest and it has the following benefits when enabled: * Resuming from a breakpoint is much more reliable. When resuming execution from a breakpoint, with interrupts enabled, more often than not, KVM would inject an interrupt and make the CPU jump immediately to the interrupt handler and eventually return to the breakpoint, to trigger it again. From the user point of view it looks like the CPU never executed a single instruction and in some cases that can even prevent forward progress, for example, when the breakpoint is placed by an automated script (e.g lx-symbols), which does something in response to the breakpoint and then continues the guest automatically. If the script execution takes enough time for another interrupt to arrive, the guest will be stuck on the same breakpoint RIP forever. * Normal single stepping is much more predictable, since it won't land the debugger into an interrupt handler. * RFLAGS.TF has less chance to be leaked to the guest: We set that flag behind the guest's back to do single stepping but if single step lands us into an interrupt/exception handler it will be leaked to the guest in the form of being pushed to the stack. This doesn't completely eliminate this problem as exceptions can still happen, but at least this reduces the chances of this happening. Signed-off-by: Maxim Levitsky <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-08-20KVM: x86/mmu: Add detailed page size statsMingwei Zhang
Existing KVM code tracks the number of large pages regardless of their sizes. Therefore, when large page of 1GB (or larger) is adopted, the information becomes less useful because lpages counts a mix of 1G and 2M pages. So remove the lpages since it is easy for user space to aggregate the info. Instead, provide a comprehensive page stats of all sizes from 4K to 512G. Suggested-by: Ben Gardon <[email protected]> Reviewed-by: David Matlack <[email protected]> Reviewed-by: Ben Gardon <[email protected]> Signed-off-by: Mingwei Zhang <[email protected]> Cc: Jing Zhang <[email protected]> Cc: David Matlack <[email protected]> Cc: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-08-20KVM: x86: hyper-v: Deactivate APICv only when AutoEOI feature is in useVitaly Kuznetsov
APICV_INHIBIT_REASON_HYPERV is currently unconditionally forced upon SynIC activation as SynIC's AutoEOI is incompatible with APICv/AVIC. It is, however, possible to track whether the feature was actually used by the guest and only inhibit APICv/AVIC when needed. TLFS suggests a dedicated 'HV_DEPRECATING_AEOI_RECOMMENDED' flag to let Windows know that AutoEOI feature should be avoided. While it's up to KVM userspace to set the flag, KVM can help a bit by exposing global APICv/AVIC enablement. Maxim: - always set HV_DEPRECATING_AEOI_RECOMMENDED in kvm_get_hv_cpuid, since this feature can be used regardless of AVIC Paolo: - use arch.apicv_update_lock to protect the hv->synic_auto_eoi_used instead of atomic ops Signed-off-by: Vitaly Kuznetsov <[email protected]> Signed-off-by: Maxim Levitsky <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-08-20KVM: x86: APICv: fix race in kvm_request_apicv_update on SVMMaxim Levitsky
Currently on SVM, the kvm_request_apicv_update toggles the APICv memslot without doing any synchronization. If there is a mismatch between that memslot state and the AVIC state, on one of the vCPUs, an APIC mmio access can be lost: For example: VCPU0: enable the APIC_ACCESS_PAGE_PRIVATE_MEMSLOT VCPU1: access an APIC mmio register. Since AVIC is still disabled on VCPU1, the access will not be intercepted by it, and neither will it cause MMIO fault, but rather it will just be read/written from/to the dummy page mapped into the APIC_ACCESS_PAGE_PRIVATE_MEMSLOT. Fix that by adding a lock guarding the AVIC state changes, and carefully order the operations of kvm_request_apicv_update to avoid this race: 1. Take the lock 2. Send KVM_REQ_APICV_UPDATE 3. Update the apic inhibit reason 4. Release the lock This ensures that at (2) all vCPUs are kicked out of the guest mode, but don't yet see the new avic state. Then only after (4) all other vCPUs can update their AVIC state and resume. Signed-off-by: Maxim Levitsky <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-08-20KVM: x86: don't disable APICv memslot when inhibitedMaxim Levitsky
Thanks to the former patches, it is now possible to keep the APICv memslot always enabled, and it will be invisible to the guest when it is inhibited This code is based on a suggestion from Sean Christopherson: https://lkml.org/lkml/2021/7/19/2970 Suggested-by: Sean Christopherson <[email protected]> Signed-off-by: Maxim Levitsky <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-08-20KVM: X86: Introduce kvm_mmu_slot_lpages() helpersPeter Xu
Introduce kvm_mmu_slot_lpages() to calculcate lpage_info and rmap array size. The other __kvm_mmu_slot_lpages() can take an extra parameter of npages rather than fetching from the memslot pointer. Start to use the latter one in kvm_alloc_memslot_metadata(). Signed-off-by: Peter Xu <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-08-19Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
drivers/ptp/Kconfig: 55c8fca1dae1 ("ptp_pch: Restore dependency on PCI") e5f31552674e ("ethernet: fix PTP_1588_CLOCK dependencies") Signed-off-by: Jakub Kicinski <[email protected]>
2021-08-16KVM: nSVM: avoid picking up unsupported bits from L2 in int_ctl (CVE-2021-3653)Maxim Levitsky
* Invert the mask of bits that we pick from L2 in nested_vmcb02_prepare_control * Invert and explicitly use VIRQ related bits bitmask in svm_clear_vintr This fixes a security issue that allowed a malicious L1 to run L2 with AVIC enabled, which allowed the L2 to exploit the uninitialized and enabled AVIC to read/write the host physical memory at some offsets. Fixes: 3d6368ef580a ("KVM: SVM: Add VMRUN handler") Signed-off-by: Maxim Levitsky <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-08-13KVM: x86: Move declaration of kvm_spurious_fault() to x86.hUros Bizjak
Move the declaration of kvm_spurious_fault() to KVM's "private" x86.h, it should never be called by anything other than low level KVM code. Cc: Paolo Bonzini <[email protected]> Cc: Sean Christopherson <[email protected]> Signed-off-by: Uros Bizjak <[email protected]> [sean: rebased to a series without __ex()/__kvm_handle_fault_on_reboot()] Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-08-13KVM: x86: Kill off __ex() and __kvm_handle_fault_on_reboot()Sean Christopherson
Remove the __kvm_handle_fault_on_reboot() and __ex() macros now that all VMX and SVM instructions use asm goto to handle the fault (or in the case of VMREAD, completely custom logic). Drop kvm_spurious_fault()'s asmlinkage annotation as __kvm_handle_fault_on_reboot() was the only flow that invoked it from assembly code. Cc: Uros Bizjak <[email protected]> Cc: Like Xu <[email protected]> Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-08-13KVM: X86: Remove unneeded KVM_DEBUGREG_RELOADLai Jiangshan
Commit ae561edeb421 ("KVM: x86: DR0-DR3 are not clear on reset") added code to ensure eff_db are updated when they're modified through non-standard paths. But there is no reason to also update hardware DRs unless hardware breakpoints are active or DR exiting is disabled, and in those cases updating hardware is handled by KVM_DEBUGREG_WONT_EXIT and KVM_DEBUGREG_BP_ENABLED. KVM_DEBUGREG_RELOAD just causes unnecesarry load of hardware DRs and is better to be removed. Suggested-by: Sean Christopherson <[email protected]> Signed-off-by: Lai Jiangshan <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-08-13Merge branch 'kvm-tdpmmu-fixes' into HEADPaolo Bonzini
Merge topic branch with fixes for 5.14-rc6 and 5.15 merge window.
2021-08-13KVM: x86/mmu: Protect marking SPs unsync when using TDP MMU with spinlockSean Christopherson
Add yet another spinlock for the TDP MMU and take it when marking indirect shadow pages unsync. When using the TDP MMU and L1 is running L2(s) with nested TDP, KVM may encounter shadow pages for the TDP entries managed by L1 (controlling L2) when handling a TDP MMU page fault. The unsync logic is not thread safe, e.g. the kvm_mmu_page fields are not atomic, and misbehaves when a shadow page is marked unsync via a TDP MMU page fault, which runs with mmu_lock held for read, not write. Lack of a critical section manifests most visibly as an underflow of unsync_children in clear_unsync_child_bit() due to unsync_children being corrupted when multiple CPUs write it without a critical section and without atomic operations. But underflow is the best case scenario. The worst case scenario is that unsync_children prematurely hits '0' and leads to guest memory corruption due to KVM neglecting to properly sync shadow pages. Use an entirely new spinlock even though piggybacking tdp_mmu_pages_lock would functionally be ok. Usurping the lock could degrade performance when building upper level page tables on different vCPUs, especially since the unsync flow could hold the lock for a comparatively long time depending on the number of indirect shadow pages and the depth of the paging tree. For simplicity, take the lock for all MMUs, even though KVM could fairly easily know that mmu_lock is held for write. If mmu_lock is held for write, there cannot be contention for the inner spinlock, and marking shadow pages unsync across multiple vCPUs will be slow enough that bouncing the kvm_arch cacheline should be in the noise. Note, even though L2 could theoretically be given access to its own EPT entries, a nested MMU must hold mmu_lock for write and thus cannot race against a TDP MMU page fault. I.e. the additional spinlock only _needs_ to be taken by the TDP MMU, as opposed to being taken by any MMU for a VM that is running with the TDP MMU enabled. Holding mmu_lock for read also prevents the indirect shadow page from being freed. But as above, keep it simple and always take the lock. Alternative #1, the TDP MMU could simply pass "false" for can_unsync and effectively disable unsync behavior for nested TDP. Write protecting leaf shadow pages is unlikely to noticeably impact traditional L1 VMMs, as such VMMs typically don't modify TDP entries, but the same may not hold true for non-standard use cases and/or VMMs that are migrating physical pages (from L1's perspective). Alternative #2, the unsync logic could be made thread safe. In theory, simply converting all relevant kvm_mmu_page fields to atomics and using atomic bitops for the bitmap would suffice. However, (a) an in-depth audit would be required, (b) the code churn would be substantial, and (c) legacy shadow paging would incur additional atomic operations in performance sensitive paths for no benefit (to legacy shadow paging). Fixes: a2855afc7ee8 ("KVM: x86/mmu: Allow parallel page faults for the TDP MMU") Cc: [email protected] Cc: Ben Gardon <[email protected]> Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-08-10x86: Avoid magic number with ELCR register accessesMaciej W. Rozycki
Define PIC_ELCR1 and PIC_ELCR2 macros for accesses to the ELCR registers implemented by many chipsets in their embedded 8259A PIC cores, avoiding magic numbers that are difficult to handle, and complementing the macros we already have for registers originally defined with discrete 8259A PIC implementations. No functional change. Signed-off-by: Maciej W. Rozycki <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-08-10x86: Add support for 0x22/0x23 port I/O configuration spaceMaciej W. Rozycki
Define macros and accessors for the configuration space addressed indirectly with an index register and a data register at the port I/O locations of 0x22 and 0x23 respectively. This space is defined by the Intel MultiProcessor Specification for the IMCR register used to switch between the PIC and the APIC mode[1], by Cyrix processors for their configuration[2][3], and also some chipsets. Given the lack of atomicity with the indirect addressing a spinlock is required to protect accesses, although for Cyrix processors it is enough if accesses are executed with interrupts locally disabled, because the registers are local to the accessing CPU, and IMCR is only ever poked at by the BSP and early enough for interrupts not to have been configured yet. Therefore existing code does not have to change or use the new spinlock and neither it does. Put the spinlock in a library file then, so that it does not get pulled unnecessarily for configurations that do not refer it. Convert Cyrix accessors to wrappers so as to retain the brevity and clarity of the `getCx86' and `setCx86' calls. References: [1] "MultiProcessor Specification", Version 1.4, Intel Corporation, Order Number: 242016-006, May 1997, Section 3.6.2.1 "PIC Mode", pp. 3-7, 3-8 [2] "5x86 Microprocessor", Cyrix Corporation, Order Number: 94192-00, July 1995, Section 2.3.2.4 "Configuration Registers", p. 2-23 [3] "6x86 Processor", Cyrix Corporation, Order Number: 94175-01, March 1996, Section 2.4.4 "6x86 Configuration Registers", p. 2-23 Signed-off-by: Maciej W. Rozycki <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-08-05KVM: xen: do not use struct gfn_to_hva_cachePaolo Bonzini
gfn_to_hva_cache is not thread-safe, so it is usually used only within a vCPU (whose code is protected by vcpu->mutex). The Xen interface implementation has such a cache in kvm->arch, but it is not really used except to store the location of the shared info page. Replace shinfo_set and shinfo_cache with just the value that is passed via KVM_XEN_ATTR_TYPE_SHARED_INFO; the only complication is that the initialization value is not zero anymore and therefore kvm_xen_init_vm needs to be introduced. Signed-off-by: Paolo Bonzini <[email protected]>
2021-08-04x86/hyperv: fix root partition faults when writing to VP assist page MSRPraveen Kumar
For root partition the VP assist pages are pre-determined by the hypervisor. The root kernel is not allowed to change them to different locations. And thus, we are getting below stack as in current implementation root is trying to perform write to specific MSR. [ 2.778197] unchecked MSR access error: WRMSR to 0x40000073 (tried to write 0x0000000145ac5001) at rIP: 0xffffffff810c1084 (native_write_msr+0x4/0x30) [ 2.784867] Call Trace: [ 2.791507] hv_cpu_init+0xf1/0x1c0 [ 2.798144] ? hyperv_report_panic+0xd0/0xd0 [ 2.804806] cpuhp_invoke_callback+0x11a/0x440 [ 2.811465] ? hv_resume+0x90/0x90 [ 2.818137] cpuhp_issue_call+0x126/0x130 [ 2.824782] __cpuhp_setup_state_cpuslocked+0x102/0x2b0 [ 2.831427] ? hyperv_report_panic+0xd0/0xd0 [ 2.838075] ? hyperv_report_panic+0xd0/0xd0 [ 2.844723] ? hv_resume+0x90/0x90 [ 2.851375] __cpuhp_setup_state+0x3d/0x90 [ 2.858030] hyperv_init+0x14e/0x410 [ 2.864689] ? enable_IR_x2apic+0x190/0x1a0 [ 2.871349] apic_intr_mode_init+0x8b/0x100 [ 2.878017] x86_late_time_init+0x20/0x30 [ 2.884675] start_kernel+0x459/0x4fb [ 2.891329] secondary_startup_64_no_verify+0xb0/0xbb Since the hypervisor already provides the VP assist pages for root partition, we need to memremap the memory from hypervisor for root kernel to use. The mapping is done in hv_cpu_init during bringup and is unmapped in hv_cpu_die during teardown. Signed-off-by: Praveen Kumar <[email protected]> Reviewed-by: Sunil Muthuswamy <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Wei Liu <[email protected]>
2021-08-04KVM: x86/pmu: Introduce pmc->is_paused to reduce the call time of perf ↵Like Xu
interfaces Based on our observations, after any vm-exit associated with vPMU, there are at least two or more perf interfaces to be called for guest counter emulation, such as perf_event_{pause, read_value, period}(), and each one will {lock, unlock} the same perf_event_ctx. The frequency of calls becomes more severe when guest use counters in a multiplexed manner. Holding a lock once and completing the KVM request operations in the perf context would introduce a set of impractical new interfaces. So we can further optimize the vPMU implementation by avoiding repeated calls to these interfaces in the KVM context for at least one pattern: After we call perf_event_pause() once, the event will be disabled and its internal count will be reset to 0. So there is no need to pause it again or read its value. Once the event is paused, event period will not be updated until the next time it's resumed or reprogrammed. And there is also no need to call perf_event_period twice for a non-running counter, considering the perf_event for a running counter is never paused. Based on this implementation, for the following common usage of sampling 4 events using perf on a 4u8g guest: echo 0 > /proc/sys/kernel/watchdog echo 25 > /proc/sys/kernel/perf_cpu_time_max_percent echo 10000 > /proc/sys/kernel/perf_event_max_sample_rate echo 0 > /proc/sys/kernel/perf_cpu_time_max_percent for i in `seq 1 1 10` do taskset -c 0 perf record \ -e cpu-cycles -e instructions -e branch-instructions -e cache-misses \ /root/br_instr a done the average latency of the guest NMI handler is reduced from 37646.7 ns to 32929.3 ns (~1.14x speed up) on the Intel ICX server. Also, in addition to collecting more samples, no loss of sampling accuracy was observed compared to before the optimization. Signed-off-by: Like Xu <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]> Acked-by: Peter Zijlstra <[email protected]>
2021-08-03KVM: const-ify all relevant uses of struct kvm_memory_slotHamza Mahfooz
As alluded to in commit f36f3f2846b5 ("KVM: add "new" argument to kvm_arch_commit_memory_region"), a bunch of other places where struct kvm_memory_slot is used, needs to be refactored to preserve the "const"ness of struct kvm_memory_slot across-the-board. Signed-off-by: Hamza Mahfooz <[email protected]> Message-Id: <[email protected]> [Do not touch body of slot_rmap_walk_init. - Paolo] Signed-off-by: Paolo Bonzini <[email protected]>
2021-08-02KVM: x86: Move EDX initialization at vCPU RESET to common codeSean Christopherson
Move the EDX initialization at vCPU RESET, which is now identical between VMX and SVM, into common code. No functional change intended. Reviewed-by: Reiji Watanabe <[email protected]> Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>