diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst index c21b5823f126..2cf5bae62036 100644 --- a/Documentation/admin-guide/mm/index.rst +++ b/Documentation/admin-guide/mm/index.rst @@ -32,6 +32,7 @@ the Linux memory management. idle_page_tracking ksm memory-hotplug + multigen_lru nommu-mmap numa_memory_policy numaperf diff --git a/Documentation/admin-guide/mm/multigen_lru.rst b/Documentation/admin-guide/mm/multigen_lru.rst new file mode 100644 index 000000000000..6355f2b5019d --- /dev/null +++ b/Documentation/admin-guide/mm/multigen_lru.rst @@ -0,0 +1,156 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============= +Multi-Gen LRU +============= +The multi-gen LRU is an alternative LRU implementation that optimizes +page reclaim and improves performance under memory pressure. Page +reclaim decides the kernel's caching policy and ability to overcommit +memory. It directly impacts the kswapd CPU usage and RAM efficiency. + +Quick start +=========== +Build the kernel with the following configurations. + +* ``CONFIG_LRU_GEN=y`` +* ``CONFIG_LRU_GEN_ENABLED=y`` + +All set! + +Runtime options +=============== +``/sys/kernel/mm/lru_gen/`` contains stable ABIs described in the +following subsections. + +Kill switch +----------- +``enabled`` accepts different values to enable or disable the +following components. Its default value depends on +``CONFIG_LRU_GEN_ENABLED``. All the components should be enabled +unless some of them have unforeseen side effects. Writing to +``enabled`` has no effect when a component is not supported by the +hardware, and valid values will be accepted even when the main switch +is off. + +====== =============================================================== +Values Components +====== =============================================================== +0x0001 The main switch for the multi-gen LRU. +0x0002 Clearing the accessed bit in leaf page table entries in large + batches, when MMU sets it (e.g., on x86). This behavior can + theoretically worsen lock contention (mmap_lock). If it is + disabled, the multi-gen LRU will suffer a minor performance + degradation for workloads that contiguously map hot pages, + whose accessed bits can be otherwise cleared by fewer larger + batches. +0x0004 Clearing the accessed bit in non-leaf page table entries as + well, when MMU sets it (e.g., on x86). This behavior was not + verified on x86 varieties other than Intel and AMD. If it is + disabled, the multi-gen LRU will suffer a negligible + performance degradation. +[yYnN] Apply to all the components above. +====== =============================================================== + +E.g., +:: + + echo y >/sys/kernel/mm/lru_gen/enabled + cat /sys/kernel/mm/lru_gen/enabled + 0x0007 + echo 5 >/sys/kernel/mm/lru_gen/enabled + cat /sys/kernel/mm/lru_gen/enabled + 0x0005 + +Thrashing prevention +-------------------- +Personal computers are more sensitive to thrashing because it can +cause janks (lags when rendering UI) and negatively impact user +experience. The multi-gen LRU offers thrashing prevention to the +majority of laptop and desktop users who do not have ``oomd``. + +Users can write ``N`` to ``min_ttl_ms`` to prevent the working set of +``N`` milliseconds from getting evicted. The OOM killer is triggered +if this working set cannot be kept in memory. In other words, this +option works as an adjustable pressure relief valve, and when open, it +terminates applications that are hopefully not being used. + +Based on the average human detectable lag (~100ms), ``N=1000`` usually +eliminates intolerable janks due to thrashing. Larger values like +``N=3000`` make janks less noticeable at the risk of premature OOM +kills. + +The default value ``0`` means disabled. + +Experimental features +===================== +``/sys/kernel/debug/lru_gen`` accepts commands described in the +following subsections. Multiple command lines are supported, so does +concatenation with delimiters ``,`` and ``;``. + +``/sys/kernel/debug/lru_gen_full`` provides additional stats for +debugging. ``CONFIG_LRU_GEN_STATS=y`` keeps historical stats from +evicted generations in this file. + +Working set estimation +---------------------- +Working set estimation measures how much memory an application needs +in a given time interval, and it is usually done with little impact on +the performance of the application. E.g., data centers want to +optimize job scheduling (bin packing) to improve memory utilizations. +When a new job comes in, the job scheduler needs to find out whether +each server it manages can allocate a certain amount of memory for +this new job before it can pick a candidate. To do so, the job +scheduler needs to estimate the working sets of the existing jobs. + +When it is read, ``lru_gen`` returns a histogram of numbers of pages +accessed over different time intervals for each memcg and node. +``MAX_NR_GENS`` decides the number of bins for each histogram. The +histograms are noncumulative. +:: + + memcg memcg_id memcg_path + node node_id + min_gen_nr age_in_ms nr_anon_pages nr_file_pages + ... + max_gen_nr age_in_ms nr_anon_pages nr_file_pages + +Each bin contains an estimated number of pages that have been accessed +within ``age_in_ms``. E.g., ``min_gen_nr`` contains the coldest pages +and ``max_gen_nr`` contains the hottest pages, since ``age_in_ms`` of +the former is the largest and that of the latter is the smallest. + +Users can write ``+ memcg_id node_id max_gen_nr +[can_swap [force_scan]]`` to ``lru_gen`` to create a new generation +``max_gen_nr+1``. ``can_swap`` defaults to the swap setting and, if it +is set to ``1``, it forces the scan of anon pages when swap is off, +and vice versa. ``force_scan`` defaults to ``1`` and, if it is set to +``0``, it employs heuristics to reduce the overhead, which is likely +to reduce the coverage as well. + +A typical use case is that a job scheduler writes to ``lru_gen`` at a +certain time interval to create new generations, and it ranks the +servers it manages based on the sizes of their cold pages defined by +this time interval. + +Proactive reclaim +----------------- +Proactive reclaim induces page reclaim when there is no memory +pressure. It usually targets cold pages only. E.g., when a new job +comes in, the job scheduler wants to proactively reclaim cold pages on +the server it selected to improve the chance of successfully landing +this new job. + +Users can write ``- memcg_id node_id min_gen_nr [swappiness +[nr_to_reclaim]]`` to ``lru_gen`` to evict generations less than or +equal to ``min_gen_nr``. Note that ``min_gen_nr`` should be less than +``max_gen_nr-1`` as ``max_gen_nr`` and ``max_gen_nr-1`` are not fully +aged and therefore cannot be evicted. ``swappiness`` overrides the +default value in ``/proc/sys/vm/swappiness``. ``nr_to_reclaim`` limits +the number of pages to evict. + +A typical use case is that a job scheduler writes to ``lru_gen`` +before it tries to land a new job on a server. If it fails to +materialize enough cold pages because of the overestimation, it +retries on the next server according to the ranking result obtained +from the working set estimation step. This less forceful approach +limits the impacts on the existing jobs. diff --git a/Documentation/devicetree/bindings/hwmon/nuvoton,nct6775.yaml b/Documentation/devicetree/bindings/hwmon/nuvoton,nct6775.yaml new file mode 100644 index 000000000000..358b262431fc --- /dev/null +++ b/Documentation/devicetree/bindings/hwmon/nuvoton,nct6775.yaml @@ -0,0 +1,57 @@ +# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) +%YAML 1.2 +--- + +$id: http://devicetree.org/schemas/hwmon/nuvoton,nct6775.yaml# +$schema: http://devicetree.org/meta-schemas/core.yaml# + +title: Nuvoton NCT6775 and compatible Super I/O chips + +maintainers: + - Zev Weiss + +properties: + compatible: + enum: + - nuvoton,nct6106 + - nuvoton,nct6116 + - nuvoton,nct6775 + - nuvoton,nct6776 + - nuvoton,nct6779 + - nuvoton,nct6791 + - nuvoton,nct6792 + - nuvoton,nct6793 + - nuvoton,nct6795 + - nuvoton,nct6796 + - nuvoton,nct6797 + - nuvoton,nct6798 + + reg: + maxItems: 1 + + nuvoton,tsi-channel-mask: + description: + Bitmask indicating which TSI temperature sensor channels are + active. LSB is TSI0, bit 1 is TSI1, etc. + $ref: /schemas/types.yaml#/definitions/uint32 + maximum: 0xff + default: 0 + +required: + - compatible + - reg + +additionalProperties: false + +examples: + - | + i2c { + #address-cells = <1>; + #size-cells = <0>; + + superio@4d { + compatible = "nuvoton,nct6779"; + reg = <0x4d>; + nuvoton,tsi-channel-mask = <0x03>; + }; + }; diff --git a/Documentation/hwmon/asus_ec_sensors.rst b/Documentation/hwmon/asus_ec_sensors.rst index e7e8f1640f45..78ca69eda877 100644 --- a/Documentation/hwmon/asus_ec_sensors.rst +++ b/Documentation/hwmon/asus_ec_sensors.rst @@ -4,17 +4,20 @@ Kernel driver asus_ec_sensors ================================= Supported boards: - * PRIME X570-PRO, - * Pro WS X570-ACE, - * ROG CROSSHAIR VIII DARK HERO, + * PRIME X470-PRO + * PRIME X570-PRO + * Pro WS X570-ACE + * ProArt X570-CREATOR WIFI + * ROG CROSSHAIR VIII DARK HERO * ROG CROSSHAIR VIII HERO (WI-FI) - * ROG CROSSHAIR VIII FORMULA, - * ROG CROSSHAIR VIII HERO, - * ROG CROSSHAIR VIII IMPACT, - * ROG STRIX B550-E GAMING, - * ROG STRIX B550-I GAMING, - * ROG STRIX X570-E GAMING, - * ROG STRIX X570-F GAMING, + * ROG CROSSHAIR VIII FORMULA + * ROG CROSSHAIR VIII HERO + * ROG CROSSHAIR VIII IMPACT + * ROG STRIX B550-E GAMING + * ROG STRIX B550-I GAMING + * ROG STRIX X570-E GAMING + * ROG STRIX X570-E GAMING WIFI II + * ROG STRIX X570-F GAMING * ROG STRIX X570-I GAMING Authors: @@ -52,3 +55,5 @@ Module Parameters the path is mostly identical for them). If ASUS changes this path in a future BIOS update, this parameter can be used to override the stored in the driver value until it gets updated. + A special string ":GLOBAL_LOCK" can be passed to use the ACPI + global lock instead of a dedicated mutex. diff --git a/Documentation/vm/index.rst b/Documentation/vm/index.rst index 44365c4574a3..b48434300226 100644 --- a/Documentation/vm/index.rst +++ b/Documentation/vm/index.rst @@ -25,6 +25,7 @@ algorithms. If you are looking for advice on simply allocating memory, see the ksm memory-model mmu_notifier + multigen_lru numa overcommit-accounting page_migration diff --git a/Documentation/vm/multigen_lru.rst b/Documentation/vm/multigen_lru.rst new file mode 100644 index 000000000000..bc8eaf1b956c --- /dev/null +++ b/Documentation/vm/multigen_lru.rst @@ -0,0 +1,159 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============= +Multi-Gen LRU +============= +The multi-gen LRU is an alternative LRU implementation that optimizes +page reclaim and improves performance under memory pressure. Page +reclaim decides the kernel's caching policy and ability to overcommit +memory. It directly impacts the kswapd CPU usage and RAM efficiency. + +Design overview +=============== +Objectives +---------- +The design objectives are: + +* Good representation of access recency +* Try to profit from spatial locality +* Fast paths to make obvious choices +* Simple self-correcting heuristics + +The representation of access recency is at the core of all LRU +implementations. In the multi-gen LRU, each generation represents a +group of pages with similar access recency. Generations establish a +(time-based) common frame of reference and therefore help make better +choices, e.g., between different memcgs on a computer or different +computers in a data center (for job scheduling). + +Exploiting spatial locality improves efficiency when gathering the +accessed bit. A rmap walk targets a single page and does not try to +profit from discovering a young PTE. A page table walk can sweep all +the young PTEs in an address space, but the address space can be too +sparse to make a profit. The key is to optimize both methods and use +them in combination. + +Fast paths reduce code complexity and runtime overhead. Unmapped pages +do not require TLB flushes; clean pages do not require writeback. +These facts are only helpful when other conditions, e.g., access +recency, are similar. With generations as a common frame of reference, +additional factors stand out. But obvious choices might not be good +choices; thus self-correction is necessary. + +The benefits of simple self-correcting heuristics are self-evident. +Again, with generations as a common frame of reference, this becomes +attainable. Specifically, pages in the same generation can be +categorized based on additional factors, and a feedback loop can +statistically compare the refault percentages across those categories +and infer which of them are better choices. + +Assumptions +----------- +The protection of hot pages and the selection of cold pages are based +on page access channels and patterns. There are two access channels: + +* Accesses through page tables +* Accesses through file descriptors + +The protection of the former channel is by design stronger because: + +1. The uncertainty in determining the access patterns of the former + channel is higher due to the approximation of the accessed bit. +2. The cost of evicting the former channel is higher due to the TLB + flushes required and the likelihood of encountering the dirty bit. +3. The penalty of underprotecting the former channel is higher because + applications usually do not prepare themselves for major page + faults like they do for blocked I/O. E.g., GUI applications + commonly use dedicated I/O threads to avoid blocking rendering + threads. + +There are also two access patterns: + +* Accesses exhibiting temporal locality +* Accesses not exhibiting temporal locality + +For the reasons listed above, the former channel is assumed to follow +the former pattern unless ``VM_SEQ_READ`` or ``VM_RAND_READ`` is +present, and the latter channel is assumed to follow the latter +pattern unless outlying refaults have been observed. + +Workflow overview +================= +Evictable pages are divided into multiple generations for each +``lruvec``. The youngest generation number is stored in +``lrugen->max_seq`` for both anon and file types as they are aged on +an equal footing. The oldest generation numbers are stored in +``lrugen->min_seq[]`` separately for anon and file types as clean file +pages can be evicted regardless of swap constraints. These three +variables are monotonically increasing. + +Generation numbers are truncated into ``order_base_2(MAX_NR_GENS+1)`` +bits in order to fit into the gen counter in ``folio->flags``. Each +truncated generation number is an index to ``lrugen->lists[]``. The +sliding window technique is used to track at least ``MIN_NR_GENS`` and +at most ``MAX_NR_GENS`` generations. The gen counter stores a value +within ``[1, MAX_NR_GENS]`` while a page is on one of +``lrugen->lists[]``; otherwise it stores zero. + +Each generation is divided into multiple tiers. Tiers represent +different ranges of numbers of accesses through file descriptors. A +page accessed ``N`` times through file descriptors is in tier +``order_base_2(N)``. In contrast to moving across generations, which +requires the LRU lock, moving across tiers only involves atomic +operations on ``folio->flags`` and therefore has a negligible cost. A +feedback loop modeled after the PID controller monitors refaults over +all the tiers from anon and file types and decides which tiers from +which types to evict or protect. + +There are two conceptually independent procedures: the aging and the +eviction. They form a closed-loop system, i.e., the page reclaim. + +Aging +----- +The aging produces young generations. Given an ``lruvec``, it +increments ``max_seq`` when ``max_seq-min_seq+1`` approaches +``MIN_NR_GENS``. The aging promotes hot pages to the youngest +generation when it finds them accessed through page tables; the +demotion of cold pages happens consequently when it increments +``max_seq``. The aging uses page table walks and rmap walks to find +young PTEs. For the former, it iterates ``lruvec_memcg()->mm_list`` +and calls ``walk_page_range()`` with each ``mm_struct`` on this list +to scan PTEs, and after each iteration, it increments ``max_seq``. For +the latter, when the eviction walks the rmap and finds a young PTE, +the aging scans the adjacent PTEs. For both, on finding a young PTE, +the aging clears the accessed bit and updates the gen counter of the +page mapped by this PTE to ``(max_seq%MAX_NR_GENS)+1``. + +Eviction +-------- +The eviction consumes old generations. Given an ``lruvec``, it +increments ``min_seq`` when ``lrugen->lists[]`` indexed by +``min_seq%MAX_NR_GENS`` becomes empty. To select a type and a tier to +evict from, it first compares ``min_seq[]`` to select the older type. +If both types are equally old, it selects the one whose first tier has +a lower refault percentage. The first tier contains single-use +unmapped clean pages, which are the best bet. The eviction sorts a +page according to its gen counter if the aging has found this page +accessed through page tables and updated its gen counter. It also +moves a page to the next generation, i.e., ``min_seq+1``, if this page +was accessed multiple times through file descriptors and the feedback +loop has detected outlying refaults from the tier this page is in. To +do this, the feedback loop uses the first tier as the baseline, for +the reason stated earlier. + +Summary +------- +The multi-gen LRU can be disassembled into the following parts: + +* Generations +* Page table walks +* Rmap walks +* Bloom filters +* PID controller + +The aging and the eviction form a producer-consumer model; +specifically, the latter drives the former by the sliding window over +generations. Within the aging, rmap walks drive page table walks by +inserting hot densely populated page tables to the Bloom filters. +Within the eviction, the PID controller uses refaults as the feedback +to select types to evict and tiers to protect. diff --git a/MAINTAINERS b/MAINTAINERS index f468864fd268..51a6ba822792 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -13536,12 +13536,21 @@ M: Samuel Mendoza-Jonas S: Maintained F: net/ncsi/ -NCT6775 HARDWARE MONITOR DRIVER +NCT6775 HARDWARE MONITOR DRIVER - CORE & PLATFORM DRIVER M: Guenter Roeck L: linux-hwmon@vger.kernel.org S: Maintained F: Documentation/hwmon/nct6775.rst -F: drivers/hwmon/nct6775.c +F: drivers/hwmon/nct6775-core.c +F: drivers/hwmon/nct6775-platform.c +F: drivers/hwmon/nct6775.h + +NCT6775 HARDWARE MONITOR DRIVER - I2C DRIVER +M: Zev Weiss +L: linux-hwmon@vger.kernel.org +S: Maintained +F: Documentation/devicetree/bindings/hwmon/nuvoton,nct6775.yaml +F: drivers/hwmon/nct6775-i2c.c NETDEVSIM M: Jakub Kicinski diff --git a/Makefile b/Makefile index 7d5b0bfe7960..7aa4dd254296 100644 --- a/Makefile +++ b/Makefile @@ -2,7 +2,7 @@ VERSION = 5 PATCHLEVEL = 18 SUBLEVEL = 0 -EXTRAVERSION = +EXTRAVERSION = -pf1 NAME = Superb Owl # *DOCUMENTATION* diff --git a/arch/Kconfig b/arch/Kconfig index 31c4fdc4a4ba..8a2cb732b09e 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -1376,6 +1376,14 @@ config DYNAMIC_SIGFRAME config HAVE_ARCH_NODE_DEV_GROUP bool +config ARCH_HAS_NONLEAF_PMD_YOUNG + bool + help + Architectures that select this option are capable of setting the + accessed bit in non-leaf PMD entries when using them as part of linear + address translations. Page table walkers that clear the accessed bit + may use this capability to reduce their search space. + source "kernel/gcov/Kconfig" source "scripts/gcc-plugins/Kconfig" diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl index 3515bc4f16a4..00ff721da300 100644 --- a/arch/alpha/kernel/syscalls/syscall.tbl +++ b/arch/alpha/kernel/syscalls/syscall.tbl @@ -490,3 +490,4 @@ 558 common process_mrelease sys_process_mrelease 559 common futex_waitv sys_futex_waitv 560 common set_mempolicy_home_node sys_ni_syscall +561 common pmadv_ksm sys_pmadv_ksm diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl index ac964612d8b0..90933eabe115 100644 --- a/arch/arm/tools/syscall.tbl +++ b/arch/arm/tools/syscall.tbl @@ -464,3 +464,4 @@ 448 common process_mrelease sys_process_mrelease 449 common futex_waitv sys_futex_waitv 450 common set_mempolicy_home_node sys_set_mempolicy_home_node +451 common pmadv_ksm sys_pmadv_ksm diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h index dff2b483ea50..8e9c2c837678 100644 --- a/arch/arm64/include/asm/pgtable.h +++ b/arch/arm64/include/asm/pgtable.h @@ -999,23 +999,13 @@ static inline void update_mmu_cache(struct vm_area_struct *vma, * page after fork() + CoW for pfn mappings. We don't always have a * hardware-managed access flag on arm64. */ -static inline bool arch_faults_on_old_pte(void) -{ - WARN_ON(preemptible()); - - return !cpu_has_hw_af(); -} -#define arch_faults_on_old_pte arch_faults_on_old_pte +#define arch_has_hw_pte_young cpu_has_hw_af /* * Experimentally, it's cheap to set the access flag in hardware and we * benefit from prefaulting mappings as 'old' to start with. */ -static inline bool arch_wants_old_prefaulted_pte(void) -{ - return !arch_faults_on_old_pte(); -} -#define arch_wants_old_prefaulted_pte arch_wants_old_prefaulted_pte +#define arch_wants_old_prefaulted_pte cpu_has_hw_af static inline bool pud_sect_supported(void) { diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h index 4e65da3445c7..14c95844f6c8 100644 --- a/arch/arm64/include/asm/unistd.h +++ b/arch/arm64/include/asm/unistd.h @@ -38,7 +38,7 @@ #define __ARM_NR_compat_set_tls (__ARM_NR_COMPAT_BASE + 5) #define __ARM_NR_COMPAT_END (__ARM_NR_COMPAT_BASE + 0x800) -#define __NR_compat_syscalls 451 +#define __NR_compat_syscalls 452 #endif #define __ARCH_WANT_SYS_CLONE diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h index 604a2053d006..91f2bb7199af 100644 --- a/arch/arm64/include/asm/unistd32.h +++ b/arch/arm64/include/asm/unistd32.h @@ -907,6 +907,8 @@ __SYSCALL(__NR_process_mrelease, sys_process_mrelease) __SYSCALL(__NR_futex_waitv, sys_futex_waitv) #define __NR_set_mempolicy_home_node 450 __SYSCALL(__NR_set_mempolicy_home_node, sys_set_mempolicy_home_node) +#define __NR_pmadv_ksm 451 +__SYSCALL(__NR_pmadv_ksm, sys_pmadv_ksm) /* * Please add new compat syscalls above this comment and update diff --git a/arch/ia64/kernel/syscalls/syscall.tbl b/arch/ia64/kernel/syscalls/syscall.tbl index 78b1d03e86e1..79ad5a5682b3 100644 --- a/arch/ia64/kernel/syscalls/syscall.tbl +++ b/arch/ia64/kernel/syscalls/syscall.tbl @@ -371,3 +371,4 @@ 448 common process_mrelease sys_process_mrelease 449 common futex_waitv sys_futex_waitv 450 common set_mempolicy_home_node sys_set_mempolicy_home_node +451 common pmadv_ksm sys_pmadv_ksm diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl index b1f3940bc298..5ccf925567da 100644 --- a/arch/m68k/kernel/syscalls/syscall.tbl +++ b/arch/m68k/kernel/syscalls/syscall.tbl @@ -450,3 +450,4 @@ 448 common process_mrelease sys_process_mrelease 449 common futex_waitv sys_futex_waitv 450 common set_mempolicy_home_node sys_set_mempolicy_home_node +451 common pmadv_ksm sys_pmadv_ksm diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl index 820145e47350..6b76208597f3 100644 --- a/arch/microblaze/kernel/syscalls/syscall.tbl +++ b/arch/microblaze/kernel/syscalls/syscall.tbl @@ -456,3 +456,4 @@ 448 common process_mrelease sys_process_mrelease 449 common futex_waitv sys_futex_waitv 450 common set_mempolicy_home_node sys_set_mempolicy_home_node +451 common pmadv_ksm sys_pmadv_ksm diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl index 253ff994ed2e..e4aeedb17c38 100644 --- a/arch/mips/kernel/syscalls/syscall_n32.tbl +++ b/arch/mips/kernel/syscalls/syscall_n32.tbl @@ -389,3 +389,4 @@ 448 n32 process_mrelease sys_process_mrelease 449 n32 futex_waitv sys_futex_waitv 450 n32 set_mempolicy_home_node sys_set_mempolicy_home_node +451 n32 pmadv_ksm sys_pmadv_ksm diff --git a/arch/mips/kernel/syscalls/syscall_n64.tbl b/arch/mips/kernel/syscalls/syscall_n64.tbl index 3f1886ad9d80..fe88db51efa0 100644 --- a/arch/mips/kernel/syscalls/syscall_n64.tbl +++ b/arch/mips/kernel/syscalls/syscall_n64.tbl @@ -365,3 +365,4 @@ 448 n64 process_mrelease sys_process_mrelease 449 n64 futex_waitv sys_futex_waitv 450 common set_mempolicy_home_node sys_set_mempolicy_home_node +451 n64 pmadv_ksm sys_pmadv_ksm diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl index 8f243e35a7b2..674cb940bd15 100644 --- a/arch/mips/kernel/syscalls/syscall_o32.tbl +++ b/arch/mips/kernel/syscalls/syscall_o32.tbl @@ -438,3 +438,4 @@ 448 o32 process_mrelease sys_process_mrelease 449 o32 futex_waitv sys_futex_waitv 450 o32 set_mempolicy_home_node sys_set_mempolicy_home_node +451 o32 pmadv_ksm sys_pmadv_ksm diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl index 68b46fe2f17c..e19e8deb4c32 100644 --- a/arch/parisc/kernel/syscalls/syscall.tbl +++ b/arch/parisc/kernel/syscalls/syscall.tbl @@ -448,3 +448,4 @@ 448 common process_mrelease sys_process_mrelease 449 common futex_waitv sys_futex_waitv 450 common set_mempolicy_home_node sys_set_mempolicy_home_node +451 common pmadv_ksm sys_pmadv_ksm diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl index 2600b4237292..bb2f71a36941 100644 --- a/arch/powerpc/kernel/syscalls/syscall.tbl +++ b/arch/powerpc/kernel/syscalls/syscall.tbl @@ -530,3 +530,4 @@ 448 common process_mrelease sys_process_mrelease 449 common futex_waitv sys_futex_waitv 450 nospu set_mempolicy_home_node sys_set_mempolicy_home_node +451 common pmadv_ksm sys_pmadv_ksm diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl index 799147658dee..1cd523748bd2 100644 --- a/arch/s390/kernel/syscalls/syscall.tbl +++ b/arch/s390/kernel/syscalls/syscall.tbl @@ -453,3 +453,4 @@ 448 common process_mrelease sys_process_mrelease sys_process_mrelease 449 common futex_waitv sys_futex_waitv sys_futex_waitv 450 common set_mempolicy_home_node sys_set_mempolicy_home_node sys_set_mempolicy_home_node +451 common pmadv_ksm sys_pmadv_ksm sys_pmadv_ksm diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl index 2de85c977f54..cfc75fa43eae 100644 --- a/arch/sh/kernel/syscalls/syscall.tbl +++ b/arch/sh/kernel/syscalls/syscall.tbl @@ -453,3 +453,4 @@ 448 common process_mrelease sys_process_mrelease 449 common futex_waitv sys_futex_waitv 450 common set_mempolicy_home_node sys_set_mempolicy_home_node +451 common pmadv_ksm sys_pmadv_ksm diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl index 4398cc6fb68d..d2c0a6426f6b 100644 --- a/arch/sparc/kernel/syscalls/syscall.tbl +++ b/arch/sparc/kernel/syscalls/syscall.tbl @@ -496,3 +496,4 @@ 448 common process_mrelease sys_process_mrelease 449 common futex_waitv sys_futex_waitv 450 common set_mempolicy_home_node sys_set_mempolicy_home_node +451 common pmadv_ksm sys_pmadv_ksm diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 4bed3abf444d..fce1d9f41cc3 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -86,6 +86,7 @@ config X86 select ARCH_HAS_PMEM_API if X86_64 select ARCH_HAS_PTE_DEVMAP if X86_64 select ARCH_HAS_PTE_SPECIAL + select ARCH_HAS_NONLEAF_PMD_YOUNG if PGTABLE_LEVELS > 2 select ARCH_HAS_UACCESS_FLUSHCACHE if X86_64 select ARCH_HAS_COPY_MC if X86_64 select ARCH_HAS_SET_MEMORY diff --git a/arch/x86/Kconfig.cpu b/arch/x86/Kconfig.cpu index 542377cd419d..22b919cdb6d1 100644 --- a/arch/x86/Kconfig.cpu +++ b/arch/x86/Kconfig.cpu @@ -157,7 +157,7 @@ config MPENTIUM4 config MK6 - bool "K6/K6-II/K6-III" + bool "AMD K6/K6-II/K6-III" depends on X86_32 help Select this for an AMD K6-family processor. Enables use of @@ -165,7 +165,7 @@ config MK6 flags to GCC. config MK7 - bool "Athlon/Duron/K7" + bool "AMD Athlon/Duron/K7" depends on X86_32 help Select this for an AMD Athlon K7-family processor. Enables use of @@ -173,12 +173,98 @@ config MK7 flags to GCC. config MK8 - bool "Opteron/Athlon64/Hammer/K8" + bool "AMD Opteron/Athlon64/Hammer/K8" help Select this for an AMD Opteron or Athlon64 Hammer-family processor. Enables use of some extended instructions, and passes appropriate optimization flags to GCC. +config MK8SSE3 + bool "AMD Opteron/Athlon64/Hammer/K8 with SSE3" + help + Select this for improved AMD Opteron or Athlon64 Hammer-family processors. + Enables use of some extended instructions, and passes appropriate + optimization flags to GCC. + +config MK10 + bool "AMD 61xx/7x50/PhenomX3/X4/II/K10" + help + Select this for an AMD 61xx Eight-Core Magny-Cours, Athlon X2 7x50, + Phenom X3/X4/II, Athlon II X2/X3/X4, or Turion II-family processor. + Enables use of some extended instructions, and passes appropriate + optimization flags to GCC. + +config MBARCELONA + bool "AMD Barcelona" + help + Select this for AMD Family 10h Barcelona processors. + + Enables -march=barcelona + +config MBOBCAT + bool "AMD Bobcat" + help + Select this for AMD Family 14h Bobcat processors. + + Enables -march=btver1 + +config MJAGUAR + bool "AMD Jaguar" + help + Select this for AMD Family 16h Jaguar processors. + + Enables -march=btver2 + +config MBULLDOZER + bool "AMD Bulldozer" + help + Select this for AMD Family 15h Bulldozer processors. + + Enables -march=bdver1 + +config MPILEDRIVER + bool "AMD Piledriver" + help + Select this for AMD Family 15h Piledriver processors. + + Enables -march=bdver2 + +config MSTEAMROLLER + bool "AMD Steamroller" + help + Select this for AMD Family 15h Steamroller processors. + + Enables -march=bdver3 + +config MEXCAVATOR + bool "AMD Excavator" + help + Select this for AMD Family 15h Excavator processors. + + Enables -march=bdver4 + +config MZEN + bool "AMD Zen" + help + Select this for AMD Family 17h Zen processors. + + Enables -march=znver1 + +config MZEN2 + bool "AMD Zen 2" + help + Select this for AMD Family 17h Zen 2 processors. + + Enables -march=znver2 + +config MZEN3 + bool "AMD Zen 3" + depends on (CC_IS_GCC && GCC_VERSION >= 100300) || (CC_IS_CLANG && CLANG_VERSION >= 120000) + help + Select this for AMD Family 19h Zen 3 processors. + + Enables -march=znver3 + config MCRUSOE bool "Crusoe" depends on X86_32 @@ -270,7 +356,7 @@ config MPSC in /proc/cpuinfo. Family 15 is an older Xeon, Family 6 a newer one. config MCORE2 - bool "Core 2/newer Xeon" + bool "Intel Core 2" help Select this for Intel Core 2 and newer Core 2 Xeons (Xeon 51xx and @@ -278,6 +364,8 @@ config MCORE2 family in /proc/cpuinfo. Newer ones have 6 and older ones 15 (not a typo) + Enables -march=core2 + config MATOM bool "Intel Atom" help @@ -287,6 +375,182 @@ config MATOM accordingly optimized code. Use a recent GCC with specific Atom support in order to fully benefit from selecting this option. +config MNEHALEM + bool "Intel Nehalem" + select X86_P6_NOP + help + + Select this for 1st Gen Core processors in the Nehalem family. + + Enables -march=nehalem + +config MWESTMERE + bool "Intel Westmere" + select X86_P6_NOP + help + + Select this for the Intel Westmere formerly Nehalem-C family. + + Enables -march=westmere + +config MSILVERMONT + bool "Intel Silvermont" + select X86_P6_NOP + help + + Select this for the Intel Silvermont platform. + + Enables -march=silvermont + +config MGOLDMONT + bool "Intel Goldmont" + select X86_P6_NOP + help + + Select this for the Intel Goldmont platform including Apollo Lake and Denverton. + + Enables -march=goldmont + +config MGOLDMONTPLUS + bool "Intel Goldmont Plus" + select X86_P6_NOP + help + + Select this for the Intel Goldmont Plus platform including Gemini Lake. + + Enables -march=goldmont-plus + +config MSANDYBRIDGE + bool "Intel Sandy Bridge" + select X86_P6_NOP + help + + Select this for 2nd Gen Core processors in the Sandy Bridge family. + + Enables -march=sandybridge + +config MIVYBRIDGE + bool "Intel Ivy Bridge" + select X86_P6_NOP + help + + Select this for 3rd Gen Core processors in the Ivy Bridge family. + + Enables -march=ivybridge + +config MHASWELL + bool "Intel Haswell" + select X86_P6_NOP + help + + Select this for 4th Gen Core processors in the Haswell family. + + Enables -march=haswell + +config MBROADWELL + bool "Intel Broadwell" + select X86_P6_NOP + help + + Select this for 5th Gen Core processors in the Broadwell family. + + Enables -march=broadwell + +config MSKYLAKE + bool "Intel Skylake" + select X86_P6_NOP + help + + Select this for 6th Gen Core processors in the Skylake family. + + Enables -march=skylake + +config MSKYLAKEX + bool "Intel Skylake X" + select X86_P6_NOP + help + + Select this for 6th Gen Core processors in the Skylake X family. + + Enables -march=skylake-avx512 + +config MCANNONLAKE + bool "Intel Cannon Lake" + select X86_P6_NOP + help + + Select this for 8th Gen Core processors + + Enables -march=cannonlake + +config MICELAKE + bool "Intel Ice Lake" + select X86_P6_NOP + help + + Select this for 10th Gen Core processors in the Ice Lake family. + + Enables -march=icelake-client + +config MCASCADELAKE + bool "Intel Cascade Lake" + select X86_P6_NOP + help + + Select this for Xeon processors in the Cascade Lake family. + + Enables -march=cascadelake + +config MCOOPERLAKE + bool "Intel Cooper Lake" + depends on (CC_IS_GCC && GCC_VERSION > 100100) || (CC_IS_CLANG && CLANG_VERSION >= 100000) + select X86_P6_NOP + help + + Select this for Xeon processors in the Cooper Lake family. + + Enables -march=cooperlake + +config MTIGERLAKE + bool "Intel Tiger Lake" + depends on (CC_IS_GCC && GCC_VERSION > 100100) || (CC_IS_CLANG && CLANG_VERSION >= 100000) + select X86_P6_NOP + help + + Select this for third-generation 10 nm process processors in the Tiger Lake family. + + Enables -march=tigerlake + +config MSAPPHIRERAPIDS + bool "Intel Sapphire Rapids" + depends on (CC_IS_GCC && GCC_VERSION > 110000) || (CC_IS_CLANG && CLANG_VERSION >= 120000) + select X86_P6_NOP + help + + Select this for third-generation 10 nm process processors in the Sapphire Rapids family. + + Enables -march=sapphirerapids + +config MROCKETLAKE + bool "Intel Rocket Lake" + depends on (CC_IS_GCC && GCC_VERSION > 110000) || (CC_IS_CLANG && CLANG_VERSION >= 120000) + select X86_P6_NOP + help + + Select this for eleventh-generation processors in the Rocket Lake family. + + Enables -march=rocketlake + +config MALDERLAKE + bool "Intel Alder Lake" + depends on (CC_IS_GCC && GCC_VERSION > 110000) || (CC_IS_CLANG && CLANG_VERSION >= 120000) + select X86_P6_NOP + help + + Select this for twelfth-generation processors in the Alder Lake family. + + Enables -march=alderlake + config GENERIC_CPU bool "Generic-x86-64" depends on X86_64 @@ -294,6 +558,50 @@ config GENERIC_CPU Generic x86-64 CPU. Run equally well on all x86-64 CPUs. +config GENERIC_CPU2 + bool "Generic-x86-64-v2" + depends on (CC_IS_GCC && GCC_VERSION > 110000) || (CC_IS_CLANG && CLANG_VERSION >= 120000) + depends on X86_64 + help + Generic x86-64 CPU. + Run equally well on all x86-64 CPUs with min support of x86-64-v2. + +config GENERIC_CPU3 + bool "Generic-x86-64-v3" + depends on (CC_IS_GCC && GCC_VERSION > 110000) || (CC_IS_CLANG && CLANG_VERSION >= 120000) + depends on X86_64 + help + Generic x86-64-v3 CPU with v3 instructions. + Run equally well on all x86-64 CPUs with min support of x86-64-v3. + +config GENERIC_CPU4 + bool "Generic-x86-64-v4" + depends on (CC_IS_GCC && GCC_VERSION > 110000) || (CC_IS_CLANG && CLANG_VERSION >= 120000) + depends on X86_64 + help + Generic x86-64 CPU with v4 instructions. + Run equally well on all x86-64 CPUs with min support of x86-64-v4. + +config MNATIVE_INTEL + bool "Intel-Native optimizations autodetected by the compiler" + help + + Clang 3.8, GCC 4.2 and above support -march=native, which automatically detects + the optimum settings to use based on your processor. Do NOT use this + for AMD CPUs. Intel Only! + + Enables -march=native + +config MNATIVE_AMD + bool "AMD-Native optimizations autodetected by the compiler" + help + + Clang 3.8, GCC 4.2 and above support -march=native, which automatically detects + the optimum settings to use based on your processor. Do NOT use this + for Intel CPUs. AMD Only! + + Enables -march=native + endchoice config X86_GENERIC @@ -318,7 +626,7 @@ config X86_INTERNODE_CACHE_SHIFT config X86_L1_CACHE_SHIFT int default "7" if MPENTIUM4 || MPSC - default "6" if MK7 || MK8 || MPENTIUMM || MCORE2 || MATOM || MVIAC7 || X86_GENERIC || GENERIC_CPU + default "6" if MK7 || MK8 || MPENTIUMM || MCORE2 || MATOM || MVIAC7 || MK8SSE3 || MK10 || MBARCELONA || MBOBCAT || MJAGUAR || MBULLDOZER || MPILEDRIVER || MSTEAMROLLER || MEXCAVATOR || MZEN || MZEN2 || MZEN3 || MNEHALEM || MWESTMERE || MSILVERMONT || MGOLDMONT || MGOLDMONTPLUS || MSANDYBRIDGE || MIVYBRIDGE || MHASWELL || MBROADWELL || MSKYLAKE || MSKYLAKEX || MCANNONLAKE || MICELAKE || MCASCADELAKE || MCOOPERLAKE || MTIGERLAKE || MSAPPHIRERAPIDS || MROCKETLAKE || MALDERLAKE || MNATIVE_INTEL || MNATIVE_AMD || X86_GENERIC || GENERIC_CPU || GENERIC_CPU2 || GENERIC_CPU3 || GENERIC_CPU4 default "4" if MELAN || M486SX || M486 || MGEODEGX1 default "5" if MWINCHIP3D || MWINCHIPC6 || MCRUSOE || MEFFICEON || MCYRIXIII || MK6 || MPENTIUMIII || MPENTIUMII || M686 || M586MMX || M586TSC || M586 || MVIAC3_2 || MGEODE_LX @@ -336,11 +644,11 @@ config X86_ALIGNMENT_16 config X86_INTEL_USERCOPY def_bool y - depends on MPENTIUM4 || MPENTIUMM || MPENTIUMIII || MPENTIUMII || M586MMX || X86_GENERIC || MK8 || MK7 || MEFFICEON || MCORE2 + depends on MPENTIUM4 || MPENTIUMM || MPENTIUMIII || MPENTIUMII || M586MMX || X86_GENERIC || MK8 || MK7 || MEFFICEON || MCORE2 || MNEHALEM || MWESTMERE || MSILVERMONT || MGOLDMONT || MGOLDMONTPLUS || MSANDYBRIDGE || MIVYBRIDGE || MHASWELL || MBROADWELL || MSKYLAKE || MSKYLAKEX || MCANNONLAKE || MICELAKE || MCASCADELAKE || MCOOPERLAKE || MTIGERLAKE || MSAPPHIRERAPIDS || MROCKETLAKE || MALDERLAKE || MNATIVE_INTEL config X86_USE_PPRO_CHECKSUM def_bool y - depends on MWINCHIP3D || MWINCHIPC6 || MCYRIXIII || MK7 || MK6 || MPENTIUM4 || MPENTIUMM || MPENTIUMIII || MPENTIUMII || M686 || MK8 || MVIAC3_2 || MVIAC7 || MEFFICEON || MGEODE_LX || MCORE2 || MATOM + depends on MWINCHIP3D || MWINCHIPC6 || MCYRIXIII || MK7 || MK6 || MPENTIUM4 || MPENTIUMM || MPENTIUMIII || MPENTIUMII || M686 || MK8 || MVIAC3_2 || MVIAC7 || MEFFICEON || MGEODE_LX || MCORE2 || MATOM || MK8SSE3 || MK10 || MBARCELONA || MBOBCAT || MJAGUAR || MBULLDOZER || MPILEDRIVER || MSTEAMROLLER || MEXCAVATOR || MZEN || MZEN2 || MZEN3 || MNEHALEM || MWESTMERE || MSILVERMONT || MGOLDMONT || MGOLDMONTPLUS || MSANDYBRIDGE || MIVYBRIDGE || MHASWELL || MBROADWELL || MSKYLAKE || MSKYLAKEX || MCANNONLAKE || MICELAKE || MCASCADELAKE || MCOOPERLAKE || MTIGERLAKE || MSAPPHIRERAPIDS || MROCKETLAKE || MALDERLAKE || MNATIVE_INTEL || MNATIVE_AMD # # P6_NOPs are a relatively minor optimization that require a family >= @@ -356,26 +664,26 @@ config X86_USE_PPRO_CHECKSUM config X86_P6_NOP def_bool y depends on X86_64 - depends on (MCORE2 || MPENTIUM4 || MPSC) + depends on (MCORE2 || MPENTIUM4 || MPSC || MNEHALEM || MWESTMERE || MSILVERMONT || MGOLDMONT || MGOLDMONTPLUS || MSANDYBRIDGE || MIVYBRIDGE || MHASWELL || MBROADWELL || MSKYLAKE || MSKYLAKEX || MCANNONLAKE || MICELAKE || MCASCADELAKE || MCOOPERLAKE || MTIGERLAKE || MSAPPHIRERAPIDS || MROCKETLAKE || MALDERLAKE || MNATIVE_INTEL) config X86_TSC def_bool y - depends on (MWINCHIP3D || MCRUSOE || MEFFICEON || MCYRIXIII || MK7 || MK6 || MPENTIUM4 || MPENTIUMM || MPENTIUMIII || MPENTIUMII || M686 || M586MMX || M586TSC || MK8 || MVIAC3_2 || MVIAC7 || MGEODEGX1 || MGEODE_LX || MCORE2 || MATOM) || X86_64 + depends on (MWINCHIP3D || MCRUSOE || MEFFICEON || MCYRIXIII || MK7 || MK6 || MPENTIUM4 || MPENTIUMM || MPENTIUMIII || MPENTIUMII || M686 || M586MMX || M586TSC || MK8 || MVIAC3_2 || MVIAC7 || MGEODEGX1 || MGEODE_LX || MCORE2 || MATOM || MK8SSE3 || MK10 || MBARCELONA || MBOBCAT || MJAGUAR || MBULLDOZER || MPILEDRIVER || MSTEAMROLLER || MEXCAVATOR || MZEN || MZEN2 || MZEN3 || MNEHALEM || MWESTMERE || MSILVERMONT || MGOLDMONT || MGOLDMONTPLUS || MSANDYBRIDGE || MIVYBRIDGE || MHASWELL || MBROADWELL || MSKYLAKE || MSKYLAKEX || MCANNONLAKE || MICELAKE || MCASCADELAKE || MCOOPERLAKE || MTIGERLAKE || MSAPPHIRERAPIDS || MROCKETLAKE || MALDERLAKE || MNATIVE_INTEL || MNATIVE_AMD) || X86_64 config X86_CMPXCHG64 def_bool y - depends on X86_PAE || X86_64 || MCORE2 || MPENTIUM4 || MPENTIUMM || MPENTIUMIII || MPENTIUMII || M686 || M586TSC || M586MMX || MATOM || MGEODE_LX || MGEODEGX1 || MK6 || MK7 || MK8 + depends on X86_PAE || X86_64 || MCORE2 || MPENTIUM4 || MPENTIUMM || MPENTIUMIII || MPENTIUMII || M686 || M586TSC || M586MMX || MATOM || MGEODE_LX || MGEODEGX1 || MK6 || MK7 || MK8 || MK8SSE3 || MK10 || MBARCELONA || MBOBCAT || MJAGUAR || MBULLDOZER || MPILEDRIVER || MSTEAMROLLER || MEXCAVATOR || MZEN || MZEN2 || MZEN3 || MNEHALEM || MWESTMERE || MSILVERMONT || MGOLDMONT || MGOLDMONTPLUS || MSANDYBRIDGE || MIVYBRIDGE || MHASWELL || MBROADWELL || MSKYLAKE || MSKYLAKEX || MCANNONLAKE || MICELAKE || MCASCADELAKE || MCOOPERLAKE || MTIGERLAKE || MSAPPHIRERAPIDS || MROCKETLAKE || MALDERLAKE || MNATIVE_INTEL || MNATIVE_AMD # this should be set for all -march=.. options where the compiler # generates cmov. config X86_CMOV def_bool y - depends on (MK8 || MK7 || MCORE2 || MPENTIUM4 || MPENTIUMM || MPENTIUMIII || MPENTIUMII || M686 || MVIAC3_2 || MVIAC7 || MCRUSOE || MEFFICEON || X86_64 || MATOM || MGEODE_LX) + depends on (MK8 || MK7 || MCORE2 || MPENTIUM4 || MPENTIUMM || MPENTIUMIII || MPENTIUMII || M686 || MVIAC3_2 || MVIAC7 || MCRUSOE || MEFFICEON || X86_64 || MATOM || MGEODE_LX || MK8SSE3 || MK10 || MBARCELONA || MBOBCAT || MJAGUAR || MBULLDOZER || MPILEDRIVER || MSTEAMROLLER || MEXCAVATOR || MZEN || MZEN2 || MZEN3 || MNEHALEM || MWESTMERE || MSILVERMONT || MGOLDMONT || MGOLDMONTPLUS || MSANDYBRIDGE || MIVYBRIDGE || MHASWELL || MBROADWELL || MSKYLAKE || MSKYLAKEX || MCANNONLAKE || MICELAKE || MCASCADELAKE || MCOOPERLAKE || MTIGERLAKE || MSAPPHIRERAPIDS || MROCKETLAKE || MALDERLAKE || MNATIVE_INTEL || MNATIVE_AMD) config X86_MINIMUM_CPU_FAMILY int default "64" if X86_64 - default "6" if X86_32 && (MPENTIUM4 || MPENTIUMM || MPENTIUMIII || MPENTIUMII || M686 || MVIAC3_2 || MVIAC7 || MEFFICEON || MATOM || MCRUSOE || MCORE2 || MK7 || MK8) + default "6" if X86_32 && (MPENTIUM4 || MPENTIUMM || MPENTIUMIII || MPENTIUMII || M686 || MVIAC3_2 || MVIAC7 || MEFFICEON || MATOM || MCRUSOE || MCORE2 || MK7 || MK8 || MK8SSE3 || MK10 || MBARCELONA || MBOBCAT || MJAGUAR || MBULLDOZER || MPILEDRIVER || MSTEAMROLLER || MEXCAVATOR || MZEN || MZEN2 || MZEN3 || MNEHALEM || MWESTMERE || MSILVERMONT || MGOLDMONT || MGOLDMONTPLUS || MSANDYBRIDGE || MIVYBRIDGE || MHASWELL || MBROADWELL || MSKYLAKE || MSKYLAKEX || MCANNONLAKE || MICELAKE || MCASCADELAKE || MCOOPERLAKE || MTIGERLAKE || MSAPPHIRERAPIDS || MROCKETLAKE || MALDERLAKE || MNATIVE_INTEL || MNATIVE_AMD) default "5" if X86_32 && X86_CMPXCHG64 default "4" diff --git a/arch/x86/Makefile b/arch/x86/Makefile index 63d50f65b828..8cd235d38634 100644 --- a/arch/x86/Makefile +++ b/arch/x86/Makefile @@ -143,8 +143,44 @@ else # FIXME - should be integrated in Makefile.cpu (Makefile_32.cpu) cflags-$(CONFIG_MK8) += -march=k8 cflags-$(CONFIG_MPSC) += -march=nocona - cflags-$(CONFIG_MCORE2) += -march=core2 - cflags-$(CONFIG_MATOM) += -march=atom + cflags-$(CONFIG_MK8SSE3) += -march=k8-sse3 + cflags-$(CONFIG_MK10) += -march=amdfam10 + cflags-$(CONFIG_MBARCELONA) += -march=barcelona + cflags-$(CONFIG_MBOBCAT) += -march=btver1 + cflags-$(CONFIG_MJAGUAR) += -march=btver2 + cflags-$(CONFIG_MBULLDOZER) += -march=bdver1 + cflags-$(CONFIG_MPILEDRIVER) += -march=bdver2 -mno-tbm + cflags-$(CONFIG_MSTEAMROLLER) += -march=bdver3 -mno-tbm + cflags-$(CONFIG_MEXCAVATOR) += -march=bdver4 -mno-tbm + cflags-$(CONFIG_MZEN) += -march=znver1 + cflags-$(CONFIG_MZEN2) += -march=znver2 + cflags-$(CONFIG_MZEN3) += -march=znver3 + cflags-$(CONFIG_MNATIVE_INTEL) += -march=native + cflags-$(CONFIG_MNATIVE_AMD) += -march=native + cflags-$(CONFIG_MATOM) += -march=bonnell + cflags-$(CONFIG_MCORE2) += -march=core2 + cflags-$(CONFIG_MNEHALEM) += -march=nehalem + cflags-$(CONFIG_MWESTMERE) += -march=westmere + cflags-$(CONFIG_MSILVERMONT) += -march=silvermont + cflags-$(CONFIG_MGOLDMONT) += -march=goldmont + cflags-$(CONFIG_MGOLDMONTPLUS) += -march=goldmont-plus + cflags-$(CONFIG_MSANDYBRIDGE) += -march=sandybridge + cflags-$(CONFIG_MIVYBRIDGE) += -march=ivybridge + cflags-$(CONFIG_MHASWELL) += -march=haswell + cflags-$(CONFIG_MBROADWELL) += -march=broadwell + cflags-$(CONFIG_MSKYLAKE) += -march=skylake + cflags-$(CONFIG_MSKYLAKEX) += -march=skylake-avx512 + cflags-$(CONFIG_MCANNONLAKE) += -march=cannonlake + cflags-$(CONFIG_MICELAKE) += -march=icelake-client + cflags-$(CONFIG_MCASCADELAKE) += -march=cascadelake + cflags-$(CONFIG_MCOOPERLAKE) += -march=cooperlake + cflags-$(CONFIG_MTIGERLAKE) += -march=tigerlake + cflags-$(CONFIG_MSAPPHIRERAPIDS) += -march=sapphirerapids + cflags-$(CONFIG_MROCKETLAKE) += -march=rocketlake + cflags-$(CONFIG_MALDERLAKE) += -march=alderlake + cflags-$(CONFIG_GENERIC_CPU2) += -march=x86-64-v2 + cflags-$(CONFIG_GENERIC_CPU3) += -march=x86-64-v3 + cflags-$(CONFIG_GENERIC_CPU4) += -march=x86-64-v4 cflags-$(CONFIG_GENERIC_CPU) += -mtune=generic KBUILD_CFLAGS += $(cflags-y) diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl index 320480a8db4f..331aaf1a782f 100644 --- a/arch/x86/entry/syscalls/syscall_32.tbl +++ b/arch/x86/entry/syscalls/syscall_32.tbl @@ -455,3 +455,4 @@ 448 i386 process_mrelease sys_process_mrelease 449 i386 futex_waitv sys_futex_waitv 450 i386 set_mempolicy_home_node sys_set_mempolicy_home_node +451 i386 pmadv_ksm sys_pmadv_ksm diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl index c84d12608cd2..14902db4c01f 100644 --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64.tbl @@ -372,6 +372,7 @@ 448 common process_mrelease sys_process_mrelease 449 common futex_waitv sys_futex_waitv 450 common set_mempolicy_home_node sys_set_mempolicy_home_node +451 common pmadv_ksm sys_pmadv_ksm # # Due to a historical design error, certain syscalls are numbered differently diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index 62ab07e24aef..9cb3cf4cf6dd 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -820,7 +820,8 @@ static inline unsigned long pmd_page_vaddr(pmd_t pmd) static inline int pmd_bad(pmd_t pmd) { - return (pmd_flags(pmd) & ~_PAGE_USER) != _KERNPG_TABLE; + return (pmd_flags(pmd) & ~(_PAGE_USER | _PAGE_ACCESSED)) != + (_KERNPG_TABLE & ~_PAGE_ACCESSED); } static inline unsigned long pages_to_mb(unsigned long npg) @@ -1424,10 +1425,10 @@ static inline bool arch_has_pfn_modify_check(void) return boot_cpu_has_bug(X86_BUG_L1TF); } -#define arch_faults_on_old_pte arch_faults_on_old_pte -static inline bool arch_faults_on_old_pte(void) +#define arch_has_hw_pte_young arch_has_hw_pte_young +static inline bool arch_has_hw_pte_young(void) { - return false; + return true; } #endif /* __ASSEMBLY__ */ diff --git a/arch/x86/include/asm/vermagic.h b/arch/x86/include/asm/vermagic.h index 75884d2cdec3..4e6a08d4c7e5 100644 --- a/arch/x86/include/asm/vermagic.h +++ b/arch/x86/include/asm/vermagic.h @@ -17,6 +17,48 @@ #define MODULE_PROC_FAMILY "586MMX " #elif defined CONFIG_MCORE2 #define MODULE_PROC_FAMILY "CORE2 " +#elif defined CONFIG_MNATIVE_INTEL +#define MODULE_PROC_FAMILY "NATIVE_INTEL " +#elif defined CONFIG_MNATIVE_AMD +#define MODULE_PROC_FAMILY "NATIVE_AMD " +#elif defined CONFIG_MNEHALEM +#define MODULE_PROC_FAMILY "NEHALEM " +#elif defined CONFIG_MWESTMERE +#define MODULE_PROC_FAMILY "WESTMERE " +#elif defined CONFIG_MSILVERMONT +#define MODULE_PROC_FAMILY "SILVERMONT " +#elif defined CONFIG_MGOLDMONT +#define MODULE_PROC_FAMILY "GOLDMONT " +#elif defined CONFIG_MGOLDMONTPLUS +#define MODULE_PROC_FAMILY "GOLDMONTPLUS " +#elif defined CONFIG_MSANDYBRIDGE +#define MODULE_PROC_FAMILY "SANDYBRIDGE " +#elif defined CONFIG_MIVYBRIDGE +#define MODULE_PROC_FAMILY "IVYBRIDGE " +#elif defined CONFIG_MHASWELL +#define MODULE_PROC_FAMILY "HASWELL " +#elif defined CONFIG_MBROADWELL +#define MODULE_PROC_FAMILY "BROADWELL " +#elif defined CONFIG_MSKYLAKE +#define MODULE_PROC_FAMILY "SKYLAKE " +#elif defined CONFIG_MSKYLAKEX +#define MODULE_PROC_FAMILY "SKYLAKEX " +#elif defined CONFIG_MCANNONLAKE +#define MODULE_PROC_FAMILY "CANNONLAKE " +#elif defined CONFIG_MICELAKE +#define MODULE_PROC_FAMILY "ICELAKE " +#elif defined CONFIG_MCASCADELAKE +#define MODULE_PROC_FAMILY "CASCADELAKE " +#elif defined CONFIG_MCOOPERLAKE +#define MODULE_PROC_FAMILY "COOPERLAKE " +#elif defined CONFIG_MTIGERLAKE +#define MODULE_PROC_FAMILY "TIGERLAKE " +#elif defined CONFIG_MSAPPHIRERAPIDS +#define MODULE_PROC_FAMILY "SAPPHIRERAPIDS " +#elif defined CONFIG_ROCKETLAKE +#define MODULE_PROC_FAMILY "ROCKETLAKE " +#elif defined CONFIG_MALDERLAKE +#define MODULE_PROC_FAMILY "ALDERLAKE " #elif defined CONFIG_MATOM #define MODULE_PROC_FAMILY "ATOM " #elif defined CONFIG_M686 @@ -35,6 +77,30 @@ #define MODULE_PROC_FAMILY "K7 " #elif defined CONFIG_MK8 #define MODULE_PROC_FAMILY "K8 " +#elif defined CONFIG_MK8SSE3 +#define MODULE_PROC_FAMILY "K8SSE3 " +#elif defined CONFIG_MK10 +#define MODULE_PROC_FAMILY "K10 " +#elif defined CONFIG_MBARCELONA +#define MODULE_PROC_FAMILY "BARCELONA " +#elif defined CONFIG_MBOBCAT +#define MODULE_PROC_FAMILY "BOBCAT " +#elif defined CONFIG_MBULLDOZER +#define MODULE_PROC_FAMILY "BULLDOZER " +#elif defined CONFIG_MPILEDRIVER +#define MODULE_PROC_FAMILY "PILEDRIVER " +#elif defined CONFIG_MSTEAMROLLER +#define MODULE_PROC_FAMILY "STEAMROLLER " +#elif defined CONFIG_MJAGUAR +#define MODULE_PROC_FAMILY "JAGUAR " +#elif defined CONFIG_MEXCAVATOR +#define MODULE_PROC_FAMILY "EXCAVATOR " +#elif defined CONFIG_MZEN +#define MODULE_PROC_FAMILY "ZEN " +#elif defined CONFIG_MZEN2 +#define MODULE_PROC_FAMILY "ZEN2 " +#elif defined CONFIG_MZEN3 +#define MODULE_PROC_FAMILY "ZEN3 " #elif defined CONFIG_MELAN #define MODULE_PROC_FAMILY "ELAN " #elif defined CONFIG_MCRUSOE diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c index 3481b35cb4ec..a224193d84bf 100644 --- a/arch/x86/mm/pgtable.c +++ b/arch/x86/mm/pgtable.c @@ -550,7 +550,7 @@ int ptep_test_and_clear_young(struct vm_area_struct *vma, return ret; } -#ifdef CONFIG_TRANSPARENT_HUGEPAGE +#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG) int pmdp_test_and_clear_young(struct vm_area_struct *vma, unsigned long addr, pmd_t *pmdp) { @@ -562,6 +562,9 @@ int pmdp_test_and_clear_young(struct vm_area_struct *vma, return ret; } +#endif + +#ifdef CONFIG_TRANSPARENT_HUGEPAGE int pudp_test_and_clear_young(struct vm_area_struct *vma, unsigned long addr, pud_t *pudp) { diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl index 52c94ab5c205..1518e261d882 100644 --- a/arch/xtensa/kernel/syscalls/syscall.tbl +++ b/arch/xtensa/kernel/syscalls/syscall.tbl @@ -421,3 +421,4 @@ 448 common process_mrelease sys_process_mrelease 449 common futex_waitv sys_futex_waitv 450 common set_mempolicy_home_node sys_set_mempolicy_home_node +451 common pmadv_ksm sys_pmadv_ksm diff --git a/crypto/drbg.c b/crypto/drbg.c index 177983b6ae38..ea79a31014a4 100644 --- a/crypto/drbg.c +++ b/crypto/drbg.c @@ -115,7 +115,7 @@ * the SHA256 / AES 256 over other ciphers. Thus, the favored * DRBGs are the latest entries in this array. */ -static const struct drbg_core drbg_cores[] = { +const struct drbg_core drbg_cores[] = { #ifdef CONFIG_CRYPTO_DRBG_CTR { .flags = DRBG_CTR | DRBG_STRENGTH128, @@ -192,6 +192,7 @@ static const struct drbg_core drbg_cores[] = { }, #endif /* CONFIG_CRYPTO_DRBG_HMAC */ }; +EXPORT_SYMBOL(drbg_cores); static int drbg_uninstantiate(struct drbg_state *drbg); @@ -207,7 +208,7 @@ static int drbg_uninstantiate(struct drbg_state *drbg); * Return: normalized strength in *bytes* value or 32 as default * to counter programming errors */ -static inline unsigned short drbg_sec_strength(drbg_flag_t flags) +unsigned short drbg_sec_strength(drbg_flag_t flags) { switch (flags & DRBG_STRENGTH_MASK) { case DRBG_STRENGTH128: @@ -220,6 +221,7 @@ static inline unsigned short drbg_sec_strength(drbg_flag_t flags) return 32; } } +EXPORT_SYMBOL(drbg_sec_strength); /* * FIPS 140-2 continuous self test for the noise source @@ -1252,7 +1254,7 @@ static int drbg_seed(struct drbg_state *drbg, struct drbg_string *pers, } /* Free all substructures in a DRBG state without the DRBG state structure */ -static inline void drbg_dealloc_state(struct drbg_state *drbg) +void drbg_dealloc_state(struct drbg_state *drbg) { if (!drbg) return; @@ -1273,12 +1275,13 @@ static inline void drbg_dealloc_state(struct drbg_state *drbg) drbg->fips_primed = false; } } +EXPORT_SYMBOL(drbg_dealloc_state); /* * Allocate all sub-structures for a DRBG state. * The DRBG state structure must already be allocated. */ -static inline int drbg_alloc_state(struct drbg_state *drbg) +int drbg_alloc_state(struct drbg_state *drbg) { int ret = -ENOMEM; unsigned int sb_size = 0; @@ -1359,6 +1362,7 @@ static inline int drbg_alloc_state(struct drbg_state *drbg) drbg_dealloc_state(drbg); return ret; } +EXPORT_SYMBOL(drbg_alloc_state); /************************************************************************* * DRBG interface functions @@ -1895,8 +1899,7 @@ static int drbg_kcapi_sym_ctr(struct drbg_state *drbg, * * return: flags */ -static inline void drbg_convert_tfm_core(const char *cra_driver_name, - int *coreref, bool *pr) +void drbg_convert_tfm_core(const char *cra_driver_name, int *coreref, bool *pr) { int i = 0; size_t start = 0; @@ -1923,6 +1926,7 @@ static inline void drbg_convert_tfm_core(const char *cra_driver_name, } } } +EXPORT_SYMBOL(drbg_convert_tfm_core); static int drbg_kcapi_init(struct crypto_tfm *tfm) { diff --git a/crypto/jitterentropy-kcapi.c b/crypto/jitterentropy-kcapi.c index 2d115bec15ae..e7dac734d237 100644 --- a/crypto/jitterentropy-kcapi.c +++ b/crypto/jitterentropy-kcapi.c @@ -42,8 +42,7 @@ #include #include #include - -#include "jitterentropy.h" +#include /*************************************************************************** * Helper function diff --git a/crypto/jitterentropy.c b/crypto/jitterentropy.c index 93bff3213823..81b80a4d3d3a 100644 --- a/crypto/jitterentropy.c +++ b/crypto/jitterentropy.c @@ -133,7 +133,7 @@ struct rand_data { #define JENT_ENTROPY_SAFETY_FACTOR 64 #include -#include "jitterentropy.h" +#include /*************************************************************************** * Adaptive Proportion Test diff --git a/drivers/char/Kconfig b/drivers/char/Kconfig index 55f48375e3fe..93d52a419470 100644 --- a/drivers/char/Kconfig +++ b/drivers/char/Kconfig @@ -452,4 +452,6 @@ config RANDOM_TRUST_BOOTLOADER only mixes the entropy pool. This can also be configured at boot with "random.trust_bootloader=on/off". +source "drivers/char/lrng/Kconfig" + endmenu diff --git a/drivers/char/Makefile b/drivers/char/Makefile index 264eb398fdd4..c2676f5d5c96 100644 --- a/drivers/char/Makefile +++ b/drivers/char/Makefile @@ -3,7 +3,8 @@ # Makefile for the kernel character device drivers. # -obj-y += mem.o random.o +obj-y += mem.o +obj-$(CONFIG_RANDOM_DEFAULT_IMPL) += random.o obj-$(CONFIG_TTY_PRINTK) += ttyprintk.o obj-y += misc.o obj-$(CONFIG_ATARI_DSP56K) += dsp56k.o @@ -46,3 +47,5 @@ obj-$(CONFIG_PS3_FLASH) += ps3flash.o obj-$(CONFIG_XILLYBUS_CLASS) += xillybus/ obj-$(CONFIG_POWERNV_OP_PANEL) += powernv-op-panel.o obj-$(CONFIG_ADI) += adi.o + +obj-$(CONFIG_LRNG) += lrng/ diff --git a/drivers/char/lrng/Kconfig b/drivers/char/lrng/Kconfig new file mode 100644 index 000000000000..bfd6b3ea4f1b --- /dev/null +++ b/drivers/char/lrng/Kconfig @@ -0,0 +1,907 @@ +# SPDX-License-Identifier: GPL-2.0 +# +# Linux Random Number Generator configuration +# + +config RANDOM_DEFAULT_IMPL + bool "Kernel RNG Default Implementation" + default y + help + The default random number generator as provided with + drivers/char/random.c is selected with this option. + +config LRNG + bool "Linux Random Number Generator" + default n + select CRYPTO_LIB_SHA256 if CRYPTO + help + The Linux Random Number Generator (LRNG) generates entropy + from different entropy sources. Each entropy source can + be enabled and configured independently. The interrupt + entropy source can be configured to be SP800-90B compliant. + The entire LRNG can be configured to be SP800-90C compliant. + Runtime-switchable cryptographic support is available. + The LRNG delivers significant entropy during boot. + + The LRNG also provides compliance to SP800-90A/B/C. + +menu "Linux Random Number Generator Configuration" + depends on LRNG + +if LRNG + +config LRNG_SHA256 + bool + default y if CRYPTO_LIB_SHA256 + +config LRNG_SHA1 + bool + default y if !CRYPTO_LIB_SHA256 + +config LRNG_COMMON_DEV_IF + bool + +config LRNG_DRNG_ATOMIC + bool + select LRNG_DRNG_CHACHA20 + +config LRNG_SYSCTL + bool + depends on SYSCTL + +config LRNG_RANDOM_IF + bool + default n if RANDOM_DEFAULT_IMPL + default y if !RANDOM_DEFAULT_IMPL + select LRNG_COMMON_DEV_IF + select LRNG_DRNG_ATOMIC + select LRNG_SYSCTL + +menu "LRNG Interfaces" + +config LRNG_KCAPI_IF + tristate "Interface with Kernel Crypto API" + depends on CRYPTO_RNG + help + The LRNG can be registered with the kernel crypto API's + random number generator framework. This offers a random + number generator with the name "lrng" and a priority that + is intended to be higher than the existing RNG + implementations. + +config LRNG_HWRAND_IF + tristate "Interface with Hardware Random Number Generator Framework" + depends on HW_RANDOM + select LRNG_DRNG_ATOMIC + help + The LRNG can be registered with the hardware random number + generator framework. This offers a random number generator + with the name "lrng" that is accessible via the framework. + For example it allows pulling data from the LRNG via the + /dev/hwrng file. + +config LRNG_DEV_IF + bool "Character device file interface" + select LRNG_COMMON_DEV_IF + help + The LRNG can create a character device file that operates + identically to /dev/random including IOCTL, read and write + operations. + +endmenu # "LRNG Interfaces" + +menu "Entropy Source Configuration" + +config LRNG_RUNTIME_ES_CONFIG + bool "Enable runtime configuration of entropy sources" + help + When enabling this option, the LRNG provides the mechanism + allowing to alter the entropy rate of each entropy source + during boot time and runtime. + + Each entropy source allows its entropy rate changed with + a kernel command line option. When not providing any + option, the default specified during kernel compilation + is applied. + +comment "Common Timer-based Entropy Source Configuration" + +config LRNG_IRQ_DFLT_TIMER_ES + bool + +config LRNG_SCHED_DFLT_TIMER_ES + bool + +config LRNG_TIMER_COMMON + bool + +choice + prompt "Default Timer-based Entropy Source" + default LRNG_IRQ_DFLT_TIMER_ES + depends on LRNG_TIMER_COMMON + help + Select the timer-based entropy source that is credited + with entropy. The other timer-based entropy sources may + be operational and provide data, but are credited with no + entropy. + + config LRNG_IRQ_DFLT_TIMER_ES + bool "Interrupt Entropy Source" + depends on LRNG_IRQ + help + The interrupt entropy source is selected as a timer-based + entropy source to provide entropy. + + config LRNG_SCHED_DFLT_TIMER_ES + bool "Scheduler Entropy Source" + depends on LRNG_SCHED + help + The scheduler entropy source is selected as timer-based + entropy source to provide entropy. +endchoice + +choice + prompt "LRNG Entropy Collection Pool Size" + default LRNG_COLLECTION_SIZE_1024 + depends on LRNG_TIMER_COMMON + help + Select the size of the LRNG entropy collection pool + storing data for the interrupt as well as the scheduler + entropy sources without performing a compression + operation. The larger the collection size is, the faster + the average interrupt handling will be. The collection + size represents the number of bytes of the per-CPU memory + used to batch up entropy event data. + + The default value is good for regular operations. Choose + larger sizes for servers that have no memory limitations. + If runtime memory is precious, choose a smaller size. + + The collection size is unrelated to the entropy rate + or the amount of entropy the LRNG can process. + + config LRNG_COLLECTION_SIZE_32 + depends on LRNG_CONTINUOUS_COMPRESSION_ENABLED + depends on !LRNG_SWITCHABLE_CONTINUOUS_COMPRESSION + depends on !LRNG_OVERSAMPLE_ENTROPY_SOURCES + bool "32 interrupt events" + + config LRNG_COLLECTION_SIZE_256 + depends on !LRNG_OVERSAMPLE_ENTROPY_SOURCES + bool "256 interrupt events" + + config LRNG_COLLECTION_SIZE_512 + bool "512 interrupt events" + + config LRNG_COLLECTION_SIZE_1024 + bool "1024 interrupt events (default)" + + config LRNG_COLLECTION_SIZE_2048 + bool "2048 interrupt events" + + config LRNG_COLLECTION_SIZE_4096 + bool "4096 interrupt events" + + config LRNG_COLLECTION_SIZE_8192 + bool "8192 interrupt events" + +endchoice + +config LRNG_COLLECTION_SIZE + int + default 32 if LRNG_COLLECTION_SIZE_32 + default 256 if LRNG_COLLECTION_SIZE_256 + default 512 if LRNG_COLLECTION_SIZE_512 + default 1024 if LRNG_COLLECTION_SIZE_1024 + default 2048 if LRNG_COLLECTION_SIZE_2048 + default 4096 if LRNG_COLLECTION_SIZE_4096 + default 8192 if LRNG_COLLECTION_SIZE_8192 + +config LRNG_HEALTH_TESTS + bool "Enable internal entropy source online health tests" + depends on LRNG_TIMER_COMMON + help + The online health tests applied to the interrupt entropy + source and to the scheduler entropy source to validate + the noise source at runtime for fatal errors. These tests + include SP800-90B compliant tests which are invoked if + the system is booted with fips=1. In case of fatal errors + during active SP800-90B tests, the issue is logged and + the noise data is discarded. These tests are required for + full compliance of the interrupt entropy source with + SP800-90B. + + If both, the scheduler and the interrupt entropy sources, + are enabled, the health tests for both are applied + independent of each other. + + If unsure, say Y. + +config LRNG_RCT_BROKEN + bool "SP800-90B RCT with dangerous low cutoff value" + depends on LRNG_HEALTH_TESTS + depends on BROKEN + default n + help + This option enables a dangerously low SP800-90B repetitive + count test (RCT) cutoff value which makes it very likely + that the RCT is triggered to raise a self test failure. + + This option is ONLY intended for developers wanting to + test the effectiveness of the SP800-90B RCT health test. + + If unsure, say N. + +config LRNG_APT_BROKEN + bool "SP800-90B APT with dangerous low cutoff value" + depends on LRNG_HEALTH_TESTS + depends on BROKEN + default n + help + This option enables a dangerously low SP800-90B adaptive + proportion test (APT) cutoff value which makes it very + likely that the APT is triggered to raise a self test + failure. + + This option is ONLY intended for developers wanting to + test the effectiveness of the SP800-90B APT health test. + + If unsure, say N. + +# Default taken from SP800-90B sec 4.4.1 - significance level 2^-30 +config LRNG_RCT_CUTOFF + int + default 31 if !LRNG_RCT_BROKEN + default 1 if LRNG_RCT_BROKEN + +# Default taken from SP800-90B sec 4.4.2 - significance level 2^-30 +config LRNG_APT_CUTOFF + int + default 325 if !LRNG_APT_BROKEN + default 32 if LRNG_APT_BROKEN + +comment "Interrupt Entropy Source" + +config LRNG_IRQ + bool "Enable Interrupt Entropy Source as LRNG Seed Source" + default y + depends on !RANDOM_DEFAULT_IMPL + select LRNG_TIMER_COMMON + help + The LRNG models an entropy source based on the timing of the + occurrence of interrupts. Enable this option to enable this + IRQ entropy source. + + The IRQ entropy source is triggered every time an interrupt + arrives and thus causes the interrupt handler to execute + slightly longer. Disabling the IRQ entropy source implies + that the performance penalty on the interrupt handler added + by the LRNG is eliminated. Yet, this entropy source is + considered to be an internal entropy source of the LRNG. + Thus, only disable it if you ensured that other entropy + sources are available that supply the LRNG with entropy. + + If you disable the IRQ entropy source, you MUST ensure + one or more entropy sources collectively have the + capability to deliver sufficient entropy with one invocation + at a rate compliant to the security strength of the DRNG + (usually 256 bits of entropy). In addition, if those + entropy sources do not deliver sufficient entropy during + first request, the reseed must be triggered from user + space or kernel space when sufficient entropy is considered + to be present. + + If unsure, say Y. + +choice + prompt "Continuous entropy compression boot time setting" + default LRNG_CONTINUOUS_COMPRESSION_ENABLED + depends on LRNG_IRQ + help + Select the default behavior of the interrupt entropy source + continuous compression operation. + + The LRNG IRQ ES collects entropy data during each interrupt. + For performance reasons, a amount of entropy data defined by + the LRNG entropy collection pool size is concatenated into + an array. When that array is filled up, a hash is calculated + to compress the entropy. That hash is calculated in + interrupt context. + + In case such hash calculation in interrupt context is deemed + too time-consuming, the continuous compression operation + can be disabled. If disabled, the collection of entropy will + not trigger a hash compression operation in interrupt context. + The compression happens only when the DRNG is reseeded which is + in process context. This implies that old entropy data + collected after the last DRNG-reseed is overwritten with newer + entropy data once the collection pool is full instead of + retaining its entropy with the compression operation. + + config LRNG_CONTINUOUS_COMPRESSION_ENABLED + bool "Enable continuous compression (default)" + + config LRNG_CONTINUOUS_COMPRESSION_DISABLED + bool "Disable continuous compression" + +endchoice + +config LRNG_ENABLE_CONTINUOUS_COMPRESSION + bool + default y if LRNG_CONTINUOUS_COMPRESSION_ENABLED + default n if LRNG_CONTINUOUS_COMPRESSION_DISABLED + +config LRNG_SWITCHABLE_CONTINUOUS_COMPRESSION + bool "Runtime-switchable continuous entropy compression" + depends on LRNG_IRQ + help + Per default, the interrupt entropy source continuous + compression operation behavior is hard-wired into the kernel. + Enable this option to allow it to be configurable at boot time. + + To modify the default behavior of the continuous + compression operation, use the kernel command line option + of lrng_sw_noise.lrng_pcpu_continuous_compression. + + If unsure, say N. + +config LRNG_IRQ_ENTROPY_RATE + int "Interrupt Entropy Source Entropy Rate" + depends on LRNG_IRQ + range 256 4294967295 if LRNG_IRQ_DFLT_TIMER_ES + range 4294967295 4294967295 if !LRNG_IRQ_DFLT_TIMER_ES + default 256 if LRNG_IRQ_DFLT_TIMER_ES + default 4294967295 if !LRNG_IRQ_DFLT_TIMER_ES + help + The LRNG will collect the configured number of interrupts to + obtain 256 bits of entropy. This value can be set to any between + 256 and 4294967295. The LRNG guarantees that this value is not + lower than 256. This lower limit implies that one interrupt event + is credited with one bit of entropy. This value is subject to the + increase by the oversampling factor, if no high-resolution timer + is found. + + In order to effectively disable the interrupt entropy source, + the option has to be set to 4294967295. In this case, the + interrupt entropy source will still deliver data but without + being credited with entropy. + +comment "Jitter RNG Entropy Source" + +config LRNG_JENT + bool "Enable Jitter RNG as LRNG Seed Source" + depends on CRYPTO + select CRYPTO_JITTERENTROPY + help + The LRNG may use the Jitter RNG as entropy source. Enabling + this option enables the use of the Jitter RNG. Its default + entropy level is 16 bits of entropy per 256 data bits delivered + by the Jitter RNG. This entropy level can be changed at boot + time or at runtime with the lrng_base.jitterrng configuration + variable. + +config LRNG_JENT_ENTROPY_RATE + int "Jitter RNG Entropy Source Entropy Rate" + depends on LRNG_JENT + range 0 256 + default 16 + help + The option defines the amount of entropy the LRNG applies to 256 + bits of data obtained from the Jitter RNG entropy source. The + LRNG enforces the limit that this value must be in the range + between 0 and 256. + + When configuring this value to 0, the Jitter RNG entropy source + will provide 256 bits of data without being credited to contain + entropy. + +comment "CPU Entropy Source" + +config LRNG_CPU + bool "Enable CPU Entropy Source as LRNG Seed Source" + default y + help + Current CPUs commonly contain entropy sources which can be + used to seed the LRNG. For example, the Intel RDSEED + instruction, or the POWER DARN instruction will be sourced + to seed the LRNG if this option is enabled. + + Note, if this option is enabled and the underlying CPU + does not offer such entropy source, the LRNG will automatically + detect this and ignore the hardware. + +config LRNG_CPU_FULL_ENT_MULTIPLIER + int + default 1 if !LRNG_TEST_CPU_ES_COMPRESSION + default 123 if LRNG_TEST_CPU_ES_COMPRESSION + +config LRNG_CPU_ENTROPY_RATE + int "CPU Entropy Source Entropy Rate" + depends on LRNG_CPU + range 0 256 + default 8 + help + The option defines the amount of entropy the LRNG applies to 256 + bits of data obtained from the CPU entropy source. The LRNG + enforces the limit that this value must be in the range between + 0 and 256. + + When configuring this value to 0, the CPU entropy source will + provide 256 bits of data without being credited to contain + entropy. + + Note, this option is overwritten when the option + CONFIG_RANDOM_TRUST_CPU is set. + +comment "Scheduler Entropy Source" + +config LRNG_SCHED + bool "Enable Scheduer Entropy Source as LRNG Seed Source" + select LRNG_TIMER_COMMON + help + The LRNG models an entropy source based on the timing of the + occurrence of scheduler-triggered context switches. Enable + this option to enable this scheduler entropy source. + + The scheduler entropy source is triggered every time a + context switch is triggered thus causes the scheduler to + execute slightly longer. Disabling the scheduler entropy + source implies that the performance penalty on the scheduler + added by the LRNG is eliminated. Yet, this entropy source is + considered to be an internal entropy source of the LRNG. + Thus, only disable it if you ensured that other entropy + sources are available that supply the LRNG with entropy. + + If you disable the scheduler entropy source, you MUST + ensure one or more entropy sources collectively have the + capability to deliver sufficient entropy with one invocation + at a rate compliant to the security strength of the DRNG + (usually 256 bits of entropy). In addition, if those + entropy sources do not deliver sufficient entropy during + first request, the reseed must be triggered from user + space or kernel space when sufficient entropy is considered + to be present. + + If unsure, say Y. + +config LRNG_SCHED_ENTROPY_RATE + int "Scheduler Entropy Source Entropy Rate" + depends on LRNG_SCHED + range 256 4294967295 if LRNG_SCHED_DFLT_TIMER_ES + range 4294967295 4294967295 if !LRNG_SCHED_DFLT_TIMER_ES + default 256 if LRNG_SCHED_DFLT_TIMER_ES + default 4294967295 if !LRNG_SCHED_DFLT_TIMER_ES + help + The LRNG will collect the configured number of context switches + triggered by the scheduler to obtain 256 bits of entropy. This + value can be set to any between 256 and 4294967295. The LRNG + guarantees that this value is not lower than 256. This lower + limit implies that one interrupt event is credited with one bit + of entropy. This value is subject to the increase by the + oversampling factor, if no high-resolution timer is found. + + In order to effectively disable the scheduler entropy source, + the option has to be set to 4294967295. In this case, the + scheduler entropy source will still deliver data but without + being credited with entropy. + +comment "Kernel RNG Entropy Source" + +config LRNG_KERNEL_RNG + bool "Enable Kernel RNG as LRNG Seed Source" + depends on RANDOM_DEFAULT_IMPL + help + The LRNG may use the kernel RNG (random.c) as entropy + source. + +config LRNG_KERNEL_RNG_ENTROPY_RATE + int "Kernel RNG Entropy Source Entropy Rate" + depends on LRNG_KERNEL_RNG + range 0 256 + default 256 + help + The option defines the amount of entropy the LRNG applies to 256 + bits of data obtained from the kernel RNG entropy source. The + LRNG enforces the limit that this value must be in the range + between 0 and 256. + + When configuring this value to 0, the kernel RNG entropy source + will provide 256 bits of data without being credited to contain + entropy. + + Note: This value is set to 0 automatically when booting the + kernel in FIPS mode (with fips=1 kernel command line option). + This is due to the fact that random.c is not SP800-90B + compliant. + +endmenu # "Entropy Source Configuration" + +config LRNG_DRNG_CHACHA20 + tristate + +config LRNG_DRBG + tristate + depends on CRYPTO + select CRYPTO_DRBG_MENU + +config LRNG_DRNG_KCAPI + tristate + depends on CRYPTO + select CRYPTO_RNG + +config LRNG_SWITCH + bool + +menuconfig LRNG_SWITCH_HASH + bool "Support conditioning hash runtime switching" + select LRNG_SWITCH + help + The LRNG uses a default message digest. With this + configuration option other message digests can be selected + and loaded at runtime. + +if LRNG_SWITCH_HASH + +config LRNG_HASH_KCAPI + tristate "Kernel crypto API hashing support for LRNG" + select CRYPTO_HASH + select CRYPTO_SHA512 + help + Enable the kernel crypto API support for entropy compression + and conditioning functions. + +endif # LRNG_SWITCH_HASH + +menuconfig LRNG_SWITCH_DRNG + bool "Support DRNG runtime switching" + select LRNG_SWITCH + help + The LRNG uses a default DRNG With this configuration + option other DRNGs or message digests can be selected and + loaded at runtime. + +if LRNG_SWITCH_DRNG + +config LRNG_SWITCH_DRNG_CHACHA20 + tristate "ChaCha20-based DRNG support for LRNG" + depends on !LRNG_DFLT_DRNG_CHACHA20 + select LRNG_DRNG_CHACHA20 + help + Enable the ChaCha20-based DRNG. This DRNG implementation + does not depend on the kernel crypto API presence. + +config LRNG_SWITCH_DRBG + tristate "SP800-90A support for the LRNG" + depends on !LRNG_DFLT_DRNG_DRBG + select LRNG_DRBG + help + Enable the SP800-90A DRBG support for the LRNG. Once the + module is loaded, output from /dev/random, /dev/urandom, + getrandom(2), or get_random_bytes_full is provided by a DRBG. + +config LRNG_SWITCH_DRNG_KCAPI + tristate "Kernel Crypto API support for the LRNG" + depends on !LRNG_DFLT_DRNG_KCAPI + depends on !LRNG_SWITCH_DRBG + select LRNG_DRNG_KCAPI + help + Enable the support for generic pseudo-random number + generators offered by the kernel crypto API with the + LRNG. Once the module is loaded, output from /dev/random, + /dev/urandom, getrandom(2), or get_random_bytes is + provided by the selected kernel crypto API RNG. + +endif # LRNG_SWITCH_DRNG + +choice + prompt "LRNG Default DRNG" + default LRNG_DFLT_DRNG_CHACHA20 + help + Select the default deterministic random number generator + that is used by the LRNG. When enabling the switchable + cryptographic mechanism support, this DRNG can be + replaced at runtime. + + config LRNG_DFLT_DRNG_CHACHA20 + bool "ChaCha20-based DRNG" + select LRNG_DRNG_CHACHA20 + + config LRNG_DFLT_DRNG_DRBG + bool "SP800-90A DRBG" + select LRNG_DRBG + + config LRNG_DFLT_DRNG_KCAPI + bool "Kernel Crypto API DRNG" + select LRNG_DRNG_KCAPI +endchoice + +menuconfig LRNG_TESTING_MENU + bool "LRNG testing interfaces" + depends on DEBUG_FS + help + Enable one or more of the following test interfaces. + + If unsure, say N. + +if LRNG_TESTING_MENU + +config LRNG_TESTING + bool + +config LRNG_TESTING_RECORDING + bool + +comment "Interrupt Entropy Source Test Interfaces" + +config LRNG_RAW_HIRES_ENTROPY + bool "Interface to obtain raw unprocessed IRQ noise source data" + default y + depends on LRNG_IRQ + select LRNG_TESTING + select LRNG_TESTING_RECORDING + help + The test interface allows a privileged process to capture + the raw unconditioned high resolution time stamp noise that + is collected by the LRNG for statistical analysis. Extracted + noise data is not used to seed the LRNG. + + The raw noise data can be obtained using the lrng_raw_hires + debugfs file. Using the option lrng_testing.boot_raw_hires_test=1 + the raw noise of the first 1000 entropy events since boot + can be sampled. + +config LRNG_RAW_JIFFIES_ENTROPY + bool "Entropy test interface to Jiffies of IRQ noise source" + depends on LRNG_IRQ + select LRNG_TESTING + select LRNG_TESTING_RECORDING + help + The test interface allows a privileged process to capture + the raw unconditioned Jiffies that is collected by + the LRNG for statistical analysis. This data is used for + seeding the LRNG if a high-resolution time stamp is not + available. If a high-resolution time stamp is detected, + the Jiffies value is not collected by the LRNG and no + data is provided via the test interface. Extracted noise + data is not used to seed the random number generator. + + The raw noise data can be obtained using the lrng_raw_jiffies + debugfs file. Using the option lrng_testing.boot_raw_jiffies_test=1 + the raw noise of the first 1000 entropy events since boot + can be sampled. + +config LRNG_RAW_IRQ_ENTROPY + bool "Entropy test interface to IRQ number noise source" + depends on LRNG_IRQ + select LRNG_TESTING + select LRNG_TESTING_RECORDING + help + The test interface allows a privileged process to capture + the raw unconditioned interrupt number that is collected by + the LRNG for statistical analysis. Extracted noise data is + not used to seed the random number generator. + + The raw noise data can be obtained using the lrng_raw_irq + debugfs file. Using the option lrng_testing.boot_raw_irq_test=1 + the raw noise of the first 1000 entropy events since boot + can be sampled. + +config LRNG_RAW_RETIP_ENTROPY + bool "Entropy test interface to RETIP value of IRQ noise source" + depends on LRNG_IRQ + select LRNG_TESTING + select LRNG_TESTING_RECORDING + help + The test interface allows a privileged process to capture + the raw unconditioned return instruction pointer value + that is collected by the LRNG for statistical analysis. + Extracted noise data is not used to seed the random number + generator. + + The raw noise data can be obtained using the lrng_raw_retip + debugfs file. Using the option lrng_testing.boot_raw_retip_test=1 + the raw noise of the first 1000 entropy events since boot + can be sampled. + +config LRNG_RAW_REGS_ENTROPY + bool "Entropy test interface to IRQ register value noise source" + depends on LRNG_IRQ + select LRNG_TESTING + select LRNG_TESTING_RECORDING + help + The test interface allows a privileged process to capture + the raw unconditioned interrupt register value that is + collected by the LRNG for statistical analysis. Extracted noise + data is not used to seed the random number generator. + + The raw noise data can be obtained using the lrng_raw_regs + debugfs file. Using the option lrng_testing.boot_raw_regs_test=1 + the raw noise of the first 1000 entropy events since boot + can be sampled. + +config LRNG_RAW_ARRAY + bool "Test interface to LRNG raw entropy IRQ storage array" + depends on LRNG_IRQ + select LRNG_TESTING + select LRNG_TESTING_RECORDING + help + The test interface allows a privileged process to capture + the raw noise data that is collected by the LRNG + in the per-CPU array for statistical analysis. The purpose + of this interface is to verify that the array handling code + truly only concatenates data and provides the same entropy + rate as the raw unconditioned noise source when assessing + the collected data byte-wise. + + The data can be obtained using the lrng_raw_array debugfs + file. Using the option lrng_testing.boot_raw_array=1 + the raw noise of the first 1000 entropy events since boot + can be sampled. + +config LRNG_IRQ_PERF + bool "LRNG interrupt entropy source performance monitor" + depends on LRNG_IRQ + select LRNG_TESTING + select LRNG_TESTING_RECORDING + help + With this option, the performance monitor of the LRNG + interrupt handling code is enabled. The file provides + the execution time of the interrupt handler in + cycles. + + The interrupt performance data can be obtained using + the lrng_irq_perf debugfs file. Using the option + lrng_testing.boot_irq_perf=1 the performance data of + the first 1000 entropy events since boot can be sampled. + +comment "Scheduler Entropy Source Test Interfaces" + +config LRNG_RAW_SCHED_HIRES_ENTROPY + bool "Interface to obtain raw unprocessed scheduler noise source data" + depends on LRNG_SCHED + select LRNG_TESTING + select LRNG_TESTING_RECORDING + help + The test interface allows a privileged process to capture + the raw unconditioned high resolution time stamp noise that + is collected by the LRNG for the Scheduler-based noise source + for statistical analysis. Extracted noise data is not used to + seed the LRNG. + + The raw noise data can be obtained using the lrng_raw_sched_hires + debugfs file. Using the option + lrng_testing.boot_raw_sched_hires_test=1 the raw noise of the + first 1000 entropy events since boot can be sampled. + +config LRNG_RAW_SCHED_PID_ENTROPY + bool "Entropy test interface to PID value" + depends on LRNG_SCHED + select LRNG_TESTING + select LRNG_TESTING_RECORDING + help + The test interface allows a privileged process to capture + the raw unconditioned PID value that is collected by the + LRNG for statistical analysis. Extracted noise + data is not used to seed the random number generator. + + The raw noise data can be obtained using the + lrng_raw_sched_pid debugfs file. Using the option + lrng_testing.boot_raw_sched_pid_test=1 + the raw noise of the first 1000 entropy events since boot + can be sampled. + +config LRNG_RAW_SCHED_START_TIME_ENTROPY + bool "Entropy test interface to task start time value" + depends on LRNG_SCHED + select LRNG_TESTING + select LRNG_TESTING_RECORDING + help + The test interface allows a privileged process to capture + the raw unconditioned task start time value that is collected + by the LRNG for statistical analysis. Extracted noise + data is not used to seed the random number generator. + + The raw noise data can be obtained using the + lrng_raw_sched_starttime debugfs file. Using the option + lrng_testing.boot_raw_sched_starttime_test=1 + the raw noise of the first 1000 entropy events since boot + can be sampled. + + +config LRNG_RAW_SCHED_NVCSW_ENTROPY + bool "Entropy test interface to task context switch numbers" + depends on LRNG_SCHED + select LRNG_TESTING + select LRNG_TESTING_RECORDING + help + The test interface allows a privileged process to capture + the raw unconditioned task numbers of context switches that + are collected by the LRNG for statistical analysis. Extracted + noise data is not used to seed the random number generator. + + The raw noise data can be obtained using the + lrng_raw_sched_nvcsw debugfs file. Using the option + lrng_testing.boot_raw_sched_nvcsw_test=1 + the raw noise of the first 1000 entropy events since boot + can be sampled. + +config LRNG_SCHED_PERF + bool "LRNG scheduler entropy source performance monitor" + depends on LRNG_SCHED + select LRNG_TESTING + select LRNG_TESTING_RECORDING + help + With this option, the performance monitor of the LRNG + scheduler event handling code is enabled. The file provides + the execution time of the interrupt handler in cycles. + + The scheduler performance data can be obtained using + the lrng_sched_perf debugfs file. Using the option + lrng_testing.boot_sched_perf=1 the performance data of + the first 1000 entropy events since boot can be sampled. + +comment "Auxiliary Test Interfaces" + +config LRNG_ACVT_HASH + bool "Enable LRNG ACVT Hash interface" + select LRNG_TESTING + help + With this option, the LRNG built-in hash function used for + auxiliary pool management and prior to switching the + cryptographic backends is made available for ACVT. The + interface allows writing of the data to be hashed + into the interface. The read operation triggers the hash + operation to generate message digest. + + The ACVT interface is available with the lrng_acvt_hash + debugfs file. + +config LRNG_RUNTIME_MAX_WO_RESEED_CONFIG + bool "Enable runtime configuration of max reseed threshold" + help + When enabling this option, the LRNG provides an interface + allowing the setting of the maximum number of DRNG generate + operations without a reseed that has full entropy. The + interface is lrng_drng.max_wo_reseed. + +config LRNG_TEST_CPU_ES_COMPRESSION + bool "Force CPU ES compression operation" + help + When enabling this option, the CPU ES compression operation + is forced by setting an arbitrary value > 1 for the data + multiplier even when the CPU ES would deliver full entropy. + This allows testing of the compression operation. It + therefore forces to pull more data from the CPU ES + than what may be required. + +endif #LRNG_TESTING_MENU + +config LRNG_SELFTEST + bool "Enable power-on and on-demand self-tests" + help + The power-on self-tests are executed during boot time + covering the ChaCha20 DRNG, the hash operation used for + processing the entropy pools and the auxiliary pool, and + the time stamp management of the LRNG. + + The on-demand self-tests are triggered by writing any + value into the SysFS file selftest_status. At the same + time, when reading this file, the test status is + returned. A zero indicates that all tests were executed + successfully. + + If unsure, say Y. + +if LRNG_SELFTEST + +config LRNG_SELFTEST_PANIC + bool "Panic the kernel upon self-test failure" + help + If the option is enabled, the kernel is terminated if an + LRNG power-on self-test failure is detected. + +endif # LRNG_SELFTEST + +endif # LRNG + +endmenu # LRNG diff --git a/drivers/char/lrng/Makefile b/drivers/char/lrng/Makefile new file mode 100644 index 000000000000..f61fc40f4620 --- /dev/null +++ b/drivers/char/lrng/Makefile @@ -0,0 +1,39 @@ +# SPDX-License-Identifier: GPL-2.0 +# +# Makefile for the Entropy Source and DRNG Manager. +# + +obj-y += lrng_es_mgr.o lrng_drng_mgr.o \ + lrng_es_aux.o +obj-$(CONFIG_LRNG_SHA256) += lrng_sha256.o +obj-$(CONFIG_LRNG_SHA1) += lrng_sha1.o + +obj-$(CONFIG_SYSCTL) += lrng_proc.o +obj-$(CONFIG_LRNG_SYSCTL) += lrng_sysctl.o +obj-$(CONFIG_NUMA) += lrng_numa.o + +obj-$(CONFIG_LRNG_SWITCH) += lrng_switch.o +obj-$(CONFIG_LRNG_HASH_KCAPI) += lrng_hash_kcapi.o +obj-$(CONFIG_LRNG_DRNG_CHACHA20) += lrng_drng_chacha20.o +obj-$(CONFIG_LRNG_DRBG) += lrng_drng_drbg.o +obj-$(CONFIG_LRNG_DRNG_KCAPI) += lrng_drng_kcapi.o +obj-$(CONFIG_LRNG_DRNG_ATOMIC) += lrng_drng_atomic.o + +obj-$(CONFIG_LRNG_TIMER_COMMON) += lrng_es_timer_common.o +obj-$(CONFIG_LRNG_IRQ) += lrng_es_irq.o +obj-$(CONFIG_LRNG_KERNEL_RNG) += lrng_es_krng.o +obj-$(CONFIG_LRNG_SCHED) += lrng_es_sched.o +obj-$(CONFIG_LRNG_CPU) += lrng_es_cpu.o +obj-$(CONFIG_LRNG_JENT) += lrng_es_jent.o + +obj-$(CONFIG_LRNG_HEALTH_TESTS) += lrng_health.o +obj-$(CONFIG_LRNG_TESTING) += lrng_testing.o +obj-$(CONFIG_LRNG_SELFTEST) += lrng_selftest.o + +obj-$(CONFIG_LRNG_COMMON_DEV_IF) += lrng_interface_dev_common.o +obj-$(CONFIG_LRNG_RANDOM_IF) += lrng_interface_random_user.o \ + lrng_interface_random_kernel.o \ + lrng_interface_aux.o +obj-$(CONFIG_LRNG_KCAPI_IF) += lrng_interface_kcapi.o +obj-$(CONFIG_LRNG_DEV_IF) += lrng_interface_dev.o +obj-$(CONFIG_LRNG_HWRAND_IF) += lrng_interface_hwrand.o diff --git a/drivers/char/lrng/lrng_definitions.h b/drivers/char/lrng/lrng_definitions.h new file mode 100644 index 000000000000..50f87c40f701 --- /dev/null +++ b/drivers/char/lrng/lrng_definitions.h @@ -0,0 +1,151 @@ +/* SPDX-License-Identifier: GPL-2.0 OR BSD-2-Clause */ +/* + * Copyright (C) 2022, Stephan Mueller + */ + +#ifndef _LRNG_DEFINITIONS_H +#define _LRNG_DEFINITIONS_H + +#include +#include +#include + +/*************************** General LRNG parameter ***************************/ + +/* + * Specific settings for different use cases + */ +#ifdef CONFIG_CRYPTO_FIPS +# define LRNG_OVERSAMPLE_ES_BITS 64 +# define LRNG_SEED_BUFFER_INIT_ADD_BITS 128 +#else /* CONFIG_CRYPTO_FIPS */ +# define LRNG_OVERSAMPLE_ES_BITS 0 +# define LRNG_SEED_BUFFER_INIT_ADD_BITS 0 +#endif /* CONFIG_CRYPTO_FIPS */ + +/* Security strength of LRNG -- this must match DRNG security strength */ +#define LRNG_DRNG_SECURITY_STRENGTH_BYTES 32 +#define LRNG_DRNG_SECURITY_STRENGTH_BITS (LRNG_DRNG_SECURITY_STRENGTH_BYTES * 8) +#define LRNG_DRNG_INIT_SEED_SIZE_BITS \ + (LRNG_DRNG_SECURITY_STRENGTH_BITS + LRNG_SEED_BUFFER_INIT_ADD_BITS) +#define LRNG_DRNG_INIT_SEED_SIZE_BYTES (LRNG_DRNG_INIT_SEED_SIZE_BITS >> 3) + +/* + * SP800-90A defines a maximum request size of 1<<16 bytes. The given value is + * considered a safer margin. + * + * This value is allowed to be changed. + */ +#define LRNG_DRNG_MAX_REQSIZE (1<<12) + +/* + * SP800-90A defines a maximum number of requests between reseeds of 2^48. + * The given value is considered a much safer margin, balancing requests for + * frequent reseeds with the need to conserve entropy. This value MUST NOT be + * larger than INT_MAX because it is used in an atomic_t. + * + * This value is allowed to be changed. + */ +#define LRNG_DRNG_RESEED_THRESH (1<<20) + +/* + * Maximum DRNG generation operations without reseed having full entropy + * This value defines the absolute maximum value of DRNG generation operations + * without a reseed holding full entropy. LRNG_DRNG_RESEED_THRESH is the + * threshold when a new reseed is attempted. But it is possible that this fails + * to deliver full entropy. In this case the DRNG will continue to provide data + * even though it was not reseeded with full entropy. To avoid in the extreme + * case that no reseed is performed for too long, this threshold is enforced. + * If that absolute low value is reached, the LRNG is marked as not operational. + * + * This value is allowed to be changed. + */ +#define LRNG_DRNG_MAX_WITHOUT_RESEED (1<<30) + +/* + * Min required seed entropy is 128 bits covering the minimum entropy + * requirement of SP800-131A and the German BSI's TR02102. + * + * This value is allowed to be changed. + */ +#define LRNG_FULL_SEED_ENTROPY_BITS LRNG_DRNG_SECURITY_STRENGTH_BITS +#define LRNG_MIN_SEED_ENTROPY_BITS 128 +#define LRNG_INIT_ENTROPY_BITS 32 + +/* + * Wakeup value + * + * This value is allowed to be changed but must not be larger than the + * digest size of the hash operation used update the aux_pool. + */ +#ifdef CONFIG_LRNG_SHA256 +# define LRNG_ATOMIC_DIGEST_SIZE SHA256_DIGEST_SIZE +#else +# define LRNG_ATOMIC_DIGEST_SIZE SHA1_DIGEST_SIZE +#endif +#define LRNG_WRITE_WAKEUP_ENTROPY LRNG_ATOMIC_DIGEST_SIZE + +/* + * If the switching support is configured, we must provide support up to + * the largest digest size. Without switching support, we know it is only + * the built-in digest size. + */ +#ifdef CONFIG_LRNG_SWITCH +# define LRNG_MAX_DIGESTSIZE 64 +#else +# define LRNG_MAX_DIGESTSIZE LRNG_ATOMIC_DIGEST_SIZE +#endif + +/* + * Oversampling factor of timer-based events to obtain + * LRNG_DRNG_SECURITY_STRENGTH_BYTES. This factor is used when a + * high-resolution time stamp is not available. In this case, jiffies and + * register contents are used to fill the entropy pool. These noise sources + * are much less entropic than the high-resolution timer. The entropy content + * is the entropy content assumed with LRNG_[IRQ|SCHED]_ENTROPY_BITS divided by + * LRNG_ES_OVERSAMPLING_FACTOR. + * + * This value is allowed to be changed. + */ +#define LRNG_ES_OVERSAMPLING_FACTOR 10 + +/* Alignmask that is intended to be identical to CRYPTO_MINALIGN */ +#define LRNG_KCAPI_ALIGN ARCH_KMALLOC_MINALIGN + +/* + * This definition must provide a buffer that is equal to SHASH_DESC_ON_STACK + * as it will be casted into a struct shash_desc. + */ +#define LRNG_POOL_SIZE (sizeof(struct shash_desc) + HASH_MAX_DESCSIZE) + +/****************************** Helper code ***********************************/ + +static inline u32 lrng_fast_noise_entropylevel(u32 ent_bits, u32 requested_bits) +{ + /* Obtain entropy statement */ + ent_bits = ent_bits * requested_bits / LRNG_DRNG_SECURITY_STRENGTH_BITS; + /* Cap entropy to buffer size in bits */ + ent_bits = min_t(u32, ent_bits, requested_bits); + return ent_bits; +} + +/* Convert entropy in bits into nr. of events with the same entropy content. */ +static inline u32 lrng_entropy_to_data(u32 entropy_bits, u32 entropy_rate) +{ + return ((entropy_bits * entropy_rate) / + LRNG_DRNG_SECURITY_STRENGTH_BITS); +} + +/* Convert number of events into entropy value. */ +static inline u32 lrng_data_to_entropy(u32 num, u32 entropy_rate) +{ + return ((num * LRNG_DRNG_SECURITY_STRENGTH_BITS) / + entropy_rate); +} + +static inline u32 atomic_read_u32(atomic_t *v) +{ + return (u32)atomic_read(v); +} + +#endif /* _LRNG_DEFINITIONS_H */ diff --git a/drivers/char/lrng/lrng_drng_atomic.c b/drivers/char/lrng/lrng_drng_atomic.c new file mode 100644 index 000000000000..e3a8fece9935 --- /dev/null +++ b/drivers/char/lrng/lrng_drng_atomic.c @@ -0,0 +1,170 @@ +// SPDX-License-Identifier: GPL-2.0 OR BSD-2-Clause +/* + * LRNG DRNG for atomic contexts + * + * Copyright (C) 2022, Stephan Mueller + */ + +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt + +#include + +#include "lrng_drng_atomic.h" +#include "lrng_drng_chacha20.h" +#include "lrng_es_aux.h" +#include "lrng_es_mgr.h" +#include "lrng_sha.h" + +static struct chacha20_state chacha20_atomic = { + LRNG_CC20_INIT_RFC7539(.block) +}; + +/* + * DRNG usable in atomic context. This DRNG will always use the ChaCha20 + * DRNG. It will never benefit from a DRNG switch like the "regular" DRNG. If + * there was no DRNG switch, the atomic DRNG is identical to the "regular" DRNG. + * + * The reason for having this is due to the fact that DRNGs other than + * the ChaCha20 DRNG may sleep. + */ +static struct lrng_drng lrng_drng_atomic = { + LRNG_DRNG_STATE_INIT(lrng_drng_atomic, + &chacha20_atomic, NULL, + &lrng_cc20_drng_cb, &lrng_sha_hash_cb), + .spin_lock = __SPIN_LOCK_UNLOCKED(lrng_drng_atomic.spin_lock) +}; + +void lrng_drng_atomic_reset(void) +{ + unsigned long flags; + + spin_lock_irqsave(&lrng_drng_atomic.spin_lock, flags); + lrng_drng_reset(&lrng_drng_atomic); + spin_unlock_irqrestore(&lrng_drng_atomic.spin_lock, flags); +} + +void lrng_drng_atomic_force_reseed(void) +{ + lrng_drng_atomic.force_reseed = lrng_drng_atomic.fully_seeded; +} + +static bool lrng_drng_atomic_must_reseed(struct lrng_drng *drng) +{ + return (!drng->fully_seeded || + atomic_read(&lrng_drng_atomic.requests) == 0 || + drng->force_reseed || + time_after(jiffies, + drng->last_seeded + lrng_drng_reseed_max_time * HZ)); +} + +void lrng_drng_atomic_seed_drng(struct lrng_drng *regular_drng) +{ + u8 seedbuf[LRNG_DRNG_SECURITY_STRENGTH_BYTES] + __aligned(LRNG_KCAPI_ALIGN); + int ret; + + if (!lrng_drng_atomic_must_reseed(&lrng_drng_atomic)) + return; + + /* + * Reseed atomic DRNG another DRNG "regular" while this regular DRNG + * is reseeded. Therefore, this code operates in non-atomic context and + * thus can use the lrng_drng_get function to get random numbers from + * the just freshly seeded DRNG. + */ + ret = lrng_drng_get(regular_drng, seedbuf, sizeof(seedbuf)); + + if (ret < 0) { + pr_warn("Error generating random numbers for atomic DRNG: %d\n", + ret); + } else { + unsigned long flags; + + spin_lock_irqsave(&lrng_drng_atomic.spin_lock, flags); + lrng_drng_inject(&lrng_drng_atomic, seedbuf, ret, + regular_drng->fully_seeded, "atomic"); + spin_unlock_irqrestore(&lrng_drng_atomic.spin_lock, flags); + } + memzero_explicit(&seedbuf, sizeof(seedbuf)); +} + +void lrng_drng_atomic_seed_es(void) +{ + struct entropy_buf seedbuf __aligned(LRNG_KCAPI_ALIGN); + struct lrng_drng *drng = &lrng_drng_atomic; + unsigned long flags; + u32 requested_bits = LRNG_MIN_SEED_ENTROPY_BITS; + + if (lrng_drng_atomic.fully_seeded) + return; + + /* + * When the LRNG reached the minimally seeded level, leave 256 bits of + * entropy for the regular DRNG. In addition ensure that additional + * 256 bits are available for the atomic DRNG to get to the fully + * seeded stage of the LRNG. + */ + if (lrng_state_min_seeded()) { + u32 avail_bits = lrng_avail_entropy(); + + requested_bits = + (avail_bits >= 2 * LRNG_FULL_SEED_ENTROPY_BITS) ? + LRNG_FULL_SEED_ENTROPY_BITS : 0; + + if (!requested_bits) + return; + } + + pr_debug("atomic DRNG seeding attempt to pull %u bits of entropy directly from entropy sources\n", + requested_bits); + + lrng_fill_seed_buffer(&seedbuf, requested_bits); + spin_lock_irqsave(&drng->spin_lock, flags); + lrng_drng_inject(&lrng_drng_atomic, (u8 *)&seedbuf, sizeof(seedbuf), + lrng_fully_seeded_eb(drng->fully_seeded, &seedbuf), + "atomic"); + spin_unlock_irqrestore(&drng->spin_lock, flags); + lrng_init_ops(&seedbuf); + + /* + * Do not call lrng_init_ops(seedbuf) here as the atomic DRNG does not + * serve common users. + */ + memzero_explicit(&seedbuf, sizeof(seedbuf)); +} + +static void lrng_drng_atomic_get(u8 *outbuf, u32 outbuflen) +{ + struct lrng_drng *drng = &lrng_drng_atomic; + unsigned long flags; + + if (!outbuf || !outbuflen) + return; + + outbuflen = min_t(size_t, outbuflen, INT_MAX); + + while (outbuflen) { + u32 todo = min_t(u32, outbuflen, LRNG_DRNG_MAX_REQSIZE); + int ret; + + atomic_dec(&drng->requests); + + spin_lock_irqsave(&drng->spin_lock, flags); + ret = drng->drng_cb->drng_generate(drng->drng, outbuf, todo); + spin_unlock_irqrestore(&drng->spin_lock, flags); + if (ret <= 0) { + pr_warn("getting random data from DRNG failed (%d)\n", + ret); + return; + } + outbuflen -= ret; + outbuf += ret; + } +} + +void lrng_get_random_bytes(void *buf, int nbytes) +{ + lrng_drng_atomic_get((u8 *)buf, (u32)nbytes); + lrng_debug_report_seedlevel("lrng_get_random_bytes"); +} +EXPORT_SYMBOL(lrng_get_random_bytes); diff --git a/drivers/char/lrng/lrng_drng_atomic.h b/drivers/char/lrng/lrng_drng_atomic.h new file mode 100644 index 000000000000..d5d24fc94f2e --- /dev/null +++ b/drivers/char/lrng/lrng_drng_atomic.h @@ -0,0 +1,23 @@ +/* SPDX-License-Identifier: GPL-2.0 OR BSD-2-Clause */ +/* + * Copyright (C) 2022, Stephan Mueller + */ + +#ifndef _LRNG_DRNG_ATOMIC_H +#define _LRNG_DRNG_ATOMIC_H + +#include "lrng_drng_mgr.h" + +#ifdef CONFIG_LRNG_DRNG_ATOMIC +void lrng_drng_atomic_reset(void); +void lrng_drng_atomic_seed_drng(struct lrng_drng *drng); +void lrng_drng_atomic_seed_es(void); +void lrng_drng_atomic_force_reseed(void); +#else /* CONFIG_LRNG_DRNG_ATOMIC */ +static inline void lrng_drng_atomic_reset(void) { } +static inline void lrng_drng_atomic_seed_drng(struct lrng_drng *drng) { } +static inline void lrng_drng_atomic_seed_es(void) { } +static inline void lrng_drng_atomic_force_reseed(void) { } +#endif /* CONFIG_LRNG_DRNG_ATOMIC */ + +#endif /* _LRNG_DRNG_ATOMIC_H */ diff --git a/drivers/char/lrng/lrng_drng_chacha20.c b/drivers/char/lrng/lrng_drng_chacha20.c new file mode 100644 index 000000000000..31be102e3007 --- /dev/null +++ b/drivers/char/lrng/lrng_drng_chacha20.c @@ -0,0 +1,195 @@ +// SPDX-License-Identifier: GPL-2.0 OR BSD-2-Clause +/* + * Backend for the LRNG providing the cryptographic primitives using + * ChaCha20 cipher implementations. + * + * Copyright (C) 2022, Stephan Mueller + */ + +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt + +#include +#include +#include +#include + +#include "lrng_drng_chacha20.h" + +/******************************* ChaCha20 DRNG *******************************/ + +#define CHACHA_BLOCK_WORDS (CHACHA_BLOCK_SIZE / sizeof(u32)) + +/* + * Update of the ChaCha20 state by either using an unused buffer part or by + * generating one ChaCha20 block which is half of the state of the ChaCha20. + * The block is XORed into the key part of the state. This shall ensure + * backtracking resistance as well as a proper mix of the ChaCha20 state once + * the key is injected. + */ +static void lrng_chacha20_update(struct chacha20_state *chacha20_state, + __le32 *buf, u32 used_words) +{ + struct chacha20_block *chacha20 = &chacha20_state->block; + u32 i; + __le32 tmp[CHACHA_BLOCK_WORDS]; + + BUILD_BUG_ON(sizeof(struct chacha20_block) != CHACHA_BLOCK_SIZE); + BUILD_BUG_ON(CHACHA_BLOCK_SIZE != 2 * CHACHA_KEY_SIZE); + + if (used_words > CHACHA_KEY_SIZE_WORDS) { + chacha20_block(&chacha20->constants[0], (u8 *)tmp); + for (i = 0; i < CHACHA_KEY_SIZE_WORDS; i++) + chacha20->key.u[i] ^= le32_to_cpu(tmp[i]); + memzero_explicit(tmp, sizeof(tmp)); + } else { + for (i = 0; i < CHACHA_KEY_SIZE_WORDS; i++) + chacha20->key.u[i] ^= le32_to_cpu(buf[i + used_words]); + } + + /* Deterministic increment of nonce as required in RFC 7539 chapter 4 */ + chacha20->nonce[0]++; + if (chacha20->nonce[0] == 0) { + chacha20->nonce[1]++; + if (chacha20->nonce[1] == 0) + chacha20->nonce[2]++; + } + + /* Leave counter untouched as it is start value is undefined in RFC */ +} + +/* + * Seed the ChaCha20 DRNG by injecting the input data into the key part of + * the ChaCha20 state. If the input data is longer than the ChaCha20 key size, + * perform a ChaCha20 operation after processing of key size input data. + * This operation shall spread out the entropy into the ChaCha20 state before + * new entropy is injected into the key part. + */ +static int lrng_cc20_drng_seed_helper(void *drng, const u8 *inbuf, u32 inbuflen) +{ + struct chacha20_state *chacha20_state = (struct chacha20_state *)drng; + struct chacha20_block *chacha20 = &chacha20_state->block; + + while (inbuflen) { + u32 i, todo = min_t(u32, inbuflen, CHACHA_KEY_SIZE); + + for (i = 0; i < todo; i++) + chacha20->key.b[i] ^= inbuf[i]; + + /* Break potential dependencies between the inbuf key blocks */ + lrng_chacha20_update(chacha20_state, NULL, + CHACHA_BLOCK_WORDS); + inbuf += todo; + inbuflen -= todo; + } + + return 0; +} + +/* + * Chacha20 DRNG generation of random numbers: the stream output of ChaCha20 + * is the random number. After the completion of the generation of the + * stream, the entire ChaCha20 state is updated. + * + * Note, as the ChaCha20 implements a 32 bit counter, we must ensure + * that this function is only invoked for at most 2^32 - 1 ChaCha20 blocks + * before a reseed or an update happens. This is ensured by the variable + * outbuflen which is a 32 bit integer defining the number of bytes to be + * generated by the ChaCha20 DRNG. At the end of this function, an update + * operation is invoked which implies that the 32 bit counter will never be + * overflown in this implementation. + */ +static int lrng_cc20_drng_generate_helper(void *drng, u8 *outbuf, u32 outbuflen) +{ + struct chacha20_state *chacha20_state = (struct chacha20_state *)drng; + struct chacha20_block *chacha20 = &chacha20_state->block; + __le32 aligned_buf[CHACHA_BLOCK_WORDS]; + u32 ret = outbuflen, used = CHACHA_BLOCK_WORDS; + int zeroize_buf = 0; + + while (outbuflen >= CHACHA_BLOCK_SIZE) { + chacha20_block(&chacha20->constants[0], outbuf); + outbuf += CHACHA_BLOCK_SIZE; + outbuflen -= CHACHA_BLOCK_SIZE; + } + + if (outbuflen) { + chacha20_block(&chacha20->constants[0], (u8 *)aligned_buf); + memcpy(outbuf, aligned_buf, outbuflen); + used = ((outbuflen + sizeof(aligned_buf[0]) - 1) / + sizeof(aligned_buf[0])); + zeroize_buf = 1; + } + + lrng_chacha20_update(chacha20_state, aligned_buf, used); + + if (zeroize_buf) + memzero_explicit(aligned_buf, sizeof(aligned_buf)); + + return ret; +} + +/* + * Allocation of the DRNG state + */ +static void *lrng_cc20_drng_alloc(u32 sec_strength) +{ + struct chacha20_state *state = NULL; + + if (sec_strength > CHACHA_KEY_SIZE) { + pr_err("Security strength of ChaCha20 DRNG (%u bits) lower than requested by LRNG (%u bits)\n", + CHACHA_KEY_SIZE * 8, sec_strength * 8); + return ERR_PTR(-EINVAL); + } + if (sec_strength < CHACHA_KEY_SIZE) + pr_warn("Security strength of ChaCha20 DRNG (%u bits) higher than requested by LRNG (%u bits)\n", + CHACHA_KEY_SIZE * 8, sec_strength * 8); + + state = kmalloc(sizeof(struct chacha20_state), GFP_KERNEL); + if (!state) + return ERR_PTR(-ENOMEM); + pr_debug("memory for ChaCha20 core allocated\n"); + + lrng_cc20_init_rfc7539(&state->block); + + return state; +} + +static void lrng_cc20_drng_dealloc(void *drng) +{ + struct chacha20_state *chacha20_state = (struct chacha20_state *)drng; + + pr_debug("ChaCha20 core zeroized and freed\n"); + kfree_sensitive(chacha20_state); +} + +static const char *lrng_cc20_drng_name(void) +{ + return "ChaCha20 DRNG"; +} + +const struct lrng_drng_cb lrng_cc20_drng_cb = { + .drng_name = lrng_cc20_drng_name, + .drng_alloc = lrng_cc20_drng_alloc, + .drng_dealloc = lrng_cc20_drng_dealloc, + .drng_seed = lrng_cc20_drng_seed_helper, + .drng_generate = lrng_cc20_drng_generate_helper, +}; + +#if !defined(CONFIG_LRNG_DFLT_DRNG_CHACHA20) && \ + !defined(CONFIG_LRNG_DRNG_ATOMIC) +static int __init lrng_cc20_drng_init(void) +{ + return lrng_set_drng_cb(&lrng_cc20_drng_cb); +} + +static void __exit lrng_cc20_drng_exit(void) +{ + lrng_set_drng_cb(NULL); +} + +late_initcall(lrng_cc20_drng_init); +module_exit(lrng_cc20_drng_exit); +MODULE_LICENSE("Dual BSD/GPL"); +MODULE_AUTHOR("Stephan Mueller "); +MODULE_DESCRIPTION("Entropy Source and DRNG Manager - ChaCha20-based DRNG backend"); +#endif /* CONFIG_LRNG_DFLT_DRNG_CHACHA20 */ diff --git a/drivers/char/lrng/lrng_drng_chacha20.h b/drivers/char/lrng/lrng_drng_chacha20.h new file mode 100644 index 000000000000..fee6571281b6 --- /dev/null +++ b/drivers/char/lrng/lrng_drng_chacha20.h @@ -0,0 +1,42 @@ +/* SPDX-License-Identifier: GPL-2.0 OR BSD-2-Clause */ +/* + * LRNG ChaCha20 definitions + * + * Copyright (C) 2022, Stephan Mueller + */ + +#ifndef _LRNG_CHACHA20_H +#define _LRNG_CHACHA20_H + +#include + +/* State according to RFC 7539 section 2.3 */ +struct chacha20_block { + u32 constants[4]; + union { +#define CHACHA_KEY_SIZE_WORDS (CHACHA_KEY_SIZE / sizeof(u32)) + u32 u[CHACHA_KEY_SIZE_WORDS]; + u8 b[CHACHA_KEY_SIZE]; + } key; + u32 counter; + u32 nonce[3]; +}; + +struct chacha20_state { + struct chacha20_block block; +}; + +static inline void lrng_cc20_init_rfc7539(struct chacha20_block *chacha20) +{ + chacha_init_consts(chacha20->constants); +} + +#define LRNG_CC20_INIT_RFC7539(x) \ + x.constants[0] = 0x61707865, \ + x.constants[1] = 0x3320646e, \ + x.constants[2] = 0x79622d32, \ + x.constants[3] = 0x6b206574, + +extern const struct lrng_drng_cb lrng_cc20_drng_cb; + +#endif /* _LRNG_CHACHA20_H */ diff --git a/drivers/char/lrng/lrng_drng_drbg.c b/drivers/char/lrng/lrng_drng_drbg.c new file mode 100644 index 000000000000..1feec2c095a5 --- /dev/null +++ b/drivers/char/lrng/lrng_drng_drbg.c @@ -0,0 +1,179 @@ +// SPDX-License-Identifier: GPL-2.0 OR BSD-2-Clause +/* + * Backend for the LRNG providing the cryptographic primitives using the + * kernel crypto API and its DRBG. + * + * Copyright (C) 2022, Stephan Mueller + */ + +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt + +#include +#include +#include +#include + +#include "lrng_drng_drbg.h" + +/* + * Define a DRBG plus a hash / MAC used to extract data from the entropy pool. + * For LRNG_HASH_NAME you can use a hash or a MAC (HMAC or CMAC) of your choice + * (Note, you should use the suggested selections below -- using SHA-1 or MD5 + * is not wise). The idea is that the used cipher primitive can be selected to + * be the same as used for the DRBG. I.e. the LRNG only uses one cipher + * primitive using the same cipher implementation with the options offered in + * the following. This means, if the CTR DRBG is selected and AES-NI is present, + * both the CTR DRBG and the selected cmac(aes) use AES-NI. + * + * The security strengths of the DRBGs are all 256 bits according to + * SP800-57 section 5.6.1. + * + * This definition is allowed to be changed. + */ +#ifdef CONFIG_CRYPTO_DRBG_CTR +static unsigned int lrng_drbg_type = 0; +#elif defined CONFIG_CRYPTO_DRBG_HMAC +static unsigned int lrng_drbg_type = 1; +#elif defined CONFIG_CRYPTO_DRBG_HASH +static unsigned int lrng_drbg_type = 2; +#else +#error "Unknown DRBG in use" +#endif + +/* The parameter must be r/o in sysfs as otherwise races appear. */ +module_param(lrng_drbg_type, uint, 0444); +MODULE_PARM_DESC(lrng_drbg_type, "DRBG type used for LRNG (0->CTR_DRBG, 1->HMAC_DRBG, 2->Hash_DRBG)"); + +struct lrng_drbg { + const char *hash_name; + const char *drbg_core; +}; + +static const struct lrng_drbg lrng_drbg_types[] = { + { /* CTR_DRBG with AES-256 using derivation function */ + .drbg_core = "drbg_nopr_ctr_aes256", + }, { /* HMAC_DRBG with SHA-512 */ + .drbg_core = "drbg_nopr_hmac_sha512", + }, { /* Hash_DRBG with SHA-512 using derivation function */ + .drbg_core = "drbg_nopr_sha512" + } +}; + +static int lrng_drbg_drng_seed_helper(void *drng, const u8 *inbuf, u32 inbuflen) +{ + struct drbg_state *drbg = (struct drbg_state *)drng; + LIST_HEAD(seedlist); + struct drbg_string data; + int ret; + + drbg_string_fill(&data, inbuf, inbuflen); + list_add_tail(&data.list, &seedlist); + ret = drbg->d_ops->update(drbg, &seedlist, drbg->seeded); + + if (ret >= 0) + drbg->seeded = true; + + return ret; +} + +static int lrng_drbg_drng_generate_helper(void *drng, u8 *outbuf, u32 outbuflen) +{ + struct drbg_state *drbg = (struct drbg_state *)drng; + + return drbg->d_ops->generate(drbg, outbuf, outbuflen, NULL); +} + +static void *lrng_drbg_drng_alloc(u32 sec_strength) +{ + struct drbg_state *drbg; + int coreref = -1; + bool pr = false; + int ret; + + drbg_convert_tfm_core(lrng_drbg_types[lrng_drbg_type].drbg_core, + &coreref, &pr); + if (coreref < 0) + return ERR_PTR(-EFAULT); + + drbg = kzalloc(sizeof(struct drbg_state), GFP_KERNEL); + if (!drbg) + return ERR_PTR(-ENOMEM); + + drbg->core = &drbg_cores[coreref]; + drbg->seeded = false; + ret = drbg_alloc_state(drbg); + if (ret) + goto err; + + if (sec_strength > drbg_sec_strength(drbg->core->flags)) { + pr_err("Security strength of DRBG (%u bits) lower than requested by LRNG (%u bits)\n", + drbg_sec_strength(drbg->core->flags) * 8, + sec_strength * 8); + goto dealloc; + } + + if (sec_strength < drbg_sec_strength(drbg->core->flags)) + pr_warn("Security strength of DRBG (%u bits) higher than requested by LRNG (%u bits)\n", + drbg_sec_strength(drbg->core->flags) * 8, + sec_strength * 8); + + pr_info("DRBG with %s core allocated\n", drbg->core->backend_cra_name); + + return drbg; + +dealloc: + if (drbg->d_ops) + drbg->d_ops->crypto_fini(drbg); + drbg_dealloc_state(drbg); +err: + kfree(drbg); + return ERR_PTR(-EINVAL); +} + +static void lrng_drbg_drng_dealloc(void *drng) +{ + struct drbg_state *drbg = (struct drbg_state *)drng; + + if (drbg && drbg->d_ops) + drbg->d_ops->crypto_fini(drbg); + drbg_dealloc_state(drbg); + kfree_sensitive(drbg); + pr_info("DRBG deallocated\n"); +} + +static const char *lrng_drbg_name(void) +{ + return lrng_drbg_types[lrng_drbg_type].drbg_core; +} + +const struct lrng_drng_cb lrng_drbg_cb = { + .drng_name = lrng_drbg_name, + .drng_alloc = lrng_drbg_drng_alloc, + .drng_dealloc = lrng_drbg_drng_dealloc, + .drng_seed = lrng_drbg_drng_seed_helper, + .drng_generate = lrng_drbg_drng_generate_helper, +}; + +#ifndef CONFIG_LRNG_DFLT_DRNG_DRBG +static int __init lrng_drbg_init(void) +{ + if (lrng_drbg_type >= ARRAY_SIZE(lrng_drbg_types)) { + pr_err("lrng_drbg_type parameter too large (given %u - max: %lu)", + lrng_drbg_type, + (unsigned long)ARRAY_SIZE(lrng_drbg_types) - 1); + return -EAGAIN; + } + return lrng_set_drng_cb(&lrng_drbg_cb); +} + +static void __exit lrng_drbg_exit(void) +{ + lrng_set_drng_cb(NULL); +} + +late_initcall(lrng_drbg_init); +module_exit(lrng_drbg_exit); +MODULE_LICENSE("Dual BSD/GPL"); +MODULE_AUTHOR("Stephan Mueller "); +MODULE_DESCRIPTION("Entropy Source and DRNG Manager - SP800-90A DRBG backend"); +#endif /* CONFIG_LRNG_DFLT_DRNG_DRBG */ diff --git a/drivers/char/lrng/lrng_drng_drbg.h b/drivers/char/lrng/lrng_drng_drbg.h new file mode 100644 index 000000000000..b3d556caf294 --- /dev/null +++ b/drivers/char/lrng/lrng_drng_drbg.h @@ -0,0 +1,13 @@ +/* SPDX-License-Identifier: GPL-2.0 OR BSD-2-Clause */ +/* + * LRNG SP800-90A definitions + * + * Copyright (C) 2022, Stephan Mueller + */ + +#ifndef _LRNG_DRBG_H +#define _LRNG_DRBG_H + +extern const struct lrng_drng_cb lrng_drbg_cb; + +#endif /* _LRNG_DRBG_H */ diff --git a/drivers/char/lrng/lrng_drng_kcapi.c b/drivers/char/lrng/lrng_drng_kcapi.c new file mode 100644 index 000000000000..a204bcf52a9a --- /dev/null +++ b/drivers/char/lrng/lrng_drng_kcapi.c @@ -0,0 +1,208 @@ +// SPDX-License-Identifier: GPL-2.0 OR BSD-2-Clause +/* + * Backend for the LRNG providing the cryptographic primitives using the + * kernel crypto API. + * + * Copyright (C) 2022, Stephan Mueller + */ + +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt + +#include +#include +#include +#include +#include + +#include "lrng_drng_kcapi.h" + +static char *drng_name = NULL; +module_param(drng_name, charp, 0444); +MODULE_PARM_DESC(drng_name, "Kernel crypto API name of DRNG"); + +static char *seed_hash = NULL; +module_param(seed_hash, charp, 0444); +MODULE_PARM_DESC(seed_hash, + "Kernel crypto API name of hash with output size equal to seedsize of DRNG to bring seed string to the size required by the DRNG"); + +struct lrng_drng_info { + struct crypto_rng *kcapi_rng; + struct crypto_shash *hash_tfm; +}; + +static int lrng_kcapi_drng_seed_helper(void *drng, const u8 *inbuf, + u32 inbuflen) +{ + struct lrng_drng_info *lrng_drng_info = (struct lrng_drng_info *)drng; + struct crypto_rng *kcapi_rng = lrng_drng_info->kcapi_rng; + struct crypto_shash *hash_tfm = lrng_drng_info->hash_tfm; + SHASH_DESC_ON_STACK(shash, hash_tfm); + u32 digestsize; + u8 digest[HASH_MAX_DIGESTSIZE] __aligned(8); + int ret; + + if (!hash_tfm) + return crypto_rng_reset(kcapi_rng, inbuf, inbuflen); + + shash->tfm = hash_tfm; + digestsize = crypto_shash_digestsize(hash_tfm); + + ret = crypto_shash_digest(shash, inbuf, inbuflen, digest); + shash_desc_zero(shash); + if (ret) + return ret; + + ret = crypto_rng_reset(kcapi_rng, digest, digestsize); + if (ret) + return ret; + + memzero_explicit(digest, digestsize); + return 0; +} + +static int lrng_kcapi_drng_generate_helper(void *drng, u8 *outbuf, + u32 outbuflen) +{ + struct lrng_drng_info *lrng_drng_info = (struct lrng_drng_info *)drng; + struct crypto_rng *kcapi_rng = lrng_drng_info->kcapi_rng; + int ret = crypto_rng_get_bytes(kcapi_rng, outbuf, outbuflen); + + if (ret < 0) + return ret; + + return outbuflen; +} + +static void *lrng_kcapi_drng_alloc(u32 sec_strength) +{ + struct lrng_drng_info *lrng_drng_info; + struct crypto_rng *kcapi_rng; + u32 time = random_get_entropy(); + int seedsize, rv; + void *ret = ERR_PTR(-ENOMEM); + + if (!drng_name) { + pr_err("DRNG name missing\n"); + return ERR_PTR(-EINVAL); + } + + if (!memcmp(drng_name, "stdrng", 6) || + !memcmp(drng_name, "lrng", 4) || + !memcmp(drng_name, "drbg", 4) || + !memcmp(drng_name, "jitterentropy_rng", 17)) { + pr_err("Refusing to load the requested random number generator\n"); + return ERR_PTR(-EINVAL); + } + + lrng_drng_info = kzalloc(sizeof(*lrng_drng_info), GFP_KERNEL); + if (!lrng_drng_info) + return ERR_PTR(-ENOMEM); + + kcapi_rng = crypto_alloc_rng(drng_name, 0, 0); + if (IS_ERR(kcapi_rng)) { + pr_err("DRNG %s cannot be allocated\n", drng_name); + ret = ERR_CAST(kcapi_rng); + goto free; + } + + lrng_drng_info->kcapi_rng = kcapi_rng; + + seedsize = crypto_rng_seedsize(kcapi_rng); + if (seedsize) { + struct crypto_shash *hash_tfm; + + if (!seed_hash) { + switch (seedsize) { + case 32: + seed_hash = "sha256"; + break; + case 48: + seed_hash = "sha384"; + break; + case 64: + seed_hash = "sha512"; + break; + default: + pr_err("Seed size %d cannot be processed\n", + seedsize); + goto dealloc; + } + } + + hash_tfm = crypto_alloc_shash(seed_hash, 0, 0); + if (IS_ERR(hash_tfm)) { + ret = ERR_CAST(hash_tfm); + goto dealloc; + } + + if (seedsize != crypto_shash_digestsize(hash_tfm)) { + pr_err("Seed hash output size not equal to DRNG seed size\n"); + crypto_free_shash(hash_tfm); + ret = ERR_PTR(-EINVAL); + goto dealloc; + } + + lrng_drng_info->hash_tfm = hash_tfm; + + pr_info("Seed hash %s allocated\n", seed_hash); + } + + rv = lrng_kcapi_drng_seed_helper(lrng_drng_info, (u8 *)(&time), + sizeof(time)); + if (rv) { + ret = ERR_PTR(rv); + goto dealloc; + } + + pr_info("Kernel crypto API DRNG %s allocated\n", drng_name); + + return lrng_drng_info; + +dealloc: + crypto_free_rng(kcapi_rng); +free: + kfree(lrng_drng_info); + return ret; +} + +static void lrng_kcapi_drng_dealloc(void *drng) +{ + struct lrng_drng_info *lrng_drng_info = (struct lrng_drng_info *)drng; + struct crypto_rng *kcapi_rng = lrng_drng_info->kcapi_rng; + + crypto_free_rng(kcapi_rng); + if (lrng_drng_info->hash_tfm) + crypto_free_shash(lrng_drng_info->hash_tfm); + kfree(lrng_drng_info); + pr_info("DRNG %s deallocated\n", drng_name); +} + +static const char *lrng_kcapi_drng_name(void) +{ + return drng_name; +} + +const struct lrng_drng_cb lrng_kcapi_drng_cb = { + .drng_name = lrng_kcapi_drng_name, + .drng_alloc = lrng_kcapi_drng_alloc, + .drng_dealloc = lrng_kcapi_drng_dealloc, + .drng_seed = lrng_kcapi_drng_seed_helper, + .drng_generate = lrng_kcapi_drng_generate_helper, +}; + +#ifndef CONFIG_LRNG_DFLT_DRNG_KCAPI +static int __init lrng_kcapi_init(void) +{ + return lrng_set_drng_cb(&lrng_kcapi_drng_cb); +} +static void __exit lrng_kcapi_exit(void) +{ + lrng_set_drng_cb(NULL); +} + +late_initcall(lrng_kcapi_init); +module_exit(lrng_kcapi_exit); +MODULE_LICENSE("Dual BSD/GPL"); +MODULE_AUTHOR("Stephan Mueller "); +MODULE_DESCRIPTION("Entropy Source and DRNG Manager - kernel crypto API DRNG backend"); +#endif /* CONFIG_LRNG_DFLT_DRNG_KCAPI */ diff --git a/drivers/char/lrng/lrng_drng_kcapi.h b/drivers/char/lrng/lrng_drng_kcapi.h new file mode 100644 index 000000000000..5db25aaf830c --- /dev/null +++ b/drivers/char/lrng/lrng_drng_kcapi.h @@ -0,0 +1,13 @@ +/* SPDX-License-Identifier: GPL-2.0 OR BSD-2-Clause */ +/* + * LRNG kernel crypto API DRNG definition + * + * Copyright (C) 2022, Stephan Mueller + */ + +#ifndef _LRNG_KCAPI_DRNG_H +#define _LRNG_KCAPI_DRNG_H + +extern const struct lrng_drng_cb lrng_kcapi_drng_cb; + +#endif /* _LRNG_KCAPI_DRNG_H */ diff --git a/drivers/char/lrng/lrng_drng_mgr.c b/drivers/char/lrng/lrng_drng_mgr.c new file mode 100644 index 000000000000..4c31cbfc79b6 --- /dev/null +++ b/drivers/char/lrng/lrng_drng_mgr.c @@ -0,0 +1,485 @@ +// SPDX-License-Identifier: GPL-2.0 OR BSD-2-Clause +/* + * LRNG DRNG management + * + * Copyright (C) 2022, Stephan Mueller + */ + +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt + +#include +#include +#include +#include +#include + +#include "lrng_drng_atomic.h" +#include "lrng_drng_chacha20.h" +#include "lrng_drng_drbg.h" +#include "lrng_drng_kcapi.h" +#include "lrng_drng_mgr.h" +#include "lrng_es_aux.h" +#include "lrng_es_mgr.h" +#include "lrng_interface_random_kernel.h" +#include "lrng_numa.h" +#include "lrng_sha.h" + +/* + * Maximum number of seconds between DRNG reseed intervals of the DRNG. Note, + * this is enforced with the next request of random numbers from the + * DRNG. Setting this value to zero implies a reseeding attempt before every + * generated random number. + */ +int lrng_drng_reseed_max_time = 600; + +/* + * Is LRNG for general-purpose use (i.e. is at least the lrng_drng_init + * fully allocated)? + */ +static atomic_t lrng_avail = ATOMIC_INIT(0); + +/* Guard protecting all crypto callback update operation of all DRNGs. */ +DEFINE_MUTEX(lrng_crypto_cb_update); + +/* + * Default hash callback that provides the crypto primitive right from the + * kernel start. It must not perform any memory allocation operation, but + * simply perform the hash calculation. + */ +const struct lrng_hash_cb *lrng_default_hash_cb = &lrng_sha_hash_cb; + +/* + * Default DRNG callback that provides the crypto primitive which is + * allocated either during late kernel boot stage. So, it is permissible for + * the callback to perform memory allocation operations. + */ +const struct lrng_drng_cb *lrng_default_drng_cb = +#if defined(CONFIG_LRNG_DFLT_DRNG_CHACHA20) + &lrng_cc20_drng_cb; +#elif defined(CONFIG_LRNG_DFLT_DRNG_DRBG) + &lrng_drbg_cb; +#elif defined(CONFIG_LRNG_DFLT_DRNG_KCAPI) + &lrng_kcapi_drng_cb; +#else +#error "Unknown default DRNG selected" +#endif + +/* DRNG for non-atomic use cases */ +static struct lrng_drng lrng_drng_init = { + LRNG_DRNG_STATE_INIT(lrng_drng_init, NULL, NULL, NULL, + &lrng_sha_hash_cb), + .lock = __MUTEX_INITIALIZER(lrng_drng_init.lock), +}; + +static u32 max_wo_reseed = LRNG_DRNG_MAX_WITHOUT_RESEED; +#ifdef CONFIG_LRNG_RUNTIME_MAX_WO_RESEED_CONFIG +module_param(max_wo_reseed, uint, 0444); +MODULE_PARM_DESC(max_wo_reseed, + "Maximum number of DRNG generate operation without full reseed\n"); +#endif + +/* Wait queue to wait until the LRNG is initialized - can freely be used */ +DECLARE_WAIT_QUEUE_HEAD(lrng_init_wait); + +/********************************** Helper ************************************/ + +bool lrng_get_available(void) +{ + return likely(atomic_read(&lrng_avail)); +} + +struct lrng_drng *lrng_drng_init_instance(void) +{ + return &lrng_drng_init; +} + +struct lrng_drng *lrng_drng_node_instance(void) +{ + struct lrng_drng **lrng_drng = lrng_drng_instances(); + int node = numa_node_id(); + + if (lrng_drng && lrng_drng[node]) + return lrng_drng[node]; + + return lrng_drng_init_instance(); +} + +void lrng_drng_reset(struct lrng_drng *drng) +{ + atomic_set(&drng->requests, LRNG_DRNG_RESEED_THRESH); + atomic_set(&drng->requests_since_fully_seeded, 0); + drng->last_seeded = jiffies; + drng->fully_seeded = false; + drng->force_reseed = true; + pr_debug("reset DRNG\n"); +} + +/* Initialize the DRNG, except the mutex lock */ +int lrng_drng_alloc_common(struct lrng_drng *drng, + const struct lrng_drng_cb *drng_cb) +{ + if (!drng || !drng_cb) + return -EINVAL; + + drng->drng_cb = drng_cb; + drng->drng = drng_cb->drng_alloc(LRNG_DRNG_SECURITY_STRENGTH_BYTES); + if (IS_ERR(drng->drng)) + return PTR_ERR(drng->drng); + + lrng_drng_reset(drng); + return 0; +} + +/* Initialize the default DRNG during boot and perform its seeding */ +int lrng_drng_initalize(void) +{ + int ret; + + if (lrng_get_available()) + return 0; + + /* Catch programming error */ + WARN_ON(lrng_drng_init.hash_cb != lrng_default_hash_cb); + + mutex_lock(&lrng_drng_init.lock); + if (lrng_get_available()) { + mutex_unlock(&lrng_drng_init.lock); + return 0; + } + + ret = lrng_drng_alloc_common(&lrng_drng_init, lrng_default_drng_cb); + mutex_unlock(&lrng_drng_init.lock); + if (ret) + return ret; + + pr_debug("LRNG for general use is available\n"); + atomic_set(&lrng_avail, 1); + + /* Seed the DRNG with any entropy available */ + if (!lrng_pool_trylock()) { + pr_info("Initial DRNG initialized triggering first seeding\n"); + lrng_drng_seed_work(NULL); + } else { + pr_info("Initial DRNG initialized without seeding\n"); + } + + return 0; +} + +static int __init lrng_drng_make_available(void) +{ + return lrng_drng_initalize(); +} +late_initcall(lrng_drng_make_available); + +bool lrng_sp80090c_compliant(void) +{ + /* SP800-90C compliant oversampling is only requested in FIPS mode */ + return fips_enabled; +} + +/************************* Random Number Generation ***************************/ + +/* Inject a data buffer into the DRNG - caller must hold its lock */ +void lrng_drng_inject(struct lrng_drng *drng, const u8 *inbuf, u32 inbuflen, + bool fully_seeded, const char *drng_type) +{ + BUILD_BUG_ON(LRNG_DRNG_RESEED_THRESH > INT_MAX); + pr_debug("seeding %s DRNG with %u bytes\n", drng_type, inbuflen); + if (drng->drng_cb->drng_seed(drng->drng, inbuf, inbuflen) < 0) { + pr_warn("seeding of %s DRNG failed\n", drng_type); + drng->force_reseed = true; + } else { + int gc = LRNG_DRNG_RESEED_THRESH - atomic_read(&drng->requests); + + pr_debug("%s DRNG stats since last seeding: %lu secs; generate calls: %d\n", + drng_type, + (time_after(jiffies, drng->last_seeded) ? + (jiffies - drng->last_seeded) : 0) / HZ, gc); + + /* Count the numbers of generate ops since last fully seeded */ + if (fully_seeded) + atomic_set(&drng->requests_since_fully_seeded, 0); + else + atomic_add(gc, &drng->requests_since_fully_seeded); + + drng->last_seeded = jiffies; + atomic_set(&drng->requests, LRNG_DRNG_RESEED_THRESH); + drng->force_reseed = false; + + if (!drng->fully_seeded) { + drng->fully_seeded = fully_seeded; + if (drng->fully_seeded) + pr_debug("%s DRNG fully seeded\n", drng_type); + } + } +} + +/* Perform the seeding of the DRNG with data from entropy source */ +static void lrng_drng_seed_es(struct lrng_drng *drng) +{ + struct entropy_buf seedbuf __aligned(LRNG_KCAPI_ALIGN); + + lrng_fill_seed_buffer(&seedbuf, + lrng_get_seed_entropy_osr(drng->fully_seeded)); + + mutex_lock(&drng->lock); + lrng_drng_inject(drng, (u8 *)&seedbuf, sizeof(seedbuf), + lrng_fully_seeded_eb(drng->fully_seeded, &seedbuf), + "regular"); + mutex_unlock(&drng->lock); + + /* Set the seeding state of the LRNG */ + lrng_init_ops(&seedbuf); + + memzero_explicit(&seedbuf, sizeof(seedbuf)); +} + +static void lrng_drng_seed(struct lrng_drng *drng) +{ + BUILD_BUG_ON(LRNG_MIN_SEED_ENTROPY_BITS > + LRNG_DRNG_SECURITY_STRENGTH_BITS); + + if (lrng_get_available()) { + /* (Re-)Seed DRNG */ + lrng_drng_seed_es(drng); + /* (Re-)Seed atomic DRNG from regular DRNG */ + lrng_drng_atomic_seed_drng(drng); + } else { + /* + * If no-one is waiting for the DRNG, seed the atomic DRNG + * directly from the entropy sources. + */ + if (!wq_has_sleeper(&lrng_init_wait) && + !lrng_ready_chain_has_sleeper()) + lrng_drng_atomic_seed_es(); + else + lrng_init_ops(NULL); + } +} + +static void _lrng_drng_seed_work(struct lrng_drng *drng, u32 node) +{ + pr_debug("reseed triggered by system events for DRNG on NUMA node %d\n", + node); + lrng_drng_seed(drng); + if (drng->fully_seeded) { + /* Prevent reseed storm */ + drng->last_seeded += node * 100 * HZ; + } +} + +/* + * DRNG reseed trigger: Kernel thread handler triggered by the schedule_work() + */ +void lrng_drng_seed_work(struct work_struct *dummy) +{ + struct lrng_drng **lrng_drng = lrng_drng_instances(); + u32 node; + + if (lrng_drng) { + for_each_online_node(node) { + struct lrng_drng *drng = lrng_drng[node]; + + if (drng && !drng->fully_seeded) { + _lrng_drng_seed_work(drng, node); + goto out; + } + } + } else { + if (!lrng_drng_init.fully_seeded) { + _lrng_drng_seed_work(&lrng_drng_init, 0); + goto out; + } + } + + lrng_pool_all_numa_nodes_seeded(true); + +out: + /* Allow the seeding operation to be called again */ + lrng_pool_unlock(); +} + +/* Force all DRNGs to reseed before next generation */ +void lrng_drng_force_reseed(void) +{ + struct lrng_drng **lrng_drng = lrng_drng_instances(); + u32 node; + + /* + * If the initial DRNG is over the reseed threshold, allow a forced + * reseed only for the initial DRNG as this is the fallback for all. It + * must be kept seeded before all others to keep the LRNG operational. + */ + if (!lrng_drng || + (atomic_read_u32(&lrng_drng_init.requests_since_fully_seeded) > + LRNG_DRNG_RESEED_THRESH)) { + lrng_drng_init.force_reseed = lrng_drng_init.fully_seeded; + pr_debug("force reseed of initial DRNG\n"); + return; + } + for_each_online_node(node) { + struct lrng_drng *drng = lrng_drng[node]; + + if (!drng) + continue; + + drng->force_reseed = drng->fully_seeded; + pr_debug("force reseed of DRNG on node %u\n", node); + } + lrng_drng_atomic_force_reseed(); +} +EXPORT_SYMBOL(lrng_drng_force_reseed); + +static bool lrng_drng_must_reseed(struct lrng_drng *drng) +{ + return (atomic_dec_and_test(&drng->requests) || + drng->force_reseed || + time_after(jiffies, + drng->last_seeded + lrng_drng_reseed_max_time * HZ)); +} + +/* + * lrng_drng_get() - Get random data out of the DRNG which is reseeded + * frequently. + * + * @drng: DRNG instance + * @outbuf: buffer for storing random data + * @outbuflen: length of outbuf + * + * Return: + * * < 0 in error case (DRNG generation or update failed) + * * >=0 returning the returned number of bytes + */ +int lrng_drng_get(struct lrng_drng *drng, u8 *outbuf, u32 outbuflen) +{ + u32 processed = 0; + + if (!outbuf || !outbuflen) + return 0; + + if (!lrng_get_available()) + return -EOPNOTSUPP; + + outbuflen = min_t(size_t, outbuflen, INT_MAX); + + /* If DRNG operated without proper reseed for too long, block LRNG */ + BUILD_BUG_ON(LRNG_DRNG_MAX_WITHOUT_RESEED < LRNG_DRNG_RESEED_THRESH); + if (atomic_read_u32(&drng->requests_since_fully_seeded) > max_wo_reseed) + lrng_unset_fully_seeded(drng); + + while (outbuflen) { + u32 todo = min_t(u32, outbuflen, LRNG_DRNG_MAX_REQSIZE); + int ret; + + if (lrng_drng_must_reseed(drng)) { + if (lrng_pool_trylock()) { + drng->force_reseed = true; + } else { + lrng_drng_seed(drng); + lrng_pool_unlock(); + } + } + + mutex_lock(&drng->lock); + ret = drng->drng_cb->drng_generate(drng->drng, + outbuf + processed, todo); + mutex_unlock(&drng->lock); + if (ret <= 0) { + pr_warn("getting random data from DRNG failed (%d)\n", + ret); + return -EFAULT; + } + processed += ret; + outbuflen -= ret; + } + + return processed; +} + +int lrng_drng_get_sleep(u8 *outbuf, u32 outbuflen) +{ + struct lrng_drng **lrng_drng = lrng_drng_instances(); + struct lrng_drng *drng = &lrng_drng_init; + int ret, node = numa_node_id(); + + might_sleep(); + + if (lrng_drng && lrng_drng[node] && lrng_drng[node]->fully_seeded) + drng = lrng_drng[node]; + + ret = lrng_drng_initalize(); + if (ret) + return ret; + + return lrng_drng_get(drng, outbuf, outbuflen); +} + +/* Reset LRNG such that all existing entropy is gone */ +static void _lrng_reset(struct work_struct *work) +{ + struct lrng_drng **lrng_drng = lrng_drng_instances(); + + if (!lrng_drng) { + mutex_lock(&lrng_drng_init.lock); + lrng_drng_reset(&lrng_drng_init); + mutex_unlock(&lrng_drng_init.lock); + } else { + u32 node; + + for_each_online_node(node) { + struct lrng_drng *drng = lrng_drng[node]; + + if (!drng) + continue; + mutex_lock(&drng->lock); + lrng_drng_reset(drng); + mutex_unlock(&drng->lock); + } + } + lrng_drng_atomic_reset(); + lrng_set_entropy_thresh(LRNG_INIT_ENTROPY_BITS); + + lrng_reset_state(); +} + +static DECLARE_WORK(lrng_reset_work, _lrng_reset); + +void lrng_reset(void) +{ + schedule_work(&lrng_reset_work); +} + +/******************* Generic LRNG kernel output interfaces ********************/ + +int lrng_drng_sleep_while_nonoperational(int nonblock) +{ + if (likely(lrng_state_operational())) + return 0; + if (nonblock) + return -EAGAIN; + return wait_event_interruptible(lrng_init_wait, + lrng_state_operational()); +} + +int lrng_drng_sleep_while_non_min_seeded(void) +{ + if (likely(lrng_state_min_seeded())) + return 0; + return wait_event_interruptible(lrng_init_wait, + lrng_state_min_seeded()); +} + +void lrng_get_random_bytes_full(void *buf, int nbytes) +{ + lrng_drng_sleep_while_nonoperational(0); + lrng_drng_get_sleep((u8 *)buf, (u32)nbytes); +} +EXPORT_SYMBOL(lrng_get_random_bytes_full); + +void lrng_get_random_bytes_min(void *buf, int nbytes) +{ + lrng_drng_sleep_while_non_min_seeded(); + lrng_drng_get_sleep((u8 *)buf, (u32)nbytes); +} +EXPORT_SYMBOL(lrng_get_random_bytes_min); diff --git a/drivers/char/lrng/lrng_drng_mgr.h b/drivers/char/lrng/lrng_drng_mgr.h new file mode 100644 index 000000000000..34ca5913dbd2 --- /dev/null +++ b/drivers/char/lrng/lrng_drng_mgr.h @@ -0,0 +1,84 @@ +/* SPDX-License-Identifier: GPL-2.0 OR BSD-2-Clause */ +/* + * Copyright (C) 2022, Stephan Mueller + */ + +#ifndef _LRNG_DRNG_H +#define _LRNG_DRNG_H + +#include +#include +#include + +#include "lrng_definitions.h" + +extern struct wait_queue_head lrng_init_wait; +extern int lrng_drng_reseed_max_time; +extern struct mutex lrng_crypto_cb_update; +extern const struct lrng_drng_cb *lrng_default_drng_cb; +extern const struct lrng_hash_cb *lrng_default_hash_cb; + +/* DRNG state handle */ +struct lrng_drng { + void *drng; /* DRNG handle */ + void *hash; /* Hash handle */ + const struct lrng_drng_cb *drng_cb; /* DRNG callbacks */ + const struct lrng_hash_cb *hash_cb; /* Hash callbacks */ + atomic_t requests; /* Number of DRNG requests */ + atomic_t requests_since_fully_seeded; /* Number DRNG requests since + * last fully seeded + */ + unsigned long last_seeded; /* Last time it was seeded */ + bool fully_seeded; /* Is DRNG fully seeded? */ + bool force_reseed; /* Force a reseed */ + + rwlock_t hash_lock; /* Lock hash_cb replacement */ + /* Lock write operations on DRNG state, DRNG replacement of drng_cb */ + struct mutex lock; /* Non-atomic DRNG operation */ + spinlock_t spin_lock; /* Atomic DRNG operation */ +}; + +#define LRNG_DRNG_STATE_INIT(x, d, h, d_cb, h_cb) \ + .drng = d, \ + .hash = h, \ + .drng_cb = d_cb, \ + .hash_cb = h_cb, \ + .requests = ATOMIC_INIT(LRNG_DRNG_RESEED_THRESH),\ + .requests_since_fully_seeded = ATOMIC_INIT(0), \ + .last_seeded = 0, \ + .fully_seeded = false, \ + .force_reseed = true, \ + .hash_lock = __RW_LOCK_UNLOCKED(x.hash_lock) + +struct lrng_drng *lrng_drng_init_instance(void); +struct lrng_drng *lrng_drng_node_instance(void); + +void lrng_reset(void); +int lrng_drng_alloc_common(struct lrng_drng *drng, + const struct lrng_drng_cb *crypto_cb); +int lrng_drng_initalize(void); +bool lrng_sp80090c_compliant(void); +bool lrng_get_available(void); +void lrng_drng_reset(struct lrng_drng *drng); +void lrng_drng_inject(struct lrng_drng *drng, const u8 *inbuf, u32 inbuflen, + bool fully_seeded, const char *drng_type); +int lrng_drng_get(struct lrng_drng *drng, u8 *outbuf, u32 outbuflen); +int lrng_drng_sleep_while_nonoperational(int nonblock); +int lrng_drng_sleep_while_non_min_seeded(void); +int lrng_drng_get_sleep(u8 *outbuf, u32 outbuflen); +void lrng_drng_seed_work(struct work_struct *dummy); +void lrng_drng_force_reseed(void); + +static inline u32 lrng_compress_osr(void) +{ + return lrng_sp80090c_compliant() ? LRNG_OVERSAMPLE_ES_BITS : 0; +} + +static inline u32 lrng_reduce_by_osr(u32 entropy_bits) +{ + u32 osr_bits = lrng_compress_osr(); + + return (entropy_bits >= osr_bits) ? (entropy_bits - osr_bits) : 0; +} + +#endif /* _LRNG_DRNG_H */ diff --git a/drivers/char/lrng/lrng_es_aux.c b/drivers/char/lrng/lrng_es_aux.c new file mode 100644 index 000000000000..245bc829998b --- /dev/null +++ b/drivers/char/lrng/lrng_es_aux.c @@ -0,0 +1,335 @@ +// SPDX-License-Identifier: GPL-2.0 OR BSD-2-Clause +/* + * LRNG Slow Entropy Source: Auxiliary entropy pool + * + * Copyright (C) 2022, Stephan Mueller + */ + +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt + +#include + +#include "lrng_es_aux.h" +#include "lrng_es_mgr.h" +#include "lrng_sysctl.h" + +/* + * This is the auxiliary pool + * + * The aux pool array is aligned to 8 bytes to comfort the kernel crypto API + * cipher implementations of the hash functions used to read the pool: for some + * accelerated implementations, we need an alignment to avoid a realignment + * which involves memcpy(). The alignment to 8 bytes should satisfy all crypto + * implementations. + */ +struct lrng_pool { + u8 aux_pool[LRNG_POOL_SIZE]; /* Aux pool: digest state */ + atomic_t aux_entropy_bits; + atomic_t digestsize; /* Digest size of used hash */ + bool initialized; /* Aux pool initialized? */ + + /* Serialize read of entropy pool and update of aux pool */ + spinlock_t lock; +}; + +static struct lrng_pool lrng_pool __aligned(LRNG_KCAPI_ALIGN) = { + .aux_entropy_bits = ATOMIC_INIT(0), + .digestsize = ATOMIC_INIT(LRNG_ATOMIC_DIGEST_SIZE), + .initialized = false, + .lock = __SPIN_LOCK_UNLOCKED(lrng_pool.lock) +}; + +/********************************** Helper ***********************************/ + +/* Entropy in bits present in aux pool */ +static u32 lrng_aux_avail_entropy(u32 __unused) +{ + /* Cap available entropy with max entropy */ + u32 avail_bits = min_t(u32, lrng_get_digestsize(), + atomic_read_u32(&lrng_pool.aux_entropy_bits)); + + /* Consider oversampling rate due to aux pool conditioning */ + return lrng_reduce_by_osr(avail_bits); +} + +/* Set the digest size of the used hash in bytes */ +static void lrng_set_digestsize(u32 digestsize) +{ + struct lrng_pool *pool = &lrng_pool; + u32 ent_bits = atomic_xchg_relaxed(&pool->aux_entropy_bits, 0), + old_digestsize = lrng_get_digestsize(); + + atomic_set(&lrng_pool.digestsize, digestsize); + + /* + * Update the write wakeup threshold which must not be larger + * than the digest size of the current conditioning hash. + */ + digestsize = lrng_reduce_by_osr(digestsize << 3); + lrng_sysctl_update_max_write_thresh(digestsize); + lrng_write_wakeup_bits = digestsize; + + /* + * In case the new digest is larger than the old one, cap the available + * entropy to the old message digest used to process the existing data. + */ + ent_bits = min_t(u32, ent_bits, old_digestsize); + atomic_add(ent_bits, &pool->aux_entropy_bits); +} + +static int __init lrng_init_wakeup_bits(void) +{ + u32 digestsize = lrng_reduce_by_osr(lrng_get_digestsize()); + + lrng_sysctl_update_max_write_thresh(digestsize); + lrng_write_wakeup_bits = digestsize; + return 0; +} +core_initcall(lrng_init_wakeup_bits); + +/* Obtain the digest size provided by the used hash in bits */ +u32 lrng_get_digestsize(void) +{ + return atomic_read_u32(&lrng_pool.digestsize) << 3; +} + +/* Set entropy content in user-space controllable aux pool */ +void lrng_pool_set_entropy(u32 entropy_bits) +{ + atomic_set(&lrng_pool.aux_entropy_bits, entropy_bits); +} + +static void lrng_aux_reset(void) +{ + lrng_pool_set_entropy(0); +} + +/* + * Replace old with new hash for auxiliary pool handling + * + * Assumption: the caller must guarantee that the new_cb is available during the + * entire operation (e.g. it must hold the write lock against pointer updating). + */ +static int +lrng_aux_switch_hash(struct lrng_drng *drng, int __unused, + const struct lrng_hash_cb *new_cb, void *new_hash, + const struct lrng_hash_cb *old_cb) +{ + struct lrng_drng *init_drng = lrng_drng_init_instance(); + struct lrng_pool *pool = &lrng_pool; + struct shash_desc *shash = (struct shash_desc *)pool->aux_pool; + u8 digest[LRNG_MAX_DIGESTSIZE]; + int ret; + + if (!IS_ENABLED(CONFIG_LRNG_SWITCH)) + return -EOPNOTSUPP; + + if (unlikely(!pool->initialized)) + return 0; + + /* We only switch if the processed DRNG is the initial DRNG. */ + if (init_drng != drng) + return 0; + + /* Get the aux pool hash with old digest ... */ + ret = old_cb->hash_final(shash, digest) ?: + /* ... re-initialize the hash with the new digest ... */ + new_cb->hash_init(shash, new_hash) ?: + /* + * ... feed the old hash into the new state. We may feed + * uninitialized memory into the new state, but this is + * considered no issue and even good as we have some more + * uncertainty here. + */ + new_cb->hash_update(shash, digest, sizeof(digest)); + if (!ret) { + lrng_set_digestsize(new_cb->hash_digestsize(new_hash)); + pr_debug("Re-initialize aux entropy pool with hash %s\n", + new_cb->hash_name()); + } + + memzero_explicit(digest, sizeof(digest)); + return ret; +} + +/* Insert data into auxiliary pool by using the hash update function. */ +static int +lrng_aux_pool_insert_locked(const u8 *inbuf, u32 inbuflen, u32 entropy_bits) +{ + struct lrng_pool *pool = &lrng_pool; + struct shash_desc *shash = (struct shash_desc *)pool->aux_pool; + struct lrng_drng *drng = lrng_drng_init_instance(); + const struct lrng_hash_cb *hash_cb; + unsigned long flags; + void *hash; + int ret; + + entropy_bits = min_t(u32, entropy_bits, inbuflen << 3); + + read_lock_irqsave(&drng->hash_lock, flags); + hash_cb = drng->hash_cb; + hash = drng->hash; + + if (unlikely(!pool->initialized)) { + ret = hash_cb->hash_init(shash, hash); + if (ret) + goto out; + pool->initialized = true; + } + + ret = hash_cb->hash_update(shash, inbuf, inbuflen); + if (ret) + goto out; + + /* + * Cap the available entropy to the hash output size compliant to + * SP800-90B section 3.1.5.1 table 1. + */ + entropy_bits += atomic_read_u32(&pool->aux_entropy_bits); + atomic_set(&pool->aux_entropy_bits, + min_t(u32, entropy_bits, + hash_cb->hash_digestsize(hash) << 3)); + +out: + read_unlock_irqrestore(&drng->hash_lock, flags); + return ret; +} + +int lrng_pool_insert_aux(const u8 *inbuf, u32 inbuflen, u32 entropy_bits) +{ + struct lrng_pool *pool = &lrng_pool; + unsigned long flags; + int ret; + + spin_lock_irqsave(&pool->lock, flags); + ret = lrng_aux_pool_insert_locked(inbuf, inbuflen, entropy_bits); + spin_unlock_irqrestore(&pool->lock, flags); + + lrng_es_add_entropy(); + + return ret; +} +EXPORT_SYMBOL(lrng_pool_insert_aux); + +/************************* Get data from entropy pool *************************/ + +/* + * Get auxiliary entropy pool and its entropy content for seed buffer. + * Caller must hold lrng_pool.pool->lock. + * @outbuf: buffer to store data in with size requested_bits + * @requested_bits: Requested amount of entropy + * @return: amount of entropy in outbuf in bits. + */ +static u32 lrng_aux_get_pool(u8 *outbuf, u32 requested_bits) +{ + struct lrng_pool *pool = &lrng_pool; + struct shash_desc *shash = (struct shash_desc *)pool->aux_pool; + struct lrng_drng *drng = lrng_drng_init_instance(); + const struct lrng_hash_cb *hash_cb; + unsigned long flags; + void *hash; + u32 collected_ent_bits, returned_ent_bits, unused_bits = 0, + digestsize, digestsize_bits, requested_bits_osr; + u8 aux_output[LRNG_MAX_DIGESTSIZE]; + + if (unlikely(!pool->initialized)) + return 0; + + read_lock_irqsave(&drng->hash_lock, flags); + + hash_cb = drng->hash_cb; + hash = drng->hash; + digestsize = hash_cb->hash_digestsize(hash); + digestsize_bits = digestsize << 3; + + /* Cap to maximum entropy that can ever be generated with given hash */ + lrng_cap_requested(digestsize_bits, requested_bits); + + /* Ensure that no more than the size of aux_pool can be requested */ + requested_bits = min_t(u32, requested_bits, (LRNG_MAX_DIGESTSIZE << 3)); + requested_bits_osr = requested_bits + lrng_compress_osr(); + + /* Cap entropy with entropy counter from aux pool and the used digest */ + collected_ent_bits = min_t(u32, digestsize_bits, + atomic_xchg_relaxed(&pool->aux_entropy_bits, 0)); + + /* We collected too much entropy and put the overflow back */ + if (collected_ent_bits > requested_bits_osr) { + /* Amount of bits we collected too much */ + unused_bits = collected_ent_bits - requested_bits_osr; + /* Put entropy back */ + atomic_add(unused_bits, &pool->aux_entropy_bits); + /* Fix collected entropy */ + collected_ent_bits = requested_bits_osr; + } + + /* Apply oversampling: discount requested oversampling rate */ + returned_ent_bits = lrng_reduce_by_osr(collected_ent_bits); + + pr_debug("obtained %u bits by collecting %u bits of entropy from aux pool, %u bits of entropy remaining\n", + returned_ent_bits, collected_ent_bits, unused_bits); + + /* Get the digest for the aux pool to be returned to the caller ... */ + if (hash_cb->hash_final(shash, aux_output) || + /* + * ... and re-initialize the aux state. Do not add the aux pool + * digest for backward secrecy as it will be added with the + * insertion of the complete seed buffer after it has been filled. + */ + hash_cb->hash_init(shash, hash)) { + returned_ent_bits = 0; + } else { + /* + * Do not truncate the output size exactly to collected_ent_bits + * as the aux pool may contain data that is not credited with + * entropy, but we want to use them to stir the DRNG state. + */ + memcpy(outbuf, aux_output, requested_bits >> 3); + } + + read_unlock_irqrestore(&drng->hash_lock, flags); + memzero_explicit(aux_output, digestsize); + return returned_ent_bits; +} + +static void lrng_aux_get_backtrack(struct entropy_buf *eb, u32 requested_bits, + bool __unused) +{ + struct lrng_pool *pool = &lrng_pool; + unsigned long flags; + + /* Ensure aux pool extraction and backtracking op are atomic */ + spin_lock_irqsave(&pool->lock, flags); + + eb->e_bits[lrng_ext_es_aux] = lrng_aux_get_pool(eb->e[lrng_ext_es_aux], + requested_bits); + + /* Mix the extracted data back into pool for backtracking resistance */ + if (lrng_aux_pool_insert_locked((u8 *)eb, + sizeof(struct entropy_buf), 0)) + pr_warn("Backtracking resistance operation failed\n"); + + spin_unlock_irqrestore(&pool->lock, flags); +} + +static void lrng_aux_es_state(unsigned char *buf, size_t buflen) +{ + const struct lrng_drng *lrng_drng_init = lrng_drng_init_instance(); + + /* Assume the lrng_drng_init lock is taken by caller */ + snprintf(buf, buflen, + " Hash for operating entropy pool: %s\n" + " Available entropy: %u\n", + lrng_drng_init->hash_cb->hash_name(), + lrng_aux_avail_entropy(0)); +} + +struct lrng_es_cb lrng_es_aux = { + .name = "Auxiliary", + .get_ent = lrng_aux_get_backtrack, + .curr_entropy = lrng_aux_avail_entropy, + .max_entropy = lrng_get_digestsize, + .state = lrng_aux_es_state, + .reset = lrng_aux_reset, + .switch_hash = lrng_aux_switch_hash, +}; diff --git a/drivers/char/lrng/lrng_es_aux.h b/drivers/char/lrng/lrng_es_aux.h new file mode 100644 index 000000000000..bc41e6474aad --- /dev/null +++ b/drivers/char/lrng/lrng_es_aux.h @@ -0,0 +1,44 @@ +/* SPDX-License-Identifier: GPL-2.0 OR BSD-2-Clause */ +/* + * Copyright (C) 2022, Stephan Mueller + */ + +#ifndef _LRNG_ES_AUX_H +#define _LRNG_ES_AUX_H + +#include "lrng_drng_mgr.h" +#include "lrng_es_mgr_cb.h" + +u32 lrng_get_digestsize(void); +void lrng_pool_set_entropy(u32 entropy_bits); +int lrng_pool_insert_aux(const u8 *inbuf, u32 inbuflen, u32 entropy_bits); + +extern struct lrng_es_cb lrng_es_aux; + +/****************************** Helper code ***********************************/ + +/* Obtain the security strength of the LRNG in bits */ +static inline u32 lrng_security_strength(void) +{ + /* + * We use a hash to read the entropy in the entropy pool. According to + * SP800-90B table 1, the entropy can be at most the digest size. + * Considering this together with the last sentence in section 3.1.5.1.2 + * the security strength of a (approved) hash is equal to its output + * size. On the other hand the entropy cannot be larger than the + * security strength of the used DRBG. + */ + return min_t(u32, LRNG_FULL_SEED_ENTROPY_BITS, lrng_get_digestsize()); +} + +static inline u32 lrng_get_seed_entropy_osr(bool fully_seeded) +{ + u32 requested_bits = lrng_security_strength(); + + /* Apply oversampling during initialization according to SP800-90C */ + if (lrng_sp80090c_compliant() && !fully_seeded) + requested_bits += LRNG_SEED_BUFFER_INIT_ADD_BITS; + return requested_bits; +} + +#endif /* _LRNG_ES_AUX_H */ diff --git a/drivers/char/lrng/lrng_es_cpu.c b/drivers/char/lrng/lrng_es_cpu.c new file mode 100644 index 000000000000..3910108cfd08 --- /dev/null +++ b/drivers/char/lrng/lrng_es_cpu.c @@ -0,0 +1,275 @@ +// SPDX-License-Identifier: GPL-2.0 OR BSD-2-Clause +/* + * LRNG Fast Entropy Source: CPU-based entropy source + * + * Copyright (C) 2022, Stephan Mueller + */ + +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt + +#include +#include +#include +#include + +#include "lrng_definitions.h" +#include "lrng_es_aux.h" +#include "lrng_es_cpu.h" + +/* + * Estimated entropy of data is a 32th of LRNG_DRNG_SECURITY_STRENGTH_BITS. + * As we have no ability to review the implementation of those noise sources, + * it is prudent to have a conservative estimate here. + */ +#define LRNG_ARCHRANDOM_DEFAULT_STRENGTH CONFIG_LRNG_CPU_ENTROPY_RATE +#define LRNG_ARCHRANDOM_TRUST_CPU_STRENGTH LRNG_DRNG_SECURITY_STRENGTH_BITS +#ifdef CONFIG_RANDOM_TRUST_CPU +static u32 cpu_entropy = LRNG_ARCHRANDOM_TRUST_CPU_STRENGTH; +#else +static u32 cpu_entropy = LRNG_ARCHRANDOM_DEFAULT_STRENGTH; +#endif +#ifdef CONFIG_LRNG_RUNTIME_ES_CONFIG +module_param(cpu_entropy, uint, 0644); +MODULE_PARM_DESC(cpu_entropy, "Entropy in bits of 256 data bits from CPU noise source (e.g. RDSEED)"); +#endif + +static int __init lrng_parse_trust_cpu(char *arg) +{ + int ret; + bool trust_cpu = false; + + ret = kstrtobool(arg, &trust_cpu); + if (ret) + return ret; + + if (trust_cpu) { + cpu_entropy = LRNG_ARCHRANDOM_TRUST_CPU_STRENGTH; + lrng_es_add_entropy(); + } else { + cpu_entropy = LRNG_ARCHRANDOM_DEFAULT_STRENGTH; + } + + return 0; +} +early_param("random.trust_cpu", lrng_parse_trust_cpu); + +static u32 lrng_cpu_entropylevel(u32 requested_bits) +{ + return lrng_fast_noise_entropylevel(cpu_entropy, requested_bits); +} + +static u32 lrng_cpu_poolsize(void) +{ + return lrng_cpu_entropylevel(lrng_security_strength()); +} + +static u32 lrng_get_cpu_data(u8 *outbuf, u32 requested_bits) +{ + u32 i; + + /* operate on full blocks */ + BUILD_BUG_ON(LRNG_DRNG_SECURITY_STRENGTH_BYTES % sizeof(unsigned long)); + BUILD_BUG_ON(LRNG_SEED_BUFFER_INIT_ADD_BITS % sizeof(unsigned long)); + /* ensure we have aligned buffers */ + BUILD_BUG_ON(LRNG_KCAPI_ALIGN % sizeof(unsigned long)); + + for (i = 0; i < (requested_bits >> 3); + i += sizeof(unsigned long)) { + if (!arch_get_random_seed_long((unsigned long *)(outbuf + i)) && + !arch_get_random_long((unsigned long *)(outbuf + i))) { + cpu_entropy = 0; + return 0; + } + } + + return requested_bits; +} + +static u32 lrng_get_cpu_data_compress(u8 *outbuf, u32 requested_bits, + u32 data_multiplier) +{ + SHASH_DESC_ON_STACK(shash, NULL); + const struct lrng_hash_cb *hash_cb; + struct lrng_drng *drng = lrng_drng_node_instance(); + unsigned long flags; + u32 ent_bits = 0, i, partial_bits = 0, digestsize, digestsize_bits, + full_bits; + void *hash; + + read_lock_irqsave(&drng->hash_lock, flags); + hash_cb = drng->hash_cb; + hash = drng->hash; + + digestsize = hash_cb->hash_digestsize(hash); + digestsize_bits = digestsize << 3; + /* Cap to maximum entropy that can ever be generated with given hash */ + lrng_cap_requested(digestsize_bits, requested_bits); + full_bits = requested_bits * data_multiplier; + + /* Calculate oversampling for SP800-90C */ + if (lrng_sp80090c_compliant()) { + /* Complete amount of bits to be pulled */ + full_bits += LRNG_OVERSAMPLE_ES_BITS * data_multiplier; + /* Full blocks that will be pulled */ + data_multiplier = full_bits / requested_bits; + /* Partial block in bits to be pulled */ + partial_bits = full_bits - (data_multiplier * requested_bits); + } + + if (hash_cb->hash_init(shash, hash)) + goto out; + + /* Hash all data from the CPU entropy source */ + for (i = 0; i < data_multiplier; i++) { + ent_bits = lrng_get_cpu_data(outbuf, requested_bits); + if (!ent_bits) + goto out; + + if (hash_cb->hash_update(shash, outbuf, ent_bits >> 3)) + goto err; + } + + /* Hash partial block, if applicable */ + ent_bits = lrng_get_cpu_data(outbuf, partial_bits); + if (ent_bits && + hash_cb->hash_update(shash, outbuf, ent_bits >> 3)) + goto err; + + pr_debug("pulled %u bits from CPU RNG entropy source\n", full_bits); + ent_bits = requested_bits; + + /* Generate the compressed data to be returned to the caller */ + if (requested_bits < digestsize_bits) { + u8 digest[LRNG_MAX_DIGESTSIZE]; + + if (hash_cb->hash_final(shash, digest)) + goto err; + + /* Truncate output data to requested size */ + memcpy(outbuf, digest, requested_bits >> 3); + memzero_explicit(digest, digestsize); + } else { + if (hash_cb->hash_final(shash, outbuf)) + goto err; + } + +out: + hash_cb->hash_desc_zero(shash); + read_unlock_irqrestore(&drng->hash_lock, flags); + return ent_bits; + +err: + ent_bits = 0; + goto out; +} + +/* + * If CPU entropy source requires does not return full entropy, return the + * multiplier of how much data shall be sampled from it. + */ +static u32 lrng_cpu_multiplier(void) +{ + static u32 data_multiplier = 0; + unsigned long v; + + if (data_multiplier > 0) + return data_multiplier; + + if (IS_ENABLED(CONFIG_X86) && !arch_get_random_seed_long(&v)) { + /* + * Intel SPEC: pulling 512 blocks from RDRAND ensures + * one reseed making it logically equivalent to RDSEED. + */ + data_multiplier = 512; + } else if (IS_ENABLED(CONFIG_PPC)) { + /* + * PowerISA defines DARN to deliver at least 0.5 bits of + * entropy per data bit. + */ + data_multiplier = 2; + } else if (IS_ENABLED(CONFIG_RISCV)) { + /* + * riscv-crypto-spec-scalar-1.0.0-rc6.pdf section 4.2 defines + * this requirement. + */ + data_multiplier = 2; + } else { + /* CPU provides full entropy */ + data_multiplier = CONFIG_LRNG_CPU_FULL_ENT_MULTIPLIER; + } + return data_multiplier; +} + +static int +lrng_cpu_switch_hash(struct lrng_drng *drng, int node, + const struct lrng_hash_cb *new_cb, void *new_hash, + const struct lrng_hash_cb *old_cb) +{ + u32 digestsize, multiplier; + + if (!IS_ENABLED(CONFIG_LRNG_SWITCH)) + return -EOPNOTSUPP; + + digestsize = lrng_get_digestsize(); + multiplier = lrng_cpu_multiplier(); + + /* + * It would be security violation if the new digestsize is smaller than + * the set CPU entropy rate. + */ + WARN_ON(multiplier > 1 && digestsize < cpu_entropy); + cpu_entropy = min_t(u32, digestsize, cpu_entropy); + return 0; +} + +/* + * lrng_get_arch() - Get CPU entropy source entropy + * + * @eb: entropy buffer to store entropy + * @requested_bits: requested entropy in bits + */ +static void lrng_cpu_get(struct entropy_buf *eb, u32 requested_bits, + bool __unused) +{ + u32 ent_bits, data_multiplier = lrng_cpu_multiplier(); + + if (data_multiplier <= 1) { + ent_bits = lrng_get_cpu_data(eb->e[lrng_ext_es_cpu], + requested_bits); + } else { + ent_bits = lrng_get_cpu_data_compress(eb->e[lrng_ext_es_cpu], + requested_bits, + data_multiplier); + } + + ent_bits = lrng_cpu_entropylevel(ent_bits); + pr_debug("obtained %u bits of entropy from CPU RNG entropy source\n", + ent_bits); + eb->e_bits[lrng_ext_es_cpu] = ent_bits; +} + +static void lrng_cpu_es_state(unsigned char *buf, size_t buflen) +{ + const struct lrng_drng *lrng_drng_init = lrng_drng_init_instance(); + u32 data_multiplier = lrng_cpu_multiplier(); + + /* Assume the lrng_drng_init lock is taken by caller */ + snprintf(buf, buflen, + " Hash for compressing data: %s\n" + " Available entropy: %u\n" + " Data multiplier: %u\n", + (data_multiplier <= 1) ? + "N/A" : lrng_drng_init->hash_cb->hash_name(), + lrng_cpu_poolsize(), + data_multiplier); +} + +struct lrng_es_cb lrng_es_cpu = { + .name = "CPU", + .get_ent = lrng_cpu_get, + .curr_entropy = lrng_cpu_entropylevel, + .max_entropy = lrng_cpu_poolsize, + .state = lrng_cpu_es_state, + .reset = NULL, + .switch_hash = lrng_cpu_switch_hash, +}; diff --git a/drivers/char/lrng/lrng_es_cpu.h b/drivers/char/lrng/lrng_es_cpu.h new file mode 100644 index 000000000000..8dbb4d9a2926 --- /dev/null +++ b/drivers/char/lrng/lrng_es_cpu.h @@ -0,0 +1,17 @@ +/* SPDX-License-Identifier: GPL-2.0 OR BSD-2-Clause */ +/* + * Copyright (C) 2022, Stephan Mueller + */ + +#ifndef _LRNG_ES_CPU_H +#define _LRNG_ES_CPU_H + +#include "lrng_es_mgr_cb.h" + +#ifdef CONFIG_LRNG_CPU + +extern struct lrng_es_cb lrng_es_cpu; + +#endif /* CONFIG_LRNG_CPU */ + +#endif /* _LRNG_ES_CPU_H */ diff --git a/drivers/char/lrng/lrng_es_irq.c b/drivers/char/lrng/lrng_es_irq.c new file mode 100644 index 000000000000..9336d0045e06 --- /dev/null +++ b/drivers/char/lrng/lrng_es_irq.c @@ -0,0 +1,728 @@ +// SPDX-License-Identifier: GPL-2.0 OR BSD-2-Clause +/* + * LRNG Slow Entropy Source: Interrupt data collection + * + * Copyright (C) 2022, Stephan Mueller + */ + +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt + +#include +#include +#include +#include +#include +#include + +#include "lrng_es_aux.h" +#include "lrng_es_irq.h" +#include "lrng_es_timer_common.h" +#include "lrng_health.h" +#include "lrng_numa.h" +#include "lrng_testing.h" + +/* + * Number of interrupts to be recorded to assume that DRNG security strength + * bits of entropy are received. + * Note: a value below the DRNG security strength should not be defined as this + * may imply the DRNG can never be fully seeded in case other noise + * sources are unavailable. + */ +#define LRNG_IRQ_ENTROPY_BITS LRNG_UINT32_C(CONFIG_LRNG_IRQ_ENTROPY_RATE) + + +/* Number of interrupts required for LRNG_DRNG_SECURITY_STRENGTH_BITS entropy */ +static u32 lrng_irq_entropy_bits = LRNG_IRQ_ENTROPY_BITS; + +static u32 irq_entropy __read_mostly = LRNG_IRQ_ENTROPY_BITS; +#ifdef CONFIG_LRNG_RUNTIME_ES_CONFIG +module_param(irq_entropy, uint, 0444); +MODULE_PARM_DESC(irq_entropy, + "How many interrupts must be collected for obtaining 256 bits of entropy\n"); +#endif + +/* Per-CPU array holding concatenated IRQ entropy events */ +static DEFINE_PER_CPU(u32 [LRNG_DATA_ARRAY_SIZE], lrng_irq_array) + __aligned(LRNG_KCAPI_ALIGN); +static DEFINE_PER_CPU(u32, lrng_irq_array_ptr) = 0; +static DEFINE_PER_CPU(atomic_t, lrng_irq_array_irqs) = ATOMIC_INIT(0); + +/* + * The entropy collection is performed by executing the following steps: + * 1. fill up the per-CPU array holding the time stamps + * 2. once the per-CPU array is full, a compression of the data into + * the entropy pool is performed - this happens in interrupt context + * + * If step 2 is not desired in interrupt context, the following boolean + * needs to be set to false. This implies that old entropy data in the + * per-CPU array collected since the last DRNG reseed is overwritten with + * new entropy data instead of retaining the entropy with the compression + * operation. + * + * Impact on entropy: + * + * If continuous compression is enabled, the maximum entropy that is collected + * per CPU between DRNG reseeds is equal to the digest size of the used hash. + * + * If continuous compression is disabled, the maximum number of entropy events + * that can be collected per CPU is equal to LRNG_DATA_ARRAY_SIZE. This amount + * of events is converted into an entropy statement which then represents the + * maximum amount of entropy collectible per CPU between DRNG reseeds. + */ +static bool lrng_irq_continuous_compression __read_mostly = + IS_ENABLED(CONFIG_LRNG_ENABLE_CONTINUOUS_COMPRESSION); + +#ifdef CONFIG_LRNG_SWITCHABLE_CONTINUOUS_COMPRESSION +module_param(lrng_irq_continuous_compression, bool, 0444); +MODULE_PARM_DESC(lrng_irq_continuous_compression, + "Perform entropy compression if per-CPU entropy data array is full\n"); +#endif + +/* + * Per-CPU entropy pool with compressed entropy event + * + * The per-CPU entropy pool is defined as the hash state. New data is simply + * inserted into the entropy pool by performing a hash update operation. + * To read the entropy pool, a hash final must be invoked. However, before + * the entropy pool is released again after a hash final, the hash init must + * be performed. + */ +static DEFINE_PER_CPU(u8 [LRNG_POOL_SIZE], lrng_irq_pool) + __aligned(LRNG_KCAPI_ALIGN); +/* + * Lock to allow other CPUs to read the pool - as this is only done during + * reseed which is infrequent, this lock is hardly contended. + */ +static DEFINE_PER_CPU(spinlock_t, lrng_irq_lock); +static DEFINE_PER_CPU(bool, lrng_irq_lock_init) = false; + +static bool lrng_irq_pool_online(int cpu) +{ + return per_cpu(lrng_irq_lock_init, cpu); +} + +static void __init lrng_irq_check_compression_state(void) +{ + /* One pool must hold sufficient entropy for disabled compression */ + if (!lrng_irq_continuous_compression) { + u32 max_ent = min_t(u32, lrng_get_digestsize(), + lrng_data_to_entropy(LRNG_DATA_NUM_VALUES, + lrng_irq_entropy_bits)); + if (max_ent < lrng_security_strength()) { + pr_warn("Force continuous compression operation to ensure LRNG can hold enough entropy\n"); + lrng_irq_continuous_compression = true; + } + } +} + +void __init lrng_irq_es_init(bool highres_timer) +{ + /* Set a minimum number of interrupts that must be collected */ + irq_entropy = max_t(u32, LRNG_IRQ_ENTROPY_BITS, irq_entropy); + + if (highres_timer) { + lrng_irq_entropy_bits = irq_entropy; + } else { + u32 new_entropy = irq_entropy * LRNG_ES_OVERSAMPLING_FACTOR; + + lrng_irq_entropy_bits = (irq_entropy < new_entropy) ? + new_entropy : irq_entropy; + pr_warn("operating without high-resolution timer and applying IRQ oversampling factor %u\n", + LRNG_ES_OVERSAMPLING_FACTOR); + } + + lrng_irq_check_compression_state(); +} + +/* + * Reset all per-CPU pools - reset entropy estimator but leave the pool data + * that may or may not have entropy unchanged. + */ +static void lrng_irq_reset(void) +{ + int cpu; + + /* Trigger GCD calculation anew. */ + lrng_gcd_set(0); + + for_each_online_cpu(cpu) + atomic_set(per_cpu_ptr(&lrng_irq_array_irqs, cpu), 0); +} + +static u32 lrng_irq_avail_pool_size(void) +{ + u32 max_size = 0, max_pool = lrng_get_digestsize(); + int cpu; + + if (!lrng_irq_continuous_compression) + max_pool = min_t(u32, max_pool, LRNG_DATA_NUM_VALUES); + + for_each_online_cpu(cpu) { + if (lrng_irq_pool_online(cpu)) + max_size += max_pool; + } + + return max_size; +} + +/* Return entropy of unused IRQs present in all per-CPU pools. */ +static u32 lrng_irq_avail_entropy(u32 __unused) +{ + u32 digestsize_irqs, irq = 0; + int cpu; + + /* Only deliver entropy when SP800-90B self test is completed */ + if (!lrng_sp80090b_startup_complete_es(lrng_int_es_irq)) + return 0; + + /* Obtain the cap of maximum numbers of IRQs we count */ + digestsize_irqs = lrng_entropy_to_data(lrng_get_digestsize(), + lrng_irq_entropy_bits); + if (!lrng_irq_continuous_compression) { + /* Cap to max. number of IRQs the array can hold */ + digestsize_irqs = min_t(u32, digestsize_irqs, + LRNG_DATA_NUM_VALUES); + } + + for_each_online_cpu(cpu) { + if (!lrng_irq_pool_online(cpu)) + continue; + irq += min_t(u32, digestsize_irqs, + atomic_read_u32(per_cpu_ptr(&lrng_irq_array_irqs, + cpu))); + } + + /* Consider oversampling rate */ + return lrng_reduce_by_osr(lrng_data_to_entropy(irq, + lrng_irq_entropy_bits)); +} + +/* + * Trigger a switch of the hash implementation for the per-CPU pool. + * + * For each per-CPU pool, obtain the message digest with the old hash + * implementation, initialize the per-CPU pool again with the new hash + * implementation and inject the message digest into the new state. + * + * Assumption: the caller must guarantee that the new_cb is available during the + * entire operation (e.g. it must hold the lock against pointer updating). + */ +static int +lrng_irq_switch_hash(struct lrng_drng *drng, int node, + const struct lrng_hash_cb *new_cb, void *new_hash, + const struct lrng_hash_cb *old_cb) +{ + u8 digest[LRNG_MAX_DIGESTSIZE]; + u32 digestsize_irqs, found_irqs; + int ret = 0, cpu; + + if (!IS_ENABLED(CONFIG_LRNG_SWITCH)) + return -EOPNOTSUPP; + + for_each_online_cpu(cpu) { + struct shash_desc *pcpu_shash; + + /* + * Only switch the per-CPU pools for the current node because + * the hash_cb only applies NUMA-node-wide. + */ + if (cpu_to_node(cpu) != node || !lrng_irq_pool_online(cpu)) + continue; + + pcpu_shash = (struct shash_desc *)per_cpu_ptr(lrng_irq_pool, + cpu); + + digestsize_irqs = old_cb->hash_digestsize(pcpu_shash); + digestsize_irqs = lrng_entropy_to_data(digestsize_irqs << 3, + lrng_irq_entropy_bits); + + if (pcpu_shash->tfm == new_hash) + continue; + + /* Get the per-CPU pool hash with old digest ... */ + ret = old_cb->hash_final(pcpu_shash, digest) ?: + /* ... re-initialize the hash with the new digest ... */ + new_cb->hash_init(pcpu_shash, new_hash) ?: + /* + * ... feed the old hash into the new state. We may feed + * uninitialized memory into the new state, but this is + * considered no issue and even good as we have some more + * uncertainty here. + */ + new_cb->hash_update(pcpu_shash, digest, + sizeof(digest)); + if (ret) + goto out; + + /* + * In case the new digest is larger than the old one, cap + * the available entropy to the old message digest used to + * process the existing data. + */ + found_irqs = atomic_xchg_relaxed( + per_cpu_ptr(&lrng_irq_array_irqs, cpu), 0); + found_irqs = min_t(u32, found_irqs, digestsize_irqs); + atomic_add_return_relaxed(found_irqs, + per_cpu_ptr(&lrng_irq_array_irqs, cpu)); + + pr_debug("Re-initialize per-CPU entropy pool for CPU %d on NUMA node %d with hash %s\n", + cpu, node, new_cb->hash_name()); + } + +out: + memzero_explicit(digest, sizeof(digest)); + return ret; +} + +/* + * When reading the per-CPU message digest, make sure we use the crypto + * callbacks defined for the NUMA node the per-CPU pool is defined for because + * the LRNG crypto switch support is only atomic per NUMA node. + */ +static u32 +lrng_irq_pool_hash_one(const struct lrng_hash_cb *pcpu_hash_cb, + void *pcpu_hash, int cpu, u8 *digest, u32 *digestsize) +{ + struct shash_desc *pcpu_shash = + (struct shash_desc *)per_cpu_ptr(lrng_irq_pool, cpu); + spinlock_t *lock = per_cpu_ptr(&lrng_irq_lock, cpu); + unsigned long flags; + u32 digestsize_irqs, found_irqs; + + /* Lock guarding against reading / writing to per-CPU pool */ + spin_lock_irqsave(lock, flags); + + *digestsize = pcpu_hash_cb->hash_digestsize(pcpu_hash); + digestsize_irqs = lrng_entropy_to_data(*digestsize << 3, + lrng_irq_entropy_bits); + + /* Obtain entropy statement like for the entropy pool */ + found_irqs = atomic_xchg_relaxed( + per_cpu_ptr(&lrng_irq_array_irqs, cpu), 0); + /* Cap to maximum amount of data we can hold in hash */ + found_irqs = min_t(u32, found_irqs, digestsize_irqs); + + /* Cap to maximum amount of data we can hold in array */ + if (!lrng_irq_continuous_compression) + found_irqs = min_t(u32, found_irqs, LRNG_DATA_NUM_VALUES); + + /* Store all not-yet compressed data in data array into hash, ... */ + if (pcpu_hash_cb->hash_update(pcpu_shash, + (u8 *)per_cpu_ptr(lrng_irq_array, cpu), + LRNG_DATA_ARRAY_SIZE * sizeof(u32)) ?: + /* ... get the per-CPU pool digest, ... */ + pcpu_hash_cb->hash_final(pcpu_shash, digest) ?: + /* ... re-initialize the hash, ... */ + pcpu_hash_cb->hash_init(pcpu_shash, pcpu_hash) ?: + /* ... feed the old hash into the new state. */ + pcpu_hash_cb->hash_update(pcpu_shash, digest, *digestsize)) + found_irqs = 0; + + spin_unlock_irqrestore(lock, flags); + return found_irqs; +} + +/* + * Hash all per-CPU pools and return the digest to be used as seed data for + * seeding a DRNG. The caller must guarantee backtracking resistance. + * The function will only copy as much data as entropy is available into the + * caller-provided output buffer. + * + * This function handles the translation from the number of received interrupts + * into an entropy statement. The conversion depends on LRNG_IRQ_ENTROPY_BITS + * which defines how many interrupts must be received to obtain 256 bits of + * entropy. With this value, the function lrng_data_to_entropy converts a given + * data size (received interrupts, requested amount of data, etc.) into an + * entropy statement. lrng_entropy_to_data does the reverse. + * + * @eb: entropy buffer to store entropy + * @requested_bits: Requested amount of entropy + * @fully_seeded: indicator whether LRNG is fully seeded + */ +static void lrng_irq_pool_hash(struct entropy_buf *eb, u32 requested_bits, + bool fully_seeded) +{ + SHASH_DESC_ON_STACK(shash, NULL); + const struct lrng_hash_cb *hash_cb; + struct lrng_drng **lrng_drng = lrng_drng_instances(); + struct lrng_drng *drng = lrng_drng_init_instance(); + u8 digest[LRNG_MAX_DIGESTSIZE]; + unsigned long flags, flags2; + u32 found_irqs, collected_irqs = 0, collected_ent_bits, requested_irqs, + returned_ent_bits; + int ret, cpu; + void *hash; + + /* Only deliver entropy when SP800-90B self test is completed */ + if (!lrng_sp80090b_startup_complete_es(lrng_int_es_irq)) { + eb->e_bits[lrng_int_es_irq] = 0; + return; + } + + /* Lock guarding replacement of per-NUMA hash */ + read_lock_irqsave(&drng->hash_lock, flags); + + hash_cb = drng->hash_cb; + hash = drng->hash; + + /* The hash state of filled with all per-CPU pool hashes. */ + ret = hash_cb->hash_init(shash, hash); + if (ret) + goto err; + + /* Cap to maximum entropy that can ever be generated with given hash */ + lrng_cap_requested(hash_cb->hash_digestsize(hash) << 3, requested_bits); + requested_irqs = lrng_entropy_to_data(requested_bits + + lrng_compress_osr(), + lrng_irq_entropy_bits); + + /* + * Harvest entropy from each per-CPU hash state - even though we may + * have collected sufficient entropy, we will hash all per-CPU pools. + */ + for_each_online_cpu(cpu) { + struct lrng_drng *pcpu_drng = drng; + u32 digestsize, pcpu_unused_irqs = 0; + int node = cpu_to_node(cpu); + + /* If pool is not online, then no entropy is present. */ + if (!lrng_irq_pool_online(cpu)) + continue; + + if (lrng_drng && lrng_drng[node]) + pcpu_drng = lrng_drng[node]; + + if (pcpu_drng == drng) { + found_irqs = lrng_irq_pool_hash_one(hash_cb, hash, + cpu, digest, + &digestsize); + } else { + read_lock_irqsave(&pcpu_drng->hash_lock, flags2); + found_irqs = + lrng_irq_pool_hash_one(pcpu_drng->hash_cb, + pcpu_drng->hash, cpu, + digest, &digestsize); + read_unlock_irqrestore(&pcpu_drng->hash_lock, flags2); + } + + /* Inject the digest into the state of all per-CPU pools */ + ret = hash_cb->hash_update(shash, digest, digestsize); + if (ret) + goto err; + + collected_irqs += found_irqs; + if (collected_irqs > requested_irqs) { + pcpu_unused_irqs = collected_irqs - requested_irqs; + atomic_add_return_relaxed(pcpu_unused_irqs, + per_cpu_ptr(&lrng_irq_array_irqs, cpu)); + collected_irqs = requested_irqs; + } + pr_debug("%u interrupts used from entropy pool of CPU %d, %u interrupts remain unused\n", + found_irqs - pcpu_unused_irqs, cpu, pcpu_unused_irqs); + } + + ret = hash_cb->hash_final(shash, digest); + if (ret) + goto err; + + collected_ent_bits = lrng_data_to_entropy(collected_irqs, + lrng_irq_entropy_bits); + /* Apply oversampling: discount requested oversampling rate */ + returned_ent_bits = lrng_reduce_by_osr(collected_ent_bits); + + pr_debug("obtained %u bits by collecting %u bits of entropy from entropy pool noise source\n", + returned_ent_bits, collected_ent_bits); + + /* + * Truncate to available entropy as implicitly allowed by SP800-90B + * section 3.1.5.1.1 table 1 which awards truncated hashes full + * entropy. + * + * During boot time, we read requested_bits data with + * returned_ent_bits entropy. In case our conservative entropy + * estimate underestimates the available entropy we can transport as + * much available entropy as possible. + */ + memcpy(eb->e[lrng_int_es_irq], digest, + fully_seeded ? returned_ent_bits >> 3 : requested_bits >> 3); + eb->e_bits[lrng_int_es_irq] = returned_ent_bits; + +out: + hash_cb->hash_desc_zero(shash); + read_unlock_irqrestore(&drng->hash_lock, flags); + memzero_explicit(digest, sizeof(digest)); + return; + +err: + eb->e_bits[lrng_int_es_irq] = 0; + goto out; +} + +/* Compress the lrng_irq_array array into lrng_irq_pool */ +static void lrng_irq_array_compress(void) +{ + struct shash_desc *shash = + (struct shash_desc *)this_cpu_ptr(lrng_irq_pool); + struct lrng_drng *drng = lrng_drng_node_instance(); + const struct lrng_hash_cb *hash_cb; + spinlock_t *lock = this_cpu_ptr(&lrng_irq_lock); + unsigned long flags, flags2; + void *hash; + bool init = false; + + read_lock_irqsave(&drng->hash_lock, flags); + hash_cb = drng->hash_cb; + hash = drng->hash; + + if (unlikely(!this_cpu_read(lrng_irq_lock_init))) { + init = true; + spin_lock_init(lock); + this_cpu_write(lrng_irq_lock_init, true); + pr_debug("Initializing per-CPU entropy pool for CPU %d on NUMA node %d with hash %s\n", + raw_smp_processor_id(), numa_node_id(), + hash_cb->hash_name()); + } + + spin_lock_irqsave(lock, flags2); + + if (unlikely(init) && hash_cb->hash_init(shash, hash)) { + this_cpu_write(lrng_irq_lock_init, false); + pr_warn("Initialization of hash failed\n"); + } else if (lrng_irq_continuous_compression) { + /* Add entire per-CPU data array content into entropy pool. */ + if (hash_cb->hash_update(shash, + (u8 *)this_cpu_ptr(lrng_irq_array), + LRNG_DATA_ARRAY_SIZE * sizeof(u32))) + pr_warn_ratelimited("Hashing of entropy data failed\n"); + } + + spin_unlock_irqrestore(lock, flags2); + read_unlock_irqrestore(&drng->hash_lock, flags); +} + +/* Compress data array into hash */ +static void lrng_irq_array_to_hash(u32 ptr) +{ + u32 *array = this_cpu_ptr(lrng_irq_array); + + /* + * During boot time the hash operation is triggered more often than + * during regular operation. + */ + if (unlikely(!lrng_state_fully_seeded())) { + if ((ptr & 31) && (ptr < LRNG_DATA_WORD_MASK)) + return; + } else if (ptr < LRNG_DATA_WORD_MASK) { + return; + } + + if (lrng_raw_array_entropy_store(*array)) { + u32 i; + + /* + * If we fed even a part of the array to external analysis, we + * mark that the entire array and the per-CPU pool to have no + * entropy. This is due to the non-IID property of the data as + * we do not fully know whether the existing dependencies + * diminish the entropy beyond to what we expect it has. + */ + atomic_set(this_cpu_ptr(&lrng_irq_array_irqs), 0); + + for (i = 1; i < LRNG_DATA_ARRAY_SIZE; i++) + lrng_raw_array_entropy_store(*(array + i)); + } else { + lrng_irq_array_compress(); + /* Ping pool handler about received entropy */ + if (lrng_sp80090b_startup_complete_es(lrng_int_es_irq)) + lrng_es_add_entropy(); + } +} + +/* + * Concatenate full 32 bit word at the end of time array even when current + * ptr is not aligned to sizeof(data). + */ +static void _lrng_irq_array_add_u32(u32 data) +{ + /* Increment pointer by number of slots taken for input value */ + u32 pre_ptr, mask, ptr = this_cpu_add_return(lrng_irq_array_ptr, + LRNG_DATA_SLOTS_PER_UINT); + unsigned int pre_array; + + /* + * This function injects a unit into the array - guarantee that + * array unit size is equal to data type of input data. + */ + BUILD_BUG_ON(LRNG_DATA_ARRAY_MEMBER_BITS != (sizeof(data) << 3)); + + /* + * The following logic requires at least two units holding + * the data as otherwise the pointer would immediately wrap when + * injection an u32 word. + */ + BUILD_BUG_ON(LRNG_DATA_NUM_VALUES <= LRNG_DATA_SLOTS_PER_UINT); + + lrng_data_split_u32(&ptr, &pre_ptr, &mask); + + /* MSB of data go into previous unit */ + pre_array = lrng_data_idx2array(pre_ptr); + /* zeroization of slot to ensure the following OR adds the data */ + this_cpu_and(lrng_irq_array[pre_array], ~(0xffffffff & ~mask)); + this_cpu_or(lrng_irq_array[pre_array], data & ~mask); + + /* Invoke compression as we just filled data array completely */ + if (unlikely(pre_ptr > ptr)) + lrng_irq_array_to_hash(LRNG_DATA_WORD_MASK); + + /* LSB of data go into current unit */ + this_cpu_write(lrng_irq_array[lrng_data_idx2array(ptr)], + data & mask); + + if (likely(pre_ptr <= ptr)) + lrng_irq_array_to_hash(ptr); +} + +/* Concatenate a 32-bit word at the end of the per-CPU array */ +void lrng_irq_array_add_u32(u32 data) +{ + /* + * Disregard entropy-less data without continuous compression to + * avoid it overwriting data with entropy when array ptr wraps. + */ + if (lrng_irq_continuous_compression) + _lrng_irq_array_add_u32(data); +} + +/* Concatenate data of max LRNG_DATA_SLOTSIZE_MASK at the end of time array */ +static void lrng_irq_array_add_slot(u32 data) +{ + /* Get slot */ + u32 ptr = this_cpu_inc_return(lrng_irq_array_ptr) & + LRNG_DATA_WORD_MASK; + unsigned int array = lrng_data_idx2array(ptr); + unsigned int slot = lrng_data_idx2slot(ptr); + + BUILD_BUG_ON(LRNG_DATA_ARRAY_MEMBER_BITS % LRNG_DATA_SLOTSIZE_BITS); + /* Ensure consistency of values */ + BUILD_BUG_ON(LRNG_DATA_ARRAY_MEMBER_BITS != + sizeof(lrng_irq_array[0]) << 3); + + /* zeroization of slot to ensure the following OR adds the data */ + this_cpu_and(lrng_irq_array[array], + ~(lrng_data_slot_val(0xffffffff & LRNG_DATA_SLOTSIZE_MASK, + slot))); + /* Store data into slot */ + this_cpu_or(lrng_irq_array[array], lrng_data_slot_val(data, slot)); + + lrng_irq_array_to_hash(ptr); +} + +static void +lrng_time_process_common(u32 time, void(*add_time)(u32 data)) +{ + enum lrng_health_res health_test; + + if (lrng_raw_hires_entropy_store(time)) + return; + + health_test = lrng_health_test(time, lrng_int_es_irq); + if (health_test > lrng_health_fail_use) + return; + + if (health_test == lrng_health_pass) + atomic_inc_return(this_cpu_ptr(&lrng_irq_array_irqs)); + + add_time(time); +} + +/* + * Batching up of entropy in per-CPU array before injecting into entropy pool. + */ +static void lrng_time_process(void) +{ + u32 now_time = random_get_entropy(); + + if (unlikely(!lrng_gcd_tested())) { + /* When GCD is unknown, we process the full time stamp */ + lrng_time_process_common(now_time, _lrng_irq_array_add_u32); + lrng_gcd_add_value(now_time); + } else { + /* GCD is known and applied */ + lrng_time_process_common((now_time / lrng_gcd_get()) & + LRNG_DATA_SLOTSIZE_MASK, + lrng_irq_array_add_slot); + } + + lrng_perf_time(now_time); +} + +/* Hot code path - Callback for interrupt handler */ +void add_interrupt_randomness(int irq) +{ + if (lrng_highres_timer()) { + lrng_time_process(); + } else { + struct pt_regs *regs = get_irq_regs(); + static atomic_t reg_idx = ATOMIC_INIT(0); + u64 ip; + u32 tmp; + + if (regs) { + u32 *ptr = (u32 *)regs; + int reg_ptr = atomic_add_return_relaxed(1, ®_idx); + size_t n = (sizeof(struct pt_regs) / sizeof(u32)); + + ip = instruction_pointer(regs); + tmp = *(ptr + (reg_ptr % n)); + tmp = lrng_raw_regs_entropy_store(tmp) ? 0 : tmp; + _lrng_irq_array_add_u32(tmp); + } else { + ip = _RET_IP_; + } + + lrng_time_process(); + + /* + * The XOR operation combining the different values is not + * considered to destroy entropy since the entirety of all + * processed values delivers the entropy (and not each + * value separately of the other values). + */ + tmp = lrng_raw_jiffies_entropy_store(jiffies) ? 0 : jiffies; + tmp ^= lrng_raw_irq_entropy_store(irq) ? 0 : irq; + tmp ^= lrng_raw_retip_entropy_store(ip) ? 0 : ip; + tmp ^= ip >> 32; + _lrng_irq_array_add_u32(tmp); + } +} +EXPORT_SYMBOL(add_interrupt_randomness); + +static void lrng_irq_es_state(unsigned char *buf, size_t buflen) +{ + const struct lrng_drng *lrng_drng_init = lrng_drng_init_instance(); + + /* Assume the lrng_drng_init lock is taken by caller */ + snprintf(buf, buflen, + " Hash for operating entropy pool: %s\n" + " Available entropy: %u\n" + " per-CPU interrupt collection size: %u\n" + " Standards compliance: %s\n" + " High-resolution timer: %s\n" + " Continuous compression: %s\n", + lrng_drng_init->hash_cb->hash_name(), + lrng_irq_avail_entropy(0), + LRNG_DATA_NUM_VALUES, + lrng_sp80090b_compliant(lrng_int_es_irq) ? "SP800-90B " : "", + lrng_highres_timer() ? "true" : "false", + lrng_irq_continuous_compression ? "true" : "false"); +} + +struct lrng_es_cb lrng_es_irq = { + .name = "IRQ", + .get_ent = lrng_irq_pool_hash, + .curr_entropy = lrng_irq_avail_entropy, + .max_entropy = lrng_irq_avail_pool_size, + .state = lrng_irq_es_state, + .reset = lrng_irq_reset, + .switch_hash = lrng_irq_switch_hash, +}; diff --git a/drivers/char/lrng/lrng_es_irq.h b/drivers/char/lrng/lrng_es_irq.h new file mode 100644 index 000000000000..2cd746611cf0 --- /dev/null +++ b/drivers/char/lrng/lrng_es_irq.h @@ -0,0 +1,24 @@ +/* SPDX-License-Identifier: GPL-2.0 OR BSD-2-Clause */ +/* + * Copyright (C) 2022, Stephan Mueller + */ + +#ifndef _LRNG_ES_IRQ_H +#define _LRNG_ES_IRQ_H + +#include + +#include "lrng_es_mgr_cb.h" + +#ifdef CONFIG_LRNG_IRQ +void lrng_irq_es_init(bool highres_timer); +void lrng_irq_array_add_u32(u32 data); + +extern struct lrng_es_cb lrng_es_irq; + +#else /* CONFIG_LRNG_IRQ */ +static inline void lrng_irq_es_init(bool highres_timer) { } +static inline void lrng_irq_array_add_u32(u32 data) { } +#endif /* CONFIG_LRNG_IRQ */ + +#endif /* _LRNG_ES_IRQ_H */ diff --git a/drivers/char/lrng/lrng_es_jent.c b/drivers/char/lrng/lrng_es_jent.c new file mode 100644 index 000000000000..dbb2fce591f0 --- /dev/null +++ b/drivers/char/lrng/lrng_es_jent.c @@ -0,0 +1,136 @@ +// SPDX-License-Identifier: GPL-2.0 OR BSD-2-Clause +/* + * LRNG Fast Entropy Source: Jitter RNG + * + * Copyright (C) 2022, Stephan Mueller + */ + +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt + +#include +#include +#include +#include + +#include "lrng_definitions.h" +#include "lrng_es_aux.h" +#include "lrng_es_jent.h" + +/* + * Estimated entropy of data is a 16th of LRNG_DRNG_SECURITY_STRENGTH_BITS. + * Albeit a full entropy assessment is provided for the noise source indicating + * that it provides high entropy rates and considering that it deactivates + * when it detects insufficient hardware, the chosen under estimation of + * entropy is considered to be acceptable to all reviewers. + */ +static u32 jent_entropy = CONFIG_LRNG_JENT_ENTROPY_RATE; +#ifdef CONFIG_LRNG_RUNTIME_ES_CONFIG +module_param(jent_entropy, uint, 0644); +MODULE_PARM_DESC(jent_entropy, "Entropy in bits of 256 data bits from Jitter RNG noise source"); +#endif + +static bool lrng_jent_initialized = false; +static struct rand_data *lrng_jent_state; + +static int __init lrng_jent_initialize(void) +{ + /* Initialize the Jitter RNG after the clocksources are initialized. */ + if (jent_entropy_init() || + (lrng_jent_state = jent_entropy_collector_alloc(1, 0)) == NULL) { + jent_entropy = 0; + pr_info("Jitter RNG unusable on current system\n"); + return 0; + } + lrng_jent_initialized = true; + pr_debug("Jitter RNG working on current system\n"); + + /* + * In FIPS mode, the Jitter RNG is defined to have full of entropy + * unless a different value has been specified at the command line + * (i.e. the user overrides the default), and the default value is + * larger than zero (if it is zero, it is assumed that an RBG2(P) or + * RBG2(NP) construction is attempted that intends to exclude the + * Jitter RNG). + */ + if (fips_enabled && + CONFIG_LRNG_JENT_ENTROPY_RATE > 0 && + jent_entropy == CONFIG_LRNG_JENT_ENTROPY_RATE) + jent_entropy = LRNG_DRNG_SECURITY_STRENGTH_BITS; + + lrng_drng_force_reseed(); + if (jent_entropy) + lrng_es_add_entropy(); + + return 0; +} +device_initcall(lrng_jent_initialize); + +static u32 lrng_jent_entropylevel(u32 requested_bits) +{ + return lrng_fast_noise_entropylevel(lrng_jent_initialized ? + jent_entropy : 0, requested_bits); +} + +static u32 lrng_jent_poolsize(void) +{ + return lrng_jent_entropylevel(lrng_security_strength()); +} + +/* + * lrng_get_jent() - Get Jitter RNG entropy + * + * @eb: entropy buffer to store entropy + * @requested_bits: requested entropy in bits + */ +static void lrng_jent_get(struct entropy_buf *eb, u32 requested_bits, + bool __unused) +{ + int ret; + u32 ent_bits = lrng_jent_entropylevel(requested_bits); + unsigned long flags; + static DEFINE_SPINLOCK(lrng_jent_lock); + + spin_lock_irqsave(&lrng_jent_lock, flags); + + if (!lrng_jent_initialized) { + spin_unlock_irqrestore(&lrng_jent_lock, flags); + goto err; + } + + ret = jent_read_entropy(lrng_jent_state, eb->e[lrng_ext_es_jitter], + requested_bits >> 3); + spin_unlock_irqrestore(&lrng_jent_lock, flags); + + if (ret) { + pr_debug("Jitter RNG failed with %d\n", ret); + goto err; + } + + pr_debug("obtained %u bits of entropy from Jitter RNG noise source\n", + ent_bits); + + eb->e_bits[lrng_ext_es_jitter] = ent_bits; + return; + +err: + eb->e_bits[lrng_ext_es_jitter] = 0; +} + +static void lrng_jent_es_state(unsigned char *buf, size_t buflen) +{ + snprintf(buf, buflen, + " Available entropy: %u\n" + " Enabled: %s\n", + lrng_jent_poolsize(), + lrng_jent_initialized ? "true" : "false"); +} + +struct lrng_es_cb lrng_es_jent = { + .name = "JitterRNG", + .get_ent = lrng_jent_get, + .curr_entropy = lrng_jent_entropylevel, + .max_entropy = lrng_jent_poolsize, + .state = lrng_jent_es_state, + .reset = NULL, + .switch_hash = NULL, +}; diff --git a/drivers/char/lrng/lrng_es_jent.h b/drivers/char/lrng/lrng_es_jent.h new file mode 100644 index 000000000000..32882d4bdf99 --- /dev/null +++ b/drivers/char/lrng/lrng_es_jent.h @@ -0,0 +1,17 @@ +/* SPDX-License-Identifier: GPL-2.0 OR BSD-2-Clause */ +/* + * Copyright (C) 2022, Stephan Mueller + */ + +#ifndef _LRNG_ES_JENT_H +#define _LRNG_ES_JENT_H + +#include "lrng_es_mgr_cb.h" + +#ifdef CONFIG_LRNG_JENT + +extern struct lrng_es_cb lrng_es_jent; + +#endif /* CONFIG_LRNG_JENT */ + +#endif /* _LRNG_ES_JENT_H */ diff --git a/drivers/char/lrng/lrng_es_krng.c b/drivers/char/lrng/lrng_es_krng.c new file mode 100644 index 000000000000..0ef88e6c8d56 --- /dev/null +++ b/drivers/char/lrng/lrng_es_krng.c @@ -0,0 +1,121 @@ +// SPDX-License-Identifier: GPL-2.0 OR BSD-2-Clause +/* + * LRNG Fast Entropy Source: Linux kernel RNG (random.c) + * + * Copyright (C) 2022, Stephan Mueller + */ + +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt + +#include +#include +#include +#include + +#include "lrng_es_aux.h" +#include "lrng_es_krng.h" + +static u32 krng_entropy = CONFIG_LRNG_KERNEL_RNG_ENTROPY_RATE; +#ifdef CONFIG_LRNG_RUNTIME_ES_CONFIG +module_param(krng_entropy, uint, 0644); +MODULE_PARM_DESC(krng_entropy, "Entropy in bits of 256 data bits from the kernel RNG noise source"); +#endif + +static atomic_t lrng_krng_initial_rate = ATOMIC_INIT(0); + +static struct notifier_block lrng_krng_ready = { + .notifier_call = NULL, +}; + +static u32 lrng_krng_fips_entropylevel(u32 entropylevel) +{ + return fips_enabled ? 0 : entropylevel; +} + +static int lrng_krng_adjust_entropy(struct notifier_block *nb, + unsigned long action, void *data) +{ + u32 entropylevel; + + krng_entropy = atomic_read_u32(&lrng_krng_initial_rate); + + entropylevel = lrng_krng_fips_entropylevel(krng_entropy); + pr_debug("Kernel RNG is fully seeded, setting entropy rate to %u bits of entropy\n", + entropylevel); + lrng_drng_force_reseed(); + if (entropylevel) + lrng_es_add_entropy(); + return 0; +} + +static u32 lrng_krng_entropylevel(u32 requested_bits) +{ + if (lrng_krng_ready.notifier_call == NULL) { + int err; + + lrng_krng_ready.notifier_call = lrng_krng_adjust_entropy; + + err = register_random_ready_notifier(&lrng_krng_ready); + switch (err) { + case 0: + atomic_set(&lrng_krng_initial_rate, krng_entropy); + krng_entropy = 0; + pr_debug("Kernel RNG is not yet seeded, setting entropy rate to 0 bits of entropy\n"); + break; + + case -EALREADY: + pr_debug("Kernel RNG is fully seeded, setting entropy rate to %u bits of entropy\n", + lrng_krng_fips_entropylevel(krng_entropy)); + break; + default: + lrng_krng_ready.notifier_call = NULL; + return 0; + } + } + + return lrng_fast_noise_entropylevel( + lrng_krng_fips_entropylevel(krng_entropy), requested_bits); +} + +static u32 lrng_krng_poolsize(void) +{ + return lrng_krng_entropylevel(lrng_security_strength()); +} + +/* + * lrng_krng_get() - Get kernel RNG entropy + * + * @eb: entropy buffer to store entropy + * @requested_bits: requested entropy in bits + */ +static void lrng_krng_get(struct entropy_buf *eb, u32 requested_bits, + bool __unused) +{ + u32 ent_bits = lrng_krng_entropylevel(requested_bits); + + get_random_bytes(eb->e[lrng_ext_es_krng], requested_bits >> 3); + + pr_debug("obtained %u bits of entropy from kernel RNG noise source\n", + ent_bits); + + eb->e_bits[lrng_ext_es_krng] = ent_bits; +} + +static void lrng_krng_es_state(unsigned char *buf, size_t buflen) +{ + snprintf(buf, buflen, + " Available entropy: %u\n" + " Entropy Rate per 256 data bits: %u\n", + lrng_krng_poolsize(), + lrng_krng_entropylevel(256)); +} + +struct lrng_es_cb lrng_es_krng = { + .name = "KernelRNG", + .get_ent = lrng_krng_get, + .curr_entropy = lrng_krng_entropylevel, + .max_entropy = lrng_krng_poolsize, + .state = lrng_krng_es_state, + .reset = NULL, + .switch_hash = NULL, +}; diff --git a/drivers/char/lrng/lrng_es_krng.h b/drivers/char/lrng/lrng_es_krng.h new file mode 100644 index 000000000000..cf982b9eea05 --- /dev/null +++ b/drivers/char/lrng/lrng_es_krng.h @@ -0,0 +1,17 @@ +/* SPDX-License-Identifier: GPL-2.0 OR BSD-2-Clause */ +/* + * Copyright (C) 2022, Stephan Mueller + */ + +#ifndef _LRNG_ES_RANDOM_H +#define _LRNG_ES_RANDOM_H + +#include "lrng_es_mgr_cb.h" + +#ifdef CONFIG_LRNG_KERNEL_RNG + +extern struct lrng_es_cb lrng_es_krng; + +#endif /* CONFIG_LRNG_KERNEL_RNG */ + +#endif /* _LRNG_ES_RANDOM_H */ diff --git a/drivers/char/lrng/lrng_es_mgr.c b/drivers/char/lrng/lrng_es_mgr.c new file mode 100644 index 000000000000..050f96bfafab --- /dev/null +++ b/drivers/char/lrng/lrng_es_mgr.c @@ -0,0 +1,429 @@ +// SPDX-License-Identifier: GPL-2.0 OR BSD-2-Clause +/* + * LRNG Entropy sources management + * + * Copyright (C) 2022, Stephan Mueller + */ + +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt + +#include +#include +#include + +#include "lrng_drng_mgr.h" +#include "lrng_es_aux.h" +#include "lrng_es_cpu.h" +#include "lrng_es_irq.h" +#include "lrng_es_jent.h" +#include "lrng_es_krng.h" +#include "lrng_es_mgr.h" +#include "lrng_es_sched.h" +#include "lrng_interface_dev_common.h" +#include "lrng_interface_random_kernel.h" + +struct lrng_state { + bool can_invalidate; /* Can invalidate batched entropy? */ + bool perform_seedwork; /* Can seed work be performed? */ + bool lrng_operational; /* Is DRNG operational? */ + bool lrng_fully_seeded; /* Is DRNG fully seeded? */ + bool lrng_min_seeded; /* Is DRNG minimally seeded? */ + bool all_online_numa_node_seeded;/* All NUMA DRNGs seeded? */ + + /* + * To ensure that external entropy providers cannot dominate the + * internal noise sources but yet cannot be dominated by internal + * noise sources, the following booleans are intended to allow + * external to provide seed once when a DRNG reseed occurs. This + * triggering of external noise source is performed even when the + * entropy pool has sufficient entropy. + */ + + atomic_t boot_entropy_thresh; /* Reseed threshold */ + atomic_t reseed_in_progress; /* Flag for on executing reseed */ + struct work_struct lrng_seed_work; /* (re)seed work queue */ +}; + +static struct lrng_state lrng_state = { + false, false, false, false, false, false, + .boot_entropy_thresh = ATOMIC_INIT(LRNG_INIT_ENTROPY_BITS), + .reseed_in_progress = ATOMIC_INIT(0), +}; + +/* + * If the entropy count falls under this number of bits, then we + * should wake up processes which are selecting or polling on write + * access to /dev/random. + */ +u32 lrng_write_wakeup_bits = (LRNG_WRITE_WAKEUP_ENTROPY << 3); + +/* + * The entries must be in the same order as defined by enum lrng_internal_es and + * enum lrng_external_es + */ +struct lrng_es_cb *lrng_es[] = { +#ifdef CONFIG_LRNG_IRQ + &lrng_es_irq, +#endif +#ifdef CONFIG_LRNG_SCHED + &lrng_es_sched, +#endif +#ifdef CONFIG_LRNG_JENT + &lrng_es_jent, +#endif +#ifdef CONFIG_LRNG_CPU + &lrng_es_cpu, +#endif +#ifdef CONFIG_LRNG_KERNEL_RNG + &lrng_es_krng, +#endif + &lrng_es_aux +}; + +/********************************** Helper ***********************************/ + +void lrng_debug_report_seedlevel(const char *name) +{ +#ifdef CONFIG_WARN_ALL_UNSEEDED_RANDOM + static void *previous = NULL; + void *caller = (void *) _RET_IP_; + + if (READ_ONCE(previous) == caller) + return; + + if (!lrng_state_min_seeded()) + pr_notice("%pS %s called without reaching minimally seeded level (available entropy %u)\n", + caller, name, lrng_avail_entropy()); + + WRITE_ONCE(previous, caller); +#endif +} + +/* + * Reading of the LRNG pool is only allowed by one caller. The reading is + * only performed to (re)seed DRNGs. Thus, if this "lock" is already taken, + * the reseeding operation is in progress. The caller is not intended to wait + * but continue with its other operation. + */ +int lrng_pool_trylock(void) +{ + return atomic_cmpxchg(&lrng_state.reseed_in_progress, 0, 1); +} + +void lrng_pool_unlock(void) +{ + atomic_set(&lrng_state.reseed_in_progress, 0); +} + +/* Set new entropy threshold for reseeding during boot */ +void lrng_set_entropy_thresh(u32 new_entropy_bits) +{ + atomic_set(&lrng_state.boot_entropy_thresh, new_entropy_bits); +} + +/* + * Reset LRNG state - the entropy counters are reset, but the data that may + * or may not have entropy remains in the pools as this data will not hurt. + */ +void lrng_reset_state(void) +{ + u32 i; + + for_each_lrng_es(i) { + if (lrng_es[i]->reset) + lrng_es[i]->reset(); + } + lrng_state.lrng_operational = false; + lrng_state.lrng_fully_seeded = false; + lrng_state.lrng_min_seeded = false; + lrng_state.all_online_numa_node_seeded = false; + pr_debug("reset LRNG\n"); +} + +/* Set flag that all DRNGs are fully seeded */ +void lrng_pool_all_numa_nodes_seeded(bool set) +{ + lrng_state.all_online_numa_node_seeded = set; +} + +/* Return boolean whether LRNG reached minimally seed level */ +bool lrng_state_min_seeded(void) +{ + return lrng_state.lrng_min_seeded; +} + +/* Return boolean whether LRNG reached fully seed level */ +bool lrng_state_fully_seeded(void) +{ + return lrng_state.lrng_fully_seeded; +} + +/* Return boolean whether LRNG is considered fully operational */ +bool lrng_state_operational(void) +{ + return lrng_state.lrng_operational; +} + +static void lrng_init_wakeup(void) +{ + wake_up_all(&lrng_init_wait); + lrng_init_wakeup_dev(); +} + +static bool lrng_fully_seeded(bool fully_seeded, u32 collected_entropy) +{ + return (collected_entropy >= lrng_get_seed_entropy_osr(fully_seeded)); +} + +/* Policy to check whether entropy buffer contains full seeded entropy */ +bool lrng_fully_seeded_eb(bool fully_seeded, struct entropy_buf *eb) +{ + u32 i, collected_entropy = 0; + + for_each_lrng_es(i) + collected_entropy += eb->e_bits[i]; + + return lrng_fully_seeded(fully_seeded, collected_entropy); +} + +/* Mark one DRNG as not fully seeded */ +void lrng_unset_fully_seeded(struct lrng_drng *drng) +{ + drng->fully_seeded = false; + lrng_pool_all_numa_nodes_seeded(false); + + /* + * The init DRNG instance must always be fully seeded as this instance + * is the fall-back if any of the per-NUMA node DRNG instances is + * insufficiently seeded. Thus, we mark the entire LRNG as + * non-operational if the initial DRNG becomes not fully seeded. + */ + if (drng == lrng_drng_init_instance() && lrng_state_operational()) { + pr_debug("LRNG set to non-operational\n"); + lrng_state.lrng_operational = false; + lrng_state.lrng_fully_seeded = false; + + /* If sufficient entropy is available, reseed now. */ + lrng_es_add_entropy(); + } +} + +/* Policy to enable LRNG operational mode */ +static void lrng_set_operational(void) +{ + /* + * LRNG is operational if the initial DRNG is fully seeded. This state + * can only occur if either the external entropy sources provided + * sufficient entropy, or the SP800-90B startup test completed for + * the internal ES to supply also entropy data. + */ + if (lrng_state.lrng_fully_seeded) { + lrng_state.lrng_operational = true; + lrng_process_ready_list(); + lrng_init_wakeup(); + pr_info("LRNG fully operational\n"); + } +} + +static u32 lrng_avail_entropy_thresh(void) +{ + u32 ent_thresh = lrng_security_strength(); + + /* + * Apply oversampling during initialization according to SP800-90C as + * we request a larger buffer from the ES. + */ + if (lrng_sp80090c_compliant() && + !lrng_state.all_online_numa_node_seeded) + ent_thresh += LRNG_SEED_BUFFER_INIT_ADD_BITS; + + return ent_thresh; +} + +/* Available entropy in the entire LRNG considering all entropy sources */ +u32 lrng_avail_entropy(void) +{ + u32 i, ent = 0, ent_thresh = lrng_avail_entropy_thresh(); + + BUILD_BUG_ON(ARRAY_SIZE(lrng_es) != lrng_ext_es_last); + for_each_lrng_es(i) + ent += lrng_es[i]->curr_entropy(ent_thresh); + return ent; +} + +/* + * lrng_init_ops() - Set seed stages of LRNG + * + * Set the slow noise source reseed trigger threshold. The initial threshold + * is set to the minimum data size that can be read from the pool: a word. Upon + * reaching this value, the next seed threshold of 128 bits is set followed + * by 256 bits. + * + * @eb: buffer containing the size of entropy currently injected into DRNG - if + * NULL, the function obtains the available entropy from the ES. + */ +void lrng_init_ops(struct entropy_buf *eb) +{ + struct lrng_state *state = &lrng_state; + u32 i, requested_bits, seed_bits = 0; + + if (state->lrng_operational) + return; + + requested_bits = lrng_get_seed_entropy_osr( + state->all_online_numa_node_seeded); + + if (eb) { + for_each_lrng_es(i) + seed_bits += eb->e_bits[i]; + } else { + u32 ent_thresh = lrng_avail_entropy_thresh(); + + for_each_lrng_es(i) + seed_bits += lrng_es[i]->curr_entropy(ent_thresh); + } + + /* DRNG is seeded with full security strength */ + if (state->lrng_fully_seeded) { + lrng_set_operational(); + lrng_set_entropy_thresh(requested_bits); + } else if (lrng_fully_seeded(state->all_online_numa_node_seeded, + seed_bits)) { + if (state->can_invalidate) + invalidate_batched_entropy(); + + state->lrng_fully_seeded = true; + lrng_set_operational(); + state->lrng_min_seeded = true; + pr_info("LRNG fully seeded with %u bits of entropy\n", + seed_bits); + lrng_set_entropy_thresh(requested_bits); + } else if (!state->lrng_min_seeded) { + + /* DRNG is seeded with at least 128 bits of entropy */ + if (seed_bits >= LRNG_MIN_SEED_ENTROPY_BITS) { + if (state->can_invalidate) + invalidate_batched_entropy(); + + state->lrng_min_seeded = true; + pr_info("LRNG minimally seeded with %u bits of entropy\n", + seed_bits); + lrng_set_entropy_thresh(requested_bits); + lrng_init_wakeup(); + + /* DRNG is seeded with at least LRNG_INIT_ENTROPY_BITS bits */ + } else if (seed_bits >= LRNG_INIT_ENTROPY_BITS) { + pr_info("LRNG initial entropy level %u bits of entropy\n", + seed_bits); + lrng_set_entropy_thresh(LRNG_MIN_SEED_ENTROPY_BITS); + } + } +} + +int __init lrng_rand_initialize(void) +{ + struct seed { + ktime_t time; + unsigned long data[(LRNG_MAX_DIGESTSIZE / + sizeof(unsigned long))]; + struct new_utsname utsname; + } seed __aligned(LRNG_KCAPI_ALIGN); + unsigned int i; + + BUILD_BUG_ON(LRNG_MAX_DIGESTSIZE % sizeof(unsigned long)); + + seed.time = ktime_get_real(); + + for (i = 0; i < ARRAY_SIZE(seed.data); i++) { +#ifdef CONFIG_LRNG_RANDOM_IF + if (!arch_get_random_seed_long_early(&(seed.data[i])) && + !arch_get_random_long_early(&seed.data[i])) +#else + if (!arch_get_random_seed_long(&(seed.data[i])) && + !arch_get_random_long(&seed.data[i])) +#endif + seed.data[i] = random_get_entropy(); + } + memcpy(&seed.utsname, utsname(), sizeof(*(utsname()))); + + lrng_pool_insert_aux((u8 *)&seed, sizeof(seed), 0); + memzero_explicit(&seed, sizeof(seed)); + + /* Initialize the seed work queue */ + INIT_WORK(&lrng_state.lrng_seed_work, lrng_drng_seed_work); + lrng_state.perform_seedwork = true; + + invalidate_batched_entropy(); + + lrng_state.can_invalidate = true; + + return 0; +} + +#ifndef CONFIG_LRNG_RANDOM_IF +early_initcall(lrng_rand_initialize); +#endif + +/* Interface requesting a reseed of the DRNG */ +void lrng_es_add_entropy(void) +{ + /* + * Once all DRNGs are fully seeded, the system-triggered arrival of + * entropy will not cause any reseeding any more. + */ + if (likely(lrng_state.all_online_numa_node_seeded)) + return; + + /* Only trigger the DRNG reseed if we have collected entropy. */ + if (lrng_avail_entropy() < + atomic_read_u32(&lrng_state.boot_entropy_thresh)) + return; + + /* Ensure that the seeding only occurs once at any given time. */ + if (lrng_pool_trylock()) + return; + + /* Seed the DRNG with any available noise. */ + if (lrng_state.perform_seedwork) + schedule_work(&lrng_state.lrng_seed_work); + else + lrng_drng_seed_work(NULL); +} + +/* Fill the seed buffer with data from the noise sources */ +void lrng_fill_seed_buffer(struct entropy_buf *eb, u32 requested_bits) +{ + struct lrng_state *state = &lrng_state; + u32 i, req_ent = lrng_sp80090c_compliant() ? + lrng_security_strength() : LRNG_MIN_SEED_ENTROPY_BITS; + + /* Guarantee that requested bits is a multiple of bytes */ + BUILD_BUG_ON(LRNG_DRNG_SECURITY_STRENGTH_BITS % 8); + + /* always reseed the DRNG with the current time stamp */ + eb->now = random_get_entropy(); + + /* + * Require at least 128 bits of entropy for any reseed. If the LRNG is + * operated SP800-90C compliant we want to comply with SP800-90A section + * 9.2 mandating that DRNG is reseeded with the security strength. + */ + if (state->lrng_fully_seeded && (lrng_avail_entropy() < req_ent)) { + for_each_lrng_es(i) + eb->e_bits[i] = 0; + + goto wakeup; + } + + /* Concatenate the output of the entropy sources. */ + for_each_lrng_es(i) { + lrng_es[i]->get_ent(eb, requested_bits, + state->lrng_fully_seeded); + } + + /* allow external entropy provider to provide seed */ + lrng_state_exseed_allow_all(); + +wakeup: + lrng_writer_wakeup(); +} diff --git a/drivers/char/lrng/lrng_es_mgr.h b/drivers/char/lrng/lrng_es_mgr.h new file mode 100644 index 000000000000..e24f101de5a0 --- /dev/null +++ b/drivers/char/lrng/lrng_es_mgr.h @@ -0,0 +1,47 @@ +/* SPDX-License-Identifier: GPL-2.0 OR BSD-2-Clause */ +/* + * Copyright (C) 2022, Stephan Mueller + */ + +#ifndef _LRNG_ES_MGR_H +#define _LRNG_ES_MGR_H + +#include "lrng_es_mgr_cb.h" + +/*************************** General LRNG parameter ***************************/ + +#define LRNG_DRNG_BLOCKSIZE 64 /* Maximum of DRNG block sizes */ + +/* Helper to concatenate a macro with an integer type */ +#define LRNG_PASTER(x, y) x ## y +#define LRNG_UINT32_C(x) LRNG_PASTER(x, U) + +/************************* Entropy sources management *************************/ + +extern struct lrng_es_cb *lrng_es[]; + +#define for_each_lrng_es(ctr) \ + for ((ctr) = 0; (ctr) < lrng_ext_es_last; (ctr)++) + +bool lrng_state_min_seeded(void); +void lrng_debug_report_seedlevel(const char *name); +int lrng_rand_initialize(void); +bool lrng_state_operational(void); + +extern u32 lrng_write_wakeup_bits; +void lrng_set_entropy_thresh(u32 new); +u32 lrng_avail_entropy(void); +void lrng_reset_state(void); + +bool lrng_state_fully_seeded(void); + +int lrng_pool_trylock(void); +void lrng_pool_unlock(void); +void lrng_pool_all_numa_nodes_seeded(bool set); + +bool lrng_fully_seeded_eb(bool fully_seeded, struct entropy_buf *eb); +void lrng_unset_fully_seeded(struct lrng_drng *drng); +void lrng_fill_seed_buffer(struct entropy_buf *eb, u32 requested_bits); +void lrng_init_ops(struct entropy_buf *eb); + +#endif /* _LRNG_ES_MGR_H */ diff --git a/drivers/char/lrng/lrng_es_mgr_cb.h b/drivers/char/lrng/lrng_es_mgr_cb.h new file mode 100644 index 000000000000..08b24e1b7766 --- /dev/null +++ b/drivers/char/lrng/lrng_es_mgr_cb.h @@ -0,0 +1,87 @@ +/* SPDX-License-Identifier: GPL-2.0 OR BSD-2-Clause */ +/* + * Copyright (C) 2022, Stephan Mueller + * + * Definition of an entropy source. + */ + +#ifndef _LRNG_ES_MGR_CB_H +#define _LRNG_ES_MGR_CB_H + +#include + +#include "lrng_definitions.h" +#include "lrng_drng_mgr.h" + +enum lrng_internal_es { +#ifdef CONFIG_LRNG_IRQ + lrng_int_es_irq, /* IRQ-based entropy source */ +#endif +#ifdef CONFIG_LRNG_SCHED + lrng_int_es_sched, /* Scheduler entropy source */ +#endif + lrng_int_es_last, /* MUST be the last entry */ +}; + +enum lrng_external_es { + lrng_ext_link = lrng_int_es_last - 1, /* Link entry */ +#ifdef CONFIG_LRNG_JENT + lrng_ext_es_jitter, /* Jitter RNG */ +#endif +#ifdef CONFIG_LRNG_CPU + lrng_ext_es_cpu, /* CPU-based, e.g. RDSEED */ +#endif +#ifdef CONFIG_LRNG_KERNEL_RNG + lrng_ext_es_krng, /* random.c */ +#endif + lrng_ext_es_aux, /* MUST BE LAST ES! */ + lrng_ext_es_last /* MUST be the last entry */ +}; + +struct entropy_buf { + u8 e[lrng_ext_es_last][LRNG_DRNG_INIT_SEED_SIZE_BYTES]; + u32 now, e_bits[lrng_ext_es_last]; +}; + +/* + * struct lrng_es_cb - callback defining an entropy source + * @name: Name of the entropy source. + * @get_ent: Fetch entropy into the entropy_buf. The ES shall only deliver + * data if its internal initialization is complete, including any + * SP800-90B startup testing or similar. + * @curr_entropy: Return amount of currently available entropy. + * @max_entropy: Maximum amount of entropy the entropy source is able to + * maintain. + * @state: Buffer with human-readable ES state. + * @reset: Reset entropy source (drop all entropy and reinitialize). + * This callback may be NULL. + * @switch_hash: callback to switch from an old hash callback definition to + * a new one. This callback may be NULL. + */ +struct lrng_es_cb { + const char *name; + void (*get_ent)(struct entropy_buf *eb, u32 requested_bits, + bool fully_seeded); + u32 (*curr_entropy)(u32 requested_bits); + u32 (*max_entropy)(void); + void (*state)(unsigned char *buf, size_t buflen); + void (*reset)(void); + int (*switch_hash)(struct lrng_drng *drng, int node, + const struct lrng_hash_cb *new_cb, void *new_hash, + const struct lrng_hash_cb *old_cb); +}; + +/* Allow entropy sources to tell the ES manager that new entropy is there */ +void lrng_es_add_entropy(void); + +/* Cap to maximum entropy that can ever be generated with given hash */ +#define lrng_cap_requested(__digestsize_bits, __requested_bits) \ + do { \ + if (__digestsize_bits < __requested_bits) { \ + pr_debug("Cannot satisfy requested entropy %u due to insufficient hash size %u\n",\ + __requested_bits, __digestsize_bits); \ + __requested_bits = __digestsize_bits; \ + } \ + } while (0) + +#endif /* _LRNG_ES_MGR_CB_H */ diff --git a/drivers/char/lrng/lrng_es_sched.c b/drivers/char/lrng/lrng_es_sched.c new file mode 100644 index 000000000000..f7e85c381e22 --- /dev/null +++ b/drivers/char/lrng/lrng_es_sched.c @@ -0,0 +1,406 @@ +// SPDX-License-Identifier: GPL-2.0 OR BSD-2-Clause +/* + * LRNG Slow Entropy Source: Scheduler-based data collection + * + * Copyright (C) 2022, Stephan Mueller + */ + +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt + +#include +#include +#include +#include +#include +#include + +#include "lrng_es_aux.h" +#include "lrng_es_sched.h" +#include "lrng_es_timer_common.h" +#include "lrng_health.h" +#include "lrng_numa.h" +#include "lrng_testing.h" + +/* + * Number of scheduler-based context switches to be recorded to assume that + * DRNG security strength bits of entropy are received. + * Note: a value below the DRNG security strength should not be defined as this + * may imply the DRNG can never be fully seeded in case other noise + * sources are unavailable. + */ +#define LRNG_SCHED_ENTROPY_BITS \ + LRNG_UINT32_C(CONFIG_LRNG_SCHED_ENTROPY_RATE) + +/* Number of events required for LRNG_DRNG_SECURITY_STRENGTH_BITS entropy */ +static u32 lrng_sched_entropy_bits = LRNG_SCHED_ENTROPY_BITS; + +static u32 sched_entropy __read_mostly = LRNG_SCHED_ENTROPY_BITS; +#ifdef CONFIG_LRNG_RUNTIME_ES_CONFIG +module_param(sched_entropy, uint, 0444); +MODULE_PARM_DESC(sched_entropy, + "How many scheduler-based context switches must be collected for obtaining 256 bits of entropy\n"); +#endif + +/* Per-CPU array holding concatenated entropy events */ +static DEFINE_PER_CPU(u32 [LRNG_DATA_ARRAY_SIZE], lrng_sched_array) + __aligned(LRNG_KCAPI_ALIGN); +static DEFINE_PER_CPU(u32, lrng_sched_array_ptr) = 0; +static DEFINE_PER_CPU(atomic_t, lrng_sched_array_events) = ATOMIC_INIT(0); + +static void __init lrng_sched_check_compression_state(void) +{ + /* One pool should hold sufficient entropy for disabled compression */ + u32 max_ent = min_t(u32, lrng_get_digestsize(), + lrng_data_to_entropy(LRNG_DATA_NUM_VALUES, + lrng_sched_entropy_bits)); + if (max_ent < lrng_security_strength()) { + pr_devel("Scheduler entropy source will never provide %u bits of entropy required for fully seeding the DRNG all by itself\n", + lrng_security_strength()); + } +} + +void __init lrng_sched_es_init(bool highres_timer) +{ + /* Set a minimum number of scheduler events that must be collected */ + sched_entropy = max_t(u32, LRNG_SCHED_ENTROPY_BITS, sched_entropy); + + if (highres_timer) { + lrng_sched_entropy_bits = sched_entropy; + } else { + u32 new_entropy = sched_entropy * LRNG_ES_OVERSAMPLING_FACTOR; + + lrng_sched_entropy_bits = (sched_entropy < new_entropy) ? + new_entropy : sched_entropy; + pr_warn("operating without high-resolution timer and applying oversampling factor %u\n", + LRNG_ES_OVERSAMPLING_FACTOR); + } + + lrng_sched_check_compression_state(); +} + +static u32 lrng_sched_avail_pool_size(void) +{ + u32 max_pool = lrng_get_digestsize(), + max_size = min_t(u32, max_pool, LRNG_DATA_NUM_VALUES); + int cpu; + + for_each_online_cpu(cpu) + max_size += max_pool; + + return max_size; +} + +/* Return entropy of unused scheduler events present in all per-CPU pools. */ +static u32 lrng_sched_avail_entropy(u32 __unused) +{ + u32 digestsize_events, events = 0; + int cpu; + + /* Only deliver entropy when SP800-90B self test is completed */ + if (!lrng_sp80090b_startup_complete_es(lrng_int_es_sched)) + return 0; + + /* Obtain the cap of maximum numbers of scheduler events we count */ + digestsize_events = lrng_entropy_to_data(lrng_get_digestsize(), + lrng_sched_entropy_bits); + /* Cap to max. number of scheduler events the array can hold */ + digestsize_events = min_t(u32, digestsize_events, LRNG_DATA_NUM_VALUES); + + for_each_online_cpu(cpu) { + events += min_t(u32, digestsize_events, + atomic_read_u32(per_cpu_ptr(&lrng_sched_array_events, + cpu))); + } + + /* Consider oversampling rate */ + return lrng_reduce_by_osr( + lrng_data_to_entropy(events, lrng_sched_entropy_bits)); +} + +/* + * Reset all per-CPU pools - reset entropy estimator but leave the pool data + * that may or may not have entropy unchanged. + */ +static void lrng_sched_reset(void) +{ + int cpu; + + /* Trigger GCD calculation anew. */ + lrng_gcd_set(0); + + for_each_online_cpu(cpu) + atomic_set(per_cpu_ptr(&lrng_sched_array_events, cpu), 0); +} + +/* + * Hash all per-CPU arrays and return the digest to be used as seed data for + * seeding a DRNG. The caller must guarantee backtracking resistance. + * The function will only copy as much data as entropy is available into the + * caller-provided output buffer. + * + * This function handles the translation from the number of received scheduler + * events into an entropy statement. The conversion depends on + * LRNG_SCHED_ENTROPY_BITS which defines how many scheduler events must be + * received to obtain 256 bits of entropy. With this value, the function + * lrng_data_to_entropy converts a given data size (received scheduler events, + * requested amount of data, etc.) into an entropy statement. + * lrng_entropy_to_data does the reverse. + * + * @eb: entropy buffer to store entropy + * @requested_bits: Requested amount of entropy + * @fully_seeded: indicator whether LRNG is fully seeded + */ +static void lrng_sched_pool_hash(struct entropy_buf *eb, u32 requested_bits, + bool fully_seeded) +{ + SHASH_DESC_ON_STACK(shash, NULL); + const struct lrng_hash_cb *hash_cb; + struct lrng_drng *drng = lrng_drng_node_instance(); + u8 digest[LRNG_MAX_DIGESTSIZE]; + unsigned long flags; + u32 found_events, collected_events = 0, collected_ent_bits, + requested_events, returned_ent_bits, digestsize, digestsize_events; + int ret, cpu; + void *hash; + + /* Only deliver entropy when SP800-90B self test is completed */ + if (!lrng_sp80090b_startup_complete_es(lrng_int_es_sched)) { + eb->e_bits[lrng_int_es_sched] = 0; + return; + } + + /* Lock guarding replacement of per-NUMA hash */ + read_lock_irqsave(&drng->hash_lock, flags); + hash_cb = drng->hash_cb; + hash = drng->hash; + + /* The hash state of filled with all per-CPU pool hashes. */ + ret = hash_cb->hash_init(shash, hash); + if (ret) + goto err; + + digestsize = hash_cb->hash_digestsize(shash); + + /* Cap to maximum entropy that can ever be generated with given hash */ + lrng_cap_requested(digestsize << 3, requested_bits); + requested_events = lrng_entropy_to_data(requested_bits + + lrng_compress_osr(), + lrng_sched_entropy_bits); + digestsize_events = lrng_entropy_to_data(digestsize << 3, + lrng_sched_entropy_bits); + + /* + * Harvest entropy from each per-CPU hash state - even though we may + * have collected sufficient entropy, we will hash all per-CPU pools. + */ + for_each_online_cpu(cpu) { + u32 unused_events = 0; + + ret = hash_cb->hash_update(shash, + (u8 *)per_cpu_ptr(lrng_sched_array, cpu), + LRNG_DATA_ARRAY_SIZE * sizeof(u32)); + + /* Store all not-yet compressed data in data array into hash */ + ret = hash_cb->hash_update(shash, digest, digestsize); + if (ret) + goto err; + + /* Obtain entropy statement like for the entropy pool */ + found_events = atomic_xchg_relaxed( + per_cpu_ptr(&lrng_sched_array_events, cpu), 0); + /* Cap to maximum amount of data we can hold in hash */ + found_events = min_t(u32, found_events, digestsize_events); + /* Cap to maximum amount of data we can hold in array */ + found_events = min_t(u32, found_events, LRNG_DATA_NUM_VALUES); + + collected_events += found_events; + if (collected_events > requested_events) { + unused_events = collected_events - requested_events; + atomic_add_return_relaxed(unused_events, + per_cpu_ptr(&lrng_sched_array_events, cpu)); + collected_events = requested_events; + } + pr_debug("%u scheduler-based events used from entropy array of CPU %d, %u scheduler-based events remain unused\n", + found_events - unused_events, cpu, unused_events); + } + + ret = hash_cb->hash_final(shash, digest); + if (ret) + goto err; + + collected_ent_bits = lrng_data_to_entropy(collected_events, + lrng_sched_entropy_bits); + /* Apply oversampling: discount requested oversampling rate */ + returned_ent_bits = lrng_reduce_by_osr(collected_ent_bits); + + pr_debug("obtained %u bits by collecting %u bits of entropy from scheduler-based noise source\n", + returned_ent_bits, collected_ent_bits); + + /* + * Truncate to available entropy as implicitly allowed by SP800-90B + * section 3.1.5.1.1 table 1 which awards truncated hashes full + * entropy. + * + * During boot time, we read requested_bits data with + * returned_ent_bits entropy. In case our conservative entropy + * estimate underestimates the available entropy we can transport as + * much available entropy as possible. + */ + memcpy(eb->e[lrng_int_es_sched], digest, + fully_seeded ? returned_ent_bits >> 3 : requested_bits >> 3); + eb->e_bits[lrng_int_es_sched] = returned_ent_bits; + +out: + hash_cb->hash_desc_zero(shash); + read_unlock_irqrestore(&drng->hash_lock, flags); + memzero_explicit(digest, sizeof(digest)); + return; + +err: + eb->e_bits[lrng_int_es_sched] = 0; + goto out; +} + +/* + * Concatenate full 32 bit word at the end of time array even when current + * ptr is not aligned to sizeof(data). + */ +static void lrng_sched_array_add_u32(u32 data) +{ + /* Increment pointer by number of slots taken for input value */ + u32 pre_ptr, mask, ptr = this_cpu_add_return(lrng_sched_array_ptr, + LRNG_DATA_SLOTS_PER_UINT); + unsigned int pre_array; + + lrng_data_split_u32(&ptr, &pre_ptr, &mask); + + /* MSB of data go into previous unit */ + pre_array = lrng_data_idx2array(pre_ptr); + /* zeroization of slot to ensure the following OR adds the data */ + this_cpu_and(lrng_sched_array[pre_array], ~(0xffffffff & ~mask)); + this_cpu_or(lrng_sched_array[pre_array], data & ~mask); + + /* + * Continuous compression is not allowed for scheduler noise source, + * so do not call lrng_sched_array_to_hash here. + */ + + /* LSB of data go into current unit */ + this_cpu_write(lrng_sched_array[lrng_data_idx2array(ptr)], + data & mask); +} + +/* Concatenate data of max LRNG_DATA_SLOTSIZE_MASK at the end of time array */ +static void lrng_sched_array_add_slot(u32 data) +{ + /* Get slot */ + u32 ptr = this_cpu_inc_return(lrng_sched_array_ptr) & + LRNG_DATA_WORD_MASK; + unsigned int array = lrng_data_idx2array(ptr); + unsigned int slot = lrng_data_idx2slot(ptr); + + /* zeroization of slot to ensure the following OR adds the data */ + this_cpu_and(lrng_sched_array[array], + ~(lrng_data_slot_val(0xffffffff & LRNG_DATA_SLOTSIZE_MASK, + slot))); + /* Store data into slot */ + this_cpu_or(lrng_sched_array[array], lrng_data_slot_val(data, slot)); + + /* + * Continuous compression is not allowed for scheduler noise source, + * so do not call lrng_sched_array_to_hash here. + */ +} + +static void +lrng_time_process_common(u32 time, void(*add_time)(u32 data)) +{ + enum lrng_health_res health_test; + + if (lrng_raw_sched_hires_entropy_store(time)) + return; + + health_test = lrng_health_test(time, lrng_int_es_sched); + if (health_test > lrng_health_fail_use) + return; + + if (health_test == lrng_health_pass) + atomic_inc_return(this_cpu_ptr(&lrng_sched_array_events)); + + add_time(time); + + /* + * We cannot call lrng_es_add_entropy() as this would call a schedule + * operation that is not permissible in scheduler context. + * As the scheduler ES provides a high bandwidth of entropy, we assume + * that other reseed triggers happen to pick up the scheduler ES + * entropy in due time. + */ +} + +/* Batching up of entropy in per-CPU array */ +static void lrng_sched_time_process(void) +{ + u32 now_time = random_get_entropy(); + + if (unlikely(!lrng_gcd_tested())) { + /* When GCD is unknown, we process the full time stamp */ + lrng_time_process_common(now_time, lrng_sched_array_add_u32); + lrng_gcd_add_value(now_time); + } else { + /* GCD is known and applied */ + lrng_time_process_common((now_time / lrng_gcd_get()) & + LRNG_DATA_SLOTSIZE_MASK, + lrng_sched_array_add_slot); + } + + lrng_sched_perf_time(now_time); +} + +void add_sched_randomness(const struct task_struct *p, int cpu) +{ + if (lrng_highres_timer()) { + lrng_sched_time_process(); + } else { + u32 tmp = cpu; + + tmp ^= lrng_raw_sched_pid_entropy_store(p->pid) ? + 0 : (u32)p->pid; + tmp ^= lrng_raw_sched_starttime_entropy_store(p->start_time) ? + 0 : (u32)p->start_time; + tmp ^= lrng_raw_sched_nvcsw_entropy_store(p->nvcsw) ? + 0 : (u32)p->nvcsw; + + lrng_sched_time_process(); + lrng_sched_array_add_u32(tmp); + } +} +EXPORT_SYMBOL(add_sched_randomness); + +static void lrng_sched_es_state(unsigned char *buf, size_t buflen) +{ + const struct lrng_drng *lrng_drng_init = lrng_drng_init_instance(); + + /* Assume the lrng_drng_init lock is taken by caller */ + snprintf(buf, buflen, + " Hash for operating entropy pool: %s\n" + " Available entropy: %u\n" + " per-CPU scheduler event collection size: %u\n" + " Standards compliance: %s\n" + " High-resolution timer: %s\n", + lrng_drng_init->hash_cb->hash_name(), + lrng_sched_avail_entropy(0), + LRNG_DATA_NUM_VALUES, + lrng_sp80090b_compliant(lrng_int_es_sched) ? "SP800-90B " : "", + lrng_highres_timer() ? "true" : "false"); +} + +struct lrng_es_cb lrng_es_sched = { + .name = "Scheduler", + .get_ent = lrng_sched_pool_hash, + .curr_entropy = lrng_sched_avail_entropy, + .max_entropy = lrng_sched_avail_pool_size, + .state = lrng_sched_es_state, + .reset = lrng_sched_reset, + .switch_hash = NULL, +}; diff --git a/drivers/char/lrng/lrng_es_sched.h b/drivers/char/lrng/lrng_es_sched.h new file mode 100644 index 000000000000..f1e596dd89d9 --- /dev/null +++ b/drivers/char/lrng/lrng_es_sched.h @@ -0,0 +1,20 @@ +/* SPDX-License-Identifier: GPL-2.0 OR BSD-2-Clause */ +/* + * Copyright (C) 2022, Stephan Mueller + */ + +#ifndef _LRNG_ES_SCHED_H +#define _LRNG_ES_SCHED_H + +#include "lrng_es_mgr_cb.h" + +#ifdef CONFIG_LRNG_SCHED +void lrng_sched_es_init(bool highres_timer); + +extern struct lrng_es_cb lrng_es_sched; + +#else /* CONFIG_LRNG_SCHED */ +static inline void lrng_sched_es_init(bool highres_timer) { } +#endif /* CONFIG_LRNG_SCHED */ + +#endif /* _LRNG_ES_SCHED_H */ diff --git a/drivers/char/lrng/lrng_es_timer_common.c b/drivers/char/lrng/lrng_es_timer_common.c new file mode 100644 index 000000000000..70f3ff074fe6 --- /dev/null +++ b/drivers/char/lrng/lrng_es_timer_common.c @@ -0,0 +1,144 @@ +// SPDX-License-Identifier: GPL-2.0 OR BSD-2-Clause +/* + * LRNG Slow Entropy Source: Interrupt data collection + * + * Copyright (C) 2022, Stephan Mueller + */ + +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt + +#include +#include + +#include "lrng_es_irq.h" +#include "lrng_es_sched.h" +#include "lrng_es_timer_common.h" +#include "lrng_health.h" + +/* Is high-resolution timer present? */ +static bool lrng_highres_timer_val = false; + +/* Number of time stamps analyzed to calculate a GCD */ +#define LRNG_GCD_WINDOW_SIZE 100 +static u32 lrng_gcd_history[LRNG_GCD_WINDOW_SIZE]; +static atomic_t lrng_gcd_history_ptr = ATOMIC_INIT(-1); + +/* The common divisor for all timestamps */ +static u32 lrng_gcd_timer = 0; + +bool lrng_gcd_tested(void) +{ + return (lrng_gcd_timer != 0); +} + +u32 lrng_gcd_get(void) +{ + return lrng_gcd_timer; +} + +/* Set the GCD for use in IRQ ES - if 0, the GCD calculation is restarted. */ +void lrng_gcd_set(u32 running_gcd) +{ + lrng_gcd_timer = running_gcd; + /* Ensure that update to global variable lrng_gcd_timer is visible */ + mb(); +} + +static void lrng_gcd_set_check(u32 running_gcd) +{ + if (!lrng_gcd_tested()) { + lrng_gcd_set(running_gcd); + pr_debug("Setting GCD to %u\n", running_gcd); + } +} + +u32 lrng_gcd_analyze(u32 *history, size_t nelem) +{ + u32 running_gcd = 0; + size_t i; + + /* Now perform the analysis on the accumulated time data. */ + for (i = 0; i < nelem; i++) { + /* + * NOTE: this would be the place to add more analysis on the + * appropriateness of the timer like checking the presence + * of sufficient variations in the timer. + */ + + /* + * This calculates the gcd of all the time values. that is + * gcd(time_1, time_2, ..., time_nelem) + * + * Some timers increment by a fixed (non-1) amount each step. + * This code checks for such increments, and allows the library + * to output the number of such changes have occurred. + */ + running_gcd = (u32)gcd(history[i], running_gcd); + + /* Zeroize data */ + history[i] = 0; + } + + return running_gcd; +} + +void lrng_gcd_add_value(u32 time) +{ + u32 ptr = (u32)atomic_inc_return_relaxed(&lrng_gcd_history_ptr); + + if (ptr < LRNG_GCD_WINDOW_SIZE) { + lrng_gcd_history[ptr] = time; + } else if (ptr == LRNG_GCD_WINDOW_SIZE) { + u32 gcd = lrng_gcd_analyze(lrng_gcd_history, + LRNG_GCD_WINDOW_SIZE); + + if (!gcd) + gcd = 1; + + /* + * Ensure that we have variations in the time stamp below the + * given value. This is just a safety measure to prevent the GCD + * becoming too large. + */ + if (gcd >= 1000) { + pr_warn("calculated GCD is larger than expected: %u\n", + gcd); + gcd = 1000; + } + + /* Adjust all deltas by the observed (small) common factor. */ + lrng_gcd_set_check(gcd); + atomic_set(&lrng_gcd_history_ptr, 0); + } +} + +/* Return boolean whether LRNG identified presence of high-resolution timer */ +bool lrng_highres_timer(void) +{ + return lrng_highres_timer_val; +} + +static int __init lrng_init_time_source(void) +{ + if ((random_get_entropy() & LRNG_DATA_SLOTSIZE_MASK) || + (random_get_entropy() & LRNG_DATA_SLOTSIZE_MASK)) { + /* + * As the highres timer is identified here, previous interrupts + * obtained during boot time are treated like a lowres-timer + * would have been present. + */ + lrng_highres_timer_val = true; + } else { + lrng_health_disable(); + lrng_highres_timer_val = false; + } + + lrng_irq_es_init(lrng_highres_timer_val); + lrng_sched_es_init(lrng_highres_timer_val); + + /* Ensure that changes to global variables are visible */ + mb(); + + return 0; +} +core_initcall(lrng_init_time_source); diff --git a/drivers/char/lrng/lrng_es_timer_common.h b/drivers/char/lrng/lrng_es_timer_common.h new file mode 100644 index 000000000000..b45b9f683dea --- /dev/null +++ b/drivers/char/lrng/lrng_es_timer_common.h @@ -0,0 +1,83 @@ +/* SPDX-License-Identifier: GPL-2.0 OR BSD-2-Clause */ +/* + * LRNG Slow Noise Source: Time stamp array handling + * + * Copyright (C) 2022, Stephan Mueller + */ + +#ifndef _LRNG_ES_TIMER_COMMON_H +#define _LRNG_ES_TIMER_COMMON_H + +bool lrng_gcd_tested(void); +void lrng_gcd_set(u32 running_gcd); +u32 lrng_gcd_get(void); +u32 lrng_gcd_analyze(u32 *history, size_t nelem); +void lrng_gcd_add_value(u32 time); +bool lrng_highres_timer(void); + +/* + * To limit the impact on the interrupt handling, the LRNG concatenates + * entropic LSB parts of the time stamps in a per-CPU array and only + * injects them into the entropy pool when the array is full. + */ + +/* Store multiple integers in one u32 */ +#define LRNG_DATA_SLOTSIZE_BITS (8) +#define LRNG_DATA_SLOTSIZE_MASK ((1 << LRNG_DATA_SLOTSIZE_BITS) - 1) +#define LRNG_DATA_ARRAY_MEMBER_BITS (4 << 3) /* ((sizeof(u32)) << 3) */ +#define LRNG_DATA_SLOTS_PER_UINT (LRNG_DATA_ARRAY_MEMBER_BITS / \ + LRNG_DATA_SLOTSIZE_BITS) + +/* + * Number of time values to store in the array - in small environments + * only one atomic_t variable per CPU is used. + */ +#define LRNG_DATA_NUM_VALUES (CONFIG_LRNG_COLLECTION_SIZE) +/* Mask of LSB of time stamp to store */ +#define LRNG_DATA_WORD_MASK (LRNG_DATA_NUM_VALUES - 1) + +#define LRNG_DATA_SLOTS_MASK (LRNG_DATA_SLOTS_PER_UINT - 1) +#define LRNG_DATA_ARRAY_SIZE (LRNG_DATA_NUM_VALUES / \ + LRNG_DATA_SLOTS_PER_UINT) + +/* Starting bit index of slot */ +static inline unsigned int lrng_data_slot2bitindex(unsigned int slot) +{ + return (LRNG_DATA_SLOTSIZE_BITS * slot); +} + +/* Convert index into the array index */ +static inline unsigned int lrng_data_idx2array(unsigned int idx) +{ + return idx / LRNG_DATA_SLOTS_PER_UINT; +} + +/* Convert index into the slot of a given array index */ +static inline unsigned int lrng_data_idx2slot(unsigned int idx) +{ + return idx & LRNG_DATA_SLOTS_MASK; +} + +/* Convert value into slot value */ +static inline unsigned int lrng_data_slot_val(unsigned int val, + unsigned int slot) +{ + return val << lrng_data_slot2bitindex(slot); +} + +/* + * Return the pointers for the previous and current units to inject a u32 into. + * Also return the mask which the u32 word is to be processed. + */ +static inline void lrng_data_split_u32(u32 *ptr, u32 *pre_ptr, u32 *mask) +{ + /* ptr to previous unit */ + *pre_ptr = (*ptr - LRNG_DATA_SLOTS_PER_UINT) & LRNG_DATA_WORD_MASK; + *ptr &= LRNG_DATA_WORD_MASK; + + /* mask to split data into the two parts for the two units */ + *mask = ((1 << (*pre_ptr & (LRNG_DATA_SLOTS_PER_UINT - 1)) * + LRNG_DATA_SLOTSIZE_BITS)) - 1; +} + +#endif /* _LRNG_ES_TIMER_COMMON_H */ \ No newline at end of file diff --git a/drivers/char/lrng/lrng_hash_kcapi.c b/drivers/char/lrng/lrng_hash_kcapi.c new file mode 100644 index 000000000000..13e62db9b6c8 --- /dev/null +++ b/drivers/char/lrng/lrng_hash_kcapi.c @@ -0,0 +1,140 @@ +// SPDX-License-Identifier: GPL-2.0 OR BSD-2-Clause +/* + * Backend for providing the hash primitive using the kernel crypto API. + * + * Copyright (C) 2022, Stephan Mueller + */ + +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt + +#include +#include +#include + + +static char *lrng_hash_name = "sha512"; + +/* The parameter must be r/o in sysfs as otherwise races appear. */ +module_param(lrng_hash_name, charp, 0444); +MODULE_PARM_DESC(lrng_hash_name, "Kernel crypto API hash name"); + +struct lrng_hash_info { + struct crypto_shash *tfm; +}; + +static const char *lrng_kcapi_hash_name(void) +{ + return lrng_hash_name; +} + +static void _lrng_kcapi_hash_free(struct lrng_hash_info *lrng_hash) +{ + struct crypto_shash *tfm = lrng_hash->tfm; + + crypto_free_shash(tfm); + kfree(lrng_hash); +} + +static void *lrng_kcapi_hash_alloc(const char *name) +{ + struct lrng_hash_info *lrng_hash; + struct crypto_shash *tfm; + int ret; + + if (!name) { + pr_err("Hash name missing\n"); + return ERR_PTR(-EINVAL); + } + + tfm = crypto_alloc_shash(name, 0, 0); + if (IS_ERR(tfm)) { + pr_err("could not allocate hash %s\n", name); + return ERR_CAST(tfm); + } + + ret = sizeof(struct lrng_hash_info); + lrng_hash = kmalloc(ret, GFP_KERNEL); + if (!lrng_hash) { + crypto_free_shash(tfm); + return ERR_PTR(-ENOMEM); + } + + lrng_hash->tfm = tfm; + + pr_info("Hash %s allocated\n", name); + + return lrng_hash; +} + +static void *lrng_kcapi_hash_name_alloc(void) +{ + return lrng_kcapi_hash_alloc(lrng_kcapi_hash_name()); +} + +static u32 lrng_kcapi_hash_digestsize(void *hash) +{ + struct lrng_hash_info *lrng_hash = (struct lrng_hash_info *)hash; + struct crypto_shash *tfm = lrng_hash->tfm; + + return crypto_shash_digestsize(tfm); +} + +static void lrng_kcapi_hash_dealloc(void *hash) +{ + struct lrng_hash_info *lrng_hash = (struct lrng_hash_info *)hash; + + _lrng_kcapi_hash_free(lrng_hash); + pr_info("Hash deallocated\n"); +} + +static int lrng_kcapi_hash_init(struct shash_desc *shash, void *hash) +{ + struct lrng_hash_info *lrng_hash = (struct lrng_hash_info *)hash; + struct crypto_shash *tfm = lrng_hash->tfm; + + shash->tfm = tfm; + return crypto_shash_init(shash); +} + +static int lrng_kcapi_hash_update(struct shash_desc *shash, const u8 *inbuf, + u32 inbuflen) +{ + return crypto_shash_update(shash, inbuf, inbuflen); +} + +static int lrng_kcapi_hash_final(struct shash_desc *shash, u8 *digest) +{ + return crypto_shash_final(shash, digest); +} + +static void lrng_kcapi_hash_zero(struct shash_desc *shash) +{ + shash_desc_zero(shash); +} + +static const struct lrng_hash_cb lrng_kcapi_hash_cb = { + .hash_name = lrng_kcapi_hash_name, + .hash_alloc = lrng_kcapi_hash_name_alloc, + .hash_dealloc = lrng_kcapi_hash_dealloc, + .hash_digestsize = lrng_kcapi_hash_digestsize, + .hash_init = lrng_kcapi_hash_init, + .hash_update = lrng_kcapi_hash_update, + .hash_final = lrng_kcapi_hash_final, + .hash_desc_zero = lrng_kcapi_hash_zero, +}; + +static int __init lrng_kcapi_init(void) +{ + return lrng_set_hash_cb(&lrng_kcapi_hash_cb); +} + +static void __exit lrng_kcapi_exit(void) +{ + lrng_set_hash_cb(NULL); +} + +late_initcall(lrng_kcapi_init); +module_exit(lrng_kcapi_exit); +MODULE_LICENSE("Dual BSD/GPL"); +MODULE_AUTHOR("Stephan Mueller "); +MODULE_DESCRIPTION("Entropy Source and DRNG Manager - Kernel crypto API hash backend"); diff --git a/drivers/char/lrng/lrng_health.c b/drivers/char/lrng/lrng_health.c new file mode 100644 index 000000000000..a60c8378eea5 --- /dev/null +++ b/drivers/char/lrng/lrng_health.c @@ -0,0 +1,420 @@ +// SPDX-License-Identifier: GPL-2.0 OR BSD-2-Clause +/* + * Entropy Source and DRNG Manager (LRNG) Health Testing + * + * Copyright (C) 2022, Stephan Mueller + */ + +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt + +#include +#include + +#include "lrng_es_mgr.h" +#include "lrng_health.h" + +/* Stuck Test */ +struct lrng_stuck_test { + u32 last_time; /* Stuck test: time of previous IRQ */ + u32 last_delta; /* Stuck test: delta of previous IRQ */ + u32 last_delta2; /* Stuck test: 2. time derivation of prev IRQ */ +}; + +/* Repetition Count Test */ +struct lrng_rct { + atomic_t rct_count; /* Number of stuck values */ +}; + +/* Adaptive Proportion Test */ +struct lrng_apt { + /* Data window size */ +#define LRNG_APT_WINDOW_SIZE 512 + /* LSB of time stamp to process */ +#define LRNG_APT_LSB 16 +#define LRNG_APT_WORD_MASK (LRNG_APT_LSB - 1) + atomic_t apt_count; /* APT counter */ + atomic_t apt_base; /* APT base reference */ + + atomic_t apt_trigger; + bool apt_base_set; /* Is APT base set? */ +}; + +/* Health data collected for one entropy source */ +struct lrng_health_es_state { + struct lrng_rct rct; + struct lrng_apt apt; + + /* SP800-90B startup health tests */ +#define LRNG_SP80090B_STARTUP_SAMPLES 1024 +#define LRNG_SP80090B_STARTUP_BLOCKS ((LRNG_SP80090B_STARTUP_SAMPLES + \ + LRNG_APT_WINDOW_SIZE - 1) / \ + LRNG_APT_WINDOW_SIZE) + bool sp80090b_startup_done; + atomic_t sp80090b_startup_blocks; +}; + +#define LRNG_HEALTH_ES_INIT(x) \ + x.rct.rct_count = ATOMIC_INIT(0), \ + x.apt.apt_count = ATOMIC_INIT(0), \ + x.apt.apt_base = ATOMIC_INIT(-1), \ + x.apt.apt_trigger = ATOMIC_INIT(LRNG_APT_WINDOW_SIZE), \ + x.apt.apt_base_set = false, \ + x.sp80090b_startup_blocks = ATOMIC_INIT(LRNG_SP80090B_STARTUP_BLOCKS), \ + x.sp80090b_startup_done = false, + +/* The health test code must operate lock-less */ +struct lrng_health { + bool health_test_enabled; + struct lrng_health_es_state es_state[lrng_int_es_last]; +}; + +static struct lrng_health lrng_health = { + .health_test_enabled = true, + +#ifdef CONFIG_LRNG_IRQ + LRNG_HEALTH_ES_INIT(.es_state[lrng_int_es_irq]) +#endif +#ifdef CONFIG_LRNG_SCHED + LRNG_HEALTH_ES_INIT(.es_state[lrng_int_es_sched]) +#endif +}; + +static DEFINE_PER_CPU(struct lrng_stuck_test[lrng_int_es_last], + lrng_stuck_test_array); + +static bool lrng_sp80090b_health_requested(void) +{ + /* Health tests are only requested in FIPS mode */ + return fips_enabled; +} + +static bool lrng_sp80090b_health_enabled(void) +{ + struct lrng_health *health = &lrng_health; + + return lrng_sp80090b_health_requested() && health->health_test_enabled; +} + +/*************************************************************************** + * SP800-90B Compliance + * + * If the Linux-RNG is booted into FIPS mode, the following interfaces + * provide an SP800-90B compliant noise source: + * + * * /dev/random + * * getrandom(2) + * * get_random_bytes_full + * + * All other interfaces, including /dev/urandom or get_random_bytes without + * the add_random_ready_callback cannot claim to use an SP800-90B compliant + * noise source. + ***************************************************************************/ + +/* + * Perform SP800-90B startup testing + */ +static void lrng_sp80090b_startup(struct lrng_health *health, + enum lrng_internal_es es) +{ + struct lrng_health_es_state *es_state = &health->es_state[es]; + + if (!es_state->sp80090b_startup_done && + atomic_dec_and_test(&es_state->sp80090b_startup_blocks)) { + es_state->sp80090b_startup_done = true; + pr_info("SP800-90B startup health tests for internal entropy source %u completed\n", + es); + lrng_drng_force_reseed(); + + /* + * We cannot call lrng_es_add_entropy() as this may cause a + * schedule operation while in scheduler context for the + * scheduler ES. + */ + } +} + +/* + * Handle failure of SP800-90B startup testing + */ +static void lrng_sp80090b_startup_failure(struct lrng_health *health, + enum lrng_internal_es es) +{ + struct lrng_health_es_state *es_state = &health->es_state[es]; + + /* Reset of LRNG and its entropy - NOTE: we are in atomic context */ + lrng_reset(); + + /* + * Reset the SP800-90B startup test. + * + * NOTE SP800-90B section 4.3 bullet 4 does not specify what + * exactly is to be done in case of failure! Thus, we do what + * makes sense, i.e. restarting the health test and thus gating + * the output function of /dev/random and getrandom(2). + */ + atomic_set(&es_state->sp80090b_startup_blocks, + LRNG_SP80090B_STARTUP_BLOCKS); +} + +/* + * Handle failure of SP800-90B runtime testing + */ +static void lrng_sp80090b_runtime_failure(struct lrng_health *health, + enum lrng_internal_es es) +{ + struct lrng_health_es_state *es_state = &health->es_state[es]; + + lrng_sp80090b_startup_failure(health, es); + es_state->sp80090b_startup_done = false; +} + +static void lrng_sp80090b_failure(struct lrng_health *health, + enum lrng_internal_es es) +{ + struct lrng_health_es_state *es_state = &health->es_state[es]; + + if (es_state->sp80090b_startup_done) { + pr_err("SP800-90B runtime health test failure for internal entropy source %u - invalidating all existing entropy and initiate SP800-90B startup\n", es); + lrng_sp80090b_runtime_failure(health, es); + } else { + pr_err("SP800-90B startup test failure for internal entropy source %u - resetting\n", es); + lrng_sp80090b_startup_failure(health, es); + } +} + +bool lrng_sp80090b_startup_complete_es(enum lrng_internal_es es) +{ + struct lrng_health *health = &lrng_health; + struct lrng_health_es_state *es_state = &health->es_state[es]; + + if (!lrng_sp80090b_health_enabled()) + return true; + + return es_state->sp80090b_startup_done; +} + +bool lrng_sp80090b_compliant(enum lrng_internal_es es) +{ + struct lrng_health *health = &lrng_health; + struct lrng_health_es_state *es_state = &health->es_state[es]; + + return lrng_sp80090b_health_enabled() && + es_state->sp80090b_startup_done; +} + +/*************************************************************************** + * Adaptive Proportion Test + * + * This test complies with SP800-90B section 4.4.2. + ***************************************************************************/ + +/* + * Reset the APT counter + * + * @health [in] Reference to health state + */ +static void lrng_apt_reset(struct lrng_apt *apt, unsigned int time_masked) +{ + /* Reset APT */ + atomic_set(&apt->apt_count, 0); + atomic_set(&apt->apt_base, time_masked); +} + +static void lrng_apt_restart(struct lrng_apt *apt) +{ + atomic_set(&apt->apt_trigger, LRNG_APT_WINDOW_SIZE); +} + +/* + * Insert a new entropy event into APT + * + * This function does is void as it does not decide about the fate of a time + * stamp. An APT failure can only happen at the same time of a stuck test + * failure. Thus, the stuck failure will already decide how the time stamp + * is handled. + * + * @health [in] Reference to health state + * @now_time [in] Time stamp to process + */ +static void lrng_apt_insert(struct lrng_health *health, + unsigned int now_time, enum lrng_internal_es es) +{ + struct lrng_health_es_state *es_state = &health->es_state[es]; + struct lrng_apt *apt = &es_state->apt; + + if (!lrng_sp80090b_health_requested()) + return; + + now_time &= LRNG_APT_WORD_MASK; + + /* Initialization of APT */ + if (!apt->apt_base_set) { + atomic_set(&apt->apt_base, now_time); + apt->apt_base_set = true; + return; + } + + if (now_time == (unsigned int)atomic_read(&apt->apt_base)) { + u32 apt_val = (u32)atomic_inc_return_relaxed(&apt->apt_count); + + if (apt_val >= CONFIG_LRNG_APT_CUTOFF) + lrng_sp80090b_failure(health, es); + } + + if (atomic_dec_and_test(&apt->apt_trigger)) { + lrng_apt_restart(apt); + lrng_apt_reset(apt, now_time); + lrng_sp80090b_startup(health, es); + } +} + +/*************************************************************************** + * Repetition Count Test + * + * The LRNG uses an enhanced version of the Repetition Count Test + * (RCT) specified in SP800-90B section 4.4.1. Instead of counting identical + * back-to-back values, the input to the RCT is the counting of the stuck + * values while filling the entropy pool. + * + * The RCT is applied with an alpha of 2^-30 compliant to FIPS 140-2 IG 9.8. + * + * During the counting operation, the LRNG always calculates the RCT + * cut-off value of C. If that value exceeds the allowed cut-off value, + * the LRNG will invalidate all entropy for the entropy pool which implies + * that no data can be extracted from the entropy pool unless new entropy + * is received. + ***************************************************************************/ + +/* + * Hot code path - Insert data for Repetition Count Test + * + * @health: Reference to health information + * @stuck: Decision of stuck test + */ +static void lrng_rct(struct lrng_health *health, enum lrng_internal_es es, + int stuck) +{ + struct lrng_health_es_state *es_state = &health->es_state[es]; + struct lrng_rct *rct = &es_state->rct; + + if (!lrng_sp80090b_health_requested()) + return; + + if (stuck) { + u32 rct_count = atomic_add_return_relaxed(1, &rct->rct_count); + + /* + * The cutoff value is based on the following consideration: + * alpha = 2^-30 as recommended in FIPS 140-2 IG 9.8. + * In addition, we imply an entropy value H of 1 bit as this + * is the minimum entropy required to provide full entropy. + * + * Note, rct_count (which equals to value B in the + * pseudo code of SP800-90B section 4.4.1) starts with zero. + * Hence we need to subtract one from the cutoff value as + * calculated following SP800-90B. + */ + if (rct_count >= CONFIG_LRNG_RCT_CUTOFF) { + atomic_set(&rct->rct_count, 0); + + /* + * APT must start anew as we consider all previously + * recorded data to contain no entropy. + */ + lrng_apt_restart(&es_state->apt); + + lrng_sp80090b_failure(health, es); + } + } else { + atomic_set(&rct->rct_count, 0); + } +} + +/*************************************************************************** + * Stuck Test + * + * Checking the: + * 1st derivative of the event occurrence (time delta) + * 2nd derivative of the event occurrence (delta of time deltas) + * 3rd derivative of the event occurrence (delta of delta of time deltas) + * + * All values must always be non-zero. The stuck test is only valid disabled if + * high-resolution time stamps are identified after initialization. + ***************************************************************************/ + +static u32 lrng_delta(u32 prev, u32 next) +{ + /* + * Note that this (unsigned) subtraction does yield the correct value + * in the wraparound-case, i.e. when next < prev. + */ + return (next - prev); +} + +/* + * Hot code path + * + * @health: Reference to health information + * @now: Event time + * @return: 0 event occurrence not stuck (good time stamp) + * != 0 event occurrence stuck (reject time stamp) + */ +static int lrng_irq_stuck(enum lrng_internal_es es, u32 now_time) +{ + struct lrng_stuck_test *stuck = this_cpu_ptr(lrng_stuck_test_array); + u32 delta = lrng_delta(stuck[es].last_time, now_time); + u32 delta2 = lrng_delta(stuck[es].last_delta, delta); + u32 delta3 = lrng_delta(stuck[es].last_delta2, delta2); + + stuck[es].last_time = now_time; + stuck[es].last_delta = delta; + stuck[es].last_delta2 = delta2; + + if (!delta || !delta2 || !delta3) + return 1; + + return 0; +} + +/*************************************************************************** + * Health test interfaces + ***************************************************************************/ + +/* + * Disable all health tests + */ +void lrng_health_disable(void) +{ + struct lrng_health *health = &lrng_health; + + health->health_test_enabled = false; + + if (lrng_sp80090b_health_requested()) + pr_warn("SP800-90B compliance requested but the Linux RNG is NOT SP800-90B compliant\n"); +} + +/* + * Hot code path - Perform health test on time stamp received from an event + * + * @now_time Time stamp + */ +enum lrng_health_res lrng_health_test(u32 now_time, enum lrng_internal_es es) +{ + struct lrng_health *health = &lrng_health; + int stuck; + + if (!health->health_test_enabled) + return lrng_health_pass; + + lrng_apt_insert(health, now_time, es); + + stuck = lrng_irq_stuck(es, now_time); + lrng_rct(health, es, stuck); + if (stuck) { + /* SP800-90B disallows using a failing health test time stamp */ + return lrng_sp80090b_health_requested() ? + lrng_health_fail_drop : lrng_health_fail_use; + } + + return lrng_health_pass; +} diff --git a/drivers/char/lrng/lrng_health.h b/drivers/char/lrng/lrng_health.h new file mode 100644 index 000000000000..4f9f5033fc30 --- /dev/null +++ b/drivers/char/lrng/lrng_health.h @@ -0,0 +1,42 @@ +/* SPDX-License-Identifier: GPL-2.0 OR BSD-2-Clause */ +/* + * Copyright (C) 2022, Stephan Mueller + */ + +#ifndef _LRNG_HEALTH_H +#define _LRNG_HEALTH_H + +#include "lrng_es_mgr.h" + +enum lrng_health_res { + lrng_health_pass, /* Health test passes on time stamp */ + lrng_health_fail_use, /* Time stamp unhealthy, but mix in */ + lrng_health_fail_drop /* Time stamp unhealthy, drop it */ +}; + +#ifdef CONFIG_LRNG_HEALTH_TESTS +bool lrng_sp80090b_startup_complete_es(enum lrng_internal_es es); +bool lrng_sp80090b_compliant(enum lrng_internal_es es); + +enum lrng_health_res lrng_health_test(u32 now_time, enum lrng_internal_es es); +void lrng_health_disable(void); +#else /* CONFIG_LRNG_HEALTH_TESTS */ +static inline bool lrng_sp80090b_startup_complete_es(enum lrng_internal_es es) +{ + return true; +} + +static inline bool lrng_sp80090b_compliant(enum lrng_internal_es es) +{ + return false; +} + +static inline enum lrng_health_res +lrng_health_test(u32 now_time, enum lrng_internal_es es) +{ + return lrng_health_pass; +} +static inline void lrng_health_disable(void) { } +#endif /* CONFIG_LRNG_HEALTH_TESTS */ + +#endif /* _LRNG_HEALTH_H */ diff --git a/drivers/char/lrng/lrng_interface_aux.c b/drivers/char/lrng/lrng_interface_aux.c new file mode 100644 index 000000000000..a2236ff3b133 --- /dev/null +++ b/drivers/char/lrng/lrng_interface_aux.c @@ -0,0 +1,158 @@ +// SPDX-License-Identifier: GPL-2.0 OR BSD-2-Clause +/* + * LRNG auxiliary interfaces + * + * Copyright (C) 2022 Stephan Mueller + * Copyright (C) 2017 Jason A. Donenfeld . All + * Rights Reserved. + * Copyright (C) 2016 Jason Cooper + */ + +#include +#include +#include + +#include "lrng_es_mgr.h" +#include "lrng_interface_random_kernel.h" + +struct batched_entropy { + union { + u64 entropy_u64[LRNG_DRNG_BLOCKSIZE / sizeof(u64)]; + u32 entropy_u32[LRNG_DRNG_BLOCKSIZE / sizeof(u32)]; + }; + unsigned int position; + spinlock_t batch_lock; +}; + +/* + * Get a random word for internal kernel use only. The quality of the random + * number is as good as /dev/urandom, but there is no backtrack protection, + * with the goal of being quite fast and not depleting entropy. + */ +static DEFINE_PER_CPU(struct batched_entropy, batched_entropy_u64) = { + .batch_lock = __SPIN_LOCK_UNLOCKED(batched_entropy_u64.lock), +}; + +u64 get_random_u64(void) +{ + u64 ret; + unsigned long flags; + struct batched_entropy *batch; + + lrng_debug_report_seedlevel("get_random_u64"); + + batch = raw_cpu_ptr(&batched_entropy_u64); + spin_lock_irqsave(&batch->batch_lock, flags); + if (batch->position % ARRAY_SIZE(batch->entropy_u64) == 0) { + lrng_get_random_bytes(batch->entropy_u64, LRNG_DRNG_BLOCKSIZE); + batch->position = 0; + } + ret = batch->entropy_u64[batch->position++]; + spin_unlock_irqrestore(&batch->batch_lock, flags); + return ret; +} +EXPORT_SYMBOL(get_random_u64); + +static DEFINE_PER_CPU(struct batched_entropy, batched_entropy_u32) = { + .batch_lock = __SPIN_LOCK_UNLOCKED(batched_entropy_u32.lock), +}; + +u32 get_random_u32(void) +{ + u32 ret; + unsigned long flags; + struct batched_entropy *batch; + + lrng_debug_report_seedlevel("get_random_u32"); + + batch = raw_cpu_ptr(&batched_entropy_u32); + spin_lock_irqsave(&batch->batch_lock, flags); + if (batch->position % ARRAY_SIZE(batch->entropy_u32) == 0) { + lrng_get_random_bytes(batch->entropy_u32, LRNG_DRNG_BLOCKSIZE); + batch->position = 0; + } + ret = batch->entropy_u32[batch->position++]; + spin_unlock_irqrestore(&batch->batch_lock, flags); + return ret; +} +EXPORT_SYMBOL(get_random_u32); + +#ifdef CONFIG_SMP +/* + * This function is called when the CPU is coming up, with entry + * CPUHP_RANDOM_PREPARE, which comes before CPUHP_WORKQUEUE_PREP. + */ +int random_prepare_cpu(unsigned int cpu) +{ + /* + * When the cpu comes back online, immediately invalidate all batches, + * so that we serve fresh randomness. + */ + per_cpu_ptr(&batched_entropy_u32, cpu)->position = 0; + per_cpu_ptr(&batched_entropy_u64, cpu)->position = 0; + return 0; +} + +int random_online_cpu(unsigned int cpu) +{ + return 0; +} +#endif + +/* + * It's important to invalidate all potential batched entropy that might + * be stored before the crng is initialized, which we can do lazily by + * simply resetting the counter to zero so that it's re-extracted on the + * next usage. + */ +void invalidate_batched_entropy(void) +{ + int cpu; + unsigned long flags; + + for_each_possible_cpu(cpu) { + struct batched_entropy *batched_entropy; + + batched_entropy = per_cpu_ptr(&batched_entropy_u32, cpu); + spin_lock_irqsave(&batched_entropy->batch_lock, flags); + batched_entropy->position = 0; + spin_unlock(&batched_entropy->batch_lock); + + batched_entropy = per_cpu_ptr(&batched_entropy_u64, cpu); + spin_lock(&batched_entropy->batch_lock); + batched_entropy->position = 0; + spin_unlock_irqrestore(&batched_entropy->batch_lock, flags); + } +} + +/* + * randomize_page - Generate a random, page aligned address + * @start: The smallest acceptable address the caller will take. + * @range: The size of the area, starting at @start, within which the + * random address must fall. + * + * If @start + @range would overflow, @range is capped. + * + * NOTE: Historical use of randomize_range, which this replaces, presumed that + * @start was already page aligned. We now align it regardless. + * + * Return: A page aligned address within [start, start + range). On error, + * @start is returned. + */ +unsigned long randomize_page(unsigned long start, unsigned long range) +{ + if (!PAGE_ALIGNED(start)) { + range -= PAGE_ALIGN(start) - start; + start = PAGE_ALIGN(start); + } + + if (start > ULONG_MAX - range) + range = ULONG_MAX - start; + + range >>= PAGE_SHIFT; + + if (range == 0) + return start; + + return start + (get_random_long() % range << PAGE_SHIFT); +} diff --git a/drivers/char/lrng/lrng_interface_dev.c b/drivers/char/lrng/lrng_interface_dev.c new file mode 100644 index 000000000000..e60060d402b3 --- /dev/null +++ b/drivers/char/lrng/lrng_interface_dev.c @@ -0,0 +1,35 @@ +// SPDX-License-Identifier: GPL-2.0 OR BSD-2-Clause +/* + * LRNG user space device file interface + * + * Copyright (C) 2022, Stephan Mueller + */ + +#include +#include + +#include "lrng_interface_dev_common.h" + +static const struct file_operations lrng_fops = { + .read = lrng_drng_read_block, + .write = lrng_drng_write, + .poll = lrng_random_poll, + .unlocked_ioctl = lrng_ioctl, + .compat_ioctl = compat_ptr_ioctl, + .fasync = lrng_fasync, + .llseek = noop_llseek, +}; + +static struct miscdevice lrng_miscdev = { + .minor = MISC_DYNAMIC_MINOR, + .name = "lrng", + .nodename = "lrng", + .fops = &lrng_fops, + .mode = 0666 +}; + +static int __init lrng_dev_if_mod_init(void) +{ + return misc_register(&lrng_miscdev); +} +device_initcall(lrng_dev_if_mod_init); diff --git a/drivers/char/lrng/lrng_interface_dev_common.c b/drivers/char/lrng/lrng_interface_dev_common.c new file mode 100644 index 000000000000..8e80d5bd9773 --- /dev/null +++ b/drivers/char/lrng/lrng_interface_dev_common.c @@ -0,0 +1,295 @@ +// SPDX-License-Identifier: GPL-2.0 OR BSD-2-Clause +/* + * LRNG User and kernel space interfaces + * + * Copyright (C) 2022, Stephan Mueller + */ + +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt + +#include +#include + +#include "lrng_drng_mgr.h" +#include "lrng_es_aux.h" +#include "lrng_es_mgr.h" +#include "lrng_interface_dev_common.h" + +DECLARE_WAIT_QUEUE_HEAD(lrng_write_wait); +static struct fasync_struct *fasync; + +static bool lrng_seed_hw = true; /* Allow HW to provide seed */ +static bool lrng_seed_user = true; /* Allow user space to provide seed */ + +/********************************** Helper ***********************************/ + +static u32 lrng_get_aux_ent(void) +{ + return lrng_es[lrng_ext_es_aux]->curr_entropy(0); +} + +/* Is the DRNG seed level too low? */ +bool lrng_need_entropy(void) +{ + return (lrng_get_aux_ent() < lrng_write_wakeup_bits); +} + +void lrng_writer_wakeup(void) +{ + if (lrng_need_entropy() && wq_has_sleeper(&lrng_write_wait)) { + wake_up_interruptible(&lrng_write_wait); + kill_fasync(&fasync, SIGIO, POLL_OUT); + } +} + +void lrng_init_wakeup_dev(void) +{ + kill_fasync(&fasync, SIGIO, POLL_IN); +} + +/* External entropy provider is allowed to provide seed data */ +bool lrng_state_exseed_allow(enum lrng_external_noise_source source) +{ + if (source == lrng_noise_source_hw) + return lrng_seed_hw; + return lrng_seed_user; +} + +/* Enable / disable external entropy provider to furnish seed */ +void lrng_state_exseed_set(enum lrng_external_noise_source source, bool type) +{ + /* + * If the LRNG is not yet operational, allow all entropy sources + * to deliver data unconditionally to get fully seeded asap. + */ + if (!lrng_state_operational()) + return; + + if (source == lrng_noise_source_hw) + lrng_seed_hw = type; + else + lrng_seed_user = type; +} + +void lrng_state_exseed_allow_all(void) +{ + lrng_state_exseed_set(lrng_noise_source_hw, true); + lrng_state_exseed_set(lrng_noise_source_user, true); +} + +/************************ LRNG user output interfaces *************************/ + +ssize_t lrng_read_common(char __user *buf, size_t nbytes) +{ + ssize_t ret = 0; + u8 tmpbuf[LRNG_DRNG_BLOCKSIZE] __aligned(LRNG_KCAPI_ALIGN); + u8 *tmp_large = NULL, *tmp = tmpbuf; + u32 tmplen = sizeof(tmpbuf); + + if (nbytes == 0) + return 0; + + /* + * Satisfy large read requests -- as the common case are smaller + * request sizes, such as 16 or 32 bytes, avoid a kmalloc overhead for + * those by using the stack variable of tmpbuf. + */ + if (!CONFIG_BASE_SMALL && (nbytes > sizeof(tmpbuf))) { + tmplen = min_t(u32, nbytes, LRNG_DRNG_MAX_REQSIZE); + tmp_large = kmalloc(tmplen + LRNG_KCAPI_ALIGN, GFP_KERNEL); + if (!tmp_large) + tmplen = sizeof(tmpbuf); + else + tmp = PTR_ALIGN(tmp_large, LRNG_KCAPI_ALIGN); + } + + while (nbytes) { + u32 todo = min_t(u32, nbytes, tmplen); + int rc = 0; + + /* Reschedule if we received a large request. */ + if ((tmp_large) && need_resched()) { + if (signal_pending(current)) { + if (ret == 0) + ret = -ERESTARTSYS; + break; + } + schedule(); + } + + rc = lrng_drng_get_sleep(tmp, todo); + if (rc <= 0) { + if (rc < 0) + ret = rc; + break; + } + if (copy_to_user(buf, tmp, rc)) { + ret = -EFAULT; + break; + } + + nbytes -= rc; + buf += rc; + ret += rc; + } + + /* Wipe data just returned from memory */ + if (tmp_large) + kfree_sensitive(tmp_large); + else + memzero_explicit(tmpbuf, sizeof(tmpbuf)); + + return ret; +} + +ssize_t lrng_read_common_block(int nonblock, char __user *buf, size_t nbytes) +{ + int ret; + + if (nbytes == 0) + return 0; + + ret = lrng_drng_sleep_while_nonoperational(nonblock); + if (ret) + return ret; + + return lrng_read_common(buf, nbytes); +} + +ssize_t lrng_drng_read_block(struct file *file, char __user *buf, size_t nbytes, + loff_t *ppos) +{ + return lrng_read_common_block(file->f_flags & O_NONBLOCK, buf, nbytes); +} + +__poll_t lrng_random_poll(struct file *file, poll_table *wait) +{ + __poll_t mask; + + poll_wait(file, &lrng_init_wait, wait); + poll_wait(file, &lrng_write_wait, wait); + mask = 0; + if (lrng_state_operational()) + mask |= EPOLLIN | EPOLLRDNORM; + if (lrng_need_entropy() || + lrng_state_exseed_allow(lrng_noise_source_user)) { + lrng_state_exseed_set(lrng_noise_source_user, false); + mask |= EPOLLOUT | EPOLLWRNORM; + } + return mask; +} + +ssize_t lrng_drng_write_common(const char __user *buffer, size_t count, + u32 entropy_bits) +{ + ssize_t ret = 0; + u8 buf[64] __aligned(LRNG_KCAPI_ALIGN); + const char __user *p = buffer; + u32 orig_entropy_bits = entropy_bits; + + if (!lrng_get_available()) { + ret = lrng_drng_initalize(); + if (!ret) + return ret; + } + + count = min_t(size_t, count, INT_MAX); + while (count > 0) { + size_t bytes = min_t(size_t, count, sizeof(buf)); + u32 ent = min_t(u32, bytes<<3, entropy_bits); + + if (copy_from_user(&buf, p, bytes)) + return -EFAULT; + /* Inject data into entropy pool */ + lrng_pool_insert_aux(buf, bytes, ent); + + count -= bytes; + p += bytes; + ret += bytes; + entropy_bits -= ent; + + cond_resched(); + } + + /* Force reseed of DRNG during next data request. */ + if (!orig_entropy_bits) + lrng_drng_force_reseed(); + + return ret; +} + +ssize_t lrng_drng_write(struct file *file, const char __user *buffer, + size_t count, loff_t *ppos) +{ + return lrng_drng_write_common(buffer, count, 0); +} + +long lrng_ioctl(struct file *f, unsigned int cmd, unsigned long arg) +{ + u32 digestsize_bits; + int size, ent_count_bits; + int __user *p = (int __user *)arg; + + switch (cmd) { + case RNDGETENTCNT: + ent_count_bits = lrng_avail_entropy(); + if (put_user(ent_count_bits, p)) + return -EFAULT; + return 0; + case RNDADDTOENTCNT: + if (!capable(CAP_SYS_ADMIN)) + return -EPERM; + if (get_user(ent_count_bits, p)) + return -EFAULT; + ent_count_bits = (int)lrng_get_aux_ent() + ent_count_bits; + if (ent_count_bits < 0) + ent_count_bits = 0; + digestsize_bits = lrng_get_digestsize(); + if (ent_count_bits > digestsize_bits) + ent_count_bits = digestsize_bits; + lrng_pool_set_entropy(ent_count_bits); + return 0; + case RNDADDENTROPY: + if (!capable(CAP_SYS_ADMIN)) + return -EPERM; + if (get_user(ent_count_bits, p++)) + return -EFAULT; + if (ent_count_bits < 0) + return -EINVAL; + if (get_user(size, p++)) + return -EFAULT; + if (size < 0) + return -EINVAL; + /* there cannot be more entropy than data */ + ent_count_bits = min(ent_count_bits, size<<3); + return lrng_drng_write_common((const char __user *)p, size, + ent_count_bits); + case RNDZAPENTCNT: + case RNDCLEARPOOL: + /* Clear the entropy pool counter. */ + if (!capable(CAP_SYS_ADMIN)) + return -EPERM; + lrng_pool_set_entropy(0); + return 0; + case RNDRESEEDCRNG: + /* + * We leave the capability check here since it is present + * in the upstream's RNG implementation. Yet, user space + * can trigger a reseed as easy as writing into /dev/random + * or /dev/urandom where no privilege is needed. + */ + if (!capable(CAP_SYS_ADMIN)) + return -EPERM; + /* Force a reseed of all DRNGs */ + lrng_drng_force_reseed(); + return 0; + default: + return -EINVAL; + } +} +EXPORT_SYMBOL(lrng_ioctl); + +int lrng_fasync(int fd, struct file *filp, int on) +{ + return fasync_helper(fd, filp, on, &fasync); +} diff --git a/drivers/char/lrng/lrng_interface_dev_common.h b/drivers/char/lrng/lrng_interface_dev_common.h new file mode 100644 index 000000000000..87cf504647d3 --- /dev/null +++ b/drivers/char/lrng/lrng_interface_dev_common.h @@ -0,0 +1,50 @@ +/* SPDX-License-Identifier: GPL-2.0 OR BSD-2-Clause */ +/* + * Copyright (C) 2022, Stephan Mueller + */ + +#ifndef _LRNG_INTERFACE_DEV_COMMON_H +#define _LRNG_INTERFACE_DEV_COMMON_H + +#include +#include + +/******************* Upstream functions hooked into the LRNG ******************/ +enum lrng_external_noise_source { + lrng_noise_source_hw, + lrng_noise_source_user +}; + +#ifdef CONFIG_LRNG_COMMON_DEV_IF +void lrng_writer_wakeup(void); +void lrng_init_wakeup_dev(void); +void lrng_state_exseed_set(enum lrng_external_noise_source source, bool type); +void lrng_state_exseed_allow_all(void); +#else /* CONFIG_LRNG_COMMON_DEV_IF */ +static inline void lrng_writer_wakeup(void) { } +static inline void lrng_init_wakeup_dev(void) { } +static inline void +lrng_state_exseed_set(enum lrng_external_noise_source source, bool type) { } +static inline void lrng_state_exseed_allow_all(void) { } +#endif /* CONFIG_LRNG_COMMON_DEV_IF */ + +/****** Downstream service functions to actual interface implementations ******/ + +bool lrng_state_exseed_allow(enum lrng_external_noise_source source); +int lrng_fasync(int fd, struct file *filp, int on); +long lrng_ioctl(struct file *f, unsigned int cmd, unsigned long arg); +ssize_t lrng_drng_write(struct file *file, const char __user *buffer, + size_t count, loff_t *ppos); +ssize_t lrng_drng_write_common(const char __user *buffer, size_t count, + u32 entropy_bits); +__poll_t lrng_random_poll(struct file *file, poll_table *wait); +ssize_t lrng_read_common_block(int nonblock, char __user *buf, size_t nbytes); +ssize_t lrng_drng_read_block(struct file *file, char __user *buf, size_t nbytes, + loff_t *ppos); +ssize_t lrng_read_common(char __user *buf, size_t nbytes); +bool lrng_need_entropy(void); + +extern struct wait_queue_head lrng_write_wait; +extern struct wait_queue_head lrng_init_wait; + +#endif /* _LRNG_INTERFACE_DEV_COMMON_H */ diff --git a/drivers/char/lrng/lrng_interface_hwrand.c b/drivers/char/lrng/lrng_interface_hwrand.c new file mode 100644 index 000000000000..e841eea13348 --- /dev/null +++ b/drivers/char/lrng/lrng_interface_hwrand.c @@ -0,0 +1,68 @@ +// SPDX-License-Identifier: GPL-2.0 OR BSD-2-Clause +/* + * LRNG interface with the HW-Random framework + * + * Copyright (C) 2022, Stephan Mueller + */ + +#include +#include +#include + +static int lrng_hwrand_if_random(struct hwrng *rng, void *buf, size_t max, + bool wait) +{ + /* + * lrng_get_random_bytes_full not called as we cannot block. + * + * Note: We should either adjust .quality below depending on + * rng_is_initialized() or block here, but neither is not supported by + * the hw_rand framework. + */ + lrng_get_random_bytes(buf, max); + return (int)max; +} + +static struct hwrng lrng_hwrand = { + .name = "lrng", + .init = NULL, + .cleanup = NULL, + .read = lrng_hwrand_if_random, + + /* + * We set .quality only in case the LRNG does not provide the common + * interfaces or does not use the legacy RNG as entropy source. This + * shall avoid that the LRNG automatically spawns the hw_rand + * framework's hwrng kernel thread to feed data into + * add_hwgenerator_randomness. When the LRNG implements the common + * interfaces, this function feeds the data directly into the LRNG. + * If the LRNG uses the legacy RNG as entropy source, + * add_hwgenerator_randomness is implemented by the legacy RNG, but + * still eventually feeds the data into the LRNG. We should avoid such + * circular loops. + * + * We can specify full entropy here, because the LRNG is designed + * to provide full entropy. + */ +#if !defined(CONFIG_LRNG_RANDOM_IF) && \ + !defined(CONFIG_LRNG_KERNEL_RNG) + .quality = 1024, +#endif +}; + +static int __init lrng_hwrand_if_mod_init(void) +{ + return hwrng_register(&lrng_hwrand); +} + +static void __exit lrng_hwrand_if_mod_exit(void) +{ + hwrng_unregister(&lrng_hwrand); +} + +module_init(lrng_hwrand_if_mod_init); +module_exit(lrng_hwrand_if_mod_exit); + +MODULE_LICENSE("Dual BSD/GPL"); +MODULE_AUTHOR("Stephan Mueller "); +MODULE_DESCRIPTION("Entropy Source and DRNG Manager HW-Random Interface"); diff --git a/drivers/char/lrng/lrng_interface_kcapi.c b/drivers/char/lrng/lrng_interface_kcapi.c new file mode 100644 index 000000000000..4cb511f8088e --- /dev/null +++ b/drivers/char/lrng/lrng_interface_kcapi.c @@ -0,0 +1,129 @@ +// SPDX-License-Identifier: GPL-2.0 OR BSD-2-Clause +/* + * LRNG interface with the RNG framework of the kernel crypto API + * + * Copyright (C) 2022, Stephan Mueller + */ + +#include +#include +#include + +#include "lrng_drng_mgr.h" +#include "lrng_es_aux.h" + +static int lrng_kcapi_if_init(struct crypto_tfm *tfm) +{ + return 0; +} + +static void lrng_kcapi_if_cleanup(struct crypto_tfm *tfm) { } + +static int lrng_kcapi_if_reseed(const u8 *src, unsigned int slen) +{ + int ret; + + if (!slen) + return 0; + + /* Insert caller-provided data without crediting entropy */ + ret = lrng_pool_insert_aux((u8 *)src, slen, 0); + if (ret) + return ret; + + /* Make sure the new data is immediately available to DRNG */ + lrng_drng_force_reseed(); + + return 0; +} + +static int lrng_kcapi_if_random(struct crypto_rng *tfm, + const u8 *src, unsigned int slen, + u8 *rdata, unsigned int dlen) +{ + int ret = lrng_kcapi_if_reseed(src, slen); + + if (!ret) + lrng_get_random_bytes_full(rdata, dlen); + + return ret; +} + +static int lrng_kcapi_if_reset(struct crypto_rng *tfm, + const u8 *seed, unsigned int slen) +{ + return lrng_kcapi_if_reseed(seed, slen); +} + +static struct rng_alg lrng_alg = { + .generate = lrng_kcapi_if_random, + .seed = lrng_kcapi_if_reset, + .seedsize = 0, + .base = { + .cra_name = "stdrng", + .cra_driver_name = "lrng", + .cra_priority = 500, + .cra_ctxsize = 0, + .cra_module = THIS_MODULE, + .cra_init = lrng_kcapi_if_init, + .cra_exit = lrng_kcapi_if_cleanup, + + } +}; + +#ifdef CONFIG_LRNG_DRNG_ATOMIC +static int lrng_kcapi_if_random_atomic(struct crypto_rng *tfm, + const u8 *src, unsigned int slen, + u8 *rdata, unsigned int dlen) +{ + int ret = lrng_kcapi_if_reseed(src, slen); + + if (!ret) + lrng_get_random_bytes(rdata, dlen); + + return ret; +} + +static struct rng_alg lrng_alg_atomic = { + .generate = lrng_kcapi_if_random_atomic, + .seed = lrng_kcapi_if_reset, + .seedsize = 0, + .base = { + .cra_name = "lrng_atomic", + .cra_driver_name = "lrng_atomic", + .cra_priority = 100, + .cra_ctxsize = 0, + .cra_module = THIS_MODULE, + .cra_init = lrng_kcapi_if_init, + .cra_exit = lrng_kcapi_if_cleanup, + + } +}; +#endif /* CONFIG_LRNG_DRNG_ATOMIC */ + +static int __init lrng_kcapi_if_mod_init(void) +{ + return +#ifdef CONFIG_LRNG_DRNG_ATOMIC + crypto_register_rng(&lrng_alg_atomic) ?: +#endif + crypto_register_rng(&lrng_alg); +} + +static void __exit lrng_kcapi_if_mod_exit(void) +{ + crypto_unregister_rng(&lrng_alg); +#ifdef CONFIG_LRNG_DRNG_ATOMIC + crypto_unregister_rng(&lrng_alg_atomic); +#endif +} + +module_init(lrng_kcapi_if_mod_init); +module_exit(lrng_kcapi_if_mod_exit); + +MODULE_LICENSE("Dual BSD/GPL"); +MODULE_AUTHOR("Stephan Mueller "); +MODULE_DESCRIPTION("Entropy Source and DRNG Manager kernel crypto API RNG framework interface"); +MODULE_ALIAS_CRYPTO("lrng"); +MODULE_ALIAS_CRYPTO("lrng_atomic"); +MODULE_ALIAS_CRYPTO("stdrng"); diff --git a/drivers/char/lrng/lrng_interface_random_kernel.c b/drivers/char/lrng/lrng_interface_random_kernel.c new file mode 100644 index 000000000000..4dae08e06500 --- /dev/null +++ b/drivers/char/lrng/lrng_interface_random_kernel.c @@ -0,0 +1,325 @@ +// SPDX-License-Identifier: GPL-2.0 OR BSD-2-Clause +/* + * LRNG Kernel space interfaces API/ABI compliant to linux/random.h + * + * Copyright (C) 2022, Stephan Mueller + */ + +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt + +#include +#include +#include +#include +#include + +#include "lrng_es_aux.h" +#include "lrng_es_irq.h" +#include "lrng_es_mgr.h" +#include "lrng_interface_dev_common.h" +#include "lrng_interface_random_kernel.h" + +static RAW_NOTIFIER_HEAD(lrng_ready_chain); +static DEFINE_SPINLOCK(lrng_ready_chain_lock); +static unsigned int lrng_ready_chain_used = 0; + +/********************************** Helper ***********************************/ + +int __init rand_initialize(void) +{ + return lrng_rand_initialize(); +} + +bool lrng_ready_chain_has_sleeper(void) +{ + return !!lrng_ready_chain_used; +} + +/* + * lrng_process_ready_list() - Ping all kernel internal callers waiting until + * the DRNG is completely initialized to inform that the DRNG reached that + * seed level. + * + * When the SP800-90B testing is enabled, the ping only happens if the SP800-90B + * startup health tests are completed. This implies that kernel internal + * callers always have an SP800-90B compliant noise source when being + * pinged. + */ +void lrng_process_ready_list(void) +{ + unsigned long flags; + + if (!lrng_state_operational()) + return; + + spin_lock_irqsave(&lrng_ready_chain_lock, flags); + raw_notifier_call_chain(&lrng_ready_chain, 0, NULL); + spin_unlock_irqrestore(&lrng_ready_chain_lock, flags); +} + +/************************ LRNG kernel input interfaces ************************/ + +/* + * add_hwgenerator_randomness() - Interface for in-kernel drivers of true + * hardware RNGs. + * + * Those devices may produce endless random bits and will be throttled + * when our pool is full. + * + * @buffer: buffer holding the entropic data from HW noise sources to be used to + * insert into entropy pool. + * @count: length of buffer + * @entropy_bits: amount of entropy in buffer (value is in bits) + */ +void add_hwgenerator_randomness(const void *buffer, size_t count, + size_t entropy_bits) +{ + /* + * Suspend writing if we are fully loaded with entropy. + * We'll be woken up again once below lrng_write_wakeup_thresh, + * or when the calling thread is about to terminate. + */ + wait_event_interruptible(lrng_write_wait, + lrng_need_entropy() || + lrng_state_exseed_allow(lrng_noise_source_hw) || + kthread_should_stop()); + lrng_state_exseed_set(lrng_noise_source_hw, false); + lrng_pool_insert_aux(buffer, count, entropy_bits); +} +EXPORT_SYMBOL_GPL(add_hwgenerator_randomness); + +/* + * add_bootloader_randomness() - Handle random seed passed by bootloader. + * + * If the seed is trustworthy, it would be regarded as hardware RNGs. Otherwise + * it would be regarded as device data. + * The decision is controlled by CONFIG_RANDOM_TRUST_BOOTLOADER. + * + * @buf: buffer holding the entropic data from HW noise sources to be used to + * insert into entropy pool. + * @size: length of buffer + */ +void add_bootloader_randomness(const void *buf, size_t size) +{ + lrng_pool_insert_aux(buf, size, + IS_ENABLED(CONFIG_RANDOM_TRUST_BOOTLOADER) ? + size * 8 : 0); +} +EXPORT_SYMBOL_GPL(add_bootloader_randomness); + +/* + * Callback for HID layer -- use the HID event values to stir the entropy pool + */ +void add_input_randomness(unsigned int type, unsigned int code, + unsigned int value) +{ + static unsigned char last_value; + + /* ignore autorepeat and the like */ + if (value == last_value) + return; + + last_value = value; + + lrng_irq_array_add_u32((type << 4) ^ code ^ (code >> 4) ^ value); +} +EXPORT_SYMBOL_GPL(add_input_randomness); + +/* + * add_device_randomness() - Add device- or boot-specific data to the entropy + * pool to help initialize it. + * + * None of this adds any entropy; it is meant to avoid the problem of + * the entropy pool having similar initial state across largely + * identical devices. + * + * @buf: buffer holding the entropic data from HW noise sources to be used to + * insert into entropy pool. + * @size: length of buffer + */ +void add_device_randomness(const void *buf, size_t size) +{ + lrng_pool_insert_aux((u8 *)buf, size, 0); +} +EXPORT_SYMBOL(add_device_randomness); + +#ifdef CONFIG_BLOCK +void rand_initialize_disk(struct gendisk *disk) { } +void add_disk_randomness(struct gendisk *disk) { } +EXPORT_SYMBOL(add_disk_randomness); +#endif + +#ifndef CONFIG_LRNG_IRQ +void add_interrupt_randomness(int irq) { } +EXPORT_SYMBOL(add_interrupt_randomness); +#endif + +/* + * unregister_random_ready_notifier() - Delete a previously registered readiness + * callback function. + * + * @nb: callback definition that was registered initially + */ +int unregister_random_ready_notifier(struct notifier_block *nb) +{ + unsigned long flags; + int ret; + + spin_lock_irqsave(&lrng_ready_chain_lock, flags); + ret = raw_notifier_chain_unregister(&lrng_ready_chain, nb); + spin_unlock_irqrestore(&lrng_ready_chain_lock, flags); + + if (!ret && lrng_ready_chain_used) + lrng_ready_chain_used--; + + return ret; +} +EXPORT_SYMBOL(unregister_random_ready_notifier); + +/* + * register_random_ready_notifier() - Add a callback function that will be + * invoked when the DRNG is fully initialized and seeded. + * + * @nb: callback definition to be invoked when the LRNG is seeded + * + * Return: + * * 0 if callback is successfully added + * * -EALREADY if pool is already initialised (callback not called) + */ +int register_random_ready_notifier(struct notifier_block *nb) +{ + unsigned long flags; + int err = -EALREADY; + + if (likely(lrng_state_operational())) + return err; + + spin_lock_irqsave(&lrng_ready_chain_lock, flags); + if (!lrng_state_operational()) + err = raw_notifier_chain_register(&lrng_ready_chain, nb); + spin_unlock_irqrestore(&lrng_ready_chain_lock, flags); + + if (!err) + lrng_ready_chain_used++; + + return err; +} +EXPORT_SYMBOL(register_random_ready_notifier); + +#if IS_ENABLED(CONFIG_VMGENID) +static BLOCKING_NOTIFIER_HEAD(lrng_vmfork_chain); + +/* + * Handle a new unique VM ID, which is unique, not secret, so we + * don't credit it, but we do immediately force a reseed after so + * that it's used by the crng posthaste. + */ +void add_vmfork_randomness(const void *unique_vm_id, size_t size) +{ + add_device_randomness(unique_vm_id, size); + if (lrng_state_operational()) + lrng_drng_force_reseed(); + blocking_notifier_call_chain(&lrng_vmfork_chain, 0, NULL); +} +#if IS_MODULE(CONFIG_VMGENID) +EXPORT_SYMBOL_GPL(add_vmfork_randomness); +#endif + +int register_random_vmfork_notifier(struct notifier_block *nb) +{ + return blocking_notifier_chain_register(&lrng_vmfork_chain, nb); +} +EXPORT_SYMBOL_GPL(register_random_vmfork_notifier); + +int unregister_random_vmfork_notifier(struct notifier_block *nb) +{ + return blocking_notifier_chain_unregister(&lrng_vmfork_chain, nb); +} +EXPORT_SYMBOL_GPL(unregister_random_vmfork_notifier); +#endif + +/*********************** LRNG kernel output interfaces ************************/ + +/* + * get_random_bytes() - Provider of cryptographic strong random numbers for + * kernel-internal usage. + * + * This function is appropriate for all in-kernel use cases. However, + * it will always use the ChaCha20 DRNG. + * + * @buf: buffer to store the random bytes + * @nbytes: size of the buffer + */ +void get_random_bytes(void *buf, size_t nbytes) +{ + lrng_get_random_bytes(buf, nbytes); +} +EXPORT_SYMBOL(get_random_bytes); + +/* + * wait_for_random_bytes() - Wait for the LRNG to be seeded and thus + * guaranteed to supply cryptographically secure random numbers. + * + * This applies to: the /dev/urandom device, the get_random_bytes function, + * and the get_random_{u32,u64,int,long} family of functions. Using any of + * these functions without first calling this function forfeits the guarantee + * of security. + * + * Return: + * * 0 if the LRNG has been seeded. + * * -ERESTARTSYS if the function was interrupted by a signal. + */ +int wait_for_random_bytes(void) +{ + return lrng_drng_sleep_while_non_min_seeded(); +} +EXPORT_SYMBOL(wait_for_random_bytes); + +/* + * get_random_bytes_arch() - This function will use the architecture-specific + * hardware random number generator if it is available. + * + * The arch-specific hw RNG will almost certainly be faster than what we can + * do in software, but it is impossible to verify that it is implemented + * securely (as opposed, to, say, the AES encryption of a sequence number using + * a key known by the NSA). So it's useful if we need the speed, but only if + * we're willing to trust the hardware manufacturer not to have put in a back + * door. + * + * @buf: buffer allocated by caller to store the random data in + * @nbytes: length of outbuf + * + * Return: number of bytes filled in. + */ +size_t __must_check get_random_bytes_arch(void *buf, size_t nbytes) +{ + size_t left = nbytes; + u8 *p = buf; + + while (left) { + unsigned long v; + size_t chunk = min_t(size_t, left, sizeof(unsigned long)); + + if (!arch_get_random_long(&v)) + break; + + memcpy(p, &v, chunk); + p += chunk; + left -= chunk; + } + + return nbytes - left; +} +EXPORT_SYMBOL(get_random_bytes_arch); + +/* + * Returns whether or not the LRNG has been seeded. + * + * Returns: true if the urandom pool has been seeded. + * false if the urandom pool has not been seeded. + */ +bool rng_is_initialized(void) +{ + return lrng_state_operational(); +} +EXPORT_SYMBOL(rng_is_initialized); diff --git a/drivers/char/lrng/lrng_interface_random_kernel.h b/drivers/char/lrng/lrng_interface_random_kernel.h new file mode 100644 index 000000000000..763ec1a633ff --- /dev/null +++ b/drivers/char/lrng/lrng_interface_random_kernel.h @@ -0,0 +1,19 @@ +/* SPDX-License-Identifier: GPL-2.0 OR BSD-2-Clause */ +/* + * Copyright (C) 2022, Stephan Mueller + */ + +#ifndef _LRNG_INTERFACE_RANDOM_H +#define _LRNG_INTERFACE_RANDOM_H + +#ifdef CONFIG_LRNG_RANDOM_IF +void lrng_process_ready_list(void); +bool lrng_ready_chain_has_sleeper(void); +void invalidate_batched_entropy(void); +#else /* CONFIG_LRNG_RANDOM_IF */ +static inline bool lrng_ready_chain_has_sleeper(void) { return false; } +static inline void lrng_process_ready_list(void) { } +static inline void invalidate_batched_entropy(void) { } +#endif /* CONFIG_LRNG_RANDOM_IF */ + +#endif /* _LRNG_INTERFACE_RANDOM_H */ diff --git a/drivers/char/lrng/lrng_interface_random_user.c b/drivers/char/lrng/lrng_interface_random_user.c new file mode 100644 index 000000000000..7e45f41b7a4d --- /dev/null +++ b/drivers/char/lrng/lrng_interface_random_user.c @@ -0,0 +1,70 @@ +// SPDX-License-Identifier: GPL-2.0 OR BSD-2-Clause +/* + * LRNG Common user space interfaces compliant to random(4), random(7) and + * getrandom(2) man pages. + * + * Copyright (C) 2022, Stephan Mueller + */ + +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt + +#include +#include + +#include "lrng_es_mgr.h" +#include "lrng_interface_dev_common.h" + +static ssize_t lrng_drng_read(struct file *file, char __user *buf, + size_t nbytes, loff_t *ppos) +{ + if (!lrng_state_min_seeded()) + pr_notice_ratelimited("%s - use of insufficiently seeded DRNG (%zu bytes read)\n", + current->comm, nbytes); + else if (!lrng_state_operational()) + pr_debug_ratelimited("%s - use of not fully seeded DRNG (%zu bytes read)\n", + current->comm, nbytes); + + return lrng_read_common(buf, nbytes); +} + +const struct file_operations random_fops = { + .read = lrng_drng_read_block, + .write = lrng_drng_write, + .poll = lrng_random_poll, + .unlocked_ioctl = lrng_ioctl, + .compat_ioctl = compat_ptr_ioctl, + .fasync = lrng_fasync, + .llseek = noop_llseek, +}; + +const struct file_operations urandom_fops = { + .read = lrng_drng_read, + .write = lrng_drng_write, + .unlocked_ioctl = lrng_ioctl, + .compat_ioctl = compat_ptr_ioctl, + .fasync = lrng_fasync, + .llseek = noop_llseek, +}; + +SYSCALL_DEFINE3(getrandom, char __user *, buf, size_t, count, + unsigned int, flags) +{ + if (flags & ~(GRND_NONBLOCK|GRND_RANDOM|GRND_INSECURE)) + return -EINVAL; + + /* + * Requesting insecure and blocking randomness at the same time makes + * no sense. + */ + if ((flags & + (GRND_INSECURE|GRND_RANDOM)) == (GRND_INSECURE|GRND_RANDOM)) + return -EINVAL; + + if (count > INT_MAX) + count = INT_MAX; + + if (flags & GRND_INSECURE) + return lrng_drng_read(NULL, buf, count, NULL); + + return lrng_read_common_block(flags & GRND_NONBLOCK, buf, count); +} diff --git a/drivers/char/lrng/lrng_numa.c b/drivers/char/lrng/lrng_numa.c new file mode 100644 index 000000000000..d74dd8df2843 --- /dev/null +++ b/drivers/char/lrng/lrng_numa.c @@ -0,0 +1,124 @@ +// SPDX-License-Identifier: GPL-2.0 OR BSD-2-Clause +/* + * LRNG NUMA support + * + * Copyright (C) 2022, Stephan Mueller + */ + +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt + +#include +#include + +#include "lrng_drng_mgr.h" +#include "lrng_es_irq.h" +#include "lrng_es_mgr.h" +#include "lrng_numa.h" +#include "lrng_proc.h" + +static struct lrng_drng **lrng_drng __read_mostly = NULL; + +struct lrng_drng **lrng_drng_instances(void) +{ + /* counterpart to cmpxchg_release in _lrng_drngs_numa_alloc */ + return READ_ONCE(lrng_drng); +} + +/* Allocate the data structures for the per-NUMA node DRNGs */ +static void _lrng_drngs_numa_alloc(struct work_struct *work) +{ + struct lrng_drng **drngs; + struct lrng_drng *lrng_drng_init = lrng_drng_init_instance(); + u32 node; + bool init_drng_used = false; + + mutex_lock(&lrng_crypto_cb_update); + + /* per-NUMA-node DRNGs are already present */ + if (lrng_drng) + goto unlock; + + /* Make sure the initial DRNG is initialized and its drng_cb is set */ + if (lrng_drng_initalize()) + goto err; + + drngs = kcalloc(nr_node_ids, sizeof(void *), GFP_KERNEL|__GFP_NOFAIL); + for_each_online_node(node) { + struct lrng_drng *drng; + + if (!init_drng_used) { + drngs[node] = lrng_drng_init; + init_drng_used = true; + continue; + } + + drng = kmalloc_node(sizeof(struct lrng_drng), + GFP_KERNEL|__GFP_NOFAIL, node); + memset(drng, 0, sizeof(lrng_drng)); + + if (lrng_drng_alloc_common(drng, lrng_drng_init->drng_cb)) { + kfree(drng); + goto err; + } + + drng->hash_cb = lrng_drng_init->hash_cb; + drng->hash = lrng_drng_init->hash_cb->hash_alloc(); + if (IS_ERR(drng->hash)) { + lrng_drng_init->drng_cb->drng_dealloc(drng->drng); + kfree(drng); + goto err; + } + + mutex_init(&drng->lock); + rwlock_init(&drng->hash_lock); + + /* + * No reseeding of NUMA DRNGs from previous DRNGs as this + * would complicate the code. Let it simply reseed. + */ + drngs[node] = drng; + + lrng_pool_inc_numa_node(); + pr_info("DRNG and entropy pool read hash for NUMA node %d allocated\n", + node); + } + + /* counterpart to READ_ONCE in lrng_drng_instances */ + if (!cmpxchg_release(&lrng_drng, NULL, drngs)) { + lrng_pool_all_numa_nodes_seeded(false); + goto unlock; + } + +err: + for_each_online_node(node) { + struct lrng_drng *drng = drngs[node]; + + if (drng == lrng_drng_init) + continue; + + if (drng) { + drng->hash_cb->hash_dealloc(drng->hash); + drng->drng_cb->drng_dealloc(drng->drng); + kfree(drng); + } + } + kfree(drngs); + +unlock: + mutex_unlock(&lrng_crypto_cb_update); +} + +static DECLARE_WORK(lrng_drngs_numa_alloc_work, _lrng_drngs_numa_alloc); + +static void lrng_drngs_numa_alloc(void) +{ + schedule_work(&lrng_drngs_numa_alloc_work); +} + +static int __init lrng_numa_init(void) +{ + lrng_drngs_numa_alloc(); + return 0; +} + +late_initcall(lrng_numa_init); diff --git a/drivers/char/lrng/lrng_numa.h b/drivers/char/lrng/lrng_numa.h new file mode 100644 index 000000000000..dc8dff9816ee --- /dev/null +++ b/drivers/char/lrng/lrng_numa.h @@ -0,0 +1,15 @@ +/* SPDX-License-Identifier: GPL-2.0 OR BSD-2-Clause */ +/* + * Copyright (C) 2022, Stephan Mueller + */ + +#ifndef _LRNG_NUMA_H +#define _LRNG_NUMA_H + +#ifdef CONFIG_NUMA +struct lrng_drng **lrng_drng_instances(void); +#else /* CONFIG_NUMA */ +static inline struct lrng_drng **lrng_drng_instances(void) { return NULL; } +#endif /* CONFIG_NUMA */ + +#endif /* _LRNG_NUMA_H */ diff --git a/drivers/char/lrng/lrng_proc.c b/drivers/char/lrng/lrng_proc.c new file mode 100644 index 000000000000..6e3e0699a840 --- /dev/null +++ b/drivers/char/lrng/lrng_proc.c @@ -0,0 +1,71 @@ +// SPDX-License-Identifier: GPL-2.0 OR BSD-2-Clause +/* + * LRNG proc interfaces + * + * Copyright (C) 2022, Stephan Mueller + */ + +#include +#include +#include +#include + +#include "lrng_drng_mgr.h" +#include "lrng_es_aux.h" +#include "lrng_es_mgr.h" +#include "lrng_proc.h" + +/* Number of online DRNGs */ +static u32 numa_drngs = 1; + +void lrng_pool_inc_numa_node(void) +{ + numa_drngs++; +} + +static int lrng_proc_type_show(struct seq_file *m, void *v) +{ + struct lrng_drng *lrng_drng_init = lrng_drng_init_instance(); + unsigned char buf[250]; + u32 i; + + mutex_lock(&lrng_drng_init->lock); + snprintf(buf, sizeof(buf), + "DRNG name: %s\n" + "LRNG security strength in bits: %d\n" + "Number of DRNG instances: %u\n" + "Standards compliance: %s\n" + "LRNG minimally seeded: %s\n" + "LRNG fully seeded: %s\n", + lrng_drng_init->drng_cb->drng_name(), + lrng_security_strength(), + numa_drngs, + lrng_sp80090c_compliant() ? "SP800-90C " : "", + lrng_state_min_seeded() ? "true" : "false", + lrng_state_fully_seeded() ? "true" : "false"); + seq_write(m, buf, strlen(buf)); + + for_each_lrng_es(i) { + snprintf(buf, sizeof(buf), + "Entropy Source %u properties:\n" + " Name: %s\n", + i, lrng_es[i]->name); + seq_write(m, buf, strlen(buf)); + + buf[0] = '\0'; + lrng_es[i]->state(buf, sizeof(buf)); + seq_write(m, buf, strlen(buf)); + } + + mutex_unlock(&lrng_drng_init->lock); + + return 0; +} + +static int __init lrng_proc_type_init(void) +{ + proc_create_single("lrng_type", 0444, NULL, &lrng_proc_type_show); + return 0; +} + +module_init(lrng_proc_type_init); diff --git a/drivers/char/lrng/lrng_proc.h b/drivers/char/lrng/lrng_proc.h new file mode 100644 index 000000000000..c653274f1954 --- /dev/null +++ b/drivers/char/lrng/lrng_proc.h @@ -0,0 +1,15 @@ +/* SPDX-License-Identifier: GPL-2.0 OR BSD-2-Clause */ +/* + * Copyright (C) 2022, Stephan Mueller + */ + +#ifndef _LRNG_PROC_H +#define _LRNG_PROC_H + +#ifdef CONFIG_SYSCTL +void lrng_pool_inc_numa_node(void); +#else +static inline void lrng_pool_inc_numa_node(void) { } +#endif + +#endif /* _LRNG_PROC_H */ diff --git a/drivers/char/lrng/lrng_selftest.c b/drivers/char/lrng/lrng_selftest.c new file mode 100644 index 000000000000..15f1e4a2a719 --- /dev/null +++ b/drivers/char/lrng/lrng_selftest.c @@ -0,0 +1,397 @@ +// SPDX-License-Identifier: GPL-2.0 OR BSD-2-Clause +/* + * LRNG power-on and on-demand self-test + * + * Copyright (C) 2022, Stephan Mueller + */ + +/* + * In addition to the self-tests below, the following LRNG components + * are covered with self-tests during regular operation: + * + * * power-on self-test: SP800-90A DRBG provided by the Linux kernel crypto API + * * power-on self-test: PRNG provided by the Linux kernel crypto API + * * runtime test: Raw noise source data testing including SP800-90B compliant + * tests when enabling CONFIG_LRNG_HEALTH_TESTS + * + * Additional developer tests present with LRNG code: + * * SP800-90B APT and RCT test enforcement validation when enabling + * CONFIG_LRNG_APT_BROKEN or CONFIG_LRNG_RCT_BROKEN. + * * Collection of raw entropy from the interrupt noise source when enabling + * CONFIG_LRNG_TESTING and pulling the data from the kernel with the provided + * interface. + */ + +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt + +#include +#include +#include + +#include "lrng_drng_chacha20.h" +#include "lrng_sha.h" + +#define LRNG_SELFTEST_PASSED 0 +#define LRNG_SEFLTEST_ERROR_TIME (1 << 0) +#define LRNG_SEFLTEST_ERROR_CHACHA20 (1 << 1) +#define LRNG_SEFLTEST_ERROR_HASH (1 << 2) +#define LRNG_SEFLTEST_ERROR_GCD (1 << 3) +#define LRNG_SELFTEST_NOT_EXECUTED 0xffffffff + +#ifdef CONFIG_LRNG_TIMER_COMMON + +#include "lrng_es_timer_common.h" + +static u32 lrng_data_selftest_ptr = 0; +static u32 lrng_data_selftest[LRNG_DATA_ARRAY_SIZE]; + +static void lrng_data_process_selftest_insert(u32 time) +{ + u32 ptr = lrng_data_selftest_ptr++ & LRNG_DATA_WORD_MASK; + unsigned int array = lrng_data_idx2array(ptr); + unsigned int slot = lrng_data_idx2slot(ptr); + + /* zeroization of slot to ensure the following OR adds the data */ + lrng_data_selftest[array] &= + ~(lrng_data_slot_val(0xffffffff & LRNG_DATA_SLOTSIZE_MASK, + slot)); + lrng_data_selftest[array] |= + lrng_data_slot_val(time & LRNG_DATA_SLOTSIZE_MASK, slot); +} + +static void lrng_data_process_selftest_u32(u32 data) +{ + u32 pre_ptr, ptr, mask; + unsigned int pre_array; + + /* Increment pointer by number of slots taken for input value */ + lrng_data_selftest_ptr += LRNG_DATA_SLOTS_PER_UINT; + + /* ptr to current unit */ + ptr = lrng_data_selftest_ptr; + + lrng_data_split_u32(&ptr, &pre_ptr, &mask); + + /* MSB of data go into previous unit */ + pre_array = lrng_data_idx2array(pre_ptr); + /* zeroization of slot to ensure the following OR adds the data */ + lrng_data_selftest[pre_array] &= ~(0xffffffff & ~mask); + lrng_data_selftest[pre_array] |= data & ~mask; + + /* LSB of data go into current unit */ + lrng_data_selftest[lrng_data_idx2array(ptr)] = data & mask; +} + +static unsigned int lrng_data_process_selftest(void) +{ + u32 time; + u32 idx_zero_compare = (0 << 0) | (1 << 8) | (2 << 16) | (3 << 24); + u32 idx_one_compare = (4 << 0) | (5 << 8) | (6 << 16) | (7 << 24); + u32 idx_last_compare = + (((LRNG_DATA_NUM_VALUES - 4) & LRNG_DATA_SLOTSIZE_MASK) << 0) | + (((LRNG_DATA_NUM_VALUES - 3) & LRNG_DATA_SLOTSIZE_MASK) << 8) | + (((LRNG_DATA_NUM_VALUES - 2) & LRNG_DATA_SLOTSIZE_MASK) << 16) | + (((LRNG_DATA_NUM_VALUES - 1) & LRNG_DATA_SLOTSIZE_MASK) << 24); + + (void)idx_one_compare; + + /* "poison" the array to verify the operation of the zeroization */ + lrng_data_selftest[0] = 0xffffffff; + lrng_data_selftest[1] = 0xffffffff; + + lrng_data_process_selftest_insert(0); + /* + * Note, when using lrng_data_process_u32() on unaligned ptr, + * the first slots will go into next word, and the last slots go + * into the previous word. + */ + lrng_data_process_selftest_u32((4 << 0) | (1 << 8) | (2 << 16) | + (3 << 24)); + lrng_data_process_selftest_insert(5); + lrng_data_process_selftest_insert(6); + lrng_data_process_selftest_insert(7); + + if ((lrng_data_selftest[0] != idx_zero_compare) || + (lrng_data_selftest[1] != idx_one_compare)) + goto err; + + /* Reset for next test */ + lrng_data_selftest[0] = 0; + lrng_data_selftest[1] = 0; + lrng_data_selftest_ptr = 0; + + for (time = 0; time < LRNG_DATA_NUM_VALUES; time++) + lrng_data_process_selftest_insert(time); + + if ((lrng_data_selftest[0] != idx_zero_compare) || + (lrng_data_selftest[1] != idx_one_compare) || + (lrng_data_selftest[LRNG_DATA_ARRAY_SIZE - 1] != idx_last_compare)) + goto err; + + return LRNG_SELFTEST_PASSED; + +err: + pr_err("LRNG data array self-test FAILED\n"); + return LRNG_SEFLTEST_ERROR_TIME; +} + +static unsigned int lrng_gcd_selftest(void) +{ + u32 history[10]; + unsigned int i; + +#define LRNG_GCD_SELFTEST 3 + for (i = 0; i < ARRAY_SIZE(history); i++) + history[i] = i * LRNG_GCD_SELFTEST; + + if (lrng_gcd_analyze(history, ARRAY_SIZE(history)) == LRNG_GCD_SELFTEST) + return LRNG_SELFTEST_PASSED; + + pr_err("LRNG GCD self-test FAILED\n"); + return LRNG_SEFLTEST_ERROR_GCD; +} + +#else /* CONFIG_LRNG_TIMER_COMMON */ + +static unsigned int lrng_data_process_selftest(void) +{ + return LRNG_SELFTEST_PASSED; +} + +static unsigned int lrng_gcd_selftest(void) +{ + return LRNG_SELFTEST_PASSED; +} + +#endif /* CONFIG_LRNG_TIMER_COMMON */ + +/* The test vectors are taken from crypto/testmgr.h */ +static unsigned int lrng_hash_selftest(void) +{ + SHASH_DESC_ON_STACK(shash, NULL); + const struct lrng_hash_cb *hash_cb = &lrng_sha_hash_cb; + static const u8 lrng_hash_selftest_result[] = +#ifdef CONFIG_CRYPTO_LIB_SHA256 + { 0xba, 0x78, 0x16, 0xbf, 0x8f, 0x01, 0xcf, 0xea, + 0x41, 0x41, 0x40, 0xde, 0x5d, 0xae, 0x22, 0x23, + 0xb0, 0x03, 0x61, 0xa3, 0x96, 0x17, 0x7a, 0x9c, + 0xb4, 0x10, 0xff, 0x61, 0xf2, 0x00, 0x15, 0xad }; +#else /* CONFIG_CRYPTO_LIB_SHA256 */ + { 0xa9, 0x99, 0x3e, 0x36, 0x47, 0x06, 0x81, 0x6a, 0xba, 0x3e, + 0x25, 0x71, 0x78, 0x50, 0xc2, 0x6c, 0x9c, 0xd0, 0xd8, 0x9d }; +#endif /* CONFIG_CRYPTO_LIB_SHA256 */ + static const u8 hash_input[] = { 0x61, 0x62, 0x63 }; /* "abc" */ + u8 digest[sizeof(lrng_hash_selftest_result)] __aligned(sizeof(u32)); + + if (sizeof(digest) != hash_cb->hash_digestsize(NULL)) + return LRNG_SEFLTEST_ERROR_HASH; + + if (!hash_cb->hash_init(shash, NULL) && + !hash_cb->hash_update(shash, hash_input, + sizeof(hash_input)) && + !hash_cb->hash_final(shash, digest) && + !memcmp(digest, lrng_hash_selftest_result, sizeof(digest))) + return 0; + + pr_err("LRNG %s Hash self-test FAILED\n", hash_cb->hash_name()); + return LRNG_SEFLTEST_ERROR_HASH; +} + +#ifdef CONFIG_LRNG_DRNG_CHACHA20 + +static void lrng_selftest_bswap32(u32 *ptr, u32 words) +{ + u32 i; + + /* Byte-swap data which is an LE representation */ + for (i = 0; i < words; i++) { + __le32 *p = (__le32 *)ptr; + + *p = cpu_to_le32(*ptr); + ptr++; + } +} + +/* + * The test vectors were generated using the ChaCha20 DRNG from + * https://www.chronox.de/chacha20.html + */ +static unsigned int lrng_chacha20_drng_selftest(void) +{ + const struct lrng_drng_cb *drng_cb = &lrng_cc20_drng_cb; + u8 seed[CHACHA_KEY_SIZE * 2] = { + 0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, + 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, + 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, + 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f, + 0x20, 0x21, 0x22, 0x23, 0x24, 0x25, 0x26, 0x27, + 0x28, 0x29, 0x2a, 0x2b, 0x2c, 0x2d, 0x2e, 0x2f, + 0x30, 0x31, 0x32, 0x33, 0x34, 0x35, 0x36, 0x37, + 0x38, 0x39, 0x3a, 0x3b, 0x3c, 0x3d, 0x3e, 0x3f, + }; + struct chacha20_block chacha20; + int ret; + u8 outbuf[CHACHA_KEY_SIZE * 2] __aligned(sizeof(u32)); + + /* + * Expected result when ChaCha20 DRNG state is zero: + * * constants are set to "expand 32-byte k" + * * remaining state is 0 + * and pulling one half ChaCha20 DRNG block. + */ + static const u8 expected_halfblock[CHACHA_KEY_SIZE] = { + 0x76, 0xb8, 0xe0, 0xad, 0xa0, 0xf1, 0x3d, 0x90, + 0x40, 0x5d, 0x6a, 0xe5, 0x53, 0x86, 0xbd, 0x28, + 0xbd, 0xd2, 0x19, 0xb8, 0xa0, 0x8d, 0xed, 0x1a, + 0xa8, 0x36, 0xef, 0xcc, 0x8b, 0x77, 0x0d, 0xc7 }; + + /* + * Expected result when ChaCha20 DRNG state is zero: + * * constants are set to "expand 32-byte k" + * * remaining state is 0 + * followed by a reseed with two keyblocks + * 0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, + * 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, + * 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, + * 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f, + * 0x20, 0x21, 0x22, 0x23, 0x24, 0x25, 0x26, 0x27, + * 0x28, 0x29, 0x2a, 0x2b, 0x2c, 0x2d, 0x2e, 0x2f, + * 0x30, 0x31, 0x32, 0x33, 0x34, 0x35, 0x36, 0x37, + * 0x38, 0x39, 0x3a, 0x3b, 0x3c, 0x3d, 0x3e, 0x3f + * and pulling one ChaCha20 DRNG block. + */ + static const u8 expected_oneblock[CHACHA_KEY_SIZE * 2] = { + 0xe3, 0xb0, 0x8a, 0xcc, 0x34, 0xc3, 0x17, 0x0e, + 0xc3, 0xd8, 0xc3, 0x40, 0xe7, 0x73, 0xe9, 0x0d, + 0xd1, 0x62, 0xa3, 0x5d, 0x7d, 0xf2, 0xf1, 0x4a, + 0x24, 0x42, 0xb7, 0x1e, 0xb0, 0x05, 0x17, 0x07, + 0xb9, 0x35, 0x10, 0x69, 0x8b, 0x46, 0xfb, 0x51, + 0xe9, 0x91, 0x3f, 0x46, 0xf2, 0x4d, 0xea, 0xd0, + 0x81, 0xc1, 0x1b, 0xa9, 0x5d, 0x52, 0x91, 0x5f, + 0xcd, 0xdc, 0xc6, 0xd6, 0xc3, 0x7c, 0x50, 0x23 }; + + /* + * Expected result when ChaCha20 DRNG state is zero: + * * constants are set to "expand 32-byte k" + * * remaining state is 0 + * followed by a reseed with one key block plus one byte + * 0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, + * 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, + * 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, + * 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f, + * 0x20 + * and pulling less than one ChaCha20 DRNG block. + */ + static const u8 expected_block_nonalinged[CHACHA_KEY_SIZE + 4] = { + 0x9c, 0xfc, 0x5e, 0x31, 0x21, 0x62, 0x11, 0x85, + 0xd3, 0x77, 0xd3, 0x69, 0x0f, 0xa8, 0x16, 0x55, + 0xb4, 0x4c, 0xf6, 0x52, 0xf3, 0xa8, 0x37, 0x99, + 0x38, 0x76, 0xa0, 0x66, 0xec, 0xbb, 0xce, 0xa9, + 0x9c, 0x95, 0xa1, 0xfd }; + + BUILD_BUG_ON(sizeof(seed) % sizeof(u32)); + + memset(&chacha20, 0, sizeof(chacha20)); + lrng_cc20_init_rfc7539(&chacha20); + lrng_selftest_bswap32((u32 *)seed, sizeof(seed) / sizeof(u32)); + + /* Generate with zero state */ + ret = drng_cb->drng_generate(&chacha20, outbuf, + sizeof(expected_halfblock)); + if (ret != sizeof(expected_halfblock)) + goto err; + if (memcmp(outbuf, expected_halfblock, sizeof(expected_halfblock))) + goto err; + + /* Clear state of DRNG */ + memset(&chacha20.key.u[0], 0, 48); + + /* Reseed with 2 key blocks */ + ret = drng_cb->drng_seed(&chacha20, seed, sizeof(expected_oneblock)); + if (ret < 0) + goto err; + ret = drng_cb->drng_generate(&chacha20, outbuf, + sizeof(expected_oneblock)); + if (ret != sizeof(expected_oneblock)) + goto err; + if (memcmp(outbuf, expected_oneblock, sizeof(expected_oneblock))) + goto err; + + /* Clear state of DRNG */ + memset(&chacha20.key.u[0], 0, 48); + + /* Reseed with 1 key block and one byte */ + ret = drng_cb->drng_seed(&chacha20, seed, + sizeof(expected_block_nonalinged)); + if (ret < 0) + goto err; + ret = drng_cb->drng_generate(&chacha20, outbuf, + sizeof(expected_block_nonalinged)); + if (ret != sizeof(expected_block_nonalinged)) + goto err; + if (memcmp(outbuf, expected_block_nonalinged, + sizeof(expected_block_nonalinged))) + goto err; + + return LRNG_SELFTEST_PASSED; + +err: + pr_err("LRNG ChaCha20 DRNG self-test FAILED\n"); + return LRNG_SEFLTEST_ERROR_CHACHA20; +} + +#else /* CONFIG_LRNG_DRNG_CHACHA20 */ + +static unsigned int lrng_chacha20_drng_selftest(void) +{ + return LRNG_SELFTEST_PASSED; +} + +#endif /* CONFIG_LRNG_DRNG_CHACHA20 */ + +static unsigned int lrng_selftest_status = LRNG_SELFTEST_NOT_EXECUTED; + +static int lrng_selftest(void) +{ + unsigned int ret = lrng_data_process_selftest(); + + ret |= lrng_chacha20_drng_selftest(); + ret |= lrng_hash_selftest(); + ret |= lrng_gcd_selftest(); + + if (ret) { + if (IS_ENABLED(CONFIG_LRNG_SELFTEST_PANIC)) + panic("LRNG self-tests failed: %u\n", ret); + } else { + pr_info("LRNG self-tests passed\n"); + } + + lrng_selftest_status = ret; + + if (lrng_selftest_status) + return -EFAULT; + return 0; +} + +#ifdef CONFIG_SYSFS +/* Re-perform self-test when any value is written to the sysfs file. */ +static int lrng_selftest_sysfs_set(const char *val, + const struct kernel_param *kp) +{ + return lrng_selftest(); +} + +static const struct kernel_param_ops lrng_selftest_sysfs = { + .set = lrng_selftest_sysfs_set, + .get = param_get_uint, +}; +module_param_cb(selftest_status, &lrng_selftest_sysfs, &lrng_selftest_status, + 0644); +#endif /* CONFIG_SYSFS */ + +static int __init lrng_selftest_init(void) +{ + return lrng_selftest(); +} + +module_init(lrng_selftest_init); diff --git a/drivers/char/lrng/lrng_sha.h b/drivers/char/lrng/lrng_sha.h new file mode 100644 index 000000000000..d2f134f54773 --- /dev/null +++ b/drivers/char/lrng/lrng_sha.h @@ -0,0 +1,14 @@ +/* SPDX-License-Identifier: GPL-2.0 OR BSD-2-Clause */ +/* + * LRNG SHA definition usable in atomic contexts right from the start of the + * kernel. + * + * Copyright (C) 2022, Stephan Mueller + */ + +#ifndef _LRNG_SHA_H +#define _LRNG_SHA_H + +extern const struct lrng_hash_cb lrng_sha_hash_cb; + +#endif /* _LRNG_SHA_H */ diff --git a/drivers/char/lrng/lrng_sha1.c b/drivers/char/lrng/lrng_sha1.c new file mode 100644 index 000000000000..9cbc7a6fee49 --- /dev/null +++ b/drivers/char/lrng/lrng_sha1.c @@ -0,0 +1,88 @@ +// SPDX-License-Identifier: GPL-2.0 OR BSD-2-Clause +/* + * Backend for the LRNG providing the SHA-1 implementation that can be used + * without the kernel crypto API available including during early boot and in + * atomic contexts. + * + * Copyright (C) 2022, Stephan Mueller + */ + +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt + +#include +#include +#include + +#include "lrng_sha.h" + +/* + * If the SHA-256 support is not compiled, we fall back to SHA-1 that is always + * compiled and present in the kernel. + */ +static u32 lrng_sha1_hash_digestsize(void *hash) +{ + return SHA1_DIGEST_SIZE; +} + +static void lrng_sha1_block_fn(struct sha1_state *sctx, const u8 *src, + int blocks) +{ + u32 temp[SHA1_WORKSPACE_WORDS]; + + while (blocks--) { + sha1_transform(sctx->state, src, temp); + src += SHA1_BLOCK_SIZE; + } + memzero_explicit(temp, sizeof(temp)); +} + +static int lrng_sha1_hash_init(struct shash_desc *shash, void *hash) +{ + /* + * We do not need a TFM - we only need sufficient space for + * struct sha1_state on the stack. + */ + sha1_base_init(shash); + return 0; +} + +static int lrng_sha1_hash_update(struct shash_desc *shash, + const u8 *inbuf, u32 inbuflen) +{ + return sha1_base_do_update(shash, inbuf, inbuflen, lrng_sha1_block_fn); +} + +static int lrng_sha1_hash_final(struct shash_desc *shash, u8 *digest) +{ + return sha1_base_do_finalize(shash, lrng_sha1_block_fn) ?: + sha1_base_finish(shash, digest); +} + +static const char *lrng_sha1_hash_name(void) +{ + return "SHA-1"; +} + +static void lrng_sha1_hash_desc_zero(struct shash_desc *shash) +{ + memzero_explicit(shash_desc_ctx(shash), sizeof(struct sha1_state)); +} + +static void *lrng_sha1_hash_alloc(void) +{ + pr_info("Hash %s allocated\n", lrng_sha1_hash_name()); + return NULL; +} + +static void lrng_sha1_hash_dealloc(void *hash) { } + +const struct lrng_hash_cb lrng_sha_hash_cb = { + .hash_name = lrng_sha1_hash_name, + .hash_alloc = lrng_sha1_hash_alloc, + .hash_dealloc = lrng_sha1_hash_dealloc, + .hash_digestsize = lrng_sha1_hash_digestsize, + .hash_init = lrng_sha1_hash_init, + .hash_update = lrng_sha1_hash_update, + .hash_final = lrng_sha1_hash_final, + .hash_desc_zero = lrng_sha1_hash_desc_zero, +}; diff --git a/drivers/char/lrng/lrng_sha256.c b/drivers/char/lrng/lrng_sha256.c new file mode 100644 index 000000000000..50705351a71c --- /dev/null +++ b/drivers/char/lrng/lrng_sha256.c @@ -0,0 +1,72 @@ +// SPDX-License-Identifier: GPL-2.0 OR BSD-2-Clause +/* + * Backend for the LRNG providing the SHA-256 implementation that can be used + * without the kernel crypto API available including during early boot and in + * atomic contexts. + * + * Copyright (C) 2022, Stephan Mueller + */ + +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt + +#include +#include + +#include "lrng_sha.h" + +static u32 lrng_sha256_hash_digestsize(void *hash) +{ + return SHA256_DIGEST_SIZE; +} + +static int lrng_sha256_hash_init(struct shash_desc *shash, void *hash) +{ + /* + * We do not need a TFM - we only need sufficient space for + * struct sha256_state on the stack. + */ + sha256_init(shash_desc_ctx(shash)); + return 0; +} + +static int lrng_sha256_hash_update(struct shash_desc *shash, + const u8 *inbuf, u32 inbuflen) +{ + sha256_update(shash_desc_ctx(shash), inbuf, inbuflen); + return 0; +} + +static int lrng_sha256_hash_final(struct shash_desc *shash, u8 *digest) +{ + sha256_final(shash_desc_ctx(shash), digest); + return 0; +} + +static const char *lrng_sha256_hash_name(void) +{ + return "SHA-256"; +} + +static void lrng_sha256_hash_desc_zero(struct shash_desc *shash) +{ + memzero_explicit(shash_desc_ctx(shash), sizeof(struct sha256_state)); +} + +static void *lrng_sha256_hash_alloc(void) +{ + pr_info("Hash %s allocated\n", lrng_sha256_hash_name()); + return NULL; +} + +static void lrng_sha256_hash_dealloc(void *hash) { } + +const struct lrng_hash_cb lrng_sha_hash_cb = { + .hash_name = lrng_sha256_hash_name, + .hash_alloc = lrng_sha256_hash_alloc, + .hash_dealloc = lrng_sha256_hash_dealloc, + .hash_digestsize = lrng_sha256_hash_digestsize, + .hash_init = lrng_sha256_hash_init, + .hash_update = lrng_sha256_hash_update, + .hash_final = lrng_sha256_hash_final, + .hash_desc_zero = lrng_sha256_hash_desc_zero, +}; diff --git a/drivers/char/lrng/lrng_switch.c b/drivers/char/lrng/lrng_switch.c new file mode 100644 index 000000000000..5ffc0286a4a4 --- /dev/null +++ b/drivers/char/lrng/lrng_switch.c @@ -0,0 +1,271 @@ +// SPDX-License-Identifier: GPL-2.0 OR BSD-2-Clause +/* + * LRNG DRNG switching support + * + * Copyright (C) 2022, Stephan Mueller + */ + +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt + +#include + +#include "lrng_es_aux.h" +#include "lrng_es_mgr.h" +#include "lrng_numa.h" + +static int __maybe_unused +lrng_hash_switch(struct lrng_drng *drng_store, const void *cb, int node) +{ + const struct lrng_hash_cb *new_cb = (const struct lrng_hash_cb *)cb; + const struct lrng_hash_cb *old_cb = drng_store->hash_cb; + unsigned long flags; + u32 i; + void *new_hash = new_cb->hash_alloc(); + void *old_hash = drng_store->hash; + int ret; + + if (IS_ERR(new_hash)) { + pr_warn("could not allocate new LRNG pool hash (%ld)\n", + PTR_ERR(new_hash)); + return PTR_ERR(new_hash); + } + + if (new_cb->hash_digestsize(new_hash) > LRNG_MAX_DIGESTSIZE) { + pr_warn("digest size of newly requested hash too large\n"); + new_cb->hash_dealloc(new_hash); + return -EINVAL; + } + + write_lock_irqsave(&drng_store->hash_lock, flags); + + /* Trigger the switch for each entropy source */ + for_each_lrng_es(i) { + if (!lrng_es[i]->switch_hash) + continue; + ret = lrng_es[i]->switch_hash(drng_store, node, new_cb, + new_hash, old_cb); + if (ret) { + u32 j; + + /* Revert all already executed operations */ + for (j = 0; j < i; j++) { + if (!lrng_es[j]->switch_hash) + continue; + WARN_ON(lrng_es[j]->switch_hash(drng_store, + node, old_cb, + old_hash, + new_cb)); + } + goto err; + } + } + + drng_store->hash = new_hash; + drng_store->hash_cb = new_cb; + old_cb->hash_dealloc(old_hash); + pr_info("Conditioning function allocated for DRNG for NUMA node %d\n", + node); + +err: + write_unlock_irqrestore(&drng_store->hash_lock, flags); + return ret; +} + +static int __maybe_unused +lrng_drng_switch(struct lrng_drng *drng_store, const void *cb, int node) +{ + const struct lrng_drng_cb *new_cb = (const struct lrng_drng_cb *)cb; + const struct lrng_drng_cb *old_cb = drng_store->drng_cb; + int ret; + u8 seed[LRNG_DRNG_SECURITY_STRENGTH_BYTES]; + void *new_drng = new_cb->drng_alloc(LRNG_DRNG_SECURITY_STRENGTH_BYTES); + void *old_drng = drng_store->drng; + u32 current_security_strength; + bool reset_drng = !lrng_get_available(); + + if (IS_ERR(new_drng)) { + pr_warn("could not allocate new DRNG for NUMA node %d (%ld)\n", + node, PTR_ERR(new_drng)); + return PTR_ERR(new_drng); + } + + current_security_strength = lrng_security_strength(); + mutex_lock(&drng_store->lock); + + /* + * Pull from existing DRNG to seed new DRNG regardless of seed status + * of old DRNG -- the entropy state for the DRNG is left unchanged which + * implies that als the new DRNG is reseeded when deemed necessary. This + * seeding of the new DRNG shall only ensure that the new DRNG has the + * same entropy as the old DRNG. + */ + ret = old_cb->drng_generate(old_drng, seed, sizeof(seed)); + mutex_unlock(&drng_store->lock); + + if (ret < 0) { + reset_drng = true; + pr_warn("getting random data from DRNG failed for NUMA node %d (%d)\n", + node, ret); + } else { + /* seed new DRNG with data */ + ret = new_cb->drng_seed(new_drng, seed, ret); + memzero_explicit(seed, sizeof(seed)); + if (ret < 0) { + reset_drng = true; + pr_warn("seeding of new DRNG failed for NUMA node %d (%d)\n", + node, ret); + } else { + pr_debug("seeded new DRNG of NUMA node %d instance from old DRNG instance\n", + node); + } + } + + mutex_lock(&drng_store->lock); + + if (reset_drng) + lrng_drng_reset(drng_store); + + drng_store->drng = new_drng; + drng_store->drng_cb = new_cb; + + /* Reseed if previous LRNG security strength was insufficient */ + if (current_security_strength < lrng_security_strength()) + drng_store->force_reseed = true; + + /* Force oversampling seeding as we initialize DRNG */ + if (IS_ENABLED(CONFIG_CRYPTO_FIPS)) + lrng_unset_fully_seeded(drng_store); + + if (lrng_state_min_seeded()) + lrng_set_entropy_thresh(lrng_get_seed_entropy_osr( + drng_store->fully_seeded)); + + old_cb->drng_dealloc(old_drng); + + pr_info("DRNG of NUMA node %d switched\n", node); + + mutex_unlock(&drng_store->lock); + return ret; +} + +/* + * Switch the existing DRNG and hash instances with new using the new crypto + * callbacks. The caller must hold the lrng_crypto_cb_update lock. + */ +static int lrng_switch(const void *cb, + int(*switcher)(struct lrng_drng *drng_store, + const void *cb, int node)) +{ + struct lrng_drng **lrng_drng = lrng_drng_instances(); + struct lrng_drng *lrng_drng_init = lrng_drng_init_instance(); + int ret = 0; + + if (lrng_drng) { + u32 node; + + for_each_online_node(node) { + if (lrng_drng[node]) + ret = switcher(lrng_drng[node], cb, node); + } + } else { + ret = switcher(lrng_drng_init, cb, 0); + } + + return 0; +} + +/* + * lrng_set_drng_cb - Register new cryptographic callback functions for DRNG + * The registering implies that all old DRNG states are replaced with new + * DRNG states. + * + * drng_cb: Callback functions to be registered -- if NULL, use the default + * callbacks defined at compile time. + * + * Return: + * * 0 on success + * * < 0 on error + */ +int lrng_set_drng_cb(const struct lrng_drng_cb *drng_cb) +{ + struct lrng_drng *lrng_drng_init = lrng_drng_init_instance(); + int ret; + + if (!IS_ENABLED(CONFIG_LRNG_SWITCH_DRNG)) + return -EOPNOTSUPP; + + if (!drng_cb) + drng_cb = lrng_default_drng_cb; + + mutex_lock(&lrng_crypto_cb_update); + + /* + * If a callback other than the default is set, allow it only to be + * set back to the default callback. This ensures that multiple + * different callbacks can be registered at the same time. If a + * callback different from the current callback and the default + * callback shall be set, the current callback must be deregistered + * (e.g. the kernel module providing it must be unloaded) and the new + * implementation can be registered. + */ + if ((drng_cb != lrng_default_drng_cb) && + (lrng_drng_init->drng_cb != lrng_default_drng_cb)) { + pr_warn("disallow setting new DRNG callbacks, unload the old callbacks first!\n"); + ret = -EINVAL; + goto out; + } + + ret = lrng_switch(drng_cb, lrng_drng_switch); + /* The swtich may imply new entropy due to larger DRNG sec strength. */ + if (!ret) + lrng_es_add_entropy(); + +out: + mutex_unlock(&lrng_crypto_cb_update); + return ret; +} +EXPORT_SYMBOL(lrng_set_drng_cb); + +/* + * lrng_set_hash_cb - Register new cryptographic callback functions for hash + * The registering implies that all old hash states are replaced with new + * hash states. + * + * @hash_cb: Callback functions to be registered -- if NULL, use the default + * callbacks defined at compile time. + * + * Return: + * * 0 on success + * * < 0 on error + */ +int lrng_set_hash_cb(const struct lrng_hash_cb *hash_cb) +{ + struct lrng_drng *lrng_drng_init = lrng_drng_init_instance(); + int ret; + + if (!IS_ENABLED(CONFIG_LRNG_SWITCH_HASH)) + return -EOPNOTSUPP; + + if (!hash_cb) + hash_cb = lrng_default_hash_cb; + + mutex_lock(&lrng_crypto_cb_update); + + /* Comment from lrng_set_drng_cb applies. */ + if ((hash_cb != lrng_default_hash_cb) && + (lrng_drng_init->hash_cb != lrng_default_hash_cb)) { + pr_warn("disallow setting new hash callbacks, unload the old callbacks first!\n"); + ret = -EINVAL; + goto out; + } + + ret = lrng_switch(hash_cb, lrng_hash_switch); + /* The swtich may imply new entropy due to larger digest size. */ + if (!ret) + lrng_es_add_entropy(); + +out: + mutex_unlock(&lrng_crypto_cb_update); + return ret; +} +EXPORT_SYMBOL(lrng_set_hash_cb); diff --git a/drivers/char/lrng/lrng_sysctl.c b/drivers/char/lrng/lrng_sysctl.c new file mode 100644 index 000000000000..0f657d01bf62 --- /dev/null +++ b/drivers/char/lrng/lrng_sysctl.c @@ -0,0 +1,145 @@ +// SPDX-License-Identifier: GPL-2.0 OR BSD-2-Clause +/* + * LRNG sysctl interfaces + * + * Copyright (C) 2022, Stephan Mueller + */ + +#include +#include +#include +#include + +#include "lrng_drng_mgr.h" +#include "lrng_es_mgr.h" +#include "lrng_sysctl.h" + +/* + * This function is used to return both the bootid UUID, and random + * UUID. The difference is in whether table->data is NULL; if it is, + * then a new UUID is generated and returned to the user. + * + * If the user accesses this via the proc interface, the UUID will be + * returned as an ASCII string in the standard UUID format; if via the + * sysctl system call, as 16 bytes of binary data. + */ +static int lrng_sysctl_do_uuid(struct ctl_table *table, int write, + void *buffer, size_t *lenp, loff_t *ppos) +{ + struct ctl_table fake_table; + unsigned char buf[64], tmp_uuid[16], *uuid; + + uuid = table->data; + if (!uuid) { + uuid = tmp_uuid; + generate_random_uuid(uuid); + } else { + static DEFINE_SPINLOCK(bootid_spinlock); + + spin_lock(&bootid_spinlock); + if (!uuid[8]) + generate_random_uuid(uuid); + spin_unlock(&bootid_spinlock); + } + + sprintf(buf, "%pU", uuid); + + fake_table.data = buf; + fake_table.maxlen = sizeof(buf); + + return proc_dostring(&fake_table, write, buffer, lenp, ppos); +} + +static int lrng_sysctl_do_entropy(struct ctl_table *table, int write, + void *buffer, size_t *lenp, loff_t *ppos) +{ + struct ctl_table fake_table; + int entropy_count; + + entropy_count = lrng_avail_entropy(); + + fake_table.data = &entropy_count; + fake_table.maxlen = sizeof(entropy_count); + + return proc_dointvec(&fake_table, write, buffer, lenp, ppos); +} + +static int lrng_sysctl_do_poolsize(struct ctl_table *table, int write, + void *buffer, size_t *lenp, loff_t *ppos) +{ + struct ctl_table fake_table; + u32 i, entropy_count = 0; + + for_each_lrng_es(i) + entropy_count += lrng_es[i]->max_entropy(); + + fake_table.data = &entropy_count; + fake_table.maxlen = sizeof(entropy_count); + + return proc_dointvec(&fake_table, write, buffer, lenp, ppos); +} + +static int lrng_min_write_thresh; +static int lrng_max_write_thresh = (LRNG_WRITE_WAKEUP_ENTROPY << 3); +static char lrng_sysctl_bootid[16]; +static int lrng_drng_reseed_max_min; + +void lrng_sysctl_update_max_write_thresh(u32 new_digestsize) +{ + lrng_max_write_thresh = (int)new_digestsize; + /* Ensure that changes to the global variable are visible */ + mb(); +} + +static struct ctl_table random_table[] = { + { + .procname = "poolsize", + .maxlen = sizeof(int), + .mode = 0444, + .proc_handler = lrng_sysctl_do_poolsize, + }, + { + .procname = "entropy_avail", + .maxlen = sizeof(int), + .mode = 0444, + .proc_handler = lrng_sysctl_do_entropy, + }, + { + .procname = "write_wakeup_threshold", + .data = &lrng_write_wakeup_bits, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec_minmax, + .extra1 = &lrng_min_write_thresh, + .extra2 = &lrng_max_write_thresh, + }, + { + .procname = "boot_id", + .data = &lrng_sysctl_bootid, + .maxlen = 16, + .mode = 0444, + .proc_handler = lrng_sysctl_do_uuid, + }, + { + .procname = "uuid", + .maxlen = 16, + .mode = 0444, + .proc_handler = lrng_sysctl_do_uuid, + }, + { + .procname = "urandom_min_reseed_secs", + .data = &lrng_drng_reseed_max_time, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec, + .extra1 = &lrng_drng_reseed_max_min, + }, + { } +}; + +static int __init random_sysctls_init(void) +{ + register_sysctl_init("kernel/random", random_table); + return 0; +} +device_initcall(random_sysctls_init); diff --git a/drivers/char/lrng/lrng_sysctl.h b/drivers/char/lrng/lrng_sysctl.h new file mode 100644 index 000000000000..4b487e5077ed --- /dev/null +++ b/drivers/char/lrng/lrng_sysctl.h @@ -0,0 +1,15 @@ +/* SPDX-License-Identifier: GPL-2.0 OR BSD-2-Clause */ +/* + * Copyright (C) 2022, Stephan Mueller + */ + +#ifndef _LRNG_SYSCTL_H +#define _LRNG_SYSCTL_H + +#ifdef CONFIG_LRNG_SYSCTL +void lrng_sysctl_update_max_write_thresh(u32 new_digestsize); +#else +static inline void lrng_sysctl_update_max_write_thresh(u32 new_digestsize) { } +#endif + +#endif /* _LRNG_SYSCTL_H */ diff --git a/drivers/char/lrng/lrng_testing.c b/drivers/char/lrng/lrng_testing.c new file mode 100644 index 000000000000..101140085d81 --- /dev/null +++ b/drivers/char/lrng/lrng_testing.c @@ -0,0 +1,901 @@ +// SPDX-License-Identifier: GPL-2.0 OR BSD-2-Clause +/* + * LRNG testing interfaces to obtain raw entropy + * + * Copyright (C) 2022, Stephan Mueller + */ + +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "lrng_definitions.h" +#include "lrng_drng_chacha20.h" +#include "lrng_sha.h" +#include "lrng_testing.h" + +#if defined(CONFIG_LRNG_RAW_SCHED_HIRES_ENTROPY) || \ + defined(CONFIG_LRNG_RAW_SCHED_PID_ENTROPY) || \ + defined(CONFIG_LRNG_RAW_SCHED_START_TIME_ENTROPY) || \ + defined(CONFIG_LRNG_RAW_SCHED_NVCSW_ENTROPY) || \ + defined(CONFIG_LRNG_SCHED_PERF) +#define LRNG_TESTING_USE_BUSYLOOP +#endif + +#ifdef CONFIG_LRNG_TESTING_RECORDING + +#define LRNG_TESTING_RINGBUFFER_SIZE 1024 +#define LRNG_TESTING_RINGBUFFER_MASK (LRNG_TESTING_RINGBUFFER_SIZE - 1) + +struct lrng_testing { + u32 lrng_testing_rb[LRNG_TESTING_RINGBUFFER_SIZE]; + u32 rb_reader; + atomic_t rb_writer; + atomic_t lrng_testing_enabled; + spinlock_t lock; + wait_queue_head_t read_wait; +}; + +/*************************** Generic Data Handling ****************************/ + +/* + * boot variable: + * 0 ==> No boot test, gathering of runtime data allowed + * 1 ==> Boot test enabled and ready for collecting data, gathering runtime + * data is disabled + * 2 ==> Boot test completed and disabled, gathering of runtime data is + * disabled + */ + +static void lrng_testing_reset(struct lrng_testing *data) +{ + unsigned long flags; + + spin_lock_irqsave(&data->lock, flags); + data->rb_reader = 0; + atomic_set(&data->rb_writer, 0); + spin_unlock_irqrestore(&data->lock, flags); +} + +static void lrng_testing_init(struct lrng_testing *data, u32 boot) +{ + /* + * The boot time testing implies we have a running test. If the + * caller wants to clear it, he has to unset the boot_test flag + * at runtime via sysfs to enable regular runtime testing + */ + if (boot) + return; + + lrng_testing_reset(data); + atomic_set(&data->lrng_testing_enabled, 1); + pr_warn("Enabling data collection\n"); +} + +static void lrng_testing_fini(struct lrng_testing *data, u32 boot) +{ + /* If we have boot data, we do not reset yet to allow data to be read */ + if (boot) + return; + + atomic_set(&data->lrng_testing_enabled, 0); + lrng_testing_reset(data); + pr_warn("Disabling data collection\n"); +} + +static bool lrng_testing_store(struct lrng_testing *data, u32 value, + u32 *boot) +{ + unsigned long flags; + + if (!atomic_read(&data->lrng_testing_enabled) && (*boot != 1)) + return false; + + spin_lock_irqsave(&data->lock, flags); + + /* + * Disable entropy testing for boot time testing after ring buffer + * is filled. + */ + if (*boot) { + if (((u32)atomic_read(&data->rb_writer)) > + LRNG_TESTING_RINGBUFFER_SIZE) { + *boot = 2; + pr_warn_once("One time data collection test disabled\n"); + spin_unlock_irqrestore(&data->lock, flags); + return false; + } + + if (atomic_read(&data->rb_writer) == 1) + pr_warn("One time data collection test enabled\n"); + } + + data->lrng_testing_rb[((u32)atomic_read(&data->rb_writer)) & + LRNG_TESTING_RINGBUFFER_MASK] = value; + atomic_inc(&data->rb_writer); + + spin_unlock_irqrestore(&data->lock, flags); + +#ifndef LRNG_TESTING_USE_BUSYLOOP + if (wq_has_sleeper(&data->read_wait)) + wake_up_interruptible(&data->read_wait); +#endif + + return true; +} + +static bool lrng_testing_have_data(struct lrng_testing *data) +{ + return ((((u32)atomic_read(&data->rb_writer)) & + LRNG_TESTING_RINGBUFFER_MASK) != + (data->rb_reader & LRNG_TESTING_RINGBUFFER_MASK)); +} + +static int lrng_testing_reader(struct lrng_testing *data, u32 *boot, + u8 *outbuf, u32 outbuflen) +{ + unsigned long flags; + int collected_data = 0; + + lrng_testing_init(data, *boot); + + while (outbuflen) { + u32 writer = (u32)atomic_read(&data->rb_writer); + + spin_lock_irqsave(&data->lock, flags); + + /* We have no data or reached the writer. */ + if (!writer || (writer == data->rb_reader)) { + + spin_unlock_irqrestore(&data->lock, flags); + + /* + * Now we gathered all boot data, enable regular data + * collection. + */ + if (*boot) { + *boot = 0; + goto out; + } + +#ifdef LRNG_TESTING_USE_BUSYLOOP + while (!lrng_testing_have_data(data)) + ; +#else + wait_event_interruptible(data->read_wait, + lrng_testing_have_data(data)); +#endif + if (signal_pending(current)) { + collected_data = -ERESTARTSYS; + goto out; + } + + continue; + } + + /* We copy out word-wise */ + if (outbuflen < sizeof(u32)) { + spin_unlock_irqrestore(&data->lock, flags); + goto out; + } + + memcpy(outbuf, &data->lrng_testing_rb[data->rb_reader], + sizeof(u32)); + data->rb_reader++; + + spin_unlock_irqrestore(&data->lock, flags); + + outbuf += sizeof(u32); + outbuflen -= sizeof(u32); + collected_data += sizeof(u32); + } + +out: + lrng_testing_fini(data, *boot); + return collected_data; +} + +static int lrng_testing_extract_user(struct file *file, char __user *buf, + size_t nbytes, loff_t *ppos, + int (*reader)(u8 *outbuf, u32 outbuflen)) +{ + u8 *tmp, *tmp_aligned; + int ret = 0, large_request = (nbytes > 256); + + if (!nbytes) + return 0; + + /* + * The intention of this interface is for collecting at least + * 1000 samples due to the SP800-90B requirements. So, we make no + * effort in avoiding allocating more memory that actually needed + * by the user. Hence, we allocate sufficient memory to always hold + * that amount of data. + */ + tmp = kmalloc(LRNG_TESTING_RINGBUFFER_SIZE + sizeof(u32), GFP_KERNEL); + if (!tmp) + return -ENOMEM; + + tmp_aligned = PTR_ALIGN(tmp, sizeof(u32)); + + while (nbytes) { + int i; + + if (large_request && need_resched()) { + if (signal_pending(current)) { + if (ret == 0) + ret = -ERESTARTSYS; + break; + } + schedule(); + } + + i = min_t(int, nbytes, LRNG_TESTING_RINGBUFFER_SIZE); + i = reader(tmp_aligned, i); + if (i <= 0) { + if (i < 0) + ret = i; + break; + } + if (copy_to_user(buf, tmp_aligned, i)) { + ret = -EFAULT; + break; + } + + nbytes -= i; + buf += i; + ret += i; + } + + kfree_sensitive(tmp); + + if (ret > 0) + *ppos += ret; + + return ret; +} + +#endif /* CONFIG_LRNG_TESTING_RECORDING */ + +/************* Raw High-Resolution IRQ Timer Entropy Data Handling ************/ + +#ifdef CONFIG_LRNG_RAW_HIRES_ENTROPY + +static u32 boot_raw_hires_test = 0; +module_param(boot_raw_hires_test, uint, 0644); +MODULE_PARM_DESC(boot_raw_hires_test, "Enable gathering boot time high resolution timer entropy of the first IRQ entropy events"); + +static struct lrng_testing lrng_raw_hires = { + .rb_reader = 0, + .rb_writer = ATOMIC_INIT(0), + .lock = __SPIN_LOCK_UNLOCKED(lrng_raw_hires.lock), + .read_wait = __WAIT_QUEUE_HEAD_INITIALIZER(lrng_raw_hires.read_wait) +}; + +bool lrng_raw_hires_entropy_store(u32 value) +{ + return lrng_testing_store(&lrng_raw_hires, value, &boot_raw_hires_test); +} + +static int lrng_raw_hires_entropy_reader(u8 *outbuf, u32 outbuflen) +{ + return lrng_testing_reader(&lrng_raw_hires, &boot_raw_hires_test, + outbuf, outbuflen); +} + +static ssize_t lrng_raw_hires_read(struct file *file, char __user *to, + size_t count, loff_t *ppos) +{ + return lrng_testing_extract_user(file, to, count, ppos, + lrng_raw_hires_entropy_reader); +} + +static const struct file_operations lrng_raw_hires_fops = { + .owner = THIS_MODULE, + .read = lrng_raw_hires_read, +}; + +#endif /* CONFIG_LRNG_RAW_HIRES_ENTROPY */ + +/********************* Raw Jiffies Entropy Data Handling **********************/ + +#ifdef CONFIG_LRNG_RAW_JIFFIES_ENTROPY + +static u32 boot_raw_jiffies_test = 0; +module_param(boot_raw_jiffies_test, uint, 0644); +MODULE_PARM_DESC(boot_raw_jiffies_test, "Enable gathering boot time high resolution timer entropy of the first entropy events"); + +static struct lrng_testing lrng_raw_jiffies = { + .rb_reader = 0, + .rb_writer = ATOMIC_INIT(0), + .lock = __SPIN_LOCK_UNLOCKED(lrng_raw_jiffies.lock), + .read_wait = __WAIT_QUEUE_HEAD_INITIALIZER(lrng_raw_jiffies.read_wait) +}; + +bool lrng_raw_jiffies_entropy_store(u32 value) +{ + return lrng_testing_store(&lrng_raw_jiffies, value, + &boot_raw_jiffies_test); +} + +static int lrng_raw_jiffies_entropy_reader(u8 *outbuf, u32 outbuflen) +{ + return lrng_testing_reader(&lrng_raw_jiffies, &boot_raw_jiffies_test, + outbuf, outbuflen); +} + +static ssize_t lrng_raw_jiffies_read(struct file *file, char __user *to, + size_t count, loff_t *ppos) +{ + return lrng_testing_extract_user(file, to, count, ppos, + lrng_raw_jiffies_entropy_reader); +} + +static const struct file_operations lrng_raw_jiffies_fops = { + .owner = THIS_MODULE, + .read = lrng_raw_jiffies_read, +}; + +#endif /* CONFIG_LRNG_RAW_JIFFIES_ENTROPY */ + +/************************** Raw IRQ Data Handling ****************************/ + +#ifdef CONFIG_LRNG_RAW_IRQ_ENTROPY + +static u32 boot_raw_irq_test = 0; +module_param(boot_raw_irq_test, uint, 0644); +MODULE_PARM_DESC(boot_raw_irq_test, "Enable gathering boot time entropy of the first IRQ entropy events"); + +static struct lrng_testing lrng_raw_irq = { + .rb_reader = 0, + .rb_writer = ATOMIC_INIT(0), + .lock = __SPIN_LOCK_UNLOCKED(lrng_raw_irq.lock), + .read_wait = __WAIT_QUEUE_HEAD_INITIALIZER(lrng_raw_irq.read_wait) +}; + +bool lrng_raw_irq_entropy_store(u32 value) +{ + return lrng_testing_store(&lrng_raw_irq, value, &boot_raw_irq_test); +} + +static int lrng_raw_irq_entropy_reader(u8 *outbuf, u32 outbuflen) +{ + return lrng_testing_reader(&lrng_raw_irq, &boot_raw_irq_test, outbuf, + outbuflen); +} + +static ssize_t lrng_raw_irq_read(struct file *file, char __user *to, + size_t count, loff_t *ppos) +{ + return lrng_testing_extract_user(file, to, count, ppos, + lrng_raw_irq_entropy_reader); +} + +static const struct file_operations lrng_raw_irq_fops = { + .owner = THIS_MODULE, + .read = lrng_raw_irq_read, +}; + +#endif /* CONFIG_LRNG_RAW_IRQ_ENTROPY */ + +/************************ Raw _RET_IP_ Data Handling **************************/ + +#ifdef CONFIG_LRNG_RAW_RETIP_ENTROPY + +static u32 boot_raw_retip_test = 0; +module_param(boot_raw_retip_test, uint, 0644); +MODULE_PARM_DESC(boot_raw_retip_test, "Enable gathering boot time entropy of the first return instruction pointer entropy events"); + +static struct lrng_testing lrng_raw_retip = { + .rb_reader = 0, + .rb_writer = ATOMIC_INIT(0), + .lock = __SPIN_LOCK_UNLOCKED(lrng_raw_retip.lock), + .read_wait = __WAIT_QUEUE_HEAD_INITIALIZER(lrng_raw_retip.read_wait) +}; + +bool lrng_raw_retip_entropy_store(u32 value) +{ + return lrng_testing_store(&lrng_raw_retip, value, &boot_raw_retip_test); +} + +static int lrng_raw_retip_entropy_reader(u8 *outbuf, u32 outbuflen) +{ + return lrng_testing_reader(&lrng_raw_retip, &boot_raw_retip_test, + outbuf, outbuflen); +} + +static ssize_t lrng_raw_retip_read(struct file *file, char __user *to, + size_t count, loff_t *ppos) +{ + return lrng_testing_extract_user(file, to, count, ppos, + lrng_raw_retip_entropy_reader); +} + +static const struct file_operations lrng_raw_retip_fops = { + .owner = THIS_MODULE, + .read = lrng_raw_retip_read, +}; + +#endif /* CONFIG_LRNG_RAW_RETIP_ENTROPY */ + +/********************** Raw IRQ register Data Handling ************************/ + +#ifdef CONFIG_LRNG_RAW_REGS_ENTROPY + +static u32 boot_raw_regs_test = 0; +module_param(boot_raw_regs_test, uint, 0644); +MODULE_PARM_DESC(boot_raw_regs_test, "Enable gathering boot time entropy of the first interrupt register entropy events"); + +static struct lrng_testing lrng_raw_regs = { + .rb_reader = 0, + .rb_writer = ATOMIC_INIT(0), + .lock = __SPIN_LOCK_UNLOCKED(lrng_raw_regs.lock), + .read_wait = __WAIT_QUEUE_HEAD_INITIALIZER(lrng_raw_regs.read_wait) +}; + +bool lrng_raw_regs_entropy_store(u32 value) +{ + return lrng_testing_store(&lrng_raw_regs, value, &boot_raw_regs_test); +} + +static int lrng_raw_regs_entropy_reader(u8 *outbuf, u32 outbuflen) +{ + return lrng_testing_reader(&lrng_raw_regs, &boot_raw_regs_test, + outbuf, outbuflen); +} + +static ssize_t lrng_raw_regs_read(struct file *file, char __user *to, + size_t count, loff_t *ppos) +{ + return lrng_testing_extract_user(file, to, count, ppos, + lrng_raw_regs_entropy_reader); +} + +static const struct file_operations lrng_raw_regs_fops = { + .owner = THIS_MODULE, + .read = lrng_raw_regs_read, +}; + +#endif /* CONFIG_LRNG_RAW_REGS_ENTROPY */ + +/********************** Raw Entropy Array Data Handling ***********************/ + +#ifdef CONFIG_LRNG_RAW_ARRAY + +static u32 boot_raw_array = 0; +module_param(boot_raw_array, uint, 0644); +MODULE_PARM_DESC(boot_raw_array, "Enable gathering boot time raw noise array data of the first entropy events"); + +static struct lrng_testing lrng_raw_array = { + .rb_reader = 0, + .rb_writer = ATOMIC_INIT(0), + .lock = __SPIN_LOCK_UNLOCKED(lrng_raw_array.lock), + .read_wait = __WAIT_QUEUE_HEAD_INITIALIZER(lrng_raw_array.read_wait) +}; + +bool lrng_raw_array_entropy_store(u32 value) +{ + return lrng_testing_store(&lrng_raw_array, value, &boot_raw_array); +} + +static int lrng_raw_array_entropy_reader(u8 *outbuf, u32 outbuflen) +{ + return lrng_testing_reader(&lrng_raw_array, &boot_raw_array, outbuf, + outbuflen); +} + +static ssize_t lrng_raw_array_read(struct file *file, char __user *to, + size_t count, loff_t *ppos) +{ + return lrng_testing_extract_user(file, to, count, ppos, + lrng_raw_array_entropy_reader); +} + +static const struct file_operations lrng_raw_array_fops = { + .owner = THIS_MODULE, + .read = lrng_raw_array_read, +}; + +#endif /* CONFIG_LRNG_RAW_ARRAY */ + +/******************** Interrupt Performance Data Handling *********************/ + +#ifdef CONFIG_LRNG_IRQ_PERF + +static u32 boot_irq_perf = 0; +module_param(boot_irq_perf, uint, 0644); +MODULE_PARM_DESC(boot_irq_perf, "Enable gathering interrupt entropy source performance data"); + +static struct lrng_testing lrng_irq_perf = { + .rb_reader = 0, + .rb_writer = ATOMIC_INIT(0), + .lock = __SPIN_LOCK_UNLOCKED(lrng_irq_perf.lock), + .read_wait = __WAIT_QUEUE_HEAD_INITIALIZER(lrng_irq_perf.read_wait) +}; + +bool lrng_perf_time(u32 start) +{ + return lrng_testing_store(&lrng_irq_perf, random_get_entropy() - start, + &boot_irq_perf); +} + +static int lrng_irq_perf_reader(u8 *outbuf, u32 outbuflen) +{ + return lrng_testing_reader(&lrng_irq_perf, &boot_irq_perf, outbuf, + outbuflen); +} + +static ssize_t lrng_irq_perf_read(struct file *file, char __user *to, + size_t count, loff_t *ppos) +{ + return lrng_testing_extract_user(file, to, count, ppos, + lrng_irq_perf_reader); +} + +static const struct file_operations lrng_irq_perf_fops = { + .owner = THIS_MODULE, + .read = lrng_irq_perf_read, +}; + +#endif /* CONFIG_LRNG_IRQ_PERF */ + +/****** Raw High-Resolution Scheduler-based Timer Entropy Data Handling *******/ + +#ifdef CONFIG_LRNG_RAW_SCHED_HIRES_ENTROPY + +static u32 boot_raw_sched_hires_test = 0; +module_param(boot_raw_sched_hires_test, uint, 0644); +MODULE_PARM_DESC(boot_raw_sched_hires_test, "Enable gathering boot time high resolution timer entropy of the first Scheduler-based entropy events"); + +static struct lrng_testing lrng_raw_sched_hires = { + .rb_reader = 0, + .rb_writer = ATOMIC_INIT(0), + .lock = __SPIN_LOCK_UNLOCKED(lrng_raw_sched_hires.lock), + .read_wait = + __WAIT_QUEUE_HEAD_INITIALIZER(lrng_raw_sched_hires.read_wait) +}; + +bool lrng_raw_sched_hires_entropy_store(u32 value) +{ + return lrng_testing_store(&lrng_raw_sched_hires, value, + &boot_raw_sched_hires_test); +} + +static int lrng_raw_sched_hires_entropy_reader(u8 *outbuf, u32 outbuflen) +{ + return lrng_testing_reader(&lrng_raw_sched_hires, + &boot_raw_sched_hires_test, + outbuf, outbuflen); +} + +static ssize_t lrng_raw_sched_hires_read(struct file *file, char __user *to, + size_t count, loff_t *ppos) +{ + return lrng_testing_extract_user(file, to, count, ppos, + lrng_raw_sched_hires_entropy_reader); +} + +static const struct file_operations lrng_raw_sched_hires_fops = { + .owner = THIS_MODULE, + .read = lrng_raw_sched_hires_read, +}; + +#endif /* CONFIG_LRNG_RAW_SCHED_HIRES_ENTROPY */ + +/******************** Interrupt Performance Data Handling *********************/ + +#ifdef CONFIG_LRNG_SCHED_PERF + +static u32 boot_sched_perf = 0; +module_param(boot_sched_perf, uint, 0644); +MODULE_PARM_DESC(boot_sched_perf, "Enable gathering scheduler-based entropy source performance data"); + +static struct lrng_testing lrng_sched_perf = { + .rb_reader = 0, + .rb_writer = ATOMIC_INIT(0), + .lock = __SPIN_LOCK_UNLOCKED(lrng_sched_perf.lock), + .read_wait = __WAIT_QUEUE_HEAD_INITIALIZER(lrng_sched_perf.read_wait) +}; + +bool lrng_sched_perf_time(u32 start) +{ + return lrng_testing_store(&lrng_sched_perf, random_get_entropy() - start, + &boot_sched_perf); +} + +static int lrng_sched_perf_reader(u8 *outbuf, u32 outbuflen) +{ + return lrng_testing_reader(&lrng_sched_perf, &boot_sched_perf, outbuf, + outbuflen); +} + +static ssize_t lrng_sched_perf_read(struct file *file, char __user *to, + size_t count, loff_t *ppos) +{ + return lrng_testing_extract_user(file, to, count, ppos, + lrng_sched_perf_reader); +} + +static const struct file_operations lrng_sched_perf_fops = { + .owner = THIS_MODULE, + .read = lrng_sched_perf_read, +}; + +#endif /* CONFIG_LRNG_SCHED_PERF */ + +/*************** Raw Scheduler task_struct->pid Data Handling *****************/ + +#ifdef CONFIG_LRNG_RAW_SCHED_PID_ENTROPY + +static u32 boot_raw_sched_pid_test = 0; +module_param(boot_raw_sched_pid_test, uint, 0644); +MODULE_PARM_DESC(boot_raw_sched_pid_test, "Enable gathering boot time entropy of the first PIDs collected by the scheduler entropy source"); + +static struct lrng_testing lrng_raw_sched_pid = { + .rb_reader = 0, + .rb_writer = ATOMIC_INIT(0), + .lock = __SPIN_LOCK_UNLOCKED(lrng_raw_sched_pid.lock), + .read_wait = __WAIT_QUEUE_HEAD_INITIALIZER(lrng_raw_sched_pid.read_wait) +}; + +bool lrng_raw_sched_pid_entropy_store(u32 value) +{ + return lrng_testing_store(&lrng_raw_sched_pid, value, + &boot_raw_sched_pid_test); +} + +static int lrng_raw_sched_pid_entropy_reader(u8 *outbuf, u32 outbuflen) +{ + return lrng_testing_reader(&lrng_raw_sched_pid, + &boot_raw_sched_pid_test, outbuf, outbuflen); +} + +static ssize_t lrng_raw_sched_pid_read(struct file *file, char __user *to, + size_t count, loff_t *ppos) +{ + return lrng_testing_extract_user(file, to, count, ppos, + lrng_raw_sched_pid_entropy_reader); +} + +static const struct file_operations lrng_raw_sched_pid_fops = { + .owner = THIS_MODULE, + .read = lrng_raw_sched_pid_read, +}; + +#endif /* CONFIG_LRNG_RAW_SCHED_PID_ENTROPY */ + + +/*********** Raw Scheduler task_struct->start_time Data Handling **************/ + +#ifdef CONFIG_LRNG_RAW_SCHED_START_TIME_ENTROPY + +static u32 boot_raw_sched_starttime_test = 0; +module_param(boot_raw_sched_starttime_test, uint, 0644); +MODULE_PARM_DESC(boot_raw_sched_starttime_test, "Enable gathering boot time entropy of the first task start times collected by the scheduler entropy source"); + +static struct lrng_testing lrng_raw_sched_starttime = { + .rb_reader = 0, + .rb_writer = ATOMIC_INIT(0), + .lock = __SPIN_LOCK_UNLOCKED(lrng_raw_sched_starttime.lock), + .read_wait = __WAIT_QUEUE_HEAD_INITIALIZER(lrng_raw_sched_starttime.read_wait) +}; + +bool lrng_raw_sched_starttime_entropy_store(u32 value) +{ + return lrng_testing_store(&lrng_raw_sched_starttime, value, + &boot_raw_sched_starttime_test); +} + +static int lrng_raw_sched_starttime_entropy_reader(u8 *outbuf, u32 outbuflen) +{ + return lrng_testing_reader(&lrng_raw_sched_starttime, + &boot_raw_sched_starttime_test, outbuf, outbuflen); +} + +static ssize_t lrng_raw_sched_starttime_read(struct file *file, char __user *to, + size_t count, loff_t *ppos) +{ + return lrng_testing_extract_user(file, to, count, ppos, + lrng_raw_sched_starttime_entropy_reader); +} + +static const struct file_operations lrng_raw_sched_starttime_fops = { + .owner = THIS_MODULE, + .read = lrng_raw_sched_starttime_read, +}; + +#endif /* CONFIG_LRNG_RAW_SCHED_START_TIME_ENTROPY */ + +/************** Raw Scheduler task_struct->nvcsw Data Handling ****************/ + +#ifdef CONFIG_LRNG_RAW_SCHED_NVCSW_ENTROPY + +static u32 boot_raw_sched_nvcsw_test = 0; +module_param(boot_raw_sched_nvcsw_test, uint, 0644); +MODULE_PARM_DESC(boot_raw_sched_nvcsw_test, "Enable gathering boot time entropy of the first task context switch numbers collected by the scheduler entropy source"); + +static struct lrng_testing lrng_raw_sched_nvcsw = { + .rb_reader = 0, + .rb_writer = ATOMIC_INIT(0), + .lock = __SPIN_LOCK_UNLOCKED(lrng_raw_sched_nvcsw.lock), + .read_wait = __WAIT_QUEUE_HEAD_INITIALIZER(lrng_raw_sched_nvcsw.read_wait) +}; + +bool lrng_raw_sched_nvcsw_entropy_store(u32 value) +{ + return lrng_testing_store(&lrng_raw_sched_nvcsw, value, + &boot_raw_sched_nvcsw_test); +} + +static int lrng_raw_sched_nvcsw_entropy_reader(u8 *outbuf, u32 outbuflen) +{ + return lrng_testing_reader(&lrng_raw_sched_nvcsw, + &boot_raw_sched_nvcsw_test, outbuf, outbuflen); +} + +static ssize_t lrng_raw_sched_nvcsw_read(struct file *file, char __user *to, + size_t count, loff_t *ppos) +{ + return lrng_testing_extract_user(file, to, count, ppos, + lrng_raw_sched_nvcsw_entropy_reader); +} + +static const struct file_operations lrng_raw_sched_nvcsw_fops = { + .owner = THIS_MODULE, + .read = lrng_raw_sched_nvcsw_read, +}; + +#endif /* CONFIG_LRNG_RAW_SCHED_NVCSW_ENTROPY */ + +/*********************************** ACVT ************************************/ + +#ifdef CONFIG_LRNG_ACVT_HASH + +/* maximum amount of data to be hashed as defined by ACVP */ +#define LRNG_ACVT_MAX_SHA_MSG (65536 >> 3) + +/* + * As we use static variables to store the data, it is clear that the + * test interface is only able to handle single threaded testing. This is + * considered to be sufficient for testing. If multi-threaded use of the + * ACVT test interface would be performed, the caller would get garbage + * but the kernel operation is unaffected by this. + */ +static u8 lrng_acvt_hash_data[LRNG_ACVT_MAX_SHA_MSG] + __aligned(LRNG_KCAPI_ALIGN); +static atomic_t lrng_acvt_hash_data_size = ATOMIC_INIT(0); +static u8 lrng_acvt_hash_digest[LRNG_ATOMIC_DIGEST_SIZE]; + +static ssize_t lrng_acvt_hash_write(struct file *file, const char __user *buf, + size_t nbytes, loff_t *ppos) +{ + if (nbytes > LRNG_ACVT_MAX_SHA_MSG) + return -EINVAL; + + atomic_set(&lrng_acvt_hash_data_size, (int)nbytes); + + return simple_write_to_buffer(lrng_acvt_hash_data, + LRNG_ACVT_MAX_SHA_MSG, ppos, buf, nbytes); +} + +static ssize_t lrng_acvt_hash_read(struct file *file, char __user *to, + size_t count, loff_t *ppos) +{ + SHASH_DESC_ON_STACK(shash, NULL); + const struct lrng_hash_cb *hash_cb = &lrng_sha_hash_cb; + ssize_t ret; + + if (count > LRNG_ATOMIC_DIGEST_SIZE) + return -EINVAL; + + ret = hash_cb->hash_init(shash, NULL) ?: + hash_cb->hash_update(shash, lrng_acvt_hash_data, + atomic_read_u32(&lrng_acvt_hash_data_size)) ?: + hash_cb->hash_final(shash, lrng_acvt_hash_digest); + if (ret) + return ret; + + return simple_read_from_buffer(to, count, ppos, lrng_acvt_hash_digest, + sizeof(lrng_acvt_hash_digest)); +} + +static const struct file_operations lrng_acvt_hash_fops = { + .owner = THIS_MODULE, + .open = simple_open, + .llseek = default_llseek, + .read = lrng_acvt_hash_read, + .write = lrng_acvt_hash_write, +}; + +#endif /* CONFIG_LRNG_ACVT_DRNG */ + +/************************************************************************** + * Debugfs interface + **************************************************************************/ + +static int __init lrng_raw_init(void) +{ + struct dentry *lrng_raw_debugfs_root; + + lrng_raw_debugfs_root = debugfs_create_dir(KBUILD_MODNAME, NULL); + +#ifdef CONFIG_LRNG_RAW_HIRES_ENTROPY + debugfs_create_file_unsafe("lrng_raw_hires", 0400, + lrng_raw_debugfs_root, NULL, + &lrng_raw_hires_fops); +#endif +#ifdef CONFIG_LRNG_RAW_JIFFIES_ENTROPY + debugfs_create_file_unsafe("lrng_raw_jiffies", 0400, + lrng_raw_debugfs_root, NULL, + &lrng_raw_jiffies_fops); +#endif +#ifdef CONFIG_LRNG_RAW_IRQ_ENTROPY + debugfs_create_file_unsafe("lrng_raw_irq", 0400, lrng_raw_debugfs_root, + NULL, &lrng_raw_irq_fops); +#endif +#ifdef CONFIG_LRNG_RAW_RETIP_ENTROPY + debugfs_create_file_unsafe("lrng_raw_retip", 0400, + lrng_raw_debugfs_root, NULL, + &lrng_raw_retip_fops); +#endif +#ifdef CONFIG_LRNG_RAW_REGS_ENTROPY + debugfs_create_file_unsafe("lrng_raw_regs", 0400, + lrng_raw_debugfs_root, NULL, + &lrng_raw_regs_fops); +#endif +#ifdef CONFIG_LRNG_RAW_ARRAY + debugfs_create_file_unsafe("lrng_raw_array", 0400, + lrng_raw_debugfs_root, NULL, + &lrng_raw_array_fops); +#endif +#ifdef CONFIG_LRNG_IRQ_PERF + debugfs_create_file_unsafe("lrng_irq_perf", 0400, lrng_raw_debugfs_root, + NULL, &lrng_irq_perf_fops); +#endif +#ifdef CONFIG_LRNG_RAW_SCHED_HIRES_ENTROPY + debugfs_create_file_unsafe("lrng_raw_sched_hires", 0400, + lrng_raw_debugfs_root, + NULL, &lrng_raw_sched_hires_fops); +#endif +#ifdef CONFIG_LRNG_RAW_SCHED_PID_ENTROPY + debugfs_create_file_unsafe("lrng_raw_sched_pid", 0400, + lrng_raw_debugfs_root, NULL, + &lrng_raw_sched_pid_fops); +#endif +#ifdef CONFIG_LRNG_RAW_SCHED_START_TIME_ENTROPY + debugfs_create_file_unsafe("lrng_raw_sched_starttime", 0400, + lrng_raw_debugfs_root, NULL, + &lrng_raw_sched_starttime_fops); +#endif +#ifdef CONFIG_LRNG_RAW_SCHED_NVCSW_ENTROPY + debugfs_create_file_unsafe("lrng_raw_sched_nvcsw", 0400, + lrng_raw_debugfs_root, NULL, + &lrng_raw_sched_nvcsw_fops); +#endif +#ifdef CONFIG_LRNG_SCHED_PERF + debugfs_create_file_unsafe("lrng_sched_perf", 0400, + lrng_raw_debugfs_root, NULL, + &lrng_sched_perf_fops); +#endif +#ifdef CONFIG_LRNG_ACVT_HASH + debugfs_create_file_unsafe("lrng_acvt_hash", 0600, + lrng_raw_debugfs_root, NULL, + &lrng_acvt_hash_fops); +#endif + + return 0; +} + +module_init(lrng_raw_init); diff --git a/drivers/char/lrng/lrng_testing.h b/drivers/char/lrng/lrng_testing.h new file mode 100644 index 000000000000..672a7fca4595 --- /dev/null +++ b/drivers/char/lrng/lrng_testing.h @@ -0,0 +1,85 @@ +/* SPDX-License-Identifier: GPL-2.0 OR BSD-2-Clause */ +/* + * Copyright (C) 2022, Stephan Mueller + */ + +#ifndef _LRNG_TESTING_H +#define _LRNG_TESTING_H + +#ifdef CONFIG_LRNG_RAW_HIRES_ENTROPY +bool lrng_raw_hires_entropy_store(u32 value); +#else /* CONFIG_LRNG_RAW_HIRES_ENTROPY */ +static inline bool lrng_raw_hires_entropy_store(u32 value) { return false; } +#endif /* CONFIG_LRNG_RAW_HIRES_ENTROPY */ + +#ifdef CONFIG_LRNG_RAW_JIFFIES_ENTROPY +bool lrng_raw_jiffies_entropy_store(u32 value); +#else /* CONFIG_LRNG_RAW_JIFFIES_ENTROPY */ +static inline bool lrng_raw_jiffies_entropy_store(u32 value) { return false; } +#endif /* CONFIG_LRNG_RAW_JIFFIES_ENTROPY */ + +#ifdef CONFIG_LRNG_RAW_IRQ_ENTROPY +bool lrng_raw_irq_entropy_store(u32 value); +#else /* CONFIG_LRNG_RAW_IRQ_ENTROPY */ +static inline bool lrng_raw_irq_entropy_store(u32 value) { return false; } +#endif /* CONFIG_LRNG_RAW_IRQ_ENTROPY */ + +#ifdef CONFIG_LRNG_RAW_RETIP_ENTROPY +bool lrng_raw_retip_entropy_store(u32 value); +#else /* CONFIG_LRNG_RAW_RETIP_ENTROPY */ +static inline bool lrng_raw_retip_entropy_store(u32 value) { return false; } +#endif /* CONFIG_LRNG_RAW_RETIP_ENTROPY */ + +#ifdef CONFIG_LRNG_RAW_REGS_ENTROPY +bool lrng_raw_regs_entropy_store(u32 value); +#else /* CONFIG_LRNG_RAW_REGS_ENTROPY */ +static inline bool lrng_raw_regs_entropy_store(u32 value) { return false; } +#endif /* CONFIG_LRNG_RAW_REGS_ENTROPY */ + +#ifdef CONFIG_LRNG_RAW_ARRAY +bool lrng_raw_array_entropy_store(u32 value); +#else /* CONFIG_LRNG_RAW_ARRAY */ +static inline bool lrng_raw_array_entropy_store(u32 value) { return false; } +#endif /* CONFIG_LRNG_RAW_ARRAY */ + +#ifdef CONFIG_LRNG_IRQ_PERF +bool lrng_perf_time(u32 start); +#else /* CONFIG_LRNG_IRQ_PERF */ +static inline bool lrng_perf_time(u32 start) { return false; } +#endif /*CONFIG_LRNG_IRQ_PERF */ + +#ifdef CONFIG_LRNG_RAW_SCHED_HIRES_ENTROPY +bool lrng_raw_sched_hires_entropy_store(u32 value); +#else /* CONFIG_LRNG_RAW_SCHED_HIRES_ENTROPY */ +static inline bool +lrng_raw_sched_hires_entropy_store(u32 value) { return false; } +#endif /* CONFIG_LRNG_RAW_SCHED_HIRES_ENTROPY */ + +#ifdef CONFIG_LRNG_RAW_SCHED_PID_ENTROPY +bool lrng_raw_sched_pid_entropy_store(u32 value); +#else /* CONFIG_LRNG_RAW_SCHED_PID_ENTROPY */ +static inline bool +lrng_raw_sched_pid_entropy_store(u32 value) { return false; } +#endif /* CONFIG_LRNG_RAW_SCHED_PID_ENTROPY */ + +#ifdef CONFIG_LRNG_RAW_SCHED_START_TIME_ENTROPY +bool lrng_raw_sched_starttime_entropy_store(u32 value); +#else /* CONFIG_LRNG_RAW_SCHED_START_TIME_ENTROPY */ +static inline bool +lrng_raw_sched_starttime_entropy_store(u32 value) { return false; } +#endif /* CONFIG_LRNG_RAW_SCHED_START_TIME_ENTROPY */ + +#ifdef CONFIG_LRNG_RAW_SCHED_NVCSW_ENTROPY +bool lrng_raw_sched_nvcsw_entropy_store(u32 value); +#else /* CONFIG_LRNG_RAW_SCHED_NVCSW_ENTROPY */ +static inline bool +lrng_raw_sched_nvcsw_entropy_store(u32 value) { return false; } +#endif /* CONFIG_LRNG_RAW_SCHED_NVCSW_ENTROPY */ + +#ifdef CONFIG_LRNG_SCHED_PERF +bool lrng_sched_perf_time(u32 start); +#else /* CONFIG_LRNG_SCHED_PERF */ +static inline bool lrng_sched_perf_time(u32 start) { return false; } +#endif /*CONFIG_LRNG_SCHED_PERF */ + +#endif /* _LRNG_TESTING_H */ diff --git a/drivers/hwmon/Kconfig b/drivers/hwmon/Kconfig index f2b038fa3b84..924dcc53bfab 100644 --- a/drivers/hwmon/Kconfig +++ b/drivers/hwmon/Kconfig @@ -1457,11 +1457,23 @@ config SENSORS_NCT6683 This driver can also be built as a module. If so, the module will be called nct6683. +config SENSORS_NCT6775_CORE + tristate + select REGMAP + help + This module contains common code shared by the platform and + i2c versions of the nct6775 driver; it is not useful on its + own. + + If built as a module, the module will be called + nct6775-core. + config SENSORS_NCT6775 - tristate "Nuvoton NCT6775F and compatibles" + tristate "Platform driver for Nuvoton NCT6775F and compatibles" depends on !PPC depends on ACPI_WMI || ACPI_WMI=n select HWMON_VID + select SENSORS_NCT6775_CORE help If you say yes here you get support for the hardware monitoring functionality of the Nuvoton NCT6106D, NCT6775F, NCT6776F, NCT6779D, @@ -1472,6 +1484,23 @@ config SENSORS_NCT6775 This driver can also be built as a module. If so, the module will be called nct6775. +config SENSORS_NCT6775_I2C + tristate "I2C driver for Nuvoton NCT6775F and compatibles" + depends on I2C + select REGMAP_I2C + select SENSORS_NCT6775_CORE + help + If you say yes here you get support for the hardware monitoring + functionality of the Nuvoton NCT6106D, NCT6775F, NCT6776F, NCT6779D, + NCT6791D, NCT6792D, NCT6793D, NCT6795D, NCT6796D, and compatible + Super-I/O chips via their I2C interface. + + If you're not building a kernel for a BMC, this is probably + not the driver you want (see CONFIG_SENSORS_NCT6775). + + This driver can also be built as a module. If so, the module + will be called nct6775-i2c. + config SENSORS_NCT7802 tristate "Nuvoton NCT7802Y" depends on I2C diff --git a/drivers/hwmon/Makefile b/drivers/hwmon/Makefile index 8a03289e2aa4..004ab87c9b80 100644 --- a/drivers/hwmon/Makefile +++ b/drivers/hwmon/Makefile @@ -154,7 +154,10 @@ obj-$(CONFIG_SENSORS_MLXREG_FAN) += mlxreg-fan.o obj-$(CONFIG_SENSORS_MENF21BMC_HWMON) += menf21bmc_hwmon.o obj-$(CONFIG_SENSORS_MR75203) += mr75203.o obj-$(CONFIG_SENSORS_NCT6683) += nct6683.o +obj-$(CONFIG_SENSORS_NCT6775_CORE) += nct6775-core.o +nct6775-objs := nct6775-platform.o obj-$(CONFIG_SENSORS_NCT6775) += nct6775.o +obj-$(CONFIG_SENSORS_NCT6775_I2C) += nct6775-i2c.o obj-$(CONFIG_SENSORS_NCT7802) += nct7802.o obj-$(CONFIG_SENSORS_NCT7904) += nct7904.o obj-$(CONFIG_SENSORS_NPCM7XX) += npcm750-pwm-fan.o diff --git a/drivers/hwmon/asus-ec-sensors.c b/drivers/hwmon/asus-ec-sensors.c index b5cf0136360c..57e11b2bab74 100644 --- a/drivers/hwmon/asus-ec-sensors.c +++ b/drivers/hwmon/asus-ec-sensors.c @@ -54,8 +54,10 @@ static char *mutex_path_override; /* ACPI mutex for locking access to the EC for the firmware */ #define ASUS_HW_ACCESS_MUTEX_ASMX "\\AMW0.ASMX" -/* There are two variants of the vendor spelling */ -#define VENDOR_ASUS_UPPER_CASE "ASUSTeK COMPUTER INC." +#define MAX_IDENTICAL_BOARD_VARIATIONS 3 + +/* Moniker for the ACPI global lock (':' is not allowed in ASL identifiers) */ +#define ACPI_GLOBAL_LOCK_PSEUDO_PATH ":GLOBAL_LOCK" typedef union { u32 value; @@ -133,8 +135,44 @@ enum ec_sensors { #define SENSOR_TEMP_WATER_IN BIT(ec_sensor_temp_water_in) #define SENSOR_TEMP_WATER_OUT BIT(ec_sensor_temp_water_out) +enum board_family { + family_unknown, + family_amd_400_series, + family_amd_500_series, +}; + /* All the known sensors for ASUS EC controllers */ -static const struct ec_sensor_info known_ec_sensors[] = { +static const struct ec_sensor_info sensors_family_amd_400[] = { + [ec_sensor_temp_chipset] = + EC_SENSOR("Chipset", hwmon_temp, 1, 0x00, 0x3a), + [ec_sensor_temp_cpu] = + EC_SENSOR("CPU", hwmon_temp, 1, 0x00, 0x3b), + [ec_sensor_temp_mb] = + EC_SENSOR("Motherboard", hwmon_temp, 1, 0x00, 0x3c), + [ec_sensor_temp_t_sensor] = + EC_SENSOR("T_Sensor", hwmon_temp, 1, 0x00, 0x3d), + [ec_sensor_temp_vrm] = + EC_SENSOR("VRM", hwmon_temp, 1, 0x00, 0x3e), + [ec_sensor_in_cpu_core] = + EC_SENSOR("CPU Core", hwmon_in, 2, 0x00, 0xa2), + [ec_sensor_fan_cpu_opt] = + EC_SENSOR("CPU_Opt", hwmon_fan, 2, 0x00, 0xbc), + [ec_sensor_fan_vrm_hs] = + EC_SENSOR("VRM HS", hwmon_fan, 2, 0x00, 0xb2), + [ec_sensor_fan_chipset] = + /* no chipset fans in this generation */ + EC_SENSOR("Chipset", hwmon_fan, 0, 0x00, 0x00), + [ec_sensor_fan_water_flow] = + EC_SENSOR("Water_Flow", hwmon_fan, 2, 0x00, 0xb4), + [ec_sensor_curr_cpu] = + EC_SENSOR("CPU", hwmon_curr, 1, 0x00, 0xf4), + [ec_sensor_temp_water_in] = + EC_SENSOR("Water_In", hwmon_temp, 1, 0x01, 0x0d), + [ec_sensor_temp_water_out] = + EC_SENSOR("Water_Out", hwmon_temp, 1, 0x01, 0x0b), +}; + +static const struct ec_sensor_info sensors_family_amd_500[] = { [ec_sensor_temp_chipset] = EC_SENSOR("Chipset", hwmon_temp, 1, 0x00, 0x3a), [ec_sensor_temp_cpu] = EC_SENSOR("CPU", hwmon_temp, 1, 0x00, 0x3b), @@ -164,68 +202,134 @@ static const struct ec_sensor_info known_ec_sensors[] = { (SENSOR_TEMP_CHIPSET | SENSOR_TEMP_CPU | SENSOR_TEMP_MB) #define SENSOR_SET_TEMP_WATER (SENSOR_TEMP_WATER_IN | SENSOR_TEMP_WATER_OUT) -#define DMI_EXACT_MATCH_BOARD(vendor, name, sensors) { \ - .matches = { \ - DMI_EXACT_MATCH(DMI_BOARD_VENDOR, vendor), \ - DMI_EXACT_MATCH(DMI_BOARD_NAME, name), \ - }, \ - .driver_data = (void *)(sensors), \ -} +struct ec_board_info { + const char *board_names[MAX_IDENTICAL_BOARD_VARIATIONS]; + unsigned long sensors; + /* + * Defines which mutex to use for guarding access to the state and the + * hardware. Can be either a full path to an AML mutex or the + * pseudo-path ACPI_GLOBAL_LOCK_PSEUDO_PATH to use the global ACPI lock, + * or left empty to use a regular mutex object, in which case access to + * the hardware is not guarded. + */ + const char *mutex_path; + enum board_family family; +}; -static const struct dmi_system_id asus_ec_dmi_table[] __initconst = { - DMI_EXACT_MATCH_BOARD(VENDOR_ASUS_UPPER_CASE, "PRIME X570-PRO", - SENSOR_SET_TEMP_CHIPSET_CPU_MB | SENSOR_TEMP_VRM | - SENSOR_TEMP_T_SENSOR | SENSOR_FAN_CHIPSET), - DMI_EXACT_MATCH_BOARD(VENDOR_ASUS_UPPER_CASE, "Pro WS X570-ACE", - SENSOR_SET_TEMP_CHIPSET_CPU_MB | SENSOR_TEMP_VRM | - SENSOR_FAN_CHIPSET | SENSOR_CURR_CPU | SENSOR_IN_CPU_CORE), - DMI_EXACT_MATCH_BOARD(VENDOR_ASUS_UPPER_CASE, - "ROG CROSSHAIR VIII DARK HERO", - SENSOR_SET_TEMP_CHIPSET_CPU_MB | SENSOR_TEMP_T_SENSOR | - SENSOR_TEMP_VRM | SENSOR_SET_TEMP_WATER | - SENSOR_FAN_CPU_OPT | SENSOR_FAN_WATER_FLOW | - SENSOR_CURR_CPU | SENSOR_IN_CPU_CORE), - DMI_EXACT_MATCH_BOARD(VENDOR_ASUS_UPPER_CASE, - "ROG CROSSHAIR VIII FORMULA", - SENSOR_SET_TEMP_CHIPSET_CPU_MB | SENSOR_TEMP_T_SENSOR | - SENSOR_TEMP_VRM | SENSOR_FAN_CPU_OPT | SENSOR_FAN_CHIPSET | - SENSOR_CURR_CPU | SENSOR_IN_CPU_CORE), - DMI_EXACT_MATCH_BOARD(VENDOR_ASUS_UPPER_CASE, "ROG CROSSHAIR VIII HERO", - SENSOR_SET_TEMP_CHIPSET_CPU_MB | SENSOR_TEMP_T_SENSOR | - SENSOR_TEMP_VRM | SENSOR_SET_TEMP_WATER | - SENSOR_FAN_CPU_OPT | SENSOR_FAN_CHIPSET | - SENSOR_FAN_WATER_FLOW | SENSOR_CURR_CPU | SENSOR_IN_CPU_CORE), - DMI_EXACT_MATCH_BOARD(VENDOR_ASUS_UPPER_CASE, - "ROG CROSSHAIR VIII HERO (WI-FI)", - SENSOR_SET_TEMP_CHIPSET_CPU_MB | SENSOR_TEMP_T_SENSOR | - SENSOR_TEMP_VRM | SENSOR_SET_TEMP_WATER | - SENSOR_FAN_CPU_OPT | SENSOR_FAN_CHIPSET | - SENSOR_FAN_WATER_FLOW | SENSOR_CURR_CPU | SENSOR_IN_CPU_CORE), - DMI_EXACT_MATCH_BOARD(VENDOR_ASUS_UPPER_CASE, - "ROG CROSSHAIR VIII IMPACT", - SENSOR_SET_TEMP_CHIPSET_CPU_MB | SENSOR_TEMP_T_SENSOR | - SENSOR_TEMP_VRM | SENSOR_FAN_CHIPSET | - SENSOR_CURR_CPU | SENSOR_IN_CPU_CORE), - DMI_EXACT_MATCH_BOARD(VENDOR_ASUS_UPPER_CASE, "ROG STRIX B550-E GAMING", - SENSOR_SET_TEMP_CHIPSET_CPU_MB | - SENSOR_TEMP_T_SENSOR | - SENSOR_TEMP_VRM | SENSOR_FAN_CPU_OPT), - DMI_EXACT_MATCH_BOARD(VENDOR_ASUS_UPPER_CASE, "ROG STRIX B550-I GAMING", - SENSOR_SET_TEMP_CHIPSET_CPU_MB | - SENSOR_TEMP_T_SENSOR | - SENSOR_TEMP_VRM | SENSOR_FAN_VRM_HS | - SENSOR_CURR_CPU | SENSOR_IN_CPU_CORE), - DMI_EXACT_MATCH_BOARD(VENDOR_ASUS_UPPER_CASE, "ROG STRIX X570-E GAMING", - SENSOR_SET_TEMP_CHIPSET_CPU_MB | - SENSOR_TEMP_T_SENSOR | - SENSOR_TEMP_VRM | SENSOR_FAN_CHIPSET | - SENSOR_CURR_CPU | SENSOR_IN_CPU_CORE), - DMI_EXACT_MATCH_BOARD(VENDOR_ASUS_UPPER_CASE, "ROG STRIX X570-F GAMING", - SENSOR_SET_TEMP_CHIPSET_CPU_MB | - SENSOR_TEMP_T_SENSOR | SENSOR_FAN_CHIPSET), - DMI_EXACT_MATCH_BOARD(VENDOR_ASUS_UPPER_CASE, "ROG STRIX X570-I GAMING", - SENSOR_TEMP_T_SENSOR | SENSOR_FAN_VRM_HS | - SENSOR_FAN_CHIPSET | SENSOR_CURR_CPU | SENSOR_IN_CPU_CORE), +static const struct ec_board_info board_info[] = { + { + .board_names = {"PRIME X470-PRO"}, + .sensors = SENSOR_SET_TEMP_CHIPSET_CPU_MB | + SENSOR_TEMP_T_SENSOR | SENSOR_TEMP_VRM | + SENSOR_FAN_CPU_OPT | + SENSOR_CURR_CPU | SENSOR_IN_CPU_CORE, + .mutex_path = ACPI_GLOBAL_LOCK_PSEUDO_PATH, + .family = family_amd_400_series, + }, + { + .board_names = {"PRIME X570-PRO"}, + .sensors = SENSOR_SET_TEMP_CHIPSET_CPU_MB | SENSOR_TEMP_VRM | + SENSOR_TEMP_T_SENSOR | SENSOR_FAN_CHIPSET, + .mutex_path = ASUS_HW_ACCESS_MUTEX_ASMX, + .family = family_amd_500_series, + }, + { + .board_names = {"ProArt X570-CREATOR WIFI"}, + .sensors = SENSOR_SET_TEMP_CHIPSET_CPU_MB | SENSOR_TEMP_VRM | + SENSOR_TEMP_T_SENSOR | SENSOR_FAN_CPU_OPT | + SENSOR_CURR_CPU | SENSOR_IN_CPU_CORE, + }, + { + .board_names = {"Pro WS X570-ACE"}, + .sensors = SENSOR_SET_TEMP_CHIPSET_CPU_MB | SENSOR_TEMP_VRM | + SENSOR_TEMP_T_SENSOR | SENSOR_FAN_CHIPSET | + SENSOR_CURR_CPU | SENSOR_IN_CPU_CORE, + .mutex_path = ASUS_HW_ACCESS_MUTEX_ASMX, + .family = family_amd_500_series, + }, + { + .board_names = {"ROG CROSSHAIR VIII DARK HERO"}, + .sensors = SENSOR_SET_TEMP_CHIPSET_CPU_MB | + SENSOR_TEMP_T_SENSOR | + SENSOR_TEMP_VRM | SENSOR_SET_TEMP_WATER | + SENSOR_FAN_CPU_OPT | SENSOR_FAN_WATER_FLOW | + SENSOR_CURR_CPU | SENSOR_IN_CPU_CORE, + .mutex_path = ASUS_HW_ACCESS_MUTEX_ASMX, + .family = family_amd_500_series, + }, + { + .board_names = { + "ROG CROSSHAIR VIII FORMULA" + "ROG CROSSHAIR VIII HERO", + "ROG CROSSHAIR VIII HERO (WI-FI)", + }, + .sensors = SENSOR_SET_TEMP_CHIPSET_CPU_MB | + SENSOR_TEMP_T_SENSOR | + SENSOR_TEMP_VRM | SENSOR_SET_TEMP_WATER | + SENSOR_FAN_CPU_OPT | SENSOR_FAN_CHIPSET | + SENSOR_FAN_WATER_FLOW | SENSOR_CURR_CPU | + SENSOR_IN_CPU_CORE, + .mutex_path = ASUS_HW_ACCESS_MUTEX_ASMX, + .family = family_amd_500_series, + }, + { + .board_names = {"ROG CROSSHAIR VIII IMPACT"}, + .sensors = SENSOR_SET_TEMP_CHIPSET_CPU_MB | + SENSOR_TEMP_T_SENSOR | SENSOR_TEMP_VRM | + SENSOR_FAN_CHIPSET | SENSOR_CURR_CPU | + SENSOR_IN_CPU_CORE, + .mutex_path = ASUS_HW_ACCESS_MUTEX_ASMX, + .family = family_amd_500_series, + }, + { + .board_names = {"ROG STRIX B550-E GAMING"}, + .sensors = SENSOR_SET_TEMP_CHIPSET_CPU_MB | + SENSOR_TEMP_T_SENSOR | SENSOR_TEMP_VRM | + SENSOR_FAN_CPU_OPT, + .mutex_path = ASUS_HW_ACCESS_MUTEX_ASMX, + .family = family_amd_500_series, + }, + { + .board_names = {"ROG STRIX B550-I GAMING"}, + .sensors = SENSOR_SET_TEMP_CHIPSET_CPU_MB | + SENSOR_TEMP_T_SENSOR | SENSOR_TEMP_VRM | + SENSOR_FAN_VRM_HS | SENSOR_CURR_CPU | + SENSOR_IN_CPU_CORE, + .mutex_path = ASUS_HW_ACCESS_MUTEX_ASMX, + .family = family_amd_500_series, + }, + { + .board_names = {"ROG STRIX X570-E GAMING"}, + .sensors = SENSOR_SET_TEMP_CHIPSET_CPU_MB | + SENSOR_TEMP_T_SENSOR | SENSOR_TEMP_VRM | + SENSOR_FAN_CHIPSET | SENSOR_CURR_CPU | + SENSOR_IN_CPU_CORE, + .mutex_path = ASUS_HW_ACCESS_MUTEX_ASMX, + .family = family_amd_500_series, + }, + { + .board_names = {"ROG STRIX X570-E GAMING WIFI II"}, + .sensors = SENSOR_SET_TEMP_CHIPSET_CPU_MB | + SENSOR_TEMP_T_SENSOR | SENSOR_CURR_CPU | + SENSOR_IN_CPU_CORE, + .mutex_path = ASUS_HW_ACCESS_MUTEX_ASMX, + .family = family_amd_500_series, + }, + { + .board_names = {"ROG STRIX X570-F GAMING"}, + .sensors = SENSOR_SET_TEMP_CHIPSET_CPU_MB | + SENSOR_TEMP_T_SENSOR | SENSOR_FAN_CHIPSET, + .mutex_path = ASUS_HW_ACCESS_MUTEX_ASMX, + .family = family_amd_500_series, + }, + { + .board_names = {"ROG STRIX X570-I GAMING"}, + .sensors = SENSOR_TEMP_T_SENSOR | SENSOR_FAN_VRM_HS | + SENSOR_FAN_CHIPSET | SENSOR_CURR_CPU | + SENSOR_IN_CPU_CORE, + .mutex_path = ASUS_HW_ACCESS_MUTEX_ASMX, + .family = family_amd_500_series, + }, {} }; @@ -234,8 +338,49 @@ struct ec_sensor { s32 cached_value; }; +struct lock_data { + union { + acpi_handle aml; + /* global lock handle */ + u32 glk; + } mutex; + bool (*lock)(struct lock_data *data); + bool (*unlock)(struct lock_data *data); +}; + +/* + * The next function pairs implement options for locking access to the + * state and the EC + */ +static bool lock_via_acpi_mutex(struct lock_data *data) +{ + /* + * ASUS DSDT does not specify that access to the EC has to be guarded, + * but firmware does access it via ACPI + */ + return ACPI_SUCCESS(acpi_acquire_mutex(data->mutex.aml, + NULL, ACPI_LOCK_DELAY_MS)); +} + +static bool unlock_acpi_mutex(struct lock_data *data) +{ + return ACPI_SUCCESS(acpi_release_mutex(data->mutex.aml, NULL)); +} + +static bool lock_via_global_acpi_lock(struct lock_data *data) +{ + return ACPI_SUCCESS(acpi_acquire_global_lock(ACPI_LOCK_DELAY_MS, + &data->mutex.glk)); +} + +static bool unlock_global_acpi_lock(struct lock_data *data) +{ + return ACPI_SUCCESS(acpi_release_global_lock(data->mutex.glk)); +} + struct ec_sensors_data { - unsigned long board_sensors; + const struct ec_board_info *board_info; + const struct ec_sensor_info *sensors_info; struct ec_sensor *sensors; /* EC registers to read from */ u16 *registers; @@ -244,7 +389,7 @@ struct ec_sensors_data { u8 banks[ASUS_EC_MAX_BANK + 1]; /* in jiffies */ unsigned long last_updated; - acpi_handle aml_mutex; + struct lock_data lock_data; /* number of board EC sensors */ u8 nr_sensors; /* @@ -278,7 +423,7 @@ static bool is_sensor_data_signed(const struct ec_sensor_info *si) static const struct ec_sensor_info * get_sensor_info(const struct ec_sensors_data *state, int index) { - return &known_ec_sensors[state->sensors[index].info_index]; + return state->sensors_info + state->sensors[index].info_index; } static int find_ec_sensor_index(const struct ec_sensors_data *ec, @@ -301,11 +446,6 @@ static int __init bank_compare(const void *a, const void *b) return *((const s8 *)a) - *((const s8 *)b); } -static int __init board_sensors_count(unsigned long sensors) -{ - return hweight_long(sensors); -} - static void __init setup_sensor_data(struct ec_sensors_data *ec) { struct ec_sensor *s = ec->sensors; @@ -316,14 +456,14 @@ static void __init setup_sensor_data(struct ec_sensors_data *ec) ec->nr_banks = 0; ec->nr_registers = 0; - for_each_set_bit(i, &ec->board_sensors, - BITS_PER_TYPE(ec->board_sensors)) { + for_each_set_bit(i, &ec->board_info->sensors, + BITS_PER_TYPE(ec->board_info->sensors)) { s->info_index = i; s->cached_value = 0; ec->nr_registers += - known_ec_sensors[s->info_index].addr.components.size; + ec->sensors_info[s->info_index].addr.components.size; bank_found = false; - bank = known_ec_sensors[s->info_index].addr.components.bank; + bank = ec->sensors_info[s->info_index].addr.components.bank; for (j = 0; j < ec->nr_banks; j++) { if (ec->banks[j] == bank) { bank_found = true; @@ -353,23 +493,36 @@ static void __init fill_ec_registers(struct ec_sensors_data *ec) } } -static acpi_handle __init asus_hw_access_mutex(struct device *dev) +static int __init setup_lock_data(struct device *dev) { const char *mutex_path; - acpi_handle res; int status; + struct ec_sensors_data *state = dev_get_drvdata(dev); mutex_path = mutex_path_override ? - mutex_path_override : ASUS_HW_ACCESS_MUTEX_ASMX; + mutex_path_override : state->board_info->mutex_path; - status = acpi_get_handle(NULL, (acpi_string)mutex_path, &res); - if (ACPI_FAILURE(status)) { - dev_err(dev, - "Could not get hardware access guard mutex '%s': error %d", - mutex_path, status); - return NULL; + if (!mutex_path || !strlen(mutex_path)) { + dev_err(dev, "Hardware access guard mutex name is empty"); + return -EINVAL; } - return res; + if (!strcmp(mutex_path, ACPI_GLOBAL_LOCK_PSEUDO_PATH)) { + state->lock_data.mutex.glk = 0; + state->lock_data.lock = lock_via_global_acpi_lock; + state->lock_data.unlock = unlock_global_acpi_lock; + } else { + status = acpi_get_handle(NULL, (acpi_string)mutex_path, + &state->lock_data.mutex.aml); + if (ACPI_FAILURE(status)) { + dev_err(dev, + "Failed to get hardware access guard AML mutex '%s': error %d", + mutex_path, status); + return -ENOENT; + } + state->lock_data.lock = lock_via_acpi_mutex; + state->lock_data.unlock = unlock_acpi_mutex; + } + return 0; } static int asus_ec_bank_switch(u8 bank, u8 *old) @@ -457,10 +610,11 @@ static inline s32 get_sensor_value(const struct ec_sensor_info *si, u8 *data) static void update_sensor_values(struct ec_sensors_data *ec, u8 *data) { const struct ec_sensor_info *si; - struct ec_sensor *s; + struct ec_sensor *s, *sensor_end; - for (s = ec->sensors; s != ec->sensors + ec->nr_sensors; s++) { - si = &known_ec_sensors[s->info_index]; + sensor_end = ec->sensors + ec->nr_sensors; + for (s = ec->sensors; s != sensor_end; s++) { + si = ec->sensors_info + s->info_index; s->cached_value = get_sensor_value(si, data); data += si->addr.components.size; } @@ -471,15 +625,9 @@ static int update_ec_sensors(const struct device *dev, { int status; - /* - * ASUS DSDT does not specify that access to the EC has to be guarded, - * but firmware does access it via ACPI - */ - if (ACPI_FAILURE(acpi_acquire_mutex(ec->aml_mutex, NULL, - ACPI_LOCK_DELAY_MS))) { - dev_err(dev, "Failed to acquire AML mutex"); - status = -EBUSY; - goto cleanup; + if (!ec->lock_data.lock(&ec->lock_data)) { + dev_warn(dev, "Failed to acquire mutex"); + return -EBUSY; } status = asus_ec_block_read(dev, ec); @@ -487,10 +635,10 @@ static int update_ec_sensors(const struct device *dev, if (!status) { update_sensor_values(ec, ec->read_buffer); } - if (ACPI_FAILURE(acpi_release_mutex(ec->aml_mutex, NULL))) { - dev_err(dev, "Failed to release AML mutex"); - } -cleanup: + + if (!ec->lock_data.unlock(&ec->lock_data)) + dev_err(dev, "Failed to release mutex"); + return status; } @@ -597,12 +745,24 @@ static struct hwmon_chip_info asus_ec_chip_info = { .ops = &asus_ec_hwmon_ops, }; -static unsigned long __init get_board_sensors(void) +static const struct ec_board_info * __init get_board_info(void) { - const struct dmi_system_id *dmi_entry = - dmi_first_match(asus_ec_dmi_table); + const char *dmi_board_vendor = dmi_get_system_info(DMI_BOARD_VENDOR); + const char *dmi_board_name = dmi_get_system_info(DMI_BOARD_NAME); + const struct ec_board_info *board; - return dmi_entry ? (unsigned long)dmi_entry->driver_data : 0; + if (!dmi_board_vendor || !dmi_board_name || + strcasecmp(dmi_board_vendor, "ASUSTeK COMPUTER INC.")) + return NULL; + + for (board = board_info; board->sensors; board++) { + if (match_string(board->board_names, + MAX_IDENTICAL_BOARD_VARIATIONS, + dmi_board_name) >= 0) + return board; + } + + return NULL; } static int __init asus_ec_probe(struct platform_device *pdev) @@ -610,17 +770,18 @@ static int __init asus_ec_probe(struct platform_device *pdev) const struct hwmon_channel_info **ptr_asus_ec_ci; int nr_count[hwmon_max] = { 0 }, nr_types = 0; struct hwmon_channel_info *asus_ec_hwmon_chan; + const struct ec_board_info *pboard_info; const struct hwmon_chip_info *chip_info; struct device *dev = &pdev->dev; struct ec_sensors_data *ec_data; const struct ec_sensor_info *si; enum hwmon_sensor_types type; - unsigned long board_sensors; struct device *hwdev; unsigned int i; + int status; - board_sensors = get_board_sensors(); - if (!board_sensors) + pboard_info = get_board_info(); + if (!pboard_info) return -ENODEV; ec_data = devm_kzalloc(dev, sizeof(struct ec_sensors_data), @@ -629,11 +790,31 @@ static int __init asus_ec_probe(struct platform_device *pdev) return -ENOMEM; dev_set_drvdata(dev, ec_data); - ec_data->board_sensors = board_sensors; - ec_data->nr_sensors = board_sensors_count(ec_data->board_sensors); + ec_data->board_info = pboard_info; + + switch (ec_data->board_info->family) { + case family_amd_400_series: + ec_data->sensors_info = sensors_family_amd_400; + break; + case family_amd_500_series: + ec_data->sensors_info = sensors_family_amd_500; + break; + default: + dev_err(dev, "Unknown board family: %d", + ec_data->board_info->family); + return -EINVAL; + } + + ec_data->nr_sensors = hweight_long(ec_data->board_info->sensors); ec_data->sensors = devm_kcalloc(dev, ec_data->nr_sensors, sizeof(struct ec_sensor), GFP_KERNEL); + status = setup_lock_data(dev); + if (status) { + dev_err(dev, "Failed to setup state/EC locking: %d", status); + return status; + } + setup_sensor_data(ec_data); ec_data->registers = devm_kcalloc(dev, ec_data->nr_registers, sizeof(u16), GFP_KERNEL); @@ -645,8 +826,6 @@ static int __init asus_ec_probe(struct platform_device *pdev) fill_ec_registers(ec_data); - ec_data->aml_mutex = asus_hw_access_mutex(dev); - for (i = 0; i < ec_data->nr_sensors; ++i) { si = get_sensor_info(ec_data, i); if (!nr_count[si->type]) @@ -703,7 +882,14 @@ static struct platform_driver asus_ec_sensors_platform_driver = { }, }; -MODULE_DEVICE_TABLE(dmi, asus_ec_dmi_table); +MODULE_DEVICE_TABLE(acpi, acpi_ec_ids); +/* + * we use module_platform_driver_probe() rather than module_platform_driver() + * because the probe function (and its dependants) are marked with __init, which + * means we can't put it into the .probe member of the platform_driver struct + * above, and we can't mark the asus_ec_sensors_platform_driver object as __init + * because the object is referenced from the module exit code. + */ module_platform_driver_probe(asus_ec_sensors_platform_driver, asus_ec_probe); module_param_named(mutex_path, mutex_path_override, charp, 0); diff --git a/drivers/hwmon/nct6775.c b/drivers/hwmon/nct6775-core.c similarity index 66% rename from drivers/hwmon/nct6775.c rename to drivers/hwmon/nct6775-core.c index 2b91f7e05126..446964cbae4c 100644 --- a/drivers/hwmon/nct6775.c +++ b/drivers/hwmon/nct6775-core.c @@ -44,24 +44,20 @@ #include #include #include -#include #include #include -#include #include #include -#include #include -#include -#include #include -#include +#include #include "lm75.h" +#include "nct6775.h" -#define USE_ALTERNATE +#undef DEFAULT_SYMBOL_NAMESPACE +#define DEFAULT_SYMBOL_NAMESPACE HWMON_NCT6775 -enum kinds { nct6106, nct6116, nct6775, nct6776, nct6779, nct6791, nct6792, - nct6793, nct6795, nct6796, nct6797, nct6798 }; +#define USE_ALTERNATE /* used to set data->name = nct6775_device_names[data->sio_kind] */ static const char * const nct6775_device_names[] = { @@ -79,242 +75,6 @@ static const char * const nct6775_device_names[] = { "nct6798", }; -static const char * const nct6775_sio_names[] __initconst = { - "NCT6106D", - "NCT6116D", - "NCT6775F", - "NCT6776D/F", - "NCT6779D", - "NCT6791D", - "NCT6792D", - "NCT6793D", - "NCT6795D", - "NCT6796D", - "NCT6797D", - "NCT6798D", -}; - -static unsigned short force_id; -module_param(force_id, ushort, 0); -MODULE_PARM_DESC(force_id, "Override the detected device ID"); - -static unsigned short fan_debounce; -module_param(fan_debounce, ushort, 0); -MODULE_PARM_DESC(fan_debounce, "Enable debouncing for fan RPM signal"); - -#define DRVNAME "nct6775" - -/* - * Super-I/O constants and functions - */ - -#define NCT6775_LD_ACPI 0x0a -#define NCT6775_LD_HWM 0x0b -#define NCT6775_LD_VID 0x0d -#define NCT6775_LD_12 0x12 - -#define SIO_REG_LDSEL 0x07 /* Logical device select */ -#define SIO_REG_DEVID 0x20 /* Device ID (2 bytes) */ -#define SIO_REG_ENABLE 0x30 /* Logical device enable */ -#define SIO_REG_ADDR 0x60 /* Logical device address (2 bytes) */ - -#define SIO_NCT6106_ID 0xc450 -#define SIO_NCT6116_ID 0xd280 -#define SIO_NCT6775_ID 0xb470 -#define SIO_NCT6776_ID 0xc330 -#define SIO_NCT6779_ID 0xc560 -#define SIO_NCT6791_ID 0xc800 -#define SIO_NCT6792_ID 0xc910 -#define SIO_NCT6793_ID 0xd120 -#define SIO_NCT6795_ID 0xd350 -#define SIO_NCT6796_ID 0xd420 -#define SIO_NCT6797_ID 0xd450 -#define SIO_NCT6798_ID 0xd428 -#define SIO_ID_MASK 0xFFF8 - -enum pwm_enable { off, manual, thermal_cruise, speed_cruise, sf3, sf4 }; -enum sensor_access { access_direct, access_asuswmi }; - -struct nct6775_sio_data { - int sioreg; - int ld; - enum kinds kind; - enum sensor_access access; - - /* superio_() callbacks */ - void (*sio_outb)(struct nct6775_sio_data *sio_data, int reg, int val); - int (*sio_inb)(struct nct6775_sio_data *sio_data, int reg); - void (*sio_select)(struct nct6775_sio_data *sio_data, int ld); - int (*sio_enter)(struct nct6775_sio_data *sio_data); - void (*sio_exit)(struct nct6775_sio_data *sio_data); -}; - -#define ASUSWMI_MONITORING_GUID "466747A0-70EC-11DE-8A39-0800200C9A66" -#define ASUSWMI_METHODID_RSIO 0x5253494F -#define ASUSWMI_METHODID_WSIO 0x5753494F -#define ASUSWMI_METHODID_RHWM 0x5248574D -#define ASUSWMI_METHODID_WHWM 0x5748574D -#define ASUSWMI_UNSUPPORTED_METHOD 0xFFFFFFFE - -static int nct6775_asuswmi_evaluate_method(u32 method_id, u8 bank, u8 reg, u8 val, u32 *retval) -{ -#if IS_ENABLED(CONFIG_ACPI_WMI) - u32 args = bank | (reg << 8) | (val << 16); - struct acpi_buffer input = { (acpi_size) sizeof(args), &args }; - struct acpi_buffer output = { ACPI_ALLOCATE_BUFFER, NULL }; - acpi_status status; - union acpi_object *obj; - u32 tmp = ASUSWMI_UNSUPPORTED_METHOD; - - status = wmi_evaluate_method(ASUSWMI_MONITORING_GUID, 0, - method_id, &input, &output); - - if (ACPI_FAILURE(status)) - return -EIO; - - obj = output.pointer; - if (obj && obj->type == ACPI_TYPE_INTEGER) - tmp = obj->integer.value; - - if (retval) - *retval = tmp; - - kfree(obj); - - if (tmp == ASUSWMI_UNSUPPORTED_METHOD) - return -ENODEV; - return 0; -#else - return -EOPNOTSUPP; -#endif -} - -static inline int nct6775_asuswmi_write(u8 bank, u8 reg, u8 val) -{ - return nct6775_asuswmi_evaluate_method(ASUSWMI_METHODID_WHWM, bank, - reg, val, NULL); -} - -static inline int nct6775_asuswmi_read(u8 bank, u8 reg, u8 *val) -{ - u32 ret, tmp = 0; - - ret = nct6775_asuswmi_evaluate_method(ASUSWMI_METHODID_RHWM, bank, - reg, 0, &tmp); - *val = tmp; - return ret; -} - -static int superio_wmi_inb(struct nct6775_sio_data *sio_data, int reg) -{ - int tmp = 0; - - nct6775_asuswmi_evaluate_method(ASUSWMI_METHODID_RSIO, sio_data->ld, - reg, 0, &tmp); - return tmp; -} - -static void superio_wmi_outb(struct nct6775_sio_data *sio_data, int reg, int val) -{ - nct6775_asuswmi_evaluate_method(ASUSWMI_METHODID_WSIO, sio_data->ld, - reg, val, NULL); -} - -static void superio_wmi_select(struct nct6775_sio_data *sio_data, int ld) -{ - sio_data->ld = ld; -} - -static int superio_wmi_enter(struct nct6775_sio_data *sio_data) -{ - return 0; -} - -static void superio_wmi_exit(struct nct6775_sio_data *sio_data) -{ -} - -static void superio_outb(struct nct6775_sio_data *sio_data, int reg, int val) -{ - int ioreg = sio_data->sioreg; - - outb(reg, ioreg); - outb(val, ioreg + 1); -} - -static int superio_inb(struct nct6775_sio_data *sio_data, int reg) -{ - int ioreg = sio_data->sioreg; - - outb(reg, ioreg); - return inb(ioreg + 1); -} - -static void superio_select(struct nct6775_sio_data *sio_data, int ld) -{ - int ioreg = sio_data->sioreg; - - outb(SIO_REG_LDSEL, ioreg); - outb(ld, ioreg + 1); -} - -static int superio_enter(struct nct6775_sio_data *sio_data) -{ - int ioreg = sio_data->sioreg; - - /* - * Try to reserve and for exclusive access. - */ - if (!request_muxed_region(ioreg, 2, DRVNAME)) - return -EBUSY; - - outb(0x87, ioreg); - outb(0x87, ioreg); - - return 0; -} - -static void superio_exit(struct nct6775_sio_data *sio_data) -{ - int ioreg = sio_data->sioreg; - - outb(0xaa, ioreg); - outb(0x02, ioreg); - outb(0x02, ioreg + 1); - release_region(ioreg, 2); -} - -/* - * ISA constants - */ - -#define IOREGION_ALIGNMENT (~7) -#define IOREGION_OFFSET 5 -#define IOREGION_LENGTH 2 -#define ADDR_REG_OFFSET 0 -#define DATA_REG_OFFSET 1 - -#define NCT6775_REG_BANK 0x4E -#define NCT6775_REG_CONFIG 0x40 -#define NCT6775_PORT_CHIPID 0x58 - -/* - * Not currently used: - * REG_MAN_ID has the value 0x5ca3 for all supported chips. - * REG_CHIP_ID == 0x88/0xa1/0xc1 depending on chip model. - * REG_MAN_ID is at port 0x4f - * REG_CHIP_ID is at port 0x58 - */ - -#define NUM_TEMP 10 /* Max number of temp attribute sets w/ limits*/ -#define NUM_TEMP_FIXED 6 /* Max number of fixed temp attribute sets */ -#define NUM_TSI_TEMP 8 /* Max number of TSI temp register pairs */ - -#define NUM_REG_ALARM 7 /* Max number of alarm registers */ -#define NUM_REG_BEEP 5 /* Max number of beep registers */ - -#define NUM_FAN 7 - /* Common and NCT6775 specific data */ /* Voltage min/max registers for nr=7..14 are in bank 5 */ @@ -333,11 +93,6 @@ static const u16 NCT6775_REG_IN[] = { #define NCT6775_REG_DIODE 0x5E #define NCT6775_DIODE_MASK 0x02 -#define NCT6775_REG_FANDIV1 0x506 -#define NCT6775_REG_FANDIV2 0x507 - -#define NCT6775_REG_CR_FAN_DEBOUNCE 0xf0 - static const u16 NCT6775_REG_ALARM[NUM_REG_ALARM] = { 0x459, 0x45A, 0x45B }; /* 0..15 voltages, 16..23 fans, 24..29 temperatures, 30..31 intrusion */ @@ -351,10 +106,6 @@ static const s8 NCT6775_ALARM_BITS[] = { 4, 5, 13, -1, -1, -1, /* temp1..temp6 */ 12, -1 }; /* intrusion0, intrusion1 */ -#define FAN_ALARM_BASE 16 -#define TEMP_ALARM_BASE 24 -#define INTRUSION_ALARM_BASE 30 - static const u16 NCT6775_REG_BEEP[NUM_REG_BEEP] = { 0x56, 0x57, 0x453, 0x4e }; /* @@ -370,11 +121,6 @@ static const s8 NCT6775_BEEP_BITS[] = { 4, 5, 13, -1, -1, -1, /* temp1..temp6 */ 12, -1 }; /* intrusion0, intrusion1 */ -#define BEEP_ENABLE_BASE 15 - -static const u8 NCT6775_REG_CR_CASEOPEN_CLR[] = { 0xe6, 0xee }; -static const u8 NCT6775_CR_CASEOPEN_CLR_MASK[] = { 0x20, 0x01 }; - /* DC or PWM output fan configuration */ static const u8 NCT6775_REG_PWM_MODE[] = { 0x04, 0x04, 0x12 }; static const u8 NCT6775_PWM_MODE_MASK[] = { 0x01, 0x02, 0x01 }; @@ -690,8 +436,6 @@ static const u16 NCT6779_REG_TEMP_CRIT[32] = { /* NCT6791 specific data */ -#define NCT6791_REG_HM_IO_SPACE_LOCK_ENABLE 0x28 - static const u16 NCT6791_REG_WEIGHT_TEMP_SEL[NUM_FAN] = { 0, 0x239 }; static const u16 NCT6791_REG_WEIGHT_TEMP_STEP[NUM_FAN] = { 0, 0x23a }; static const u16 NCT6791_REG_WEIGHT_TEMP_STEP_TOL[NUM_FAN] = { 0, 0x23b }; @@ -1191,165 +935,6 @@ static inline unsigned int tsi_temp_from_reg(unsigned int reg) * Data structures and manipulation thereof */ -struct nct6775_data { - int addr; /* IO base of hw monitor block */ - struct nct6775_sio_data *sio_data; - enum kinds kind; - const char *name; - - const struct attribute_group *groups[7]; - - u16 reg_temp[5][NUM_TEMP]; /* 0=temp, 1=temp_over, 2=temp_hyst, - * 3=temp_crit, 4=temp_lcrit - */ - u8 temp_src[NUM_TEMP]; - u16 reg_temp_config[NUM_TEMP]; - const char * const *temp_label; - u32 temp_mask; - u32 virt_temp_mask; - - u16 REG_CONFIG; - u16 REG_VBAT; - u16 REG_DIODE; - u8 DIODE_MASK; - - const s8 *ALARM_BITS; - const s8 *BEEP_BITS; - - const u16 *REG_VIN; - const u16 *REG_IN_MINMAX[2]; - - const u16 *REG_TARGET; - const u16 *REG_FAN; - const u16 *REG_FAN_MODE; - const u16 *REG_FAN_MIN; - const u16 *REG_FAN_PULSES; - const u16 *FAN_PULSE_SHIFT; - const u16 *REG_FAN_TIME[3]; - - const u16 *REG_TOLERANCE_H; - - const u8 *REG_PWM_MODE; - const u8 *PWM_MODE_MASK; - - const u16 *REG_PWM[7]; /* [0]=pwm, [1]=pwm_start, [2]=pwm_floor, - * [3]=pwm_max, [4]=pwm_step, - * [5]=weight_duty_step, [6]=weight_duty_base - */ - const u16 *REG_PWM_READ; - - const u16 *REG_CRITICAL_PWM_ENABLE; - u8 CRITICAL_PWM_ENABLE_MASK; - const u16 *REG_CRITICAL_PWM; - - const u16 *REG_AUTO_TEMP; - const u16 *REG_AUTO_PWM; - - const u16 *REG_CRITICAL_TEMP; - const u16 *REG_CRITICAL_TEMP_TOLERANCE; - - const u16 *REG_TEMP_SOURCE; /* temp register sources */ - const u16 *REG_TEMP_SEL; - const u16 *REG_WEIGHT_TEMP_SEL; - const u16 *REG_WEIGHT_TEMP[3]; /* 0=base, 1=tolerance, 2=step */ - - const u16 *REG_TEMP_OFFSET; - - const u16 *REG_ALARM; - const u16 *REG_BEEP; - - const u16 *REG_TSI_TEMP; - - unsigned int (*fan_from_reg)(u16 reg, unsigned int divreg); - unsigned int (*fan_from_reg_min)(u16 reg, unsigned int divreg); - - struct mutex update_lock; - bool valid; /* true if following fields are valid */ - unsigned long last_updated; /* In jiffies */ - - /* Register values */ - u8 bank; /* current register bank */ - u8 in_num; /* number of in inputs we have */ - u8 in[15][3]; /* [0]=in, [1]=in_max, [2]=in_min */ - unsigned int rpm[NUM_FAN]; - u16 fan_min[NUM_FAN]; - u8 fan_pulses[NUM_FAN]; - u8 fan_div[NUM_FAN]; - u8 has_pwm; - u8 has_fan; /* some fan inputs can be disabled */ - u8 has_fan_min; /* some fans don't have min register */ - bool has_fan_div; - - u8 num_temp_alarms; /* 2, 3, or 6 */ - u8 num_temp_beeps; /* 2, 3, or 6 */ - u8 temp_fixed_num; /* 3 or 6 */ - u8 temp_type[NUM_TEMP_FIXED]; - s8 temp_offset[NUM_TEMP_FIXED]; - s16 temp[5][NUM_TEMP]; /* 0=temp, 1=temp_over, 2=temp_hyst, - * 3=temp_crit, 4=temp_lcrit */ - s16 tsi_temp[NUM_TSI_TEMP]; - u64 alarms; - u64 beeps; - - u8 pwm_num; /* number of pwm */ - u8 pwm_mode[NUM_FAN]; /* 0->DC variable voltage, - * 1->PWM variable duty cycle - */ - enum pwm_enable pwm_enable[NUM_FAN]; - /* 0->off - * 1->manual - * 2->thermal cruise mode (also called SmartFan I) - * 3->fan speed cruise mode - * 4->SmartFan III - * 5->enhanced variable thermal cruise (SmartFan IV) - */ - u8 pwm[7][NUM_FAN]; /* [0]=pwm, [1]=pwm_start, [2]=pwm_floor, - * [3]=pwm_max, [4]=pwm_step, - * [5]=weight_duty_step, [6]=weight_duty_base - */ - - u8 target_temp[NUM_FAN]; - u8 target_temp_mask; - u32 target_speed[NUM_FAN]; - u32 target_speed_tolerance[NUM_FAN]; - u8 speed_tolerance_limit; - - u8 temp_tolerance[2][NUM_FAN]; - u8 tolerance_mask; - - u8 fan_time[3][NUM_FAN]; /* 0 = stop_time, 1 = step_up, 2 = step_down */ - - /* Automatic fan speed control registers */ - int auto_pwm_num; - u8 auto_pwm[NUM_FAN][7]; - u8 auto_temp[NUM_FAN][7]; - u8 pwm_temp_sel[NUM_FAN]; - u8 pwm_weight_temp_sel[NUM_FAN]; - u8 weight_temp[3][NUM_FAN]; /* 0->temp_step, 1->temp_step_tol, - * 2->temp_base - */ - - u8 vid; - u8 vrm; - - bool have_vid; - - u16 have_temp; - u16 have_temp_fixed; - u16 have_tsi_temp; - u16 have_in; - - /* Remember extra register values over suspend/resume */ - u8 vbat; - u8 fandiv1; - u8 fandiv2; - u8 sio_reg_enable; - - /* nct6775_*() callbacks */ - u16 (*read_value)(struct nct6775_data *data, u16 reg); - int (*write_value)(struct nct6775_data *data, u16 reg, u16 value); -}; - struct sensor_device_template { struct device_attribute dev_attr; union { @@ -1405,10 +990,8 @@ struct sensor_template_group { int base; }; -static struct attribute_group * -nct6775_create_attr_group(struct device *dev, - const struct sensor_template_group *tg, - int repeat) +static int nct6775_add_template_attr_group(struct device *dev, struct nct6775_data *data, + const struct sensor_template_group *tg, int repeat) { struct attribute_group *group; struct sensor_device_attr_u *su; @@ -1419,28 +1002,28 @@ nct6775_create_attr_group(struct device *dev, int i, count; if (repeat <= 0) - return ERR_PTR(-EINVAL); + return -EINVAL; t = tg->templates; for (count = 0; *t; t++, count++) ; if (count == 0) - return ERR_PTR(-EINVAL); + return -EINVAL; group = devm_kzalloc(dev, sizeof(*group), GFP_KERNEL); if (group == NULL) - return ERR_PTR(-ENOMEM); + return -ENOMEM; attrs = devm_kcalloc(dev, repeat * count + 1, sizeof(*attrs), GFP_KERNEL); if (attrs == NULL) - return ERR_PTR(-ENOMEM); + return -ENOMEM; su = devm_kzalloc(dev, array3_size(repeat, count, sizeof(*su)), GFP_KERNEL); if (su == NULL) - return ERR_PTR(-ENOMEM); + return -ENOMEM; group->attrs = attrs; group->is_visible = tg->is_visible; @@ -1478,10 +1061,10 @@ nct6775_create_attr_group(struct device *dev, } } - return group; + return nct6775_add_attr_group(data, group); } -static bool is_word_sized(struct nct6775_data *data, u16 reg) +bool nct6775_reg_is_word_sized(struct nct6775_data *data, u16 reg) { switch (data->kind) { case nct6106: @@ -1538,180 +1121,81 @@ static bool is_word_sized(struct nct6775_data *data, u16 reg) } return false; } +EXPORT_SYMBOL_GPL(nct6775_reg_is_word_sized); -static inline void nct6775_wmi_set_bank(struct nct6775_data *data, u16 reg) -{ - u8 bank = reg >> 8; - - data->bank = bank; -} - -static u16 nct6775_wmi_read_value(struct nct6775_data *data, u16 reg) +/* We left-align 8-bit temperature values to make the code simpler */ +static int nct6775_read_temp(struct nct6775_data *data, u16 reg, u16 *val) { - int res, err, word_sized = is_word_sized(data, reg); - u8 tmp = 0; - - nct6775_wmi_set_bank(data, reg); + int err; - err = nct6775_asuswmi_read(data->bank, reg & 0xff, &tmp); + err = nct6775_read_value(data, reg, val); if (err) - return 0; - - res = tmp; - if (word_sized) { - err = nct6775_asuswmi_read(data->bank, (reg & 0xff) + 1, &tmp); - if (err) - return 0; - - res = (res << 8) + tmp; - } - return res; -} - -static int nct6775_wmi_write_value(struct nct6775_data *data, u16 reg, u16 value) -{ - int res, word_sized = is_word_sized(data, reg); - - nct6775_wmi_set_bank(data, reg); - - if (word_sized) { - res = nct6775_asuswmi_write(data->bank, reg & 0xff, value >> 8); - if (res) - return res; - - res = nct6775_asuswmi_write(data->bank, (reg & 0xff) + 1, value); - } else { - res = nct6775_asuswmi_write(data->bank, reg & 0xff, value); - } - - return res; -} - -/* - * On older chips, only registers 0x50-0x5f are banked. - * On more recent chips, all registers are banked. - * Assume that is the case and set the bank number for each access. - * Cache the bank number so it only needs to be set if it changes. - */ -static inline void nct6775_set_bank(struct nct6775_data *data, u16 reg) -{ - u8 bank = reg >> 8; - - if (data->bank != bank) { - outb_p(NCT6775_REG_BANK, data->addr + ADDR_REG_OFFSET); - outb_p(bank, data->addr + DATA_REG_OFFSET); - data->bank = bank; - } -} + return err; -static u16 nct6775_read_value(struct nct6775_data *data, u16 reg) -{ - int res, word_sized = is_word_sized(data, reg); - - nct6775_set_bank(data, reg); - outb_p(reg & 0xff, data->addr + ADDR_REG_OFFSET); - res = inb_p(data->addr + DATA_REG_OFFSET); - if (word_sized) { - outb_p((reg & 0xff) + 1, - data->addr + ADDR_REG_OFFSET); - res = (res << 8) + inb_p(data->addr + DATA_REG_OFFSET); - } - return res; -} + if (!nct6775_reg_is_word_sized(data, reg)) + *val <<= 8; -static int nct6775_write_value(struct nct6775_data *data, u16 reg, u16 value) -{ - int word_sized = is_word_sized(data, reg); - - nct6775_set_bank(data, reg); - outb_p(reg & 0xff, data->addr + ADDR_REG_OFFSET); - if (word_sized) { - outb_p(value >> 8, data->addr + DATA_REG_OFFSET); - outb_p((reg & 0xff) + 1, - data->addr + ADDR_REG_OFFSET); - } - outb_p(value & 0xff, data->addr + DATA_REG_OFFSET); return 0; } -/* We left-align 8-bit temperature values to make the code simpler */ -static u16 nct6775_read_temp(struct nct6775_data *data, u16 reg) -{ - u16 res; - - res = data->read_value(data, reg); - if (!is_word_sized(data, reg)) - res <<= 8; - - return res; -} - -static int nct6775_write_temp(struct nct6775_data *data, u16 reg, u16 value) -{ - if (!is_word_sized(data, reg)) - value >>= 8; - return data->write_value(data, reg, value); -} - /* This function assumes that the caller holds data->update_lock */ -static void nct6775_write_fan_div(struct nct6775_data *data, int nr) +static int nct6775_write_fan_div(struct nct6775_data *data, int nr) { - u8 reg; + u16 reg; + int err; + u16 fandiv_reg = nr < 2 ? NCT6775_REG_FANDIV1 : NCT6775_REG_FANDIV2; + unsigned int oddshift = (nr & 1) * 4; /* masks shift by four if nr is odd */ - switch (nr) { - case 0: - reg = (data->read_value(data, NCT6775_REG_FANDIV1) & 0x70) - | (data->fan_div[0] & 0x7); - data->write_value(data, NCT6775_REG_FANDIV1, reg); - break; - case 1: - reg = (data->read_value(data, NCT6775_REG_FANDIV1) & 0x7) - | ((data->fan_div[1] << 4) & 0x70); - data->write_value(data, NCT6775_REG_FANDIV1, reg); - break; - case 2: - reg = (data->read_value(data, NCT6775_REG_FANDIV2) & 0x70) - | (data->fan_div[2] & 0x7); - data->write_value(data, NCT6775_REG_FANDIV2, reg); - break; - case 3: - reg = (data->read_value(data, NCT6775_REG_FANDIV2) & 0x7) - | ((data->fan_div[3] << 4) & 0x70); - data->write_value(data, NCT6775_REG_FANDIV2, reg); - break; - } + err = nct6775_read_value(data, fandiv_reg, ®); + if (err) + return err; + reg &= 0x70 >> oddshift; + reg |= data->fan_div[nr] & (0x7 << oddshift); + return nct6775_write_value(data, fandiv_reg, reg); } -static void nct6775_write_fan_div_common(struct nct6775_data *data, int nr) +static int nct6775_write_fan_div_common(struct nct6775_data *data, int nr) { if (data->kind == nct6775) - nct6775_write_fan_div(data, nr); + return nct6775_write_fan_div(data, nr); + return 0; } -static void nct6775_update_fan_div(struct nct6775_data *data) +static int nct6775_update_fan_div(struct nct6775_data *data) { - u8 i; + int err; + u16 i; - i = data->read_value(data, NCT6775_REG_FANDIV1); + err = nct6775_read_value(data, NCT6775_REG_FANDIV1, &i); + if (err) + return err; data->fan_div[0] = i & 0x7; data->fan_div[1] = (i & 0x70) >> 4; - i = data->read_value(data, NCT6775_REG_FANDIV2); + err = nct6775_read_value(data, NCT6775_REG_FANDIV2, &i); + if (err) + return err; data->fan_div[2] = i & 0x7; if (data->has_fan & BIT(3)) data->fan_div[3] = (i & 0x70) >> 4; + + return 0; } -static void nct6775_update_fan_div_common(struct nct6775_data *data) +static int nct6775_update_fan_div_common(struct nct6775_data *data) { if (data->kind == nct6775) - nct6775_update_fan_div(data); + return nct6775_update_fan_div(data); + return 0; } -static void nct6775_init_fan_div(struct nct6775_data *data) +static int nct6775_init_fan_div(struct nct6775_data *data) { - int i; + int i, err; + + err = nct6775_update_fan_div_common(data); + if (err) + return err; - nct6775_update_fan_div_common(data); /* * For all fans, start with highest divider value if the divider * register is not initialized. This ensures that we get a @@ -1723,19 +1207,26 @@ static void nct6775_init_fan_div(struct nct6775_data *data) continue; if (data->fan_div[i] == 0) { data->fan_div[i] = 7; - nct6775_write_fan_div_common(data, i); + err = nct6775_write_fan_div_common(data, i); + if (err) + return err; } } + + return 0; } -static void nct6775_init_fan_common(struct device *dev, - struct nct6775_data *data) +static int nct6775_init_fan_common(struct device *dev, + struct nct6775_data *data) { - int i; - u8 reg; + int i, err; + u16 reg; - if (data->has_fan_div) - nct6775_init_fan_div(data); + if (data->has_fan_div) { + err = nct6775_init_fan_div(data); + if (err) + return err; + } /* * If fan_min is not set (0), set it to 0xff to disable it. This @@ -1743,23 +1234,30 @@ static void nct6775_init_fan_common(struct device *dev, */ for (i = 0; i < ARRAY_SIZE(data->fan_min); i++) { if (data->has_fan_min & BIT(i)) { - reg = data->read_value(data, data->REG_FAN_MIN[i]); - if (!reg) - data->write_value(data, data->REG_FAN_MIN[i], - data->has_fan_div ? 0xff - : 0xff1f); + err = nct6775_read_value(data, data->REG_FAN_MIN[i], ®); + if (err) + return err; + if (!reg) { + err = nct6775_write_value(data, data->REG_FAN_MIN[i], + data->has_fan_div ? 0xff : 0xff1f); + if (err) + return err; + } } } + + return 0; } -static void nct6775_select_fan_div(struct device *dev, - struct nct6775_data *data, int nr, u16 reg) +static int nct6775_select_fan_div(struct device *dev, + struct nct6775_data *data, int nr, u16 reg) { + int err; u8 fan_div = data->fan_div[nr]; u16 fan_min; if (!data->has_fan_div) - return; + return 0; /* * If we failed to measure the fan speed, or the reported value is not @@ -1791,36 +1289,46 @@ static void nct6775_select_fan_div(struct device *dev, } if (fan_min != data->fan_min[nr]) { data->fan_min[nr] = fan_min; - data->write_value(data, data->REG_FAN_MIN[nr], - fan_min); + err = nct6775_write_value(data, data->REG_FAN_MIN[nr], fan_min); + if (err) + return err; } } data->fan_div[nr] = fan_div; - nct6775_write_fan_div_common(data, nr); + err = nct6775_write_fan_div_common(data, nr); + if (err) + return err; } + + return 0; } -static void nct6775_update_pwm(struct device *dev) +static int nct6775_update_pwm(struct device *dev) { struct nct6775_data *data = dev_get_drvdata(dev); - int i, j; - int fanmodecfg, reg; + int i, j, err; + u16 fanmodecfg, reg; bool duty_is_dc; for (i = 0; i < data->pwm_num; i++) { if (!(data->has_pwm & BIT(i))) continue; - duty_is_dc = data->REG_PWM_MODE[i] && - (data->read_value(data, data->REG_PWM_MODE[i]) - & data->PWM_MODE_MASK[i]); + err = nct6775_read_value(data, data->REG_PWM_MODE[i], ®); + if (err) + return err; + duty_is_dc = data->REG_PWM_MODE[i] && (reg & data->PWM_MODE_MASK[i]); data->pwm_mode[i] = !duty_is_dc; - fanmodecfg = data->read_value(data, data->REG_FAN_MODE[i]); + err = nct6775_read_value(data, data->REG_FAN_MODE[i], &fanmodecfg); + if (err) + return err; for (j = 0; j < ARRAY_SIZE(data->REG_PWM); j++) { if (data->REG_PWM[j] && data->REG_PWM[j][i]) { - data->pwm[j][i] = data->read_value(data, - data->REG_PWM[j][i]); + err = nct6775_read_value(data, data->REG_PWM[j][i], ®); + if (err) + return err; + data->pwm[j][i] = reg; } } @@ -1835,17 +1343,22 @@ static void nct6775_update_pwm(struct device *dev) u8 t = fanmodecfg & 0x0f; if (data->REG_TOLERANCE_H) { - t |= (data->read_value(data, - data->REG_TOLERANCE_H[i]) & 0x70) >> 1; + err = nct6775_read_value(data, data->REG_TOLERANCE_H[i], ®); + if (err) + return err; + t |= (reg & 0x70) >> 1; } data->target_speed_tolerance[i] = t; } - data->temp_tolerance[1][i] = - data->read_value(data, - data->REG_CRITICAL_TEMP_TOLERANCE[i]); + err = nct6775_read_value(data, data->REG_CRITICAL_TEMP_TOLERANCE[i], ®); + if (err) + return err; + data->temp_tolerance[1][i] = reg; - reg = data->read_value(data, data->REG_TEMP_SEL[i]); + err = nct6775_read_value(data, data->REG_TEMP_SEL[i], ®); + if (err) + return err; data->pwm_temp_sel[i] = reg & 0x1f; /* If fan can stop, report floor as 0 */ if (reg & 0x80) @@ -1854,7 +1367,9 @@ static void nct6775_update_pwm(struct device *dev) if (!data->REG_WEIGHT_TEMP_SEL[i]) continue; - reg = data->read_value(data, data->REG_WEIGHT_TEMP_SEL[i]); + err = nct6775_read_value(data, data->REG_WEIGHT_TEMP_SEL[i], ®); + if (err) + return err; data->pwm_weight_temp_sel[i] = reg & 0x1f; /* If weight is disabled, report weight source as 0 */ if (!(reg & 0x80)) @@ -1862,29 +1377,37 @@ static void nct6775_update_pwm(struct device *dev) /* Weight temp data */ for (j = 0; j < ARRAY_SIZE(data->weight_temp); j++) { - data->weight_temp[j][i] = data->read_value(data, - data->REG_WEIGHT_TEMP[j][i]); + err = nct6775_read_value(data, data->REG_WEIGHT_TEMP[j][i], ®); + if (err) + return err; + data->weight_temp[j][i] = reg; } } + + return 0; } -static void nct6775_update_pwm_limits(struct device *dev) +static int nct6775_update_pwm_limits(struct device *dev) { struct nct6775_data *data = dev_get_drvdata(dev); - int i, j; - u8 reg; - u16 reg_t; + int i, j, err; + u16 reg, reg_t; for (i = 0; i < data->pwm_num; i++) { if (!(data->has_pwm & BIT(i))) continue; for (j = 0; j < ARRAY_SIZE(data->fan_time); j++) { - data->fan_time[j][i] = - data->read_value(data, data->REG_FAN_TIME[j][i]); + err = nct6775_read_value(data, data->REG_FAN_TIME[j][i], ®); + if (err) + return err; + data->fan_time[j][i] = reg; } - reg_t = data->read_value(data, data->REG_TARGET[i]); + err = nct6775_read_value(data, data->REG_TARGET[i], ®_t); + if (err) + return err; + /* Update only in matching mode or if never updated */ if (!data->target_temp[i] || data->pwm_enable[i] == thermal_cruise) @@ -1892,29 +1415,37 @@ static void nct6775_update_pwm_limits(struct device *dev) if (!data->target_speed[i] || data->pwm_enable[i] == speed_cruise) { if (data->REG_TOLERANCE_H) { - reg_t |= (data->read_value(data, - data->REG_TOLERANCE_H[i]) & 0x0f) << 8; + err = nct6775_read_value(data, data->REG_TOLERANCE_H[i], ®); + if (err) + return err; + reg_t |= (reg & 0x0f) << 8; } data->target_speed[i] = reg_t; } for (j = 0; j < data->auto_pwm_num; j++) { - data->auto_pwm[i][j] = - data->read_value(data, - NCT6775_AUTO_PWM(data, i, j)); - data->auto_temp[i][j] = - data->read_value(data, - NCT6775_AUTO_TEMP(data, i, j)); + err = nct6775_read_value(data, NCT6775_AUTO_PWM(data, i, j), ®); + if (err) + return err; + data->auto_pwm[i][j] = reg; + + err = nct6775_read_value(data, NCT6775_AUTO_TEMP(data, i, j), ®); + if (err) + return err; + data->auto_temp[i][j] = reg; } /* critical auto_pwm temperature data */ - data->auto_temp[i][data->auto_pwm_num] = - data->read_value(data, data->REG_CRITICAL_TEMP[i]); + err = nct6775_read_value(data, data->REG_CRITICAL_TEMP[i], ®); + if (err) + return err; + data->auto_temp[i][data->auto_pwm_num] = reg; switch (data->kind) { case nct6775: - reg = data->read_value(data, - NCT6775_REG_CRITICAL_ENAB[i]); + err = nct6775_read_value(data, NCT6775_REG_CRITICAL_ENAB[i], ®); + if (err) + return err; data->auto_pwm[i][data->auto_pwm_num] = (reg & 0x02) ? 0xff : 0x00; break; @@ -1931,120 +1462,158 @@ static void nct6775_update_pwm_limits(struct device *dev) case nct6796: case nct6797: case nct6798: - reg = data->read_value(data, - data->REG_CRITICAL_PWM_ENABLE[i]); - if (reg & data->CRITICAL_PWM_ENABLE_MASK) - reg = data->read_value(data, - data->REG_CRITICAL_PWM[i]); - else + err = nct6775_read_value(data, data->REG_CRITICAL_PWM_ENABLE[i], ®); + if (err) + return err; + if (reg & data->CRITICAL_PWM_ENABLE_MASK) { + err = nct6775_read_value(data, data->REG_CRITICAL_PWM[i], ®); + if (err) + return err; + } else { reg = 0xff; + } data->auto_pwm[i][data->auto_pwm_num] = reg; break; } } + + return 0; } static struct nct6775_data *nct6775_update_device(struct device *dev) { struct nct6775_data *data = dev_get_drvdata(dev); - int i, j; + int i, j, err = 0; + u16 reg; mutex_lock(&data->update_lock); if (time_after(jiffies, data->last_updated + HZ + HZ / 2) || !data->valid) { /* Fan clock dividers */ - nct6775_update_fan_div_common(data); + err = nct6775_update_fan_div_common(data); + if (err) + goto out; /* Measured voltages and limits */ for (i = 0; i < data->in_num; i++) { if (!(data->have_in & BIT(i))) continue; - data->in[i][0] = data->read_value(data, - data->REG_VIN[i]); - data->in[i][1] = data->read_value(data, - data->REG_IN_MINMAX[0][i]); - data->in[i][2] = data->read_value(data, - data->REG_IN_MINMAX[1][i]); + err = nct6775_read_value(data, data->REG_VIN[i], ®); + if (err) + goto out; + data->in[i][0] = reg; + + err = nct6775_read_value(data, data->REG_IN_MINMAX[0][i], ®); + if (err) + goto out; + data->in[i][1] = reg; + + err = nct6775_read_value(data, data->REG_IN_MINMAX[1][i], ®); + if (err) + goto out; + data->in[i][2] = reg; } /* Measured fan speeds and limits */ for (i = 0; i < ARRAY_SIZE(data->rpm); i++) { - u16 reg; - if (!(data->has_fan & BIT(i))) continue; - reg = data->read_value(data, data->REG_FAN[i]); + err = nct6775_read_value(data, data->REG_FAN[i], ®); + if (err) + goto out; data->rpm[i] = data->fan_from_reg(reg, data->fan_div[i]); - if (data->has_fan_min & BIT(i)) - data->fan_min[i] = data->read_value(data, - data->REG_FAN_MIN[i]); + if (data->has_fan_min & BIT(i)) { + err = nct6775_read_value(data, data->REG_FAN_MIN[i], ®); + if (err) + goto out; + data->fan_min[i] = reg; + } if (data->REG_FAN_PULSES[i]) { - data->fan_pulses[i] = - (data->read_value(data, - data->REG_FAN_PULSES[i]) - >> data->FAN_PULSE_SHIFT[i]) & 0x03; + err = nct6775_read_value(data, data->REG_FAN_PULSES[i], ®); + if (err) + goto out; + data->fan_pulses[i] = (reg >> data->FAN_PULSE_SHIFT[i]) & 0x03; } - nct6775_select_fan_div(dev, data, i, reg); + err = nct6775_select_fan_div(dev, data, i, reg); + if (err) + goto out; } - nct6775_update_pwm(dev); - nct6775_update_pwm_limits(dev); + err = nct6775_update_pwm(dev); + if (err) + goto out; + + err = nct6775_update_pwm_limits(dev); + if (err) + goto out; /* Measured temperatures and limits */ for (i = 0; i < NUM_TEMP; i++) { if (!(data->have_temp & BIT(i))) continue; for (j = 0; j < ARRAY_SIZE(data->reg_temp); j++) { - if (data->reg_temp[j][i]) - data->temp[j][i] = nct6775_read_temp(data, - data->reg_temp[j][i]); + if (data->reg_temp[j][i]) { + err = nct6775_read_temp(data, data->reg_temp[j][i], ®); + if (err) + goto out; + data->temp[j][i] = reg; + } } if (i >= NUM_TEMP_FIXED || !(data->have_temp_fixed & BIT(i))) continue; - data->temp_offset[i] = data->read_value(data, - data->REG_TEMP_OFFSET[i]); + err = nct6775_read_value(data, data->REG_TEMP_OFFSET[i], ®); + if (err) + goto out; + data->temp_offset[i] = reg; } for (i = 0; i < NUM_TSI_TEMP; i++) { if (!(data->have_tsi_temp & BIT(i))) continue; - data->tsi_temp[i] = data->read_value(data, data->REG_TSI_TEMP[i]); + err = nct6775_read_value(data, data->REG_TSI_TEMP[i], ®); + if (err) + goto out; + data->tsi_temp[i] = reg; } data->alarms = 0; for (i = 0; i < NUM_REG_ALARM; i++) { - u8 alarm; + u16 alarm; if (!data->REG_ALARM[i]) continue; - alarm = data->read_value(data, data->REG_ALARM[i]); + err = nct6775_read_value(data, data->REG_ALARM[i], &alarm); + if (err) + goto out; data->alarms |= ((u64)alarm) << (i << 3); } data->beeps = 0; for (i = 0; i < NUM_REG_BEEP; i++) { - u8 beep; + u16 beep; if (!data->REG_BEEP[i]) continue; - beep = data->read_value(data, data->REG_BEEP[i]); + err = nct6775_read_value(data, data->REG_BEEP[i], &beep); + if (err) + goto out; data->beeps |= ((u64)beep) << (i << 3); } data->last_updated = jiffies; data->valid = true; } - +out: mutex_unlock(&data->update_lock); - return data; + return err ? ERR_PTR(err) : data; } /* @@ -2058,6 +1627,9 @@ show_in_reg(struct device *dev, struct device_attribute *attr, char *buf) int index = sattr->index; int nr = sattr->nr; + if (IS_ERR(data)) + return PTR_ERR(data); + return sprintf(buf, "%ld\n", in_from_reg(data->in[nr][index], nr)); } @@ -2077,34 +1649,39 @@ store_in_reg(struct device *dev, struct device_attribute *attr, const char *buf, return err; mutex_lock(&data->update_lock); data->in[nr][index] = in_to_reg(val, nr); - data->write_value(data, data->REG_IN_MINMAX[index - 1][nr], - data->in[nr][index]); + err = nct6775_write_value(data, data->REG_IN_MINMAX[index - 1][nr], data->in[nr][index]); mutex_unlock(&data->update_lock); - return count; + return err ? : count; } -static ssize_t -show_alarm(struct device *dev, struct device_attribute *attr, char *buf) +ssize_t +nct6775_show_alarm(struct device *dev, struct device_attribute *attr, char *buf) { struct nct6775_data *data = nct6775_update_device(dev); struct sensor_device_attribute *sattr = to_sensor_dev_attr(attr); - int nr = data->ALARM_BITS[sattr->index]; + int nr; + if (IS_ERR(data)) + return PTR_ERR(data); + + nr = data->ALARM_BITS[sattr->index]; return sprintf(buf, "%u\n", (unsigned int)((data->alarms >> nr) & 0x01)); } +EXPORT_SYMBOL_GPL(nct6775_show_alarm); static int find_temp_source(struct nct6775_data *data, int index, int count) { int source = data->temp_src[index]; - int nr; + int nr, err; for (nr = 0; nr < count; nr++) { - int src; + u16 src; - src = data->read_value(data, - data->REG_TEMP_SOURCE[nr]) & 0x1f; - if (src == source) + err = nct6775_read_value(data, data->REG_TEMP_SOURCE[nr], &src); + if (err) + return err; + if ((src & 0x1f) == source) return nr; } return -ENODEV; @@ -2118,6 +1695,9 @@ show_temp_alarm(struct device *dev, struct device_attribute *attr, char *buf) unsigned int alarm = 0; int nr; + if (IS_ERR(data)) + return PTR_ERR(data); + /* * For temperatures, there is no fixed mapping from registers to alarm * bits. Alarm bits are determined by the temperature source mapping. @@ -2131,20 +1711,25 @@ show_temp_alarm(struct device *dev, struct device_attribute *attr, char *buf) return sprintf(buf, "%u\n", alarm); } -static ssize_t -show_beep(struct device *dev, struct device_attribute *attr, char *buf) +ssize_t +nct6775_show_beep(struct device *dev, struct device_attribute *attr, char *buf) { struct sensor_device_attribute *sattr = to_sensor_dev_attr(attr); struct nct6775_data *data = nct6775_update_device(dev); - int nr = data->BEEP_BITS[sattr->index]; + int nr; + + if (IS_ERR(data)) + return PTR_ERR(data); + + nr = data->BEEP_BITS[sattr->index]; return sprintf(buf, "%u\n", (unsigned int)((data->beeps >> nr) & 0x01)); } +EXPORT_SYMBOL_GPL(nct6775_show_beep); -static ssize_t -store_beep(struct device *dev, struct device_attribute *attr, const char *buf, - size_t count) +ssize_t +nct6775_store_beep(struct device *dev, struct device_attribute *attr, const char *buf, size_t count) { struct sensor_device_attribute_2 *sattr = to_sensor_dev_attr_2(attr); struct nct6775_data *data = dev_get_drvdata(dev); @@ -2164,11 +1749,12 @@ store_beep(struct device *dev, struct device_attribute *attr, const char *buf, data->beeps |= (1ULL << nr); else data->beeps &= ~(1ULL << nr); - data->write_value(data, data->REG_BEEP[regindex], - (data->beeps >> (regindex << 3)) & 0xff); + err = nct6775_write_value(data, data->REG_BEEP[regindex], + (data->beeps >> (regindex << 3)) & 0xff); mutex_unlock(&data->update_lock); - return count; + return err ? : count; } +EXPORT_SYMBOL_GPL(nct6775_store_beep); static ssize_t show_temp_beep(struct device *dev, struct device_attribute *attr, char *buf) @@ -2178,6 +1764,9 @@ show_temp_beep(struct device *dev, struct device_attribute *attr, char *buf) unsigned int beep = 0; int nr; + if (IS_ERR(data)) + return PTR_ERR(data); + /* * For temperatures, there is no fixed mapping from registers to beep * enable bits. Beep enable bits are determined by the temperature @@ -2220,11 +1809,11 @@ store_temp_beep(struct device *dev, struct device_attribute *attr, data->beeps |= (1ULL << bit); else data->beeps &= ~(1ULL << bit); - data->write_value(data, data->REG_BEEP[regindex], - (data->beeps >> (regindex << 3)) & 0xff); + err = nct6775_write_value(data, data->REG_BEEP[regindex], + (data->beeps >> (regindex << 3)) & 0xff); mutex_unlock(&data->update_lock); - return count; + return err ? : count; } static umode_t nct6775_in_is_visible(struct kobject *kobj, @@ -2237,17 +1826,14 @@ static umode_t nct6775_in_is_visible(struct kobject *kobj, if (!(data->have_in & BIT(in))) return 0; - return attr->mode; + return nct6775_attr_mode(data, attr); } -SENSOR_TEMPLATE_2(in_input, "in%d_input", S_IRUGO, show_in_reg, NULL, 0, 0); -SENSOR_TEMPLATE(in_alarm, "in%d_alarm", S_IRUGO, show_alarm, NULL, 0); -SENSOR_TEMPLATE(in_beep, "in%d_beep", S_IWUSR | S_IRUGO, show_beep, store_beep, - 0); -SENSOR_TEMPLATE_2(in_min, "in%d_min", S_IWUSR | S_IRUGO, show_in_reg, - store_in_reg, 0, 1); -SENSOR_TEMPLATE_2(in_max, "in%d_max", S_IWUSR | S_IRUGO, show_in_reg, - store_in_reg, 0, 2); +SENSOR_TEMPLATE_2(in_input, "in%d_input", 0444, show_in_reg, NULL, 0, 0); +SENSOR_TEMPLATE(in_alarm, "in%d_alarm", 0444, nct6775_show_alarm, NULL, 0); +SENSOR_TEMPLATE(in_beep, "in%d_beep", 0644, nct6775_show_beep, nct6775_store_beep, 0); +SENSOR_TEMPLATE_2(in_min, "in%d_min", 0644, show_in_reg, store_in_reg, 0, 1); +SENSOR_TEMPLATE_2(in_max, "in%d_max", 0644, show_in_reg, store_in_reg, 0, 2); /* * nct6775_in_is_visible uses the index into the following array @@ -2275,6 +1861,9 @@ show_fan(struct device *dev, struct device_attribute *attr, char *buf) struct sensor_device_attribute *sattr = to_sensor_dev_attr(attr); int nr = sattr->index; + if (IS_ERR(data)) + return PTR_ERR(data); + return sprintf(buf, "%d\n", data->rpm[nr]); } @@ -2285,6 +1874,9 @@ show_fan_min(struct device *dev, struct device_attribute *attr, char *buf) struct sensor_device_attribute *sattr = to_sensor_dev_attr(attr); int nr = sattr->index; + if (IS_ERR(data)) + return PTR_ERR(data); + return sprintf(buf, "%d\n", data->fan_from_reg_min(data->fan_min[nr], data->fan_div[nr])); @@ -2297,6 +1889,9 @@ show_fan_div(struct device *dev, struct device_attribute *attr, char *buf) struct sensor_device_attribute *sattr = to_sensor_dev_attr(attr); int nr = sattr->index; + if (IS_ERR(data)) + return PTR_ERR(data); + return sprintf(buf, "%u\n", div_from_reg(data->fan_div[nr])); } @@ -2382,16 +1977,18 @@ store_fan_min(struct device *dev, struct device_attribute *attr, nr + 1, div_from_reg(data->fan_div[nr]), div_from_reg(new_div)); data->fan_div[nr] = new_div; - nct6775_write_fan_div_common(data, nr); + err = nct6775_write_fan_div_common(data, nr); + if (err) + goto write_min; /* Give the chip time to sample a new speed value */ data->last_updated = jiffies; } write_min: - data->write_value(data, data->REG_FAN_MIN[nr], data->fan_min[nr]); + err = nct6775_write_value(data, data->REG_FAN_MIN[nr], data->fan_min[nr]); mutex_unlock(&data->update_lock); - return count; + return err ? : count; } static ssize_t @@ -2399,8 +1996,12 @@ show_fan_pulses(struct device *dev, struct device_attribute *attr, char *buf) { struct nct6775_data *data = nct6775_update_device(dev); struct sensor_device_attribute *sattr = to_sensor_dev_attr(attr); - int p = data->fan_pulses[sattr->index]; + int p; + if (IS_ERR(data)) + return PTR_ERR(data); + + p = data->fan_pulses[sattr->index]; return sprintf(buf, "%d\n", p ? : 4); } @@ -2413,7 +2014,7 @@ store_fan_pulses(struct device *dev, struct device_attribute *attr, int nr = sattr->index; unsigned long val; int err; - u8 reg; + u16 reg; err = kstrtoul(buf, 10, &val); if (err < 0) @@ -2424,13 +2025,16 @@ store_fan_pulses(struct device *dev, struct device_attribute *attr, mutex_lock(&data->update_lock); data->fan_pulses[nr] = val & 3; - reg = data->read_value(data, data->REG_FAN_PULSES[nr]); + err = nct6775_read_value(data, data->REG_FAN_PULSES[nr], ®); + if (err) + goto out; reg &= ~(0x03 << data->FAN_PULSE_SHIFT[nr]); reg |= (val & 3) << data->FAN_PULSE_SHIFT[nr]; - data->write_value(data, data->REG_FAN_PULSES[nr], reg); + err = nct6775_write_value(data, data->REG_FAN_PULSES[nr], reg); +out: mutex_unlock(&data->update_lock); - return count; + return err ? : count; } static umode_t nct6775_fan_is_visible(struct kobject *kobj, @@ -2455,19 +2059,16 @@ static umode_t nct6775_fan_is_visible(struct kobject *kobj, if (nr == 5 && data->kind != nct6775) return 0; - return attr->mode; + return nct6775_attr_mode(data, attr); } -SENSOR_TEMPLATE(fan_input, "fan%d_input", S_IRUGO, show_fan, NULL, 0); -SENSOR_TEMPLATE(fan_alarm, "fan%d_alarm", S_IRUGO, show_alarm, NULL, - FAN_ALARM_BASE); -SENSOR_TEMPLATE(fan_beep, "fan%d_beep", S_IWUSR | S_IRUGO, show_beep, - store_beep, FAN_ALARM_BASE); -SENSOR_TEMPLATE(fan_pulses, "fan%d_pulses", S_IWUSR | S_IRUGO, show_fan_pulses, - store_fan_pulses, 0); -SENSOR_TEMPLATE(fan_min, "fan%d_min", S_IWUSR | S_IRUGO, show_fan_min, - store_fan_min, 0); -SENSOR_TEMPLATE(fan_div, "fan%d_div", S_IRUGO, show_fan_div, NULL, 0); +SENSOR_TEMPLATE(fan_input, "fan%d_input", 0444, show_fan, NULL, 0); +SENSOR_TEMPLATE(fan_alarm, "fan%d_alarm", 0444, nct6775_show_alarm, NULL, FAN_ALARM_BASE); +SENSOR_TEMPLATE(fan_beep, "fan%d_beep", 0644, nct6775_show_beep, + nct6775_store_beep, FAN_ALARM_BASE); +SENSOR_TEMPLATE(fan_pulses, "fan%d_pulses", 0644, show_fan_pulses, store_fan_pulses, 0); +SENSOR_TEMPLATE(fan_min, "fan%d_min", 0644, show_fan_min, store_fan_min, 0); +SENSOR_TEMPLATE(fan_div, "fan%d_div", 0444, show_fan_div, NULL, 0); /* * nct6775_fan_is_visible uses the index into the following array @@ -2497,6 +2098,9 @@ show_temp_label(struct device *dev, struct device_attribute *attr, char *buf) struct sensor_device_attribute *sattr = to_sensor_dev_attr(attr); int nr = sattr->index; + if (IS_ERR(data)) + return PTR_ERR(data); + return sprintf(buf, "%s\n", data->temp_label[data->temp_src[nr]]); } @@ -2508,6 +2112,9 @@ show_temp(struct device *dev, struct device_attribute *attr, char *buf) int nr = sattr->nr; int index = sattr->index; + if (IS_ERR(data)) + return PTR_ERR(data); + return sprintf(buf, "%d\n", LM75_TEMP_FROM_REG(data->temp[index][nr])); } @@ -2528,10 +2135,9 @@ store_temp(struct device *dev, struct device_attribute *attr, const char *buf, mutex_lock(&data->update_lock); data->temp[index][nr] = LM75_TEMP_TO_REG(val); - nct6775_write_temp(data, data->reg_temp[index][nr], - data->temp[index][nr]); + err = nct6775_write_temp(data, data->reg_temp[index][nr], data->temp[index][nr]); mutex_unlock(&data->update_lock); - return count; + return err ? : count; } static ssize_t @@ -2540,6 +2146,9 @@ show_temp_offset(struct device *dev, struct device_attribute *attr, char *buf) struct nct6775_data *data = nct6775_update_device(dev); struct sensor_device_attribute *sattr = to_sensor_dev_attr(attr); + if (IS_ERR(data)) + return PTR_ERR(data); + return sprintf(buf, "%d\n", data->temp_offset[sattr->index] * 1000); } @@ -2561,10 +2170,10 @@ store_temp_offset(struct device *dev, struct device_attribute *attr, mutex_lock(&data->update_lock); data->temp_offset[nr] = val; - data->write_value(data, data->REG_TEMP_OFFSET[nr], val); + err = nct6775_write_value(data, data->REG_TEMP_OFFSET[nr], val); mutex_unlock(&data->update_lock); - return count; + return err ? : count; } static ssize_t @@ -2574,6 +2183,9 @@ show_temp_type(struct device *dev, struct device_attribute *attr, char *buf) struct sensor_device_attribute *sattr = to_sensor_dev_attr(attr); int nr = sattr->index; + if (IS_ERR(data)) + return PTR_ERR(data); + return sprintf(buf, "%d\n", (int)data->temp_type[nr]); } @@ -2586,7 +2198,11 @@ store_temp_type(struct device *dev, struct device_attribute *attr, int nr = sattr->index; unsigned long val; int err; - u8 vbat, diode, vbit, dbit; + u8 vbit, dbit; + u16 vbat, diode; + + if (IS_ERR(data)) + return PTR_ERR(data); err = kstrtoul(buf, 10, &val); if (err < 0) @@ -2600,12 +2216,21 @@ store_temp_type(struct device *dev, struct device_attribute *attr, data->temp_type[nr] = val; vbit = 0x02 << nr; dbit = data->DIODE_MASK << nr; - vbat = data->read_value(data, data->REG_VBAT) & ~vbit; - diode = data->read_value(data, data->REG_DIODE) & ~dbit; - switch (val) { - case 1: /* CPU diode (diode, current mode) */ - vbat |= vbit; - diode |= dbit; + + err = nct6775_read_value(data, data->REG_VBAT, &vbat); + if (err) + goto out; + vbat &= ~vbit; + + err = nct6775_read_value(data, data->REG_DIODE, &diode); + if (err) + goto out; + diode &= ~dbit; + + switch (val) { + case 1: /* CPU diode (diode, current mode) */ + vbat |= vbit; + diode |= dbit; break; case 3: /* diode, voltage mode */ vbat |= dbit; @@ -2613,11 +2238,13 @@ store_temp_type(struct device *dev, struct device_attribute *attr, case 4: /* thermistor */ break; } - data->write_value(data, data->REG_VBAT, vbat); - data->write_value(data, data->REG_DIODE, diode); - + err = nct6775_write_value(data, data->REG_VBAT, vbat); + if (err) + goto out; + err = nct6775_write_value(data, data->REG_DIODE, diode); +out: mutex_unlock(&data->update_lock); - return count; + return err ? : count; } static umode_t nct6775_temp_is_visible(struct kobject *kobj, @@ -2656,26 +2283,19 @@ static umode_t nct6775_temp_is_visible(struct kobject *kobj, if (nr > 7 && !(data->have_temp_fixed & BIT(temp))) return 0; - return attr->mode; + return nct6775_attr_mode(data, attr); } -SENSOR_TEMPLATE_2(temp_input, "temp%d_input", S_IRUGO, show_temp, NULL, 0, 0); -SENSOR_TEMPLATE(temp_label, "temp%d_label", S_IRUGO, show_temp_label, NULL, 0); -SENSOR_TEMPLATE_2(temp_max, "temp%d_max", S_IRUGO | S_IWUSR, show_temp, - store_temp, 0, 1); -SENSOR_TEMPLATE_2(temp_max_hyst, "temp%d_max_hyst", S_IRUGO | S_IWUSR, - show_temp, store_temp, 0, 2); -SENSOR_TEMPLATE_2(temp_crit, "temp%d_crit", S_IRUGO | S_IWUSR, show_temp, - store_temp, 0, 3); -SENSOR_TEMPLATE_2(temp_lcrit, "temp%d_lcrit", S_IRUGO | S_IWUSR, show_temp, - store_temp, 0, 4); -SENSOR_TEMPLATE(temp_offset, "temp%d_offset", S_IRUGO | S_IWUSR, - show_temp_offset, store_temp_offset, 0); -SENSOR_TEMPLATE(temp_type, "temp%d_type", S_IRUGO | S_IWUSR, show_temp_type, - store_temp_type, 0); -SENSOR_TEMPLATE(temp_alarm, "temp%d_alarm", S_IRUGO, show_temp_alarm, NULL, 0); -SENSOR_TEMPLATE(temp_beep, "temp%d_beep", S_IRUGO | S_IWUSR, show_temp_beep, - store_temp_beep, 0); +SENSOR_TEMPLATE_2(temp_input, "temp%d_input", 0444, show_temp, NULL, 0, 0); +SENSOR_TEMPLATE(temp_label, "temp%d_label", 0444, show_temp_label, NULL, 0); +SENSOR_TEMPLATE_2(temp_max, "temp%d_max", 0644, show_temp, store_temp, 0, 1); +SENSOR_TEMPLATE_2(temp_max_hyst, "temp%d_max_hyst", 0644, show_temp, store_temp, 0, 2); +SENSOR_TEMPLATE_2(temp_crit, "temp%d_crit", 0644, show_temp, store_temp, 0, 3); +SENSOR_TEMPLATE_2(temp_lcrit, "temp%d_lcrit", 0644, show_temp, store_temp, 0, 4); +SENSOR_TEMPLATE(temp_offset, "temp%d_offset", 0644, show_temp_offset, store_temp_offset, 0); +SENSOR_TEMPLATE(temp_type, "temp%d_type", 0644, show_temp_type, store_temp_type, 0); +SENSOR_TEMPLATE(temp_alarm, "temp%d_alarm", 0444, show_temp_alarm, NULL, 0); +SENSOR_TEMPLATE(temp_beep, "temp%d_beep", 0644, show_temp_beep, store_temp_beep, 0); /* * nct6775_temp_is_visible uses the index into the following array @@ -2707,6 +2327,9 @@ static ssize_t show_tsi_temp(struct device *dev, struct device_attribute *attr, struct nct6775_data *data = nct6775_update_device(dev); struct sensor_device_attribute *sattr = to_sensor_dev_attr(attr); + if (IS_ERR(data)) + return PTR_ERR(data); + return sysfs_emit(buf, "%u\n", tsi_temp_from_reg(data->tsi_temp[sattr->index])); } @@ -2727,7 +2350,7 @@ static umode_t nct6775_tsi_temp_is_visible(struct kobject *kobj, struct attribut struct nct6775_data *data = dev_get_drvdata(dev); int temp = index / 2; - return (data->have_tsi_temp & BIT(temp)) ? attr->mode : 0; + return (data->have_tsi_temp & BIT(temp)) ? nct6775_attr_mode(data, attr) : 0; } /* @@ -2746,6 +2369,9 @@ show_pwm_mode(struct device *dev, struct device_attribute *attr, char *buf) struct nct6775_data *data = nct6775_update_device(dev); struct sensor_device_attribute *sattr = to_sensor_dev_attr(attr); + if (IS_ERR(data)) + return PTR_ERR(data); + return sprintf(buf, "%d\n", data->pwm_mode[sattr->index]); } @@ -2758,7 +2384,7 @@ store_pwm_mode(struct device *dev, struct device_attribute *attr, int nr = sattr->index; unsigned long val; int err; - u8 reg; + u16 reg; err = kstrtoul(buf, 10, &val); if (err < 0) @@ -2776,13 +2402,16 @@ store_pwm_mode(struct device *dev, struct device_attribute *attr, mutex_lock(&data->update_lock); data->pwm_mode[nr] = val; - reg = data->read_value(data, data->REG_PWM_MODE[nr]); + err = nct6775_read_value(data, data->REG_PWM_MODE[nr], ®); + if (err) + goto out; reg &= ~data->PWM_MODE_MASK[nr]; if (!val) reg |= data->PWM_MODE_MASK[nr]; - data->write_value(data, data->REG_PWM_MODE[nr], reg); + err = nct6775_write_value(data, data->REG_PWM_MODE[nr], reg); +out: mutex_unlock(&data->update_lock); - return count; + return err ? : count; } static ssize_t @@ -2792,16 +2421,23 @@ show_pwm(struct device *dev, struct device_attribute *attr, char *buf) struct sensor_device_attribute_2 *sattr = to_sensor_dev_attr_2(attr); int nr = sattr->nr; int index = sattr->index; - int pwm; + int err; + u16 pwm; + + if (IS_ERR(data)) + return PTR_ERR(data); /* * For automatic fan control modes, show current pwm readings. * Otherwise, show the configured value. */ - if (index == 0 && data->pwm_enable[nr] > manual) - pwm = data->read_value(data, data->REG_PWM_READ[nr]); - else + if (index == 0 && data->pwm_enable[nr] > manual) { + err = nct6775_read_value(data, data->REG_PWM_READ[nr], &pwm); + if (err) + return err; + } else { pwm = data->pwm[index][nr]; + } return sprintf(buf, "%d\n", pwm); } @@ -2819,7 +2455,7 @@ store_pwm(struct device *dev, struct device_attribute *attr, const char *buf, int maxval[7] = { 255, 255, data->pwm[3][nr] ? : 255, 255, 255, 255, 255 }; int err; - u8 reg; + u16 reg; err = kstrtoul(buf, 10, &val); if (err < 0) @@ -2828,16 +2464,21 @@ store_pwm(struct device *dev, struct device_attribute *attr, const char *buf, mutex_lock(&data->update_lock); data->pwm[index][nr] = val; - data->write_value(data, data->REG_PWM[index][nr], val); + err = nct6775_write_value(data, data->REG_PWM[index][nr], val); + if (err) + goto out; if (index == 2) { /* floor: disable if val == 0 */ - reg = data->read_value(data, data->REG_TEMP_SEL[nr]); + err = nct6775_read_value(data, data->REG_TEMP_SEL[nr], ®); + if (err) + goto out; reg &= 0x7f; if (val) reg |= 0x80; - data->write_value(data, data->REG_TEMP_SEL[nr], reg); + err = nct6775_write_value(data, data->REG_TEMP_SEL[nr], reg); } +out: mutex_unlock(&data->update_lock); - return count; + return err ? : count; } /* Returns 0 if OK, -EINVAL otherwise */ @@ -2864,40 +2505,54 @@ static int check_trip_points(struct nct6775_data *data, int nr) return 0; } -static void pwm_update_registers(struct nct6775_data *data, int nr) +static int pwm_update_registers(struct nct6775_data *data, int nr) { - u8 reg; + u16 reg; + int err; switch (data->pwm_enable[nr]) { case off: case manual: break; case speed_cruise: - reg = data->read_value(data, data->REG_FAN_MODE[nr]); + err = nct6775_read_value(data, data->REG_FAN_MODE[nr], ®); + if (err) + return err; reg = (reg & ~data->tolerance_mask) | (data->target_speed_tolerance[nr] & data->tolerance_mask); - data->write_value(data, data->REG_FAN_MODE[nr], reg); - data->write_value(data, data->REG_TARGET[nr], - data->target_speed[nr] & 0xff); + err = nct6775_write_value(data, data->REG_FAN_MODE[nr], reg); + if (err) + return err; + err = nct6775_write_value(data, data->REG_TARGET[nr], + data->target_speed[nr] & 0xff); + if (err) + return err; if (data->REG_TOLERANCE_H) { reg = (data->target_speed[nr] >> 8) & 0x0f; reg |= (data->target_speed_tolerance[nr] & 0x38) << 1; - data->write_value(data, - data->REG_TOLERANCE_H[nr], - reg); + err = nct6775_write_value(data, data->REG_TOLERANCE_H[nr], reg); + if (err) + return err; } break; case thermal_cruise: - data->write_value(data, data->REG_TARGET[nr], - data->target_temp[nr]); + err = nct6775_write_value(data, data->REG_TARGET[nr], data->target_temp[nr]); + if (err) + return err; fallthrough; default: - reg = data->read_value(data, data->REG_FAN_MODE[nr]); + err = nct6775_read_value(data, data->REG_FAN_MODE[nr], ®); + if (err) + return err; reg = (reg & ~data->tolerance_mask) | data->temp_tolerance[0][nr]; - data->write_value(data, data->REG_FAN_MODE[nr], reg); + err = nct6775_write_value(data, data->REG_FAN_MODE[nr], reg); + if (err) + return err; break; } + + return 0; } static ssize_t @@ -2906,6 +2561,9 @@ show_pwm_enable(struct device *dev, struct device_attribute *attr, char *buf) struct nct6775_data *data = nct6775_update_device(dev); struct sensor_device_attribute *sattr = to_sensor_dev_attr(attr); + if (IS_ERR(data)) + return PTR_ERR(data); + return sprintf(buf, "%d\n", data->pwm_enable[sattr->index]); } @@ -2943,15 +2601,22 @@ store_pwm_enable(struct device *dev, struct device_attribute *attr, * turn off pwm control: select manual mode, set pwm to maximum */ data->pwm[0][nr] = 255; - data->write_value(data, data->REG_PWM[0][nr], 255); + err = nct6775_write_value(data, data->REG_PWM[0][nr], 255); + if (err) + goto out; } - pwm_update_registers(data, nr); - reg = data->read_value(data, data->REG_FAN_MODE[nr]); + err = pwm_update_registers(data, nr); + if (err) + goto out; + err = nct6775_read_value(data, data->REG_FAN_MODE[nr], ®); + if (err) + goto out; reg &= 0x0f; reg |= pwm_enable_to_reg(val) << 4; - data->write_value(data, data->REG_FAN_MODE[nr], reg); + err = nct6775_write_value(data, data->REG_FAN_MODE[nr], reg); +out: mutex_unlock(&data->update_lock); - return count; + return err ? : count; } static ssize_t @@ -2978,6 +2643,9 @@ show_pwm_temp_sel(struct device *dev, struct device_attribute *attr, char *buf) struct sensor_device_attribute *sattr = to_sensor_dev_attr(attr); int index = sattr->index; + if (IS_ERR(data)) + return PTR_ERR(data); + return show_pwm_temp_sel_common(data, buf, data->pwm_temp_sel[index]); } @@ -2989,7 +2657,11 @@ store_pwm_temp_sel(struct device *dev, struct device_attribute *attr, struct sensor_device_attribute *sattr = to_sensor_dev_attr(attr); int nr = sattr->index; unsigned long val; - int err, reg, src; + int err, src; + u16 reg; + + if (IS_ERR(data)) + return PTR_ERR(data); err = kstrtoul(buf, 10, &val); if (err < 0) @@ -3002,13 +2674,16 @@ store_pwm_temp_sel(struct device *dev, struct device_attribute *attr, mutex_lock(&data->update_lock); src = data->temp_src[val - 1]; data->pwm_temp_sel[nr] = src; - reg = data->read_value(data, data->REG_TEMP_SEL[nr]); + err = nct6775_read_value(data, data->REG_TEMP_SEL[nr], ®); + if (err) + goto out; reg &= 0xe0; reg |= src; - data->write_value(data, data->REG_TEMP_SEL[nr], reg); + err = nct6775_write_value(data, data->REG_TEMP_SEL[nr], reg); +out: mutex_unlock(&data->update_lock); - return count; + return err ? : count; } static ssize_t @@ -3019,6 +2694,9 @@ show_pwm_weight_temp_sel(struct device *dev, struct device_attribute *attr, struct sensor_device_attribute *sattr = to_sensor_dev_attr(attr); int index = sattr->index; + if (IS_ERR(data)) + return PTR_ERR(data); + return show_pwm_temp_sel_common(data, buf, data->pwm_weight_temp_sel[index]); } @@ -3031,7 +2709,11 @@ store_pwm_weight_temp_sel(struct device *dev, struct device_attribute *attr, struct sensor_device_attribute *sattr = to_sensor_dev_attr(attr); int nr = sattr->index; unsigned long val; - int err, reg, src; + int err, src; + u16 reg; + + if (IS_ERR(data)) + return PTR_ERR(data); err = kstrtoul(buf, 10, &val); if (err < 0) @@ -3047,19 +2729,24 @@ store_pwm_weight_temp_sel(struct device *dev, struct device_attribute *attr, if (val) { src = data->temp_src[val - 1]; data->pwm_weight_temp_sel[nr] = src; - reg = data->read_value(data, data->REG_WEIGHT_TEMP_SEL[nr]); + err = nct6775_read_value(data, data->REG_WEIGHT_TEMP_SEL[nr], ®); + if (err) + goto out; reg &= 0xe0; reg |= (src | 0x80); - data->write_value(data, data->REG_WEIGHT_TEMP_SEL[nr], reg); + err = nct6775_write_value(data, data->REG_WEIGHT_TEMP_SEL[nr], reg); } else { data->pwm_weight_temp_sel[nr] = 0; - reg = data->read_value(data, data->REG_WEIGHT_TEMP_SEL[nr]); + err = nct6775_read_value(data, data->REG_WEIGHT_TEMP_SEL[nr], ®); + if (err) + goto out; reg &= 0x7f; - data->write_value(data, data->REG_WEIGHT_TEMP_SEL[nr], reg); + err = nct6775_write_value(data, data->REG_WEIGHT_TEMP_SEL[nr], reg); } +out: mutex_unlock(&data->update_lock); - return count; + return err ? : count; } static ssize_t @@ -3068,6 +2755,9 @@ show_target_temp(struct device *dev, struct device_attribute *attr, char *buf) struct nct6775_data *data = nct6775_update_device(dev); struct sensor_device_attribute *sattr = to_sensor_dev_attr(attr); + if (IS_ERR(data)) + return PTR_ERR(data); + return sprintf(buf, "%d\n", data->target_temp[sattr->index] * 1000); } @@ -3090,9 +2780,9 @@ store_target_temp(struct device *dev, struct device_attribute *attr, mutex_lock(&data->update_lock); data->target_temp[nr] = val; - pwm_update_registers(data, nr); + err = pwm_update_registers(data, nr); mutex_unlock(&data->update_lock); - return count; + return err ? : count; } static ssize_t @@ -3102,6 +2792,9 @@ show_target_speed(struct device *dev, struct device_attribute *attr, char *buf) struct sensor_device_attribute *sattr = to_sensor_dev_attr(attr); int nr = sattr->index; + if (IS_ERR(data)) + return PTR_ERR(data); + return sprintf(buf, "%d\n", fan_from_reg16(data->target_speed[nr], data->fan_div[nr])); @@ -3127,9 +2820,9 @@ store_target_speed(struct device *dev, struct device_attribute *attr, mutex_lock(&data->update_lock); data->target_speed[nr] = speed; - pwm_update_registers(data, nr); + err = pwm_update_registers(data, nr); mutex_unlock(&data->update_lock); - return count; + return err ? : count; } static ssize_t @@ -3141,6 +2834,9 @@ show_temp_tolerance(struct device *dev, struct device_attribute *attr, int nr = sattr->nr; int index = sattr->index; + if (IS_ERR(data)) + return PTR_ERR(data); + return sprintf(buf, "%d\n", data->temp_tolerance[index][nr] * 1000); } @@ -3165,13 +2861,11 @@ store_temp_tolerance(struct device *dev, struct device_attribute *attr, mutex_lock(&data->update_lock); data->temp_tolerance[index][nr] = val; if (index) - pwm_update_registers(data, nr); + err = pwm_update_registers(data, nr); else - data->write_value(data, - data->REG_CRITICAL_TEMP_TOLERANCE[nr], - val); + err = nct6775_write_value(data, data->REG_CRITICAL_TEMP_TOLERANCE[nr], val); mutex_unlock(&data->update_lock); - return count; + return err ? : count; } /* @@ -3188,8 +2882,12 @@ show_speed_tolerance(struct device *dev, struct device_attribute *attr, struct nct6775_data *data = nct6775_update_device(dev); struct sensor_device_attribute *sattr = to_sensor_dev_attr(attr); int nr = sattr->index; - int target = data->target_speed[nr]; - int tolerance = 0; + int target, tolerance = 0; + + if (IS_ERR(data)) + return PTR_ERR(data); + + target = data->target_speed[nr]; if (target) { int low = target - data->target_speed_tolerance[nr]; @@ -3239,24 +2937,19 @@ store_speed_tolerance(struct device *dev, struct device_attribute *attr, mutex_lock(&data->update_lock); data->target_speed_tolerance[nr] = val; - pwm_update_registers(data, nr); + err = pwm_update_registers(data, nr); mutex_unlock(&data->update_lock); - return count; + return err ? : count; } -SENSOR_TEMPLATE_2(pwm, "pwm%d", S_IWUSR | S_IRUGO, show_pwm, store_pwm, 0, 0); -SENSOR_TEMPLATE(pwm_mode, "pwm%d_mode", S_IWUSR | S_IRUGO, show_pwm_mode, - store_pwm_mode, 0); -SENSOR_TEMPLATE(pwm_enable, "pwm%d_enable", S_IWUSR | S_IRUGO, show_pwm_enable, - store_pwm_enable, 0); -SENSOR_TEMPLATE(pwm_temp_sel, "pwm%d_temp_sel", S_IWUSR | S_IRUGO, - show_pwm_temp_sel, store_pwm_temp_sel, 0); -SENSOR_TEMPLATE(pwm_target_temp, "pwm%d_target_temp", S_IWUSR | S_IRUGO, - show_target_temp, store_target_temp, 0); -SENSOR_TEMPLATE(fan_target, "fan%d_target", S_IWUSR | S_IRUGO, - show_target_speed, store_target_speed, 0); -SENSOR_TEMPLATE(fan_tolerance, "fan%d_tolerance", S_IWUSR | S_IRUGO, - show_speed_tolerance, store_speed_tolerance, 0); +SENSOR_TEMPLATE_2(pwm, "pwm%d", 0644, show_pwm, store_pwm, 0, 0); +SENSOR_TEMPLATE(pwm_mode, "pwm%d_mode", 0644, show_pwm_mode, store_pwm_mode, 0); +SENSOR_TEMPLATE(pwm_enable, "pwm%d_enable", 0644, show_pwm_enable, store_pwm_enable, 0); +SENSOR_TEMPLATE(pwm_temp_sel, "pwm%d_temp_sel", 0644, show_pwm_temp_sel, store_pwm_temp_sel, 0); +SENSOR_TEMPLATE(pwm_target_temp, "pwm%d_target_temp", 0644, show_target_temp, store_target_temp, 0); +SENSOR_TEMPLATE(fan_target, "fan%d_target", 0644, show_target_speed, store_target_speed, 0); +SENSOR_TEMPLATE(fan_tolerance, "fan%d_tolerance", 0644, show_speed_tolerance, + store_speed_tolerance, 0); /* Smart Fan registers */ @@ -3268,6 +2961,9 @@ show_weight_temp(struct device *dev, struct device_attribute *attr, char *buf) int nr = sattr->nr; int index = sattr->index; + if (IS_ERR(data)) + return PTR_ERR(data); + return sprintf(buf, "%d\n", data->weight_temp[index][nr] * 1000); } @@ -3290,23 +2986,21 @@ store_weight_temp(struct device *dev, struct device_attribute *attr, mutex_lock(&data->update_lock); data->weight_temp[index][nr] = val; - data->write_value(data, data->REG_WEIGHT_TEMP[index][nr], val); + err = nct6775_write_value(data, data->REG_WEIGHT_TEMP[index][nr], val); mutex_unlock(&data->update_lock); - return count; + return err ? : count; } -SENSOR_TEMPLATE(pwm_weight_temp_sel, "pwm%d_weight_temp_sel", S_IWUSR | S_IRUGO, - show_pwm_weight_temp_sel, store_pwm_weight_temp_sel, 0); +SENSOR_TEMPLATE(pwm_weight_temp_sel, "pwm%d_weight_temp_sel", 0644, + show_pwm_weight_temp_sel, store_pwm_weight_temp_sel, 0); SENSOR_TEMPLATE_2(pwm_weight_temp_step, "pwm%d_weight_temp_step", - S_IWUSR | S_IRUGO, show_weight_temp, store_weight_temp, 0, 0); + 0644, show_weight_temp, store_weight_temp, 0, 0); SENSOR_TEMPLATE_2(pwm_weight_temp_step_tol, "pwm%d_weight_temp_step_tol", - S_IWUSR | S_IRUGO, show_weight_temp, store_weight_temp, 0, 1); + 0644, show_weight_temp, store_weight_temp, 0, 1); SENSOR_TEMPLATE_2(pwm_weight_temp_step_base, "pwm%d_weight_temp_step_base", - S_IWUSR | S_IRUGO, show_weight_temp, store_weight_temp, 0, 2); -SENSOR_TEMPLATE_2(pwm_weight_duty_step, "pwm%d_weight_duty_step", - S_IWUSR | S_IRUGO, show_pwm, store_pwm, 0, 5); -SENSOR_TEMPLATE_2(pwm_weight_duty_base, "pwm%d_weight_duty_base", - S_IWUSR | S_IRUGO, show_pwm, store_pwm, 0, 6); + 0644, show_weight_temp, store_weight_temp, 0, 2); +SENSOR_TEMPLATE_2(pwm_weight_duty_step, "pwm%d_weight_duty_step", 0644, show_pwm, store_pwm, 0, 5); +SENSOR_TEMPLATE_2(pwm_weight_duty_base, "pwm%d_weight_duty_base", 0644, show_pwm, store_pwm, 0, 6); static ssize_t show_fan_time(struct device *dev, struct device_attribute *attr, char *buf) @@ -3316,6 +3010,9 @@ show_fan_time(struct device *dev, struct device_attribute *attr, char *buf) int nr = sattr->nr; int index = sattr->index; + if (IS_ERR(data)) + return PTR_ERR(data); + return sprintf(buf, "%d\n", step_time_from_reg(data->fan_time[index][nr], data->pwm_mode[nr])); @@ -3339,9 +3036,9 @@ store_fan_time(struct device *dev, struct device_attribute *attr, val = step_time_to_reg(val, data->pwm_mode[nr]); mutex_lock(&data->update_lock); data->fan_time[index][nr] = val; - data->write_value(data, data->REG_FAN_TIME[index][nr], val); + err = nct6775_write_value(data, data->REG_FAN_TIME[index][nr], val); mutex_unlock(&data->update_lock); - return count; + return err ? : count; } static ssize_t @@ -3350,6 +3047,9 @@ show_auto_pwm(struct device *dev, struct device_attribute *attr, char *buf) struct nct6775_data *data = nct6775_update_device(dev); struct sensor_device_attribute_2 *sattr = to_sensor_dev_attr_2(attr); + if (IS_ERR(data)) + return PTR_ERR(data); + return sprintf(buf, "%d\n", data->auto_pwm[sattr->nr][sattr->index]); } @@ -3363,7 +3063,7 @@ store_auto_pwm(struct device *dev, struct device_attribute *attr, int point = sattr->index; unsigned long val; int err; - u8 reg; + u16 reg; err = kstrtoul(buf, 10, &val); if (err < 0) @@ -3381,21 +3081,20 @@ store_auto_pwm(struct device *dev, struct device_attribute *attr, mutex_lock(&data->update_lock); data->auto_pwm[nr][point] = val; if (point < data->auto_pwm_num) { - data->write_value(data, - NCT6775_AUTO_PWM(data, nr, point), - data->auto_pwm[nr][point]); + err = nct6775_write_value(data, NCT6775_AUTO_PWM(data, nr, point), + data->auto_pwm[nr][point]); } else { switch (data->kind) { case nct6775: /* disable if needed (pwm == 0) */ - reg = data->read_value(data, - NCT6775_REG_CRITICAL_ENAB[nr]); + err = nct6775_read_value(data, NCT6775_REG_CRITICAL_ENAB[nr], ®); + if (err) + break; if (val) reg |= 0x02; else reg &= ~0x02; - data->write_value(data, NCT6775_REG_CRITICAL_ENAB[nr], - reg); + err = nct6775_write_value(data, NCT6775_REG_CRITICAL_ENAB[nr], reg); break; case nct6776: break; /* always enabled, nothing to do */ @@ -3409,22 +3108,22 @@ store_auto_pwm(struct device *dev, struct device_attribute *attr, case nct6796: case nct6797: case nct6798: - data->write_value(data, data->REG_CRITICAL_PWM[nr], - val); - reg = data->read_value(data, - data->REG_CRITICAL_PWM_ENABLE[nr]); + err = nct6775_write_value(data, data->REG_CRITICAL_PWM[nr], val); + if (err) + break; + err = nct6775_read_value(data, data->REG_CRITICAL_PWM_ENABLE[nr], ®); + if (err) + break; if (val == 255) reg &= ~data->CRITICAL_PWM_ENABLE_MASK; else reg |= data->CRITICAL_PWM_ENABLE_MASK; - data->write_value(data, - data->REG_CRITICAL_PWM_ENABLE[nr], - reg); + err = nct6775_write_value(data, data->REG_CRITICAL_PWM_ENABLE[nr], reg); break; } } mutex_unlock(&data->update_lock); - return count; + return err ? : count; } static ssize_t @@ -3435,6 +3134,9 @@ show_auto_temp(struct device *dev, struct device_attribute *attr, char *buf) int nr = sattr->nr; int point = sattr->index; + if (IS_ERR(data)) + return PTR_ERR(data); + /* * We don't know for sure if the temperature is signed or unsigned. * Assume it is unsigned. @@ -3462,15 +3164,14 @@ store_auto_temp(struct device *dev, struct device_attribute *attr, mutex_lock(&data->update_lock); data->auto_temp[nr][point] = DIV_ROUND_CLOSEST(val, 1000); if (point < data->auto_pwm_num) { - data->write_value(data, - NCT6775_AUTO_TEMP(data, nr, point), - data->auto_temp[nr][point]); + err = nct6775_write_value(data, NCT6775_AUTO_TEMP(data, nr, point), + data->auto_temp[nr][point]); } else { - data->write_value(data, data->REG_CRITICAL_TEMP[nr], - data->auto_temp[nr][point]); + err = nct6775_write_value(data, data->REG_CRITICAL_TEMP[nr], + data->auto_temp[nr][point]); } mutex_unlock(&data->update_lock); - return count; + return err ? : count; } static umode_t nct6775_pwm_is_visible(struct kobject *kobj, @@ -3500,65 +3201,59 @@ static umode_t nct6775_pwm_is_visible(struct kobject *kobj, if (api > data->auto_pwm_num) return 0; } - return attr->mode; + return nct6775_attr_mode(data, attr); } -SENSOR_TEMPLATE_2(pwm_stop_time, "pwm%d_stop_time", S_IWUSR | S_IRUGO, - show_fan_time, store_fan_time, 0, 0); -SENSOR_TEMPLATE_2(pwm_step_up_time, "pwm%d_step_up_time", S_IWUSR | S_IRUGO, +SENSOR_TEMPLATE_2(pwm_stop_time, "pwm%d_stop_time", 0644, show_fan_time, store_fan_time, 0, 0); +SENSOR_TEMPLATE_2(pwm_step_up_time, "pwm%d_step_up_time", 0644, show_fan_time, store_fan_time, 0, 1); -SENSOR_TEMPLATE_2(pwm_step_down_time, "pwm%d_step_down_time", S_IWUSR | S_IRUGO, +SENSOR_TEMPLATE_2(pwm_step_down_time, "pwm%d_step_down_time", 0644, show_fan_time, store_fan_time, 0, 2); -SENSOR_TEMPLATE_2(pwm_start, "pwm%d_start", S_IWUSR | S_IRUGO, show_pwm, - store_pwm, 0, 1); -SENSOR_TEMPLATE_2(pwm_floor, "pwm%d_floor", S_IWUSR | S_IRUGO, show_pwm, - store_pwm, 0, 2); -SENSOR_TEMPLATE_2(pwm_temp_tolerance, "pwm%d_temp_tolerance", S_IWUSR | S_IRUGO, +SENSOR_TEMPLATE_2(pwm_start, "pwm%d_start", 0644, show_pwm, store_pwm, 0, 1); +SENSOR_TEMPLATE_2(pwm_floor, "pwm%d_floor", 0644, show_pwm, store_pwm, 0, 2); +SENSOR_TEMPLATE_2(pwm_temp_tolerance, "pwm%d_temp_tolerance", 0644, show_temp_tolerance, store_temp_tolerance, 0, 0); SENSOR_TEMPLATE_2(pwm_crit_temp_tolerance, "pwm%d_crit_temp_tolerance", - S_IWUSR | S_IRUGO, show_temp_tolerance, store_temp_tolerance, - 0, 1); + 0644, show_temp_tolerance, store_temp_tolerance, 0, 1); -SENSOR_TEMPLATE_2(pwm_max, "pwm%d_max", S_IWUSR | S_IRUGO, show_pwm, store_pwm, - 0, 3); +SENSOR_TEMPLATE_2(pwm_max, "pwm%d_max", 0644, show_pwm, store_pwm, 0, 3); -SENSOR_TEMPLATE_2(pwm_step, "pwm%d_step", S_IWUSR | S_IRUGO, show_pwm, - store_pwm, 0, 4); +SENSOR_TEMPLATE_2(pwm_step, "pwm%d_step", 0644, show_pwm, store_pwm, 0, 4); SENSOR_TEMPLATE_2(pwm_auto_point1_pwm, "pwm%d_auto_point1_pwm", - S_IWUSR | S_IRUGO, show_auto_pwm, store_auto_pwm, 0, 0); + 0644, show_auto_pwm, store_auto_pwm, 0, 0); SENSOR_TEMPLATE_2(pwm_auto_point1_temp, "pwm%d_auto_point1_temp", - S_IWUSR | S_IRUGO, show_auto_temp, store_auto_temp, 0, 0); + 0644, show_auto_temp, store_auto_temp, 0, 0); SENSOR_TEMPLATE_2(pwm_auto_point2_pwm, "pwm%d_auto_point2_pwm", - S_IWUSR | S_IRUGO, show_auto_pwm, store_auto_pwm, 0, 1); + 0644, show_auto_pwm, store_auto_pwm, 0, 1); SENSOR_TEMPLATE_2(pwm_auto_point2_temp, "pwm%d_auto_point2_temp", - S_IWUSR | S_IRUGO, show_auto_temp, store_auto_temp, 0, 1); + 0644, show_auto_temp, store_auto_temp, 0, 1); SENSOR_TEMPLATE_2(pwm_auto_point3_pwm, "pwm%d_auto_point3_pwm", - S_IWUSR | S_IRUGO, show_auto_pwm, store_auto_pwm, 0, 2); + 0644, show_auto_pwm, store_auto_pwm, 0, 2); SENSOR_TEMPLATE_2(pwm_auto_point3_temp, "pwm%d_auto_point3_temp", - S_IWUSR | S_IRUGO, show_auto_temp, store_auto_temp, 0, 2); + 0644, show_auto_temp, store_auto_temp, 0, 2); SENSOR_TEMPLATE_2(pwm_auto_point4_pwm, "pwm%d_auto_point4_pwm", - S_IWUSR | S_IRUGO, show_auto_pwm, store_auto_pwm, 0, 3); + 0644, show_auto_pwm, store_auto_pwm, 0, 3); SENSOR_TEMPLATE_2(pwm_auto_point4_temp, "pwm%d_auto_point4_temp", - S_IWUSR | S_IRUGO, show_auto_temp, store_auto_temp, 0, 3); + 0644, show_auto_temp, store_auto_temp, 0, 3); SENSOR_TEMPLATE_2(pwm_auto_point5_pwm, "pwm%d_auto_point5_pwm", - S_IWUSR | S_IRUGO, show_auto_pwm, store_auto_pwm, 0, 4); + 0644, show_auto_pwm, store_auto_pwm, 0, 4); SENSOR_TEMPLATE_2(pwm_auto_point5_temp, "pwm%d_auto_point5_temp", - S_IWUSR | S_IRUGO, show_auto_temp, store_auto_temp, 0, 4); + 0644, show_auto_temp, store_auto_temp, 0, 4); SENSOR_TEMPLATE_2(pwm_auto_point6_pwm, "pwm%d_auto_point6_pwm", - S_IWUSR | S_IRUGO, show_auto_pwm, store_auto_pwm, 0, 5); + 0644, show_auto_pwm, store_auto_pwm, 0, 5); SENSOR_TEMPLATE_2(pwm_auto_point6_temp, "pwm%d_auto_point6_temp", - S_IWUSR | S_IRUGO, show_auto_temp, store_auto_temp, 0, 5); + 0644, show_auto_temp, store_auto_temp, 0, 5); SENSOR_TEMPLATE_2(pwm_auto_point7_pwm, "pwm%d_auto_point7_pwm", - S_IWUSR | S_IRUGO, show_auto_pwm, store_auto_pwm, 0, 6); + 0644, show_auto_pwm, store_auto_pwm, 0, 6); SENSOR_TEMPLATE_2(pwm_auto_point7_temp, "pwm%d_auto_point7_temp", - S_IWUSR | S_IRUGO, show_auto_temp, store_auto_temp, 0, 6); + 0644, show_auto_temp, store_auto_temp, 0, 6); /* * nct6775_pwm_is_visible uses the index into the following array @@ -3612,123 +3307,21 @@ static const struct sensor_template_group nct6775_pwm_template_group = { .base = 1, }; -static ssize_t -cpu0_vid_show(struct device *dev, struct device_attribute *attr, char *buf) -{ - struct nct6775_data *data = dev_get_drvdata(dev); - - return sprintf(buf, "%d\n", vid_from_reg(data->vid, data->vrm)); -} - -static DEVICE_ATTR_RO(cpu0_vid); - -/* Case open detection */ - -static ssize_t -clear_caseopen(struct device *dev, struct device_attribute *attr, - const char *buf, size_t count) -{ - struct nct6775_data *data = dev_get_drvdata(dev); - struct nct6775_sio_data *sio_data = data->sio_data; - int nr = to_sensor_dev_attr(attr)->index - INTRUSION_ALARM_BASE; - unsigned long val; - u8 reg; - int ret; - - if (kstrtoul(buf, 10, &val) || val != 0) - return -EINVAL; - - mutex_lock(&data->update_lock); - - /* - * Use CR registers to clear caseopen status. - * The CR registers are the same for all chips, and not all chips - * support clearing the caseopen status through "regular" registers. - */ - ret = sio_data->sio_enter(sio_data); - if (ret) { - count = ret; - goto error; - } - - sio_data->sio_select(sio_data, NCT6775_LD_ACPI); - reg = sio_data->sio_inb(sio_data, NCT6775_REG_CR_CASEOPEN_CLR[nr]); - reg |= NCT6775_CR_CASEOPEN_CLR_MASK[nr]; - sio_data->sio_outb(sio_data, NCT6775_REG_CR_CASEOPEN_CLR[nr], reg); - reg &= ~NCT6775_CR_CASEOPEN_CLR_MASK[nr]; - sio_data->sio_outb(sio_data, NCT6775_REG_CR_CASEOPEN_CLR[nr], reg); - sio_data->sio_exit(sio_data); - - data->valid = false; /* Force cache refresh */ -error: - mutex_unlock(&data->update_lock); - return count; -} - -static SENSOR_DEVICE_ATTR(intrusion0_alarm, S_IWUSR | S_IRUGO, show_alarm, - clear_caseopen, INTRUSION_ALARM_BASE); -static SENSOR_DEVICE_ATTR(intrusion1_alarm, S_IWUSR | S_IRUGO, show_alarm, - clear_caseopen, INTRUSION_ALARM_BASE + 1); -static SENSOR_DEVICE_ATTR(intrusion0_beep, S_IWUSR | S_IRUGO, show_beep, - store_beep, INTRUSION_ALARM_BASE); -static SENSOR_DEVICE_ATTR(intrusion1_beep, S_IWUSR | S_IRUGO, show_beep, - store_beep, INTRUSION_ALARM_BASE + 1); -static SENSOR_DEVICE_ATTR(beep_enable, S_IWUSR | S_IRUGO, show_beep, - store_beep, BEEP_ENABLE_BASE); - -static umode_t nct6775_other_is_visible(struct kobject *kobj, - struct attribute *attr, int index) -{ - struct device *dev = kobj_to_dev(kobj); - struct nct6775_data *data = dev_get_drvdata(dev); - - if (index == 0 && !data->have_vid) - return 0; - - if (index == 1 || index == 2) { - if (data->ALARM_BITS[INTRUSION_ALARM_BASE + index - 1] < 0) - return 0; - } - - if (index == 3 || index == 4) { - if (data->BEEP_BITS[INTRUSION_ALARM_BASE + index - 3] < 0) - return 0; - } - - return attr->mode; -} - -/* - * nct6775_other_is_visible uses the index into the following array - * to determine if attributes should be created or not. - * Any change in order or content must be matched. - */ -static struct attribute *nct6775_attributes_other[] = { - &dev_attr_cpu0_vid.attr, /* 0 */ - &sensor_dev_attr_intrusion0_alarm.dev_attr.attr, /* 1 */ - &sensor_dev_attr_intrusion1_alarm.dev_attr.attr, /* 2 */ - &sensor_dev_attr_intrusion0_beep.dev_attr.attr, /* 3 */ - &sensor_dev_attr_intrusion1_beep.dev_attr.attr, /* 4 */ - &sensor_dev_attr_beep_enable.dev_attr.attr, /* 5 */ - - NULL -}; - -static const struct attribute_group nct6775_group_other = { - .attrs = nct6775_attributes_other, - .is_visible = nct6775_other_is_visible, -}; - -static inline void nct6775_init_device(struct nct6775_data *data) +static inline int nct6775_init_device(struct nct6775_data *data) { - int i; - u8 tmp, diode; + int i, err; + u16 tmp, diode; /* Start monitoring if needed */ if (data->REG_CONFIG) { - tmp = data->read_value(data, data->REG_CONFIG); - if (!(tmp & 0x01)) - data->write_value(data, data->REG_CONFIG, tmp | 0x01); + err = nct6775_read_value(data, data->REG_CONFIG, &tmp); + if (err) + return err; + if (!(tmp & 0x01)) { + err = nct6775_write_value(data, data->REG_CONFIG, tmp | 0x01); + if (err) + return err; + } } /* Enable temperature sensors if needed */ @@ -3737,18 +3330,29 @@ static inline void nct6775_init_device(struct nct6775_data *data) continue; if (!data->reg_temp_config[i]) continue; - tmp = data->read_value(data, data->reg_temp_config[i]); - if (tmp & 0x01) - data->write_value(data, data->reg_temp_config[i], - tmp & 0xfe); + err = nct6775_read_value(data, data->reg_temp_config[i], &tmp); + if (err) + return err; + if (tmp & 0x01) { + err = nct6775_write_value(data, data->reg_temp_config[i], tmp & 0xfe); + if (err) + return err; + } } /* Enable VBAT monitoring if needed */ - tmp = data->read_value(data, data->REG_VBAT); - if (!(tmp & 0x01)) - data->write_value(data, data->REG_VBAT, tmp | 0x01); + err = nct6775_read_value(data, data->REG_VBAT, &tmp); + if (err) + return err; + if (!(tmp & 0x01)) { + err = nct6775_write_value(data, data->REG_VBAT, tmp | 0x01); + if (err) + return err; + } - diode = data->read_value(data, data->REG_DIODE); + err = nct6775_read_value(data, data->REG_DIODE, &diode); + if (err) + return err; for (i = 0; i < data->temp_fixed_num; i++) { if (!(data->have_temp_fixed & BIT(i))) @@ -3759,241 +3363,24 @@ static inline void nct6775_init_device(struct nct6775_data *data) else /* thermistor */ data->temp_type[i] = 4; } -} - -static void -nct6775_check_fan_inputs(struct nct6775_data *data, struct nct6775_sio_data *sio_data) -{ - bool fan3pin = false, fan4pin = false, fan4min = false; - bool fan5pin = false, fan6pin = false, fan7pin = false; - bool pwm3pin = false, pwm4pin = false, pwm5pin = false; - bool pwm6pin = false, pwm7pin = false; - - /* Store SIO_REG_ENABLE for use during resume */ - sio_data->sio_select(sio_data, NCT6775_LD_HWM); - data->sio_reg_enable = sio_data->sio_inb(sio_data, SIO_REG_ENABLE); - - /* fan4 and fan5 share some pins with the GPIO and serial flash */ - if (data->kind == nct6775) { - int cr2c = sio_data->sio_inb(sio_data, 0x2c); - - fan3pin = cr2c & BIT(6); - pwm3pin = cr2c & BIT(7); - - /* On NCT6775, fan4 shares pins with the fdc interface */ - fan4pin = !(sio_data->sio_inb(sio_data, 0x2A) & 0x80); - } else if (data->kind == nct6776) { - bool gpok = sio_data->sio_inb(sio_data, 0x27) & 0x80; - const char *board_vendor, *board_name; - - board_vendor = dmi_get_system_info(DMI_BOARD_VENDOR); - board_name = dmi_get_system_info(DMI_BOARD_NAME); - - if (board_name && board_vendor && - !strcmp(board_vendor, "ASRock")) { - /* - * Auxiliary fan monitoring is not enabled on ASRock - * Z77 Pro4-M if booted in UEFI Ultra-FastBoot mode. - * Observed with BIOS version 2.00. - */ - if (!strcmp(board_name, "Z77 Pro4-M")) { - if ((data->sio_reg_enable & 0xe0) != 0xe0) { - data->sio_reg_enable |= 0xe0; - sio_data->sio_outb(sio_data, SIO_REG_ENABLE, - data->sio_reg_enable); - } - } - } - - if (data->sio_reg_enable & 0x80) - fan3pin = gpok; - else - fan3pin = !(sio_data->sio_inb(sio_data, 0x24) & 0x40); - - if (data->sio_reg_enable & 0x40) - fan4pin = gpok; - else - fan4pin = sio_data->sio_inb(sio_data, 0x1C) & 0x01; - - if (data->sio_reg_enable & 0x20) - fan5pin = gpok; - else - fan5pin = sio_data->sio_inb(sio_data, 0x1C) & 0x02; - - fan4min = fan4pin; - pwm3pin = fan3pin; - } else if (data->kind == nct6106) { - int cr24 = sio_data->sio_inb(sio_data, 0x24); - - fan3pin = !(cr24 & 0x80); - pwm3pin = cr24 & 0x08; - } else if (data->kind == nct6116) { - int cr1a = sio_data->sio_inb(sio_data, 0x1a); - int cr1b = sio_data->sio_inb(sio_data, 0x1b); - int cr24 = sio_data->sio_inb(sio_data, 0x24); - int cr2a = sio_data->sio_inb(sio_data, 0x2a); - int cr2b = sio_data->sio_inb(sio_data, 0x2b); - int cr2f = sio_data->sio_inb(sio_data, 0x2f); - - fan3pin = !(cr2b & 0x10); - fan4pin = (cr2b & 0x80) || // pin 1(2) - (!(cr2f & 0x10) && (cr1a & 0x04)); // pin 65(66) - fan5pin = (cr2b & 0x80) || // pin 126(127) - (!(cr1b & 0x03) && (cr2a & 0x02)); // pin 94(96) - - pwm3pin = fan3pin && (cr24 & 0x08); - pwm4pin = fan4pin; - pwm5pin = fan5pin; - } else { - /* - * NCT6779D, NCT6791D, NCT6792D, NCT6793D, NCT6795D, NCT6796D, - * NCT6797D, NCT6798D - */ - int cr1a = sio_data->sio_inb(sio_data, 0x1a); - int cr1b = sio_data->sio_inb(sio_data, 0x1b); - int cr1c = sio_data->sio_inb(sio_data, 0x1c); - int cr1d = sio_data->sio_inb(sio_data, 0x1d); - int cr2a = sio_data->sio_inb(sio_data, 0x2a); - int cr2b = sio_data->sio_inb(sio_data, 0x2b); - int cr2d = sio_data->sio_inb(sio_data, 0x2d); - int cr2f = sio_data->sio_inb(sio_data, 0x2f); - bool dsw_en = cr2f & BIT(3); - bool ddr4_en = cr2f & BIT(4); - int cre0; - int creb; - int cred; - - sio_data->sio_select(sio_data, NCT6775_LD_12); - cre0 = sio_data->sio_inb(sio_data, 0xe0); - creb = sio_data->sio_inb(sio_data, 0xeb); - cred = sio_data->sio_inb(sio_data, 0xed); - - fan3pin = !(cr1c & BIT(5)); - fan4pin = !(cr1c & BIT(6)); - fan5pin = !(cr1c & BIT(7)); - - pwm3pin = !(cr1c & BIT(0)); - pwm4pin = !(cr1c & BIT(1)); - pwm5pin = !(cr1c & BIT(2)); - - switch (data->kind) { - case nct6791: - fan6pin = cr2d & BIT(1); - pwm6pin = cr2d & BIT(0); - break; - case nct6792: - fan6pin = !dsw_en && (cr2d & BIT(1)); - pwm6pin = !dsw_en && (cr2d & BIT(0)); - break; - case nct6793: - fan5pin |= cr1b & BIT(5); - fan5pin |= creb & BIT(5); - - fan6pin = !dsw_en && (cr2d & BIT(1)); - fan6pin |= creb & BIT(3); - - pwm5pin |= cr2d & BIT(7); - pwm5pin |= (creb & BIT(4)) && !(cr2a & BIT(0)); - - pwm6pin = !dsw_en && (cr2d & BIT(0)); - pwm6pin |= creb & BIT(2); - break; - case nct6795: - fan5pin |= cr1b & BIT(5); - fan5pin |= creb & BIT(5); - - fan6pin = (cr2a & BIT(4)) && - (!dsw_en || (cred & BIT(4))); - fan6pin |= creb & BIT(3); - - pwm5pin |= cr2d & BIT(7); - pwm5pin |= (creb & BIT(4)) && !(cr2a & BIT(0)); - - pwm6pin = (cr2a & BIT(3)) && (cred & BIT(2)); - pwm6pin |= creb & BIT(2); - break; - case nct6796: - fan5pin |= cr1b & BIT(5); - fan5pin |= (cre0 & BIT(3)) && !(cr1b & BIT(0)); - fan5pin |= creb & BIT(5); - - fan6pin = (cr2a & BIT(4)) && - (!dsw_en || (cred & BIT(4))); - fan6pin |= creb & BIT(3); - - fan7pin = !(cr2b & BIT(2)); - - pwm5pin |= cr2d & BIT(7); - pwm5pin |= (cre0 & BIT(4)) && !(cr1b & BIT(0)); - pwm5pin |= (creb & BIT(4)) && !(cr2a & BIT(0)); - - pwm6pin = (cr2a & BIT(3)) && (cred & BIT(2)); - pwm6pin |= creb & BIT(2); - - pwm7pin = !(cr1d & (BIT(2) | BIT(3))); - break; - case nct6797: - fan5pin |= !ddr4_en && (cr1b & BIT(5)); - fan5pin |= creb & BIT(5); - - fan6pin = cr2a & BIT(4); - fan6pin |= creb & BIT(3); - - fan7pin = cr1a & BIT(1); - - pwm5pin |= (creb & BIT(4)) && !(cr2a & BIT(0)); - pwm5pin |= !ddr4_en && (cr2d & BIT(7)); - - pwm6pin = creb & BIT(2); - pwm6pin |= cred & BIT(2); - - pwm7pin = cr1d & BIT(4); - break; - case nct6798: - fan6pin = !(cr1b & BIT(0)) && (cre0 & BIT(3)); - fan6pin |= cr2a & BIT(4); - fan6pin |= creb & BIT(5); - - fan7pin = cr1b & BIT(5); - fan7pin |= !(cr2b & BIT(2)); - fan7pin |= creb & BIT(3); - - pwm6pin = !(cr1b & BIT(0)) && (cre0 & BIT(4)); - pwm6pin |= !(cred & BIT(2)) && (cr2a & BIT(3)); - pwm6pin |= (creb & BIT(4)) && !(cr2a & BIT(0)); - - pwm7pin = !(cr1d & (BIT(2) | BIT(3))); - pwm7pin |= cr2d & BIT(7); - pwm7pin |= creb & BIT(2); - break; - default: /* NCT6779D */ - break; - } - - fan4min = fan4pin; - } - /* fan 1 and 2 (0x03) are always present */ - data->has_fan = 0x03 | (fan3pin << 2) | (fan4pin << 3) | - (fan5pin << 4) | (fan6pin << 5) | (fan7pin << 6); - data->has_fan_min = 0x03 | (fan3pin << 2) | (fan4min << 3) | - (fan5pin << 4) | (fan6pin << 5) | (fan7pin << 6); - data->has_pwm = 0x03 | (pwm3pin << 2) | (pwm4pin << 3) | - (pwm5pin << 4) | (pwm6pin << 5) | (pwm7pin << 6); + return 0; } -static void add_temp_sensors(struct nct6775_data *data, const u16 *regp, - int *available, int *mask) +static int add_temp_sensors(struct nct6775_data *data, const u16 *regp, + int *available, int *mask) { - int i; - u8 src; + int i, err; + u16 src; for (i = 0; i < data->pwm_num && *available; i++) { int index; if (!regp[i]) continue; - src = data->read_value(data, regp[i]); + err = nct6775_read_value(data, regp[i], &src); + if (err) + return err; src &= 0x1f; if (!src || (*mask & BIT(src))) continue; @@ -4001,58 +3388,36 @@ static void add_temp_sensors(struct nct6775_data *data, const u16 *regp, continue; index = __ffs(*available); - data->write_value(data, data->REG_TEMP_SOURCE[index], src); + err = nct6775_write_value(data, data->REG_TEMP_SOURCE[index], src); + if (err) + return err; *available &= ~BIT(index); *mask |= BIT(src); } + + return 0; } -static int nct6775_probe(struct platform_device *pdev) +int nct6775_probe(struct device *dev, struct nct6775_data *data, + const struct regmap_config *regmapcfg) { - struct device *dev = &pdev->dev; - struct nct6775_sio_data *sio_data = dev_get_platdata(dev); - struct nct6775_data *data; - struct resource *res; int i, s, err = 0; - int src, mask, available; + int mask, available; + u16 src; const u16 *reg_temp, *reg_temp_over, *reg_temp_hyst, *reg_temp_config; const u16 *reg_temp_mon, *reg_temp_alternate, *reg_temp_crit; const u16 *reg_temp_crit_l = NULL, *reg_temp_crit_h = NULL; int num_reg_temp, num_reg_temp_mon, num_reg_tsi_temp; - u8 cr2a; - struct attribute_group *group; struct device *hwmon_dev; struct sensor_template_group tsi_temp_tg; - int num_attr_groups = 0; - if (sio_data->access == access_direct) { - res = platform_get_resource(pdev, IORESOURCE_IO, 0); - if (!devm_request_region(&pdev->dev, res->start, IOREGION_LENGTH, - DRVNAME)) - return -EBUSY; - } - - data = devm_kzalloc(&pdev->dev, sizeof(struct nct6775_data), - GFP_KERNEL); - if (!data) - return -ENOMEM; - - data->kind = sio_data->kind; - data->sio_data = sio_data; - - if (sio_data->access == access_direct) { - data->addr = res->start; - data->read_value = nct6775_read_value; - data->write_value = nct6775_write_value; - } else { - data->read_value = nct6775_wmi_read_value; - data->write_value = nct6775_wmi_write_value; - } + data->regmap = devm_regmap_init(dev, NULL, data, regmapcfg); + if (IS_ERR(data->regmap)) + return PTR_ERR(data->regmap); mutex_init(&data->update_lock); data->name = nct6775_device_names[data->kind]; data->bank = 0xff; /* Force initial bank selection */ - platform_set_drvdata(pdev, data); switch (data->kind) { case nct6106: @@ -4596,7 +3961,10 @@ static int nct6775_probe(struct platform_device *pdev) if (reg_temp[i] == 0) continue; - src = data->read_value(data, data->REG_TEMP_SOURCE[i]) & 0x1f; + err = nct6775_read_value(data, data->REG_TEMP_SOURCE[i], &src); + if (err) + return err; + src &= 0x1f; if (!src || (mask & BIT(src))) available |= BIT(i); @@ -4607,8 +3975,12 @@ static int nct6775_probe(struct platform_device *pdev) * Now find unmonitored temperature registers and enable monitoring * if additional monitoring registers are available. */ - add_temp_sensors(data, data->REG_TEMP_SEL, &available, &mask); - add_temp_sensors(data, data->REG_WEIGHT_TEMP_SEL, &available, &mask); + err = add_temp_sensors(data, data->REG_TEMP_SEL, &available, &mask); + if (err) + return err; + err = add_temp_sensors(data, data->REG_WEIGHT_TEMP_SEL, &available, &mask); + if (err) + return err; mask = 0; s = NUM_TEMP_FIXED; /* First dynamic temperature attribute */ @@ -4616,7 +3988,10 @@ static int nct6775_probe(struct platform_device *pdev) if (reg_temp[i] == 0) continue; - src = data->read_value(data, data->REG_TEMP_SOURCE[i]) & 0x1f; + err = nct6775_read_value(data, data->REG_TEMP_SOURCE[i], &src); + if (err) + return err; + src &= 0x1f; if (!src || (mask & BIT(src))) continue; @@ -4676,7 +4051,10 @@ static int nct6775_probe(struct platform_device *pdev) if (reg_temp_mon[i] == 0) continue; - src = data->read_value(data, data->REG_TEMP_SEL[i]) & 0x1f; + err = nct6775_read_value(data, data->REG_TEMP_SEL[i], &src); + if (err) + return err; + src &= 0x1f; if (!src) continue; @@ -4760,525 +4138,68 @@ static int nct6775_probe(struct platform_device *pdev) /* Check which TSIx_TEMP registers are active */ for (i = 0; i < num_reg_tsi_temp; i++) { - if (data->read_value(data, data->REG_TSI_TEMP[i])) + u16 tmp; + + err = nct6775_read_value(data, data->REG_TSI_TEMP[i], &tmp); + if (err) + return err; + if (tmp) data->have_tsi_temp |= BIT(i); } /* Initialize the chip */ - nct6775_init_device(data); - - err = sio_data->sio_enter(sio_data); + err = nct6775_init_device(data); if (err) return err; - cr2a = sio_data->sio_inb(sio_data, 0x2a); - switch (data->kind) { - case nct6775: - data->have_vid = (cr2a & 0x40); - break; - case nct6776: - data->have_vid = (cr2a & 0x60) == 0x40; - break; - case nct6106: - case nct6116: - case nct6779: - case nct6791: - case nct6792: - case nct6793: - case nct6795: - case nct6796: - case nct6797: - case nct6798: - break; - } - - /* - * Read VID value - * We can get the VID input values directly at logical device D 0xe3. - */ - if (data->have_vid) { - sio_data->sio_select(sio_data, NCT6775_LD_VID); - data->vid = sio_data->sio_inb(sio_data, 0xe3); - data->vrm = vid_which_vrm(); - } - - if (fan_debounce) { - u8 tmp; - - sio_data->sio_select(sio_data, NCT6775_LD_HWM); - tmp = sio_data->sio_inb(sio_data, - NCT6775_REG_CR_FAN_DEBOUNCE); - switch (data->kind) { - case nct6106: - case nct6116: - tmp |= 0xe0; - break; - case nct6775: - tmp |= 0x1e; - break; - case nct6776: - case nct6779: - tmp |= 0x3e; - break; - case nct6791: - case nct6792: - case nct6793: - case nct6795: - case nct6796: - case nct6797: - case nct6798: - tmp |= 0x7e; - break; - } - sio_data->sio_outb(sio_data, NCT6775_REG_CR_FAN_DEBOUNCE, - tmp); - dev_info(&pdev->dev, "Enabled fan debounce for chip %s\n", - data->name); + if (data->driver_init) { + err = data->driver_init(data); + if (err) + return err; } - nct6775_check_fan_inputs(data, sio_data); - - sio_data->sio_exit(sio_data); - /* Read fan clock dividers immediately */ - nct6775_init_fan_common(dev, data); + err = nct6775_init_fan_common(dev, data); + if (err) + return err; /* Register sysfs hooks */ - group = nct6775_create_attr_group(dev, &nct6775_pwm_template_group, - data->pwm_num); - if (IS_ERR(group)) - return PTR_ERR(group); - - data->groups[num_attr_groups++] = group; - - group = nct6775_create_attr_group(dev, &nct6775_in_template_group, - fls(data->have_in)); - if (IS_ERR(group)) - return PTR_ERR(group); - - data->groups[num_attr_groups++] = group; - - group = nct6775_create_attr_group(dev, &nct6775_fan_template_group, - fls(data->has_fan)); - if (IS_ERR(group)) - return PTR_ERR(group); + err = nct6775_add_template_attr_group(dev, data, &nct6775_pwm_template_group, + data->pwm_num); + if (err) + return err; - data->groups[num_attr_groups++] = group; + err = nct6775_add_template_attr_group(dev, data, &nct6775_in_template_group, + fls(data->have_in)); + if (err) + return err; - group = nct6775_create_attr_group(dev, &nct6775_temp_template_group, - fls(data->have_temp)); - if (IS_ERR(group)) - return PTR_ERR(group); + err = nct6775_add_template_attr_group(dev, data, &nct6775_fan_template_group, + fls(data->has_fan)); + if (err) + return err; - data->groups[num_attr_groups++] = group; + err = nct6775_add_template_attr_group(dev, data, &nct6775_temp_template_group, + fls(data->have_temp)); + if (err) + return err; if (data->have_tsi_temp) { tsi_temp_tg.templates = nct6775_tsi_temp_template; tsi_temp_tg.is_visible = nct6775_tsi_temp_is_visible; tsi_temp_tg.base = fls(data->have_temp) + 1; - group = nct6775_create_attr_group(dev, &tsi_temp_tg, fls(data->have_tsi_temp)); - if (IS_ERR(group)) - return PTR_ERR(group); - - data->groups[num_attr_groups++] = group; + err = nct6775_add_template_attr_group(dev, data, &tsi_temp_tg, + fls(data->have_tsi_temp)); + if (err) + return err; } - data->groups[num_attr_groups++] = &nct6775_group_other; - hwmon_dev = devm_hwmon_device_register_with_groups(dev, data->name, data, data->groups); return PTR_ERR_OR_ZERO(hwmon_dev); } - -static void nct6791_enable_io_mapping(struct nct6775_sio_data *sio_data) -{ - int val; - - val = sio_data->sio_inb(sio_data, NCT6791_REG_HM_IO_SPACE_LOCK_ENABLE); - if (val & 0x10) { - pr_info("Enabling hardware monitor logical device mappings.\n"); - sio_data->sio_outb(sio_data, NCT6791_REG_HM_IO_SPACE_LOCK_ENABLE, - val & ~0x10); - } -} - -static int __maybe_unused nct6775_suspend(struct device *dev) -{ - struct nct6775_data *data = nct6775_update_device(dev); - - mutex_lock(&data->update_lock); - data->vbat = data->read_value(data, data->REG_VBAT); - if (data->kind == nct6775) { - data->fandiv1 = data->read_value(data, NCT6775_REG_FANDIV1); - data->fandiv2 = data->read_value(data, NCT6775_REG_FANDIV2); - } - mutex_unlock(&data->update_lock); - - return 0; -} - -static int __maybe_unused nct6775_resume(struct device *dev) -{ - struct nct6775_data *data = dev_get_drvdata(dev); - struct nct6775_sio_data *sio_data = dev_get_platdata(dev); - int i, j, err = 0; - u8 reg; - - mutex_lock(&data->update_lock); - data->bank = 0xff; /* Force initial bank selection */ - - err = sio_data->sio_enter(sio_data); - if (err) - goto abort; - - sio_data->sio_select(sio_data, NCT6775_LD_HWM); - reg = sio_data->sio_inb(sio_data, SIO_REG_ENABLE); - if (reg != data->sio_reg_enable) - sio_data->sio_outb(sio_data, SIO_REG_ENABLE, data->sio_reg_enable); - - if (data->kind == nct6791 || data->kind == nct6792 || - data->kind == nct6793 || data->kind == nct6795 || - data->kind == nct6796 || data->kind == nct6797 || - data->kind == nct6798) - nct6791_enable_io_mapping(sio_data); - - sio_data->sio_exit(sio_data); - - /* Restore limits */ - for (i = 0; i < data->in_num; i++) { - if (!(data->have_in & BIT(i))) - continue; - - data->write_value(data, data->REG_IN_MINMAX[0][i], - data->in[i][1]); - data->write_value(data, data->REG_IN_MINMAX[1][i], - data->in[i][2]); - } - - for (i = 0; i < ARRAY_SIZE(data->fan_min); i++) { - if (!(data->has_fan_min & BIT(i))) - continue; - - data->write_value(data, data->REG_FAN_MIN[i], - data->fan_min[i]); - } - - for (i = 0; i < NUM_TEMP; i++) { - if (!(data->have_temp & BIT(i))) - continue; - - for (j = 1; j < ARRAY_SIZE(data->reg_temp); j++) - if (data->reg_temp[j][i]) - nct6775_write_temp(data, data->reg_temp[j][i], - data->temp[j][i]); - } - - /* Restore other settings */ - data->write_value(data, data->REG_VBAT, data->vbat); - if (data->kind == nct6775) { - data->write_value(data, NCT6775_REG_FANDIV1, data->fandiv1); - data->write_value(data, NCT6775_REG_FANDIV2, data->fandiv2); - } - -abort: - /* Force re-reading all values */ - data->valid = false; - mutex_unlock(&data->update_lock); - - return err; -} - -static SIMPLE_DEV_PM_OPS(nct6775_dev_pm_ops, nct6775_suspend, nct6775_resume); - -static struct platform_driver nct6775_driver = { - .driver = { - .name = DRVNAME, - .pm = &nct6775_dev_pm_ops, - }, - .probe = nct6775_probe, -}; - -/* nct6775_find() looks for a '627 in the Super-I/O config space */ -static int __init nct6775_find(int sioaddr, struct nct6775_sio_data *sio_data) -{ - u16 val; - int err; - int addr; - - sio_data->access = access_direct; - sio_data->sioreg = sioaddr; - - err = sio_data->sio_enter(sio_data); - if (err) - return err; - - val = (sio_data->sio_inb(sio_data, SIO_REG_DEVID) << 8) | - sio_data->sio_inb(sio_data, SIO_REG_DEVID + 1); - if (force_id && val != 0xffff) - val = force_id; - - switch (val & SIO_ID_MASK) { - case SIO_NCT6106_ID: - sio_data->kind = nct6106; - break; - case SIO_NCT6116_ID: - sio_data->kind = nct6116; - break; - case SIO_NCT6775_ID: - sio_data->kind = nct6775; - break; - case SIO_NCT6776_ID: - sio_data->kind = nct6776; - break; - case SIO_NCT6779_ID: - sio_data->kind = nct6779; - break; - case SIO_NCT6791_ID: - sio_data->kind = nct6791; - break; - case SIO_NCT6792_ID: - sio_data->kind = nct6792; - break; - case SIO_NCT6793_ID: - sio_data->kind = nct6793; - break; - case SIO_NCT6795_ID: - sio_data->kind = nct6795; - break; - case SIO_NCT6796_ID: - sio_data->kind = nct6796; - break; - case SIO_NCT6797_ID: - sio_data->kind = nct6797; - break; - case SIO_NCT6798_ID: - sio_data->kind = nct6798; - break; - default: - if (val != 0xffff) - pr_debug("unsupported chip ID: 0x%04x\n", val); - sio_data->sio_exit(sio_data); - return -ENODEV; - } - - /* We have a known chip, find the HWM I/O address */ - sio_data->sio_select(sio_data, NCT6775_LD_HWM); - val = (sio_data->sio_inb(sio_data, SIO_REG_ADDR) << 8) - | sio_data->sio_inb(sio_data, SIO_REG_ADDR + 1); - addr = val & IOREGION_ALIGNMENT; - if (addr == 0) { - pr_err("Refusing to enable a Super-I/O device with a base I/O port 0\n"); - sio_data->sio_exit(sio_data); - return -ENODEV; - } - - /* Activate logical device if needed */ - val = sio_data->sio_inb(sio_data, SIO_REG_ENABLE); - if (!(val & 0x01)) { - pr_warn("Forcibly enabling Super-I/O. Sensor is probably unusable.\n"); - sio_data->sio_outb(sio_data, SIO_REG_ENABLE, val | 0x01); - } - - if (sio_data->kind == nct6791 || sio_data->kind == nct6792 || - sio_data->kind == nct6793 || sio_data->kind == nct6795 || - sio_data->kind == nct6796 || sio_data->kind == nct6797 || - sio_data->kind == nct6798) - nct6791_enable_io_mapping(sio_data); - - sio_data->sio_exit(sio_data); - pr_info("Found %s or compatible chip at %#x:%#x\n", - nct6775_sio_names[sio_data->kind], sioaddr, addr); - - return addr; -} - -/* - * when Super-I/O functions move to a separate file, the Super-I/O - * bus will manage the lifetime of the device and this module will only keep - * track of the nct6775 driver. But since we use platform_device_alloc(), we - * must keep track of the device - */ -static struct platform_device *pdev[2]; - -static const char * const asus_wmi_boards[] = { - "ProArt X570-CREATOR WIFI", - "Pro B550M-C", - "Pro WS X570-ACE", - "PRIME B360-PLUS", - "PRIME B460-PLUS", - "PRIME B550-PLUS", - "PRIME B550M-A", - "PRIME B550M-A (WI-FI)", - "PRIME X570-P", - "PRIME X570-PRO", - "ROG CROSSHAIR VIII DARK HERO", - "ROG CROSSHAIR VIII FORMULA", - "ROG CROSSHAIR VIII HERO", - "ROG CROSSHAIR VIII IMPACT", - "ROG STRIX B550-A GAMING", - "ROG STRIX B550-E GAMING", - "ROG STRIX B550-F GAMING", - "ROG STRIX B550-F GAMING (WI-FI)", - "ROG STRIX B550-F GAMING WIFI II", - "ROG STRIX B550-I GAMING", - "ROG STRIX B550-XE GAMING (WI-FI)", - "ROG STRIX X570-E GAMING", - "ROG STRIX X570-F GAMING", - "ROG STRIX X570-I GAMING", - "ROG STRIX Z390-E GAMING", - "ROG STRIX Z390-F GAMING", - "ROG STRIX Z390-H GAMING", - "ROG STRIX Z390-I GAMING", - "ROG STRIX Z490-A GAMING", - "ROG STRIX Z490-E GAMING", - "ROG STRIX Z490-F GAMING", - "ROG STRIX Z490-G GAMING", - "ROG STRIX Z490-G GAMING (WI-FI)", - "ROG STRIX Z490-H GAMING", - "ROG STRIX Z490-I GAMING", - "TUF GAMING B550M-PLUS", - "TUF GAMING B550M-PLUS (WI-FI)", - "TUF GAMING B550-PLUS", - "TUF GAMING B550-PRO", - "TUF GAMING X570-PLUS", - "TUF GAMING X570-PLUS (WI-FI)", - "TUF GAMING X570-PRO (WI-FI)", - "TUF GAMING Z490-PLUS", - "TUF GAMING Z490-PLUS (WI-FI)", -}; - -static int __init sensors_nct6775_init(void) -{ - int i, err; - bool found = false; - int address; - struct resource res; - struct nct6775_sio_data sio_data; - int sioaddr[2] = { 0x2e, 0x4e }; - enum sensor_access access = access_direct; - const char *board_vendor, *board_name; - u8 tmp; - - err = platform_driver_register(&nct6775_driver); - if (err) - return err; - - board_vendor = dmi_get_system_info(DMI_BOARD_VENDOR); - board_name = dmi_get_system_info(DMI_BOARD_NAME); - - if (board_name && board_vendor && - !strcmp(board_vendor, "ASUSTeK COMPUTER INC.")) { - err = match_string(asus_wmi_boards, ARRAY_SIZE(asus_wmi_boards), - board_name); - if (err >= 0) { - /* if reading chip id via WMI succeeds, use WMI */ - if (!nct6775_asuswmi_read(0, NCT6775_PORT_CHIPID, &tmp) && tmp) { - pr_info("Using Asus WMI to access %#x chip.\n", tmp); - access = access_asuswmi; - } else { - pr_err("Can't read ChipID by Asus WMI.\n"); - } - } - } - - /* - * initialize sio_data->kind and sio_data->sioreg. - * - * when Super-I/O functions move to a separate file, the Super-I/O - * driver will probe 0x2e and 0x4e and auto-detect the presence of a - * nct6775 hardware monitor, and call probe() - */ - for (i = 0; i < ARRAY_SIZE(pdev); i++) { - sio_data.sio_outb = superio_outb; - sio_data.sio_inb = superio_inb; - sio_data.sio_select = superio_select; - sio_data.sio_enter = superio_enter; - sio_data.sio_exit = superio_exit; - - address = nct6775_find(sioaddr[i], &sio_data); - if (address <= 0) - continue; - - found = true; - - sio_data.access = access; - - if (access == access_asuswmi) { - sio_data.sio_outb = superio_wmi_outb; - sio_data.sio_inb = superio_wmi_inb; - sio_data.sio_select = superio_wmi_select; - sio_data.sio_enter = superio_wmi_enter; - sio_data.sio_exit = superio_wmi_exit; - } - - pdev[i] = platform_device_alloc(DRVNAME, address); - if (!pdev[i]) { - err = -ENOMEM; - goto exit_device_unregister; - } - - err = platform_device_add_data(pdev[i], &sio_data, - sizeof(struct nct6775_sio_data)); - if (err) - goto exit_device_put; - - if (sio_data.access == access_direct) { - memset(&res, 0, sizeof(res)); - res.name = DRVNAME; - res.start = address + IOREGION_OFFSET; - res.end = address + IOREGION_OFFSET + IOREGION_LENGTH - 1; - res.flags = IORESOURCE_IO; - - err = acpi_check_resource_conflict(&res); - if (err) { - platform_device_put(pdev[i]); - pdev[i] = NULL; - continue; - } - - err = platform_device_add_resources(pdev[i], &res, 1); - if (err) - goto exit_device_put; - } - - /* platform_device_add calls probe() */ - err = platform_device_add(pdev[i]); - if (err) - goto exit_device_put; - } - if (!found) { - err = -ENODEV; - goto exit_unregister; - } - - return 0; - -exit_device_put: - platform_device_put(pdev[i]); -exit_device_unregister: - while (--i >= 0) { - if (pdev[i]) - platform_device_unregister(pdev[i]); - } -exit_unregister: - platform_driver_unregister(&nct6775_driver); - return err; -} - -static void __exit sensors_nct6775_exit(void) -{ - int i; - - for (i = 0; i < ARRAY_SIZE(pdev); i++) { - if (pdev[i]) - platform_device_unregister(pdev[i]); - } - platform_driver_unregister(&nct6775_driver); -} +EXPORT_SYMBOL_GPL(nct6775_probe); MODULE_AUTHOR("Guenter Roeck "); -MODULE_DESCRIPTION("Driver for NCT6775F and compatible chips"); +MODULE_DESCRIPTION("Core driver for NCT6775F and compatible chips"); MODULE_LICENSE("GPL"); - -module_init(sensors_nct6775_init); -module_exit(sensors_nct6775_exit); diff --git a/drivers/hwmon/nct6775-i2c.c b/drivers/hwmon/nct6775-i2c.c new file mode 100644 index 000000000000..e1bcd1146191 --- /dev/null +++ b/drivers/hwmon/nct6775-i2c.c @@ -0,0 +1,195 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * nct6775-i2c - I2C driver for the hardware monitoring functionality of + * Nuvoton NCT677x Super-I/O chips + * + * Copyright (C) 2022 Zev Weiss + * + * This driver interacts with the chip via it's "back door" i2c interface, as + * is often exposed to a BMC. Because the host may still be operating the + * chip via the ("front door") LPC interface, this driver cannot assume that + * it actually has full control of the chip, and in particular must avoid + * making any changes that could confuse the host's LPC usage of it. It thus + * operates in a strictly read-only fashion, with the only exception being the + * bank-select register (which seems, thankfully, to be replicated for the i2c + * interface so it doesn't affect the LPC interface). + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include "nct6775.h" + +static int nct6775_i2c_read(void *ctx, unsigned int reg, unsigned int *val) +{ + int ret; + u32 tmp; + u8 bank = reg >> 8; + struct nct6775_data *data = ctx; + struct i2c_client *client = data->driver_data; + + if (bank != data->bank) { + ret = i2c_smbus_write_byte_data(client, NCT6775_REG_BANK, bank); + if (ret) + return ret; + data->bank = bank; + } + + ret = i2c_smbus_read_byte_data(client, reg & 0xff); + if (ret < 0) + return ret; + tmp = ret; + + if (nct6775_reg_is_word_sized(data, reg)) { + ret = i2c_smbus_read_byte_data(client, (reg & 0xff) + 1); + if (ret < 0) + return ret; + tmp = (tmp << 8) | ret; + } + + *val = tmp; + return 0; +} + +/* + * The write operation is a dummy so as not to disturb anything being done + * with the chip via LPC. + */ +static int nct6775_i2c_write(void *ctx, unsigned int reg, unsigned int value) +{ + struct nct6775_data *data = ctx; + struct i2c_client *client = data->driver_data; + + dev_dbg(&client->dev, "skipping attempted write: %02x -> %03x\n", value, reg); + + /* + * This is a lie, but writing anything but the bank-select register is + * something this driver shouldn't be doing. + */ + return 0; +} + +static const struct of_device_id __maybe_unused nct6775_i2c_of_match[] = { + { .compatible = "nuvoton,nct6106", .data = (void *)nct6106, }, + { .compatible = "nuvoton,nct6116", .data = (void *)nct6116, }, + { .compatible = "nuvoton,nct6775", .data = (void *)nct6775, }, + { .compatible = "nuvoton,nct6776", .data = (void *)nct6776, }, + { .compatible = "nuvoton,nct6779", .data = (void *)nct6779, }, + { .compatible = "nuvoton,nct6791", .data = (void *)nct6791, }, + { .compatible = "nuvoton,nct6792", .data = (void *)nct6792, }, + { .compatible = "nuvoton,nct6793", .data = (void *)nct6793, }, + { .compatible = "nuvoton,nct6795", .data = (void *)nct6795, }, + { .compatible = "nuvoton,nct6796", .data = (void *)nct6796, }, + { .compatible = "nuvoton,nct6797", .data = (void *)nct6797, }, + { .compatible = "nuvoton,nct6798", .data = (void *)nct6798, }, + { }, +}; +MODULE_DEVICE_TABLE(of, nct6775_i2c_of_match); + +static const struct i2c_device_id nct6775_i2c_id[] = { + { "nct6106", nct6106 }, + { "nct6116", nct6116 }, + { "nct6775", nct6775 }, + { "nct6776", nct6776 }, + { "nct6779", nct6779 }, + { "nct6791", nct6791 }, + { "nct6792", nct6792 }, + { "nct6793", nct6793 }, + { "nct6795", nct6795 }, + { "nct6796", nct6796 }, + { "nct6797", nct6797 }, + { "nct6798", nct6798 }, + { } +}; +MODULE_DEVICE_TABLE(i2c, nct6775_i2c_id); + +static int nct6775_i2c_probe_init(struct nct6775_data *data) +{ + u32 tsi_channel_mask; + struct i2c_client *client = data->driver_data; + + /* + * The i2c interface doesn't provide access to the control registers + * needed to determine the presence of other fans, but fans 1 and 2 + * are (in principle) always there. + * + * In practice this is perhaps a little silly, because the system + * using this driver is mostly likely a BMC, and hence probably has + * totally separate fan tachs & pwms of its own that are actually + * controlling/monitoring the fans -- these are thus unlikely to be + * doing anything actually useful. + */ + data->has_fan = 0x03; + data->has_fan_min = 0x03; + data->has_pwm = 0x03; + + /* + * Because on a BMC this driver may be bound very shortly after power + * is first applied to the device, the automatic TSI channel detection + * in nct6775_probe() (which has already been run at this point) may + * not find anything if a channel hasn't yet produced a temperature + * reading. Augment whatever was found via autodetection (if + * anything) with the channels DT says should be active. + */ + if (!of_property_read_u32(client->dev.of_node, "nuvoton,tsi-channel-mask", + &tsi_channel_mask)) + data->have_tsi_temp |= tsi_channel_mask & GENMASK(NUM_TSI_TEMP - 1, 0); + + return 0; +} + +static const struct regmap_config nct6775_i2c_regmap_config = { + .reg_bits = 16, + .val_bits = 16, + .reg_read = nct6775_i2c_read, + .reg_write = nct6775_i2c_write, +}; + +static int nct6775_i2c_probe(struct i2c_client *client) +{ + struct nct6775_data *data; + const struct of_device_id *of_id; + const struct i2c_device_id *i2c_id; + struct device *dev = &client->dev; + + of_id = of_match_device(nct6775_i2c_of_match, dev); + i2c_id = i2c_match_id(nct6775_i2c_id, client); + + if (of_id && (unsigned long)of_id->data != i2c_id->driver_data) + dev_notice(dev, "Device mismatch: %s in device tree, %s detected\n", + of_id->name, i2c_id->name); + + data = devm_kzalloc(&client->dev, sizeof(*data), GFP_KERNEL); + if (!data) + return -ENOMEM; + + data->kind = i2c_id->driver_data; + + data->read_only = true; + data->driver_data = client; + data->driver_init = nct6775_i2c_probe_init; + + return nct6775_probe(dev, data, &nct6775_i2c_regmap_config); +} + +static struct i2c_driver nct6775_i2c_driver = { + .class = I2C_CLASS_HWMON, + .driver = { + .name = "nct6775-i2c", + .of_match_table = of_match_ptr(nct6775_i2c_of_match), + }, + .probe_new = nct6775_i2c_probe, + .id_table = nct6775_i2c_id, +}; + +module_i2c_driver(nct6775_i2c_driver); + +MODULE_AUTHOR("Zev Weiss "); +MODULE_DESCRIPTION("I2C driver for NCT6775F and compatible chips"); +MODULE_LICENSE("GPL"); +MODULE_IMPORT_NS(HWMON_NCT6775); diff --git a/drivers/hwmon/nct6775-platform.c b/drivers/hwmon/nct6775-platform.c new file mode 100644 index 000000000000..6d46c9401898 --- /dev/null +++ b/drivers/hwmon/nct6775-platform.c @@ -0,0 +1,1229 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * nct6775 - Platform driver for the hardware monitoring + * functionality of Nuvoton NCT677x Super-I/O chips + * + * Copyright (C) 2012 Guenter Roeck + */ + +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "nct6775.h" + +enum sensor_access { access_direct, access_asuswmi }; + +static const char * const nct6775_sio_names[] __initconst = { + "NCT6106D", + "NCT6116D", + "NCT6775F", + "NCT6776D/F", + "NCT6779D", + "NCT6791D", + "NCT6792D", + "NCT6793D", + "NCT6795D", + "NCT6796D", + "NCT6797D", + "NCT6798D", +}; + +static unsigned short force_id; +module_param(force_id, ushort, 0); +MODULE_PARM_DESC(force_id, "Override the detected device ID"); + +static unsigned short fan_debounce; +module_param(fan_debounce, ushort, 0); +MODULE_PARM_DESC(fan_debounce, "Enable debouncing for fan RPM signal"); + +#define DRVNAME "nct6775" + +#define NCT6775_PORT_CHIPID 0x58 + +/* + * ISA constants + */ + +#define IOREGION_ALIGNMENT (~7) +#define IOREGION_OFFSET 5 +#define IOREGION_LENGTH 2 +#define ADDR_REG_OFFSET 0 +#define DATA_REG_OFFSET 1 + +/* + * Super-I/O constants and functions + */ + +#define NCT6775_LD_ACPI 0x0a +#define NCT6775_LD_HWM 0x0b +#define NCT6775_LD_VID 0x0d +#define NCT6775_LD_12 0x12 + +#define SIO_REG_LDSEL 0x07 /* Logical device select */ +#define SIO_REG_DEVID 0x20 /* Device ID (2 bytes) */ +#define SIO_REG_ENABLE 0x30 /* Logical device enable */ +#define SIO_REG_ADDR 0x60 /* Logical device address (2 bytes) */ + +#define SIO_NCT6106_ID 0xc450 +#define SIO_NCT6116_ID 0xd280 +#define SIO_NCT6775_ID 0xb470 +#define SIO_NCT6776_ID 0xc330 +#define SIO_NCT6779_ID 0xc560 +#define SIO_NCT6791_ID 0xc800 +#define SIO_NCT6792_ID 0xc910 +#define SIO_NCT6793_ID 0xd120 +#define SIO_NCT6795_ID 0xd350 +#define SIO_NCT6796_ID 0xd420 +#define SIO_NCT6797_ID 0xd450 +#define SIO_NCT6798_ID 0xd428 +#define SIO_ID_MASK 0xFFF8 + +/* + * Control registers + */ +#define NCT6775_REG_CR_FAN_DEBOUNCE 0xf0 + +struct nct6775_sio_data { + int sioreg; + int ld; + enum kinds kind; + enum sensor_access access; + + /* superio_() callbacks */ + void (*sio_outb)(struct nct6775_sio_data *sio_data, int reg, int val); + int (*sio_inb)(struct nct6775_sio_data *sio_data, int reg); + void (*sio_select)(struct nct6775_sio_data *sio_data, int ld); + int (*sio_enter)(struct nct6775_sio_data *sio_data); + void (*sio_exit)(struct nct6775_sio_data *sio_data); +}; + +#define ASUSWMI_MONITORING_GUID "466747A0-70EC-11DE-8A39-0800200C9A66" +#define ASUSWMI_METHODID_RSIO 0x5253494F +#define ASUSWMI_METHODID_WSIO 0x5753494F +#define ASUSWMI_METHODID_RHWM 0x5248574D +#define ASUSWMI_METHODID_WHWM 0x5748574D +#define ASUSWMI_UNSUPPORTED_METHOD 0xFFFFFFFE + +static int nct6775_asuswmi_evaluate_method(u32 method_id, u8 bank, u8 reg, u8 val, u32 *retval) +{ +#if IS_ENABLED(CONFIG_ACPI_WMI) + u32 args = bank | (reg << 8) | (val << 16); + struct acpi_buffer input = { (acpi_size) sizeof(args), &args }; + struct acpi_buffer output = { ACPI_ALLOCATE_BUFFER, NULL }; + acpi_status status; + union acpi_object *obj; + u32 tmp = ASUSWMI_UNSUPPORTED_METHOD; + + status = wmi_evaluate_method(ASUSWMI_MONITORING_GUID, 0, + method_id, &input, &output); + + if (ACPI_FAILURE(status)) + return -EIO; + + obj = output.pointer; + if (obj && obj->type == ACPI_TYPE_INTEGER) + tmp = obj->integer.value; + + if (retval) + *retval = tmp; + + kfree(obj); + + if (tmp == ASUSWMI_UNSUPPORTED_METHOD) + return -ENODEV; + return 0; +#else + return -EOPNOTSUPP; +#endif +} + +static inline int nct6775_asuswmi_write(u8 bank, u8 reg, u8 val) +{ + return nct6775_asuswmi_evaluate_method(ASUSWMI_METHODID_WHWM, bank, + reg, val, NULL); +} + +static inline int nct6775_asuswmi_read(u8 bank, u8 reg, u8 *val) +{ + u32 ret, tmp = 0; + + ret = nct6775_asuswmi_evaluate_method(ASUSWMI_METHODID_RHWM, bank, + reg, 0, &tmp); + *val = tmp; + return ret; +} + +static int superio_wmi_inb(struct nct6775_sio_data *sio_data, int reg) +{ + int tmp = 0; + + nct6775_asuswmi_evaluate_method(ASUSWMI_METHODID_RSIO, sio_data->ld, + reg, 0, &tmp); + return tmp; +} + +static void superio_wmi_outb(struct nct6775_sio_data *sio_data, int reg, int val) +{ + nct6775_asuswmi_evaluate_method(ASUSWMI_METHODID_WSIO, sio_data->ld, + reg, val, NULL); +} + +static void superio_wmi_select(struct nct6775_sio_data *sio_data, int ld) +{ + sio_data->ld = ld; +} + +static int superio_wmi_enter(struct nct6775_sio_data *sio_data) +{ + return 0; +} + +static void superio_wmi_exit(struct nct6775_sio_data *sio_data) +{ +} + +static void superio_outb(struct nct6775_sio_data *sio_data, int reg, int val) +{ + int ioreg = sio_data->sioreg; + + outb(reg, ioreg); + outb(val, ioreg + 1); +} + +static int superio_inb(struct nct6775_sio_data *sio_data, int reg) +{ + int ioreg = sio_data->sioreg; + + outb(reg, ioreg); + return inb(ioreg + 1); +} + +static void superio_select(struct nct6775_sio_data *sio_data, int ld) +{ + int ioreg = sio_data->sioreg; + + outb(SIO_REG_LDSEL, ioreg); + outb(ld, ioreg + 1); +} + +static int superio_enter(struct nct6775_sio_data *sio_data) +{ + int ioreg = sio_data->sioreg; + + /* + * Try to reserve and for exclusive access. + */ + if (!request_muxed_region(ioreg, 2, DRVNAME)) + return -EBUSY; + + outb(0x87, ioreg); + outb(0x87, ioreg); + + return 0; +} + +static void superio_exit(struct nct6775_sio_data *sio_data) +{ + int ioreg = sio_data->sioreg; + + outb(0xaa, ioreg); + outb(0x02, ioreg); + outb(0x02, ioreg + 1); + release_region(ioreg, 2); +} + +static inline void nct6775_wmi_set_bank(struct nct6775_data *data, u16 reg) +{ + u8 bank = reg >> 8; + + data->bank = bank; +} + +static int nct6775_wmi_reg_read(void *ctx, unsigned int reg, unsigned int *val) +{ + struct nct6775_data *data = ctx; + int err, word_sized = nct6775_reg_is_word_sized(data, reg); + u8 tmp = 0; + u16 res; + + nct6775_wmi_set_bank(data, reg); + + err = nct6775_asuswmi_read(data->bank, reg & 0xff, &tmp); + if (err) + return err; + + res = tmp; + if (word_sized) { + err = nct6775_asuswmi_read(data->bank, (reg & 0xff) + 1, &tmp); + if (err) + return err; + + res = (res << 8) + tmp; + } + *val = res; + return 0; +} + +static int nct6775_wmi_reg_write(void *ctx, unsigned int reg, unsigned int value) +{ + struct nct6775_data *data = ctx; + int res, word_sized = nct6775_reg_is_word_sized(data, reg); + + nct6775_wmi_set_bank(data, reg); + + if (word_sized) { + res = nct6775_asuswmi_write(data->bank, reg & 0xff, value >> 8); + if (res) + return res; + + res = nct6775_asuswmi_write(data->bank, (reg & 0xff) + 1, value); + } else { + res = nct6775_asuswmi_write(data->bank, reg & 0xff, value); + } + + return res; +} + +/* + * On older chips, only registers 0x50-0x5f are banked. + * On more recent chips, all registers are banked. + * Assume that is the case and set the bank number for each access. + * Cache the bank number so it only needs to be set if it changes. + */ +static inline void nct6775_set_bank(struct nct6775_data *data, u16 reg) +{ + u8 bank = reg >> 8; + + if (data->bank != bank) { + outb_p(NCT6775_REG_BANK, data->addr + ADDR_REG_OFFSET); + outb_p(bank, data->addr + DATA_REG_OFFSET); + data->bank = bank; + } +} + +static int nct6775_reg_read(void *ctx, unsigned int reg, unsigned int *val) +{ + struct nct6775_data *data = ctx; + int word_sized = nct6775_reg_is_word_sized(data, reg); + + nct6775_set_bank(data, reg); + outb_p(reg & 0xff, data->addr + ADDR_REG_OFFSET); + *val = inb_p(data->addr + DATA_REG_OFFSET); + if (word_sized) { + outb_p((reg & 0xff) + 1, + data->addr + ADDR_REG_OFFSET); + *val = (*val << 8) + inb_p(data->addr + DATA_REG_OFFSET); + } + return 0; +} + +static int nct6775_reg_write(void *ctx, unsigned int reg, unsigned int value) +{ + struct nct6775_data *data = ctx; + int word_sized = nct6775_reg_is_word_sized(data, reg); + + nct6775_set_bank(data, reg); + outb_p(reg & 0xff, data->addr + ADDR_REG_OFFSET); + if (word_sized) { + outb_p(value >> 8, data->addr + DATA_REG_OFFSET); + outb_p((reg & 0xff) + 1, + data->addr + ADDR_REG_OFFSET); + } + outb_p(value & 0xff, data->addr + DATA_REG_OFFSET); + return 0; +} + +static void nct6791_enable_io_mapping(struct nct6775_sio_data *sio_data) +{ + int val; + + val = sio_data->sio_inb(sio_data, NCT6791_REG_HM_IO_SPACE_LOCK_ENABLE); + if (val & 0x10) { + pr_info("Enabling hardware monitor logical device mappings.\n"); + sio_data->sio_outb(sio_data, NCT6791_REG_HM_IO_SPACE_LOCK_ENABLE, + val & ~0x10); + } +} + +static int __maybe_unused nct6775_suspend(struct device *dev) +{ + int err; + u16 tmp; + struct nct6775_data *data = dev_get_drvdata(dev); + + if (IS_ERR(data)) + return PTR_ERR(data); + + mutex_lock(&data->update_lock); + err = nct6775_read_value(data, data->REG_VBAT, &tmp); + if (err) + goto out; + data->vbat = tmp; + if (data->kind == nct6775) { + err = nct6775_read_value(data, NCT6775_REG_FANDIV1, &tmp); + if (err) + goto out; + data->fandiv1 = tmp; + + err = nct6775_read_value(data, NCT6775_REG_FANDIV2, &tmp); + if (err) + goto out; + data->fandiv2 = tmp; + } +out: + mutex_unlock(&data->update_lock); + + return err; +} + +static int __maybe_unused nct6775_resume(struct device *dev) +{ + struct nct6775_data *data = dev_get_drvdata(dev); + struct nct6775_sio_data *sio_data = dev_get_platdata(dev); + int i, j, err = 0; + u8 reg; + + mutex_lock(&data->update_lock); + data->bank = 0xff; /* Force initial bank selection */ + + err = sio_data->sio_enter(sio_data); + if (err) + goto abort; + + sio_data->sio_select(sio_data, NCT6775_LD_HWM); + reg = sio_data->sio_inb(sio_data, SIO_REG_ENABLE); + if (reg != data->sio_reg_enable) + sio_data->sio_outb(sio_data, SIO_REG_ENABLE, data->sio_reg_enable); + + if (data->kind == nct6791 || data->kind == nct6792 || + data->kind == nct6793 || data->kind == nct6795 || + data->kind == nct6796 || data->kind == nct6797 || + data->kind == nct6798) + nct6791_enable_io_mapping(sio_data); + + sio_data->sio_exit(sio_data); + + /* Restore limits */ + for (i = 0; i < data->in_num; i++) { + if (!(data->have_in & BIT(i))) + continue; + + err = nct6775_write_value(data, data->REG_IN_MINMAX[0][i], data->in[i][1]); + if (err) + goto abort; + err = nct6775_write_value(data, data->REG_IN_MINMAX[1][i], data->in[i][2]); + if (err) + goto abort; + } + + for (i = 0; i < ARRAY_SIZE(data->fan_min); i++) { + if (!(data->has_fan_min & BIT(i))) + continue; + + err = nct6775_write_value(data, data->REG_FAN_MIN[i], data->fan_min[i]); + if (err) + goto abort; + } + + for (i = 0; i < NUM_TEMP; i++) { + if (!(data->have_temp & BIT(i))) + continue; + + for (j = 1; j < ARRAY_SIZE(data->reg_temp); j++) + if (data->reg_temp[j][i]) { + err = nct6775_write_temp(data, data->reg_temp[j][i], + data->temp[j][i]); + if (err) + goto abort; + } + } + + /* Restore other settings */ + err = nct6775_write_value(data, data->REG_VBAT, data->vbat); + if (err) + goto abort; + if (data->kind == nct6775) { + err = nct6775_write_value(data, NCT6775_REG_FANDIV1, data->fandiv1); + if (err) + goto abort; + err = nct6775_write_value(data, NCT6775_REG_FANDIV2, data->fandiv2); + } + +abort: + /* Force re-reading all values */ + data->valid = false; + mutex_unlock(&data->update_lock); + + return err; +} + +static SIMPLE_DEV_PM_OPS(nct6775_dev_pm_ops, nct6775_suspend, nct6775_resume); + +static void +nct6775_check_fan_inputs(struct nct6775_data *data, struct nct6775_sio_data *sio_data) +{ + bool fan3pin = false, fan4pin = false, fan4min = false; + bool fan5pin = false, fan6pin = false, fan7pin = false; + bool pwm3pin = false, pwm4pin = false, pwm5pin = false; + bool pwm6pin = false, pwm7pin = false; + + /* Store SIO_REG_ENABLE for use during resume */ + sio_data->sio_select(sio_data, NCT6775_LD_HWM); + data->sio_reg_enable = sio_data->sio_inb(sio_data, SIO_REG_ENABLE); + + /* fan4 and fan5 share some pins with the GPIO and serial flash */ + if (data->kind == nct6775) { + int cr2c = sio_data->sio_inb(sio_data, 0x2c); + + fan3pin = cr2c & BIT(6); + pwm3pin = cr2c & BIT(7); + + /* On NCT6775, fan4 shares pins with the fdc interface */ + fan4pin = !(sio_data->sio_inb(sio_data, 0x2A) & 0x80); + } else if (data->kind == nct6776) { + bool gpok = sio_data->sio_inb(sio_data, 0x27) & 0x80; + const char *board_vendor, *board_name; + + board_vendor = dmi_get_system_info(DMI_BOARD_VENDOR); + board_name = dmi_get_system_info(DMI_BOARD_NAME); + + if (board_name && board_vendor && + !strcmp(board_vendor, "ASRock")) { + /* + * Auxiliary fan monitoring is not enabled on ASRock + * Z77 Pro4-M if booted in UEFI Ultra-FastBoot mode. + * Observed with BIOS version 2.00. + */ + if (!strcmp(board_name, "Z77 Pro4-M")) { + if ((data->sio_reg_enable & 0xe0) != 0xe0) { + data->sio_reg_enable |= 0xe0; + sio_data->sio_outb(sio_data, SIO_REG_ENABLE, + data->sio_reg_enable); + } + } + } + + if (data->sio_reg_enable & 0x80) + fan3pin = gpok; + else + fan3pin = !(sio_data->sio_inb(sio_data, 0x24) & 0x40); + + if (data->sio_reg_enable & 0x40) + fan4pin = gpok; + else + fan4pin = sio_data->sio_inb(sio_data, 0x1C) & 0x01; + + if (data->sio_reg_enable & 0x20) + fan5pin = gpok; + else + fan5pin = sio_data->sio_inb(sio_data, 0x1C) & 0x02; + + fan4min = fan4pin; + pwm3pin = fan3pin; + } else if (data->kind == nct6106) { + int cr24 = sio_data->sio_inb(sio_data, 0x24); + + fan3pin = !(cr24 & 0x80); + pwm3pin = cr24 & 0x08; + } else if (data->kind == nct6116) { + int cr1a = sio_data->sio_inb(sio_data, 0x1a); + int cr1b = sio_data->sio_inb(sio_data, 0x1b); + int cr24 = sio_data->sio_inb(sio_data, 0x24); + int cr2a = sio_data->sio_inb(sio_data, 0x2a); + int cr2b = sio_data->sio_inb(sio_data, 0x2b); + int cr2f = sio_data->sio_inb(sio_data, 0x2f); + + fan3pin = !(cr2b & 0x10); + fan4pin = (cr2b & 0x80) || // pin 1(2) + (!(cr2f & 0x10) && (cr1a & 0x04)); // pin 65(66) + fan5pin = (cr2b & 0x80) || // pin 126(127) + (!(cr1b & 0x03) && (cr2a & 0x02)); // pin 94(96) + + pwm3pin = fan3pin && (cr24 & 0x08); + pwm4pin = fan4pin; + pwm5pin = fan5pin; + } else { + /* + * NCT6779D, NCT6791D, NCT6792D, NCT6793D, NCT6795D, NCT6796D, + * NCT6797D, NCT6798D + */ + int cr1a = sio_data->sio_inb(sio_data, 0x1a); + int cr1b = sio_data->sio_inb(sio_data, 0x1b); + int cr1c = sio_data->sio_inb(sio_data, 0x1c); + int cr1d = sio_data->sio_inb(sio_data, 0x1d); + int cr2a = sio_data->sio_inb(sio_data, 0x2a); + int cr2b = sio_data->sio_inb(sio_data, 0x2b); + int cr2d = sio_data->sio_inb(sio_data, 0x2d); + int cr2f = sio_data->sio_inb(sio_data, 0x2f); + bool dsw_en = cr2f & BIT(3); + bool ddr4_en = cr2f & BIT(4); + int cre0; + int creb; + int cred; + + sio_data->sio_select(sio_data, NCT6775_LD_12); + cre0 = sio_data->sio_inb(sio_data, 0xe0); + creb = sio_data->sio_inb(sio_data, 0xeb); + cred = sio_data->sio_inb(sio_data, 0xed); + + fan3pin = !(cr1c & BIT(5)); + fan4pin = !(cr1c & BIT(6)); + fan5pin = !(cr1c & BIT(7)); + + pwm3pin = !(cr1c & BIT(0)); + pwm4pin = !(cr1c & BIT(1)); + pwm5pin = !(cr1c & BIT(2)); + + switch (data->kind) { + case nct6791: + fan6pin = cr2d & BIT(1); + pwm6pin = cr2d & BIT(0); + break; + case nct6792: + fan6pin = !dsw_en && (cr2d & BIT(1)); + pwm6pin = !dsw_en && (cr2d & BIT(0)); + break; + case nct6793: + fan5pin |= cr1b & BIT(5); + fan5pin |= creb & BIT(5); + + fan6pin = !dsw_en && (cr2d & BIT(1)); + fan6pin |= creb & BIT(3); + + pwm5pin |= cr2d & BIT(7); + pwm5pin |= (creb & BIT(4)) && !(cr2a & BIT(0)); + + pwm6pin = !dsw_en && (cr2d & BIT(0)); + pwm6pin |= creb & BIT(2); + break; + case nct6795: + fan5pin |= cr1b & BIT(5); + fan5pin |= creb & BIT(5); + + fan6pin = (cr2a & BIT(4)) && + (!dsw_en || (cred & BIT(4))); + fan6pin |= creb & BIT(3); + + pwm5pin |= cr2d & BIT(7); + pwm5pin |= (creb & BIT(4)) && !(cr2a & BIT(0)); + + pwm6pin = (cr2a & BIT(3)) && (cred & BIT(2)); + pwm6pin |= creb & BIT(2); + break; + case nct6796: + fan5pin |= cr1b & BIT(5); + fan5pin |= (cre0 & BIT(3)) && !(cr1b & BIT(0)); + fan5pin |= creb & BIT(5); + + fan6pin = (cr2a & BIT(4)) && + (!dsw_en || (cred & BIT(4))); + fan6pin |= creb & BIT(3); + + fan7pin = !(cr2b & BIT(2)); + + pwm5pin |= cr2d & BIT(7); + pwm5pin |= (cre0 & BIT(4)) && !(cr1b & BIT(0)); + pwm5pin |= (creb & BIT(4)) && !(cr2a & BIT(0)); + + pwm6pin = (cr2a & BIT(3)) && (cred & BIT(2)); + pwm6pin |= creb & BIT(2); + + pwm7pin = !(cr1d & (BIT(2) | BIT(3))); + break; + case nct6797: + fan5pin |= !ddr4_en && (cr1b & BIT(5)); + fan5pin |= creb & BIT(5); + + fan6pin = cr2a & BIT(4); + fan6pin |= creb & BIT(3); + + fan7pin = cr1a & BIT(1); + + pwm5pin |= (creb & BIT(4)) && !(cr2a & BIT(0)); + pwm5pin |= !ddr4_en && (cr2d & BIT(7)); + + pwm6pin = creb & BIT(2); + pwm6pin |= cred & BIT(2); + + pwm7pin = cr1d & BIT(4); + break; + case nct6798: + fan6pin = !(cr1b & BIT(0)) && (cre0 & BIT(3)); + fan6pin |= cr2a & BIT(4); + fan6pin |= creb & BIT(5); + + fan7pin = cr1b & BIT(5); + fan7pin |= !(cr2b & BIT(2)); + fan7pin |= creb & BIT(3); + + pwm6pin = !(cr1b & BIT(0)) && (cre0 & BIT(4)); + pwm6pin |= !(cred & BIT(2)) && (cr2a & BIT(3)); + pwm6pin |= (creb & BIT(4)) && !(cr2a & BIT(0)); + + pwm7pin = !(cr1d & (BIT(2) | BIT(3))); + pwm7pin |= cr2d & BIT(7); + pwm7pin |= creb & BIT(2); + break; + default: /* NCT6779D */ + break; + } + + fan4min = fan4pin; + } + + /* fan 1 and 2 (0x03) are always present */ + data->has_fan = 0x03 | (fan3pin << 2) | (fan4pin << 3) | + (fan5pin << 4) | (fan6pin << 5) | (fan7pin << 6); + data->has_fan_min = 0x03 | (fan3pin << 2) | (fan4min << 3) | + (fan5pin << 4) | (fan6pin << 5) | (fan7pin << 6); + data->has_pwm = 0x03 | (pwm3pin << 2) | (pwm4pin << 3) | + (pwm5pin << 4) | (pwm6pin << 5) | (pwm7pin << 6); +} + +static ssize_t +cpu0_vid_show(struct device *dev, struct device_attribute *attr, char *buf) +{ + struct nct6775_data *data = dev_get_drvdata(dev); + + return sprintf(buf, "%d\n", vid_from_reg(data->vid, data->vrm)); +} + +static DEVICE_ATTR_RO(cpu0_vid); + +/* Case open detection */ + +static const u8 NCT6775_REG_CR_CASEOPEN_CLR[] = { 0xe6, 0xee }; +static const u8 NCT6775_CR_CASEOPEN_CLR_MASK[] = { 0x20, 0x01 }; + +static ssize_t +clear_caseopen(struct device *dev, struct device_attribute *attr, + const char *buf, size_t count) +{ + struct nct6775_data *data = dev_get_drvdata(dev); + struct nct6775_sio_data *sio_data = data->driver_data; + int nr = to_sensor_dev_attr(attr)->index - INTRUSION_ALARM_BASE; + unsigned long val; + u8 reg; + int ret; + + if (kstrtoul(buf, 10, &val) || val != 0) + return -EINVAL; + + mutex_lock(&data->update_lock); + + /* + * Use CR registers to clear caseopen status. + * The CR registers are the same for all chips, and not all chips + * support clearing the caseopen status through "regular" registers. + */ + ret = sio_data->sio_enter(sio_data); + if (ret) { + count = ret; + goto error; + } + + sio_data->sio_select(sio_data, NCT6775_LD_ACPI); + reg = sio_data->sio_inb(sio_data, NCT6775_REG_CR_CASEOPEN_CLR[nr]); + reg |= NCT6775_CR_CASEOPEN_CLR_MASK[nr]; + sio_data->sio_outb(sio_data, NCT6775_REG_CR_CASEOPEN_CLR[nr], reg); + reg &= ~NCT6775_CR_CASEOPEN_CLR_MASK[nr]; + sio_data->sio_outb(sio_data, NCT6775_REG_CR_CASEOPEN_CLR[nr], reg); + sio_data->sio_exit(sio_data); + + data->valid = false; /* Force cache refresh */ +error: + mutex_unlock(&data->update_lock); + return count; +} + +static SENSOR_DEVICE_ATTR(intrusion0_alarm, 0644, nct6775_show_alarm, + clear_caseopen, INTRUSION_ALARM_BASE); +static SENSOR_DEVICE_ATTR(intrusion1_alarm, 0644, nct6775_show_alarm, + clear_caseopen, INTRUSION_ALARM_BASE + 1); +static SENSOR_DEVICE_ATTR(intrusion0_beep, 0644, nct6775_show_beep, + nct6775_store_beep, INTRUSION_ALARM_BASE); +static SENSOR_DEVICE_ATTR(intrusion1_beep, 0644, nct6775_show_beep, + nct6775_store_beep, INTRUSION_ALARM_BASE + 1); +static SENSOR_DEVICE_ATTR(beep_enable, 0644, nct6775_show_beep, + nct6775_store_beep, BEEP_ENABLE_BASE); + +static umode_t nct6775_other_is_visible(struct kobject *kobj, + struct attribute *attr, int index) +{ + struct device *dev = kobj_to_dev(kobj); + struct nct6775_data *data = dev_get_drvdata(dev); + + if (index == 0 && !data->have_vid) + return 0; + + if (index == 1 || index == 2) { + if (data->ALARM_BITS[INTRUSION_ALARM_BASE + index - 1] < 0) + return 0; + } + + if (index == 3 || index == 4) { + if (data->BEEP_BITS[INTRUSION_ALARM_BASE + index - 3] < 0) + return 0; + } + + return nct6775_attr_mode(data, attr); +} + +/* + * nct6775_other_is_visible uses the index into the following array + * to determine if attributes should be created or not. + * Any change in order or content must be matched. + */ +static struct attribute *nct6775_attributes_other[] = { + &dev_attr_cpu0_vid.attr, /* 0 */ + &sensor_dev_attr_intrusion0_alarm.dev_attr.attr, /* 1 */ + &sensor_dev_attr_intrusion1_alarm.dev_attr.attr, /* 2 */ + &sensor_dev_attr_intrusion0_beep.dev_attr.attr, /* 3 */ + &sensor_dev_attr_intrusion1_beep.dev_attr.attr, /* 4 */ + &sensor_dev_attr_beep_enable.dev_attr.attr, /* 5 */ + + NULL +}; + +static const struct attribute_group nct6775_group_other = { + .attrs = nct6775_attributes_other, + .is_visible = nct6775_other_is_visible, +}; + +static int nct6775_platform_probe_init(struct nct6775_data *data) +{ + int err; + u8 cr2a; + struct nct6775_sio_data *sio_data = data->driver_data; + + err = sio_data->sio_enter(sio_data); + if (err) + return err; + + cr2a = sio_data->sio_inb(sio_data, 0x2a); + switch (data->kind) { + case nct6775: + data->have_vid = (cr2a & 0x40); + break; + case nct6776: + data->have_vid = (cr2a & 0x60) == 0x40; + break; + case nct6106: + case nct6116: + case nct6779: + case nct6791: + case nct6792: + case nct6793: + case nct6795: + case nct6796: + case nct6797: + case nct6798: + break; + } + + /* + * Read VID value + * We can get the VID input values directly at logical device D 0xe3. + */ + if (data->have_vid) { + sio_data->sio_select(sio_data, NCT6775_LD_VID); + data->vid = sio_data->sio_inb(sio_data, 0xe3); + data->vrm = vid_which_vrm(); + } + + if (fan_debounce) { + u8 tmp; + + sio_data->sio_select(sio_data, NCT6775_LD_HWM); + tmp = sio_data->sio_inb(sio_data, + NCT6775_REG_CR_FAN_DEBOUNCE); + switch (data->kind) { + case nct6106: + case nct6116: + tmp |= 0xe0; + break; + case nct6775: + tmp |= 0x1e; + break; + case nct6776: + case nct6779: + tmp |= 0x3e; + break; + case nct6791: + case nct6792: + case nct6793: + case nct6795: + case nct6796: + case nct6797: + case nct6798: + tmp |= 0x7e; + break; + } + sio_data->sio_outb(sio_data, NCT6775_REG_CR_FAN_DEBOUNCE, + tmp); + pr_info("Enabled fan debounce for chip %s\n", data->name); + } + + nct6775_check_fan_inputs(data, sio_data); + + sio_data->sio_exit(sio_data); + + return nct6775_add_attr_group(data, &nct6775_group_other); +} + +static const struct regmap_config nct6775_regmap_config = { + .reg_bits = 16, + .val_bits = 16, + .reg_read = nct6775_reg_read, + .reg_write = nct6775_reg_write, +}; + +static const struct regmap_config nct6775_wmi_regmap_config = { + .reg_bits = 16, + .val_bits = 16, + .reg_read = nct6775_wmi_reg_read, + .reg_write = nct6775_wmi_reg_write, +}; + +static int nct6775_platform_probe(struct platform_device *pdev) +{ + struct device *dev = &pdev->dev; + struct nct6775_sio_data *sio_data = dev_get_platdata(dev); + struct nct6775_data *data; + struct resource *res; + const struct regmap_config *regmapcfg; + + if (sio_data->access == access_direct) { + res = platform_get_resource(pdev, IORESOURCE_IO, 0); + if (!devm_request_region(&pdev->dev, res->start, IOREGION_LENGTH, DRVNAME)) + return -EBUSY; + } + + data = devm_kzalloc(&pdev->dev, sizeof(*data), GFP_KERNEL); + if (!data) + return -ENOMEM; + + data->kind = sio_data->kind; + data->sioreg = sio_data->sioreg; + + if (sio_data->access == access_direct) { + data->addr = res->start; + regmapcfg = &nct6775_regmap_config; + } else { + regmapcfg = &nct6775_wmi_regmap_config; + } + + platform_set_drvdata(pdev, data); + + data->driver_data = sio_data; + data->driver_init = nct6775_platform_probe_init; + + return nct6775_probe(&pdev->dev, data, regmapcfg); +} + +static struct platform_driver nct6775_driver = { + .driver = { + .name = DRVNAME, + .pm = &nct6775_dev_pm_ops, + }, + .probe = nct6775_platform_probe, +}; + +/* nct6775_find() looks for a '627 in the Super-I/O config space */ +static int __init nct6775_find(int sioaddr, struct nct6775_sio_data *sio_data) +{ + u16 val; + int err; + int addr; + + sio_data->access = access_direct; + sio_data->sioreg = sioaddr; + + err = sio_data->sio_enter(sio_data); + if (err) + return err; + + val = (sio_data->sio_inb(sio_data, SIO_REG_DEVID) << 8) | + sio_data->sio_inb(sio_data, SIO_REG_DEVID + 1); + if (force_id && val != 0xffff) + val = force_id; + + switch (val & SIO_ID_MASK) { + case SIO_NCT6106_ID: + sio_data->kind = nct6106; + break; + case SIO_NCT6116_ID: + sio_data->kind = nct6116; + break; + case SIO_NCT6775_ID: + sio_data->kind = nct6775; + break; + case SIO_NCT6776_ID: + sio_data->kind = nct6776; + break; + case SIO_NCT6779_ID: + sio_data->kind = nct6779; + break; + case SIO_NCT6791_ID: + sio_data->kind = nct6791; + break; + case SIO_NCT6792_ID: + sio_data->kind = nct6792; + break; + case SIO_NCT6793_ID: + sio_data->kind = nct6793; + break; + case SIO_NCT6795_ID: + sio_data->kind = nct6795; + break; + case SIO_NCT6796_ID: + sio_data->kind = nct6796; + break; + case SIO_NCT6797_ID: + sio_data->kind = nct6797; + break; + case SIO_NCT6798_ID: + sio_data->kind = nct6798; + break; + default: + if (val != 0xffff) + pr_debug("unsupported chip ID: 0x%04x\n", val); + sio_data->sio_exit(sio_data); + return -ENODEV; + } + + /* We have a known chip, find the HWM I/O address */ + sio_data->sio_select(sio_data, NCT6775_LD_HWM); + val = (sio_data->sio_inb(sio_data, SIO_REG_ADDR) << 8) + | sio_data->sio_inb(sio_data, SIO_REG_ADDR + 1); + addr = val & IOREGION_ALIGNMENT; + if (addr == 0) { + pr_err("Refusing to enable a Super-I/O device with a base I/O port 0\n"); + sio_data->sio_exit(sio_data); + return -ENODEV; + } + + /* Activate logical device if needed */ + val = sio_data->sio_inb(sio_data, SIO_REG_ENABLE); + if (!(val & 0x01)) { + pr_warn("Forcibly enabling Super-I/O. Sensor is probably unusable.\n"); + sio_data->sio_outb(sio_data, SIO_REG_ENABLE, val | 0x01); + } + + if (sio_data->kind == nct6791 || sio_data->kind == nct6792 || + sio_data->kind == nct6793 || sio_data->kind == nct6795 || + sio_data->kind == nct6796 || sio_data->kind == nct6797 || + sio_data->kind == nct6798) + nct6791_enable_io_mapping(sio_data); + + sio_data->sio_exit(sio_data); + pr_info("Found %s or compatible chip at %#x:%#x\n", + nct6775_sio_names[sio_data->kind], sioaddr, addr); + + return addr; +} + +/* + * when Super-I/O functions move to a separate file, the Super-I/O + * bus will manage the lifetime of the device and this module will only keep + * track of the nct6775 driver. But since we use platform_device_alloc(), we + * must keep track of the device + */ +static struct platform_device *pdev[2]; + +static const char * const asus_wmi_boards[] = { + "PRO H410T", + "ProArt X570-CREATOR WIFI", + "Pro B550M-C", + "Pro WS X570-ACE", + "PRIME B360-PLUS", + "PRIME B460-PLUS", + "PRIME B550-PLUS", + "PRIME B550M-A", + "PRIME B550M-A (WI-FI)", + "PRIME H410M-R", + "PRIME X570-P", + "PRIME X570-PRO", + "ROG CROSSHAIR VIII DARK HERO", + "ROG CROSSHAIR VIII FORMULA", + "ROG CROSSHAIR VIII HERO", + "ROG CROSSHAIR VIII IMPACT", + "ROG STRIX B550-A GAMING", + "ROG STRIX B550-E GAMING", + "ROG STRIX B550-F GAMING", + "ROG STRIX B550-F GAMING (WI-FI)", + "ROG STRIX B550-F GAMING WIFI II", + "ROG STRIX B550-I GAMING", + "ROG STRIX B550-XE GAMING (WI-FI)", + "ROG STRIX X570-E GAMING", + "ROG STRIX X570-E GAMING WIFI II", + "ROG STRIX X570-F GAMING", + "ROG STRIX X570-I GAMING", + "ROG STRIX Z390-E GAMING", + "ROG STRIX Z390-F GAMING", + "ROG STRIX Z390-H GAMING", + "ROG STRIX Z390-I GAMING", + "ROG STRIX Z490-A GAMING", + "ROG STRIX Z490-E GAMING", + "ROG STRIX Z490-F GAMING", + "ROG STRIX Z490-G GAMING", + "ROG STRIX Z490-G GAMING (WI-FI)", + "ROG STRIX Z490-H GAMING", + "ROG STRIX Z490-I GAMING", + "TUF GAMING B550M-PLUS", + "TUF GAMING B550M-PLUS (WI-FI)", + "TUF GAMING B550-PLUS", + "TUF GAMING B550-PRO", + "TUF GAMING X570-PLUS", + "TUF GAMING X570-PLUS (WI-FI)", + "TUF GAMING X570-PRO (WI-FI)", + "TUF GAMING Z490-PLUS", + "TUF GAMING Z490-PLUS (WI-FI)", +}; + +static int __init sensors_nct6775_platform_init(void) +{ + int i, err; + bool found = false; + int address; + struct resource res; + struct nct6775_sio_data sio_data; + int sioaddr[2] = { 0x2e, 0x4e }; + enum sensor_access access = access_direct; + const char *board_vendor, *board_name; + u8 tmp; + + err = platform_driver_register(&nct6775_driver); + if (err) + return err; + + board_vendor = dmi_get_system_info(DMI_BOARD_VENDOR); + board_name = dmi_get_system_info(DMI_BOARD_NAME); + + if (board_name && board_vendor && + !strcmp(board_vendor, "ASUSTeK COMPUTER INC.")) { + err = match_string(asus_wmi_boards, ARRAY_SIZE(asus_wmi_boards), + board_name); + if (err >= 0) { + /* if reading chip id via WMI succeeds, use WMI */ + if (!nct6775_asuswmi_read(0, NCT6775_PORT_CHIPID, &tmp) && tmp) { + pr_info("Using Asus WMI to access %#x chip.\n", tmp); + access = access_asuswmi; + } else { + pr_err("Can't read ChipID by Asus WMI.\n"); + } + } + } + + /* + * initialize sio_data->kind and sio_data->sioreg. + * + * when Super-I/O functions move to a separate file, the Super-I/O + * driver will probe 0x2e and 0x4e and auto-detect the presence of a + * nct6775 hardware monitor, and call probe() + */ + for (i = 0; i < ARRAY_SIZE(pdev); i++) { + sio_data.sio_outb = superio_outb; + sio_data.sio_inb = superio_inb; + sio_data.sio_select = superio_select; + sio_data.sio_enter = superio_enter; + sio_data.sio_exit = superio_exit; + + address = nct6775_find(sioaddr[i], &sio_data); + if (address <= 0) + continue; + + found = true; + + sio_data.access = access; + + if (access == access_asuswmi) { + sio_data.sio_outb = superio_wmi_outb; + sio_data.sio_inb = superio_wmi_inb; + sio_data.sio_select = superio_wmi_select; + sio_data.sio_enter = superio_wmi_enter; + sio_data.sio_exit = superio_wmi_exit; + } + + pdev[i] = platform_device_alloc(DRVNAME, address); + if (!pdev[i]) { + err = -ENOMEM; + goto exit_device_unregister; + } + + err = platform_device_add_data(pdev[i], &sio_data, + sizeof(struct nct6775_sio_data)); + if (err) + goto exit_device_put; + + if (sio_data.access == access_direct) { + memset(&res, 0, sizeof(res)); + res.name = DRVNAME; + res.start = address + IOREGION_OFFSET; + res.end = address + IOREGION_OFFSET + IOREGION_LENGTH - 1; + res.flags = IORESOURCE_IO; + + err = acpi_check_resource_conflict(&res); + if (err) { + platform_device_put(pdev[i]); + pdev[i] = NULL; + continue; + } + + err = platform_device_add_resources(pdev[i], &res, 1); + if (err) + goto exit_device_put; + } + + /* platform_device_add calls probe() */ + err = platform_device_add(pdev[i]); + if (err) + goto exit_device_put; + } + if (!found) { + err = -ENODEV; + goto exit_unregister; + } + + return 0; + +exit_device_put: + platform_device_put(pdev[i]); +exit_device_unregister: + while (--i >= 0) { + if (pdev[i]) + platform_device_unregister(pdev[i]); + } +exit_unregister: + platform_driver_unregister(&nct6775_driver); + return err; +} + +static void __exit sensors_nct6775_platform_exit(void) +{ + int i; + + for (i = 0; i < ARRAY_SIZE(pdev); i++) { + if (pdev[i]) + platform_device_unregister(pdev[i]); + } + platform_driver_unregister(&nct6775_driver); +} + +MODULE_AUTHOR("Guenter Roeck "); +MODULE_DESCRIPTION("Platform driver for NCT6775F and compatible chips"); +MODULE_LICENSE("GPL"); +MODULE_IMPORT_NS(HWMON_NCT6775); + +module_init(sensors_nct6775_platform_init); +module_exit(sensors_nct6775_platform_exit); diff --git a/drivers/hwmon/nct6775.h b/drivers/hwmon/nct6775.h new file mode 100644 index 000000000000..93f708148e65 --- /dev/null +++ b/drivers/hwmon/nct6775.h @@ -0,0 +1,252 @@ +/* SPDX-License-Identifier: GPL-2.0-or-later */ +#ifndef __HWMON_NCT6775_H__ +#define __HWMON_NCT6775_H__ + +#include + +enum kinds { nct6106, nct6116, nct6775, nct6776, nct6779, nct6791, nct6792, + nct6793, nct6795, nct6796, nct6797, nct6798 }; +enum pwm_enable { off, manual, thermal_cruise, speed_cruise, sf3, sf4 }; + +#define NUM_TEMP 10 /* Max number of temp attribute sets w/ limits*/ +#define NUM_TEMP_FIXED 6 /* Max number of fixed temp attribute sets */ +#define NUM_TSI_TEMP 8 /* Max number of TSI temp register pairs */ + +#define NUM_REG_ALARM 7 /* Max number of alarm registers */ +#define NUM_REG_BEEP 5 /* Max number of beep registers */ + +#define NUM_FAN 7 + +struct nct6775_data { + int addr; /* IO base of hw monitor block */ + int sioreg; /* SIO register address */ + enum kinds kind; + const char *name; + + const struct attribute_group *groups[7]; + u8 num_groups; + + u16 reg_temp[5][NUM_TEMP]; /* 0=temp, 1=temp_over, 2=temp_hyst, + * 3=temp_crit, 4=temp_lcrit + */ + u8 temp_src[NUM_TEMP]; + u16 reg_temp_config[NUM_TEMP]; + const char * const *temp_label; + u32 temp_mask; + u32 virt_temp_mask; + + u16 REG_CONFIG; + u16 REG_VBAT; + u16 REG_DIODE; + u8 DIODE_MASK; + + const s8 *ALARM_BITS; + const s8 *BEEP_BITS; + + const u16 *REG_VIN; + const u16 *REG_IN_MINMAX[2]; + + const u16 *REG_TARGET; + const u16 *REG_FAN; + const u16 *REG_FAN_MODE; + const u16 *REG_FAN_MIN; + const u16 *REG_FAN_PULSES; + const u16 *FAN_PULSE_SHIFT; + const u16 *REG_FAN_TIME[3]; + + const u16 *REG_TOLERANCE_H; + + const u8 *REG_PWM_MODE; + const u8 *PWM_MODE_MASK; + + const u16 *REG_PWM[7]; /* [0]=pwm, [1]=pwm_start, [2]=pwm_floor, + * [3]=pwm_max, [4]=pwm_step, + * [5]=weight_duty_step, [6]=weight_duty_base + */ + const u16 *REG_PWM_READ; + + const u16 *REG_CRITICAL_PWM_ENABLE; + u8 CRITICAL_PWM_ENABLE_MASK; + const u16 *REG_CRITICAL_PWM; + + const u16 *REG_AUTO_TEMP; + const u16 *REG_AUTO_PWM; + + const u16 *REG_CRITICAL_TEMP; + const u16 *REG_CRITICAL_TEMP_TOLERANCE; + + const u16 *REG_TEMP_SOURCE; /* temp register sources */ + const u16 *REG_TEMP_SEL; + const u16 *REG_WEIGHT_TEMP_SEL; + const u16 *REG_WEIGHT_TEMP[3]; /* 0=base, 1=tolerance, 2=step */ + + const u16 *REG_TEMP_OFFSET; + + const u16 *REG_ALARM; + const u16 *REG_BEEP; + + const u16 *REG_TSI_TEMP; + + unsigned int (*fan_from_reg)(u16 reg, unsigned int divreg); + unsigned int (*fan_from_reg_min)(u16 reg, unsigned int divreg); + + struct mutex update_lock; + bool valid; /* true if following fields are valid */ + unsigned long last_updated; /* In jiffies */ + + /* Register values */ + u8 bank; /* current register bank */ + u8 in_num; /* number of in inputs we have */ + u8 in[15][3]; /* [0]=in, [1]=in_max, [2]=in_min */ + unsigned int rpm[NUM_FAN]; + u16 fan_min[NUM_FAN]; + u8 fan_pulses[NUM_FAN]; + u8 fan_div[NUM_FAN]; + u8 has_pwm; + u8 has_fan; /* some fan inputs can be disabled */ + u8 has_fan_min; /* some fans don't have min register */ + bool has_fan_div; + + u8 num_temp_alarms; /* 2, 3, or 6 */ + u8 num_temp_beeps; /* 2, 3, or 6 */ + u8 temp_fixed_num; /* 3 or 6 */ + u8 temp_type[NUM_TEMP_FIXED]; + s8 temp_offset[NUM_TEMP_FIXED]; + s16 temp[5][NUM_TEMP]; /* 0=temp, 1=temp_over, 2=temp_hyst, + * 3=temp_crit, 4=temp_lcrit + */ + s16 tsi_temp[NUM_TSI_TEMP]; + u64 alarms; + u64 beeps; + + u8 pwm_num; /* number of pwm */ + u8 pwm_mode[NUM_FAN]; /* 0->DC variable voltage, + * 1->PWM variable duty cycle + */ + enum pwm_enable pwm_enable[NUM_FAN]; + /* 0->off + * 1->manual + * 2->thermal cruise mode (also called SmartFan I) + * 3->fan speed cruise mode + * 4->SmartFan III + * 5->enhanced variable thermal cruise (SmartFan IV) + */ + u8 pwm[7][NUM_FAN]; /* [0]=pwm, [1]=pwm_start, [2]=pwm_floor, + * [3]=pwm_max, [4]=pwm_step, + * [5]=weight_duty_step, [6]=weight_duty_base + */ + + u8 target_temp[NUM_FAN]; + u8 target_temp_mask; + u32 target_speed[NUM_FAN]; + u32 target_speed_tolerance[NUM_FAN]; + u8 speed_tolerance_limit; + + u8 temp_tolerance[2][NUM_FAN]; + u8 tolerance_mask; + + u8 fan_time[3][NUM_FAN]; /* 0 = stop_time, 1 = step_up, 2 = step_down */ + + /* Automatic fan speed control registers */ + int auto_pwm_num; + u8 auto_pwm[NUM_FAN][7]; + u8 auto_temp[NUM_FAN][7]; + u8 pwm_temp_sel[NUM_FAN]; + u8 pwm_weight_temp_sel[NUM_FAN]; + u8 weight_temp[3][NUM_FAN]; /* 0->temp_step, 1->temp_step_tol, + * 2->temp_base + */ + + u8 vid; + u8 vrm; + + bool have_vid; + + u16 have_temp; + u16 have_temp_fixed; + u16 have_tsi_temp; + u16 have_in; + + /* Remember extra register values over suspend/resume */ + u8 vbat; + u8 fandiv1; + u8 fandiv2; + u8 sio_reg_enable; + + struct regmap *regmap; + bool read_only; + + /* driver-specific (platform, i2c) initialization hook and data */ + int (*driver_init)(struct nct6775_data *data); + void *driver_data; +}; + +static inline int nct6775_read_value(struct nct6775_data *data, u16 reg, u16 *value) +{ + unsigned int tmp; + int ret = regmap_read(data->regmap, reg, &tmp); + + if (!ret) + *value = tmp; + return ret; +} + +static inline int nct6775_write_value(struct nct6775_data *data, u16 reg, u16 value) +{ + return regmap_write(data->regmap, reg, value); +} + +bool nct6775_reg_is_word_sized(struct nct6775_data *data, u16 reg); +int nct6775_probe(struct device *dev, struct nct6775_data *data, + const struct regmap_config *regmapcfg); + +ssize_t nct6775_show_alarm(struct device *dev, struct device_attribute *attr, char *buf); +ssize_t nct6775_show_beep(struct device *dev, struct device_attribute *attr, char *buf); +ssize_t nct6775_store_beep(struct device *dev, struct device_attribute *attr, const char *buf, + size_t count); + +static inline int nct6775_write_temp(struct nct6775_data *data, u16 reg, u16 value) +{ + if (!nct6775_reg_is_word_sized(data, reg)) + value >>= 8; + return nct6775_write_value(data, reg, value); +} + +static inline umode_t nct6775_attr_mode(struct nct6775_data *data, struct attribute *attr) +{ + return data->read_only ? (attr->mode & ~0222) : attr->mode; +} + +static inline int +nct6775_add_attr_group(struct nct6775_data *data, const struct attribute_group *group) +{ + /* Need to leave a NULL terminator at the end of data->groups */ + if (data->num_groups == ARRAY_SIZE(data->groups) - 1) + return -ENOBUFS; + + data->groups[data->num_groups++] = group; + return 0; +} + +#define NCT6775_REG_BANK 0x4E +#define NCT6775_REG_CONFIG 0x40 + +#define NCT6775_REG_FANDIV1 0x506 +#define NCT6775_REG_FANDIV2 0x507 + +#define NCT6791_REG_HM_IO_SPACE_LOCK_ENABLE 0x28 + +#define FAN_ALARM_BASE 16 +#define TEMP_ALARM_BASE 24 +#define INTRUSION_ALARM_BASE 30 +#define BEEP_ENABLE_BASE 15 + +/* + * Not currently used: + * REG_MAN_ID has the value 0x5ca3 for all supported chips. + * REG_CHIP_ID == 0x88/0xa1/0xc1 depending on chip model. + * REG_MAN_ID is at port 0x4f + * REG_CHIP_ID is at port 0x58 + */ + +#endif /* __HWMON_NCT6775_H__ */ diff --git a/drivers/media/v4l2-core/Kconfig b/drivers/media/v4l2-core/Kconfig index 1be9a2cc947a..cf792fbf52bc 100644 --- a/drivers/media/v4l2-core/Kconfig +++ b/drivers/media/v4l2-core/Kconfig @@ -40,6 +40,11 @@ config VIDEO_TUNER config V4L2_JPEG_HELPER tristate +config V4L2_LOOPBACK + tristate "V4L2 loopback device" + help + V4L2 loopback device + # Used by drivers that need v4l2-h264.ko config V4L2_H264 tristate diff --git a/drivers/media/v4l2-core/Makefile b/drivers/media/v4l2-core/Makefile index 41d91bd10cf2..4de37a844f95 100644 --- a/drivers/media/v4l2-core/Makefile +++ b/drivers/media/v4l2-core/Makefile @@ -32,6 +32,8 @@ obj-$(CONFIG_V4L2_JPEG_HELPER) += v4l2-jpeg.o obj-$(CONFIG_V4L2_MEM2MEM_DEV) += v4l2-mem2mem.o obj-$(CONFIG_V4L2_VP9) += v4l2-vp9.o +obj-$(CONFIG_V4L2_LOOPBACK) += v4l2loopback.o + obj-$(CONFIG_VIDEOBUF_DMA_CONTIG) += videobuf-dma-contig.o obj-$(CONFIG_VIDEOBUF_DMA_SG) += videobuf-dma-sg.o obj-$(CONFIG_VIDEOBUF_GEN) += videobuf-core.o diff --git a/drivers/media/v4l2-core/v4l2loopback.c b/drivers/media/v4l2-core/v4l2loopback.c new file mode 100644 index 000000000000..b1c747f7d13c --- /dev/null +++ b/drivers/media/v4l2-core/v4l2loopback.c @@ -0,0 +1,2914 @@ +/* -*- c-file-style: "linux" -*- */ +/* + * v4l2loopback.c -- video4linux2 loopback driver + * + * Copyright (C) 2005-2009 Vasily Levin (vasaka@gmail.com) + * Copyright (C) 2010-2019 IOhannes m zmoelnig (zmoelnig@iem.at) + * Copyright (C) 2011 Stefan Diewald (stefan.diewald@mytum.de) + * Copyright (C) 2012 Anton Novikov (random.plant@gmail.com) + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + */ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#if LINUX_VERSION_CODE >= KERNEL_VERSION(2, 6, 29) +#define HAVE__V4L2_DEVICE +#include +#endif +#if LINUX_VERSION_CODE >= KERNEL_VERSION(2, 6, 36) +#define HAVE__V4L2_CTRLS +#include +#endif +#include + +#include +#include "v4l2loopback.h" + +#if LINUX_VERSION_CODE < KERNEL_VERSION(3, 6, 1) +#define kstrtoul strict_strtoul +#endif + +#if defined(timer_setup) && defined(from_timer) +#define HAVE_TIMER_SETUP +#endif + +#if LINUX_VERSION_CODE < KERNEL_VERSION(5, 7, 0) +#define VFL_TYPE_VIDEO VFL_TYPE_GRABBER +#endif + +#define V4L2LOOPBACK_VERSION_CODE \ + KERNEL_VERSION(V4L2LOOPBACK_VERSION_MAJOR, V4L2LOOPBACK_VERSION_MINOR, \ + V4L2LOOPBACK_VERSION_BUGFIX) + +MODULE_DESCRIPTION("V4L2 loopback video device"); +MODULE_AUTHOR("Vasily Levin, " + "IOhannes m zmoelnig ," + "Stefan Diewald," + "Anton Novikov" + "et al."); +MODULE_VERSION("0.12.5"); +MODULE_LICENSE("GPL"); + +/* + * helpers + */ +#define STRINGIFY(s) #s +#define STRINGIFY2(s) STRINGIFY(s) + +#define dprintk(fmt, args...) \ + do { \ + if (debug > 0) { \ + printk(KERN_INFO "v4l2-loopback[" STRINGIFY2( \ + __LINE__) "]: " fmt, \ + ##args); \ + } \ + } while (0) + +#define MARK() \ + do { \ + if (debug > 1) { \ + printk(KERN_INFO "%s:%d[%s]\n", __FILE__, __LINE__, \ + __func__); \ + } \ + } while (0) + +#define dprintkrw(fmt, args...) \ + do { \ + if (debug > 2) { \ + printk(KERN_INFO "v4l2-loopback[" STRINGIFY2( \ + __LINE__) "]: " fmt, \ + ##args); \ + } \ + } while (0) + +/* + * compatibility hacks + */ + +#ifndef HAVE__V4L2_CTRLS +struct v4l2_ctrl_handler { + int error; +}; +struct v4l2_ctrl_config { + void *ops; + u32 id; + const char *name; + int type; + s32 min; + s32 max; + u32 step; + s32 def; +}; +int v4l2_ctrl_handler_init(struct v4l2_ctrl_handler *hdl, + unsigned nr_of_controls_hint) +{ + hdl->error = 0; + return 0; +} +void v4l2_ctrl_handler_free(struct v4l2_ctrl_handler *hdl) +{ +} +void *v4l2_ctrl_new_custom(struct v4l2_ctrl_handler *hdl, + const struct v4l2_ctrl_config *conf, void *priv) +{ + return NULL; +} +#endif /* HAVE__V4L2_CTRLS */ + +#ifndef HAVE__V4L2_DEVICE +/* dummy v4l2_device struct/functions */ +#define V4L2_DEVICE_NAME_SIZE (20 + 16) +struct v4l2_device { + char name[V4L2_DEVICE_NAME_SIZE]; + struct v4l2_ctrl_handler *ctrl_handler; +}; +static inline int v4l2_device_register(void *dev, void *v4l2_dev) +{ + return 0; +} +static inline void v4l2_device_unregister(struct v4l2_device *v4l2_dev) +{ + return; +} +#endif /* HAVE__V4L2_DEVICE */ + +#if LINUX_VERSION_CODE < KERNEL_VERSION(2, 6, 29) +#define v4l2_file_operations file_operations +#endif +#if LINUX_VERSION_CODE < KERNEL_VERSION(2, 6, 37) +void *v4l2l_vzalloc(unsigned long size) +{ + void *data = vmalloc(size); + + memset(data, 0, size); + return data; +} +#else +#define v4l2l_vzalloc vzalloc +#endif + +static inline void v4l2l_get_timestamp(struct v4l2_buffer *b) +{ + /* ktime_get_ts is considered deprecated, so use ktime_get_ts64 if possible */ +#if LINUX_VERSION_CODE < KERNEL_VERSION(3, 17, 0) + struct timespec ts; + ktime_get_ts(&ts); +#else + struct timespec64 ts; + ktime_get_ts64(&ts); +#endif + + b->timestamp.tv_sec = ts.tv_sec; + b->timestamp.tv_usec = (ts.tv_nsec / NSEC_PER_USEC); +} + +#if !defined(__poll_t) +typedef unsigned __poll_t; +#endif + +/* module constants + * can be overridden during he build process using something like + * make KCPPFLAGS="-DMAX_DEVICES=100" + */ + +/* maximum number of v4l2loopback devices that can be created */ +#ifndef MAX_DEVICES +#define MAX_DEVICES 8 +#endif + +/* whether the default is to announce capabilities exclusively or not */ +#ifndef V4L2LOOPBACK_DEFAULT_EXCLUSIVECAPS +#define V4L2LOOPBACK_DEFAULT_EXCLUSIVECAPS 0 +#endif + +/* when a producer is considered to have gone stale */ +#ifndef MAX_TIMEOUT +#define MAX_TIMEOUT (100 * 1000) /* in msecs */ +#endif + +/* max buffers that can be mapped, actually they + * are all mapped to max_buffers buffers */ +#ifndef MAX_BUFFERS +#define MAX_BUFFERS 32 +#endif + +/* module parameters */ +static int debug = 0; +module_param(debug, int, S_IRUGO | S_IWUSR); +MODULE_PARM_DESC(debug, "debugging level (higher values == more verbose)"); + +#define V4L2LOOPBACK_DEFAULT_MAX_BUFFERS 2 +static int max_buffers = V4L2LOOPBACK_DEFAULT_MAX_BUFFERS; +module_param(max_buffers, int, S_IRUGO); +MODULE_PARM_DESC(max_buffers, + "how many buffers should be allocated [DEFAULT: " STRINGIFY2( + V4L2LOOPBACK_DEFAULT_MAX_BUFFERS) "]"); + +/* how many times a device can be opened + * the per-module default value can be overridden on a per-device basis using + * the /sys/devices interface + * + * note that max_openers should be at least 2 in order to get a working system: + * one opener for the producer and one opener for the consumer + * however, we leave that to the user + */ +#define V4L2LOOPBACK_DEFAULT_MAX_OPENERS 10 +static int max_openers = V4L2LOOPBACK_DEFAULT_MAX_OPENERS; +module_param(max_openers, int, S_IRUGO | S_IWUSR); +MODULE_PARM_DESC( + max_openers, + "how many users can open the loopback device [DEFAULT: " STRINGIFY2( + V4L2LOOPBACK_DEFAULT_MAX_OPENERS) "]"); + +static int devices = -1; +module_param(devices, int, 0); +MODULE_PARM_DESC(devices, "how many devices should be created"); + +static int video_nr[MAX_DEVICES] = { [0 ...(MAX_DEVICES - 1)] = -1 }; +module_param_array(video_nr, int, NULL, 0444); +MODULE_PARM_DESC(video_nr, + "video device numbers (-1=auto, 0=/dev/video0, etc.)"); + +static char *card_label[MAX_DEVICES]; +module_param_array(card_label, charp, NULL, 0000); +MODULE_PARM_DESC(card_label, "card labels for each device"); + +static bool exclusive_caps[MAX_DEVICES] = { + [0 ...(MAX_DEVICES - 1)] = V4L2LOOPBACK_DEFAULT_EXCLUSIVECAPS +}; +module_param_array(exclusive_caps, bool, NULL, 0444); +/* FIXXME: wording */ +MODULE_PARM_DESC( + exclusive_caps, + "whether to announce OUTPUT/CAPTURE capabilities exclusively or not [DEFAULT: " STRINGIFY2( + V4L2LOOPBACK_DEFAULT_EXCLUSIVECAPS) "]"); + +/* format specifications */ +#define V4L2LOOPBACK_SIZE_MIN_WIDTH 48 +#define V4L2LOOPBACK_SIZE_MIN_HEIGHT 32 +#define V4L2LOOPBACK_SIZE_DEFAULT_MAX_WIDTH 8192 +#define V4L2LOOPBACK_SIZE_DEFAULT_MAX_HEIGHT 8192 + +#define V4L2LOOPBACK_SIZE_DEFAULT_WIDTH 640 +#define V4L2LOOPBACK_SIZE_DEFAULT_HEIGHT 480 + +static int max_width = V4L2LOOPBACK_SIZE_DEFAULT_MAX_WIDTH; +module_param(max_width, int, S_IRUGO); +MODULE_PARM_DESC(max_width, "maximum allowed frame width [DEFAULT: " STRINGIFY2( + V4L2LOOPBACK_SIZE_DEFAULT_MAX_WIDTH) "]"); +static int max_height = V4L2LOOPBACK_SIZE_DEFAULT_MAX_HEIGHT; +module_param(max_height, int, S_IRUGO); +MODULE_PARM_DESC(max_height, + "maximum allowed frame height [DEFAULT: " STRINGIFY2( + V4L2LOOPBACK_SIZE_DEFAULT_MAX_HEIGHT) "]"); + +static DEFINE_IDR(v4l2loopback_index_idr); +static DEFINE_MUTEX(v4l2loopback_ctl_mutex); + +/* control IDs */ +#ifndef HAVE__V4L2_CTRLS +#define V4L2LOOPBACK_CID_BASE (V4L2_CID_PRIVATE_BASE) +#else +#define V4L2LOOPBACK_CID_BASE (V4L2_CID_USER_BASE | 0xf000) +#endif +#define CID_KEEP_FORMAT (V4L2LOOPBACK_CID_BASE + 0) +#define CID_SUSTAIN_FRAMERATE (V4L2LOOPBACK_CID_BASE + 1) +#define CID_TIMEOUT (V4L2LOOPBACK_CID_BASE + 2) +#define CID_TIMEOUT_IMAGE_IO (V4L2LOOPBACK_CID_BASE + 3) + +static int v4l2loopback_s_ctrl(struct v4l2_ctrl *ctrl); +static const struct v4l2_ctrl_ops v4l2loopback_ctrl_ops = { + .s_ctrl = v4l2loopback_s_ctrl, +}; +static const struct v4l2_ctrl_config v4l2loopback_ctrl_keepformat = { + // clang-format off + .ops = &v4l2loopback_ctrl_ops, + .id = CID_KEEP_FORMAT, + .name = "keep_format", + .type = V4L2_CTRL_TYPE_BOOLEAN, + .min = 0, + .max = 1, + .step = 1, + .def = 0, + // clang-format on +}; +static const struct v4l2_ctrl_config v4l2loopback_ctrl_sustainframerate = { + // clang-format off + .ops = &v4l2loopback_ctrl_ops, + .id = CID_SUSTAIN_FRAMERATE, + .name = "sustain_framerate", + .type = V4L2_CTRL_TYPE_BOOLEAN, + .min = 0, + .max = 1, + .step = 1, + .def = 0, + // clang-format on +}; +static const struct v4l2_ctrl_config v4l2loopback_ctrl_timeout = { + // clang-format off + .ops = &v4l2loopback_ctrl_ops, + .id = CID_TIMEOUT, + .name = "timeout", + .type = V4L2_CTRL_TYPE_INTEGER, + .min = 0, + .max = MAX_TIMEOUT, + .step = 1, + .def = 0, + // clang-format on +}; +static const struct v4l2_ctrl_config v4l2loopback_ctrl_timeoutimageio = { + // clang-format off + .ops = &v4l2loopback_ctrl_ops, + .id = CID_TIMEOUT_IMAGE_IO, + .name = "timeout_image_io", + .type = V4L2_CTRL_TYPE_BOOLEAN, + .min = 0, + .max = 1, + .step = 1, + .def = 0, + // clang-format on +}; + +/* module structures */ +struct v4l2loopback_private { + int device_nr; +}; + +/* TODO(vasaka) use typenames which are common to kernel, but first find out if + * it is needed */ +/* struct keeping state and settings of loopback device */ + +struct v4l2l_buffer { + struct v4l2_buffer buffer; + struct list_head list_head; + int use_count; +}; + +struct v4l2_loopback_device { + struct v4l2_device v4l2_dev; + struct v4l2_ctrl_handler ctrl_handler; + struct video_device *vdev; + /* pixel and stream format */ + struct v4l2_pix_format pix_format; + struct v4l2_captureparm capture_param; + unsigned long frame_jiffies; + + /* ctrls */ + int keep_format; /* CID_KEEP_FORMAT; stay ready_for_capture even when all + openers close() the device */ + int sustain_framerate; /* CID_SUSTAIN_FRAMERATE; duplicate frames to maintain + (close to) nominal framerate */ + + /* buffers stuff */ + u8 *image; /* pointer to actual buffers data */ + unsigned long int imagesize; /* size of buffers data */ + int buffers_number; /* should not be big, 4 is a good choice */ + struct v4l2l_buffer buffers[MAX_BUFFERS]; /* inner driver buffers */ + int used_buffers; /* number of the actually used buffers */ + int max_openers; /* how many times can this device be opened */ + + int write_position; /* number of last written frame + 1 */ + struct list_head outbufs_list; /* buffers in output DQBUF order */ + int bufpos2index + [MAX_BUFFERS]; /* mapping of (read/write_position % used_buffers) + * to inner buffer index */ + long buffer_size; + + /* sustain_framerate stuff */ + struct timer_list sustain_timer; + unsigned int reread_count; + + /* timeout stuff */ + unsigned long timeout_jiffies; /* CID_TIMEOUT; 0 means disabled */ + int timeout_image_io; /* CID_TIMEOUT_IMAGE_IO; next opener will + * read/write to timeout_image */ + u8 *timeout_image; /* copy of it will be captured when timeout passes */ + struct v4l2l_buffer timeout_image_buffer; + struct timer_list timeout_timer; + int timeout_happened; + + /* sync stuff */ + atomic_t open_count; + + int ready_for_capture; /* set to the number of writers that opened the + * device and negotiated format. */ + int ready_for_output; /* set to true when no writer is currently attached + * this differs slightly from !ready_for_capture, + * e.g. when using fallback images */ + int announce_all_caps; /* set to false, if device caps (OUTPUT/CAPTURE) + * should only be announced if the resp. "ready" + * flag is set; default=TRUE */ + + int max_width; + int max_height; + + char card_label[32]; + + wait_queue_head_t read_event; + spinlock_t lock; +}; + +/* types of opener shows what opener wants to do with loopback */ +enum opener_type { + // clang-format off + UNNEGOTIATED = 0, + READER = 1, + WRITER = 2, + // clang-format on +}; + +/* struct keeping state and type of opener */ +struct v4l2_loopback_opener { + enum opener_type type; + int vidioc_enum_frameintervals_calls; + int read_position; /* number of last processed frame + 1 or + * write_position - 1 if reader went out of sync */ + unsigned int reread_count; + struct v4l2_buffer *buffers; + int buffers_number; /* should not be big, 4 is a good choice */ + int timeout_image_io; + + struct v4l2_fh fh; +}; + +#define fh_to_opener(ptr) container_of((ptr), struct v4l2_loopback_opener, fh) + +/* this is heavily inspired by the bttv driver found in the linux kernel */ +struct v4l2l_format { + char *name; + int fourcc; /* video4linux 2 */ + int depth; /* bit/pixel */ + int flags; +}; +/* set the v4l2l_format.flags to PLANAR for non-packed formats */ +#define FORMAT_FLAGS_PLANAR 0x01 +#define FORMAT_FLAGS_COMPRESSED 0x02 + +#include "v4l2loopback_formats.h" + +static const unsigned int FORMATS = ARRAY_SIZE(formats); + +static char *fourcc2str(unsigned int fourcc, char buf[4]) +{ + buf[0] = (fourcc >> 0) & 0xFF; + buf[1] = (fourcc >> 8) & 0xFF; + buf[2] = (fourcc >> 16) & 0xFF; + buf[3] = (fourcc >> 24) & 0xFF; + + return buf; +} + +static const struct v4l2l_format *format_by_fourcc(int fourcc) +{ + unsigned int i; + + for (i = 0; i < FORMATS; i++) { + if (formats[i].fourcc == fourcc) + return formats + i; + } + + dprintk("unsupported format '%c%c%c%c'\n", (fourcc >> 0) & 0xFF, + (fourcc >> 8) & 0xFF, (fourcc >> 16) & 0xFF, + (fourcc >> 24) & 0xFF); + return NULL; +} + +static void pix_format_set_size(struct v4l2_pix_format *f, + const struct v4l2l_format *fmt, + unsigned int width, unsigned int height) +{ + f->width = width; + f->height = height; + + if (fmt->flags & FORMAT_FLAGS_PLANAR) { + f->bytesperline = width; /* Y plane */ + f->sizeimage = (width * height * fmt->depth) >> 3; + } else if (fmt->flags & FORMAT_FLAGS_COMPRESSED) { + /* doesn't make sense for compressed formats */ + f->bytesperline = 0; + f->sizeimage = (width * height * fmt->depth) >> 3; + } else { + f->bytesperline = (width * fmt->depth) >> 3; + f->sizeimage = height * f->bytesperline; + } +} + +static int set_timeperframe(struct v4l2_loopback_device *dev, + struct v4l2_fract *tpf) +{ + if ((tpf->denominator < 1) || (tpf->numerator < 1)) { + return -EINVAL; + } + dev->capture_param.timeperframe = *tpf; + dev->frame_jiffies = max(1UL, msecs_to_jiffies(1000) * tpf->numerator / + tpf->denominator); + return 0; +} + +static struct v4l2_loopback_device *v4l2loopback_cd2dev(struct device *cd); + +/* device attributes */ +/* available via sysfs: /sys/devices/virtual/video4linux/video* */ + +static ssize_t attr_show_format(struct device *cd, + struct device_attribute *attr, char *buf) +{ + /* gets the current format as "FOURCC:WxH@f/s", e.g. "YUYV:320x240@1000/30" */ + struct v4l2_loopback_device *dev = v4l2loopback_cd2dev(cd); + const struct v4l2_fract *tpf; + char buf4cc[5], buf_fps[32]; + + if (!dev || !dev->ready_for_capture) + return 0; + tpf = &dev->capture_param.timeperframe; + + fourcc2str(dev->pix_format.pixelformat, buf4cc); + buf4cc[4] = 0; + if (tpf->numerator == 1) + snprintf(buf_fps, sizeof(buf_fps), "%d", tpf->denominator); + else + snprintf(buf_fps, sizeof(buf_fps), "%d/%d", tpf->denominator, + tpf->numerator); + return sprintf(buf, "%4s:%dx%d@%s\n", buf4cc, dev->pix_format.width, + dev->pix_format.height, buf_fps); +} + +static ssize_t attr_store_format(struct device *cd, + struct device_attribute *attr, const char *buf, + size_t len) +{ + struct v4l2_loopback_device *dev = v4l2loopback_cd2dev(cd); + int fps_num = 0, fps_den = 1; + + if (!dev) + return -ENODEV; + + /* only fps changing is supported */ + if (sscanf(buf, "@%d/%d", &fps_num, &fps_den) > 0) { + struct v4l2_fract f = { .numerator = fps_den, + .denominator = fps_num }; + int err = 0; + if ((err = set_timeperframe(dev, &f)) < 0) + return err; + return len; + } + return -EINVAL; +} + +static DEVICE_ATTR(format, S_IRUGO | S_IWUSR, attr_show_format, + attr_store_format); + +static ssize_t attr_show_buffers(struct device *cd, + struct device_attribute *attr, char *buf) +{ + struct v4l2_loopback_device *dev = v4l2loopback_cd2dev(cd); + + if (!dev) + return -ENODEV; + + return sprintf(buf, "%d\n", dev->used_buffers); +} + +static DEVICE_ATTR(buffers, S_IRUGO, attr_show_buffers, NULL); + +static ssize_t attr_show_maxopeners(struct device *cd, + struct device_attribute *attr, char *buf) +{ + struct v4l2_loopback_device *dev = v4l2loopback_cd2dev(cd); + + return sprintf(buf, "%d\n", dev->max_openers); +} + +static ssize_t attr_store_maxopeners(struct device *cd, + struct device_attribute *attr, + const char *buf, size_t len) +{ + struct v4l2_loopback_device *dev = NULL; + unsigned long curr = 0; + + if (kstrtoul(buf, 0, &curr)) + return -EINVAL; + + dev = v4l2loopback_cd2dev(cd); + + if (dev->max_openers == curr) + return len; + + if (dev->open_count.counter > curr) { + /* request to limit to less openers as are currently attached to us */ + return -EINVAL; + } + + dev->max_openers = (int)curr; + + return len; +} + +static DEVICE_ATTR(max_openers, S_IRUGO | S_IWUSR, attr_show_maxopeners, + attr_store_maxopeners); + +static void v4l2loopback_remove_sysfs(struct video_device *vdev) +{ +#define V4L2_SYSFS_DESTROY(x) device_remove_file(&vdev->dev, &dev_attr_##x) + + if (vdev) { + V4L2_SYSFS_DESTROY(format); + V4L2_SYSFS_DESTROY(buffers); + V4L2_SYSFS_DESTROY(max_openers); + /* ... */ + } +} + +static void v4l2loopback_create_sysfs(struct video_device *vdev) +{ + int res = 0; + +#define V4L2_SYSFS_CREATE(x) \ + res = device_create_file(&vdev->dev, &dev_attr_##x); \ + if (res < 0) \ + break + if (!vdev) + return; + do { + V4L2_SYSFS_CREATE(format); + V4L2_SYSFS_CREATE(buffers); + V4L2_SYSFS_CREATE(max_openers); + /* ... */ + } while (0); + + if (res >= 0) + return; + dev_err(&vdev->dev, "%s error: %d\n", __func__, res); +} + +/* global module data */ +/* find a device based on it's device-number (e.g. '3' for /dev/video3) */ +struct v4l2loopback_lookup_cb_data { + int device_nr; + struct v4l2_loopback_device *device; +}; +static int v4l2loopback_lookup_cb(int id, void *ptr, void *data) +{ + struct v4l2_loopback_device *device = ptr; + struct v4l2loopback_lookup_cb_data *cbdata = data; + if (cbdata && device && device->vdev) { + if (device->vdev->num == cbdata->device_nr) { + cbdata->device = device; + cbdata->device_nr = id; + return 1; + } + } + return 0; +} +static int v4l2loopback_lookup(int device_nr, + struct v4l2_loopback_device **device) +{ + struct v4l2loopback_lookup_cb_data data = { + .device_nr = device_nr, + .device = NULL, + }; + int err = idr_for_each(&v4l2loopback_index_idr, &v4l2loopback_lookup_cb, + &data); + if (1 == err) { + if (device) + *device = data.device; + return data.device_nr; + } + return -ENODEV; +} +static struct v4l2_loopback_device *v4l2loopback_cd2dev(struct device *cd) +{ + struct video_device *loopdev = to_video_device(cd); + struct v4l2loopback_private *ptr = + (struct v4l2loopback_private *)video_get_drvdata(loopdev); + int nr = ptr->device_nr; + + return idr_find(&v4l2loopback_index_idr, nr); +} + +static struct v4l2_loopback_device *v4l2loopback_getdevice(struct file *f) +{ + struct video_device *loopdev = video_devdata(f); + struct v4l2loopback_private *ptr = + (struct v4l2loopback_private *)video_get_drvdata(loopdev); + int nr = ptr->device_nr; + + return idr_find(&v4l2loopback_index_idr, nr); +} + +/* forward declarations */ +static void init_buffers(struct v4l2_loopback_device *dev); +static int allocate_buffers(struct v4l2_loopback_device *dev); +static int free_buffers(struct v4l2_loopback_device *dev); +static void try_free_buffers(struct v4l2_loopback_device *dev); +static int allocate_timeout_image(struct v4l2_loopback_device *dev); +static void check_timers(struct v4l2_loopback_device *dev); +static const struct v4l2_file_operations v4l2_loopback_fops; +static const struct v4l2_ioctl_ops v4l2_loopback_ioctl_ops; + +/* Queue helpers */ +/* next functions sets buffer flags and adjusts counters accordingly */ +static inline void set_done(struct v4l2l_buffer *buffer) +{ + buffer->buffer.flags &= ~V4L2_BUF_FLAG_QUEUED; + buffer->buffer.flags |= V4L2_BUF_FLAG_DONE; +} + +static inline void set_queued(struct v4l2l_buffer *buffer) +{ + buffer->buffer.flags &= ~V4L2_BUF_FLAG_DONE; + buffer->buffer.flags |= V4L2_BUF_FLAG_QUEUED; +} + +static inline void unset_flags(struct v4l2l_buffer *buffer) +{ + buffer->buffer.flags &= ~V4L2_BUF_FLAG_QUEUED; + buffer->buffer.flags &= ~V4L2_BUF_FLAG_DONE; +} + +/* V4L2 ioctl caps and params calls */ +/* returns device capabilities + * called on VIDIOC_QUERYCAP + */ +static int vidioc_querycap(struct file *file, void *priv, + struct v4l2_capability *cap) +{ + struct v4l2_loopback_device *dev = v4l2loopback_getdevice(file); + int labellen = (sizeof(cap->card) < sizeof(dev->card_label)) ? + sizeof(cap->card) : + sizeof(dev->card_label); + int device_nr = + ((struct v4l2loopback_private *)video_get_drvdata(dev->vdev)) + ->device_nr; + __u32 capabilities = V4L2_CAP_STREAMING | V4L2_CAP_READWRITE; + + strlcpy(cap->driver, "v4l2 loopback", sizeof(cap->driver)); + snprintf(cap->card, labellen, dev->card_label); + snprintf(cap->bus_info, sizeof(cap->bus_info), + "platform:v4l2loopback-%03d", device_nr); + +#if LINUX_VERSION_CODE < KERNEL_VERSION(3, 1, 0) + /* since 3.1.0, the v4l2-core system is supposed to set the version */ + cap->version = V4L2LOOPBACK_VERSION_CODE; +#endif + +#ifdef V4L2_CAP_VIDEO_M2M + capabilities |= V4L2_CAP_VIDEO_M2M; +#endif /* V4L2_CAP_VIDEO_M2M */ + + if (dev->announce_all_caps) { + capabilities |= V4L2_CAP_VIDEO_CAPTURE | V4L2_CAP_VIDEO_OUTPUT; + } else { + if (dev->ready_for_capture) { + capabilities |= V4L2_CAP_VIDEO_CAPTURE; + } + if (dev->ready_for_output) { + capabilities |= V4L2_CAP_VIDEO_OUTPUT; + } + } + +#if LINUX_VERSION_CODE >= KERNEL_VERSION(4, 7, 0) + dev->vdev->device_caps = +#endif /* >=linux-4.7.0 */ + cap->device_caps = cap->capabilities = capabilities; + +#if LINUX_VERSION_CODE >= KERNEL_VERSION(3, 3, 0) + cap->capabilities |= V4L2_CAP_DEVICE_CAPS; +#endif + + memset(cap->reserved, 0, sizeof(cap->reserved)); + return 0; +} + +static int vidioc_enum_framesizes(struct file *file, void *fh, + struct v4l2_frmsizeenum *argp) +{ + struct v4l2_loopback_device *dev; + + /* LATER: what does the index really mean? + * if it's about enumerating formats, we can safely ignore it + * (CHECK) + */ + + /* there can be only one... */ + if (argp->index) + return -EINVAL; + + dev = v4l2loopback_getdevice(file); + if (dev->ready_for_capture) { + /* format has already been negotiated + * cannot change during runtime + */ + argp->type = V4L2_FRMSIZE_TYPE_DISCRETE; + + argp->discrete.width = dev->pix_format.width; + argp->discrete.height = dev->pix_format.height; + } else { + /* if the format has not been negotiated yet, we accept anything + */ + argp->type = V4L2_FRMSIZE_TYPE_CONTINUOUS; + + argp->stepwise.min_width = V4L2LOOPBACK_SIZE_MIN_WIDTH; + argp->stepwise.min_height = V4L2LOOPBACK_SIZE_MIN_HEIGHT; + + argp->stepwise.max_width = dev->max_width; + argp->stepwise.max_height = dev->max_height; + + argp->stepwise.step_width = 1; + argp->stepwise.step_height = 1; + } + return 0; +} + +/* returns frameinterval (fps) for the set resolution + * called on VIDIOC_ENUM_FRAMEINTERVALS + */ +static int vidioc_enum_frameintervals(struct file *file, void *fh, + struct v4l2_frmivalenum *argp) +{ + struct v4l2_loopback_device *dev = v4l2loopback_getdevice(file); + struct v4l2_loopback_opener *opener = fh_to_opener(fh); + + if (dev->ready_for_capture) { + if (opener->vidioc_enum_frameintervals_calls > 0) + return -EINVAL; + if (argp->width == dev->pix_format.width && + argp->height == dev->pix_format.height) { + argp->type = V4L2_FRMIVAL_TYPE_DISCRETE; + argp->discrete = dev->capture_param.timeperframe; + opener->vidioc_enum_frameintervals_calls++; + return 0; + } + return -EINVAL; + } + return 0; +} + +/* ------------------ CAPTURE ----------------------- */ + +/* returns device formats + * called on VIDIOC_ENUM_FMT, with v4l2_buf_type set to V4L2_BUF_TYPE_VIDEO_CAPTURE + */ +static int vidioc_enum_fmt_cap(struct file *file, void *fh, + struct v4l2_fmtdesc *f) +{ + struct v4l2_loopback_device *dev; + MARK(); + + dev = v4l2loopback_getdevice(file); + + if (f->index) + return -EINVAL; + if (dev->ready_for_capture) { + const __u32 format = dev->pix_format.pixelformat; + + snprintf(f->description, sizeof(f->description), "[%c%c%c%c]", + (format >> 0) & 0xFF, (format >> 8) & 0xFF, + (format >> 16) & 0xFF, (format >> 24) & 0xFF); + + f->pixelformat = dev->pix_format.pixelformat; + } else { + return -EINVAL; + } + f->flags = 0; + MARK(); + return 0; +} + +/* returns current video format + * called on VIDIOC_G_FMT, with v4l2_buf_type set to V4L2_BUF_TYPE_VIDEO_CAPTURE + */ +static int vidioc_g_fmt_cap(struct file *file, void *priv, + struct v4l2_format *fmt) +{ + struct v4l2_loopback_device *dev; + MARK(); + + dev = v4l2loopback_getdevice(file); + + if (!dev->ready_for_capture) + return -EINVAL; + + fmt->fmt.pix = dev->pix_format; + MARK(); + return 0; +} + +/* checks if it is OK to change to format fmt; + * actual check is done by inner_try_fmt_cap + * just checking that pixelformat is OK and set other parameters, app should + * obey this decision + * called on VIDIOC_TRY_FMT, with v4l2_buf_type set to V4L2_BUF_TYPE_VIDEO_CAPTURE + */ +static int vidioc_try_fmt_cap(struct file *file, void *priv, + struct v4l2_format *fmt) +{ + struct v4l2_loopback_device *dev; + char buf[5]; + + dev = v4l2loopback_getdevice(file); + + if (0 == dev->ready_for_capture) { + dprintk("setting fmt_cap not possible yet\n"); + return -EBUSY; + } + + if (fmt->fmt.pix.pixelformat != dev->pix_format.pixelformat) + return -EINVAL; + + fmt->fmt.pix = dev->pix_format; + + buf[4] = 0; + dprintk("capFOURCC=%s\n", fourcc2str(dev->pix_format.pixelformat, buf)); + return 0; +} + +/* sets new output format, if possible + * actually format is set by input and we even do not check it, just return + * current one, but it is possible to set subregions of input TODO(vasaka) + * called on VIDIOC_S_FMT, with v4l2_buf_type set to V4L2_BUF_TYPE_VIDEO_CAPTURE + */ +static int vidioc_s_fmt_cap(struct file *file, void *priv, + struct v4l2_format *fmt) +{ + return vidioc_try_fmt_cap(file, priv, fmt); +} + +/* ------------------ OUTPUT ----------------------- */ + +/* returns device formats; + * LATER: allow all formats + * called on VIDIOC_ENUM_FMT, with v4l2_buf_type set to V4L2_BUF_TYPE_VIDEO_OUTPUT + */ +static int vidioc_enum_fmt_out(struct file *file, void *fh, + struct v4l2_fmtdesc *f) +{ + struct v4l2_loopback_device *dev; + const struct v4l2l_format *fmt; + + dev = v4l2loopback_getdevice(file); + + if (dev->ready_for_capture) { + const __u32 format = dev->pix_format.pixelformat; + + /* format has been fixed by the writer, so only one single format is supported */ + if (f->index) + return -EINVAL; + + fmt = format_by_fourcc(format); + if (NULL == fmt) + return -EINVAL; + + f->type = V4L2_BUF_TYPE_VIDEO_CAPTURE; + /* f->flags = ??; */ + snprintf(f->description, sizeof(f->description), "%s", + fmt->name); + + f->pixelformat = dev->pix_format.pixelformat; + } else { + /* fill in a dummy format */ + /* coverity[unsigned_compare] */ + if (f->index < 0 || f->index >= FORMATS) + return -EINVAL; + + fmt = &formats[f->index]; + + f->pixelformat = fmt->fourcc; + snprintf(f->description, sizeof(f->description), "%s", + fmt->name); + } + f->flags = 0; + + return 0; +} + +/* returns current video format format fmt */ +/* NOTE: this is called from the producer + * so if format has not been negotiated yet, + * it should return ALL of available formats, + * called on VIDIOC_G_FMT, with v4l2_buf_type set to V4L2_BUF_TYPE_VIDEO_OUTPUT + */ +static int vidioc_g_fmt_out(struct file *file, void *priv, + struct v4l2_format *fmt) +{ + struct v4l2_loopback_device *dev; + MARK(); + + dev = v4l2loopback_getdevice(file); + + /* + * LATER: this should return the currently valid format + * gstreamer doesn't like it, if this returns -EINVAL, as it + * then concludes that there is _no_ valid format + * CHECK whether this assumption is wrong, + * or whether we have to always provide a valid format + */ + + fmt->fmt.pix = dev->pix_format; + return 0; +} + +/* checks if it is OK to change to format fmt; + * if format is negotiated do not change it + * called on VIDIOC_TRY_FMT with v4l2_buf_type set to V4L2_BUF_TYPE_VIDEO_OUTPUT + */ +static int vidioc_try_fmt_out(struct file *file, void *priv, + struct v4l2_format *fmt) +{ + struct v4l2_loopback_device *dev; + MARK(); + + dev = v4l2loopback_getdevice(file); + + /* TODO(vasaka) loopback does not care about formats writer want to set, + * maybe it is a good idea to restrict format somehow */ + if (dev->ready_for_capture) { + fmt->fmt.pix = dev->pix_format; + } else { + __u32 w = fmt->fmt.pix.width; + __u32 h = fmt->fmt.pix.height; + __u32 pixfmt = fmt->fmt.pix.pixelformat; + const struct v4l2l_format *format = format_by_fourcc(pixfmt); + + if (w > dev->max_width) + w = dev->max_width; + if (h > dev->max_height) + h = dev->max_height; + + dprintk("trying image %dx%d\n", w, h); + + if (w < 1) + w = V4L2LOOPBACK_SIZE_DEFAULT_WIDTH; + + if (h < 1) + h = V4L2LOOPBACK_SIZE_DEFAULT_HEIGHT; + + if (NULL == format) + format = &formats[0]; + + pix_format_set_size(&fmt->fmt.pix, format, w, h); + + fmt->fmt.pix.pixelformat = format->fourcc; + + if ((fmt->fmt.pix.colorspace == V4L2_COLORSPACE_DEFAULT) || + (fmt->fmt.pix.colorspace > V4L2_COLORSPACE_DCI_P3)) + fmt->fmt.pix.colorspace = V4L2_COLORSPACE_SRGB; + + if (V4L2_FIELD_ANY == fmt->fmt.pix.field) + fmt->fmt.pix.field = V4L2_FIELD_NONE; + + /* FIXXME: try_fmt should never modify the device-state */ + dev->pix_format = fmt->fmt.pix; + } + return 0; +} + +/* sets new output format, if possible; + * allocate data here because we do not know if it will be streaming or + * read/write IO + * called on VIDIOC_S_FMT with v4l2_buf_type set to V4L2_BUF_TYPE_VIDEO_OUTPUT + */ +static int vidioc_s_fmt_out(struct file *file, void *priv, + struct v4l2_format *fmt) +{ + struct v4l2_loopback_device *dev; + char buf[5]; + int ret; + MARK(); + + dev = v4l2loopback_getdevice(file); + ret = vidioc_try_fmt_out(file, priv, fmt); + + dprintk("s_fmt_out(%d) %d...%d\n", ret, dev->ready_for_capture, + dev->pix_format.sizeimage); + + buf[4] = 0; + dprintk("outFOURCC=%s\n", fourcc2str(dev->pix_format.pixelformat, buf)); + + if (ret < 0) + return ret; + + if (!dev->ready_for_capture) { + dev->buffer_size = PAGE_ALIGN(dev->pix_format.sizeimage); + fmt->fmt.pix.sizeimage = dev->buffer_size; + allocate_buffers(dev); + } + return ret; +} + +// #define V4L2L_OVERLAY +#ifdef V4L2L_OVERLAY +/* ------------------ OVERLAY ----------------------- */ +/* currently unsupported */ +/* GSTreamer's v4l2sink is buggy, as it requires the overlay to work + * while it should only require it, if overlay is requested + * once the gstreamer element is fixed, remove the overlay dummies + */ +#warning OVERLAY dummies +static int vidioc_g_fmt_overlay(struct file *file, void *priv, + struct v4l2_format *fmt) +{ + return 0; +} + +static int vidioc_s_fmt_overlay(struct file *file, void *priv, + struct v4l2_format *fmt) +{ + return 0; +} +#endif /* V4L2L_OVERLAY */ + +/* ------------------ PARAMs ----------------------- */ + +/* get some data flow parameters, only capability, fps and readbuffers has + * effect on this driver + * called on VIDIOC_G_PARM + */ +static int vidioc_g_parm(struct file *file, void *priv, + struct v4l2_streamparm *parm) +{ + /* do not care about type of opener, hope these enums would always be + * compatible */ + struct v4l2_loopback_device *dev; + MARK(); + + dev = v4l2loopback_getdevice(file); + parm->parm.capture = dev->capture_param; + return 0; +} + +/* get some data flow parameters, only capability, fps and readbuffers has + * effect on this driver + * called on VIDIOC_S_PARM + */ +static int vidioc_s_parm(struct file *file, void *priv, + struct v4l2_streamparm *parm) +{ + struct v4l2_loopback_device *dev; + int err = 0; + MARK(); + + dev = v4l2loopback_getdevice(file); + dprintk("vidioc_s_parm called frate=%d/%d\n", + parm->parm.capture.timeperframe.numerator, + parm->parm.capture.timeperframe.denominator); + + switch (parm->type) { + case V4L2_BUF_TYPE_VIDEO_CAPTURE: + if ((err = set_timeperframe( + dev, &parm->parm.capture.timeperframe)) < 0) + return err; + break; + case V4L2_BUF_TYPE_VIDEO_OUTPUT: + if ((err = set_timeperframe( + dev, &parm->parm.capture.timeperframe)) < 0) + return err; + break; + default: + return -1; + } + + parm->parm.capture = dev->capture_param; + return 0; +} + +#ifdef V4L2LOOPBACK_WITH_STD +/* sets a tv standard, actually we do not need to handle this any special way + * added to support effecttv + * called on VIDIOC_S_STD + */ +static int vidioc_s_std(struct file *file, void *fh, v4l2_std_id *_std) +{ + v4l2_std_id req_std = 0, supported_std = 0; + const v4l2_std_id all_std = V4L2_STD_ALL, no_std = 0; + + if (_std) { + req_std = *_std; + *_std = all_std; + } + + /* we support everything in V4L2_STD_ALL, but not more... */ + supported_std = (all_std & req_std); + if (no_std == supported_std) + return -EINVAL; + + return 0; +} + +/* gets a fake video standard + * called on VIDIOC_G_STD + */ +static int vidioc_g_std(struct file *file, void *fh, v4l2_std_id *norm) +{ + if (norm) + *norm = V4L2_STD_ALL; + return 0; +} +/* gets a fake video standard + * called on VIDIOC_QUERYSTD + */ +static int vidioc_querystd(struct file *file, void *fh, v4l2_std_id *norm) +{ + if (norm) + *norm = V4L2_STD_ALL; + return 0; +} +#endif /* V4L2LOOPBACK_WITH_STD */ + +/* get ctrls info + * called on VIDIOC_QUERYCTRL + */ +static int vidioc_queryctrl(struct file *file, void *fh, + struct v4l2_queryctrl *q) +{ + const struct v4l2_ctrl_config *cnf = 0; + switch (q->id) { + case CID_KEEP_FORMAT: + cnf = &v4l2loopback_ctrl_keepformat; + break; + case CID_SUSTAIN_FRAMERATE: + cnf = &v4l2loopback_ctrl_sustainframerate; + break; + case CID_TIMEOUT: + cnf = &v4l2loopback_ctrl_timeout; + break; + case CID_TIMEOUT_IMAGE_IO: + cnf = &v4l2loopback_ctrl_timeoutimageio; + break; + default: + return -EINVAL; + } + if (!cnf) + BUG(); + + strcpy(q->name, cnf->name); + q->default_value = cnf->def; + q->type = cnf->type; + q->minimum = cnf->min; + q->maximum = cnf->max; + q->step = cnf->step; + + memset(q->reserved, 0, sizeof(q->reserved)); + return 0; +} + +static int vidioc_g_ctrl(struct file *file, void *fh, struct v4l2_control *c) +{ + struct v4l2_loopback_device *dev = v4l2loopback_getdevice(file); + + switch (c->id) { + case CID_KEEP_FORMAT: + c->value = dev->keep_format; + break; + case CID_SUSTAIN_FRAMERATE: + c->value = dev->sustain_framerate; + break; + case CID_TIMEOUT: + c->value = jiffies_to_msecs(dev->timeout_jiffies); + break; + case CID_TIMEOUT_IMAGE_IO: + c->value = dev->timeout_image_io; + break; + default: + return -EINVAL; + } + + return 0; +} + +static int v4l2loopback_set_ctrl(struct v4l2_loopback_device *dev, u32 id, + s64 val) +{ + switch (id) { + case CID_KEEP_FORMAT: + if (val < 0 || val > 1) + return -EINVAL; + dev->keep_format = val; + try_free_buffers( + dev); /* will only free buffers if !keep_format */ + break; + case CID_SUSTAIN_FRAMERATE: + if (val < 0 || val > 1) + return -EINVAL; + spin_lock_bh(&dev->lock); + dev->sustain_framerate = val; + check_timers(dev); + spin_unlock_bh(&dev->lock); + break; + case CID_TIMEOUT: + if (val < 0 || val > MAX_TIMEOUT) + return -EINVAL; + spin_lock_bh(&dev->lock); + dev->timeout_jiffies = msecs_to_jiffies(val); + check_timers(dev); + spin_unlock_bh(&dev->lock); + allocate_timeout_image(dev); + break; + case CID_TIMEOUT_IMAGE_IO: + if (val < 0 || val > 1) + return -EINVAL; + dev->timeout_image_io = val; + break; + default: + return -EINVAL; + } + return 0; +} + +static int v4l2loopback_s_ctrl(struct v4l2_ctrl *ctrl) +{ + struct v4l2_loopback_device *dev = container_of( + ctrl->handler, struct v4l2_loopback_device, ctrl_handler); + return v4l2loopback_set_ctrl(dev, ctrl->id, ctrl->val); +} +static int vidioc_s_ctrl(struct file *file, void *fh, struct v4l2_control *c) +{ + struct v4l2_loopback_device *dev = v4l2loopback_getdevice(file); + return v4l2loopback_set_ctrl(dev, c->id, c->value); +} + +/* returns set of device outputs, in our case there is only one + * called on VIDIOC_ENUMOUTPUT + */ +static int vidioc_enum_output(struct file *file, void *fh, + struct v4l2_output *outp) +{ + __u32 index = outp->index; + struct v4l2_loopback_device *dev = v4l2loopback_getdevice(file); + MARK(); + + if (!dev->announce_all_caps && !dev->ready_for_output) + return -ENOTTY; + + if (0 != index) + return -EINVAL; + + /* clear all data (including the reserved fields) */ + memset(outp, 0, sizeof(*outp)); + + outp->index = index; + strlcpy(outp->name, "loopback in", sizeof(outp->name)); + outp->type = V4L2_OUTPUT_TYPE_ANALOG; + outp->audioset = 0; + outp->modulator = 0; +#ifdef V4L2LOOPBACK_WITH_STD + outp->std = V4L2_STD_ALL; +#ifdef V4L2_OUT_CAP_STD + outp->capabilities |= V4L2_OUT_CAP_STD; +#endif /* V4L2_OUT_CAP_STD */ +#endif /* V4L2LOOPBACK_WITH_STD */ + + return 0; +} + +/* which output is currently active, + * called on VIDIOC_G_OUTPUT + */ +static int vidioc_g_output(struct file *file, void *fh, unsigned int *i) +{ + struct v4l2_loopback_device *dev = v4l2loopback_getdevice(file); + if (!dev->announce_all_caps && !dev->ready_for_output) + return -ENOTTY; + if (i) + *i = 0; + return 0; +} + +/* set output, can make sense if we have more than one video src, + * called on VIDIOC_S_OUTPUT + */ +static int vidioc_s_output(struct file *file, void *fh, unsigned int i) +{ + struct v4l2_loopback_device *dev = v4l2loopback_getdevice(file); + if (!dev->announce_all_caps && !dev->ready_for_output) + return -ENOTTY; + + if (i) + return -EINVAL; + + return 0; +} + +/* returns set of device inputs, in our case there is only one, + * but later I may add more + * called on VIDIOC_ENUMINPUT + */ +static int vidioc_enum_input(struct file *file, void *fh, + struct v4l2_input *inp) +{ + __u32 index = inp->index; + MARK(); + + if (0 != index) + return -EINVAL; + + /* clear all data (including the reserved fields) */ + memset(inp, 0, sizeof(*inp)); + + inp->index = index; + strlcpy(inp->name, "loopback", sizeof(inp->name)); + inp->type = V4L2_INPUT_TYPE_CAMERA; + inp->audioset = 0; + inp->tuner = 0; + inp->status = 0; + +#ifdef V4L2LOOPBACK_WITH_STD + inp->std = V4L2_STD_ALL; +#ifdef V4L2_IN_CAP_STD + inp->capabilities |= V4L2_IN_CAP_STD; +#endif +#endif /* V4L2LOOPBACK_WITH_STD */ + + return 0; +} + +/* which input is currently active, + * called on VIDIOC_G_INPUT + */ +static int vidioc_g_input(struct file *file, void *fh, unsigned int *i) +{ + struct v4l2_loopback_device *dev = v4l2loopback_getdevice(file); + if (!dev->announce_all_caps && !dev->ready_for_capture) + return -ENOTTY; + if (i) + *i = 0; + return 0; +} + +/* set input, can make sense if we have more than one video src, + * called on VIDIOC_S_INPUT + */ +static int vidioc_s_input(struct file *file, void *fh, unsigned int i) +{ + struct v4l2_loopback_device *dev = v4l2loopback_getdevice(file); + if (!dev->announce_all_caps && !dev->ready_for_capture) + return -ENOTTY; + if (i == 0) + return 0; + return -EINVAL; +} + +/* --------------- V4L2 ioctl buffer related calls ----------------- */ + +/* negotiate buffer type + * only mmap streaming supported + * called on VIDIOC_REQBUFS + */ +static int vidioc_reqbufs(struct file *file, void *fh, + struct v4l2_requestbuffers *b) +{ + struct v4l2_loopback_device *dev; + struct v4l2_loopback_opener *opener; + int i; + MARK(); + + dev = v4l2loopback_getdevice(file); + opener = fh_to_opener(fh); + + dprintk("reqbufs: %d\t%d=%d\n", b->memory, b->count, + dev->buffers_number); + if (opener->timeout_image_io) { + if (b->memory != V4L2_MEMORY_MMAP) + return -EINVAL; + b->count = 1; + return 0; + } + + init_buffers(dev); + switch (b->memory) { + case V4L2_MEMORY_MMAP: + /* do nothing here, buffers are always allocated */ + if (b->count < 1 || dev->buffers_number < 1) + return 0; + + if (b->count > dev->buffers_number) + b->count = dev->buffers_number; + + /* make sure that outbufs_list contains buffers from 0 to used_buffers-1 + * actually, it will have been already populated via v4l2_loopback_init() + * at this point */ + if (list_empty(&dev->outbufs_list)) { + for (i = 0; i < dev->used_buffers; ++i) + list_add_tail(&dev->buffers[i].list_head, + &dev->outbufs_list); + } + + /* also, if dev->used_buffers is going to be decreased, we should remove + * out-of-range buffers from outbufs_list, and fix bufpos2index mapping */ + if (b->count < dev->used_buffers) { + struct v4l2l_buffer *pos, *n; + + list_for_each_entry_safe (pos, n, &dev->outbufs_list, + list_head) { + if (pos->buffer.index >= b->count) + list_del(&pos->list_head); + } + + /* after we update dev->used_buffers, buffers in outbufs_list will + * correspond to dev->write_position + [0;b->count-1] range */ + i = dev->write_position; + list_for_each_entry (pos, &dev->outbufs_list, + list_head) { + dev->bufpos2index[i % b->count] = + pos->buffer.index; + ++i; + } + } + + opener->buffers_number = b->count; + if (opener->buffers_number < dev->used_buffers) + dev->used_buffers = opener->buffers_number; + return 0; + default: + return -EINVAL; + } +} + +/* returns buffer asked for; + * give app as many buffers as it wants, if it less than MAX, + * but map them in our inner buffers + * called on VIDIOC_QUERYBUF + */ +static int vidioc_querybuf(struct file *file, void *fh, struct v4l2_buffer *b) +{ + enum v4l2_buf_type type; + int index; + struct v4l2_loopback_device *dev; + struct v4l2_loopback_opener *opener; + + MARK(); + + type = b->type; + index = b->index; + dev = v4l2loopback_getdevice(file); + opener = fh_to_opener(fh); + + if ((b->type != V4L2_BUF_TYPE_VIDEO_CAPTURE) && + (b->type != V4L2_BUF_TYPE_VIDEO_OUTPUT)) { + return -EINVAL; + } + if (b->index > max_buffers) + return -EINVAL; + + if (opener->timeout_image_io) + *b = dev->timeout_image_buffer.buffer; + else + *b = dev->buffers[b->index % dev->used_buffers].buffer; + + b->type = type; + b->index = index; + dprintkrw("buffer type: %d (of %d with size=%ld)\n", b->memory, + dev->buffers_number, dev->buffer_size); + + /* Hopefully fix 'DQBUF return bad index if queue bigger then 2 for capture' + https://github.com/umlaeute/v4l2loopback/issues/60 */ + b->flags &= ~V4L2_BUF_FLAG_DONE; + b->flags |= V4L2_BUF_FLAG_QUEUED; + + return 0; +} + +static void buffer_written(struct v4l2_loopback_device *dev, + struct v4l2l_buffer *buf) +{ + del_timer_sync(&dev->sustain_timer); + del_timer_sync(&dev->timeout_timer); + spin_lock_bh(&dev->lock); + + dev->bufpos2index[dev->write_position % dev->used_buffers] = + buf->buffer.index; + list_move_tail(&buf->list_head, &dev->outbufs_list); + ++dev->write_position; + dev->reread_count = 0; + + check_timers(dev); + spin_unlock_bh(&dev->lock); +} + +/* put buffer to queue + * called on VIDIOC_QBUF + */ +static int vidioc_qbuf(struct file *file, void *fh, struct v4l2_buffer *buf) +{ + struct v4l2_loopback_device *dev; + struct v4l2_loopback_opener *opener; + struct v4l2l_buffer *b; + int index; + + dev = v4l2loopback_getdevice(file); + opener = fh_to_opener(fh); + + if (buf->index > max_buffers) + return -EINVAL; + if (opener->timeout_image_io) + return 0; + + index = buf->index % dev->used_buffers; + b = &dev->buffers[index]; + + switch (buf->type) { + case V4L2_BUF_TYPE_VIDEO_CAPTURE: + dprintkrw("capture QBUF index: %d\n", index); + set_queued(b); + return 0; + case V4L2_BUF_TYPE_VIDEO_OUTPUT: + dprintkrw("output QBUF pos: %d index: %d\n", + dev->write_position, index); + if (buf->timestamp.tv_sec == 0 && buf->timestamp.tv_usec == 0) + v4l2l_get_timestamp(&b->buffer); + else + b->buffer.timestamp = buf->timestamp; + b->buffer.bytesused = buf->bytesused; + set_done(b); + buffer_written(dev, b); + + /* Hopefully fix 'DQBUF return bad index if queue bigger then 2 for capture' + https://github.com/umlaeute/v4l2loopback/issues/60 */ + buf->flags &= ~V4L2_BUF_FLAG_DONE; + buf->flags |= V4L2_BUF_FLAG_QUEUED; + + wake_up_all(&dev->read_event); + return 0; + default: + return -EINVAL; + } +} + +static int can_read(struct v4l2_loopback_device *dev, + struct v4l2_loopback_opener *opener) +{ + int ret; + + spin_lock_bh(&dev->lock); + check_timers(dev); + ret = dev->write_position > opener->read_position || + dev->reread_count > opener->reread_count || dev->timeout_happened; + spin_unlock_bh(&dev->lock); + return ret; +} + +static int get_capture_buffer(struct file *file) +{ + struct v4l2_loopback_device *dev = v4l2loopback_getdevice(file); + struct v4l2_loopback_opener *opener = fh_to_opener(file->private_data); + int pos, ret; + int timeout_happened; + + if ((file->f_flags & O_NONBLOCK) && + (dev->write_position <= opener->read_position && + dev->reread_count <= opener->reread_count && + !dev->timeout_happened)) + return -EAGAIN; + wait_event_interruptible(dev->read_event, can_read(dev, opener)); + + spin_lock_bh(&dev->lock); + if (dev->write_position == opener->read_position) { + if (dev->reread_count > opener->reread_count + 2) + opener->reread_count = dev->reread_count - 1; + ++opener->reread_count; + pos = (opener->read_position + dev->used_buffers - 1) % + dev->used_buffers; + } else { + opener->reread_count = 0; + if (dev->write_position > + opener->read_position + dev->used_buffers) + opener->read_position = dev->write_position - 1; + pos = opener->read_position % dev->used_buffers; + ++opener->read_position; + } + timeout_happened = dev->timeout_happened; + dev->timeout_happened = 0; + spin_unlock_bh(&dev->lock); + + ret = dev->bufpos2index[pos]; + if (timeout_happened) { + /* although allocated on-demand, timeout_image is freed only + * in free_buffers(), so we don't need to worry about it being + * deallocated suddenly */ + memcpy(dev->image + dev->buffers[ret].buffer.m.offset, + dev->timeout_image, dev->buffer_size); + } + return ret; +} + +/* put buffer to dequeue + * called on VIDIOC_DQBUF + */ +static int vidioc_dqbuf(struct file *file, void *fh, struct v4l2_buffer *buf) +{ + struct v4l2_loopback_device *dev; + struct v4l2_loopback_opener *opener; + int index; + struct v4l2l_buffer *b; + + dev = v4l2loopback_getdevice(file); + opener = fh_to_opener(fh); + if (opener->timeout_image_io) { + *buf = dev->timeout_image_buffer.buffer; + return 0; + } + + switch (buf->type) { + case V4L2_BUF_TYPE_VIDEO_CAPTURE: + index = get_capture_buffer(file); + if (index < 0) + return index; + dprintkrw("capture DQBUF pos: %d index: %d\n", + opener->read_position - 1, index); + if (!(dev->buffers[index].buffer.flags & + V4L2_BUF_FLAG_MAPPED)) { + dprintk("trying to return not mapped buf[%d]\n", index); + return -EINVAL; + } + unset_flags(&dev->buffers[index]); + *buf = dev->buffers[index].buffer; + return 0; + case V4L2_BUF_TYPE_VIDEO_OUTPUT: + b = list_entry(dev->outbufs_list.prev, struct v4l2l_buffer, + list_head); + list_move_tail(&b->list_head, &dev->outbufs_list); + dprintkrw("output DQBUF index: %d\n", b->buffer.index); + unset_flags(b); + *buf = b->buffer; + buf->type = V4L2_BUF_TYPE_VIDEO_OUTPUT; + return 0; + default: + return -EINVAL; + } +} + +/* ------------- STREAMING ------------------- */ + +/* start streaming + * called on VIDIOC_STREAMON + */ +static int vidioc_streamon(struct file *file, void *fh, enum v4l2_buf_type type) +{ + struct v4l2_loopback_device *dev; + struct v4l2_loopback_opener *opener; + MARK(); + + dev = v4l2loopback_getdevice(file); + opener = fh_to_opener(fh); + + switch (type) { + case V4L2_BUF_TYPE_VIDEO_OUTPUT: + opener->type = WRITER; + dev->ready_for_output = 0; + if (!dev->ready_for_capture) { + int ret = allocate_buffers(dev); + if (ret < 0) + return ret; + } + dev->ready_for_capture++; + return 0; + case V4L2_BUF_TYPE_VIDEO_CAPTURE: + opener->type = READER; + if (!dev->ready_for_capture) + return -EIO; + return 0; + default: + return -EINVAL; + } + return -EINVAL; +} + +/* stop streaming + * called on VIDIOC_STREAMOFF + */ +static int vidioc_streamoff(struct file *file, void *fh, + enum v4l2_buf_type type) +{ + struct v4l2_loopback_device *dev; + MARK(); + dprintk("%d\n", type); + + dev = v4l2loopback_getdevice(file); + + switch (type) { + case V4L2_BUF_TYPE_VIDEO_OUTPUT: + if (dev->ready_for_capture > 0) + dev->ready_for_capture--; + return 0; + case V4L2_BUF_TYPE_VIDEO_CAPTURE: + return 0; + default: + return -EINVAL; + } + return -EINVAL; +} + +#ifdef CONFIG_VIDEO_V4L1_COMPAT +static int vidiocgmbuf(struct file *file, void *fh, struct video_mbuf *p) +{ + struct v4l2_loopback_device *dev; + MARK(); + + dev = v4l2loopback_getdevice(file); + p->frames = dev->buffers_number; + p->offsets[0] = 0; + p->offsets[1] = 0; + p->size = dev->buffer_size; + return 0; +} +#endif + +static int vidioc_subscribe_event(struct v4l2_fh *fh, + const struct v4l2_event_subscription *sub) +{ + switch (sub->type) { + case V4L2_EVENT_CTRL: + return v4l2_ctrl_subscribe_event(fh, sub); + } + + return -EINVAL; +} + +/* file operations */ +static void vm_open(struct vm_area_struct *vma) +{ + struct v4l2l_buffer *buf; + MARK(); + + buf = vma->vm_private_data; + buf->use_count++; +} + +static void vm_close(struct vm_area_struct *vma) +{ + struct v4l2l_buffer *buf; + MARK(); + + buf = vma->vm_private_data; + buf->use_count--; +} + +static struct vm_operations_struct vm_ops = { + .open = vm_open, + .close = vm_close, +}; + +static int v4l2_loopback_mmap(struct file *file, struct vm_area_struct *vma) +{ + unsigned long addr; + unsigned long start; + unsigned long size; + struct v4l2_loopback_device *dev; + struct v4l2_loopback_opener *opener; + struct v4l2l_buffer *buffer = NULL; + MARK(); + + start = (unsigned long)vma->vm_start; + size = (unsigned long)(vma->vm_end - vma->vm_start); + + dev = v4l2loopback_getdevice(file); + opener = fh_to_opener(file->private_data); + + if (size > dev->buffer_size) { + dprintk("userspace tries to mmap too much, fail\n"); + return -EINVAL; + } + if (opener->timeout_image_io) { + /* we are going to map the timeout_image_buffer */ + if ((vma->vm_pgoff << PAGE_SHIFT) != + dev->buffer_size * MAX_BUFFERS) { + dprintk("invalid mmap offset for timeout_image_io mode\n"); + return -EINVAL; + } + } else if ((vma->vm_pgoff << PAGE_SHIFT) > + dev->buffer_size * (dev->buffers_number - 1)) { + dprintk("userspace tries to mmap too far, fail\n"); + return -EINVAL; + } + + /* FIXXXXXME: allocation should not happen here! */ + if (NULL == dev->image) + if (allocate_buffers(dev) < 0) + return -EINVAL; + + if (opener->timeout_image_io) { + buffer = &dev->timeout_image_buffer; + addr = (unsigned long)dev->timeout_image; + } else { + int i; + for (i = 0; i < dev->buffers_number; ++i) { + buffer = &dev->buffers[i]; + if ((buffer->buffer.m.offset >> PAGE_SHIFT) == + vma->vm_pgoff) + break; + } + + if (NULL == buffer) + return -EINVAL; + + addr = (unsigned long)dev->image + + (vma->vm_pgoff << PAGE_SHIFT); + } + + while (size > 0) { + struct page *page; + + page = (void *)vmalloc_to_page((void *)addr); + + if (vm_insert_page(vma, start, page) < 0) + return -EAGAIN; + + start += PAGE_SIZE; + addr += PAGE_SIZE; + size -= PAGE_SIZE; + } + + vma->vm_ops = &vm_ops; + vma->vm_private_data = buffer; + buffer->buffer.flags |= V4L2_BUF_FLAG_MAPPED; + + vm_open(vma); + + MARK(); + return 0; +} + +static unsigned int v4l2_loopback_poll(struct file *file, + struct poll_table_struct *pts) +{ + struct v4l2_loopback_opener *opener; + struct v4l2_loopback_device *dev; + __poll_t req_events = poll_requested_events(pts); + int ret_mask = 0; + MARK(); + + opener = fh_to_opener(file->private_data); + dev = v4l2loopback_getdevice(file); + + if (req_events & POLLPRI) { + if (!v4l2_event_pending(&opener->fh)) + poll_wait(file, &opener->fh.wait, pts); + if (v4l2_event_pending(&opener->fh)) { + ret_mask |= POLLPRI; + if (!(req_events & DEFAULT_POLLMASK)) + return ret_mask; + } + } + + switch (opener->type) { + case WRITER: + ret_mask |= POLLOUT | POLLWRNORM; + break; + case READER: + if (!can_read(dev, opener)) { + if (ret_mask) + return ret_mask; + poll_wait(file, &dev->read_event, pts); + } + if (can_read(dev, opener)) + ret_mask |= POLLIN | POLLRDNORM; + if (v4l2_event_pending(&opener->fh)) + ret_mask |= POLLPRI; + break; + default: + break; + } + + MARK(); + return ret_mask; +} + +/* do not want to limit device opens, it can be as many readers as user want, + * writers are limited by means of setting writer field */ +static int v4l2_loopback_open(struct file *file) +{ + struct v4l2_loopback_device *dev; + struct v4l2_loopback_opener *opener; + MARK(); + dev = v4l2loopback_getdevice(file); + if (dev->open_count.counter >= dev->max_openers) + return -EBUSY; + /* kfree on close */ + opener = kzalloc(sizeof(*opener), GFP_KERNEL); + if (opener == NULL) + return -ENOMEM; + + v4l2_fh_init(&opener->fh, video_devdata(file)); + file->private_data = &opener->fh; + atomic_inc(&dev->open_count); + + opener->timeout_image_io = dev->timeout_image_io; + dev->timeout_image_io = 0; + + if (opener->timeout_image_io) { + int r = allocate_timeout_image(dev); + + if (r < 0) { + dprintk("timeout image allocation failed\n"); + return r; + } + } + + v4l2_fh_add(&opener->fh); + dprintk("opened dev:%p with image:%p\n", dev, dev ? dev->image : NULL); + MARK(); + return 0; +} + +static int v4l2_loopback_close(struct file *file) +{ + struct v4l2_loopback_opener *opener; + struct v4l2_loopback_device *dev; + int iswriter = 0; + MARK(); + + opener = fh_to_opener(file->private_data); + dev = v4l2loopback_getdevice(file); + + if (WRITER == opener->type) + iswriter = 1; + + atomic_dec(&dev->open_count); + if (dev->open_count.counter == 0) { + del_timer_sync(&dev->sustain_timer); + del_timer_sync(&dev->timeout_timer); + } + try_free_buffers(dev); + + v4l2_fh_del(&opener->fh); + v4l2_fh_exit(&opener->fh); + + kfree(opener); + if (iswriter) { + dev->ready_for_output = 1; + } + MARK(); + return 0; +} + +static ssize_t v4l2_loopback_read(struct file *file, char __user *buf, + size_t count, loff_t *ppos) +{ + int read_index; + struct v4l2_loopback_device *dev; + struct v4l2_buffer *b; + MARK(); + + dev = v4l2loopback_getdevice(file); + + read_index = get_capture_buffer(file); + if (read_index < 0) + return read_index; + if (count > dev->buffer_size) + count = dev->buffer_size; + b = &dev->buffers[read_index].buffer; + if (count > b->bytesused) + count = b->bytesused; + if (copy_to_user((void *)buf, (void *)(dev->image + b->m.offset), + count)) { + printk(KERN_ERR + "v4l2-loopback: failed copy_to_user() in read buf\n"); + return -EFAULT; + } + dprintkrw("leave v4l2_loopback_read()\n"); + return count; +} + +static ssize_t v4l2_loopback_write(struct file *file, const char __user *buf, + size_t count, loff_t *ppos) +{ + struct v4l2_loopback_device *dev; + int write_index; + struct v4l2_buffer *b; + MARK(); + + dev = v4l2loopback_getdevice(file); + + /* there's at least one writer, so don't stop announcing output capabilities */ + dev->ready_for_output = 0; + + if (!dev->ready_for_capture) { + int ret = allocate_buffers(dev); + if (ret < 0) + return ret; + dev->ready_for_capture = 1; + } + dprintkrw("v4l2_loopback_write() trying to write %zu bytes\n", count); + if (count > dev->buffer_size) + count = dev->buffer_size; + + write_index = dev->write_position % dev->used_buffers; + b = &dev->buffers[write_index].buffer; + + if (copy_from_user((void *)(dev->image + b->m.offset), (void *)buf, + count)) { + printk(KERN_ERR + "v4l2-loopback: failed copy_from_user() in write buf, could not write %zu\n", + count); + return -EFAULT; + } + v4l2l_get_timestamp(b); + b->bytesused = count; + b->sequence = dev->write_position; + buffer_written(dev, &dev->buffers[write_index]); + wake_up_all(&dev->read_event); + dprintkrw("leave v4l2_loopback_write()\n"); + return count; +} + +/* init functions */ +/* frees buffers, if already allocated */ +static int free_buffers(struct v4l2_loopback_device *dev) +{ + MARK(); + dprintk("freeing image@%p for dev:%p\n", dev ? dev->image : NULL, dev); + if (dev->image) { + vfree(dev->image); + dev->image = NULL; + } + if (dev->timeout_image) { + vfree(dev->timeout_image); + dev->timeout_image = NULL; + } + dev->imagesize = 0; + + return 0; +} +/* frees buffers, if they are no longer needed */ +static void try_free_buffers(struct v4l2_loopback_device *dev) +{ + MARK(); + if (0 == dev->open_count.counter && !dev->keep_format) { + free_buffers(dev); + dev->ready_for_capture = 0; + dev->buffer_size = 0; + dev->write_position = 0; + } +} +/* allocates buffers, if buffer_size is set */ +static int allocate_buffers(struct v4l2_loopback_device *dev) +{ + MARK(); + /* vfree on close file operation in case no open handles left */ + if (0 == dev->buffer_size) + return -EINVAL; + + if (dev->image) { + dprintk("allocating buffers again: %ld %ld\n", + dev->buffer_size * dev->buffers_number, dev->imagesize); + /* FIXME: prevent double allocation more intelligently! */ + if (dev->buffer_size * dev->buffers_number == dev->imagesize) + return 0; + + /* if there is only one writer, no problem should occur */ + if (dev->open_count.counter == 1) + free_buffers(dev); + else + return -EINVAL; + } + + dev->imagesize = dev->buffer_size * dev->buffers_number; + + dprintk("allocating %ld = %ldx%d\n", dev->imagesize, dev->buffer_size, + dev->buffers_number); + + dev->image = vmalloc(dev->imagesize); + if (dev->timeout_jiffies > 0) + allocate_timeout_image(dev); + + if (dev->image == NULL) + return -ENOMEM; + dprintk("vmallocated %ld bytes\n", dev->imagesize); + MARK(); + init_buffers(dev); + return 0; +} + +/* init inner buffers, they are capture mode and flags are set as + * for capture mod buffers */ +static void init_buffers(struct v4l2_loopback_device *dev) +{ + int i; + int buffer_size; + int bytesused; + MARK(); + + buffer_size = dev->buffer_size; + bytesused = dev->pix_format.sizeimage; + + for (i = 0; i < dev->buffers_number; ++i) { + struct v4l2_buffer *b = &dev->buffers[i].buffer; + b->index = i; + b->bytesused = bytesused; + b->length = buffer_size; + b->field = V4L2_FIELD_NONE; + b->flags = 0; +#if LINUX_VERSION_CODE < KERNEL_VERSION(3, 6, 1) + b->input = 0; +#endif + b->m.offset = i * buffer_size; + b->memory = V4L2_MEMORY_MMAP; + b->sequence = 0; + b->timestamp.tv_sec = 0; + b->timestamp.tv_usec = 0; + b->type = V4L2_BUF_TYPE_VIDEO_CAPTURE; + + v4l2l_get_timestamp(b); + } + dev->timeout_image_buffer = dev->buffers[0]; + dev->timeout_image_buffer.buffer.m.offset = MAX_BUFFERS * buffer_size; + MARK(); +} + +static int allocate_timeout_image(struct v4l2_loopback_device *dev) +{ + MARK(); + if (dev->buffer_size <= 0) + return -EINVAL; + + if (dev->timeout_image == NULL) { + dev->timeout_image = v4l2l_vzalloc(dev->buffer_size); + if (dev->timeout_image == NULL) + return -ENOMEM; + } + return 0; +} + +/* fills and register video device */ +static void init_vdev(struct video_device *vdev, int nr) +{ + MARK(); + +#ifdef V4L2LOOPBACK_WITH_STD + vdev->tvnorms = V4L2_STD_ALL; +#endif /* V4L2LOOPBACK_WITH_STD */ + + vdev->vfl_type = VFL_TYPE_VIDEO; + vdev->fops = &v4l2_loopback_fops; + vdev->ioctl_ops = &v4l2_loopback_ioctl_ops; + vdev->release = &video_device_release; + vdev->minor = -1; +#if LINUX_VERSION_CODE >= KERNEL_VERSION(4, 7, 0) + vdev->device_caps = V4L2_CAP_VIDEO_CAPTURE | V4L2_CAP_VIDEO_OUTPUT | + V4L2_CAP_READWRITE | V4L2_CAP_STREAMING; +#ifdef V4L2_CAP_VIDEO_M2M + vdev->device_caps |= V4L2_CAP_VIDEO_M2M; +#endif +#endif /* >=linux-4.7.0 */ + + if (debug > 1) +#if LINUX_VERSION_CODE < KERNEL_VERSION(3, 20, 0) + vdev->debug = V4L2_DEBUG_IOCTL | V4L2_DEBUG_IOCTL_ARG; +#else + vdev->dev_debug = + V4L2_DEV_DEBUG_IOCTL | V4L2_DEV_DEBUG_IOCTL_ARG; +#endif + + /* since kernel-3.7, there is a new field 'vfl_dir' that has to be + * set to VFL_DIR_M2M for bidirectional devices */ +#if LINUX_VERSION_CODE >= KERNEL_VERSION(3, 7, 0) + vdev->vfl_dir = VFL_DIR_M2M; +#endif + + MARK(); +} + +/* init default capture parameters, only fps may be changed in future */ +static void init_capture_param(struct v4l2_captureparm *capture_param) +{ + MARK(); + capture_param->capability = 0; + capture_param->capturemode = 0; + capture_param->extendedmode = 0; + capture_param->readbuffers = max_buffers; + capture_param->timeperframe.numerator = 1; + capture_param->timeperframe.denominator = 30; +} + +static void check_timers(struct v4l2_loopback_device *dev) +{ + if (!dev->ready_for_capture) + return; + + if (dev->timeout_jiffies > 0 && !timer_pending(&dev->timeout_timer)) + mod_timer(&dev->timeout_timer, jiffies + dev->timeout_jiffies); + if (dev->sustain_framerate && !timer_pending(&dev->sustain_timer)) + mod_timer(&dev->sustain_timer, + jiffies + dev->frame_jiffies * 3 / 2); +} +#ifdef HAVE_TIMER_SETUP +static void sustain_timer_clb(struct timer_list *t) +{ + struct v4l2_loopback_device *dev = from_timer(dev, t, sustain_timer); +#else +static void sustain_timer_clb(unsigned long nr) +{ + struct v4l2_loopback_device *dev = + idr_find(&v4l2loopback_index_idr, nr); +#endif + spin_lock(&dev->lock); + if (dev->sustain_framerate) { + dev->reread_count++; + dprintkrw("reread: %d %d\n", dev->write_position, + dev->reread_count); + if (dev->reread_count == 1) + mod_timer(&dev->sustain_timer, + jiffies + max(1UL, dev->frame_jiffies / 2)); + else + mod_timer(&dev->sustain_timer, + jiffies + dev->frame_jiffies); + wake_up_all(&dev->read_event); + } + spin_unlock(&dev->lock); +} +#ifdef HAVE_TIMER_SETUP +static void timeout_timer_clb(struct timer_list *t) +{ + struct v4l2_loopback_device *dev = from_timer(dev, t, timeout_timer); +#else +static void timeout_timer_clb(unsigned long nr) +{ + struct v4l2_loopback_device *dev = + idr_find(&v4l2loopback_index_idr, nr); +#endif + spin_lock(&dev->lock); + if (dev->timeout_jiffies > 0) { + dev->timeout_happened = 1; + mod_timer(&dev->timeout_timer, jiffies + dev->timeout_jiffies); + wake_up_all(&dev->read_event); + } + spin_unlock(&dev->lock); +} + +/* init loopback main structure */ +#define DEFAULT_FROM_CONF(confmember, default_condition, default_value) \ + ((conf) ? \ + ((conf->confmember default_condition) ? (default_value) : \ + (conf->confmember)) : \ + default_value) + +static int v4l2_loopback_add(struct v4l2_loopback_config *conf, int *ret_nr) +{ + struct v4l2_loopback_device *dev; + struct v4l2_ctrl_handler *hdl; + + int err = -ENOMEM; + + int _max_width = DEFAULT_FROM_CONF( + max_width, <= V4L2LOOPBACK_SIZE_MIN_WIDTH, max_width); + int _max_height = DEFAULT_FROM_CONF( + max_height, <= V4L2LOOPBACK_SIZE_MIN_HEIGHT, max_height); + bool _announce_all_caps = (conf && conf->announce_all_caps >= 0) ? + (conf->announce_all_caps) : + V4L2LOOPBACK_DEFAULT_EXCLUSIVECAPS; + + int _max_buffers = DEFAULT_FROM_CONF(max_buffers, <= 0, max_buffers); + int _max_openers = DEFAULT_FROM_CONF(max_openers, <= 0, max_openers); + + int nr = -1; + if (conf) { + if (conf->capture_nr >= 0 && + conf->output_nr == conf->capture_nr) { + nr = conf->capture_nr; + } else if (conf->capture_nr < 0 && conf->output_nr < 0) { + nr = -1; + } else if (conf->capture_nr < 0) { + nr = conf->output_nr; + } else if (conf->output_nr < 0) { + nr = conf->capture_nr; + } else { + printk(KERN_ERR + "split OUTPUT and CAPTURE devices not yet supported."); + printk(KERN_INFO + "both devices must have the same number (%d != %d).", + conf->output_nr, conf->capture_nr); + return -EINVAL; + } + } + + if (idr_find(&v4l2loopback_index_idr, nr)) + return -EEXIST; + + dprintk("creating v4l2loopback-device #%d\n", nr); + dev = kzalloc(sizeof(*dev), GFP_KERNEL); + if (!dev) + return -ENOMEM; + + /* allocate id, if @id >= 0, we're requesting that specific id */ + if (nr >= 0) { + err = idr_alloc(&v4l2loopback_index_idr, dev, nr, nr + 1, + GFP_KERNEL); + if (err == -ENOSPC) + err = -EEXIST; + } else { + err = idr_alloc(&v4l2loopback_index_idr, dev, 0, 0, GFP_KERNEL); + } + if (err < 0) + goto out_free_dev; + nr = err; + err = -ENOMEM; + + if (conf && conf->card_label && *(conf->card_label)) { + snprintf(dev->card_label, sizeof(dev->card_label), "%s", + conf->card_label); + } else { + snprintf(dev->card_label, sizeof(dev->card_label), + "Dummy video device (0x%04X)", nr); + } + snprintf(dev->v4l2_dev.name, sizeof(dev->v4l2_dev.name), + "v4l2loopback-%03d", nr); + + err = v4l2_device_register(NULL, &dev->v4l2_dev); + if (err) + goto out_free_idr; + MARK(); + + dev->vdev = video_device_alloc(); + if (dev->vdev == NULL) { + err = -ENOMEM; + goto out_unregister; + } + video_set_drvdata(dev->vdev, + kzalloc(sizeof(struct v4l2loopback_private), + GFP_KERNEL)); + if (video_get_drvdata(dev->vdev) == NULL) { + err = -ENOMEM; + goto out_unregister; + } + + MARK(); + snprintf(dev->vdev->name, sizeof(dev->vdev->name), dev->card_label); + + ((struct v4l2loopback_private *)video_get_drvdata(dev->vdev)) + ->device_nr = nr; + + init_vdev(dev->vdev, nr); + dev->vdev->v4l2_dev = &dev->v4l2_dev; + init_capture_param(&dev->capture_param); + err = set_timeperframe(dev, &dev->capture_param.timeperframe); + if (err) + goto out_unregister; + dev->keep_format = 0; + dev->sustain_framerate = 0; + + dev->announce_all_caps = _announce_all_caps; + dev->max_width = _max_width; + dev->max_height = _max_height; + dev->max_openers = _max_openers; + dev->buffers_number = dev->used_buffers = _max_buffers; + + dev->write_position = 0; + + MARK(); + spin_lock_init(&dev->lock); + INIT_LIST_HEAD(&dev->outbufs_list); + if (list_empty(&dev->outbufs_list)) { + int i; + + for (i = 0; i < dev->used_buffers; ++i) + list_add_tail(&dev->buffers[i].list_head, + &dev->outbufs_list); + } + memset(dev->bufpos2index, 0, sizeof(dev->bufpos2index)); + atomic_set(&dev->open_count, 0); + dev->ready_for_capture = 0; + dev->ready_for_output = 1; + + dev->buffer_size = 0; + dev->image = NULL; + dev->imagesize = 0; +#ifdef HAVE_TIMER_SETUP + timer_setup(&dev->sustain_timer, sustain_timer_clb, 0); + timer_setup(&dev->timeout_timer, timeout_timer_clb, 0); +#else + setup_timer(&dev->sustain_timer, sustain_timer_clb, nr); + setup_timer(&dev->timeout_timer, timeout_timer_clb, nr); +#endif + dev->reread_count = 0; + dev->timeout_jiffies = 0; + dev->timeout_image = NULL; + dev->timeout_happened = 0; + + hdl = &dev->ctrl_handler; + err = v4l2_ctrl_handler_init(hdl, 4); + if (err) + goto out_unregister; + v4l2_ctrl_new_custom(hdl, &v4l2loopback_ctrl_keepformat, NULL); + v4l2_ctrl_new_custom(hdl, &v4l2loopback_ctrl_sustainframerate, NULL); + v4l2_ctrl_new_custom(hdl, &v4l2loopback_ctrl_timeout, NULL); + v4l2_ctrl_new_custom(hdl, &v4l2loopback_ctrl_timeoutimageio, NULL); + if (hdl->error) { + err = hdl->error; + goto out_free_handler; + } + dev->v4l2_dev.ctrl_handler = hdl; + + err = v4l2_ctrl_handler_setup(hdl); + if (err) + goto out_free_handler; + + /* FIXME set buffers to 0 */ + + /* Set initial format */ + dev->pix_format.width = 0; /* V4L2LOOPBACK_SIZE_DEFAULT_WIDTH; */ + dev->pix_format.height = 0; /* V4L2LOOPBACK_SIZE_DEFAULT_HEIGHT; */ + dev->pix_format.pixelformat = formats[0].fourcc; + dev->pix_format.colorspace = + V4L2_COLORSPACE_SRGB; /* do we need to set this ? */ + dev->pix_format.field = V4L2_FIELD_NONE; + + dev->buffer_size = PAGE_ALIGN(dev->pix_format.sizeimage); + dprintk("buffer_size = %ld (=%d)\n", dev->buffer_size, + dev->pix_format.sizeimage); + err = allocate_buffers(dev); + if (err && dev->buffer_size) + goto out_free_handler; + + init_waitqueue_head(&dev->read_event); + + /* register the device -> it creates /dev/video* */ + if (video_register_device(dev->vdev, VFL_TYPE_VIDEO, nr) < 0) { + printk(KERN_ERR + "v4l2loopback: failed video_register_device()\n"); + err = -EFAULT; + goto out_free_device; + } + v4l2loopback_create_sysfs(dev->vdev); + + MARK(); + if (ret_nr) + *ret_nr = dev->vdev->num; + return 0; + +out_free_device: + video_device_release(dev->vdev); +out_free_handler: + v4l2_ctrl_handler_free(&dev->ctrl_handler); +out_unregister: + v4l2_device_unregister(&dev->v4l2_dev); +out_free_idr: + idr_remove(&v4l2loopback_index_idr, nr); +out_free_dev: + kfree(dev); + return err; +} + +static void v4l2_loopback_remove(struct v4l2_loopback_device *dev) +{ + free_buffers(dev); + v4l2loopback_remove_sysfs(dev->vdev); + kfree(video_get_drvdata(dev->vdev)); + video_unregister_device(dev->vdev); + v4l2_device_unregister(&dev->v4l2_dev); + v4l2_ctrl_handler_free(&dev->ctrl_handler); + kfree(dev); +} + +static long v4l2loopback_control_ioctl(struct file *file, unsigned int cmd, + unsigned long parm) +{ + struct v4l2_loopback_device *dev; + struct v4l2_loopback_config conf; + struct v4l2_loopback_config *confptr = &conf; + int device_nr; + int ret; + + ret = mutex_lock_killable(&v4l2loopback_ctl_mutex); + if (ret) + return ret; + + ret = -EINVAL; + switch (cmd) { + default: + ret = -ENOSYS; + break; + /* add a v4l2loopback device (pair), based on the user-provided specs */ + case V4L2LOOPBACK_CTL_ADD: + if (parm) { + if ((ret = copy_from_user(&conf, (void *)parm, + sizeof(conf))) < 0) + break; + } else + confptr = NULL; + ret = v4l2_loopback_add(confptr, &device_nr); + if (ret >= 0) + ret = device_nr; + break; + /* remove a v4l2loopback device (both capture and output) */ + case V4L2LOOPBACK_CTL_REMOVE: + ret = v4l2loopback_lookup((int)parm, &dev); + if (ret >= 0 && dev) { + int nr = ret; + ret = -EBUSY; + if (dev->open_count.counter > 0) + break; + idr_remove(&v4l2loopback_index_idr, nr); + v4l2_loopback_remove(dev); + ret = 0; + }; + break; + /* get information for a loopback device. + * this is mostly about limits (which cannot be queried directly with VIDIOC_G_FMT and friends + */ + case V4L2LOOPBACK_CTL_QUERY: + if (!parm) + break; + if ((ret = copy_from_user(&conf, (void *)parm, sizeof(conf))) < + 0) + break; + device_nr = + (conf.output_nr < 0) ? conf.capture_nr : conf.output_nr; + MARK(); + /* get the device from either capture_nr or output_nr (whatever is valid) */ + if ((ret = v4l2loopback_lookup(device_nr, &dev)) < 0) + break; + MARK(); + /* if we got the device from output_nr and there is a valid capture_nr, + * make sure that both refer to the same device (or bail out) + */ + if ((device_nr != conf.capture_nr) && (conf.capture_nr >= 0) && + (ret != v4l2loopback_lookup(conf.capture_nr, 0))) + break; + MARK(); + /* if otoh, we got the device from capture_nr and there is a valid output_nr, + * make sure that both refer to the same device (or bail out) + */ + if ((device_nr != conf.output_nr) && (conf.output_nr >= 0) && + (ret != v4l2loopback_lookup(conf.output_nr, 0))) + break; + MARK(); + + /* v4l2_loopback_config identified a single device, so fetch the data */ + snprintf(conf.card_label, sizeof(conf.card_label), "%s", + dev->card_label); + MARK(); + conf.output_nr = conf.capture_nr = dev->vdev->num; + conf.max_width = dev->max_width; + conf.max_height = dev->max_height; + conf.announce_all_caps = dev->announce_all_caps; + conf.max_buffers = dev->buffers_number; + conf.max_openers = dev->max_openers; + conf.debug = debug; + MARK(); + if (copy_to_user((void *)parm, &conf, sizeof(conf))) { + ret = -EFAULT; + break; + } + MARK(); + ret = 0; + ; + break; + } + + MARK(); + mutex_unlock(&v4l2loopback_ctl_mutex); + MARK(); + return ret; +} + +/* LINUX KERNEL */ + +static const struct file_operations v4l2loopback_ctl_fops = { + // clang-format off + .owner = THIS_MODULE, + .open = nonseekable_open, + .unlocked_ioctl = v4l2loopback_control_ioctl, + .compat_ioctl = v4l2loopback_control_ioctl, + .llseek = noop_llseek, + // clang-format on +}; + +static struct miscdevice v4l2loopback_misc = { + // clang-format off + .minor = MISC_DYNAMIC_MINOR, + .name = "v4l2loopback", + .fops = &v4l2loopback_ctl_fops, + // clang-format on +}; + +static const struct v4l2_file_operations v4l2_loopback_fops = { + // clang-format off + .owner = THIS_MODULE, + .open = v4l2_loopback_open, + .release = v4l2_loopback_close, + .read = v4l2_loopback_read, + .write = v4l2_loopback_write, + .poll = v4l2_loopback_poll, + .mmap = v4l2_loopback_mmap, + .unlocked_ioctl = video_ioctl2, + // clang-format on +}; + +static const struct v4l2_ioctl_ops v4l2_loopback_ioctl_ops = { + // clang-format off + .vidioc_querycap = &vidioc_querycap, +#if LINUX_VERSION_CODE >= KERNEL_VERSION(2, 6, 29) + .vidioc_enum_framesizes = &vidioc_enum_framesizes, + .vidioc_enum_frameintervals = &vidioc_enum_frameintervals, +#endif + +#ifndef HAVE__V4L2_CTRLS + .vidioc_queryctrl = &vidioc_queryctrl, + .vidioc_g_ctrl = &vidioc_g_ctrl, + .vidioc_s_ctrl = &vidioc_s_ctrl, +#endif /* HAVE__V4L2_CTRLS */ + + .vidioc_enum_output = &vidioc_enum_output, + .vidioc_g_output = &vidioc_g_output, + .vidioc_s_output = &vidioc_s_output, + + .vidioc_enum_input = &vidioc_enum_input, + .vidioc_g_input = &vidioc_g_input, + .vidioc_s_input = &vidioc_s_input, + + .vidioc_enum_fmt_vid_cap = &vidioc_enum_fmt_cap, + .vidioc_g_fmt_vid_cap = &vidioc_g_fmt_cap, + .vidioc_s_fmt_vid_cap = &vidioc_s_fmt_cap, + .vidioc_try_fmt_vid_cap = &vidioc_try_fmt_cap, + + .vidioc_enum_fmt_vid_out = &vidioc_enum_fmt_out, + .vidioc_s_fmt_vid_out = &vidioc_s_fmt_out, + .vidioc_g_fmt_vid_out = &vidioc_g_fmt_out, + .vidioc_try_fmt_vid_out = &vidioc_try_fmt_out, + +#ifdef V4L2L_OVERLAY + .vidioc_s_fmt_vid_overlay = &vidioc_s_fmt_overlay, + .vidioc_g_fmt_vid_overlay = &vidioc_g_fmt_overlay, +#endif + +#ifdef V4L2LOOPBACK_WITH_STD + .vidioc_s_std = &vidioc_s_std, + .vidioc_g_std = &vidioc_g_std, + .vidioc_querystd = &vidioc_querystd, +#endif /* V4L2LOOPBACK_WITH_STD */ + + .vidioc_g_parm = &vidioc_g_parm, + .vidioc_s_parm = &vidioc_s_parm, + + .vidioc_reqbufs = &vidioc_reqbufs, + .vidioc_querybuf = &vidioc_querybuf, + .vidioc_qbuf = &vidioc_qbuf, + .vidioc_dqbuf = &vidioc_dqbuf, + + .vidioc_streamon = &vidioc_streamon, + .vidioc_streamoff = &vidioc_streamoff, + +#ifdef CONFIG_VIDEO_V4L1_COMPAT + .vidiocgmbuf = &vidiocgmbuf, +#endif + + .vidioc_subscribe_event = &vidioc_subscribe_event, + .vidioc_unsubscribe_event = &v4l2_event_unsubscribe, + // clang-format on +}; + +static int free_device_cb(int id, void *ptr, void *data) +{ + struct v4l2_loopback_device *dev = ptr; + v4l2_loopback_remove(dev); + return 0; +} +static void free_devices(void) +{ + idr_for_each(&v4l2loopback_index_idr, &free_device_cb, NULL); + idr_destroy(&v4l2loopback_index_idr); +} + +static int __init v4l2loopback_init_module(void) +{ + int err; + int i; + MARK(); + + err = misc_register(&v4l2loopback_misc); + if (err < 0) + return err; + + if (devices < 0) { + devices = 1; + + /* try guessing the devices from the "video_nr" parameter */ + for (i = MAX_DEVICES - 1; i >= 0; i--) { + if (video_nr[i] >= 0) { + devices = i + 1; + break; + } + } + } + + if (devices > MAX_DEVICES) { + devices = MAX_DEVICES; + printk(KERN_INFO + "v4l2loopback: number of initial devices is limited to: %d\n", + MAX_DEVICES); + } + + if (max_buffers > MAX_BUFFERS) { + max_buffers = MAX_BUFFERS; + printk(KERN_INFO + "v4l2loopback: number of buffers is limited to: %d\n", + MAX_BUFFERS); + } + + if (max_openers < 0) { + printk(KERN_INFO + "v4l2loopback: allowing %d openers rather than %d\n", + 2, max_openers); + max_openers = 2; + } + + if (max_width < 1) { + max_width = V4L2LOOPBACK_SIZE_DEFAULT_MAX_WIDTH; + printk(KERN_INFO "v4l2loopback: using max_width %d\n", + max_width); + } + if (max_height < 1) { + max_height = V4L2LOOPBACK_SIZE_DEFAULT_MAX_HEIGHT; + printk(KERN_INFO "v4l2loopback: using max_height %d\n", + max_height); + } + + /* kfree on module release */ + for (i = 0; i < devices; i++) { + struct v4l2_loopback_config cfg = { + // clang-format off + .output_nr = video_nr[i], + .capture_nr = video_nr[i], + .max_width = max_width, + .max_height = max_height, + .announce_all_caps = (!exclusive_caps[i]), + .max_buffers = max_buffers, + .max_openers = max_openers, + .debug = debug, + // clang-format on + }; + cfg.card_label[0] = 0; + if (card_label[i]) + snprintf(cfg.card_label, sizeof(cfg.card_label), "%s", + card_label[i]); + err = v4l2_loopback_add(&cfg, 0); + if (err) { + free_devices(); + goto error; + } + } + + dprintk("module installed\n"); + + printk(KERN_INFO "v4l2loopback driver version %d.%d.%d%s loaded\n", + // clang-format off + (V4L2LOOPBACK_VERSION_CODE >> 16) & 0xff, + (V4L2LOOPBACK_VERSION_CODE >> 8) & 0xff, + (V4L2LOOPBACK_VERSION_CODE ) & 0xff, +#ifdef SNAPSHOT_VERSION + " (" STRINGIFY2(SNAPSHOT_VERSION) ")" +#else + "" +#endif + ); + // clang-format on + + return 0; +error: + misc_deregister(&v4l2loopback_misc); + return err; +} + +#ifdef MODULE +static void v4l2loopback_cleanup_module(void) +{ + MARK(); + /* unregister the device -> it deletes /dev/video* */ + free_devices(); + /* and get rid of /dev/v4l2loopback */ + misc_deregister(&v4l2loopback_misc); + dprintk("module removed\n"); +} +#endif + +MODULE_ALIAS_MISCDEV(MISC_DYNAMIC_MINOR); + +module_init(v4l2loopback_init_module); +module_exit(v4l2loopback_cleanup_module); + +/* + * fake usage of unused functions + */ +#ifdef HAVE__V4L2_CTRLS +static int vidioc_queryctrl(struct file *file, void *fh, + struct v4l2_queryctrl *q) __attribute__((unused)); +static int vidioc_g_ctrl(struct file *file, void *fh, struct v4l2_control *c) + __attribute__((unused)); +static int vidioc_s_ctrl(struct file *file, void *fh, struct v4l2_control *c) + __attribute__((unused)); +#endif /* HAVE__V4L2_CTRLS */ diff --git a/drivers/media/v4l2-core/v4l2loopback.h b/drivers/media/v4l2-core/v4l2loopback.h new file mode 100644 index 000000000000..fb7180802c3f --- /dev/null +++ b/drivers/media/v4l2-core/v4l2loopback.h @@ -0,0 +1,92 @@ +/* SPDX-License-Identifier: GPL-2.0+ WITH Linux-syscall-note */ +/* + * v4l2loopback.h + * + * Written by IOhannes m zmölnig, 7/1/20. + * + * Copyright 2020 by IOhannes m zmölnig. Redistribution of this file is + * permitted under the GNU General Public License. + */ +#ifndef _V4L2LOOPBACK_H +#define _V4L2LOOPBACK_H + +#define V4L2LOOPBACK_VERSION_MAJOR 0 +#define V4L2LOOPBACK_VERSION_MINOR 12 +#define V4L2LOOPBACK_VERSION_BUGFIX 5 + +/* /dev/v4l2loopback interface */ + +struct v4l2_loopback_config { + /** + * the device-number (/dev/video) + * V4L2LOOPBACK_CTL_ADD: + * setting this to a value<0, will allocate an available one + * if nr>=0 and the device already exists, the ioctl will EEXIST + * if output_nr and capture_nr are the same, only a single device will be created + * + * V4L2LOOPBACK_CTL_QUERY: + * either both output_nr and capture_nr must refer to the same loopback, + * or one (and only one) of them must be -1 + * + */ + int output_nr; + int capture_nr; + + /** + * a nice name for your device + * if (*card_label)==0, an automatic name is assigned + */ + char card_label[32]; + + /** + * maximum allowed frame size + * if too low, default values are used + */ + int max_width; + int max_height; + + /** + * whether to announce OUTPUT/CAPTURE capabilities exclusively + * for this device or not + * (!exclusive_caps) + * FIXXME: this ought to be removed (if superseded by output_nr vs capture_nr) + */ + int announce_all_caps; + + /** + * number of buffers to allocate for the queue + * if set to <=0, default values are used + */ + int max_buffers; + + /** + * how many consumers are allowed to open this device concurrently + * if set to <=0, default values are used + */ + int max_openers; + + /** + * set the debugging level for this device + */ + int debug; +}; + +/* a pointer to a (struct v4l2_loopback_config) that has all values you wish to impose on the + * to-be-created device set. + * if the ptr is NULL, a new device is created with default values at the driver's discretion. + * + * returns the device_nr of the OUTPUT device (which can be used with V4L2LOOPBACK_CTL_QUERY, + * to get more information on the device) + */ +#define V4L2LOOPBACK_CTL_ADD 0x4C80 + +/* a pointer to a (struct v4l2_loopback_config) that has output_nr and/or capture_nr set + * (the two values must either refer to video-devices associated with the same loopback device + * or exactly one of them must be <0 + */ +#define V4L2LOOPBACK_CTL_QUERY 0x4C82 + +/* the device-number (either CAPTURE or OUTPUT) associated with the loopback-device */ +#define V4L2LOOPBACK_CTL_REMOVE 0x4C81 + +#endif /* _V4L2LOOPBACK_H */ diff --git a/drivers/media/v4l2-core/v4l2loopback_formats.h b/drivers/media/v4l2-core/v4l2loopback_formats.h new file mode 100644 index 000000000000..0a4e458526b3 --- /dev/null +++ b/drivers/media/v4l2-core/v4l2loopback_formats.h @@ -0,0 +1,437 @@ +static const struct v4l2l_format formats[] = { +#ifndef V4L2_PIX_FMT_VP9 +#define V4L2_PIX_FMT_VP9 v4l2_fourcc('V', 'P', '9', '0') +#endif +#ifndef V4L2_PIX_FMT_HEVC +#define V4L2_PIX_FMT_HEVC v4l2_fourcc('H', 'E', 'V', 'C') +#endif + + /* here come the packed formats */ + { + .name = "32 bpp RGB, le", + .fourcc = V4L2_PIX_FMT_BGR32, + .depth = 32, + .flags = 0, + }, + { + .name = "32 bpp RGB, be", + .fourcc = V4L2_PIX_FMT_RGB32, + .depth = 32, + .flags = 0, + }, + { + .name = "24 bpp RGB, le", + .fourcc = V4L2_PIX_FMT_BGR24, + .depth = 24, + .flags = 0, + }, + { + .name = "24 bpp RGB, be", + .fourcc = V4L2_PIX_FMT_RGB24, + .depth = 24, + .flags = 0, + }, +#ifdef V4L2_PIX_FMT_RGBA32 + { + .name = "32 bpp RGBA", + .fourcc = V4L2_PIX_FMT_RGBA32, + .depth = 32, + .flags = 0, + }, +#endif +#ifdef V4L2_PIX_FMT_RGB332 + { + .name = "8 bpp RGB-3-3-2", + .fourcc = V4L2_PIX_FMT_RGB332, + .depth = 8, + .flags = 0, + }, +#endif /* V4L2_PIX_FMT_RGB332 */ +#ifdef V4L2_PIX_FMT_RGB444 + { + .name = "16 bpp RGB (xxxxrrrr ggggbbbb)", + .fourcc = V4L2_PIX_FMT_RGB444, + .depth = 16, + .flags = 0, + }, +#endif /* V4L2_PIX_FMT_RGB444 */ +#ifdef V4L2_PIX_FMT_RGB555 + { + .name = "16 bpp RGB-5-5-5", + .fourcc = V4L2_PIX_FMT_RGB555, + .depth = 16, + .flags = 0, + }, +#endif /* V4L2_PIX_FMT_RGB555 */ +#ifdef V4L2_PIX_FMT_RGB565 + { + .name = "16 bpp RGB-5-6-5", + .fourcc = V4L2_PIX_FMT_RGB565, + .depth = 16, + .flags = 0, + }, +#endif /* V4L2_PIX_FMT_RGB565 */ +#ifdef V4L2_PIX_FMT_RGB555X + { + .name = "16 bpp RGB-5-5-5 BE", + .fourcc = V4L2_PIX_FMT_RGB555X, + .depth = 16, + .flags = 0, + }, +#endif /* V4L2_PIX_FMT_RGB555X */ +#ifdef V4L2_PIX_FMT_RGB565X + { + .name = "16 bpp RGB-5-6-5 BE", + .fourcc = V4L2_PIX_FMT_RGB565X, + .depth = 16, + .flags = 0, + }, +#endif /* V4L2_PIX_FMT_RGB565X */ +#ifdef V4L2_PIX_FMT_BGR666 + { + .name = "18 bpp BGR-6-6-6", + .fourcc = V4L2_PIX_FMT_BGR666, + .depth = 18, + .flags = 0, + }, +#endif /* V4L2_PIX_FMT_BGR666 */ + { + .name = "4:2:2, packed, YUYV", + .fourcc = V4L2_PIX_FMT_YUYV, + .depth = 16, + .flags = 0, + }, + { + .name = "4:2:2, packed, UYVY", + .fourcc = V4L2_PIX_FMT_UYVY, + .depth = 16, + .flags = 0, + }, +#ifdef V4L2_PIX_FMT_YVYU + { + .name = "4:2:2, packed YVYU", + .fourcc = V4L2_PIX_FMT_YVYU, + .depth = 16, + .flags = 0, + }, +#endif +#ifdef V4L2_PIX_FMT_VYUY + { + .name = "4:2:2, packed VYUY", + .fourcc = V4L2_PIX_FMT_VYUY, + .depth = 16, + .flags = 0, + }, +#endif + { + .name = "4:2:2, packed YYUV", + .fourcc = V4L2_PIX_FMT_YYUV, + .depth = 16, + .flags = 0, + }, + { + .name = "YUV-8-8-8-8", + .fourcc = V4L2_PIX_FMT_YUV32, + .depth = 32, + .flags = 0, + }, + { + .name = "8 bpp, Greyscale", + .fourcc = V4L2_PIX_FMT_GREY, + .depth = 8, + .flags = 0, + }, +#ifdef V4L2_PIX_FMT_Y4 + { + .name = "4 bpp Greyscale", + .fourcc = V4L2_PIX_FMT_Y4, + .depth = 4, + .flags = 0, + }, +#endif /* V4L2_PIX_FMT_Y4 */ +#ifdef V4L2_PIX_FMT_Y6 + { + .name = "6 bpp Greyscale", + .fourcc = V4L2_PIX_FMT_Y6, + .depth = 6, + .flags = 0, + }, +#endif /* V4L2_PIX_FMT_Y6 */ +#ifdef V4L2_PIX_FMT_Y10 + { + .name = "10 bpp Greyscale", + .fourcc = V4L2_PIX_FMT_Y10, + .depth = 10, + .flags = 0, + }, +#endif /* V4L2_PIX_FMT_Y10 */ +#ifdef V4L2_PIX_FMT_Y12 + { + .name = "12 bpp Greyscale", + .fourcc = V4L2_PIX_FMT_Y12, + .depth = 12, + .flags = 0, + }, +#endif /* V4L2_PIX_FMT_Y12 */ + { + .name = "16 bpp, Greyscale", + .fourcc = V4L2_PIX_FMT_Y16, + .depth = 16, + .flags = 0, + }, +#ifdef V4L2_PIX_FMT_YUV444 + { + .name = "16 bpp xxxxyyyy uuuuvvvv", + .fourcc = V4L2_PIX_FMT_YUV444, + .depth = 16, + .flags = 0, + }, +#endif /* V4L2_PIX_FMT_YUV444 */ +#ifdef V4L2_PIX_FMT_YUV555 + { + .name = "16 bpp YUV-5-5-5", + .fourcc = V4L2_PIX_FMT_YUV555, + .depth = 16, + .flags = 0, + }, +#endif /* V4L2_PIX_FMT_YUV555 */ +#ifdef V4L2_PIX_FMT_YUV565 + { + .name = "16 bpp YUV-5-6-5", + .fourcc = V4L2_PIX_FMT_YUV565, + .depth = 16, + .flags = 0, + }, +#endif /* V4L2_PIX_FMT_YUV565 */ + +/* bayer formats */ +#ifdef V4L2_PIX_FMT_SRGGB8 + { + .name = "Bayer RGGB 8bit", + .fourcc = V4L2_PIX_FMT_SRGGB8, + .depth = 8, + .flags = 0, + }, +#endif /* V4L2_PIX_FMT_SRGGB8 */ +#ifdef V4L2_PIX_FMT_SGRBG8 + { + .name = "Bayer GRBG 8bit", + .fourcc = V4L2_PIX_FMT_SGRBG8, + .depth = 8, + .flags = 0, + }, +#endif /* V4L2_PIX_FMT_SGRBG8 */ +#ifdef V4L2_PIX_FMT_SGBRG8 + { + .name = "Bayer GBRG 8bit", + .fourcc = V4L2_PIX_FMT_SGBRG8, + .depth = 8, + .flags = 0, + }, +#endif /* V4L2_PIX_FMT_SGBRG8 */ +#ifdef V4L2_PIX_FMT_SBGGR8 + { + .name = "Bayer BA81 8bit", + .fourcc = V4L2_PIX_FMT_SBGGR8, + .depth = 8, + .flags = 0, + }, +#endif /* V4L2_PIX_FMT_SBGGR8 */ + + /* here come the planar formats */ + { + .name = "4:1:0, planar, Y-Cr-Cb", + .fourcc = V4L2_PIX_FMT_YVU410, + .depth = 9, + .flags = FORMAT_FLAGS_PLANAR, + }, + { + .name = "4:2:0, planar, Y-Cr-Cb", + .fourcc = V4L2_PIX_FMT_YVU420, + .depth = 12, + .flags = FORMAT_FLAGS_PLANAR, + }, + { + .name = "4:1:0, planar, Y-Cb-Cr", + .fourcc = V4L2_PIX_FMT_YUV410, + .depth = 9, + .flags = FORMAT_FLAGS_PLANAR, + }, + { + .name = "4:2:0, planar, Y-Cb-Cr", + .fourcc = V4L2_PIX_FMT_YUV420, + .depth = 12, + .flags = FORMAT_FLAGS_PLANAR, + }, +#ifdef V4L2_PIX_FMT_YUV422P + { + .name = "16 bpp YVU422 planar", + .fourcc = V4L2_PIX_FMT_YUV422P, + .depth = 16, + .flags = FORMAT_FLAGS_PLANAR, + }, +#endif /* V4L2_PIX_FMT_YUV422P */ +#ifdef V4L2_PIX_FMT_YUV411P + { + .name = "16 bpp YVU411 planar", + .fourcc = V4L2_PIX_FMT_YUV411P, + .depth = 16, + .flags = FORMAT_FLAGS_PLANAR, + }, +#endif /* V4L2_PIX_FMT_YUV411P */ +#ifdef V4L2_PIX_FMT_Y41P + { + .name = "12 bpp YUV 4:1:1", + .fourcc = V4L2_PIX_FMT_Y41P, + .depth = 12, + .flags = FORMAT_FLAGS_PLANAR, + }, +#endif /* V4L2_PIX_FMT_Y41P */ +#ifdef V4L2_PIX_FMT_NV12 + { + .name = "12 bpp Y/CbCr 4:2:0 ", + .fourcc = V4L2_PIX_FMT_NV12, + .depth = 12, + .flags = FORMAT_FLAGS_PLANAR, + }, +#endif /* V4L2_PIX_FMT_NV12 */ + +/* here come the compressed formats */ + +#ifdef V4L2_PIX_FMT_MJPEG + { + .name = "Motion-JPEG", + .fourcc = V4L2_PIX_FMT_MJPEG, + .depth = 32, + .flags = FORMAT_FLAGS_COMPRESSED, + }, +#endif /* V4L2_PIX_FMT_MJPEG */ +#ifdef V4L2_PIX_FMT_JPEG + { + .name = "JFIF JPEG", + .fourcc = V4L2_PIX_FMT_JPEG, + .depth = 32, + .flags = FORMAT_FLAGS_COMPRESSED, + }, +#endif /* V4L2_PIX_FMT_JPEG */ +#ifdef V4L2_PIX_FMT_DV + { + .name = "DV1394", + .fourcc = V4L2_PIX_FMT_DV, + .depth = 32, + .flags = FORMAT_FLAGS_COMPRESSED, + }, +#endif /* V4L2_PIX_FMT_DV */ +#ifdef V4L2_PIX_FMT_MPEG + { + .name = "MPEG-1/2/4 Multiplexed", + .fourcc = V4L2_PIX_FMT_MPEG, + .depth = 32, + .flags = FORMAT_FLAGS_COMPRESSED, + }, +#endif /* V4L2_PIX_FMT_MPEG */ +#ifdef V4L2_PIX_FMT_H264 + { + .name = "H264 with start codes", + .fourcc = V4L2_PIX_FMT_H264, + .depth = 32, + .flags = FORMAT_FLAGS_COMPRESSED, + }, +#endif /* V4L2_PIX_FMT_H264 */ +#ifdef V4L2_PIX_FMT_H264_NO_SC + { + .name = "H264 without start codes", + .fourcc = V4L2_PIX_FMT_H264_NO_SC, + .depth = 32, + .flags = FORMAT_FLAGS_COMPRESSED, + }, +#endif /* V4L2_PIX_FMT_H264_NO_SC */ +#ifdef V4L2_PIX_FMT_H264_MVC + { + .name = "H264 MVC", + .fourcc = V4L2_PIX_FMT_H264_MVC, + .depth = 32, + .flags = FORMAT_FLAGS_COMPRESSED, + }, +#endif /* V4L2_PIX_FMT_H264_MVC */ +#ifdef V4L2_PIX_FMT_H263 + { + .name = "H263", + .fourcc = V4L2_PIX_FMT_H263, + .depth = 32, + .flags = FORMAT_FLAGS_COMPRESSED, + }, +#endif /* V4L2_PIX_FMT_H263 */ +#ifdef V4L2_PIX_FMT_MPEG1 + { + .name = "MPEG-1 ES", + .fourcc = V4L2_PIX_FMT_MPEG1, + .depth = 32, + .flags = FORMAT_FLAGS_COMPRESSED, + }, +#endif /* V4L2_PIX_FMT_MPEG1 */ +#ifdef V4L2_PIX_FMT_MPEG2 + { + .name = "MPEG-2 ES", + .fourcc = V4L2_PIX_FMT_MPEG2, + .depth = 32, + .flags = FORMAT_FLAGS_COMPRESSED, + }, +#endif /* V4L2_PIX_FMT_MPEG2 */ +#ifdef V4L2_PIX_FMT_MPEG4 + { + .name = "MPEG-4 part 2 ES", + .fourcc = V4L2_PIX_FMT_MPEG4, + .depth = 32, + .flags = FORMAT_FLAGS_COMPRESSED, + }, +#endif /* V4L2_PIX_FMT_MPEG4 */ +#ifdef V4L2_PIX_FMT_XVID + { + .name = "Xvid", + .fourcc = V4L2_PIX_FMT_XVID, + .depth = 32, + .flags = FORMAT_FLAGS_COMPRESSED, + }, +#endif /* V4L2_PIX_FMT_XVID */ +#ifdef V4L2_PIX_FMT_VC1_ANNEX_G + { + .name = "SMPTE 421M Annex G compliant stream", + .fourcc = V4L2_PIX_FMT_VC1_ANNEX_G, + .depth = 32, + .flags = FORMAT_FLAGS_COMPRESSED, + }, +#endif /* V4L2_PIX_FMT_VC1_ANNEX_G */ +#ifdef V4L2_PIX_FMT_VC1_ANNEX_L + { + .name = "SMPTE 421M Annex L compliant stream", + .fourcc = V4L2_PIX_FMT_VC1_ANNEX_L, + .depth = 32, + .flags = FORMAT_FLAGS_COMPRESSED, + }, +#endif /* V4L2_PIX_FMT_VC1_ANNEX_L */ +#ifdef V4L2_PIX_FMT_VP8 + { + .name = "VP8", + .fourcc = V4L2_PIX_FMT_VP8, + .depth = 32, + .flags = FORMAT_FLAGS_COMPRESSED, + }, +#endif /* V4L2_PIX_FMT_VP8 */ +#ifdef V4L2_PIX_FMT_VP9 + { + .name = "VP9", + .fourcc = V4L2_PIX_FMT_VP9, + .depth = 32, + .flags = FORMAT_FLAGS_COMPRESSED, + }, +#endif /* V4L2_PIX_FMT_VP9 */ +#ifdef V4L2_PIX_FMT_HEVC + { + .name = "HEVC", + .fourcc = V4L2_PIX_FMT_HEVC, + .depth = 32, + .flags = FORMAT_FLAGS_COMPRESSED, + }, +#endif /* V4L2_PIX_FMT_HEVC */ +}; diff --git a/fs/exec.c b/fs/exec.c index e3e55d5e0be1..bba8fc44926f 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1011,6 +1011,7 @@ static int exec_mmap(struct mm_struct *mm) active_mm = tsk->active_mm; tsk->active_mm = mm; tsk->mm = mm; + lru_gen_add_mm(mm); /* * This prevents preemption while active_mm is being loaded and * it and mm are being updated, which could cause problems for @@ -1023,6 +1024,7 @@ static int exec_mmap(struct mm_struct *mm) activate_mm(active_mm, mm); if (IS_ENABLED(CONFIG_ARCH_WANT_IRQS_OFF_ACTIVATE_MM)) local_irq_enable(); + lru_gen_use_mm(mm); tsk->mm->vmacache_seqnum = 0; vmacache_flush(tsk); task_unlock(tsk); diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c index 0e537e580dc1..5d36015071d2 100644 --- a/fs/fuse/dev.c +++ b/fs/fuse/dev.c @@ -777,7 +777,8 @@ static int fuse_check_page(struct page *page) 1 << PG_active | 1 << PG_workingset | 1 << PG_reclaim | - 1 << PG_waiters))) { + 1 << PG_waiters | + LRU_GEN_MASK | LRU_REFS_MASK))) { dump_page(page, "fuse: trying to steal weird page"); return 1; } diff --git a/include/crypto/drbg.h b/include/crypto/drbg.h index af5ad51d3eef..b12ae9bdebf4 100644 --- a/include/crypto/drbg.h +++ b/include/crypto/drbg.h @@ -283,4 +283,11 @@ enum drbg_prefixes { DRBG_PREFIX3 }; +extern int drbg_alloc_state(struct drbg_state *drbg); +extern void drbg_dealloc_state(struct drbg_state *drbg); +extern void drbg_convert_tfm_core(const char *cra_driver_name, int *coreref, + bool *pr); +extern const struct drbg_core drbg_cores[]; +extern unsigned short drbg_sec_strength(drbg_flag_t flags); + #endif /* _DRBG_H */ diff --git a/crypto/jitterentropy.h b/include/crypto/internal/jitterentropy.h similarity index 100% rename from crypto/jitterentropy.h rename to include/crypto/internal/jitterentropy.h diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h index 0d1ada8968d7..1bc0cabf993f 100644 --- a/include/linux/cgroup.h +++ b/include/linux/cgroup.h @@ -432,6 +432,18 @@ static inline void cgroup_put(struct cgroup *cgrp) css_put(&cgrp->self); } +extern struct mutex cgroup_mutex; + +static inline void cgroup_lock(void) +{ + mutex_lock(&cgroup_mutex); +} + +static inline void cgroup_unlock(void) +{ + mutex_unlock(&cgroup_mutex); +} + /** * task_css_set_check - obtain a task's css_set with extra access conditions * @task: the task to obtain css_set for @@ -446,7 +458,6 @@ static inline void cgroup_put(struct cgroup *cgrp) * as locks used during the cgroup_subsys::attach() methods. */ #ifdef CONFIG_PROVE_RCU -extern struct mutex cgroup_mutex; extern spinlock_t css_set_lock; #define task_css_set_check(task, __c) \ rcu_dereference_check((task)->cgroups, \ @@ -708,6 +719,8 @@ struct cgroup; static inline u64 cgroup_id(const struct cgroup *cgrp) { return 1; } static inline void css_get(struct cgroup_subsys_state *css) {} static inline void css_put(struct cgroup_subsys_state *css) {} +static inline void cgroup_lock(void) {} +static inline void cgroup_unlock(void) {} static inline int cgroup_attach_task_all(struct task_struct *from, struct task_struct *t) { return 0; } static inline int cgroupstats_build(struct cgroupstats *stats, diff --git a/include/linux/ksm.h b/include/linux/ksm.h index 0630e545f4cb..a3234efb5064 100644 --- a/include/linux/ksm.h +++ b/include/linux/ksm.h @@ -19,6 +19,10 @@ struct stable_node; struct mem_cgroup; #ifdef CONFIG_KSM +int ksm_madvise_merge(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long *vm_flags); +int ksm_madvise_unmerge(struct vm_area_struct *vma, unsigned long start, + unsigned long end, unsigned long *vm_flags); int ksm_madvise(struct vm_area_struct *vma, unsigned long start, unsigned long end, int advice, unsigned long *vm_flags); int __ksm_enter(struct mm_struct *mm); diff --git a/include/linux/lrng.h b/include/linux/lrng.h new file mode 100644 index 000000000000..32873352c082 --- /dev/null +++ b/include/linux/lrng.h @@ -0,0 +1,150 @@ +/* SPDX-License-Identifier: GPL-2.0 OR BSD-2-Clause */ +/* + * Copyright (C) 2022, Stephan Mueller + */ + +#ifndef _LRNG_H +#define _LRNG_H + +#include +#include +#include +#include + +/* + * struct lrng_drng_cb - cryptographic callback functions defining a DRNG + * @drng_name Name of DRNG + * @drng_alloc: Allocate DRNG -- the provided integer should be used for + * sanity checks. + * return: allocated data structure or PTR_ERR on error + * @drng_dealloc: Deallocate DRNG + * @drng_seed: Seed the DRNG with data of arbitrary length drng: is + * pointer to data structure allocated with drng_alloc + * return: >= 0 on success, < 0 on error + * @drng_generate: Generate random numbers from the DRNG with arbitrary + * length + */ +struct lrng_drng_cb { + const char *(*drng_name)(void); + void *(*drng_alloc)(u32 sec_strength); + void (*drng_dealloc)(void *drng); + int (*drng_seed)(void *drng, const u8 *inbuf, u32 inbuflen); + int (*drng_generate)(void *drng, u8 *outbuf, u32 outbuflen); +}; + +/* + * struct lrng_hash_cb - cryptographic callback functions defining a hash + * @hash_name Name of Hash used for reading entropy pool arbitrary + * length + * @hash_alloc: Allocate the hash for reading the entropy pool + * return: allocated data structure (NULL is success too) + * or ERR_PTR on error + * @hash_dealloc: Deallocate Hash + * @hash_digestsize: Return the digestsize for the used hash to read out + * entropy pool + * hash: is pointer to data structure allocated with + * hash_alloc + * return: size of digest of hash in bytes + * @hash_init: Initialize hash + * hash: is pointer to data structure allocated with + * hash_alloc + * return: 0 on success, < 0 on error + * @hash_update: Update hash operation + * hash: is pointer to data structure allocated with + * hash_alloc + * return: 0 on success, < 0 on error + * @hash_final Final hash operation + * hash: is pointer to data structure allocated with + * hash_alloc + * return: 0 on success, < 0 on error + * @hash_desc_zero Zeroization of hash state buffer + * + * Assumptions: + * + * 1. Hash operation will not sleep + * 2. The hash' volatile state information is provided with *shash by caller. + */ +struct lrng_hash_cb { + const char *(*hash_name)(void); + void *(*hash_alloc)(void); + void (*hash_dealloc)(void *hash); + u32 (*hash_digestsize)(void *hash); + int (*hash_init)(struct shash_desc *shash, void *hash); + int (*hash_update)(struct shash_desc *shash, const u8 *inbuf, + u32 inbuflen); + int (*hash_final)(struct shash_desc *shash, u8 *digest); + void (*hash_desc_zero)(struct shash_desc *shash); +}; + +/* Register cryptographic backend */ +#ifdef CONFIG_LRNG_SWITCH +int lrng_set_drng_cb(const struct lrng_drng_cb *cb); +int lrng_set_hash_cb(const struct lrng_hash_cb *cb); +#else /* CONFIG_LRNG_SWITCH */ +static inline int +lrng_set_drng_cb(const struct lrng_drng_cb *cb) { return -EOPNOTSUPP; } +static inline int +lrng_set_hash_cb(const struct lrng_hash_cb *cb) { return -EOPNOTSUPP; } +#endif /* CONFIG_LRNG_SWITCH */ + +/* Callback to feed events to the scheduler entropy source */ +#ifdef CONFIG_LRNG_SCHED +extern void add_sched_randomness(const struct task_struct *p, int cpu); +#else +static inline void +add_sched_randomness(const struct task_struct *p, int cpu) { } +#endif + +/* + * lrng_get_random_bytes() - Provider of cryptographic strong random numbers + * for kernel-internal usage. + * + * This function is appropriate for in-kernel use cases operating in atomic + * contexts. It will always use the ChaCha20 DRNG and it may be the case that + * it is not fully seeded when being used. + * + * @buf: buffer to store the random bytes + * @nbytes: size of the buffer + */ +#ifdef CONFIG_LRNG_DRNG_ATOMIC +void lrng_get_random_bytes(void *buf, int nbytes); +#endif + +/* + * lrng_get_random_bytes_full() - Provider of cryptographic strong + * random numbers for kernel-internal usage from a fully initialized LRNG. + * + * This function will always return random numbers from a fully seeded and + * fully initialized LRNG. + * + * This function is appropriate only for non-atomic use cases as this + * function may sleep. It provides access to the full functionality of LRNG + * including the switchable DRNG support, that may support other DRNGs such + * as the SP800-90A DRBG. + * + * @buf: buffer to store the random bytes + * @nbytes: size of the buffer + */ +#ifdef CONFIG_LRNG +void lrng_get_random_bytes_full(void *buf, int nbytes); +#endif + +/* + * lrng_get_random_bytes_min() - Provider of cryptographic strong + * random numbers for kernel-internal usage from at least a minimally seeded + * LRNG, which is not necessarily fully initialized yet (e.g. SP800-90C + * oversampling applied in FIPS mode is not applied yet). + * + * This function is appropriate only for non-atomic use cases as this + * function may sleep. It provides access to the full functionality of LRNG + * including the switchable DRNG support, that may support other DRNGs such + * as the SP800-90A DRBG. + * + * @buf: buffer to store the random bytes + * @nbytes: size of the buffer + */ +#ifdef CONFIG_LRNG +void lrng_get_random_bytes_min(void *buf, int nbytes); +#endif + +#endif /* _LRNG_H */ diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 89b14729d59f..cc16ba262464 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -344,6 +344,11 @@ struct mem_cgroup { struct deferred_split deferred_split_queue; #endif +#ifdef CONFIG_LRU_GEN + /* per-memcg mm_struct list */ + struct lru_gen_mm_list mm_list; +#endif + struct mem_cgroup_per_node *nodeinfo[]; }; @@ -438,6 +443,7 @@ static inline struct obj_cgroup *__folio_objcg(struct folio *folio) * - LRU isolation * - lock_page_memcg() * - exclusive reference + * - mem_cgroup_trylock_pages() * * For a kmem folio a caller should hold an rcu read lock to protect memcg * associated with a kmem folio from being released. @@ -499,6 +505,7 @@ static inline struct mem_cgroup *folio_memcg_rcu(struct folio *folio) * - LRU isolation * - lock_page_memcg() * - exclusive reference + * - mem_cgroup_trylock_pages() * * For a kmem page a caller should hold an rcu read lock to protect memcg * associated with a kmem page from being released. @@ -948,6 +955,23 @@ void unlock_page_memcg(struct page *page); void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val); +/* try to stablize folio_memcg() for all the pages in a memcg */ +static inline bool mem_cgroup_trylock_pages(struct mem_cgroup *memcg) +{ + rcu_read_lock(); + + if (mem_cgroup_disabled() || !atomic_read(&memcg->moving_account)) + return true; + + rcu_read_unlock(); + return false; +} + +static inline void mem_cgroup_unlock_pages(void) +{ + rcu_read_unlock(); +} + /* idx can be of type enum memcg_stat_item or node_stat_item */ static inline void mod_memcg_state(struct mem_cgroup *memcg, int idx, int val) @@ -1386,6 +1410,18 @@ static inline void folio_memcg_unlock(struct folio *folio) { } +static inline bool mem_cgroup_trylock_pages(struct mem_cgroup *memcg) +{ + /* to match folio_memcg_rcu() */ + rcu_read_lock(); + return true; +} + +static inline void mem_cgroup_unlock_pages(void) +{ + rcu_read_unlock(); +} + static inline void mem_cgroup_handle_over_high(void) { } diff --git a/include/linux/mm.h b/include/linux/mm.h index 9f44254af8ce..4e8ab4ad4473 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1060,6 +1060,8 @@ vm_fault_t finish_mkwrite_fault(struct vm_fault *vmf); #define ZONES_PGOFF (NODES_PGOFF - ZONES_WIDTH) #define LAST_CPUPID_PGOFF (ZONES_PGOFF - LAST_CPUPID_WIDTH) #define KASAN_TAG_PGOFF (LAST_CPUPID_PGOFF - KASAN_TAG_WIDTH) +#define LRU_GEN_PGOFF (KASAN_TAG_PGOFF - LRU_GEN_WIDTH) +#define LRU_REFS_PGOFF (LRU_GEN_PGOFF - LRU_REFS_WIDTH) /* * Define the bit shifts to access each section. For non-existent @@ -1521,6 +1523,11 @@ static inline unsigned long folio_pfn(struct folio *folio) return page_to_pfn(&folio->page); } +static inline struct folio *pfn_folio(unsigned long pfn) +{ + return page_folio(pfn_to_page(pfn)); +} + static inline atomic_t *folio_pincount_ptr(struct folio *folio) { return &folio_page(folio, 1)->compound_pincount; diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h index ac32125745ab..d3d7b7bb297d 100644 --- a/include/linux/mm_inline.h +++ b/include/linux/mm_inline.h @@ -32,15 +32,25 @@ static inline int page_is_file_lru(struct page *page) return folio_is_file_lru(page_folio(page)); } -static __always_inline void update_lru_size(struct lruvec *lruvec, +static __always_inline void __update_lru_size(struct lruvec *lruvec, enum lru_list lru, enum zone_type zid, long nr_pages) { struct pglist_data *pgdat = lruvec_pgdat(lruvec); + lockdep_assert_held(&lruvec->lru_lock); + WARN_ON_ONCE(nr_pages != (int)nr_pages); + __mod_lruvec_state(lruvec, NR_LRU_BASE + lru, nr_pages); __mod_zone_page_state(&pgdat->node_zones[zid], NR_ZONE_LRU_BASE + lru, nr_pages); +} + +static __always_inline void update_lru_size(struct lruvec *lruvec, + enum lru_list lru, enum zone_type zid, + int nr_pages) +{ + __update_lru_size(lruvec, lru, zid, nr_pages); #ifdef CONFIG_MEMCG mem_cgroup_update_lru_size(lruvec, lru, zid, nr_pages); #endif @@ -92,11 +102,226 @@ static __always_inline enum lru_list folio_lru_list(struct folio *folio) return lru; } +#ifdef CONFIG_LRU_GEN + +static inline bool lru_gen_enabled(void) +{ +#ifdef CONFIG_LRU_GEN_ENABLED + DECLARE_STATIC_KEY_TRUE(lru_gen_caps[NR_LRU_GEN_CAPS]); + + return static_branch_likely(&lru_gen_caps[LRU_GEN_CORE]); +#else + DECLARE_STATIC_KEY_FALSE(lru_gen_caps[NR_LRU_GEN_CAPS]); + + return static_branch_unlikely(&lru_gen_caps[LRU_GEN_CORE]); +#endif +} + +static inline bool lru_gen_in_fault(void) +{ + return current->in_lru_fault; +} + +static inline int lru_gen_from_seq(unsigned long seq) +{ + return seq % MAX_NR_GENS; +} + +static inline int lru_hist_from_seq(unsigned long seq) +{ + return seq % NR_HIST_GENS; +} + +static inline int lru_tier_from_refs(int refs) +{ + VM_WARN_ON_ONCE(refs > BIT(LRU_REFS_WIDTH)); + + /* see the comment in folio_lru_refs() */ + return order_base_2(refs + 1); +} + +static inline int folio_lru_refs(struct folio *folio) +{ + unsigned long flags = READ_ONCE(folio->flags); + bool workingset = flags & BIT(PG_workingset); + + /* + * Return the number of accesses beyond PG_referenced, i.e., N-1 if the + * total number of accesses is N>1, since N=0,1 both map to the first + * tier. lru_tier_from_refs() will account for this off-by-one. Also see + * the comment on MAX_NR_TIERS. + */ + return ((flags & LRU_REFS_MASK) >> LRU_REFS_PGOFF) + workingset; +} + +static inline int folio_lru_gen(struct folio *folio) +{ + unsigned long flags = READ_ONCE(folio->flags); + + return ((flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1; +} + +static inline bool lru_gen_is_active(struct lruvec *lruvec, int gen) +{ + unsigned long max_seq = lruvec->lrugen.max_seq; + + VM_WARN_ON_ONCE(gen >= MAX_NR_GENS); + + /* see the comment on MIN_NR_GENS */ + return gen == lru_gen_from_seq(max_seq) || gen == lru_gen_from_seq(max_seq - 1); +} + +static inline void lru_gen_update_size(struct lruvec *lruvec, struct folio *folio, + int old_gen, int new_gen) +{ + int type = folio_is_file_lru(folio); + int zone = folio_zonenum(folio); + int delta = folio_nr_pages(folio); + enum lru_list lru = type * LRU_INACTIVE_FILE; + struct lru_gen_struct *lrugen = &lruvec->lrugen; + + VM_WARN_ON_ONCE(old_gen != -1 && old_gen >= MAX_NR_GENS); + VM_WARN_ON_ONCE(new_gen != -1 && new_gen >= MAX_NR_GENS); + VM_WARN_ON_ONCE(old_gen == -1 && new_gen == -1); + + if (old_gen >= 0) + WRITE_ONCE(lrugen->nr_pages[old_gen][type][zone], + lrugen->nr_pages[old_gen][type][zone] - delta); + if (new_gen >= 0) + WRITE_ONCE(lrugen->nr_pages[new_gen][type][zone], + lrugen->nr_pages[new_gen][type][zone] + delta); + + /* addition */ + if (old_gen < 0) { + if (lru_gen_is_active(lruvec, new_gen)) + lru += LRU_ACTIVE; + __update_lru_size(lruvec, lru, zone, delta); + return; + } + + /* deletion */ + if (new_gen < 0) { + if (lru_gen_is_active(lruvec, old_gen)) + lru += LRU_ACTIVE; + __update_lru_size(lruvec, lru, zone, -delta); + return; + } + + /* promotion */ + if (!lru_gen_is_active(lruvec, old_gen) && lru_gen_is_active(lruvec, new_gen)) { + __update_lru_size(lruvec, lru, zone, -delta); + __update_lru_size(lruvec, lru + LRU_ACTIVE, zone, delta); + } + + /* demotion requires isolation, e.g., lru_deactivate_fn() */ + VM_WARN_ON_ONCE(lru_gen_is_active(lruvec, old_gen) && !lru_gen_is_active(lruvec, new_gen)); +} + +static inline bool lru_gen_add_folio(struct lruvec *lruvec, struct folio *folio, bool reclaiming) +{ + unsigned long mask, flags; + int gen = folio_lru_gen(folio); + int type = folio_is_file_lru(folio); + int zone = folio_zonenum(folio); + struct lru_gen_struct *lrugen = &lruvec->lrugen; + + VM_WARN_ON_ONCE_FOLIO(gen != -1, folio); + + if (folio_test_unevictable(folio) || !lrugen->enabled) + return false; + /* + * There are three common cases for this page: + * 1. If it's hot, e.g., freshly faulted in or previously hot and + * migrated, add it to the youngest generation. + * 2. If it's cold but can't be evicted immediately, i.e., an anon page + * not in swapcache or a dirty page pending writeback, add it to the + * second oldest generation. + * 3. Everything else (clean, cold) is added to the oldest generation. + */ + if (folio_test_active(folio)) + gen = lru_gen_from_seq(lrugen->max_seq); + else if ((type == LRU_GEN_ANON && !folio_test_swapcache(folio)) || + (folio_test_reclaim(folio) && + (folio_test_dirty(folio) || folio_test_writeback(folio)))) + gen = lru_gen_from_seq(lrugen->min_seq[type] + 1); + else + gen = lru_gen_from_seq(lrugen->min_seq[type]); + + /* see the comment on MIN_NR_GENS */ + mask = LRU_GEN_MASK | BIT(PG_active); + flags = (gen + 1UL) << LRU_GEN_PGOFF; + set_mask_bits(&folio->flags, mask, flags); + + lru_gen_update_size(lruvec, folio, -1, gen); + /* for folio_rotate_reclaimable() */ + if (reclaiming) + list_add_tail(&folio->lru, &lrugen->lists[gen][type][zone]); + else + list_add(&folio->lru, &lrugen->lists[gen][type][zone]); + + return true; +} + +static inline bool lru_gen_del_folio(struct lruvec *lruvec, struct folio *folio, bool reclaiming) +{ + unsigned long mask, flags; + int gen = folio_lru_gen(folio); + + if (gen < 0) + return false; + + VM_WARN_ON_ONCE_FOLIO(folio_test_active(folio), folio); + VM_WARN_ON_ONCE_FOLIO(folio_test_unevictable(folio), folio); + + mask = LRU_GEN_MASK; + flags = 0; + /* for shrink_page_list() or folio_migrate_flags() */ + if (reclaiming) + mask |= BIT(PG_referenced) | BIT(PG_reclaim); + else if (lru_gen_is_active(lruvec, gen)) + flags |= BIT(PG_active); + + flags = set_mask_bits(&folio->flags, mask, flags); + gen = ((flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1; + + lru_gen_update_size(lruvec, folio, gen, -1); + list_del(&folio->lru); + + return true; +} + +#else + +static inline bool lru_gen_enabled(void) +{ + return false; +} + +static inline bool lru_gen_in_fault(void) +{ + return false; +} + +static inline bool lru_gen_add_folio(struct lruvec *lruvec, struct folio *folio, bool reclaiming) +{ + return false; +} + +static inline bool lru_gen_del_folio(struct lruvec *lruvec, struct folio *folio, bool reclaiming) +{ + return false; +} + +#endif /* CONFIG_LRU_GEN */ + static __always_inline void lruvec_add_folio(struct lruvec *lruvec, struct folio *folio) { enum lru_list lru = folio_lru_list(folio); + if (lru_gen_add_folio(lruvec, folio, false)) + return; + update_lru_size(lruvec, lru, folio_zonenum(folio), folio_nr_pages(folio)); if (lru != LRU_UNEVICTABLE) @@ -114,6 +339,9 @@ void lruvec_add_folio_tail(struct lruvec *lruvec, struct folio *folio) { enum lru_list lru = folio_lru_list(folio); + if (lru_gen_add_folio(lruvec, folio, true)) + return; + update_lru_size(lruvec, lru, folio_zonenum(folio), folio_nr_pages(folio)); /* This is not expected to be used on LRU_UNEVICTABLE */ @@ -131,6 +359,9 @@ void lruvec_del_folio(struct lruvec *lruvec, struct folio *folio) { enum lru_list lru = folio_lru_list(folio); + if (lru_gen_del_folio(lruvec, folio, false)) + return; + if (lru != LRU_UNEVICTABLE) list_del(&folio->lru); update_lru_size(lruvec, lru, folio_zonenum(folio), diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 8834e38c06a4..ef46d5b3da9b 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -3,6 +3,7 @@ #define _LINUX_MM_TYPES_H #include +#include #include #include @@ -17,6 +18,7 @@ #include #include #include +#include #include @@ -655,6 +657,22 @@ struct mm_struct { #ifdef CONFIG_IOMMU_SVA u32 pasid; #endif +#ifdef CONFIG_LRU_GEN + struct { + /* this mm_struct is on lru_gen_mm_list */ + struct list_head list; +#ifdef CONFIG_MEMCG + /* points to the memcg of "owner" above */ + struct mem_cgroup *memcg; +#endif + /* + * Set when switching to this mm_struct, as a hint of + * whether it has been used since the last time per-node + * page table walkers cleared the corresponding bits. + */ + unsigned long bitmap; + } lru_gen; +#endif /* CONFIG_LRU_GEN */ } __randomize_layout; /* @@ -681,6 +699,65 @@ static inline cpumask_t *mm_cpumask(struct mm_struct *mm) return (struct cpumask *)&mm->cpu_bitmap; } +#ifdef CONFIG_LRU_GEN + +struct lru_gen_mm_list { + /* mm_struct list for page table walkers */ + struct list_head fifo; + /* protects the list above */ + spinlock_t lock; +}; + +void lru_gen_add_mm(struct mm_struct *mm); +void lru_gen_del_mm(struct mm_struct *mm); +#ifdef CONFIG_MEMCG +void lru_gen_migrate_mm(struct mm_struct *mm); +#endif + +static inline void lru_gen_init_mm(struct mm_struct *mm) +{ + INIT_LIST_HEAD(&mm->lru_gen.list); +#ifdef CONFIG_MEMCG + mm->lru_gen.memcg = NULL; +#endif + mm->lru_gen.bitmap = 0; +} + +static inline void lru_gen_use_mm(struct mm_struct *mm) +{ + /* unlikely but not a bug when racing with lru_gen_migrate_mm() */ + VM_WARN_ON_ONCE(list_empty(&mm->lru_gen.list)); + + if (!(current->flags & PF_KTHREAD)) + WRITE_ONCE(mm->lru_gen.bitmap, -1); +} + +#else /* !CONFIG_LRU_GEN */ + +static inline void lru_gen_add_mm(struct mm_struct *mm) +{ +} + +static inline void lru_gen_del_mm(struct mm_struct *mm) +{ +} + +#ifdef CONFIG_MEMCG +static inline void lru_gen_migrate_mm(struct mm_struct *mm) +{ +} +#endif + +static inline void lru_gen_init_mm(struct mm_struct *mm) +{ +} + +static inline void lru_gen_use_mm(struct mm_struct *mm) +{ +} + +#endif /* CONFIG_LRU_GEN */ + struct mmu_gather; extern void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm); extern void tlb_gather_mmu_fullmm(struct mmu_gather *tlb, struct mm_struct *mm); diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 46ffab808f03..b2e8869be7cf 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -317,6 +317,211 @@ enum lruvec_flags { */ }; +#endif /* !__GENERATING_BOUNDS_H */ + +/* + * Evictable pages are divided into multiple generations. The youngest and the + * oldest generation numbers, max_seq and min_seq, are monotonically increasing. + * They form a sliding window of a variable size [MIN_NR_GENS, MAX_NR_GENS]. An + * offset within MAX_NR_GENS, i.e., gen, indexes the LRU list of the + * corresponding generation. The gen counter in folio->flags stores gen+1 while + * a page is on one of lrugen->lists[]. Otherwise it stores 0. + * + * A page is added to the youngest generation on faulting. The aging needs to + * check the accessed bit at least twice before handing this page over to the + * eviction. The first check takes care of the accessed bit set on the initial + * fault; the second check makes sure this page hasn't been used since then. + * This process, AKA second chance, requires a minimum of two generations, + * hence MIN_NR_GENS. And to maintain ABI compatibility with the active/inactive + * LRU, e.g., /proc/vmstat, these two generations are considered active; the + * rest of generations, if they exist, are considered inactive. See + * lru_gen_is_active(). PG_active is always cleared while a page is on one of + * lrugen->lists[] so that the aging needs not to worry about it. And it's set + * again when a page considered active is isolated for non-reclaiming purposes, + * e.g., migration. See lru_gen_add_folio() and lru_gen_del_folio(). + * + * MAX_NR_GENS is set to 4 so that the multi-gen LRU can support twice the + * number of categories of the active/inactive LRU when keeping track of + * accesses through page tables. It requires order_base_2(MAX_NR_GENS+1) bits in + * folio->flags (LRU_GEN_MASK). + */ +#define MIN_NR_GENS 2U +#define MAX_NR_GENS 4U + +/* + * Each generation is divided into multiple tiers. Tiers represent different + * ranges of numbers of accesses through file descriptors. A page accessed N + * times through file descriptors is in tier order_base_2(N). A page in the + * first tier (N=0,1) is marked by PG_referenced unless it was faulted in + * though page tables or read ahead. A page in any other tier (N>1) is marked + * by PG_referenced and PG_workingset. This implies a minimum of two tiers is + * supported without using additional bits in folio->flags. + * + * In contrast to moving across generations which requires the LRU lock, moving + * across tiers only involves atomic operations on folio->flags and therefore + * has a negligible cost in the buffered access path. In the eviction path, + * comparisons of refaulted/(evicted+protected) from the first tier and the + * rest infer whether pages accessed multiple times through file descriptors + * are statistically hot and thus worth protecting. + * + * MAX_NR_TIERS is set to 4 so that the multi-gen LRU can support twice the + * number of categories of the active/inactive LRU when keeping track of + * accesses through file descriptors. It uses MAX_NR_TIERS-2 spare bits in + * folio->flags (LRU_REFS_MASK). + */ +#define MAX_NR_TIERS 4U + +#ifndef __GENERATING_BOUNDS_H + +struct lruvec; +struct page_vma_mapped_walk; + +#define LRU_GEN_MASK ((BIT(LRU_GEN_WIDTH) - 1) << LRU_GEN_PGOFF) +#define LRU_REFS_MASK ((BIT(LRU_REFS_WIDTH) - 1) << LRU_REFS_PGOFF) + +#ifdef CONFIG_LRU_GEN + +enum { + LRU_GEN_ANON, + LRU_GEN_FILE, +}; + +enum { + LRU_GEN_CORE, + LRU_GEN_MM_WALK, + LRU_GEN_NONLEAF_YOUNG, + NR_LRU_GEN_CAPS +}; + +#define MIN_LRU_BATCH BITS_PER_LONG +#define MAX_LRU_BATCH (MIN_LRU_BATCH * 128) + +/* whether to keep historical stats from evicted generations */ +#ifdef CONFIG_LRU_GEN_STATS +#define NR_HIST_GENS MAX_NR_GENS +#else +#define NR_HIST_GENS 1U +#endif + +/* + * The youngest generation number is stored in max_seq for both anon and file + * types as they are aged on an equal footing. The oldest generation numbers are + * stored in min_seq[] separately for anon and file types as clean file pages + * can be evicted regardless of swap constraints. + * + * Normally anon and file min_seq are in sync. But if swapping is constrained, + * e.g., out of swap space, file min_seq is allowed to advance and leave anon + * min_seq behind. + * + * nr_pages[] are eventually consistent and therefore can be transiently + * negative when reset_batch_size() is pending. + */ +struct lru_gen_struct { + /* the aging increments the youngest generation number */ + unsigned long max_seq; + /* the eviction increments the oldest generation numbers */ + unsigned long min_seq[ANON_AND_FILE]; + /* the birth time of each generation in jiffies */ + unsigned long timestamps[MAX_NR_GENS]; + /* the multi-gen LRU lists */ + struct list_head lists[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES]; + /* the sizes of the above lists */ + long nr_pages[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES]; + /* the exponential moving average of refaulted */ + unsigned long avg_refaulted[ANON_AND_FILE][MAX_NR_TIERS]; + /* the exponential moving average of evicted+protected */ + unsigned long avg_total[ANON_AND_FILE][MAX_NR_TIERS]; + /* the first tier doesn't need protection, hence the minus one */ + unsigned long protected[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS - 1]; + /* can be modified without holding the LRU lock */ + atomic_long_t evicted[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS]; + atomic_long_t refaulted[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS]; + /* whether the multi-gen LRU is enabled */ + bool enabled; +}; + +enum { + MM_PTE_TOTAL, /* total leaf entries */ + MM_PTE_OLD, /* old leaf entries */ + MM_PTE_YOUNG, /* young leaf entries */ + MM_PMD_TOTAL, /* total non-leaf entries */ + MM_PMD_FOUND, /* non-leaf entries found in Bloom filters */ + MM_PMD_ADDED, /* non-leaf entries added to Bloom filters */ + NR_MM_STATS +}; + +/* mnemonic codes for the mm stats above */ +#define MM_STAT_CODES "toydfa" + +/* double-buffering Bloom filters */ +#define NR_BLOOM_FILTERS 2 + +struct lru_gen_mm_state { + /* set to max_seq after each iteration */ + unsigned long seq; + /* where the current iteration starts (inclusive) */ + struct list_head *head; + /* where the last iteration ends (exclusive) */ + struct list_head *tail; + /* to wait for the last page table walker to finish */ + struct wait_queue_head wait; + /* Bloom filters flip after each iteration */ + unsigned long *filters[NR_BLOOM_FILTERS]; + /* the mm stats for debugging */ + unsigned long stats[NR_HIST_GENS][NR_MM_STATS]; + /* the number of concurrent page table walkers */ + int nr_walkers; +}; + +struct lru_gen_mm_walk { + /* the lruvec under reclaim */ + struct lruvec *lruvec; + /* unstable max_seq from lru_gen_struct */ + unsigned long max_seq; + /* the next address within an mm to scan */ + unsigned long next_addr; + /* to batch page table entries */ + unsigned long bitmap[BITS_TO_LONGS(MIN_LRU_BATCH)]; + /* to batch promoted pages */ + int nr_pages[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES]; + /* to batch the mm stats */ + int mm_stats[NR_MM_STATS]; + /* total batched items */ + int batched; + bool can_swap; + bool force_scan; +}; + +void lru_gen_init_lruvec(struct lruvec *lruvec); +void lru_gen_look_around(struct page_vma_mapped_walk *pvmw); + +#ifdef CONFIG_MEMCG +void lru_gen_init_memcg(struct mem_cgroup *memcg); +void lru_gen_exit_memcg(struct mem_cgroup *memcg); +#endif + +#else /* !CONFIG_LRU_GEN */ + +static inline void lru_gen_init_lruvec(struct lruvec *lruvec) +{ +} + +static inline void lru_gen_look_around(struct page_vma_mapped_walk *pvmw) +{ +} + +#ifdef CONFIG_MEMCG +static inline void lru_gen_init_memcg(struct mem_cgroup *memcg) +{ +} + +static inline void lru_gen_exit_memcg(struct mem_cgroup *memcg) +{ +} +#endif + +#endif /* CONFIG_LRU_GEN */ + struct lruvec { struct list_head lists[NR_LRU_LISTS]; /* per lruvec lru_lock for memcg */ @@ -334,6 +539,12 @@ struct lruvec { unsigned long refaults[ANON_AND_FILE]; /* Various lruvec state flags (enum lruvec_flags) */ unsigned long flags; +#ifdef CONFIG_LRU_GEN + /* evictable pages divided into generations */ + struct lru_gen_struct lrugen; + /* to concurrently iterate lru_gen_mm_list */ + struct lru_gen_mm_state mm_state; +#endif #ifdef CONFIG_MEMCG struct pglist_data *pgdat; #endif @@ -926,6 +1137,11 @@ typedef struct pglist_data { unsigned long flags; +#ifdef CONFIG_LRU_GEN + /* kswap mm walk data */ + struct lru_gen_mm_walk mm_walk; +#endif + ZONE_PADDING(_pad2_) /* Per-node vmstats */ diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h index 567c3ddba2c4..90840c459abc 100644 --- a/include/linux/nodemask.h +++ b/include/linux/nodemask.h @@ -486,6 +486,7 @@ static inline int num_node_state(enum node_states state) #define first_online_node 0 #define first_memory_node 0 #define next_online_node(nid) (MAX_NUMNODES) +#define next_memory_node(nid) (MAX_NUMNODES) #define nr_node_ids 1U #define nr_online_nodes 1U diff --git a/include/linux/page-flags-layout.h b/include/linux/page-flags-layout.h index ef1e3e736e14..7d79818dc065 100644 --- a/include/linux/page-flags-layout.h +++ b/include/linux/page-flags-layout.h @@ -55,7 +55,8 @@ #define SECTIONS_WIDTH 0 #endif -#if ZONES_WIDTH + SECTIONS_WIDTH + NODES_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS +#if ZONES_WIDTH + LRU_GEN_WIDTH + SECTIONS_WIDTH + NODES_SHIFT \ + <= BITS_PER_LONG - NR_PAGEFLAGS #define NODES_WIDTH NODES_SHIFT #elif defined(CONFIG_SPARSEMEM_VMEMMAP) #error "Vmemmap: No space for nodes field in page flags" @@ -89,8 +90,8 @@ #define LAST_CPUPID_SHIFT 0 #endif -#if ZONES_WIDTH + SECTIONS_WIDTH + NODES_WIDTH + KASAN_TAG_WIDTH + LAST_CPUPID_SHIFT \ - <= BITS_PER_LONG - NR_PAGEFLAGS +#if ZONES_WIDTH + LRU_GEN_WIDTH + SECTIONS_WIDTH + NODES_WIDTH + \ + KASAN_TAG_WIDTH + LAST_CPUPID_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS #define LAST_CPUPID_WIDTH LAST_CPUPID_SHIFT #else #define LAST_CPUPID_WIDTH 0 @@ -100,10 +101,15 @@ #define LAST_CPUPID_NOT_IN_PAGE_FLAGS #endif -#if ZONES_WIDTH + SECTIONS_WIDTH + NODES_WIDTH + KASAN_TAG_WIDTH + LAST_CPUPID_WIDTH \ - > BITS_PER_LONG - NR_PAGEFLAGS +#if ZONES_WIDTH + LRU_GEN_WIDTH + SECTIONS_WIDTH + NODES_WIDTH + \ + KASAN_TAG_WIDTH + LAST_CPUPID_WIDTH > BITS_PER_LONG - NR_PAGEFLAGS #error "Not enough bits in page flags" #endif +/* see the comment on MAX_NR_TIERS */ +#define LRU_REFS_WIDTH min(__LRU_REFS_WIDTH, BITS_PER_LONG - NR_PAGEFLAGS - \ + ZONES_WIDTH - LRU_GEN_WIDTH - SECTIONS_WIDTH - \ + NODES_WIDTH - KASAN_TAG_WIDTH - LAST_CPUPID_WIDTH) + #endif #endif /* _LINUX_PAGE_FLAGS_LAYOUT */ diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 9d8eeaa67d05..5cbde013ce66 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -1017,7 +1017,7 @@ PAGEFLAG(Isolated, isolated, PF_ANY); 1UL << PG_private | 1UL << PG_private_2 | \ 1UL << PG_writeback | 1UL << PG_reserved | \ 1UL << PG_slab | 1UL << PG_active | \ - 1UL << PG_unevictable | __PG_MLOCKED) + 1UL << PG_unevictable | __PG_MLOCKED | LRU_GEN_MASK) /* * Flags checked when a page is prepped for return by the page allocator. @@ -1028,7 +1028,7 @@ PAGEFLAG(Isolated, isolated, PF_ANY); * alloc-free cycle to prevent from reusing the page. */ #define PAGE_FLAGS_CHECK_AT_PREP \ - (PAGEFLAGS_MASK & ~__PG_HWPOISON) + ((PAGEFLAGS_MASK & ~__PG_HWPOISON) | LRU_GEN_MASK | LRU_REFS_MASK) #define PAGE_FLAGS_PRIVATE \ (1UL << PG_private | 1UL << PG_private_2) diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index f4f4077b97aa..743e7fc4afda 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -212,7 +212,7 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma, #endif #ifndef __HAVE_ARCH_PMDP_TEST_AND_CLEAR_YOUNG -#ifdef CONFIG_TRANSPARENT_HUGEPAGE +#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG) static inline int pmdp_test_and_clear_young(struct vm_area_struct *vma, unsigned long address, pmd_t *pmdp) @@ -233,7 +233,7 @@ static inline int pmdp_test_and_clear_young(struct vm_area_struct *vma, BUILD_BUG(); return 0; } -#endif /* CONFIG_TRANSPARENT_HUGEPAGE */ +#endif /* CONFIG_TRANSPARENT_HUGEPAGE || CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG */ #endif #ifndef __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH @@ -259,6 +259,19 @@ static inline int pmdp_clear_flush_young(struct vm_area_struct *vma, #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ #endif +#ifndef arch_has_hw_pte_young +/* + * Return whether the accessed bit is supported on the local CPU. + * + * This stub assumes accessing through an old PTE triggers a page fault. + * Architectures that automatically set the access bit should overwrite it. + */ +static inline bool arch_has_hw_pte_young(void) +{ + return false; +} +#endif + #ifndef __HAVE_ARCH_PTEP_CLEAR static inline void ptep_clear(struct mm_struct *mm, unsigned long addr, pte_t *ptep) diff --git a/include/linux/sched.h b/include/linux/sched.h index a8911b1f35aa..448e75a5acc5 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -914,6 +914,10 @@ struct task_struct { #ifdef CONFIG_MEMCG unsigned in_user_fault:1; #endif +#ifdef CONFIG_LRU_GEN + /* whether the LRU algorithm may apply to this access */ + unsigned in_lru_fault:1; +#endif #ifdef CONFIG_COMPAT_BRK unsigned brk_randomized:1; #endif diff --git a/include/linux/swap.h b/include/linux/swap.h index 27093b477c5f..7bdd7bcb135d 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -137,6 +137,10 @@ union swap_header { */ struct reclaim_state { unsigned long reclaimed_slab; +#ifdef CONFIG_LRU_GEN + /* per-thread mm walk data */ + struct lru_gen_mm_walk *mm_walk; +#endif }; #ifdef __KERNEL__ diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index a34b0f9a9972..82afad91de9f 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -917,6 +917,7 @@ asmlinkage long sys_madvise(unsigned long start, size_t len, int behavior); asmlinkage long sys_process_madvise(int pidfd, const struct iovec __user *vec, size_t vlen, int behavior, unsigned int flags); asmlinkage long sys_process_mrelease(int pidfd, unsigned int flags); +asmlinkage long sys_pmadv_ksm(int pidfd, int behavior, unsigned int flags); asmlinkage long sys_remap_file_pages(unsigned long start, unsigned long size, unsigned long prot, unsigned long pgoff, unsigned long flags); diff --git a/include/linux/tcp.h b/include/linux/tcp.h index 1168302b7927..138c1e6f161a 100644 --- a/include/linux/tcp.h +++ b/include/linux/tcp.h @@ -225,7 +225,8 @@ struct tcp_sock { u8 compressed_ack; u8 dup_ack_counter:2, tlp_retrans:1, /* TLP is a retransmission */ - unused:5; + fast_ack_mode:2, /* which fast ack mode ? */ + unused:3; u32 chrono_start; /* Start time in jiffies of a TCP chrono */ u32 chrono_stat[3]; /* Time in jiffies for chrono_stat stats */ u8 chrono_type:2, /* current chronograph type */ diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h index 33a4240e6a6f..82213f9c4c17 100644 --- a/include/linux/user_namespace.h +++ b/include/linux/user_namespace.h @@ -139,6 +139,8 @@ static inline void set_rlimit_ucount_max(struct user_namespace *ns, #ifdef CONFIG_USER_NS +extern int unprivileged_userns_clone; + static inline struct user_namespace *get_user_ns(struct user_namespace *ns) { if (ns) @@ -172,6 +174,8 @@ extern bool current_in_userns(const struct user_namespace *target_ns); struct ns_common *ns_get_owner(struct ns_common *ns); #else +#define unprivileged_userns_clone 0 + static inline struct user_namespace *get_user_ns(struct user_namespace *ns) { return &init_user_ns; diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h index 3908296d103f..41ed0ac26bd3 100644 --- a/include/net/inet_connection_sock.h +++ b/include/net/inet_connection_sock.h @@ -134,7 +134,8 @@ struct inet_connection_sock { u32 icsk_probes_tstamp; u32 icsk_user_timeout; - u64 icsk_ca_priv[104 / sizeof(u64)]; +/* XXX inflated by temporary internal debugging info */ + u64 icsk_ca_priv[216 / sizeof(u64)]; #define ICSK_CA_PRIV_SIZE sizeof_field(struct inet_connection_sock, icsk_ca_priv) }; diff --git a/include/net/tcp.h b/include/net/tcp.h index cc1295037533..d63369ab0209 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -370,6 +370,7 @@ static inline void tcp_dec_quickack_mode(struct sock *sk, #define TCP_ECN_QUEUE_CWR 2 #define TCP_ECN_DEMAND_CWR 4 #define TCP_ECN_SEEN 8 +#define TCP_ECN_ECT_PERMANENT 16 enum tcp_tw_status { TCP_TW_SUCCESS = 0, @@ -810,6 +811,11 @@ static inline u32 tcp_stamp_us_delta(u64 t1, u64 t0) return max_t(s64, t1 - t0, 0); } +static inline u32 tcp_stamp32_us_delta(u32 t1, u32 t0) +{ + return max_t(s32, t1 - t0, 0); +} + static inline u32 tcp_skb_timestamp(const struct sk_buff *skb) { return tcp_ns_to_ts(skb->skb_mstamp_ns); @@ -885,9 +891,14 @@ struct tcp_skb_cb { /* pkts S/ACKed so far upon tx of skb, incl retrans: */ __u32 delivered; /* start of send pipeline phase */ - u64 first_tx_mstamp; + u32 first_tx_mstamp; /* when we reached the "delivered" count */ - u64 delivered_mstamp; + u32 delivered_mstamp; +#define TCPCB_IN_FLIGHT_BITS 20 +#define TCPCB_IN_FLIGHT_MAX ((1U << TCPCB_IN_FLIGHT_BITS) - 1) + u32 in_flight:20, /* packets in flight at transmit */ + unused2:12; + u32 lost; /* packets lost so far upon tx of skb */ } tx; /* only used for outgoing skbs */ union { struct inet_skb_parm h4; @@ -1013,7 +1024,11 @@ enum tcp_ca_ack_event_flags { #define TCP_CONG_NON_RESTRICTED 0x1 /* Requires ECN/ECT set on all packets */ #define TCP_CONG_NEEDS_ECN 0x2 -#define TCP_CONG_MASK (TCP_CONG_NON_RESTRICTED | TCP_CONG_NEEDS_ECN) +/* Wants notification of CE events (CA_EVENT_ECN_IS_CE, CA_EVENT_ECN_NO_CE). */ +#define TCP_CONG_WANTS_CE_EVENTS 0x4 +#define TCP_CONG_MASK (TCP_CONG_NON_RESTRICTED | \ + TCP_CONG_NEEDS_ECN | \ + TCP_CONG_WANTS_CE_EVENTS) union tcp_cc_info; @@ -1033,8 +1048,11 @@ struct ack_sample { */ struct rate_sample { u64 prior_mstamp; /* starting timestamp for interval */ + u32 prior_lost; /* tp->lost at "prior_mstamp" */ u32 prior_delivered; /* tp->delivered at "prior_mstamp" */ u32 prior_delivered_ce;/* tp->delivered_ce at "prior_mstamp" */ + u32 tx_in_flight; /* packets in flight at starting timestamp */ + s32 lost; /* number of packets lost over interval */ s32 delivered; /* number of packets delivered over interval */ s32 delivered_ce; /* number of packets delivered w/ CE marks*/ long interval_us; /* time for tp->delivered to incr "delivered" */ @@ -1048,6 +1066,7 @@ struct rate_sample { bool is_app_limited; /* is sample from packet with bubble in pipe? */ bool is_retrans; /* is sample from retransmission? */ bool is_ack_delayed; /* is this (likely) a delayed ACK? */ + bool is_ece; /* did this ACK have ECN marked? */ }; struct tcp_congestion_ops { @@ -1071,8 +1090,11 @@ struct tcp_congestion_ops { /* hook for packet ack accounting (optional) */ void (*pkts_acked)(struct sock *sk, const struct ack_sample *sample); - /* override sysctl_tcp_min_tso_segs */ - u32 (*min_tso_segs)(struct sock *sk); + /* pick target number of segments per TSO/GSO skb (optional): */ + u32 (*tso_segs)(struct sock *sk, unsigned int mss_now); + + /* react to a specific lost skb (optional) */ + void (*skb_marked_lost)(struct sock *sk, const struct sk_buff *skb); /* call when packets are delivered to update cwnd and pacing rate, * after all the ca_state processing. (optional) @@ -1135,6 +1157,14 @@ static inline char *tcp_ca_get_name_by_key(u32 key, char *buffer) } #endif +static inline bool tcp_ca_wants_ce_events(const struct sock *sk) +{ + const struct inet_connection_sock *icsk = inet_csk(sk); + + return icsk->icsk_ca_ops->flags & (TCP_CONG_NEEDS_ECN | + TCP_CONG_WANTS_CE_EVENTS); +} + static inline bool tcp_ca_needs_ecn(const struct sock *sk) { const struct inet_connection_sock *icsk = inet_csk(sk); @@ -1160,6 +1190,7 @@ static inline void tcp_ca_event(struct sock *sk, const enum tcp_ca_event event) } /* From tcp_rate.c */ +void tcp_set_tx_in_flight(struct sock *sk, struct sk_buff *skb); void tcp_rate_skb_sent(struct sock *sk, struct sk_buff *skb); void tcp_rate_skb_delivered(struct sock *sk, struct sk_buff *skb, struct rate_sample *rs); diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h index 1c48b0ae3ba3..c780129ab742 100644 --- a/include/uapi/asm-generic/unistd.h +++ b/include/uapi/asm-generic/unistd.h @@ -886,8 +886,11 @@ __SYSCALL(__NR_futex_waitv, sys_futex_waitv) #define __NR_set_mempolicy_home_node 450 __SYSCALL(__NR_set_mempolicy_home_node, sys_set_mempolicy_home_node) +#define __NR_pmadv_ksm 451 +__SYSCALL(__NR_pmadv_ksm, sys_pmadv_ksm) + #undef __NR_syscalls -#define __NR_syscalls 451 +#define __NR_syscalls 452 /* * 32 bit systems traditionally used different diff --git a/include/uapi/linux/inet_diag.h b/include/uapi/linux/inet_diag.h index 20ee93f0f876..96d52dd9c48a 100644 --- a/include/uapi/linux/inet_diag.h +++ b/include/uapi/linux/inet_diag.h @@ -231,9 +231,42 @@ struct tcp_bbr_info { __u32 bbr_cwnd_gain; /* cwnd gain shifted left 8 bits */ }; +/* Phase as reported in netlink/ss stats. */ +enum tcp_bbr2_phase { + BBR2_PHASE_INVALID = 0, + BBR2_PHASE_STARTUP = 1, + BBR2_PHASE_DRAIN = 2, + BBR2_PHASE_PROBE_RTT = 3, + BBR2_PHASE_PROBE_BW_UP = 4, + BBR2_PHASE_PROBE_BW_DOWN = 5, + BBR2_PHASE_PROBE_BW_CRUISE = 6, + BBR2_PHASE_PROBE_BW_REFILL = 7 +}; + +struct tcp_bbr2_info { + /* u64 bw: bandwidth (app throughput) estimate in Byte per sec: */ + __u32 bbr_bw_lsb; /* lower 32 bits of bw */ + __u32 bbr_bw_msb; /* upper 32 bits of bw */ + __u32 bbr_min_rtt; /* min-filtered RTT in uSec */ + __u32 bbr_pacing_gain; /* pacing gain shifted left 8 bits */ + __u32 bbr_cwnd_gain; /* cwnd gain shifted left 8 bits */ + __u32 bbr_bw_hi_lsb; /* lower 32 bits of bw_hi */ + __u32 bbr_bw_hi_msb; /* upper 32 bits of bw_hi */ + __u32 bbr_bw_lo_lsb; /* lower 32 bits of bw_lo */ + __u32 bbr_bw_lo_msb; /* upper 32 bits of bw_lo */ + __u8 bbr_mode; /* current bbr_mode in state machine */ + __u8 bbr_phase; /* current state machine phase */ + __u8 unused1; /* alignment padding; not used yet */ + __u8 bbr_version; /* MUST be at this offset in struct */ + __u32 bbr_inflight_lo; /* lower/short-term data volume bound */ + __u32 bbr_inflight_hi; /* higher/long-term data volume bound */ + __u32 bbr_extra_acked; /* max excess packets ACKed in epoch */ +}; + union tcp_cc_info { struct tcpvegas_info vegas; struct tcp_dctcp_info dctcp; struct tcp_bbr_info bbr; + struct tcp_bbr2_info bbr2; }; #endif /* _UAPI_INET_DIAG_H_ */ diff --git a/init/Kconfig b/init/Kconfig index ddcbefe535e9..4fa2a49b23fa 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -1235,6 +1235,22 @@ config USER_NS If unsure, say N. +config USER_NS_UNPRIVILEGED + bool "Allow unprivileged users to create namespaces" + default y + depends on USER_NS + help + When disabled, unprivileged users will not be able to create + new namespaces. Allowing users to create their own namespaces + has been part of several recent local privilege escalation + exploits, so if you need user namespaces but are + paranoid^Wsecurity-conscious you want to disable this. + + This setting can be overridden at runtime via the + kernel.unprivileged_userns_clone sysctl. + + If unsure, say Y. + config PID_NS bool "PID Namespaces" default y @@ -1374,7 +1390,6 @@ config CC_OPTIMIZE_FOR_PERFORMANCE config CC_OPTIMIZE_FOR_PERFORMANCE_O3 bool "Optimize more for performance (-O3)" - depends on ARC help Choosing this option will pass "-O3" to your compiler to optimize the kernel yet more for performance. diff --git a/kernel/bounds.c b/kernel/bounds.c index 9795d75b09b2..b529182e8b04 100644 --- a/kernel/bounds.c +++ b/kernel/bounds.c @@ -22,6 +22,13 @@ int main(void) DEFINE(NR_CPUS_BITS, ilog2(CONFIG_NR_CPUS)); #endif DEFINE(SPINLOCK_SIZE, sizeof(spinlock_t)); +#ifdef CONFIG_LRU_GEN + DEFINE(LRU_GEN_WIDTH, order_base_2(MAX_NR_GENS + 1)); + DEFINE(__LRU_REFS_WIDTH, MAX_NR_TIERS - 2); +#else + DEFINE(LRU_GEN_WIDTH, 0); + DEFINE(__LRU_REFS_WIDTH, 0); +#endif /* End of constants */ return 0; diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h index 6e36e854b512..929ed3bf1a7c 100644 --- a/kernel/cgroup/cgroup-internal.h +++ b/kernel/cgroup/cgroup-internal.h @@ -165,7 +165,6 @@ struct cgroup_mgctx { #define DEFINE_CGROUP_MGCTX(name) \ struct cgroup_mgctx name = CGROUP_MGCTX_INIT(name) -extern struct mutex cgroup_mutex; extern spinlock_t css_set_lock; extern struct cgroup_subsys *cgroup_subsys[]; extern struct list_head cgroup_roots; diff --git a/kernel/exit.c b/kernel/exit.c index f072959fcab7..f2d4d48ea790 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -466,6 +466,7 @@ void mm_update_next_owner(struct mm_struct *mm) goto retry; } WRITE_ONCE(mm->owner, c); + lru_gen_migrate_mm(mm); task_unlock(c); put_task_struct(c); } diff --git a/kernel/fork.c b/kernel/fork.c index 35a3beff140b..cf07934f56da 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -99,6 +99,10 @@ #include #include +#ifdef CONFIG_USER_NS +#include +#endif + #include #include #include @@ -1149,6 +1153,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p, goto fail_nocontext; mm->user_ns = get_user_ns(user_ns); + lru_gen_init_mm(mm); return mm; fail_nocontext: @@ -1191,6 +1196,7 @@ static inline void __mmput(struct mm_struct *mm) } if (mm->binfmt) module_put(mm->binfmt->module); + lru_gen_del_mm(mm); mmdrop(mm); } @@ -1992,6 +1998,10 @@ static __latent_entropy struct task_struct *copy_process( if ((clone_flags & (CLONE_NEWUSER|CLONE_FS)) == (CLONE_NEWUSER|CLONE_FS)) return ERR_PTR(-EINVAL); + if ((clone_flags & CLONE_NEWUSER) && !unprivileged_userns_clone) + if (!capable(CAP_SYS_ADMIN)) + return ERR_PTR(-EPERM); + /* * Thread groups must share signals as well, and detached threads * can only be started up within the thread group. @@ -2660,6 +2670,13 @@ pid_t kernel_clone(struct kernel_clone_args *args) get_task_struct(p); } + if (IS_ENABLED(CONFIG_LRU_GEN) && !(clone_flags & CLONE_VM)) { + /* lock the task to synchronize with memcg migration */ + task_lock(p); + lru_gen_add_mm(p->mm); + task_unlock(p); + } + wake_up_new_task(p); /* forking complete and child started to run, tell ptracer */ @@ -3110,6 +3127,12 @@ int ksys_unshare(unsigned long unshare_flags) if (unshare_flags & CLONE_NEWNS) unshare_flags |= CLONE_FS; + if ((unshare_flags & CLONE_NEWUSER) && !unprivileged_userns_clone) { + err = -EPERM; + if (!capable(CAP_SYS_ADMIN)) + goto bad_unshare_out; + } + err = check_unshare_flags(unshare_flags); if (err) goto bad_unshare_out; diff --git a/kernel/sched/core.c b/kernel/sched/core.c index d58c0389eb23..407eb7dfa477 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6,6 +6,7 @@ * * Copyright (C) 1991-2002 Linus Torvalds */ +#include #include #include #include @@ -3583,6 +3584,8 @@ ttwu_stat(struct task_struct *p, int cpu, int wake_flags) { struct rq *rq; + add_sched_randomness(p, cpu); + if (!schedstat_enabled()) return; @@ -5057,6 +5060,7 @@ context_switch(struct rq *rq, struct task_struct *prev, * finish_task_switch()'s mmdrop(). */ switch_mm_irqs_off(prev->active_mm, next->mm, next); + lru_gen_use_mm(next->mm); if (!prev->mm) { // from kernel /* will mmdrop() in finish_task_switch(). */ diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c index a492f159624f..dc765f3eff16 100644 --- a/kernel/sys_ni.c +++ b/kernel/sys_ni.c @@ -291,6 +291,7 @@ COND_SYSCALL(mincore); COND_SYSCALL(madvise); COND_SYSCALL(process_madvise); COND_SYSCALL(process_mrelease); +COND_SYSCALL(pmadv_ksm); COND_SYSCALL(remap_file_pages); COND_SYSCALL(mbind); COND_SYSCALL(get_mempolicy); diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 830aaf8ca08e..af4c0806bd8e 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -91,6 +91,9 @@ #if defined(CONFIG_PROVE_LOCKING) || defined(CONFIG_LOCK_STAT) #include #endif +#ifdef CONFIG_USER_NS +#include +#endif #if defined(CONFIG_SYSCTL) @@ -1803,6 +1806,15 @@ static struct ctl_table kern_table[] = { .mode = 0644, .proc_handler = proc_dointvec, }, +#ifdef CONFIG_USER_NS + { + .procname = "unprivileged_userns_clone", + .data = &unprivileged_userns_clone, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec, + }, +#endif #ifdef CONFIG_PROC_SYSCTL { .procname = "tainted", diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c index 5481ba44a8d6..423ab2563ad7 100644 --- a/kernel/user_namespace.c +++ b/kernel/user_namespace.c @@ -21,6 +21,13 @@ #include #include +/* sysctl */ +#ifdef CONFIG_USER_NS_UNPRIVILEGED +int unprivileged_userns_clone = 1; +#else +int unprivileged_userns_clone; +#endif + static struct kmem_cache *user_ns_cachep __read_mostly; static DEFINE_MUTEX(userns_state_mutex); diff --git a/mm/Kconfig b/mm/Kconfig index 034d87953600..05291697055a 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -909,6 +909,32 @@ config ANON_VMA_NAME area from being merged with adjacent virtual memory areas due to the difference in their name. +# multi-gen LRU { +config LRU_GEN + bool "Multi-Gen LRU" + depends on MMU + # make sure folio->flags has enough spare bits + depends on 64BIT || !SPARSEMEM || SPARSEMEM_VMEMMAP + help + A high performance LRU implementation to overcommit memory. See + Documentation/admin-guide/mm/multigen_lru.rst for details. + +config LRU_GEN_ENABLED + bool "Enable by default" + depends on LRU_GEN + help + This option enables the multi-gen LRU by default. + +config LRU_GEN_STATS + bool "Full stats for debugging" + depends on LRU_GEN + help + Do not enable this option unless you plan to look at historical stats + from evicted generations for debugging purpose. + + This option has a per-memcg and per-node memory overhead. +# } + source "mm/damon/Kconfig" endmenu diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 910a138e9859..a090514f2bf3 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2320,7 +2320,8 @@ static void __split_huge_page_tail(struct page *head, int tail, #ifdef CONFIG_64BIT (1L << PG_arch_2) | #endif - (1L << PG_dirty))); + (1L << PG_dirty) | + LRU_GEN_MASK | LRU_REFS_MASK)); /* ->mapping in first tail page is compound_mapcount */ VM_BUG_ON_PAGE(tail > 2 && page_tail->mapping != TAIL_MAPPING, diff --git a/mm/internal.h b/mm/internal.h index cf16280ce132..59d2422b647d 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -68,6 +68,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf); void folio_rotate_reclaimable(struct folio *folio); bool __folio_end_writeback(struct folio *folio); void deactivate_file_folio(struct folio *folio); +void folio_activate(struct folio *folio); void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma, unsigned long floor, unsigned long ceiling); diff --git a/mm/ksm.c b/mm/ksm.c index 063a48eeb5ee..c4fe9f20b11d 100644 --- a/mm/ksm.c +++ b/mm/ksm.c @@ -2419,54 +2419,78 @@ static int ksm_scan_thread(void *nothing) return 0; } -int ksm_madvise(struct vm_area_struct *vma, unsigned long start, - unsigned long end, int advice, unsigned long *vm_flags) +int ksm_madvise_merge(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long *vm_flags) { - struct mm_struct *mm = vma->vm_mm; int err; - switch (advice) { - case MADV_MERGEABLE: - /* - * Be somewhat over-protective for now! - */ - if (*vm_flags & (VM_MERGEABLE | VM_SHARED | VM_MAYSHARE | - VM_PFNMAP | VM_IO | VM_DONTEXPAND | - VM_HUGETLB | VM_MIXEDMAP)) - return 0; /* just ignore the advice */ + /* + * Be somewhat over-protective for now! + */ + if (*vm_flags & (VM_MERGEABLE | VM_SHARED | VM_MAYSHARE | + VM_PFNMAP | VM_IO | VM_DONTEXPAND | + VM_HUGETLB | VM_MIXEDMAP)) + return 0; /* just ignore the advice */ - if (vma_is_dax(vma)) - return 0; + if (vma_is_dax(vma)) + return 0; #ifdef VM_SAO if (*vm_flags & VM_SAO) return 0; #endif #ifdef VM_SPARC_ADI - if (*vm_flags & VM_SPARC_ADI) - return 0; + if (*vm_flags & VM_SPARC_ADI) + return 0; #endif - if (!test_bit(MMF_VM_MERGEABLE, &mm->flags)) { - err = __ksm_enter(mm); - if (err) - return err; - } + if (!test_bit(MMF_VM_MERGEABLE, &mm->flags)) { + err = __ksm_enter(mm); + if (err) + return err; + } - *vm_flags |= VM_MERGEABLE; - break; + *vm_flags |= VM_MERGEABLE; - case MADV_UNMERGEABLE: - if (!(*vm_flags & VM_MERGEABLE)) - return 0; /* just ignore the advice */ + return 0; +} - if (vma->anon_vma) { - err = unmerge_ksm_pages(vma, start, end); - if (err) - return err; - } +int ksm_madvise_unmerge(struct vm_area_struct *vma, unsigned long start, + unsigned long end, unsigned long *vm_flags) +{ + int err; + + if (!(*vm_flags & VM_MERGEABLE)) + return 0; /* just ignore the advice */ + + if (vma->anon_vma) { + err = unmerge_ksm_pages(vma, start, end); + if (err) + return err; + } - *vm_flags &= ~VM_MERGEABLE; + *vm_flags &= ~VM_MERGEABLE; + + return 0; +} + +int ksm_madvise(struct vm_area_struct *vma, unsigned long start, + unsigned long end, int advice, unsigned long *vm_flags) +{ + struct mm_struct *mm = vma->vm_mm; + int err; + + switch (advice) { + case MADV_MERGEABLE: + err = ksm_madvise_merge(mm, vma, vm_flags); + if (err) + return err; + break; + + case MADV_UNMERGEABLE: + err = ksm_madvise_unmerge(vma, start, end, vm_flags); + if (err) + return err; break; } diff --git a/mm/madvise.c b/mm/madvise.c index 1873616a37d2..a356b9f3b4c5 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -1482,3 +1482,114 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec, out: return ret; } + +SYSCALL_DEFINE3(pmadv_ksm, int, pidfd, int, behaviour, unsigned int, flags) +{ +#ifdef CONFIG_KSM + ssize_t ret; + struct pid *pid; + struct task_struct *task; + struct mm_struct *mm; + unsigned int f_flags; + struct vm_area_struct *vma; + + if (flags != 0) { + ret = -EINVAL; + goto out; + } + + switch (behaviour) { + case MADV_MERGEABLE: + case MADV_UNMERGEABLE: + break; + default: + ret = -EINVAL; + goto out; + break; + } + + pid = pidfd_get_pid(pidfd, &f_flags); + if (IS_ERR(pid)) { + ret = PTR_ERR(pid); + goto out; + } + + task = get_pid_task(pid, PIDTYPE_PID); + if (!task) { + ret = -ESRCH; + goto put_pid; + } + + /* Require PTRACE_MODE_READ to avoid leaking ASLR metadata. */ + mm = mm_access(task, PTRACE_MODE_READ_FSCREDS); + if (IS_ERR_OR_NULL(mm)) { + ret = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH; + goto release_task; + } + + /* Require CAP_SYS_NICE for influencing process performance. */ + if (!capable(CAP_SYS_NICE)) { + ret = -EPERM; + goto release_mm; + } + + if (mmap_write_lock_killable(mm)) { + ret = -EINTR; + goto release_mm; + } + + for (vma = mm->mmap; vma; vma = vma->vm_next) { + switch (behaviour) { + case MADV_MERGEABLE: + ret = ksm_madvise_merge(vma->vm_mm, vma, &vma->vm_flags); + break; + case MADV_UNMERGEABLE: + ret = ksm_madvise_unmerge(vma, vma->vm_start, vma->vm_end, &vma->vm_flags); + break; + default: + /* look, ma, no brain */ + break; + } + if (ret) + break; + } + + mmap_write_unlock(mm); + +release_mm: + mmput(mm); +release_task: + put_task_struct(task); +put_pid: + put_pid(pid); +out: + return ret; +#else /* CONFIG_KSM */ + return -ENOSYS; +#endif /* CONFIG_KSM */ +} + +#ifdef CONFIG_KSM +static ssize_t ksm_show(struct kobject *kobj, struct kobj_attribute *attr, + char *buf) +{ + return sprintf(buf, "%u\n", __NR_pmadv_ksm); +} +static struct kobj_attribute pmadv_ksm_attr = __ATTR_RO(ksm); + +static struct attribute *pmadv_sysfs_attrs[] = { + &pmadv_ksm_attr.attr, + NULL, +}; + +static const struct attribute_group pmadv_sysfs_attr_group = { + .attrs = pmadv_sysfs_attrs, + .name = "pmadv", +}; + +static int __init pmadv_sysfs_init(void) +{ + return sysfs_create_group(kernel_kobj, &pmadv_sysfs_attr_group); +} +subsys_initcall(pmadv_sysfs_init); +#endif /* CONFIG_KSM */ diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 598fece89e2b..ff2b0a09e6c8 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2769,6 +2769,7 @@ static void commit_charge(struct folio *folio, struct mem_cgroup *memcg) * - LRU isolation * - lock_page_memcg() * - exclusive reference + * - mem_cgroup_trylock_pages() */ folio->memcg_data = (unsigned long)memcg; } @@ -5072,6 +5073,7 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg) static void mem_cgroup_free(struct mem_cgroup *memcg) { + lru_gen_exit_memcg(memcg); memcg_wb_domain_exit(memcg); __mem_cgroup_free(memcg); } @@ -5130,6 +5132,7 @@ static struct mem_cgroup *mem_cgroup_alloc(void) memcg->deferred_split_queue.split_queue_len = 0; #endif idr_replace(&mem_cgroup_idr, memcg, memcg->id.id); + lru_gen_init_memcg(memcg); return memcg; fail: mem_cgroup_id_remove(memcg); @@ -6090,6 +6093,30 @@ static void mem_cgroup_move_task(void) } #endif +#ifdef CONFIG_LRU_GEN +static void mem_cgroup_attach(struct cgroup_taskset *tset) +{ + struct task_struct *task; + struct cgroup_subsys_state *css; + + /* find the first leader if there is any */ + cgroup_taskset_for_each_leader(task, css, tset) + break; + + if (!task) + return; + + task_lock(task); + if (task->mm && task->mm->owner == task) + lru_gen_migrate_mm(task->mm); + task_unlock(task); +} +#else +static void mem_cgroup_attach(struct cgroup_taskset *tset) +{ +} +#endif /* CONFIG_LRU_GEN */ + static int seq_puts_memcg_tunable(struct seq_file *m, unsigned long value) { if (value == PAGE_COUNTER_MAX) @@ -6435,6 +6462,7 @@ struct cgroup_subsys memory_cgrp_subsys = { .css_reset = mem_cgroup_css_reset, .css_rstat_flush = mem_cgroup_css_rstat_flush, .can_attach = mem_cgroup_can_attach, + .attach = mem_cgroup_attach, .cancel_attach = mem_cgroup_cancel_attach, .post_attach = mem_cgroup_move_task, .dfl_cftypes = memory_files, diff --git a/mm/memory.c b/mm/memory.c index 76e3af9639d9..6df27b84c5aa 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -122,18 +122,6 @@ int randomize_va_space __read_mostly = 2; #endif -#ifndef arch_faults_on_old_pte -static inline bool arch_faults_on_old_pte(void) -{ - /* - * Those arches which don't have hw access flag feature need to - * implement their own helper. By default, "true" means pagefault - * will be hit on old pte. - */ - return true; -} -#endif - #ifndef arch_wants_old_prefaulted_pte static inline bool arch_wants_old_prefaulted_pte(void) { @@ -2784,7 +2772,7 @@ static inline bool cow_user_page(struct page *dst, struct page *src, * On architectures with software "accessed" bits, we would * take a double page fault, so mark it accessed here. */ - if (arch_faults_on_old_pte() && !pte_young(vmf->orig_pte)) { + if (!arch_has_hw_pte_young() && !pte_young(vmf->orig_pte)) { pte_t entry; vmf->pte = pte_offset_map_lock(mm, vmf->pmd, addr, &vmf->ptl); @@ -4824,6 +4812,27 @@ static inline void mm_account_fault(struct pt_regs *regs, perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, regs, address); } +#ifdef CONFIG_LRU_GEN +static void lru_gen_enter_fault(struct vm_area_struct *vma) +{ + /* the LRU algorithm doesn't apply to sequential or random reads */ + current->in_lru_fault = !(vma->vm_flags & (VM_SEQ_READ | VM_RAND_READ)); +} + +static void lru_gen_exit_fault(void) +{ + current->in_lru_fault = false; +} +#else +static void lru_gen_enter_fault(struct vm_area_struct *vma) +{ +} + +static void lru_gen_exit_fault(void) +{ +} +#endif /* CONFIG_LRU_GEN */ + /* * By the time we get here, we already hold the mm semaphore * @@ -4855,11 +4864,15 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address, if (flags & FAULT_FLAG_USER) mem_cgroup_enter_user_fault(); + lru_gen_enter_fault(vma); + if (unlikely(is_vm_hugetlb_page(vma))) ret = hugetlb_fault(vma->vm_mm, vma, address, flags); else ret = __handle_mm_fault(vma, address, flags); + lru_gen_exit_fault(); + if (flags & FAULT_FLAG_USER) { mem_cgroup_exit_user_fault(); /* diff --git a/mm/mm_init.c b/mm/mm_init.c index 9ddaf0e1b0ab..0d7b2bd2454a 100644 --- a/mm/mm_init.c +++ b/mm/mm_init.c @@ -65,14 +65,16 @@ void __init mminit_verify_pageflags_layout(void) shift = 8 * sizeof(unsigned long); width = shift - SECTIONS_WIDTH - NODES_WIDTH - ZONES_WIDTH - - LAST_CPUPID_SHIFT - KASAN_TAG_WIDTH; + - LAST_CPUPID_SHIFT - KASAN_TAG_WIDTH - LRU_GEN_WIDTH - LRU_REFS_WIDTH; mminit_dprintk(MMINIT_TRACE, "pageflags_layout_widths", - "Section %d Node %d Zone %d Lastcpupid %d Kasantag %d Flags %d\n", + "Section %d Node %d Zone %d Lastcpupid %d Kasantag %d Gen %d Tier %d Flags %d\n", SECTIONS_WIDTH, NODES_WIDTH, ZONES_WIDTH, LAST_CPUPID_WIDTH, KASAN_TAG_WIDTH, + LRU_GEN_WIDTH, + LRU_REFS_WIDTH, NR_PAGEFLAGS); mminit_dprintk(MMINIT_TRACE, "pageflags_layout_shifts", "Section %d Node %d Zone %d Lastcpupid %d Kasantag %d\n", diff --git a/mm/mmzone.c b/mm/mmzone.c index 0ae7571e35ab..68e1511be12d 100644 --- a/mm/mmzone.c +++ b/mm/mmzone.c @@ -88,6 +88,8 @@ void lruvec_init(struct lruvec *lruvec) * Poison its list head, so that any operations on it would crash. */ list_del(&lruvec->lists[LRU_UNEVICTABLE]); + + lru_gen_init_lruvec(lruvec); } #if defined(CONFIG_NUMA_BALANCING) && !defined(LAST_CPUPID_NOT_IN_PAGE_FLAGS) diff --git a/mm/rmap.c b/mm/rmap.c index fedb82371efe..7cb7ef29088a 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -73,6 +73,7 @@ #include #include #include +#include #include @@ -821,6 +822,12 @@ static bool folio_referenced_one(struct folio *folio, } if (pvmw.pte) { + if (lru_gen_enabled() && pte_young(*pvmw.pte) && + !(vma->vm_flags & (VM_SEQ_READ | VM_RAND_READ))) { + lru_gen_look_around(&pvmw); + referenced++; + } + if (ptep_clear_flush_young_notify(vma, address, pvmw.pte)) { /* diff --git a/mm/swap.c b/mm/swap.c index 7e320ec08c6a..0aa1d0b33d42 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -342,7 +342,7 @@ static bool need_activate_page_drain(int cpu) return pagevec_count(&per_cpu(lru_pvecs.activate_page, cpu)) != 0; } -static void folio_activate(struct folio *folio) +void folio_activate(struct folio *folio) { if (folio_test_lru(folio) && !folio_test_active(folio) && !folio_test_unevictable(folio)) { @@ -362,7 +362,7 @@ static inline void activate_page_drain(int cpu) { } -static void folio_activate(struct folio *folio) +void folio_activate(struct folio *folio) { struct lruvec *lruvec; @@ -405,6 +405,40 @@ static void __lru_cache_activate_folio(struct folio *folio) local_unlock(&lru_pvecs.lock); } +#ifdef CONFIG_LRU_GEN +static void folio_inc_refs(struct folio *folio) +{ + unsigned long new_flags, old_flags = READ_ONCE(folio->flags); + + if (folio_test_unevictable(folio)) + return; + + if (!folio_test_referenced(folio)) { + folio_set_referenced(folio); + return; + } + + if (!folio_test_workingset(folio)) { + folio_set_workingset(folio); + return; + } + + /* see the comment on MAX_NR_TIERS */ + do { + new_flags = old_flags & LRU_REFS_MASK; + if (new_flags == LRU_REFS_MASK) + break; + + new_flags += BIT(LRU_REFS_PGOFF); + new_flags |= old_flags & ~LRU_REFS_MASK; + } while (!try_cmpxchg(&folio->flags, &old_flags, new_flags)); +} +#else +static void folio_inc_refs(struct folio *folio) +{ +} +#endif /* CONFIG_LRU_GEN */ + /* * Mark a page as having seen activity. * @@ -417,6 +451,11 @@ static void __lru_cache_activate_folio(struct folio *folio) */ void folio_mark_accessed(struct folio *folio) { + if (lru_gen_enabled()) { + folio_inc_refs(folio); + return; + } + if (!folio_test_referenced(folio)) { folio_set_referenced(folio); } else if (folio_test_unevictable(folio)) { @@ -460,6 +499,11 @@ void folio_add_lru(struct folio *folio) VM_BUG_ON_FOLIO(folio_test_active(folio) && folio_test_unevictable(folio), folio); VM_BUG_ON_FOLIO(folio_test_lru(folio), folio); + /* see the comment in lru_gen_add_folio() */ + if (lru_gen_enabled() && !folio_test_unevictable(folio) && + lru_gen_in_fault() && !(current->flags & PF_MEMALLOC)) + folio_set_active(folio); + folio_get(folio); local_lock(&lru_pvecs.lock); pvec = this_cpu_ptr(&lru_pvecs.lru_add); @@ -551,7 +595,7 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec) static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec) { - if (PageActive(page) && !PageUnevictable(page)) { + if (!PageUnevictable(page) && (PageActive(page) || lru_gen_enabled())) { int nr_pages = thp_nr_pages(page); del_page_from_lru_list(page, lruvec); @@ -666,7 +710,7 @@ void deactivate_file_folio(struct folio *folio) */ void deactivate_page(struct page *page) { - if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) { + if (PageLRU(page) && !PageUnevictable(page) && (PageActive(page) || lru_gen_enabled())) { struct pagevec *pvec; local_lock(&lru_pvecs.lock); diff --git a/mm/vmscan.c b/mm/vmscan.c index 1678802e03e7..aef71c45c424 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -50,6 +50,10 @@ #include #include #include +#include +#include +#include +#include #include #include @@ -126,6 +130,13 @@ struct scan_control { /* Always discard instead of demoting to lower tier memory */ unsigned int no_demotion:1; +#ifdef CONFIG_LRU_GEN + /* help make better choices when multiple memcgs are available */ + unsigned int memcgs_need_aging:1; + unsigned int memcgs_need_swapping:1; + unsigned int memcgs_avoid_swapping:1; +#endif + /* Allocation order */ s8 order; @@ -1275,9 +1286,11 @@ static int __remove_mapping(struct address_space *mapping, struct folio *folio, if (folio_test_swapcache(folio)) { swp_entry_t swap = folio_swap_entry(folio); - mem_cgroup_swapout(folio, swap); + + /* get a shadow entry before mem_cgroup_swapout() clears folio_memcg() */ if (reclaimed && !mapping_exiting(mapping)) shadow = workingset_eviction(folio, target_memcg); + mem_cgroup_swapout(folio, swap); __delete_from_swap_cache(&folio->page, swap, shadow); xa_unlock_irq(&mapping->i_pages); put_swap_page(&folio->page, swap); @@ -1552,6 +1565,11 @@ static unsigned int shrink_page_list(struct list_head *page_list, if (!sc->may_unmap && page_mapped(page)) goto keep_locked; + /* folio_update_gen() tried to promote this page? */ + if (lru_gen_enabled() && !ignore_references && + page_mapped(page) && PageReferenced(page)) + goto keep_locked; + may_enter_fs = (sc->gfp_mask & __GFP_FS) || (PageSwapCache(page) && (sc->gfp_mask & __GFP_IO)); @@ -2644,6 +2662,112 @@ enum scan_balance { SCAN_FILE, }; +static void prepare_scan_count(pg_data_t *pgdat, struct scan_control *sc) +{ + unsigned long file; + struct lruvec *target_lruvec; + + if (lru_gen_enabled()) + return; + + target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat); + + /* + * Flush the memory cgroup stats, so that we read accurate per-memcg + * lruvec stats for heuristics. + */ + mem_cgroup_flush_stats(); + + /* + * Determine the scan balance between anon and file LRUs. + */ + spin_lock_irq(&target_lruvec->lru_lock); + sc->anon_cost = target_lruvec->anon_cost; + sc->file_cost = target_lruvec->file_cost; + spin_unlock_irq(&target_lruvec->lru_lock); + + /* + * Target desirable inactive:active list ratios for the anon + * and file LRU lists. + */ + if (!sc->force_deactivate) { + unsigned long refaults; + + refaults = lruvec_page_state(target_lruvec, + WORKINGSET_ACTIVATE_ANON); + if (refaults != target_lruvec->refaults[0] || + inactive_is_low(target_lruvec, LRU_INACTIVE_ANON)) + sc->may_deactivate |= DEACTIVATE_ANON; + else + sc->may_deactivate &= ~DEACTIVATE_ANON; + + /* + * When refaults are being observed, it means a new + * workingset is being established. Deactivate to get + * rid of any stale active pages quickly. + */ + refaults = lruvec_page_state(target_lruvec, + WORKINGSET_ACTIVATE_FILE); + if (refaults != target_lruvec->refaults[1] || + inactive_is_low(target_lruvec, LRU_INACTIVE_FILE)) + sc->may_deactivate |= DEACTIVATE_FILE; + else + sc->may_deactivate &= ~DEACTIVATE_FILE; + } else + sc->may_deactivate = DEACTIVATE_ANON | DEACTIVATE_FILE; + + /* + * If we have plenty of inactive file pages that aren't + * thrashing, try to reclaim those first before touching + * anonymous pages. + */ + file = lruvec_page_state(target_lruvec, NR_INACTIVE_FILE); + if (file >> sc->priority && !(sc->may_deactivate & DEACTIVATE_FILE)) + sc->cache_trim_mode = 1; + else + sc->cache_trim_mode = 0; + + /* + * Prevent the reclaimer from falling into the cache trap: as + * cache pages start out inactive, every cache fault will tip + * the scan balance towards the file LRU. And as the file LRU + * shrinks, so does the window for rotation from references. + * This means we have a runaway feedback loop where a tiny + * thrashing file LRU becomes infinitely more attractive than + * anon pages. Try to detect this based on file LRU size. + */ + if (!cgroup_reclaim(sc)) { + unsigned long total_high_wmark = 0; + unsigned long free, anon; + int z; + + free = sum_zone_node_page_state(pgdat->node_id, NR_FREE_PAGES); + file = node_page_state(pgdat, NR_ACTIVE_FILE) + + node_page_state(pgdat, NR_INACTIVE_FILE); + + for (z = 0; z < MAX_NR_ZONES; z++) { + struct zone *zone = &pgdat->node_zones[z]; + + if (!managed_zone(zone)) + continue; + + total_high_wmark += high_wmark_pages(zone); + } + + /* + * Consider anon: if that's low too, this isn't a + * runaway file reclaim problem, but rather just + * extreme pressure. Reclaim as per usual then. + */ + anon = node_page_state(pgdat, NR_INACTIVE_ANON); + + sc->file_is_tiny = + file + free <= total_high_wmark && + !(sc->may_deactivate & DEACTIVATE_ANON) && + anon >> sc->priority; + } +} + /* * Determine how aggressively the anon and file LRU lists should be * scanned. The relative value of each set of LRU lists is determined @@ -2865,153 +2989,2801 @@ static bool can_age_anon_pages(struct pglist_data *pgdat, return can_demote(pgdat->node_id, sc); } -static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc) -{ - unsigned long nr[NR_LRU_LISTS]; - unsigned long targets[NR_LRU_LISTS]; - unsigned long nr_to_scan; - enum lru_list lru; - unsigned long nr_reclaimed = 0; - unsigned long nr_to_reclaim = sc->nr_to_reclaim; - struct blk_plug plug; - bool scan_adjusted; - - get_scan_count(lruvec, sc, nr); +#ifdef CONFIG_LRU_GEN - /* Record the original scan target for proportional adjustments later */ - memcpy(targets, nr, sizeof(nr)); +#ifdef CONFIG_LRU_GEN_ENABLED +DEFINE_STATIC_KEY_ARRAY_TRUE(lru_gen_caps, NR_LRU_GEN_CAPS); +#else +DEFINE_STATIC_KEY_ARRAY_FALSE(lru_gen_caps, NR_LRU_GEN_CAPS); +#endif - /* - * Global reclaiming within direct reclaim at DEF_PRIORITY is a normal - * event that can occur when there is little memory pressure e.g. - * multiple streaming readers/writers. Hence, we do not abort scanning - * when the requested number of pages are reclaimed when scanning at - * DEF_PRIORITY on the assumption that the fact we are direct - * reclaiming implies that kswapd is not keeping up and it is best to - * do a batch of work at once. For memcg reclaim one check is made to - * abort proportional reclaim if either the file or anon lru has already - * dropped to zero at the first pass. - */ - scan_adjusted = (!cgroup_reclaim(sc) && !current_is_kswapd() && - sc->priority == DEF_PRIORITY); +/****************************************************************************** + * shorthand helpers + ******************************************************************************/ - blk_start_plug(&plug); - while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] || - nr[LRU_INACTIVE_FILE]) { - unsigned long nr_anon, nr_file, percentage; - unsigned long nr_scanned; +#define LRU_REFS_FLAGS (BIT(PG_referenced) | BIT(PG_workingset)) - for_each_evictable_lru(lru) { - if (nr[lru]) { - nr_to_scan = min(nr[lru], SWAP_CLUSTER_MAX); - nr[lru] -= nr_to_scan; +#define DEFINE_MAX_SEQ(lruvec) \ + unsigned long max_seq = READ_ONCE((lruvec)->lrugen.max_seq) - nr_reclaimed += shrink_list(lru, nr_to_scan, - lruvec, sc); - } - } +#define DEFINE_MIN_SEQ(lruvec) \ + unsigned long min_seq[ANON_AND_FILE] = { \ + READ_ONCE((lruvec)->lrugen.min_seq[LRU_GEN_ANON]), \ + READ_ONCE((lruvec)->lrugen.min_seq[LRU_GEN_FILE]), \ + } - cond_resched(); +#define for_each_gen_type_zone(gen, type, zone) \ + for ((gen) = 0; (gen) < MAX_NR_GENS; (gen)++) \ + for ((type) = 0; (type) < ANON_AND_FILE; (type)++) \ + for ((zone) = 0; (zone) < MAX_NR_ZONES; (zone)++) - if (nr_reclaimed < nr_to_reclaim || scan_adjusted) - continue; +static bool get_cap(int cap) +{ +#ifdef CONFIG_LRU_GEN_ENABLED + return static_branch_likely(&lru_gen_caps[cap]); +#else + return static_branch_unlikely(&lru_gen_caps[cap]); +#endif +} - /* - * For kswapd and memcg, reclaim at least the number of pages - * requested. Ensure that the anon and file LRUs are scanned - * proportionally what was requested by get_scan_count(). We - * stop reclaiming one LRU and reduce the amount scanning - * proportional to the original scan target. - */ - nr_file = nr[LRU_INACTIVE_FILE] + nr[LRU_ACTIVE_FILE]; - nr_anon = nr[LRU_INACTIVE_ANON] + nr[LRU_ACTIVE_ANON]; +static struct lruvec *get_lruvec(struct mem_cgroup *memcg, int nid) +{ + struct pglist_data *pgdat = NODE_DATA(nid); - /* - * It's just vindictive to attack the larger once the smaller - * has gone to zero. And given the way we stop scanning the - * smaller below, this makes sure that we only make one nudge - * towards proportionality once we've got nr_to_reclaim. - */ - if (!nr_file || !nr_anon) - break; +#ifdef CONFIG_MEMCG + if (memcg) { + struct lruvec *lruvec = &memcg->nodeinfo[nid]->lruvec; - if (nr_file > nr_anon) { - unsigned long scan_target = targets[LRU_INACTIVE_ANON] + - targets[LRU_ACTIVE_ANON] + 1; - lru = LRU_BASE; - percentage = nr_anon * 100 / scan_target; - } else { - unsigned long scan_target = targets[LRU_INACTIVE_FILE] + - targets[LRU_ACTIVE_FILE] + 1; - lru = LRU_FILE; - percentage = nr_file * 100 / scan_target; - } + /* for hotadd_new_pgdat() */ + if (!lruvec->pgdat) + lruvec->pgdat = pgdat; - /* Stop scanning the smaller of the LRU */ - nr[lru] = 0; - nr[lru + LRU_ACTIVE] = 0; + return lruvec; + } +#endif + VM_WARN_ON_ONCE(!mem_cgroup_disabled()); - /* - * Recalculate the other LRU scan count based on its original - * scan target and the percentage scanning already complete - */ - lru = (lru == LRU_FILE) ? LRU_BASE : LRU_FILE; - nr_scanned = targets[lru] - nr[lru]; - nr[lru] = targets[lru] * (100 - percentage) / 100; - nr[lru] -= min(nr[lru], nr_scanned); + return pgdat ? &pgdat->__lruvec : NULL; +} - lru += LRU_ACTIVE; - nr_scanned = targets[lru] - nr[lru]; - nr[lru] = targets[lru] * (100 - percentage) / 100; - nr[lru] -= min(nr[lru], nr_scanned); +static int get_swappiness(struct lruvec *lruvec, struct scan_control *sc) +{ + struct mem_cgroup *memcg = lruvec_memcg(lruvec); + struct pglist_data *pgdat = lruvec_pgdat(lruvec); - scan_adjusted = true; - } - blk_finish_plug(&plug); - sc->nr_reclaimed += nr_reclaimed; + if (!can_demote(pgdat->node_id, sc) && + mem_cgroup_get_nr_swap_pages(memcg) < MIN_LRU_BATCH) + return 0; - /* - * Even if we did not try to evict anon pages at all, we want to - * rebalance the anon lru active/inactive ratio. - */ - if (can_age_anon_pages(lruvec_pgdat(lruvec), sc) && - inactive_is_low(lruvec, LRU_INACTIVE_ANON)) - shrink_active_list(SWAP_CLUSTER_MAX, lruvec, - sc, LRU_ACTIVE_ANON); + return mem_cgroup_swappiness(memcg); } -/* Use reclaim/compaction for costly allocs or under memory pressure */ -static bool in_reclaim_compaction(struct scan_control *sc) +static int get_nr_gens(struct lruvec *lruvec, int type) { - if (IS_ENABLED(CONFIG_COMPACTION) && sc->order && - (sc->order > PAGE_ALLOC_COSTLY_ORDER || - sc->priority < DEF_PRIORITY - 2)) - return true; + return lruvec->lrugen.max_seq - lruvec->lrugen.min_seq[type] + 1; +} - return false; +static bool __maybe_unused seq_is_valid(struct lruvec *lruvec) +{ + /* see the comment on lru_gen_struct */ + return get_nr_gens(lruvec, LRU_GEN_FILE) >= MIN_NR_GENS && + get_nr_gens(lruvec, LRU_GEN_FILE) <= get_nr_gens(lruvec, LRU_GEN_ANON) && + get_nr_gens(lruvec, LRU_GEN_ANON) <= MAX_NR_GENS; } -/* - * Reclaim/compaction is used for high-order allocation requests. It reclaims - * order-0 pages before compacting the zone. should_continue_reclaim() returns - * true if more pages should be reclaimed such that when the page allocator - * calls try_to_compact_pages() that it will have enough free pages to succeed. - * It will give up earlier than that if there is difficulty reclaiming pages. - */ -static inline bool should_continue_reclaim(struct pglist_data *pgdat, - unsigned long nr_reclaimed, - struct scan_control *sc) +/****************************************************************************** + * mm_struct list + ******************************************************************************/ + +static struct lru_gen_mm_list *get_mm_list(struct mem_cgroup *memcg) { - unsigned long pages_for_compaction; - unsigned long inactive_lru_pages; - int z; + static struct lru_gen_mm_list mm_list = { + .fifo = LIST_HEAD_INIT(mm_list.fifo), + .lock = __SPIN_LOCK_UNLOCKED(mm_list.lock), + }; - /* If not in reclaim/compaction mode, stop */ - if (!in_reclaim_compaction(sc)) - return false; +#ifdef CONFIG_MEMCG + if (memcg) + return &memcg->mm_list; +#endif + VM_WARN_ON_ONCE(!mem_cgroup_disabled()); - /* - * Stop if we failed to reclaim any pages from the last SWAP_CLUSTER_MAX + return &mm_list; +} + +void lru_gen_add_mm(struct mm_struct *mm) +{ + int nid; + struct mem_cgroup *memcg = get_mem_cgroup_from_mm(mm); + struct lru_gen_mm_list *mm_list = get_mm_list(memcg); + + VM_WARN_ON_ONCE(!list_empty(&mm->lru_gen.list)); +#ifdef CONFIG_MEMCG + VM_WARN_ON_ONCE(mm->lru_gen.memcg); + mm->lru_gen.memcg = memcg; +#endif + spin_lock(&mm_list->lock); + + for_each_node_state(nid, N_MEMORY) { + struct lruvec *lruvec = get_lruvec(memcg, nid); + + if (!lruvec) + continue; + + if (lruvec->mm_state.tail == &mm_list->fifo) + lruvec->mm_state.tail = &mm->lru_gen.list; + } + + list_add_tail(&mm->lru_gen.list, &mm_list->fifo); + + spin_unlock(&mm_list->lock); +} + +void lru_gen_del_mm(struct mm_struct *mm) +{ + int nid; + struct lru_gen_mm_list *mm_list; + struct mem_cgroup *memcg = NULL; + + if (list_empty(&mm->lru_gen.list)) + return; + +#ifdef CONFIG_MEMCG + memcg = mm->lru_gen.memcg; +#endif + mm_list = get_mm_list(memcg); + + spin_lock(&mm_list->lock); + + for_each_node(nid) { + struct lruvec *lruvec = get_lruvec(memcg, nid); + + if (!lruvec) + continue; + + if (lruvec->mm_state.tail == &mm->lru_gen.list) + lruvec->mm_state.tail = lruvec->mm_state.tail->next; + + if (lruvec->mm_state.head != &mm->lru_gen.list) + continue; + + lruvec->mm_state.head = lruvec->mm_state.head->next; + if (lruvec->mm_state.head == &mm_list->fifo) + WRITE_ONCE(lruvec->mm_state.seq, lruvec->mm_state.seq + 1); + } + + list_del_init(&mm->lru_gen.list); + + spin_unlock(&mm_list->lock); + +#ifdef CONFIG_MEMCG + mem_cgroup_put(mm->lru_gen.memcg); + mm->lru_gen.memcg = NULL; +#endif +} + +#ifdef CONFIG_MEMCG +void lru_gen_migrate_mm(struct mm_struct *mm) +{ + struct mem_cgroup *memcg; + + lockdep_assert_held(&mm->owner->alloc_lock); + + /* for mm_update_next_owner() */ + if (mem_cgroup_disabled()) + return; + + rcu_read_lock(); + memcg = mem_cgroup_from_task(rcu_dereference(mm->owner)); + rcu_read_unlock(); + if (memcg == mm->lru_gen.memcg) + return; + + VM_WARN_ON_ONCE(!mm->lru_gen.memcg); + VM_WARN_ON_ONCE(list_empty(&mm->lru_gen.list)); + + lru_gen_del_mm(mm); + lru_gen_add_mm(mm); +} +#endif + +/* + * Bloom filters with m=1<<15, k=2 and the false positive rates of ~1/5 when + * n=10,000 and ~1/2 when n=20,000, where, conventionally, m is the number of + * bits in a bitmap, k is the number of hash functions and n is the number of + * inserted items. + * + * Page table walkers use one of the two filters to reduce their search space. + * To get rid of non-leaf entries that no longer have enough leaf entries, the + * aging uses the double-buffering technique to flip to the other filter each + * time it produces a new generation. For non-leaf entries that have enough + * leaf entries, the aging carries them over to the next generation in + * walk_pmd_range(); the eviction also report them when walking the rmap + * in lru_gen_look_around(). + * + * For future optimizations: + * 1. It's not necessary to keep both filters all the time. The spare one can be + * freed after the RCU grace period and reallocated if needed again. + * 2. And when reallocating, it's worth scaling its size according to the number + * of inserted entries in the other filter, to reduce the memory overhead on + * small systems and false positives on large systems. + * 3. Jenkins' hash function is an alternative to Knuth's. + */ +#define BLOOM_FILTER_SHIFT 15 + +static inline int filter_gen_from_seq(unsigned long seq) +{ + return seq % NR_BLOOM_FILTERS; +} + +static void get_item_key(void *item, int *key) +{ + u32 hash = hash_ptr(item, BLOOM_FILTER_SHIFT * 2); + + BUILD_BUG_ON(BLOOM_FILTER_SHIFT * 2 > BITS_PER_TYPE(u32)); + + key[0] = hash & (BIT(BLOOM_FILTER_SHIFT) - 1); + key[1] = hash >> BLOOM_FILTER_SHIFT; +} + +static void reset_bloom_filter(struct lruvec *lruvec, unsigned long seq) +{ + unsigned long *filter; + int gen = filter_gen_from_seq(seq); + + lockdep_assert_held(&get_mm_list(lruvec_memcg(lruvec))->lock); + + filter = lruvec->mm_state.filters[gen]; + if (filter) { + bitmap_clear(filter, 0, BIT(BLOOM_FILTER_SHIFT)); + return; + } + + filter = bitmap_zalloc(BIT(BLOOM_FILTER_SHIFT), GFP_ATOMIC); + WRITE_ONCE(lruvec->mm_state.filters[gen], filter); +} + +static void update_bloom_filter(struct lruvec *lruvec, unsigned long seq, void *item) +{ + int key[2]; + unsigned long *filter; + int gen = filter_gen_from_seq(seq); + + filter = READ_ONCE(lruvec->mm_state.filters[gen]); + if (!filter) + return; + + get_item_key(item, key); + + if (!test_bit(key[0], filter)) + set_bit(key[0], filter); + if (!test_bit(key[1], filter)) + set_bit(key[1], filter); +} + +static bool test_bloom_filter(struct lruvec *lruvec, unsigned long seq, void *item) +{ + int key[2]; + unsigned long *filter; + int gen = filter_gen_from_seq(seq); + + filter = READ_ONCE(lruvec->mm_state.filters[gen]); + if (!filter) + return true; + + get_item_key(item, key); + + return test_bit(key[0], filter) && test_bit(key[1], filter); +} + +static void reset_mm_stats(struct lruvec *lruvec, struct lru_gen_mm_walk *walk, bool last) +{ + int i; + int hist; + + lockdep_assert_held(&get_mm_list(lruvec_memcg(lruvec))->lock); + + if (walk) { + hist = lru_hist_from_seq(walk->max_seq); + + for (i = 0; i < NR_MM_STATS; i++) { + WRITE_ONCE(lruvec->mm_state.stats[hist][i], + lruvec->mm_state.stats[hist][i] + walk->mm_stats[i]); + walk->mm_stats[i] = 0; + } + } + + if (NR_HIST_GENS > 1 && last) { + hist = lru_hist_from_seq(lruvec->mm_state.seq + 1); + + for (i = 0; i < NR_MM_STATS; i++) + WRITE_ONCE(lruvec->mm_state.stats[hist][i], 0); + } +} + +static bool should_skip_mm(struct mm_struct *mm, struct lru_gen_mm_walk *walk) +{ + int type; + unsigned long size = 0; + struct pglist_data *pgdat = lruvec_pgdat(walk->lruvec); + int key = pgdat->node_id % BITS_PER_TYPE(mm->lru_gen.bitmap); + + if (!walk->force_scan && cpumask_empty(mm_cpumask(mm)) && + !test_bit(key, &mm->lru_gen.bitmap)) + return true; + + clear_bit(key, &mm->lru_gen.bitmap); + + for (type = !walk->can_swap; type < ANON_AND_FILE; type++) { + size += type ? get_mm_counter(mm, MM_FILEPAGES) : + get_mm_counter(mm, MM_ANONPAGES) + + get_mm_counter(mm, MM_SHMEMPAGES); + } + + if (size < MIN_LRU_BATCH) + return true; + + if (mm_is_oom_victim(mm)) + return true; + + return !mmget_not_zero(mm); +} + +static bool iterate_mm_list(struct lruvec *lruvec, struct lru_gen_mm_walk *walk, + struct mm_struct **iter) +{ + bool first = false; + bool last = true; + struct mm_struct *mm = NULL; + struct mem_cgroup *memcg = lruvec_memcg(lruvec); + struct lru_gen_mm_list *mm_list = get_mm_list(memcg); + struct lru_gen_mm_state *mm_state = &lruvec->mm_state; + + /* + * There are four interesting cases for this page table walker: + * 1. It tries to start a new iteration of mm_list with a stale max_seq; + * there is nothing to be done. + * 2. It's the first of the current generation, and it needs to reset + * the Bloom filter for the next generation. + * 3. It reaches the end of mm_list, and it needs to increment + * mm_state->seq; the iteration is done. + * 4. It's the last of the current generation, and it needs to reset the + * mm stats counters for the next generation. + */ + if (*iter) + mmput_async(*iter); + else if (walk->max_seq <= READ_ONCE(mm_state->seq)) + return false; + + spin_lock(&mm_list->lock); + + VM_WARN_ON_ONCE(mm_state->seq + 1 < walk->max_seq); + VM_WARN_ON_ONCE(*iter && mm_state->seq > walk->max_seq); + VM_WARN_ON_ONCE(*iter && !mm_state->nr_walkers); + + if (walk->max_seq <= mm_state->seq) { + if (!*iter) + last = false; + goto done; + } + + if (!mm_state->nr_walkers) { + VM_WARN_ON_ONCE(mm_state->head && mm_state->head != &mm_list->fifo); + + mm_state->head = mm_list->fifo.next; + first = true; + } + + while (!mm && mm_state->head != &mm_list->fifo) { + mm = list_entry(mm_state->head, struct mm_struct, lru_gen.list); + + mm_state->head = mm_state->head->next; + + /* force scan for those added after the last iteration */ + if (!mm_state->tail || mm_state->tail == &mm->lru_gen.list) { + mm_state->tail = mm_state->head; + walk->force_scan = true; + } + + if (should_skip_mm(mm, walk)) + mm = NULL; + } + + if (mm_state->head == &mm_list->fifo) + WRITE_ONCE(mm_state->seq, mm_state->seq + 1); +done: + if (*iter && !mm) + mm_state->nr_walkers--; + if (!*iter && mm) + mm_state->nr_walkers++; + + if (mm_state->nr_walkers) + last = false; + + if (mm && first) + reset_bloom_filter(lruvec, walk->max_seq + 1); + + if (*iter || last) + reset_mm_stats(lruvec, walk, last); + + spin_unlock(&mm_list->lock); + + *iter = mm; + + return last; +} + +static bool iterate_mm_list_nowalk(struct lruvec *lruvec, unsigned long max_seq) +{ + bool success = false; + struct mem_cgroup *memcg = lruvec_memcg(lruvec); + struct lru_gen_mm_list *mm_list = get_mm_list(memcg); + struct lru_gen_mm_state *mm_state = &lruvec->mm_state; + + if (max_seq <= READ_ONCE(mm_state->seq)) + return false; + + spin_lock(&mm_list->lock); + + VM_WARN_ON_ONCE(mm_state->seq + 1 < max_seq); + + if (max_seq > mm_state->seq && !mm_state->nr_walkers) { + VM_WARN_ON_ONCE(mm_state->head && mm_state->head != &mm_list->fifo); + + WRITE_ONCE(mm_state->seq, mm_state->seq + 1); + reset_mm_stats(lruvec, NULL, true); + success = true; + } + + spin_unlock(&mm_list->lock); + + return success; +} + +/****************************************************************************** + * refault feedback loop + ******************************************************************************/ + +/* + * A feedback loop based on Proportional-Integral-Derivative (PID) controller. + * + * The P term is refaulted/(evicted+protected) from a tier in the generation + * currently being evicted; the I term is the exponential moving average of the + * P term over the generations previously evicted, using the smoothing factor + * 1/2; the D term isn't supported. + * + * The setpoint (SP) is always the first tier of one type; the process variable + * (PV) is either any tier of the other type or any other tier of the same + * type. + * + * The error is the difference between the SP and the PV; the correction is + * turn off protection when SP>PV or turn on protection when SPlrugen; + int hist = lru_hist_from_seq(lrugen->min_seq[type]); + + pos->refaulted = lrugen->avg_refaulted[type][tier] + + atomic_long_read(&lrugen->refaulted[hist][type][tier]); + pos->total = lrugen->avg_total[type][tier] + + atomic_long_read(&lrugen->evicted[hist][type][tier]); + if (tier) + pos->total += lrugen->protected[hist][type][tier - 1]; + pos->gain = gain; +} + +static void reset_ctrl_pos(struct lruvec *lruvec, int type, bool carryover) +{ + int hist, tier; + struct lru_gen_struct *lrugen = &lruvec->lrugen; + bool clear = carryover ? NR_HIST_GENS == 1 : NR_HIST_GENS > 1; + unsigned long seq = carryover ? lrugen->min_seq[type] : lrugen->max_seq + 1; + + lockdep_assert_held(&lruvec->lru_lock); + + if (!carryover && !clear) + return; + + hist = lru_hist_from_seq(seq); + + for (tier = 0; tier < MAX_NR_TIERS; tier++) { + if (carryover) { + unsigned long sum; + + sum = lrugen->avg_refaulted[type][tier] + + atomic_long_read(&lrugen->refaulted[hist][type][tier]); + WRITE_ONCE(lrugen->avg_refaulted[type][tier], sum / 2); + + sum = lrugen->avg_total[type][tier] + + atomic_long_read(&lrugen->evicted[hist][type][tier]); + if (tier) + sum += lrugen->protected[hist][type][tier - 1]; + WRITE_ONCE(lrugen->avg_total[type][tier], sum / 2); + } + + if (clear) { + atomic_long_set(&lrugen->refaulted[hist][type][tier], 0); + atomic_long_set(&lrugen->evicted[hist][type][tier], 0); + if (tier) + WRITE_ONCE(lrugen->protected[hist][type][tier - 1], 0); + } + } +} + +static bool positive_ctrl_err(struct ctrl_pos *sp, struct ctrl_pos *pv) +{ + /* + * Return true if the PV has a limited number of refaults or a lower + * refaulted/total than the SP. + */ + return pv->refaulted < MIN_LRU_BATCH || + pv->refaulted * (sp->total + MIN_LRU_BATCH) * sp->gain <= + (sp->refaulted + 1) * pv->total * pv->gain; +} + +/****************************************************************************** + * the aging + ******************************************************************************/ + +static int folio_update_gen(struct folio *folio, int gen) +{ + unsigned long new_flags, old_flags = READ_ONCE(folio->flags); + + VM_WARN_ON_ONCE(gen >= MAX_NR_GENS); + VM_WARN_ON_ONCE(!rcu_read_lock_held()); + + do { + /* lru_gen_del_folio() has isolated this page? */ + if (!(old_flags & LRU_GEN_MASK)) { + /* for shrink_page_list() */ + new_flags = old_flags | BIT(PG_referenced); + continue; + } + + new_flags = old_flags & ~(LRU_GEN_MASK | LRU_REFS_MASK | LRU_REFS_FLAGS); + new_flags |= (gen + 1UL) << LRU_GEN_PGOFF; + } while (!try_cmpxchg(&folio->flags, &old_flags, new_flags)); + + return ((old_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1; +} + +static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio, bool reclaiming) +{ + int type = folio_is_file_lru(folio); + struct lru_gen_struct *lrugen = &lruvec->lrugen; + int new_gen, old_gen = lru_gen_from_seq(lrugen->min_seq[type]); + unsigned long new_flags, old_flags = READ_ONCE(folio->flags); + + VM_WARN_ON_ONCE_FOLIO(!(old_flags & LRU_GEN_MASK), folio); + + do { + new_gen = ((old_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1; + /* folio_update_gen() has promoted this page? */ + if (new_gen >= 0 && new_gen != old_gen) + return new_gen; + + new_gen = (old_gen + 1) % MAX_NR_GENS; + + new_flags = old_flags & ~(LRU_GEN_MASK | LRU_REFS_MASK | LRU_REFS_FLAGS); + new_flags |= (new_gen + 1UL) << LRU_GEN_PGOFF; + /* for folio_end_writeback() */ + if (reclaiming) + new_flags |= BIT(PG_reclaim); + } while (!try_cmpxchg(&folio->flags, &old_flags, new_flags)); + + lru_gen_update_size(lruvec, folio, old_gen, new_gen); + + return new_gen; +} + +static void update_batch_size(struct lru_gen_mm_walk *walk, struct folio *folio, + int old_gen, int new_gen) +{ + int type = folio_is_file_lru(folio); + int zone = folio_zonenum(folio); + int delta = folio_nr_pages(folio); + + VM_WARN_ON_ONCE(old_gen >= MAX_NR_GENS); + VM_WARN_ON_ONCE(new_gen >= MAX_NR_GENS); + + walk->batched++; + + walk->nr_pages[old_gen][type][zone] -= delta; + walk->nr_pages[new_gen][type][zone] += delta; +} + +static void reset_batch_size(struct lruvec *lruvec, struct lru_gen_mm_walk *walk) +{ + int gen, type, zone; + struct lru_gen_struct *lrugen = &lruvec->lrugen; + + walk->batched = 0; + + for_each_gen_type_zone(gen, type, zone) { + enum lru_list lru = type * LRU_INACTIVE_FILE; + int delta = walk->nr_pages[gen][type][zone]; + + if (!delta) + continue; + + walk->nr_pages[gen][type][zone] = 0; + WRITE_ONCE(lrugen->nr_pages[gen][type][zone], + lrugen->nr_pages[gen][type][zone] + delta); + + if (lru_gen_is_active(lruvec, gen)) + lru += LRU_ACTIVE; + __update_lru_size(lruvec, lru, zone, delta); + } +} + +static int should_skip_vma(unsigned long start, unsigned long end, struct mm_walk *args) +{ + struct address_space *mapping; + struct vm_area_struct *vma = args->vma; + struct lru_gen_mm_walk *walk = args->private; + + if (!vma_is_accessible(vma) || is_vm_hugetlb_page(vma) || + (vma->vm_flags & (VM_LOCKED | VM_SPECIAL | VM_SEQ_READ | VM_RAND_READ)) || + vma == get_gate_vma(vma->vm_mm)) + return true; + + if (vma_is_anonymous(vma)) + return !walk->can_swap; + + if (WARN_ON_ONCE(!vma->vm_file || !vma->vm_file->f_mapping)) + return true; + + mapping = vma->vm_file->f_mapping; + if (mapping_unevictable(mapping)) + return true; + + /* check readpage to exclude special mappings like dax, etc. */ + return shmem_mapping(mapping) ? !walk->can_swap : !mapping->a_ops->readpage; +} + +/* + * Some userspace memory allocators map many single-page VMAs. Instead of + * returning back to the PGD table for each of such VMAs, finish an entire PMD + * table to reduce zigzags and improve cache performance. + */ +static bool get_next_vma(unsigned long mask, unsigned long size, struct mm_walk *args, + unsigned long *start, unsigned long *end) +{ + unsigned long next = round_up(*end, size); + + VM_WARN_ON_ONCE(mask & size); + VM_WARN_ON_ONCE(*start >= *end); + VM_WARN_ON_ONCE((next & mask) != (*start & mask)); + + while (args->vma) { + if (next >= args->vma->vm_end) { + args->vma = args->vma->vm_next; + continue; + } + + if ((next & mask) != (args->vma->vm_start & mask)) + return false; + + if (should_skip_vma(args->vma->vm_start, args->vma->vm_end, args)) { + args->vma = args->vma->vm_next; + continue; + } + + *start = max(next, args->vma->vm_start); + next = (next | ~mask) + 1; + /* rounded-up boundaries can wrap to 0 */ + *end = next && next < args->vma->vm_end ? next : args->vma->vm_end; + + return true; + } + + return false; +} + +static bool suitable_to_scan(int total, int young) +{ + int n = clamp_t(int, cache_line_size() / sizeof(pte_t), 2, 8); + + /* suitable if the average number of young PTEs per cacheline is >=1 */ + return young * n >= total; +} + +static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end, + struct mm_walk *args) +{ + int i; + pte_t *pte; + spinlock_t *ptl; + unsigned long addr; + int total = 0; + int young = 0; + struct lru_gen_mm_walk *walk = args->private; + struct mem_cgroup *memcg = lruvec_memcg(walk->lruvec); + struct pglist_data *pgdat = lruvec_pgdat(walk->lruvec); + int old_gen, new_gen = lru_gen_from_seq(walk->max_seq); + + VM_WARN_ON_ONCE(pmd_leaf(*pmd)); + + ptl = pte_lockptr(args->mm, pmd); + if (!spin_trylock(ptl)) + return false; + + arch_enter_lazy_mmu_mode(); + + pte = pte_offset_map(pmd, start & PMD_MASK); +restart: + for (i = pte_index(start), addr = start; addr != end; i++, addr += PAGE_SIZE) { + struct folio *folio; + unsigned long pfn = pte_pfn(pte[i]); + + VM_WARN_ON_ONCE(addr < args->vma->vm_start || addr >= args->vma->vm_end); + + total++; + walk->mm_stats[MM_PTE_TOTAL]++; + + if (!pte_present(pte[i]) || is_zero_pfn(pfn)) + continue; + + if (WARN_ON_ONCE(pte_devmap(pte[i]) || pte_special(pte[i]))) + continue; + + if (!pte_young(pte[i])) { + walk->mm_stats[MM_PTE_OLD]++; + continue; + } + + VM_WARN_ON_ONCE(!pfn_valid(pfn)); + if (pfn < pgdat->node_start_pfn || pfn >= pgdat_end_pfn(pgdat)) + continue; + + folio = pfn_folio(pfn); + if (folio_nid(folio) != pgdat->node_id) + continue; + + if (folio_memcg_rcu(folio) != memcg) + continue; + + if (!folio_is_file_lru(folio) && !walk->can_swap) + continue; + + if (!ptep_test_and_clear_young(args->vma, addr, pte + i)) + continue; + + young++; + walk->mm_stats[MM_PTE_YOUNG]++; + + if (pte_dirty(pte[i]) && !folio_test_dirty(folio) && + !(folio_test_anon(folio) && folio_test_swapbacked(folio) && + !folio_test_swapcache(folio))) + folio_mark_dirty(folio); + + old_gen = folio_update_gen(folio, new_gen); + if (old_gen >= 0 && old_gen != new_gen) + update_batch_size(walk, folio, old_gen, new_gen); + } + + if (i < PTRS_PER_PTE && get_next_vma(PMD_MASK, PAGE_SIZE, args, &start, &end)) + goto restart; + + pte_unmap(pte); + + arch_leave_lazy_mmu_mode(); + spin_unlock(ptl); + + return suitable_to_scan(total, young); +} + +#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG) +static void walk_pmd_range_locked(pud_t *pud, unsigned long next, struct vm_area_struct *vma, + struct mm_walk *args, unsigned long *start) +{ + int i; + pmd_t *pmd; + spinlock_t *ptl; + struct lru_gen_mm_walk *walk = args->private; + struct mem_cgroup *memcg = lruvec_memcg(walk->lruvec); + struct pglist_data *pgdat = lruvec_pgdat(walk->lruvec); + int old_gen, new_gen = lru_gen_from_seq(walk->max_seq); + + VM_WARN_ON_ONCE(pud_leaf(*pud)); + + /* try to batch at most 1+MIN_LRU_BATCH+1 entries */ + if (*start == -1) { + *start = next; + return; + } + + i = next == -1 ? 0 : pmd_index(next) - pmd_index(*start); + if (i && i <= MIN_LRU_BATCH) { + __set_bit(i - 1, walk->bitmap); + return; + } + + pmd = pmd_offset(pud, *start); + + ptl = pmd_lockptr(args->mm, pmd); + if (!spin_trylock(ptl)) + goto done; + + arch_enter_lazy_mmu_mode(); + + do { + struct folio *folio; + unsigned long pfn = pmd_pfn(pmd[i]); + unsigned long addr = i ? (*start & PMD_MASK) + i * PMD_SIZE : *start; + + VM_WARN_ON_ONCE(addr < vma->vm_start || addr >= vma->vm_end); + + if (!pmd_present(pmd[i]) || is_huge_zero_pmd(pmd[i])) + goto next; + + if (WARN_ON_ONCE(pmd_devmap(pmd[i]))) + goto next; + + if (!pmd_trans_huge(pmd[i])) { + if (IS_ENABLED(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG) && + get_cap(LRU_GEN_NONLEAF_YOUNG)) + pmdp_test_and_clear_young(vma, addr, pmd + i); + goto next; + } + + VM_WARN_ON_ONCE(!pfn_valid(pfn)); + if (pfn < pgdat->node_start_pfn || pfn >= pgdat_end_pfn(pgdat)) + goto next; + + folio = pfn_folio(pfn); + if (folio_nid(folio) != pgdat->node_id) + goto next; + + if (folio_memcg_rcu(folio) != memcg) + goto next; + + if (!folio_is_file_lru(folio) && !walk->can_swap) + goto next; + + if (!pmdp_test_and_clear_young(vma, addr, pmd + i)) + goto next; + + walk->mm_stats[MM_PTE_YOUNG]++; + + if (pmd_dirty(pmd[i]) && !folio_test_dirty(folio) && + !(folio_test_anon(folio) && folio_test_swapbacked(folio) && + !folio_test_swapcache(folio))) + folio_mark_dirty(folio); + + old_gen = folio_update_gen(folio, new_gen); + if (old_gen >= 0 && old_gen != new_gen) + update_batch_size(walk, folio, old_gen, new_gen); +next: + i = i > MIN_LRU_BATCH ? 0 : + find_next_bit(walk->bitmap, MIN_LRU_BATCH, i) + 1; + } while (i <= MIN_LRU_BATCH); + + arch_leave_lazy_mmu_mode(); + spin_unlock(ptl); +done: + *start = -1; + bitmap_zero(walk->bitmap, MIN_LRU_BATCH); +} +#else +static void walk_pmd_range_locked(pud_t *pud, unsigned long next, struct vm_area_struct *vma, + struct mm_walk *args, unsigned long *start) +{ +} +#endif + +static void walk_pmd_range(pud_t *pud, unsigned long start, unsigned long end, + struct mm_walk *args) +{ + int i; + pmd_t *pmd; + unsigned long next; + unsigned long addr; + struct vm_area_struct *vma; + unsigned long pos = -1; + struct lru_gen_mm_walk *walk = args->private; + + VM_WARN_ON_ONCE(pud_leaf(*pud)); + + /* + * Finish an entire PMD in two passes: the first only reaches to PTE + * tables to avoid taking the PMD lock; the second, if necessary, takes + * the PMD lock to clear the accessed bit in PMD entries. + */ + pmd = pmd_offset(pud, start & PUD_MASK); +restart: + /* walk_pte_range() may call get_next_vma() */ + vma = args->vma; + for (i = pmd_index(start), addr = start; addr != end; i++, addr = next) { + pmd_t val = pmd_read_atomic(pmd + i); + + /* for pmd_read_atomic() */ + barrier(); + + next = pmd_addr_end(addr, end); + + if (!pmd_present(val)) { + walk->mm_stats[MM_PTE_TOTAL]++; + continue; + } + +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + if (pmd_trans_huge(val)) { + unsigned long pfn = pmd_pfn(val); + struct pglist_data *pgdat = lruvec_pgdat(walk->lruvec); + + walk->mm_stats[MM_PTE_TOTAL]++; + + if (is_huge_zero_pmd(val)) + continue; + + if (!pmd_young(val)) { + walk->mm_stats[MM_PTE_OLD]++; + continue; + } + + if (pfn < pgdat->node_start_pfn || pfn >= pgdat_end_pfn(pgdat)) + continue; + + walk_pmd_range_locked(pud, addr, vma, args, &pos); + continue; + } +#endif + walk->mm_stats[MM_PMD_TOTAL]++; + +#ifdef CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG + if (get_cap(LRU_GEN_NONLEAF_YOUNG)) { + if (!pmd_young(val)) + continue; + + walk_pmd_range_locked(pud, addr, vma, args, &pos); + } +#endif + if (!walk->force_scan && !test_bloom_filter(walk->lruvec, walk->max_seq, pmd + i)) + continue; + + walk->mm_stats[MM_PMD_FOUND]++; + + if (!walk_pte_range(&val, addr, next, args)) + continue; + + walk->mm_stats[MM_PMD_ADDED]++; + + /* carry over to the next generation */ + update_bloom_filter(walk->lruvec, walk->max_seq + 1, pmd + i); + } + + walk_pmd_range_locked(pud, -1, vma, args, &pos); + + if (i < PTRS_PER_PMD && get_next_vma(PUD_MASK, PMD_SIZE, args, &start, &end)) + goto restart; +} + +static int walk_pud_range(p4d_t *p4d, unsigned long start, unsigned long end, + struct mm_walk *args) +{ + int i; + pud_t *pud; + unsigned long addr; + unsigned long next; + struct lru_gen_mm_walk *walk = args->private; + + VM_WARN_ON_ONCE(p4d_leaf(*p4d)); + + pud = pud_offset(p4d, start & P4D_MASK); +restart: + for (i = pud_index(start), addr = start; addr != end; i++, addr = next) { + pud_t val = READ_ONCE(pud[i]); + + next = pud_addr_end(addr, end); + + if (!pud_present(val) || WARN_ON_ONCE(pud_leaf(val))) + continue; + + walk_pmd_range(&val, addr, next, args); + + if (walk->batched >= MAX_LRU_BATCH) { + end = (addr | ~PUD_MASK) + 1; + goto done; + } + } + + if (i < PTRS_PER_PUD && get_next_vma(P4D_MASK, PUD_SIZE, args, &start, &end)) + goto restart; + + end = round_up(end, P4D_SIZE); +done: + /* rounded-up boundaries can wrap to 0 */ + walk->next_addr = end && args->vma ? max(end, args->vma->vm_start) : 0; + + return -EAGAIN; +} + +static void walk_mm(struct lruvec *lruvec, struct mm_struct *mm, struct lru_gen_mm_walk *walk) +{ + static const struct mm_walk_ops mm_walk_ops = { + .test_walk = should_skip_vma, + .p4d_entry = walk_pud_range, + }; + + int err; + struct mem_cgroup *memcg = lruvec_memcg(lruvec); + + walk->next_addr = FIRST_USER_ADDRESS; + + do { + err = -EBUSY; + + /* folio_update_gen() requires stable folio_memcg() */ + if (!mem_cgroup_trylock_pages(memcg)) + break; + + /* the caller might be holding the lock for write */ + if (mmap_read_trylock(mm)) { + err = walk_page_range(mm, walk->next_addr, ULONG_MAX, &mm_walk_ops, walk); + + mmap_read_unlock(mm); + + if (walk->batched) { + spin_lock_irq(&lruvec->lru_lock); + reset_batch_size(lruvec, walk); + spin_unlock_irq(&lruvec->lru_lock); + } + } + + mem_cgroup_unlock_pages(); + + cond_resched(); + } while (err == -EAGAIN && walk->next_addr && !mm_is_oom_victim(mm)); +} + +static struct lru_gen_mm_walk *alloc_mm_walk(void) +{ + if (current->reclaim_state && current->reclaim_state->mm_walk) + return current->reclaim_state->mm_walk; + + return kzalloc(sizeof(struct lru_gen_mm_walk), + __GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN); +} + +static void free_mm_walk(struct lru_gen_mm_walk *walk) +{ + if (!current->reclaim_state || !current->reclaim_state->mm_walk) + kfree(walk); +} + +static bool inc_min_seq(struct lruvec *lruvec, int type, bool can_swap) +{ + int zone; + int remaining = MAX_LRU_BATCH; + struct lru_gen_struct *lrugen = &lruvec->lrugen; + int new_gen, old_gen = lru_gen_from_seq(lrugen->min_seq[type]); + + if (type == LRU_GEN_ANON && !can_swap) + goto done; + + for (zone = 0; zone < MAX_NR_ZONES; zone++) { + struct list_head *head = &lrugen->lists[old_gen][type][zone]; + + while (!list_empty(head)) { + struct folio *folio = lru_to_folio(head); + + VM_WARN_ON_ONCE_FOLIO(folio_test_unevictable(folio), folio); + VM_WARN_ON_ONCE_FOLIO(folio_test_active(folio), folio); + VM_WARN_ON_ONCE_FOLIO(folio_is_file_lru(folio) != type, folio); + VM_WARN_ON_ONCE_FOLIO(folio_zonenum(folio) != zone, folio); + + new_gen = folio_inc_gen(lruvec, folio, false); + list_move_tail(&folio->lru, &lrugen->lists[new_gen][type][zone]); + + if (!--remaining) + return false; + } + } +done: + reset_ctrl_pos(lruvec, type, true); + WRITE_ONCE(lrugen->min_seq[type], lrugen->min_seq[type] + 1); + + return true; +} + +static bool try_to_inc_min_seq(struct lruvec *lruvec, bool can_swap) +{ + int gen, type, zone; + bool success = false; + struct lru_gen_struct *lrugen = &lruvec->lrugen; + DEFINE_MIN_SEQ(lruvec); + + VM_WARN_ON_ONCE(!seq_is_valid(lruvec)); + + /* find the oldest populated generation */ + for (type = !can_swap; type < ANON_AND_FILE; type++) { + while (min_seq[type] + MIN_NR_GENS <= lrugen->max_seq) { + gen = lru_gen_from_seq(min_seq[type]); + + for (zone = 0; zone < MAX_NR_ZONES; zone++) { + if (!list_empty(&lrugen->lists[gen][type][zone])) + goto next; + } + + min_seq[type]++; + } +next: + ; + } + + /* see the comment on lru_gen_struct */ + if (can_swap) { + min_seq[LRU_GEN_ANON] = min(min_seq[LRU_GEN_ANON], min_seq[LRU_GEN_FILE]); + min_seq[LRU_GEN_FILE] = max(min_seq[LRU_GEN_ANON], lrugen->min_seq[LRU_GEN_FILE]); + } + + for (type = !can_swap; type < ANON_AND_FILE; type++) { + if (min_seq[type] == lrugen->min_seq[type]) + continue; + + reset_ctrl_pos(lruvec, type, true); + WRITE_ONCE(lrugen->min_seq[type], min_seq[type]); + success = true; + } + + return success; +} + +static void inc_max_seq(struct lruvec *lruvec, bool can_swap, bool force_scan) +{ + int prev, next; + int type, zone; + struct lru_gen_struct *lrugen = &lruvec->lrugen; + + spin_lock_irq(&lruvec->lru_lock); + + VM_WARN_ON_ONCE(!seq_is_valid(lruvec)); + + for (type = 0; type < ANON_AND_FILE; type++) { + if (get_nr_gens(lruvec, type) != MAX_NR_GENS) + continue; + + VM_WARN_ON_ONCE(!force_scan && (type == LRU_GEN_FILE || can_swap)); + + while (!inc_min_seq(lruvec, type, can_swap)) { + spin_unlock_irq(&lruvec->lru_lock); + cond_resched(); + spin_lock_irq(&lruvec->lru_lock); + } + } + + /* + * Update the active/inactive LRU sizes for compatibility. Both sides of + * the current max_seq need to be covered, since max_seq+1 can overlap + * with min_seq[LRU_GEN_ANON] if swapping is constrained. And if they do + * overlap, cold/hot inversion happens. + */ + prev = lru_gen_from_seq(lrugen->max_seq - 1); + next = lru_gen_from_seq(lrugen->max_seq + 1); + + for (type = 0; type < ANON_AND_FILE; type++) { + for (zone = 0; zone < MAX_NR_ZONES; zone++) { + enum lru_list lru = type * LRU_INACTIVE_FILE; + long delta = lrugen->nr_pages[prev][type][zone] - + lrugen->nr_pages[next][type][zone]; + + if (!delta) + continue; + + __update_lru_size(lruvec, lru, zone, delta); + __update_lru_size(lruvec, lru + LRU_ACTIVE, zone, -delta); + } + } + + for (type = 0; type < ANON_AND_FILE; type++) + reset_ctrl_pos(lruvec, type, false); + + WRITE_ONCE(lrugen->timestamps[next], jiffies); + /* make sure preceding modifications appear */ + smp_store_release(&lrugen->max_seq, lrugen->max_seq + 1); + + spin_unlock_irq(&lruvec->lru_lock); +} + +static bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long max_seq, + struct scan_control *sc, bool can_swap, bool force_scan) +{ + bool success; + struct lru_gen_mm_walk *walk; + struct mm_struct *mm = NULL; + struct lru_gen_struct *lrugen = &lruvec->lrugen; + + VM_WARN_ON_ONCE(max_seq > READ_ONCE(lrugen->max_seq)); + + /* + * If the hardware doesn't automatically set the accessed bit, fallback + * to lru_gen_look_around(), which only clears the accessed bit in a + * handful of PTEs. Spreading the work out over a period of time usually + * is less efficient, but it avoids bursty page faults. + */ + if (!force_scan && !(arch_has_hw_pte_young() && get_cap(LRU_GEN_MM_WALK))) { + success = iterate_mm_list_nowalk(lruvec, max_seq); + goto done; + } + + walk = alloc_mm_walk(); + if (!walk) { + success = iterate_mm_list_nowalk(lruvec, max_seq); + goto done; + } + + walk->lruvec = lruvec; + walk->max_seq = max_seq; + walk->can_swap = can_swap; + walk->force_scan = force_scan; + + do { + success = iterate_mm_list(lruvec, walk, &mm); + if (mm) + walk_mm(lruvec, mm, walk); + + cond_resched(); + } while (mm); + + free_mm_walk(walk); +done: + if (!success) { + if (!current_is_kswapd() && !sc->priority) + wait_event_killable(lruvec->mm_state.wait, + max_seq < READ_ONCE(lrugen->max_seq)); + + return max_seq < READ_ONCE(lrugen->max_seq); + } + + VM_WARN_ON_ONCE(max_seq != READ_ONCE(lrugen->max_seq)); + + inc_max_seq(lruvec, can_swap, force_scan); + /* either this sees any waiters or they will see updated max_seq */ + if (wq_has_sleeper(&lruvec->mm_state.wait)) + wake_up_all(&lruvec->mm_state.wait); + + wakeup_flusher_threads(WB_REASON_VMSCAN); + + return true; +} + +static long get_nr_evictable(struct lruvec *lruvec, unsigned long max_seq, + unsigned long *min_seq, bool can_swap, bool *need_aging) +{ + int gen, type, zone; + long old = 0; + long young = 0; + long total = 0; + struct lru_gen_struct *lrugen = &lruvec->lrugen; + + for (type = !can_swap; type < ANON_AND_FILE; type++) { + unsigned long seq; + + for (seq = min_seq[type]; seq <= max_seq; seq++) { + long size = 0; + + gen = lru_gen_from_seq(seq); + + for (zone = 0; zone < MAX_NR_ZONES; zone++) + size += READ_ONCE(lrugen->nr_pages[gen][type][zone]); + + total += size; + if (seq == max_seq) + young += size; + if (seq + MIN_NR_GENS == max_seq) + old += size; + } + } + + /* + * The aging tries to be lazy to reduce the overhead. On the other hand, + * the eviction stalls when the number of generations reaches + * MIN_NR_GENS. So ideally, there should be MIN_NR_GENS+1 generations, + * hence the first two if's. + * + * Also it's ideal to spread pages out evenly, meaning 1/(MIN_NR_GENS+1) + * of the total number of pages for each generation. A reasonable range + * for this average portion is [1/MIN_NR_GENS, 1/(MIN_NR_GENS+2)]. The + * eviction cares about the lower bound of cold pages, whereas the aging + * cares about the upper bound of hot pages. + */ + if (min_seq[!can_swap] + MIN_NR_GENS > max_seq) + *need_aging = true; + else if (min_seq[!can_swap] + MIN_NR_GENS < max_seq) + *need_aging = false; + else if (young * MIN_NR_GENS > total) + *need_aging = true; + else if (old * (MIN_NR_GENS + 2) < total) + *need_aging = true; + else + *need_aging = false; + + return total > 0 ? total : 0; +} + +static bool age_lruvec(struct lruvec *lruvec, struct scan_control *sc, + unsigned long min_ttl) +{ + bool need_aging; + long nr_to_scan; + int swappiness = get_swappiness(lruvec, sc); + struct mem_cgroup *memcg = lruvec_memcg(lruvec); + DEFINE_MAX_SEQ(lruvec); + DEFINE_MIN_SEQ(lruvec); + + VM_WARN_ON_ONCE(sc->memcg_low_reclaim); + + if (min_ttl) { + int gen = lru_gen_from_seq(min_seq[LRU_GEN_FILE]); + unsigned long birth = READ_ONCE(lruvec->lrugen.timestamps[gen]); + + if (time_is_after_jiffies(birth + min_ttl)) + return false; + } + + mem_cgroup_calculate_protection(NULL, memcg); + + if (mem_cgroup_below_min(memcg)) + return false; + + nr_to_scan = get_nr_evictable(lruvec, max_seq, min_seq, swappiness, &need_aging); + if (!nr_to_scan) + return false; + + nr_to_scan >>= sc->priority; + + if (!mem_cgroup_online(memcg)) + nr_to_scan++; + + if (nr_to_scan && need_aging) + try_to_inc_max_seq(lruvec, max_seq, sc, swappiness, false); + + return true; +} + +/* to protect the working set of the last N jiffies */ +static unsigned long lru_gen_min_ttl __read_mostly; + +static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc) +{ + struct mem_cgroup *memcg; + bool success = false; + unsigned long min_ttl = READ_ONCE(lru_gen_min_ttl); + + VM_WARN_ON_ONCE(!current_is_kswapd()); + + /* + * To reduce the chance of going into the aging path or swapping, which + * can be costly, optimistically skip them unless their corresponding + * flags were cleared in the eviction path. This improves the overall + * performance when multiple memcgs are available. + */ + if (!sc->memcgs_need_aging) { + sc->memcgs_need_aging = true; + sc->memcgs_avoid_swapping = !sc->memcgs_need_swapping; + sc->memcgs_need_swapping = true; + return; + } + + sc->memcgs_need_swapping = true; + sc->memcgs_avoid_swapping = true; + + current->reclaim_state->mm_walk = &pgdat->mm_walk; + + memcg = mem_cgroup_iter(NULL, NULL, NULL); + do { + struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat); + + if (age_lruvec(lruvec, sc, min_ttl)) + success = true; + + cond_resched(); + } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL))); + + current->reclaim_state->mm_walk = NULL; + + /* + * The main goal is to OOM kill if every generation from all memcgs is + * younger than min_ttl. However, another theoretical possibility is all + * memcgs are either below min or empty. + */ + if (!success && !sc->order && mutex_trylock(&oom_lock)) { + struct oom_control oc = { + .gfp_mask = sc->gfp_mask, + }; + + out_of_memory(&oc); + + mutex_unlock(&oom_lock); + } +} + +/* + * This function exploits spatial locality when shrink_page_list() walks the + * rmap. It scans the adjacent PTEs of a young PTE and promotes hot pages. If + * the scan was done cacheline efficiently, it adds the PMD entry pointing to + * the PTE table to the Bloom filter. This forms a feedback loop between the + * eviction and the aging. + */ +void lru_gen_look_around(struct page_vma_mapped_walk *pvmw) +{ + int i; + pte_t *pte; + unsigned long start; + unsigned long end; + unsigned long addr; + struct lru_gen_mm_walk *walk; + int young = 0; + unsigned long bitmap[BITS_TO_LONGS(MIN_LRU_BATCH)] = {}; + struct folio *folio = pfn_folio(pvmw->pfn); + struct mem_cgroup *memcg = folio_memcg(folio); + struct pglist_data *pgdat = folio_pgdat(folio); + struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat); + DEFINE_MAX_SEQ(lruvec); + int old_gen, new_gen = lru_gen_from_seq(max_seq); + + lockdep_assert_held(pvmw->ptl); + VM_WARN_ON_ONCE_FOLIO(folio_test_lru(folio), folio); + + if (spin_is_contended(pvmw->ptl)) + return; + + start = max(pvmw->address & PMD_MASK, pvmw->vma->vm_start); + end = pmd_addr_end(pvmw->address, pvmw->vma->vm_end); + + if (end - start > MIN_LRU_BATCH * PAGE_SIZE) { + if (pvmw->address - start < MIN_LRU_BATCH * PAGE_SIZE / 2) + end = start + MIN_LRU_BATCH * PAGE_SIZE; + else if (end - pvmw->address < MIN_LRU_BATCH * PAGE_SIZE / 2) + start = end - MIN_LRU_BATCH * PAGE_SIZE; + else { + start = pvmw->address - MIN_LRU_BATCH * PAGE_SIZE / 2; + end = pvmw->address + MIN_LRU_BATCH * PAGE_SIZE / 2; + } + } + + pte = pvmw->pte - (pvmw->address - start) / PAGE_SIZE; + + rcu_read_lock(); + arch_enter_lazy_mmu_mode(); + + for (i = 0, addr = start; addr != end; i++, addr += PAGE_SIZE) { + unsigned long pfn = pte_pfn(pte[i]); + + VM_WARN_ON_ONCE(addr < pvmw->vma->vm_start || addr >= pvmw->vma->vm_end); + + if (!pte_present(pte[i]) || is_zero_pfn(pfn)) + continue; + + if (WARN_ON_ONCE(pte_devmap(pte[i]) || pte_special(pte[i]))) + continue; + + if (!pte_young(pte[i])) + continue; + + VM_WARN_ON_ONCE(!pfn_valid(pfn)); + if (pfn < pgdat->node_start_pfn || pfn >= pgdat_end_pfn(pgdat)) + continue; + + folio = pfn_folio(pfn); + if (folio_nid(folio) != pgdat->node_id) + continue; + + if (folio_memcg_rcu(folio) != memcg) + continue; + + if (!ptep_test_and_clear_young(pvmw->vma, addr, pte + i)) + continue; + + young++; + + if (pte_dirty(pte[i]) && !folio_test_dirty(folio) && + !(folio_test_anon(folio) && folio_test_swapbacked(folio) && + !folio_test_swapcache(folio))) + folio_mark_dirty(folio); + + old_gen = folio_lru_gen(folio); + if (old_gen < 0) + folio_set_referenced(folio); + else if (old_gen != new_gen) + __set_bit(i, bitmap); + } + + arch_leave_lazy_mmu_mode(); + rcu_read_unlock(); + + /* feedback from rmap walkers to page table walkers */ + if (suitable_to_scan(i, young)) + update_bloom_filter(lruvec, max_seq, pvmw->pmd); + + walk = current->reclaim_state ? current->reclaim_state->mm_walk : NULL; + + if (!walk && bitmap_weight(bitmap, MIN_LRU_BATCH) < PAGEVEC_SIZE) { + for_each_set_bit(i, bitmap, MIN_LRU_BATCH) { + folio = pfn_folio(pte_pfn(pte[i])); + folio_activate(folio); + } + return; + } + + /* folio_update_gen() requires stable folio_memcg() */ + if (!mem_cgroup_trylock_pages(memcg)) + return; + + if (!walk) { + spin_lock_irq(&lruvec->lru_lock); + new_gen = lru_gen_from_seq(lruvec->lrugen.max_seq); + } + + for_each_set_bit(i, bitmap, MIN_LRU_BATCH) { + folio = pfn_folio(pte_pfn(pte[i])); + if (folio_memcg_rcu(folio) != memcg) + continue; + + old_gen = folio_update_gen(folio, new_gen); + if (old_gen < 0 || old_gen == new_gen) + continue; + + if (walk) + update_batch_size(walk, folio, old_gen, new_gen); + else + lru_gen_update_size(lruvec, folio, old_gen, new_gen); + } + + if (!walk) + spin_unlock_irq(&lruvec->lru_lock); + + mem_cgroup_unlock_pages(); +} + +/****************************************************************************** + * the eviction + ******************************************************************************/ + +static bool sort_folio(struct lruvec *lruvec, struct folio *folio, int tier_idx) +{ + bool success; + int gen = folio_lru_gen(folio); + int type = folio_is_file_lru(folio); + int zone = folio_zonenum(folio); + int delta = folio_nr_pages(folio); + int refs = folio_lru_refs(folio); + int tier = lru_tier_from_refs(refs); + struct lru_gen_struct *lrugen = &lruvec->lrugen; + + VM_WARN_ON_ONCE_FOLIO(gen >= MAX_NR_GENS, folio); + + /* unevictable */ + if (!folio_evictable(folio)) { + success = lru_gen_del_folio(lruvec, folio, true); + VM_WARN_ON_ONCE_FOLIO(!success, folio); + folio_set_unevictable(folio); + lruvec_add_folio(lruvec, folio); + __count_vm_events(UNEVICTABLE_PGCULLED, delta); + return true; + } + + /* dirtied lazyfree */ + if (type == LRU_GEN_FILE && folio_test_anon(folio) && folio_test_dirty(folio)) { + success = lru_gen_del_folio(lruvec, folio, true); + VM_WARN_ON_ONCE_FOLIO(!success, folio); + folio_set_swapbacked(folio); + lruvec_add_folio_tail(lruvec, folio); + return true; + } + + /* promoted */ + if (gen != lru_gen_from_seq(lrugen->min_seq[type])) { + list_move(&folio->lru, &lrugen->lists[gen][type][zone]); + return true; + } + + /* protected */ + if (tier > tier_idx) { + int hist = lru_hist_from_seq(lrugen->min_seq[type]); + + gen = folio_inc_gen(lruvec, folio, false); + list_move_tail(&folio->lru, &lrugen->lists[gen][type][zone]); + + WRITE_ONCE(lrugen->protected[hist][type][tier - 1], + lrugen->protected[hist][type][tier - 1] + delta); + __mod_lruvec_state(lruvec, WORKINGSET_ACTIVATE_BASE + type, delta); + return true; + } + + /* waiting for writeback */ + if (folio_test_locked(folio) || folio_test_writeback(folio) || + (type == LRU_GEN_FILE && folio_test_dirty(folio))) { + gen = folio_inc_gen(lruvec, folio, true); + list_move(&folio->lru, &lrugen->lists[gen][type][zone]); + return true; + } + + return false; +} + +static bool isolate_folio(struct lruvec *lruvec, struct folio *folio, struct scan_control *sc) +{ + bool success; + + if (!sc->may_unmap && folio_mapped(folio)) + return false; + + if (!(sc->may_writepage && (sc->gfp_mask & __GFP_IO)) && + (folio_test_dirty(folio) || + (folio_test_anon(folio) && !folio_test_swapcache(folio)))) + return false; + + if (!folio_try_get(folio)) + return false; + + if (!folio_test_clear_lru(folio)) { + folio_put(folio); + return false; + } + + success = lru_gen_del_folio(lruvec, folio, true); + VM_WARN_ON_ONCE_FOLIO(!success, folio); + + return true; +} + +static int scan_folios(struct lruvec *lruvec, struct scan_control *sc, + int type, int tier, struct list_head *list) +{ + int gen, zone; + enum vm_event_item item; + int sorted = 0; + int scanned = 0; + int isolated = 0; + int remaining = MAX_LRU_BATCH; + struct lru_gen_struct *lrugen = &lruvec->lrugen; + struct mem_cgroup *memcg = lruvec_memcg(lruvec); + + VM_WARN_ON_ONCE(!list_empty(list)); + + if (get_nr_gens(lruvec, type) == MIN_NR_GENS) + return 0; + + gen = lru_gen_from_seq(lrugen->min_seq[type]); + + for (zone = sc->reclaim_idx; zone >= 0; zone--) { + LIST_HEAD(moved); + int skipped = 0; + struct list_head *head = &lrugen->lists[gen][type][zone]; + + while (!list_empty(head)) { + struct folio *folio = lru_to_folio(head); + int delta = folio_nr_pages(folio); + + VM_WARN_ON_ONCE_FOLIO(folio_test_unevictable(folio), folio); + VM_WARN_ON_ONCE_FOLIO(folio_test_active(folio), folio); + VM_WARN_ON_ONCE_FOLIO(folio_is_file_lru(folio) != type, folio); + VM_WARN_ON_ONCE_FOLIO(folio_zonenum(folio) != zone, folio); + + scanned += delta; + + if (sort_folio(lruvec, folio, tier)) + sorted += delta; + else if (isolate_folio(lruvec, folio, sc)) { + list_add(&folio->lru, list); + isolated += delta; + } else { + list_move(&folio->lru, &moved); + skipped += delta; + } + + if (!--remaining || max(isolated, skipped) >= MIN_LRU_BATCH) + break; + } + + if (skipped) { + list_splice(&moved, head); + __count_zid_vm_events(PGSCAN_SKIP, zone, skipped); + } + + if (!remaining || isolated >= MIN_LRU_BATCH) + break; + } + + item = current_is_kswapd() ? PGSCAN_KSWAPD : PGSCAN_DIRECT; + if (!cgroup_reclaim(sc)) { + __count_vm_events(item, isolated); + __count_vm_events(PGREFILL, sorted); + } + __count_memcg_events(memcg, item, isolated); + __count_memcg_events(memcg, PGREFILL, sorted); + __count_vm_events(PGSCAN_ANON + type, isolated); + + /* + * There might not be eligible pages due to reclaim_idx, may_unmap and + * may_writepage. Check the remaining to prevent livelock if there is no + * progress. + */ + return isolated || !remaining ? scanned : 0; +} + +static int get_tier_idx(struct lruvec *lruvec, int type) +{ + int tier; + struct ctrl_pos sp, pv; + + /* + * To leave a margin for fluctuations, use a larger gain factor (1:2). + * This value is chosen because any other tier would have at least twice + * as many refaults as the first tier. + */ + read_ctrl_pos(lruvec, type, 0, 1, &sp); + for (tier = 1; tier < MAX_NR_TIERS; tier++) { + read_ctrl_pos(lruvec, type, tier, 2, &pv); + if (!positive_ctrl_err(&sp, &pv)) + break; + } + + return tier - 1; +} + +static int get_type_to_scan(struct lruvec *lruvec, int swappiness, int *tier_idx) +{ + int type, tier; + struct ctrl_pos sp, pv; + int gain[ANON_AND_FILE] = { swappiness, 200 - swappiness }; + + /* + * Compare the first tier of anon with that of file to determine which + * type to scan. Also need to compare other tiers of the selected type + * with the first tier of the other type to determine the last tier (of + * the selected type) to evict. + */ + read_ctrl_pos(lruvec, LRU_GEN_ANON, 0, gain[LRU_GEN_ANON], &sp); + read_ctrl_pos(lruvec, LRU_GEN_FILE, 0, gain[LRU_GEN_FILE], &pv); + type = positive_ctrl_err(&sp, &pv); + + read_ctrl_pos(lruvec, !type, 0, gain[!type], &sp); + for (tier = 1; tier < MAX_NR_TIERS; tier++) { + read_ctrl_pos(lruvec, type, tier, gain[type], &pv); + if (!positive_ctrl_err(&sp, &pv)) + break; + } + + *tier_idx = tier - 1; + + return type; +} + +static int isolate_folios(struct lruvec *lruvec, struct scan_control *sc, int swappiness, + int *type_scanned, struct list_head *list) +{ + int i; + int type; + int scanned; + int tier = -1; + DEFINE_MIN_SEQ(lruvec); + + /* + * Try to make the obvious choice first. When anon and file are both + * available from the same generation, interpret swappiness 1 as file + * first and 200 as anon first. + */ + if (!swappiness) + type = LRU_GEN_FILE; + else if (min_seq[LRU_GEN_ANON] < min_seq[LRU_GEN_FILE]) + type = LRU_GEN_ANON; + else if (swappiness == 1) + type = LRU_GEN_FILE; + else if (swappiness == 200) + type = LRU_GEN_ANON; + else + type = get_type_to_scan(lruvec, swappiness, &tier); + + for (i = !swappiness; i < ANON_AND_FILE; i++) { + if (tier < 0) + tier = get_tier_idx(lruvec, type); + + scanned = scan_folios(lruvec, sc, type, tier, list); + if (scanned) + break; + + type = !type; + tier = -1; + } + + *type_scanned = type; + + return scanned; +} + +static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swappiness, + bool *swapped) +{ + int type; + int scanned; + int reclaimed; + LIST_HEAD(list); + struct folio *folio; + enum vm_event_item item; + struct reclaim_stat stat; + struct lru_gen_mm_walk *walk; + struct mem_cgroup *memcg = lruvec_memcg(lruvec); + struct pglist_data *pgdat = lruvec_pgdat(lruvec); + + spin_lock_irq(&lruvec->lru_lock); + + scanned = isolate_folios(lruvec, sc, swappiness, &type, &list); + + if (try_to_inc_min_seq(lruvec, swappiness)) + scanned++; + + if (get_nr_gens(lruvec, !swappiness) == MIN_NR_GENS) + scanned = 0; + + spin_unlock_irq(&lruvec->lru_lock); + + if (list_empty(&list)) + return scanned; + + reclaimed = shrink_page_list(&list, pgdat, sc, &stat, false); + + /* + * To avoid livelock, don't add rejected pages back to the same lists + * they were isolated from. See lru_gen_add_folio(). + */ + list_for_each_entry(folio, &list, lru) { + folio_clear_referenced(folio); + folio_clear_workingset(folio); + + if (folio_test_reclaim(folio) && + (folio_test_dirty(folio) || folio_test_writeback(folio))) + folio_clear_active(folio); + else + folio_set_active(folio); + } + + spin_lock_irq(&lruvec->lru_lock); + + move_pages_to_lru(lruvec, &list); + + walk = current->reclaim_state ? current->reclaim_state->mm_walk : NULL; + if (walk && walk->batched) + reset_batch_size(lruvec, walk); + + item = current_is_kswapd() ? PGSTEAL_KSWAPD : PGSTEAL_DIRECT; + if (!cgroup_reclaim(sc)) + __count_vm_events(item, reclaimed); + __count_memcg_events(memcg, item, reclaimed); + __count_vm_events(PGSTEAL_ANON + type, reclaimed); + + spin_unlock_irq(&lruvec->lru_lock); + + mem_cgroup_uncharge_list(&list); + free_unref_page_list(&list); + + sc->nr_reclaimed += reclaimed; + + if (type == LRU_GEN_ANON && swapped) + *swapped = true; + + return scanned; +} + +static long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc, bool can_swap) +{ + bool need_aging; + long nr_to_scan; + struct mem_cgroup *memcg = lruvec_memcg(lruvec); + DEFINE_MAX_SEQ(lruvec); + DEFINE_MIN_SEQ(lruvec); + + if (mem_cgroup_below_min(memcg) || + (mem_cgroup_below_low(memcg) && !sc->memcg_low_reclaim)) + return 0; + + nr_to_scan = get_nr_evictable(lruvec, max_seq, min_seq, can_swap, &need_aging); + if (!nr_to_scan) + return 0; + + /* reset the priority if the target has been met */ + nr_to_scan >>= sc->nr_reclaimed < sc->nr_to_reclaim ? sc->priority : DEF_PRIORITY; + + if (!mem_cgroup_online(memcg)) + nr_to_scan++; + + if (!nr_to_scan) + return 0; + + if (!need_aging) { + sc->memcgs_need_aging = false; + return nr_to_scan; + } + + /* leave the work to lru_gen_age_node() */ + if (current_is_kswapd()) + return 0; + + /* try other memcgs before going to the aging path */ + if (!cgroup_reclaim(sc) && !sc->force_deactivate) { + sc->skipped_deactivate = true; + return 0; + } + + if (try_to_inc_max_seq(lruvec, max_seq, sc, can_swap, false)) + return nr_to_scan; + + return min_seq[!can_swap] + MIN_NR_GENS <= max_seq ? nr_to_scan : 0; +} + +static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc) +{ + struct blk_plug plug; + long scanned = 0; + bool swapped = false; + unsigned long reclaimed = sc->nr_reclaimed; + struct pglist_data *pgdat = lruvec_pgdat(lruvec); + + lru_add_drain(); + + blk_start_plug(&plug); + + if (current_is_kswapd()) + current->reclaim_state->mm_walk = &pgdat->mm_walk; + + while (true) { + int delta; + int swappiness; + long nr_to_scan; + + if (sc->may_swap) + swappiness = get_swappiness(lruvec, sc); + else if (!cgroup_reclaim(sc) && get_swappiness(lruvec, sc)) + swappiness = 1; + else + swappiness = 0; + + nr_to_scan = get_nr_to_scan(lruvec, sc, swappiness); + if (!nr_to_scan) + break; + + delta = evict_folios(lruvec, sc, swappiness, &swapped); + if (!delta) + break; + + if (sc->memcgs_avoid_swapping && swappiness < 200 && swapped) + break; + + scanned += delta; + if (scanned >= nr_to_scan) { + if (!swapped && sc->nr_reclaimed - reclaimed >= MIN_LRU_BATCH) + sc->memcgs_need_swapping = false; + break; + } + + cond_resched(); + } + + if (current_is_kswapd()) + current->reclaim_state->mm_walk = NULL; + + blk_finish_plug(&plug); +} + +/****************************************************************************** + * state change + ******************************************************************************/ + +static bool __maybe_unused state_is_valid(struct lruvec *lruvec) +{ + struct lru_gen_struct *lrugen = &lruvec->lrugen; + + if (lrugen->enabled) { + enum lru_list lru; + + for_each_evictable_lru(lru) { + if (!list_empty(&lruvec->lists[lru])) + return false; + } + } else { + int gen, type, zone; + + for_each_gen_type_zone(gen, type, zone) { + if (!list_empty(&lrugen->lists[gen][type][zone])) + return false; + + /* unlikely but not a bug when reset_batch_size() is pending */ + VM_WARN_ON_ONCE(lrugen->nr_pages[gen][type][zone]); + } + } + + return true; +} + +static bool fill_evictable(struct lruvec *lruvec) +{ + enum lru_list lru; + int remaining = MAX_LRU_BATCH; + + for_each_evictable_lru(lru) { + int type = is_file_lru(lru); + bool active = is_active_lru(lru); + struct list_head *head = &lruvec->lists[lru]; + + while (!list_empty(head)) { + bool success; + struct folio *folio = lru_to_folio(head); + + VM_WARN_ON_ONCE_FOLIO(folio_test_unevictable(folio), folio); + VM_WARN_ON_ONCE_FOLIO(folio_test_active(folio) != active, folio); + VM_WARN_ON_ONCE_FOLIO(folio_is_file_lru(folio) != type, folio); + VM_WARN_ON_ONCE_FOLIO(folio_lru_gen(folio) != -1, folio); + + lruvec_del_folio(lruvec, folio); + success = lru_gen_add_folio(lruvec, folio, false); + VM_WARN_ON_ONCE(!success); + + if (!--remaining) + return false; + } + } + + return true; +} + +static bool drain_evictable(struct lruvec *lruvec) +{ + int gen, type, zone; + int remaining = MAX_LRU_BATCH; + + for_each_gen_type_zone(gen, type, zone) { + struct list_head *head = &lruvec->lrugen.lists[gen][type][zone]; + + while (!list_empty(head)) { + bool success; + struct folio *folio = lru_to_folio(head); + + VM_WARN_ON_ONCE_FOLIO(folio_test_unevictable(folio), folio); + VM_WARN_ON_ONCE_FOLIO(folio_test_active(folio), folio); + VM_WARN_ON_ONCE_FOLIO(folio_is_file_lru(folio) != type, folio); + VM_WARN_ON_ONCE_FOLIO(folio_zonenum(folio) != zone, folio); + + success = lru_gen_del_folio(lruvec, folio, false); + VM_WARN_ON_ONCE(!success); + lruvec_add_folio(lruvec, folio); + + if (!--remaining) + return false; + } + } + + return true; +} + +static void lru_gen_change_state(bool enable) +{ + static DEFINE_MUTEX(state_mutex); + + struct mem_cgroup *memcg; + + cgroup_lock(); + cpus_read_lock(); + get_online_mems(); + mutex_lock(&state_mutex); + + if (enable == lru_gen_enabled()) + goto unlock; + + if (enable) + static_branch_enable_cpuslocked(&lru_gen_caps[LRU_GEN_CORE]); + else + static_branch_disable_cpuslocked(&lru_gen_caps[LRU_GEN_CORE]); + + memcg = mem_cgroup_iter(NULL, NULL, NULL); + do { + int nid; + + for_each_node(nid) { + struct lruvec *lruvec = get_lruvec(memcg, nid); + + if (!lruvec) + continue; + + spin_lock_irq(&lruvec->lru_lock); + + VM_WARN_ON_ONCE(!seq_is_valid(lruvec)); + VM_WARN_ON_ONCE(!state_is_valid(lruvec)); + + lruvec->lrugen.enabled = enable; + + while (!(enable ? fill_evictable(lruvec) : drain_evictable(lruvec))) { + spin_unlock_irq(&lruvec->lru_lock); + cond_resched(); + spin_lock_irq(&lruvec->lru_lock); + } + + spin_unlock_irq(&lruvec->lru_lock); + } + + cond_resched(); + } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL))); +unlock: + mutex_unlock(&state_mutex); + put_online_mems(); + cpus_read_unlock(); + cgroup_unlock(); +} + +/****************************************************************************** + * sysfs interface + ******************************************************************************/ + +static ssize_t show_min_ttl(struct kobject *kobj, struct kobj_attribute *attr, char *buf) +{ + return sprintf(buf, "%u\n", jiffies_to_msecs(READ_ONCE(lru_gen_min_ttl))); +} + +static ssize_t store_min_ttl(struct kobject *kobj, struct kobj_attribute *attr, + const char *buf, size_t len) +{ + unsigned int msecs; + + if (kstrtouint(buf, 0, &msecs)) + return -EINVAL; + + WRITE_ONCE(lru_gen_min_ttl, msecs_to_jiffies(msecs)); + + return len; +} + +static struct kobj_attribute lru_gen_min_ttl_attr = __ATTR( + min_ttl_ms, 0644, show_min_ttl, store_min_ttl +); + +static ssize_t show_enable(struct kobject *kobj, struct kobj_attribute *attr, char *buf) +{ + unsigned int caps = 0; + + if (get_cap(LRU_GEN_CORE)) + caps |= BIT(LRU_GEN_CORE); + + if (arch_has_hw_pte_young() && get_cap(LRU_GEN_MM_WALK)) + caps |= BIT(LRU_GEN_MM_WALK); + + if (IS_ENABLED(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG) && get_cap(LRU_GEN_NONLEAF_YOUNG)) + caps |= BIT(LRU_GEN_NONLEAF_YOUNG); + + return snprintf(buf, PAGE_SIZE, "0x%04x\n", caps); +} + +static ssize_t store_enable(struct kobject *kobj, struct kobj_attribute *attr, + const char *buf, size_t len) +{ + int i; + unsigned int caps; + + if (tolower(*buf) == 'n') + caps = 0; + else if (tolower(*buf) == 'y') + caps = -1; + else if (kstrtouint(buf, 0, &caps)) + return -EINVAL; + + for (i = 0; i < NR_LRU_GEN_CAPS; i++) { + bool enable = caps & BIT(i); + + if (i == LRU_GEN_CORE) + lru_gen_change_state(enable); + else if (enable) + static_branch_enable(&lru_gen_caps[i]); + else + static_branch_disable(&lru_gen_caps[i]); + } + + return len; +} + +static struct kobj_attribute lru_gen_enabled_attr = __ATTR( + enabled, 0644, show_enable, store_enable +); + +static struct attribute *lru_gen_attrs[] = { + &lru_gen_min_ttl_attr.attr, + &lru_gen_enabled_attr.attr, + NULL +}; + +static struct attribute_group lru_gen_attr_group = { + .name = "lru_gen", + .attrs = lru_gen_attrs, +}; + +/****************************************************************************** + * debugfs interface + ******************************************************************************/ + +static void *lru_gen_seq_start(struct seq_file *m, loff_t *pos) +{ + struct mem_cgroup *memcg; + loff_t nr_to_skip = *pos; + + m->private = kvmalloc(PATH_MAX, GFP_KERNEL); + if (!m->private) + return ERR_PTR(-ENOMEM); + + memcg = mem_cgroup_iter(NULL, NULL, NULL); + do { + int nid; + + for_each_node_state(nid, N_MEMORY) { + if (!nr_to_skip--) + return get_lruvec(memcg, nid); + } + } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL))); + + return NULL; +} + +static void lru_gen_seq_stop(struct seq_file *m, void *v) +{ + if (!IS_ERR_OR_NULL(v)) + mem_cgroup_iter_break(NULL, lruvec_memcg(v)); + + kvfree(m->private); + m->private = NULL; +} + +static void *lru_gen_seq_next(struct seq_file *m, void *v, loff_t *pos) +{ + int nid = lruvec_pgdat(v)->node_id; + struct mem_cgroup *memcg = lruvec_memcg(v); + + ++*pos; + + nid = next_memory_node(nid); + if (nid == MAX_NUMNODES) { + memcg = mem_cgroup_iter(NULL, memcg, NULL); + if (!memcg) + return NULL; + + nid = first_memory_node; + } + + return get_lruvec(memcg, nid); +} + +static void lru_gen_seq_show_full(struct seq_file *m, struct lruvec *lruvec, + unsigned long max_seq, unsigned long *min_seq, + unsigned long seq) +{ + int i; + int type, tier; + int hist = lru_hist_from_seq(seq); + struct lru_gen_struct *lrugen = &lruvec->lrugen; + + for (tier = 0; tier < MAX_NR_TIERS; tier++) { + seq_printf(m, " %10d", tier); + for (type = 0; type < ANON_AND_FILE; type++) { + unsigned long n[3] = {}; + + if (seq == max_seq) { + n[0] = READ_ONCE(lrugen->avg_refaulted[type][tier]); + n[1] = READ_ONCE(lrugen->avg_total[type][tier]); + + seq_printf(m, " %10luR %10luT %10lu ", n[0], n[1], n[2]); + } else if (seq == min_seq[type] || NR_HIST_GENS > 1) { + n[0] = atomic_long_read(&lrugen->refaulted[hist][type][tier]); + n[1] = atomic_long_read(&lrugen->evicted[hist][type][tier]); + if (tier) + n[2] = READ_ONCE(lrugen->protected[hist][type][tier - 1]); + + seq_printf(m, " %10lur %10lue %10lup", n[0], n[1], n[2]); + } else + seq_puts(m, " 0 0 0 "); + } + seq_putc(m, '\n'); + } + + seq_puts(m, " "); + for (i = 0; i < NR_MM_STATS; i++) { + if (seq == max_seq && NR_HIST_GENS == 1) + seq_printf(m, " %10lu%c", READ_ONCE(lruvec->mm_state.stats[hist][i]), + toupper(MM_STAT_CODES[i])); + else if (seq != max_seq && NR_HIST_GENS > 1) + seq_printf(m, " %10lu%c", READ_ONCE(lruvec->mm_state.stats[hist][i]), + MM_STAT_CODES[i]); + else + seq_puts(m, " 0 "); + } + seq_putc(m, '\n'); +} + +static int lru_gen_seq_show(struct seq_file *m, void *v) +{ + unsigned long seq; + bool full = !debugfs_real_fops(m->file)->write; + struct lruvec *lruvec = v; + struct lru_gen_struct *lrugen = &lruvec->lrugen; + int nid = lruvec_pgdat(lruvec)->node_id; + struct mem_cgroup *memcg = lruvec_memcg(lruvec); + DEFINE_MAX_SEQ(lruvec); + DEFINE_MIN_SEQ(lruvec); + + if (nid == first_memory_node) { + const char *path = memcg ? m->private : ""; + +#ifdef CONFIG_MEMCG + if (memcg) + cgroup_path(memcg->css.cgroup, m->private, PATH_MAX); +#endif + seq_printf(m, "memcg %5hu %s\n", mem_cgroup_id(memcg), path); + } + + seq_printf(m, " node %5d\n", nid); + + if (!full) + seq = min_seq[LRU_GEN_ANON]; + else if (max_seq >= MAX_NR_GENS) + seq = max_seq - MAX_NR_GENS + 1; + else + seq = 0; + + for (; seq <= max_seq; seq++) { + int type, zone; + int gen = lru_gen_from_seq(seq); + unsigned long birth = READ_ONCE(lruvec->lrugen.timestamps[gen]); + + seq_printf(m, " %10lu %10u", seq, jiffies_to_msecs(jiffies - birth)); + + for (type = 0; type < ANON_AND_FILE; type++) { + long size = 0; + char mark = full && seq < min_seq[type] ? 'x' : ' '; + + for (zone = 0; zone < MAX_NR_ZONES; zone++) + size += READ_ONCE(lrugen->nr_pages[gen][type][zone]); + + seq_printf(m, " %10lu%c", max(size, 0L), mark); + } + + seq_putc(m, '\n'); + + if (full) + lru_gen_seq_show_full(m, lruvec, max_seq, min_seq, seq); + } + + return 0; +} + +static const struct seq_operations lru_gen_seq_ops = { + .start = lru_gen_seq_start, + .stop = lru_gen_seq_stop, + .next = lru_gen_seq_next, + .show = lru_gen_seq_show, +}; + +static int run_aging(struct lruvec *lruvec, unsigned long seq, struct scan_control *sc, + bool can_swap, bool force_scan) +{ + DEFINE_MAX_SEQ(lruvec); + DEFINE_MIN_SEQ(lruvec); + + if (seq < max_seq) + return 0; + + if (seq > max_seq || (!force_scan && min_seq[!can_swap] + MAX_NR_GENS - 1 <= max_seq)) + return -EINVAL; + + try_to_inc_max_seq(lruvec, max_seq, sc, can_swap, force_scan); + + return 0; +} + +static int run_eviction(struct lruvec *lruvec, unsigned long seq, struct scan_control *sc, + int swappiness, unsigned long nr_to_reclaim) +{ + struct blk_plug plug; + int err = -EINTR; + DEFINE_MAX_SEQ(lruvec); + + if (seq + MIN_NR_GENS > max_seq) + return -EINVAL; + + sc->nr_reclaimed = 0; + + blk_start_plug(&plug); + + while (!signal_pending(current)) { + DEFINE_MIN_SEQ(lruvec); + + if (seq < min_seq[!swappiness] || sc->nr_reclaimed >= nr_to_reclaim || + !evict_folios(lruvec, sc, swappiness, NULL)) { + err = 0; + break; + } + + cond_resched(); + } + + blk_finish_plug(&plug); + + return err; +} + +static int run_cmd(char cmd, int memcg_id, int nid, unsigned long seq, + struct scan_control *sc, int swappiness, unsigned long opt) +{ + struct lruvec *lruvec; + int err = -EINVAL; + struct mem_cgroup *memcg = NULL; + + if (!mem_cgroup_disabled()) { + rcu_read_lock(); + memcg = mem_cgroup_from_id(memcg_id); +#ifdef CONFIG_MEMCG + if (memcg && !css_tryget(&memcg->css)) + memcg = NULL; +#endif + rcu_read_unlock(); + + if (!memcg) + goto done; + } + if (memcg_id != mem_cgroup_id(memcg)) + goto done; + + if (nid < 0 || nid >= MAX_NUMNODES || !node_state(nid, N_MEMORY)) + goto done; + + lruvec = get_lruvec(memcg, nid); + + if (swappiness < 0) + swappiness = get_swappiness(lruvec, sc); + else if (swappiness > 200) + goto done; + + switch (cmd) { + case '+': + err = run_aging(lruvec, seq, sc, swappiness, opt); + break; + case '-': + err = run_eviction(lruvec, seq, sc, swappiness, opt); + break; + } +done: + mem_cgroup_put(memcg); + + return err; +} + +static ssize_t lru_gen_seq_write(struct file *file, const char __user *src, + size_t len, loff_t *pos) +{ + void *buf; + char *cur, *next; + unsigned int flags; + int err = 0; + struct scan_control sc = { + .may_writepage = true, + .may_unmap = true, + .may_swap = true, + .reclaim_idx = MAX_NR_ZONES - 1, + .gfp_mask = GFP_KERNEL, + }; + + buf = kvmalloc(len + 1, GFP_KERNEL); + if (!buf) + return -ENOMEM; + + if (copy_from_user(buf, src, len)) { + kvfree(buf); + return -EFAULT; + } + + next = buf; + next[len] = '\0'; + + sc.reclaim_state.mm_walk = alloc_mm_walk(); + if (!sc.reclaim_state.mm_walk) { + kvfree(buf); + return -ENOMEM; + } + + set_task_reclaim_state(current, &sc.reclaim_state); + flags = memalloc_noreclaim_save(); + + while ((cur = strsep(&next, ",;\n"))) { + int n; + int end; + char cmd; + unsigned int memcg_id; + unsigned int nid; + unsigned long seq; + unsigned int swappiness = -1; + unsigned long opt = -1; + + cur = skip_spaces(cur); + if (!*cur) + continue; + + n = sscanf(cur, "%c %u %u %lu %n %u %n %lu %n", &cmd, &memcg_id, &nid, + &seq, &end, &swappiness, &end, &opt, &end); + if (n < 4 || cur[end]) { + err = -EINVAL; + break; + } + + err = run_cmd(cmd, memcg_id, nid, seq, &sc, swappiness, opt); + if (err) + break; + } + + memalloc_noreclaim_restore(flags); + set_task_reclaim_state(current, NULL); + + free_mm_walk(sc.reclaim_state.mm_walk); + kvfree(buf); + + return err ? : len; +} + +static int lru_gen_seq_open(struct inode *inode, struct file *file) +{ + return seq_open(file, &lru_gen_seq_ops); +} + +static const struct file_operations lru_gen_rw_fops = { + .open = lru_gen_seq_open, + .read = seq_read, + .write = lru_gen_seq_write, + .llseek = seq_lseek, + .release = seq_release, +}; + +static const struct file_operations lru_gen_ro_fops = { + .open = lru_gen_seq_open, + .read = seq_read, + .llseek = seq_lseek, + .release = seq_release, +}; + +/****************************************************************************** + * initialization + ******************************************************************************/ + +void lru_gen_init_lruvec(struct lruvec *lruvec) +{ + int i; + int gen, type, zone; + struct lru_gen_struct *lrugen = &lruvec->lrugen; + + lrugen->max_seq = MIN_NR_GENS + 1; + lrugen->enabled = lru_gen_enabled(); + + for (i = 0; i <= MIN_NR_GENS + 1; i++) + lrugen->timestamps[i] = jiffies; + + for_each_gen_type_zone(gen, type, zone) + INIT_LIST_HEAD(&lrugen->lists[gen][type][zone]); + + lruvec->mm_state.seq = MIN_NR_GENS; + init_waitqueue_head(&lruvec->mm_state.wait); +} + +#ifdef CONFIG_MEMCG +void lru_gen_init_memcg(struct mem_cgroup *memcg) +{ + INIT_LIST_HEAD(&memcg->mm_list.fifo); + spin_lock_init(&memcg->mm_list.lock); +} + +void lru_gen_exit_memcg(struct mem_cgroup *memcg) +{ + int i; + int nid; + + for_each_node(nid) { + struct lruvec *lruvec = get_lruvec(memcg, nid); + + VM_WARN_ON_ONCE(memchr_inv(lruvec->lrugen.nr_pages, 0, + sizeof(lruvec->lrugen.nr_pages))); + + for (i = 0; i < NR_BLOOM_FILTERS; i++) { + bitmap_free(lruvec->mm_state.filters[i]); + lruvec->mm_state.filters[i] = NULL; + } + } +} +#endif + +static int __init init_lru_gen(void) +{ + BUILD_BUG_ON(MIN_NR_GENS + 1 >= MAX_NR_GENS); + BUILD_BUG_ON(BIT(LRU_GEN_WIDTH) <= MAX_NR_GENS); + BUILD_BUG_ON(sizeof(MM_STAT_CODES) != NR_MM_STATS + 1); + + if (sysfs_create_group(mm_kobj, &lru_gen_attr_group)) + pr_err("lru_gen: failed to create sysfs group\n"); + + debugfs_create_file("lru_gen", 0644, NULL, NULL, &lru_gen_rw_fops); + debugfs_create_file("lru_gen_full", 0444, NULL, NULL, &lru_gen_ro_fops); + + return 0; +}; +late_initcall(init_lru_gen); + +#else + +static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc) +{ +} + +static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc) +{ +} + +#endif /* CONFIG_LRU_GEN */ + +static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc) +{ + unsigned long nr[NR_LRU_LISTS]; + unsigned long targets[NR_LRU_LISTS]; + unsigned long nr_to_scan; + enum lru_list lru; + unsigned long nr_reclaimed = 0; + unsigned long nr_to_reclaim = sc->nr_to_reclaim; + struct blk_plug plug; + bool scan_adjusted; + + if (lru_gen_enabled()) { + lru_gen_shrink_lruvec(lruvec, sc); + return; + } + + get_scan_count(lruvec, sc, nr); + + /* Record the original scan target for proportional adjustments later */ + memcpy(targets, nr, sizeof(nr)); + + /* + * Global reclaiming within direct reclaim at DEF_PRIORITY is a normal + * event that can occur when there is little memory pressure e.g. + * multiple streaming readers/writers. Hence, we do not abort scanning + * when the requested number of pages are reclaimed when scanning at + * DEF_PRIORITY on the assumption that the fact we are direct + * reclaiming implies that kswapd is not keeping up and it is best to + * do a batch of work at once. For memcg reclaim one check is made to + * abort proportional reclaim if either the file or anon lru has already + * dropped to zero at the first pass. + */ + scan_adjusted = (!cgroup_reclaim(sc) && !current_is_kswapd() && + sc->priority == DEF_PRIORITY); + + blk_start_plug(&plug); + while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] || + nr[LRU_INACTIVE_FILE]) { + unsigned long nr_anon, nr_file, percentage; + unsigned long nr_scanned; + + for_each_evictable_lru(lru) { + if (nr[lru]) { + nr_to_scan = min(nr[lru], SWAP_CLUSTER_MAX); + nr[lru] -= nr_to_scan; + + nr_reclaimed += shrink_list(lru, nr_to_scan, + lruvec, sc); + } + } + + cond_resched(); + + if (nr_reclaimed < nr_to_reclaim || scan_adjusted) + continue; + + /* + * For kswapd and memcg, reclaim at least the number of pages + * requested. Ensure that the anon and file LRUs are scanned + * proportionally what was requested by get_scan_count(). We + * stop reclaiming one LRU and reduce the amount scanning + * proportional to the original scan target. + */ + nr_file = nr[LRU_INACTIVE_FILE] + nr[LRU_ACTIVE_FILE]; + nr_anon = nr[LRU_INACTIVE_ANON] + nr[LRU_ACTIVE_ANON]; + + /* + * It's just vindictive to attack the larger once the smaller + * has gone to zero. And given the way we stop scanning the + * smaller below, this makes sure that we only make one nudge + * towards proportionality once we've got nr_to_reclaim. + */ + if (!nr_file || !nr_anon) + break; + + if (nr_file > nr_anon) { + unsigned long scan_target = targets[LRU_INACTIVE_ANON] + + targets[LRU_ACTIVE_ANON] + 1; + lru = LRU_BASE; + percentage = nr_anon * 100 / scan_target; + } else { + unsigned long scan_target = targets[LRU_INACTIVE_FILE] + + targets[LRU_ACTIVE_FILE] + 1; + lru = LRU_FILE; + percentage = nr_file * 100 / scan_target; + } + + /* Stop scanning the smaller of the LRU */ + nr[lru] = 0; + nr[lru + LRU_ACTIVE] = 0; + + /* + * Recalculate the other LRU scan count based on its original + * scan target and the percentage scanning already complete + */ + lru = (lru == LRU_FILE) ? LRU_BASE : LRU_FILE; + nr_scanned = targets[lru] - nr[lru]; + nr[lru] = targets[lru] * (100 - percentage) / 100; + nr[lru] -= min(nr[lru], nr_scanned); + + lru += LRU_ACTIVE; + nr_scanned = targets[lru] - nr[lru]; + nr[lru] = targets[lru] * (100 - percentage) / 100; + nr[lru] -= min(nr[lru], nr_scanned); + + scan_adjusted = true; + } + blk_finish_plug(&plug); + sc->nr_reclaimed += nr_reclaimed; + + /* + * Even if we did not try to evict anon pages at all, we want to + * rebalance the anon lru active/inactive ratio. + */ + if (can_age_anon_pages(lruvec_pgdat(lruvec), sc) && + inactive_is_low(lruvec, LRU_INACTIVE_ANON)) + shrink_active_list(SWAP_CLUSTER_MAX, lruvec, + sc, LRU_ACTIVE_ANON); +} + +/* Use reclaim/compaction for costly allocs or under memory pressure */ +static bool in_reclaim_compaction(struct scan_control *sc) +{ + if (IS_ENABLED(CONFIG_COMPACTION) && sc->order && + (sc->order > PAGE_ALLOC_COSTLY_ORDER || + sc->priority < DEF_PRIORITY - 2)) + return true; + + return false; +} + +/* + * Reclaim/compaction is used for high-order allocation requests. It reclaims + * order-0 pages before compacting the zone. should_continue_reclaim() returns + * true if more pages should be reclaimed such that when the page allocator + * calls try_to_compact_pages() that it will have enough free pages to succeed. + * It will give up earlier than that if there is difficulty reclaiming pages. + */ +static inline bool should_continue_reclaim(struct pglist_data *pgdat, + unsigned long nr_reclaimed, + struct scan_control *sc) +{ + unsigned long pages_for_compaction; + unsigned long inactive_lru_pages; + int z; + + /* If not in reclaim/compaction mode, stop */ + if (!in_reclaim_compaction(sc)) + return false; + + /* + * Stop if we failed to reclaim any pages from the last SWAP_CLUSTER_MAX * number of pages that were scanned. This will return to the caller * with the risk reclaim/compaction and the resulting allocation attempt * fails. In the past we have tried harder for __GFP_RETRY_MAYFAIL @@ -3114,109 +5886,16 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc) unsigned long nr_reclaimed, nr_scanned; struct lruvec *target_lruvec; bool reclaimable = false; - unsigned long file; target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat); again: - /* - * Flush the memory cgroup stats, so that we read accurate per-memcg - * lruvec stats for heuristics. - */ - mem_cgroup_flush_stats(); - memset(&sc->nr, 0, sizeof(sc->nr)); nr_reclaimed = sc->nr_reclaimed; nr_scanned = sc->nr_scanned; - /* - * Determine the scan balance between anon and file LRUs. - */ - spin_lock_irq(&target_lruvec->lru_lock); - sc->anon_cost = target_lruvec->anon_cost; - sc->file_cost = target_lruvec->file_cost; - spin_unlock_irq(&target_lruvec->lru_lock); - - /* - * Target desirable inactive:active list ratios for the anon - * and file LRU lists. - */ - if (!sc->force_deactivate) { - unsigned long refaults; - - refaults = lruvec_page_state(target_lruvec, - WORKINGSET_ACTIVATE_ANON); - if (refaults != target_lruvec->refaults[0] || - inactive_is_low(target_lruvec, LRU_INACTIVE_ANON)) - sc->may_deactivate |= DEACTIVATE_ANON; - else - sc->may_deactivate &= ~DEACTIVATE_ANON; - - /* - * When refaults are being observed, it means a new - * workingset is being established. Deactivate to get - * rid of any stale active pages quickly. - */ - refaults = lruvec_page_state(target_lruvec, - WORKINGSET_ACTIVATE_FILE); - if (refaults != target_lruvec->refaults[1] || - inactive_is_low(target_lruvec, LRU_INACTIVE_FILE)) - sc->may_deactivate |= DEACTIVATE_FILE; - else - sc->may_deactivate &= ~DEACTIVATE_FILE; - } else - sc->may_deactivate = DEACTIVATE_ANON | DEACTIVATE_FILE; - - /* - * If we have plenty of inactive file pages that aren't - * thrashing, try to reclaim those first before touching - * anonymous pages. - */ - file = lruvec_page_state(target_lruvec, NR_INACTIVE_FILE); - if (file >> sc->priority && !(sc->may_deactivate & DEACTIVATE_FILE)) - sc->cache_trim_mode = 1; - else - sc->cache_trim_mode = 0; - - /* - * Prevent the reclaimer from falling into the cache trap: as - * cache pages start out inactive, every cache fault will tip - * the scan balance towards the file LRU. And as the file LRU - * shrinks, so does the window for rotation from references. - * This means we have a runaway feedback loop where a tiny - * thrashing file LRU becomes infinitely more attractive than - * anon pages. Try to detect this based on file LRU size. - */ - if (!cgroup_reclaim(sc)) { - unsigned long total_high_wmark = 0; - unsigned long free, anon; - int z; - - free = sum_zone_node_page_state(pgdat->node_id, NR_FREE_PAGES); - file = node_page_state(pgdat, NR_ACTIVE_FILE) + - node_page_state(pgdat, NR_INACTIVE_FILE); - - for (z = 0; z < MAX_NR_ZONES; z++) { - struct zone *zone = &pgdat->node_zones[z]; - if (!managed_zone(zone)) - continue; - - total_high_wmark += high_wmark_pages(zone); - } - - /* - * Consider anon: if that's low too, this isn't a - * runaway file reclaim problem, but rather just - * extreme pressure. Reclaim as per usual then. - */ - anon = node_page_state(pgdat, NR_INACTIVE_ANON); - - sc->file_is_tiny = - file + free <= total_high_wmark && - !(sc->may_deactivate & DEACTIVATE_ANON) && - anon >> sc->priority; - } + prepare_scan_count(pgdat, sc); shrink_node_memcgs(pgdat, sc); @@ -3473,6 +6152,9 @@ static void snapshot_refaults(struct mem_cgroup *target_memcg, pg_data_t *pgdat) struct lruvec *target_lruvec; unsigned long refaults; + if (lru_gen_enabled()) + return; + target_lruvec = mem_cgroup_lruvec(target_memcg, pgdat); refaults = lruvec_page_state(target_lruvec, WORKINGSET_ACTIVATE_ANON); target_lruvec->refaults[0] = refaults; @@ -3837,12 +6519,17 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, } #endif -static void age_active_anon(struct pglist_data *pgdat, +static void kswapd_age_node(struct pglist_data *pgdat, struct scan_control *sc) { struct mem_cgroup *memcg; struct lruvec *lruvec; + if (lru_gen_enabled()) { + lru_gen_age_node(pgdat, sc); + return; + } + if (!can_age_anon_pages(pgdat, sc)) return; @@ -4162,12 +6849,11 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int highest_zoneidx) sc.may_swap = !nr_boost_reclaim; /* - * Do some background aging of the anon list, to give - * pages a chance to be referenced before reclaiming. All - * pages are rotated regardless of classzone as this is - * about consistent aging. + * Do some background aging, to give pages a chance to be + * referenced before reclaiming. All pages are rotated + * regardless of classzone as this is about consistent aging. */ - age_active_anon(pgdat, &sc); + kswapd_age_node(pgdat, &sc); /* * If we're getting trouble reclaiming, start doing writepage diff --git a/mm/workingset.c b/mm/workingset.c index 592569a8974c..db6f0c8a98c2 100644 --- a/mm/workingset.c +++ b/mm/workingset.c @@ -187,7 +187,6 @@ static unsigned int bucket_order __read_mostly; static void *pack_shadow(int memcgid, pg_data_t *pgdat, unsigned long eviction, bool workingset) { - eviction >>= bucket_order; eviction &= EVICTION_MASK; eviction = (eviction << MEM_CGROUP_ID_SHIFT) | memcgid; eviction = (eviction << NODES_SHIFT) | pgdat->node_id; @@ -212,10 +211,107 @@ static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat, *memcgidp = memcgid; *pgdat = NODE_DATA(nid); - *evictionp = entry << bucket_order; + *evictionp = entry; *workingsetp = workingset; } +#ifdef CONFIG_LRU_GEN + +static void *lru_gen_eviction(struct folio *folio) +{ + int hist; + unsigned long token; + unsigned long min_seq; + struct lruvec *lruvec; + struct lru_gen_struct *lrugen; + int type = folio_is_file_lru(folio); + int delta = folio_nr_pages(folio); + int refs = folio_lru_refs(folio); + int tier = lru_tier_from_refs(refs); + struct mem_cgroup *memcg = folio_memcg(folio); + struct pglist_data *pgdat = folio_pgdat(folio); + + BUILD_BUG_ON(LRU_GEN_WIDTH + LRU_REFS_WIDTH > BITS_PER_LONG - EVICTION_SHIFT); + + lruvec = mem_cgroup_lruvec(memcg, pgdat); + lrugen = &lruvec->lrugen; + min_seq = READ_ONCE(lrugen->min_seq[type]); + token = (min_seq << LRU_REFS_WIDTH) | max(refs - 1, 0); + + hist = lru_hist_from_seq(min_seq); + atomic_long_add(delta, &lrugen->evicted[hist][type][tier]); + + return pack_shadow(mem_cgroup_id(memcg), pgdat, token, refs); +} + +static void lru_gen_refault(struct folio *folio, void *shadow) +{ + int hist, tier, refs; + int memcg_id; + bool workingset; + unsigned long token; + unsigned long min_seq; + struct lruvec *lruvec; + struct lru_gen_struct *lrugen; + struct mem_cgroup *memcg; + struct pglist_data *pgdat; + int type = folio_is_file_lru(folio); + int delta = folio_nr_pages(folio); + + unpack_shadow(shadow, &memcg_id, &pgdat, &token, &workingset); + + if (folio_pgdat(folio) != pgdat) + return; + + /* see the comment in folio_lru_refs() */ + refs = (token & (BIT(LRU_REFS_WIDTH) - 1)) + workingset; + tier = lru_tier_from_refs(refs); + + rcu_read_lock(); + memcg = folio_memcg_rcu(folio); + if (mem_cgroup_id(memcg) != memcg_id) + goto unlock; + + lruvec = mem_cgroup_lruvec(memcg, pgdat); + lrugen = &lruvec->lrugen; + min_seq = READ_ONCE(lrugen->min_seq[type]); + + token >>= LRU_REFS_WIDTH; + if (token != (min_seq & (EVICTION_MASK >> LRU_REFS_WIDTH))) + goto unlock; + + hist = lru_hist_from_seq(min_seq); + atomic_long_add(delta, &lrugen->refaulted[hist][type][tier]); + mod_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + type, delta); + + /* + * Count the following two cases as stalls: + * 1. For pages accessed through page tables, hotter pages pushed out + * hot pages which refaulted immediately. + * 2. For pages accessed through file descriptors, numbers of accesses + * might have been beyond the limit. + */ + if (lru_gen_in_fault() || refs == BIT(LRU_REFS_WIDTH)) { + folio_set_workingset(folio); + mod_lruvec_state(lruvec, WORKINGSET_RESTORE_BASE + type, delta); + } +unlock: + rcu_read_unlock(); +} + +#else + +static void *lru_gen_eviction(struct folio *folio) +{ + return NULL; +} + +static void lru_gen_refault(struct folio *folio, void *shadow) +{ +} + +#endif /* CONFIG_LRU_GEN */ + /** * workingset_age_nonresident - age non-resident entries as LRU ages * @lruvec: the lruvec that was aged @@ -264,10 +360,14 @@ void *workingset_eviction(struct folio *folio, struct mem_cgroup *target_memcg) VM_BUG_ON_FOLIO(folio_ref_count(folio), folio); VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio); + if (lru_gen_enabled()) + return lru_gen_eviction(folio); + lruvec = mem_cgroup_lruvec(target_memcg, pgdat); /* XXX: target_memcg can be NULL, go through lruvec */ memcgid = mem_cgroup_id(lruvec_memcg(lruvec)); eviction = atomic_long_read(&lruvec->nonresident_age); + eviction >>= bucket_order; workingset_age_nonresident(lruvec, folio_nr_pages(folio)); return pack_shadow(memcgid, pgdat, eviction, folio_test_workingset(folio)); @@ -298,7 +398,13 @@ void workingset_refault(struct folio *folio, void *shadow) int memcgid; long nr; + if (lru_gen_enabled()) { + lru_gen_refault(folio, shadow); + return; + } + unpack_shadow(shadow, &memcgid, &pgdat, &eviction, &workingset); + eviction <<= bucket_order; rcu_read_lock(); /* diff --git a/net/ipv4/Kconfig b/net/ipv4/Kconfig index 87983e70f03f..a833a7a67ce7 100644 --- a/net/ipv4/Kconfig +++ b/net/ipv4/Kconfig @@ -669,6 +669,24 @@ config TCP_CONG_BBR AQM schemes that do not provide a delay signal. It requires the fq ("Fair Queue") pacing packet scheduler. +config TCP_CONG_BBR2 + tristate "BBR2 TCP" + default n + help + + BBR2 TCP congestion control is a model-based congestion control + algorithm that aims to maximize network utilization, keep queues and + retransmit rates low, and to be able to coexist with Reno/CUBIC in + common scenarios. It builds an explicit model of the network path. It + tolerates a targeted degree of random packet loss and delay that are + unrelated to congestion. It can operate over LAN, WAN, cellular, wifi, + or cable modem links, and can use DCTCP-L4S-style ECN signals. It can + coexist with flows that use loss-based congestion control, and can + operate with shallow buffers, deep buffers, bufferbloat, policers, or + AQM schemes that do not provide a delay signal. It requires pacing, + using either TCP internal pacing or the fq ("Fair Queue") pacing packet + scheduler. + choice prompt "Default TCP congestion control" default DEFAULT_CUBIC @@ -706,6 +724,9 @@ choice config DEFAULT_BBR bool "BBR" if TCP_CONG_BBR=y + config DEFAULT_BBR2 + bool "BBR2" if TCP_CONG_BBR2=y + config DEFAULT_RENO bool "Reno" endchoice @@ -730,6 +751,7 @@ config DEFAULT_TCP_CONG default "dctcp" if DEFAULT_DCTCP default "cdg" if DEFAULT_CDG default "bbr" if DEFAULT_BBR + default "bbr2" if DEFAULT_BBR2 default "cubic" config TCP_MD5SIG diff --git a/net/ipv4/Makefile b/net/ipv4/Makefile index bbdd9c44f14e..8dee1547d820 100644 --- a/net/ipv4/Makefile +++ b/net/ipv4/Makefile @@ -46,6 +46,7 @@ obj-$(CONFIG_INET_TCP_DIAG) += tcp_diag.o obj-$(CONFIG_INET_UDP_DIAG) += udp_diag.o obj-$(CONFIG_INET_RAW_DIAG) += raw_diag.o obj-$(CONFIG_TCP_CONG_BBR) += tcp_bbr.o +obj-$(CONFIG_TCP_CONG_BBR2) += tcp_bbr2.o obj-$(CONFIG_TCP_CONG_BIC) += tcp_bic.o obj-$(CONFIG_TCP_CONG_CDG) += tcp_cdg.o obj-$(CONFIG_TCP_CONG_CUBIC) += tcp_cubic.o diff --git a/net/ipv4/bpf_tcp_ca.c b/net/ipv4/bpf_tcp_ca.c index f79ab942f03b..389f4c874f87 100644 --- a/net/ipv4/bpf_tcp_ca.c +++ b/net/ipv4/bpf_tcp_ca.c @@ -21,7 +21,7 @@ static u32 optional_ops[] = { offsetof(struct tcp_congestion_ops, cwnd_event), offsetof(struct tcp_congestion_ops, in_ack_event), offsetof(struct tcp_congestion_ops, pkts_acked), - offsetof(struct tcp_congestion_ops, min_tso_segs), + offsetof(struct tcp_congestion_ops, tso_segs), offsetof(struct tcp_congestion_ops, sndbuf_expand), offsetof(struct tcp_congestion_ops, cong_control), }; diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index cf18fbcbf123..38b4e4a4dc72 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -3085,6 +3085,7 @@ int tcp_disconnect(struct sock *sk, int flags) tp->rx_opt.dsack = 0; tp->rx_opt.num_sacks = 0; tp->rcv_ooopack = 0; + tp->fast_ack_mode = 0; /* Clean up fastopen related fields */ diff --git a/net/ipv4/tcp_bbr.c b/net/ipv4/tcp_bbr.c index 02e8626ccb27..664c9e119787 100644 --- a/net/ipv4/tcp_bbr.c +++ b/net/ipv4/tcp_bbr.c @@ -294,26 +294,40 @@ static void bbr_set_pacing_rate(struct sock *sk, u32 bw, int gain) sk->sk_pacing_rate = rate; } -/* override sysctl_tcp_min_tso_segs */ static u32 bbr_min_tso_segs(struct sock *sk) { return sk->sk_pacing_rate < (bbr_min_tso_rate >> 3) ? 1 : 2; } +/* Return the number of segments BBR would like in a TSO/GSO skb, given + * a particular max gso size as a constraint. + */ +static u32 bbr_tso_segs_generic(struct sock *sk, unsigned int mss_now, + u32 gso_max_size) +{ + u32 segs; + u64 bytes; + + /* Budget a TSO/GSO burst size allowance based on bw (pacing_rate). */ + bytes = sk->sk_pacing_rate >> sk->sk_pacing_shift; + + bytes = min_t(u32, bytes, gso_max_size - 1 - MAX_TCP_HEADER); + segs = max_t(u32, bytes / mss_now, bbr_min_tso_segs(sk)); + return segs; +} + +/* Custom tcp_tso_autosize() for BBR, used at transmit time to cap skb size. */ +static u32 bbr_tso_segs(struct sock *sk, unsigned int mss_now) +{ + return bbr_tso_segs_generic(sk, mss_now, sk->sk_gso_max_size); +} + +/* Like bbr_tso_segs(), using mss_cache, ignoring driver's sk_gso_max_size. */ static u32 bbr_tso_segs_goal(struct sock *sk) { struct tcp_sock *tp = tcp_sk(sk); - u32 segs, bytes; - - /* Sort of tcp_tso_autosize() but ignoring - * driver provided sk_gso_max_size. - */ - bytes = min_t(unsigned long, - sk->sk_pacing_rate >> READ_ONCE(sk->sk_pacing_shift), - GSO_MAX_SIZE - 1 - MAX_TCP_HEADER); - segs = max_t(u32, bytes / tp->mss_cache, bbr_min_tso_segs(sk)); - return min(segs, 0x7FU); + return bbr_tso_segs_generic(sk, tp->mss_cache, GSO_MAX_SIZE); } /* Save "last known good" cwnd so we can restore it after losses or PROBE_RTT */ @@ -1149,7 +1163,7 @@ static struct tcp_congestion_ops tcp_bbr_cong_ops __read_mostly = { .undo_cwnd = bbr_undo_cwnd, .cwnd_event = bbr_cwnd_event, .ssthresh = bbr_ssthresh, - .min_tso_segs = bbr_min_tso_segs, + .tso_segs = bbr_tso_segs, .get_info = bbr_get_info, .set_state = bbr_set_state, }; diff --git a/net/ipv4/tcp_bbr2.c b/net/ipv4/tcp_bbr2.c new file mode 100644 index 000000000000..fa49e17c47ca --- /dev/null +++ b/net/ipv4/tcp_bbr2.c @@ -0,0 +1,2674 @@ +/* BBR (Bottleneck Bandwidth and RTT) congestion control, v2 + * + * BBRv2 is a model-based congestion control algorithm that aims for low + * queues, low loss, and (bounded) Reno/CUBIC coexistence. To maintain a model + * of the network path, it uses measurements of bandwidth and RTT, as well as + * (if they occur) packet loss and/or DCTCP/L4S-style ECN signals. Note that + * although it can use ECN or loss signals explicitly, it does not require + * either; it can bound its in-flight data based on its estimate of the BDP. + * + * The model has both higher and lower bounds for the operating range: + * lo: bw_lo, inflight_lo: conservative short-term lower bound + * hi: bw_hi, inflight_hi: robust long-term upper bound + * The bandwidth-probing time scale is (a) extended dynamically based on + * estimated BDP to improve coexistence with Reno/CUBIC; (b) bounded by + * an interactive wall-clock time-scale to be more scalable and responsive + * than Reno and CUBIC. + * + * Here is a state transition diagram for BBR: + * + * | + * V + * +---> STARTUP ----+ + * | | | + * | V | + * | DRAIN ----+ + * | | | + * | V | + * +---> PROBE_BW ----+ + * | ^ | | + * | | | | + * | +----+ | + * | | + * +---- PROBE_RTT <--+ + * + * A BBR flow starts in STARTUP, and ramps up its sending rate quickly. + * When it estimates the pipe is full, it enters DRAIN to drain the queue. + * In steady state a BBR flow only uses PROBE_BW and PROBE_RTT. + * A long-lived BBR flow spends the vast majority of its time remaining + * (repeatedly) in PROBE_BW, fully probing and utilizing the pipe's bandwidth + * in a fair manner, with a small, bounded queue. *If* a flow has been + * continuously sending for the entire min_rtt window, and hasn't seen an RTT + * sample that matches or decreases its min_rtt estimate for 10 seconds, then + * it briefly enters PROBE_RTT to cut inflight to a minimum value to re-probe + * the path's two-way propagation delay (min_rtt). When exiting PROBE_RTT, if + * we estimated that we reached the full bw of the pipe then we enter PROBE_BW; + * otherwise we enter STARTUP to try to fill the pipe. + * + * BBR is described in detail in: + * "BBR: Congestion-Based Congestion Control", + * Neal Cardwell, Yuchung Cheng, C. Stephen Gunn, Soheil Hassas Yeganeh, + * Van Jacobson. ACM Queue, Vol. 14 No. 5, September-October 2016. + * + * There is a public e-mail list for discussing BBR development and testing: + * https://groups.google.com/forum/#!forum/bbr-dev + * + * NOTE: BBR might be used with the fq qdisc ("man tc-fq") with pacing enabled, + * otherwise TCP stack falls back to an internal pacing using one high + * resolution timer per TCP socket and may use more resources. + */ +#include +#include +#include +#include +#include + +#include "tcp_dctcp.h" + +/* Scale factor for rate in pkt/uSec unit to avoid truncation in bandwidth + * estimation. The rate unit ~= (1500 bytes / 1 usec / 2^24) ~= 715 bps. + * This handles bandwidths from 0.06pps (715bps) to 256Mpps (3Tbps) in a u32. + * Since the minimum window is >=4 packets, the lower bound isn't + * an issue. The upper bound isn't an issue with existing technologies. + */ +#define BW_SCALE 24 +#define BW_UNIT (1 << BW_SCALE) + +#define BBR_SCALE 8 /* scaling factor for fractions in BBR (e.g. gains) */ +#define BBR_UNIT (1 << BBR_SCALE) + +#define FLAG_DEBUG_VERBOSE 0x1 /* Verbose debugging messages */ +#define FLAG_DEBUG_LOOPBACK 0x2 /* Do NOT skip loopback addr */ + +#define CYCLE_LEN 8 /* number of phases in a pacing gain cycle */ + +/* BBR has the following modes for deciding how fast to send: */ +enum bbr_mode { + BBR_STARTUP, /* ramp up sending rate rapidly to fill pipe */ + BBR_DRAIN, /* drain any queue created during startup */ + BBR_PROBE_BW, /* discover, share bw: pace around estimated bw */ + BBR_PROBE_RTT, /* cut inflight to min to probe min_rtt */ +}; + +/* How does the incoming ACK stream relate to our bandwidth probing? */ +enum bbr_ack_phase { + BBR_ACKS_INIT, /* not probing; not getting probe feedback */ + BBR_ACKS_REFILLING, /* sending at est. bw to fill pipe */ + BBR_ACKS_PROBE_STARTING, /* inflight rising to probe bw */ + BBR_ACKS_PROBE_FEEDBACK, /* getting feedback from bw probing */ + BBR_ACKS_PROBE_STOPPING, /* stopped probing; still getting feedback */ +}; + +/* BBR congestion control block */ +struct bbr { + u32 min_rtt_us; /* min RTT in min_rtt_win_sec window */ + u32 min_rtt_stamp; /* timestamp of min_rtt_us */ + u32 probe_rtt_done_stamp; /* end time for BBR_PROBE_RTT mode */ + u32 probe_rtt_min_us; /* min RTT in bbr_probe_rtt_win_ms window */ + u32 probe_rtt_min_stamp; /* timestamp of probe_rtt_min_us*/ + u32 next_rtt_delivered; /* scb->tx.delivered at end of round */ + u32 prior_rcv_nxt; /* tp->rcv_nxt when CE state last changed */ + u64 cycle_mstamp; /* time of this cycle phase start */ + u32 mode:3, /* current bbr_mode in state machine */ + prev_ca_state:3, /* CA state on previous ACK */ + packet_conservation:1, /* use packet conservation? */ + round_start:1, /* start of packet-timed tx->ack round? */ + ce_state:1, /* If most recent data has CE bit set */ + bw_probe_up_rounds:5, /* cwnd-limited rounds in PROBE_UP */ + try_fast_path:1, /* can we take fast path? */ + unused2:11, + idle_restart:1, /* restarting after idle? */ + probe_rtt_round_done:1, /* a BBR_PROBE_RTT round at 4 pkts? */ + cycle_idx:3, /* current index in pacing_gain cycle array */ + has_seen_rtt:1; /* have we seen an RTT sample yet? */ + u32 pacing_gain:11, /* current gain for setting pacing rate */ + cwnd_gain:11, /* current gain for setting cwnd */ + full_bw_reached:1, /* reached full bw in Startup? */ + full_bw_cnt:2, /* number of rounds without large bw gains */ + init_cwnd:7; /* initial cwnd */ + u32 prior_cwnd; /* prior cwnd upon entering loss recovery */ + u32 full_bw; /* recent bw, to estimate if pipe is full */ + + /* For tracking ACK aggregation: */ + u64 ack_epoch_mstamp; /* start of ACK sampling epoch */ + u16 extra_acked[2]; /* max excess data ACKed in epoch */ + u32 ack_epoch_acked:20, /* packets (S)ACKed in sampling epoch */ + extra_acked_win_rtts:5, /* age of extra_acked, in round trips */ + extra_acked_win_idx:1, /* current index in extra_acked array */ + /* BBR v2 state: */ + unused1:2, + startup_ecn_rounds:2, /* consecutive hi ECN STARTUP rounds */ + loss_in_cycle:1, /* packet loss in this cycle? */ + ecn_in_cycle:1; /* ECN in this cycle? */ + u32 loss_round_delivered; /* scb->tx.delivered ending loss round */ + u32 undo_bw_lo; /* bw_lo before latest losses */ + u32 undo_inflight_lo; /* inflight_lo before latest losses */ + u32 undo_inflight_hi; /* inflight_hi before latest losses */ + u32 bw_latest; /* max delivered bw in last round trip */ + u32 bw_lo; /* lower bound on sending bandwidth */ + u32 bw_hi[2]; /* upper bound of sending bandwidth range*/ + u32 inflight_latest; /* max delivered data in last round trip */ + u32 inflight_lo; /* lower bound of inflight data range */ + u32 inflight_hi; /* upper bound of inflight data range */ + u32 bw_probe_up_cnt; /* packets delivered per inflight_hi incr */ + u32 bw_probe_up_acks; /* packets (S)ACKed since inflight_hi incr */ + u32 probe_wait_us; /* PROBE_DOWN until next clock-driven probe */ + u32 ecn_eligible:1, /* sender can use ECN (RTT, handshake)? */ + ecn_alpha:9, /* EWMA delivered_ce/delivered; 0..256 */ + bw_probe_samples:1, /* rate samples reflect bw probing? */ + prev_probe_too_high:1, /* did last PROBE_UP go too high? */ + stopped_risky_probe:1, /* last PROBE_UP stopped due to risk? */ + rounds_since_probe:8, /* packet-timed rounds since probed bw */ + loss_round_start:1, /* loss_round_delivered round trip? */ + loss_in_round:1, /* loss marked in this round trip? */ + ecn_in_round:1, /* ECN marked in this round trip? */ + ack_phase:3, /* bbr_ack_phase: meaning of ACKs */ + loss_events_in_round:4,/* losses in STARTUP round */ + initialized:1; /* has bbr_init() been called? */ + u32 alpha_last_delivered; /* tp->delivered at alpha update */ + u32 alpha_last_delivered_ce; /* tp->delivered_ce at alpha update */ + + /* Params configurable using setsockopt. Refer to correspoding + * module param for detailed description of params. + */ + struct bbr_params { + u32 high_gain:11, /* max allowed value: 2047 */ + drain_gain:10, /* max allowed value: 1023 */ + cwnd_gain:11; /* max allowed value: 2047 */ + u32 cwnd_min_target:4, /* max allowed value: 15 */ + min_rtt_win_sec:5, /* max allowed value: 31 */ + probe_rtt_mode_ms:9, /* max allowed value: 511 */ + full_bw_cnt:3, /* max allowed value: 7 */ + cwnd_tso_budget:1, /* allowed values: {0, 1} */ + unused3:6, + drain_to_target:1, /* boolean */ + precise_ece_ack:1, /* boolean */ + extra_acked_in_startup:1, /* allowed values: {0, 1} */ + fast_path:1; /* boolean */ + u32 full_bw_thresh:10, /* max allowed value: 1023 */ + startup_cwnd_gain:11, /* max allowed value: 2047 */ + bw_probe_pif_gain:9, /* max allowed value: 511 */ + usage_based_cwnd:1, /* boolean */ + unused2:1; + u16 probe_rtt_win_ms:14, /* max allowed value: 16383 */ + refill_add_inc:2; /* max allowed value: 3 */ + u16 extra_acked_gain:11, /* max allowed value: 2047 */ + extra_acked_win_rtts:5; /* max allowed value: 31*/ + u16 pacing_gain[CYCLE_LEN]; /* max allowed value: 1023 */ + /* Mostly BBR v2 parameters below here: */ + u32 ecn_alpha_gain:8, /* max allowed value: 255 */ + ecn_factor:8, /* max allowed value: 255 */ + ecn_thresh:8, /* max allowed value: 255 */ + beta:8; /* max allowed value: 255 */ + u32 ecn_max_rtt_us:19, /* max allowed value: 524287 */ + bw_probe_reno_gain:9, /* max allowed value: 511 */ + full_loss_cnt:4; /* max allowed value: 15 */ + u32 probe_rtt_cwnd_gain:8, /* max allowed value: 255 */ + inflight_headroom:8, /* max allowed value: 255 */ + loss_thresh:8, /* max allowed value: 255 */ + bw_probe_max_rounds:8; /* max allowed value: 255 */ + u32 bw_probe_rand_rounds:4, /* max allowed value: 15 */ + bw_probe_base_us:26, /* usecs: 0..2^26-1 (67 secs) */ + full_ecn_cnt:2; /* max allowed value: 3 */ + u32 bw_probe_rand_us:26, /* usecs: 0..2^26-1 (67 secs) */ + undo:1, /* boolean */ + tso_rtt_shift:4, /* max allowed value: 15 */ + unused5:1; + u32 ecn_reprobe_gain:9, /* max allowed value: 511 */ + unused1:14, + ecn_alpha_init:9; /* max allowed value: 256 */ + } params; + + struct { + u32 snd_isn; /* Initial sequence number */ + u32 rs_bw; /* last valid rate sample bw */ + u32 target_cwnd; /* target cwnd, based on BDP */ + u8 undo:1, /* Undo even happened but not yet logged */ + unused:7; + char event; /* single-letter event debug codes */ + u16 unused2; + } debug; +}; + +struct bbr_context { + u32 sample_bw; + u32 target_cwnd; + u32 log:1; +}; + +/* Window length of min_rtt filter (in sec). Max allowed value is 31 (0x1F) */ +static u32 bbr_min_rtt_win_sec = 10; +/* Minimum time (in ms) spent at bbr_cwnd_min_target in BBR_PROBE_RTT mode. + * Max allowed value is 511 (0x1FF). + */ +static u32 bbr_probe_rtt_mode_ms = 200; +/* Window length of probe_rtt_min_us filter (in ms), and consequently the + * typical interval between PROBE_RTT mode entries. + * Note that bbr_probe_rtt_win_ms must be <= bbr_min_rtt_win_sec * MSEC_PER_SEC + */ +static u32 bbr_probe_rtt_win_ms = 5000; +/* Skip TSO below the following bandwidth (bits/sec): */ +static int bbr_min_tso_rate = 1200000; + +/* Use min_rtt to help adapt TSO burst size, with smaller min_rtt resulting + * in bigger TSO bursts. By default we cut the RTT-based allowance in half + * for every 2^9 usec (aka 512 us) of RTT, so that the RTT-based allowance + * is below 1500 bytes after 6 * ~500 usec = 3ms. + */ +static u32 bbr_tso_rtt_shift = 9; /* halve allowance per 2^9 usecs, 512us */ + +/* Select cwnd TSO budget approach: + * 0: padding + * 1: flooring + */ +static uint bbr_cwnd_tso_budget = 1; + +/* Pace at ~1% below estimated bw, on average, to reduce queue at bottleneck. + * In order to help drive the network toward lower queues and low latency while + * maintaining high utilization, the average pacing rate aims to be slightly + * lower than the estimated bandwidth. This is an important aspect of the + * design. + */ +static const int bbr_pacing_margin_percent = 1; + +/* We use a high_gain value of 2/ln(2) because it's the smallest pacing gain + * that will allow a smoothly increasing pacing rate that will double each RTT + * and send the same number of packets per RTT that an un-paced, slow-starting + * Reno or CUBIC flow would. Max allowed value is 2047 (0x7FF). + */ +static int bbr_high_gain = BBR_UNIT * 2885 / 1000 + 1; +/* The gain for deriving startup cwnd. Max allowed value is 2047 (0x7FF). */ +static int bbr_startup_cwnd_gain = BBR_UNIT * 2885 / 1000 + 1; +/* The pacing gain of 1/high_gain in BBR_DRAIN is calculated to typically drain + * the queue created in BBR_STARTUP in a single round. Max allowed value + * is 1023 (0x3FF). + */ +static int bbr_drain_gain = BBR_UNIT * 1000 / 2885; +/* The gain for deriving steady-state cwnd tolerates delayed/stretched ACKs. + * Max allowed value is 2047 (0x7FF). + */ +static int bbr_cwnd_gain = BBR_UNIT * 2; +/* The pacing_gain values for the PROBE_BW gain cycle, to discover/share bw. + * Max allowed value for each element is 1023 (0x3FF). + */ +enum bbr_pacing_gain_phase { + BBR_BW_PROBE_UP = 0, /* push up inflight to probe for bw/vol */ + BBR_BW_PROBE_DOWN = 1, /* drain excess inflight from the queue */ + BBR_BW_PROBE_CRUISE = 2, /* use pipe, w/ headroom in queue/pipe */ + BBR_BW_PROBE_REFILL = 3, /* v2: refill the pipe again to 100% */ +}; +static int bbr_pacing_gain[] = { + BBR_UNIT * 5 / 4, /* probe for more available bw */ + BBR_UNIT * 3 / 4, /* drain queue and/or yield bw to other flows */ + BBR_UNIT, BBR_UNIT, BBR_UNIT, /* cruise at 1.0*bw to utilize pipe, */ + BBR_UNIT, BBR_UNIT, BBR_UNIT /* without creating excess queue... */ +}; + +/* Try to keep at least this many packets in flight, if things go smoothly. For + * smooth functioning, a sliding window protocol ACKing every other packet + * needs at least 4 packets in flight. Max allowed value is 15 (0xF). + */ +static u32 bbr_cwnd_min_target = 4; + +/* Cwnd to BDP proportion in PROBE_RTT mode scaled by BBR_UNIT. Default: 50%. + * Use 0 to disable. Max allowed value is 255. + */ +static u32 bbr_probe_rtt_cwnd_gain = BBR_UNIT * 1 / 2; + +/* To estimate if BBR_STARTUP mode (i.e. high_gain) has filled pipe... */ +/* If bw has increased significantly (1.25x), there may be more bw available. + * Max allowed value is 1023 (0x3FF). + */ +static u32 bbr_full_bw_thresh = BBR_UNIT * 5 / 4; +/* But after 3 rounds w/o significant bw growth, estimate pipe is full. + * Max allowed value is 7 (0x7). + */ +static u32 bbr_full_bw_cnt = 3; + +static u32 bbr_flags; /* Debugging related stuff */ + +/* Whether to debug using printk. + */ +static bool bbr_debug_with_printk; + +/* Whether to debug using ftrace event tcp:tcp_bbr_event. + * Ignored when bbr_debug_with_printk is set. + */ +static bool bbr_debug_ftrace; + +/* Experiment: each cycle, try to hold sub-unity gain until inflight <= BDP. */ +static bool bbr_drain_to_target = true; /* default: enabled */ + +/* Experiment: Flags to control BBR with ECN behavior. + */ +static bool bbr_precise_ece_ack = true; /* default: enabled */ + +/* The max rwin scaling shift factor is 14 (RFC 1323), so the max sane rwin is + * (2^(16+14) B)/(1024 B/packet) = 1M packets. + */ +static u32 bbr_cwnd_warn_val = 1U << 20; + +static u16 bbr_debug_port_mask; + +/* BBR module parameters. These are module parameters only in Google prod. + * Upstream these are intentionally not module parameters. + */ +static int bbr_pacing_gain_size = CYCLE_LEN; + +/* Gain factor for adding extra_acked to target cwnd: */ +static int bbr_extra_acked_gain = 256; + +/* Window length of extra_acked window. Max allowed val is 31. */ +static u32 bbr_extra_acked_win_rtts = 5; + +/* Max allowed val for ack_epoch_acked, after which sampling epoch is reset */ +static u32 bbr_ack_epoch_acked_reset_thresh = 1U << 20; + +/* Time period for clamping cwnd increment due to ack aggregation */ +static u32 bbr_extra_acked_max_us = 100 * 1000; + +/* Use extra acked in startup ? + * 0: disabled + * 1: use latest extra_acked value from 1-2 rtt in startup + */ +static int bbr_extra_acked_in_startup = 1; /* default: enabled */ + +/* Experiment: don't grow cwnd beyond twice of what we just probed. */ +static bool bbr_usage_based_cwnd; /* default: disabled */ + +/* For lab testing, researchers can enable BBRv2 ECN support with this flag, + * when they know that any ECN marks that the connections experience will be + * DCTCP/L4S-style ECN marks, rather than RFC3168 ECN marks. + * TODO(ncardwell): Production use of the BBRv2 ECN functionality depends on + * negotiation or configuration that is outside the scope of the BBRv2 + * alpha release. + */ +static bool bbr_ecn_enable = false; + +module_param_named(min_tso_rate, bbr_min_tso_rate, int, 0644); +module_param_named(tso_rtt_shift, bbr_tso_rtt_shift, int, 0644); +module_param_named(high_gain, bbr_high_gain, int, 0644); +module_param_named(drain_gain, bbr_drain_gain, int, 0644); +module_param_named(startup_cwnd_gain, bbr_startup_cwnd_gain, int, 0644); +module_param_named(cwnd_gain, bbr_cwnd_gain, int, 0644); +module_param_array_named(pacing_gain, bbr_pacing_gain, int, + &bbr_pacing_gain_size, 0644); +module_param_named(cwnd_min_target, bbr_cwnd_min_target, uint, 0644); +module_param_named(probe_rtt_cwnd_gain, + bbr_probe_rtt_cwnd_gain, uint, 0664); +module_param_named(cwnd_warn_val, bbr_cwnd_warn_val, uint, 0664); +module_param_named(debug_port_mask, bbr_debug_port_mask, ushort, 0644); +module_param_named(flags, bbr_flags, uint, 0644); +module_param_named(debug_ftrace, bbr_debug_ftrace, bool, 0644); +module_param_named(debug_with_printk, bbr_debug_with_printk, bool, 0644); +module_param_named(min_rtt_win_sec, bbr_min_rtt_win_sec, uint, 0644); +module_param_named(probe_rtt_mode_ms, bbr_probe_rtt_mode_ms, uint, 0644); +module_param_named(probe_rtt_win_ms, bbr_probe_rtt_win_ms, uint, 0644); +module_param_named(full_bw_thresh, bbr_full_bw_thresh, uint, 0644); +module_param_named(full_bw_cnt, bbr_full_bw_cnt, uint, 0644); +module_param_named(cwnd_tso_bduget, bbr_cwnd_tso_budget, uint, 0664); +module_param_named(extra_acked_gain, bbr_extra_acked_gain, int, 0664); +module_param_named(extra_acked_win_rtts, + bbr_extra_acked_win_rtts, uint, 0664); +module_param_named(extra_acked_max_us, + bbr_extra_acked_max_us, uint, 0664); +module_param_named(ack_epoch_acked_reset_thresh, + bbr_ack_epoch_acked_reset_thresh, uint, 0664); +module_param_named(drain_to_target, bbr_drain_to_target, bool, 0664); +module_param_named(precise_ece_ack, bbr_precise_ece_ack, bool, 0664); +module_param_named(extra_acked_in_startup, + bbr_extra_acked_in_startup, int, 0664); +module_param_named(usage_based_cwnd, bbr_usage_based_cwnd, bool, 0664); +module_param_named(ecn_enable, bbr_ecn_enable, bool, 0664); + +static void bbr2_exit_probe_rtt(struct sock *sk); +static void bbr2_reset_congestion_signals(struct sock *sk); + +static void bbr_check_probe_rtt_done(struct sock *sk); + +/* Do we estimate that STARTUP filled the pipe? */ +static bool bbr_full_bw_reached(const struct sock *sk) +{ + const struct bbr *bbr = inet_csk_ca(sk); + + return bbr->full_bw_reached; +} + +/* Return the windowed max recent bandwidth sample, in pkts/uS << BW_SCALE. */ +static u32 bbr_max_bw(const struct sock *sk) +{ + struct bbr *bbr = inet_csk_ca(sk); + + return max(bbr->bw_hi[0], bbr->bw_hi[1]); +} + +/* Return the estimated bandwidth of the path, in pkts/uS << BW_SCALE. */ +static u32 bbr_bw(const struct sock *sk) +{ + struct bbr *bbr = inet_csk_ca(sk); + + return min(bbr_max_bw(sk), bbr->bw_lo); +} + +/* Return maximum extra acked in past k-2k round trips, + * where k = bbr_extra_acked_win_rtts. + */ +static u16 bbr_extra_acked(const struct sock *sk) +{ + struct bbr *bbr = inet_csk_ca(sk); + + return max(bbr->extra_acked[0], bbr->extra_acked[1]); +} + +/* Return rate in bytes per second, optionally with a gain. + * The order here is chosen carefully to avoid overflow of u64. This should + * work for input rates of up to 2.9Tbit/sec and gain of 2.89x. + */ +static u64 bbr_rate_bytes_per_sec(struct sock *sk, u64 rate, int gain, + int margin) +{ + unsigned int mss = tcp_sk(sk)->mss_cache; + + rate *= mss; + rate *= gain; + rate >>= BBR_SCALE; + rate *= USEC_PER_SEC / 100 * (100 - margin); + rate >>= BW_SCALE; + rate = max(rate, 1ULL); + return rate; +} + +static u64 bbr_bw_bytes_per_sec(struct sock *sk, u64 rate) +{ + return bbr_rate_bytes_per_sec(sk, rate, BBR_UNIT, 0); +} + +static u64 bbr_rate_kbps(struct sock *sk, u64 rate) +{ + rate = bbr_bw_bytes_per_sec(sk, rate); + rate *= 8; + do_div(rate, 1000); + return rate; +} + +static u32 bbr_tso_segs_goal(struct sock *sk); +static void bbr_debug(struct sock *sk, u32 acked, + const struct rate_sample *rs, struct bbr_context *ctx) +{ + static const char ca_states[] = { + [TCP_CA_Open] = 'O', + [TCP_CA_Disorder] = 'D', + [TCP_CA_CWR] = 'C', + [TCP_CA_Recovery] = 'R', + [TCP_CA_Loss] = 'L', + }; + static const char mode[] = { + 'G', /* Growing - BBR_STARTUP */ + 'D', /* Drain - BBR_DRAIN */ + 'W', /* Window - BBR_PROBE_BW */ + 'M', /* Min RTT - BBR_PROBE_RTT */ + }; + static const char ack_phase[] = { /* bbr_ack_phase strings */ + 'I', /* BBR_ACKS_INIT - 'Init' */ + 'R', /* BBR_ACKS_REFILLING - 'Refilling' */ + 'B', /* BBR_ACKS_PROBE_STARTING - 'Before' */ + 'F', /* BBR_ACKS_PROBE_FEEDBACK - 'Feedback' */ + 'A', /* BBR_ACKS_PROBE_STOPPING - 'After' */ + }; + struct tcp_sock *tp = tcp_sk(sk); + struct bbr *bbr = inet_csk_ca(sk); + const u32 una = tp->snd_una - bbr->debug.snd_isn; + const u32 fack = tcp_highest_sack_seq(tp); + const u16 dport = ntohs(inet_sk(sk)->inet_dport); + bool is_port_match = (bbr_debug_port_mask && + ((dport & bbr_debug_port_mask) == 0)); + char debugmsg[320]; + + if (sk->sk_state == TCP_SYN_SENT) + return; /* no bbr_init() yet if SYN retransmit -> CA_Loss */ + + if (!tp->snd_cwnd || tp->snd_cwnd > bbr_cwnd_warn_val) { + char addr[INET6_ADDRSTRLEN + 10] = { 0 }; + + if (sk->sk_family == AF_INET) + snprintf(addr, sizeof(addr), "%pI4:%u", + &inet_sk(sk)->inet_daddr, dport); + else if (sk->sk_family == AF_INET6) + snprintf(addr, sizeof(addr), "%pI6:%u", + &sk->sk_v6_daddr, dport); + + WARN_ONCE(1, + "BBR %s cwnd alert: %u " + "snd_una: %u ca: %d pacing_gain: %u cwnd_gain: %u " + "bw: %u rtt: %u min_rtt: %u " + "acked: %u tso_segs: %u " + "bw: %d %ld %d pif: %u\n", + addr, tp->snd_cwnd, + una, inet_csk(sk)->icsk_ca_state, + bbr->pacing_gain, bbr->cwnd_gain, + bbr_max_bw(sk), (tp->srtt_us >> 3), bbr->min_rtt_us, + acked, bbr_tso_segs_goal(sk), + rs->delivered, rs->interval_us, rs->is_retrans, + tcp_packets_in_flight(tp)); + } + + if (likely(!bbr_debug_with_printk && !bbr_debug_ftrace)) + return; + + if (!sock_flag(sk, SOCK_DBG) && !is_port_match) + return; + + if (!ctx->log && !tp->app_limited && !(bbr_flags & FLAG_DEBUG_VERBOSE)) + return; + + if (ipv4_is_loopback(inet_sk(sk)->inet_daddr) && + !(bbr_flags & FLAG_DEBUG_LOOPBACK)) + return; + + snprintf(debugmsg, sizeof(debugmsg) - 1, + "BBR %pI4:%-5u %5u,%03u:%-7u %c " + "%c %2u br %2u cr %2d rtt %5ld d %2d i %5ld mrtt %d %cbw %llu " + "bw %llu lb %llu ib %llu qb %llu " + "a %u if %2u %c %c dl %u l %u al %u # %u t %u %c %c " + "lr %d er %d ea %d bwl %lld il %d ih %d c %d " + "v %d %c %u %c %s\n", + &inet_sk(sk)->inet_daddr, dport, + una / 1000, una % 1000, fack - tp->snd_una, + ca_states[inet_csk(sk)->icsk_ca_state], + bbr->debug.undo ? '@' : mode[bbr->mode], + tp->snd_cwnd, + bbr_extra_acked(sk), /* br (legacy): extra_acked */ + rs->tx_in_flight, /* cr (legacy): tx_inflight */ + rs->rtt_us, + rs->delivered, + rs->interval_us, + bbr->min_rtt_us, + rs->is_app_limited ? '_' : 'l', + bbr_rate_kbps(sk, ctx->sample_bw), /* lbw: latest sample bw */ + bbr_rate_kbps(sk, bbr_max_bw(sk)), /* bw: max bw */ + 0ULL, /* lb: [obsolete] */ + 0ULL, /* ib: [obsolete] */ + (u64)sk->sk_pacing_rate * 8 / 1000, + acked, + tcp_packets_in_flight(tp), + rs->is_ack_delayed ? 'd' : '.', + bbr->round_start ? '*' : '.', + tp->delivered, tp->lost, + tp->app_limited, + 0, /* #: [obsolete] */ + ctx->target_cwnd, + tp->reord_seen ? 'r' : '.', /* r: reordering seen? */ + ca_states[bbr->prev_ca_state], + (rs->lost + rs->delivered) > 0 ? + (1000 * rs->lost / + (rs->lost + rs->delivered)) : 0, /* lr: loss rate x1000 */ + (rs->delivered) > 0 ? + (1000 * rs->delivered_ce / + (rs->delivered)) : 0, /* er: ECN rate x1000 */ + 1000 * bbr->ecn_alpha >> BBR_SCALE, /* ea: ECN alpha x1000 */ + bbr->bw_lo == ~0U ? + -1 : (s64)bbr_rate_kbps(sk, bbr->bw_lo), /* bwl */ + bbr->inflight_lo, /* il */ + bbr->inflight_hi, /* ih */ + bbr->bw_probe_up_cnt, /* c */ + 2, /* v: version */ + bbr->debug.event, + bbr->cycle_idx, + ack_phase[bbr->ack_phase], + bbr->bw_probe_samples ? "Y" : "N"); + debugmsg[sizeof(debugmsg) - 1] = 0; + + /* printk takes a higher precedence. */ + if (bbr_debug_with_printk) + printk(KERN_DEBUG "%s", debugmsg); + + if (unlikely(bbr->debug.undo)) + bbr->debug.undo = 0; +} + +/* Convert a BBR bw and gain factor to a pacing rate in bytes per second. */ +static unsigned long bbr_bw_to_pacing_rate(struct sock *sk, u32 bw, int gain) +{ + u64 rate = bw; + + rate = bbr_rate_bytes_per_sec(sk, rate, gain, + bbr_pacing_margin_percent); + rate = min_t(u64, rate, sk->sk_max_pacing_rate); + return rate; +} + +/* Initialize pacing rate to: high_gain * init_cwnd / RTT. */ +static void bbr_init_pacing_rate_from_rtt(struct sock *sk) +{ + struct tcp_sock *tp = tcp_sk(sk); + struct bbr *bbr = inet_csk_ca(sk); + u64 bw; + u32 rtt_us; + + if (tp->srtt_us) { /* any RTT sample yet? */ + rtt_us = max(tp->srtt_us >> 3, 1U); + bbr->has_seen_rtt = 1; + } else { /* no RTT sample yet */ + rtt_us = USEC_PER_MSEC; /* use nominal default RTT */ + } + bw = (u64)tp->snd_cwnd * BW_UNIT; + do_div(bw, rtt_us); + sk->sk_pacing_rate = bbr_bw_to_pacing_rate(sk, bw, bbr->params.high_gain); +} + +/* Pace using current bw estimate and a gain factor. */ +static void bbr_set_pacing_rate(struct sock *sk, u32 bw, int gain) +{ + struct tcp_sock *tp = tcp_sk(sk); + struct bbr *bbr = inet_csk_ca(sk); + unsigned long rate = bbr_bw_to_pacing_rate(sk, bw, gain); + + if (unlikely(!bbr->has_seen_rtt && tp->srtt_us)) + bbr_init_pacing_rate_from_rtt(sk); + if (bbr_full_bw_reached(sk) || rate > sk->sk_pacing_rate) + sk->sk_pacing_rate = rate; +} + +static u32 bbr_min_tso_segs(struct sock *sk) +{ + return sk->sk_pacing_rate < (bbr_min_tso_rate >> 3) ? 1 : 2; +} + +/* Return the number of segments BBR would like in a TSO/GSO skb, given + * a particular max gso size as a constraint. + */ +static u32 bbr_tso_segs_generic(struct sock *sk, unsigned int mss_now, + u32 gso_max_size) +{ + struct bbr *bbr = inet_csk_ca(sk); + u32 segs, r; + u64 bytes; + + /* Budget a TSO/GSO burst size allowance based on bw (pacing_rate). */ + bytes = sk->sk_pacing_rate >> sk->sk_pacing_shift; + + /* Budget a TSO/GSO burst size allowance based on min_rtt. For every + * K = 2^tso_rtt_shift microseconds of min_rtt, halve the burst. + * The min_rtt-based burst allowance is: 64 KBytes / 2^(min_rtt/K) + */ + if (bbr->params.tso_rtt_shift) { + r = bbr->min_rtt_us >> bbr->params.tso_rtt_shift; + if (r < BITS_PER_TYPE(u32)) /* prevent undefined behavior */ + bytes += GSO_MAX_SIZE >> r; + } + + bytes = min_t(u32, bytes, gso_max_size - 1 - MAX_TCP_HEADER); + segs = max_t(u32, bytes / mss_now, bbr_min_tso_segs(sk)); + return segs; +} + +/* Custom tcp_tso_autosize() for BBR, used at transmit time to cap skb size. */ +static u32 bbr_tso_segs(struct sock *sk, unsigned int mss_now) +{ + return bbr_tso_segs_generic(sk, mss_now, sk->sk_gso_max_size); +} + +/* Like bbr_tso_segs(), using mss_cache, ignoring driver's sk_gso_max_size. */ +static u32 bbr_tso_segs_goal(struct sock *sk) +{ + struct tcp_sock *tp = tcp_sk(sk); + + return bbr_tso_segs_generic(sk, tp->mss_cache, GSO_MAX_SIZE); +} + +/* Save "last known good" cwnd so we can restore it after losses or PROBE_RTT */ +static void bbr_save_cwnd(struct sock *sk) +{ + struct tcp_sock *tp = tcp_sk(sk); + struct bbr *bbr = inet_csk_ca(sk); + + if (bbr->prev_ca_state < TCP_CA_Recovery && bbr->mode != BBR_PROBE_RTT) + bbr->prior_cwnd = tp->snd_cwnd; /* this cwnd is good enough */ + else /* loss recovery or BBR_PROBE_RTT have temporarily cut cwnd */ + bbr->prior_cwnd = max(bbr->prior_cwnd, tp->snd_cwnd); +} + +static void bbr_cwnd_event(struct sock *sk, enum tcp_ca_event event) +{ + struct tcp_sock *tp = tcp_sk(sk); + struct bbr *bbr = inet_csk_ca(sk); + + if (event == CA_EVENT_TX_START && tp->app_limited) { + bbr->idle_restart = 1; + bbr->ack_epoch_mstamp = tp->tcp_mstamp; + bbr->ack_epoch_acked = 0; + /* Avoid pointless buffer overflows: pace at est. bw if we don't + * need more speed (we're restarting from idle and app-limited). + */ + if (bbr->mode == BBR_PROBE_BW) + bbr_set_pacing_rate(sk, bbr_bw(sk), BBR_UNIT); + else if (bbr->mode == BBR_PROBE_RTT) + bbr_check_probe_rtt_done(sk); + } else if ((event == CA_EVENT_ECN_IS_CE || + event == CA_EVENT_ECN_NO_CE) && + bbr_ecn_enable && + bbr->params.precise_ece_ack) { + u32 state = bbr->ce_state; + dctcp_ece_ack_update(sk, event, &bbr->prior_rcv_nxt, &state); + bbr->ce_state = state; + if (tp->fast_ack_mode == 2 && event == CA_EVENT_ECN_IS_CE) + tcp_enter_quickack_mode(sk, TCP_MAX_QUICKACKS); + } +} + +/* Calculate bdp based on min RTT and the estimated bottleneck bandwidth: + * + * bdp = ceil(bw * min_rtt * gain) + * + * The key factor, gain, controls the amount of queue. While a small gain + * builds a smaller queue, it becomes more vulnerable to noise in RTT + * measurements (e.g., delayed ACKs or other ACK compression effects). This + * noise may cause BBR to under-estimate the rate. + */ +static u32 bbr_bdp(struct sock *sk, u32 bw, int gain) +{ + struct bbr *bbr = inet_csk_ca(sk); + u32 bdp; + u64 w; + + /* If we've never had a valid RTT sample, cap cwnd at the initial + * default. This should only happen when the connection is not using TCP + * timestamps and has retransmitted all of the SYN/SYNACK/data packets + * ACKed so far. In this case, an RTO can cut cwnd to 1, in which + * case we need to slow-start up toward something safe: initial cwnd. + */ + if (unlikely(bbr->min_rtt_us == ~0U)) /* no valid RTT samples yet? */ + return bbr->init_cwnd; /* be safe: cap at initial cwnd */ + + w = (u64)bw * bbr->min_rtt_us; + + /* Apply a gain to the given value, remove the BW_SCALE shift, and + * round the value up to avoid a negative feedback loop. + */ + bdp = (((w * gain) >> BBR_SCALE) + BW_UNIT - 1) / BW_UNIT; + + return bdp; +} + +/* To achieve full performance in high-speed paths, we budget enough cwnd to + * fit full-sized skbs in-flight on both end hosts to fully utilize the path: + * - one skb in sending host Qdisc, + * - one skb in sending host TSO/GSO engine + * - one skb being received by receiver host LRO/GRO/delayed-ACK engine + * Don't worry, at low rates (bbr_min_tso_rate) this won't bloat cwnd because + * in such cases tso_segs_goal is 1. The minimum cwnd is 4 packets, + * which allows 2 outstanding 2-packet sequences, to try to keep pipe + * full even with ACK-every-other-packet delayed ACKs. + */ +static u32 bbr_quantization_budget(struct sock *sk, u32 cwnd) +{ + struct bbr *bbr = inet_csk_ca(sk); + u32 tso_segs_goal; + + tso_segs_goal = 3 * bbr_tso_segs_goal(sk); + + /* Allow enough full-sized skbs in flight to utilize end systems. */ + if (bbr->params.cwnd_tso_budget == 1) { + cwnd = max_t(u32, cwnd, tso_segs_goal); + cwnd = max_t(u32, cwnd, bbr->params.cwnd_min_target); + } else { + cwnd += tso_segs_goal; + cwnd = (cwnd + 1) & ~1U; + } + /* Ensure gain cycling gets inflight above BDP even for small BDPs. */ + if (bbr->mode == BBR_PROBE_BW && bbr->cycle_idx == BBR_BW_PROBE_UP) + cwnd += 2; + + return cwnd; +} + +/* Find inflight based on min RTT and the estimated bottleneck bandwidth. */ +static u32 bbr_inflight(struct sock *sk, u32 bw, int gain) +{ + u32 inflight; + + inflight = bbr_bdp(sk, bw, gain); + inflight = bbr_quantization_budget(sk, inflight); + + return inflight; +} + +/* With pacing at lower layers, there's often less data "in the network" than + * "in flight". With TSQ and departure time pacing at lower layers (e.g. fq), + * we often have several skbs queued in the pacing layer with a pre-scheduled + * earliest departure time (EDT). BBR adapts its pacing rate based on the + * inflight level that it estimates has already been "baked in" by previous + * departure time decisions. We calculate a rough estimate of the number of our + * packets that might be in the network at the earliest departure time for the + * next skb scheduled: + * in_network_at_edt = inflight_at_edt - (EDT - now) * bw + * If we're increasing inflight, then we want to know if the transmit of the + * EDT skb will push inflight above the target, so inflight_at_edt includes + * bbr_tso_segs_goal() from the skb departing at EDT. If decreasing inflight, + * then estimate if inflight will sink too low just before the EDT transmit. + */ +static u32 bbr_packets_in_net_at_edt(struct sock *sk, u32 inflight_now) +{ + struct tcp_sock *tp = tcp_sk(sk); + struct bbr *bbr = inet_csk_ca(sk); + u64 now_ns, edt_ns, interval_us; + u32 interval_delivered, inflight_at_edt; + + now_ns = tp->tcp_clock_cache; + edt_ns = max(tp->tcp_wstamp_ns, now_ns); + interval_us = div_u64(edt_ns - now_ns, NSEC_PER_USEC); + interval_delivered = (u64)bbr_bw(sk) * interval_us >> BW_SCALE; + inflight_at_edt = inflight_now; + if (bbr->pacing_gain > BBR_UNIT) /* increasing inflight */ + inflight_at_edt += bbr_tso_segs_goal(sk); /* include EDT skb */ + if (interval_delivered >= inflight_at_edt) + return 0; + return inflight_at_edt - interval_delivered; +} + +/* Find the cwnd increment based on estimate of ack aggregation */ +static u32 bbr_ack_aggregation_cwnd(struct sock *sk) +{ + struct bbr *bbr = inet_csk_ca(sk); + u32 max_aggr_cwnd, aggr_cwnd = 0; + + if (bbr->params.extra_acked_gain && + (bbr_full_bw_reached(sk) || bbr->params.extra_acked_in_startup)) { + max_aggr_cwnd = ((u64)bbr_bw(sk) * bbr_extra_acked_max_us) + / BW_UNIT; + aggr_cwnd = (bbr->params.extra_acked_gain * bbr_extra_acked(sk)) + >> BBR_SCALE; + aggr_cwnd = min(aggr_cwnd, max_aggr_cwnd); + } + + return aggr_cwnd; +} + +/* Returns the cwnd for PROBE_RTT mode. */ +static u32 bbr_probe_rtt_cwnd(struct sock *sk) +{ + struct bbr *bbr = inet_csk_ca(sk); + + if (bbr->params.probe_rtt_cwnd_gain == 0) + return bbr->params.cwnd_min_target; + return max_t(u32, bbr->params.cwnd_min_target, + bbr_bdp(sk, bbr_bw(sk), bbr->params.probe_rtt_cwnd_gain)); +} + +/* Slow-start up toward target cwnd (if bw estimate is growing, or packet loss + * has drawn us down below target), or snap down to target if we're above it. + */ +static void bbr_set_cwnd(struct sock *sk, const struct rate_sample *rs, + u32 acked, u32 bw, int gain, u32 cwnd, + struct bbr_context *ctx) +{ + struct tcp_sock *tp = tcp_sk(sk); + struct bbr *bbr = inet_csk_ca(sk); + u32 target_cwnd = 0, prev_cwnd = tp->snd_cwnd, max_probe; + + if (!acked) + goto done; /* no packet fully ACKed; just apply caps */ + + target_cwnd = bbr_bdp(sk, bw, gain); + + /* Increment the cwnd to account for excess ACKed data that seems + * due to aggregation (of data and/or ACKs) visible in the ACK stream. + */ + target_cwnd += bbr_ack_aggregation_cwnd(sk); + target_cwnd = bbr_quantization_budget(sk, target_cwnd); + + /* If we're below target cwnd, slow start cwnd toward target cwnd. */ + bbr->debug.target_cwnd = target_cwnd; + + /* Update cwnd and enable fast path if cwnd reaches target_cwnd. */ + bbr->try_fast_path = 0; + if (bbr_full_bw_reached(sk)) { /* only cut cwnd if we filled the pipe */ + cwnd += acked; + if (cwnd >= target_cwnd) { + cwnd = target_cwnd; + bbr->try_fast_path = 1; + } + } else if (cwnd < target_cwnd || cwnd < 2 * bbr->init_cwnd) { + cwnd += acked; + } else { + bbr->try_fast_path = 1; + } + + /* When growing cwnd, don't grow beyond twice what we just probed. */ + if (bbr->params.usage_based_cwnd) { + max_probe = max(2 * tp->max_packets_out, tp->snd_cwnd); + cwnd = min(cwnd, max_probe); + } + + cwnd = max_t(u32, cwnd, bbr->params.cwnd_min_target); +done: + tp->snd_cwnd = min(cwnd, tp->snd_cwnd_clamp); /* apply global cap */ + if (bbr->mode == BBR_PROBE_RTT) /* drain queue, refresh min_rtt */ + tp->snd_cwnd = min_t(u32, tp->snd_cwnd, bbr_probe_rtt_cwnd(sk)); + + ctx->target_cwnd = target_cwnd; + ctx->log = (tp->snd_cwnd != prev_cwnd); +} + +/* See if we have reached next round trip */ +static void bbr_update_round_start(struct sock *sk, + const struct rate_sample *rs, struct bbr_context *ctx) +{ + struct tcp_sock *tp = tcp_sk(sk); + struct bbr *bbr = inet_csk_ca(sk); + + bbr->round_start = 0; + + /* See if we've reached the next RTT */ + if (rs->interval_us > 0 && + !before(rs->prior_delivered, bbr->next_rtt_delivered)) { + bbr->next_rtt_delivered = tp->delivered; + bbr->round_start = 1; + } +} + +/* Calculate the bandwidth based on how fast packets are delivered */ +static void bbr_calculate_bw_sample(struct sock *sk, + const struct rate_sample *rs, struct bbr_context *ctx) +{ + struct bbr *bbr = inet_csk_ca(sk); + u64 bw = 0; + + /* Divide delivered by the interval to find a (lower bound) bottleneck + * bandwidth sample. Delivered is in packets and interval_us in uS and + * ratio will be <<1 for most connections. So delivered is first scaled. + * Round up to allow growth at low rates, even with integer division. + */ + if (rs->interval_us > 0) { + if (WARN_ONCE(rs->delivered < 0, + "negative delivered: %d interval_us: %ld\n", + rs->delivered, rs->interval_us)) + return; + + bw = DIV_ROUND_UP_ULL((u64)rs->delivered * BW_UNIT, rs->interval_us); + } + + ctx->sample_bw = bw; + bbr->debug.rs_bw = bw; +} + +/* Estimates the windowed max degree of ack aggregation. + * This is used to provision extra in-flight data to keep sending during + * inter-ACK silences. + * + * Degree of ack aggregation is estimated as extra data acked beyond expected. + * + * max_extra_acked = "maximum recent excess data ACKed beyond max_bw * interval" + * cwnd += max_extra_acked + * + * Max extra_acked is clamped by cwnd and bw * bbr_extra_acked_max_us (100 ms). + * Max filter is an approximate sliding window of 5-10 (packet timed) round + * trips for non-startup phase, and 1-2 round trips for startup. + */ +static void bbr_update_ack_aggregation(struct sock *sk, + const struct rate_sample *rs) +{ + u32 epoch_us, expected_acked, extra_acked; + struct bbr *bbr = inet_csk_ca(sk); + struct tcp_sock *tp = tcp_sk(sk); + u32 extra_acked_win_rtts_thresh = bbr->params.extra_acked_win_rtts; + + if (!bbr->params.extra_acked_gain || rs->acked_sacked <= 0 || + rs->delivered < 0 || rs->interval_us <= 0) + return; + + if (bbr->round_start) { + bbr->extra_acked_win_rtts = min(0x1F, + bbr->extra_acked_win_rtts + 1); + if (bbr->params.extra_acked_in_startup && + !bbr_full_bw_reached(sk)) + extra_acked_win_rtts_thresh = 1; + if (bbr->extra_acked_win_rtts >= + extra_acked_win_rtts_thresh) { + bbr->extra_acked_win_rtts = 0; + bbr->extra_acked_win_idx = bbr->extra_acked_win_idx ? + 0 : 1; + bbr->extra_acked[bbr->extra_acked_win_idx] = 0; + } + } + + /* Compute how many packets we expected to be delivered over epoch. */ + epoch_us = tcp_stamp_us_delta(tp->delivered_mstamp, + bbr->ack_epoch_mstamp); + expected_acked = ((u64)bbr_bw(sk) * epoch_us) / BW_UNIT; + + /* Reset the aggregation epoch if ACK rate is below expected rate or + * significantly large no. of ack received since epoch (potentially + * quite old epoch). + */ + if (bbr->ack_epoch_acked <= expected_acked || + (bbr->ack_epoch_acked + rs->acked_sacked >= + bbr_ack_epoch_acked_reset_thresh)) { + bbr->ack_epoch_acked = 0; + bbr->ack_epoch_mstamp = tp->delivered_mstamp; + expected_acked = 0; + } + + /* Compute excess data delivered, beyond what was expected. */ + bbr->ack_epoch_acked = min_t(u32, 0xFFFFF, + bbr->ack_epoch_acked + rs->acked_sacked); + extra_acked = bbr->ack_epoch_acked - expected_acked; + extra_acked = min(extra_acked, tp->snd_cwnd); + if (extra_acked > bbr->extra_acked[bbr->extra_acked_win_idx]) + bbr->extra_acked[bbr->extra_acked_win_idx] = extra_acked; +} + +/* Estimate when the pipe is full, using the change in delivery rate: BBR + * estimates that STARTUP filled the pipe if the estimated bw hasn't changed by + * at least bbr_full_bw_thresh (25%) after bbr_full_bw_cnt (3) non-app-limited + * rounds. Why 3 rounds: 1: rwin autotuning grows the rwin, 2: we fill the + * higher rwin, 3: we get higher delivery rate samples. Or transient + * cross-traffic or radio noise can go away. CUBIC Hystart shares a similar + * design goal, but uses delay and inter-ACK spacing instead of bandwidth. + */ +static void bbr_check_full_bw_reached(struct sock *sk, + const struct rate_sample *rs) +{ + struct bbr *bbr = inet_csk_ca(sk); + u32 bw_thresh; + + if (bbr_full_bw_reached(sk) || !bbr->round_start || rs->is_app_limited) + return; + + bw_thresh = (u64)bbr->full_bw * bbr->params.full_bw_thresh >> BBR_SCALE; + if (bbr_max_bw(sk) >= bw_thresh) { + bbr->full_bw = bbr_max_bw(sk); + bbr->full_bw_cnt = 0; + return; + } + ++bbr->full_bw_cnt; + bbr->full_bw_reached = bbr->full_bw_cnt >= bbr->params.full_bw_cnt; +} + +/* If pipe is probably full, drain the queue and then enter steady-state. */ +static bool bbr_check_drain(struct sock *sk, const struct rate_sample *rs, + struct bbr_context *ctx) +{ + struct bbr *bbr = inet_csk_ca(sk); + + if (bbr->mode == BBR_STARTUP && bbr_full_bw_reached(sk)) { + bbr->mode = BBR_DRAIN; /* drain queue we created */ + tcp_sk(sk)->snd_ssthresh = + bbr_inflight(sk, bbr_max_bw(sk), BBR_UNIT); + bbr2_reset_congestion_signals(sk); + } /* fall through to check if in-flight is already small: */ + if (bbr->mode == BBR_DRAIN && + bbr_packets_in_net_at_edt(sk, tcp_packets_in_flight(tcp_sk(sk))) <= + bbr_inflight(sk, bbr_max_bw(sk), BBR_UNIT)) + return true; /* exiting DRAIN now */ + return false; +} + +static void bbr_check_probe_rtt_done(struct sock *sk) +{ + struct tcp_sock *tp = tcp_sk(sk); + struct bbr *bbr = inet_csk_ca(sk); + + if (!(bbr->probe_rtt_done_stamp && + after(tcp_jiffies32, bbr->probe_rtt_done_stamp))) + return; + + bbr->probe_rtt_min_stamp = tcp_jiffies32; /* schedule next PROBE_RTT */ + tp->snd_cwnd = max(tp->snd_cwnd, bbr->prior_cwnd); + bbr2_exit_probe_rtt(sk); +} + +/* The goal of PROBE_RTT mode is to have BBR flows cooperatively and + * periodically drain the bottleneck queue, to converge to measure the true + * min_rtt (unloaded propagation delay). This allows the flows to keep queues + * small (reducing queuing delay and packet loss) and achieve fairness among + * BBR flows. + * + * The min_rtt filter window is 10 seconds. When the min_rtt estimate expires, + * we enter PROBE_RTT mode and cap the cwnd at bbr_cwnd_min_target=4 packets. + * After at least bbr_probe_rtt_mode_ms=200ms and at least one packet-timed + * round trip elapsed with that flight size <= 4, we leave PROBE_RTT mode and + * re-enter the previous mode. BBR uses 200ms to approximately bound the + * performance penalty of PROBE_RTT's cwnd capping to roughly 2% (200ms/10s). + * + * Note that flows need only pay 2% if they are busy sending over the last 10 + * seconds. Interactive applications (e.g., Web, RPCs, video chunks) often have + * natural silences or low-rate periods within 10 seconds where the rate is low + * enough for long enough to drain its queue in the bottleneck. We pick up + * these min RTT measurements opportunistically with our min_rtt filter. :-) + */ +static void bbr_update_min_rtt(struct sock *sk, const struct rate_sample *rs) +{ + struct tcp_sock *tp = tcp_sk(sk); + struct bbr *bbr = inet_csk_ca(sk); + bool probe_rtt_expired, min_rtt_expired; + u32 expire; + + /* Track min RTT in probe_rtt_win_ms to time next PROBE_RTT state. */ + expire = bbr->probe_rtt_min_stamp + + msecs_to_jiffies(bbr->params.probe_rtt_win_ms); + probe_rtt_expired = after(tcp_jiffies32, expire); + if (rs->rtt_us >= 0 && + (rs->rtt_us <= bbr->probe_rtt_min_us || + (probe_rtt_expired && !rs->is_ack_delayed))) { + bbr->probe_rtt_min_us = rs->rtt_us; + bbr->probe_rtt_min_stamp = tcp_jiffies32; + } + /* Track min RTT seen in the min_rtt_win_sec filter window: */ + expire = bbr->min_rtt_stamp + bbr->params.min_rtt_win_sec * HZ; + min_rtt_expired = after(tcp_jiffies32, expire); + if (bbr->probe_rtt_min_us <= bbr->min_rtt_us || + min_rtt_expired) { + bbr->min_rtt_us = bbr->probe_rtt_min_us; + bbr->min_rtt_stamp = bbr->probe_rtt_min_stamp; + } + + if (bbr->params.probe_rtt_mode_ms > 0 && probe_rtt_expired && + !bbr->idle_restart && bbr->mode != BBR_PROBE_RTT) { + bbr->mode = BBR_PROBE_RTT; /* dip, drain queue */ + bbr_save_cwnd(sk); /* note cwnd so we can restore it */ + bbr->probe_rtt_done_stamp = 0; + bbr->ack_phase = BBR_ACKS_PROBE_STOPPING; + bbr->next_rtt_delivered = tp->delivered; + } + + if (bbr->mode == BBR_PROBE_RTT) { + /* Ignore low rate samples during this mode. */ + tp->app_limited = + (tp->delivered + tcp_packets_in_flight(tp)) ? : 1; + /* Maintain min packets in flight for max(200 ms, 1 round). */ + if (!bbr->probe_rtt_done_stamp && + tcp_packets_in_flight(tp) <= bbr_probe_rtt_cwnd(sk)) { + bbr->probe_rtt_done_stamp = tcp_jiffies32 + + msecs_to_jiffies(bbr->params.probe_rtt_mode_ms); + bbr->probe_rtt_round_done = 0; + bbr->next_rtt_delivered = tp->delivered; + } else if (bbr->probe_rtt_done_stamp) { + if (bbr->round_start) + bbr->probe_rtt_round_done = 1; + if (bbr->probe_rtt_round_done) + bbr_check_probe_rtt_done(sk); + } + } + /* Restart after idle ends only once we process a new S/ACK for data */ + if (rs->delivered > 0) + bbr->idle_restart = 0; +} + +static void bbr_update_gains(struct sock *sk) +{ + struct bbr *bbr = inet_csk_ca(sk); + + switch (bbr->mode) { + case BBR_STARTUP: + bbr->pacing_gain = bbr->params.high_gain; + bbr->cwnd_gain = bbr->params.startup_cwnd_gain; + break; + case BBR_DRAIN: + bbr->pacing_gain = bbr->params.drain_gain; /* slow, to drain */ + bbr->cwnd_gain = bbr->params.startup_cwnd_gain; /* keep cwnd */ + break; + case BBR_PROBE_BW: + bbr->pacing_gain = bbr->params.pacing_gain[bbr->cycle_idx]; + bbr->cwnd_gain = bbr->params.cwnd_gain; + break; + case BBR_PROBE_RTT: + bbr->pacing_gain = BBR_UNIT; + bbr->cwnd_gain = BBR_UNIT; + break; + default: + WARN_ONCE(1, "BBR bad mode: %u\n", bbr->mode); + break; + } +} + +static void bbr_init(struct sock *sk) +{ + struct tcp_sock *tp = tcp_sk(sk); + struct bbr *bbr = inet_csk_ca(sk); + int i; + + WARN_ON_ONCE(tp->snd_cwnd >= bbr_cwnd_warn_val); + + bbr->initialized = 1; + bbr->params.high_gain = min(0x7FF, bbr_high_gain); + bbr->params.drain_gain = min(0x3FF, bbr_drain_gain); + bbr->params.startup_cwnd_gain = min(0x7FF, bbr_startup_cwnd_gain); + bbr->params.cwnd_gain = min(0x7FF, bbr_cwnd_gain); + bbr->params.cwnd_tso_budget = min(0x1U, bbr_cwnd_tso_budget); + bbr->params.cwnd_min_target = min(0xFU, bbr_cwnd_min_target); + bbr->params.min_rtt_win_sec = min(0x1FU, bbr_min_rtt_win_sec); + bbr->params.probe_rtt_mode_ms = min(0x1FFU, bbr_probe_rtt_mode_ms); + bbr->params.full_bw_cnt = min(0x7U, bbr_full_bw_cnt); + bbr->params.full_bw_thresh = min(0x3FFU, bbr_full_bw_thresh); + bbr->params.extra_acked_gain = min(0x7FF, bbr_extra_acked_gain); + bbr->params.extra_acked_win_rtts = min(0x1FU, bbr_extra_acked_win_rtts); + bbr->params.drain_to_target = bbr_drain_to_target ? 1 : 0; + bbr->params.precise_ece_ack = bbr_precise_ece_ack ? 1 : 0; + bbr->params.extra_acked_in_startup = bbr_extra_acked_in_startup ? 1 : 0; + bbr->params.probe_rtt_cwnd_gain = min(0xFFU, bbr_probe_rtt_cwnd_gain); + bbr->params.probe_rtt_win_ms = + min(0x3FFFU, + min_t(u32, bbr_probe_rtt_win_ms, + bbr->params.min_rtt_win_sec * MSEC_PER_SEC)); + for (i = 0; i < CYCLE_LEN; i++) + bbr->params.pacing_gain[i] = min(0x3FF, bbr_pacing_gain[i]); + bbr->params.usage_based_cwnd = bbr_usage_based_cwnd ? 1 : 0; + bbr->params.tso_rtt_shift = min(0xFU, bbr_tso_rtt_shift); + + bbr->debug.snd_isn = tp->snd_una; + bbr->debug.target_cwnd = 0; + bbr->debug.undo = 0; + + bbr->init_cwnd = min(0x7FU, tp->snd_cwnd); + bbr->prior_cwnd = tp->prior_cwnd; + tp->snd_ssthresh = TCP_INFINITE_SSTHRESH; + bbr->next_rtt_delivered = 0; + bbr->prev_ca_state = TCP_CA_Open; + bbr->packet_conservation = 0; + + bbr->probe_rtt_done_stamp = 0; + bbr->probe_rtt_round_done = 0; + bbr->probe_rtt_min_us = tcp_min_rtt(tp); + bbr->probe_rtt_min_stamp = tcp_jiffies32; + bbr->min_rtt_us = tcp_min_rtt(tp); + bbr->min_rtt_stamp = tcp_jiffies32; + + bbr->has_seen_rtt = 0; + bbr_init_pacing_rate_from_rtt(sk); + + bbr->round_start = 0; + bbr->idle_restart = 0; + bbr->full_bw_reached = 0; + bbr->full_bw = 0; + bbr->full_bw_cnt = 0; + bbr->cycle_mstamp = 0; + bbr->cycle_idx = 0; + bbr->mode = BBR_STARTUP; + bbr->debug.rs_bw = 0; + + bbr->ack_epoch_mstamp = tp->tcp_mstamp; + bbr->ack_epoch_acked = 0; + bbr->extra_acked_win_rtts = 0; + bbr->extra_acked_win_idx = 0; + bbr->extra_acked[0] = 0; + bbr->extra_acked[1] = 0; + + bbr->ce_state = 0; + bbr->prior_rcv_nxt = tp->rcv_nxt; + bbr->try_fast_path = 0; + + cmpxchg(&sk->sk_pacing_status, SK_PACING_NONE, SK_PACING_NEEDED); +} + +static u32 bbr_sndbuf_expand(struct sock *sk) +{ + /* Provision 3 * cwnd since BBR may slow-start even during recovery. */ + return 3; +} + +/* __________________________________________________________________________ + * + * Functions new to BBR v2 ("bbr") congestion control are below here. + * __________________________________________________________________________ + */ + +/* Incorporate a new bw sample into the current window of our max filter. */ +static void bbr2_take_bw_hi_sample(struct sock *sk, u32 bw) +{ + struct bbr *bbr = inet_csk_ca(sk); + + bbr->bw_hi[1] = max(bw, bbr->bw_hi[1]); +} + +/* Keep max of last 1-2 cycles. Each PROBE_BW cycle, flip filter window. */ +static void bbr2_advance_bw_hi_filter(struct sock *sk) +{ + struct bbr *bbr = inet_csk_ca(sk); + + if (!bbr->bw_hi[1]) + return; /* no samples in this window; remember old window */ + bbr->bw_hi[0] = bbr->bw_hi[1]; + bbr->bw_hi[1] = 0; +} + +/* How much do we want in flight? Our BDP, unless congestion cut cwnd. */ +static u32 bbr2_target_inflight(struct sock *sk) +{ + u32 bdp = bbr_inflight(sk, bbr_bw(sk), BBR_UNIT); + + return min(bdp, tcp_sk(sk)->snd_cwnd); +} + +static bool bbr2_is_probing_bandwidth(struct sock *sk) +{ + struct bbr *bbr = inet_csk_ca(sk); + + return (bbr->mode == BBR_STARTUP) || + (bbr->mode == BBR_PROBE_BW && + (bbr->cycle_idx == BBR_BW_PROBE_REFILL || + bbr->cycle_idx == BBR_BW_PROBE_UP)); +} + +/* Has the given amount of time elapsed since we marked the phase start? */ +static bool bbr2_has_elapsed_in_phase(const struct sock *sk, u32 interval_us) +{ + const struct tcp_sock *tp = tcp_sk(sk); + const struct bbr *bbr = inet_csk_ca(sk); + + return tcp_stamp_us_delta(tp->tcp_mstamp, + bbr->cycle_mstamp + interval_us) > 0; +} + +static void bbr2_handle_queue_too_high_in_startup(struct sock *sk) +{ + struct bbr *bbr = inet_csk_ca(sk); + + bbr->full_bw_reached = 1; + bbr->inflight_hi = bbr_inflight(sk, bbr_max_bw(sk), BBR_UNIT); +} + +/* Exit STARTUP upon N consecutive rounds with ECN mark rate > ecn_thresh. */ +static void bbr2_check_ecn_too_high_in_startup(struct sock *sk, u32 ce_ratio) +{ + struct bbr *bbr = inet_csk_ca(sk); + + if (bbr_full_bw_reached(sk) || !bbr->ecn_eligible || + !bbr->params.full_ecn_cnt || !bbr->params.ecn_thresh) + return; + + if (ce_ratio >= bbr->params.ecn_thresh) + bbr->startup_ecn_rounds++; + else + bbr->startup_ecn_rounds = 0; + + if (bbr->startup_ecn_rounds >= bbr->params.full_ecn_cnt) { + bbr->debug.event = 'E'; /* ECN caused STARTUP exit */ + bbr2_handle_queue_too_high_in_startup(sk); + return; + } +} + +static void bbr2_update_ecn_alpha(struct sock *sk) +{ + struct tcp_sock *tp = tcp_sk(sk); + struct bbr *bbr = inet_csk_ca(sk); + s32 delivered, delivered_ce; + u64 alpha, ce_ratio; + u32 gain; + + if (bbr->params.ecn_factor == 0) + return; + + delivered = tp->delivered - bbr->alpha_last_delivered; + delivered_ce = tp->delivered_ce - bbr->alpha_last_delivered_ce; + + if (delivered == 0 || /* avoid divide by zero */ + WARN_ON_ONCE(delivered < 0 || delivered_ce < 0)) /* backwards? */ + return; + + /* See if we should use ECN sender logic for this connection. */ + if (!bbr->ecn_eligible && bbr_ecn_enable && + (bbr->min_rtt_us <= bbr->params.ecn_max_rtt_us || + !bbr->params.ecn_max_rtt_us)) + bbr->ecn_eligible = 1; + + ce_ratio = (u64)delivered_ce << BBR_SCALE; + do_div(ce_ratio, delivered); + gain = bbr->params.ecn_alpha_gain; + alpha = ((BBR_UNIT - gain) * bbr->ecn_alpha) >> BBR_SCALE; + alpha += (gain * ce_ratio) >> BBR_SCALE; + bbr->ecn_alpha = min_t(u32, alpha, BBR_UNIT); + + bbr->alpha_last_delivered = tp->delivered; + bbr->alpha_last_delivered_ce = tp->delivered_ce; + + bbr2_check_ecn_too_high_in_startup(sk, ce_ratio); +} + +/* Each round trip of BBR_BW_PROBE_UP, double volume of probing data. */ +static void bbr2_raise_inflight_hi_slope(struct sock *sk) +{ + struct tcp_sock *tp = tcp_sk(sk); + struct bbr *bbr = inet_csk_ca(sk); + u32 growth_this_round, cnt; + + /* Calculate "slope": packets S/Acked per inflight_hi increment. */ + growth_this_round = 1 << bbr->bw_probe_up_rounds; + bbr->bw_probe_up_rounds = min(bbr->bw_probe_up_rounds + 1, 30); + cnt = tp->snd_cwnd / growth_this_round; + cnt = max(cnt, 1U); + bbr->bw_probe_up_cnt = cnt; + bbr->debug.event = 'G'; /* Grow inflight_hi slope */ +} + +/* In BBR_BW_PROBE_UP, not seeing high loss/ECN/queue, so raise inflight_hi. */ +static void bbr2_probe_inflight_hi_upward(struct sock *sk, + const struct rate_sample *rs) +{ + struct tcp_sock *tp = tcp_sk(sk); + struct bbr *bbr = inet_csk_ca(sk); + u32 delta; + + if (!tp->is_cwnd_limited || tp->snd_cwnd < bbr->inflight_hi) { + bbr->bw_probe_up_acks = 0; /* don't accmulate unused credits */ + return; /* not fully using inflight_hi, so don't grow it */ + } + + /* For each bw_probe_up_cnt packets ACKed, increase inflight_hi by 1. */ + bbr->bw_probe_up_acks += rs->acked_sacked; + if (bbr->bw_probe_up_acks >= bbr->bw_probe_up_cnt) { + delta = bbr->bw_probe_up_acks / bbr->bw_probe_up_cnt; + bbr->bw_probe_up_acks -= delta * bbr->bw_probe_up_cnt; + bbr->inflight_hi += delta; + bbr->debug.event = 'I'; /* Increment inflight_hi */ + } + + if (bbr->round_start) + bbr2_raise_inflight_hi_slope(sk); +} + +/* Does loss/ECN rate for this sample say inflight is "too high"? + * This is used by both the bbr_check_loss_too_high_in_startup() function, + * which can be used in either v1 or v2, and the PROBE_UP phase of v2, which + * uses it to notice when loss/ECN rates suggest inflight is too high. + */ +static bool bbr2_is_inflight_too_high(const struct sock *sk, + const struct rate_sample *rs) +{ + const struct bbr *bbr = inet_csk_ca(sk); + u32 loss_thresh, ecn_thresh; + + if (rs->lost > 0 && rs->tx_in_flight) { + loss_thresh = (u64)rs->tx_in_flight * bbr->params.loss_thresh >> + BBR_SCALE; + if (rs->lost > loss_thresh) + return true; + } + + if (rs->delivered_ce > 0 && rs->delivered > 0 && + bbr->ecn_eligible && bbr->params.ecn_thresh) { + ecn_thresh = (u64)rs->delivered * bbr->params.ecn_thresh >> + BBR_SCALE; + if (rs->delivered_ce >= ecn_thresh) + return true; + } + + return false; +} + +/* Calculate the tx_in_flight level that corresponded to excessive loss. + * We find "lost_prefix" segs of the skb where loss rate went too high, + * by solving for "lost_prefix" in the following equation: + * lost / inflight >= loss_thresh + * (lost_prev + lost_prefix) / (inflight_prev + lost_prefix) >= loss_thresh + * Then we take that equation, convert it to fixed point, and + * round up to the nearest packet. + */ +static u32 bbr2_inflight_hi_from_lost_skb(const struct sock *sk, + const struct rate_sample *rs, + const struct sk_buff *skb) +{ + const struct bbr *bbr = inet_csk_ca(sk); + u32 loss_thresh = bbr->params.loss_thresh; + u32 pcount, divisor, inflight_hi; + s32 inflight_prev, lost_prev; + u64 loss_budget, lost_prefix; + + pcount = tcp_skb_pcount(skb); + + /* How much data was in flight before this skb? */ + inflight_prev = rs->tx_in_flight - pcount; + if (WARN_ONCE(inflight_prev < 0, + "tx_in_flight: %u pcount: %u reneg: %u", + rs->tx_in_flight, pcount, tcp_sk(sk)->is_sack_reneg)) + return ~0U; + + /* How much inflight data was marked lost before this skb? */ + lost_prev = rs->lost - pcount; + if (WARN_ON_ONCE(lost_prev < 0)) + return ~0U; + + /* At what prefix of this lost skb did losss rate exceed loss_thresh? */ + loss_budget = (u64)inflight_prev * loss_thresh + BBR_UNIT - 1; + loss_budget >>= BBR_SCALE; + if (lost_prev >= loss_budget) { + lost_prefix = 0; /* previous losses crossed loss_thresh */ + } else { + lost_prefix = loss_budget - lost_prev; + lost_prefix <<= BBR_SCALE; + divisor = BBR_UNIT - loss_thresh; + if (WARN_ON_ONCE(!divisor)) /* loss_thresh is 8 bits */ + return ~0U; + do_div(lost_prefix, divisor); + } + + inflight_hi = inflight_prev + lost_prefix; + return inflight_hi; +} + +/* If loss/ECN rates during probing indicated we may have overfilled a + * buffer, return an operating point that tries to leave unutilized headroom in + * the path for other flows, for fairness convergence and lower RTTs and loss. + */ +static u32 bbr2_inflight_with_headroom(const struct sock *sk) +{ + struct bbr *bbr = inet_csk_ca(sk); + u32 headroom, headroom_fraction; + + if (bbr->inflight_hi == ~0U) + return ~0U; + + headroom_fraction = bbr->params.inflight_headroom; + headroom = ((u64)bbr->inflight_hi * headroom_fraction) >> BBR_SCALE; + headroom = max(headroom, 1U); + return max_t(s32, bbr->inflight_hi - headroom, + bbr->params.cwnd_min_target); +} + +/* Bound cwnd to a sensible level, based on our current probing state + * machine phase and model of a good inflight level (inflight_lo, inflight_hi). + */ +static void bbr2_bound_cwnd_for_inflight_model(struct sock *sk) +{ + struct tcp_sock *tp = tcp_sk(sk); + struct bbr *bbr = inet_csk_ca(sk); + u32 cap; + + /* tcp_rcv_synsent_state_process() currently calls tcp_ack() + * and thus cong_control() without first initializing us(!). + */ + if (!bbr->initialized) + return; + + cap = ~0U; + if (bbr->mode == BBR_PROBE_BW && + bbr->cycle_idx != BBR_BW_PROBE_CRUISE) { + /* Probe to see if more packets fit in the path. */ + cap = bbr->inflight_hi; + } else { + if (bbr->mode == BBR_PROBE_RTT || + (bbr->mode == BBR_PROBE_BW && + bbr->cycle_idx == BBR_BW_PROBE_CRUISE)) + cap = bbr2_inflight_with_headroom(sk); + } + /* Adapt to any loss/ECN since our last bw probe. */ + cap = min(cap, bbr->inflight_lo); + + cap = max_t(u32, cap, bbr->params.cwnd_min_target); + tp->snd_cwnd = min(cap, tp->snd_cwnd); +} + +/* Estimate a short-term lower bound on the capacity available now, based + * on measurements of the current delivery process and recent history. When we + * are seeing loss/ECN at times when we are not probing bw, then conservatively + * move toward flow balance by multiplicatively cutting our short-term + * estimated safe rate and volume of data (bw_lo and inflight_lo). We use a + * multiplicative decrease in order to converge to a lower capacity in time + * logarithmic in the magnitude of the decrease. + * + * However, we do not cut our short-term estimates lower than the current rate + * and volume of delivered data from this round trip, since from the current + * delivery process we can estimate the measured capacity available now. + * + * Anything faster than that approach would knowingly risk high loss, which can + * cause low bw for Reno/CUBIC and high loss recovery latency for + * request/response flows using any congestion control. + */ +static void bbr2_adapt_lower_bounds(struct sock *sk) +{ + struct tcp_sock *tp = tcp_sk(sk); + struct bbr *bbr = inet_csk_ca(sk); + u32 ecn_cut, ecn_inflight_lo, beta; + + /* We only use lower-bound estimates when not probing bw. + * When probing we need to push inflight higher to probe bw. + */ + if (bbr2_is_probing_bandwidth(sk)) + return; + + /* ECN response. */ + if (bbr->ecn_in_round && bbr->ecn_eligible && bbr->params.ecn_factor) { + /* Reduce inflight to (1 - alpha*ecn_factor). */ + ecn_cut = (BBR_UNIT - + ((bbr->ecn_alpha * bbr->params.ecn_factor) >> + BBR_SCALE)); + if (bbr->inflight_lo == ~0U) + bbr->inflight_lo = tp->snd_cwnd; + ecn_inflight_lo = (u64)bbr->inflight_lo * ecn_cut >> BBR_SCALE; + } else { + ecn_inflight_lo = ~0U; + } + + /* Loss response. */ + if (bbr->loss_in_round) { + /* Reduce bw and inflight to (1 - beta). */ + if (bbr->bw_lo == ~0U) + bbr->bw_lo = bbr_max_bw(sk); + if (bbr->inflight_lo == ~0U) + bbr->inflight_lo = tp->snd_cwnd; + beta = bbr->params.beta; + bbr->bw_lo = + max_t(u32, bbr->bw_latest, + (u64)bbr->bw_lo * + (BBR_UNIT - beta) >> BBR_SCALE); + bbr->inflight_lo = + max_t(u32, bbr->inflight_latest, + (u64)bbr->inflight_lo * + (BBR_UNIT - beta) >> BBR_SCALE); + } + + /* Adjust to the lower of the levels implied by loss or ECN. */ + bbr->inflight_lo = min(bbr->inflight_lo, ecn_inflight_lo); +} + +/* Reset any short-term lower-bound adaptation to congestion, so that we can + * push our inflight up. + */ +static void bbr2_reset_lower_bounds(struct sock *sk) +{ + struct bbr *bbr = inet_csk_ca(sk); + + bbr->bw_lo = ~0U; + bbr->inflight_lo = ~0U; +} + +/* After bw probing (STARTUP/PROBE_UP), reset signals before entering a state + * machine phase where we adapt our lower bound based on congestion signals. + */ +static void bbr2_reset_congestion_signals(struct sock *sk) +{ + struct bbr *bbr = inet_csk_ca(sk); + + bbr->loss_in_round = 0; + bbr->ecn_in_round = 0; + bbr->loss_in_cycle = 0; + bbr->ecn_in_cycle = 0; + bbr->bw_latest = 0; + bbr->inflight_latest = 0; +} + +/* Update (most of) our congestion signals: track the recent rate and volume of + * delivered data, presence of loss, and EWMA degree of ECN marking. + */ +static void bbr2_update_congestion_signals( + struct sock *sk, const struct rate_sample *rs, struct bbr_context *ctx) +{ + struct tcp_sock *tp = tcp_sk(sk); + struct bbr *bbr = inet_csk_ca(sk); + u64 bw; + + bbr->loss_round_start = 0; + if (rs->interval_us <= 0 || !rs->acked_sacked) + return; /* Not a valid observation */ + bw = ctx->sample_bw; + + if (!rs->is_app_limited || bw >= bbr_max_bw(sk)) + bbr2_take_bw_hi_sample(sk, bw); + + bbr->loss_in_round |= (rs->losses > 0); + + /* Update rate and volume of delivered data from latest round trip: */ + bbr->bw_latest = max_t(u32, bbr->bw_latest, ctx->sample_bw); + bbr->inflight_latest = max_t(u32, bbr->inflight_latest, rs->delivered); + + if (before(rs->prior_delivered, bbr->loss_round_delivered)) + return; /* skip the per-round-trip updates */ + /* Now do per-round-trip updates. */ + bbr->loss_round_delivered = tp->delivered; /* mark round trip */ + bbr->loss_round_start = 1; + bbr2_adapt_lower_bounds(sk); + + /* Update windowed "latest" (single-round-trip) filters. */ + bbr->loss_in_round = 0; + bbr->ecn_in_round = 0; + bbr->bw_latest = ctx->sample_bw; + bbr->inflight_latest = rs->delivered; +} + +/* Bandwidth probing can cause loss. To help coexistence with loss-based + * congestion control we spread out our probing in a Reno-conscious way. Due to + * the shape of the Reno sawtooth, the time required between loss epochs for an + * idealized Reno flow is a number of round trips that is the BDP of that + * flow. We count packet-timed round trips directly, since measured RTT can + * vary widely, and Reno is driven by packet-timed round trips. + */ +static bool bbr2_is_reno_coexistence_probe_time(struct sock *sk) +{ + struct bbr *bbr = inet_csk_ca(sk); + u32 inflight, rounds, reno_gain, reno_rounds; + + /* Random loss can shave some small percentage off of our inflight + * in each round. To survive this, flows need robust periodic probes. + */ + rounds = bbr->params.bw_probe_max_rounds; + + reno_gain = bbr->params.bw_probe_reno_gain; + if (reno_gain) { + inflight = bbr2_target_inflight(sk); + reno_rounds = ((u64)inflight * reno_gain) >> BBR_SCALE; + rounds = min(rounds, reno_rounds); + } + return bbr->rounds_since_probe >= rounds; +} + +/* How long do we want to wait before probing for bandwidth (and risking + * loss)? We randomize the wait, for better mixing and fairness convergence. + * + * We bound the Reno-coexistence inter-bw-probe time to be 62-63 round trips. + * This is calculated to allow fairness with a 25Mbps, 30ms Reno flow, + * (eg 4K video to a broadband user): + * BDP = 25Mbps * .030sec /(1514bytes) = 61.9 packets + * + * We bound the BBR-native inter-bw-probe wall clock time to be: + * (a) higher than 2 sec: to try to avoid causing loss for a long enough time + * to allow Reno at 30ms to get 4K video bw, the inter-bw-probe time must + * be at least: 25Mbps * .030sec / (1514bytes) * 0.030sec = 1.9secs + * (b) lower than 3 sec: to ensure flows can start probing in a reasonable + * amount of time to discover unutilized bw on human-scale interactive + * time-scales (e.g. perhaps traffic from a web page download that we + * were competing with is now complete). + */ +static void bbr2_pick_probe_wait(struct sock *sk) +{ + struct bbr *bbr = inet_csk_ca(sk); + + /* Decide the random round-trip bound for wait until probe: */ + bbr->rounds_since_probe = + prandom_u32_max(bbr->params.bw_probe_rand_rounds); + /* Decide the random wall clock bound for wait until probe: */ + bbr->probe_wait_us = bbr->params.bw_probe_base_us + + prandom_u32_max(bbr->params.bw_probe_rand_us); +} + +static void bbr2_set_cycle_idx(struct sock *sk, int cycle_idx) +{ + struct bbr *bbr = inet_csk_ca(sk); + + bbr->cycle_idx = cycle_idx; + /* New phase, so need to update cwnd and pacing rate. */ + bbr->try_fast_path = 0; +} + +/* Send at estimated bw to fill the pipe, but not queue. We need this phase + * before PROBE_UP, because as soon as we send faster than the available bw + * we will start building a queue, and if the buffer is shallow we can cause + * loss. If we do not fill the pipe before we cause this loss, our bw_hi and + * inflight_hi estimates will underestimate. + */ +static void bbr2_start_bw_probe_refill(struct sock *sk, u32 bw_probe_up_rounds) +{ + struct tcp_sock *tp = tcp_sk(sk); + struct bbr *bbr = inet_csk_ca(sk); + + bbr2_reset_lower_bounds(sk); + if (bbr->inflight_hi != ~0U) + bbr->inflight_hi += bbr->params.refill_add_inc; + bbr->bw_probe_up_rounds = bw_probe_up_rounds; + bbr->bw_probe_up_acks = 0; + bbr->stopped_risky_probe = 0; + bbr->ack_phase = BBR_ACKS_REFILLING; + bbr->next_rtt_delivered = tp->delivered; + bbr2_set_cycle_idx(sk, BBR_BW_PROBE_REFILL); +} + +/* Now probe max deliverable data rate and volume. */ +static void bbr2_start_bw_probe_up(struct sock *sk) +{ + struct tcp_sock *tp = tcp_sk(sk); + struct bbr *bbr = inet_csk_ca(sk); + + bbr->ack_phase = BBR_ACKS_PROBE_STARTING; + bbr->next_rtt_delivered = tp->delivered; + bbr->cycle_mstamp = tp->tcp_mstamp; + bbr2_set_cycle_idx(sk, BBR_BW_PROBE_UP); + bbr2_raise_inflight_hi_slope(sk); +} + +/* Start a new PROBE_BW probing cycle of some wall clock length. Pick a wall + * clock time at which to probe beyond an inflight that we think to be + * safe. This will knowingly risk packet loss, so we want to do this rarely, to + * keep packet loss rates low. Also start a round-trip counter, to probe faster + * if we estimate a Reno flow at our BDP would probe faster. + */ +static void bbr2_start_bw_probe_down(struct sock *sk) +{ + struct tcp_sock *tp = tcp_sk(sk); + struct bbr *bbr = inet_csk_ca(sk); + + bbr2_reset_congestion_signals(sk); + bbr->bw_probe_up_cnt = ~0U; /* not growing inflight_hi any more */ + bbr2_pick_probe_wait(sk); + bbr->cycle_mstamp = tp->tcp_mstamp; /* start wall clock */ + bbr->ack_phase = BBR_ACKS_PROBE_STOPPING; + bbr->next_rtt_delivered = tp->delivered; + bbr2_set_cycle_idx(sk, BBR_BW_PROBE_DOWN); +} + +/* Cruise: maintain what we estimate to be a neutral, conservative + * operating point, without attempting to probe up for bandwidth or down for + * RTT, and only reducing inflight in response to loss/ECN signals. + */ +static void bbr2_start_bw_probe_cruise(struct sock *sk) +{ + struct bbr *bbr = inet_csk_ca(sk); + + if (bbr->inflight_lo != ~0U) + bbr->inflight_lo = min(bbr->inflight_lo, bbr->inflight_hi); + + bbr2_set_cycle_idx(sk, BBR_BW_PROBE_CRUISE); +} + +/* Loss and/or ECN rate is too high while probing. + * Adapt (once per bw probe) by cutting inflight_hi and then restarting cycle. + */ +static void bbr2_handle_inflight_too_high(struct sock *sk, + const struct rate_sample *rs) +{ + struct bbr *bbr = inet_csk_ca(sk); + const u32 beta = bbr->params.beta; + + bbr->prev_probe_too_high = 1; + bbr->bw_probe_samples = 0; /* only react once per probe */ + bbr->debug.event = 'L'; /* Loss/ECN too high */ + /* If we are app-limited then we are not robustly + * probing the max volume of inflight data we think + * might be safe (analogous to how app-limited bw + * samples are not known to be robustly probing bw). + */ + if (!rs->is_app_limited) + bbr->inflight_hi = max_t(u32, rs->tx_in_flight, + (u64)bbr2_target_inflight(sk) * + (BBR_UNIT - beta) >> BBR_SCALE); + if (bbr->mode == BBR_PROBE_BW && bbr->cycle_idx == BBR_BW_PROBE_UP) + bbr2_start_bw_probe_down(sk); +} + +/* If we're seeing bw and loss samples reflecting our bw probing, adapt + * using the signals we see. If loss or ECN mark rate gets too high, then adapt + * inflight_hi downward. If we're able to push inflight higher without such + * signals, push higher: adapt inflight_hi upward. + */ +static bool bbr2_adapt_upper_bounds(struct sock *sk, + const struct rate_sample *rs) +{ + struct bbr *bbr = inet_csk_ca(sk); + + /* Track when we'll see bw/loss samples resulting from our bw probes. */ + if (bbr->ack_phase == BBR_ACKS_PROBE_STARTING && bbr->round_start) + bbr->ack_phase = BBR_ACKS_PROBE_FEEDBACK; + if (bbr->ack_phase == BBR_ACKS_PROBE_STOPPING && bbr->round_start) { + /* End of samples from bw probing phase. */ + bbr->bw_probe_samples = 0; + bbr->ack_phase = BBR_ACKS_INIT; + /* At this point in the cycle, our current bw sample is also + * our best recent chance at finding the highest available bw + * for this flow. So now is the best time to forget the bw + * samples from the previous cycle, by advancing the window. + */ + if (bbr->mode == BBR_PROBE_BW && !rs->is_app_limited) + bbr2_advance_bw_hi_filter(sk); + /* If we had an inflight_hi, then probed and pushed inflight all + * the way up to hit that inflight_hi without seeing any + * high loss/ECN in all the resulting ACKs from that probing, + * then probe up again, this time letting inflight persist at + * inflight_hi for a round trip, then accelerating beyond. + */ + if (bbr->mode == BBR_PROBE_BW && + bbr->stopped_risky_probe && !bbr->prev_probe_too_high) { + bbr->debug.event = 'R'; /* reprobe */ + bbr2_start_bw_probe_refill(sk, 0); + return true; /* yes, decided state transition */ + } + } + + if (bbr2_is_inflight_too_high(sk, rs)) { + if (bbr->bw_probe_samples) /* sample is from bw probing? */ + bbr2_handle_inflight_too_high(sk, rs); + } else { + /* Loss/ECN rate is declared safe. Adjust upper bound upward. */ + if (bbr->inflight_hi == ~0U) /* no excess queue signals yet? */ + return false; + + /* To be resilient to random loss, we must raise inflight_hi + * if we observe in any phase that a higher level is safe. + */ + if (rs->tx_in_flight > bbr->inflight_hi) { + bbr->inflight_hi = rs->tx_in_flight; + bbr->debug.event = 'U'; /* raise up inflight_hi */ + } + + if (bbr->mode == BBR_PROBE_BW && + bbr->cycle_idx == BBR_BW_PROBE_UP) + bbr2_probe_inflight_hi_upward(sk, rs); + } + + return false; +} + +/* Check if it's time to probe for bandwidth now, and if so, kick it off. */ +static bool bbr2_check_time_to_probe_bw(struct sock *sk) +{ + struct bbr *bbr = inet_csk_ca(sk); + u32 n; + + /* If we seem to be at an operating point where we are not seeing loss + * but we are seeing ECN marks, then when the ECN marks cease we reprobe + * quickly (in case a burst of cross-traffic has ceased and freed up bw, + * or in case we are sharing with multiplicatively probing traffic). + */ + if (bbr->params.ecn_reprobe_gain && bbr->ecn_eligible && + bbr->ecn_in_cycle && !bbr->loss_in_cycle && + inet_csk(sk)->icsk_ca_state == TCP_CA_Open) { + bbr->debug.event = 'A'; /* *A*ll clear to probe *A*gain */ + /* Calculate n so that when bbr2_raise_inflight_hi_slope() + * computes growth_this_round as 2^n it will be roughly the + * desired volume of data (inflight_hi*ecn_reprobe_gain). + */ + n = ilog2((((u64)bbr->inflight_hi * + bbr->params.ecn_reprobe_gain) >> BBR_SCALE)); + bbr2_start_bw_probe_refill(sk, n); + return true; + } + + if (bbr2_has_elapsed_in_phase(sk, bbr->probe_wait_us) || + bbr2_is_reno_coexistence_probe_time(sk)) { + bbr2_start_bw_probe_refill(sk, 0); + return true; + } + return false; +} + +/* Is it time to transition from PROBE_DOWN to PROBE_CRUISE? */ +static bool bbr2_check_time_to_cruise(struct sock *sk, u32 inflight, u32 bw) +{ + struct bbr *bbr = inet_csk_ca(sk); + bool is_under_bdp, is_long_enough; + + /* Always need to pull inflight down to leave headroom in queue. */ + if (inflight > bbr2_inflight_with_headroom(sk)) + return false; + + is_under_bdp = inflight <= bbr_inflight(sk, bw, BBR_UNIT); + if (bbr->params.drain_to_target) + return is_under_bdp; + + is_long_enough = bbr2_has_elapsed_in_phase(sk, bbr->min_rtt_us); + return is_under_bdp || is_long_enough; +} + +/* PROBE_BW state machine: cruise, refill, probe for bw, or drain? */ +static void bbr2_update_cycle_phase(struct sock *sk, + const struct rate_sample *rs) +{ + struct bbr *bbr = inet_csk_ca(sk); + bool is_risky = false, is_queuing = false; + u32 inflight, bw; + + if (!bbr_full_bw_reached(sk)) + return; + + /* In DRAIN, PROBE_BW, or PROBE_RTT, adjust upper bounds. */ + if (bbr2_adapt_upper_bounds(sk, rs)) + return; /* already decided state transition */ + + if (bbr->mode != BBR_PROBE_BW) + return; + + inflight = bbr_packets_in_net_at_edt(sk, rs->prior_in_flight); + bw = bbr_max_bw(sk); + + switch (bbr->cycle_idx) { + /* First we spend most of our time cruising with a pacing_gain of 1.0, + * which paces at the estimated bw, to try to fully use the pipe + * without building queue. If we encounter loss/ECN marks, we adapt + * by slowing down. + */ + case BBR_BW_PROBE_CRUISE: + if (bbr2_check_time_to_probe_bw(sk)) + return; /* already decided state transition */ + break; + + /* After cruising, when it's time to probe, we first "refill": we send + * at the estimated bw to fill the pipe, before probing higher and + * knowingly risking overflowing the bottleneck buffer (causing loss). + */ + case BBR_BW_PROBE_REFILL: + if (bbr->round_start) { + /* After one full round trip of sending in REFILL, we + * start to see bw samples reflecting our REFILL, which + * may be putting too much data in flight. + */ + bbr->bw_probe_samples = 1; + bbr2_start_bw_probe_up(sk); + } + break; + + /* After we refill the pipe, we probe by using a pacing_gain > 1.0, to + * probe for bw. If we have not seen loss/ECN, we try to raise inflight + * to at least pacing_gain*BDP; note that this may take more than + * min_rtt if min_rtt is small (e.g. on a LAN). + * + * We terminate PROBE_UP bandwidth probing upon any of the following: + * + * (1) We've pushed inflight up to hit the inflight_hi target set in the + * most recent previous bw probe phase. Thus we want to start + * draining the queue immediately because it's very likely the most + * recently sent packets will fill the queue and cause drops. + * (checked here) + * (2) We have probed for at least 1*min_rtt_us, and the + * estimated queue is high enough (inflight > 1.25 * estimated_bdp). + * (checked here) + * (3) Loss filter says loss rate is "too high". + * (checked in bbr_is_inflight_too_high()) + * (4) ECN filter says ECN mark rate is "too high". + * (checked in bbr_is_inflight_too_high()) + */ + case BBR_BW_PROBE_UP: + if (bbr->prev_probe_too_high && + inflight >= bbr->inflight_hi) { + bbr->stopped_risky_probe = 1; + is_risky = true; + bbr->debug.event = 'D'; /* D for danger */ + } else if (bbr2_has_elapsed_in_phase(sk, bbr->min_rtt_us) && + inflight >= + bbr_inflight(sk, bw, + bbr->params.bw_probe_pif_gain)) { + is_queuing = true; + bbr->debug.event = 'Q'; /* building Queue */ + } + if (is_risky || is_queuing) { + bbr->prev_probe_too_high = 0; /* no loss/ECN (yet) */ + bbr2_start_bw_probe_down(sk); /* restart w/ down */ + } + break; + + /* After probing in PROBE_UP, we have usually accumulated some data in + * the bottleneck buffer (if bw probing didn't find more bw). We next + * enter PROBE_DOWN to try to drain any excess data from the queue. To + * do this, we use a pacing_gain < 1.0. We hold this pacing gain until + * our inflight is less then that target cruising point, which is the + * minimum of (a) the amount needed to leave headroom, and (b) the + * estimated BDP. Once inflight falls to match the target, we estimate + * the queue is drained; persisting would underutilize the pipe. + */ + case BBR_BW_PROBE_DOWN: + if (bbr2_check_time_to_probe_bw(sk)) + return; /* already decided state transition */ + if (bbr2_check_time_to_cruise(sk, inflight, bw)) + bbr2_start_bw_probe_cruise(sk); + break; + + default: + WARN_ONCE(1, "BBR invalid cycle index %u\n", bbr->cycle_idx); + } +} + +/* Exiting PROBE_RTT, so return to bandwidth probing in STARTUP or PROBE_BW. */ +static void bbr2_exit_probe_rtt(struct sock *sk) +{ + struct bbr *bbr = inet_csk_ca(sk); + + bbr2_reset_lower_bounds(sk); + if (bbr_full_bw_reached(sk)) { + bbr->mode = BBR_PROBE_BW; + /* Raising inflight after PROBE_RTT may cause loss, so reset + * the PROBE_BW clock and schedule the next bandwidth probe for + * a friendly and randomized future point in time. + */ + bbr2_start_bw_probe_down(sk); + /* Since we are exiting PROBE_RTT, we know inflight is + * below our estimated BDP, so it is reasonable to cruise. + */ + bbr2_start_bw_probe_cruise(sk); + } else { + bbr->mode = BBR_STARTUP; + } +} + +/* Exit STARTUP based on loss rate > 1% and loss gaps in round >= N. Wait until + * the end of the round in recovery to get a good estimate of how many packets + * have been lost, and how many we need to drain with a low pacing rate. + */ +static void bbr2_check_loss_too_high_in_startup(struct sock *sk, + const struct rate_sample *rs) +{ + struct bbr *bbr = inet_csk_ca(sk); + + if (bbr_full_bw_reached(sk)) + return; + + /* For STARTUP exit, check the loss rate at the end of each round trip + * of Recovery episodes in STARTUP. We check the loss rate at the end + * of the round trip to filter out noisy/low loss and have a better + * sense of inflight (extent of loss), so we can drain more accurately. + */ + if (rs->losses && bbr->loss_events_in_round < 0xf) + bbr->loss_events_in_round++; /* update saturating counter */ + if (bbr->params.full_loss_cnt && bbr->loss_round_start && + inet_csk(sk)->icsk_ca_state == TCP_CA_Recovery && + bbr->loss_events_in_round >= bbr->params.full_loss_cnt && + bbr2_is_inflight_too_high(sk, rs)) { + bbr->debug.event = 'P'; /* Packet loss caused STARTUP exit */ + bbr2_handle_queue_too_high_in_startup(sk); + return; + } + if (bbr->loss_round_start) + bbr->loss_events_in_round = 0; +} + +/* If we are done draining, advance into steady state operation in PROBE_BW. */ +static void bbr2_check_drain(struct sock *sk, const struct rate_sample *rs, + struct bbr_context *ctx) +{ + struct bbr *bbr = inet_csk_ca(sk); + + if (bbr_check_drain(sk, rs, ctx)) { + bbr->mode = BBR_PROBE_BW; + bbr2_start_bw_probe_down(sk); + } +} + +static void bbr2_update_model(struct sock *sk, const struct rate_sample *rs, + struct bbr_context *ctx) +{ + bbr2_update_congestion_signals(sk, rs, ctx); + bbr_update_ack_aggregation(sk, rs); + bbr2_check_loss_too_high_in_startup(sk, rs); + bbr_check_full_bw_reached(sk, rs); + bbr2_check_drain(sk, rs, ctx); + bbr2_update_cycle_phase(sk, rs); + bbr_update_min_rtt(sk, rs); +} + +/* Fast path for app-limited case. + * + * On each ack, we execute bbr state machine, which primarily consists of: + * 1) update model based on new rate sample, and + * 2) update control based on updated model or state change. + * + * There are certain workload/scenarios, e.g. app-limited case, where + * either we can skip updating model or we can skip update of both model + * as well as control. This provides signifcant softirq cpu savings for + * processing incoming acks. + * + * In case of app-limited, if there is no congestion (loss/ecn) and + * if observed bw sample is less than current estimated bw, then we can + * skip some of the computation in bbr state processing: + * + * - if there is no rtt/mode/phase change: In this case, since all the + * parameters of the network model are constant, we can skip model + * as well control update. + * + * - else we can skip rest of the model update. But we still need to + * update the control to account for the new rtt/mode/phase. + * + * Returns whether we can take fast path or not. + */ +static bool bbr2_fast_path(struct sock *sk, bool *update_model, + const struct rate_sample *rs, struct bbr_context *ctx) +{ + struct bbr *bbr = inet_csk_ca(sk); + u32 prev_min_rtt_us, prev_mode; + + if (bbr->params.fast_path && bbr->try_fast_path && + rs->is_app_limited && ctx->sample_bw < bbr_max_bw(sk) && + !bbr->loss_in_round && !bbr->ecn_in_round) { + prev_mode = bbr->mode; + prev_min_rtt_us = bbr->min_rtt_us; + bbr2_check_drain(sk, rs, ctx); + bbr2_update_cycle_phase(sk, rs); + bbr_update_min_rtt(sk, rs); + + if (bbr->mode == prev_mode && + bbr->min_rtt_us == prev_min_rtt_us && + bbr->try_fast_path) + return true; + + /* Skip model update, but control still needs to be updated */ + *update_model = false; + } + return false; +} + +static void bbr2_main(struct sock *sk, const struct rate_sample *rs) +{ + struct tcp_sock *tp = tcp_sk(sk); + struct bbr *bbr = inet_csk_ca(sk); + struct bbr_context ctx = { 0 }; + bool update_model = true; + u32 bw; + + bbr->debug.event = '.'; /* init to default NOP (no event yet) */ + + bbr_update_round_start(sk, rs, &ctx); + if (bbr->round_start) { + bbr->rounds_since_probe = + min_t(s32, bbr->rounds_since_probe + 1, 0xFF); + bbr2_update_ecn_alpha(sk); + } + + bbr->ecn_in_round |= rs->is_ece; + bbr_calculate_bw_sample(sk, rs, &ctx); + + if (bbr2_fast_path(sk, &update_model, rs, &ctx)) + goto out; + + if (update_model) + bbr2_update_model(sk, rs, &ctx); + + bbr_update_gains(sk); + bw = bbr_bw(sk); + bbr_set_pacing_rate(sk, bw, bbr->pacing_gain); + bbr_set_cwnd(sk, rs, rs->acked_sacked, bw, bbr->cwnd_gain, + tp->snd_cwnd, &ctx); + bbr2_bound_cwnd_for_inflight_model(sk); + +out: + bbr->prev_ca_state = inet_csk(sk)->icsk_ca_state; + bbr->loss_in_cycle |= rs->lost > 0; + bbr->ecn_in_cycle |= rs->delivered_ce > 0; + + bbr_debug(sk, rs->acked_sacked, rs, &ctx); +} + +/* Module parameters that are settable by TCP_CONGESTION_PARAMS are declared + * down here, so that the algorithm functions that use the parameters must use + * the per-socket parameters; if they accidentally use the global version + * then there will be a compile error. + * TODO(ncardwell): move all per-socket parameters down to this section. + */ + +/* On losses, scale down inflight and pacing rate by beta scaled by BBR_SCALE. + * No loss response when 0. Max allwed value is 255. + */ +static u32 bbr_beta = BBR_UNIT * 30 / 100; + +/* Gain factor for ECN mark ratio samples, scaled by BBR_SCALE. + * Max allowed value is 255. + */ +static u32 bbr_ecn_alpha_gain = BBR_UNIT * 1 / 16; /* 1/16 = 6.25% */ + +/* The initial value for the ecn_alpha state variable. Default and max + * BBR_UNIT (256), representing 1.0. This allows a flow to respond quickly + * to congestion if the bottleneck is congested when the flow starts up. + */ +static u32 bbr_ecn_alpha_init = BBR_UNIT; /* 1.0, to respond quickly */ + +/* On ECN, cut inflight_lo to (1 - ecn_factor * ecn_alpha) scaled by BBR_SCALE. + * No ECN based bounding when 0. Max allwed value is 255. + */ +static u32 bbr_ecn_factor = BBR_UNIT * 1 / 3; /* 1/3 = 33% */ + +/* Estimate bw probing has gone too far if CE ratio exceeds this threshold. + * Scaled by BBR_SCALE. Disabled when 0. Max allowed is 255. + */ +static u32 bbr_ecn_thresh = BBR_UNIT * 1 / 2; /* 1/2 = 50% */ + +/* Max RTT (in usec) at which to use sender-side ECN logic. + * Disabled when 0 (ECN allowed at any RTT). + * Max allowed for the parameter is 524287 (0x7ffff) us, ~524 ms. + */ +static u32 bbr_ecn_max_rtt_us = 5000; + +/* If non-zero, if in a cycle with no losses but some ECN marks, after ECN + * clears then use a multiplicative increase to quickly reprobe bw by + * starting inflight probing at the given multiple of inflight_hi. + * Default for this experimental knob is 0 (disabled). + * Planned value for experiments: BBR_UNIT * 1 / 2 = 128, representing 0.5. + */ +static u32 bbr_ecn_reprobe_gain; + +/* Estimate bw probing has gone too far if loss rate exceeds this level. */ +static u32 bbr_loss_thresh = BBR_UNIT * 2 / 100; /* 2% loss */ + +/* Exit STARTUP if number of loss marking events in a Recovery round is >= N, + * and loss rate is higher than bbr_loss_thresh. + * Disabled if 0. Max allowed value is 15 (0xF). + */ +static u32 bbr_full_loss_cnt = 8; + +/* Exit STARTUP if number of round trips with ECN mark rate above ecn_thresh + * meets this count. Max allowed value is 3. + */ +static u32 bbr_full_ecn_cnt = 2; + +/* Fraction of unutilized headroom to try to leave in path upon high loss. */ +static u32 bbr_inflight_headroom = BBR_UNIT * 15 / 100; + +/* Multiplier to get target inflight (as multiple of BDP) for PROBE_UP phase. + * Default is 1.25x, as in BBR v1. Max allowed is 511. + */ +static u32 bbr_bw_probe_pif_gain = BBR_UNIT * 5 / 4; + +/* Multiplier to get Reno-style probe epoch duration as: k * BDP round trips. + * If zero, disables this BBR v2 Reno-style BDP-scaled coexistence mechanism. + * Max allowed is 511. + */ +static u32 bbr_bw_probe_reno_gain = BBR_UNIT; + +/* Max number of packet-timed rounds to wait before probing for bandwidth. If + * we want to tolerate 1% random loss per round, and not have this cut our + * inflight too much, we must probe for bw periodically on roughly this scale. + * If low, limits Reno/CUBIC coexistence; if high, limits loss tolerance. + * We aim to be fair with Reno/CUBIC up to a BDP of at least: + * BDP = 25Mbps * .030sec /(1514bytes) = 61.9 packets + */ +static u32 bbr_bw_probe_max_rounds = 63; + +/* Max amount of randomness to inject in round counting for Reno-coexistence. + * Max value is 15. + */ +static u32 bbr_bw_probe_rand_rounds = 2; + +/* Use BBR-native probe time scale starting at this many usec. + * We aim to be fair with Reno/CUBIC up to an inter-loss time epoch of at least: + * BDP*RTT = 25Mbps * .030sec /(1514bytes) * 0.030sec = 1.9 secs + */ +static u32 bbr_bw_probe_base_us = 2 * USEC_PER_SEC; /* 2 secs */ + +/* Use BBR-native probes spread over this many usec: */ +static u32 bbr_bw_probe_rand_us = 1 * USEC_PER_SEC; /* 1 secs */ + +/* Undo the model changes made in loss recovery if recovery was spurious? */ +static bool bbr_undo = true; + +/* Use fast path if app-limited, no loss/ECN, and target cwnd was reached? */ +static bool bbr_fast_path = true; /* default: enabled */ + +/* Use fast ack mode ? */ +static int bbr_fast_ack_mode = 1; /* default: rwnd check off */ + +/* How much to additively increase inflight_hi when entering REFILL? */ +static u32 bbr_refill_add_inc; /* default: disabled */ + +module_param_named(beta, bbr_beta, uint, 0644); +module_param_named(ecn_alpha_gain, bbr_ecn_alpha_gain, uint, 0644); +module_param_named(ecn_alpha_init, bbr_ecn_alpha_init, uint, 0644); +module_param_named(ecn_factor, bbr_ecn_factor, uint, 0644); +module_param_named(ecn_thresh, bbr_ecn_thresh, uint, 0644); +module_param_named(ecn_max_rtt_us, bbr_ecn_max_rtt_us, uint, 0644); +module_param_named(ecn_reprobe_gain, bbr_ecn_reprobe_gain, uint, 0644); +module_param_named(loss_thresh, bbr_loss_thresh, uint, 0664); +module_param_named(full_loss_cnt, bbr_full_loss_cnt, uint, 0664); +module_param_named(full_ecn_cnt, bbr_full_ecn_cnt, uint, 0664); +module_param_named(inflight_headroom, bbr_inflight_headroom, uint, 0664); +module_param_named(bw_probe_pif_gain, bbr_bw_probe_pif_gain, uint, 0664); +module_param_named(bw_probe_reno_gain, bbr_bw_probe_reno_gain, uint, 0664); +module_param_named(bw_probe_max_rounds, bbr_bw_probe_max_rounds, uint, 0664); +module_param_named(bw_probe_rand_rounds, bbr_bw_probe_rand_rounds, uint, 0664); +module_param_named(bw_probe_base_us, bbr_bw_probe_base_us, uint, 0664); +module_param_named(bw_probe_rand_us, bbr_bw_probe_rand_us, uint, 0664); +module_param_named(undo, bbr_undo, bool, 0664); +module_param_named(fast_path, bbr_fast_path, bool, 0664); +module_param_named(fast_ack_mode, bbr_fast_ack_mode, uint, 0664); +module_param_named(refill_add_inc, bbr_refill_add_inc, uint, 0664); + +static void bbr2_init(struct sock *sk) +{ + struct tcp_sock *tp = tcp_sk(sk); + struct bbr *bbr = inet_csk_ca(sk); + + bbr_init(sk); /* run shared init code for v1 and v2 */ + + /* BBR v2 parameters: */ + bbr->params.beta = min_t(u32, 0xFFU, bbr_beta); + bbr->params.ecn_alpha_gain = min_t(u32, 0xFFU, bbr_ecn_alpha_gain); + bbr->params.ecn_alpha_init = min_t(u32, BBR_UNIT, bbr_ecn_alpha_init); + bbr->params.ecn_factor = min_t(u32, 0xFFU, bbr_ecn_factor); + bbr->params.ecn_thresh = min_t(u32, 0xFFU, bbr_ecn_thresh); + bbr->params.ecn_max_rtt_us = min_t(u32, 0x7ffffU, bbr_ecn_max_rtt_us); + bbr->params.ecn_reprobe_gain = min_t(u32, 0x1FF, bbr_ecn_reprobe_gain); + bbr->params.loss_thresh = min_t(u32, 0xFFU, bbr_loss_thresh); + bbr->params.full_loss_cnt = min_t(u32, 0xFU, bbr_full_loss_cnt); + bbr->params.full_ecn_cnt = min_t(u32, 0x3U, bbr_full_ecn_cnt); + bbr->params.inflight_headroom = + min_t(u32, 0xFFU, bbr_inflight_headroom); + bbr->params.bw_probe_pif_gain = + min_t(u32, 0x1FFU, bbr_bw_probe_pif_gain); + bbr->params.bw_probe_reno_gain = + min_t(u32, 0x1FFU, bbr_bw_probe_reno_gain); + bbr->params.bw_probe_max_rounds = + min_t(u32, 0xFFU, bbr_bw_probe_max_rounds); + bbr->params.bw_probe_rand_rounds = + min_t(u32, 0xFU, bbr_bw_probe_rand_rounds); + bbr->params.bw_probe_base_us = + min_t(u32, (1 << 26) - 1, bbr_bw_probe_base_us); + bbr->params.bw_probe_rand_us = + min_t(u32, (1 << 26) - 1, bbr_bw_probe_rand_us); + bbr->params.undo = bbr_undo; + bbr->params.fast_path = bbr_fast_path ? 1 : 0; + bbr->params.refill_add_inc = min_t(u32, 0x3U, bbr_refill_add_inc); + + /* BBR v2 state: */ + bbr->initialized = 1; + /* Start sampling ECN mark rate after first full flight is ACKed: */ + bbr->loss_round_delivered = tp->delivered + 1; + bbr->loss_round_start = 0; + bbr->undo_bw_lo = 0; + bbr->undo_inflight_lo = 0; + bbr->undo_inflight_hi = 0; + bbr->loss_events_in_round = 0; + bbr->startup_ecn_rounds = 0; + bbr2_reset_congestion_signals(sk); + bbr->bw_lo = ~0U; + bbr->bw_hi[0] = 0; + bbr->bw_hi[1] = 0; + bbr->inflight_lo = ~0U; + bbr->inflight_hi = ~0U; + bbr->bw_probe_up_cnt = ~0U; + bbr->bw_probe_up_acks = 0; + bbr->bw_probe_up_rounds = 0; + bbr->probe_wait_us = 0; + bbr->stopped_risky_probe = 0; + bbr->ack_phase = BBR_ACKS_INIT; + bbr->rounds_since_probe = 0; + bbr->bw_probe_samples = 0; + bbr->prev_probe_too_high = 0; + bbr->ecn_eligible = 0; + bbr->ecn_alpha = bbr->params.ecn_alpha_init; + bbr->alpha_last_delivered = 0; + bbr->alpha_last_delivered_ce = 0; + + tp->fast_ack_mode = min_t(u32, 0x2U, bbr_fast_ack_mode); + + if ((tp->ecn_flags & TCP_ECN_OK) && bbr_ecn_enable) + tp->ecn_flags |= TCP_ECN_ECT_PERMANENT; +} + +/* Core TCP stack informs us that the given skb was just marked lost. */ +static void bbr2_skb_marked_lost(struct sock *sk, const struct sk_buff *skb) +{ + struct tcp_sock *tp = tcp_sk(sk); + struct bbr *bbr = inet_csk_ca(sk); + struct tcp_skb_cb *scb = TCP_SKB_CB(skb); + struct rate_sample rs; + + /* Capture "current" data over the full round trip of loss, + * to have a better chance to see the full capacity of the path. + */ + if (!bbr->loss_in_round) /* first loss in this round trip? */ + bbr->loss_round_delivered = tp->delivered; /* set round trip */ + bbr->loss_in_round = 1; + bbr->loss_in_cycle = 1; + + if (!bbr->bw_probe_samples) + return; /* not an skb sent while probing for bandwidth */ + if (unlikely(!scb->tx.delivered_mstamp)) + return; /* skb was SACKed, reneged, marked lost; ignore it */ + /* We are probing for bandwidth. Construct a rate sample that + * estimates what happened in the flight leading up to this lost skb, + * then see if the loss rate went too high, and if so at which packet. + */ + memset(&rs, 0, sizeof(rs)); + rs.tx_in_flight = scb->tx.in_flight; + rs.lost = tp->lost - scb->tx.lost; + rs.is_app_limited = scb->tx.is_app_limited; + if (bbr2_is_inflight_too_high(sk, &rs)) { + rs.tx_in_flight = bbr2_inflight_hi_from_lost_skb(sk, &rs, skb); + bbr2_handle_inflight_too_high(sk, &rs); + } +} + +/* Revert short-term model if current loss recovery event was spurious. */ +static u32 bbr2_undo_cwnd(struct sock *sk) +{ + struct tcp_sock *tp = tcp_sk(sk); + struct bbr *bbr = inet_csk_ca(sk); + + bbr->debug.undo = 1; + bbr->full_bw = 0; /* spurious slow-down; reset full pipe detection */ + bbr->full_bw_cnt = 0; + bbr->loss_in_round = 0; + + if (!bbr->params.undo) + return tp->snd_cwnd; + + /* Revert to cwnd and other state saved before loss episode. */ + bbr->bw_lo = max(bbr->bw_lo, bbr->undo_bw_lo); + bbr->inflight_lo = max(bbr->inflight_lo, bbr->undo_inflight_lo); + bbr->inflight_hi = max(bbr->inflight_hi, bbr->undo_inflight_hi); + return bbr->prior_cwnd; +} + +/* Entering loss recovery, so save state for when we undo recovery. */ +static u32 bbr2_ssthresh(struct sock *sk) +{ + struct bbr *bbr = inet_csk_ca(sk); + + bbr_save_cwnd(sk); + /* For undo, save state that adapts based on loss signal. */ + bbr->undo_bw_lo = bbr->bw_lo; + bbr->undo_inflight_lo = bbr->inflight_lo; + bbr->undo_inflight_hi = bbr->inflight_hi; + return tcp_sk(sk)->snd_ssthresh; +} + +static enum tcp_bbr2_phase bbr2_get_phase(struct bbr *bbr) +{ + switch (bbr->mode) { + case BBR_STARTUP: + return BBR2_PHASE_STARTUP; + case BBR_DRAIN: + return BBR2_PHASE_DRAIN; + case BBR_PROBE_BW: + break; + case BBR_PROBE_RTT: + return BBR2_PHASE_PROBE_RTT; + default: + return BBR2_PHASE_INVALID; + } + switch (bbr->cycle_idx) { + case BBR_BW_PROBE_UP: + return BBR2_PHASE_PROBE_BW_UP; + case BBR_BW_PROBE_DOWN: + return BBR2_PHASE_PROBE_BW_DOWN; + case BBR_BW_PROBE_CRUISE: + return BBR2_PHASE_PROBE_BW_CRUISE; + case BBR_BW_PROBE_REFILL: + return BBR2_PHASE_PROBE_BW_REFILL; + default: + return BBR2_PHASE_INVALID; + } +} + +static size_t bbr2_get_info(struct sock *sk, u32 ext, int *attr, + union tcp_cc_info *info) +{ + if (ext & (1 << (INET_DIAG_BBRINFO - 1)) || + ext & (1 << (INET_DIAG_VEGASINFO - 1))) { + struct bbr *bbr = inet_csk_ca(sk); + u64 bw = bbr_bw_bytes_per_sec(sk, bbr_bw(sk)); + u64 bw_hi = bbr_bw_bytes_per_sec(sk, bbr_max_bw(sk)); + u64 bw_lo = bbr->bw_lo == ~0U ? + ~0ULL : bbr_bw_bytes_per_sec(sk, bbr->bw_lo); + + memset(&info->bbr2, 0, sizeof(info->bbr2)); + info->bbr2.bbr_bw_lsb = (u32)bw; + info->bbr2.bbr_bw_msb = (u32)(bw >> 32); + info->bbr2.bbr_min_rtt = bbr->min_rtt_us; + info->bbr2.bbr_pacing_gain = bbr->pacing_gain; + info->bbr2.bbr_cwnd_gain = bbr->cwnd_gain; + info->bbr2.bbr_bw_hi_lsb = (u32)bw_hi; + info->bbr2.bbr_bw_hi_msb = (u32)(bw_hi >> 32); + info->bbr2.bbr_bw_lo_lsb = (u32)bw_lo; + info->bbr2.bbr_bw_lo_msb = (u32)(bw_lo >> 32); + info->bbr2.bbr_mode = bbr->mode; + info->bbr2.bbr_phase = (__u8)bbr2_get_phase(bbr); + info->bbr2.bbr_version = (__u8)2; + info->bbr2.bbr_inflight_lo = bbr->inflight_lo; + info->bbr2.bbr_inflight_hi = bbr->inflight_hi; + info->bbr2.bbr_extra_acked = bbr_extra_acked(sk); + *attr = INET_DIAG_BBRINFO; + return sizeof(info->bbr2); + } + return 0; +} + +static void bbr2_set_state(struct sock *sk, u8 new_state) +{ + struct tcp_sock *tp = tcp_sk(sk); + struct bbr *bbr = inet_csk_ca(sk); + + if (new_state == TCP_CA_Loss) { + struct rate_sample rs = { .losses = 1 }; + struct bbr_context ctx = { 0 }; + + bbr->prev_ca_state = TCP_CA_Loss; + bbr->full_bw = 0; + if (!bbr2_is_probing_bandwidth(sk) && bbr->inflight_lo == ~0U) { + /* bbr_adapt_lower_bounds() needs cwnd before + * we suffered an RTO, to update inflight_lo: + */ + bbr->inflight_lo = + max(tp->snd_cwnd, bbr->prior_cwnd); + } + bbr_debug(sk, 0, &rs, &ctx); + } else if (bbr->prev_ca_state == TCP_CA_Loss && + new_state != TCP_CA_Loss) { + tp->snd_cwnd = max(tp->snd_cwnd, bbr->prior_cwnd); + bbr->try_fast_path = 0; /* bound cwnd using latest model */ + } +} + +static struct tcp_congestion_ops tcp_bbr2_cong_ops __read_mostly = { + .flags = TCP_CONG_NON_RESTRICTED | TCP_CONG_WANTS_CE_EVENTS, + .name = "bbr2", + .owner = THIS_MODULE, + .init = bbr2_init, + .cong_control = bbr2_main, + .sndbuf_expand = bbr_sndbuf_expand, + .skb_marked_lost = bbr2_skb_marked_lost, + .undo_cwnd = bbr2_undo_cwnd, + .cwnd_event = bbr_cwnd_event, + .ssthresh = bbr2_ssthresh, + .tso_segs = bbr_tso_segs, + .get_info = bbr2_get_info, + .set_state = bbr2_set_state, +}; + +static int __init bbr_register(void) +{ + BUILD_BUG_ON(sizeof(struct bbr) > ICSK_CA_PRIV_SIZE); + return tcp_register_congestion_control(&tcp_bbr2_cong_ops); +} + +static void __exit bbr_unregister(void) +{ + tcp_unregister_congestion_control(&tcp_bbr2_cong_ops); +} + +module_init(bbr_register); +module_exit(bbr_unregister); + +MODULE_AUTHOR("Van Jacobson "); +MODULE_AUTHOR("Neal Cardwell "); +MODULE_AUTHOR("Yuchung Cheng "); +MODULE_AUTHOR("Soheil Hassas Yeganeh "); +MODULE_AUTHOR("Priyaranjan Jha "); +MODULE_AUTHOR("Yousuk Seung "); +MODULE_AUTHOR("Kevin Yang "); +MODULE_AUTHOR("Arjun Roy "); + +MODULE_LICENSE("Dual BSD/GPL"); +MODULE_DESCRIPTION("TCP BBR (Bottleneck Bandwidth and RTT)"); diff --git a/net/ipv4/tcp_cong.c b/net/ipv4/tcp_cong.c index dc95572163df..3e38644718b9 100644 --- a/net/ipv4/tcp_cong.c +++ b/net/ipv4/tcp_cong.c @@ -177,6 +177,7 @@ void tcp_init_congestion_control(struct sock *sk) struct inet_connection_sock *icsk = inet_csk(sk); tcp_sk(sk)->prior_ssthresh = 0; + tcp_sk(sk)->fast_ack_mode = 0; if (icsk->icsk_ca_ops->init) icsk->icsk_ca_ops->init(sk); if (tcp_ca_needs_ecn(sk)) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 60f99e9fb6d1..d2e4f9f9b2fd 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -349,7 +349,7 @@ static void __tcp_ecn_check_ce(struct sock *sk, const struct sk_buff *skb) tcp_enter_quickack_mode(sk, 2); break; case INET_ECN_CE: - if (tcp_ca_needs_ecn(sk)) + if (tcp_ca_wants_ce_events(sk)) tcp_ca_event(sk, CA_EVENT_ECN_IS_CE); if (!(tp->ecn_flags & TCP_ECN_DEMAND_CWR)) { @@ -360,7 +360,7 @@ static void __tcp_ecn_check_ce(struct sock *sk, const struct sk_buff *skb) tp->ecn_flags |= TCP_ECN_SEEN; break; default: - if (tcp_ca_needs_ecn(sk)) + if (tcp_ca_wants_ce_events(sk)) tcp_ca_event(sk, CA_EVENT_ECN_NO_CE); tp->ecn_flags |= TCP_ECN_SEEN; break; @@ -1079,7 +1079,12 @@ static void tcp_verify_retransmit_hint(struct tcp_sock *tp, struct sk_buff *skb) */ static void tcp_notify_skb_loss_event(struct tcp_sock *tp, const struct sk_buff *skb) { + struct sock *sk = (struct sock *)tp; + const struct tcp_congestion_ops *ca_ops = inet_csk(sk)->icsk_ca_ops; + tp->lost += tcp_skb_pcount(skb); + if (ca_ops->skb_marked_lost) + ca_ops->skb_marked_lost(sk, skb); } void tcp_mark_skb_lost(struct sock *sk, struct sk_buff *skb) @@ -1460,6 +1465,17 @@ static bool tcp_shifted_skb(struct sock *sk, struct sk_buff *prev, WARN_ON_ONCE(tcp_skb_pcount(skb) < pcount); tcp_skb_pcount_add(skb, -pcount); + /* Adjust tx.in_flight as pcount is shifted from skb to prev. */ + if (WARN_ONCE(TCP_SKB_CB(skb)->tx.in_flight < pcount, + "prev in_flight: %u skb in_flight: %u pcount: %u", + TCP_SKB_CB(prev)->tx.in_flight, + TCP_SKB_CB(skb)->tx.in_flight, + pcount)) + TCP_SKB_CB(skb)->tx.in_flight = 0; + else + TCP_SKB_CB(skb)->tx.in_flight -= pcount; + TCP_SKB_CB(prev)->tx.in_flight += pcount; + /* When we're adding to gso_segs == 1, gso_size will be zero, * in theory this shouldn't be necessary but as long as DSACK * code can come after this skb later on it's better to keep @@ -3790,6 +3806,7 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag) prior_fack = tcp_is_sack(tp) ? tcp_highest_sack_seq(tp) : tp->snd_una; rs.prior_in_flight = tcp_packets_in_flight(tp); + tcp_rate_check_app_limited(sk); /* ts_recent update must be made after we are sure that the packet * is in window. @@ -3888,6 +3905,7 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag) delivered = tcp_newly_delivered(sk, delivered, flag); lost = tp->lost - lost; /* freshly marked lost */ rs.is_ack_delayed = !!(flag & FLAG_ACK_MAYBE_DELAYED); + rs.is_ece = !!(flag & FLAG_ECE); tcp_rate_gen(sk, delivered, lost, is_sack_reneg, sack_state.rate); tcp_cong_control(sk, ack, delivered, flag, sack_state.rate); tcp_xmit_recovery(sk, rexmit); @@ -5493,13 +5511,14 @@ static void __tcp_ack_snd_check(struct sock *sk, int ofo_possible) /* More than one full frame received... */ if (((tp->rcv_nxt - tp->rcv_wup) > inet_csk(sk)->icsk_ack.rcv_mss && + (tp->fast_ack_mode == 1 || /* ... and right edge of window advances far enough. * (tcp_recvmsg() will send ACK otherwise). * If application uses SO_RCVLOWAT, we want send ack now if * we have not received enough bytes to satisfy the condition. */ - (tp->rcv_nxt - tp->copied_seq < sk->sk_rcvlowat || - __tcp_select_window(sk) >= tp->rcv_wnd)) || + (tp->rcv_nxt - tp->copied_seq < sk->sk_rcvlowat || + __tcp_select_window(sk) >= tp->rcv_wnd))) || /* We ACK each frame or... */ tcp_in_quickack_mode(sk) || /* Protocol state mandates a one-time immediate ACK */ diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 1ca2f28c9981..9a336caa715c 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -378,7 +378,8 @@ static void tcp_ecn_send(struct sock *sk, struct sk_buff *skb, th->cwr = 1; skb_shinfo(skb)->gso_type |= SKB_GSO_TCP_ECN; } - } else if (!tcp_ca_needs_ecn(sk)) { + } else if (!(tp->ecn_flags & TCP_ECN_ECT_PERMANENT) && + !tcp_ca_needs_ecn(sk)) { /* ACK or retransmitted segment: clear ECT|CE */ INET_ECN_dontxmit(sk); } @@ -1534,7 +1535,7 @@ int tcp_fragment(struct sock *sk, enum tcp_queue tcp_queue, { struct tcp_sock *tp = tcp_sk(sk); struct sk_buff *buff; - int nsize, old_factor; + int nsize, old_factor, inflight_prev; long limit; int nlen; u8 flags; @@ -1611,6 +1612,15 @@ int tcp_fragment(struct sock *sk, enum tcp_queue tcp_queue, if (diff) tcp_adjust_pcount(sk, skb, diff); + + /* Set buff tx.in_flight as if buff were sent by itself. */ + inflight_prev = TCP_SKB_CB(skb)->tx.in_flight - old_factor; + if (WARN_ONCE(inflight_prev < 0, + "inconsistent: tx.in_flight: %u old_factor: %d", + TCP_SKB_CB(skb)->tx.in_flight, old_factor)) + inflight_prev = 0; + TCP_SKB_CB(buff)->tx.in_flight = inflight_prev + + tcp_skb_pcount(buff); } /* Link BUFF into the send queue. */ @@ -1988,13 +1998,12 @@ static u32 tcp_tso_autosize(const struct sock *sk, unsigned int mss_now, static u32 tcp_tso_segs(struct sock *sk, unsigned int mss_now) { const struct tcp_congestion_ops *ca_ops = inet_csk(sk)->icsk_ca_ops; - u32 min_tso, tso_segs; - - min_tso = ca_ops->min_tso_segs ? - ca_ops->min_tso_segs(sk) : - sock_net(sk)->ipv4.sysctl_tcp_min_tso_segs; + u32 tso_segs; - tso_segs = tcp_tso_autosize(sk, mss_now, min_tso); + tso_segs = ca_ops->tso_segs ? + ca_ops->tso_segs(sk, mss_now) : + tcp_tso_autosize(sk, mss_now, + sock_net(sk)->ipv4.sysctl_tcp_min_tso_segs); return min_t(u32, tso_segs, sk->sk_gso_max_segs); } @@ -2630,6 +2639,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle, skb_set_delivery_time(skb, tp->tcp_wstamp_ns, true); list_move_tail(&skb->tcp_tsorted_anchor, &tp->tsorted_sent_queue); tcp_init_tso_segs(skb, mss_now); + tcp_set_tx_in_flight(sk, skb); goto repair; /* Skip network transmission */ } diff --git a/net/ipv4/tcp_rate.c b/net/ipv4/tcp_rate.c index 9a8e014d9b5b..de9d4cc29722 100644 --- a/net/ipv4/tcp_rate.c +++ b/net/ipv4/tcp_rate.c @@ -34,6 +34,24 @@ * ready to send in the write queue. */ +void tcp_set_tx_in_flight(struct sock *sk, struct sk_buff *skb) +{ + struct tcp_sock *tp = tcp_sk(sk); + u32 in_flight; + + /* Check, sanitize, and record packets in flight after skb was sent. */ + in_flight = tcp_packets_in_flight(tp) + tcp_skb_pcount(skb); + if (WARN_ONCE(in_flight > TCPCB_IN_FLIGHT_MAX, + "insane in_flight %u cc %s mss %u " + "cwnd %u pif %u %u %u %u\n", + in_flight, inet_csk(sk)->icsk_ca_ops->name, + tp->mss_cache, tp->snd_cwnd, + tp->packets_out, tp->retrans_out, + tp->sacked_out, tp->lost_out)) + in_flight = TCPCB_IN_FLIGHT_MAX; + TCP_SKB_CB(skb)->tx.in_flight = in_flight; +} + /* Snapshot the current delivery information in the skb, to generate * a rate sample later when the skb is (s)acked in tcp_rate_skb_delivered(). */ @@ -66,7 +84,9 @@ void tcp_rate_skb_sent(struct sock *sk, struct sk_buff *skb) TCP_SKB_CB(skb)->tx.delivered_mstamp = tp->delivered_mstamp; TCP_SKB_CB(skb)->tx.delivered = tp->delivered; TCP_SKB_CB(skb)->tx.delivered_ce = tp->delivered_ce; + TCP_SKB_CB(skb)->tx.lost = tp->lost; TCP_SKB_CB(skb)->tx.is_app_limited = tp->app_limited ? 1 : 0; + tcp_set_tx_in_flight(sk, skb); } /* When an skb is sacked or acked, we fill in the rate sample with the (prior) @@ -91,18 +111,21 @@ void tcp_rate_skb_delivered(struct sock *sk, struct sk_buff *skb, if (!rs->prior_delivered || tcp_skb_sent_after(tx_tstamp, tp->first_tx_mstamp, scb->end_seq, rs->last_end_seq)) { + rs->prior_lost = scb->tx.lost; rs->prior_delivered_ce = scb->tx.delivered_ce; rs->prior_delivered = scb->tx.delivered; rs->prior_mstamp = scb->tx.delivered_mstamp; rs->is_app_limited = scb->tx.is_app_limited; rs->is_retrans = scb->sacked & TCPCB_RETRANS; rs->last_end_seq = scb->end_seq; + rs->tx_in_flight = scb->tx.in_flight; /* Record send time of most recently ACKed packet: */ tp->first_tx_mstamp = tx_tstamp; /* Find the duration of the "send phase" of this window: */ - rs->interval_us = tcp_stamp_us_delta(tp->first_tx_mstamp, - scb->tx.first_tx_mstamp); + rs->interval_us = tcp_stamp32_us_delta( + tp->first_tx_mstamp, + scb->tx.first_tx_mstamp); } /* Mark off the skb delivered once it's sacked to avoid being @@ -144,6 +167,7 @@ void tcp_rate_gen(struct sock *sk, u32 delivered, u32 lost, return; } rs->delivered = tp->delivered - rs->prior_delivered; + rs->lost = tp->lost - rs->prior_lost; rs->delivered_ce = tp->delivered_ce - rs->prior_delivered_ce; /* delivered_ce occupies less than 32 bits in the skb control block */ @@ -155,7 +179,7 @@ void tcp_rate_gen(struct sock *sk, u32 delivered, u32 lost, * longer phase. */ snd_us = rs->interval_us; /* send phase */ - ack_us = tcp_stamp_us_delta(tp->tcp_mstamp, + ack_us = tcp_stamp32_us_delta(tp->tcp_mstamp, rs->prior_mstamp); /* ack phase */ rs->interval_us = max(snd_us, ack_us); diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c index 20cf4a98c69d..b5f7e49a003a 100644 --- a/net/ipv4/tcp_timer.c +++ b/net/ipv4/tcp_timer.c @@ -607,6 +607,7 @@ void tcp_write_timer_handler(struct sock *sk) goto out; } + tcp_rate_check_app_limited(sk); tcp_mstamp_refresh(tcp_sk(sk)); event = icsk->icsk_pending;