diff --git a/.gitignore b/.gitignore index 72ef86a5570d..2258e906f01c 100644 --- a/.gitignore +++ b/.gitignore @@ -1,3 +1,4 @@ +# SPDX-License-Identifier: GPL-2.0-only # # NOTE! Don't add files that are generated in specific # subdirectories here. Add them in the ".gitignore" file diff --git a/.mailmap b/.mailmap index a0dfce8de1ba..db3754a41018 100644 --- a/.mailmap +++ b/.mailmap @@ -210,6 +210,7 @@ Oleksij Rempel Oleksij Rempel Oleksij Rempel Oleksij Rempel +Pali Rohár Paolo 'Blaisorblade' Giarrusso Patrick Mochel Paul Burton @@ -244,9 +245,11 @@ Santosh Shilimkar Santosh Shilimkar Sascha Hauer S.Çağlar Onur +Sakari Ailus Sean Nyekjaer Sebastian Reichel Sebastian Reichel +Sedat Dilek Shiraz Hashim Shuah Khan Shuah Khan diff --git a/Documentation/.gitignore b/Documentation/.gitignore index e74fec8693b2..d6dc7c9b8e25 100644 --- a/Documentation/.gitignore +++ b/Documentation/.gitignore @@ -1,2 +1,3 @@ +# SPDX-License-Identifier: GPL-2.0-only output *.pyc diff --git a/Documentation/ABI/obsolete/sysfs-kernel-fadump_enabled b/Documentation/ABI/obsolete/sysfs-kernel-fadump_enabled new file mode 100644 index 000000000000..e9c2de8b3688 --- /dev/null +++ b/Documentation/ABI/obsolete/sysfs-kernel-fadump_enabled @@ -0,0 +1,9 @@ +This ABI is renamed and moved to a new location /sys/kernel/fadump/enabled. + +What: /sys/kernel/fadump_enabled +Date: Feb 2012 +Contact: linuxppc-dev@lists.ozlabs.org +Description: read only + Primarily used to identify whether the FADump is enabled in + the kernel or not. +User: Kdump service diff --git a/Documentation/ABI/obsolete/sysfs-kernel-fadump_registered b/Documentation/ABI/obsolete/sysfs-kernel-fadump_registered new file mode 100644 index 000000000000..0360be39c98e --- /dev/null +++ b/Documentation/ABI/obsolete/sysfs-kernel-fadump_registered @@ -0,0 +1,10 @@ +This ABI is renamed and moved to a new location /sys/kernel/fadump/registered.¬ + +What: /sys/kernel/fadump_registered +Date: Feb 2012 +Contact: linuxppc-dev@lists.ozlabs.org +Description: read/write + Helps to control the dump collect feature from userspace. + Setting 1 to this file enables the system to collect the + dump and 0 to disable it. +User: Kdump service diff --git a/Documentation/ABI/obsolete/sysfs-kernel-fadump_release_mem b/Documentation/ABI/obsolete/sysfs-kernel-fadump_release_mem new file mode 100644 index 000000000000..6ce0b129ab12 --- /dev/null +++ b/Documentation/ABI/obsolete/sysfs-kernel-fadump_release_mem @@ -0,0 +1,10 @@ +This ABI is renamed and moved to a new location /sys/kernel/fadump/release_mem.¬ + +What: /sys/kernel/fadump_release_mem +Date: Feb 2012 +Contact: linuxppc-dev@lists.ozlabs.org +Description: write only + This is a special sysfs file and only available when + the system is booted to capture the vmcore using FADump. + It is used to release the memory reserved by FADump to + save the crash dump. diff --git a/Documentation/ABI/obsolete/sysfs-selinux-checkreqprot b/Documentation/ABI/obsolete/sysfs-selinux-checkreqprot new file mode 100644 index 000000000000..49ed9c8fd1e5 --- /dev/null +++ b/Documentation/ABI/obsolete/sysfs-selinux-checkreqprot @@ -0,0 +1,23 @@ +What: /sys/fs/selinux/checkreqprot +Date: April 2005 (predates git) +KernelVersion: 2.6.12-rc2 (predates git) +Contact: selinux@vger.kernel.org +Description: + + The selinuxfs "checkreqprot" node allows SELinux to be configured + to check the protection requested by userspace for mmap/mprotect + calls instead of the actual protection applied by the kernel. + This was a compatibility mechanism for legacy userspace and + for the READ_IMPLIES_EXEC personality flag. However, if set to + 1, it weakens security by allowing mappings to be made executable + without authorization by policy. The default value of checkreqprot + at boot was changed starting in Linux v4.4 to 0 (i.e. check the + actual protection), and Android and Linux distributions have been + explicitly writing a "0" to /sys/fs/selinux/checkreqprot during + initialization for some time. Support for setting checkreqprot to 1 + will be removed in a future kernel release, at which point the kernel + will always cease using checkreqprot internally and will always + check the actual protections being applied upon mmap/mprotect calls. + The checkreqprot selinuxfs node will remain for backward compatibility + but will discard writes of the "0" value and will reject writes of the + "1" value when this mechanism is removed. diff --git a/Documentation/ABI/removed/sysfs-kernel-fadump_release_opalcore b/Documentation/ABI/removed/sysfs-kernel-fadump_release_opalcore new file mode 100644 index 000000000000..a8d46cd0f4e6 --- /dev/null +++ b/Documentation/ABI/removed/sysfs-kernel-fadump_release_opalcore @@ -0,0 +1,9 @@ +This ABI is moved to /sys/firmware/opal/mpipl/release_core. + +What: /sys/kernel/fadump_release_opalcore +Date: Sep 2019 +Contact: linuxppc-dev@lists.ozlabs.org +Description: write only + The sysfs file is available when the system is booted to + collect the dump on OPAL based machine. It used to release + the memory used to collect the opalcore. diff --git a/Documentation/ABI/removed/sysfs-kernel-uids b/Documentation/ABI/removed/sysfs-kernel-uids new file mode 100644 index 000000000000..dc4463f190a7 --- /dev/null +++ b/Documentation/ABI/removed/sysfs-kernel-uids @@ -0,0 +1,14 @@ +What: /sys/kernel/uids//cpu_shares +Date: December 2007, finally removed in kernel v2.6.34-rc1 +Contact: Dhaval Giani + Srivatsa Vaddagiri +Description: + The /sys/kernel/uids//cpu_shares tunable is used + to set the cpu bandwidth a user is allowed. This is a + propotional value. What that means is that if there + are two users logged in, each with an equal number of + shares, then they will get equal CPU bandwidth. Another + example would be, if User A has shares = 1024 and user + B has shares = 2048, User B will get twice the CPU + bandwidth user A will. For more details refer + Documentation/scheduler/sched-design-CFS.rst diff --git a/Documentation/ABI/testing/configfs-most b/Documentation/ABI/testing/configfs-most new file mode 100644 index 000000000000..ed67a4d9f6d6 --- /dev/null +++ b/Documentation/ABI/testing/configfs-most @@ -0,0 +1,196 @@ +What: /sys/kernel/config/most_ +Date: March 8, 2019 +KernelVersion: 5.2 +Description: Interface is used to configure and connect device channels + to component drivers. + + Attributes are visible only when configfs is mounted. To mount + configfs in /sys/kernel/config directory use: + # mount -t configfs none /sys/kernel/config/ + + +What: /sys/kernel/config/most_cdev/ +Date: March 8, 2019 +KernelVersion: 5.2 +Description: + The attributes: + + buffer_size configure the buffer size for this channel + + subbuffer_size configure the sub-buffer size for this channel + (needed for synchronous and isochrnous data) + + + num_buffers configure number of buffers used for this + channel + + datatype configure type of data that will travel over + this channel + + direction configure whether this link will be an input + or output + + dbr_size configure DBR data buffer size (this is used + for MediaLB communication only) + + packets_per_xact + configure the number of packets that will be + collected from the network before being + transmitted via USB (this is used for USB + communication only) + + device name of the device the link is to be attached to + + channel name of the channel the link is to be attached to + + comp_params pass parameters needed by some components + + create_link write '1' to this attribute to trigger the + creation of the link. In case of speculative + configuration, the creation is post-poned until + a physical device is being attached to the bus. + + destroy_link write '1' to this attribute to destroy an + active link + +What: /sys/kernel/config/most_video/ +Date: March 8, 2019 +KernelVersion: 5.2 +Description: + The attributes: + + buffer_size configure the buffer size for this channel + + subbuffer_size configure the sub-buffer size for this channel + (needed for synchronous and isochrnous data) + + + num_buffers configure number of buffers used for this + channel + + datatype configure type of data that will travel over + this channel + + direction configure whether this link will be an input + or output + + dbr_size configure DBR data buffer size (this is used + for MediaLB communication only) + + packets_per_xact + configure the number of packets that will be + collected from the network before being + transmitted via USB (this is used for USB + communication only) + + device name of the device the link is to be attached to + + channel name of the channel the link is to be attached to + + comp_params pass parameters needed by some components + + create_link write '1' to this attribute to trigger the + creation of the link. In case of speculative + configuration, the creation is post-poned until + a physical device is being attached to the bus. + + destroy_link write '1' to this attribute to destroy an + active link + +What: /sys/kernel/config/most_net/ +Date: March 8, 2019 +KernelVersion: 5.2 +Description: + The attributes: + + buffer_size configure the buffer size for this channel + + subbuffer_size configure the sub-buffer size for this channel + (needed for synchronous and isochrnous data) + + + num_buffers configure number of buffers used for this + channel + + datatype configure type of data that will travel over + this channel + + direction configure whether this link will be an input + or output + + dbr_size configure DBR data buffer size (this is used + for MediaLB communication only) + + packets_per_xact + configure the number of packets that will be + collected from the network before being + transmitted via USB (this is used for USB + communication only) + + device name of the device the link is to be attached to + + channel name of the channel the link is to be attached to + + comp_params pass parameters needed by some components + + create_link write '1' to this attribute to trigger the + creation of the link. In case of speculative + configuration, the creation is post-poned until + a physical device is being attached to the bus. + + destroy_link write '1' to this attribute to destroy an + active link + +What: /sys/kernel/config/most_sound/ +Date: March 8, 2019 +KernelVersion: 5.2 +Description: + The attributes: + + create_card write '1' to this attribute to trigger the + registration of the sound card with the ALSA + subsystem. + +What: /sys/kernel/config/most_sound// +Date: March 8, 2019 +KernelVersion: 5.2 +Description: + The attributes: + + buffer_size configure the buffer size for this channel + + subbuffer_size configure the sub-buffer size for this channel + (needed for synchronous and isochrnous data) + + + num_buffers configure number of buffers used for this + channel + + datatype configure type of data that will travel over + this channel + + direction configure whether this link will be an input + or output + + dbr_size configure DBR data buffer size (this is used + for MediaLB communication only) + + packets_per_xact + configure the number of packets that will be + collected from the network before being + transmitted via USB (this is used for USB + communication only) + + device name of the device the link is to be attached to + + channel name of the channel the link is to be attached to + + comp_params pass parameters needed by some components + + create_link write '1' to this attribute to trigger the + creation of the link. In case of speculative + configuration, the creation is post-poned until + a physical device is being attached to the bus. + + destroy_link write '1' to this attribute to destroy an + active link diff --git a/Documentation/ABI/testing/debugfs-driver-habanalabs b/Documentation/ABI/testing/debugfs-driver-habanalabs index f0ac14b70ecb..a73601c5121e 100644 --- a/Documentation/ABI/testing/debugfs-driver-habanalabs +++ b/Documentation/ABI/testing/debugfs-driver-habanalabs @@ -43,6 +43,20 @@ Description: Allows the root user to read or write directly through the If the IOMMU is disabled, it also allows the root user to read or write from the host a device VA of a host mapped memory +What: /sys/kernel/debug/habanalabs/hl/data64 +Date: Jan 2020 +KernelVersion: 5.6 +Contact: oded.gabbay@gmail.com +Description: Allows the root user to read or write 64 bit data directly + through the device's PCI bar. Writing to this file generates a + write transaction while reading from the file generates a read + transaction. This custom interface is needed (instead of using + the generic Linux user-space PCI mapping) because the DDR bar + is very small compared to the DDR memory and only the driver can + move the bar before and after the transaction. + If the IOMMU is disabled, it also allows the root user to read + or write from the host a device VA of a host mapped memory + What: /sys/kernel/debug/habanalabs/hl/device Date: Jan 2019 KernelVersion: 5.1 diff --git a/Documentation/ABI/testing/sysfs-bus-coresight-devices-cti b/Documentation/ABI/testing/sysfs-bus-coresight-devices-cti new file mode 100644 index 000000000000..9d11502b4390 --- /dev/null +++ b/Documentation/ABI/testing/sysfs-bus-coresight-devices-cti @@ -0,0 +1,241 @@ +What: /sys/bus/coresight/devices//enable +Date: March 2020 +KernelVersion 5.7 +Contact: Mike Leach or Mathieu Poirier +Description: (RW) Enable/Disable the CTI hardware. + +What: /sys/bus/coresight/devices//powered +Date: March 2020 +KernelVersion 5.7 +Contact: Mike Leach or Mathieu Poirier +Description: (R) Indicate if the CTI hardware is powered. + +What: /sys/bus/coresight/devices//ctmid +Date: March 2020 +KernelVersion 5.7 +Contact: Mike Leach or Mathieu Poirier +Description: (R) Display the associated CTM ID + +What: /sys/bus/coresight/devices//nr_trigger_cons +Date: March 2020 +KernelVersion 5.7 +Contact: Mike Leach or Mathieu Poirier +Description: (R) Number of devices connected to triggers on this CTI + +What: /sys/bus/coresight/devices//triggers/name +Date: March 2020 +KernelVersion 5.7 +Contact: Mike Leach or Mathieu Poirier +Description: (R) Name of connected device + +What: /sys/bus/coresight/devices//triggers/in_signals +Date: March 2020 +KernelVersion 5.7 +Contact: Mike Leach or Mathieu Poirier +Description: (R) Input trigger signals from connected device + +What: /sys/bus/coresight/devices//triggers/in_types +Date: March 2020 +KernelVersion 5.7 +Contact: Mike Leach or Mathieu Poirier +Description: (R) Functional types for the input trigger signals + from connected device + +What: /sys/bus/coresight/devices//triggers/out_signals +Date: March 2020 +KernelVersion 5.7 +Contact: Mike Leach or Mathieu Poirier +Description: (R) Output trigger signals to connected device + +What: /sys/bus/coresight/devices//triggers/out_types +Date: March 2020 +KernelVersion 5.7 +Contact: Mike Leach or Mathieu Poirier +Description: (R) Functional types for the output trigger signals + to connected device + +What: /sys/bus/coresight/devices//regs/inout_sel +Date: March 2020 +KernelVersion 5.7 +Contact: Mike Leach or Mathieu Poirier +Description: (RW) Select the index for inen and outen registers. + +What: /sys/bus/coresight/devices//regs/inen +Date: March 2020 +KernelVersion 5.7 +Contact: Mike Leach or Mathieu Poirier +Description: (RW) Read or write the CTIINEN register selected by inout_sel. + +What: /sys/bus/coresight/devices//regs/outen +Date: March 2020 +KernelVersion 5.7 +Contact: Mike Leach or Mathieu Poirier +Description: (RW) Read or write the CTIOUTEN register selected by inout_sel. + +What: /sys/bus/coresight/devices//regs/gate +Date: March 2020 +KernelVersion 5.7 +Contact: Mike Leach or Mathieu Poirier +Description: (RW) Read or write CTIGATE register. + +What: /sys/bus/coresight/devices//regs/asicctl +Date: March 2020 +KernelVersion 5.7 +Contact: Mike Leach or Mathieu Poirier +Description: (RW) Read or write ASICCTL register. + +What: /sys/bus/coresight/devices//regs/intack +Date: March 2020 +KernelVersion 5.7 +Contact: Mike Leach or Mathieu Poirier +Description: (W) Write the INTACK register. + +What: /sys/bus/coresight/devices//regs/appset +Date: March 2020 +KernelVersion 5.7 +Contact: Mike Leach or Mathieu Poirier +Description: (RW) Set CTIAPPSET register to activate channel. Read back to + determine current value of register. + +What: /sys/bus/coresight/devices//regs/appclear +Date: March 2020 +KernelVersion 5.7 +Contact: Mike Leach or Mathieu Poirier +Description: (W) Write APPCLEAR register to deactivate channel. + +What: /sys/bus/coresight/devices//regs/apppulse +Date: March 2020 +KernelVersion 5.7 +Contact: Mike Leach or Mathieu Poirier +Description: (W) Write APPPULSE to pulse a channel active for one clock + cycle. + +What: /sys/bus/coresight/devices//regs/chinstatus +Date: March 2020 +KernelVersion 5.7 +Contact: Mike Leach or Mathieu Poirier +Description: (R) Read current status of channel inputs. + +What: /sys/bus/coresight/devices//regs/choutstatus +Date: March 2020 +KernelVersion 5.7 +Contact: Mike Leach or Mathieu Poirier +Description: (R) read current status of channel outputs. + +What: /sys/bus/coresight/devices//regs/triginstatus +Date: March 2020 +KernelVersion 5.7 +Contact: Mike Leach or Mathieu Poirier +Description: (R) read current status of input trigger signals + +What: /sys/bus/coresight/devices//regs/trigoutstatus +Date: March 2020 +KernelVersion 5.7 +Contact: Mike Leach or Mathieu Poirier +Description: (R) read current status of output trigger signals. + +What: /sys/bus/coresight/devices//channels/trigin_attach +Date: March 2020 +KernelVersion 5.7 +Contact: Mike Leach or Mathieu Poirier +Description: (W) Attach a CTI input trigger to a CTM channel. + +What: /sys/bus/coresight/devices//channels/trigin_detach +Date: March 2020 +KernelVersion 5.7 +Contact: Mike Leach or Mathieu Poirier +Description: (W) Detach a CTI input trigger from a CTM channel. + +What: /sys/bus/coresight/devices//channels/trigout_attach +Date: March 2020 +KernelVersion 5.7 +Contact: Mike Leach or Mathieu Poirier +Description: (W) Attach a CTI output trigger to a CTM channel. + +What: /sys/bus/coresight/devices//channels/trigout_detach +Date: March 2020 +KernelVersion 5.7 +Contact: Mike Leach or Mathieu Poirier +Description: (W) Detach a CTI output trigger from a CTM channel. + +What: /sys/bus/coresight/devices//channels/chan_gate_enable +Date: March 2020 +KernelVersion 5.7 +Contact: Mike Leach or Mathieu Poirier +Description: (RW) Enable CTIGATE for single channel (W) or list enabled + channels through the gate (R). + +What: /sys/bus/coresight/devices//channels/chan_gate_disable +Date: March 2020 +KernelVersion 5.7 +Contact: Mike Leach or Mathieu Poirier +Description: (W) Disable CTIGATE for single channel. + +What: /sys/bus/coresight/devices//channels/chan_set +Date: March 2020 +KernelVersion 5.7 +Contact: Mike Leach or Mathieu Poirier +Description: (W) Activate a single channel. + +What: /sys/bus/coresight/devices//channels/chan_clear +Date: March 2020 +KernelVersion 5.7 +Contact: Mike Leach or Mathieu Poirier +Description: (W) Deactivate a single channel. + +What: /sys/bus/coresight/devices//channels/chan_pulse +Date: March 2020 +KernelVersion 5.7 +Contact: Mike Leach or Mathieu Poirier +Description: (W) Pulse a single channel - activate for a single clock cycle. + +What: /sys/bus/coresight/devices//channels/trigout_filtered +Date: March 2020 +KernelVersion 5.7 +Contact: Mike Leach or Mathieu Poirier +Description: (R) List of output triggers filtered across all connections. + +What: /sys/bus/coresight/devices//channels/trig_filter_enable +Date: March 2020 +KernelVersion 5.7 +Contact: Mike Leach or Mathieu Poirier +Description: (RW) Enable or disable trigger output signal filtering. + +What: /sys/bus/coresight/devices//channels/chan_inuse +Date: March 2020 +KernelVersion 5.7 +Contact: Mike Leach or Mathieu Poirier +Description: (R) show channels with at least one attached trigger signal. + +What: /sys/bus/coresight/devices//channels/chan_free +Date: March 2020 +KernelVersion 5.7 +Contact: Mike Leach or Mathieu Poirier +Description: (R) show channels with no attached trigger signals. + +What: /sys/bus/coresight/devices//channels/chan_xtrigs_sel +Date: March 2020 +KernelVersion 5.7 +Contact: Mike Leach or Mathieu Poirier +Description: (RW) Write channel number to select a channel to view, read to + see selected channel number. + +What: /sys/bus/coresight/devices//channels/chan_xtrigs_in +Date: March 2020 +KernelVersion 5.7 +Contact: Mike Leach or Mathieu Poirier +Description: (R) Read to see input triggers connected to selected view + channel. + +What: /sys/bus/coresight/devices//channels/chan_xtrigs_out +Date: March 2020 +KernelVersion 5.7 +Contact: Mike Leach or Mathieu Poirier +Description: (R) Read to see output triggers connected to selected view + channel. + +What: /sys/bus/coresight/devices//channels/chan_xtrigs_reset +Date: March 2020 +KernelVersion 5.7 +Contact: Mike Leach or Mathieu Poirier +Description: (W) Clear all channel / trigger programming. diff --git a/Documentation/ABI/testing/sysfs-bus-counter-104-quad-8 b/Documentation/ABI/testing/sysfs-bus-counter-104-quad-8 index 46b1f33b2fce..eac32180c40d 100644 --- a/Documentation/ABI/testing/sysfs-bus-counter-104-quad-8 +++ b/Documentation/ABI/testing/sysfs-bus-counter-104-quad-8 @@ -1,3 +1,28 @@ +What: /sys/bus/counter/devices/counterX/signalY/cable_fault +KernelVersion: 5.7 +Contact: linux-iio@vger.kernel.org +Description: + Read-only attribute that indicates whether a differential + encoder cable fault (not connected or loose wires) is detected + for the respective channel of Signal Y. Valid attribute values + are boolean. Detection must first be enabled via the + corresponding cable_fault_enable attribute. + +What: /sys/bus/counter/devices/counterX/signalY/cable_fault_enable +KernelVersion: 5.7 +Contact: linux-iio@vger.kernel.org +Description: + Whether detection of differential encoder cable faults for the + respective channel of Signal Y is enabled. Valid attribute + values are boolean. + +What: /sys/bus/counter/devices/counterX/signalY/filter_clock_prescaler +KernelVersion: 5.7 +Contact: linux-iio@vger.kernel.org +Description: + Filter clock factor for input Signal Y. This prescaler value + affects the inputs of both quadrature pair signals. + What: /sys/bus/counter/devices/counterX/signalY/index_polarity KernelVersion: 5.2 Contact: linux-iio@vger.kernel.org diff --git a/Documentation/ABI/testing/sysfs-bus-iio-adc-ad7192 b/Documentation/ABI/testing/sysfs-bus-iio-adc-ad7192 index 7627d3be08f5..f8315202c8f0 100644 --- a/Documentation/ABI/testing/sysfs-bus-iio-adc-ad7192 +++ b/Documentation/ABI/testing/sysfs-bus-iio-adc-ad7192 @@ -2,17 +2,22 @@ What: /sys/bus/iio/devices/iio:deviceX/ac_excitation_en KernelVersion: Contact: linux-iio@vger.kernel.org Description: - Reading gives the state of AC excitation. - Writing '1' enables AC excitation. + This attribute, if available, is used to enable the AC + excitation mode found on some converters. In ac excitation mode, + the polarity of the excitation voltage is reversed on + alternate cycles, to eliminate DC errors. What: /sys/bus/iio/devices/iio:deviceX/bridge_switch_en KernelVersion: Contact: linux-iio@vger.kernel.org Description: - This bridge switch is used to disconnect it when there is a - need to minimize the system current consumption. - Reading gives the state of the bridge switch. - Writing '1' enables the bridge switch. + This attribute, if available, is used to close or open the + bridge power down switch found on some converters. + In bridge applications, such as strain gauges and load cells, + the bridge itself consumes the majority of the current in the + system. To minimize the current consumption of the system, + the bridge can be disconnected (when it is not being used + using the bridge_switch_en attribute. What: /sys/bus/iio/devices/iio:deviceX/in_voltagex_sys_calibration KernelVersion: @@ -21,6 +26,13 @@ Description: Initiates the system calibration procedure. This is done on a single channel at a time. Write '1' to start the calibration. +What: /sys/bus/iio/devices/iio:deviceX/in_voltage2-voltage2_shorted_raw +KernelVersion: +Contact: linux-iio@vger.kernel.org +Description: + Measure voltage from AIN2 pin connected to AIN(+) + and AIN(-) shorted. + What: /sys/bus/iio/devices/iio:deviceX/in_voltagex_sys_calibration_mode_available KernelVersion: Contact: linux-iio@vger.kernel.org diff --git a/Documentation/ABI/testing/sysfs-bus-intel_th-devices-msc b/Documentation/ABI/testing/sysfs-bus-intel_th-devices-msc index 456cb62b384c..7fd2601c2831 100644 --- a/Documentation/ABI/testing/sysfs-bus-intel_th-devices-msc +++ b/Documentation/ABI/testing/sysfs-bus-intel_th-devices-msc @@ -40,3 +40,11 @@ Description: (RW) Trigger window switch for the MSC's buffer, in triggering a window switch for the buffer. Returns an error in any other operating mode or attempts to write something other than "1". +What: /sys/bus/intel_th/devices/-msc/stop_on_full +Date: March 2020 +KernelVersion: 5.7 +Contact: Alexander Shishkin +Description: (RW) Configure whether trace stops when the last available window + becomes full (1/y/Y) or wraps around and continues until the next + window becomes available again (0/n/N). + diff --git a/Documentation/ABI/testing/sysfs-bus-most b/Documentation/ABI/testing/sysfs-bus-most new file mode 100644 index 000000000000..6b1d06e3285e --- /dev/null +++ b/Documentation/ABI/testing/sysfs-bus-most @@ -0,0 +1,295 @@ +What: /sys/bus/most/devices/.../description +Date: March 2017 +KernelVersion: 4.15 +Contact: Christian Gromm +Description: + Provides information about the interface type and the physical + location of the device. Hardware attached via USB, for instance, + might return <1-1.1:1.0> +Users: + +What: /sys/bus/most/devices/.../interface +Date: March 2017 +KernelVersion: 4.15 +Contact: Christian Gromm +Description: + Indicates the type of peripheral interface the device uses. +Users: + +What: /sys/bus/most/devices/.../dci +Date: June 2016 +KernelVersion: 4.15 +Contact: Christian Gromm +Description: + If the network interface controller is attached via USB, a dci + directory is created that allows applications to read and + write the controller's DCI registers. +Users: + +What: /sys/bus/most/devices/.../dci/arb_address +Date: June 2016 +KernelVersion: 4.15 +Contact: Christian Gromm +Description: + This is used to set an arbitrary DCI register address an + application wants to read from or write to. +Users: + +What: /sys/bus/most/devices/.../dci/arb_value +Date: June 2016 +KernelVersion: 4.15 +Contact: Christian Gromm +Description: + This is used to read and write the DCI register whose address + is stored in arb_address. +Users: + +What: /sys/bus/most/devices/.../dci/mep_eui48_hi +Date: June 2016 +KernelVersion: 4.15 +Contact: Christian Gromm +Description: + This is used to check and configure the MAC address. +Users: + +What: /sys/bus/most/devices/.../dci/mep_eui48_lo +Date: June 2016 +KernelVersion: 4.15 +Contact: Christian Gromm +Description: + This is used to check and configure the MAC address. +Users: + +What: /sys/bus/most/devices/.../dci/mep_eui48_mi +Date: June 2016 +KernelVersion: 4.15 +Contact: Christian Gromm +Description: + This is used to check and configure the MAC address. +Users: + +What: /sys/bus/most/devices/.../dci/mep_filter +Date: June 2016 +KernelVersion: 4.15 +Contact: Christian Gromm +Description: + This is used to check and configure the MEP filter address. +Users: + +What: /sys/bus/most/devices/.../dci/mep_hash0 +Date: June 2016 +KernelVersion: 4.15 +Contact: Christian Gromm +Description: + This is used to check and configure the MEP hash table. +Users: + +What: /sys/bus/most/devices/.../dci/mep_hash1 +Date: June 2016 +KernelVersion: 4.15 +Contact: Christian Gromm +Description: + This is used to check and configure the MEP hash table. +Users: + +What: /sys/bus/most/devices/.../dci/mep_hash2 +Date: June 2016 +KernelVersion: 4.15 +Contact: Christian Gromm +Description: + This is used to check and configure the MEP hash table. +Users: + +What: /sys/bus/most/devices/.../dci/mep_hash3 +Date: June 2016 +KernelVersion: 4.15 +Contact: Christian Gromm +Description: + This is used to check and configure the MEP hash table. +Users: + +What: /sys/bus/most/devices/.../dci/ni_state +Date: June 2016 +KernelVersion: 4.15 +Contact: Christian Gromm +Description: + Indicates the current network interface state. +Users: + +What: /sys/bus/most/devices/.../dci/node_address +Date: June 2016 +KernelVersion: 4.15 +Contact: Christian Gromm +Description: + Indicates the current node address. +Users: + +What: /sys/bus/most/devices/.../dci/node_position +Date: June 2016 +KernelVersion: 4.15 +Contact: Christian Gromm +Description: + Indicates the current node position. +Users: + +What: /sys/bus/most/devices/.../dci/packet_bandwidth +Date: June 2016 +KernelVersion: 4.15 +Contact: Christian Gromm +Description: + Indicates the configured packet bandwidth. +Users: + +What: /sys/bus/most/devices/.../dci/sync_ep +Date: June 2016 +KernelVersion: 4.15 +Contact: Christian Gromm +Description: + Triggers the controller's synchronization process for a certain + endpoint. +Users: + +What: /sys/bus/most/devices/...// +Date: March 2017 +KernelVersion: 4.15 +Contact: Christian Gromm +Description: + For every channel of the device a directory is created, whose + name is dictated by the HDM. This enables an application to + collect information about the channel's capabilities and + configure it. +Users: + +What: /sys/bus/most/devices/...//available_datatypes +Date: March 2017 +KernelVersion: 4.15 +Contact: Christian Gromm +Description: + Indicates the data types the current channel can transport. +Users: + +What: /sys/bus/most/devices/...//available_directions +Date: March 2017 +KernelVersion: 4.15 +Contact: Christian Gromm +Description: + Indicates the directions the current channel is capable of. +Users: + +What: /sys/bus/most/devices/...//number_of_packet_buffers +Date: March 2017 +KernelVersion: 4.15 +Contact: Christian Gromm +Description: + Indicates the number of packet buffers the current channel can + handle. +Users: + +What: /sys/bus/most/devices/...//number_of_stream_buffers +Date: March 2017 +KernelVersion: 4.15 +Contact: Christian Gromm +Description: + Indicates the number of streaming buffers the current channel can + handle. +Users: + +What: /sys/bus/most/devices/...//size_of_packet_buffer +Date: March 2017 +KernelVersion: 4.15 +Contact: Christian Gromm +Description: + Indicates the size of a packet buffer the current channel can + handle. +Users: + +What: /sys/bus/most/devices/...//size_of_stream_buffer +Date: March 2017 +KernelVersion: 4.15 +Contact: Christian Gromm +Description: + Indicates the size of a streaming buffer the current channel can + handle. +Users: + +What: /sys/bus/most/devices/...//set_number_of_buffers +Date: March 2017 +KernelVersion: 4.15 +Contact: Christian Gromm +Description: + This is to configure the number of buffers of the current channel. +Users: + +What: /sys/bus/most/devices/...//set_buffer_size +Date: March 2017 +KernelVersion: 4.15 +Contact: Christian Gromm +Description: + This is to configure the size of a buffer of the current channel. +Users: + +What: /sys/bus/most/devices/...//set_direction +Date: March 2017 +KernelVersion: 4.15 +Contact: Christian Gromm +Description: + This is to configure the direction of the current channel. + The following strings will be accepted: + 'dir_tx', + 'dir_rx' +Users: + +What: /sys/bus/most/devices/...//set_datatype +Date: March 2017 +KernelVersion: 4.15 +Contact: Christian Gromm +Description: + This is to configure the data type of the current channel. + The following strings will be accepted: + 'control', + 'async', + 'sync', + 'isoc_avp' +Users: + +What: /sys/bus/most/devices/...//set_subbuffer_size +Date: March 2017 +KernelVersion: 4.15 +Contact: Christian Gromm +Description: + This is to configure the subbuffer size of the current channel. +Users: + +What: /sys/bus/most/devices/...//set_packets_per_xact +Date: March 2017 +KernelVersion: 4.15 +Contact: Christian Gromm +Description: + This is to configure the number of packets per transaction of + the current channel. This is only needed network interface + controller is attached via USB. +Users: + +What: /sys/bus/most/devices/...//channel_starving +Date: March 2017 +KernelVersion: 4.15 +Contact: Christian Gromm +Description: + Indicates whether current channel ran out of buffers. +Users: + +What: /sys/bus/most/drivers/most_core/components +Date: March 2017 +KernelVersion: 4.15 +Contact: Christian Gromm +Description: + This is used to retrieve a list of registered components. +Users: + +What: /sys/bus/most/drivers/most_core/links +Date: March 2017 +KernelVersion: 4.15 +Contact: Christian Gromm +Description: + This is used to retrieve a list of established links. +Users: diff --git a/Documentation/ABI/testing/sysfs-class-typec b/Documentation/ABI/testing/sysfs-class-typec index d7647b258c3c..b834671522d6 100644 --- a/Documentation/ABI/testing/sysfs-class-typec +++ b/Documentation/ABI/testing/sysfs-class-typec @@ -20,13 +20,13 @@ Date: April 2017 Contact: Heikki Krogerus Description: The supported power roles. This attribute can be used to request - power role swap on the port when the port supports USB Power - Delivery. Swapping is supported as synchronous operation, so - write(2) to the attribute will not return until the operation - has finished. The attribute is notified about role changes so - that poll(2) on the attribute wakes up. Change on the role will - also generate uevent KOBJ_CHANGE. The current role is show in - brackets, for example "[source] sink" when in source mode. + power role swap on the port. Swapping is supported as + synchronous operation, so write(2) to the attribute will not + return until the operation has finished. The attribute is + notified about role changes so that poll(2) on the attribute + wakes up. Change on the role will also generate uevent + KOBJ_CHANGE. The current role is show in brackets, for example + "[source] sink" when in source mode. Valid values: source, sink @@ -108,6 +108,15 @@ Contact: Heikki Krogerus Description: Revision number of the supported USB Type-C specification. +What: /sys/class/typec//orientation +Date: February 2020 +Contact: Badhri Jagan Sridharan +Description: + Indicates the active orientation of the Type-C connector. + Valid values: + - "normal": CC1 orientation + - "reverse": CC2 orientation + - "unknown": Orientation cannot be determined. USB Type-C partner devices (eg. /sys/class/typec/port0-partner/) diff --git a/Documentation/ABI/testing/sysfs-driver-jz4780-efuse b/Documentation/ABI/testing/sysfs-driver-jz4780-efuse new file mode 100644 index 000000000000..bb6f5d6ceea0 --- /dev/null +++ b/Documentation/ABI/testing/sysfs-driver-jz4780-efuse @@ -0,0 +1,16 @@ +What: /sys/devices/*//nvmem +Date: December 2017 +Contact: PrasannaKumar Muralidharan +Description: read-only access to the efuse on the Ingenic JZ4780 SoC + The SoC has a one time programmable 8K efuse that is + split into segments. The driver supports read only. + The segments are + 0x000 64 bit Random Number + 0x008 128 bit Ingenic Chip ID + 0x018 128 bit Customer ID + 0x028 3520 bit Reserved + 0x1E0 8 bit Protect Segment + 0x1E1 2296 bit HDMI Key + 0x300 2048 bit Security boot key +Users: any user space application which wants to read the Chip + and Customer ID diff --git a/Documentation/ABI/testing/sysfs-driver-uacce b/Documentation/ABI/testing/sysfs-driver-uacce new file mode 100644 index 000000000000..08f2591138af --- /dev/null +++ b/Documentation/ABI/testing/sysfs-driver-uacce @@ -0,0 +1,39 @@ +What: /sys/class/uacce//api +Date: Feb 2020 +KernelVersion: 5.7 +Contact: linux-accelerators@lists.ozlabs.org +Description: Api of the device + Can be any string and up to userspace to parse. + Application use the api to match the correct driver + +What: /sys/class/uacce//flags +Date: Feb 2020 +KernelVersion: 5.7 +Contact: linux-accelerators@lists.ozlabs.org +Description: Attributes of the device, see UACCE_DEV_xxx flag defined in uacce.h + +What: /sys/class/uacce//available_instances +Date: Feb 2020 +KernelVersion: 5.7 +Contact: linux-accelerators@lists.ozlabs.org +Description: Available instances left of the device + Return -ENODEV if uacce_ops get_available_instances is not provided + +What: /sys/class/uacce//algorithms +Date: Feb 2020 +KernelVersion: 5.7 +Contact: linux-accelerators@lists.ozlabs.org +Description: Algorithms supported by this accelerator, separated by new line. + Can be any string and up to userspace to parse. + +What: /sys/class/uacce//region_mmio_size +Date: Feb 2020 +KernelVersion: 5.7 +Contact: linux-accelerators@lists.ozlabs.org +Description: Size (bytes) of mmio region queue file + +What: /sys/class/uacce//region_dus_size +Date: Feb 2020 +KernelVersion: 5.7 +Contact: linux-accelerators@lists.ozlabs.org +Description: Size (bytes) of dus region queue file diff --git a/Documentation/ABI/testing/sysfs-firmware-opal-sensor-groups b/Documentation/ABI/testing/sysfs-firmware-opal-sensor-groups new file mode 100644 index 000000000000..3a2dfe542e8c --- /dev/null +++ b/Documentation/ABI/testing/sysfs-firmware-opal-sensor-groups @@ -0,0 +1,21 @@ +What: /sys/firmware/opal/sensor_groups +Date: August 2017 +Contact: Linux for PowerPC mailing list +Description: Sensor groups directory for POWER9 powernv servers + + Each folder in this directory contains a sensor group + which are classified based on type of the sensor + like power, temperature, frequency, current, etc. They + can also indicate the group of sensors belonging to + different owners like CSM, Profiler, Job-Scheduler + +What: /sys/firmware/opal/sensor_groups//clear +Date: August 2017 +Contact: Linux for PowerPC mailing list +Description: Sysfs file to clear the min-max of all the sensors + belonging to the group. + + Writing 1 to this file will clear the minimum and + maximum values of all the sensors in the group. + In POWER9, the min-max of a sensor is the historical minimum + and maximum value of the sensor cached by OCC. diff --git a/Documentation/ABI/testing/sysfs-fs-f2fs b/Documentation/ABI/testing/sysfs-fs-f2fs index 1a6cd5397129..bd8a0d19abe6 100644 --- a/Documentation/ABI/testing/sysfs-fs-f2fs +++ b/Documentation/ABI/testing/sysfs-fs-f2fs @@ -318,3 +318,8 @@ Date: September 2019 Contact: "Hridya Valsaraju" Description: Average number of valid blocks. Available when CONFIG_F2FS_STAT_FS=y. + +What: /sys/fs/f2fs//mounted_time_sec +Date: February 2020 +Contact: "Jaegeuk Kim" +Description: Show the mounted time in secs of this partition. diff --git a/Documentation/ABI/testing/sysfs-kernel-fadump b/Documentation/ABI/testing/sysfs-kernel-fadump new file mode 100644 index 000000000000..8f7a64a81783 --- /dev/null +++ b/Documentation/ABI/testing/sysfs-kernel-fadump @@ -0,0 +1,40 @@ +What: /sys/kernel/fadump/* +Date: Dec 2019 +Contact: linuxppc-dev@lists.ozlabs.org +Description: + The /sys/kernel/fadump/* is a collection of FADump sysfs + file provide information about the configuration status + of Firmware Assisted Dump (FADump). + +What: /sys/kernel/fadump/enabled +Date: Dec 2019 +Contact: linuxppc-dev@lists.ozlabs.org +Description: read only + Primarily used to identify whether the FADump is enabled in + the kernel or not. +User: Kdump service + +What: /sys/kernel/fadump/registered +Date: Dec 2019 +Contact: linuxppc-dev@lists.ozlabs.org +Description: read/write + Helps to control the dump collect feature from userspace. + Setting 1 to this file enables the system to collect the + dump and 0 to disable it. +User: Kdump service + +What: /sys/kernel/fadump/release_mem +Date: Dec 2019 +Contact: linuxppc-dev@lists.ozlabs.org +Description: write only + This is a special sysfs file and only available when + the system is booted to capture the vmcore using FADump. + It is used to release the memory reserved by FADump to + save the crash dump. + +What: /sys/kernel/fadump/mem_reserved +Date: Dec 2019 +Contact: linuxppc-dev@lists.ozlabs.org +Description: read only + Provide information about the amount of memory reserved by + FADump to save the crash dump in bytes. diff --git a/Documentation/ABI/testing/sysfs-kernel-uids b/Documentation/ABI/testing/sysfs-kernel-uids deleted file mode 100644 index 4182b7061816..000000000000 --- a/Documentation/ABI/testing/sysfs-kernel-uids +++ /dev/null @@ -1,14 +0,0 @@ -What: /sys/kernel/uids//cpu_shares -Date: December 2007 -Contact: Dhaval Giani - Srivatsa Vaddagiri -Description: - The /sys/kernel/uids//cpu_shares tunable is used - to set the cpu bandwidth a user is allowed. This is a - propotional value. What that means is that if there - are two users logged in, each with an equal number of - shares, then they will get equal CPU bandwidth. Another - example would be, if User A has shares = 1024 and user - B has shares = 2048, User B will get twice the CPU - bandwidth user A will. For more details refer - Documentation/scheduler/sched-design-CFS.rst diff --git a/Documentation/ABI/testing/sysfs-platform-dell-laptop b/Documentation/ABI/testing/sysfs-platform-dell-laptop index 8c6a0b8e1131..9b917c7453de 100644 --- a/Documentation/ABI/testing/sysfs-platform-dell-laptop +++ b/Documentation/ABI/testing/sysfs-platform-dell-laptop @@ -2,7 +2,7 @@ What: /sys/class/leds/dell::kbd_backlight/als_enabled Date: December 2014 KernelVersion: 3.19 Contact: Gabriele Mazzotta , - Pali Rohár + Pali Rohár Description: This file allows to control the automatic keyboard illumination mode on some systems that have an ambient @@ -13,7 +13,7 @@ What: /sys/class/leds/dell::kbd_backlight/als_setting Date: December 2014 KernelVersion: 3.19 Contact: Gabriele Mazzotta , - Pali Rohár + Pali Rohár Description: This file allows to specifiy the on/off threshold value, as reported by the ambient light sensor. @@ -22,7 +22,7 @@ What: /sys/class/leds/dell::kbd_backlight/start_triggers Date: December 2014 KernelVersion: 3.19 Contact: Gabriele Mazzotta , - Pali Rohár + Pali Rohár Description: This file allows to control the input triggers that turn on the keyboard backlight illumination that is @@ -45,7 +45,7 @@ What: /sys/class/leds/dell::kbd_backlight/stop_timeout Date: December 2014 KernelVersion: 3.19 Contact: Gabriele Mazzotta , - Pali Rohár + Pali Rohár Description: This file allows to specify the interval after which the keyboard illumination is disabled because of inactivity. diff --git a/Documentation/ABI/testing/sysfs-tty b/Documentation/ABI/testing/sysfs-tty index 9eb3c2b6b040..e157130a6792 100644 --- a/Documentation/ABI/testing/sysfs-tty +++ b/Documentation/ABI/testing/sysfs-tty @@ -154,3 +154,10 @@ Description: device specification. For example, when user sets 7bytes on 16550A, which has 1/4/8/14 bytes trigger, the RX trigger is automatically changed to 4 bytes. + +What: /sys/class/tty/ttyS0/console +Date: February 2020 +Contact: Andy Shevchenko +Description: + Allows user to detach or attach back the given device as + kernel console. It shows and accepts a boolean variable. diff --git a/Documentation/EDID/1024x768.S b/Documentation/EDID/1024x768.S deleted file mode 100644 index 4aed3f9ab88a..000000000000 --- a/Documentation/EDID/1024x768.S +++ /dev/null @@ -1,43 +0,0 @@ -/* - 1024x768.S: EDID data set for standard 1024x768 60 Hz monitor - - Copyright (C) 2011 Carsten Emde - - This program is free software; you can redistribute it and/or - modify it under the terms of the GNU General Public License - as published by the Free Software Foundation; either version 2 - of the License, or (at your option) any later version. - - This program is distributed in the hope that it will be useful, - but WITHOUT ANY WARRANTY; without even the implied warranty of - MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - GNU General Public License for more details. - - You should have received a copy of the GNU General Public License - along with this program; if not, write to the Free Software - Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA. -*/ - -/* EDID */ -#define VERSION 1 -#define REVISION 3 - -/* Display */ -#define CLOCK 65000 /* kHz */ -#define XPIX 1024 -#define YPIX 768 -#define XY_RATIO XY_RATIO_4_3 -#define XBLANK 320 -#define YBLANK 38 -#define XOFFSET 8 -#define XPULSE 144 -#define YOFFSET 3 -#define YPULSE 6 -#define DPI 72 -#define VFREQ 60 /* Hz */ -#define TIMING_NAME "Linux XGA" -#define ESTABLISHED_TIMING2_BITS 0x08 /* Bit 3 -> 1024x768 @60 Hz */ -#define HSYNC_POL 0 -#define VSYNC_POL 0 - -#include "edid.S" diff --git a/Documentation/EDID/1280x1024.S b/Documentation/EDID/1280x1024.S deleted file mode 100644 index b26dd424cad7..000000000000 --- a/Documentation/EDID/1280x1024.S +++ /dev/null @@ -1,43 +0,0 @@ -/* - 1280x1024.S: EDID data set for standard 1280x1024 60 Hz monitor - - Copyright (C) 2011 Carsten Emde - - This program is free software; you can redistribute it and/or - modify it under the terms of the GNU General Public License - as published by the Free Software Foundation; either version 2 - of the License, or (at your option) any later version. - - This program is distributed in the hope that it will be useful, - but WITHOUT ANY WARRANTY; without even the implied warranty of - MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - GNU General Public License for more details. - - You should have received a copy of the GNU General Public License - along with this program; if not, write to the Free Software - Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA. -*/ - -/* EDID */ -#define VERSION 1 -#define REVISION 3 - -/* Display */ -#define CLOCK 108000 /* kHz */ -#define XPIX 1280 -#define YPIX 1024 -#define XY_RATIO XY_RATIO_5_4 -#define XBLANK 408 -#define YBLANK 42 -#define XOFFSET 48 -#define XPULSE 112 -#define YOFFSET 1 -#define YPULSE 3 -#define DPI 72 -#define VFREQ 60 /* Hz */ -#define TIMING_NAME "Linux SXGA" -/* No ESTABLISHED_TIMINGx_BITS */ -#define HSYNC_POL 1 -#define VSYNC_POL 1 - -#include "edid.S" diff --git a/Documentation/EDID/1600x1200.S b/Documentation/EDID/1600x1200.S deleted file mode 100644 index 0d091b282768..000000000000 --- a/Documentation/EDID/1600x1200.S +++ /dev/null @@ -1,43 +0,0 @@ -/* - 1600x1200.S: EDID data set for standard 1600x1200 60 Hz monitor - - Copyright (C) 2013 Carsten Emde - - This program is free software; you can redistribute it and/or - modify it under the terms of the GNU General Public License - as published by the Free Software Foundation; either version 2 - of the License, or (at your option) any later version. - - This program is distributed in the hope that it will be useful, - but WITHOUT ANY WARRANTY; without even the implied warranty of - MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - GNU General Public License for more details. - - You should have received a copy of the GNU General Public License - along with this program; if not, write to the Free Software - Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA. -*/ - -/* EDID */ -#define VERSION 1 -#define REVISION 3 - -/* Display */ -#define CLOCK 162000 /* kHz */ -#define XPIX 1600 -#define YPIX 1200 -#define XY_RATIO XY_RATIO_4_3 -#define XBLANK 560 -#define YBLANK 50 -#define XOFFSET 64 -#define XPULSE 192 -#define YOFFSET 1 -#define YPULSE 3 -#define DPI 72 -#define VFREQ 60 /* Hz */ -#define TIMING_NAME "Linux UXGA" -/* No ESTABLISHED_TIMINGx_BITS */ -#define HSYNC_POL 1 -#define VSYNC_POL 1 - -#include "edid.S" diff --git a/Documentation/EDID/1680x1050.S b/Documentation/EDID/1680x1050.S deleted file mode 100644 index 7dfed9a33eab..000000000000 --- a/Documentation/EDID/1680x1050.S +++ /dev/null @@ -1,43 +0,0 @@ -/* - 1680x1050.S: EDID data set for standard 1680x1050 60 Hz monitor - - Copyright (C) 2012 Carsten Emde - - This program is free software; you can redistribute it and/or - modify it under the terms of the GNU General Public License - as published by the Free Software Foundation; either version 2 - of the License, or (at your option) any later version. - - This program is distributed in the hope that it will be useful, - but WITHOUT ANY WARRANTY; without even the implied warranty of - MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - GNU General Public License for more details. - - You should have received a copy of the GNU General Public License - along with this program; if not, write to the Free Software - Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA. -*/ - -/* EDID */ -#define VERSION 1 -#define REVISION 3 - -/* Display */ -#define CLOCK 146250 /* kHz */ -#define XPIX 1680 -#define YPIX 1050 -#define XY_RATIO XY_RATIO_16_10 -#define XBLANK 560 -#define YBLANK 39 -#define XOFFSET 104 -#define XPULSE 176 -#define YOFFSET 3 -#define YPULSE 6 -#define DPI 96 -#define VFREQ 60 /* Hz */ -#define TIMING_NAME "Linux WSXGA" -/* No ESTABLISHED_TIMINGx_BITS */ -#define HSYNC_POL 1 -#define VSYNC_POL 1 - -#include "edid.S" diff --git a/Documentation/EDID/1920x1080.S b/Documentation/EDID/1920x1080.S deleted file mode 100644 index d6ffbba28e95..000000000000 --- a/Documentation/EDID/1920x1080.S +++ /dev/null @@ -1,43 +0,0 @@ -/* - 1920x1080.S: EDID data set for standard 1920x1080 60 Hz monitor - - Copyright (C) 2012 Carsten Emde - - This program is free software; you can redistribute it and/or - modify it under the terms of the GNU General Public License - as published by the Free Software Foundation; either version 2 - of the License, or (at your option) any later version. - - This program is distributed in the hope that it will be useful, - but WITHOUT ANY WARRANTY; without even the implied warranty of - MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - GNU General Public License for more details. - - You should have received a copy of the GNU General Public License - along with this program; if not, write to the Free Software - Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA. -*/ - -/* EDID */ -#define VERSION 1 -#define REVISION 3 - -/* Display */ -#define CLOCK 148500 /* kHz */ -#define XPIX 1920 -#define YPIX 1080 -#define XY_RATIO XY_RATIO_16_9 -#define XBLANK 280 -#define YBLANK 45 -#define XOFFSET 88 -#define XPULSE 44 -#define YOFFSET 4 -#define YPULSE 5 -#define DPI 96 -#define VFREQ 60 /* Hz */ -#define TIMING_NAME "Linux FHD" -/* No ESTABLISHED_TIMINGx_BITS */ -#define HSYNC_POL 1 -#define VSYNC_POL 1 - -#include "edid.S" diff --git a/Documentation/EDID/800x600.S b/Documentation/EDID/800x600.S deleted file mode 100644 index a5616588de08..000000000000 --- a/Documentation/EDID/800x600.S +++ /dev/null @@ -1,40 +0,0 @@ -/* - 800x600.S: EDID data set for standard 800x600 60 Hz monitor - - Copyright (C) 2011 Carsten Emde - Copyright (C) 2014 Linaro Limited - - This program is free software; you can redistribute it and/or - modify it under the terms of the GNU General Public License - as published by the Free Software Foundation; either version 2 - of the License, or (at your option) any later version. - - This program is distributed in the hope that it will be useful, - but WITHOUT ANY WARRANTY; without even the implied warranty of - MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - GNU General Public License for more details. -*/ - -/* EDID */ -#define VERSION 1 -#define REVISION 3 - -/* Display */ -#define CLOCK 40000 /* kHz */ -#define XPIX 800 -#define YPIX 600 -#define XY_RATIO XY_RATIO_4_3 -#define XBLANK 256 -#define YBLANK 28 -#define XOFFSET 40 -#define XPULSE 128 -#define YOFFSET 1 -#define YPULSE 4 -#define DPI 72 -#define VFREQ 60 /* Hz */ -#define TIMING_NAME "Linux SVGA" -#define ESTABLISHED_TIMING1_BITS 0x01 /* Bit 0: 800x600 @ 60Hz */ -#define HSYNC_POL 1 -#define VSYNC_POL 1 - -#include "edid.S" diff --git a/Documentation/EDID/Makefile b/Documentation/EDID/Makefile deleted file mode 100644 index 85a927dfab02..000000000000 --- a/Documentation/EDID/Makefile +++ /dev/null @@ -1,37 +0,0 @@ - -SOURCES := $(wildcard [0-9]*x[0-9]*.S) - -BIN := $(patsubst %.S, %.bin, $(SOURCES)) - -IHEX := $(patsubst %.S, %.bin.ihex, $(SOURCES)) - -CODE := $(patsubst %.S, %.c, $(SOURCES)) - -all: $(BIN) $(IHEX) $(CODE) - -clean: - @rm -f *.o *.bin.ihex *.bin *.c - -%.o: %.S - @cc -c $^ - -%.bin.nocrc: %.o - @objcopy -Obinary $^ $@ - -%.crc: %.bin.nocrc - @list=$$(for i in `seq 1 127`; do head -c$$i $^ | tail -c1 \ - | hexdump -v -e '/1 "%02X+"'; done); \ - echo "ibase=16;100-($${list%?})%100" | bc >$@ - -%.p: %.crc %.S - @cc -c -DCRC="$$(cat $*.crc)" -o $@ $*.S - -%.bin: %.p - @objcopy -Obinary $^ $@ - -%.bin.ihex: %.p - @objcopy -Oihex $^ $@ - @dos2unix $@ 2>/dev/null - -%.c: %.bin - @echo "{" >$@; hexdump -f hex $^ >>$@; echo "};" >>$@ diff --git a/Documentation/EDID/edid.S b/Documentation/EDID/edid.S deleted file mode 100644 index c3d13815526d..000000000000 --- a/Documentation/EDID/edid.S +++ /dev/null @@ -1,274 +0,0 @@ -/* - edid.S: EDID data template - - Copyright (C) 2012 Carsten Emde - - This program is free software; you can redistribute it and/or - modify it under the terms of the GNU General Public License - as published by the Free Software Foundation; either version 2 - of the License, or (at your option) any later version. - - This program is distributed in the hope that it will be useful, - but WITHOUT ANY WARRANTY; without even the implied warranty of - MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - GNU General Public License for more details. - - You should have received a copy of the GNU General Public License - along with this program; if not, write to the Free Software - Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA. -*/ - - -/* Manufacturer */ -#define MFG_LNX1 'L' -#define MFG_LNX2 'N' -#define MFG_LNX3 'X' -#define SERIAL 0 -#define YEAR 2012 -#define WEEK 5 - -/* EDID 1.3 standard definitions */ -#define XY_RATIO_16_10 0b00 -#define XY_RATIO_4_3 0b01 -#define XY_RATIO_5_4 0b10 -#define XY_RATIO_16_9 0b11 - -/* Provide defaults for the timing bits */ -#ifndef ESTABLISHED_TIMING1_BITS -#define ESTABLISHED_TIMING1_BITS 0x00 -#endif -#ifndef ESTABLISHED_TIMING2_BITS -#define ESTABLISHED_TIMING2_BITS 0x00 -#endif -#ifndef ESTABLISHED_TIMING3_BITS -#define ESTABLISHED_TIMING3_BITS 0x00 -#endif - -#define mfgname2id(v1,v2,v3) \ - ((((v1-'@')&0x1f)<<10)+(((v2-'@')&0x1f)<<5)+((v3-'@')&0x1f)) -#define swap16(v1) ((v1>>8)+((v1&0xff)<<8)) -#define lsbs2(v1,v2) (((v1&0x0f)<<4)+(v2&0x0f)) -#define msbs2(v1,v2) ((((v1>>8)&0x0f)<<4)+((v2>>8)&0x0f)) -#define msbs4(v1,v2,v3,v4) \ - ((((v1>>8)&0x03)<<6)+(((v2>>8)&0x03)<<4)+\ - (((v3>>4)&0x03)<<2)+((v4>>4)&0x03)) -#define pixdpi2mm(pix,dpi) ((pix*25)/dpi) -#define xsize pixdpi2mm(XPIX,DPI) -#define ysize pixdpi2mm(YPIX,DPI) - - .data - -/* Fixed header pattern */ -header: .byte 0x00,0xff,0xff,0xff,0xff,0xff,0xff,0x00 - -mfg_id: .hword swap16(mfgname2id(MFG_LNX1, MFG_LNX2, MFG_LNX3)) - -prod_code: .hword 0 - -/* Serial number. 32 bits, little endian. */ -serial_number: .long SERIAL - -/* Week of manufacture */ -week: .byte WEEK - -/* Year of manufacture, less 1990. (1990-2245) - If week=255, it is the model year instead */ -year: .byte YEAR-1990 - -version: .byte VERSION /* EDID version, usually 1 (for 1.3) */ -revision: .byte REVISION /* EDID revision, usually 3 (for 1.3) */ - -/* If Bit 7=1 Digital input. If set, the following bit definitions apply: - Bits 6-1 Reserved, must be 0 - Bit 0 Signal is compatible with VESA DFP 1.x TMDS CRGB, - 1 pixel per clock, up to 8 bits per color, MSB aligned, - If Bit 7=0 Analog input. If clear, the following bit definitions apply: - Bits 6-5 Video white and sync levels, relative to blank - 00=+0.7/-0.3 V; 01=+0.714/-0.286 V; - 10=+1.0/-0.4 V; 11=+0.7/0 V - Bit 4 Blank-to-black setup (pedestal) expected - Bit 3 Separate sync supported - Bit 2 Composite sync (on HSync) supported - Bit 1 Sync on green supported - Bit 0 VSync pulse must be serrated when somposite or - sync-on-green is used. */ -video_parms: .byte 0x6d - -/* Maximum horizontal image size, in centimetres - (max 292 cm/115 in at 16:9 aspect ratio) */ -max_hor_size: .byte xsize/10 - -/* Maximum vertical image size, in centimetres. - If either byte is 0, undefined (e.g. projector) */ -max_vert_size: .byte ysize/10 - -/* Display gamma, minus 1, times 100 (range 1.00-3.5 */ -gamma: .byte 120 - -/* Bit 7 DPMS standby supported - Bit 6 DPMS suspend supported - Bit 5 DPMS active-off supported - Bits 4-3 Display type: 00=monochrome; 01=RGB colour; - 10=non-RGB multicolour; 11=undefined - Bit 2 Standard sRGB colour space. Bytes 25-34 must contain - sRGB standard values. - Bit 1 Preferred timing mode specified in descriptor block 1. - Bit 0 GTF supported with default parameter values. */ -dsp_features: .byte 0xea - -/* Chromaticity coordinates. */ -/* Red and green least-significant bits - Bits 7-6 Red x value least-significant 2 bits - Bits 5-4 Red y value least-significant 2 bits - Bits 3-2 Green x value lst-significant 2 bits - Bits 1-0 Green y value least-significant 2 bits */ -red_green_lsb: .byte 0x5e - -/* Blue and white least-significant 2 bits */ -blue_white_lsb: .byte 0xc0 - -/* Red x value most significant 8 bits. - 0-255 encodes 0-0.996 (255/256); 0-0.999 (1023/1024) with lsbits */ -red_x_msb: .byte 0xa4 - -/* Red y value most significant 8 bits */ -red_y_msb: .byte 0x59 - -/* Green x and y value most significant 8 bits */ -green_x_y_msb: .byte 0x4a,0x98 - -/* Blue x and y value most significant 8 bits */ -blue_x_y_msb: .byte 0x25,0x20 - -/* Default white point x and y value most significant 8 bits */ -white_x_y_msb: .byte 0x50,0x54 - -/* Established timings */ -/* Bit 7 720x400 @ 70 Hz - Bit 6 720x400 @ 88 Hz - Bit 5 640x480 @ 60 Hz - Bit 4 640x480 @ 67 Hz - Bit 3 640x480 @ 72 Hz - Bit 2 640x480 @ 75 Hz - Bit 1 800x600 @ 56 Hz - Bit 0 800x600 @ 60 Hz */ -estbl_timing1: .byte ESTABLISHED_TIMING1_BITS - -/* Bit 7 800x600 @ 72 Hz - Bit 6 800x600 @ 75 Hz - Bit 5 832x624 @ 75 Hz - Bit 4 1024x768 @ 87 Hz, interlaced (1024x768) - Bit 3 1024x768 @ 60 Hz - Bit 2 1024x768 @ 72 Hz - Bit 1 1024x768 @ 75 Hz - Bit 0 1280x1024 @ 75 Hz */ -estbl_timing2: .byte ESTABLISHED_TIMING2_BITS - -/* Bit 7 1152x870 @ 75 Hz (Apple Macintosh II) - Bits 6-0 Other manufacturer-specific display mod */ -estbl_timing3: .byte ESTABLISHED_TIMING3_BITS - -/* Standard timing */ -/* X resolution, less 31, divided by 8 (256-2288 pixels) */ -std_xres: .byte (XPIX/8)-31 -/* Y resolution, X:Y pixel ratio - Bits 7-6 X:Y pixel ratio: 00=16:10; 01=4:3; 10=5:4; 11=16:9. - Bits 5-0 Vertical frequency, less 60 (60-123 Hz) */ -std_vres: .byte (XY_RATIO<<6)+VFREQ-60 - .fill 7,2,0x0101 /* Unused */ - -descriptor1: -/* Pixel clock in 10 kHz units. (0.-655.35 MHz, little-endian) */ -clock: .hword CLOCK/10 - -/* Horizontal active pixels 8 lsbits (0-4095) */ -x_act_lsb: .byte XPIX&0xff -/* Horizontal blanking pixels 8 lsbits (0-4095) - End of active to start of next active. */ -x_blk_lsb: .byte XBLANK&0xff -/* Bits 7-4 Horizontal active pixels 4 msbits - Bits 3-0 Horizontal blanking pixels 4 msbits */ -x_msbs: .byte msbs2(XPIX,XBLANK) - -/* Vertical active lines 8 lsbits (0-4095) */ -y_act_lsb: .byte YPIX&0xff -/* Vertical blanking lines 8 lsbits (0-4095) */ -y_blk_lsb: .byte YBLANK&0xff -/* Bits 7-4 Vertical active lines 4 msbits - Bits 3-0 Vertical blanking lines 4 msbits */ -y_msbs: .byte msbs2(YPIX,YBLANK) - -/* Horizontal sync offset pixels 8 lsbits (0-1023) From blanking start */ -x_snc_off_lsb: .byte XOFFSET&0xff -/* Horizontal sync pulse width pixels 8 lsbits (0-1023) */ -x_snc_pls_lsb: .byte XPULSE&0xff -/* Bits 7-4 Vertical sync offset lines 4 lsbits (0-63) - Bits 3-0 Vertical sync pulse width lines 4 lsbits (0-63) */ -y_snc_lsb: .byte lsbs2(YOFFSET, YPULSE) -/* Bits 7-6 Horizontal sync offset pixels 2 msbits - Bits 5-4 Horizontal sync pulse width pixels 2 msbits - Bits 3-2 Vertical sync offset lines 2 msbits - Bits 1-0 Vertical sync pulse width lines 2 msbits */ -xy_snc_msbs: .byte msbs4(XOFFSET,XPULSE,YOFFSET,YPULSE) - -/* Horizontal display size, mm, 8 lsbits (0-4095 mm, 161 in) */ -x_dsp_size: .byte xsize&0xff - -/* Vertical display size, mm, 8 lsbits (0-4095 mm, 161 in) */ -y_dsp_size: .byte ysize&0xff - -/* Bits 7-4 Horizontal display size, mm, 4 msbits - Bits 3-0 Vertical display size, mm, 4 msbits */ -dsp_size_mbsb: .byte msbs2(xsize,ysize) - -/* Horizontal border pixels (each side; total is twice this) */ -x_border: .byte 0 -/* Vertical border lines (each side; total is twice this) */ -y_border: .byte 0 - -/* Bit 7 Interlaced - Bits 6-5 Stereo mode: 00=No stereo; other values depend on bit 0: - Bit 0=0: 01=Field sequential, sync=1 during right; 10=similar, - sync=1 during left; 11=4-way interleaved stereo - Bit 0=1 2-way interleaved stereo: 01=Right image on even lines; - 10=Left image on even lines; 11=side-by-side - Bits 4-3 Sync type: 00=Analog composite; 01=Bipolar analog composite; - 10=Digital composite (on HSync); 11=Digital separate - Bit 2 If digital separate: Vertical sync polarity (1=positive) - Other types: VSync serrated (HSync during VSync) - Bit 1 If analog sync: Sync on all 3 RGB lines (else green only) - Digital: HSync polarity (1=positive) - Bit 0 2-way line-interleaved stereo, if bits 4-3 are not 00. */ -features: .byte 0x18+(VSYNC_POL<<2)+(HSYNC_POL<<1) - -descriptor2: .byte 0,0 /* Not a detailed timing descriptor */ - .byte 0 /* Must be zero */ - .byte 0xff /* Descriptor is monitor serial number (text) */ - .byte 0 /* Must be zero */ -start1: .ascii "Linux #0" -end1: .byte 0x0a /* End marker */ - .fill 12-(end1-start1), 1, 0x20 /* Padded spaces */ -descriptor3: .byte 0,0 /* Not a detailed timing descriptor */ - .byte 0 /* Must be zero */ - .byte 0xfd /* Descriptor is monitor range limits */ - .byte 0 /* Must be zero */ -start2: .byte VFREQ-1 /* Minimum vertical field rate (1-255 Hz) */ - .byte VFREQ+1 /* Maximum vertical field rate (1-255 Hz) */ - .byte (CLOCK/(XPIX+XBLANK))-1 /* Minimum horizontal line rate - (1-255 kHz) */ - .byte (CLOCK/(XPIX+XBLANK))+1 /* Maximum horizontal line rate - (1-255 kHz) */ - .byte (CLOCK/10000)+1 /* Maximum pixel clock rate, rounded up - to 10 MHz multiple (10-2550 MHz) */ - .byte 0 /* No extended timing information type */ -end2: .byte 0x0a /* End marker */ - .fill 12-(end2-start2), 1, 0x20 /* Padded spaces */ -descriptor4: .byte 0,0 /* Not a detailed timing descriptor */ - .byte 0 /* Must be zero */ - .byte 0xfc /* Descriptor is text */ - .byte 0 /* Must be zero */ -start3: .ascii TIMING_NAME -end3: .byte 0x0a /* End marker */ - .fill 12-(end3-start3), 1, 0x20 /* Padded spaces */ -extensions: .byte 0 /* Number of extensions to follow */ -checksum: .byte CRC /* Sum of all bytes must be 0 */ diff --git a/Documentation/EDID/hex b/Documentation/EDID/hex deleted file mode 100644 index 8873ebb618af..000000000000 --- a/Documentation/EDID/hex +++ /dev/null @@ -1 +0,0 @@ -"\t" 8/1 "0x%02x, " "\n" diff --git a/Documentation/Makefile b/Documentation/Makefile index d77bb607aea4..cc786d11a028 100644 --- a/Documentation/Makefile +++ b/Documentation/Makefile @@ -2,7 +2,8 @@ # Makefile for Sphinx documentation # -subdir-y := devicetree/bindings/ +# for cleaning +subdir- := devicetree/bindings # Check for broken documentation file references ifeq ($(CONFIG_WARN_MISSING_DOCUMENTS),y) @@ -13,7 +14,7 @@ endif SPHINXBUILD = sphinx-build SPHINXOPTS = SPHINXDIRS = . -_SPHINXDIRS = $(patsubst $(srctree)/Documentation/%/index.rst,%,$(wildcard $(srctree)/Documentation/*/index.rst)) +_SPHINXDIRS = $(sort $(patsubst $(srctree)/Documentation/%/index.rst,%,$(wildcard $(srctree)/Documentation/*/index.rst))) SPHINX_CONF = conf.py PAPER = BUILDDIR = $(obj)/output diff --git a/Documentation/PCI/boot-interrupts.rst b/Documentation/PCI/boot-interrupts.rst new file mode 100644 index 000000000000..d078ef3eb192 --- /dev/null +++ b/Documentation/PCI/boot-interrupts.rst @@ -0,0 +1,155 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=============== +Boot Interrupts +=============== + +:Author: - Sean V Kelley + +Overview +======== + +On PCI Express, interrupts are represented with either MSI or inbound +interrupt messages (Assert_INTx/Deassert_INTx). The integrated IO-APIC in a +given Core IO converts the legacy interrupt messages from PCI Express to +MSI interrupts. If the IO-APIC is disabled (via the mask bits in the +IO-APIC table entries), the messages are routed to the legacy PCH. This +in-band interrupt mechanism was traditionally necessary for systems that +did not support the IO-APIC and for boot. Intel in the past has used the +term "boot interrupts" to describe this mechanism. Further, the PCI Express +protocol describes this in-band legacy wire-interrupt INTx mechanism for +I/O devices to signal PCI-style level interrupts. The subsequent paragraphs +describe problems with the Core IO handling of INTx message routing to the +PCH and mitigation within BIOS and the OS. + + +Issue +===== + +When in-band legacy INTx messages are forwarded to the PCH, they in turn +trigger a new interrupt for which the OS likely lacks a handler. When an +interrupt goes unhandled over time, they are tracked by the Linux kernel as +Spurious Interrupts. The IRQ will be disabled by the Linux kernel after it +reaches a specific count with the error "nobody cared". This disabled IRQ +now prevents valid usage by an existing interrupt which may happen to share +the IRQ line. + + irq 19: nobody cared (try booting with the "irqpoll" option) + CPU: 0 PID: 2988 Comm: irq/34-nipalk Tainted: 4.14.87-rt49-02410-g4a640ec-dirty #1 + Hardware name: National Instruments NI PXIe-8880/NI PXIe-8880, BIOS 2.1.5f1 01/09/2020 + Call Trace: + + ? dump_stack+0x46/0x5e + ? __report_bad_irq+0x2e/0xb0 + ? note_interrupt+0x242/0x290 + ? nNIKAL100_memoryRead16+0x8/0x10 [nikal] + ? handle_irq_event_percpu+0x55/0x70 + ? handle_irq_event+0x4f/0x80 + ? handle_fasteoi_irq+0x81/0x180 + ? handle_irq+0x1c/0x30 + ? do_IRQ+0x41/0xd0 + ? common_interrupt+0x84/0x84 + + + handlers: + irq_default_primary_handler threaded usb_hcd_irq + Disabling IRQ #19 + + +Conditions +========== + +The use of threaded interrupts is the most likely condition to trigger +this problem today. Threaded interrupts may not be reenabled after the IRQ +handler wakes. These "one shot" conditions mean that the threaded interrupt +needs to keep the interrupt line masked until the threaded handler has run. +Especially when dealing with high data rate interrupts, the thread needs to +run to completion; otherwise some handlers will end up in stack overflows +since the interrupt of the issuing device is still active. + +Affected Chipsets +================= + +The legacy interrupt forwarding mechanism exists today in a number of +devices including but not limited to chipsets from AMD/ATI, Broadcom, and +Intel. Changes made through the mitigations below have been applied to +drivers/pci/quirks.c + +Starting with ICX there are no longer any IO-APICs in the Core IO's +devices. IO-APIC is only in the PCH. Devices connected to the Core IO's +PCIe Root Ports will use native MSI/MSI-X mechanisms. + +Mitigations +=========== + +The mitigations take the form of PCI quirks. The preference has been to +first identify and make use of a means to disable the routing to the PCH. +In such a case a quirk to disable boot interrupt generation can be +added.[1] + + Intel® 6300ESB I/O Controller Hub + Alternate Base Address Register: + BIE: Boot Interrupt Enable + 0 = Boot interrupt is enabled. + 1 = Boot interrupt is disabled. + + Intel® Sandy Bridge through Sky Lake based Xeon servers: + Coherent Interface Protocol Interrupt Control + dis_intx_route2pch/dis_intx_route2ich/dis_intx_route2dmi2: + When this bit is set. Local INTx messages received from the + Intel® Quick Data DMA/PCI Express ports are not routed to legacy + PCH - they are either converted into MSI via the integrated IO-APIC + (if the IO-APIC mask bit is clear in the appropriate entries) + or cause no further action (when mask bit is set) + +In the absence of a way to directly disable the routing, another approach +has been to make use of PCI Interrupt pin to INTx routing tables for +purposes of redirecting the interrupt handler to the rerouted interrupt +line by default. Therefore, on chipsets where this INTx routing cannot be +disabled, the Linux kernel will reroute the valid interrupt to its legacy +interrupt. This redirection of the handler will prevent the occurrence of +the spurious interrupt detection which would ordinarily disable the IRQ +line due to excessive unhandled counts.[2] + +The config option X86_REROUTE_FOR_BROKEN_BOOT_IRQS exists to enable (or +disable) the redirection of the interrupt handler to the PCH interrupt +line. The option can be overridden by either pci=ioapicreroute or +pci=noioapicreroute.[3] + + +More Documentation +================== + +There is an overview of the legacy interrupt handling in several datasheets +(6300ESB and 6700PXH below). While largely the same, it provides insight +into the evolution of its handling with chipsets. + +Example of disabling of the boot interrupt +------------------------------------------ + +Intel® 6300ESB I/O Controller Hub (Document # 300641-004US) + 5.7.3 Boot Interrupt + https://www.intel.com/content/dam/doc/datasheet/6300esb-io-controller-hub-datasheet.pdf + +Intel® Xeon® Processor E5-1600/2400/2600/4600 v3 Product Families +Datasheet - Volume 2: Registers (Document # 330784-003) + 6.6.41 cipintrc Coherent Interface Protocol Interrupt Control + https://www.intel.com/content/dam/www/public/us/en/documents/datasheets/xeon-e5-v3-datasheet-vol-2.pdf + +Example of handler rerouting +---------------------------- + +Intel® 6700PXH 64-bit PCI Hub (Document # 302628) + 2.15.2 PCI Express Legacy INTx Support and Boot Interrupt + https://www.intel.com/content/dam/doc/datasheet/6700pxh-64-bit-pci-hub-datasheet.pdf + + +If you have any legacy PCI interrupt questions that aren't answered, email me. + +Cheers, + Sean V Kelley + sean.v.kelley@linux.intel.com + +[1] https://lore.kernel.org/r/12131949181903-git-send-email-sassmann@suse.de/ +[2] https://lore.kernel.org/r/12131949182094-git-send-email-sassmann@suse.de/ +[3] https://lore.kernel.org/r/487C8EA7.6020205@suse.de/ diff --git a/Documentation/PCI/index.rst b/Documentation/PCI/index.rst index 6768305e4c26..8f66feaafd4f 100644 --- a/Documentation/PCI/index.rst +++ b/Documentation/PCI/index.rst @@ -16,3 +16,4 @@ Linux PCI Bus Subsystem pci-error-recovery pcieaer-howto endpoint/index + boot-interrupts diff --git a/Documentation/PCI/pci.rst b/Documentation/PCI/pci.rst index 6864f9a70f5f..8c016d8c9862 100644 --- a/Documentation/PCI/pci.rst +++ b/Documentation/PCI/pci.rst @@ -239,7 +239,7 @@ from the PCI device config space. Use the values in the pci_dev structure as the PCI "bus address" might have been remapped to a "host physical" address by the arch/chip-set specific kernel support. -See Documentation/io-mapping.txt for how to access device registers +See Documentation/driver-api/io-mapping.rst for how to access device registers or device memory. The device driver needs to call pci_request_region() to verify diff --git a/Documentation/PCI/pcieaer-howto.rst b/Documentation/PCI/pcieaer-howto.rst index 18bdefaafd1a..0b36b9ebfa4b 100644 --- a/Documentation/PCI/pcieaer-howto.rst +++ b/Documentation/PCI/pcieaer-howto.rst @@ -156,12 +156,6 @@ default reset_link function, but different upstream ports might have different specifications to reset pci express link, so all upstream ports should provide their own reset_link functions. -In struct pcie_port_service_driver, a new pointer, reset_link, is -added. -:: - - pci_ers_result_t (*reset_link) (struct pci_dev *dev); - Section 3.2.2.2 provides more detailed info on when to call reset_link. @@ -212,15 +206,10 @@ error_detected(dev, pci_channel_io_frozen) to all drivers within a hierarchy in question. Then, performing link reset at upstream is necessary. As different kinds of devices might use different approaches to reset link, AER port service driver is required to provide the -function to reset link. Firstly, kernel looks for if the upstream -component has an aer driver. If it has, kernel uses the reset_link -callback of the aer driver. If the upstream component has no aer driver -and the port is downstream port, we will perform a hot reset as the -default by setting the Secondary Bus Reset bit of the Bridge Control -register associated with the downstream port. As for upstream ports, -they should provide their own aer service drivers with reset_link -function. If error_detected returns PCI_ERS_RESULT_CAN_RECOVER and -reset_link returns PCI_ERS_RESULT_RECOVERED, the error handling goes +function to reset link via callback parameter of pcie_do_recovery() +function. If reset_link is not NULL, recovery function will use it +to reset the link. If error_detected returns PCI_ERS_RESULT_CAN_RECOVER +and reset_link returns PCI_ERS_RESULT_RECOVERED, the error handling goes to mmio_enabled. helper functions @@ -243,9 +232,9 @@ messages to root port when an error is detected. :: - int pci_cleanup_aer_uncorrect_error_status(struct pci_dev *dev);` + int pci_aer_clear_nonfatal_status(struct pci_dev *dev);` -pci_cleanup_aer_uncorrect_error_status cleanups the uncorrectable +pci_aer_clear_nonfatal_status clears non-fatal errors in the uncorrectable error status register. Frequent Asked Questions diff --git a/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst b/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst index 1a8b129cfc04..83ae3b79a643 100644 --- a/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst +++ b/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst @@ -4,7 +4,7 @@ A Tour Through TREE_RCU's Grace-Period Memory Ordering August 8, 2017 -This article was contributed by Paul E. McKenney +This article was contributed by Paul E. McKenney Introduction ============ @@ -48,7 +48,7 @@ Tree RCU Grace Period Memory Ordering Building Blocks The workhorse for RCU's grace-period memory ordering is the critical section for the ``rcu_node`` structure's -``->lock``. These critical sections use helper functions for lock +``->lock``. These critical sections use helper functions for lock acquisition, including ``raw_spin_lock_rcu_node()``, ``raw_spin_lock_irq_rcu_node()``, and ``raw_spin_lock_irqsave_rcu_node()``. Their lock-release counterparts are ``raw_spin_unlock_rcu_node()``, @@ -102,9 +102,9 @@ lock-acquisition and lock-release functions:: 23 r3 = READ_ONCE(x); 24 } 25 - 26 WARN_ON(r1 == 0 && r2 == 0 && r3 == 0); + 26 WARN_ON(r1 == 0 && r2 == 0 && r3 == 0); -The ``WARN_ON()`` is evaluated at “the end of time”, +The ``WARN_ON()`` is evaluated at "the end of time", after all changes have propagated throughout the system. Without the ``smp_mb__after_unlock_lock()`` provided by the acquisition functions, this ``WARN_ON()`` could trigger, for example diff --git a/Documentation/RCU/listRCU.rst b/Documentation/RCU/listRCU.rst index 7956ff33042b..2a643e293fb4 100644 --- a/Documentation/RCU/listRCU.rst +++ b/Documentation/RCU/listRCU.rst @@ -4,12 +4,61 @@ Using RCU to Protect Read-Mostly Linked Lists ============================================= One of the best applications of RCU is to protect read-mostly linked lists -("struct list_head" in list.h). One big advantage of this approach +(``struct list_head`` in list.h). One big advantage of this approach is that all of the required memory barriers are included for you in the list macros. This document describes several applications of RCU, with the best fits first. -Example 1: Read-Side Action Taken Outside of Lock, No In-Place Updates + +Example 1: Read-mostly list: Deferred Destruction +------------------------------------------------- + +A widely used usecase for RCU lists in the kernel is lockless iteration over +all processes in the system. ``task_struct::tasks`` represents the list node that +links all the processes. The list can be traversed in parallel to any list +additions or removals. + +The traversal of the list is done using ``for_each_process()`` which is defined +by the 2 macros:: + + #define next_task(p) \ + list_entry_rcu((p)->tasks.next, struct task_struct, tasks) + + #define for_each_process(p) \ + for (p = &init_task ; (p = next_task(p)) != &init_task ; ) + +The code traversing the list of all processes typically looks like:: + + rcu_read_lock(); + for_each_process(p) { + /* Do something with p */ + } + rcu_read_unlock(); + +The simplified code for removing a process from a task list is:: + + void release_task(struct task_struct *p) + { + write_lock(&tasklist_lock); + list_del_rcu(&p->tasks); + write_unlock(&tasklist_lock); + call_rcu(&p->rcu, delayed_put_task_struct); + } + +When a process exits, ``release_task()`` calls ``list_del_rcu(&p->tasks)`` under +``tasklist_lock`` writer lock protection, to remove the task from the list of +all tasks. The ``tasklist_lock`` prevents concurrent list additions/removals +from corrupting the list. Readers using ``for_each_process()`` are not protected +with the ``tasklist_lock``. To prevent readers from noticing changes in the list +pointers, the ``task_struct`` object is freed only after one or more grace +periods elapse (with the help of call_rcu()). This deferring of destruction +ensures that any readers traversing the list will see valid ``p->tasks.next`` +pointers and deletion/freeing can happen in parallel with traversal of the list. +This pattern is also called an **existence lock**, since RCU pins the object in +memory until all existing readers finish. + + +Example 2: Read-Side Action Taken Outside of Lock: No In-Place Updates ---------------------------------------------------------------------- The best applications are cases where, if reader-writer locking were @@ -26,7 +75,7 @@ added or deleted, rather than being modified in place. A straightforward example of this use of RCU may be found in the system-call auditing support. For example, a reader-writer locked -implementation of audit_filter_task() might be as follows:: +implementation of ``audit_filter_task()`` might be as follows:: static enum audit_state audit_filter_task(struct task_struct *tsk) { @@ -34,7 +83,7 @@ implementation of audit_filter_task() might be as follows:: enum audit_state state; read_lock(&auditsc_lock); - /* Note: audit_netlink_sem held by caller. */ + /* Note: audit_filter_mutex held by caller. */ list_for_each_entry(e, &audit_tsklist, list) { if (audit_filter_rules(tsk, &e->rule, NULL, &state)) { read_unlock(&auditsc_lock); @@ -58,7 +107,7 @@ This means that RCU can be easily applied to the read side, as follows:: enum audit_state state; rcu_read_lock(); - /* Note: audit_netlink_sem held by caller. */ + /* Note: audit_filter_mutex held by caller. */ list_for_each_entry_rcu(e, &audit_tsklist, list) { if (audit_filter_rules(tsk, &e->rule, NULL, &state)) { rcu_read_unlock(); @@ -69,18 +118,18 @@ This means that RCU can be easily applied to the read side, as follows:: return AUDIT_BUILD_CONTEXT; } -The read_lock() and read_unlock() calls have become rcu_read_lock() +The ``read_lock()`` and ``read_unlock()`` calls have become rcu_read_lock() and rcu_read_unlock(), respectively, and the list_for_each_entry() has -become list_for_each_entry_rcu(). The _rcu() list-traversal primitives +become list_for_each_entry_rcu(). The **_rcu()** list-traversal primitives insert the read-side memory barriers that are required on DEC Alpha CPUs. -The changes to the update side are also straightforward. A reader-writer -lock might be used as follows for deletion and insertion:: +The changes to the update side are also straightforward. A reader-writer lock +might be used as follows for deletion and insertion:: static inline int audit_del_rule(struct audit_rule *rule, struct list_head *list) { - struct audit_entry *e; + struct audit_entry *e; write_lock(&auditsc_lock); list_for_each_entry(e, list, list) { @@ -113,9 +162,9 @@ Following are the RCU equivalents for these two functions:: static inline int audit_del_rule(struct audit_rule *rule, struct list_head *list) { - struct audit_entry *e; + struct audit_entry *e; - /* Do not use the _rcu iterator here, since this is the only + /* No need to use the _rcu iterator here, since this is the only * deletion routine. */ list_for_each_entry(e, list, list) { if (!audit_compare_rule(rule, &e->rule)) { @@ -139,45 +188,45 @@ Following are the RCU equivalents for these two functions:: return 0; } -Normally, the write_lock() and write_unlock() would be replaced by -a spin_lock() and a spin_unlock(), but in this case, all callers hold -audit_netlink_sem, so no additional locking is required. The auditsc_lock -can therefore be eliminated, since use of RCU eliminates the need for -writers to exclude readers. Normally, the write_lock() calls would -be converted into spin_lock() calls. +Normally, the ``write_lock()`` and ``write_unlock()`` would be replaced by a +spin_lock() and a spin_unlock(). But in this case, all callers hold +``audit_filter_mutex``, so no additional locking is required. The +``auditsc_lock`` can therefore be eliminated, since use of RCU eliminates the +need for writers to exclude readers. The list_del(), list_add(), and list_add_tail() primitives have been replaced by list_del_rcu(), list_add_rcu(), and list_add_tail_rcu(). -The _rcu() list-manipulation primitives add memory barriers that are -needed on weakly ordered CPUs (most of them!). The list_del_rcu() -primitive omits the pointer poisoning debug-assist code that would -otherwise cause concurrent readers to fail spectacularly. +The **_rcu()** list-manipulation primitives add memory barriers that are needed on +weakly ordered CPUs (most of them!). The list_del_rcu() primitive omits the +pointer poisoning debug-assist code that would otherwise cause concurrent +readers to fail spectacularly. -So, when readers can tolerate stale data and when entries are either added -or deleted, without in-place modification, it is very easy to use RCU! +So, when readers can tolerate stale data and when entries are either added or +deleted, without in-place modification, it is very easy to use RCU! -Example 2: Handling In-Place Updates + +Example 3: Handling In-Place Updates ------------------------------------ -The system-call auditing code does not update auditing rules in place. -However, if it did, reader-writer-locked code to do so might look as -follows (presumably, the field_count is only permitted to decrease, -otherwise, the added fields would need to be filled in):: +The system-call auditing code does not update auditing rules in place. However, +if it did, the reader-writer-locked code to do so might look as follows +(assuming only ``field_count`` is updated, otherwise, the added fields would +need to be filled in):: static inline int audit_upd_rule(struct audit_rule *rule, struct list_head *list, __u32 newaction, __u32 newfield_count) { - struct audit_entry *e; - struct audit_newentry *ne; + struct audit_entry *e; + struct audit_entry *ne; write_lock(&auditsc_lock); - /* Note: audit_netlink_sem held by caller. */ + /* Note: audit_filter_mutex held by caller. */ list_for_each_entry(e, list, list) { if (!audit_compare_rule(rule, &e->rule)) { e->rule.action = newaction; - e->rule.file_count = newfield_count; + e->rule.field_count = newfield_count; write_unlock(&auditsc_lock); return 0; } @@ -188,16 +237,16 @@ otherwise, the added fields would need to be filled in):: The RCU version creates a copy, updates the copy, then replaces the old entry with the newly updated entry. This sequence of actions, allowing -concurrent reads while doing a copy to perform an update, is what gives -RCU ("read-copy update") its name. The RCU code is as follows:: +concurrent reads while making a copy to perform an update, is what gives +RCU (*read-copy update*) its name. The RCU code is as follows:: static inline int audit_upd_rule(struct audit_rule *rule, struct list_head *list, __u32 newaction, __u32 newfield_count) { - struct audit_entry *e; - struct audit_newentry *ne; + struct audit_entry *e; + struct audit_entry *ne; list_for_each_entry(e, list, list) { if (!audit_compare_rule(rule, &e->rule)) { @@ -206,7 +255,7 @@ RCU ("read-copy update") its name. The RCU code is as follows:: return -ENOMEM; audit_copy_rule(&ne->rule, &e->rule); ne->rule.action = newaction; - ne->rule.file_count = newfield_count; + ne->rule.field_count = newfield_count; list_replace_rcu(&e->list, &ne->list); call_rcu(&e->rcu, audit_free_rule); return 0; @@ -215,34 +264,45 @@ RCU ("read-copy update") its name. The RCU code is as follows:: return -EFAULT; /* No matching rule */ } -Again, this assumes that the caller holds audit_netlink_sem. Normally, -the reader-writer lock would become a spinlock in this sort of code. +Again, this assumes that the caller holds ``audit_filter_mutex``. Normally, the +writer lock would become a spinlock in this sort of code. -Example 3: Eliminating Stale Data +Another use of this pattern can be found in the openswitch driver's *connection +tracking table* code in ``ct_limit_set()``. The table holds connection tracking +entries and has a limit on the maximum entries. There is one such table +per-zone and hence one *limit* per zone. The zones are mapped to their limits +through a hashtable using an RCU-managed hlist for the hash chains. When a new +limit is set, a new limit object is allocated and ``ct_limit_set()`` is called +to replace the old limit object with the new one using list_replace_rcu(). +The old limit object is then freed after a grace period using kfree_rcu(). + + +Example 4: Eliminating Stale Data --------------------------------- -The auditing examples above tolerate stale data, as do most algorithms +The auditing example above tolerates stale data, as do most algorithms that are tracking external state. Because there is a delay from the time the external state changes before Linux becomes aware of the change, -additional RCU-induced staleness is normally not a problem. +additional RCU-induced staleness is generally not a problem. However, there are many examples where stale data cannot be tolerated. -One example in the Linux kernel is the System V IPC (see the ipc_lock() -function in ipc/util.c). This code checks a "deleted" flag under a -per-entry spinlock, and, if the "deleted" flag is set, pretends that the +One example in the Linux kernel is the System V IPC (see the shm_lock() +function in ipc/shm.c). This code checks a *deleted* flag under a +per-entry spinlock, and, if the *deleted* flag is set, pretends that the entry does not exist. For this to be helpful, the search function must -return holding the per-entry spinlock, as ipc_lock() does in fact do. +return holding the per-entry spinlock, as shm_lock() does in fact do. + +.. _quick_quiz: Quick Quiz: - Why does the search function need to return holding the per-entry lock for - this deleted-flag technique to be helpful? + For the deleted-flag technique to be helpful, why is it necessary + to hold the per-entry lock while returning from the search function? -:ref:`Answer to Quick Quiz ` +:ref:`Answer to Quick Quiz ` -If the system-call audit module were to ever need to reject stale data, -one way to accomplish this would be to add a "deleted" flag and a "lock" -spinlock to the audit_entry structure, and modify audit_filter_task() -as follows:: +If the system-call audit module were to ever need to reject stale data, one way +to accomplish this would be to add a ``deleted`` flag and a ``lock`` spinlock to the +audit_entry structure, and modify ``audit_filter_task()`` as follows:: static enum audit_state audit_filter_task(struct task_struct *tsk) { @@ -267,20 +327,20 @@ as follows:: } Note that this example assumes that entries are only added and deleted. -Additional mechanism is required to deal correctly with the -update-in-place performed by audit_upd_rule(). For one thing, -audit_upd_rule() would need additional memory barriers to ensure -that the list_add_rcu() was really executed before the list_del_rcu(). +Additional mechanism is required to deal correctly with the update-in-place +performed by ``audit_upd_rule()``. For one thing, ``audit_upd_rule()`` would +need additional memory barriers to ensure that the list_add_rcu() was really +executed before the list_del_rcu(). -The audit_del_rule() function would need to set the "deleted" -flag under the spinlock as follows:: +The ``audit_del_rule()`` function would need to set the ``deleted`` flag under the +spinlock as follows:: static inline int audit_del_rule(struct audit_rule *rule, struct list_head *list) { - struct audit_entry *e; + struct audit_entry *e; - /* Do not need to use the _rcu iterator here, since this + /* No need to use the _rcu iterator here, since this * is the only deletion routine. */ list_for_each_entry(e, list, list) { if (!audit_compare_rule(rule, &e->rule)) { @@ -295,6 +355,91 @@ flag under the spinlock as follows:: return -EFAULT; /* No matching rule */ } +This too assumes that the caller holds ``audit_filter_mutex``. + + +Example 5: Skipping Stale Objects +--------------------------------- + +For some usecases, reader performance can be improved by skipping stale objects +during read-side list traversal if the object in concern is pending destruction +after one or more grace periods. One such example can be found in the timerfd +subsystem. When a ``CLOCK_REALTIME`` clock is reprogrammed - for example due to +setting of the system time, then all programmed timerfds that depend on this +clock get triggered and processes waiting on them to expire are woken up in +advance of their scheduled expiry. To facilitate this, all such timers are added +to an RCU-managed ``cancel_list`` when they are setup in +``timerfd_setup_cancel()``:: + + static void timerfd_setup_cancel(struct timerfd_ctx *ctx, int flags) + { + spin_lock(&ctx->cancel_lock); + if ((ctx->clockid == CLOCK_REALTIME && + (flags & TFD_TIMER_ABSTIME) && (flags & TFD_TIMER_CANCEL_ON_SET)) { + if (!ctx->might_cancel) { + ctx->might_cancel = true; + spin_lock(&cancel_lock); + list_add_rcu(&ctx->clist, &cancel_list); + spin_unlock(&cancel_lock); + } + } + spin_unlock(&ctx->cancel_lock); + } + +When a timerfd is freed (fd is closed), then the ``might_cancel`` flag of the +timerfd object is cleared, the object removed from the ``cancel_list`` and +destroyed:: + + int timerfd_release(struct inode *inode, struct file *file) + { + struct timerfd_ctx *ctx = file->private_data; + + spin_lock(&ctx->cancel_lock); + if (ctx->might_cancel) { + ctx->might_cancel = false; + spin_lock(&cancel_lock); + list_del_rcu(&ctx->clist); + spin_unlock(&cancel_lock); + } + spin_unlock(&ctx->cancel_lock); + + hrtimer_cancel(&ctx->t.tmr); + kfree_rcu(ctx, rcu); + return 0; + } + +If the ``CLOCK_REALTIME`` clock is set, for example by a time server, the +hrtimer framework calls ``timerfd_clock_was_set()`` which walks the +``cancel_list`` and wakes up processes waiting on the timerfd. While iterating +the ``cancel_list``, the ``might_cancel`` flag is consulted to skip stale +objects:: + + void timerfd_clock_was_set(void) + { + struct timerfd_ctx *ctx; + unsigned long flags; + + rcu_read_lock(); + list_for_each_entry_rcu(ctx, &cancel_list, clist) { + if (!ctx->might_cancel) + continue; + spin_lock_irqsave(&ctx->wqh.lock, flags); + if (ctx->moffs != ktime_mono_to_real(0)) { + ctx->moffs = KTIME_MAX; + ctx->ticks++; + wake_up_locked_poll(&ctx->wqh, EPOLLIN); + } + spin_unlock_irqrestore(&ctx->wqh.lock, flags); + } + rcu_read_unlock(); + } + +The key point here is, because RCU-traversal of the ``cancel_list`` happens +while objects are being added and removed to the list, sometimes the traversal +can step on an object that has been removed from the list. In this example, it +is seen that it is better to skip such objects using a flag. + + Summary ------- @@ -303,19 +448,21 @@ the most amenable to use of RCU. The simplest case is where entries are either added or deleted from the data structure (or atomically modified in place), but non-atomic in-place modifications can be handled by making a copy, updating the copy, then replacing the original with the copy. -If stale data cannot be tolerated, then a "deleted" flag may be used +If stale data cannot be tolerated, then a *deleted* flag may be used in conjunction with a per-entry spinlock in order to allow the search function to reject newly deleted data. -.. _answer_quick_quiz_list: +.. _quick_quiz_answer: Answer to Quick Quiz: - Why does the search function need to return holding the per-entry - lock for this deleted-flag technique to be helpful? + For the deleted-flag technique to be helpful, why is it necessary + to hold the per-entry lock while returning from the search function? If the search function drops the per-entry lock before returning, then the caller will be processing stale data in any case. If it is really OK to be processing stale data, then you don't need a - "deleted" flag. If processing stale data really is a problem, + *deleted* flag. If processing stale data really is a problem, then you need to hold the per-entry lock across all of the code that uses the value that was returned. + +:ref:`Back to Quick Quiz ` diff --git a/Documentation/RCU/rcu.rst b/Documentation/RCU/rcu.rst index 8dfb437dacc3..0e03c6ef3147 100644 --- a/Documentation/RCU/rcu.rst +++ b/Documentation/RCU/rcu.rst @@ -11,8 +11,8 @@ must be long enough that any readers accessing the item being deleted have since dropped their references. For example, an RCU-protected deletion from a linked list would first remove the item from the list, wait for a grace period to elapse, then free the element. See the -Documentation/RCU/listRCU.rst file for more information on using RCU with -linked lists. +:ref:`Documentation/RCU/listRCU.rst ` for more information on +using RCU with linked lists. Frequently Asked Questions -------------------------- @@ -50,7 +50,7 @@ Frequently Asked Questions - If I am running on a uniprocessor kernel, which can only do one thing at a time, why should I wait for a grace period? - See the Documentation/RCU/UP.rst file for more information. + See :ref:`Documentation/RCU/UP.rst ` for more information. - How can I see where RCU is currently used in the Linux kernel? @@ -68,18 +68,18 @@ Frequently Asked Questions - Why the name "RCU"? - "RCU" stands for "read-copy update". The file Documentation/RCU/listRCU.rst - has more information on where this name came from, search for - "read-copy update" to find it. + "RCU" stands for "read-copy update". + :ref:`Documentation/RCU/listRCU.rst ` has more information on where + this name came from, search for "read-copy update" to find it. - I hear that RCU is patented? What is with that? Yes, it is. There are several known patents related to RCU, - search for the string "Patent" in RTFP.txt to find them. + search for the string "Patent" in Documentation/RCU/RTFP.txt to find them. Of these, one was allowed to lapse by the assignee, and the others have been contributed to the Linux kernel under GPL. There are now also LGPL implementations of user-level RCU - available (http://liburcu.org/). + available (https://liburcu.org/). - I hear that RCU needs work in order to support realtime kernels? @@ -88,5 +88,5 @@ Frequently Asked Questions - Where can I find more information on RCU? - See the RTFP.txt file in this directory. + See the Documentation/RCU/RTFP.txt file. Or point your browser at (http://www.rdrop.com/users/paulmck/RCU/). diff --git a/Documentation/RCU/torture.txt b/Documentation/RCU/torture.txt index a41a0384d20c..af712a3c5b6a 100644 --- a/Documentation/RCU/torture.txt +++ b/Documentation/RCU/torture.txt @@ -124,9 +124,14 @@ using a dynamically allocated srcu_struct (hence "srcud-" rather than debugging. The final "T" entry contains the totals of the counters. -USAGE +USAGE ON SPECIFIC KERNEL BUILDS -The following script may be used to torture RCU: +It is sometimes desirable to torture RCU on a specific kernel build, +for example, when preparing to put that kernel build into production. +In that case, the kernel should be built with CONFIG_RCU_TORTURE_TEST=m +so that the test can be started using modprobe and terminated using rmmod. + +For example, the following script may be used to torture RCU: #!/bin/sh @@ -142,8 +147,136 @@ checked for such errors. The "rmmod" command forces a "SUCCESS", two are self-explanatory, while the last indicates that while there were no RCU failures, CPU-hotplug problems were detected. -However, the tools/testing/selftests/rcutorture/bin/kvm.sh script -provides better automation, including automatic failure analysis. -It assumes a qemu/kvm-enabled platform, and runs guest OSes out of initrd. -See tools/testing/selftests/rcutorture/doc/initrd.txt for instructions -on setting up such an initrd. + +USAGE ON MAINLINE KERNELS + +When using rcutorture to test changes to RCU itself, it is often +necessary to build a number of kernels in order to test that change +across a broad range of combinations of the relevant Kconfig options +and of the relevant kernel boot parameters. In this situation, use +of modprobe and rmmod can be quite time-consuming and error-prone. + +Therefore, the tools/testing/selftests/rcutorture/bin/kvm.sh +script is available for mainline testing for x86, arm64, and +powerpc. By default, it will run the series of tests specified by +tools/testing/selftests/rcutorture/configs/rcu/CFLIST, with each test +running for 30 minutes within a guest OS using a minimal userspace +supplied by an automatically generated initrd. After the tests are +complete, the resulting build products and console output are analyzed +for errors and the results of the runs are summarized. + +On larger systems, rcutorture testing can be accelerated by passing the +--cpus argument to kvm.sh. For example, on a 64-CPU system, "--cpus 43" +would use up to 43 CPUs to run tests concurrently, which as of v5.4 would +complete all the scenarios in two batches, reducing the time to complete +from about eight hours to about one hour (not counting the time to build +the sixteen kernels). The "--dryrun sched" argument will not run tests, +but rather tell you how the tests would be scheduled into batches. This +can be useful when working out how many CPUs to specify in the --cpus +argument. + +Not all changes require that all scenarios be run. For example, a change +to Tree SRCU might run only the SRCU-N and SRCU-P scenarios using the +--configs argument to kvm.sh as follows: "--configs 'SRCU-N SRCU-P'". +Large systems can run multiple copies of of the full set of scenarios, +for example, a system with 448 hardware threads can run five instances +of the full set concurrently. To make this happen: + + kvm.sh --cpus 448 --configs '5*CFLIST' + +Alternatively, such a system can run 56 concurrent instances of a single +eight-CPU scenario: + + kvm.sh --cpus 448 --configs '56*TREE04' + +Or 28 concurrent instances of each of two eight-CPU scenarios: + + kvm.sh --cpus 448 --configs '28*TREE03 28*TREE04' + +Of course, each concurrent instance will use memory, which can be +limited using the --memory argument, which defaults to 512M. Small +values for memory may require disabling the callback-flooding tests +using the --bootargs parameter discussed below. + +Sometimes additional debugging is useful, and in such cases the --kconfig +parameter to kvm.sh may be used, for example, "--kconfig 'CONFIG_KASAN=y'". + +Kernel boot arguments can also be supplied, for example, to control +rcutorture's module parameters. For example, to test a change to RCU's +CPU stall-warning code, use "--bootargs 'rcutorture.stall_cpu=30'". +This will of course result in the scripting reporting a failure, namely +the resuling RCU CPU stall warning. As noted above, reducing memory may +require disabling rcutorture's callback-flooding tests: + + kvm.sh --cpus 448 --configs '56*TREE04' --memory 128M \ + --bootargs 'rcutorture.fwd_progress=0' + +Sometimes all that is needed is a full set of kernel builds. This is +what the --buildonly argument does. + +Finally, the --trust-make argument allows each kernel build to reuse what +it can from the previous kernel build. + +There are additional more arcane arguments that are documented in the +source code of the kvm.sh script. + +If a run contains failures, the number of buildtime and runtime failures +is listed at the end of the kvm.sh output, which you really should redirect +to a file. The build products and console output of each run is kept in +tools/testing/selftests/rcutorture/res in timestamped directories. A +given directory can be supplied to kvm-find-errors.sh in order to have +it cycle you through summaries of errors and full error logs. For example: + + tools/testing/selftests/rcutorture/bin/kvm-find-errors.sh \ + tools/testing/selftests/rcutorture/res/2020.01.20-15.54.23 + +However, it is often more convenient to access the files directly. +Files pertaining to all scenarios in a run reside in the top-level +directory (2020.01.20-15.54.23 in the example above), while per-scenario +files reside in a subdirectory named after the scenario (for example, +"TREE04"). If a given scenario ran more than once (as in "--configs +'56*TREE04'" above), the directories corresponding to the second and +subsequent runs of that scenario include a sequence number, for example, +"TREE04.2", "TREE04.3", and so on. + +The most frequently used file in the top-level directory is testid.txt. +If the test ran in a git repository, then this file contains the commit +that was tested and any uncommitted changes in diff format. + +The most frequently used files in each per-scenario-run directory are: + +.config: This file contains the Kconfig options. + +Make.out: This contains build output for a specific scenario. + +console.log: This contains the console output for a specific scenario. + This file may be examined once the kernel has booted, but + it might not exist if the build failed. + +vmlinux: This contains the kernel, which can be useful with tools like + objdump and gdb. + +A number of additional files are available, but are less frequently used. +Many are intended for debugging of rcutorture itself or of its scripting. + +As of v5.4, a successful run with the default set of scenarios produces +the following summary at the end of the run on a 12-CPU system: + +SRCU-N ------- 804233 GPs (148.932/s) [srcu: g10008272 f0x0 ] +SRCU-P ------- 202320 GPs (37.4667/s) [srcud: g1809476 f0x0 ] +SRCU-t ------- 1122086 GPs (207.794/s) [srcu: g0 f0x0 ] +SRCU-u ------- 1111285 GPs (205.794/s) [srcud: g1 f0x0 ] +TASKS01 ------- 19666 GPs (3.64185/s) [tasks: g0 f0x0 ] +TASKS02 ------- 20541 GPs (3.80389/s) [tasks: g0 f0x0 ] +TASKS03 ------- 19416 GPs (3.59556/s) [tasks: g0 f0x0 ] +TINY01 ------- 836134 GPs (154.84/s) [rcu: g0 f0x0 ] n_max_cbs: 34198 +TINY02 ------- 850371 GPs (157.476/s) [rcu: g0 f0x0 ] n_max_cbs: 2631 +TREE01 ------- 162625 GPs (30.1157/s) [rcu: g1124169 f0x0 ] +TREE02 ------- 333003 GPs (61.6672/s) [rcu: g2647753 f0x0 ] n_max_cbs: 35844 +TREE03 ------- 306623 GPs (56.782/s) [rcu: g2975325 f0x0 ] n_max_cbs: 1496497 +CPU count limited from 16 to 12 +TREE04 ------- 246149 GPs (45.5831/s) [rcu: g1695737 f0x0 ] n_max_cbs: 434961 +TREE05 ------- 314603 GPs (58.2598/s) [rcu: g2257741 f0x2 ] n_max_cbs: 193997 +TREE07 ------- 167347 GPs (30.9902/s) [rcu: g1079021 f0x0 ] n_max_cbs: 478732 +CPU count limited from 16 to 12 +TREE09 ------- 752238 GPs (139.303/s) [rcu: g13075057 f0x0 ] n_max_cbs: 99011 diff --git a/Documentation/accounting/psi.rst b/Documentation/accounting/psi.rst index 621111ce5740..f2b3439edcc2 100644 --- a/Documentation/accounting/psi.rst +++ b/Documentation/accounting/psi.rst @@ -1,3 +1,5 @@ +.. _psi: + ================================ PSI - Pressure Stall Information ================================ diff --git a/Documentation/admin-guide/binderfs.rst b/Documentation/admin-guide/binderfs.rst index c009671f8434..8243af9b3510 100644 --- a/Documentation/admin-guide/binderfs.rst +++ b/Documentation/admin-guide/binderfs.rst @@ -33,6 +33,12 @@ max a per-instance limit. If ``max=`` is set then only ```` number of binder devices can be allocated in this binderfs instance. +stats + Using ``stats=global`` enables global binder statistics. + ``stats=global`` is only available for a binderfs instance mounted in the + initial user namespace. An attempt to use the option to mount a binderfs + instance in another user namespace will return a permission error. + Allocating binder Devices ------------------------- diff --git a/Documentation/admin-guide/binfmt-misc.rst b/Documentation/admin-guide/binfmt-misc.rst index 97b0d7927078..7a864131e5ea 100644 --- a/Documentation/admin-guide/binfmt-misc.rst +++ b/Documentation/admin-guide/binfmt-misc.rst @@ -1,5 +1,5 @@ -Kernel Support for miscellaneous (your favourite) Binary Formats v1.1 -===================================================================== +Kernel Support for miscellaneous Binary Formats (binfmt_misc) +============================================================= This Kernel feature allows you to invoke almost (for restrictions see below) every program by simply typing its name in the shell. @@ -140,8 +140,8 @@ Hints ----- If you want to pass special arguments to your interpreter, you can -write a wrapper script for it. See Documentation/admin-guide/java.rst for an -example. +write a wrapper script for it. +See :doc:`Documentation/admin-guide/java.rst <./java>` for an example. Your interpreter should NOT look in the PATH for the filename; the kernel passes it the full filename (or the file descriptor) to use. Using ``$PATH`` can diff --git a/Documentation/admin-guide/blockdev/zram.rst b/Documentation/admin-guide/blockdev/zram.rst index 27c77d853028..a6fd1f9b5faf 100644 --- a/Documentation/admin-guide/blockdev/zram.rst +++ b/Documentation/admin-guide/blockdev/zram.rst @@ -251,8 +251,6 @@ line of text and contains the following stats separated by whitespace: ================ ============================================================= orig_data_size uncompressed size of data stored in this disk. - This excludes same-element-filled pages (same_pages) since - no memory is allocated for them. Unit: bytes compr_data_size compressed size of data stored in this disk mem_used_total the amount of memory allocated for this disk. This diff --git a/Documentation/admin-guide/bootconfig.rst b/Documentation/admin-guide/bootconfig.rst index cf2edcd09183..d6b3b77a4129 100644 --- a/Documentation/admin-guide/bootconfig.rst +++ b/Documentation/admin-guide/bootconfig.rst @@ -23,7 +23,7 @@ of dot-connected-words, and key and value are connected by ``=``. The value has to be terminated by semi-colon (``;``) or newline (``\n``). For array value, array entries are separated by comma (``,``). :: -KEY[.WORD[...]] = VALUE[, VALUE2[...]][;] + KEY[.WORD[...]] = VALUE[, VALUE2[...]][;] Unlike the kernel command line syntax, spaces are OK around the comma and ``=``. diff --git a/Documentation/admin-guide/cgroup-v1/cpusets.rst b/Documentation/admin-guide/cgroup-v1/cpusets.rst index 86a6ae995d54..7ade3abd342a 100644 --- a/Documentation/admin-guide/cgroup-v1/cpusets.rst +++ b/Documentation/admin-guide/cgroup-v1/cpusets.rst @@ -223,6 +223,17 @@ cpu_online_mask using a CPU hotplug notifier, and the mems file automatically tracks the value of node_states[N_MEMORY]--i.e., nodes with memory--using the cpuset_track_online_nodes() hook. +The cpuset.effective_cpus and cpuset.effective_mems files are +normally read-only copies of cpuset.cpus and cpuset.mems files +respectively. If the cpuset cgroup filesystem is mounted with the +special "cpuset_v2_mode" option, the behavior of these files will become +similar to the corresponding files in cpuset v2. In other words, hotplug +events will not change cpuset.cpus and cpuset.mems. Those events will +only affect cpuset.effective_cpus and cpuset.effective_mems which show +the actual cpus and memory nodes that are currently used by this cpuset. +See Documentation/admin-guide/cgroup-v2.rst for more information about +cpuset v2 behavior. + 1.4 What are exclusive cpusets ? -------------------------------- diff --git a/Documentation/admin-guide/cgroup-v1/hugetlb.rst b/Documentation/admin-guide/cgroup-v1/hugetlb.rst index a3902aa253a9..338f2c7d7a1c 100644 --- a/Documentation/admin-guide/cgroup-v1/hugetlb.rst +++ b/Documentation/admin-guide/cgroup-v1/hugetlb.rst @@ -2,13 +2,6 @@ HugeTLB Controller ================== -The HugeTLB controller allows to limit the HugeTLB usage per control group and -enforces the controller limit during page fault. Since HugeTLB doesn't -support page reclaim, enforcing the limit at page fault time implies that, -the application will get SIGBUS signal if it tries to access HugeTLB pages -beyond its limit. This requires the application to know beforehand how much -HugeTLB pages it would require for its use. - HugeTLB controller can be created by first mounting the cgroup filesystem. # mount -t cgroup -o hugetlb none /sys/fs/cgroup @@ -28,10 +21,14 @@ process (bash) into it. Brief summary of control files:: - hugetlb..limit_in_bytes # set/show limit of "hugepagesize" hugetlb usage - hugetlb..max_usage_in_bytes # show max "hugepagesize" hugetlb usage recorded - hugetlb..usage_in_bytes # show current usage for "hugepagesize" hugetlb - hugetlb..failcnt # show the number of allocation failure due to HugeTLB limit + hugetlb..rsvd.limit_in_bytes # set/show limit of "hugepagesize" hugetlb reservations + hugetlb..rsvd.max_usage_in_bytes # show max "hugepagesize" hugetlb reservations and no-reserve faults + hugetlb..rsvd.usage_in_bytes # show current reservations and no-reserve faults for "hugepagesize" hugetlb + hugetlb..rsvd.failcnt # show the number of allocation failure due to HugeTLB reservation limit + hugetlb..limit_in_bytes # set/show limit of "hugepagesize" hugetlb faults + hugetlb..max_usage_in_bytes # show max "hugepagesize" hugetlb usage recorded + hugetlb..usage_in_bytes # show current usage for "hugepagesize" hugetlb + hugetlb..failcnt # show the number of allocation failure due to HugeTLB usage limit For a system supporting three hugepage sizes (64k, 32M and 1G), the control files include:: @@ -40,11 +37,95 @@ files include:: hugetlb.1GB.max_usage_in_bytes hugetlb.1GB.usage_in_bytes hugetlb.1GB.failcnt + hugetlb.1GB.rsvd.limit_in_bytes + hugetlb.1GB.rsvd.max_usage_in_bytes + hugetlb.1GB.rsvd.usage_in_bytes + hugetlb.1GB.rsvd.failcnt hugetlb.64KB.limit_in_bytes hugetlb.64KB.max_usage_in_bytes hugetlb.64KB.usage_in_bytes hugetlb.64KB.failcnt + hugetlb.64KB.rsvd.limit_in_bytes + hugetlb.64KB.rsvd.max_usage_in_bytes + hugetlb.64KB.rsvd.usage_in_bytes + hugetlb.64KB.rsvd.failcnt hugetlb.32MB.limit_in_bytes hugetlb.32MB.max_usage_in_bytes hugetlb.32MB.usage_in_bytes hugetlb.32MB.failcnt + hugetlb.32MB.rsvd.limit_in_bytes + hugetlb.32MB.rsvd.max_usage_in_bytes + hugetlb.32MB.rsvd.usage_in_bytes + hugetlb.32MB.rsvd.failcnt + + +1. Page fault accounting + +hugetlb..limit_in_bytes +hugetlb..max_usage_in_bytes +hugetlb..usage_in_bytes +hugetlb..failcnt + +The HugeTLB controller allows users to limit the HugeTLB usage (page fault) per +control group and enforces the limit during page fault. Since HugeTLB +doesn't support page reclaim, enforcing the limit at page fault time implies +that, the application will get SIGBUS signal if it tries to fault in HugeTLB +pages beyond its limit. Therefore the application needs to know exactly how many +HugeTLB pages it uses before hand, and the sysadmin needs to make sure that +there are enough available on the machine for all the users to avoid processes +getting SIGBUS. + + +2. Reservation accounting + +hugetlb..rsvd.limit_in_bytes +hugetlb..rsvd.max_usage_in_bytes +hugetlb..rsvd.usage_in_bytes +hugetlb..rsvd.failcnt + +The HugeTLB controller allows to limit the HugeTLB reservations per control +group and enforces the controller limit at reservation time and at the fault of +HugeTLB memory for which no reservation exists. Since reservation limits are +enforced at reservation time (on mmap or shget), reservation limits never causes +the application to get SIGBUS signal if the memory was reserved before hand. For +MAP_NORESERVE allocations, the reservation limit behaves the same as the fault +limit, enforcing memory usage at fault time and causing the application to +receive a SIGBUS if it's crossing its limit. + +Reservation limits are superior to page fault limits described above, since +reservation limits are enforced at reservation time (on mmap or shget), and +never causes the application to get SIGBUS signal if the memory was reserved +before hand. This allows for easier fallback to alternatives such as +non-HugeTLB memory for example. In the case of page fault accounting, it's very +hard to avoid processes getting SIGBUS since the sysadmin needs precisely know +the HugeTLB usage of all the tasks in the system and make sure there is enough +pages to satisfy all requests. Avoiding tasks getting SIGBUS on overcommited +systems is practically impossible with page fault accounting. + + +3. Caveats with shared memory + +For shared HugeTLB memory, both HugeTLB reservation and page faults are charged +to the first task that causes the memory to be reserved or faulted, and all +subsequent uses of this reserved or faulted memory is done without charging. + +Shared HugeTLB memory is only uncharged when it is unreserved or deallocated. +This is usually when the HugeTLB file is deleted, and not when the task that +caused the reservation or fault has exited. + + +4. Caveats with HugeTLB cgroup offline. + +When a HugeTLB cgroup goes offline with some reservations or faults still +charged to it, the behavior is as follows: + +- The fault charges are charged to the parent HugeTLB cgroup (reparented), +- the reservation charges remain on the offline HugeTLB cgroup. + +This means that if a HugeTLB cgroup gets offlined while there is still HugeTLB +reservations charged to it, that cgroup persists as a zombie until all HugeTLB +reservations are uncharged. HugeTLB reservations behave in this manner to match +the memory controller whose cgroups also persist as zombie until all charged +memory is uncharged. Also, the tracking of HugeTLB reservations is a bit more +complex compared to the tracking of HugeTLB faults, so it is significantly +harder to reparent reservations at offline time. diff --git a/Documentation/admin-guide/cgroup-v1/index.rst b/Documentation/admin-guide/cgroup-v1/index.rst index 10bf48bae0b0..226f64473e8e 100644 --- a/Documentation/admin-guide/cgroup-v1/index.rst +++ b/Documentation/admin-guide/cgroup-v1/index.rst @@ -1,3 +1,5 @@ +.. _cgroup-v1: + ======================== Control Groups version 1 ======================== diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 3f801461f0f3..bcc80269bb6a 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -9,7 +9,7 @@ This is the authoritative documentation on the design, interface and conventions of cgroup v2. It describes all userland-visible aspects of cgroup including core and specific controller behaviors. All future changes must be reflected in this document. Documentation for -v1 is available under Documentation/admin-guide/cgroup-v1/. +v1 is available under :ref:`Documentation/admin-guide/cgroup-v1/index.rst `. .. CONTENTS @@ -188,6 +188,17 @@ cgroup v2 currently supports the following mount options. modified through remount from the init namespace. The mount option is ignored on non-init namespace mounts. + memory_recursiveprot + + Recursively apply memory.min and memory.low protection to + entire subtrees, without requiring explicit downward + propagation into leaf cgroups. This allows protecting entire + subtrees from one another, while retaining free competition + within those subtrees. This should have been the default + behavior but is a mount-option to avoid regressing setups + relying on the original semantics (e.g. specifying bogusly + high 'bypass' protection values at higher tree levels). + Organizing Processes and Threads -------------------------------- @@ -1023,7 +1034,7 @@ All time durations are in microseconds. A read-only nested-key file which exists on non-root cgroups. Shows pressure stall information for CPU. See - Documentation/accounting/psi.rst for details. + :ref:`Documentation/accounting/psi.rst ` for details. cpu.uclamp.min A read-write single value file which exists on non-root cgroups. @@ -1103,7 +1114,7 @@ PAGE_SIZE multiple when read back. proportionally to the overage, reducing reclaim pressure for smaller overages. - Effective min boundary is limited by memory.min values of + Effective min boundary is limited by memory.min values of all ancestor cgroups. If there is memory.min overcommitment (child cgroup or cgroups are requiring more protected memory than parent will allow), then each child cgroup will get @@ -1313,53 +1324,41 @@ PAGE_SIZE multiple when read back. Number of major page faults incurred workingset_refault - Number of refaults of previously evicted pages workingset_activate - Number of refaulted pages that were immediately activated workingset_nodereclaim - Number of times a shadow node has been reclaimed pgrefill - Amount of scanned pages (in an active LRU list) pgscan - Amount of scanned pages (in an inactive LRU list) pgsteal - Amount of reclaimed pages pgactivate - Amount of pages moved to the active LRU list pgdeactivate - Amount of pages moved to the inactive LRU list pglazyfree - Amount of pages postponed to be freed under memory pressure pglazyfreed - Amount of reclaimed lazyfree pages thp_fault_alloc - Number of transparent hugepages which were allocated to satisfy a page fault, including COW faults. This counter is not present when CONFIG_TRANSPARENT_HUGEPAGE is not set. thp_collapse_alloc - Number of transparent hugepages which were allocated to allow collapsing an existing range of pages. This counter is not present when CONFIG_TRANSPARENT_HUGEPAGE is not set. @@ -1403,7 +1402,7 @@ PAGE_SIZE multiple when read back. A read-only nested-key file which exists on non-root cgroups. Shows pressure stall information for memory. See - Documentation/accounting/psi.rst for details. + :ref:`Documentation/accounting/psi.rst ` for details. Usage Guidelines @@ -1478,7 +1477,7 @@ IO Interface Files dios Number of discard IOs ====== ===================== - An example read output follows: + An example read output follows:: 8:16 rbytes=1459200 wbytes=314773504 rios=192 wios=353 dbytes=0 dios=0 8:0 rbytes=90430464 wbytes=299008000 rios=8950 wios=1252 dbytes=50331648 dios=3021 @@ -1643,7 +1642,7 @@ IO Interface Files A read-only nested-key file which exists on non-root cgroups. Shows pressure stall information for IO. See - Documentation/accounting/psi.rst for details. + :ref:`Documentation/accounting/psi.rst ` for details. Writeback @@ -1853,7 +1852,7 @@ Cpuset Interface Files from the requested CPUs. The CPU numbers are comma-separated numbers or ranges. - For example: + For example:: # cat cpuset.cpus 0-4,6,8-10 @@ -1892,7 +1891,7 @@ Cpuset Interface Files from the requested memory nodes. The memory node numbers are comma-separated numbers or ranges. - For example: + For example:: # cat cpuset.mems 0-1,3 diff --git a/Documentation/admin-guide/dynamic-debug-howto.rst b/Documentation/admin-guide/dynamic-debug-howto.rst index 252e5ef324e5..0dc2eb8e44e5 100644 --- a/Documentation/admin-guide/dynamic-debug-howto.rst +++ b/Documentation/admin-guide/dynamic-debug-howto.rst @@ -54,6 +54,9 @@ If you make a mistake with the syntax, the write will fail thus:: /dynamic_debug/control -bash: echo: write error: Invalid argument +Note, for systems without 'debugfs' enabled, the control file can be +found in ``/proc/dynamic_debug/control``. + Viewing Dynamic Debug Behaviour =============================== diff --git a/Documentation/admin-guide/edid.rst b/Documentation/admin-guide/edid.rst new file mode 100644 index 000000000000..80deeb21a265 --- /dev/null +++ b/Documentation/admin-guide/edid.rst @@ -0,0 +1,60 @@ +.. SPDX-License-Identifier: GPL-2.0 + +==== +EDID +==== + +In the good old days when graphics parameters were configured explicitly +in a file called xorg.conf, even broken hardware could be managed. + +Today, with the advent of Kernel Mode Setting, a graphics board is +either correctly working because all components follow the standards - +or the computer is unusable, because the screen remains dark after +booting or it displays the wrong area. Cases when this happens are: + +- The graphics board does not recognize the monitor. +- The graphics board is unable to detect any EDID data. +- The graphics board incorrectly forwards EDID data to the driver. +- The monitor sends no or bogus EDID data. +- A KVM sends its own EDID data instead of querying the connected monitor. + +Adding the kernel parameter "nomodeset" helps in most cases, but causes +restrictions later on. + +As a remedy for such situations, the kernel configuration item +CONFIG_DRM_LOAD_EDID_FIRMWARE was introduced. It allows to provide an +individually prepared or corrected EDID data set in the /lib/firmware +directory from where it is loaded via the firmware interface. The code +(see drivers/gpu/drm/drm_edid_load.c) contains built-in data sets for +commonly used screen resolutions (800x600, 1024x768, 1280x1024, 1600x1200, +1680x1050, 1920x1080) as binary blobs, but the kernel source tree does +not contain code to create these data. In order to elucidate the origin +of the built-in binary EDID blobs and to facilitate the creation of +individual data for a specific misbehaving monitor, commented sources +and a Makefile environment are given here. + +To create binary EDID and C source code files from the existing data +material, simply type "make" in tools/edid/. + +If you want to create your own EDID file, copy the file 1024x768.S, +replace the settings with your own data and add a new target to the +Makefile. Please note that the EDID data structure expects the timing +values in a different way as compared to the standard X11 format. + +X11: + HTimings: + hdisp hsyncstart hsyncend htotal + VTimings: + vdisp vsyncstart vsyncend vtotal + +EDID:: + + #define XPIX hdisp + #define XBLANK htotal-hdisp + #define XOFFSET hsyncstart-hdisp + #define XPULSE hsyncend-hsyncstart + + #define YPIX vdisp + #define YBLANK vtotal-vdisp + #define YOFFSET vsyncstart-vdisp + #define YPULSE vsyncend-vsyncstart diff --git a/Documentation/admin-guide/hw-vuln/tsx_async_abort.rst b/Documentation/admin-guide/hw-vuln/tsx_async_abort.rst index af6865b822d2..68d96f0e9c95 100644 --- a/Documentation/admin-guide/hw-vuln/tsx_async_abort.rst +++ b/Documentation/admin-guide/hw-vuln/tsx_async_abort.rst @@ -136,8 +136,6 @@ enables the mitigation by default. The mitigation can be controlled at boot time via a kernel command line option. See :ref:`taa_mitigation_control_command_line`. -.. _virt_mechanism: - Virtualization mitigation ^^^^^^^^^^^^^^^^^^^^^^^^^ diff --git a/Documentation/admin-guide/index.rst b/Documentation/admin-guide/index.rst index f1d0ccffbe72..5a6269fb8593 100644 --- a/Documentation/admin-guide/index.rst +++ b/Documentation/admin-guide/index.rst @@ -75,6 +75,7 @@ configure specific aspects of kernel behavior to your liking. cputopology dell_rbu device-mapper/index + edid efi-stub ext4 nfs/index diff --git a/Documentation/admin-guide/iostats.rst b/Documentation/admin-guide/iostats.rst index df5b8345c41d..9b14b0c2c9c4 100644 --- a/Documentation/admin-guide/iostats.rst +++ b/Documentation/admin-guide/iostats.rst @@ -100,7 +100,7 @@ Field 10 -- # of milliseconds spent doing I/Os (unsigned int) Since 5.0 this field counts jiffies when at least one request was started or completed. If request runs more than 2 jiffies then some - I/O time will not be accounted unless there are other requests. + I/O time might be not accounted in case of concurrent requests. Field 11 -- weighted # of milliseconds spent doing I/Os (unsigned int) This field is incremented at each I/O start, I/O completion, I/O @@ -143,6 +143,9 @@ are summed (possibly overflowing the unsigned long variable they are summed to) and the result given to the user. There is no convenient user interface for accessing the per-CPU counters themselves. +Since 4.19 request times are measured with nanoseconds precision and +truncated to milliseconds before showing in this interface. + Disks vs Partitions ------------------- diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index c07815d230bc..f2a93c8679e8 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -22,11 +22,13 @@ default: 0 acpi_backlight= [HW,ACPI] - acpi_backlight=vendor - acpi_backlight=video - If set to vendor, prefer vendor specific driver + { vendor | video | native | none } + If set to vendor, prefer vendor-specific driver (e.g. thinkpad_acpi, sony_acpi, etc.) instead of the ACPI video.ko driver. + If set to video, use the ACPI video.ko driver. + If set to native, use the device's native backlight mode. + If set to none, disable the ACPI backlight interface. acpi_force_32bit_fadt_addr force FADT to use 32 bit addresses rather than the @@ -450,6 +452,9 @@ bert_disable [ACPI] Disable BERT OS support on buggy BIOSes. + bgrt_disable [ACPI][X86] + Disable BGRT to avoid flickering OEM logo. + bttv.card= [HW,V4L] bttv (bt848 + bt878 based grabber cards) bttv.radio= Most important insmod options are available as kernel args too. @@ -522,6 +527,7 @@ Default value is set via a kernel config option. Value can be changed at runtime via /sys/fs/selinux/checkreqprot. + Setting checkreqprot to 1 is deprecated. cio_ignore= [S390] See Documentation/s390/common_io.rst for details. @@ -679,7 +685,7 @@ coredump_filter= [KNL] Change the default value for /proc//coredump_filter. - See also Documentation/filesystems/proc.txt. + See also Documentation/filesystems/proc.rst. coresight_cpu_debug.enable [ARM,ARM64] @@ -956,7 +962,7 @@ edid/1680x1050.bin, or edid/1920x1080.bin is given and no file with the same name exists. Details and instructions how to build your own EDID data are - available in Documentation/driver-api/edid.rst. An EDID + available in Documentation/admin-guide/edid.rst. An EDID data set will only be used for a particular connector, if its name and a colon are prepended to the EDID name. Each connector may use a unique EDID data @@ -986,10 +992,6 @@ Documentation/admin-guide/dynamic-debug-howto.rst for details. - nompx [X86] Disables Intel Memory Protection Extensions. - See Documentation/x86/intel_mpx.rst for more - information about the feature. - nopku [X86] Disable Memory Protection Keys CPU feature found in some Intel CPUs. @@ -1099,6 +1101,12 @@ A valid base address must be provided, and the serial port must already be setup and configured. + ec_imx21, + ec_imx6q, + Start an early, polled-mode, output-only console on the + Freescale i.MX UART at the specified address. The UART + must already be setup and configured. + ar3700_uart, Start an early, polled-mode console on the Armada 3700 serial port at the specified @@ -1354,6 +1362,24 @@ can be changed at run time by the max_graph_depth file in the tracefs tracing directory. default: 0 (no limit) + fw_devlink= [KNL] Create device links between consumer and supplier + devices by scanning the firmware to infer the + consumer/supplier relationships. This feature is + especially useful when drivers are loaded as modules as + it ensures proper ordering of tasks like device probing + (suppliers first, then consumers), supplier boot state + clean up (only after all consumers have probed), + suspend/resume & runtime PM (consumers first, then + suppliers). + Format: { off | permissive | on | rpm } + off -- Don't create device links from firmware info. + permissive -- Create device links from firmware info + but use it only for ordering boot state clean + up (sync_state() calls). + on -- Create device links from firmware info and use it + to enforce probe and suspend/resume ordering. + rpm -- Like "on", but also use to order runtime PM. + gamecon.map[2|3]= [HW,JOY] Multisystem joystick and NES/SNES/PSX pad support via parallel port (up to 5 devices per port) @@ -1445,6 +1471,14 @@ hpet_mmap= [X86, HPET_MMAP] Allow userspace to mmap HPET registers. Default set by CONFIG_HPET_MMAP_DEFAULT. + hugetlb_cma= [HW] The size of a cma area used for allocation + of gigantic hugepages. + Format: nn[KMGTPE] + + Reserve a cma area of given size and allocate gigantic + hugepages using the cma allocator. If enabled, the + boot-time allocation of gigantic hugepages is skipped. + hugepages= [HW,X86-32,IA-64] HugeTLB pages to allocate at boot. hugepagesz= [HW,IA-64,PPC,X86-64] The size of the HugeTLB pages. On x86-64 and powerpc, this option can be specified @@ -1779,7 +1813,7 @@ provided by tboot because it makes the system vulnerable to DMA attacks. nobounce [Default off] - Disable bounce buffer for unstrusted devices such as + Disable bounce buffer for untrusted devices such as the Thunderbolt devices. This will treat the untrusted devices as the trusted ones, hence might expose security risks of DMA attacks. @@ -1883,7 +1917,7 @@ No delay ip= [IP_PNP] - See Documentation/filesystems/nfs/nfsroot.txt. + See Documentation/admin-guide/nfs/nfsroot.rst. ipcmni_extend [KNL] Extend the maximum number of unique System V IPC identifiers from 32,768 to 16,777,216. @@ -2543,13 +2577,22 @@ For details see: Documentation/admin-guide/hw-vuln/mds.rst mem=nn[KMG] [KNL,BOOT] Force usage of a specific amount of memory - Amount of memory to be used when the kernel is not able - to see the whole system memory or for test. + Amount of memory to be used in cases as follows: + + 1 for test; + 2 when the kernel is not able to see the whole system memory; + 3 memory that lies after 'mem=' boundary is excluded from + the hypervisor, then assigned to KVM guests. + [X86] Work as limiting max address. Use together with memmap= to avoid physical address space collisions. Without memmap= PCI devices could be placed at addresses belonging to unused RAM. + Note that this only takes effects during boot time since + in above case 3, memory may need be hot added after boot + if system memory of hypervisor is not sufficient. + mem=nopentium [BUGS=X86-32] Disable usage of 4MB pages for kernel memory. @@ -2795,7 +2838,7 @@ ,[,,,,] mtdparts= [MTD] - See drivers/mtd/cmdlinepart.c. + See drivers/mtd/parsers/cmdlinepart.c multitce=off [PPC] This parameter disables the use of the pSeries firmware feature for updating multiple TCE entries @@ -2853,13 +2896,13 @@ Default value is 0. nfsaddrs= [NFS] Deprecated. Use ip= instead. - See Documentation/filesystems/nfs/nfsroot.txt. + See Documentation/admin-guide/nfs/nfsroot.rst. nfsroot= [NFS] nfs root filesystem for disk-less boxes. - See Documentation/filesystems/nfs/nfsroot.txt. + See Documentation/admin-guide/nfs/nfsroot.rst. nfsrootdebug [NFS] enable nfsroot debugging messages. - See Documentation/filesystems/nfs/nfsroot.txt. + See Documentation/admin-guide/nfs/nfsroot.rst. nfs.callback_nr_threads= [NFSv4] set the total number of threads that the @@ -3174,7 +3217,7 @@ [X86,PV_OPS] Disable paravirtualized VMware scheduler clock and use the default one. - no-steal-acc [X86,KVM,ARM64] Disable paravirtualized steal time + no-steal-acc [X86,PV_OPS,ARM64] Disable paravirtualized steal time accounting. steal time is computed, but won't influence scheduler behaviour @@ -3285,12 +3328,6 @@ This can be set from sysctl after boot. See Documentation/admin-guide/sysctl/vm.rst for details. - of_devlink [OF, KNL] Create device links between consumer and - supplier devices by scanning the devictree to infer the - consumer/supplier relationships. A consumer device - will not be probed until all the supplier devices have - probed successfully. - ohci1394_dma=early [HW] enable debugging via the ohci1394 driver. See Documentation/debugging-via-ohci1394.txt for more info. @@ -3698,6 +3735,9 @@ Override pmtimer IOPort with a hex value. e.g. pmtmr=0x508 + pm_debug_messages [SUSPEND,KNL] + Enable suspend/resume debug messages during boot up. + pnp.debug=1 [PNP] Enable PNP debug messages (depends on the CONFIG_PNP_DEBUG_MESSAGES option). Change at run-time @@ -3799,6 +3839,11 @@ before loading. See Documentation/admin-guide/blockdev/ramdisk.rst. + prot_virt= [S390] enable hosting protected virtual machines + isolated from the hypervisor (if hardware supports + that). + Format: + psi= [KNL] Enable or disable pressure stall information tracking. Format: @@ -3984,6 +4029,15 @@ Set threshold of queued RCU callbacks below which batch limiting is re-enabled. + rcutree.qovld= [KNL] + Set threshold of queued RCU callbacks beyond which + RCU's force-quiescent-state scan will aggressively + enlist help from cond_resched() and sched IPIs to + help CPUs more quickly reach quiescent states. + Set to less than zero to make this be set based + on rcutree.qhimark at boot time and to zero to + disable more aggressive help enlistment. + rcutree.rcu_idle_gp_delay= [KNL] Set wakeup interval for idle CPUs that have RCU callbacks (RCU_FAST_NO_HZ=y). @@ -4199,6 +4253,12 @@ rcupdate.rcu_cpu_stall_suppress= [KNL] Suppress RCU CPU stall warning messages. + rcupdate.rcu_cpu_stall_suppress_at_boot= [KNL] + Suppress RCU CPU stall warning messages and + rcutorture writer stall warnings that occur + during early boot, that is, during the time + before the init task is spawned. + rcupdate.rcu_cpu_stall_timeout= [KNL] Set timeout for RCU CPU stall warning messages. @@ -4392,6 +4452,22 @@ incurs a small amount of overhead in the scheduler but is useful for debugging and performance tuning. + sched_thermal_decay_shift= + [KNL, SMP] Set a decay shift for scheduler thermal + pressure signal. Thermal pressure signal follows the + default decay period of other scheduler pelt + signals(usually 32 ms but configurable). Setting + sched_thermal_decay_shift will left shift the decay + period for the thermal pressure signal by the shift + value. + i.e. with the default pelt decay period of 32 ms + sched_thermal_decay_shift thermal pressure decay pr + 1 64 ms + 2 128 ms + and so on. + Format: integer between 0 and 10 + Default is 0. + skew_tick= [KNL] Offset the periodic timer tick per cpu to mitigate xtime_lock contention on larger systems, and/or RCU lock contention on all systems with CONFIG_MAXSMP set. @@ -4514,10 +4590,10 @@ Format: A nonzero value instructs the soft-lockup detector - to panic the machine when a soft-lockup occurs. This - is also controlled by CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC - which is the respective build-time switch to that - functionality. + to panic the machine when a soft-lockup occurs. It is + also controlled by the kernel.softlockup_panic sysctl + and CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC, which is the + respective build-time switch to that functionality. softlockup_all_cpu_backtrace= [KNL] Should the soft-lockup detector generate @@ -4659,6 +4735,28 @@ spia_pedr= spia_peddr= + split_lock_detect= + [X86] Enable split lock detection + + When enabled (and if hardware support is present), atomic + instructions that access data across cache line + boundaries will result in an alignment check exception. + + off - not enabled + + warn - the kernel will emit rate limited warnings + about applications triggering the #AC + exception. This mode is the default on CPUs + that supports split lock detection. + + fatal - the kernel will send SIGBUS to applications + that trigger the #AC exception. + + If an #AC exception is hit in the kernel or in + firmware (i.e. not while executing in user mode) + the kernel will oops in either "warn" or "fatal" + mode. + srcutree.counter_wrap_check [KNL] Specifies how frequently to check for grace-period sequence counter wrap for the @@ -4871,6 +4969,10 @@ topology updates sent by the hypervisor to this LPAR. + torture.disable_onoff_at_boot= [KNL] + Prevent the CPU-hotplug component of torturing + until after init has spawned. + tp720= [HW,PS2] tpm_suspend_pcr=[HW,TPM] diff --git a/Documentation/admin-guide/kernel-per-CPU-kthreads.rst b/Documentation/admin-guide/kernel-per-CPU-kthreads.rst index baeeba8762ae..21818aca4708 100644 --- a/Documentation/admin-guide/kernel-per-CPU-kthreads.rst +++ b/Documentation/admin-guide/kernel-per-CPU-kthreads.rst @@ -234,7 +234,7 @@ To reduce its OS jitter, do any of the following: Such a workqueue can be confined to a given subset of the CPUs using the ``/sys/devices/virtual/workqueue/*/cpumask`` sysfs files. The set of WQ_SYSFS workqueues can be displayed using - "ls sys/devices/virtual/workqueue". That said, the workqueues + "ls /sys/devices/virtual/workqueue". That said, the workqueues maintainer would like to caution people against indiscriminately sprinkling WQ_SYSFS across all the workqueues. The reason for caution is that it is easy to add WQ_SYSFS, but because sysfs is diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst index bd5714547cee..2f31de8f7c74 100644 --- a/Documentation/admin-guide/mm/transhuge.rst +++ b/Documentation/admin-guide/mm/transhuge.rst @@ -310,6 +310,11 @@ thp_fault_fallback is incremented if a page fault fails to allocate a huge page and instead falls back to using small pages. +thp_fault_fallback_charge + is incremented if a page fault fails to charge a huge page and + instead falls back to using small pages even though the + allocation was successful. + thp_collapse_alloc_failed is incremented if khugepaged found a range of pages that should be collapsed into one huge page but failed @@ -319,6 +324,15 @@ thp_file_alloc is incremented every time a file huge page is successfully allocated. +thp_file_fallback + is incremented if a file huge page is attempted to be allocated + but fails and instead falls back to using small pages. + +thp_file_fallback_charge + is incremented if a file huge page cannot be charged and instead + falls back to using small pages even though the allocation was + successful. + thp_file_mapped is incremented every time a file huge page is mapped into user address space. diff --git a/Documentation/admin-guide/mm/userfaultfd.rst b/Documentation/admin-guide/mm/userfaultfd.rst index 5048cf661a8a..c30176e67900 100644 --- a/Documentation/admin-guide/mm/userfaultfd.rst +++ b/Documentation/admin-guide/mm/userfaultfd.rst @@ -108,6 +108,57 @@ UFFDIO_COPY. They're atomic as in guaranteeing that nothing can see an half copied page since it'll keep userfaulting until the copy has finished. +Notes: + +- If you requested UFFDIO_REGISTER_MODE_MISSING when registering then + you must provide some kind of page in your thread after reading from + the uffd. You must provide either UFFDIO_COPY or UFFDIO_ZEROPAGE. + The normal behavior of the OS automatically providing a zero page on + an annonymous mmaping is not in place. + +- None of the page-delivering ioctls default to the range that you + registered with. You must fill in all fields for the appropriate + ioctl struct including the range. + +- You get the address of the access that triggered the missing page + event out of a struct uffd_msg that you read in the thread from the + uffd. You can supply as many pages as you want with UFFDIO_COPY or + UFFDIO_ZEROPAGE. Keep in mind that unless you used DONTWAKE then + the first of any of those IOCTLs wakes up the faulting thread. + +- Be sure to test for all errors including (pollfd[0].revents & + POLLERR). This can happen, e.g. when ranges supplied were + incorrect. + +Write Protect Notifications +--------------------------- + +This is equivalent to (but faster than) using mprotect and a SIGSEGV +signal handler. + +Firstly you need to register a range with UFFDIO_REGISTER_MODE_WP. +Instead of using mprotect(2) you use ioctl(uffd, UFFDIO_WRITEPROTECT, +struct *uffdio_writeprotect) while mode = UFFDIO_WRITEPROTECT_MODE_WP +in the struct passed in. The range does not default to and does not +have to be identical to the range you registered with. You can write +protect as many ranges as you like (inside the registered range). +Then, in the thread reading from uffd the struct will have +msg.arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_WP set. Now you send +ioctl(uffd, UFFDIO_WRITEPROTECT, struct *uffdio_writeprotect) again +while pagefault.mode does not have UFFDIO_WRITEPROTECT_MODE_WP set. +This wakes up the thread which will continue to run with writes. This +allows you to do the bookkeeping about the write in the uffd reading +thread before the ioctl. + +If you registered with both UFFDIO_REGISTER_MODE_MISSING and +UFFDIO_REGISTER_MODE_WP then you need to think about the sequence in +which you supply a page and undo write protect. Note that there is a +difference between writes into a WP area and into a !WP area. The +former will have UFFD_PAGEFAULT_FLAG_WP set, the latter +UFFD_PAGEFAULT_FLAG_WRITE. The latter did not fail on protection but +you still need to supply a page when UFFDIO_REGISTER_MODE_MISSING was +used. + QEMU/KVM ======== diff --git a/Documentation/admin-guide/perf/imx-ddr.rst b/Documentation/admin-guide/perf/imx-ddr.rst index 3726a10a03ba..f05f56c73b7d 100644 --- a/Documentation/admin-guide/perf/imx-ddr.rst +++ b/Documentation/admin-guide/perf/imx-ddr.rst @@ -43,7 +43,8 @@ value 1 for supported. AXI_ID and AXI_MASKING are mapped on DPCR1 register in performance counter. When non-masked bits are matching corresponding AXI_ID bits then counter is - incremented. Perf counter is incremented if + incremented. Perf counter is incremented if:: + AxID && AXI_MASKING == AXI_ID && AXI_MASKING This filter doesn't support filter different AXI ID for axid-read and axid-write diff --git a/Documentation/admin-guide/pm/cpufreq_drivers.rst b/Documentation/admin-guide/pm/cpufreq_drivers.rst new file mode 100644 index 000000000000..9a134ae65803 --- /dev/null +++ b/Documentation/admin-guide/pm/cpufreq_drivers.rst @@ -0,0 +1,274 @@ +.. SPDX-License-Identifier: GPL-2.0 + +======================================================= +Legacy Documentation of CPU Performance Scaling Drivers +======================================================= + +Included below are historic documents describing assorted +:doc:`CPU performance scaling ` drivers. They are reproduced verbatim, +with the original white space formatting and indentation preserved, except for +the added leading space character in every line of text. + + +AMD PowerNow! Drivers +===================== + +:: + + PowerNow! and Cool'n'Quiet are AMD names for frequency + management capabilities in AMD processors. As the hardware + implementation changes in new generations of the processors, + there is a different cpu-freq driver for each generation. + + Note that the driver's will not load on the "wrong" hardware, + so it is safe to try each driver in turn when in doubt as to + which is the correct driver. + + Note that the functionality to change frequency (and voltage) + is not available in all processors. The drivers will refuse + to load on processors without this capability. The capability + is detected with the cpuid instruction. + + The drivers use BIOS supplied tables to obtain frequency and + voltage information appropriate for a particular platform. + Frequency transitions will be unavailable if the BIOS does + not supply these tables. + + 6th Generation: powernow-k6 + + 7th Generation: powernow-k7: Athlon, Duron, Geode. + + 8th Generation: powernow-k8: Athlon, Athlon 64, Opteron, Sempron. + Documentation on this functionality in 8th generation processors + is available in the "BIOS and Kernel Developer's Guide", publication + 26094, in chapter 9, available for download from www.amd.com. + + BIOS supplied data, for powernow-k7 and for powernow-k8, may be + from either the PSB table or from ACPI objects. The ACPI support + is only available if the kernel config sets CONFIG_ACPI_PROCESSOR. + The powernow-k8 driver will attempt to use ACPI if so configured, + and fall back to PST if that fails. + The powernow-k7 driver will try to use the PSB support first, and + fall back to ACPI if the PSB support fails. A module parameter, + acpi_force, is provided to force ACPI support to be used instead + of PSB support. + + +``cpufreq-nforce2`` +=================== + +:: + + The cpufreq-nforce2 driver changes the FSB on nVidia nForce2 platforms. + + This works better than on other platforms, because the FSB of the CPU + can be controlled independently from the PCI/AGP clock. + + The module has two options: + + fid: multiplier * 10 (for example 8.5 = 85) + min_fsb: minimum FSB + + If not set, fid is calculated from the current CPU speed and the FSB. + min_fsb defaults to FSB at boot time - 50 MHz. + + IMPORTANT: The available range is limited downwards! + Also the minimum available FSB can differ, for systems + booting with 200 MHz, 150 should always work. + + +``pcc-cpufreq`` +=============== + +:: + + /* + * pcc-cpufreq.txt - PCC interface documentation + * + * Copyright (C) 2009 Red Hat, Matthew Garrett + * Copyright (C) 2009 Hewlett-Packard Development Company, L.P. + * Nagananda Chumbalkar + */ + + + Processor Clocking Control Driver + --------------------------------- + + Contents: + --------- + 1. Introduction + 1.1 PCC interface + 1.1.1 Get Average Frequency + 1.1.2 Set Desired Frequency + 1.2 Platforms affected + 2. Driver and /sys details + 2.1 scaling_available_frequencies + 2.2 cpuinfo_transition_latency + 2.3 cpuinfo_cur_freq + 2.4 related_cpus + 3. Caveats + + 1. Introduction: + ---------------- + Processor Clocking Control (PCC) is an interface between the platform + firmware and OSPM. It is a mechanism for coordinating processor + performance (ie: frequency) between the platform firmware and the OS. + + The PCC driver (pcc-cpufreq) allows OSPM to take advantage of the PCC + interface. + + OS utilizes the PCC interface to inform platform firmware what frequency the + OS wants for a logical processor. The platform firmware attempts to achieve + the requested frequency. If the request for the target frequency could not be + satisfied by platform firmware, then it usually means that power budget + conditions are in place, and "power capping" is taking place. + + 1.1 PCC interface: + ------------------ + The complete PCC specification is available here: + https://acpica.org/sites/acpica/files/Processor-Clocking-Control-v1p0.pdf + + PCC relies on a shared memory region that provides a channel for communication + between the OS and platform firmware. PCC also implements a "doorbell" that + is used by the OS to inform the platform firmware that a command has been + sent. + + The ACPI PCCH() method is used to discover the location of the PCC shared + memory region. The shared memory region header contains the "command" and + "status" interface. PCCH() also contains details on how to access the platform + doorbell. + + The following commands are supported by the PCC interface: + * Get Average Frequency + * Set Desired Frequency + + The ACPI PCCP() method is implemented for each logical processor and is + used to discover the offsets for the input and output buffers in the shared + memory region. + + When PCC mode is enabled, the platform will not expose processor performance + or throttle states (_PSS, _TSS and related ACPI objects) to OSPM. Therefore, + the native P-state driver (such as acpi-cpufreq for Intel, powernow-k8 for + AMD) will not load. + + However, OSPM remains in control of policy. The governor (eg: "ondemand") + computes the required performance for each processor based on server workload. + The PCC driver fills in the command interface, and the input buffer and + communicates the request to the platform firmware. The platform firmware is + responsible for delivering the requested performance. + + Each PCC command is "global" in scope and can affect all the logical CPUs in + the system. Therefore, PCC is capable of performing "group" updates. With PCC + the OS is capable of getting/setting the frequency of all the logical CPUs in + the system with a single call to the BIOS. + + 1.1.1 Get Average Frequency: + ---------------------------- + This command is used by the OSPM to query the running frequency of the + processor since the last time this command was completed. The output buffer + indicates the average unhalted frequency of the logical processor expressed as + a percentage of the nominal (ie: maximum) CPU frequency. The output buffer + also signifies if the CPU frequency is limited by a power budget condition. + + 1.1.2 Set Desired Frequency: + ---------------------------- + This command is used by the OSPM to communicate to the platform firmware the + desired frequency for a logical processor. The output buffer is currently + ignored by OSPM. The next invocation of "Get Average Frequency" will inform + OSPM if the desired frequency was achieved or not. + + 1.2 Platforms affected: + ----------------------- + The PCC driver will load on any system where the platform firmware: + * supports the PCC interface, and the associated PCCH() and PCCP() methods + * assumes responsibility for managing the hardware clocking controls in order + to deliver the requested processor performance + + Currently, certain HP ProLiant platforms implement the PCC interface. On those + platforms PCC is the "default" choice. + + However, it is possible to disable this interface via a BIOS setting. In + such an instance, as is also the case on platforms where the PCC interface + is not implemented, the PCC driver will fail to load silently. + + 2. Driver and /sys details: + --------------------------- + When the driver loads, it merely prints the lowest and the highest CPU + frequencies supported by the platform firmware. + + The PCC driver loads with a message such as: + pcc-cpufreq: (v1.00.00) driver loaded with frequency limits: 1600 MHz, 2933 + MHz + + This means that the OPSM can request the CPU to run at any frequency in + between the limits (1600 MHz, and 2933 MHz) specified in the message. + + Internally, there is no need for the driver to convert the "target" frequency + to a corresponding P-state. + + The VERSION number for the driver will be of the format v.xy.ab. + eg: 1.00.02 + ----- -- + | | + | -- this will increase with bug fixes/enhancements to the driver + |-- this is the version of the PCC specification the driver adheres to + + + The following is a brief discussion on some of the fields exported via the + /sys filesystem and how their values are affected by the PCC driver: + + 2.1 scaling_available_frequencies: + ---------------------------------- + scaling_available_frequencies is not created in /sys. No intermediate + frequencies need to be listed because the BIOS will try to achieve any + frequency, within limits, requested by the governor. A frequency does not have + to be strictly associated with a P-state. + + 2.2 cpuinfo_transition_latency: + ------------------------------- + The cpuinfo_transition_latency field is 0. The PCC specification does + not include a field to expose this value currently. + + 2.3 cpuinfo_cur_freq: + --------------------- + A) Often cpuinfo_cur_freq will show a value different than what is declared + in the scaling_available_frequencies or scaling_cur_freq, or scaling_max_freq. + This is due to "turbo boost" available on recent Intel processors. If certain + conditions are met the BIOS can achieve a slightly higher speed than requested + by OSPM. An example: + + scaling_cur_freq : 2933000 + cpuinfo_cur_freq : 3196000 + + B) There is a round-off error associated with the cpuinfo_cur_freq value. + Since the driver obtains the current frequency as a "percentage" (%) of the + nominal frequency from the BIOS, sometimes, the values displayed by + scaling_cur_freq and cpuinfo_cur_freq may not match. An example: + + scaling_cur_freq : 1600000 + cpuinfo_cur_freq : 1583000 + + In this example, the nominal frequency is 2933 MHz. The driver obtains the + current frequency, cpuinfo_cur_freq, as 54% of the nominal frequency: + + 54% of 2933 MHz = 1583 MHz + + Nominal frequency is the maximum frequency of the processor, and it usually + corresponds to the frequency of the P0 P-state. + + 2.4 related_cpus: + ----------------- + The related_cpus field is identical to affected_cpus. + + affected_cpus : 4 + related_cpus : 4 + + Currently, the PCC driver does not evaluate _PSD. The platforms that support + PCC do not implement SW_ALL. So OSPM doesn't need to perform any coordination + to ensure that the same frequency is requested of all dependent CPUs. + + 3. Caveats: + ----------- + The "cpufreq_stats" module in its present form cannot be loaded and + expected to work with the PCC driver. Since the "cpufreq_stats" module + provides information wrt each P-state, it is not applicable to the PCC driver. diff --git a/Documentation/admin-guide/pm/cpuidle.rst b/Documentation/admin-guide/pm/cpuidle.rst index 6a06dc473dd6..5605cc6f9560 100644 --- a/Documentation/admin-guide/pm/cpuidle.rst +++ b/Documentation/admin-guide/pm/cpuidle.rst @@ -583,20 +583,17 @@ Power Management Quality of Service for CPUs The power management quality of service (PM QoS) framework in the Linux kernel allows kernel code and user space processes to set constraints on various energy-efficiency features of the kernel to prevent performance from dropping -below a required level. The PM QoS constraints can be set globally, in -predefined categories referred to as PM QoS classes, or against individual -devices. +below a required level. CPU idle time management can be affected by PM QoS in two ways, through the -global constraint in the ``PM_QOS_CPU_DMA_LATENCY`` class and through the -resume latency constraints for individual CPUs. Kernel code (e.g. device -drivers) can set both of them with the help of special internal interfaces -provided by the PM QoS framework. User space can modify the former by opening -the :file:`cpu_dma_latency` special device file under :file:`/dev/` and writing -a binary value (interpreted as a signed 32-bit integer) to it. In turn, the -resume latency constraint for a CPU can be modified by user space by writing a -string (representing a signed 32-bit integer) to the -:file:`power/pm_qos_resume_latency_us` file under +global CPU latency limit and through the resume latency constraints for +individual CPUs. Kernel code (e.g. device drivers) can set both of them with +the help of special internal interfaces provided by the PM QoS framework. User +space can modify the former by opening the :file:`cpu_dma_latency` special +device file under :file:`/dev/` and writing a binary value (interpreted as a +signed 32-bit integer) to it. In turn, the resume latency constraint for a CPU +can be modified from user space by writing a string (representing a signed +32-bit integer) to the :file:`power/pm_qos_resume_latency_us` file under :file:`/sys/devices/system/cpu/cpu/` in ``sysfs``, where the CPU number ```` is allocated at the system initialization time. Negative values will be rejected in both cases and, also in both cases, the written integer @@ -605,32 +602,34 @@ number will be interpreted as a requested PM QoS constraint in microseconds. The requested value is not automatically applied as a new constraint, however, as it may be less restrictive (greater in this particular case) than another constraint previously requested by someone else. For this reason, the PM QoS -framework maintains a list of requests that have been made so far in each -global class and for each device, aggregates them and applies the effective -(minimum in this particular case) value as the new constraint. +framework maintains a list of requests that have been made so far for the +global CPU latency limit and for each individual CPU, aggregates them and +applies the effective (minimum in this particular case) value as the new +constraint. In fact, opening the :file:`cpu_dma_latency` special device file causes a new -PM QoS request to be created and added to the priority list of requests in the -``PM_QOS_CPU_DMA_LATENCY`` class and the file descriptor coming from the -"open" operation represents that request. If that file descriptor is then -used for writing, the number written to it will be associated with the PM QoS -request represented by it as a new requested constraint value. Next, the -priority list mechanism will be used to determine the new effective value of -the entire list of requests and that effective value will be set as a new -constraint. Thus setting a new requested constraint value will only change the -real constraint if the effective "list" value is affected by it. In particular, -for the ``PM_QOS_CPU_DMA_LATENCY`` class it only affects the real constraint if -it is the minimum of the requested constraints in the list. The process holding -a file descriptor obtained by opening the :file:`cpu_dma_latency` special device -file controls the PM QoS request associated with that file descriptor, but it -controls this particular PM QoS request only. +PM QoS request to be created and added to a global priority list of CPU latency +limit requests and the file descriptor coming from the "open" operation +represents that request. If that file descriptor is then used for writing, the +number written to it will be associated with the PM QoS request represented by +it as a new requested limit value. Next, the priority list mechanism will be +used to determine the new effective value of the entire list of requests and +that effective value will be set as a new CPU latency limit. Thus requesting a +new limit value will only change the real limit if the effective "list" value is +affected by it, which is the case if it is the minimum of the requested values +in the list. + +The process holding a file descriptor obtained by opening the +:file:`cpu_dma_latency` special device file controls the PM QoS request +associated with that file descriptor, but it controls this particular PM QoS +request only. Closing the :file:`cpu_dma_latency` special device file or, more precisely, the file descriptor obtained while opening it, causes the PM QoS request associated -with that file descriptor to be removed from the ``PM_QOS_CPU_DMA_LATENCY`` -class priority list and destroyed. If that happens, the priority list mechanism -will be used, again, to determine the new effective value for the whole list -and that value will become the new real constraint. +with that file descriptor to be removed from the global priority list of CPU +latency limit requests and destroyed. If that happens, the priority list +mechanism will be used again, to determine the new effective value for the whole +list and that value will become the new limit. In turn, for each CPU there is one resume latency PM QoS request associated with the :file:`power/pm_qos_resume_latency_us` file under @@ -647,10 +646,10 @@ CPU in question every time the list of requests is updated this way or another (there may be other requests coming from kernel code in that list). CPU idle time governors are expected to regard the minimum of the global -effective ``PM_QOS_CPU_DMA_LATENCY`` class constraint and the effective -resume latency constraint for the given CPU as the upper limit for the exit -latency of the idle states they can select for that CPU. They should never -select any idle states with exit latency beyond that limit. +(effective) CPU latency limit and the effective resume latency constraint for +the given CPU as the upper limit for the exit latency of the idle states that +they are allowed to select for that CPU. They should never select any idle +states with exit latency beyond that limit. Idle States Control Via Kernel Command Line diff --git a/Documentation/admin-guide/pm/intel_pstate.rst b/Documentation/admin-guide/pm/intel_pstate.rst index 67e414e34f37..ad392f3aee06 100644 --- a/Documentation/admin-guide/pm/intel_pstate.rst +++ b/Documentation/admin-guide/pm/intel_pstate.rst @@ -734,10 +734,10 @@ References ========== .. [1] Kristen Accardi, *Balancing Power and Performance in the Linux Kernel*, - http://events.linuxfoundation.org/sites/events/files/slides/LinuxConEurope_2015.pdf + https://events.static.linuxfound.org/sites/events/files/slides/LinuxConEurope_2015.pdf .. [2] *Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 3: System Programming Guide*, - http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-system-programming-manual-325384.html + https://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-system-programming-manual-325384.html .. [3] *Advanced Configuration and Power Interface Specification*, https://uefi.org/sites/default/files/resources/ACPI_6_3_final_Jan30.pdf diff --git a/Documentation/admin-guide/pm/suspend-flows.rst b/Documentation/admin-guide/pm/suspend-flows.rst new file mode 100644 index 000000000000..c479d7462647 --- /dev/null +++ b/Documentation/admin-guide/pm/suspend-flows.rst @@ -0,0 +1,270 @@ +.. SPDX-License-Identifier: GPL-2.0 +.. include:: + +========================= +System Suspend Code Flows +========================= + +:Copyright: |copy| 2020 Intel Corporation + +:Author: Rafael J. Wysocki + +At least one global system-wide transition needs to be carried out for the +system to get from the working state into one of the supported +:doc:`sleep states `. Hibernation requires more than one +transition to occur for this purpose, but the other sleep states, commonly +referred to as *system-wide suspend* (or simply *system suspend*) states, need +only one. + +For those sleep states, the transition from the working state of the system into +the target sleep state is referred to as *system suspend* too (in the majority +of cases, whether this means a transition or a sleep state of the system should +be clear from the context) and the transition back from the sleep state into the +working state is referred to as *system resume*. + +The kernel code flows associated with the suspend and resume transitions for +different sleep states of the system are quite similar, but there are some +significant differences between the :ref:`suspend-to-idle ` code flows +and the code flows related to the :ref:`suspend-to-RAM ` and +:ref:`standby ` sleep states. + +The :ref:`suspend-to-RAM ` and :ref:`standby ` sleep states +cannot be implemented without platform support and the difference between them +boils down to the platform-specific actions carried out by the suspend and +resume hooks that need to be provided by the platform driver to make them +available. Apart from that, the suspend and resume code flows for these sleep +states are mostly identical, so they both together will be referred to as +*platform-dependent suspend* states in what follows. + + +.. _s2idle_suspend: + +Suspend-to-idle Suspend Code Flow +================================= + +The following steps are taken in order to transition the system from the working +state to the :ref:`suspend-to-idle ` sleep state: + + 1. Invoking system-wide suspend notifiers. + + Kernel subsystems can register callbacks to be invoked when the suspend + transition is about to occur and when the resume transition has finished. + + That allows them to prepare for the change of the system state and to clean + up after getting back to the working state. + + 2. Freezing tasks. + + Tasks are frozen primarily in order to avoid unchecked hardware accesses + from user space through MMIO regions or I/O registers exposed directly to + it and to prevent user space from entering the kernel while the next step + of the transition is in progress (which might have been problematic for + various reasons). + + All user space tasks are intercepted as though they were sent a signal and + put into uninterruptible sleep until the end of the subsequent system resume + transition. + + The kernel threads that choose to be frozen during system suspend for + specific reasons are frozen subsequently, but they are not intercepted. + Instead, they are expected to periodically check whether or not they need + to be frozen and to put themselves into uninterruptible sleep if so. [Note, + however, that kernel threads can use locking and other concurrency controls + available in kernel space to synchronize themselves with system suspend and + resume, which can be much more precise than the freezing, so the latter is + not a recommended option for kernel threads.] + + 3. Suspending devices and reconfiguring IRQs. + + Devices are suspended in four phases called *prepare*, *suspend*, + *late suspend* and *noirq suspend* (see :ref:`driverapi_pm_devices` for more + information on what exactly happens in each phase). + + Every device is visited in each phase, but typically it is not physically + accessed in more than two of them. + + The runtime PM API is disabled for every device during the *late* suspend + phase and high-level ("action") interrupt handlers are prevented from being + invoked before the *noirq* suspend phase. + + Interrupts are still handled after that, but they are only acknowledged to + interrupt controllers without performing any device-specific actions that + would be triggered in the working state of the system (those actions are + deferred till the subsequent system resume transition as described + `below `_). + + IRQs associated with system wakeup devices are "armed" so that the resume + transition of the system is started when one of them signals an event. + + 4. Freezing the scheduler tick and suspending timekeeping. + + When all devices have been suspended, CPUs enter the idle loop and are put + into the deepest available idle state. While doing that, each of them + "freezes" its own scheduler tick so that the timer events associated with + the tick do not occur until the CPU is woken up by another interrupt source. + + The last CPU to enter the idle state also stops the timekeeping which + (among other things) prevents high resolution timers from triggering going + forward until the first CPU that is woken up restarts the timekeeping. + That allows the CPUs to stay in the deep idle state relatively long in one + go. + + From this point on, the CPUs can only be woken up by non-timer hardware + interrupts. If that happens, they go back to the idle state unless the + interrupt that woke up one of them comes from an IRQ that has been armed for + system wakeup, in which case the system resume transition is started. + + +.. _s2idle_resume: + +Suspend-to-idle Resume Code Flow +================================ + +The following steps are taken in order to transition the system from the +:ref:`suspend-to-idle ` sleep state into the working state: + + 1. Resuming timekeeping and unfreezing the scheduler tick. + + When one of the CPUs is woken up (by a non-timer hardware interrupt), it + leaves the idle state entered in the last step of the preceding suspend + transition, restarts the timekeeping (unless it has been restarted already + by another CPU that woke up earlier) and the scheduler tick on that CPU is + unfrozen. + + If the interrupt that has woken up the CPU was armed for system wakeup, + the system resume transition begins. + + 2. Resuming devices and restoring the working-state configuration of IRQs. + + Devices are resumed in four phases called *noirq resume*, *early resume*, + *resume* and *complete* (see :ref:`driverapi_pm_devices` for more + information on what exactly happens in each phase). + + Every device is visited in each phase, but typically it is not physically + accessed in more than two of them. + + The working-state configuration of IRQs is restored after the *noirq* resume + phase and the runtime PM API is re-enabled for every device whose driver + supports it during the *early* resume phase. + + 3. Thawing tasks. + + Tasks frozen in step 2 of the preceding `suspend `_ + transition are "thawed", which means that they are woken up from the + uninterruptible sleep that they went into at that time and user space tasks + are allowed to exit the kernel. + + 4. Invoking system-wide resume notifiers. + + This is analogous to step 1 of the `suspend `_ transition + and the same set of callbacks is invoked at this point, but a different + "notification type" parameter value is passed to them. + + +Platform-dependent Suspend Code Flow +==================================== + +The following steps are taken in order to transition the system from the working +state to platform-dependent suspend state: + + 1. Invoking system-wide suspend notifiers. + + This step is the same as step 1 of the suspend-to-idle suspend transition + described `above `_. + + 2. Freezing tasks. + + This step is the same as step 2 of the suspend-to-idle suspend transition + described `above `_. + + 3. Suspending devices and reconfiguring IRQs. + + This step is analogous to step 3 of the suspend-to-idle suspend transition + described `above `_, but the arming of IRQs for system + wakeup generally does not have any effect on the platform. + + There are platforms that can go into a very deep low-power state internally + when all CPUs in them are in sufficiently deep idle states and all I/O + devices have been put into low-power states. On those platforms, + suspend-to-idle can reduce system power very effectively. + + On the other platforms, however, low-level components (like interrupt + controllers) need to be turned off in a platform-specific way (implemented + in the hooks provided by the platform driver) to achieve comparable power + reduction. + + That usually prevents in-band hardware interrupts from waking up the system, + which must be done in a special platform-dependent way. Then, the + configuration of system wakeup sources usually starts when system wakeup + devices are suspended and is finalized by the platform suspend hooks later + on. + + 4. Disabling non-boot CPUs. + + On some platforms the suspend hooks mentioned above must run in a one-CPU + configuration of the system (in particular, the hardware cannot be accessed + by any code running in parallel with the platform suspend hooks that may, + and often do, trap into the platform firmware in order to finalize the + suspend transition). + + For this reason, the CPU offline/online (CPU hotplug) framework is used + to take all of the CPUs in the system, except for one (the boot CPU), + offline (typically, the CPUs that have been taken offline go into deep idle + states). + + This means that all tasks are migrated away from those CPUs and all IRQs are + rerouted to the only CPU that remains online. + + 5. Suspending core system components. + + This prepares the core system components for (possibly) losing power going + forward and suspends the timekeeping. + + 6. Platform-specific power removal. + + This is expected to remove power from all of the system components except + for the memory controller and RAM (in order to preserve the contents of the + latter) and some devices designated for system wakeup. + + In many cases control is passed to the platform firmware which is expected + to finalize the suspend transition as needed. + + +Platform-dependent Resume Code Flow +=================================== + +The following steps are taken in order to transition the system from a +platform-dependent suspend state into the working state: + + 1. Platform-specific system wakeup. + + The platform is woken up by a signal from one of the designated system + wakeup devices (which need not be an in-band hardware interrupt) and + control is passed back to the kernel (the working configuration of the + platform may need to be restored by the platform firmware before the + kernel gets control again). + + 2. Resuming core system components. + + The suspend-time configuration of the core system components is restored and + the timekeeping is resumed. + + 3. Re-enabling non-boot CPUs. + + The CPUs disabled in step 4 of the preceding suspend transition are taken + back online and their suspend-time configuration is restored. + + 4. Resuming devices and restoring the working-state configuration of IRQs. + + This step is the same as step 2 of the suspend-to-idle suspend transition + described `above `_. + + 5. Thawing tasks. + + This step is the same as step 3 of the suspend-to-idle suspend transition + described `above `_. + + 6. Invoking system-wide resume notifiers. + + This step is the same as step 4 of the suspend-to-idle suspend transition + described `above `_. diff --git a/Documentation/admin-guide/pm/system-wide.rst b/Documentation/admin-guide/pm/system-wide.rst index 2b1f987b34f0..1a1924d71006 100644 --- a/Documentation/admin-guide/pm/system-wide.rst +++ b/Documentation/admin-guide/pm/system-wide.rst @@ -8,3 +8,4 @@ System-Wide Power Management :maxdepth: 2 sleep-states + suspend-flows diff --git a/Documentation/admin-guide/pm/working-state.rst b/Documentation/admin-guide/pm/working-state.rst index 88f717e59a42..0a38cdf39df1 100644 --- a/Documentation/admin-guide/pm/working-state.rst +++ b/Documentation/admin-guide/pm/working-state.rst @@ -11,4 +11,5 @@ Working-State Power Management intel_idle cpufreq intel_pstate + cpufreq_drivers intel_epb diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst index def074807cee..39c95c0e13d3 100644 --- a/Documentation/admin-guide/sysctl/kernel.rst +++ b/Documentation/admin-guide/sysctl/kernel.rst @@ -2,262 +2,197 @@ Documentation for /proc/sys/kernel/ =================================== -kernel version 2.2.10 +.. See scripts/check-sysctl-docs to keep this up to date + Copyright (c) 1998, 1999, Rik van Riel Copyright (c) 2009, Shen Feng -For general info and legal blurb, please look in index.rst. +For general info and legal blurb, please look in :doc:`index`. ------------------------------------------------------------------------------ This file contains documentation for the sysctl files in -/proc/sys/kernel/ and is valid for Linux kernel version 2.2. +``/proc/sys/kernel/`` and is valid for Linux kernel version 2.2. The files in this directory can be used to tune and monitor miscellaneous and general things in the operation of the Linux -kernel. Since some of the files _can_ be used to screw up your +kernel. Since some of the files *can* be used to screw up your system, it is advisable to read both documentation and source before actually making adjustments. Currently, these files might (depending on your configuration) -show up in /proc/sys/kernel: - -- acct -- acpi_video_flags -- auto_msgmni -- bootloader_type [ X86 only ] -- bootloader_version [ X86 only ] -- cap_last_cap -- core_pattern -- core_pipe_limit -- core_uses_pid -- ctrl-alt-del -- dmesg_restrict -- domainname -- hostname -- hotplug -- hardlockup_all_cpu_backtrace -- hardlockup_panic -- hung_task_panic -- hung_task_check_count -- hung_task_timeout_secs -- hung_task_check_interval_secs -- hung_task_warnings -- hyperv_record_panic_msg -- kexec_load_disabled -- kptr_restrict -- l2cr [ PPC only ] -- modprobe ==> Documentation/debugging-modules.txt -- modules_disabled -- msg_next_id [ sysv ipc ] -- msgmax -- msgmnb -- msgmni -- nmi_watchdog -- osrelease -- ostype -- overflowgid -- overflowuid -- panic -- panic_on_oops -- panic_on_stackoverflow -- panic_on_unrecovered_nmi -- panic_on_warn -- panic_print -- panic_on_rcu_stall -- perf_cpu_time_max_percent -- perf_event_paranoid -- perf_event_max_stack -- perf_event_mlock_kb -- perf_event_max_contexts_per_stack -- pid_max -- powersave-nap [ PPC only ] -- printk -- printk_delay -- printk_ratelimit -- printk_ratelimit_burst -- pty ==> Documentation/filesystems/devpts.txt -- randomize_va_space -- real-root-dev ==> Documentation/admin-guide/initrd.rst -- reboot-cmd [ SPARC only ] -- rtsig-max -- rtsig-nr -- sched_energy_aware -- seccomp/ ==> Documentation/userspace-api/seccomp_filter.rst -- sem -- sem_next_id [ sysv ipc ] -- sg-big-buff [ generic SCSI device (sg) ] -- shm_next_id [ sysv ipc ] -- shm_rmid_forced -- shmall -- shmmax [ sysv ipc ] -- shmmni -- softlockup_all_cpu_backtrace -- soft_watchdog -- stack_erasing -- stop-a [ SPARC only ] -- sysrq ==> Documentation/admin-guide/sysrq.rst -- sysctl_writes_strict -- tainted ==> Documentation/admin-guide/tainted-kernels.rst -- threads-max -- unknown_nmi_panic -- watchdog -- watchdog_thresh -- version - - -acct: -===== +show up in ``/proc/sys/kernel``: + +.. contents:: :local: + + +acct +==== + +:: -highwater lowwater frequency + highwater lowwater frequency If BSD-style process accounting is enabled these values control its behaviour. If free space on filesystem where the log lives -goes below % accounting suspends. If free space gets -above % accounting resumes. determines +goes below ``lowwater``% accounting suspends. If free space gets +above ``highwater``% accounting resumes. ``frequency`` determines how often do we check the amount of free space (value is in seconds). Default: -4 2 30 -That is, suspend accounting if there left <= 2% free; resume it -if we got >=4%; consider information about amount of free space -valid for 30 seconds. +:: -acpi_video_flags: -================= + 4 2 30 + +That is, suspend accounting if free space drops below 2%; resume it +if it increases to at least 4%; consider information about amount of +free space valid for 30 seconds. -flags -See Doc*/kernel/power/video.txt, it allows mode of video boot to be -set during run time. +acpi_video_flags +================ +See :doc:`/power/video`. This allows the video resume mode to be set, +in a similar fashion to the ``acpi_sleep`` kernel parameter, by +combining the following values: + += ======= +1 s3_bios +2 s3_mode +4 s3_beep += ======= -auto_msgmni: -============ + +auto_msgmni +=========== This variable has no effect and may be removed in future kernel releases. Reading it always returns 0. -Up to Linux 3.17, it enabled/disabled automatic recomputing of msgmni -upon memory add/remove or upon ipc namespace creation/removal. +Up to Linux 3.17, it enabled/disabled automatic recomputing of +`msgmni`_ +upon memory add/remove or upon IPC namespace creation/removal. Echoing "1" into this file enabled msgmni automatic recomputing. -Echoing "0" turned it off. auto_msgmni default value was 1. - +Echoing "0" turned it off. The default value was 1. -bootloader_type: -================ -x86 bootloader identification +bootloader_type (x86 only) +========================== This gives the bootloader type number as indicated by the bootloader, shifted left by 4, and OR'd with the low four bits of the bootloader version. The reason for this encoding is that this used to match the -type_of_loader field in the kernel header; the encoding is kept for +``type_of_loader`` field in the kernel header; the encoding is kept for backwards compatibility. That is, if the full bootloader type number is 0x15 and the full version number is 0x234, this file will contain the value 340 = 0x154. -See the type_of_loader and ext_loader_type fields in -Documentation/x86/boot.rst for additional information. +See the ``type_of_loader`` and ``ext_loader_type`` fields in +:doc:`/x86/boot` for additional information. -bootloader_version: -=================== - -x86 bootloader version +bootloader_version (x86 only) +============================= The complete bootloader version number. In the example above, this file will contain the value 564 = 0x234. -See the type_of_loader and ext_loader_ver fields in -Documentation/x86/boot.rst for additional information. +See the ``type_of_loader`` and ``ext_loader_ver`` fields in +:doc:`/x86/boot` for additional information. -cap_last_cap: -============= +cap_last_cap +============ Highest valid capability of the running kernel. Exports -CAP_LAST_CAP from the kernel. +``CAP_LAST_CAP`` from the kernel. -core_pattern: -============= +core_pattern +============ -core_pattern is used to specify a core dumpfile pattern name. +``core_pattern`` is used to specify a core dumpfile pattern name. * max length 127 characters; default value is "core" -* core_pattern is used as a pattern template for the output filename; - certain string patterns (beginning with '%') are substituted with - their actual values. -* backward compatibility with core_uses_pid: +* ``core_pattern`` is used as a pattern template for the output + filename; certain string patterns (beginning with '%') are + substituted with their actual values. +* backward compatibility with ``core_uses_pid``: - If core_pattern does not include "%p" (default does not) - and core_uses_pid is set, then .PID will be appended to + If ``core_pattern`` does not include "%p" (default does not) + and ``core_uses_pid`` is set, then .PID will be appended to the filename. -* corename format specifiers:: - - % '%' is dropped - %% output one '%' - %p pid - %P global pid (init PID namespace) - %i tid - %I global tid (init PID namespace) - %u uid (in initial user namespace) - %g gid (in initial user namespace) - %d dump mode, matches PR_SET_DUMPABLE and - /proc/sys/fs/suid_dumpable - %s signal number - %t UNIX time of dump - %h hostname - %e executable filename (may be shortened) - %E executable path - % both are dropped +* corename format specifiers + + ======== ========================================== + % '%' is dropped + %% output one '%' + %p pid + %P global pid (init PID namespace) + %i tid + %I global tid (init PID namespace) + %u uid (in initial user namespace) + %g gid (in initial user namespace) + %d dump mode, matches ``PR_SET_DUMPABLE`` and + ``/proc/sys/fs/suid_dumpable`` + %s signal number + %t UNIX time of dump + %h hostname + %e executable filename (may be shortened) + %E executable path + %c maximum size of core file by resource limit RLIMIT_CORE + % both are dropped + ======== ========================================== * If the first character of the pattern is a '|', the kernel will treat the rest of the pattern as a command to run. The core dump will be written to the standard input of that program instead of to a file. -core_pipe_limit: -================ +core_pipe_limit +=============== -This sysctl is only applicable when core_pattern is configured to pipe -core files to a user space helper (when the first character of -core_pattern is a '|', see above). When collecting cores via a pipe -to an application, it is occasionally useful for the collecting -application to gather data about the crashing process from its -/proc/pid directory. In order to do this safely, the kernel must wait -for the collecting process to exit, so as not to remove the crashing -processes proc files prematurely. This in turn creates the -possibility that a misbehaving userspace collecting process can block -the reaping of a crashed process simply by never exiting. This sysctl -defends against that. It defines how many concurrent crashing -processes may be piped to user space applications in parallel. If -this value is exceeded, then those crashing processes above that value -are noted via the kernel log and their cores are skipped. 0 is a -special value, indicating that unlimited processes may be captured in -parallel, but that no waiting will take place (i.e. the collecting -process is not guaranteed access to /proc//). This -value defaults to 0. - - -core_uses_pid: -============== +This sysctl is only applicable when `core_pattern`_ is configured to +pipe core files to a user space helper (when the first character of +``core_pattern`` is a '|', see above). +When collecting cores via a pipe to an application, it is occasionally +useful for the collecting application to gather data about the +crashing process from its ``/proc/pid`` directory. +In order to do this safely, the kernel must wait for the collecting +process to exit, so as not to remove the crashing processes proc files +prematurely. +This in turn creates the possibility that a misbehaving userspace +collecting process can block the reaping of a crashed process simply +by never exiting. +This sysctl defends against that. +It defines how many concurrent crashing processes may be piped to user +space applications in parallel. +If this value is exceeded, then those crashing processes above that +value are noted via the kernel log and their cores are skipped. +0 is a special value, indicating that unlimited processes may be +captured in parallel, but that no waiting will take place (i.e. the +collecting process is not guaranteed access to ``/proc//``). +This value defaults to 0. + + +core_uses_pid +============= The default coredump filename is "core". By setting -core_uses_pid to 1, the coredump filename becomes core.PID. -If core_pattern does not include "%p" (default does not) -and core_uses_pid is set, then .PID will be appended to +``core_uses_pid`` to 1, the coredump filename becomes core.PID. +If `core_pattern`_ does not include "%p" (default does not) +and ``core_uses_pid`` is set, then .PID will be appended to the filename. -ctrl-alt-del: -============= +ctrl-alt-del +============ When the value in this file is 0, ctrl-alt-del is trapped and -sent to the init(1) program to handle a graceful restart. +sent to the ``init(1)`` program to handle a graceful restart. When, however, the value is > 0, Linux's reaction to a Vulcan Nerve Pinch (tm) will be an immediate reboot, without even syncing its dirty buffers. @@ -269,21 +204,22 @@ Note: to decide what to do with it. -dmesg_restrict: -=============== +dmesg_restrict +============== This toggle indicates whether unprivileged users are prevented -from using dmesg(8) to view messages from the kernel's log buffer. -When dmesg_restrict is set to (0) there are no restrictions. When -dmesg_restrict is set set to (1), users must have CAP_SYSLOG to use -dmesg(8). +from using ``dmesg(8)`` to view messages from the kernel's log +buffer. +When ``dmesg_restrict`` is set to 0 there are no restrictions. +When ``dmesg_restrict`` is set set to 1, users must have +``CAP_SYSLOG`` to use ``dmesg(8)``. -The kernel config option CONFIG_SECURITY_DMESG_RESTRICT sets the -default value of dmesg_restrict. +The kernel config option ``CONFIG_SECURITY_DMESG_RESTRICT`` sets the +default value of ``dmesg_restrict``. -domainname & hostname: -====================== +domainname & hostname +===================== These files can be used to set the NIS/YP domainname and the hostname of your box in exactly the same way as the commands @@ -302,167 +238,206 @@ hostname "darkstar" and DNS (Internet Domain Name Server) domainname "frop.org", not to be confused with the NIS (Network Information Service) or YP (Yellow Pages) domainname. These two domain names are in general different. For a detailed discussion -see the hostname(1) man page. +see the ``hostname(1)`` man page. -hardlockup_all_cpu_backtrace: -============================= +hardlockup_all_cpu_backtrace +============================ This value controls the hard lockup detector behavior when a hard lockup condition is detected as to whether or not to gather further debug information. If enabled, arch-specific all-CPU stack dumping will be initiated. -0: do nothing. This is the default behavior. += ============================================ +0 Do nothing. This is the default behavior. +1 On detection capture more debug information. += ============================================ -1: on detection capture more debug information. - -hardlockup_panic: -================= +hardlockup_panic +================ This parameter can be used to control whether the kernel panics when a hard lockup is detected. - 0 - don't panic on hard lockup - 1 - panic on hard lockup += =========================== +0 Don't panic on hard lockup. +1 Panic on hard lockup. += =========================== -See Documentation/admin-guide/lockup-watchdogs.rst for more information. This can -also be set using the nmi_watchdog kernel parameter. +See :doc:`/admin-guide/lockup-watchdogs` for more information. +This can also be set using the nmi_watchdog kernel parameter. -hotplug: -======== +hotplug +======= Path for the hotplug policy agent. -Default value is "/sbin/hotplug". +Default value is "``/sbin/hotplug``". -hung_task_panic: -================ +hung_task_panic +=============== Controls the kernel's behavior when a hung task is detected. -This file shows up if CONFIG_DETECT_HUNG_TASK is enabled. - -0: continue operation. This is the default behavior. +This file shows up if ``CONFIG_DETECT_HUNG_TASK`` is enabled. -1: panic immediately. += ================================================= +0 Continue operation. This is the default behavior. +1 Panic immediately. += ================================================= -hung_task_check_count: -====================== +hung_task_check_count +===================== The upper bound on the number of tasks that are checked. -This file shows up if CONFIG_DETECT_HUNG_TASK is enabled. +This file shows up if ``CONFIG_DETECT_HUNG_TASK`` is enabled. -hung_task_timeout_secs: -======================= +hung_task_timeout_secs +====================== When a task in D state did not get scheduled for more than this value report a warning. -This file shows up if CONFIG_DETECT_HUNG_TASK is enabled. +This file shows up if ``CONFIG_DETECT_HUNG_TASK`` is enabled. -0: means infinite timeout - no checking done. +0 means infinite timeout, no checking is done. -Possible values to set are in range {0..LONG_MAX/HZ}. +Possible values to set are in range {0:``LONG_MAX``/``HZ``}. -hung_task_check_interval_secs: -============================== +hung_task_check_interval_secs +============================= Hung task check interval. If hung task checking is enabled -(see hung_task_timeout_secs), the check is done every -hung_task_check_interval_secs seconds. -This file shows up if CONFIG_DETECT_HUNG_TASK is enabled. +(see `hung_task_timeout_secs`_), the check is done every +``hung_task_check_interval_secs`` seconds. +This file shows up if ``CONFIG_DETECT_HUNG_TASK`` is enabled. -0 (default): means use hung_task_timeout_secs as checking interval. -Possible values to set are in range {0..LONG_MAX/HZ}. +0 (default) means use ``hung_task_timeout_secs`` as checking +interval. +Possible values to set are in range {0:``LONG_MAX``/``HZ``}. -hung_task_warnings: -=================== + +hung_task_warnings +================== The maximum number of warnings to report. During a check interval if a hung task is detected, this value is decreased by 1. When this value reaches 0, no more warnings will be reported. -This file shows up if CONFIG_DETECT_HUNG_TASK is enabled. +This file shows up if ``CONFIG_DETECT_HUNG_TASK`` is enabled. -1: report an infinite number of warnings. -hyperv_record_panic_msg: -======================== +hyperv_record_panic_msg +======================= Controls whether the panic kmsg data should be reported to Hyper-V. -0: do not report panic kmsg data. - -1: report the panic kmsg data. This is the default behavior. += ========================================================= +0 Do not report panic kmsg data. +1 Report the panic kmsg data. This is the default behavior. += ========================================================= -kexec_load_disabled: -==================== +kexec_load_disabled +=================== -A toggle indicating if the kexec_load syscall has been disabled. This -value defaults to 0 (false: kexec_load enabled), but can be set to 1 -(true: kexec_load disabled). Once true, kexec can no longer be used, and -the toggle cannot be set back to false. This allows a kexec image to be -loaded before disabling the syscall, allowing a system to set up (and -later use) an image without it being altered. Generally used together -with the "modules_disabled" sysctl. +A toggle indicating if the ``kexec_load`` syscall has been disabled. +This value defaults to 0 (false: ``kexec_load`` enabled), but can be +set to 1 (true: ``kexec_load`` disabled). +Once true, kexec can no longer be used, and the toggle cannot be set +back to false. +This allows a kexec image to be loaded before disabling the syscall, +allowing a system to set up (and later use) an image without it being +altered. +Generally used together with the `modules_disabled`_ sysctl. -kptr_restrict: -============== +kptr_restrict +============= This toggle indicates whether restrictions are placed on -exposing kernel addresses via /proc and other interfaces. +exposing kernel addresses via ``/proc`` and other interfaces. + +When ``kptr_restrict`` is set to 0 (the default) the address is hashed +before printing. +(This is the equivalent to %p.) + +When ``kptr_restrict`` is set to 1, kernel pointers printed using the +%pK format specifier will be replaced with 0s unless the user has +``CAP_SYSLOG`` and effective user and group ids are equal to the real +ids. +This is because %pK checks are done at read() time rather than open() +time, so if permissions are elevated between the open() and the read() +(e.g via a setuid binary) then %pK will not leak kernel pointers to +unprivileged users. +Note, this is a temporary solution only. +The correct long-term solution is to do the permission checks at +open() time. +Consider removing world read permissions from files that use %pK, and +using `dmesg_restrict`_ to protect against uses of %pK in ``dmesg(8)`` +if leaking kernel pointer values to unprivileged users is a concern. + +When ``kptr_restrict`` is set to 2, kernel pointers printed using +%pK will be replaced with 0s regardless of privileges. + + +modprobe +======== -When kptr_restrict is set to 0 (the default) the address is hashed before -printing. (This is the equivalent to %p.) +This gives the full path of the modprobe command which the kernel will +use to load modules. This can be used to debug module loading +requests:: -When kptr_restrict is set to (1), kernel pointers printed using the %pK -format specifier will be replaced with 0's unless the user has CAP_SYSLOG -and effective user and group ids are equal to the real ids. This is -because %pK checks are done at read() time rather than open() time, so -if permissions are elevated between the open() and the read() (e.g via -a setuid binary) then %pK will not leak kernel pointers to unprivileged -users. Note, this is a temporary solution only. The correct long-term -solution is to do the permission checks at open() time. Consider removing -world read permissions from files that use %pK, and using dmesg_restrict -to protect against uses of %pK in dmesg(8) if leaking kernel pointer -values to unprivileged users is a concern. + echo '#! /bin/sh' > /tmp/modprobe + echo 'echo "$@" >> /tmp/modprobe.log' >> /tmp/modprobe + echo 'exec /sbin/modprobe "$@"' >> /tmp/modprobe + chmod a+x /tmp/modprobe + echo /tmp/modprobe > /proc/sys/kernel/modprobe -When kptr_restrict is set to (2), kernel pointers printed using -%pK will be replaced with 0's regardless of privileges. +This only applies when the *kernel* is requesting that the module be +loaded; it won't have any effect if the module is being loaded +explicitly using ``modprobe`` from userspace. -l2cr: (PPC only) +modules_disabled ================ -This flag controls the L2 cache of G3 processor boards. If -0, the cache is disabled. Enabled if nonzero. - - -modules_disabled: -================= - A toggle value indicating if modules are allowed to be loaded in an otherwise modular kernel. This toggle defaults to off (0), but can be set true (1). Once true, modules can be neither loaded nor unloaded, and the toggle cannot be set back -to false. Generally used with the "kexec_load_disabled" toggle. +to false. Generally used with the `kexec_load_disabled`_ toggle. -msg_next_id, sem_next_id, and shm_next_id: -========================================== +.. _msgmni: + +msgmax, msgmnb, and msgmni +========================== + +``msgmax`` is the maximum size of an IPC message, in bytes. 8192 by +default (``MSGMAX``). + +``msgmnb`` is the maximum size of an IPC queue, in bytes. 16384 by +default (``MSGMNB``). + +``msgmni`` is the maximum number of IPC queues. 32000 by default +(``MSGMNI``). + + +msg_next_id, sem_next_id, and shm_next_id (System V IPC) +======================================================== These three toggles allows to specify desired id for next allocated IPC object: message, semaphore or shared memory respectively. By default they are equal to -1, which means generic allocation logic. -Possible values to set are in range {0..INT_MAX}. +Possible values to set are in range {0:``INT_MAX``}. Notes: 1) kernel doesn't guarantee, that new object will have desired id. So, @@ -471,16 +446,38 @@ Notes: successful IPC object allocation. If an IPC object allocation syscall fails, it is undefined if the value remains unmodified or is reset to -1. +modprobe: +========= -nmi_watchdog: -============= +The path to the usermode helper for autoloading kernel modules, by +default "/sbin/modprobe". This binary is executed when the kernel +requests a module. For example, if userspace passes an unknown +filesystem type to mount(), then the kernel will automatically request +the corresponding filesystem module by executing this usermode helper. +This usermode helper should insert the needed module into the kernel. + +This sysctl only affects module autoloading. It has no effect on the +ability to explicitly insert modules. + +If this sysctl is set to the empty string, then module autoloading is +completely disabled. The kernel will not try to execute a usermode +helper at all, nor will it call the kernel_module_request LSM hook. + +If CONFIG_STATIC_USERMODEHELPER=y is set in the kernel configuration, +then the configured static usermode helper overrides this sysctl, +except that the empty string is still accepted to completely disable +module autoloading as described above. + +nmi_watchdog +============ This parameter can be used to control the NMI watchdog (i.e. the hard lockup detector) on x86 systems. -0 - disable the hard lockup detector - -1 - enable the hard lockup detector += ================================= +0 Disable the hard lockup detector. +1 Enable the hard lockup detector. += ================================= The hard lockup detector monitors each CPU for its ability to respond to timer interrupts. The mechanism utilizes CPU performance counter registers @@ -492,11 +489,11 @@ in a KVM virtual machine. This default can be overridden by adding:: nmi_watchdog=1 -to the guest kernel command line (see Documentation/admin-guide/kernel-parameters.rst). +to the guest kernel command line (see :doc:`/admin-guide/kernel-parameters`). -numa_balancing: -=============== +numa_balancing +============== Enables/disables automatic page fault based NUMA memory balancing. Memory is moved automatically to nodes @@ -514,9 +511,10 @@ ideally is offset by improved memory locality but there is no universal guarantee. If the target workload is already bound to NUMA nodes then this feature should be disabled. Otherwise, if the system overhead from the feature is too high then the rate the kernel samples for NUMA hinting -faults may be controlled by the numa_balancing_scan_period_min_ms, +faults may be controlled by the `numa_balancing_scan_period_min_ms, numa_balancing_scan_delay_ms, numa_balancing_scan_period_max_ms, -numa_balancing_scan_size_mb, and numa_balancing_settle_count sysctls. +numa_balancing_scan_size_mb`_, and numa_balancing_settle_count sysctls. + numa_balancing_scan_period_min_ms, numa_balancing_scan_delay_ms, numa_balancing_scan_period_max_ms, numa_balancing_scan_size_mb =============================================================================================================================== @@ -542,23 +540,23 @@ workload pattern changes and minimises performance impact due to remote memory accesses. These sysctls control the thresholds for scan delays and the number of pages scanned. -numa_balancing_scan_period_min_ms is the minimum time in milliseconds to +``numa_balancing_scan_period_min_ms`` is the minimum time in milliseconds to scan a tasks virtual memory. It effectively controls the maximum scanning rate for each task. -numa_balancing_scan_delay_ms is the starting "scan delay" used for a task +``numa_balancing_scan_delay_ms`` is the starting "scan delay" used for a task when it initially forks. -numa_balancing_scan_period_max_ms is the maximum time in milliseconds to +``numa_balancing_scan_period_max_ms`` is the maximum time in milliseconds to scan a tasks virtual memory. It effectively controls the minimum scanning rate for each task. -numa_balancing_scan_size_mb is how many megabytes worth of pages are +``numa_balancing_scan_size_mb`` is how many megabytes worth of pages are scanned for a given scan. -osrelease, ostype & version: -============================ +osrelease, ostype & version +=========================== :: @@ -569,15 +567,16 @@ osrelease, ostype & version: # cat version #5 Wed Feb 25 21:49:24 MET 1998 -The files osrelease and ostype should be clear enough. Version +The files ``osrelease`` and ``ostype`` should be clear enough. +``version`` needs a little more clarification however. The '#5' means that this is the fifth kernel built from this source base and the date behind it indicates the time the kernel was built. The only way to tune these values is to rebuild the kernel :-) -overflowgid & overflowuid: -========================== +overflowgid & overflowuid +========================= if your architecture did not always support 32-bit UIDs (i.e. arm, i386, m68k, sh, and sparc32), a fixed UID and GID will be returned to @@ -588,108 +587,119 @@ These sysctls allow you to change the value of the fixed UID and GID. The default is 65534. +panic +===== + +The value in this file determines the behaviour of the kernel on a panic: -====== -The value in this file represents the number of seconds the kernel -waits before rebooting on a panic. When you use the software watchdog, -the recommended setting is 60. +* if zero, the kernel will loop forever; +* if negative, the kernel will reboot immediately; +* if positive, the kernel will reboot after the corresponding number + of seconds. +When you use the software watchdog, the recommended setting is 60. -panic_on_io_nmi: -================ + +panic_on_io_nmi +=============== Controls the kernel's behavior when a CPU receives an NMI caused by an IO error. -0: try to continue operation (default) - -1: panic immediately. The IO error triggered an NMI. This indicates a - serious system condition which could result in IO data corruption. - Rather than continuing, panicking might be a better choice. Some - servers issue this sort of NMI when the dump button is pushed, - and you can use this option to take a crash dump. += ================================================================== +0 Try to continue operation (default). +1 Panic immediately. The IO error triggered an NMI. This indicates a + serious system condition which could result in IO data corruption. + Rather than continuing, panicking might be a better choice. Some + servers issue this sort of NMI when the dump button is pushed, + and you can use this option to take a crash dump. += ================================================================== -panic_on_oops: -============== +panic_on_oops +============= Controls the kernel's behaviour when an oops or BUG is encountered. -0: try to continue operation - -1: panic immediately. If the `panic` sysctl is also non-zero then the - machine will be rebooted. += =================================================================== +0 Try to continue operation. +1 Panic immediately. If the `panic` sysctl is also non-zero then the + machine will be rebooted. += =================================================================== -panic_on_stackoverflow: -======================= +panic_on_stackoverflow +====================== Controls the kernel's behavior when detecting the overflows of kernel, IRQ and exception stacks except a user stack. -This file shows up if CONFIG_DEBUG_STACKOVERFLOW is enabled. - -0: try to continue operation. +This file shows up if ``CONFIG_DEBUG_STACKOVERFLOW`` is enabled. -1: panic immediately. += ========================== +0 Try to continue operation. +1 Panic immediately. += ========================== -panic_on_unrecovered_nmi: -========================= +panic_on_unrecovered_nmi +======================== The default Linux behaviour on an NMI of either memory or unknown is to continue operation. For many environments such as scientific computing it is preferable that the box is taken out and the error dealt with than an uncorrected parity/ECC error get propagated. -A small number of systems do generate NMI's for bizarre random reasons +A small number of systems do generate NMIs for bizarre random reasons such as power management so the default is off. That sysctl works like the existing panic controls already in that directory. -panic_on_warn: -============== +panic_on_warn +============= Calls panic() in the WARN() path when set to 1. This is useful to avoid a kernel rebuild when attempting to kdump at the location of a WARN(). -0: only WARN(), default behaviour. += ================================================ +0 Only WARN(), default behaviour. +1 Call panic() after printing out WARN() location. += ================================================ -1: call panic() after printing out WARN() location. - -panic_print: -============ +panic_print +=========== Bitmask for printing system info when panic happens. User can chose combination of the following bits: -===== ======================================== +===== ============================================ bit 0 print all tasks info bit 1 print system memory info bit 2 print timer info -bit 3 print locks info if CONFIG_LOCKDEP is on +bit 3 print locks info if ``CONFIG_LOCKDEP`` is on bit 4 print ftrace buffer -===== ======================================== +===== ============================================ So for example to print tasks and memory info on panic, user can:: echo 3 > /proc/sys/kernel/panic_print -panic_on_rcu_stall: -=================== +panic_on_rcu_stall +================== When set to 1, calls panic() after RCU stall detection messages. This is useful to define the root cause of RCU stalls using a vmcore. -0: do not panic() when RCU stall takes place, default behavior. += ============================================================ +0 Do not panic() when RCU stall takes place, default behavior. +1 panic() after printing RCU stall messages. += ============================================================ -1: panic() after printing RCU stall messages. - -perf_cpu_time_max_percent: -========================== +perf_cpu_time_max_percent +========================= Hints to the kernel how much CPU time it should be allowed to use to handle perf sampling events. If the perf subsystem @@ -702,171 +712,179 @@ unexpectedly take too long to execute, the NMIs can become stacked up next to each other so much that nothing else is allowed to execute. -0: - disable the mechanism. Do not monitor or correct perf's - sampling rate no matter how CPU time it takes. +===== ======================================================== +0 Disable the mechanism. Do not monitor or correct perf's + sampling rate no matter how CPU time it takes. -1-100: - attempt to throttle perf's sample rate to this - percentage of CPU. Note: the kernel calculates an - "expected" length of each sample event. 100 here means - 100% of that expected length. Even if this is set to - 100, you may still see sample throttling if this - length is exceeded. Set to 0 if you truly do not care - how much CPU is consumed. +1-100 Attempt to throttle perf's sample rate to this + percentage of CPU. Note: the kernel calculates an + "expected" length of each sample event. 100 here means + 100% of that expected length. Even if this is set to + 100, you may still see sample throttling if this + length is exceeded. Set to 0 if you truly do not care + how much CPU is consumed. +===== ======================================================== -perf_event_paranoid: -==================== +perf_event_paranoid +=================== Controls use of the performance events system by unprivileged users (without CAP_SYS_ADMIN). The default value is 2. === ================================================================== - -1 Allow use of (almost) all events by all users + -1 Allow use of (almost) all events by all users. - Ignore mlock limit after perf_event_mlock_kb without CAP_IPC_LOCK + Ignore mlock limit after perf_event_mlock_kb without + ``CAP_IPC_LOCK``. ->=0 Disallow ftrace function tracepoint by users without CAP_SYS_ADMIN +>=0 Disallow ftrace function tracepoint by users without + ``CAP_SYS_ADMIN``. - Disallow raw tracepoint access by users without CAP_SYS_ADMIN + Disallow raw tracepoint access by users without ``CAP_SYS_ADMIN``. ->=1 Disallow CPU event access by users without CAP_SYS_ADMIN +>=1 Disallow CPU event access by users without ``CAP_SYS_ADMIN``. ->=2 Disallow kernel profiling by users without CAP_SYS_ADMIN +>=2 Disallow kernel profiling by users without ``CAP_SYS_ADMIN``. === ================================================================== -perf_event_max_stack: -===================== +perf_event_max_stack +==================== -Controls maximum number of stack frames to copy for (attr.sample_type & -PERF_SAMPLE_CALLCHAIN) configured events, for instance, when using -'perf record -g' or 'perf trace --call-graph fp'. +Controls maximum number of stack frames to copy for (``attr.sample_type & +PERF_SAMPLE_CALLCHAIN``) configured events, for instance, when using +'``perf record -g``' or '``perf trace --call-graph fp``'. This can only be done when no events are in use that have callchains -enabled, otherwise writing to this file will return -EBUSY. +enabled, otherwise writing to this file will return ``-EBUSY``. The default value is 127. -perf_event_mlock_kb: -==================== +perf_event_mlock_kb +=================== Control size of per-cpu ring buffer not counted agains mlock limit. The default value is 512 + 1 page -perf_event_max_contexts_per_stack: -================================== +perf_event_max_contexts_per_stack +================================= Controls maximum number of stack frame context entries for -(attr.sample_type & PERF_SAMPLE_CALLCHAIN) configured events, for -instance, when using 'perf record -g' or 'perf trace --call-graph fp'. +(``attr.sample_type & PERF_SAMPLE_CALLCHAIN``) configured events, for +instance, when using '``perf record -g``' or '``perf trace --call-graph fp``'. This can only be done when no events are in use that have callchains -enabled, otherwise writing to this file will return -EBUSY. +enabled, otherwise writing to this file will return ``-EBUSY``. The default value is 8. -pid_max: -======== +pid_max +======= PID allocation wrap value. When the kernel's next PID value reaches this value, it wraps back to a minimum PID value. -PIDs of value pid_max or larger are not allocated. +PIDs of value ``pid_max`` or larger are not allocated. -ns_last_pid: -============ +ns_last_pid +=========== The last pid allocated in the current (the one task using this sysctl lives in) pid namespace. When selecting a pid for a next task on fork kernel tries to allocate a number starting from this one. -powersave-nap: (PPC only) -========================= +powersave-nap (PPC only) +======================== If set, Linux-PPC will use the 'nap' mode of powersaving, otherwise the 'doze' mode will be used. + ============================================================== -printk: -======= +printk +====== -The four values in printk denote: console_loglevel, -default_message_loglevel, minimum_console_loglevel and -default_console_loglevel respectively. +The four values in printk denote: ``console_loglevel``, +``default_message_loglevel``, ``minimum_console_loglevel`` and +``default_console_loglevel`` respectively. These values influence printk() behavior when printing or -logging error messages. See 'man 2 syslog' for more info on +logging error messages. See '``man 2 syslog``' for more info on the different loglevels. -- console_loglevel: - messages with a higher priority than - this will be printed to the console -- default_message_loglevel: - messages without an explicit priority - will be printed with this priority -- minimum_console_loglevel: - minimum (highest) value to which - console_loglevel can be set -- default_console_loglevel: - default value for console_loglevel +======================== ===================================== +console_loglevel messages with a higher priority than + this will be printed to the console +default_message_loglevel messages without an explicit priority + will be printed with this priority +minimum_console_loglevel minimum (highest) value to which + console_loglevel can be set +default_console_loglevel default value for console_loglevel +======================== ===================================== -printk_delay: -============= +printk_delay +============ -Delay each printk message in printk_delay milliseconds +Delay each printk message in ``printk_delay`` milliseconds Value from 0 - 10000 is allowed. -printk_ratelimit: -================= +printk_ratelimit +================ -Some warning messages are rate limited. printk_ratelimit specifies +Some warning messages are rate limited. ``printk_ratelimit`` specifies the minimum length of time between these messages (in seconds). The default value is 5 seconds. A value of 0 will disable rate limiting. -printk_ratelimit_burst: -======================= +printk_ratelimit_burst +====================== -While long term we enforce one message per printk_ratelimit +While long term we enforce one message per `printk_ratelimit`_ seconds, we do allow a burst of messages to pass through. -printk_ratelimit_burst specifies the number of messages we can +``printk_ratelimit_burst`` specifies the number of messages we can send before ratelimiting kicks in. The default value is 10 messages. -printk_devkmsg: -=============== - -Control the logging to /dev/kmsg from userspace: - -ratelimit: - default, ratelimited +printk_devkmsg +============== -on: unlimited logging to /dev/kmsg from userspace +Control the logging to ``/dev/kmsg`` from userspace: -off: logging to /dev/kmsg disabled +========= ============================================= +ratelimit default, ratelimited +on unlimited logging to /dev/kmsg from userspace +off logging to /dev/kmsg disabled +========= ============================================= -The kernel command line parameter printk.devkmsg= overrides this and is +The kernel command line parameter ``printk.devkmsg=`` overrides this and is a one-time setting until next reboot: once set, it cannot be changed by this sysctl interface anymore. +============================================================== -randomize_va_space: -=================== + +pty +=== + +See Documentation/filesystems/devpts.txt. + + +randomize_va_space +================== This option can be used to select the type of process address space randomization that is used in the system, for architectures @@ -881,10 +899,10 @@ that support this feature. This, among other things, implies that shared libraries will be loaded to random addresses. Also for PIE-linked binaries, the location of code start is randomized. This is the default if the - CONFIG_COMPAT_BRK option is enabled. + ``CONFIG_COMPAT_BRK`` option is enabled. 2 Additionally enable heap randomization. This is the default if - CONFIG_COMPAT_BRK is disabled. + ``CONFIG_COMPAT_BRK`` is disabled. There are a few legacy applications out there (such as some ancient versions of libc.so.5 from 1996) that assume that brk area starts @@ -894,31 +912,27 @@ that support this feature. systems it is safe to choose full randomization. Systems with ancient and/or broken binaries should be configured - with CONFIG_COMPAT_BRK enabled, which excludes the heap from process + with ``CONFIG_COMPAT_BRK`` enabled, which excludes the heap from process address space randomization. == =========================================================================== -reboot-cmd: (Sparc only) -======================== - -??? This seems to be a way to give an argument to the Sparc -ROM/Flash boot loader. Maybe to tell it what to do after -rebooting. ??? +real-root-dev +============= +See :doc:`/admin-guide/initrd`. -rtsig-max & rtsig-nr: -===================== -The file rtsig-max can be used to tune the maximum number -of POSIX realtime (queued) signals that can be outstanding -in the system. +reboot-cmd (SPARC only) +======================= -rtsig-nr shows the number of RT signals currently queued. +??? This seems to be a way to give an argument to the Sparc +ROM/Flash boot loader. Maybe to tell it what to do after +rebooting. ??? -sched_energy_aware: -=================== +sched_energy_aware +================== Enables/disables Energy Aware Scheduling (EAS). EAS starts automatically on platforms where it can run (that is, @@ -928,75 +942,88 @@ requirements for EAS but you do not want to use it, change this value to 0. -sched_schedstats: -================= +sched_schedstats +================ Enables/disables scheduler statistics. Enabling this feature incurs a small amount of overhead in the scheduler but is useful for debugging and performance tuning. -sg-big-buff: -============ +seccomp +======= + +See :doc:`/userspace-api/seccomp_filter`. + + +sg-big-buff +=========== This file shows the size of the generic SCSI (sg) buffer. You can't tune it just yet, but you could change it on -compile time by editing include/scsi/sg.h and changing -the value of SG_BIG_BUFF. +compile time by editing ``include/scsi/sg.h`` and changing +the value of ``SG_BIG_BUFF``. There shouldn't be any reason to change this value. If you can come up with one, you probably know what you are doing anyway :) -shmall: -======= +shmall +====== This parameter sets the total amount of shared memory pages that -can be used system wide. Hence, SHMALL should always be at least -ceil(shmmax/PAGE_SIZE). +can be used system wide. Hence, ``shmall`` should always be at least +``ceil(shmmax/PAGE_SIZE)``. -If you are not sure what the default PAGE_SIZE is on your Linux -system, you can run the following command: +If you are not sure what the default ``PAGE_SIZE`` is on your Linux +system, you can run the following command:: # getconf PAGE_SIZE -shmmax: -======= +shmmax +====== This value can be used to query and set the run time limit on the maximum shared memory segment size that can be created. Shared memory segments up to 1Gb are now supported in the -kernel. This value defaults to SHMMAX. +kernel. This value defaults to ``SHMMAX``. -shm_rmid_forced: -================ +shmmni +====== + +This value determines the maximum number of shared memory segments. +4096 by default (``SHMMNI``). + + +shm_rmid_forced +=============== Linux lets you set resource limits, including how much memory one -process can consume, via setrlimit(2). Unfortunately, shared memory +process can consume, via ``setrlimit(2)``. Unfortunately, shared memory segments are allowed to exist without association with any process, and thus might not be counted against any resource limits. If enabled, shared memory segments are automatically destroyed when their attach count becomes zero after a detach or a process termination. It will also destroy segments that were created, but never attached to, on exit -from the process. The only use left for IPC_RMID is to immediately +from the process. The only use left for ``IPC_RMID`` is to immediately destroy an unattached segment. Of course, this breaks the way things are defined, so some applications might stop working. Note that this feature will do you no good unless you also configure your resource -limits (in particular, RLIMIT_AS and RLIMIT_NPROC). Most systems don't +limits (in particular, ``RLIMIT_AS`` and ``RLIMIT_NPROC``). Most systems don't need this. Note that if you change this from 0 to 1, already created segments without users and with a dead originative process will be destroyed. -sysctl_writes_strict: -===================== +sysctl_writes_strict +==================== Control how file position affects the behavior of updating sysctl values -via the /proc/sys interface: +via the ``/proc/sys`` interface: == ====================================================================== -1 Legacy per-write sysctl value handling, with no printk warnings. @@ -1013,8 +1040,8 @@ via the /proc/sys interface: == ====================================================================== -softlockup_all_cpu_backtrace: -============================= +softlockup_all_cpu_backtrace +============================ This value controls the soft lockup detector thread's behavior when a soft lockup condition is detected as to whether or not @@ -1024,43 +1051,80 @@ be issued an NMI and instructed to capture stack trace. This feature is only applicable for architectures which support NMI. -0: do nothing. This is the default behavior. += ============================================ +0 Do nothing. This is the default behavior. +1 On detection capture more debug information. += ============================================ -1: on detection capture more debug information. +softlockup_panic +================= -soft_watchdog: -============== +This parameter can be used to control whether the kernel panics +when a soft lockup is detected. -This parameter can be used to control the soft lockup detector. += ============================================ +0 Don't panic on soft lockup. +1 Panic on soft lockup. += ============================================ - 0 - disable the soft lockup detector +This can also be set using the softlockup_panic kernel parameter. - 1 - enable the soft lockup detector + +soft_watchdog +============= + +This parameter can be used to control the soft lockup detector. + += ================================= +0 Disable the soft lockup detector. +1 Enable the soft lockup detector. += ================================= The soft lockup detector monitors CPUs for threads that are hogging the CPUs without rescheduling voluntarily, and thus prevent the 'watchdog/N' threads from running. The mechanism depends on the CPUs ability to respond to timer interrupts which are needed for the 'watchdog/N' threads to be woken up by -the watchdog timer function, otherwise the NMI watchdog - if enabled - can +the watchdog timer function, otherwise the NMI watchdog — if enabled — can detect a hard lockup condition. -stack_erasing: -============== +stack_erasing +============= This parameter can be used to control kernel stack erasing at the end -of syscalls for kernels built with CONFIG_GCC_PLUGIN_STACKLEAK. +of syscalls for kernels built with ``CONFIG_GCC_PLUGIN_STACKLEAK``. That erasing reduces the information which kernel stack leak bugs can reveal and blocks some uninitialized stack variable attacks. The tradeoff is the performance impact: on a single CPU system kernel compilation sees a 1% slowdown, other systems and workloads may vary. - 0: kernel stack erasing is disabled, STACKLEAK_METRICS are not updated. += ==================================================================== +0 Kernel stack erasing is disabled, STACKLEAK_METRICS are not updated. +1 Kernel stack erasing is enabled (default), it is performed before + returning to the userspace at the end of syscalls. += ==================================================================== - 1: kernel stack erasing is enabled (default), it is performed before - returning to the userspace at the end of syscalls. + +stop-a (SPARC only) +=================== + +Controls Stop-A: + += ==================================== +0 Stop-A has no effect. +1 Stop-A breaks to the PROM (default). += ==================================== + +Stop-A is always enabled on a panic, so that the user can return to +the boot PROM. + + +sysrq +===== + +See :doc:`/admin-guide/sysrq`. tainted @@ -1090,30 +1154,30 @@ ORed together. The letters are seen in "Tainted" line of Oops reports. 131072 `(T)` The kernel was built with the struct randomization plugin ====== ===== ============================================================== -See Documentation/admin-guide/tainted-kernels.rst for more information. +See :doc:`/admin-guide/tainted-kernels` for more information. -threads-max: -============ +threads-max +=========== This value controls the maximum number of threads that can be created -using fork(). +using ``fork()``. During initialization the kernel sets this value such that even if the maximum number of threads is created, the thread structures occupy only a part (1/8th) of the available RAM pages. -The minimum value that can be written to threads-max is 1. +The minimum value that can be written to ``threads-max`` is 1. -The maximum value that can be written to threads-max is given by the -constant FUTEX_TID_MASK (0x3fffffff). +The maximum value that can be written to ``threads-max`` is given by the +constant ``FUTEX_TID_MASK`` (0x3fffffff). -If a value outside of this range is written to threads-max an error -EINVAL occurs. +If a value outside of this range is written to ``threads-max`` an +``EINVAL`` error occurs. -unknown_nmi_panic: -================== +unknown_nmi_panic +================= The value in this file affects behavior of handling NMI. When the value is non-zero, unknown NMI is trapped and then panic occurs. At @@ -1123,37 +1187,39 @@ NMI switch that most IA32 servers have fires unknown NMI up, for example. If a system hangs up, try pressing the NMI switch. -watchdog: -========= +watchdog +======== This parameter can be used to disable or enable the soft lockup detector -_and_ the NMI watchdog (i.e. the hard lockup detector) at the same time. - - 0 - disable both lockup detectors +*and* the NMI watchdog (i.e. the hard lockup detector) at the same time. - 1 - enable both lockup detectors += ============================== +0 Disable both lockup detectors. +1 Enable both lockup detectors. += ============================== The soft lockup detector and the NMI watchdog can also be disabled or -enabled individually, using the soft_watchdog and nmi_watchdog parameters. -If the watchdog parameter is read, for example by executing:: +enabled individually, using the ``soft_watchdog`` and ``nmi_watchdog`` +parameters. +If the ``watchdog`` parameter is read, for example by executing:: cat /proc/sys/kernel/watchdog -the output of this command (0 or 1) shows the logical OR of soft_watchdog -and nmi_watchdog. +the output of this command (0 or 1) shows the logical OR of +``soft_watchdog`` and ``nmi_watchdog``. -watchdog_cpumask: -================= +watchdog_cpumask +================ This value can be used to control on which cpus the watchdog may run. -The default cpumask is all possible cores, but if NO_HZ_FULL is +The default cpumask is all possible cores, but if ``NO_HZ_FULL`` is enabled in the kernel config, and cores are specified with the -nohz_full= boot argument, those cores are excluded by default. +``nohz_full=`` boot argument, those cores are excluded by default. Offline cores can be included in this mask, and if the core is later brought online, the watchdog will be started based on the mask value. -Typically this value would only be touched in the nohz_full case +Typically this value would only be touched in the ``nohz_full`` case to re-enable cores that by default were not running the watchdog, if a kernel lockup was suspected on those cores. @@ -1164,12 +1230,12 @@ might say:: echo 0,2-4 > /proc/sys/kernel/watchdog_cpumask -watchdog_thresh: -================ +watchdog_thresh +=============== This value can be used to control the frequency of hrtimer and NMI events and the soft and hard lockup thresholds. The default threshold is 10 seconds. -The softlockup threshold is (2 * watchdog_thresh). Setting this +The softlockup threshold is (``2 * watchdog_thresh``). Setting this tunable to zero will disable lockup detection altogether. diff --git a/Documentation/admin-guide/sysctl/net.rst b/Documentation/admin-guide/sysctl/net.rst index 287b98708a40..e043c9213388 100644 --- a/Documentation/admin-guide/sysctl/net.rst +++ b/Documentation/admin-guide/sysctl/net.rst @@ -67,7 +67,8 @@ two flavors of JITs, the newer eBPF JIT currently supported on: - sparc64 - mips64 - s390x - - riscv + - riscv64 + - riscv32 And the older cBPF JIT supported on the following archs: diff --git a/Documentation/admin-guide/sysctl/user.rst b/Documentation/admin-guide/sysctl/user.rst index 650eaa03f15e..c45824589339 100644 --- a/Documentation/admin-guide/sysctl/user.rst +++ b/Documentation/admin-guide/sysctl/user.rst @@ -65,6 +65,12 @@ max_pid_namespaces The maximum number of pid namespaces that any user in the current user namespace may create. +max_time_namespaces +=================== + + The maximum number of time namespaces that any user in the current + user namespace may create. + max_user_namespaces =================== diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst index 64aeee1009ca..0329a4d3fa9e 100644 --- a/Documentation/admin-guide/sysctl/vm.rst +++ b/Documentation/admin-guide/sysctl/vm.rst @@ -128,6 +128,9 @@ allowed to examine the unevictable lru (mlocked pages) for pages to compact. This should be used on systems where stalls for minor page faults are an acceptable trade for large contiguous free memory. Set to 0 to prevent compaction from moving pages that are unevictable. Default value is 1. +On CONFIG_PREEMPT_RT the default value is 0 in order to avoid a page fault, due +to compaction, which would block the task from becomming active until the fault +is resolved. dirty_background_bytes diff --git a/Documentation/admin-guide/sysrq.rst b/Documentation/admin-guide/sysrq.rst index 72b2cfb066f4..a46209f4636c 100644 --- a/Documentation/admin-guide/sysrq.rst +++ b/Documentation/admin-guide/sysrq.rst @@ -48,9 +48,10 @@ always allowed (by a user with admin privileges). How do I use the magic SysRq key? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -On x86 - You press the key combo :kbd:`ALT-SysRq-`. +On x86 + You press the key combo :kbd:`ALT-SysRq-`. -.. note:: + .. note:: Some keyboards may not have a key labeled 'SysRq'. The 'SysRq' key is also known as the 'Print Screen' key. Also some keyboards cannot @@ -58,14 +59,15 @@ On x86 - You press the key combo :kbd:`ALT-SysRq-`. have better luck with press :kbd:`Alt`, press :kbd:`SysRq`, release :kbd:`SysRq`, press :kbd:``, release everything. -On SPARC - You press :kbd:`ALT-STOP-`, I believe. +On SPARC + You press :kbd:`ALT-STOP-`, I believe. On the serial console (PC style standard serial ports only) You send a ``BREAK``, then within 5 seconds a command key. Sending ``BREAK`` twice is interpreted as a normal BREAK. On PowerPC - Press :kbd:`ALT - Print Screen` (or :kbd:`F13`) - :kbd:``, + Press :kbd:`ALT - Print Screen` (or :kbd:`F13`) - :kbd:``. :kbd:`Print Screen` (or :kbd:`F13`) - :kbd:`` may suffice. On other @@ -73,7 +75,7 @@ On other let me know so I can add them to this section. On all - write a character to /proc/sysrq-trigger. e.g.:: + Write a character to /proc/sysrq-trigger. e.g.:: echo t > /proc/sysrq-trigger @@ -282,7 +284,7 @@ Just ask them on the linux-kernel mailing list: Credits ~~~~~~~ -Written by Mydraal -Updated by Adam Sulmicki -Updated by Jeremy M. Dolan 2001/01/28 10:15:59 -Added to by Crutcher Dunnavant +- Written by Mydraal +- Updated by Adam Sulmicki +- Updated by Jeremy M. Dolan 2001/01/28 10:15:59 +- Added to by Crutcher Dunnavant diff --git a/Documentation/arm/tcm.rst b/Documentation/arm/tcm.rst index effd9c7bc968..b256f9783883 100644 --- a/Documentation/arm/tcm.rst +++ b/Documentation/arm/tcm.rst @@ -4,18 +4,18 @@ ARM TCM (Tightly-Coupled Memory) handling in Linux Written by Linus Walleij -Some ARM SoC:s have a so-called TCM (Tightly-Coupled Memory). +Some ARM SoCs have a so-called TCM (Tightly-Coupled Memory). This is usually just a few (4-64) KiB of RAM inside the ARM processor. -Due to being embedded inside the CPU The TCM has a +Due to being embedded inside the CPU, the TCM has a Harvard-architecture, so there is an ITCM (instruction TCM) and a DTCM (data TCM). The DTCM can not contain any instructions, but the ITCM can actually contain data. The size of DTCM or ITCM is minimum 4KiB so the typical minimum configuration is 4KiB ITCM and 4KiB DTCM. -ARM CPU:s have special registers to read out status, physical +ARM CPUs have special registers to read out status, physical location and size of TCM memories. arch/arm/include/asm/cputype.h defines a CPUID_TCM register that you can read out from the system control coprocessor. Documentation from ARM can be found diff --git a/Documentation/arm64/amu.rst b/Documentation/arm64/amu.rst new file mode 100644 index 000000000000..5057b11100ed --- /dev/null +++ b/Documentation/arm64/amu.rst @@ -0,0 +1,112 @@ +======================================================= +Activity Monitors Unit (AMU) extension in AArch64 Linux +======================================================= + +Author: Ionela Voinescu + +Date: 2019-09-10 + +This document briefly describes the provision of Activity Monitors Unit +support in AArch64 Linux. + + +Architecture overview +--------------------- + +The activity monitors extension is an optional extension introduced by the +ARMv8.4 CPU architecture. + +The activity monitors unit, implemented in each CPU, provides performance +counters intended for system management use. The AMU extension provides a +system register interface to the counter registers and also supports an +optional external memory-mapped interface. + +Version 1 of the Activity Monitors architecture implements a counter group +of four fixed and architecturally defined 64-bit event counters. + - CPU cycle counter: increments at the frequency of the CPU. + - Constant counter: increments at the fixed frequency of the system + clock. + - Instructions retired: increments with every architecturally executed + instruction. + - Memory stall cycles: counts instruction dispatch stall cycles caused by + misses in the last level cache within the clock domain. + +When in WFI or WFE these counters do not increment. + +The Activity Monitors architecture provides space for up to 16 architected +event counters. Future versions of the architecture may use this space to +implement additional architected event counters. + +Additionally, version 1 implements a counter group of up to 16 auxiliary +64-bit event counters. + +On cold reset all counters reset to 0. + + +Basic support +------------- + +The kernel can safely run a mix of CPUs with and without support for the +activity monitors extension. Therefore, when CONFIG_ARM64_AMU_EXTN is +selected we unconditionally enable the capability to allow any late CPU +(secondary or hotplugged) to detect and use the feature. + +When the feature is detected on a CPU, we flag the availability of the +feature but this does not guarantee the correct functionality of the +counters, only the presence of the extension. + +Firmware (code running at higher exception levels, e.g. arm-tf) support is +needed to: + - Enable access for lower exception levels (EL2 and EL1) to the AMU + registers. + - Enable the counters. If not enabled these will read as 0. + - Save/restore the counters before/after the CPU is being put/brought up + from the 'off' power state. + +When using kernels that have this feature enabled but boot with broken +firmware the user may experience panics or lockups when accessing the +counter registers. Even if these symptoms are not observed, the values +returned by the register reads might not correctly reflect reality. Most +commonly, the counters will read as 0, indicating that they are not +enabled. + +If proper support is not provided in firmware it's best to disable +CONFIG_ARM64_AMU_EXTN. To be noted that for security reasons, this does not +bypass the setting of AMUSERENR_EL0 to trap accesses from EL0 (userspace) to +EL1 (kernel). Therefore, firmware should still ensure accesses to AMU registers +are not trapped in EL2/EL3. + +The fixed counters of AMUv1 are accessible though the following system +register definitions: + - SYS_AMEVCNTR0_CORE_EL0 + - SYS_AMEVCNTR0_CONST_EL0 + - SYS_AMEVCNTR0_INST_RET_EL0 + - SYS_AMEVCNTR0_MEM_STALL_EL0 + +Auxiliary platform specific counters can be accessed using +SYS_AMEVCNTR1_EL0(n), where n is a value between 0 and 15. + +Details can be found in: arch/arm64/include/asm/sysreg.h. + + +Userspace access +---------------- + +Currently, access from userspace to the AMU registers is disabled due to: + - Security reasons: they might expose information about code executed in + secure mode. + - Purpose: AMU counters are intended for system management use. + +Also, the presence of the feature is not visible to userspace. + + +Virtualization +-------------- + +Currently, access from userspace (EL0) and kernelspace (EL1) on the KVM +guest side is disabled due to: + - Security reasons: they might expose information about code executed + by other guests or the host. + +Any attempt to access the AMU registers will result in an UNDEFINED +exception being injected into the guest. diff --git a/Documentation/arm64/booting.rst b/Documentation/arm64/booting.rst index 5d78a6f5b0ae..a3f1a47b6f1c 100644 --- a/Documentation/arm64/booting.rst +++ b/Documentation/arm64/booting.rst @@ -248,6 +248,20 @@ Before jumping into the kernel, the following conditions must be met: - HCR_EL2.APK (bit 40) must be initialised to 0b1 - HCR_EL2.API (bit 41) must be initialised to 0b1 + For CPUs with Activity Monitors Unit v1 (AMUv1) extension present: + - If EL3 is present: + CPTR_EL3.TAM (bit 30) must be initialised to 0b0 + CPTR_EL2.TAM (bit 30) must be initialised to 0b0 + AMCNTENSET0_EL0 must be initialised to 0b1111 + AMCNTENSET1_EL0 must be initialised to a platform specific value + having 0b1 set for the corresponding bit for each of the auxiliary + counters present. + - If the kernel is entered at EL1: + AMCNTENSET0_EL0 must be initialised to 0b1111 + AMCNTENSET1_EL0 must be initialised to a platform specific value + having 0b1 set for the corresponding bit for each of the auxiliary + counters present. + The requirements described above for CPU mode, caches, MMUs, architected timers, coherency and system registers apply to all CPUs. All CPUs must enter the kernel in the same exception level. diff --git a/Documentation/arm64/index.rst b/Documentation/arm64/index.rst index 5c0c69dc58aa..09cbb4ed2237 100644 --- a/Documentation/arm64/index.rst +++ b/Documentation/arm64/index.rst @@ -6,6 +6,7 @@ ARM64 Architecture :maxdepth: 1 acpi_object_usage + amu arm-acpi booting cpu-feature-registers diff --git a/Documentation/block/capability.rst b/Documentation/block/capability.rst index 2cf258d64bbe..160a5148b915 100644 --- a/Documentation/block/capability.rst +++ b/Documentation/block/capability.rst @@ -2,17 +2,9 @@ Generic Block Device Capability =============================== -This file documents the sysfs file block//capability +This file documents the sysfs file ``block//capability``. -capability is a hex word indicating which capabilities a specific disk -supports. For more information on bits not listed here, see -include/linux/genhd.h +``capability`` is a bitfield, printed in hexadecimal, indicating which +capabilities a specific block device supports: -GENHD_FL_MEDIA_CHANGE_NOTIFY ----------------------------- - -Value: 4 - -When this bit is set, the disk supports Asynchronous Notification -of media change events. These events will be broadcast to user -space via kernel uevent. +.. kernel-doc:: include/linux/genhd.h diff --git a/Documentation/bpf/bpf_devel_QA.rst b/Documentation/bpf/bpf_devel_QA.rst index c9856b927055..38c15c6fcb14 100644 --- a/Documentation/bpf/bpf_devel_QA.rst +++ b/Documentation/bpf/bpf_devel_QA.rst @@ -20,11 +20,11 @@ Reporting bugs Q: How do I report bugs for BPF kernel code? -------------------------------------------- A: Since all BPF kernel development as well as bpftool and iproute2 BPF -loader development happens through the netdev kernel mailing list, +loader development happens through the bpf kernel mailing list, please report any found issues around BPF to the following mailing list: - netdev@vger.kernel.org + bpf@vger.kernel.org This may also include issues related to XDP, BPF tracing, etc. @@ -46,17 +46,12 @@ Submitting patches Q: To which mailing list do I need to submit my BPF patches? ------------------------------------------------------------ -A: Please submit your BPF patches to the netdev kernel mailing list: +A: Please submit your BPF patches to the bpf kernel mailing list: - netdev@vger.kernel.org - -Historically, BPF came out of networking and has always been maintained -by the kernel networking community. Although these days BPF touches -many other subsystems as well, the patches are still routed mainly -through the networking community. + bpf@vger.kernel.org In case your patch has changes in various different subsystems (e.g. -tracing, security, etc), make sure to Cc the related kernel mailing +networking, tracing, security, etc), make sure to Cc the related kernel mailing lists and maintainers from there as well, so they are able to review the changes and provide their Acked-by's to the patches. @@ -168,7 +163,7 @@ a BPF point of view. Be aware that this is not a final verdict that the patch will automatically get accepted into net or net-next trees eventually: -On the netdev kernel mailing list reviews can come in at any point +On the bpf kernel mailing list reviews can come in at any point in time. If discussions around a patch conclude that they cannot get included as-is, we will either apply a follow-up fix or drop them from the trees entirely. Therefore, we also reserve to rebase @@ -494,15 +489,15 @@ A: You need cmake and gcc-c++ as build requisites for LLVM. Once you have that set up, proceed with building the latest LLVM and clang version from the git repositories:: - $ git clone http://llvm.org/git/llvm.git - $ cd llvm/tools - $ git clone --depth 1 http://llvm.org/git/clang.git - $ cd ..; mkdir build; cd build - $ cmake .. -DLLVM_TARGETS_TO_BUILD="BPF;X86" \ + $ git clone https://github.com/llvm/llvm-project.git + $ mkdir -p llvm-project/llvm/build/install + $ cd llvm-project/llvm/build + $ cmake .. -G "Ninja" -DLLVM_TARGETS_TO_BUILD="BPF;X86" \ + -DLLVM_ENABLE_PROJECTS="clang" \ -DBUILD_SHARED_LIBS=OFF \ -DCMAKE_BUILD_TYPE=Release \ -DLLVM_BUILD_RUNTIME=OFF - $ make -j $(getconf _NPROCESSORS_ONLN) + $ ninja The built binaries can then be found in the build/bin/ directory, where you can point the PATH variable to. diff --git a/Documentation/bpf/bpf_lsm.rst b/Documentation/bpf/bpf_lsm.rst new file mode 100644 index 000000000000..1c0a75a51d79 --- /dev/null +++ b/Documentation/bpf/bpf_lsm.rst @@ -0,0 +1,142 @@ +.. SPDX-License-Identifier: GPL-2.0+ +.. Copyright (C) 2020 Google LLC. + +================ +LSM BPF Programs +================ + +These BPF programs allow runtime instrumentation of the LSM hooks by privileged +users to implement system-wide MAC (Mandatory Access Control) and Audit +policies using eBPF. + +Structure +--------- + +The example shows an eBPF program that can be attached to the ``file_mprotect`` +LSM hook: + +.. c:function:: int file_mprotect(struct vm_area_struct *vma, unsigned long reqprot, unsigned long prot); + +Other LSM hooks which can be instrumented can be found in +``include/linux/lsm_hooks.h``. + +eBPF programs that use :doc:`/bpf/btf` do not need to include kernel headers +for accessing information from the attached eBPF program's context. They can +simply declare the structures in the eBPF program and only specify the fields +that need to be accessed. + +.. code-block:: c + + struct mm_struct { + unsigned long start_brk, brk, start_stack; + } __attribute__((preserve_access_index)); + + struct vm_area_struct { + unsigned long start_brk, brk, start_stack; + unsigned long vm_start, vm_end; + struct mm_struct *vm_mm; + } __attribute__((preserve_access_index)); + + +.. note:: The order of the fields is irrelevant. + +This can be further simplified (if one has access to the BTF information at +build time) by generating the ``vmlinux.h`` with: + +.. code-block:: console + + # bpftool btf dump file format c > vmlinux.h + +.. note:: ``path-to-btf-vmlinux`` can be ``/sys/kernel/btf/vmlinux`` if the + build environment matches the environment the BPF programs are + deployed in. + +The ``vmlinux.h`` can then simply be included in the BPF programs without +requiring the definition of the types. + +The eBPF programs can be declared using the``BPF_PROG`` +macros defined in `tools/lib/bpf/bpf_tracing.h`_. In this +example: + + * ``"lsm/file_mprotect"`` indicates the LSM hook that the program must + be attached to + * ``mprotect_audit`` is the name of the eBPF program + +.. code-block:: c + + SEC("lsm/file_mprotect") + int BPF_PROG(mprotect_audit, struct vm_area_struct *vma, + unsigned long reqprot, unsigned long prot, int ret) + { + /* ret is the return value from the previous BPF program + * or 0 if it's the first hook. + */ + if (ret != 0) + return ret; + + int is_heap; + + is_heap = (vma->vm_start >= vma->vm_mm->start_brk && + vma->vm_end <= vma->vm_mm->brk); + + /* Return an -EPERM or write information to the perf events buffer + * for auditing + */ + if (is_heap) + return -EPERM; + } + +The ``__attribute__((preserve_access_index))`` is a clang feature that allows +the BPF verifier to update the offsets for the access at runtime using the +:doc:`/bpf/btf` information. Since the BPF verifier is aware of the types, it +also validates all the accesses made to the various types in the eBPF program. + +Loading +------- + +eBPF programs can be loaded with the :manpage:`bpf(2)` syscall's +``BPF_PROG_LOAD`` operation: + +.. code-block:: c + + struct bpf_object *obj; + + obj = bpf_object__open("./my_prog.o"); + bpf_object__load(obj); + +This can be simplified by using a skeleton header generated by ``bpftool``: + +.. code-block:: console + + # bpftool gen skeleton my_prog.o > my_prog.skel.h + +and the program can be loaded by including ``my_prog.skel.h`` and using +the generated helper, ``my_prog__open_and_load``. + +Attachment to LSM Hooks +----------------------- + +The LSM allows attachment of eBPF programs as LSM hooks using :manpage:`bpf(2)` +syscall's ``BPF_RAW_TRACEPOINT_OPEN`` operation or more simply by +using the libbpf helper ``bpf_program__attach_lsm``. + +The program can be detached from the LSM hook by *destroying* the ``link`` +link returned by ``bpf_program__attach_lsm`` using ``bpf_link__destroy``. + +One can also use the helpers generated in ``my_prog.skel.h`` i.e. +``my_prog__attach`` for attachment and ``my_prog__destroy`` for cleaning up. + +Examples +-------- + +An example eBPF program can be found in +`tools/testing/selftests/bpf/progs/lsm.c`_ and the corresponding +userspace code in `tools/testing/selftests/bpf/prog_tests/test_lsm.c`_ + +.. Links +.. _tools/lib/bpf/bpf_tracing.h: + https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/tools/lib/bpf/bpf_tracing.h +.. _tools/testing/selftests/bpf/progs/lsm.c: + https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/tools/testing/selftests/bpf/progs/lsm.c +.. _tools/testing/selftests/bpf/prog_tests/test_lsm.c: + https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/tools/testing/selftests/bpf/prog_tests/test_lsm.c diff --git a/Documentation/bpf/drgn.rst b/Documentation/bpf/drgn.rst new file mode 100644 index 000000000000..41f223c3161e --- /dev/null +++ b/Documentation/bpf/drgn.rst @@ -0,0 +1,213 @@ +.. SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) + +============== +BPF drgn tools +============== + +drgn scripts is a convenient and easy to use mechanism to retrieve arbitrary +kernel data structures. drgn is not relying on kernel UAPI to read the data. +Instead it's reading directly from ``/proc/kcore`` or vmcore and pretty prints +the data based on DWARF debug information from vmlinux. + +This document describes BPF related drgn tools. + +See `drgn/tools`_ for all tools available at the moment and `drgn/doc`_ for +more details on drgn itself. + +bpf_inspect.py +-------------- + +Description +=========== + +`bpf_inspect.py`_ is a tool intended to inspect BPF programs and maps. It can +iterate over all programs and maps in the system and print basic information +about these objects, including id, type and name. + +The main use-case `bpf_inspect.py`_ covers is to show BPF programs of types +``BPF_PROG_TYPE_EXT`` and ``BPF_PROG_TYPE_TRACING`` attached to other BPF +programs via ``freplace``/``fentry``/``fexit`` mechanisms, since there is no +user-space API to get this information. + +Getting started +=============== + +List BPF programs (full names are obtained from BTF):: + + % sudo bpf_inspect.py prog + 27: BPF_PROG_TYPE_TRACEPOINT tracepoint__tcp__tcp_send_reset + 4632: BPF_PROG_TYPE_CGROUP_SOCK_ADDR tw_ipt_bind + 49464: BPF_PROG_TYPE_RAW_TRACEPOINT raw_tracepoint__sched_process_exit + +List BPF maps:: + + % sudo bpf_inspect.py map + 2577: BPF_MAP_TYPE_HASH tw_ipt_vips + 4050: BPF_MAP_TYPE_STACK_TRACE stack_traces + 4069: BPF_MAP_TYPE_PERCPU_ARRAY ned_dctcp_cntr + +Find BPF programs attached to BPF program ``test_pkt_access``:: + + % sudo bpf_inspect.py p | grep test_pkt_access + 650: BPF_PROG_TYPE_SCHED_CLS test_pkt_access + 654: BPF_PROG_TYPE_TRACING test_main linked:[650->25: BPF_TRAMP_FEXIT test_pkt_access->test_pkt_access()] + 655: BPF_PROG_TYPE_TRACING test_subprog1 linked:[650->29: BPF_TRAMP_FEXIT test_pkt_access->test_pkt_access_subprog1()] + 656: BPF_PROG_TYPE_TRACING test_subprog2 linked:[650->31: BPF_TRAMP_FEXIT test_pkt_access->test_pkt_access_subprog2()] + 657: BPF_PROG_TYPE_TRACING test_subprog3 linked:[650->21: BPF_TRAMP_FEXIT test_pkt_access->test_pkt_access_subprog3()] + 658: BPF_PROG_TYPE_EXT new_get_skb_len linked:[650->16: BPF_TRAMP_REPLACE test_pkt_access->get_skb_len()] + 659: BPF_PROG_TYPE_EXT new_get_skb_ifindex linked:[650->23: BPF_TRAMP_REPLACE test_pkt_access->get_skb_ifindex()] + 660: BPF_PROG_TYPE_EXT new_get_constant linked:[650->19: BPF_TRAMP_REPLACE test_pkt_access->get_constant()] + +It can be seen that there is a program ``test_pkt_access``, id 650 and there +are multiple other tracing and ext programs attached to functions in +``test_pkt_access``. + +For example the line:: + + 658: BPF_PROG_TYPE_EXT new_get_skb_len linked:[650->16: BPF_TRAMP_REPLACE test_pkt_access->get_skb_len()] + +, means that BPF program id 658, type ``BPF_PROG_TYPE_EXT``, name +``new_get_skb_len`` replaces (``BPF_TRAMP_REPLACE``) function ``get_skb_len()`` +that has BTF id 16 in BPF program id 650, name ``test_pkt_access``. + +Getting help: + +.. code-block:: none + + % sudo bpf_inspect.py + usage: bpf_inspect.py [-h] {prog,p,map,m} ... + + drgn script to list BPF programs or maps and their properties + unavailable via kernel API. + + See https://github.com/osandov/drgn/ for more details on drgn. + + optional arguments: + -h, --help show this help message and exit + + subcommands: + {prog,p,map,m} + prog (p) list BPF programs + map (m) list BPF maps + +Customization +============= + +The script is intended to be customized by developers to print relevant +information about BPF programs, maps and other objects. + +For example, to print ``struct bpf_prog_aux`` for BPF program id 53077: + +.. code-block:: none + + % git diff + diff --git a/tools/bpf_inspect.py b/tools/bpf_inspect.py + index 650e228..aea2357 100755 + --- a/tools/bpf_inspect.py + +++ b/tools/bpf_inspect.py + @@ -112,7 +112,9 @@ def list_bpf_progs(args): + if linked: + linked = f" linked:[{linked}]" + + - print(f"{id_:>6}: {type_:32} {name:32} {linked}") + + if id_ == 53077: + + print(f"{id_:>6}: {type_:32} {name:32}") + + print(f"{bpf_prog.aux}") + + + def list_bpf_maps(args): + +It produces the output:: + + % sudo bpf_inspect.py p + 53077: BPF_PROG_TYPE_XDP tw_xdp_policer + *(struct bpf_prog_aux *)0xffff8893fad4b400 = { + .refcnt = (atomic64_t){ + .counter = (long)58, + }, + .used_map_cnt = (u32)1, + .max_ctx_offset = (u32)8, + .max_pkt_offset = (u32)15, + .max_tp_access = (u32)0, + .stack_depth = (u32)8, + .id = (u32)53077, + .func_cnt = (u32)0, + .func_idx = (u32)0, + .attach_btf_id = (u32)0, + .linked_prog = (struct bpf_prog *)0x0, + .verifier_zext = (bool)0, + .offload_requested = (bool)0, + .attach_btf_trace = (bool)0, + .func_proto_unreliable = (bool)0, + .trampoline_prog_type = (enum bpf_tramp_prog_type)BPF_TRAMP_FENTRY, + .trampoline = (struct bpf_trampoline *)0x0, + .tramp_hlist = (struct hlist_node){ + .next = (struct hlist_node *)0x0, + .pprev = (struct hlist_node **)0x0, + }, + .attach_func_proto = (const struct btf_type *)0x0, + .attach_func_name = (const char *)0x0, + .func = (struct bpf_prog **)0x0, + .jit_data = (void *)0x0, + .poke_tab = (struct bpf_jit_poke_descriptor *)0x0, + .size_poke_tab = (u32)0, + .ksym_tnode = (struct latch_tree_node){ + .node = (struct rb_node [2]){ + { + .__rb_parent_color = (unsigned long)18446612956263126665, + .rb_right = (struct rb_node *)0x0, + .rb_left = (struct rb_node *)0xffff88a0be3d0088, + }, + { + .__rb_parent_color = (unsigned long)18446612956263126689, + .rb_right = (struct rb_node *)0x0, + .rb_left = (struct rb_node *)0xffff88a0be3d00a0, + }, + }, + }, + .ksym_lnode = (struct list_head){ + .next = (struct list_head *)0xffff88bf481830b8, + .prev = (struct list_head *)0xffff888309f536b8, + }, + .ops = (const struct bpf_prog_ops *)xdp_prog_ops+0x0 = 0xffffffff820fa350, + .used_maps = (struct bpf_map **)0xffff889ff795de98, + .prog = (struct bpf_prog *)0xffffc9000cf2d000, + .user = (struct user_struct *)root_user+0x0 = 0xffffffff82444820, + .load_time = (u64)2408348759285319, + .cgroup_storage = (struct bpf_map *[2]){}, + .name = (char [16])"tw_xdp_policer", + .security = (void *)0xffff889ff795d548, + .offload = (struct bpf_prog_offload *)0x0, + .btf = (struct btf *)0xffff8890ce6d0580, + .func_info = (struct bpf_func_info *)0xffff889ff795d240, + .func_info_aux = (struct bpf_func_info_aux *)0xffff889ff795de20, + .linfo = (struct bpf_line_info *)0xffff888a707afc00, + .jited_linfo = (void **)0xffff8893fad48600, + .func_info_cnt = (u32)1, + .nr_linfo = (u32)37, + .linfo_idx = (u32)0, + .num_exentries = (u32)0, + .extable = (struct exception_table_entry *)0xffffffffa032d950, + .stats = (struct bpf_prog_stats *)0x603fe3a1f6d0, + .work = (struct work_struct){ + .data = (atomic_long_t){ + .counter = (long)0, + }, + .entry = (struct list_head){ + .next = (struct list_head *)0x0, + .prev = (struct list_head *)0x0, + }, + .func = (work_func_t)0x0, + }, + .rcu = (struct callback_head){ + .next = (struct callback_head *)0x0, + .func = (void (*)(struct callback_head *))0x0, + }, + } + + +.. Links +.. _drgn/doc: https://drgn.readthedocs.io/en/latest/ +.. _drgn/tools: https://github.com/osandov/drgn/tree/master/tools +.. _bpf_inspect.py: + https://github.com/osandov/drgn/blob/master/tools/bpf_inspect.py diff --git a/Documentation/bpf/index.rst b/Documentation/bpf/index.rst index 4f5410b61441..f99677f3572f 100644 --- a/Documentation/bpf/index.rst +++ b/Documentation/bpf/index.rst @@ -45,14 +45,16 @@ Program types prog_cgroup_sockopt prog_cgroup_sysctl prog_flow_dissector + bpf_lsm -Testing BPF -=========== +Testing and debugging BPF +========================= .. toctree:: :maxdepth: 1 + drgn s390 diff --git a/Documentation/conf.py b/Documentation/conf.py index 3c7bdf4cd31f..9ae8e9abf846 100644 --- a/Documentation/conf.py +++ b/Documentation/conf.py @@ -38,7 +38,11 @@ needs_sphinx = '1.3' # ones. extensions = ['kerneldoc', 'rstFlatTable', 'kernel_include', 'cdomain', 'kfigure', 'sphinx.ext.ifconfig', 'automarkup', - 'maintainers_include'] + 'maintainers_include', 'sphinx.ext.autosectionlabel' ] + +# Ensure that autosectionlabel will produce unique names +autosectionlabel_prefix_document = True +autosectionlabel_maxdepth = 2 # The name of the math extension changed on Sphinx 1.4 if (major == 1 and minor > 3) or (major > 1): diff --git a/Documentation/core-api/gcc-plugins.rst b/Documentation/core-api/gcc-plugins.rst deleted file mode 100644 index 8502f24396fb..000000000000 --- a/Documentation/core-api/gcc-plugins.rst +++ /dev/null @@ -1,93 +0,0 @@ -========================= -GCC plugin infrastructure -========================= - - -Introduction -============ - -GCC plugins are loadable modules that provide extra features to the -compiler [1]_. They are useful for runtime instrumentation and static analysis. -We can analyse, change and add further code during compilation via -callbacks [2]_, GIMPLE [3]_, IPA [4]_ and RTL passes [5]_. - -The GCC plugin infrastructure of the kernel supports all gcc versions from -4.5 to 6.0, building out-of-tree modules, cross-compilation and building in a -separate directory. -Plugin source files have to be compilable by both a C and a C++ compiler as well -because gcc versions 4.5 and 4.6 are compiled by a C compiler, -gcc-4.7 can be compiled by a C or a C++ compiler, -and versions 4.8+ can only be compiled by a C++ compiler. - -Currently the GCC plugin infrastructure supports only the x86, arm, arm64 and -powerpc architectures. - -This infrastructure was ported from grsecurity [6]_ and PaX [7]_. - --- - -.. [1] https://gcc.gnu.org/onlinedocs/gccint/Plugins.html -.. [2] https://gcc.gnu.org/onlinedocs/gccint/Plugin-API.html#Plugin-API -.. [3] https://gcc.gnu.org/onlinedocs/gccint/GIMPLE.html -.. [4] https://gcc.gnu.org/onlinedocs/gccint/IPA.html -.. [5] https://gcc.gnu.org/onlinedocs/gccint/RTL.html -.. [6] https://grsecurity.net/ -.. [7] https://pax.grsecurity.net/ - - -Files -===== - -**$(src)/scripts/gcc-plugins** - - This is the directory of the GCC plugins. - -**$(src)/scripts/gcc-plugins/gcc-common.h** - - This is a compatibility header for GCC plugins. - It should be always included instead of individual gcc headers. - -**$(src)/scripts/gcc-plugin.sh** - - This script checks the availability of the included headers in - gcc-common.h and chooses the proper host compiler to build the plugins - (gcc-4.7 can be built by either gcc or g++). - -**$(src)/scripts/gcc-plugins/gcc-generate-gimple-pass.h, -$(src)/scripts/gcc-plugins/gcc-generate-ipa-pass.h, -$(src)/scripts/gcc-plugins/gcc-generate-simple_ipa-pass.h, -$(src)/scripts/gcc-plugins/gcc-generate-rtl-pass.h** - - These headers automatically generate the registration structures for - GIMPLE, SIMPLE_IPA, IPA and RTL passes. They support all gcc versions - from 4.5 to 6.0. - They should be preferred to creating the structures by hand. - - -Usage -===== - -You must install the gcc plugin headers for your gcc version, -e.g., on Ubuntu for gcc-4.9:: - - apt-get install gcc-4.9-plugin-dev - -Enable a GCC plugin based feature in the kernel config:: - - CONFIG_GCC_PLUGIN_CYC_COMPLEXITY = y - -To compile only the plugin(s):: - - make gcc-plugins - -or just run the kernel make and compile the whole kernel with -the cyclomatic complexity GCC plugin. - - -4. How to add a new GCC plugin -============================== - -The GCC plugins are in $(src)/scripts/gcc-plugins/. You can use a file or a directory -here. It must be added to $(src)/scripts/gcc-plugins/Makefile, -$(src)/scripts/Makefile.gcc-plugins and $(src)/arch/Kconfig. -See the cyc_complexity_plugin.c (CONFIG_GCC_PLUGIN_CYC_COMPLEXITY) GCC plugin. diff --git a/Documentation/core-api/index.rst b/Documentation/core-api/index.rst index a501dc1c90d0..0897ad12c119 100644 --- a/Documentation/core-api/index.rst +++ b/Documentation/core-api/index.rst @@ -8,41 +8,81 @@ This is the beginning of a manual for core kernel APIs. The conversion Core utilities ============== +This section has general and "core core" documentation. The first is a +massive grab-bag of kerneldoc info left over from the docbook days; it +should really be broken up someday when somebody finds the energy to do +it. + .. toctree:: :maxdepth: 1 kernel-api + workqueue + printk-formats + symbol-namespaces + +Data structures and low-level utilities +======================================= + +Library functionality that is used throughout the kernel. + +.. toctree:: + :maxdepth: 1 + + kobject assoc_array + xarray + idr + circular-buffers + generic-radix-tree + packing + timekeeping + errseq + +Concurrency primitives +====================== + +How Linux keeps everything from happening at the same time. See +:doc:`/locking/index` for more related documentation. + +.. toctree:: + :maxdepth: 1 + atomic_ops - cachetlb refcount-vs-atomic - cpu_hotplug - idr local_ops - workqueue + padata + ../RCU/index + +Low-level hardware management +============================= + +Cache management, managing CPU hotplug, etc. + +.. toctree:: + :maxdepth: 1 + + cachetlb + cpu_hotplug + memory-hotplug genericirq - xarray - librs - genalloc - errseq - packing - printk-formats - circular-buffers - generic-radix-tree + protection-keys + +Memory management +================= + +How to allocate and use memory in the kernel. Note that there is a lot +more memory-management documentation in :doc:`/vm/index`. + +.. toctree:: + :maxdepth: 1 + memory-allocation mm-api + genalloc pin_user_pages - gfp_mask-from-fs-io - timekeeping boot-time-mm - memory-hotplug - protection-keys - ../RCU/index - gcc-plugins - symbol-namespaces - padata - ioctl - + gfp_mask-from-fs-io Interfaces for kernel debugging =============================== @@ -53,6 +93,16 @@ Interfaces for kernel debugging debug-objects tracepoint +Everything else +=============== + +Documents that don't fit elsewhere or which have yet to be categorized. + +.. toctree:: + :maxdepth: 1 + + librs + .. only:: subproject and html Indices diff --git a/Documentation/core-api/ioctl.rst b/Documentation/core-api/ioctl.rst deleted file mode 100644 index c455db0e1627..000000000000 --- a/Documentation/core-api/ioctl.rst +++ /dev/null @@ -1,253 +0,0 @@ -====================== -ioctl based interfaces -====================== - -ioctl() is the most common way for applications to interface -with device drivers. It is flexible and easily extended by adding new -commands and can be passed through character devices, block devices as -well as sockets and other special file descriptors. - -However, it is also very easy to get ioctl command definitions wrong, -and hard to fix them later without breaking existing applications, -so this documentation tries to help developers get it right. - -Command number definitions -========================== - -The command number, or request number, is the second argument passed to -the ioctl system call. While this can be any 32-bit number that uniquely -identifies an action for a particular driver, there are a number of -conventions around defining them. - -``include/uapi/asm-generic/ioctl.h`` provides four macros for defining -ioctl commands that follow modern conventions: ``_IO``, ``_IOR``, -``_IOW``, and ``_IOWR``. These should be used for all new commands, -with the correct parameters: - -_IO/_IOR/_IOW/_IOWR - The macro name specifies how the argument will be used.  It may be a - pointer to data to be passed into the kernel (_IOW), out of the kernel - (_IOR), or both (_IOWR).  _IO can indicate either commands with no - argument or those passing an integer value instead of a pointer. - It is recommended to only use _IO for commands without arguments, - and use pointers for passing data. - -type - An 8-bit number, often a character literal, specific to a subsystem - or driver, and listed in :doc:`../userspace-api/ioctl/ioctl-number` - -nr - An 8-bit number identifying the specific command, unique for a give - value of 'type' - -data_type - The name of the data type pointed to by the argument, the command number - encodes the ``sizeof(data_type)`` value in a 13-bit or 14-bit integer, - leading to a limit of 8191 bytes for the maximum size of the argument. - Note: do not pass sizeof(data_type) type into _IOR/_IOW/IOWR, as that - will lead to encoding sizeof(sizeof(data_type)), i.e. sizeof(size_t). - _IO does not have a data_type parameter. - - -Interface versions -================== - -Some subsystems use version numbers in data structures to overload -commands with different interpretations of the argument. - -This is generally a bad idea, since changes to existing commands tend -to break existing applications. - -A better approach is to add a new ioctl command with a new number. The -old command still needs to be implemented in the kernel for compatibility, -but this can be a wrapper around the new implementation. - -Return code -=========== - -ioctl commands can return negative error codes as documented in errno(3); -these get turned into errno values in user space. On success, the return -code should be zero. It is also possible but not recommended to return -a positive 'long' value. - -When the ioctl callback is called with an unknown command number, the -handler returns either -ENOTTY or -ENOIOCTLCMD, which also results in --ENOTTY being returned from the system call. Some subsystems return --ENOSYS or -EINVAL here for historic reasons, but this is wrong. - -Prior to Linux 5.5, compat_ioctl handlers were required to return --ENOIOCTLCMD in order to use the fallback conversion into native -commands. As all subsystems are now responsible for handling compat -mode themselves, this is no longer needed, but it may be important to -consider when backporting bug fixes to older kernels. - -Timestamps -========== - -Traditionally, timestamps and timeout values are passed as ``struct -timespec`` or ``struct timeval``, but these are problematic because of -incompatible definitions of these structures in user space after the -move to 64-bit time_t. - -The ``struct __kernel_timespec`` type can be used instead to be embedded -in other data structures when separate second/nanosecond values are -desired, or passed to user space directly. This is still not ideal though, -as the structure matches neither the kernel's timespec64 nor the user -space timespec exactly. The get_timespec64() and put_timespec64() helper -functions can be used to ensure that the layout remains compatible with -user space and the padding is treated correctly. - -As it is cheap to convert seconds to nanoseconds, but the opposite -requires an expensive 64-bit division, a simple __u64 nanosecond value -can be simpler and more efficient. - -Timeout values and timestamps should ideally use CLOCK_MONOTONIC time, -as returned by ktime_get_ns() or ktime_get_ts64(). Unlike -CLOCK_REALTIME, this makes the timestamps immune from jumping backwards -or forwards due to leap second adjustments and clock_settime() calls. - -ktime_get_real_ns() can be used for CLOCK_REALTIME timestamps that -need to be persistent across a reboot or between multiple machines. - -32-bit compat mode -================== - -In order to support 32-bit user space running on a 64-bit machine, each -subsystem or driver that implements an ioctl callback handler must also -implement the corresponding compat_ioctl handler. - -As long as all the rules for data structures are followed, this is as -easy as setting the .compat_ioctl pointer to a helper function such as -compat_ptr_ioctl() or blkdev_compat_ptr_ioctl(). - -compat_ptr() ------------- - -On the s390 architecture, 31-bit user space has ambiguous representations -for data pointers, with the upper bit being ignored. When running such -a process in compat mode, the compat_ptr() helper must be used to -clear the upper bit of a compat_uptr_t and turn it into a valid 64-bit -pointer. On other architectures, this macro only performs a cast to a -``void __user *`` pointer. - -In an compat_ioctl() callback, the last argument is an unsigned long, -which can be interpreted as either a pointer or a scalar depending on -the command. If it is a scalar, then compat_ptr() must not be used, to -ensure that the 64-bit kernel behaves the same way as a 32-bit kernel -for arguments with the upper bit set. - -The compat_ptr_ioctl() helper can be used in place of a custom -compat_ioctl file operation for drivers that only take arguments that -are pointers to compatible data structures. - -Structure layout ----------------- - -Compatible data structures have the same layout on all architectures, -avoiding all problematic members: - -* ``long`` and ``unsigned long`` are the size of a register, so - they can be either 32-bit or 64-bit wide and cannot be used in portable - data structures. Fixed-length replacements are ``__s32``, ``__u32``, - ``__s64`` and ``__u64``. - -* Pointers have the same problem, in addition to requiring the - use of compat_ptr(). The best workaround is to use ``__u64`` - in place of pointers, which requires a cast to ``uintptr_t`` in user - space, and the use of u64_to_user_ptr() in the kernel to convert - it back into a user pointer. - -* On the x86-32 (i386) architecture, the alignment of 64-bit variables - is only 32-bit, but they are naturally aligned on most other - architectures including x86-64. This means a structure like:: - - struct foo { - __u32 a; - __u64 b; - __u32 c; - }; - - has four bytes of padding between a and b on x86-64, plus another four - bytes of padding at the end, but no padding on i386, and it needs a - compat_ioctl conversion handler to translate between the two formats. - - To avoid this problem, all structures should have their members - naturally aligned, or explicit reserved fields added in place of the - implicit padding. The ``pahole`` tool can be used for checking the - alignment. - -* On ARM OABI user space, structures are padded to multiples of 32-bit, - making some structs incompatible with modern EABI kernels if they - do not end on a 32-bit boundary. - -* On the m68k architecture, struct members are not guaranteed to have an - alignment greater than 16-bit, which is a problem when relying on - implicit padding. - -* Bitfields and enums generally work as one would expect them to, - but some properties of them are implementation-defined, so it is better - to avoid them completely in ioctl interfaces. - -* ``char`` members can be either signed or unsigned, depending on - the architecture, so the __u8 and __s8 types should be used for 8-bit - integer values, though char arrays are clearer for fixed-length strings. - -Information leaks -================= - -Uninitialized data must not be copied back to user space, as this can -cause an information leak, which can be used to defeat kernel address -space layout randomization (KASLR), helping in an attack. - -For this reason (and for compat support) it is best to avoid any -implicit padding in data structures.  Where there is implicit padding -in an existing structure, kernel drivers must be careful to fully -initialize an instance of the structure before copying it to user -space.  This is usually done by calling memset() before assigning to -individual members. - -Subsystem abstractions -====================== - -While some device drivers implement their own ioctl function, most -subsystems implement the same command for multiple drivers. Ideally the -subsystem has an .ioctl() handler that copies the arguments from and -to user space, passing them into subsystem specific callback functions -through normal kernel pointers. - -This helps in various ways: - -* Applications written for one driver are more likely to work for - another one in the same subsystem if there are no subtle differences - in the user space ABI. - -* The complexity of user space access and data structure layout is done - in one place, reducing the potential for implementation bugs. - -* It is more likely to be reviewed by experienced developers - that can spot problems in the interface when the ioctl is shared - between multiple drivers than when it is only used in a single driver. - -Alternatives to ioctl -===================== - -There are many cases in which ioctl is not the best solution for a -problem. Alternatives include: - -* System calls are a better choice for a system-wide feature that - is not tied to a physical device or constrained by the file system - permissions of a character device node - -* netlink is the preferred way of configuring any network related - objects through sockets. - -* debugfs is used for ad-hoc interfaces for debugging functionality - that does not need to be exposed as a stable interface to applications. - -* sysfs is a good way to expose the state of an in-kernel object - that is not tied to a file descriptor. - -* configfs can be used for more complex configuration than sysfs - -* A custom file system can provide extra flexibility with a simple - user interface but adds a lot of complexity to the implementation. diff --git a/Documentation/core-api/kobject.rst b/Documentation/core-api/kobject.rst new file mode 100644 index 000000000000..1f62d4d7d966 --- /dev/null +++ b/Documentation/core-api/kobject.rst @@ -0,0 +1,434 @@ +===================================================================== +Everything you never wanted to know about kobjects, ksets, and ktypes +===================================================================== + +:Author: Greg Kroah-Hartman +:Last updated: December 19, 2007 + +Based on an original article by Jon Corbet for lwn.net written October 1, +2003 and located at http://lwn.net/Articles/51437/ + +Part of the difficulty in understanding the driver model - and the kobject +abstraction upon which it is built - is that there is no obvious starting +place. Dealing with kobjects requires understanding a few different types, +all of which make reference to each other. In an attempt to make things +easier, we'll take a multi-pass approach, starting with vague terms and +adding detail as we go. To that end, here are some quick definitions of +some terms we will be working with. + + - A kobject is an object of type struct kobject. Kobjects have a name + and a reference count. A kobject also has a parent pointer (allowing + objects to be arranged into hierarchies), a specific type, and, + usually, a representation in the sysfs virtual filesystem. + + Kobjects are generally not interesting on their own; instead, they are + usually embedded within some other structure which contains the stuff + the code is really interested in. + + No structure should **EVER** have more than one kobject embedded within it. + If it does, the reference counting for the object is sure to be messed + up and incorrect, and your code will be buggy. So do not do this. + + - A ktype is the type of object that embeds a kobject. Every structure + that embeds a kobject needs a corresponding ktype. The ktype controls + what happens to the kobject when it is created and destroyed. + + - A kset is a group of kobjects. These kobjects can be of the same ktype + or belong to different ktypes. The kset is the basic container type for + collections of kobjects. Ksets contain their own kobjects, but you can + safely ignore that implementation detail as the kset core code handles + this kobject automatically. + + When you see a sysfs directory full of other directories, generally each + of those directories corresponds to a kobject in the same kset. + +We'll look at how to create and manipulate all of these types. A bottom-up +approach will be taken, so we'll go back to kobjects. + + +Embedding kobjects +================== + +It is rare for kernel code to create a standalone kobject, with one major +exception explained below. Instead, kobjects are used to control access to +a larger, domain-specific object. To this end, kobjects will be found +embedded in other structures. If you are used to thinking of things in +object-oriented terms, kobjects can be seen as a top-level, abstract class +from which other classes are derived. A kobject implements a set of +capabilities which are not particularly useful by themselves, but are +nice to have in other objects. The C language does not allow for the +direct expression of inheritance, so other techniques - such as structure +embedding - must be used. + +(As an aside, for those familiar with the kernel linked list implementation, +this is analogous as to how "list_head" structs are rarely useful on +their own, but are invariably found embedded in the larger objects of +interest.) + +So, for example, the UIO code in ``drivers/uio/uio.c`` has a structure that +defines the memory region associated with a uio device:: + + struct uio_map { + struct kobject kobj; + struct uio_mem *mem; + }; + +If you have a struct uio_map structure, finding its embedded kobject is +just a matter of using the kobj member. Code that works with kobjects will +often have the opposite problem, however: given a struct kobject pointer, +what is the pointer to the containing structure? You must avoid tricks +(such as assuming that the kobject is at the beginning of the structure) +and, instead, use the container_of() macro, found in ````:: + + container_of(pointer, type, member) + +where: + + * ``pointer`` is the pointer to the embedded kobject, + * ``type`` is the type of the containing structure, and + * ``member`` is the name of the structure field to which ``pointer`` points. + +The return value from container_of() is a pointer to the corresponding +container type. So, for example, a pointer ``kp`` to a struct kobject +embedded **within** a struct uio_map could be converted to a pointer to the +**containing** uio_map structure with:: + + struct uio_map *u_map = container_of(kp, struct uio_map, kobj); + +For convenience, programmers often define a simple macro for **back-casting** +kobject pointers to the containing type. Exactly this happens in the +earlier ``drivers/uio/uio.c``, as you can see here:: + + struct uio_map { + struct kobject kobj; + struct uio_mem *mem; + }; + + #define to_map(map) container_of(map, struct uio_map, kobj) + +where the macro argument "map" is a pointer to the struct kobject in +question. That macro is subsequently invoked with:: + + struct uio_map *map = to_map(kobj); + + +Initialization of kobjects +========================== + +Code which creates a kobject must, of course, initialize that object. Some +of the internal fields are setup with a (mandatory) call to kobject_init():: + + void kobject_init(struct kobject *kobj, struct kobj_type *ktype); + +The ktype is required for a kobject to be created properly, as every kobject +must have an associated kobj_type. After calling kobject_init(), to +register the kobject with sysfs, the function kobject_add() must be called:: + + int kobject_add(struct kobject *kobj, struct kobject *parent, + const char *fmt, ...); + +This sets up the parent of the kobject and the name for the kobject +properly. If the kobject is to be associated with a specific kset, +kobj->kset must be assigned before calling kobject_add(). If a kset is +associated with a kobject, then the parent for the kobject can be set to +NULL in the call to kobject_add() and then the kobject's parent will be the +kset itself. + +As the name of the kobject is set when it is added to the kernel, the name +of the kobject should never be manipulated directly. If you must change +the name of the kobject, call kobject_rename():: + + int kobject_rename(struct kobject *kobj, const char *new_name); + +kobject_rename does not perform any locking or have a solid notion of +what names are valid so the caller must provide their own sanity checking +and serialization. + +There is a function called kobject_set_name() but that is legacy cruft and +is being removed. If your code needs to call this function, it is +incorrect and needs to be fixed. + +To properly access the name of the kobject, use the function +kobject_name():: + + const char *kobject_name(const struct kobject * kobj); + +There is a helper function to both initialize and add the kobject to the +kernel at the same time, called surprisingly enough kobject_init_and_add():: + + int kobject_init_and_add(struct kobject *kobj, struct kobj_type *ktype, + struct kobject *parent, const char *fmt, ...); + +The arguments are the same as the individual kobject_init() and +kobject_add() functions described above. + + +Uevents +======= + +After a kobject has been registered with the kobject core, you need to +announce to the world that it has been created. This can be done with a +call to kobject_uevent():: + + int kobject_uevent(struct kobject *kobj, enum kobject_action action); + +Use the **KOBJ_ADD** action for when the kobject is first added to the kernel. +This should be done only after any attributes or children of the kobject +have been initialized properly, as userspace will instantly start to look +for them when this call happens. + +When the kobject is removed from the kernel (details on how to do that are +below), the uevent for **KOBJ_REMOVE** will be automatically created by the +kobject core, so the caller does not have to worry about doing that by +hand. + + +Reference counts +================ + +One of the key functions of a kobject is to serve as a reference counter +for the object in which it is embedded. As long as references to the object +exist, the object (and the code which supports it) must continue to exist. +The low-level functions for manipulating a kobject's reference counts are:: + + struct kobject *kobject_get(struct kobject *kobj); + void kobject_put(struct kobject *kobj); + +A successful call to kobject_get() will increment the kobject's reference +counter and return the pointer to the kobject. + +When a reference is released, the call to kobject_put() will decrement the +reference count and, possibly, free the object. Note that kobject_init() +sets the reference count to one, so the code which sets up the kobject will +need to do a kobject_put() eventually to release that reference. + +Because kobjects are dynamic, they must not be declared statically or on +the stack, but instead, always allocated dynamically. Future versions of +the kernel will contain a run-time check for kobjects that are created +statically and will warn the developer of this improper usage. + +If all that you want to use a kobject for is to provide a reference counter +for your structure, please use the struct kref instead; a kobject would be +overkill. For more information on how to use struct kref, please see the +file Documentation/kref.txt in the Linux kernel source tree. + + +Creating "simple" kobjects +========================== + +Sometimes all that a developer wants is a way to create a simple directory +in the sysfs hierarchy, and not have to mess with the whole complication of +ksets, show and store functions, and other details. This is the one +exception where a single kobject should be created. To create such an +entry, use the function:: + + struct kobject *kobject_create_and_add(char *name, struct kobject *parent); + +This function will create a kobject and place it in sysfs in the location +underneath the specified parent kobject. To create simple attributes +associated with this kobject, use:: + + int sysfs_create_file(struct kobject *kobj, struct attribute *attr); + +or:: + + int sysfs_create_group(struct kobject *kobj, struct attribute_group *grp); + +Both types of attributes used here, with a kobject that has been created +with the kobject_create_and_add(), can be of type kobj_attribute, so no +special custom attribute is needed to be created. + +See the example module, ``samples/kobject/kobject-example.c`` for an +implementation of a simple kobject and attributes. + + + +ktypes and release methods +========================== + +One important thing still missing from the discussion is what happens to a +kobject when its reference count reaches zero. The code which created the +kobject generally does not know when that will happen; if it did, there +would be little point in using a kobject in the first place. Even +predictable object lifecycles become more complicated when sysfs is brought +in as other portions of the kernel can get a reference on any kobject that +is registered in the system. + +The end result is that a structure protected by a kobject cannot be freed +before its reference count goes to zero. The reference count is not under +the direct control of the code which created the kobject. So that code must +be notified asynchronously whenever the last reference to one of its +kobjects goes away. + +Once you registered your kobject via kobject_add(), you must never use +kfree() to free it directly. The only safe way is to use kobject_put(). It +is good practice to always use kobject_put() after kobject_init() to avoid +errors creeping in. + +This notification is done through a kobject's release() method. Usually +such a method has a form like:: + + void my_object_release(struct kobject *kobj) + { + struct my_object *mine = container_of(kobj, struct my_object, kobj); + + /* Perform any additional cleanup on this object, then... */ + kfree(mine); + } + +One important point cannot be overstated: every kobject must have a +release() method, and the kobject must persist (in a consistent state) +until that method is called. If these constraints are not met, the code is +flawed. Note that the kernel will warn you if you forget to provide a +release() method. Do not try to get rid of this warning by providing an +"empty" release function. + +If all your cleanup function needs to do is call kfree(), then you must +create a wrapper function which uses container_of() to upcast to the correct +type (as shown in the example above) and then calls kfree() on the overall +structure. + +Note, the name of the kobject is available in the release function, but it +must NOT be changed within this callback. Otherwise there will be a memory +leak in the kobject core, which makes people unhappy. + +Interestingly, the release() method is not stored in the kobject itself; +instead, it is associated with the ktype. So let us introduce struct +kobj_type:: + + struct kobj_type { + void (*release)(struct kobject *kobj); + const struct sysfs_ops *sysfs_ops; + struct attribute **default_attrs; + const struct kobj_ns_type_operations *(*child_ns_type)(struct kobject *kobj); + const void *(*namespace)(struct kobject *kobj); + }; + +This structure is used to describe a particular type of kobject (or, more +correctly, of containing object). Every kobject needs to have an associated +kobj_type structure; a pointer to that structure must be specified when you +call kobject_init() or kobject_init_and_add(). + +The release field in struct kobj_type is, of course, a pointer to the +release() method for this type of kobject. The other two fields (sysfs_ops +and default_attrs) control how objects of this type are represented in +sysfs; they are beyond the scope of this document. + +The default_attrs pointer is a list of default attributes that will be +automatically created for any kobject that is registered with this ktype. + + +ksets +===== + +A kset is merely a collection of kobjects that want to be associated with +each other. There is no restriction that they be of the same ktype, but be +very careful if they are not. + +A kset serves these functions: + + - It serves as a bag containing a group of objects. A kset can be used by + the kernel to track "all block devices" or "all PCI device drivers." + + - A kset is also a subdirectory in sysfs, where the associated kobjects + with the kset can show up. Every kset contains a kobject which can be + set up to be the parent of other kobjects; the top-level directories of + the sysfs hierarchy are constructed in this way. + + - Ksets can support the "hotplugging" of kobjects and influence how + uevent events are reported to user space. + +In object-oriented terms, "kset" is the top-level container class; ksets +contain their own kobject, but that kobject is managed by the kset code and +should not be manipulated by any other user. + +A kset keeps its children in a standard kernel linked list. Kobjects point +back to their containing kset via their kset field. In almost all cases, +the kobjects belonging to a kset have that kset (or, strictly, its embedded +kobject) in their parent. + +As a kset contains a kobject within it, it should always be dynamically +created and never declared statically or on the stack. To create a new +kset use:: + + struct kset *kset_create_and_add(const char *name, + struct kset_uevent_ops *u, + struct kobject *parent); + +When you are finished with the kset, call:: + + void kset_unregister(struct kset *kset); + +to destroy it. This removes the kset from sysfs and decrements its reference +count. When the reference count goes to zero, the kset will be released. +Because other references to the kset may still exist, the release may happen +after kset_unregister() returns. + +An example of using a kset can be seen in the +``samples/kobject/kset-example.c`` file in the kernel tree. + +If a kset wishes to control the uevent operations of the kobjects +associated with it, it can use the struct kset_uevent_ops to handle it:: + + struct kset_uevent_ops { + int (*filter)(struct kset *kset, struct kobject *kobj); + const char *(*name)(struct kset *kset, struct kobject *kobj); + int (*uevent)(struct kset *kset, struct kobject *kobj, + struct kobj_uevent_env *env); + }; + + +The filter function allows a kset to prevent a uevent from being emitted to +userspace for a specific kobject. If the function returns 0, the uevent +will not be emitted. + +The name function will be called to override the default name of the kset +that the uevent sends to userspace. By default, the name will be the same +as the kset itself, but this function, if present, can override that name. + +The uevent function will be called when the uevent is about to be sent to +userspace to allow more environment variables to be added to the uevent. + +One might ask how, exactly, a kobject is added to a kset, given that no +functions which perform that function have been presented. The answer is +that this task is handled by kobject_add(). When a kobject is passed to +kobject_add(), its kset member should point to the kset to which the +kobject will belong. kobject_add() will handle the rest. + +If the kobject belonging to a kset has no parent kobject set, it will be +added to the kset's directory. Not all members of a kset do necessarily +live in the kset directory. If an explicit parent kobject is assigned +before the kobject is added, the kobject is registered with the kset, but +added below the parent kobject. + + +Kobject removal +=============== + +After a kobject has been registered with the kobject core successfully, it +must be cleaned up when the code is finished with it. To do that, call +kobject_put(). By doing this, the kobject core will automatically clean up +all of the memory allocated by this kobject. If a ``KOBJ_ADD`` uevent has been +sent for the object, a corresponding ``KOBJ_REMOVE`` uevent will be sent, and +any other sysfs housekeeping will be handled for the caller properly. + +If you need to do a two-stage delete of the kobject (say you are not +allowed to sleep when you need to destroy the object), then call +kobject_del() which will unregister the kobject from sysfs. This makes the +kobject "invisible", but it is not cleaned up, and the reference count of +the object is still the same. At a later time call kobject_put() to finish +the cleanup of the memory associated with the kobject. + +kobject_del() can be used to drop the reference to the parent object, if +circular references are constructed. It is valid in some cases, that a +parent objects references a child. Circular references _must_ be broken +with an explicit call to kobject_del(), so that a release functions will be +called, and the objects in the former circle release each other. + + +Example code to copy from +========================= + +For a more complete example of using ksets and kobjects properly, see the +example programs ``samples/kobject/{kobject-example.c,kset-example.c}``, +which will be built as loadable modules if you select ``CONFIG_SAMPLE_KOBJECT``. diff --git a/Documentation/core-api/mm-api.rst b/Documentation/core-api/mm-api.rst index be726986ff75..2adffb3f7914 100644 --- a/Documentation/core-api/mm-api.rst +++ b/Documentation/core-api/mm-api.rst @@ -73,6 +73,9 @@ File Mapping and Page Cache .. kernel-doc:: mm/truncate.c :export: +.. kernel-doc:: include/linux/pagemap.h + :internal: + Memory pools ============ diff --git a/Documentation/core-api/pin_user_pages.rst b/Documentation/core-api/pin_user_pages.rst index 1d490155ecd7..2e939ff10b86 100644 --- a/Documentation/core-api/pin_user_pages.rst +++ b/Documentation/core-api/pin_user_pages.rst @@ -52,8 +52,22 @@ Which flags are set by each wrapper For these pin_user_pages*() functions, FOLL_PIN is OR'd in with whatever gup flags the caller provides. The caller is required to pass in a non-null struct -pages* array, and the function then pin pages by incrementing each by a special -value. For now, that value is +1, just like get_user_pages*().:: +pages* array, and the function then pins pages by incrementing each by a special +value: GUP_PIN_COUNTING_BIAS. + +For huge pages (and in fact, any compound page of more than 2 pages), the +GUP_PIN_COUNTING_BIAS scheme is not used. Instead, an exact form of pin counting +is achieved, by using the 3rd struct page in the compound page. A new struct +page field, hpage_pinned_refcount, has been added in order to support this. + +This approach for compound pages avoids the counting upper limit problems that +are discussed below. Those limitations would have been aggravated severely by +huge pages, because each tail page adds a refcount to the head page. And in +fact, testing revealed that, without a separate hpage_pinned_refcount field, +page overflows were seen in some huge page stress tests. + +This also means that huge pages and compound pages (of order > 1) do not suffer +from the false positives problem that is mentioned below.:: Function -------- @@ -99,27 +113,6 @@ pages: This also leads to limitations: there are only 31-10==21 bits available for a counter that increments 10 bits at a time. -TODO: for 1GB and larger huge pages, this is cutting it close. That's because -when pin_user_pages() follows such pages, it increments the head page by "1" -(where "1" used to mean "+1" for get_user_pages(), but now means "+1024" for -pin_user_pages()) for each tail page. So if you have a 1GB huge page: - -* There are 256K (18 bits) worth of 4 KB tail pages. -* There are 21 bits available to count up via GUP_PIN_COUNTING_BIAS (that is, - 10 bits at a time) -* There are 21 - 18 == 3 bits available to count. Except that there aren't, - because you need to allow for a few normal get_page() calls on the head page, - as well. Fortunately, the approach of using addition, rather than "hard" - bitfields, within page->_refcount, allows for sharing these bits gracefully. - But we're still looking at about 8 references. - -This, however, is a missing feature more than anything else, because it's easily -solved by addressing an obvious inefficiency in the original get_user_pages() -approach of retrieving pages: stop treating all the pages as if they were -PAGE_SIZE. Retrieve huge pages as huge pages. The callers need to be aware of -this, so some work is required. Once that's in place, this limitation mostly -disappears from view, because there will be ample refcounting range available. - * Callers must specifically request "dma-pinned tracking of pages". In other words, just calling get_user_pages() will not suffice; a new set of functions, pin_user_page() and related, must be used. @@ -173,8 +166,8 @@ CASE 4: Pinning for struct page manipulation only ------------------------------------------------- Here, normal GUP calls are sufficient, so neither flag needs to be set. -page_dma_pinned(): the whole point of pinning -============================================= +page_maybe_dma_pinned(): the whole point of pinning +=================================================== The whole point of marking pages as "DMA-pinned" or "gup-pinned" is to be able to query, "is this page DMA-pinned?" That allows code such as page_mkclean() @@ -186,7 +179,7 @@ and debates (see the References at the end of this document). It's a TODO item here: fill in the details once that's worked out. Meanwhile, it's safe to say that having this available: :: - static inline bool page_dma_pinned(struct page *page) + static inline bool page_maybe_dma_pinned(struct page *page) ...is a prerequisite to solving the long-running gup+DMA problem. @@ -215,12 +208,42 @@ has the following new calls to exercise the new pin*() wrapper functions: You can monitor how many total dma-pinned pages have been acquired and released since the system was booted, via two new /proc/vmstat entries: :: - /proc/vmstat/nr_foll_pin_requested - /proc/vmstat/nr_foll_pin_requested + /proc/vmstat/nr_foll_pin_acquired + /proc/vmstat/nr_foll_pin_released + +Under normal conditions, these two values will be equal unless there are any +long-term [R]DMA pins in place, or during pin/unpin transitions. + +* nr_foll_pin_acquired: This is the number of logical pins that have been + acquired since the system was powered on. For huge pages, the head page is + pinned once for each page (head page and each tail page) within the huge page. + This follows the same sort of behavior that get_user_pages() uses for huge + pages: the head page is refcounted once for each tail or head page in the huge + page, when get_user_pages() is applied to a huge page. + +* nr_foll_pin_released: The number of logical pins that have been released since + the system was powered on. Note that pages are released (unpinned) on a + PAGE_SIZE granularity, even if the original pin was applied to a huge page. + Becaused of the pin count behavior described above in "nr_foll_pin_acquired", + the accounting balances out, so that after doing this:: + + pin_user_pages(huge_page); + for (each page in huge_page) + unpin_user_page(page); + +...the following is expected:: + + nr_foll_pin_released == nr_foll_pin_acquired + +(...unless it was already out of balance due to a long-term RDMA pin being in +place.) + +Other diagnostics +================= -Those are both going to show zero, unless CONFIG_DEBUG_VM is set. This is -because there is a noticeable performance drop in unpin_user_page(), when they -are activated. +dump_page() has been enhanced slightly, to handle these new counting fields, and +to better report on compound pages in general. Specifically, for compound pages +with order > 1, the exact (hpage_pinned_refcount) pincount is reported. References ========== @@ -228,5 +251,6 @@ References * `Some slow progress on get_user_pages() (Apr 2, 2019) `_ * `DMA and get_user_pages() (LPC: Dec 12, 2018) `_ * `The trouble with get_user_pages() (Apr 30, 2018) `_ +* `LWN kernel index: get_user_pages() `_ John Hubbard, October, 2019 diff --git a/Documentation/cpu-freq/amd-powernow.txt b/Documentation/cpu-freq/amd-powernow.txt deleted file mode 100644 index 254da155fa47..000000000000 --- a/Documentation/cpu-freq/amd-powernow.txt +++ /dev/null @@ -1,38 +0,0 @@ - -PowerNow! and Cool'n'Quiet are AMD names for frequency -management capabilities in AMD processors. As the hardware -implementation changes in new generations of the processors, -there is a different cpu-freq driver for each generation. - -Note that the driver's will not load on the "wrong" hardware, -so it is safe to try each driver in turn when in doubt as to -which is the correct driver. - -Note that the functionality to change frequency (and voltage) -is not available in all processors. The drivers will refuse -to load on processors without this capability. The capability -is detected with the cpuid instruction. - -The drivers use BIOS supplied tables to obtain frequency and -voltage information appropriate for a particular platform. -Frequency transitions will be unavailable if the BIOS does -not supply these tables. - -6th Generation: powernow-k6 - -7th Generation: powernow-k7: Athlon, Duron, Geode. - -8th Generation: powernow-k8: Athlon, Athlon 64, Opteron, Sempron. -Documentation on this functionality in 8th generation processors -is available in the "BIOS and Kernel Developer's Guide", publication -26094, in chapter 9, available for download from www.amd.com. - -BIOS supplied data, for powernow-k7 and for powernow-k8, may be -from either the PSB table or from ACPI objects. The ACPI support -is only available if the kernel config sets CONFIG_ACPI_PROCESSOR. -The powernow-k8 driver will attempt to use ACPI if so configured, -and fall back to PST if that fails. -The powernow-k7 driver will try to use the PSB support first, and -fall back to ACPI if the PSB support fails. A module parameter, -acpi_force, is provided to force ACPI support to be used instead -of PSB support. diff --git a/Documentation/cpu-freq/core.rst b/Documentation/cpu-freq/core.rst new file mode 100644 index 000000000000..33cb90bd1d8f --- /dev/null +++ b/Documentation/cpu-freq/core.rst @@ -0,0 +1,113 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============================================================= +General description of the CPUFreq core and CPUFreq notifiers +============================================================= + +Authors: + - Dominik Brodowski + - David Kimdon + - Rafael J. Wysocki + - Viresh Kumar + +.. Contents: + + 1. CPUFreq core and interfaces + 2. CPUFreq notifiers + 3. CPUFreq Table Generation with Operating Performance Point (OPP) + +1. General Information +====================== + +The CPUFreq core code is located in drivers/cpufreq/cpufreq.c. This +cpufreq code offers a standardized interface for the CPUFreq +architecture drivers (those pieces of code that do actual +frequency transitions), as well as to "notifiers". These are device +drivers or other part of the kernel that need to be informed of +policy changes (ex. thermal modules like ACPI) or of all +frequency changes (ex. timing code) or even need to force certain +speed limits (like LCD drivers on ARM architecture). Additionally, the +kernel "constant" loops_per_jiffy is updated on frequency changes +here. + +Reference counting of the cpufreq policies is done by cpufreq_cpu_get +and cpufreq_cpu_put, which make sure that the cpufreq driver is +correctly registered with the core, and will not be unloaded until +cpufreq_put_cpu is called. That also ensures that the respective cpufreq +policy doesn't get freed while being used. + +2. CPUFreq notifiers +==================== + +CPUFreq notifiers conform to the standard kernel notifier interface. +See linux/include/linux/notifier.h for details on notifiers. + +There are two different CPUFreq notifiers - policy notifiers and +transition notifiers. + + +2.1 CPUFreq policy notifiers +---------------------------- + +These are notified when a new policy is created or removed. + +The phase is specified in the second argument to the notifier. The phase is +CPUFREQ_CREATE_POLICY when the policy is first created and it is +CPUFREQ_REMOVE_POLICY when the policy is removed. + +The third argument, a ``void *pointer``, points to a struct cpufreq_policy +consisting of several values, including min, max (the lower and upper +frequencies (in kHz) of the new policy). + + +2.2 CPUFreq transition notifiers +-------------------------------- + +These are notified twice for each online CPU in the policy, when the +CPUfreq driver switches the CPU core frequency and this change has no +any external implications. + +The second argument specifies the phase - CPUFREQ_PRECHANGE or +CPUFREQ_POSTCHANGE. + +The third argument is a struct cpufreq_freqs with the following +values: + +===== =========================== +cpu number of the affected CPU +old old frequency +new new frequency +flags flags of the cpufreq driver +===== =========================== + +3. CPUFreq Table Generation with Operating Performance Point (OPP) +================================================================== +For details about OPP, see Documentation/power/opp.rst + +dev_pm_opp_init_cpufreq_table - + This function provides a ready to use conversion routine to translate + the OPP layer's internal information about the available frequencies + into a format readily providable to cpufreq. + + .. Warning:: + + Do not use this function in interrupt context. + + Example:: + + soc_pm_init() + { + /* Do things */ + r = dev_pm_opp_init_cpufreq_table(dev, &freq_table); + if (!r) + policy->freq_table = freq_table; + /* Do other things */ + } + + .. note:: + + This function is available only if CONFIG_CPU_FREQ is enabled in + addition to CONFIG_PM_OPP. + +dev_pm_opp_free_cpufreq_table + Free up the table allocated by dev_pm_opp_init_cpufreq_table diff --git a/Documentation/cpu-freq/core.txt b/Documentation/cpu-freq/core.txt deleted file mode 100644 index ed577d9c154b..000000000000 --- a/Documentation/cpu-freq/core.txt +++ /dev/null @@ -1,112 +0,0 @@ - CPU frequency and voltage scaling code in the Linux(TM) kernel - - - L i n u x C P U F r e q - - C P U F r e q C o r e - - - Dominik Brodowski - David Kimdon - Rafael J. Wysocki - Viresh Kumar - - - - Clock scaling allows you to change the clock speed of the CPUs on the - fly. This is a nice method to save battery power, because the lower - the clock speed, the less power the CPU consumes. - - -Contents: ---------- -1. CPUFreq core and interfaces -2. CPUFreq notifiers -3. CPUFreq Table Generation with Operating Performance Point (OPP) - -1. General Information -======================= - -The CPUFreq core code is located in drivers/cpufreq/cpufreq.c. This -cpufreq code offers a standardized interface for the CPUFreq -architecture drivers (those pieces of code that do actual -frequency transitions), as well as to "notifiers". These are device -drivers or other part of the kernel that need to be informed of -policy changes (ex. thermal modules like ACPI) or of all -frequency changes (ex. timing code) or even need to force certain -speed limits (like LCD drivers on ARM architecture). Additionally, the -kernel "constant" loops_per_jiffy is updated on frequency changes -here. - -Reference counting of the cpufreq policies is done by cpufreq_cpu_get -and cpufreq_cpu_put, which make sure that the cpufreq driver is -correctly registered with the core, and will not be unloaded until -cpufreq_put_cpu is called. That also ensures that the respective cpufreq -policy doesn't get freed while being used. - -2. CPUFreq notifiers -==================== - -CPUFreq notifiers conform to the standard kernel notifier interface. -See linux/include/linux/notifier.h for details on notifiers. - -There are two different CPUFreq notifiers - policy notifiers and -transition notifiers. - - -2.1 CPUFreq policy notifiers ----------------------------- - -These are notified when a new policy is created or removed. - -The phase is specified in the second argument to the notifier. The phase is -CPUFREQ_CREATE_POLICY when the policy is first created and it is -CPUFREQ_REMOVE_POLICY when the policy is removed. - -The third argument, a void *pointer, points to a struct cpufreq_policy -consisting of several values, including min, max (the lower and upper -frequencies (in kHz) of the new policy). - - -2.2 CPUFreq transition notifiers --------------------------------- - -These are notified twice for each online CPU in the policy, when the -CPUfreq driver switches the CPU core frequency and this change has no -any external implications. - -The second argument specifies the phase - CPUFREQ_PRECHANGE or -CPUFREQ_POSTCHANGE. - -The third argument is a struct cpufreq_freqs with the following -values: -cpu - number of the affected CPU -old - old frequency -new - new frequency -flags - flags of the cpufreq driver - -3. CPUFreq Table Generation with Operating Performance Point (OPP) -================================================================== -For details about OPP, see Documentation/power/opp.rst - -dev_pm_opp_init_cpufreq_table - - This function provides a ready to use conversion routine to translate - the OPP layer's internal information about the available frequencies - into a format readily providable to cpufreq. - - WARNING: Do not use this function in interrupt context. - - Example: - soc_pm_init() - { - /* Do things */ - r = dev_pm_opp_init_cpufreq_table(dev, &freq_table); - if (!r) - policy->freq_table = freq_table; - /* Do other things */ - } - - NOTE: This function is available only if CONFIG_CPU_FREQ is enabled in - addition to CONFIG_PM_OPP. - -dev_pm_opp_free_cpufreq_table - Free up the table allocated by dev_pm_opp_init_cpufreq_table diff --git a/Documentation/cpu-freq/cpu-drivers.rst b/Documentation/cpu-freq/cpu-drivers.rst new file mode 100644 index 000000000000..a697278ce190 --- /dev/null +++ b/Documentation/cpu-freq/cpu-drivers.rst @@ -0,0 +1,292 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=============================================== +How to Implement a new CPUFreq Processor Driver +=============================================== + +Authors: + + + - Dominik Brodowski + - Rafael J. Wysocki + - Viresh Kumar + +.. Contents + + 1. What To Do? + 1.1 Initialization + 1.2 Per-CPU Initialization + 1.3 verify + 1.4 target/target_index or setpolicy? + 1.5 target/target_index + 1.6 setpolicy + 1.7 get_intermediate and target_intermediate + 2. Frequency Table Helpers + + + +1. What To Do? +============== + +So, you just got a brand-new CPU / chipset with datasheets and want to +add cpufreq support for this CPU / chipset? Great. Here are some hints +on what is necessary: + + +1.1 Initialization +------------------ + +First of all, in an __initcall level 7 (module_init()) or later +function check whether this kernel runs on the right CPU and the right +chipset. If so, register a struct cpufreq_driver with the CPUfreq core +using cpufreq_register_driver() + +What shall this struct cpufreq_driver contain? + + .name - The name of this driver. + + .init - A pointer to the per-policy initialization function. + + .verify - A pointer to a "verification" function. + + .setpolicy _or_ .fast_switch _or_ .target _or_ .target_index - See + below on the differences. + +And optionally + + .flags - Hints for the cpufreq core. + + .driver_data - cpufreq driver specific data. + + .resolve_freq - Returns the most appropriate frequency for a target + frequency. Doesn't change the frequency though. + + .get_intermediate and target_intermediate - Used to switch to stable + frequency while changing CPU frequency. + + .get - Returns current frequency of the CPU. + + .bios_limit - Returns HW/BIOS max frequency limitations for the CPU. + + .exit - A pointer to a per-policy cleanup function called during + CPU_POST_DEAD phase of cpu hotplug process. + + .stop_cpu - A pointer to a per-policy stop function called during + CPU_DOWN_PREPARE phase of cpu hotplug process. + + .suspend - A pointer to a per-policy suspend function which is called + with interrupts disabled and _after_ the governor is stopped for the + policy. + + .resume - A pointer to a per-policy resume function which is called + with interrupts disabled and _before_ the governor is started again. + + .ready - A pointer to a per-policy ready function which is called after + the policy is fully initialized. + + .attr - A pointer to a NULL-terminated list of "struct freq_attr" which + allow to export values to sysfs. + + .boost_enabled - If set, boost frequencies are enabled. + + .set_boost - A pointer to a per-policy function to enable/disable boost + frequencies. + + +1.2 Per-CPU Initialization +-------------------------- + +Whenever a new CPU is registered with the device model, or after the +cpufreq driver registers itself, the per-policy initialization function +cpufreq_driver.init is called if no cpufreq policy existed for the CPU. +Note that the .init() and .exit() routines are called only once for the +policy and not for each CPU managed by the policy. It takes a ``struct +cpufreq_policy *policy`` as argument. What to do now? + +If necessary, activate the CPUfreq support on your CPU. + +Then, the driver must fill in the following values: + ++-----------------------------------+--------------------------------------+ +|policy->cpuinfo.min_freq _and_ | | +|policy->cpuinfo.max_freq | the minimum and maximum frequency | +| | (in kHz) which is supported by | +| | this CPU | ++-----------------------------------+--------------------------------------+ +|policy->cpuinfo.transition_latency | the time it takes on this CPU to | +| | switch between two frequencies in | +| | nanoseconds (if appropriate, else | +| | specify CPUFREQ_ETERNAL) | ++-----------------------------------+--------------------------------------+ +|policy->cur | The current operating frequency of | +| | this CPU (if appropriate) | ++-----------------------------------+--------------------------------------+ +|policy->min, | | +|policy->max, | | +|policy->policy and, if necessary, | | +|policy->governor | must contain the "default policy" for| +| | this CPU. A few moments later, | +| | cpufreq_driver.verify and either | +| | cpufreq_driver.setpolicy or | +| | cpufreq_driver.target/target_index is| +| | called with these values. | ++-----------------------------------+--------------------------------------+ +|policy->cpus | Update this with the masks of the | +| | (online + offline) CPUs that do DVFS | +| | along with this CPU (i.e. that share| +| | clock/voltage rails with it). | ++-----------------------------------+--------------------------------------+ + +For setting some of these values (cpuinfo.min[max]_freq, policy->min[max]), the +frequency table helpers might be helpful. See the section 2 for more information +on them. + + +1.3 verify +---------- + +When the user decides a new policy (consisting of +"policy,governor,min,max") shall be set, this policy must be validated +so that incompatible values can be corrected. For verifying these +values cpufreq_verify_within_limits(``struct cpufreq_policy *policy``, +``unsigned int min_freq``, ``unsigned int max_freq``) function might be helpful. +See section 2 for details on frequency table helpers. + +You need to make sure that at least one valid frequency (or operating +range) is within policy->min and policy->max. If necessary, increase +policy->max first, and only if this is no solution, decrease policy->min. + + +1.4 target or target_index or setpolicy or fast_switch? +------------------------------------------------------- + +Most cpufreq drivers or even most cpu frequency scaling algorithms +only allow the CPU frequency to be set to predefined fixed values. For +these, you use the ->target(), ->target_index() or ->fast_switch() +callbacks. + +Some cpufreq capable processors switch the frequency between certain +limits on their own. These shall use the ->setpolicy() callback. + + +1.5. target/target_index +------------------------ + +The target_index call has two arguments: ``struct cpufreq_policy *policy``, +and ``unsigned int`` index (into the exposed frequency table). + +The CPUfreq driver must set the new frequency when called here. The +actual frequency must be determined by freq_table[index].frequency. + +It should always restore to earlier frequency (i.e. policy->restore_freq) in +case of errors, even if we switched to intermediate frequency earlier. + +Deprecated +---------- +The target call has three arguments: ``struct cpufreq_policy *policy``, +unsigned int target_frequency, unsigned int relation. + +The CPUfreq driver must set the new frequency when called here. The +actual frequency must be determined using the following rules: + +- keep close to "target_freq" +- policy->min <= new_freq <= policy->max (THIS MUST BE VALID!!!) +- if relation==CPUFREQ_REL_L, try to select a new_freq higher than or equal + target_freq. ("L for lowest, but no lower than") +- if relation==CPUFREQ_REL_H, try to select a new_freq lower than or equal + target_freq. ("H for highest, but no higher than") + +Here again the frequency table helper might assist you - see section 2 +for details. + +1.6. fast_switch +---------------- + +This function is used for frequency switching from scheduler's context. +Not all drivers are expected to implement it, as sleeping from within +this callback isn't allowed. This callback must be highly optimized to +do switching as fast as possible. + +This function has two arguments: ``struct cpufreq_policy *policy`` and +``unsigned int target_frequency``. + + +1.7 setpolicy +------------- + +The setpolicy call only takes a ``struct cpufreq_policy *policy`` as +argument. You need to set the lower limit of the in-processor or +in-chipset dynamic frequency switching to policy->min, the upper limit +to policy->max, and -if supported- select a performance-oriented +setting when policy->policy is CPUFREQ_POLICY_PERFORMANCE, and a +powersaving-oriented setting when CPUFREQ_POLICY_POWERSAVE. Also check +the reference implementation in drivers/cpufreq/longrun.c + +1.8 get_intermediate and target_intermediate +-------------------------------------------- + +Only for drivers with target_index() and CPUFREQ_ASYNC_NOTIFICATION unset. + +get_intermediate should return a stable intermediate frequency platform wants to +switch to, and target_intermediate() should set CPU to that frequency, before +jumping to the frequency corresponding to 'index'. Core will take care of +sending notifications and driver doesn't have to handle them in +target_intermediate() or target_index(). + +Drivers can return '0' from get_intermediate() in case they don't wish to switch +to intermediate frequency for some target frequency. In that case core will +directly call ->target_index(). + +NOTE: ->target_index() should restore to policy->restore_freq in case of +failures as core would send notifications for that. + + +2. Frequency Table Helpers +========================== + +As most cpufreq processors only allow for being set to a few specific +frequencies, a "frequency table" with some functions might assist in +some work of the processor driver. Such a "frequency table" consists of +an array of struct cpufreq_frequency_table entries, with driver specific +values in "driver_data", the corresponding frequency in "frequency" and +flags set. At the end of the table, you need to add a +cpufreq_frequency_table entry with frequency set to CPUFREQ_TABLE_END. +And if you want to skip one entry in the table, set the frequency to +CPUFREQ_ENTRY_INVALID. The entries don't need to be in sorted in any +particular order, but if they are cpufreq core will do DVFS a bit +quickly for them as search for best match is faster. + +The cpufreq table is verified automatically by the core if the policy contains a +valid pointer in its policy->freq_table field. + +cpufreq_frequency_table_verify() assures that at least one valid +frequency is within policy->min and policy->max, and all other criteria +are met. This is helpful for the ->verify call. + +cpufreq_frequency_table_target() is the corresponding frequency table +helper for the ->target stage. Just pass the values to this function, +and this function returns the of the frequency table entry which +contains the frequency the CPU shall be set to. + +The following macros can be used as iterators over cpufreq_frequency_table: + +cpufreq_for_each_entry(pos, table) - iterates over all entries of frequency +table. + +cpufreq_for_each_valid_entry(pos, table) - iterates over all entries, +excluding CPUFREQ_ENTRY_INVALID frequencies. +Use arguments "pos" - a ``cpufreq_frequency_table *`` as a loop cursor and +"table" - the ``cpufreq_frequency_table *`` you want to iterate over. + +For example:: + + struct cpufreq_frequency_table *pos, *driver_freq_table; + + cpufreq_for_each_entry(pos, driver_freq_table) { + /* Do something with pos */ + pos->frequency = ... + } + +If you need to work with the position of pos within driver_freq_table, +do not subtract the pointers, as it is quite costly. Instead, use the +macros cpufreq_for_each_entry_idx() and cpufreq_for_each_valid_entry_idx(). diff --git a/Documentation/cpu-freq/cpu-drivers.txt b/Documentation/cpu-freq/cpu-drivers.txt deleted file mode 100644 index 6e353d00cdc6..000000000000 --- a/Documentation/cpu-freq/cpu-drivers.txt +++ /dev/null @@ -1,295 +0,0 @@ - CPU frequency and voltage scaling code in the Linux(TM) kernel - - - L i n u x C P U F r e q - - C P U D r i v e r s - - - information for developers - - - - Dominik Brodowski - Rafael J. Wysocki - Viresh Kumar - - - - Clock scaling allows you to change the clock speed of the CPUs on the - fly. This is a nice method to save battery power, because the lower - the clock speed, the less power the CPU consumes. - - -Contents: ---------- -1. What To Do? -1.1 Initialization -1.2 Per-CPU Initialization -1.3 verify -1.4 target/target_index or setpolicy? -1.5 target/target_index -1.6 setpolicy -1.7 get_intermediate and target_intermediate -2. Frequency Table Helpers - - - -1. What To Do? -============== - -So, you just got a brand-new CPU / chipset with datasheets and want to -add cpufreq support for this CPU / chipset? Great. Here are some hints -on what is necessary: - - -1.1 Initialization ------------------- - -First of all, in an __initcall level 7 (module_init()) or later -function check whether this kernel runs on the right CPU and the right -chipset. If so, register a struct cpufreq_driver with the CPUfreq core -using cpufreq_register_driver() - -What shall this struct cpufreq_driver contain? - - .name - The name of this driver. - - .init - A pointer to the per-policy initialization function. - - .verify - A pointer to a "verification" function. - - .setpolicy _or_ .fast_switch _or_ .target _or_ .target_index - See - below on the differences. - -And optionally - - .flags - Hints for the cpufreq core. - - .driver_data - cpufreq driver specific data. - - .resolve_freq - Returns the most appropriate frequency for a target - frequency. Doesn't change the frequency though. - - .get_intermediate and target_intermediate - Used to switch to stable - frequency while changing CPU frequency. - - .get - Returns current frequency of the CPU. - - .bios_limit - Returns HW/BIOS max frequency limitations for the CPU. - - .exit - A pointer to a per-policy cleanup function called during - CPU_POST_DEAD phase of cpu hotplug process. - - .stop_cpu - A pointer to a per-policy stop function called during - CPU_DOWN_PREPARE phase of cpu hotplug process. - - .suspend - A pointer to a per-policy suspend function which is called - with interrupts disabled and _after_ the governor is stopped for the - policy. - - .resume - A pointer to a per-policy resume function which is called - with interrupts disabled and _before_ the governor is started again. - - .ready - A pointer to a per-policy ready function which is called after - the policy is fully initialized. - - .attr - A pointer to a NULL-terminated list of "struct freq_attr" which - allow to export values to sysfs. - - .boost_enabled - If set, boost frequencies are enabled. - - .set_boost - A pointer to a per-policy function to enable/disable boost - frequencies. - - -1.2 Per-CPU Initialization --------------------------- - -Whenever a new CPU is registered with the device model, or after the -cpufreq driver registers itself, the per-policy initialization function -cpufreq_driver.init is called if no cpufreq policy existed for the CPU. -Note that the .init() and .exit() routines are called only once for the -policy and not for each CPU managed by the policy. It takes a struct -cpufreq_policy *policy as argument. What to do now? - -If necessary, activate the CPUfreq support on your CPU. - -Then, the driver must fill in the following values: - -policy->cpuinfo.min_freq _and_ -policy->cpuinfo.max_freq - the minimum and maximum frequency - (in kHz) which is supported by - this CPU -policy->cpuinfo.transition_latency the time it takes on this CPU to - switch between two frequencies in - nanoseconds (if appropriate, else - specify CPUFREQ_ETERNAL) - -policy->cur The current operating frequency of - this CPU (if appropriate) -policy->min, -policy->max, -policy->policy and, if necessary, -policy->governor must contain the "default policy" for - this CPU. A few moments later, - cpufreq_driver.verify and either - cpufreq_driver.setpolicy or - cpufreq_driver.target/target_index is called - with these values. -policy->cpus Update this with the masks of the - (online + offline) CPUs that do DVFS - along with this CPU (i.e. that share - clock/voltage rails with it). - -For setting some of these values (cpuinfo.min[max]_freq, policy->min[max]), the -frequency table helpers might be helpful. See the section 2 for more information -on them. - - -1.3 verify ----------- - -When the user decides a new policy (consisting of -"policy,governor,min,max") shall be set, this policy must be validated -so that incompatible values can be corrected. For verifying these -values cpufreq_verify_within_limits(struct cpufreq_policy *policy, -unsigned int min_freq, unsigned int max_freq) function might be helpful. -See section 2 for details on frequency table helpers. - -You need to make sure that at least one valid frequency (or operating -range) is within policy->min and policy->max. If necessary, increase -policy->max first, and only if this is no solution, decrease policy->min. - - -1.4 target or target_index or setpolicy or fast_switch? -------------------------------------------------------- - -Most cpufreq drivers or even most cpu frequency scaling algorithms -only allow the CPU frequency to be set to predefined fixed values. For -these, you use the ->target(), ->target_index() or ->fast_switch() -callbacks. - -Some cpufreq capable processors switch the frequency between certain -limits on their own. These shall use the ->setpolicy() callback. - - -1.5. target/target_index ------------------------- - -The target_index call has two arguments: struct cpufreq_policy *policy, -and unsigned int index (into the exposed frequency table). - -The CPUfreq driver must set the new frequency when called here. The -actual frequency must be determined by freq_table[index].frequency. - -It should always restore to earlier frequency (i.e. policy->restore_freq) in -case of errors, even if we switched to intermediate frequency earlier. - -Deprecated: ----------- -The target call has three arguments: struct cpufreq_policy *policy, -unsigned int target_frequency, unsigned int relation. - -The CPUfreq driver must set the new frequency when called here. The -actual frequency must be determined using the following rules: - -- keep close to "target_freq" -- policy->min <= new_freq <= policy->max (THIS MUST BE VALID!!!) -- if relation==CPUFREQ_REL_L, try to select a new_freq higher than or equal - target_freq. ("L for lowest, but no lower than") -- if relation==CPUFREQ_REL_H, try to select a new_freq lower than or equal - target_freq. ("H for highest, but no higher than") - -Here again the frequency table helper might assist you - see section 2 -for details. - -1.6. fast_switch ----------------- - -This function is used for frequency switching from scheduler's context. -Not all drivers are expected to implement it, as sleeping from within -this callback isn't allowed. This callback must be highly optimized to -do switching as fast as possible. - -This function has two arguments: struct cpufreq_policy *policy and -unsigned int target_frequency. - - -1.7 setpolicy -------------- - -The setpolicy call only takes a struct cpufreq_policy *policy as -argument. You need to set the lower limit of the in-processor or -in-chipset dynamic frequency switching to policy->min, the upper limit -to policy->max, and -if supported- select a performance-oriented -setting when policy->policy is CPUFREQ_POLICY_PERFORMANCE, and a -powersaving-oriented setting when CPUFREQ_POLICY_POWERSAVE. Also check -the reference implementation in drivers/cpufreq/longrun.c - -1.8 get_intermediate and target_intermediate --------------------------------------------- - -Only for drivers with target_index() and CPUFREQ_ASYNC_NOTIFICATION unset. - -get_intermediate should return a stable intermediate frequency platform wants to -switch to, and target_intermediate() should set CPU to that frequency, before -jumping to the frequency corresponding to 'index'. Core will take care of -sending notifications and driver doesn't have to handle them in -target_intermediate() or target_index(). - -Drivers can return '0' from get_intermediate() in case they don't wish to switch -to intermediate frequency for some target frequency. In that case core will -directly call ->target_index(). - -NOTE: ->target_index() should restore to policy->restore_freq in case of -failures as core would send notifications for that. - - -2. Frequency Table Helpers -========================== - -As most cpufreq processors only allow for being set to a few specific -frequencies, a "frequency table" with some functions might assist in -some work of the processor driver. Such a "frequency table" consists of -an array of struct cpufreq_frequency_table entries, with driver specific -values in "driver_data", the corresponding frequency in "frequency" and -flags set. At the end of the table, you need to add a -cpufreq_frequency_table entry with frequency set to CPUFREQ_TABLE_END. -And if you want to skip one entry in the table, set the frequency to -CPUFREQ_ENTRY_INVALID. The entries don't need to be in sorted in any -particular order, but if they are cpufreq core will do DVFS a bit -quickly for them as search for best match is faster. - -The cpufreq table is verified automatically by the core if the policy contains a -valid pointer in its policy->freq_table field. - -cpufreq_frequency_table_verify() assures that at least one valid -frequency is within policy->min and policy->max, and all other criteria -are met. This is helpful for the ->verify call. - -cpufreq_frequency_table_target() is the corresponding frequency table -helper for the ->target stage. Just pass the values to this function, -and this function returns the of the frequency table entry which -contains the frequency the CPU shall be set to. - -The following macros can be used as iterators over cpufreq_frequency_table: - -cpufreq_for_each_entry(pos, table) - iterates over all entries of frequency -table. - -cpufreq_for_each_valid_entry(pos, table) - iterates over all entries, -excluding CPUFREQ_ENTRY_INVALID frequencies. -Use arguments "pos" - a cpufreq_frequency_table * as a loop cursor and -"table" - the cpufreq_frequency_table * you want to iterate over. - -For example: - - struct cpufreq_frequency_table *pos, *driver_freq_table; - - cpufreq_for_each_entry(pos, driver_freq_table) { - /* Do something with pos */ - pos->frequency = ... - } - -If you need to work with the position of pos within driver_freq_table, -do not subtract the pointers, as it is quite costly. Instead, use the -macros cpufreq_for_each_entry_idx() and cpufreq_for_each_valid_entry_idx(). diff --git a/Documentation/cpu-freq/cpufreq-nforce2.txt b/Documentation/cpu-freq/cpufreq-nforce2.txt deleted file mode 100644 index babce1315026..000000000000 --- a/Documentation/cpu-freq/cpufreq-nforce2.txt +++ /dev/null @@ -1,19 +0,0 @@ - -The cpufreq-nforce2 driver changes the FSB on nVidia nForce2 platforms. - -This works better than on other platforms, because the FSB of the CPU -can be controlled independently from the PCI/AGP clock. - -The module has two options: - - fid: multiplier * 10 (for example 8.5 = 85) - min_fsb: minimum FSB - -If not set, fid is calculated from the current CPU speed and the FSB. -min_fsb defaults to FSB at boot time - 50 MHz. - -IMPORTANT: The available range is limited downwards! - Also the minimum available FSB can differ, for systems - booting with 200 MHz, 150 should always work. - - diff --git a/Documentation/cpu-freq/cpufreq-stats.rst b/Documentation/cpu-freq/cpufreq-stats.rst new file mode 100644 index 000000000000..9ad695b1c7db --- /dev/null +++ b/Documentation/cpu-freq/cpufreq-stats.rst @@ -0,0 +1,136 @@ +.. SPDX-License-Identifier: GPL-2.0 + +========================================== +General Description of sysfs CPUFreq Stats +========================================== + +information for users + + +Author: Venkatesh Pallipadi + +.. Contents + + 1. Introduction + 2. Statistics Provided (with example) + 3. Configuring cpufreq-stats + + +1. Introduction +=============== + +cpufreq-stats is a driver that provides CPU frequency statistics for each CPU. +These statistics are provided in /sysfs as a bunch of read_only interfaces. This +interface (when configured) will appear in a separate directory under cpufreq +in /sysfs (/devices/system/cpu/cpuX/cpufreq/stats/) for each CPU. +Various statistics will form read_only files under this directory. + +This driver is designed to be independent of any particular cpufreq_driver +that may be running on your CPU. So, it will work with any cpufreq_driver. + + +2. Statistics Provided (with example) +===================================== + +cpufreq stats provides following statistics (explained in detail below). + +- time_in_state +- total_trans +- trans_table + +All the statistics will be from the time the stats driver has been inserted +(or the time the stats were reset) to the time when a read of a particular +statistic is done. Obviously, stats driver will not have any information +about the frequency transitions before the stats driver insertion. + +:: + + :/sys/devices/system/cpu/cpu0/cpufreq/stats # ls -l + total 0 + drwxr-xr-x 2 root root 0 May 14 16:06 . + drwxr-xr-x 3 root root 0 May 14 15:58 .. + --w------- 1 root root 4096 May 14 16:06 reset + -r--r--r-- 1 root root 4096 May 14 16:06 time_in_state + -r--r--r-- 1 root root 4096 May 14 16:06 total_trans + -r--r--r-- 1 root root 4096 May 14 16:06 trans_table + +- **reset** + +Write-only attribute that can be used to reset the stat counters. This can be +useful for evaluating system behaviour under different governors without the +need for a reboot. + +- **time_in_state** + +This gives the amount of time spent in each of the frequencies supported by +this CPU. The cat output will have "