Based on kernel version 4.16.1. Page generated on 2018-04-09 11:53 EST.
1 ======================= 2 INTEL POWERCLAMP DRIVER 3 ======================= 4 By: Arjan van de Ven <arjan@linux.intel.com> 5 Jacob Pan <jacob.jun.pan@linux.intel.com> 6 7 Contents: 8 (*) Introduction 9 - Goals and Objectives 10 11 (*) Theory of Operation 12 - Idle Injection 13 - Calibration 14 15 (*) Performance Analysis 16 - Effectiveness and Limitations 17 - Power vs Performance 18 - Scalability 19 - Calibration 20 - Comparison with Alternative Techniques 21 22 (*) Usage and Interfaces 23 - Generic Thermal Layer (sysfs) 24 - Kernel APIs (TBD) 25 26 ============ 27 INTRODUCTION 28 ============ 29 30 Consider the situation where a system’s power consumption must be 31 reduced at runtime, due to power budget, thermal constraint, or noise 32 level, and where active cooling is not preferred. Software managed 33 passive power reduction must be performed to prevent the hardware 34 actions that are designed for catastrophic scenarios. 35 36 Currently, P-states, T-states (clock modulation), and CPU offlining 37 are used for CPU throttling. 38 39 On Intel CPUs, C-states provide effective power reduction, but so far 40 they’re only used opportunistically, based on workload. With the 41 development of intel_powerclamp driver, the method of synchronizing 42 idle injection across all online CPU threads was introduced. The goal 43 is to achieve forced and controllable C-state residency. 44 45 Test/Analysis has been made in the areas of power, performance, 46 scalability, and user experience. In many cases, clear advantage is 47 shown over taking the CPU offline or modulating the CPU clock. 48 49 50 =================== 51 THEORY OF OPERATION 52 =================== 53 54 Idle Injection 55 -------------- 56 57 On modern Intel processors (Nehalem or later), package level C-state 58 residency is available in MSRs, thus also available to the kernel. 59 60 These MSRs are: 61 #define MSR_PKG_C2_RESIDENCY 0x60D 62 #define MSR_PKG_C3_RESIDENCY 0x3F8 63 #define MSR_PKG_C6_RESIDENCY 0x3F9 64 #define MSR_PKG_C7_RESIDENCY 0x3FA 65 66 If the kernel can also inject idle time to the system, then a 67 closed-loop control system can be established that manages package 68 level C-state. The intel_powerclamp driver is conceived as such a 69 control system, where the target set point is a user-selected idle 70 ratio (based on power reduction), and the error is the difference 71 between the actual package level C-state residency ratio and the target idle 72 ratio. 73 74 Injection is controlled by high priority kernel threads, spawned for 75 each online CPU. 76 77 These kernel threads, with SCHED_FIFO class, are created to perform 78 clamping actions of controlled duty ratio and duration. Each per-CPU 79 thread synchronizes its idle time and duration, based on the rounding 80 of jiffies, so accumulated errors can be prevented to avoid a jittery 81 effect. Threads are also bound to the CPU such that they cannot be 82 migrated, unless the CPU is taken offline. In this case, threads 83 belong to the offlined CPUs will be terminated immediately. 84 85 Running as SCHED_FIFO and relatively high priority, also allows such 86 scheme to work for both preemptable and non-preemptable kernels. 87 Alignment of idle time around jiffies ensures scalability for HZ 88 values. This effect can be better visualized using a Perf timechart. 89 The following diagram shows the behavior of kernel thread 90 kidle_inject/cpu. During idle injection, it runs monitor/mwait idle 91 for a given "duration", then relinquishes the CPU to other tasks, 92 until the next time interval. 93 94 The NOHZ schedule tick is disabled during idle time, but interrupts 95 are not masked. Tests show that the extra wakeups from scheduler tick 96 have a dramatic impact on the effectiveness of the powerclamp driver 97 on large scale systems (Westmere system with 80 processors). 98 99 CPU0 100 ____________ ____________ 101 kidle_inject/0 | sleep | mwait | sleep | 102 _________| |________| |_______ 103 duration 104 CPU1 105 ____________ ____________ 106 kidle_inject/1 | sleep | mwait | sleep | 107 _________| |________| |_______ 108 ^ 109 | 110 | 111 roundup(jiffies, interval) 112 113 Only one CPU is allowed to collect statistics and update global 114 control parameters. This CPU is referred to as the controlling CPU in 115 this document. The controlling CPU is elected at runtime, with a 116 policy that favors BSP, taking into account the possibility of a CPU 117 hot-plug. 118 119 In terms of dynamics of the idle control system, package level idle 120 time is considered largely as a non-causal system where its behavior 121 cannot be based on the past or current input. Therefore, the 122 intel_powerclamp driver attempts to enforce the desired idle time 123 instantly as given input (target idle ratio). After injection, 124 powerclamp monitors the actual idle for a given time window and adjust 125 the next injection accordingly to avoid over/under correction. 126 127 When used in a causal control system, such as a temperature control, 128 it is up to the user of this driver to implement algorithms where 129 past samples and outputs are included in the feedback. For example, a 130 PID-based thermal controller can use the powerclamp driver to 131 maintain a desired target temperature, based on integral and 132 derivative gains of the past samples. 133 134 135 136 Calibration 137 ----------- 138 During scalability testing, it is observed that synchronized actions 139 among CPUs become challenging as the number of cores grows. This is 140 also true for the ability of a system to enter package level C-states. 141 142 To make sure the intel_powerclamp driver scales well, online 143 calibration is implemented. The goals for doing such a calibration 144 are: 145 146 a) determine the effective range of idle injection ratio 147 b) determine the amount of compensation needed at each target ratio 148 149 Compensation to each target ratio consists of two parts: 150 151 a) steady state error compensation 152 This is to offset the error occurring when the system can 153 enter idle without extra wakeups (such as external interrupts). 154 155 b) dynamic error compensation 156 When an excessive amount of wakeups occurs during idle, an 157 additional idle ratio can be added to quiet interrupts, by 158 slowing down CPU activities. 159 160 A debugfs file is provided for the user to examine compensation 161 progress and results, such as on a Westmere system. 162 [jacob@nex01 ~]$ cat 163 /sys/kernel/debug/intel_powerclamp/powerclamp_calib 164 controlling cpu: 0 165 pct confidence steady dynamic (compensation) 166 0 0 0 0 167 1 1 0 0 168 2 1 1 0 169 3 3 1 0 170 4 3 1 0 171 5 3 1 0 172 6 3 1 0 173 7 3 1 0 174 8 3 1 0 175 ... 176 30 3 2 0 177 31 3 2 0 178 32 3 1 0 179 33 3 2 0 180 34 3 1 0 181 35 3 2 0 182 36 3 1 0 183 37 3 2 0 184 38 3 1 0 185 39 3 2 0 186 40 3 3 0 187 41 3 1 0 188 42 3 2 0 189 43 3 1 0 190 44 3 1 0 191 45 3 2 0 192 46 3 3 0 193 47 3 0 0 194 48 3 2 0 195 49 3 3 0 196 197 Calibration occurs during runtime. No offline method is available. 198 Steady state compensation is used only when confidence levels of all 199 adjacent ratios have reached satisfactory level. A confidence level 200 is accumulated based on clean data collected at runtime. Data 201 collected during a period without extra interrupts is considered 202 clean. 203 204 To compensate for excessive amounts of wakeup during idle, additional 205 idle time is injected when such a condition is detected. Currently, 206 we have a simple algorithm to double the injection ratio. A possible 207 enhancement might be to throttle the offending IRQ, such as delaying 208 EOI for level triggered interrupts. But it is a challenge to be 209 non-intrusive to the scheduler or the IRQ core code. 210 211 212 CPU Online/Offline 213 ------------------ 214 Per-CPU kernel threads are started/stopped upon receiving 215 notifications of CPU hotplug activities. The intel_powerclamp driver 216 keeps track of clamping kernel threads, even after they are migrated 217 to other CPUs, after a CPU offline event. 218 219 220 ===================== 221 Performance Analysis 222 ===================== 223 This section describes the general performance data collected on 224 multiple systems, including Westmere (80P) and Ivy Bridge (4P, 8P). 225 226 Effectiveness and Limitations 227 ----------------------------- 228 The maximum range that idle injection is allowed is capped at 50 229 percent. As mentioned earlier, since interrupts are allowed during 230 forced idle time, excessive interrupts could result in less 231 effectiveness. The extreme case would be doing a ping -f to generated 232 flooded network interrupts without much CPU acknowledgement. In this 233 case, little can be done from the idle injection threads. In most 234 normal cases, such as scp a large file, applications can be throttled 235 by the powerclamp driver, since slowing down the CPU also slows down 236 network protocol processing, which in turn reduces interrupts. 237 238 When control parameters change at runtime by the controlling CPU, it 239 may take an additional period for the rest of the CPUs to catch up 240 with the changes. During this time, idle injection is out of sync, 241 thus not able to enter package C- states at the expected ratio. But 242 this effect is minor, in that in most cases change to the target 243 ratio is updated much less frequently than the idle injection 244 frequency. 245 246 Scalability 247 ----------- 248 Tests also show a minor, but measurable, difference between the 4P/8P 249 Ivy Bridge system and the 80P Westmere server under 50% idle ratio. 250 More compensation is needed on Westmere for the same amount of 251 target idle ratio. The compensation also increases as the idle ratio 252 gets larger. The above reason constitutes the need for the 253 calibration code. 254 255 On the IVB 8P system, compared to an offline CPU, powerclamp can 256 achieve up to 40% better performance per watt. (measured by a spin 257 counter summed over per CPU counting threads spawned for all running 258 CPUs). 259 260 ==================== 261 Usage and Interfaces 262 ==================== 263 The powerclamp driver is registered to the generic thermal layer as a 264 cooling device. Currently, it’s not bound to any thermal zones. 265 266 jacob@chromoly:/sys/class/thermal/cooling_device14$ grep . * 267 cur_state:0 268 max_state:50 269 type:intel_powerclamp 270 271 cur_state allows user to set the desired idle percentage. Writing 0 to 272 cur_state will stop idle injection. Writing a value between 1 and 273 max_state will start the idle injection. Reading cur_state returns the 274 actual and current idle percentage. This may not be the same value 275 set by the user in that current idle percentage depends on workload 276 and includes natural idle. When idle injection is disabled, reading 277 cur_state returns value -1 instead of 0 which is to avoid confusing 278 100% busy state with the disabled state. 279 280 Example usage: 281 - To inject 25% idle time 282 $ sudo sh -c "echo 25 > /sys/class/thermal/cooling_device80/cur_state 283 " 284 285 If the system is not busy and has more than 25% idle time already, 286 then the powerclamp driver will not start idle injection. Using Top 287 will not show idle injection kernel threads. 288 289 If the system is busy (spin test below) and has less than 25% natural 290 idle time, powerclamp kernel threads will do idle injection. Forced 291 idle time is accounted as normal idle in that common code path is 292 taken as the idle task. 293 294 In this example, 24.1% idle is shown. This helps the system admin or 295 user determine the cause of slowdown, when a powerclamp driver is in action. 296 297 298 Tasks: 197 total, 1 running, 196 sleeping, 0 stopped, 0 zombie 299 Cpu(s): 71.2%us, 4.7%sy, 0.0%ni, 24.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st 300 Mem: 3943228k total, 1689632k used, 2253596k free, 74960k buffers 301 Swap: 4087804k total, 0k used, 4087804k free, 945336k cached 302 303 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 304 3352 jacob 20 0 262m 644 428 S 286 0.0 0:17.16 spin 305 3341 root -51 0 0 0 0 D 25 0.0 0:01.62 kidle_inject/0 306 3344 root -51 0 0 0 0 D 25 0.0 0:01.60 kidle_inject/3 307 3342 root -51 0 0 0 0 D 25 0.0 0:01.61 kidle_inject/1 308 3343 root -51 0 0 0 0 D 25 0.0 0:01.60 kidle_inject/2 309 2935 jacob 20 0 696m 125m 35m S 5 3.3 0:31.11 firefox 310 1546 root 20 0 158m 20m 6640 S 3 0.5 0:26.97 Xorg 311 2100 jacob 20 0 1223m 88m 30m S 3 2.3 0:23.68 compiz 312 313 Tests have shown that by using the powerclamp driver as a cooling 314 device, a PID based userspace thermal controller can manage to 315 control CPU temperature effectively, when no other thermal influence 316 is added. For example, a UltraBook user can compile the kernel under 317 certain temperature (below most active trip points).