Undervolting can work WONDERS. Especially if you get lucky with binning. I'm running the i7 in my little 13" convertible with a ~140mV undervolt, which is the difference between unplayable and 90 pegged in DCS and War Thunder (flight sims) in VR (w/ eGPU). It's also almost an extra hour of battery life when I'm just browsing the internet or whatnot.
Yes, definitely! Your laptop is generally constrained by how much heat it can push through the heat sink. If you get "lucky" you have a CPU that can remain stable at lower voltages. So it's effectively the same as an "overclock" because your laptop can turbo the core speed up for longer (or forever). Your laptop rarely runs at maximum turbo all day. It can only do that for a few hundred ms usually. It'll basically just hold the clock speed at whatever it can before hitting some specific temp (85-95c). When you undervolt, the frequency that it can "hold" at that temp threshold is higher.
It's not just temperature: many laptops also hit their power limits under full load.
I had a Surface Book 2 that was particularly bad for this, I think because it uses the same connector as older versions, and it could not supply enough power. Undervolting seemed to help quite a bit.
MacBook Pros have this problem. You cannot achieve maximum performance without using both the external adapter and the internal battery. Not sure about the most recent models but true of mine (2013).
Annoyingly, there are 3 major related limits for Intel mobile CPUs- thermal, power (both an instantaneous, absolute max limit and a set of "okay you can use 25W for up to 5 seconds but then have to stop" limits), and current.
Within that envelope, the CPU can boost up to its max turbo frequency when needed.
I wonder what exactly causes "instability" when overclocking/undervolting in a processor. Transistors don't have enough time to switch, but it matters where it happens.
If there's an error in e.g. floating-point unit, it would just cause random errors in the output, but the game would probably keep running. That might be completely fine, if it's rare and you could undervolt just that unit a lot. Especially in somewhere in the graphics pipeline, where it would just generate visual noise. Also, I think energy consumed / heat is proportional to voltage squared, so it has a big effect.
> If there's an error in e.g. floating-point unit, it would just cause random errors in the output, but the game would probably keep running
The processor circuitry does not only consist of transistors dedicated to the datapath, a significant amount of transistors are used to control the datapath. If control transistors are faulty there is no guarantee about the processor behavior.
Also, I don't know about Intel cpu protections, but in modern secure hardware we include dedicated circuits for power/clock glitch detection.
The glitch detection circuit is designed with a higher sensibility to power/clock fault than the rest of the circuit, this way its avoids that clock/power faults affect the integrity of the circuit output.
The Plundervolt [1] paper is one study in what exactly goes haywire when a processor is undervolted. They found that the computation errors induced are quite consistent and reproducible. MUL and DIV are vulnerable, and AES-NI happened to be particularly badly broken in the face of undervolting.
Indeed. CLKSCREW is the progenitor of that work. Note that in the context of security, it only matters if undervolting can cross security boundaries. This was possible on Intel systems at the time of Plundervolt but has since been fixed.
> I wonder what exactly causes "instability" when overclocking/undervolting in a processor. Transistors don't have enough time to switch, but it matters where it happens.
Not x86, but in college a series of courses used a battery powered embedded system with a Motorolla 68HC11. When the batteries ran low, conditional branches would never branch, which was very confusing. Everything else seemed to work fine though.
Afaik impurities, small imperfections, and micro fractures increase resistance which increases power consumption and generate heat. Im not an expert on this but I assume even how well the chip was soldered to the board might make a serious difference here.
Yeah, even with my NUC that had a pretty decent fan running at top speed I literally put a block of metal I kept in the freezer on top of it to keep it from downthrottling due to overheating. It was ridiculous that it could let itself get that hot instead of doing something more graceful.
It's a 5th generation core i7 in the Intel NUC. It's been off the market for four years and the thermal issues were known and addressed in the intervening years to the best of my understanding.
I'm just saying that linux does not do a great job of gracefully limiting CPU power to avoid a lower BIOS level limiter kicking in and performing worse than if it just throttled down a bit earlier.
Of course it does. Undervolting is simply finding the lowest voltage your unit can stably work at.
The only result of undervolting on modern CPUs is lower temperature; performance actually goes up as the processor still aims to hit its set TDP/TPL limits.
Lower voltage = higher clocks for longer on all Intel CPUs that can be undervolted. I believe it's the same for new AMDs, but I haven't had the chance to test one.
You're basically fine tuning your pet rock - some silicon is higher quality and can do better than the stock voltages that were set for thousands/millions of units out of the factory.
A lot of laptops are built with pretty lack luster thermal design.
Less voltage means less waste heat. If you can lower voltage without lowering clock speed, you'll be able to maintain turbo boost longer instead of constantly thermal throttling.
Not my first recommendation but it is a valid strategy.
It helps a bit to reduce power/thermal throttling, from my experience the effect is not very strong though (maybe 5-10 % at most I'd say). If your CPU gets thermally throttled a lot it's more likely that you have a faulty or underdimensioned heat sink, in that case I'd try to apply better thermal paste and/or ensure the heat sink properly connects to the CPU. My Alienware throttled the CPU a lot due to overheating because they didn't apply the thermal compound right during fabrication (and seemed to use cheap compound with low conductivity), applying better thermal compound fixed that problem reliably, while undervolting the CPU did almost nothing in that case.
Voltage is an energy gap between on and off state of the transistor. There is also random noise on top of that gap. And that noise can sometimes cause the transistor to flip randomly, which results in an error. More voltage means system is more stable and has less random errors. But in gaming people often don't care if there is an error or two now and then, so they can tradeoff stability for lower power usage. Companies don't like that, because that means more technical support when programs randomly crash.
of course it does - less voltage => more power headroom and less overall power, less heat dissipated => more thermal headroom.
If the silicon doesn't need the voltage, more voltage is only harmful. I can run my skylake alike laptop at -170mV offset (which is borderline golden chip), stable at prime 95 and whatnot. Prime95 runs below 70C.
Did you use "lucky" here sarcastically, or do you actually need to be lucky with binning to be able to undervolt without issues? (I thought that was a concern with overclocking?)
Undervolting goes hand in hand with overclocking. The goal is higher clocks. The enemies are stability and heat. Undervolting reduces heat but the impact on stability depends on the silicon lottery. For mobile chipsets, undervolting at the same clocks is a nice performance uplift because of the reduced heat (on a system that undoubtedly does not have the best heat dissipation.)
Overvolting is also useful for overclocking; at a point it can help increase stability. But generally I think you want to hit a point at which you are getting the highest clocks with the lowest voltage that is still stable.
Roughly, the transistors in a chip can have more or less internal resistance (Rds) due to impurities and variations during production. There's no way to know from one chip to the next until you try them.
If it has less internal resistance it will generate less heat for the same amount of work. Semiconductors are constrained by the maximum junction temperature[1], and CMOS logic chips like modern processors primarily consume power when switching.
So there's a thermal limit to how fast they can switch. Thus having lower internal resistance can allow you to clock higher (switch more often) before reaching the thermal limit.
However, when switching they act as varying resistors, and the amount of power dissipated in any resistor depends on the square of the applied voltage (P = V^2/R).
So if you can run your chip on a lower voltage, it will dissipate a lot less energy when switching, thus allowing you to switch faster before being thermally limited.
When overclocking, you ask the transistors to switch faster. To turn a MOSFET[3] transistor on or off, you need to transfer a certain amount of charge (aka electrons) into or out of the gate. The faster you can transfer the charge (more electrons per second ie the higher the current) the quicker the transistor turns on. And the way to do that is to increase the voltage[2]. So overclockers tend to bump up the voltage when going for the super-high clocks. But as you see from above that causes a massive increase in heat, thus requiring significantly more cooling.
I'm no semiconductor expert though, so I might got some of this wrong, but I hope I got the essence right. Also I'm not sure if the impurities and such that lead to the variability of the internal resistance is related to the ones affecting the ability to undervolt. Is it mainly the threshold voltage or?
There are other aspects as well though, I assume the parasitic elements[4] play a role as well, especially when overclocking.
One thing that has to be considered and can become an issue with under volting is that multiple transistors make up circuits. Some times these transistors work together and some times they fight each other. Even two transistors in close proximity to each other on the same chip will be slightly different due to variation. As voltage decreases for some circuits this variation between transistors gets exposed as one transistor approaches the treshold voltage while the other may not, while at higher voltages this slight variation may not come into play as strongly, leading to outright failures or stability problems that get exposed at particular frequencies.
This is actually one of those cases where the 'Voltage as Pressure' analogy works well for a laymans explanation; with a lower voltage, the electons might not get pushed through quickly enough (Due to variations in manufacturing of the gates)
For anyone struggling with this, look at the TLP project which is a package in most distros that makes (I think pretty anodyne) changes to improve power consumption/battery life. Not uncommon to get 50% improvements straightaway after installing it.
> There is a need for documentation concerning the relevant MSR(s) and thus waiting on Intel engineers to find such documentation internally and see if/what can be publicly released. Hopefully Intel's large open-source team will be able to provide some fruits soon
Well, i915 is certainly a completely different story than nouveau. Otherwise I would not be so optimistic. From the anecdotal evidence of listening to less than a handful of Intel colleagues I happen to have my hopes are not that high. According to them Intel is a horrible organization where high walls and secrecy making is more common than helping another part of the organization.
I think undervolting is good for the environment to lower the carbon footprint of CPUs. Some never multicore CPUs has TDP of +100W, it would probably be much better if we lowered the voltage and frequency for a small CPU performance hit. Thankful to the developers of the driver!
> Undervolting lowers voltage, but the processor will now stay at higher clocks for a longer time, hitting the same TDP/power limits.
No,at the same power, but for shorter amount of time, because now it is able do a lot more work per unit of time. So no, you'll definitely use less energy, although dissipate same amount of power. Keeping in mind, that your will be idling 90% of time, if not more, your savings will be even bigger.
Even in that case, if your CPU cooler can deal with dissipating all the power (without throttling), consumed by the CPU in both cases (normal and undervolted), you'll still use less energy when undervolted.
I haven't made any estimates to how much energy is wasted like this but I "feel" that it is negligible. Max power consumption will not change in power demanding cases as cpu will still throttle up to the TDP limits. I think it will only consume less power when idling but I think that this is in the range of Watts.
Do people who like to play with the voltage and clock speed settings of their CPU/GPU/memory typically do so with one of these utilities? Or in the EFI/BIOS config pages?
I've never done this, but lately it sounds like there's sometimes quite a lot left on the table, even after part binning. Performance is so high now that I wouldn't bother doing this for an extra 5-10%, but for an extra hour of battery life, it's certainly tempting...
I've tried undervolting, which did very little for my desktop, but prolonged the battery life and reduce heat noticeably for my Surface Pro. Especially since the Surface was fan-less, the reduction in heat actually helps with performance as well. The only quirk is the BIOS of Surface provides no such options and I have to use intel extreme tuning utility, which crashes sometimes.
If Linux gets undervolting support, that will help a bunch of laptops without highly configurable BIOSes.
> Performance is so high now that I wouldn’t bother doing this for an extra 5-10%
I’m an economist and I’m fascinated by this statement.
Surely if something (‘performance’, in this case) is ‘high’ then getting a 5-10% proportion of it is better than getting the same proportion increase in performance on a lesser level (the bigger the baseline, the bigger that 5-10% extra performance is). So it’s apparently irrational: being willing to incur effort (cost) to get something that is worth a smaller amount and being unwilling to do it when the same effort/cost gets you more isn’t what one would expect.
But maybe it’s satisficing: maybe formerly the experience was endurable only when running at 110% of maximum, but now there’s no need to redline to get a decent experience?
Or maybe it’s the law of diminishing returns, utility approaching some kind of asymptote.
If you're running a compute task that is slow and takes an hour, then 5% additional speed saves you 3 minutes. If the computer gets 2x faster, then an additional 5% improvement only saves you 1.5 minutes. So if utility is proportional to time (and hence inversely proportional to speed), then this makes sense.
At a large enough N 5% to 10% is massive, at a small enough N 5% to 10% gets you closer to competitive, somewhere between though is where those extra percentage points are "meh".
For instance I run my computing on a TR 1950X and I could probably undervolt or overclock the chip but, to be honest, it's not worth the perceived "risk". Undervolting could break specific instructions in undetectable ways and the only way to be sure things haven't broken would be to run something like sandsifter [0] multiple times. Because I, and many others who would have the technical know how, write software on these systems it's not very much worth the headache. If I had a seperate system I built just for gaming or doing a specific task and I wanted to reduce how much I spent, or the power usage, etc undervolting would then shift out of headache -> super worth it.
The last thing I want to have to do is run every piece of code I need through godbolt and pray there's no AVX instructions that "may" be fishy.
I think this is largely why cloud providers don't do things like this as aggressively. They do it to some extent with custom off-roadmap chips but not controlled dynamically in userspace. If we can get something like this original post into the kernel that would be a massive win for everyone.
An analogy with airplanes: you pay cubic dollars to fly faster at the high end.
That is, it's easy to build an airplane that flies at 90 knots. You can then clean up the aerodynamics a bit, burn a bit more fuel to get a bit more power to fly 110 knots. If you want to fly 130 knots, you probably need to add fairings and wheel pants and put in a bigger engine that burns even more fuel. It's very hard to get to 150 knots without retractable landing gear, a massive engine, or removing seats. To get to 200 knots you need 50% more cylinders or a turbocharger. To go past that you usually start looking at a turboprop at 10x the cost and 4x the fuel burn, then jets (...etc.)
Back to computers... doubling your power consumption and heat output for a 100% speed improvement is a perfect, linear improvement. If you need more performance, it's a fair trade. Even if you care about battery life, you should still make the trade, because drawing half as much power for twice as long still burns the same amount of total energy, so you don't gain anything by slowing down.
But would you double your power consumption and heat output for a 50% speed improvement? What about a 20% improvement?
It's the same idea. If the chip makers are setting voltages near the high-end, exponential cost, diminishing returns side of the curve, then you'll get more total "value" back by going back to the linear part of the curve, assuming you care about all of the variables, and not just "more speed at any cost."
> Or maybe it’s the law of diminishing returns, utility approaching some kind of asymptote.
It's exactly this. Many things have sublinear utility (Money is often assumed to have a log or square-root utility).
speed often has this effect magnified because what people actually care about is time and time is the inverse of speed. Lets say that doing task Z takes X + Y time. At some point increasing the performance of Y has no measurable gain in utility because X dominates.
I used to. I found on my AMD 5870 GPU that cutting RAM speed in half had no measurable effect on desktop use, but considerably lowered the temperature of the GPU as a whole, like over 10 C. The GPU itself already did dynamic frequency switching, so not much gain there.
So I had a couple of profiles made that I could switch between when gaming and not, which would adjust the RAM speed.
I guess it's a question of whether you want drivers to be in the kernel source tree with supported interfaces, where interactions with other drivers can be mediated, or in userspace, where they can iterate without the kernel release process, which takes a long time to get to distros and end-users.
As this Intel MSR is not well documented, I would not argue strongly either way.
Power used is roughly proportional to the clock times the square of the voltage. Changing the processor governor affects the clock, undervolting affects the voltage.
If you use a more conservative governor you lower performance and power consumption both linearly; this means that e.g. the total energy used for running a long batch job is roughly constant. If you lower the voltage, you don't decrease performance at all, but still lower the power consumption. It's a free lunch, right up until your CPU starts silently generating the wrong results and possibly corrupting your entire system.
On top of this, CPUs now often will (internally, or with the help of the motherboard chipset) dynamically alter their clock based on the temperature. So undervolting reduces the power consumed at a given temperature, which will increase the speed your CPU runs at, if you are thermally limited. This means a setup with poor cooling will use the same amount of power before but run faster.
I'm a laymen so don't take this as definitive but my understanding is Power = Volts * Amps. The amount of Amps going into something is the amount of electrons moving into it (water in a pipe). Volts is how hard you're pushing the electrons through a circuit (pressure in a pipe). Because of magic I don't understand amps is ~constant while how much force is behind those electrons have is changeable (how wide you open your faucet).
Processor governors change the clock speed of the CPU (how many dishes you're washing). Dynamic voltage control is changing how hard you're pushing electrons through the CPU to let it stably operate at a given frequency (opening/closing the faucet as your scrubbing your dishes).
By doing this we can lower the amount of Power (water) we use while getting the same amount of work done (dishes washed).
amdctl[1] might be useful to look at to learn a little here, as it meddles with these things.
the CPU has different ACPI P-States, different power level modes that the governor will switch between.
but each of those states has an assigned voltage to it. you can tell a governor to keep using lower power states, but those states may still be supplying more voltage/power than the CPU requires to run.
under-volting is about adjusting the power level for a given state. there's a very broad ranging article on AMD Zen architecture[1] that i love, from Matt Dillon, creator of DragonflyBSD, and it talks some to undervolting & how it can be useful for their big build-servers & others. [edit: wait no, wrong link: it was THIS other huge thread from Dillon[3] that included low power musings, but still no explicit undervolting, just setting the thermal limit lower].
But it's tightly integrated with the CPU itself. The VRM dos nothing without the CPU saying so. Whatever voltage it requests, the VRM will comply.
For a glorious short time with Haswell, the VRM was fully integrated onto the processor die. It led to increased heat, obviously, but if you properly cooled the processor, you could stop worrying about your VRM failing.
VRs can be off chip (MBVR stands for motherboard VR) versus on chip (FIVR stands for fully integrated VR). In both cases, there exists a protocol called SVID that allows the processor to send voltage control commands. The MSR is merely the interface or proxy for these commands.
MSRs can do anything, really. They're also not universally "registers" in any meaningful way. Basically, instead of adding new instructions for everything, CPU designers just add a new MSR. Its kinda like how the kernel assigns a new major/minor for every new device driver instead of a new syscall.
MSRs can also be added by microcode update as we saw with MSR_IA32_SPEC_CTRL, which was used to mitigate Spectre (among other things).
Disclaimer: I work on Linux at Intel, and I was one of the ones posting in that LKML thread quoted in the article.