I've had this post sitting in my drafts for the last 7 months. The code is stale, but the concepts are correct. I had intended to add some pictures before posting, but it's clear that won't happen now. Words are better than nothing, I suppose...
Recently I pushed an intel-gpu-tool tool for modifying GPU frequencies. It's meant to simplify using the sysfs interfaces, similar to many other such helper tools in Linux (cpupower has become one of my recent favorites). I wrote this fancy getopt example\^w\^w\^w program to address a specific need I had, but saw its uses for the community as well. Some time after upstreaming the tool, I accidentally put the name of this tool into my favorite search engine (yes, after all these years X clipboard still confuses me). Surprisingly I was greeted by a discussion about the tool. None of it was terribly misinformed, but I figured I might set the record straight anyway.
Dynamically changing frequencies is a difficult thing to accomplish. Typically there are multiple clock domains in a chip and coordinating all the necessary ones in order to modify one probably requires a bunch of state which I'd not understand much less be able to explain. To facilitate this (on Gen6+) there is firmware which does whatever it is that firmwares do to change frequencies. When we talk about changing frequencies from a Linux kernel driver perspective, it means we're asking the firmware for a frequency. It can, and does overrule, balk and ignore our frequency requests
The term that is used within the kernel driver and docs is, "RPS" which is [I believe] short for Render P-States. They are analogous to CPU P-states in that lower numbers are faster, higher number are slower. Conveniently, we only have two numbers, 0, and 1 on GEN. Past that, I don't know how CPU P-States work, so I'm not sure how much else is similar.
There are roughly 4 generations of RPS:
IPS (not to be confused with Intermediate Pixel Storage). The implementation of this predates my knowledge of this part of the hardware, so I can't speak much about it. It stopped being a thing after Ironlake (Gen5). I don't care to look, but you can if you want: drivers/platform/x86/intel_ips.c
RPS (Sandybridge, Ivybridge) There are 4 numbers of interest: RP0, RP1, RPn, "hw max". The first 3 are directly read from a register
rp_state_cap = I915_READ(GEN6_RP_STATE_CAP); dev_priv->rps.rp0_freq = (rp_state_cap >> 0) & 0xff; dev_priv->rps.rp1_freq = (rp_state_cap >> 8) & 0xff; dev_priv->rps.min_freq = (rp_state_cap >> 16) & 0xff;
RP0 is the maximum value the driver can request from the firmware. It's the highest non-overclocked frequency supported. RPn is the minimum value the driver can request from the firmware. It's the lowest frequency supported. RP1 is the most efficient frequency. hw_max is RP0 if the system does not supports overclocking. Otherwise, it is read through a special set of commands where we're told by the firmware the real max. The overclocking max typically cannot be sustained for long durations due to thermal considerations, but that is transparent to software.
RPS (HSW+) Similar to the previous RPS, except there is an RPe (for efficient). I don't know how it differs from RP1. I just learned about this after posting the tool - but just be aware it's different. Baytrail and Cherryview also have a concept of RPe.
The Atom based stuff (Baytrail, Cherryview) I don't pretend to know how this works [or doesn't]. They use similar terms. They have the same set of names as the HSW+ RPS.
The driver can make requests to the firmware for a desired frequency:
if (IS_HASWELL(dev) || IS_BROADWELL(dev)) I915_WRITE(GEN6_RPNSWREQ, HSW_FREQUENCY(val)); else I915_WRITE(GEN6_RPNSWREQ, GEN6_FREQUENCY(val) | GEN6_OFFSET(0) | GEN6_AGGRESSIVE_TURBO);
This register interface doesn't provide a way to determine the request was granted other than reading back the current frequency, which is error prone, as explained below.
By default, the driver will request frequencies between the efficient frequency (RP1, or RPe), and the max frequency (RP0 or hwmax) based on the system's busyness. The busyness can be calculated either by software, or by the hardware. For the former, the driver can periodically read a register to get busyness and make decisions based on that:
render_count = I915_READ(VLV_RENDER_C0_COUNT_REG);
In the latter case, the firmware will itself measure busyness and give the driver an interrupt when it determines that the GPU is sufficiently overworked or under worked. At each interrupt the driver would raise or lower by the smallest step size (typically 50MHz), and continue on its way. The most complex thing we did (which we still managed to screw up) was disable interrupts telling us to go up when we were already at max, and the equivalent for down.
It seems obvious that there are usual trends, if you increment the frequency you're more likely to increment again in the near future and such. Since leaving for sabbatical and returning to work on mesa, there has been a lot of complexity added here by Chris Wilson, things which mimic concepts like the ondemand CPU frequency governor. I never looked much into those, so I can't talk knowledgeably about it - just realize it's not as naive as it once was, and should do a better job as a result.
The flow seems a bit ping-pongish:
The benefit is though that the firmware can do all the dirty work, and the CPU can sleep. Particularly when there's nothing going on in the system, that should provide significant power savings.
First, what does it do? --max and --min lock the GPU frequency to min and max respectively. What this actually means is that in the driver, even if we get interrupts to throttle up, or throttle down, we ignore them (hopefully the driver will disable such useless interrupts, but I am too scared to check). I also didn't check what this does when we're using software to determine busyness, but it should be very similar
I should mention now that this is what I've inferred through careful observation\^w\^w random guessing. In the world of thermals and overheating, things can go from good, to bad, to broke faster than the CPU can take an interrupt and adjust things. As a result, the firmware can and will lower frequencies even if it's previously acknowledges it can give a specific frequency.
As an example if you do:
intel_gpu_frequency --set X assert (intel_gpu_frequency --get == X)
There is a period in the middle, and after the assert where the firmware may have done who knows what. Furthermore, if we try to lock to --max, the GPU is more likely to hit these overheating conditions and throttle you down. So --max doesn't even necessarily get you max. It's sort of an interesting implication there since one would think these same conditions (probably the CPU is heavily used as well) would end up clocking up all the way anyway, and we'd get into the same spot, but I'm not really sure. Perhaps the firmware won't tell us to throttle up so aggressively when it's near its thermal limits. Using --max can actually result in non-optimal performance, and I have some good theories why, but I'll keep them to myself since I really have no proof.
--min on the other hand is equally stupid for a different and more obvious reason. As I said above, it's guaranteed to provide the worst possible performance and not guaranteed to provide the most optimal power usage.
The usages are primarily benchmarking and performance debug. Assuming you can sustain decent cooling, locking the GPU frequency to anything will give you the most consistent results from run to run. Presumably max, or near max will be the most optimal.
min is useful to get a measure of what the worst possible performance is to see how it might impact various optimizations. It can help you change the ratio of GPU to CPU frequency drastically (and easily). I don't actually expect anyone to use this very much.
If you are a developer trying to get performance optimizations or measurements which you think can be negatively impacted by the GPU throttling you can set this value. Again because of the thermal considerations, as you go from run to run making tweaks, I'd recommend setting something maybe 75% or so of max - that is a total ballpark figure. When you're ready to get some absolute numbers, you can try setting --max, and the few frequencies near max to see if you have any unexpected results. Take highest value when done.
The sysfs entries require root for a reason. Any time
On the surface it would seem that the minimum frequency should always use the least amount of power. At idle, I'd assert that is always true. The corollary to that is that at minimum frequency they also finish the workload the slower. Intel GPUs don't idle for long, they go into RC6. The efficient frequency is a blend of maintaining a low frequency and winning the race to idle. AFAIK, it's a number selected after a whole lot of tests run - we could ignore it.