GEN Graphics and the URB WTF is a VUE

VF create VUEs

Introduction

If you were waiting for a new post to learn something new about Intel GEN graphics, I am sorry. I started moving away from i915 kernel development about one year ago. Although I had worked landed patches in mesa before, it has taken me some time to get to a point where I had enough of a grasp on things to write a blog post of interest. I did start to write a post about the i965 compiler, but I was never able to complete it.1

In general I think the documentation has gotten quite good at giving the full details and explanations in a reasonable format. In terms of diagrams and prose, the GEN8+ Programmer Reference Manuals blow the previous docs out of the water. The greatest challenge I see though is the organization of the information which is arguably worse on newer documentation. Also, there is so much of it that it’s very difficult to be able to find everything you need when you need it; and it is especially hard when you don’t really know what you’re looking for. Therefore, I thought it would be valuable to people who want to understand this part of the Intel hardware and the i965 driver. More importantly, I have a terrible memory, and this will serve as good documentation for future me.

I highly recommend looking at the SVG files if you want to see the pictures in higher detail. I try very hard to strike a balance between factual accuracy and in making the information consumable. If you feel that either something is grossly inaccurate, or far too confusing, please let me know.

And finally, before I begin, I’d like to extend a special thanks to Kristian H√łgsberg for the initial 90 minute discussion that seeded this post as well as helping me along the way. I believe it benefited both of us.

Whiteboarding with krh
Blurry picture of the resulting whiteboard

Vertices

One of the most important components that goes into a building a 3d scene is a vertex. The point represented by the vertex may simply be a point or it may be part of some amalgamation of other points to make a line, shape, mesh, etc. Depending on the situation, you might observe a set of vertices referred to as: topology, primitive, or geometry (these aren’t all the same thing, but unless you’re speaking to a pedant you can expect the terms to be interchanged pretty frequently). Most often, these vertices are parts of triangles which make up a more complex model which is part of a scene.

The first stage in modern current programmable graphics pipeline is the Vertex Shader which unsurprisingly operates on the vertices.2 In the simplest case you might use a vertex shader to simply locate the vertex in the correct place using the model, and view transformations. Don’t worry if you’re asking yourself why is it called a shader. Given the example there isn’t anything I would define as “shading” either. How clever programmers make use of the vertex shader is out of scope, thankfully – but rest assured it’s not always so simple.


Vertex processing of a gradient square
Vertex processing of a gradient square

Vertex Buffers

The attributes that define a vertex are called, Vertex Attributes (yay). For now you can just imagine the most obvious attribute – position. The attributes are stored in one or more buffers that are created and defined by the Graphics API. Each vertex (defined here as a collection of all the attributes for a single point) gets processed and/or modified by the vertex shader with the program that is also specified by the API.

It is the responsibility of the application to build up the vertex buffers. I already mentioned that position is a very common attribute describing a vertex; unless you’re going to be creating parts of your scene within a shader, it’s likely a required attribute. So the programmer in one way or another will define each position of every vertex for the scene. We can have other things though, and the purpose of the current programmable GPU pipelines is to allow the sky to be the limit. Colors and normals are some other common vertex attributes. With a few more magic API calls, we’ll have a nice 2d image on the screen that represents the 3D image provided via the vertices.

Here are some links with much more detailed information:

Intel GPU Hardware and the i965 Driver

The GPU hardware will need to be able to operate on the vertices that are contained in the buffers created by the programmer. It will also need to be able to execute a shader program specified by the programmer. Setting this up is the responsibility of the GPU driver. When I say driver, I mean like the i965 driver in mesa. This is just a plain old shared object sitting on your disk. This is not some kernel component. In graphics, the kernel driver knows less about the underlying 3d hardware than the userspace driver.

Ultimately, the driver needs to take all the API goop and make sure it programs the hardware correctly for the given goop. There will be details on this later in the post. Getting the shader program to run takes some amount of state setup, relatively little. It’s primarily a different component within the driver which implements this. Modern graphics drivers have a surprisingly full featured compiler within them for being able to convert the API’s shading language into the native code on the GPU.

Execution Units

Programmable shading for all stages in the GPU pipeline occurs on Execution Units which are often abbreviated to EU. All other stuff is generally called, “fixed function”. The Execution Units are VLIW processors. The ISA for the EU is documented. The most important point I need to make is that each thread of each EU has its own register file. If you’re unfamiliar with the term register file, you can think of it as thread local memory. If you’re unfamiliar with that, I’m sorry.

Push vs. Pull Vertex Attributes

As it pertains to the shader, Vertex Attributes can be pushed, or pulled [or both]. Intel and most other hardware started with the push model. This is because there was no programmable shaders for geometry in the olden days. Some of the hardware vendors have moved exclusively from push to pull. Since Intel hardware can do both (as is likely the case for any HW supporting the push model), I’ll give a quick overview mostly as a background. The Intel docs do have a comparison of the two which I noticed after writing this section. I am too lazy to find the link ūüėõ

In the pull model, none of the vertex attributes needed by the shader program are fetched before execution of the shader program beings. Assuming you can get the shader programs up and running quickly, and have a way to hide the memory latency (like relatively fast thread switching). This comes with some advantages (maybe more which I haven’t thought of):

  1. In programs with control flow or conditional execution, you may not need all the attributes.
  2. Fewer transistors for fixed function logic.
  3. Assuming the push model requires register space – more ability to avoid spills when there are a lot of vertex attributes.
    • More initial register space for the programs

The push model has fixed function hardware that is designed to fetch all the vertex attributes and populate it into the shader program during invocation. Unlike the pull model, all attributes that may be needed are fetched and this can cause some overhead.

  1. It doesn’t suffer from the two requirements above though (fast shader invocation + latency hiding).
  2. Can do format conversion automatically.
  3. Hardware should also have a better view of the system and be able to more intelligently do the attribute pushing, perhaps utilizing caches better, or considering memory arbiters better, etc.

As already mentioned, usually if your hardware supports the push model, you can optionally pull things as needed, and you can easily envision how these things may work. That is especially useful on GEN in certain cases where we have hardware constraints that make the push model difficult, or even impossible to use.

Push vs. Pull
Push vs. Pull

Green Square

Before moving on, I’d like to demonstrate a real example with code. We’ll use this to see how the attributes make their way down the pipeline. The shader program will simply draw a green square. Such a square can be achieved very easily without a vertex shader, or more specifically, without passing the color through the vertex shader, but this will circumvent what I want to demonstrate. I’ll be using shader_runner from the piglit GL test suite and framework. The i965 driver has several debug capabilities built into the driver and exposed via environment variables. In order to see the contents of the commands emitted by the driver to the hardware in a batchbuffer, you can set the environment variable INTEL_DEBUG=batch. The contents of the batchbuffers will be passed to libdrm for decoding3. The shader programs get disassembled to the native code by the i965 driver4.

Here is the linked shader we’ll be using for shader_runner:

From the shader program we can infer the vertex shader is reading from exactly one input variable: gl_Vertex and is going to “output” two things: the varying vec4 color, and the built-in vec4 gl_Position. Here is the disassembled program from invoking INTEL_DEBUG=vs Don’t worry if you can’t make sense of the assembly that follows. I’m only trying to show the reader that the above shader program is indeed fetching vertex attributes with the aforementioned push model.

Ignore that first mov for now. That last send instruction indicates that the program is outputting 98 registers of data as determined by mlen. Yes, mlen is 9 because of that first mov I asked you to ignore. The rest of the information for the send instruction won’t yet make sense. Refer to Intel’s public documentation if you want to know more about the send instruction, but hopefully I’ll cover most of the important parts by the end. The constant 0,1,0,15 values which are moved to g124-g127 should make it clear what color corresponds to, which means the other thing in g120-g123 is gl_Position, and g2-g5 is gl_Vertex (because gl_Position = gl_Vertex;).

*Either you already know how this stuff works, or that should seem weird. We have two vec4 things, color and position, and yet we’re passing 8 vec4 registers down to the next stage – what are the other 6? Well that is the result of SIMD8 dispatch.

Programming the Hardware

Okay, fine. Vertices are important. The various graphics APIs all give ways to specify them, but unless you want to use a software renderer, the hardware needs to be made aware of them too. I guess (and I really am guessing) that most of the graphics APIs are similar enough that a single set of HW commands are sufficient to achieve the goal. It’s actually not a whole lot of commands to do this, and in fact they make a good deal of sense once you understand enough about the APIs and HW.

For the very simple case, we can reduce the number of commands needed to get vertices into the GPU pipeline to 3. There are certainly other very important parts of setting up state so things will run correctly, and the not so simple cases have more commands just for this part of the pipeline setup.

  1. Here are my vertex buffers – 3DSTATE_VERTEX_BUFFERS
  2. Here is how my vertex buffers are laid out – 3DSTATE_VERTEX_ELEMENTS
  3. Do it – 3DPRIMITIVE

There is a 4th command which deserves honorable mention, 3DSTATE_VS. We’re going to have to defer dissecting that for now.

The hardware unit responsible for fetching vertices is simply called the Vertex Fetcher (VF). It’s the fixed function hardware that was mentioned earlier in the section about push vs pull models. The Vertex Fetcher’s job is to feed the vertex shader, and this can be split into two main functional objectives, transferring the data from the vertex buffer into a format which can be read by the vertex shader, and allocating a handle and special internal memory for the data (more later).

3DSTATE_VERTEX_BUFFERS

The API specifies a vertex buffer which contains very important things for rendering the scene. This is done with a command that describes properties of the vertex buffer. From the 3DSTATE_VERTEX_BUFFERS documentation:

This structure is used in 3DSTATE_VERTEX_BUFFERS to set the state associated with a VB. The VF function will use this state to determine how/where to extract vertex element data for all vertex elements associated with the VB.

(the actual inline data is defined here: VERTEX_BUFFER_STATE)

In my words, the command specifies a number of buffers in any order which are “bound” during the time of drawing. These binding points are needed by the hardware later when it actually begins to fetch the data for drawing the scene. That is why you may have noticed that a lot of information is actually missing from the command. Here is a diagram which shows what a vertex buffer might look like in memory. This diagram is based upon the green square vertex shader described earlier.

3DSTATE_VERTEX_BUFFERS command
3DSTATE_VERTEX_BUFFERS command

3DSTATE_VERTEX_ELEMENTS

The API also specifies the layout of the vertex buffers. This entails where the actual components are as well as the format of the components. Application programmers can and do use lots of cooky stuff to do amazing things, which is again, thankfully out of scope. As an example, let’s suppose our vertices are comprised of three attributes, a 2D coordinate X,Y; a Normal vector U, V; and a color R, G, B.

Packing Vertex Buffers
Packing Vertex Buffers

In the above picture, the first two have very similar programming. They have the same number of VERTEX_ELEMENTS. The third case is a bit weird even from the API perspective, so just consider the fact that it exists and move on.

Just like the vertex buffer state we initialized in hardware, here too we must initialize the vertex element state.

Vertex Elements
Vertex Elements

I don’t think I’ve posted driver code yet, so here is some. What the code does is loop over every enabled attribute enum gl_vert_attrib (GL defines some and some are generic) and is setting up the command for transferring the data from the vertex buffer to the special memory we’ll talk about later. The hardware will have a similar looking loop (the green text in the above diagram) to read these things and to act upon them.

3DPRIMITIVE

Once everything is set up, 3DPRIMITIVE tells the hardware to start doing stuff. Again, much of it is driven by the graphics API in use. I don’t think it’s worth going into too much detail on this one, but feel free to ask questions in the comments…

The URB

Modern GEN hardware has a dedicated GPU memory which is referred to as L3 cache. I’m not sure there was a more confusing way to name it. I’ll try to make sure I call it GPU L3 which separates it from whatever else may be called L3.6 The GPU L3 is partitioned into several functional parts, the largest of which is the special URB memory.

The Unified Return Buffer (URB) is a general purpose buffer used for sending data between different threads, and, in some cases, between threads and fixed function units (or vice versa).

The Thread Dispatcher (TD) is the main source of URB reads. As a part of spawning a thread, pipeline fixed functions provide the TD with a number of URB handles, read offsets, and lengths. The TD reads the specified data from the URB and provide that data in the thread payload pre loaded into GRF registers.

I’ll explain that quote in finer detail in the next section. For now, simply recall from the first diagram that we have some input which goes into some geometry part of the pipeline (starting with vertex shading), followed by rasterization (and other fixed function things), followed by fragment shading. The URB is what is containing the data of this geometry as it makes its way through the various stages within that geometry part of this pipeline. The fixed function part of the hardware will consume the data from the URB at which point it can be reused for the next set of geometry.

To summarize, the data going through the pipeline may exist in 3 types of memory depending on the stage and the hardware:

  • Graphics memory (GDDR on discrete cards, or system memory on integrated graphics).
  • The URB.
  • The Execution Unit’s register file

URB space must be divided among the geometry shading stages. To do this, we have state commands like 3DSTATE_URB_VS. Allocation granularity is in 512b chunks which one might make an educated guess is based on the cacheline size in the system (cacheline granularity tends to be the norm for internal clients of memory arbiters can request).

Boiling down the URB allocation in mesa for the VS, you get an admittedly difficult to understand thing.

The vs_entry_size_bytes is the size of every “packet” which is delivered to the VS shader thread with the push model of execution (keep this in mind for the next section).

Above, count is the number of inputs for the vertex shader, and let’s defer a bit on num_slots. Ultimately, we get a count of VUE entries.

WTF is a VUE

In general, vertex data is stored in Vertex URB Entries (VUEs) in the URB, processed by CLIP threads, and only referenced by the pipeline stages indirectly via VUE handles.

The Vertex Fetcher and the Thread Dispatcher will work in tandem to get a Vertex Shader thread running with all the correct data in place. A VUE handle is created for some portion of the URB with the specified size at the time the Vertex Fetcher acts upon a 3DSTATE_VERTEX_ELEMENT. The Vertex Fetcher will populate some part of the VUE based on the contents of the vertex buffer as well as things like adding the handle it generated, and format conversion. That data is referred to as the VS thread payload, and the functionality is referred to as Input Assembly (The documentation there is really good, you should read that page). It’s probably been too long without a picture. The following picture deviates from our green square example in case you are keeping a close eye.

VF create VUEs
VF create VUEs

There are a few things you should notice there. The VF can only write a full vec4 into the URB. In the example there are 2 vec2s (a position, and a normal), but the VF must populate the remaining 2 elements. We briefly saw some of the VERTEX_ELEMENT fields earlier but I didn’t explain them. The STORE_SRC instructs the VF to copy, or convert the source data from the vertex buffer and write it into the VUE. STORE_0 and STORE_1 do not read from the vertex buffer but they do write in a 0, and 1 respectively into the VUE. The terminology that seems to be the most prevalent for the vec4 of data in the VUE is “row”. ie. We’ve used 2 rows per VUE for input, row 0 is the position, row 1 is the normal. The images depict a VUE as an array of vec4 rows.

Here is the second half of the action where the VS actually makes the vertex data available to the EU by loading the register file.

SIMD8 VS Thread Dispatch
SIMD8 VS Thread Dispatch

A VUE is allocated per vertex, and one handle is allocated for each as well. That happens once (it’s more complicated with HS, DS, and GS, but I don’t claim to understand that, so I’ll skip it for now). That means that as the vertex attributes are getting mutated along the GPU pipeline the output is going to have to be written back into the same VUE space that the input came from. The other constraint is we need to actually setup the VUEs so they can be used by the fixed function pipeline later. This may be deferred until a later stage if we’re using other geometry shading stages, or perhaps ignored entirely for streamout. To sum up there are two requirements we have to meet:

  1. The possibly modified vertex data for a given VUE must be written back to the same VUE.
  2. We must build the VUE header, also in place of the old VUE.7

#2 seems pretty straight forward once we find the definition of a VUE header. That should also clear up why we had a function doing MAX2 to determine the VUE size in the previous chapter – we need the max of [the inputs, and outputs with a VUE header]. You don’t have to bother looking at VUE header definition if you don’t want, for our example we just need to know that we need to put an X, Y, Z, W in DWord 4-7 of the header. But how are we going to get the data back into the VUE… And now finally we can explain that 9th register (and mov) I asked you to ignore earlier. Give yourself a pat on the back if you still remember that.

Well we didn’t allocate the handles or the space in the URB for the VUEs. The Vertex Fetcher hardware did that. So we can’t possibly know where they are. But wait! VS Thread Payload for SIMD8 dispatch does have some interesting things in it. Here’s my simplified version of the payload from the table in the docs:

DWordRegisterDescription
0-2g0MBZ
3g0.3Scratch Space
Sampler State Pointer
4g0.4Binding Table Pointer
5g0.5Scratch Offset
6.-7g0.6Don't Care
8-15g1.0Vertex 0-7 Return Handles
16-23 (optional)g2.0 (optional)Constant Data
16-23 (pushed up for constant data)g2.0Vertex Data

Oh. g1 contains the “return” handles for the VUEs. In other words, the thing to reference the location of the output. Let’s go back to the original VS example now

With what we now know, we should be able to figure out what this means. Recall that we had 4 vertices in the green square. The destination of the mov instruction is the first register, so our chunk of data to be sent in the URB write message is:

The send command [message], like the rest of the ISA operates upon a channel. In other words, read the values in column order. Send outputs n (9 in this case) registers worth of data. The URB write is pretty straightforward, it just writes n-1 (8 in this case) elements from the channel consuming the first register as the URB handle. That happens across all 8 columns because we’re using SIMD8 dispatch.
URB return handle, x, y, z, w, 0, 1, 0, 1. So that’s good, but if you are paying very close attention, you are missing one critical detail, and not enough information is provided above to figure it out. I’ll give you a moment to go back in case I got your curiosity……

As we already stated above, DWord 4-7 of the VUE header has to be XYZW. So if you believe what I told you about URB writes, what we have above is going to end up with XYZW in DWORD 0-3, and RGBA in DWord4-7. That means the position will be totally screwed, and the color won’t even get passed down to the next stage (because it will be VUE header data). This is solved with the global offset field in the URB write message. To be honest, I don’t quite understand how the code implements this (it’s definitely in src/mesa/drivers/dri/i965/brw_fs_visitor.cpp). I can go find the references to the URB write messages, and dig around the i965 code to figure out how it works, or you can do that and I’ll do one last diagram. What; diagram? Good choice.

URB writeback from VS
URB writeback from VS

Closing

For me, writing blog posts like this is like writing a big patch series. You write and correct so many times that you become blind to any errors. WordPress says I am closing in on my 300th revision (and that of course excludes the diagrams). Please do let me know if you notice any errors.

I really wanted to talk about a few other things actually. The way SGVs work for VertexID, InstanceID, and the edge flag. Unfortunately, I think I’ve reached the limit on information that should be in a single blog post. Maybe I’ll do those later or something. Here is the gradient square using VertexID which I was planning to elaborate upon (that is why the first picture had a gradient square):

Collateral

A great link about the GPU pipeline which I never actually read.
https://fgiesen.wordpress.com/category/graphics-pipeline/

.

Inkscape SVGs

I think Inkscape might save some extra stuff, so try that if it looks weird in the browser or whatever.
Basic Pipeline:
https://bwidawsk.net/blog/wp-content/uploads/2015/09/basic_pipeline.svg

Push vs. Pull:
https://bwidawsk.net/blog/wp-content/uploads/2015/09/push_v_pull.svg

Pushed Attributes:
https://bwidawsk.net/blog/wp-content/uploads/2015/09/pushed_attribs.svg

Vertex Buffers and Vertex Elements:
https://bwidawsk.net/blog/wp-content/uploads/2015/09/vb-ve_deconstruct.svg

Attribute Packing:
https://bwidawsk.net/blog/wp-content/uploads/2015/09/packing.svg

VUE Life Cycle:
https://bwidawsk.net/blog/wp-content/uploads/2015/09/vue_lifecycle.svg

Share on FacebookShare on Google+Share on LinkedInPin on PinterestTweet about this on Twitter
Download PDF

  1. I spent a significant amount of time and effort writing about the post about the GLSL compiler in mesa with a focus on the i965 backend. If you’ve read my blog posts, hopefully you see that I try to include real spec reference and code snippets (and pretty pictures). I started work on that before NIR was fully baked, and before certain people essentially rewrote the whole freaking driver. Needless to say, by the time I had a decent post going, everything had changed. Sadly it would have been great to document that part of the driver, but I discarded that post. If there is any interest in me posting what I had written/drawn, maybe I’d be willing to post what I had done in its current state. 

  2. I find the word “pipeline” a bit misleading here, but maybe that’s just me. It’s not specific enough. It seems to imply that a vertex is the smallest atom for which the GPU will operate on, and while that’s true, it ignores the massively parallel nature of the beast. There are other oddities as well, like streamout 

  3. The decoding within libdrm is quite old and crufty. Most, if not everything in this post was hand decoded via the specifications for the commands in the public documentation 

  4. To dump the generated vertex shader and fragment shader in native code (with some amount of the IR), you similarly add fs, and vs. To dump all three things at once for example:
    INTEL_DEBUG=batch,vs,fs 

  5. The layout is RGBA, ie. completely opaque green value. https://en.wikipedia.org/wiki/RGBA_color_space 

  6. Originally, and in older public docs it is sometimes referred to as Mid Level Cache (MLC), and I believe they are exactly the same thing. MLC was the level before Last Level Cache (LLC), which made sense, until Baytrail which had a GPU L3, but did not have an LLC. 

  7. You could get clever, or buggy, depending on which way you look at it and actually write into a different VUE. The ability is there, but it’s not often (if ever) used 

A bit on Intel GPU frequency and a new tool to control it

I’ve had this post sitting in my drafts for the last 7 months. The code is stale, but the concepts are correct. I had intended to add some pictures before posting, but it’s clear that won’t happen now. Words are better than nothing, I suppose…

Recently I pushed an intel-gpu-tool tool for modifying GPU frequencies. It’s meant to simplify using the sysfs interfaces, similar to many other such helper tools in Linux (cpupower has become one of my recent favorites). I wrote this fancy getopt example^w^w^w program to address a specific need I had, but saw its uses for the community as well. Some time after upstreaming the tool, I accidentally put the name of this tool into my favorite search engine (yes, after all these years X clipboard still confuses me). Surprisingly I was greeted by a discussion about the tool. None of it was terribly misinformed, but I figured I might set the record straight anyway.

Introduction

Dynamically changing frequencies is a difficult thing to accomplish. Typically there are multiple clock domains in a chip and coordinating all the necessary ones in order to modify one probably requires a bunch of state which I’d not understand much less be able to explain. To facilitate this (on Gen6+) there is firmware which does whatever it is that firmwares do to change frequencies. When we talk about changing frequencies from a Linux kernel driver perspective, it means we’re asking the firmware for a frequency. It can, and does overrule, balk and ignore our frequency requests

A brief explanation on frequencies in i915

The term that is used within the kernel driver and docs is, “RPS” which is [I believe] short for Render P-States. They are analogous to CPU P-states in that lower numbers are faster, higher number are slower. Conveniently, we only have two numbers, 0, and 1 on GEN. Past that, I don’t know how CPU P-States work, so I’m not sure how much else is similar.

There are roughly 4 generations of RPS:

  1. IPS (not to be confused with Intermediate Pixel Storage). The implementation of this predates my knowledge of this part of the hardware, so I can’t speak much about it. It stopped being a thing after Ironlake (Gen5). I don’t care to look, but you can if you want: drivers/platform/x86/intel_ips.c
  2. RPS (Sandybridge, Ivybridge)
    There are 4 numbers of interest: RP0, RP1, RPn, “hw max”. The first 3 are directly read from a register

    RP0 is the maximum value the driver can request from the firmware. It’s the highest non-overclocked frequency supported.
    RPn is the minimum value the driver can request from the firmware. It’s the lowest frequency supported.
    RP1 is the most efficient frequency.
    hw_max is RP0 if the system does not supports overclocking.
    Otherwise, it is read through a special set of commands where we’re told by the firmware the real max. The overclocking max typically cannot be sustained for long durations due to thermal considerations, but that is transparent to software.

  3. RPS (HSW+)
    Similar to the previous RPS, except there is an RPe (for efficient). I don’t know how it differs from RP1. I just learned about this after posting the tool – but just be aware it’s different. Baytrail and Cherryview also have a concept of RPe.
  4. The Atom based stuff (Baytrail, Cherryview)
    I don’t pretend to know how this works [or doesn’t]. They use similar terms.
    They have the same set of names as the HSW+ RPS.

How the driver uses it

The driver can make requests to the firmware for a desired frequency:

This register interface doesn’t provide a way to determine the request was granted other than reading back the current frequency, which is error prone, as explained below.

By default, the driver will request frequencies between the efficient frequency (RP1, or RPe), and the max frequency (RP0 or hwmax) based on the system’s busyness. The busyness can be calculated either by software, or by the hardware. For the former, the driver can periodically read a register to get busyness and make decisions based on that:

In the latter case, the firmware will itself measure busyness and give the driver an interrupt when it determines that the GPU is sufficiently overworked or under worked. At each interrupt the driver would raise or lower by the smallest step size (typically 50MHz), and continue on its way. The most complex thing we did (which we still managed to screw up) was disable interrupts telling us to go up when we were already at max, and the equivalent for down.

It seems obvious that there are usual trends, if you increment the frequency you’re more likely to increment again in the near future and such. Since leaving for sabbatical and returning to work on mesa, there has been a lot of complexity added here by Chris Wilson, things which mimic concepts like the ondemand CPU frequency governor. I never looked much into those, so I can’t talk knowledgeably about it – just realize it’s not as naive as it once was, and should do a better job as a result.

The flow seems a bit ping-pongish:

The benefit is though that the firmware can do all the dirty work, and the CPU can sleep. Particularly when there’s nothing going on in the system, that should provide significant power savings.

Why you shouldn’t set –max or –min

First, what does it do? –max and –min lock the GPU frequency to min and max respectively. What this actually means is that in the driver, even if we get interrupts to throttle up, or throttle down, we ignore them (hopefully the driver will disable such useless interrupts, but I am too scared to check). I also didn’t check what this does when we’re using software to determine busyness, but it should be very similar

I should mention now that this is what I’ve inferred through careful observation^w^w random guessing. In the world of thermals and overheating, things can go from good, to bad, to broke faster than the CPU can take an interrupt and adjust things. As a result, the firmware can and will lower frequencies even if it’s previously acknowledges it can give a specific frequency.

As an example if you do:

There is a period in the middle, and after the assert where the firmware may have done who knows what. Furthermore, if we try to lock to –max, the GPU is more likely to hit these overheating conditions and throttle you down. So –max doesn’t even necessarily get you max. It’s sort of an interesting implication there since one would think these same conditions (probably the CPU is heavily used as well) would end up clocking up all the way anyway, and we’d get into the same spot, but I’m not really sure. Perhaps the firmware won’t tell us to throttle up so aggressively when it’s near its thermal limits. Using –max can actually result in non-optimal performance, and I have some good theories why, but I’ll keep them to myself since I really have no proof.

–min on the other hand is equally stupid for a different and more obvious reason. As I said above, it’s guaranteed to provide the worst possible performance and not guaranteed to provide the most optimal power usage.

What you should do with it

The usages are primarily benchmarking and performance debug. Assuming you can sustain decent cooling, locking the GPU frequency to anything will give you the most consistent results from run to run. Presumably max, or near max will be the most optimal.

min is useful to get a measure of what the worst possible performance is to see how it might impact various optimizations. It can help you change the ratio of GPU to CPU frequency drastically (and easily). I don’t actually expect anyone to use this very much.

How not to get shot in the foot

If you are a developer trying to get performance optimizations or measurements which you think can be negatively impacted by the GPU throttling you can set this value. Again because of the thermal considerations, as you go from run to run making tweaks, I’d recommend setting something maybe 75% or so of max – that is a total ballpark figure. When you’re ready to get some absolute numbers, you can try setting –max, and the few frequencies near max to see if you have any unexpected results. Take highest value when done.

The sysfs entries require root for a reason. Any time

Footnotes

On the surface it would seem that the minimum frequency should always use the least amount of power. At idle, I’d assert that is always true. The corollary to that is that at minimum frequency they also finish the workload the slower. Intel GPUs don’t idle for long, they go into RC6. The efficient frequency is a blend of maintaining a low frequency and winning the race to idle. AFAIK, it’s a number selected after a whole lot of tests run – we could ignore it.

Share on FacebookShare on Google+Share on LinkedInPin on PinterestTweet about this on Twitter
Download PDF

OS backtrace with symbol names Goodbye Sabbatical

Example Stackframe

For those of you reading this that didn’t know, I’ve had two months of paid vacation – one of the real perks of working for Intel. Today is the last day. It is as hard as I thought it would be.

Most of the vacation was spent vacationing. As I have access to none of the pictures at the moment, and I don’t want to make you jealous, I’ll skip over that. Toward the end though, I ended up at a coffee shop waiting for someone with nothing to do. I spent a little bit of time working on HOBos, and I thought it could be interesting to write about it.

WARNING: There is nothing novel here.

A brief history of HOBos

The HOBby operating system is an operating system project I started a while ago. I am a bit embarrassed that I started an OS. In my opinion, it’s one of lamer tasks to take on because

  1. everyone seems to do it;
  2. there really isn’t a need, there are many operating systems with permissive licenses already; and
  3. sites like OSDev¬†have made much of the work trivial (I like to think that when I started there wasn’t quite as much info readily available, but that’s a lie).
Larrabee Av in Portland (not what the project was named after)
Larrabee Av in Portland
(not what the project was named after)

HOBos began while I was working on the Larrabee project. The team¬†spent a lot of time optimizing¬†the memory management, and the scheduler for the embedded software. I really wanted to work on these things full time. Unfortunately for me, having had a background in device drivers, I was often required to do other things. As a means to scratch the itch, I started HOBos after not finding anything suitable for my needs. The stuff I found¬†were all either too advanced, or too rudimentary.¬†When I was hired to work on i915, I decided that it was better use of my free time. Since then, I’ve been tweaking things here or there, and I do try to make sure things still run with the latest QEMU and compilers at least once a year. The last actual feature I added was more than 1300 days ago:

So back to the coffee shop. I tried to do something or other, got a hang, and didn’t want to fire up GDB.

Backtracing

HOBos had implemented backtraces since the original import from SVN (let’s pretend that means, since always). Obtaining a backtrace is actually pretty straightforward on x86.

The stack frame

The stack frame can be thought of as memory contents that are locally scoped to a function. Declaring a local variable will end up in the stack frame. A global variable will not. As functions call other functions, you end up with multiple frames. A stack is used because the last frames added are the first ones removed (this is not always true for things like exceptions, but nevermind that for now). The fact that a stack decrements is arbitrarily chosen, as far as I can tell. The following shows the stack when the function foo() calls the function bar().

Example Stackframe
Example Stackframe

The memory contents shown above are created as a result of two things. First is what the CPU implicitly does upon execution of the call instruction. The second is what the compiler generates. Since we’re talking about x86 here, the call instruction always pushes at least the return address. The second I’ll detail a bit more in a bit. Correlating this to the picture, the green (foo) and blue (bar) are creations of the compiler. The brown is automatically pushed on to the stack by the hardware, and is automatically popped from the stack on the ret instruction.

In the above there are two registers worth noting, RBP, and RSP. RBP which is the 64b extension to the original 8086 BP, Base Pointer, register is the beginning of the stack frame ie the Frame Pointer. RSP, the extension to the 8086 SP, Stack Pointer, points to the end of the stack frame. By convention the Base Pointer doesn’t change throughout a function being executed and therefore it is often used as the reference to local variables stored on the stack. -100(%rbp) as an example above.

Digging further into that disassembly above, one notices a pattern. Every function begins with:

Assuming this is the convention, it implies that at any given point during the execution of a function we can obtain the previous RBP by reading the current RBP and doing some processing. Specifically, reading RBP gives us the old Stack Pointer, which is pointing to the last RBP. As mentioned above, the x86 CPU pushed the return address immediately before the push %rbp – which means as we work backwards through the Base Pointers, we can also obtain the caller for the current stack frame. People have done really nice pictures on this – use your favorite search engine.

Here is the HOBos code (ignore the part about symbols for now):

As far as I know, all modern CPUs work in a similar fashion with differences sprinkled here and there. ARM for example has an LR register for the return address instead of using the stack.

ABI/Calling Conventions

The fact that we can work backwards this way is a byproduct of the calling convention. One example of an aspect of the calling convention is where to put the arguments to a function. Do they go on the stack, in registers, or somewhere else? In addition to these arguments, the way in which RBP, and RSP are used are strictly software constructs that are part of the convention. As a result, it might not always be possible to get a backtrace if:

  1. This convention is not adhered to (or -fomit-frame-pointer)
  2. The contents of RBP are destroyed
  3. The contents of the stack are corrupted.

How arguments are passed to function are also needed to make sure linkers and loaders (both static and dynamic) can operate to correctly form an executable, or dynamically call a function. Since this isn’t really important to obtaining a backtrace, I will leave it there. Some architectures do provide a way to obtain useful backtrace information without caring about the calling convention: Intel’s Processor Trace for example.

Symbol information

The previous section will get us a reverse list of addresses for all function calls working backward from a given point during¬†execution. But having names makes things much more easier to quickly diagnose what is going wrong. There is a lot of information on the internet about this stuff. I’m simply providing all that’s relevant to my specific problem.

ELF Symbols (linking)

The ELF format¬†provides everything we need (assuming things aren’t stripped). Glossing over the details (see this simple tool¬†if¬†you’re curious), we end up with two “sections” that tell us everything we need. They are conventionally named, “.symtab”, and “.strtab” and are conveniently¬†of type, SHT_SYMTAB, and SHT_STRTAB. The symbol table defines the information about each symbol (functions, variables, whatever). Part of the information is a name, which is an index into the string table. In this simplest case, these are provisions for inter-object linking. If I had defined foo() in foo.c, and bar() in bar.c, the compiled object files can be linked together, but the linker needs the information about the symbol bar (in this case) in order to do its job.

[Nr] Name Type Address Offset
[33] .symtab SYMTAB 0000000000000000 000015b8
[34] .strtab STRTAB 0000000000000000 00001c90

Summing that up, if we have an entire ELF file, and the symbol and string tables haven’t been stripped, we’re good to go.¬†However, ELF sections are not the unit in which an ELF loader decides what to load. The loader loads segments which are of type PT_LOAD. A segment is made up of 0 or more sections, plus padding. Since the Operating System is itself an ELF loaded by an ELF loader (the bootloader) we’re not in a good position. :(

ELF Loader
ELF Loader

Debug Info

Note that what we want is not the same thing as debug information. If one wants to do source level debug, there needs to be some way of correlating a machine instruction to a line of source code. This is also a software abstraction, and there is usually a spec for it unless you are using some proprietary thing. It would technically be possible to include DWARF capabilities within the kernel, but I do not know of a way to get that info to the OS (see multiboot stuff for details).

From boot to symbols

The HOBos project implements a small Multiboot compliant bootloader called smallboot. When the machine starts up, boot firmware is loaded from a fixed location (this is currently done by SeaBIOS). The boot firmware then loads the smallboot bootloader. The bootloader will load the kernel (smallboot, and most modern bootloaders will do this through a text configuration file on the resident filesystem). In the HOBos case, the kernel is simply an ELF file. smallboot implements a basic ELF loader to load the kernel into memory and give execution over.

The multiboot specification is a standardized communication mechanism (for various things)  from the bootloader to the Operating System (or any file really). One of these things is symbol information. Quoting the multiboot spec

If bit 5 in the ‚Äėflags‚Äô word is set, then the following fields in the Multiboot information structure starting at byte 28 are valid:

These indicate where the section header table from an ELF kernel is, the size of each entry, number of entries, and the string table used as the index of names. They correspond to the ‚Äėshdr_*‚Äô entries (‚Äėshdr_num‚Äô, etc.) in the Executable and Linkable Format (elf) specification in the program header. All sections are loaded, and the physical address fields of the elf section header then refer to where the sections are in memory (refer to the i386 elf documentation for details as to how to read the section header(s)). Note that ‚Äėshdr_num‚Äô may be 0, indicating no symbols, even if bit 5 in the ‚Äėflags‚Äô word is set.

Since the beginning I had implemented these fields in the bootloader:

Because of the fact that the symbols weren’t in the ELF segments though, I was stumped as to how to get at the data one the OS is loaded. As it turned out, I didn’t actually read all 4 sentences and ¬†I had missed one very important part.

All sections are loaded, and the physical address fields of the elf section header then refer to where the sections are in memory

What the spec is dictating is that even though the sections are not in loadable segments, they shall exist within memory during the handover to the OS, and the section header information will be updated so that the OS knows where to find it. With this, the OS can copy out, or just make sure not to overwrite the info, and then get access to it.

et cetera

The code for pulling out the symbols is quite a bit longer, but it can be found in kern/core/syms.c. With the given RBP unwinder near the top, we can easily get the IP for the caller. With that IP, we do a symbol lookup from the symbols we got via the multiboot info.

Screenshot with backtrace
Screenshot with backtrace

Inkscape Links

https://bwidawsk.net/blog/wp-content/uploads/2014/10/stackframe.svg
https://bwidawsk.net/blog/wp-content/uploads/2014/10/loader.svg

Share on FacebookShare on Google+Share on LinkedInPin on PinterestTweet about this on Twitter
Download PDF

Future PPGTT [part 4] (Dynamic page table allocations, 64 bit address space, GPU “mirroring”, and yeah, something about relocs too)

GPU Mirroring

Preface

GPU mirroring provides a mechanism to have the CPU and the GPU use the same virtual address for the same physical (or IOMMU) page. An immediate result of this is that relocations can be eliminated. There are a few derivative benefits from the removal of the relocation mechanism, but it really all boils down to that. Other people call it other things, but I chose this name before I had heard other names. SVM would probably have been a better name had I read the OCL spec sooner. This is not an exclusive feature restricted to OpenCL. Any GPU client will hopefully eventually have this capability provided to them.

If you’re going to read any single PPGTT post of this series, I think it should not be this one. I was not sure I’d write this¬†post when I started documenting the PPGTT (part 1, part2, part3). I had hoped that any of the following things would have solidified the decision¬†by the time I completed part3.

  1. CODE: The code is not not merged, not reviewed, and not tested (by anyone but me). There’s no indication about the “upstreamability”. What this means is that if you read my blog to understand how the i915 driver currently works, you’ll be taking a crap-shoot on this one.
  2. DOCS: The Broadwell public Programmer Reference Manuals are not available. I can’t refer to them directly, I can only refer to the code.
  3. PRODUCT: Broadwell has not yet shipped. My ulterior motive had always been to rally¬†the masses to test the code. Without product, that isn’t possible.

Concomitant with these facts, my memory of the code and interesting parts of the hardware it utilizes continues to degrade. Ultimately, I decided to write down what I can while it’s still fresh (for some very warped definition of “fresh”).

Goal

GPU mirroring is the goal. Dynamic page table allocations are very valuable by itself. Using dynamic page table allocations can dramatically conserve system memory when running with multiple address spaces (part 3 if you forgot), which is something which should become pretty common shortly. Consider for a moment a Broadwell legacy 32b system (more details later). TYou would require about 8MB for page tables to map one page of system memory. With the dynamic page table allocations, this would be reduced to 8K. Dynamic page table allocations are also an indirect requirement for implementing a 64b virtual address space. Having a 64b virtual address space is a pretty unremarkable feature by itself. On current workloads [that I am aware of] it provides no real benefit. Supporting 64b did require cleaning up the infrastructure code quite a bit though and should anything from the series get merged, and I believe the result is a huge improvement in code readability.

Current Status

I briefly mentioned dogfooding these several months ago. At that time I only had the dynamic page table allocations on GEN7 working. The fallout wasn’t nearly as bad as I was expecting, but things were far from stable. There was a second posting which is much more stable and contains support of everything through Broadwell. To summarize:

Feature Status TODO
Dynamic page tables Implemented Test and fix bugs
64b Address space Implemented Test and fix bugs
GPU mirroring Proof of Concept Decide on interface; Implement interface.1

Testing has been limited to just one machine, mine, when I don’t have a million other things to do. With that caveat, on top¬†of my last PPGTT stabilization patches things¬†look pretty stable.

Present: Relocations

Throughout many of my previous blog posts I’ve gone out of the way to avoid explaining relocations. My reluctance was because explaining the mechanics is quite tedious, not because it is a difficult concept. It’s impossible [and extremely unfortunate for my weekend] to make the case for why these new PPGTT features are cool without touching on relocations at least a little bit.¬†The following picture exemplifies both the CPU and GPU mapping the same pages with the current relocation mechanism.

Current PPGTT support
Current PPGTT support

To get to the above state, something like the following would happen.

  1. Create BOx
  2. Create BOy
  3. Request BOx be uncached via (IOCTL DRM_IOCTL_I915_GEM_SET_CACHING).
  4. Do one of aforementioned operations on BOx and BOy
  5. Perform execbuf2.

Accesses to the BO from the CPU require having a CPU virtual address that eventually points to the pages representing the BO2. The GPU has no notion of CPU virtual addresses (unless you have a bug in your code). Inevitably, all the GPU really cares about is physical pages; which ones. On the other hand, userspace needs to build up a set of GPU commands which sometimes need to be aware of the absolute graphics address.

Several commands do not need an absolute address. 3DSTATE_VS for instance does not need to know anything about where Scratch Space Base Offset
is actually located. It needs to provide an offset to the General State Base Address. The General State Base Address does need to be known by userspace:
STATE_BASE_ADDRESS

Using the relocation mechanism gives userspace a way to inform the i915 driver about the BOs which needs an absolute address. The handles plus some information about the GPU commands that need absolute graphics addresses are submitted at execbuf time. The kernel will make a GPU mapping for all the pages that constitute the BO, process the list of GPU commands needing update, and finally submit the work to the GPU.

Future: No relocations

GPU Mirroring
GPU Mirroring

The diagram above demonstrates the goal. Symmetric mappings to a BO on both the GPU and the CPU. There are benefits for ditching relocations. One of the nice side effects of getting rid of relocations is it allows us to drop the use of the DRM memory manager and simply rely on malloc as the address space allocator. The DRM memory allocator does not get the same amount of attention with regard to performance as malloc does. Even if it did perform as ideally as possible, it’s still a superfluous CPU workload. Other people can probably explain the CPU overhead in better detail. Oh, and OpenCL 2.0 requires it.

Makin’ it Happen

64b

As I’ve already mentioned, the most obvious requirement is expanding the GPU address space to match the CPU.

Page Table Hierarchy
Page Table Hierarchy

If you have taken any sort of Operating Systems class, or read up on Linux MM within the last 10 years or so, the above drawing should be incredibly unremarkable. If you have not, you’re probably left with a big ‘WTF’ face. I probably can’t help you if you’re in the latter group, but I do sympathize. For the other camp: Broadwell brought 4 level page tables that work exactly how you’d expect them to. Instead of the x86 CPU’s CR3, GEN GPUs have PML4. When operating in legacy 32b mode, there are 4 PDP registers that each point to a page directory and therefore map 4GB of address space3. The register is just a simple logical address pointing to a page directory.¬†The actual changes in hardware interactions are trivial on top of all the existing PPGTT work.

“This will take one week. I can just allocate everything up front.” (Dynamic Page Table Allocation)

Funny story. I was asked to estimate how long it would take me to get this GPU mirror stuff in shape for a very rough proof of concept. “One week. I can just allocate everything up front.” If what I have is, “done” then I was off by 10x.

Where I went wrong in my estimate was math. If you consider the above, you quickly see why allocating everything up front is a terrible idea and flat out impossible on some systems.

Dissimilarities to x86

First and foremost, there are no GPU page faults to speak of. We cannot demand allocate anything in the traditional sense. I was naive though, and one of the first thoughts I had was: the Linux kernel [heck, just about everything that calls itself an OS] manages 4 level pages tables on multiple architectures. The page table format on Broadwell is remarkably similar to x86 page tables. If I can’t use the code directly, surely I can copy. Wrong.

Here is some code from the Linux kernel which demonstrates how you can get a PTE for a given address in Linux.

X86 page table code has a two very distinct property that does not exist here (warning, this is slightly hand-wavy).

  1. The kernel knows exactly where in physical memory the page tables reside4. On x86, it need only read CR3. We don’t know where our pages tables reside in physical memory because of the IOMMU. When VT-d¬†is enabled, the i915 driver only knows the DMA address of the page tables.
  2. There is a strong correlation between a CPU process and an mm (set of page tables). Keeping mappings around of the page tables is easy to do if you don’t want to take the hit to map them every time you need to look at a PTE.

If the Linux kernel needs to find if a page is present or not without taking a fault, it need only look to one of those¬†two options. After about of week of making the IOMMU driver do things it shouldn’t do, and trying to push the square block through the round hole, I gave up on reusing the x86 code.

Why Do We Actually Need Page Table Tracking?

The IOMMU interfaces were not designed to pull a physical address from a DMA address. Pre-allocation is right out. It’s difficult to try to get the instantaneous¬†state of the page tables…

Another thought I had very early on was that tracking could be avoided if we just never tore down page tables. I knew this wasn’t a good solution, but at that time I just wanted to get the thing working and didn’t really care if things blew up spectacularly after running for a few minutes. There is actually a really¬†easy set of operations that show why this won’t work. For the following, think of the four level page tables as arrays. ie.

  • PML4[0-255], each point to a PDP
  • PDP[0-255][0-511], each point to a PD
  • PD[0-255][0-511][0-511], each point to a PT
  • PT[0-255][0-511][0-511][0-511] (where PT[0][0][0][0][0] is the 0th¬†PTE in the system)
  1. [mesa] Create a 2M sized BO. Write to it. Submit it via execbuffer
  2. [i915] See¬†new BO in the execbuffer list. Allocate page tables for it…
    1. [DRM]Find that address 0 is free.
    2. [i915]Allocate PDP for PML4[0]
    3. [i915]Allocate PD for PDP[0][0]
    4. [i915]Allocate PT for PD[0][0][0]/li>
    5. [i915](condensed)Set pointers from PML4->PDP->PD->PT
    6. [i915]Set the 512 PTEs PT[0][0][0][0][511-0] to point to the BO’s backing page.
  3. [i915] Dispatch work to the GPU on behalf of mesa.
  4. [i915] Observe the hardware has completed
  5. [mesa] Create a 4k sized BO. Write to it. Submit both BOs via execbuffer.
  6. [i915]¬†See¬†new BO in the execbuffer list. Allocate page tables for it…
    1. [DRM]Find that address 0x200000 is free.
    2. [i915]Allocate PDP[0][0], PD[0][0][0], PT[0][0][0][1].
    3. Set pointers… Wait. Is PDP[0][0] allocated already? Did we already set pointers? I have no freaking idea!
    4. Abort.

Page Tables Tracking with Bitmaps

Okay. I could have used a sentinel for empty entries. It is possible to achieve this same thing by using a sentinel value (point the page table entry to the scratch page). To implement this involves reading back potentially large amounts of data from the page tables which will be slow. It should work though. I didn’t try it.

After I had determined I couldn’t reuse x86 code, and that I need some way to track which page table elements were allocated, I was pretty set on using bitmaps for tracking usage. The idea of a hash table came and went – none of the upsides of a hash table are useful here, but all of the downsides are present(space). Bitmaps was sort of the default case. Unfortunately though, I did some math at this point, notice the LaTex!.
  \frac{2^{47}bytes}{\frac{4096bytes}{1 page}} = 34359738368 pages \newline  34359738368 pages \times \frac{1bit}{1page} = 34359738368 bits \newline  34359738368 bits \times \frac{8bits}{1byte} = 4294967296 bytes
That’s 4GB simply to track every page. There’s some more overhead¬†because page [tables, directories, directory pointers] are also tracked.
  256entries + (256\times512)entries + (256\times512^2)entries = 67240192entries \newline  67240192entries \times \frac{1bit}{1entry} = 67240192bits \newline  67240192bits \times \frac{8bits}{1byte} = 8405024bytes \newline  4294967296bytes + 8405024bytes = 4303372320bytes \newline  4303372320bytes \times \frac{1GB}{1073741824bytes} = 4.0078G

I can’t remember whether I had planned to statically pre-allocate the bitmaps, or I was so caught up in the details and couldn’t see the big picture.¬†I remember thinking, 4GB just for the bitmaps, that will never fly. I probably spent a week trying to figure out a better solution. When we invent time travel, I will go back and talk to my former self: 4GB of bitmap tracking if you’re using 128TB of memory is inconsequential. That is 0.3% of the memory used by the GPU. Hopefully you didn’t fall into that trap, and I just wasted your time, but there it is anyway.

Sample code to walk the page tables

This code does not actually exist, but it is very similar to the real code. The following shows how one would “walk” to a specific address allocating the necessary page tables and setting the bitmaps along the way. Teardown is a bit harder, but it is similar.

Here is a picture which shows the bitmaps for the 2 allocation example above.

Bitmaps tracking page tables
Bitmaps tracking page tables

The GPU mirroring interface

I really don’t want to spend too much time here. In other words, no more pictures. As I’ve already mentioned, the interface was designed for a proof of concept which already had code using userptr. The shortest path was to simply reuse the interface.

In the patches I’ve submitted, 2 changes were made to the existing userptr interface (which wasn’t then, but is now, merged upstream). I added a context ID, and the flag to specify you want mirroring.

The context argument is to tell the i915 driver for which address space we’ll be mirroring the BO. Recall from part 3 that a GPU process may have multiple contexts. The flag is simply to tell the kernel to use the value in user_ptr as the address to map the BO in the virtual address space of the GEN GPU. When using the normal userptr interface, the i915 driver will pick the GPU virtual address.

  • Pros:
    • This interface is very simple.
    • Existing userptr code does the hard work for us
  • Cons:
    • You need 1 IOCTL per object. Much undeeded overhead.
    • It’s subject to a lot of problems userptr has5
    • Userptr was already merged, so unless pad get’s repruposed, we’re screwed

What should be: soft pin

There hasn’t been too much discussion here, so it’s hard to say. I believe the trends of the discussion (and the author’s personal preference) would be to add flags to the existing execbuf relocation mechanism. The flag would tell the kernel to not relocate it, and use the presumed_offset field that already exists. This is sometimes called, “soft pin.” It is a bit of a chicken and egg problem since the amount of work in userspace to make this useful is non-trivial, and the feature can’t merged until there is an open source userspace. Stay tuned. Perhaps I’ll update the blog as the story unfolds.

Wrapping it up (all 4 parts)

As usual, please report bugs or ask questions.

So with the 4 parts you should understand how the GPU interacts with system memory. You should know what the Global GTT is, why it still exists, and how it works. You might recall what a PPGTT is, and the intricacies of multiple address space. Hopefully you remember what you just read about 64b and GPU mirror. Expect a rebased patch series from me soon with all that was discussed (quite a bit has changed around me since my original posting of the patches).

This is the last post I will be writing on how GEN hardware interfaces with system memory, and how that related to the i915 driver. Unlike the Rocky movie series, I will stop at the 4th. Like the Rocky movie series, I hope this is the best. Yes, I just went there.

Unlike the usual, “buy me a beer if you liked this”, I would like to buy you a beer if you read it and considered giving me feedback. So if you know me, or meet me somewhere, feel free to reclaim the voucher.

Image links

The images I’ve created. Feel free to do with them as you please.
https://bwidawsk.net/blog/wp-content/uploads/2014/07/legacy.svg
https://bwidawsk.net/blog/wp-content/uploads/2014/07/mirrored.svg
https://bwidawsk.net/blog/wp-content/uploads/2014/07/table_hierarchy.svg
https://bwidawsk.net/blog/wp-content/uploads/2014/07/addr-bitmap.svg

Share on FacebookShare on Google+Share on LinkedInPin on PinterestTweet about this on Twitter
Download PDF

  1. The patches I posted for enabling GPU mirroring piggyback of of the existing userptr interface. Before those patches were merged I added some info to the API (a flag + context) for the point of testing. I needed to get this working quickly and porting from the existing userptr code was the shortest path. Since then userptr has been merged without this extra info which makes things difficult for people trying to test things. In any case an interface needs to be agreed upon. My preference would be to do this via the existing relocation flags. One could add a new flag called "SOFT_PIN" 

  2. The GEM and BO terminology is a fancy sounding wrapper for the notion¬†that we want an interface to coherently write data which the GPU can read (input), and have CPU observe data which the GPU has written (output)  

  3. The PDP registers are are not PDPEs because they do not have any of the associated flags of a PDPE. Also, note that in my patch series I submitted a patch which defines the number of these to be PDPE. This is incorrect. 

  4. I am not sure how KVM works manages page tables. At least conceptually I’d think they’d have a similar problem to the i915 driver’s page table management. I should have probably looked a bit closer as I may have been able to leverage that; but I didn’t have the idea until just now… looking at the KVM code, it does have a lot of similarities to the approach I took 

  5. Let me be clear that I don’t think userptr is a bad thing. It’s a very hard thing to get right, and much of the trickery needed for it is *not* needed for GPU mirroring 

Eating my own dogfood – dynamic page table allocations

I don’t know if I’ve ever eaten my own dogfood that smells this risky.

A few days ago, I published patches to support dynamic page table allocation and tear-down in the i915 driver http://lists.freedesktop.org/archives/intel-gfx/2014-March/041814.html. This work will eventually help us support expanded page tables (similar to how things work for normal Linux page tables). The patches rely on using full PPGTT support, which still requires some work to get enabled by default. As a result, I’ll be carrying around this work for quite a while. The patches provide a lot of opportunity to uncover all sorts of weird bugs we’ve never seen due to the more stressful usage of the GPU’s TLBs. To avoid the patches getting too stale, and to further the bug extermination, I’ve figured, why not run it myself?

If you feel like some serious pain, or just want to help me debug it, give it a go – there should be absolutely no visible gain for you, only harm. You can either grab the patches from the mailing list, patchwork, or my branch. ¬†Make sure to turn on full PPGTT support with i915.enable_ppgtt=2. If you do decide to opt for the pain, you can take comfort in the fact that you’re helping get the next big piece of prep work in place.

The question is, how long before I get sick of this terrible dogfood? I’m thinking by Monday I’ll be finished ūüėÄ

Share on FacebookShare on Google+Share on LinkedInPin on PinterestTweet about this on Twitter
Download PDF