True PPGTT [part 3]

This entry is part 4 of 4 in the series i915 object mapping
  • EDIT1: I forgot to include a diagram I did of the software state machine for some presentation. I long lost the SVG, and it got kind of messed up, but it’s there at the bottom.
  • EDIT2: (Apologies to aggregators) Grammar fixes. Fixed some bugs in a couple of the images.
  • EDIT3: (Again, apologies to aggregators) s/indirect rendering/direct rendering. I had to fix this or else the sentence made no sense.
  • EDIT4 (2017-07-13): I was under the impression we were not yet allowed to talk about preemption. But apparently we are. So feature matrix at the bottom is updated.

The Per-Process Graphics Translation Tables provide real process isolation among the various graphics processes running within an i915 based system. When in use, the combination of the PPGTT and the Hardware Context provide the equivalent of the traditional CPU process. Most of the same capabilities can be provided, and most of the same limitations come with it. True PPGTT encompasses all of the functionality currently merged into the i915 kernel driver that support page tables and address spaces. It’s called, “true” because the Aliasing PPGTT was introduced first and often was simply called, “PPGTT.”

The True PPGTT patches represent one of the more challenging aspects of working on a project like the Linux kernel. The feature couldn’t realistically be enabled in isolation of the existing driver. When regressions occur it’s likely that the user gets no display. To say we get chided on occasion would be an understatement. Ipso facto, this feature is not enabled by default. There are quite a few patches on the mailing list that build new functionality on top of this support, and to help stabilize existing support. If one wishes to try enabling the real PPGTT, one must simply use the i915 module parameter: enable_ppgtt=2. I highly recommended that the stability patches be used unless you’re reading this in some future where the stability problems are fixed upstream.

Unlike the previous posts where I tried to emphasize the hardware architecture for this feature, the following will not go into almost no detail about how hardware works. There won’t be PRM references, or hardware state machines. All of those mechanics have been described in parts 1 and part 2

A Brief History of the i915 Graphics Process

There have been three stages of the definition of a graphics process within the i915 driver. I believe that by explaining the stages one can get a better appreciation for the capabilities. In the following pictures there is meant to be a highlighted region (yellow in the first two, yellow, orange and blue in the last) that denote the scope of a GPU context/process with the specified feature. Incrementally the definition of a process begins to bleed between the CPU, and the GPU.

Unfortunately I have some overlap with my earlier post about Hardware Contexts. I found no good way to write this post without doing so. If you read that post, consider this a refresher.

File Descriptors

Initially all GPU state was shared by every GPU client. The only partition was done via the operating system. Every process that does direct rendering will get a file descriptor for the device. The file descriptor is the thing through which commands are submitted. This could be used by the i915 driver to help disambiguate “who” was doing “what.” This permitted the i915 kernel driver to prevent one GPU client from directly referencing the buffers owned by a different GPU client. By making the buffer object handles per file descriptor (this is very easy to implement, it’s just an idr in the kernel) there exist no mechanism to reference buffer handles from a different file descriptor. For applications which do not require context saved, non-buggy apps, or non-malicious apps, this separation is still perfectly sufficient. As an example, BO handle #1 for the X server is not the same as BO handle #1 for xonotic since each has a different file descriptor1. Even though we had this partition at the software level, nothing was enforced by the hardware. Provided a GPU client could guess where another buffer resided, it could easily operate on that buffer. Similarly, a GPU client could not expect the GPU state it had set previously to be preserved for any amount of time.

File descriptor isolation.  Before hardware contexts.
File descriptor isolation.Before hardware contexts.

Hardware Contexts

The next step towards isolation was the Hardware Context2. The hardware contexts built upon the isolation provided  by the original file descriptor mechanism. The hardware context was an opt-in interface which meant that those not wishing to use the interface received the old behavior: they could purposefully or accidentally use the state from another GPU client3. There was quite a bit of discussion around this at the time the patches were in review, and there’s not really any point in lamenting about how it could be better, now.

The context exists within the domain of the process/file descriptor in the same way that a BO exists in that domain. Contexts cannot be shared [intentionally]. The interface created was, and remains extremely simple.

As you can see from the two IOCTL payloads above, I wasn’t lying about the simplicity. Because there was not a great deal of variable functionality, there just wasn’t a lot to add in terms of the interface. Destroy is an optional call because we have the file descriptor and can clean up if a process does not. The primary motivation for destroy() is simply to allow very meticulous and memory conscious GPU clients to keep things tidy. Earlier I had a list of 3 types of GPU clients that could survive without this separation. Considering their inverse; this takes one of those off the list.

  • GPU clients needed HW context preserved
  • Buggy applications writing to random memory
  • Malicious applications

The block diagram is quite similar to above diagram with the exception that now there are discrete blocks for the persistent state. I was a bit lazy with the separation on this drawing. Hopefully, you get the idea.

Hardware context isolation
Hardware context isolation

Full PPGTT

The last piece was to provide a discrete virtual address space for each GPU client. For completeness, I will provide the diagram, but by now you should already know what to expect.

PPGTT, full isolation
PPGTT, full isolation

If I write about this picture, there would be no point in continuing with an organized blog post :-). So I’ll continue to explain this topic. Take my word for it that this addresses the other two types of GPU clients

  • GPU clients needed HW context preserved
  • Buggy applications writing to random memory
  • Malicious applications

Since the GGTT isn’t really mentioned much in this post, I’d like to point out  that the GTT still exists as you can see in this diagram. It is required for several components that were listed in my previous blog post.

VMAs and Address Spaces (AKA VMs)

The patch series which began to implement PPGTT was actually a separate series. It was the one that introduced the Virtual Memory Area for the PPGTT, simply referred to as, VMA4. You can think of a VMA in a very similar way to a GEM BO. It is an identifiable, continuous range within an address space. Conceptually there isn’t much difference between a GEM BO. To try to define it in my horrible math jargon: a logical grouping of virtual addresses representing an operand for some GPU operation within a given PPGTT domain. A VMA is uniquely identified via the tuple (BO, Address space). In the likely case that I made no sense just there, a VMA is just another handle on a chunk of GPU memory used for rendering.

Sharing VMAs

You can’t (see the note at the bottom). There’s not a whole lot I can say without doing another post about DMA-Buf, and/or Flink. Perhaps someday I will, but for now I’ll keep things general and brief.

It is impossible to share a VMA. To repeat, a VMA is uniquely identifiable by the address space, and a BO. It remains possible to share a BO. An address space exists for an individual GPU client’s process. Therefore it makes no sense to share a VMA since the address space cannot be shared5. As a result of using the existing sharing interfaces a GPU will get multiple VMAs that reference the same BO. Trying to go back to the math jargon again:

  1. VMA: (BO, Address Space) // Some BO mapped by the address space.
  2. VMA′: (BO′, Address Space) // Another BO mapped into the address space
  3. VMA″: (BO, Address Space′) // The same BO as 1, mapped into a different address space.
VMA : PPGTT :: BO : GGTT
M = {1,2,3,…} N = {1,2,3,…}

In case it’s still unclear, I’ll use an example (which is kind of a simplified/false demonstration). The scanout buffer is the thing which is displayed on the screen. When doing frontbuffer rendering, one directly renders to that buffer. If we remember my previous post, the Display Engine requires a GGTT mapping. Therefore we know we have VMAglobal. Jumping ahead, a GPU client cannot have a global mapping, therefore, to render to the frontbuffer it too has a VMA, VMApp. There you have two VMAs pointing to the same Buffer Object.

NOTE: You can actually share VMAs if you are already sharing a Context/PPGTT. I can’t think of any real world examples off of the top of my head, but it is possible, and potentially a useful thing to do.

Data Structures

Here are the relevant data structures cropped for the sake of brevity.

The struct i915_hw_ppgtt is a subclass of a struct i915_address_space. Only two implementors of i915_address space exist: the i915_hw_ppgtt (a PPGTT), and the i915_gtt (the GGTT). It might make some sense to create a new PPGTT subclass for GEN8+ but I’ve not opted to do this. I feel there is too much duplication for not enough benefit.

I’ve already explained in different words that a range of used address space is the VMA.  If the address space has the drm_mm, then it should make direct sense that the VMA has the drm_mm_node because this is the used part of the address space6. In the i915_vma struct above is a pointer to the address space for which the VMA exists, and the object the VMA is referencing. This provides the tuple that define the VMA.

HOLE 0x0->0x64000 VMA 1 0x64000->0x69000 HOLE 0x69000->512M VMA 2 512M->512.004M HOLE 1.5GB
HOLE 0x0->0x64000
VMA 1 0x64000->0x69000
HOLE 0x69000->512M
VMA 2 512M->512.004M
HOLE ~512M->2GB
Allocated space: 0x6000 Free space: 0x7fffa000

Relation to the Hardware Context

With the 3 elements discussed a few times already: file descriptor, context, PPGTT, we get real GPU process isolation. Since the context was historically an opt-in interface, changes needed to be made in order to keep the opt-in behavior yet provide isolation behind the scenes regardless of what the GPU client tried to do. If this was not done, then innocent GPU clients could feel the wrath. The file descriptor was already intimately connected with the direct rendering process (one cannot render without getting a file descriptor), it made sense to hook off of that to create the contexts and PPGTTs.

Implicit Context (“private default context”)

From here on out we can consider a, “context” as the 3 elements: fd, HW context, and a PPGTT. In the driver as it exists today if a GPU client does not provide a context for rendering, it cannot rely on GPU state being preserved. A context is created for GPU clients that do not provide one, but the state of this context should be considered completely opaque to all GPU clients. I’ve called this the Private Default Context as it very much resembles the default context that exists for the whole system (again, let me point you to the previous blog post on contexts). The driver will isolate the various contexts within the system from implicit contexts, and vice versa. Hardware state is undefined while using the private default context. Hardware state maintains it’s state from the previous render operation when using the IOCTLs.

The behavior of the implicit context does result in waste when userspace uses contexts (as mesa/libgl does).  There are a few solutions to this problem, and I’ve submitted patches for all of them (I can count 3 off the top of my head). Perhaps one day in the not too distant future, this above section will be false and we can just say – every process will get a context when they open the DRI file. If they want more contexts, they can use the IOCTL.

Multi Context

A GPU client can create more than one context. The context they wish to use for a given rendering command is built into the execbuffer2 API (note that KMS is not context savvy).

A process may wish to create several GL contexts. The API allows this, and for reasons I don’t understand, it’s something some applications wish to do. If there was no mechanism to create a new contexts, userspace would be forced to open a new file descriptor for each GL context or else they would not reap the benefits of everything we’ve discussed for a GL context.

The Big Picture – literally

Overview
Overview

Context:PPGTT

One of the more contentious topics in the very early stages of development was the relationship and connection of a PPGTT and a HW context.

Quoting myself from one of earlier public declarations, here:

My idea was to embed the PPGTT within the context structure, and creating a context always resulted in a new PPGTT. Creating a PPGTT by itself would have been impossible. This is not what we ended up doing. The implementation allows multiple hardware contexts to share a PPGTT. I’m still unclear exactly what is needed to support share groups within OpenGL, but it has been speculated that this is a requirement for share groups. Fundamentally this would allow the client to create multiple GPU contexts that share an address space (it resembles what you’d get back when there was only HW contexts). The execbuffer2 IOCTL allows one to specify the context. Behaviorally however, my proposal matches what is in use currently. I think it’s a bit easier to think of things this way too.

Current Mesa Current DDX 2 hypothetical scenarios
Current Mesa
Current DDX
2 hypothetical scenarios

Conclusion

Please feel free to send me issues or questions.
Oh yeah. Here is a state machine that I did for a presentation on this. Things got rendered weird, and I lost the original SVG file, but perhaps it will be of some value to someone.

State Machine
State Machine

TODO

As I alluded to earlier, there is still some work left to do in order to get this feature turned on by default. I gave the links to some patches, and the parameter to make it happen. If you feel motivated to help get this stuff moving forward, test it, report bugs, try to fix stuff, don’t yell at me when things break :-).

Summary

That’s most of it. I like to give the 10 second summary.

  1. i915_vma, i915_hw_ppgtt, i915_address_space: important things.
  2. The GPU has a virtual address space per DRI file descriptor.
  3. There is a connection between the PPGTT, and a Hardware Context.
  4. VMAs are backed by BOs which are backed by physical pages.
  5. GPU clients have some flexibility with how they interact with contexts, and therefore the PPGTT.

And finally, since I compared our now well defined notion of a GPU process to the traditional CPU process, I wanted to create a quick list of what I think are some interesting data points regarding the capabilities of the processors.

Thing Modern X86 CPU Modern i915 GPU
Phys Address Limit 48b? ~40b
Process Isolation Yes Yes (with True PPGTT)
Virtual Address Space Yes Yes
64b VA Space Yes GEN8+ 48b only
PTE access controls Yes No
Page Fault Handling Yes No
Preemption7 Yes *With execlists

So while True PPGTT brings the GPU closer to having all of the [what I consider to be] interesting features of a modern x86 CPU – it still has a ways to go. I would be surprised if things didn’t continue going in this direction.

SVG Links

As usual, please feel free to do something useful with the images I’ve created. Also as usual, they are really poorly named.
https://bwidawsk.net/blog/wp-content/uploads/2014/07/pre-context.svg
https://bwidawsk.net/blog/wp-content/uploads/2014/07/post-context.svg
https://bwidawsk.net/blog/wp-content/uploads/2014/07/post-ppgtt.svg
https://bwidawsk.net/blog/wp-content/uploads/2014/07/vma-bo-page.svg
https://bwidawsk.net/blog/wp-content/uploads/2014/07/vma.svg
https://bwidawsk.net/blog/wp-content/uploads/2014/07/ppgtt-context.svg
https://bwidawsk.net/blog/wp-content/uploads/2014/07/multi-context.svg

Series Navigation<< Future PPGTT [part 4] (Dynamic page table allocations, 64 bit address space, GPU “mirroring”, and yeah, something about relocs too)
Download PDF
Share on FacebookShare on Google+Share on LinkedInPin on PinterestTweet about this on Twitter

  1. It’s technically possible to make them be the same BO through the two buffer sharing mechanisms. 

  2. Around the same time Hardware Contexts were introduced, so was the Aliasing PPGTT. The Aliasing PPGTT was interesting, however it does not contribute to any part of the GPU “process” 

  3. Hardware contexts use a mechanism which will inhibit the restoration of state when not opted-in. This means if one GPU client does opt-in, and another does not, the client without contexts can reuse the state of the client with contexts. As the address space is still shared, this is actually a really dangerous thing to allow. 

  4. I would have preferred the reservation of a space within the address space be called a, “GVMA”, but that was shot down during review 

  5. There’s a whole section below describing how this statement could be false. For now, let’s pretend address spaces can’t be shared 

  6. For those unfamiliar with the Direct Render Manager memory manager, a drm_mm is the structure for the memory manager provided by the DRM midlayer helper layer. It does all the things you’d expect out of a memory manager like, find free nodes, allocate nodes, free up nodes… A drm_mm_node is a structure representing an allocation from the memory manager. The PPGTT code relies entirely on thedrm_mm and the DRM helper functions in order to actually do the address space allocations and frees. 

  7. I am defining the word preemption as the ability to switch at an arbitrary point in time between contexts. On the CPU this is easily accomplished. The GPU running the i915 driver as of today has no way to do this. Once a batch is running it cannot be interrupted except for RC6. 

4 thoughts on “True PPGTT [part 3]

  1. Very interesting! Regarding Footnote 3, is this a security problem in practice, or will hardware contexts only be used together with PPGTT? Or will the verification code in the kernel make sure that the old context is properly overwritten by the client that has not opted in?

    1. That’s an excellent observation. It’s not a problem in practice (that there is some way to exploit this information “leak” would be hard to prove false; but there is no way that I know of). There is no software mechanism to assert any sort of correctness in this regard. Perhaps you understood, but for posterity let me the difference hardware contexts makes regarding security…

      Without hardware contexts, GPU client’s state does bleed into one another. In that sense, there is no difference with, or without contexts. However, it’s likely impossible to determine the previous client’s entire state. Enabling hardware contexts [without PPGTT] implies that a globally accessible chunk of GPU memory contains the other GPU clien’ts last state. This happens on context save when the GPU writes the state out to memory. Getting to this memory is not easy, but it’s easier than trying to determine the state without it.

      As a side note, hardware contexts have been used without PPGTT (if you’re referring to more than aliasing PPGTT) for a very long time. Most likely, Sandybridge (and ILK if those damn patches would have been merged) will never support PPGTT, but will support hardware contexts, so even in the future this situation will exist.

      Thanks for the question.

      1. Thanks, I think I get it now!

        The reason for my confusion was probably that I mistook the buffer relocation mechanism for a security barrier; I thought that being able to reuse a different client’s leftover state allowed clients to do things they wouldn’t be able to do with a “clean” state.

        Just to make sure I understand it correctly: With or without contexts, it has always been possible for DRI clients to guess the memory location of a different client’s buffer and operate on it (e.g. read its content); this is something that will only be fixed by using PPGTT?

        Is this something that could theoretically be prevented in the component of the driver that verifies commands submitted to the hardware by clients, or would that be too resource intensitive? Or could such malicious memory accesses even originate from GPU programs, which can’t be statically proven to be safe?

        1. Sorry about the delayed response. It was a little more thought intensive than I expected.

          > With or without contexts, it has always been possible for DRI clients to guess the memory location of a different client’s buffer and operate on it (e.g. read its content); this is something that will only be fixed by using PPGTT?
          That is correct. Only the real PPGTT will prevent this. Given that we use a single scratch page for all unallocated virtual address space, it would even be relatively easy to identify memory as owned by another client.

          > Is this something that could theoretically be prevented in the component of the driver that verifies commands submitted to the hardware by clients, or would that be too resource intensitive [sic]? Or could such malicious memory accesses even originate from GPU programs, which can’t be statically proven to be safe?

          You asked 3 questions with only two question marks :-). This is a rather long answer which could probably serve as a post of its own since I have no code or diagrams to explain things. Anyway…
          1. Is it feasible?
          The short and easy answer is, it’s way too resource intensive to do this.

          2. Can the command validator catch this?
          If one were using the interfaces properly, it would not be possible. Every time you refer to a Buffer Object, you emit a relocation which is implicitly isolated from all other objects, and therefore other client’s objects. However, it’s quite easy to issue a blit command using a hard coded source address, and not issue a relocation for it. Scanning for this kind of situation is probably feasible and not entirely detrimental to performance. We already have a command validator, and so some of the framework is even there. This would likely require enforcing clients to always emit relocations, and not take short-cuts such as emitting the presumed offset of the buffer so the kernel only needs to relocate if the buffer moved. This step would be required if we even wanted to attempt the next part…

          3. But what about the shaders?
          You pick up on the real challenge however, and that’s figuring out what the shader programs are doing as well. There are hardware features which we haven’t yet implemented which make the static determination impossible. When we enable those you could of course disallow this for the sake of security. The way in which the EU accesses system memory [again foregoing the not yet implemented features] is through something called a binding table. This table is populated by userspace, and gives the locations of the surfaces used by shaders. Malicious software can attempt to populate the table with surfaces that it doesn’t actually own. The kernel can potentially find this table by parsing the command stream, however the size of the table isn’t known, and so there is no 100% foolproof way to determine all the entries point to memory owned by this client. If there were some way to actually parse the binding table, you could try to shut down any programs that accessed memory that doesn’t belong to them. In theory, you could enforce userspace to always upload all entries (with magic number entries for unused ones) so that the kernel can make this determination. That requires breaking compatibility with older, or un-ported userspace. To keep compatibility, let’s think about what’s required to even accomplish this… At the kernel level, using the validator one can potentially find where the shaders are actually located. This is given in commands like 3DSTATE_PS for the fragment shader. Once found, the machine code for the EU can be obtained. You must disassemble the machine code, analyze all the instructions, and make the determination if the accesses will be invalid. The disassembling is certainly something within reason given the ISA of the EU and that we already have code to do this in both mesa, and intel-gpu-tools. Determining exactly which binding table entries are used is very hard [IMO], but if you can do it, go back to the binding table, look them up, and then see if those entries are within the valid range “owned” by this client. AFAICT, with the aforementioned constraints, and a bunch of code, you can make the static determination.

          Summarizing.
          1. Not feasible
          2. Probably can at least add the simple check in the command validator
          3. Checking shaders without breaking compatibility is hard, and a lot of work.

          Thanks again for the questions. I’m glad someone is getting something out of this 🙂

Leave a Reply

Your email address will not be published.

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax