The Global GTT [Part 1]

Conceptualized view of mmap and mmap_gtt

Global Graphics Translation Tables

Here goes the basics of how the GEN GPU interacts with memory. It will be focused on the lowest levels of the i915 driver, and the hardware interaction. My hope is that by going through this in excruciating detail, I might be able to take more liberties in the future posts.

What are the Global Graphics Translation Table

The graphics translation tables provide the address mapping from the GPU’s virtual address space to a physical address1. The GTT is somewhat a relic of the AGP days ( GART) with the distinction being that the GTT as it pertains to Intel GEN GPUs has logic that is contained within the GPU, and does not act as a platform IOMMU. I believe (and wikipedia seems to agree) that GTT and GART were used interchangeably in the AGP days.

GGTT architecture

Each element within the GTT is an entry, and the initialism for each entry is a, “PTE” or page table entry. Much of the required initialization is handled by the boot firmware. The i915 driver will get any required information from the initialization process via PCI config space, or MMIO.

Intel/GEN UMA system
Example illustrating Intel/GEN memory organization:

Location

The table is located within system memory, and is allocated for us by the BIOS or boot firmware. To clarify the docs a bit, GSM is the portion of stolen memory for the GTT, DSM is the rest of stolen memory used for misc things. DSM is the stolen memory referred to by the current i915 code as “stolen memory.” In theory we can get the location of the GTT from MMIO MPGFXTRK_CR_MBGSM_0_2_0_GTTMMADR (0x108100, 31:20), but we do not do that. The register space, and the GTT entries are both accessible within BAR0 (GTTMMADR).

All the information can be found in Volume 12, p.129: UNCORE_CR_GTTMMADR_0_2_0_PCI. Quoting directly from the HSW spec, “The range requires 4 MB combined for MMIO and Global GTT aperture, with 2MB of that used by MMIO and 2MB used by GTT. GTTADR will begin at GTTMMADR 2 MB while the MMIO base address will be the same as GTTMMADR.”

In the below code you can see we take the address in the PCI BAR and add half the length to the base. For all modern GENs, this is how things are split in the BAR.

One important thing to notice above is that the PTEs are mapped in a write-combined fashion. Write combining makes sequential updates (something which is very common when mapping objects) significantly faster. Also, the observant reader might ask, ‘why go through the BAR to update the PTEs if we have the actual physical memory location.’ This is the only way we have to make sure the GPUs TLBs get synchronized properly on PTE updates. If this weren’t required, a nice optimization might be to update all the entries as once with the CPU, and then go tell the GPU to invalidate the.

Size

Size is a bit more straight forward. We just read the relevant PCI offset. In the docs: p.151 GSA_CR_MGGC0_0_2_0_PCI offset 0x50, bits 9:8

And the code is even more straightforward.

Layout

The PTE layout is defined by the PRM and as an example, can be found on page 35 of HSW – Volume 5: Memory Views. For convenience, I have reconstructed the important part here:

31:12 11 10:04 03:01 0
Physical Page Address 31:12 Cacheability Control[3] Physical Page Address 38:322 Cacheability Control[2:0] Valid

The valid bit is always set for all GGTT PTEs. The programming notes tell us to do this (also on page 35 of HSW – Volume 5: Memory Views)3.

Putting it together

As a result, of what we’ve just learned, we can make up a function to write the PTEs.:

Example

Let’s analyze a real HSW running something. We can do this with the tool in the intel-gpu-tools suite, intel_gtt, passing it the -d option4.

And just to continue beating the dead horse, let’s breakout the first PTE:

31:12 11 10:04 03:01 0
Physical Page Address 31:12 Cacheability Control[3] Physical Page Address 38:32 Cacheability Control[2:0] Valid
0xee23000 0 0x2 0x2 1

Physical address: 0x20ee23000
Cache type: 0x2 (WB in LLC Only – Aged "3")
Valid: yes

Definition of a GEM BO

We refer to virtually contiguous locations which are mapped to specific graphics operands as one of, objects, buffer objects, BOs, or GEM BOs.

In the i915 driver, the verb, “bind” is used to describe the action of making a GPU virtual address range point to the valid backing pages of a buffer object.5 The driver also reuses the verb, “pin” from the Linux mm, to mean, prevent the object from being unbound.

bo_mapped
Example of  a “bound” GPU buffer

Scratch Page

We’ve already talked about the scratch page twice, albeit briefly. There was an indirect mention, and of course in the image directly above. The scratch page is a single page allocated from memory which every unused GGTT PTE will point to.

To the best of my knowledge, the docs have never given a concrete explanation for the necessity of this, however one might assume unintentional  behavior should the GPU talk a page fault. One would be right to interject at this point with the fact that by the very nature of DRI drivers, userspace can almost certainly find a way to hang the GPU. Why should we bother to protect them against this particular issue? Given that the GPU has undefined (read: Not part of the behavioral specification) prefecthing behavior, we cannot guarantee that even a well behaved userspace won’t invoke page faults6. Correction: after writing this, I went and looked at the docs. They do explain exactly which engines can, and cannot take faults. The “why” seems to be missing however.

Mappings and the aperture

The Aperture

First we need to take a bit of a diversion away from GEN graphics (which to repeat myself, are all of the shared memory type). If one thinks of traditional discrete graphics devices, there is always embedded GPU memory. This poses somewhat of an issue given that all end user applications require the CPU to run. The CPU still dispatches work to the GPU, and for cases like games, the event loop still runs on the CPU. As a result, the CPU needs to be able to both read, and write to memory that the GPU will operate on. There are two common solutions to this problem.
  • DMA engine
    • Setup overhead.
      • Need to deal with asynchronous (and possibly out of order) completion. Latencies involved with both setup and completion notification.
      • Need to actually program the interface via MMIO, or send a command to the GPU7
    • Unlikely to re-arrange or process memory
      • tile/detile surfaces8.
      • can’t take page faults, pages must be pinned
    • No size restrictions (I guess that’s implementation specific)
    • Completely asynchronous – the CPU is free to do whatever else needs doing.
  • Aperture
    • Synchronous. Not only is it slow, but the CPU has to hand hold the data transfer.
    • Size limited/limited resource. There is really no excuse with PCIe and modern 64b platforms why the aperture can’t be as large as needed, but for Intel at least, someone must be making some excuses, because 512MB is as large as it gets for now.
    • Can swizzle as needed (for various tiling formats).
    • Simple usage model. Particularly for unified memory systems.
aper_example
Moving data via the aperture
dma_example
Moving data via DMA

The Intel GEN GPUs have no local memory9. However, DMA has very similar properties to writing the backing pages directly on unified memory systems. The aperture is still used for accesses to tiled memory, and for systems without LLC. LLC is out of scope for this post.

GTT and MMAP

There are two distinct interfaces to map an object for reading or writing. There are lots of caveats to the usage of these two methods. My point isn’t to explain how to use them (libdrm is a better way to learn to use them anyway). Rather I wanted to clear up something which confused me early on.

The first is very straightforward, and has behavior I would have expected.

I might be projecting my ineptitude on the reader, but, it’s the second interface which caused me a lot of confusion, and one which I’ll talk briefly about. The interface itself is even simpler smaller:

Why do I think this is confusing? The name itself never quite made sense – what use is there in mapping an object to the GTT? Furthermore, how does mapping it to the GPU allow me to do anything with in from userspace. For one thing, I had confused, “mmap” with, “map.” The former really does identify the recipient (the CPU, not the GPU) of the mapping. If follows the conventional use of mmap(). The other thing is that the interface has an implicit meaning. A GTT map here actually means a GTT mapping within the aperture space. Recall that the aperture is a subset of the GTT which can be accessed through a PCI BAR. Therefore, what this interface actually does is return a token to userspace which can be mmap’d to get the CPU mapping (through the BAR, to the GPU memory). Like I said before, there are a lot of caveats with the decisions to use one vs. the other which depend on platform, the type of surface you are operating on, and available aperture space at the time of the call. All of these things will not be discussed.

Conceptualized view of mmap and mmap_gtt
Conceptualized view of mmap and mmap_gtt

Finally, here is a snippet of code from intel-gpu-tools that hopefully just encapsulates what I said and drew.

Summary

This is how modern Intel GPUs deal with system memory on all platforms without a PPGTT (or if you disable it via module parameter). Although I happily skipped over the parts about tiling, fences, and cache coherency, rest assured that if you understood all of this post, you have a good footing. Going over the HSW docs again for this post, I am really pleased with how much Intel has improved the organization, and clarity. I highly encourage you to go off and read those for any missing pieces.

Please let me know about any bugs, or feature requests in this post. I would be happy to add them as time allows.

Here are links to SVGs of all the images I created. Feel free to use them how you please.
https://bwidawsk.net/blog/wp-content/uploads/2014/06/overview_standard.svg
https://bwidawsk.net/blog/wp-content/uploads/2014/06/bo_mapped.svg
https://bwidawsk.net/blog/wp-content/uploads/2014/06/dma_example.svg
https://bwidawsk.net/blog/wp-content/uploads/2014/06/aper_example.svg
https://bwidawsk.net/blog/wp-content/uploads/2014/06/mmaps.svg

Share on FacebookShare on Google+Share on LinkedInPin on PinterestTweet about this on Twitter
Download PDF

  1. when using the VT-d the address is actually an I/O address rather than the physical address 

  2. Previous gens went to 39 

  3. I have submitted two patch series, one of which has been reverted, the other, never merged, which allow invalid PTEs for debug purposes 

  4. intel_gtt is currently not supported for GEN8+. If someone wants to volunteer to update this tool for gen8, please let me know 

  5. I’ve fought to call this operation, “map” 

  6. Empirically (for me), GEN7+ GPUs have behaved themselves quite well after taking the page fault. I very much believe we should be using this feature as much as possible to help userspace driver developers 

  7. I’ve previously written a post on how this works for Intel 

  8. Sorry people, this one is too far out of scope for and explanation in this post. Just trust it’s a limitation if you don’t understand. Daniel Vetter probably wrote an article about it if you feel like heading over to his blog

  9. There are several distinct caches on all modern GEN GPUs, as well as eDRAM for Intel’s Iris Pro. The combined amount of this “local” memory is actually greater than many earlier discrete GPUs 

8 thoughts on “The Global GTT [Part 1]

  1. Hello,

    When talking about the “stolen memory”, there is nothing about it’s size. The size is configured through the BIOS but is there an optimal size ? What happens if the user reserves not enough or too much memory ?
    Do you have some links to documentation ?

    Best Regards !

    1. How the stolen memory is allocated should be part of a BIOS writers guide. Your favorite search engine turns up a few hits, but I am too lazy to see if they have one for modern platforms with graphics. Essentially the boot firmware picks some number, shoves it in the memory map, and hands off to the OS. The stolen memory size is sometimes configurable frome the BIOS menu, sometimes not. Coreboot from the Chrome OS project would have some implementation of this. I’d encourage you to find what they do.

      Too little:
      Today the only consumer that requires stolen memory is the Frame Buffer Compressor (FBC). If you have too little, FBC will not be able to work. This is a power conservation feature. The amount you need for this is roughly the size of your display, plus some fudge for limits due to tiling.

      Too much:
      We do try to reuse stolen memory as much as possible in recent kernel versions (this is relatively quite new). Assuming we aren’t able to reuse it all, then you just waste system memory. It will not have a direct negative impact on your Intel graphics hardware. If we can reuse it all, then you should see no difference over having the exact right amount.

  2. Hi,
    This post is great and very useful for general understanding.
    However, you said the following:
    “This is the only way we have to make sure the GPUs TLBs get synchronized properly on PTE updates.”

    This seems to somewhat contradict Daniel Vetter’s post [1]:
    “Note though that this only invalidates TLBs for cpu access”

    What have I missed?

    [1] Ref: http://blog.ffwll.ch/2012/10/i915gem-crashcourse.html

    – Ofir

    1. Daniel is certainly right in the part of his blog that says we need a special register to flush SA TLBs. I completely forgot about this when I wrote the blog post.
      (The register: https://01.org/linuxgraphics/sites/default/files/documentation/intel-gfx-prm-osrc-hsw-pcie-config-registers.pdf#page=73)

      However, quoting https://01.org/linuxgraphics/sites/default/files/documentation/intel-gfx-prm-osrc-hsw-pcie-config-registers.pdf#page=129 :
      “The device snoops writes to this region in order to invalidate any cached translations within the various TLB’s implemented on chip.”

      I always read that as the opposite of what Daniel stated. Perhaps he knows something that I do not, or I am misinterpreting it.

  3. Thx for nice explanation.
    You mentioned about drm_i915_gem_mmap_gtt and drm_i915_gem_mmap
    “there are a lot of caveats with the decisions to use one vs. the other which depend on platform, the type of surface you are operating on, and available aperture space at the time of the call.”
    Could you explain how I can decide to use one vs. another? or could you point out link I can learn?

    I ask to resolve [INTERNAL LINK REMOVED]

    1. Man, didn’t you see that I said there are a lot of caveats 🙂 ?

      I’ll assume that you’re asking about modern platforms (gen3 and earlier had many
      reasons to use mmap_gtt).

      On LLC platforms it’s pretty straight forward, a GTT mapping is always slower
      because at best you’ve mapped it write-combined. CPU mappings will be WB caches.
      Therefore, you only want to use mmap_gtt if you don’t have a software
      tiling/detiling algorithm and want to be able to map the buffer with a linear
      view on the CPU. Modern mesa has tiling/detiling algorithms for just about every
      situation, so we almost never want to use the GTT.

      On non-LLC platforms there are some benefits:
      1. Tiling/detiling just like LLC platforms
      2. Some semblance of GPU-CPU coherency
      On LLC, you’d get this with cached mappings for free. CPU writes through and
      reads through the GTT are done through the same hardware that the GPU uses (via
      the PCI BAR). Accesses are therefore coherent and synchronized with respect to
      both the GPU and CPU.
      3. Avoiding the clflush
      Unless you actively track the state of your cachelines that are used to map
      objects, which we do not do, you need to clflush the cachelines that may have
      been touched while operating on an object. Since i915 has no concept [yet] of
      operating on anything smaller than an entire BO, the result is we have to
      clflush the whole object very often. For small objects it’s not problem, but
      large objects like 2d texture arrays can take a very large hit on this operation
      (it also has a lot of redundancy).
      4. Minimizing cache pollution Most non-LLC platforms have relatively small
      caches, and doing mmap_gtt will avoid using your precious little cache.

      There has been a lot of conjecture that non-temporal hints to SSE moves carry a
      lot of the benefits from #3 and #4, and additionally provide a lot more
      flexibility with how and when you cache things over GTT. I don’t want to go into
      too many details because I don’t have them very clear in my head at the moment,
      but you can go use your favorite search engine to read up on non-temporal SSE
      instructions.

    1. Sadly, I will not be at either event. Feel free to email me, or you can always visit sunny Portland (a nice 4 hour train ride from Seattle). I will be at XDC2015 in Toronto.

Leave a Reply

Your email address will not be published.

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax