The Global GTT [Part 1]

This entry is part 1 of 4 in the series i915 object mapping

Global Graphics Translation Tables

Here goes the basics of how the GEN GPU interacts with memory. It will be focused on the lowest levels of the i915 driver, and the hardware interaction. My hope is that by going through this in excruciating detail, I might be able to take more liberties in the future posts.

What are the Global Graphics Translation Table

The graphics translation tables provide the address mapping from the GPU’s virtual address space to a physical address ((when using the VT-d the address is actually an I/O address rather than the physical address)). The GTT is somewhat a relic of the AGP days ( GART) with the distinction being that the GTT as it pertains to Intel GEN GPUs has logic that is contained within the GPU, and does not act as a platform IOMMU. I believe (and wikipedia seems to agree) that GTT and GART were used interchangeably in the AGP days.

GGTT architecture

Each element within the GTT is an entry, and the initialism for each entry is a, “PTE” or page table entry. Much of the required initialization is handled by the boot firmware. The i915 driver will get any required information from the initialization process via PCI config space, or MMIO.

Intel/GEN UMA system
Example illustrating Intel/GEN memory organization:


The table is located within system memory, and is allocated for us by the BIOS or boot firmware. To clarify the docs a bit, GSM is the portion of stolen memory for the GTT, DSM is the rest of stolen memory used for misc things. DSM is the stolen memory referred to by the current i915 code as “stolen memory.” In theory we can get the location of the GTT from MMIO MPGFXTRK_CR_MBGSM_0_2_0_GTTMMADR (0x108100, 31:20), but we do not do that. The register space, and the GTT entries are both accessible within BAR0 (GTTMMADR).

All the information can be found in Volume 12, p.129: UNCORE_CR_GTTMMADR_0_2_0_PCI. Quoting directly from the HSW spec, “The range requires 4 MB combined for MMIO and Global GTT aperture, with 2MB of that used by MMIO and 2MB used by GTT. GTTADR will begin at GTTMMADR 2 MB while the MMIO base address will be the same as GTTMMADR.”

In the below code you can see we take the address in the PCI BAR and add half the length to the base. For all modern GENs, this is how things are split in the BAR.

/* For Modern GENs the PTEs and register space are split in the BAR */
gtt_phys_addr = pci_resource_start(dev->pdev, 0) +
    (pci_resource_len(dev->pdev, 0) / 2);

dev_priv->gtt.gsm = ioremap_wc(gtt_phys_addr, gtt_size);

One important thing to notice above is that the PTEs are mapped in a write-combined fashion. Write combining makes sequential updates (something which is very common when mapping objects) significantly faster. Also, the observant reader might ask, ‘why go through the BAR to update the PTEs if we have the actual physical memory location.’ This is the only way we have to make sure the GPUs TLBs get synchronized properly on PTE updates. If this weren’t required, a nice optimization might be to update all the entries as once with the CPU, and then go tell the GPU to invalidate the.


Size is a bit more straight forward. We just read the relevant PCI offset. In the docs: p.151 GSA_CR_MGGC0_0_2_0_PCI offset 0x50, bits 9:8

And the code is even more straightforward.

static inline unsigned int gen6_get_total_gtt_size(u16 snb_gmch_ctl)
        snb_gmch_ctl >>= SNB_GMCH_GGMS_SHIFT;
        snb_gmch_ctl &= SNB_GMCH_GGMS_MASK;
        return snb_gmch_ctl << 20;
pci_read_config_word(dev->pdev, SNB_GMCH_CTRL, &snb_gmch_ctl);
gtt_size = gen6_get_total_gtt_size(snb_gmch_ctl);
gtt_total = (gtt_size / sizeof(gen6_gtt_pte_t)) << PAGE_SHIFT;


The PTE layout is defined by the PRM and as an example, can be found on page 35 of HSW – Volume 5: Memory Views. For convenience, I have reconstructed the important part here:

31:12 11 10:04 03:01 0
Physical Page Address 31:12 Cacheability Control[3] Physical Page Address 38:32 ((Previous gens went to 39)) Cacheability Control[2:0] Valid

The valid bit is always set for all GGTT PTEs. The programming notes tell us to do this (also on page 35 of HSW – Volume 5: Memory Views) ((I have submitted two patch series, one of which has been reverted, the other, never merged, which allow invalid PTEs for debug purposes)).

Putting it together

As a result, of what we’ve just learned, we can make up a function to write the PTEs.:

 * gen_write_pte() - Write a PTE entry
 * @dev_priv:   The driver private structure
 * @address:    The physical address to back the graphics VA
 * @entry:  Which PTE in the table to update
 * @cache_type: Preformatted cache type. Varies by platform
static void
gen_write_pte(dev_priv, phys_addr_t address,
          unsigned int entry, uint32_t cache_type)
    /* Total size, divided by the PTE size is the max entry */
    BUG_ON(entry >= (gtt_total / 4);
    /* We can only use 38 address bits */
    BUG_ON(address >= (1<<39); 

    uint32_t pte = lower_32_bits(address) |
               (upper_32_bits(address) << 4) |
               cache_type |
    iowrite32(pte, dev_priv->gtt.gsm + (entry * 4));


Let’s analyze a real HSW running something. We can do this with the tool in the intel-gpu-tools suite, intel_gtt, passing it the -d option ((intel_gtt is currently not supported for GEN8+. If someone wants to volunteer to update this tool for gen8, please let me know)).

GTT offset |                 PTEs
  0x000000 | 0x0ee23025 0x0ee28025 0x0ee29025 0x0ee2a025
  0x004000 | 0x0ee2b025 0x0ee2c025 0x0ee2d025 0x0ee2e025
  0x008000 | 0x0ee2f025 0x0ee30025 0x0ee31025 0x0ee32025
  0x00c000 | 0x0ee33025 0x0ee34025 0x0ee35025 0x0ee36025
  0x010000 | 0x0ee37025 0x0ee13025 0x0ee1a025 0x0ee1b025
  0x014000 | 0x0ee1c025 0x0ee1d025 0x0ee1e025 0x0ee1f025
  0x018000 | 0x0ee80025 0x0ee81025 0x0ee82025 0x0ee83025
  0x01c000 | 0x0ee84025 0x0ee85025 0x0ee86025 0x0ee87025

And just to continue beating the dead horse, let’s breakout the first PTE:

31:12 11 10:04 03:01 0
Physical Page Address 31:12 Cacheability Control[3] Physical Page Address 38:32 Cacheability Control[2:0] Valid
0xee23000 0 0x2 0x2 1

Physical address: 0x20ee23000
Cache type: 0x2 (WB in LLC Only – Aged "3")
Valid: yes

Definition of a GEM BO

We refer to virtually contiguous locations which are mapped to specific graphics operands as one of, objects, buffer objects, BOs, or GEM BOs.

In the i915 driver, the verb, “bind” is used to describe the action of making a GPU virtual address range point to the valid backing pages of a buffer object. ((I’ve fought to call this operation, “map”)) The driver also reuses the verb, “pin” from the Linux mm, to mean, prevent the object from being unbound.

Example of  a “bound” GPU buffer

Scratch Page

We’ve already talked about the scratch page twice, albeit briefly. There was an indirect mention, and of course in the image directly above. The scratch page is a single page allocated from memory which every unused GGTT PTE will point to.

To the best of my knowledge, the docs have never given a concrete explanation for the necessity of this, however one might assume unintentional  behavior should the GPU talk a page fault. One would be right to interject at this point with the fact that by the very nature of DRI drivers, userspace can almost certainly find a way to hang the GPU. Why should we bother to protect them against this particular issue? Given that the GPU has undefined (read: Not part of the behavioral specification) prefecthing behavior, we cannot guarantee that even a well behaved userspace won’t invoke page faults ((Empirically (for me), GEN7+ GPUs have behaved themselves quite well after taking the page fault. I very much believe we should be using this feature as much as possible to help userspace driver developers)). Correction: after writing this, I went and looked at the docs. They do explain exactly which engines can, and cannot take faults. The “why” seems to be missing however.

Mappings and the aperture

The Aperture

First we need to take a bit of a diversion away from GEN graphics (which to repeat myself, are all of the shared memory type). If one thinks of traditional discrete graphics devices, there is always embedded GPU memory. This poses somewhat of an issue given that all end user applications require the CPU to run. The CPU still dispatches work to the GPU, and for cases like games, the event loop still runs on the CPU. As a result, the CPU needs to be able to both read, and write to memory that the GPU will operate on. There are two common solutions to this problem.
  • DMA engine
    • Setup overhead.
      • Need to deal with asynchronous (and possibly out of order) completion. Latencies involved with both setup and completion notification.
      • Need to actually program the interface via MMIO, or send a command to the GPU ((I’ve previously written a post on how this works for Intel)). 
    • Unlikely to re-arrange or process memory
      • tile/detile surfaces ((Sorry people, this one is too far out of scope for and explanation in this post. Just trust it’s a limitation if you don’t understand. Daniel Vetter probably wrote an article about it if you feel like heading over to his blog.)).
      • can’t take page faults, pages must be pinned
    • No size restrictions (I guess that’s implementation specific)
    • Completely asynchronous – the CPU is free to do whatever else needs doing.
  • Aperture
    • Synchronous. Not only is it slow, but the CPU has to hand hold the data transfer.
    • Size limited/limited resource. There is really no excuse with PCIe and modern 64b platforms why the aperture can’t be as large as needed, but for Intel at least, someone must be making some excuses, because 512MB is as large as it gets for now.
    • Can swizzle as needed (for various tiling formats).
    • Simple usage model. Particularly for unified memory systems.
Moving data via the aperture
Moving data via DMA

The Intel GEN GPUs have no local memory ((There are several distinct caches on all modern GEN GPUs, as well as eDRAM for Intel’s Iris Pro. The combined amount of this “local” memory is actually greater than many earlier discrete GPUs)). However, DMA has very similar properties to writing the backing pages directly on unified memory systems. The aperture is still used for accesses to tiled memory, and for systems without LLC. LLC is out of scope for this post.


There are two distinct interfaces to map an object for reading or writing. There are lots of caveats to the usage of these two methods. My point isn’t to explain how to use them (libdrm is a better way to learn to use them anyway). Rather I wanted to clear up something which confused me early on.

The first is very straightforward, and has behavior I would have expected.

struct drm_i915_gem_mmap {
#define DRM_I915_GEM_MMAP       0x1e
    /** Handle for the object being mapped. */
    __u32 handle;
    __u32 pad;
    /** Offset in the object to map. */
    __u64 offset;
     * Length of data to map.
     * The value will be page-aligned.
    __u64 size;
     * Returned pointer the data was mapped at.
     * This is a fixed-size type for 32/64 compatibility.
    __u64 addr_ptr;

// let bo_handle = some valid GEM BO handle to a 4k object
// What follows is a way to map the BO, and write something
memset(&arg, 0, sizeof(arg));
arg.handle = bo_handle;
arg.offset = 0;
arg.size = 4096;
ioctl(fd, DRM_IOCTL_I915_GEM_MMAP, &arg);
*((uint32_t *)arg.addr_ptr) = 0xdefecate;

I might be projecting my ineptitude on the reader, but, it’s the second interface which caused me a lot of confusion, and one which I’ll talk briefly about. The interface itself is even simpler smaller:

#define DRM_I915_GEM_MMAP_GTT   0x24
struct drm_i915_gem_mmap_gtt {
    /** Handle for the object being mapped. */
    __u32 handle;
    __u32 pad;
     * Fake offset to use for subsequent mmap call
     * This is a fixed-sizeso [sic] type for 32/64 compatibility.
    __u64 offset;

Why do I think this is confusing? The name itself never quite made sense – what use is there in mapping an object to the GTT? Furthermore, how does mapping it to the GPU allow me to do anything with in from userspace. For one thing, I had confused, “mmap” with, “map.” The former really does identify the recipient (the CPU, not the GPU) of the mapping. If follows the conventional use of mmap(). The other thing is that the interface has an implicit meaning. A GTT map here actually means a GTT mapping within the aperture space. Recall that the aperture is a subset of the GTT which can be accessed through a PCI BAR. Therefore, what this interface actually does is return a token to userspace which can be mmap’d to get the CPU mapping (through the BAR, to the GPU memory). Like I said before, there are a lot of caveats with the decisions to use one vs. the other which depend on platform, the type of surface you are operating on, and available aperture space at the time of the call. All of these things will not be discussed.

Conceptualized view of mmap and mmap_gtt
Conceptualized view of mmap and mmap_gtt

Finally, here is a snippet of code from intel-gpu-tools that hopefully just encapsulates what I said and drew.

mmap_arg.handle = handle;
assert(drmIoctl(fd, DRM_IOCTL_I915_GEM_MMAP_GTT, &mmap_arg) == 0);
assert(mmap64(0, OBJECT_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, mmap_arg.offset));


This is how modern Intel GPUs deal with system memory on all platforms without a PPGTT (or if you disable it via module parameter). Although I happily skipped over the parts about tiling, fences, and cache coherency, rest assured that if you understood all of this post, you have a good footing. Going over the HSW docs again for this post, I am really pleased with how much Intel has improved the organization, and clarity. I highly encourage you to go off and read those for any missing pieces.

Please let me know about any bugs, or feature requests in this post. I would be happy to add them as time allows.

Here are links to SVGs of all the images I created. Feel free to use them how you please.

Series NavigationAliasing PPGTT [part 2] >>

9 thoughts on “The Global GTT [Part 1]

  1. Hello,

    When talking about the “stolen memory”, there is nothing about it’s size. The size is configured through the BIOS but is there an optimal size ? What happens if the user reserves not enough or too much memory ?
    Do you have some links to documentation ?

    Best Regards !

    1. How the stolen memory is allocated should be part of a BIOS writers guide. Your favorite search engine turns up a few hits, but I am too lazy to see if they have one for modern platforms with graphics. Essentially the boot firmware picks some number, shoves it in the memory map, and hands off to the OS. The stolen memory size is sometimes configurable frome the BIOS menu, sometimes not. Coreboot from the Chrome OS project would have some implementation of this. I’d encourage you to find what they do.

      Too little:
      Today the only consumer that requires stolen memory is the Frame Buffer Compressor (FBC). If you have too little, FBC will not be able to work. This is a power conservation feature. The amount you need for this is roughly the size of your display, plus some fudge for limits due to tiling.

      Too much:
      We do try to reuse stolen memory as much as possible in recent kernel versions (this is relatively quite new). Assuming we aren’t able to reuse it all, then you just waste system memory. It will not have a direct negative impact on your Intel graphics hardware. If we can reuse it all, then you should see no difference over having the exact right amount.

  2. Hi,
    This post is great and very useful for general understanding.
    However, you said the following:
    “This is the only way we have to make sure the GPUs TLBs get synchronized properly on PTE updates.”

    This seems to somewhat contradict Daniel Vetter’s post [1]:
    “Note though that this only invalidates TLBs for cpu access”

    What have I missed?

    [1] Ref:

    – Ofir

    1. Daniel is certainly right in the part of his blog that says we need a special register to flush SA TLBs. I completely forgot about this when I wrote the blog post.
      (The register:

      However, quoting :
      “The device snoops writes to this region in order to invalidate any cached translations within the various TLB’s implemented on chip.”

      I always read that as the opposite of what Daniel stated. Perhaps he knows something that I do not, or I am misinterpreting it.

  3. Thx for nice explanation.
    You mentioned about drm_i915_gem_mmap_gtt and drm_i915_gem_mmap
    “there are a lot of caveats with the decisions to use one vs. the other which depend on platform, the type of surface you are operating on, and available aperture space at the time of the call.”
    Could you explain how I can decide to use one vs. another? or could you point out link I can learn?

    I ask to resolve [INTERNAL LINK REMOVED]

    1. Man, didn’t you see that I said there are a lot of caveats 🙂 ?

      I’ll assume that you’re asking about modern platforms (gen3 and earlier had many
      reasons to use mmap_gtt).

      On LLC platforms it’s pretty straight forward, a GTT mapping is always slower
      because at best you’ve mapped it write-combined. CPU mappings will be WB caches.
      Therefore, you only want to use mmap_gtt if you don’t have a software
      tiling/detiling algorithm and want to be able to map the buffer with a linear
      view on the CPU. Modern mesa has tiling/detiling algorithms for just about every
      situation, so we almost never want to use the GTT.

      On non-LLC platforms there are some benefits:
      1. Tiling/detiling just like LLC platforms
      2. Some semblance of GPU-CPU coherency
      On LLC, you’d get this with cached mappings for free. CPU writes through and
      reads through the GTT are done through the same hardware that the GPU uses (via
      the PCI BAR). Accesses are therefore coherent and synchronized with respect to
      both the GPU and CPU.
      3. Avoiding the clflush
      Unless you actively track the state of your cachelines that are used to map
      objects, which we do not do, you need to clflush the cachelines that may have
      been touched while operating on an object. Since i915 has no concept [yet] of
      operating on anything smaller than an entire BO, the result is we have to
      clflush the whole object very often. For small objects it’s not problem, but
      large objects like 2d texture arrays can take a very large hit on this operation
      (it also has a lot of redundancy).
      4. Minimizing cache pollution Most non-LLC platforms have relatively small
      caches, and doing mmap_gtt will avoid using your precious little cache.

      There has been a lot of conjecture that non-temporal hints to SSE moves carry a
      lot of the benefits from #3 and #4, and additionally provide a lot more
      flexibility with how and when you cache things over GTT. I don’t want to go into
      too many details because I don’t have them very clear in my head at the moment,
      but you can go use your favorite search engine to read up on non-temporal SSE

    1. Sadly, I will not be at either event. Feel free to email me, or you can always visit sunny Portland (a nice 4 hour train ride from Seattle). I will be at XDC2015 in Toronto.

  4. I have query related to memory mapping drm library calls:
    1. drm_intel_gem_bo_map_gtt
    2. dri_bo_map
    Please could you tell me what is reason if both behave the same way?
    Means I am getting same h264 enceded video if I use dri_bo_map instead of drm_intel_gem_bo_map_gtt.

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.