CppCon 2017: Nicolas Guillemot “Design Patterns for Low-Level Real-Time Rendering”


– Alright, so hi everyone, and
thank you for coming today. I’m going to be talking to
you about Design Patterns for Low-Level Real-Time
Rendering and my name is Nicolas. So, the reason why I made this talk, or one big reason is because of the introduction in the recent years of brand new graphics programming APIs like Vulkan and DirectX 12 and Metal. These APIs give you control
that wasn’t possible before. And so what I think that
we need to do at this point is we need to realize that
the rules have changed. About what we can do with
graphics programming. And so I want to take
a step back and rethink how we can program with GPUs using the new primitives
that we have access to. And a lot of this talk is, I’m trying to boil down
things to like a level that can be understood
so there maybe times where you disagree about the way that I present the information, so I encourage you to look
at the nitty-gritty details yourself and try and
think if you can think of and even better way of seeing
a lot of these problems. One of the big motivations I have for what I’m doing here
is to try to think of generic solutions to
Domain-Specific problems that can be expressed
in Domain-Specific terms for the solution as well, so this is just a little project I did. It’s a ray tracer
implemented in DirectX 12. It’s nothing super amazing, but I thought it was a really great project and I was really happy with the results, because I implemented this ray tracer based on a design from a paper by Simon Elaine and the
design of the ray tracer you can see on the right a diagram there. So the ray tracer itself is kind of like a pass graph, with data that passes from one node to the other through these cues. So when I started implementing this I started just writing the code, you know hard coding, you
know, here’s the function for the logic node, here’s the function for the material node and just typing in DirectX 12 code and it was really tedious. And then I realized I was wasting my time because I was typing the same
thing over and over again. So I just made kind of a
pass graph system myself. Where I could add nodes and declare cues between them and then it kind of all just worked automatically. I could focus on those things
that I really care about which is adding new materials and customizing all the nodes. So this is kind of like the
general kind of solutions that I think are a great
way to use these new APIs. Is to use like a low level API, to implement a very high
level interface to work at the level at which
you are most productive. So that’s just the motivation, kind of like a lofty
goal to try to achieve. This talk is mostly split into two parts. So in the first part, I’m
going to be talking about some general GPU Programming Basics. From the perspective of a fresh, kind of modern perspective, I think. And in the 2nd part, I’m
going to be talking about how these primitives can be used to implement various data structures that are useful for designing a renderer. So let’s get on with part one. So, GPU Programming Basics. The first thing I want to talk
about is Memory Management. So, I want to start with Memory Management because it’s such an important
part of everything we do. That it feels really important to address how to do that properly first. The first thing I want to do is I want to talk about memory
management on a discrete GPU. So discrete GPU the CPU and
GPU are relatively separate. As far as the CPU concerned, the CPU as you can see right here, has it’s own region of memory, we refer to as system memory
or various things for it. This basically your RAM. And the CPU can access this RAM through a memory management unit. And this memory management unit does very many fancy things. One of the things that it does is it allows you to use virtual memory. So for example, in this case, using virtual memory,
what’s happening here is that they system memory has
two pages of memory allocated inside of it as you can see at the bottom. And these two pages of
memory are not contiguous. But when the pages are
accessed through virtual memory which you can see at the
top in the MMU block. These pages appear to be contiguous. So they’re not physically contiguous but they’re contiguous in virtual memory. And this is a theme that
we’ll see over and over again. In this slide and the next slide so that’s why I wanted to
really deliberate on this point. The MMU does actually a
whole bunch of fancy things. But another fancy thing that it does is it can automagically fold
pages into the file system. So what I mean by that is to say that for example if you have too many processes running at the same time and you actually exhaust the total amount of RAM that you have in your computer it’s actually possible
to take pages of RAM and put them on the file system and run another process in the meantime. This is something that
happens kind of automagically and is very interesting. I’ll show you why I’m
talking about this topic. It’s kind of relevant to GPUs as well. Or rather the fact that it’s not relevant is what’s relevant. So, moving on to the GPU part, the GPU is, you can say it in this way. It’s very similar to the CPU. It also has it’s own region of memory. And that region of memory, it’s called in this case, we’ll call it video memory. But it has many names. This region of memory
is also accessed by MMU. That is on the GPU side. Now this MMU is set to do very much of the same functionality. Such as, for example, virtual memory. So in this case there are physical pages of video memory being allocated, which may not be physically contiguous, but these physically dis contiguous pages can be accessed through
a virtual address range that is appears to be contiguous. So this kind of functionality
is supported by GPUs as well. And now we get to the weird part where the two devices kind of start crossing over each other. One thing you can do for example, also, is you can have pages of memory in system memory, which the
CPU can refer to of course, using it’s own virtual memory space. Also, the GPU can refer to it using it’s own virtual address space. So this is one way in
which you can share memory between the CPU and the GPU. And you can actually go one step further and do this the other way around. So in this case, if you
look at the purple boxes, that just appeared, in this
case the physical memory is in video memory and we have
the same kind of situation where the CPU and the GPU
each have virtual addresses that point to the same physical memory, but this time the physical
memory is on the GPU side. So just to say these
scenarios are all possible. And another one that’s
also important to note is DMA engines, so GPUs
have the ability to, they have special
hardware for doing copies very quickly but between
system memory and video memory, using the so-called DMA engines. And actually one thing I want to highlight is that I deliberately put the DMA engine block here at the bottom
and it’s directly connected to the physical memory of one device the physical memory of the other device. To say that in general you can’t assume that a DMA engine has
the ability to understand what virtual memory is, which means that for example, it can only work
at the granular area of pages. Because it can’t go from
one page to the other in a virtual way it has to actually go from one page to another
in a physical way. Now there’s also the case
of the integrated GPU. In this case, the CPU and
GPU are roughly one unit. In this case, things
actually get a lot simpler. But it’s not too different
at the same time. In this case, we have both devices sharing the same memory, in system memory. They don’t necessarily share
the same address space, although that kind of thing is possible with some extensions. But in the most general case,
they each have their own virtual address space
in which they each have different addresses to refer to the same locations of physical memory. And one thing I want to highlight, which I think I forgot to mention on the last slide, one
thing I want to highlight is the memory that’s being access by the GPU in the yellow blocks, I deliberately did not
connect it to the file system. In these diagrams, because
what I wanted to say is that the memory that is
accessible to the GPU, does not support this
kind of fancy mechanism of faulting pages back to the hard drive, like it may be possible, but
it is physically possible to support this configuration and some do, but in the most general
case you can’t assume that you can do that. So, moving on, I want to talk about Command lists a little bit. The way I see Command lists, from a big picture point of view is that in this case, we
have two processors, mainly. The CPU which is an Out of Order pPocessor in most usual cases. And GPU which is an In Order Processor. And so by the nature of, you know if you think about the word in order, what does that mean, it means that the order in which it runs the work that it’s given, is the order in which it was given. Rather than an order
that is more efficient, that it can lift the CPU run time, which is what your CPU do. Right, when we talk
about branch prediction and those kinds of things. Those are fancy things that a CPU can do because it’s an out of order processor. So the importance of,
the thing about being an in order processor
is that it means that the performance depends mainly on whether the commands you are given were given in an order
that is actually efficient. And so one of the big picture ways that I think about this
relationship is that it’s very beneficial to think about how we can use the CPU
to order the commands to be sent to the GPU in a
way that is most efficient. If you want an analogy, here’s an analogy. So let’s suppose the GPU is the train and the dog is the CPU, and
the CPU is being very careful being very agile in
laying these train tracks as a command list for the GPU to run. And actually while I’m on this topic, I’m going to actually take
this analogy one step further. And I’m going to introduce to
you the concept of a fence. So, a fence is kind of like
an operating system object and also represents a
hardware object, I suppose. What it does is it lets you keep track of the progress of the
GPU as it makes progress in the list of commands you’ve given it. So for example, the fence
has a current value. And this current value
might initially be zero. If we assume that the train
on the right is the GPU, and the tracks are the command list, then as the train moves along,
and reaches the first fence with one, then the fence
changes, when this reaches the fence value of one, then
the fence value changes to one. And as it keeps moving along the commands and reaches two, then the fence changes from value one to two. And so on, as it reaches number three, and this mechanism as I’ll show you, is how you can implement
various kinds of synchronization between the CPU and the GPU. So I’m actually going to elaborate on that a little bit more with a
slightly more concrete example. So in this scenario we
have a CPU and a GPU. And we have a fence object in the middle. The fence object has
currently the value of one. Just to be arbitrary, and
let’s suppose we do this. The CPU writes a list of commands that it wants to send to the GPU. And at the end of this list of commands, it outputs a special command
which is a signal command. Which is what changes
the value of the fence. So after it writes this list of commands, and sends it to the GPU,
the GPU takes control of that list of commands and
it executes the commands. When the GPU reaches the signal command, it changes the value of
the fence from one to two. And using this fact,
we can now synchronize on the fact that this list
of commands is completed. Using the knowledge that
completion is indicated by having passed by two. So for example, the CPU can wait for the GPU to finish what it did by checking that the fence is
greater than or equal to two. And similarly, the GPU can also check that it finished the work
by checking the fence is greater than equal to two. Because it’s actually
possible to have more than one list of commands running concurrently on the GPU in a general case. So you need to synchronize sometimes between these different lists of commands. So I’m going with this. So I’ve talked a lot about commands but I haven’t actually said
what the commands kind of are. So I’m going to give
you a taste of commands. And again this is very high level so I’m kind of like
hiding a lot of details. But hopefully it’s easier
to think about this way. That’s the way I see it, even for me, I think it’s easier to
think about it this way. So for example we have some commands that may abstract, for
example, the DMA engines. So you might have a command
that just copies memory from a source location to
a destination location. Page two four, I think. You may also have some commands that are made for executing
programs on the GPU. So for example the sequence
of commands I’m showing at the bottom left here. So like we start by setting the program, to say like this is the
program I want to use the next time we run a program. And then we have a command
that sets the parameters to that program and then we can actually invoke the program using generally like a draw
call or a dispatch call. Those are the two main ways in
which you can set a program. And you can kind of take these three this three couple of commands and boil that down to
being from a C+ programming perspective it’s like a function call. So when we make a function call in every day C++ code, you
know you have a function you pass the parameters and
then you do a branch, right? That’s like generally what happens when you make a function call. And this is actually
pretty much exactly that. It’s just that it’s spread
out over multiple statements. So sometimes it helps to look
at this sequence of statements and think about it more
like a function call then to think about it as a separate like as a separate three steps. And from an API perspective you can make it look like that
and it could be convenient. So GPUs also have built
in hardware support for various kinds of rendering operations. So the commands are
also commands that exist for just controlling how this hardware renderer works. So for example you might have a command that sets the render target. To say like where should the
render be drawing pixels to. You may have a command
that allows you to set the graphics pipeline to say that you configure the way in
which the renderer should render triangles for example. There’s a wide variety of
configurations possible. And finally, this is
one of the most abstract set of commands that
I’m trying to basically pull a quick one on you or something. So they’re commands for
managing the memory model, and I call it the object model. To say that for example,
if you want to deal with data hazards, you may need for example a memory barrier to say that like the rights that have happened previously should be done before you move on. Those are the kinds of things you have to care about when you write some kinds of programs. Also, I have this command here, which I call transition, to say that it transitions an object
from state A to state B. Which is to say like some operations on some objects can only be done when the object is in a certain state. And so to be able to
do a certain operation, you need the object to
be in a certain state, but if it was not in that state beforehand then you have to
transition it to that state before it can do the operation. This is like a slightly more flexible, I would say I think about it
as a slightly more flexible way of doing a memory barrier in the sense that the barrier does more than just wait
for memory to finish. I also have these, these
are the sneakiest ones. Construct and destruct, if you look in the documentation, you will probably not find anything that looks like a command
that constructs and destructs. And I think about it a little bit more placement new, placement delete. I’ll talk more about that topic later. Another thing I want to talk about is just a very common pattern you’ll find in this kind of programs
is related to commands. Is whether the command
argument should be read at the time at which the
commands are recorded or whether the command
argument should be read at the time at which
the command is executed. And so that’s kind of a trade off between these two ways of doing it. If you for example on the left if you pass the number of threads you want to run by value, then the number of threads
is stored within the command. And then hopefully that can give more opportunities for the driver
and the GPU and all that to know that what’s
coming up ahead of time. On the other hand, another
thing you can do is the so called indirect way of making draw calls for example and dispatch calls. Where instead of passing the
number of threads by value, you pass it by reference and then just before
the command is executed, it’s gonna read from memory
the number of threads that it needs to run actually at run time. And run that number of
threads at the time. So there’s kind of like a performance versus flexibility trade-off. And I wanted to bring this up because this is actually a, like
if you look through stuff you’ll find that this is
a very common trade-off of preparing things ahead of time or doing things at the last second. And the trade-offs in
performance and flexibility that that involves. One last thing I want to talk about when it comes to GPU programming quote unquote basics is descriptors. So descriptors, the way I think about it is something like the hardware view of what an object is. And when I say object, I
mean mostly things like buffers and textures, although there are
various kinds of buffers and there are various kinds of textures. So it’s a bit more complicated. The way I see it mainly
it’s a small struct that’s sort of something like an address plus some metadata that is relevant to how that object should be accessed. So for example, a concrete example, is from the documentation here is what a descriptor looks
like on AMDs processors. GB processors, I mean. So there’s a bunch of
stuff it’s like 128 bits. It’s not the only texture
descriptor that they support. There’s more than one kind of format. But in this case you
can actually clearly see that it begins with an address to the resource that
this object represents. And then later on it describes
the width and the height of the object that’s being referenced, or the texture that’s being referenced. So just to give you an idea that this is roughly what I expect to see when I think about a descriptor. Then these actually kind of map relatively closely to the assembly code in the sense that if you
look at the assembly code you’ll see that it’ll take the descriptor and pass it as an
argument to an instruction and that may sample from a texture. For example, so this is a concept that is very closely tied to
the architecture I would say. So that is all for the
first part of this talk, Which was about GPU basics. And now I’m going to talk more about designing a renderer at a high level using the primitives that I talked about in the first part of this talk. So before I go too much
into the nitty gritty, I want to introduce you to what I see as a decent way to architect
a real time renderer at a high level. So usually the way I see it is
I start from the simulation, which is for example if
you’re making a video game this is the part where maybe
the game logic happens. And the game logic happens in terms of so called game objects. And so the term game
object is not relating to any specific implementation. Every game engine has it’s own idea of what a game object is, so that’s kind of like
a very general term. But the point is that
every update of the game somehow you’re taking all
the state of the game objects and moving them around
based on the game logic. And so this kind of like
happens, I see it as two separate parts where like, in one sense you have like some objects that are updated based purely on time. As the time passes in a
relatively continuous way, and in another sense you have things that are more like events like you might receive a
message from another object. Or maybe you press the keyboard key and then that keyboard key
is like a discrete event that you have to handle
like on that one frame. So that’s just like one that I think about breaking down that problem,
that you might like. So, as a simulation happens
the way I see it is that the simulation should make
updates to a data structure that represents the
current state of the scene. So as objects are being moved around, you would take like for
example the position of a mesh and move it forward
based on some game logic. And this is all happening at the level of what I call the scene. Which exposes some high level concepts such as geometry and materials and instances of meshes and cameras and lights and so these
are all like not related strictly to any GPU concepts. But are kind of like
a high level interface for the simulation to describe
what it wants to render. From there, the renderer
itself is in charge of reading the data in the
scene and presenting that like how a presentation of
what it needs to be done into a representation that makes sense for talking to a GPU such as for example you this is where we bring out the Buffers and the Textures and the Shaders and the rendering passes. So you know it’s a bit lower level and at the same time it kind
of gives you a separation of like how the scene is describe versus how the scene is rendered. The separation can get a bit blurry so it’s not always straight forward. This is just a mental model for how to mentally separate it if it helps you. So the renderer will read
the data in the scene and then presumably
create a list of commands that it will send to the GPU
to actually render the scene. So this list of commands
makes it’s way to the GPU and the GPU will run those commands. At the end of running
all of those commands it will probably output
an image to the screen. Which we call like presenting and so this presented image actually gets put in the
so called swap chain. That’s like the DirectX
vocabulary anyways. The swap chain represents the
double buffering mechanism supported by the operating system. What you put in the swap
chain gets composited onto your desktop along with the rest of what’s on your desktop. Or directly to your screen if
you’re in full screen mode. Finally that makes its
way to your display. So in the context of real-time rendering we actually want this
whole procedure to happen within 10 to a hundred milliseconds so if you’re working on a shooting game or you’re working on
a virtual reality game you probably want to be on the
lower end of that spectrum. But if you’re making something
that’s kind of like an editor maybe it’s acceptable to
have a hundred milliseconds if it makes it easier to see like a good visualization
of what you’re working on. So depending on exactly what
kind of game you’re making or what kind of project you’re making you can afford some latency. So the next thing I want to
talk about is Ring Buffers. Because I think they are
a very useful primitive for implementing all sorts of stuff. I’m going to give you a very basic example of what a Ring Buffer is. So, in this case, I’m
showing you a Ring Buffer in the context of streaming
data from the CPU to the GPU. Although it could happen
the other way around. It’s a very general data structure so you don’t have to be
tied to this exact use case. But for the sake of
example, let’s think about a CPU streaming data to the GPU. So what happens is that
you’ve got the storage of memory, that is the Ring Buffer itself. And the CPU will allocate
a range of that memory and write data to it. Then this data that was just written gets passed on to the GPU
and the GPU can read it. And while the GPU is reading that data, the CPU can write more data. So you’ve got this kind of
like back and forth going on where the CPU and GPU are
both able to work in parallel. And this pattern kind
of just repeats itself So what was just run by
the CPU again gets passed to the CPU and CPU keep
writing the next RAM data. Now I think that the API you make for accessing stuff in data structure can be actually beautifully simple. So for example, what I’m proposing here is kind of like a little
bit of an extension on the typical allocators
you see in C++ code, where I’m proposing
something like an allocator where you take your ring data structure and you call it Alloc on it in this case passing it a type to see how you want to allocate an object or a struc and instead of just returning just a CPU virtual address which is what an everyday
C++ allocator does, this allocator returns both
the CPU virtual address and the GPU virtual address. So in this case what you can do with this with this kind of allocation is you can use the CPU virtual address to write the data you want to upload and then you can take
the GPU virtual address and you can pass that on to the renderer through a command. Now I think some people will look at this and also throw in the suggestion that you should probably
consider copying that memory to proper video memory
before making a command. Making a draw command
so that some people are some people are really strict about that. Depending on how you feel,
there’s some trade-offs. And moving on, at the
bottom part of this slide you can see that I’m
applying the same idea but instead of allocating
just plain memory I’m allocating descriptors,
so you can make a Ring Buffer of descriptors and just have
a function that allocates you know end descriptors. You get back a CPU virtual address to the descriptors you allocated, a GPU virtual address to the
descriptors you allocated. So you can write the
descriptors metadata using the CPU virtual address and then you can take the GPU virtual address and pass that on for your rendering. Now there’s a bunch of tricky
cases with ring buffers. There’s a lot of buts, so for example, what happens when you run out-of-memory? This is a tricky case and I think ideally you want to avoid it. So I’ll talk about that a little bit. So what happens if the buffer’s full? The GPU is still reading the
yellow blocks on the right. But the CPU wants to write
the next RAM of data. And it can’t because the GPU
is still reading that memory. In this case the CPU
might just have to wait for that block of memory to be available which it might do using a fence. So once the GPU finishes
processing that data the CPU can take control of
it and start writing again. That’s one way of handling
out-of-memory situations. Using a fence, for example
to synchronize that access. So another big problem is how
do you handle wrap-around? Which conveniently,
you might have noticed, I avoided in all of my animations. So there are some tricks that
can make it a little bit sane. So for example, one thing
that I highly recommend is to make something like a virtual offset in the ring buffer. So, like in reality your ring buffer has a limited size, probably, but with the virtual
offset you just pretend that the ring buffer actually
has an infinite size. So even when get past the end you just keep incrementing
your current iterator and keep going. But then when you actually
want to access the memory that corresponds to one
of these virtual offsets, you mask it by the size of the buffer. Here we are assuming
that the buffer’s size is the power of two so that’s
why this masking trick works. This simplifies a lot of math
relating to ring buffers. Another thing that is kind of convenient is that is simplifies
working with fences, too. Because the fences increase monotonically and in this case, the virtual offset also increases monotonically. So it’s gonna be easier to match them up if you want to use fences and the ring buffer together. One thing you can do is and actually I do this
almost all the time, is just disallow wrap-around, so you just allocate enough memory that you know you’re not
going to need anymore and you just run with that. And if it fails like you
actually run out of memory you just assert and if you hit that assert you just double the amount of memory. And it’s kind of dumb but
it’ll get you very, very far. So that could be a very
worthwhile engineering trade-off, let’s say. If you’re actually
implementing this in the wild I recommend you consider
reading this article by Fabian that actually has a lot of really nice break down of the invariance that are related to a ring buffer and some proofs about how the
virtual offset for example makes things a lot simpler. So moving on, another neat
thing about ring buffers is you can actually do
allocations with them in a lock-free fashion. So let’s suppose that the
virtual offset in the last slide is now an atomic, in this case it’s actually very straight forward to allocate memory with
it in a lock-free fashion. For example, if the ring buffer’s offset represents an index in an array, it’s very easy to do
a lock-free allocation because all you have to
do is do a fetch add. So when you do a fetch add, what that does is it
increments the current offset by the number you said,
so it will allocate that many entries and it
returns the old value. So by returning the old value that’s what tells you
like here’s the start of the range you allocated. And it bumps the pointer
or the offset forward by that amount, so that’s a
relatively straightforward application of lock-free allocation. The more complicated case is
when you’re allocating bytes. So in the case for example earlier when it had like alloc of camera this is supposedly an
interface that can allocate any struct of any size. So it tends to be that
some different interfaces for working with a GPU
require different alignments for the data you pass in. And so you need to align your allocations which makes it a little
bit more complicated than simply bumping a pointer forward. And what I suggest we do here is just pick the worst case alignment that is relevant to the platform you’re working on which for example for DirectX is basically 512. And just pad up your
allocations to that size. So, you know it’s not,
maybe there’s a lot of waste in this kind of approach. I haven’t really measured it. It’s convenient, especially like it’s very easy to forget that
you need to align something if you don’t do it automatically like this so this is actually pretty
convenient for a lot of reasons. Yeah, so I just said. So there’s some pros and
cons like I mentioned. So for example, I think one
of the biggest pros for me is that this interface
takes memory management which is a relatively complex problem where you have to create a
bunch of buffers everywhere and it boils it down to being
a very, very simple API. And also like avoids fragmentation because all your allocations
are tightly packed in this ring buffer, so it can be a very nice way to use memory. I think it’s a , I say a
powerful building block to build all sorts of
higher level abstractions. Like when you find you
have this ring buffer, you find all sorts of uses for it. So I’ll suggest you know
Procedural geometry, if you want to generate
some stuff on the fly. Texture streaming, so for example you allocate like pixels
into your ring buffer and then kick off a copy into
a textural objects on the GPU. And also this so called
Sprite Batch abstraction for rendering sprites in
a 2D game for example. It’s something you can
implement very easily if you have a ring buffer handy. The main cons I can think
of are that it’s hard to allocate memory for it, so for example if you make it too small, then either your performance is going to be bad because you’re going to be synchronizing on these fences all the time Or you’re just going to crash because you put an assert there so it’s kind of tough. And if you make it too big then you’re just wasting memory so it’s not straightforward to solve this problem in
a general way, I think. Although I’m open, if
anybody in the audience has any solutions please, write a blog post or something about it. So another big problem I think is that the memory configuration itself is not one-size-fits-all in the sense that this whole time I’ve been
talking about this ring buffer like memory region but
we haven’t talked about should this memory region
be in system memory? Should it be in video memory and what should the cache
properties of this memory be? Should it be write combined cache? Should it be a write back cache? Should it be a write through cache? I don’t know, so and these
things really matter a lot because for example, you know, if you do procedural, if you run a procedural geometry algorithm
in place in a buffer that you’ve allocated
with write combined cache you’re gonna have a really bad day. So, it’s important to
think about these details when you’re implementing or
when you’re using a ring buffer. So the next thing I want to talk about is Parallel Command Recording which is a really big feature
of a lot of these new APIs. Here’s the big idea,
so the big idea is that it turns out that writing command lists is actually quite CPU intensive, so one of the ways that that
overhead has been mitigated in recent APIs is that
they make it possible to write commands in
multiple threads in parallel. So in this scenario, the
CPU creates multiple threads and each of these threads
can record command lists in parallel. And when they’re all done
creating their command lists you can bundle them up together and submit them all to
the GPU all at once. And this has been proven to
make some pretty big advantages, when you have a lot of work to do. So if we actually want to apply this let’s first approach the easy case. And before actually, before
I even talk about this I just want to say if you don’t need to make it like multi-threaded, don’t. But if you have like 10,000 objects then maybe it starts getting interesting so my point is, like
don’t over engineer it. But anyways, this is a
relatively simple case when it comes to multi-threading
the command writing. In this case, all we’ve done is you know say we have a whole bunch
of asteroids to draw and we want to draw
them as fast as possible and writing the commands is expensive because we have a whole lot of asteroids. In this case, what we can do is we can split the list of asteroids into several jobs and send each job to a different thread. In this case, each thread will write the commands independently so each one will start by
setting the render target and then submit the changes
of geometry and material and make the draw calls
for each of the objects that are being drawn. And at the end of this parallel for loop all the lists are grouped together and submitted all at once. So this is the easy case. In the next few slides,
I’m going to explore the difficult case, which
is kind of adventurous and almost academic so
if you cringe, it’s okay. I’m a bit in the clouds right now. The difficult case is what if you’re not
working with regular work? Regularly shaped objects, for example maybe you have a render
that you want to be extremely general and you want to handle like blob shaped objects
which require like martian cubes or you want to have like subdivision objects which require like some kind of triangle
subdivision algorithm. And then you want to have particles which require like bitting and sorting, all sorts of stuff to
make them look right. So if you want to have all
sorts of different objects rendering like this then
the amount of CPU work that is required to render
one of these objects changes depending on
what type of object it is and maybe other things
that are hard to measure like how far it is from the camera and how much of it can you see like how much is actually
visible to the camera. In these cases we have to adopt a slightly more sophisticated approach then a parallel for each if you want to be efficient I suppose. And so the solution I’m going
to propose for this problem is to use Fork/Join parallelism. So to try and summarize
Fork/Join parallelism in relatively rockable way, I have this image on the right here. So the idea is that we
start off with a big list that contains all the objects in the scene and then we first begin by checking is this list big enough
that it’s worth splitting among multiple cores? And if it’s worth splitting
among multiple cores then we split it in half and we actually apply this recursively. So like every time the list
appears to still be big enough that it’s worth splitting
among multiple cores you split it again. And at the end of the day,
you get a bunch of tasks that are hopefully roughly well-sized to be run on their own core by themselves. If this interests you, I highly recommend that you watch the talk by Pablo Halpern from CppCon 2015, I thought
that was a brilliant talk. And really explains all
this slide in more detail. But actually, what I’m going to do here is I’m going to go one step deeper and I’m going to talk to you about a solution to another problem with this fork/join thing. So zooming into the part
of the fork/join graph I showed earlier where there
was a cube and a green cloud. In the case that I
showed in the last slide the cube commands, the rendering commands for drawing the cube are written into their own command list and the commands for rendering the green cloud are written into their own command list as well. And so, I should say
like, you generally want to avoid creating too many command lists because there is some overhead associated with submitting
to many command lists. And one case where this is especially an obvious loss is for example if you find out that the two tasks that were created for these two objects actually ended up running
on the same CPU core. So for example, if they both run on the same CPU core,
we don’t need to have two separate command lists. We can actually just have one command list that will write the commands
to both of those objects. And that reduces the number of command lists that we need in total. So what I’m saying here is like why not, for the 2nd task, if
we’re in the circumstance where the 2nd task is
run on the same core, why not automatically use list number one again for the 2nd object? Now this is actually possible to do in a relatively elegant way using the so called Hyperobject idea. And with this Hyperobject
idea implemented, the final result is that
in those circumstances where the same CPU is used
for subsequent draw calls the two lists of draw calls are combined into one single list. Now, the key idea, I don’t know if I want to go into too much detail but I think this is a really neat idea. And if it interests
you, I highly recommend that you read this paper. Actually, I would say this
paper is really interesting even in general because
it explains a lot about the internals of how C++ scheduler works because it explains how
this whole Hyperobject idea actually integrates super nicely into their scheduling. So a very interesting
paper that I recommend to read if you’re
interested in this topic. And another thing that’s
specifically about rendering is that if you apply this
kind of Hyperobject idea it still keeps the order
of the draw calls intact. Which is actually very important because it turns out that if the order of the draw calls changes between frames for example, it may have issues of aesthetics like for example, if the order of the draw calls change you might see objects flickering. And also changing the order of draw calls may change the performance, so it’s important to
have stable performance for each frame so keeping the
order of the draw calls intact despite the parallelism
that you’re adding here is actually a nice bonus. So now one of the last parts of this talk that I’m going to show you today is about scheduling GPU Work & Memory. So going back again to the big picture. The big picture here is that you can decompose a frame of rendering in a real time scenario into something like a bunch of passes where each pass has a
certain set of inputs and a certain set of outputs. And you can kind of like take a step back and visualize this whole thing as a big kind of graph. Now when we make a scheduler what we want to do conceptually is we want to take as input this graph and somehow find a way
to execute that work in a way that is first of all valid in the sense that like
dependencies are respected. But also ideally efficient
as much as we can. Another thing is that for example if you look at the middle of this graph, there’s a Texture 2 object which is the output of Pass
1 and the input of Pass 2. So actually this Texture 2
object is no longer relevant after those two nodes that are using it. So what I mean by managing
allocation and lifetime is that there exists some resources that exist for only a
small period of the graph. And it is interesting to think about how we can best do these allocations and de allocations of those objects, as we execute the graph. And finally, I wanted to say that I’m putting it as respect dynamic nature of real-time rendering. What I mean by that it’s very tempting to look at something like this and to make an API where you build a whole graph of the frame up front that’s like a big data structure in like a init function somewhere and then just like let that run and that can work and it
has some nice advantages but I want to make
there’s some respect given to the fact that in a real time scenario the contents of what’s shown on the screen changes a lot from one frame
to the other potentially. So what happens is that
a game can change a lot so it’s hard to come up
with one, big, static data structure that represents a frame in these kinds of scenarios, so there has to be some
kind of thought put into the level of like flexibility
that you want to allow for the generation of this graph and in the most general case,
I think are very common cases to actually just regenerate
the whole graph every frame. This is a common approach,
but it’s unnecessary. So you make your own decisions. So this is, this kind of pseudocode here is roughly how you might
have done it for example in Opengeo DirectX 11, so
this is kind of like the old school approach
of how to do the problem. It illustrates a lot of the problems that we have to deal with when we implement a more general solution to this problem. So for example, the code on this slide is split into two parts. We have the higher part
and the lower part. Both of these parts of code represent a pass of rendering. The fist pass of rendering
puts a depth buffer in the form of a shadow map. So the word shadow map
that I’ve highlighted there in the first part that’s
the output of this pass. And if you don’t understand
what a shadow map is that’s fine, just imagine just the output of that pass of rendering. Meanwhile, in the bottom part of the code we have another pass of rendering and this pass of rendering
takes the shadow map as in input now, so this shadow map went from being an
output to being an input. And more precisely it went
from being a depth buffer which has like this read and write kind of permissions to being a texture that is only read as a depth texture. And so if you’re working with
Opengeo or if you’re working with DirectX 11, this kind
of like change of an object going from being written to to being read and changing from being a depth buffer to something like a depth
texture that’s being read, these kinds of transitions are handled automatically for you. And the driver may even be able to build a graph very similar to what I showed in the last slide automatically from what you’re submitting. But these are the kinds
of responsibilities that now are on our plates so we have to figure out the solution to this automatically. But if you are using Opengeo or DirectX 11 you can probably actually
make your real code very similar to the code
I showed in this slide. And it’s going to be just fine. So, I think actually this
is a nice way to architect Opengeo code for example in general. If you’re not going to
come up with a really fancy astronaut solution like
the rendering graph I’m going to talk about. So when it comes to
scheduling GPU work and memory there is actually some
previous work on this topic. Which is, I think very interesting. So if you’re interested in this topic I definitely recommend that you look at the work on frame graph
by Yuiry from Frostbite and also the work on
render graphs with Vulkan by Hans-Kristian and so if
you’re interested in Vulkan I especially recommend that you look at the 2nd one there, it’s
actually open source as well. So you can go and look
at how it’s implemented. I don’t want to repeat
everything they said but today I’m going to try
and give you a summary. I tried to boil it down,
like the core ideas and some of the fundamental challenges to make it a bit easier to understand some of the details. So first I want to talk
about submitting work. So there’s various operations, I would say various problems that we can solve in the process of submitting work using a work scheduler that I’m proposing. For example, I’ll say in general you can implement
Compiler-like optimizations. And so one example of this
is dead code elimination. So let’s suppose that you have the graph that’s shown on the right here. Where you can see this Pass 1 and Pass 2 and actually Pass 1 is
outputting a texture. But a texture is not
connected to anything else. So we’re creating this texture and then we’re just throwing it away. So an obvious solution to this problem is hey, make the work scheduler realize this output is actually useless and remove that entire part of the graph. And then you know, at this
point you might be asking well why did you even write that code? Just delete the code, why
should it even be there? But you know, it happens
that sometimes, for example you might have a node that
outputs some debug information and then you want to be
able to conditionally display that debug information. So there are various cases like that where you have these optional outputs and you ideally want to not output it when nobody needs it but you don’t know that nobody needs it until the end of the frame. Just to say that it’s not
always obvious to know some code is dead ahead of time. Another thing that’s very important thing that your scheduler should handle is handling the Data hazards when it comes to both memory barriers and to handle like transitions
between reads and writes or exchanging between an object that’s being read and written
or writ and read, that’s one. And also barriers that
are relating to the render because there are various kinds of I’ll say various kinds of barriers that are more advanced
that are relating to the way that the rendering pipeline works. So for example in this
area that I’ve shown we have one blue node
which is attached node writes to some memory which is outputted in this texture. And then the next node
will read that texture. So in this situation we
have a read after write data hazard and so in
between these two nodes we have to make sure we insert a barrier that says that all
writes before this point must have finished before we can move on and read the data that we just wrote. Another, I guess this is kind of tricky aspect is managing the
latency between passes. So for example, I’ve
shown two schedules here. On the left we have A
followed quickly by B. And C followed quickly by
D and on the other side, on the right, we start
A as early as possible and then only run B like very late. So the trade-off here is that if you run A and then B immediately, then hopefully you get
some cache coherency from running like the outputs and then using the outputs
immediately as inputs. But you may have to wait for A to finish before you can do that. So there might some
extra like cost to that. Whereas in the other case, we start A soon and do B like very late and as a result, like it’s very likely that while A started has been mostly finished by now, so we won’t have to
wait for A for very long but it maybe that the
memory that A produced is now no longer in cache so it’s kind of a difficult trade-off. And also I’ll add that on the right side there’s more tasks
running at the same time so that means that the
total amount of memory necessary is greater
so lots of non-obvious kind of heuristic trade-offs
are necessary here. And finally when your
scheduler has finished deciding what tasks
actually need to be run it has to somehow traverse this graph and invoke some code that somehow will record the commands that are relevant to that node of rendering,
that pass of rendering. Now I want to get a
little bit more concrete about the algorithms that you can use to schedule this work. So the closest thing I could find to describe this algorithm is like list scheduling, so called. This is a very simple algorithm. The algorithm itself is simple but the heuristics are the
tricky part, basically. So the algorithm itself is something like allocate priorities to all the tasks. And then in a loop, take
the highest priority task that you can actually run and schedule it and then just repeat that until you’ve hopefully
scheduled everything. So in this case, to
illustrate that concept, I’ve created a small task graph here. And I’ve ordered them
by priority so A B C D E is the order of priority. And let’s suppose that we
want to run this schedule. So first we begin by running
the highest priority task which is A, so A gets scheduled. And then let’s suppose that the next highest priority task is B. But maybe B needs to allocate some memory that it can’t satisfy, like there’s actually not
enough memory available to run B, so we skip B, because it’s not a valid runnable task and move on to the next highest
priority task, which is C. And maybe C is possible to
run, so C gets scheduled. And maybe now after C has finished, it may be possible to
allocate the memory for B because it has used up its input so we can schedule the rest of the graph. That’s kind of how I see it and again like the tricky part about this is how I come up with the right heuristics which I don’t know if it’s
going to solve the problem although if you look at the
blog post that I linked, there are some suggested heuristics. Another aspect is, I talked
about this a little bit before, but just like managing
the lifetime of objects as the flow of tasks come in. So for example, if you have a very simple task graph like this one, where Texture 1 is an input to Pass A, and then Pass A outputs Texture 2, which is an input to Pass B, which then outputs finally
the final output of the frame. One potential way you could schedule this is something like this. So in this case, what’s happening is that before I run A, I first allocate Texture 1 and Texture 2. Because both its inputs and its
outputs need to be allocated before the task can actually run. But then maybe after A has run the memory for Texture
1 is no longer necessary so it can be deleted. And again, like earlier,
when I say new and delete I’m talking about something
that’s a little bit abstract and it’s more concrete if you
look at the documentation. You’ll find it, it’s like a sequence of specific commands, I’ll say. But moving on, so then before running B we have to make sure that
the output is also allocated. So make sure that that memory
is available and then run B. And finally we can delete Texture 2 because it’s no longer necessary and we have the final output. And part of what your scheduler can do is optimize scenarios like
this one if for example it may be that the same
memory can be reused for Texture 1 and Texture 2. So for example, if Texture 1 and Texture 2 have the same size and format, and it turns out that the
operation done by Pass A actually like takes every
pixel from Texture 1 and just makes the calculation on it and outputs a new pixel at the same place it may be that you can
reuse the same texture for both of those operations rather than doing a copy from one region of memory to the other. And in this case you can have something more compact schedule where you only allocate the one texture and reuse it for both
Texture 1 and Texture 2. So these are some kinds
of things that you can do with a task scheduler
is identify resources that can be merged or reused. I think that the, well one of the places where you can find most of
the actual concrete details of how you implement this, I’ll say is the documentation is CreatePlacedResource in DirectX 12, turns out to actually have quite a beefy amount of documentation in that article. So, in summary, in part one today, I talked about three main topics. First I talked about Memory Management where I talked about the separation between System Memory and Video Memory and I talked about how Virtual Memory can be used to bridge
between them as well as DMA. Then I talked about Command Lists where I proposed a big idea where the CPU builds the schedule
for the GPU to consume. In an order that is efficient
and ideally correct. And then I talked about
how we can use fences to synchronize the flow of commands between the two processors. Finally, in that part of the talk, I talked about Descriptors very briefly, just to say that descriptors are, I say the hardware view of an object which is something like
an address, the memory plus some metadata that describes how the resource should be accessed. Then in part two, I
talked about Ring Buffers and I proposed them as a CPU
to GPU streaming primitive, although they have more uses than that. You can go backwards too, like GPU to CPU or just lots of kinds of
different code I would say. And then I talked about
Parallel Command Recording, to say that it’s possible
with these new APIs to record command lists in parallel and submit them all at once to the GPU. And then I went a little bit deep in talking about the differences between regular and irregular
cases and proposed some I think, maybe academic solutions to the irregular cases. Finally, I talked about
Scheduling Work & Memory. To talk about how we can
take a graph that represents a frame being rendered and
officially convert that into a list of commands
and send that to the GPU. That both optimizes the frame globally and manages the lifetime of the objects throughout the frame. I think actually, some people will say that if you don’t optimize
the frame globally then you’re not using these
APIs to their full potential so I think this is something
that’s relatively important. So before I finish this talk I’d like to acknowledge a lot of people who helped me, I guess, prepare this talk. Scott and Mauricio
mainly, helped me review early versions of this talk
that I pitched to them. And then there’s a whole
bunch of people here who just answered various questions and inspired various ideas
that I’ve shared today so I’d like to thank them all for their time. If you look at the slides later, you’ll be able to skim
through these references. If you want to see all the papers and slides that I’ve talked about. But for now, that’s all for today. So thank you for listening and I’ll accept any comments
or questions you may have. (clapping) – [Speaker] Hello, so despite all the talk you made about data in general do you think that so far there is room today to have abstractions in game engine to how you organize data on the GPU or is it always the case that for example I know I have a game with a lot of trees so I have to make all my rendering code consider that. Do you think there’s room for
an abstraction layer there where a game might not need to care about how data is put on the GPU? – I mean it’s a very
situational thing to consider. Depending on what kind of game
you’re working on, I suppose. Some things for example the
frame graph that I showed like that’s something that
like exists with Frostbite so people do have, I would say
like some little abstraction. And it’s almost like, to quote Agora, it’s a negative abstraction sometimes. It’s an abstraction that
actually improves the performance in a lot of ways, so I
think that this is something we can aim for to create
abstractions that actually improve the performance. Although, you know it’s hard to say. If you want to render the trees it might involve a lot of tricky work and it might be hard to actually organize all sorts of different concepts including rendering trees
and rendering all sorts of objects that’s in the same framework, then it may not be straight
forward, so hard to say. But I think it’s a noble goal, so. – [Speaker] Alright. – So yeah, so actually
I would say in practice one thing is to consider also the cost the time to even implement
system abstraction. So if you sit down today
and you start programming something like that,
you’d be way better off not trying to abstract it, hopefully, until you realize that it may be useful. Just saying that it’s some times better to ignore like you
don’t need a frame graph to like code something simple. – [Speaker] Right – It’s one of those trade-offs. – [Speaker] The goal is not
to write a new abstraction on every project, right?
– Yeah. – [Speaker] The goal is for
example, if I’m at my company I have 10 projects, I should
write the rendering code once for all my projects,
in the best case. – Right. – [Speaker] I guess, so anyway thank you. – Yeah, thanks. Any other questions? So I guess we’re gonna call it. (clapping)

Posts Tagged with…

Reader Comments

  1. Henri Tuhola

    Barely scratches the paint on the surface of the subject. I hoped to see some real code among it and not just those oversimplifications.

  2. scolic03

    10/10 talk, liked, shared on Facebook, subscribed, excitedly printed on A3 wallpaper, studied and revered!
    This kind of explanation only comes from a person knowing the subject matter intimately. A lot of hard work went into this talk and it shows. Thank you.

  3. Xavier Thomas

    One solution for the ring buffer sizing/wrapping issue is to use virtual memory as explained here http://ourmachinery.com/post/virtual-memory-tricks/ which permit allocating a huge addressing space but actually using just the memory you touch.

  4. Operation Darkside

    I would have needed these explanations far sooner. This was finally a version I understood. So many other tutorials and talks assume existing knowledge of graphics programming

  5. Sergey Dzhaltyr

    48:01
    Stack allocator might be beneficial for this case. Basically it makes the memory allocated for T1 to be implicitly reused for OUT (due to nature of stack allocators).
    So sequence will be: [NEW T2] [NEW T1] [RUN A] [DEL T1] [NEW OUT] …

Write a Comment

Your email address will not be published. Required fields are marked *