Welcome to the 3rd book! Here we’ll checkout some problems which can occur during the rendering process. But first, some practice:
To know about a problem is useful. To actually feel the problem is even better for understanding. So let’s try to feel like a CPU/GPU.
Experiment
Please create 10.000 small files (e.g. 1 KB each) and copy them from one hard drive to an other. It will take a long time even if the data amount is just 9,7 MB in total.
Now create a single file with a size of 9,7 MB and copy it the same way. It will go a lot faster!
That’s right but for every copy-action there’s some stuff to do, for example: prepare the file transfer, allocate memory, read/write heads move back and forth in the HDD, … which is overhead for every write action. As you painful feel, this overhead is immense if you copy a lot small files. Rendering many meshes (which means executing many commands) is a lot more complex, but it feels similar.
Let’s now have a look at the worst case you can get during the rendering process.
Worst Case
To have many small meshes is bad. If they’re using different material parameters on them, it gets even worse. But: Why?
1. Many Draw Calls
“The main reason to make fewer draw calls is that graphics hardware can transform and render triangles much faster than you can submit them. If you submit few triangles with each call, you will be completely bound by the CPU and the GPU will be mostly idle. The CPU won’t be able to feed the GPU fast enough.” [f05]
In addition, every draw call produces some kind of overhead (like mentioned above):
“There is driver overhead whenever you make an API call, and the best way to amortize this overhead is to call the API as little as possible.” [a02]
More Details here: NVIDIA OpenGL extension showcasing perf benefits of new concepts in APIs
2. Many Commands
Therefore the driver batches up several commands in something called a push-buffer, it does not hand over one command after another but first fills up this buffer and then hands over a complete chunk of commands to the GPU.
You can find the settings for this buffering in the control panel of the graphics driver (“maximum pre-rendered frames”). The down-side of high amount of frames, is that it we essentially render further in the “past”, our CPU frame already has the latest player input data, but our GPU renders something that is some frames in past. This added latency can be bad for certain content (virtual reality…).
Modern or console graphics APIs also allow you to fill several command buffers in parallel and the driver hands them over to the GPU one after another (serial submission queue).
The key difference of DirectX 12’s vs DirectX 11’s command-buffers is essentially that the parallel built command-buffers are now created in such a way that makes them super quick to submit by the driver later. While in DirectX 11 the driver still had to do more tracking of things in the serial submission, which reduced the benefit of building in parallel.
We only spoke about many meshes with the same material parameters (render state). But what happens when you want to render meshes with different materials?
3. Many Meshes and Materials
Flush the pipeline.
“When changing the state, there is sometimes a need to wholly or partially flush the pipeline. For this reason, changing shader programs or material parameters can be very expensive […]” [b01 page 711/712]
You thought it can’t get worse? Well … if you have different materials on different meshes, you may introduce additional setup time on both CPU and GPU. You set a render state for the first mesh, command to render it, set a new render state, command the next mesh-rendering and so on.
I colored the “change state” commands in red because a) they can be expensive and b) for better overview.
Setting the render state sometimes results (not always, depends on what parameters you want to change and the available units on a GPU) in a “flush” of some parts of the pipeline. This means: every mesh which is currently processed by some hardware units (with the current render state) has to be finished before new meshes can be rendered (with the new render state). It would look like in the image above. Instead of taking a huge number of vertices (e.g. when you combine several meshes of the same render state – an optimization I’ll explain later), you would render a small amount before changing the render state which – this should be clear by now – is a bad thing.
This “rule” changes of course when we talk about combining several small meshes into one big mesh (we’ll look at this in a second).
4. Meshes and Multi-Materials
What when not only one material is assigned to a mesh but two or more? Basically, your mesh is ripped into pieces and then fed piece by piece into the command buffer.
This of course creates one draw call per mesh piece.
Typical GPUs today have a single command processor / front-end for graphics. That means even with parallel “chunk” submission from the CPU, all graphics-related commands will be processed in serial once before they are distributed to the many parallel units on the GPU. More about how the GPU works in the in-depth book in Book 2.
6. Thin Triangles
The rasterizing process may (depending on the hardware) have performance-related details which I already teased in Book II “3.16 Rasterizing”:
Most current graphics hardware shades 2×2 quads belonging to one triangle (on NVIDA hardware this would be 8 such quads = 32 threads in one group).
If some of those fragments don’t cover the triangle, their outputs will simply be ignored.
You can imagine why something like long thin triangles are really bad for the hardware. because tons of those quads will have only a single of those 4 threads actually computing a pixel. It goes even so far that for very costly fullscreen post processing effects, we don’t render them as two triangles, but a giant triangle whose corners are outside of the view, so that only the area without any diagonals running through the screen is visible.
7. Useless Overdraw
A lot performance may be wasted when polygons are rendered with soft-alpha and big areas of the texture are 100% transparent. This may happen when you have a branch/leaf-texture or if you use a full-screen quad to render a vignette (which only darkens the image in the corners).
A solution to this problem will be presented in Book IV – “Solutions”.
8. Mobile vs. PC
A lot of mobile devices are good with blending and anti-aliasing, while more challenged with lots of geometry. In contrast desktop/console GPUs are a bit on the opposite end. The reason is that mobile GPUs use “on-die/on-chip memory” (those tiny caches) as intermediate frame-buffer (Xbox360 also had this). Hence they can do blends very quickly and also anti-aliasing at relatively lower performance hit.
However, the amounts of memory needed to render in full HD would be way too expensive to have on chip, so they render the frame not in one go, but in little tiles (or chunks). One tile at a time the scene is rendered and after each tile is done, it is copied from the tile cache to the final frame-buffer memory. This is also more power-efficient than copying to frame-buffer memory directly as desktop GPUs do.
The downside is that they have to process the geometry multiple times, as it may overlap several of the tiles. Which means lots of vertices become more costly.
However, this approach works great for UI and text rendering (textured quads) with lots of blending, which is a dominant task for mobile devices.
I hope i could give you a small insight of what is bad about a lot meshes and materials. Let’s now look at some solutions because even all of this sounds really really bad: There are beautiful games out there which means that they solved the mentioned problems somehow.
The End.
Hey there, excellent work! I believe you forgot to update the beginning of this book after releasing the new book, because it starts saying this is the second book, heheh.
Thank you so much for sharing stuff in such a fun way!
Thank you very much for the hint! Jus corrected it :)
so I’m a bit unclear about the cost of lots of draw calls. is it that you fill up the command buffer faster, and the driver has to copy over the buffer to kernel space every time it reaches sufficient size, or is there something else to it. I don’t really get what the overhead is supposed to be.
It’s hard to define a strict rule for the drawcall amount – especially as Vulcan/Dx12 allow for a lot of them with their new architecture. Actually I’m not sure how many drawcalls can be references in the buffer and if it’s a realistic problem that it fills up completely and then everything has to wait until the graphic card is done finishing the jobs.
This article gives a rough overview about the functionality of the pipeline and I think everyone has to run some tests on the own project to see where the bottlenecks are. Just the number of drawcalls isn’t the sole indicator. But of course, if you have a simple scene and 3000 drawcalls, it might be worth a look where those are going into – maybe some assets are poorly optimized :D
Usually the overhead of draw calls is CPU processing: driver is doing guesses if you are going to send more data in, tries to optimise the resulting command buffer, remove redundant state changes.
Indirect draw calls exist: you pack multiple draw calls (no state changes) in a single buffer, and it is much easier to optimise. Actually, you can even generate the draw commands within the GPU this way. But even then there is overhead: I recall making instancing test, where I would do a single draw call with many instances, or many draw calls with 1 instance (both via indirect draws, obviously). The many draw call version was 2 or 4 time slower… It could be that GPU flushes some buffers and repopulates them (instanced data?). It could be synchronisation. Its not that slow though(probably depends on implementation), the pure instanced test was bound by primitive generation (2 per cycle on my 1050ti). The maximum amount of draw calls? So long you have memory for it (and its less that 2^32, but why would you ever…).
Personally im trying to make a renderer that would use the least possible amount of CPU control (just for the sake of doing it): compute shader GPU frustum culling, draw call generation, etc. And since Vulkan allows for explicit memory allocation, no redundant data transfer is needed!
Also, there is nVidia’s device generated commands extension (https://developer.nvidia.com/blog/new-vulkan-device-generated-commands/), that allows to create indirect draws with state.