Reading the OpenGL backbuffer to system memory

Sometimes you are in the need to read back the OpenGL backbuffer, or other framebuffers. In my case the question was how to read back a downsampled framebuffer resp. its texture to system memory. There are different methods for this and I wrote a small benchmark to test them on various systems.

UPDATE #1: A thing I’ve totally overlooked is that you can actually blit the backbuffer to the downsampled framebuffer directly, saving you the FBO overhead (memory, setup, rendering time). I’ve updated the page and the code reflecting that.
UPDATE #2: The code now works on Windows and Linux (via GLX). Maybe I’ll port it to GLES2 for the Raspberry Pi too… 🙂
UPDATE #3: The code works on the Raspberry Pi now using OpenGL ES 2.0. Reading the framebuffer is quite slow though even with an overclocked device. It’s not the actual reading that is slow, but that we have to render to another framebuffer first and back to the screen. Atm all buffers and textures use RGBA8888, which might slow things down. I’ll try RGB565 in the future. Also there are no PBOs and no glGetTexImage(), so only the glReadPixels method is working… If you have any ideas on how to make the code faster, I’d love to hear them.
UPDATE #4: Tinkered around with color formats a bit and updated the Pi’s firmware and MESA implementation. I’ve tried changing the FBOs color format to 16bit, but that didn’t change much. The secret seems to lie in using a RGB565 EGL display/framebuffer format with no alpha, no depth, no stencil and to remove the glDiscardFramebuffer calls. The FBO backbuffer and the downsampled FBO still have RGBA8888. Together with using a screen size of 640×480 this nearly quadrupled the frame rate. Blitting the FBO to the screen now seems to be much, much faster… I updated the code accordingly and also did some cleanup and found a safer way to get function addresses on Linux systems.
UPDATE #5: This is old. There are new options now, even for OpenGL.

Prequisites

I needed to downsample the framebuffer. This can be done by blitting to a smaller framebuffer that will be read back. For setting up framebuffers, see here. In your rendering loop, do:

// set viewport to window size
glViewport(0, 0, width, height);

// draw stuff here

// blit backbuffer to downsampled buffer
context->glBindFramebuffer(GL_READ_FRAMEBUFFER, 0);
context->glBindFramebuffer(GL_DRAW_FRAMEBUFFER, smallId);
context->glBlitFramebuffer(0, 0, width, height, 0, 0, smallWidth, smallHeight, GL_COLOR_BUFFER_BIT, GL_LINEAR);

// read back buffer content here (Method #1, #2, #3)
// unbind again
glBindFramebuffer(GL_FRAMEBUFFER, 0);

Blitting the framebuffer using glBlitFramebuffer isn’t much faster than binding a texture and rendering a quad to another framebuffer, but it has much less setup needed and you don’t find yourself setting up projection or modelview matrices and trashing the state.

Method #1 – glReadPixels

The first method is to use a standard glReadPixels. This is the common way to read the backbuffer since ancient OpenGL times. Do (depending on you framebuffer format):

// bind downsampled buffer for reading
glBindFramebuffer(GL_FRAMEBUFFER, smallId);
glReadPixels(0, 0, smallWidth, smallHeight, GL_RGBA, GL_UNSIGNED_BYTE, downsampleData);
// unbind buffer again 
glBindFramebuffer(GL_FRAMEBUFFER, 0);

The data will end up in downsampleData in system memory. Note that you need to allocate space for that data before!

Method #2 – glGetTexImage

The second method does not read the actual buffer, but the texture attached to it. That can be done using glGetTexImage. Do (depending on you framebuffer format):

// bind downsampled texture
glBindTexture(GL_TEXTURE_2D, downsampleTextureId);
// read from bound texture to CPU
glGetTexImage(GL_TEXTURE_2D, 0, GL_RGBA, GL_UNSIGNED_BYTE, downsampleData);
// unbind texture again
glBindTexture(GL_TEXTURE_2D, 0);

The data will end up in downsampleData in system memory again. Note that you need to allocate space for that data before!

Method #3 – PixelBufferObjects

This is the most complicated, but a slightly faster method than the previous two. You read the downsampled framebuffer using glReadPixels again, but now you read it to a PixelBuffer object asynchronously. For that you set up two PBOs and alternate between them so the GPU can go on rendering to the other while you process the first. The download of the data to system memory happens using DMA and does not block the CPU. To do this, first start up creating two PBOs:

int readIndex = 0;
int writeIndex = 1;
GLuint pbo[2];

// create PBOs to hold the data. this allocates memory for them too
glGenBuffers(2, pbo);
glBindBuffer(GL_PIXEL_PACK_BUFFER, pbo[0]);
glBufferData(GL_PIXEL_PACK_BUFFER, smallWidth, smallHeight * smallDepth, 0, GL_STREAM_READ);
glBindBuffer(GL_PIXEL_PACK_BUFFER, pbo[1]);
glBufferData(GL_PIXEL_PACK_BUFFER, smallWidth, smallHeight * smallDepth, 0, GL_STREAM_READ);

// unbind buffers for now
glBindBuffer(GL_PIXEL_PACK_BUFFER, 0);

Now that you have created the PBOs, go to your rendering loop and there you read back data to the PBOs there.

// bind downsampled fbo
downsample->bind();
// swap PBOs each frame
writeIndex = (writeIndex + 1) % 2;
readIndex = (writeIndex + 1) % 2;
// bind PBO to read pixels. This buffer is being copied from GPU to CPU memory
glBindBuffer(GL_PIXEL_PACK_BUFFER, pbo[writeIndex]);
// copy from framebuffer to PBO asynchronously. it will be ready in the NEXT frame
glReadPixels(0, 0, downsample->getWidth(), downsample->getHeight(), GL_RGBA, GL_UNSIGNED_BYTE, nullptr);
// now read other PBO which should be already in CPU memory
glBindBuffer(GL_PIXEL_PACK_BUFFER, pbo[readIndex]);
// map buffer so we can access it
downsampleData = (unsigned char *)glMapBuffer(GL_PIXEL_PACK_BUFFER, GL_READ_ONLY);
if (downsampleData) {
    // ok. we have the data now available and could process it or copy it somewhere
    // ...
    // unmap the buffer again
    downsampleData = nullptr;
    glUnmapBuffer(GL_PIXEL_PACK_BUFFER);
}
// back to conventional pixel operation
glBindBuffer(GL_PIXEL_PACK_BUFFER, 0);
// unbind downsampled fbo
downsample->unbind();
// now in the next readIndex and writeIndex will be swapped and we'll read from what is currently writeIndex...

The data will end up in downsampleData in system memory. This method seems to be slightly faster on some systems, but gain is highly dependent on the frame rate.

Results

The results depend on the system, method used and read color format. Here’s a table showing my benchmarking results. The first column is regular rendering to the backbuffer, no FBOs involved. Then follow the tests, every test first without actually reading the buffer, just rendering to the downsampled FBO. Then two tests reading the downsampled buffer to system memory using different formats:

System	No FBOs	#1 no reading	#1 BGRA	#1 RGBA	#2 no reading	#2 BGRA	#2 RGBA	#3 no reading	#3 BGRA	#3 RGBA
Intel Core i7 960 Nvidia Quadro 600, Driver 310.90 OpenGL 4.3, 640×480	0.39	0.54	0.70	0.67	0.54	0.88	0.87	0.55	0.64	0.55
Intel Core i5 480M AMD Radeon 5650M, Driver Catalyst 13.1 OpenGL 4.2, 640×480	0.30	0.41	0.80	0.77	0.41	0.81	0.81	0.42	0.51	0.49
Intel Core i5 480M Integrated HD Graphics, Driver 8.15.10.2827 OpenGL 2.1, 640×480	0.73	0.76	1.75	1.52	0.78	5.26	2.56	0.79	1.77	1.70

Windows 7 x64, frame time in ms, vsync off, data size 160x90x32bit

System	No FBOs	#1 no reading	#1 BGRA	#1 RGBA	#2 no reading	#2 BGRA	#2 RGBA	#3 no reading	#3 BGRA	#3 RGBA
Intel Core i5 480M Integrated HD Graphics, Driver Mesa DRI Intel Ironlake Mobile OpenGL 2.1, GLX 1.4, MESA 9.0.2, 640×480	1.13	1.46	3.25	16.91	1.97	3.40	16.77	1.89	4.22	17.05
Intel Core i3-2120T Integrated HD Graphics 2000, Driver Mesa DRI Intel Sandybridge Desktop x86/MMX/SSE2 OpenGL 3.0, GLX 1.4, MESA 9.0.2, 640×480	0.78	1.14	1.30	1.77	1.13	1.29	1.55	1.13	1.30	1.77

Ubuntu 12.10 x64, frame time in ms, vsync off (“export vblank_mode=0”), data size 160x90x32bit

System	No FBOs	#1 no reading	#1 BGRA	#1 RGBA	#2 no reading	#2 BGRA	#2 RGBA	#3 no reading	#3 BGRA	#3 RGBA
Intel Core i5 480M Integrated HD Graphics, Driver Mesa DRI Intel Ironlake Mobile OpenGL 2.1, GLX 1.4, MESA 17.0.7, 640×480	1.08	1.04	1.48	1.42	1.03	1.44	1.42	1.04	1.07	1.07

Ubuntu 16.04 x64, frame time in ms, vsync off (“export vblank_mode=0”), data size 160x90x32bit

System	No FBOs	#1 no reading	#1 BGRA	#1 RGBA	#2 no reading	#2 BGRA	#2 RGBA	#3 no reading	#3 BGRA	#3 RGBA
Intel Core i7 7500U Mesa DRI Intel(R) HD Graphics 620 (Kabylake GT2) OpenGL 3.0, GLX 1.4, MESA 11.2.0, 640×480	0.20	0.22	0.28	0.29	0.22	0.29	0.28	0.22	0.25	0.24

Ubuntu 16.04.3 x64, frame time in ms, vsync off (“export vblank_mode=0”), data size 160x90x32bit

System	No FBOs	#1 no reading	#1 BGRA	#1 RGBA	#2 no reading	#2 BGRA	#2 RGBA	#3 no reading	#3 BGRA	#3 RGBA
Raspberry Pi OpenGL ES 2.0, EGL 1.4, 640×480	2.37	4.49	4.54	6.25	–	–	–	–	–	–

Raspian (Updates: March 25th, 2013), frame time in ms, vsync off (eglSwapInterval(0)), data size 160x90x32bit

Source code

The source code, a CMake build file and the benchmark results can be found on GitHub and building should work at least on Windows 7, Ubuntu 12.10 (via GLX on X11) and on a current Raspian for the Raspberry Pi (via EGL). The beef is in the Test_… classes. To do a benchmark, run the application from the command line and wait for it to finish. It’ll spit out some frame time values for the different methods and color formats.
I’d love to get your feedback on how else to read to system memory or maybe speed up the methods. Just drop me a comment.

11 thoughts on “Reading the OpenGL backbuffer to system memory”

jdobmeier says:

March 10, 2013 at 8:55 pm

With some SoCs, in particular the RPi, you can get true DMA as opposed to mere memory mapped I/O. A call to glMapBuffer returns a pointer to the actual buffer in GPU address space (since all memory is, in principal, mutually accessible via the front side bus) so you can avoid even the latency associated with transfer over the PCI bus. This \”issue\” on Github demonstrates one method: https://github.com/raspberrypi/firmware/issues/85 There is also a lot of info on the bare metal section of the RPi forum. Also here: http://www.cl.cam.ac.uk/freshers/raspberrypi/tutorials/os/screen01.html If you do end up porting what you have above I would be willing to help test. I am also working toward getting this same functionality implemented for the Pi to support some GPGPU experiments.

LikeLike

Bim says:

March 10, 2013 at 10:11 pm

Sounds nice. Thanks for the info!There's a reason I'm actually trying this is (as you can grasp from the blog posts) – to get XBMC to render all GL-rendered stuff to my ambilight via boblight (not only videos). I threw together those examples to convince the XBMC people (and myself) that performance is not an issue.I'm currently porting to the PI, but wasn't familiar with EGL before, there's no glBlitFrameBuffer, eglGetProcAdress doesn't work properly, I'm trying out new things and so on… 🙂 So it takes time…You should be able to drop me an Email now if you're interested. I can send you the code package when it's done. Or just check the page from time to time. No deadlines though 😉

LikeLike

Bim says:

March 19, 2013 at 3:03 pm

@jdobmeier: The test code is working, but the problem is that glMapBuffer only supports vertex or index buffers, no PBOs or other stuff AND only supports writes, so even rendering to a vertex buffer is of no use. All the examples you mentioned show that reading the actual framebuffer (what is on screen) is possible, but not just any FBO. The trick in my code is supposed to be that the more powerful GPU does the downsampling and frees the CPU from that task… So your approach actually doesn't make sense in my case…If you have any idea on how to get the address of an FBO attachment that would be great though.

LikeLike

Bim says:

March 19, 2013 at 3:08 pm

Btw. The documentation of the GL_OES_mapbuffer extension is here: http://www.khronos.org/registry/gles/extensions/OES/OES_mapbuffer.txt

LikeLike

Unknown says:

April 12, 2013 at 4:17 pm

hello,Nadnerb from raspberry pi forum made a guide on setting up boblight on a Raspberry Pi with, hopefully, every step you need to take to get it up and running.This includes the hardware and software elements and uses 50 WS2801 LEDs.here after is the link:http://bit.ly/pi_boblight

LikeLike

Bim says:

April 12, 2013 at 4:41 pm

Thanks for the info, but it does not seem like the boblight forks or the omxplayer thingie capture from OpenGL, which is what I wanted…I wanted boblight not only when playing videos, but also in the menu, visualization etc.

LikeLike

D. Cerisano says:

December 13, 2015 at 1:36 pm

Not sure why, but following this thread just reduced my GL to H264 transcoder to almost nothing. Time for bed.

LikeLike

Anton says:

December 1, 2016 at 1:59 pm

Why not do glReadPixels after glUnmapBuffer? Same speed, minus one buffer.

LikeLike

Bim says:

December 3, 2016 at 11:39 am

Not sure what you mean. The idea is to hide the latency of copying to CPU memory, and not stall the CPU and GPU. Therefore 2 buffers are used, one that is used to copy from GPU to CPU memory asynchronously (writeIndex) and the other that is already copied and can be read from (readIndex). These buffers are swapped every frame.With only one buffer you'd stall the GPU.You are right, whether this is a real gain depends on your GPU + driver + system and probably on how long your rendering takes…

LikeLike

Anonymous says:

September 13, 2017 at 4:45 am

Thanks for the good info! Read through the code, couldn't get it to work on 16.04, but I'm running amdgpu pro drivers… get this error:build$ ./read_test GLX is supported.XF86 VideoMode extension version is 2.224 video modes found.20 suitable framebuffer modes found. Using config #0: Color buffer: R8G8B8A8, no multisampling Depth buffer: D24S0 Double bufferingglXCreateContextAttribs is available.Trying to get an OpenGL 3.0 context… worked.Got a direct context.Failed to bind OpenGL function \”glXSwapInterval\”!Failed to get all function bindings!Segmentation fault (core dumped)I'll look around and see if I see something simple. Had a question, I've been working with GLFW, GLEW, OpenCV and sometimes OpenCL. I've found a few work arounds, but nothing really direct to render to the backbuffer and use that rendering in a OpenCV UMat (image file) that resides on the GPU still. I'm trying to avoid the GPU to CPU transfer, since I plan on doing some post processing on the GPU with the CV library… so I often have to transfer it right back to the GPU to process it. Thanks in Advance!

LikeLike

Bim says:

September 13, 2017 at 6:01 pm

Not sure where it crashes by looking at the output, but if you find out, pull requests are welcome :)Not sure how OpenCV works nowadays, but back when I was using it, it was CPU-only. If it supports OpenCL now, sharing OpenGL textures with OpenCL is possible (see: https://www.khronos.org/registry/OpenCL/sdk/1.1/docs/man/xhtml/gl_sharing.html and https://stackoverflow.com/questions/8824269/gl-cl-interoperability-shared-texture.Best) of luck.

LikeLike

Reading the OpenGL backbuffer to system memory

Prequisites

Method #1 – glReadPixels

Method #2 – glGetTexImage

Method #3 – PixelBufferObjects

Results

Source code

Published by HorstBaerbel

11 thoughts on “Reading the OpenGL backbuffer to system memory”

Leave a comment Cancel reply

Prequisites

Method #1 – glReadPixels

Method #2 – glGetTexImage

Method #3 – PixelBufferObjects

Results

Source code

Share this:

Related

Published by HorstBaerbel

11 thoughts on “Reading the OpenGL backbuffer to system memory”

Leave a comment Cancel reply