Rendering 18,000 videos in real-time with Python

The Scale of The Problem

"An image, but every pixel is actually a video."

It seems simple enough. But let's look at the numbers:

The image in the video above is 1536 x 864 pixels.
It uses 18,000 colour-matched video clips analysed from 121 sitcom episodes and a total of 858,000 different frames of video.
At 1024x level of detail shown in the video, you're looking at 2513 gigabytes of RAM required to render a video that is 2 gigapixels in size.
Even if you had enough RAM to store all the videos, relying on raw Python to render this image would take 11 minutes per frame.

The solution? We need to stop thinking like a video player, and start thinking like an open world video game.

Technique 1 - Mip-mapping & levels of detail

A screenshot from "Assassin's Creed: Unity", showcasing Paris at a 1:1 scale.

Open-world games boast incredible amounts of detail, often recreating entire cities that can be fully explored by the player. The first thing we can learn from these games is how they handle distance.

In the screenshot above, the buildings further away from the player aren't treated the same as the one the player is standing on. The game engine measures how big the building is on the screen, and replaces it with a lower quality version if it's small enough.

We can do the same thing with our mosaic. When we're zoomed out, we don't need videos that are 1024 pixels wide. We can replace them with lower quality tiles (called "mips") that are as small as 1x1, and the difference isn't noticeable - because we only need to generate as much detail as can fit on the screen.

This is the first optimisation we can apply, and it speeds up the rendering by 1024x, allowing us to generate 2 frames per second. Progress?

Technique 2 - Culling

Demo video with intermediate scaling turned off to showcase mip-mapping and view culling.

Another trick we can learn from video games is by analysing what's not on screen. In an optimised renderer, there's no point doing work for things the player isn't going to see. If it's not on screen, it gets ignored, and the GPU can work on something else.

We can do something similar with our renderer. In the video above, I've turned off the intermediate scaling between mip levels to display exactly what gets generated by the renderer as you zoom in. You'll notice that the window gets smaller and smaller each time, with the tiles doubling in size when the window gets small enough. Eventually, the tiles become as big as the window itself.

This follows from our initial observation - we only need to render what's on screen. Everything else gets ignored. This saves us exponential amounts of processing power - theoretically allowing for zoom levels as big as the screen will allow.

Technique 3 - Bypassing Python

So far, we've solved the problem of managing large zoom levels, but Python will still struggle with rendering a zoomed out 1080p image, even if the mips are a single pixel each. A standard 1080p image consists of 2 million pixels - that's 2 million for-loop iterations per frame. Python is an interpreted language with heavy overhead and no just-in-time compilation - so it'll struggle at real-time framerates (60 frames per second minimum).

Luckily, Python has a great ecosystem designed around bypassing Python itself for extra speed. We can use PyGame to represent our interactive canvas as an array of numbers instead, and Numba to copy bits over to our array as fast as possible, by compiling our loop into machine code and automatically spreading it over all of our CPU cores.

@numba.jit(nopython=True, nogil=True, parallel=True, cache=True) 
def draw_fast_static( x_start: int, ... render_array: np.ndarray):
for y in numba.prange(max(y_start, 0), ...):
for x in range(max(x_start, 0), ...):
match_id = match_list[y, x]
# sparse to dense mapping
mapped_id = tile_map[match_id]
render_array[
  (x - x_start) * tile_size : (x - x_start + 1) * tile_size,
  (y - y_start) * tile_size : (y - y_start + 1) * tile_size
] = tile_store[mapped_id]

Our rendering loop boiled down to essentially a fragment shader written in Numba.

This gives us a 100x speed increase over our raw Python loop, allowing us to render our mosaic at 120 frames per second.

In a real video game, they use the GPU for thousands of times more performance for an embarrassingly parallel task such as tile-based rendering, since GPUs have thousands of specialised cores, while consumer CPUs only have a few cores since they're optimised for more general-purpose work. However, we already have our mosaic rendering at 120 frames per second, and keeping it CPU-only greatly simplifies the requirements for our biggest optimisation yet.

Technique 4 - Texture Streaming

So far, we've solved the FPS problem, but we still haven't solved the RAM problem. We can go back to our original insight about culling what's off screen, and look at it from another angle. Now that we know exactly what's on screen at any given moment, we only need to worry about loading in tiles that the user will see.

However, as we zoom in, we're faced with a problem - we're storing all of our tiles in arrays, and an array needs all of its space upfront. At 1x1, this is 54 kilobytes. At 2x2, 200 kilobytes. But as you zoom in, you'll find yourself allocating terabytes, because you need to make space for all tiles, since you don't know which one the user will zoom in on.

Technique 4a - Dynamic Allocation

We can solve our array allocation problem by using a sparse data structure instead. The one I settled on is the Numba typed dictionary since it works inside our hot loop and only needs as much RAM as the tiles inside of it.

It's not free, though - a hashmap lookup inside our rendering loop is 3x slower than an array lookup. There are more advanced ways of doing virtual texturing, including the techniques employed on actual GPUs (since they can't do hashmaps), but for now we can settle with just splitting our streaming cache into 2 distinct pools - a "fast" pool using fixed arrays, and a "smart" pool using dynamic hashmaps.

The way I settled on deciding the split is a bit arbitrary.

available_ram = get_system_ram() * 0.75
for size in power_of_two_sizes:
# 2 arrays at current size + tile dictionary at 4x(default)
# the screen resolution * avg no. of frames per tile
required = 2 * total_frames * size * size * 3 + 
3 * prefetch_size * min(avg_tile_len, max_tile_len) *
display_res[0] * display_res[1] * 3
if available_ram - required >= 0:
tile_stores[size] = TileStore(smart=False)
available_ram -= required
else:
tile_stores[size] = TileStore(smart=True)

The split is essentially decided greedily with a very generous bias towards dynamic allocation.

Technique 4b - LRU Caching

How the renderer behaves when told to minimise RAM usage at all costs. Notice the debug overlay, with brighter pixels corresponding to higher resolution mip levels. As the image is zoomed in, the image gets sparser as low resolution mips off screen are freed to make room for higher resolution ones on screen. As the image is zoomed out, the entire image gets duller as higher resolution mips are discarded to make room for lower resolution ones covering the whole screen instead.

Dynamic allocation solves the accumulation problem, meaning we start off with a mosaic that only allocates as you zoom in, minimising startup RAM costs. However, we need to go one step further to prevent crashes and constrain RAM usage - we need to delete tiles to save space.

A common solution is to simply free the least recently accessed tile. We can use an OrderedDict for relatively fast bookkeeping, and with a large enough RAM buffer the amount of thrashing is reduced. This works, but we can go one step further to really ensure a smooth mosaic streaming experience.

Technique 4c - Fine-tuned heuristics

We can further tune our algorithm with number of optimisations that take advantage of RAM beyond just "allocate what's on screen, free what's not":

Prefetching: H.264 video is very slow to seek and decode, but the decoded frames are cheap to scale. We can decode each frame once, at a given multiplier of the current mip resolution, and then downscale it to generate mips at the level of zoom the user is viewing at. I settled on 4x as a balance between speed and RAM.
Layered freeing strategy: Mips can be downscaled from a larger tile, but not vice versa. We can exploit this by tuning our LRU strategy to prioritise tiles that have a bigger sibling over ones that don't.
Balance the size of our buffer: Freeing too aggressively causes tiles to disappear as soon as they go off screen, which is not ideal when they take so much time to load back in. Increasing the size of our RAM buffer increases the chance of a cache hit, but increases RAM usage. A compromise is to stress-test the algorithm and measure metrics where the pop-in starts getting "acceptable" - a subjective threshold.

Conclusion

After applying all of these techniques, let's look back at the original set of problems:

We needed 2 terabytes of RAM for the original video, now we can get by comfortably (relatively low amounts of pop-in) with a buffer of 6 gigabytes.
We can fake 1000x detail using only a fixed number of pixels on screen at any time.
We were rendering a frame every couple of minutes, now we're rendering 120 per second.
And technically, we're still doing it all in Python.

Bonus: Video Mode

We have a renderer that works on static photos, but can we make it work with videos? Thanks to Numba and the fact that it can release the Global Interpreter Lock, we can.

Using Python as the glue, we can have:

One thread decoding video into a shared buffer
One thread taking decoded video frames and generating mosaics out of them in real-time
One thread monitoring pixels on screen to stream in tiles
And one thread rendering and responding to user input

All four threads, writing to shared memory, all working together in lockstep, to achieve this marvel of computing:

An episode of Parks & Recreation, but each pixel is replaced with a frame from an episode of Parks & Recreation.

Read more about the custom UI engine I originally designed for this project.