session replay to mp4, and fast
Rob Pruzan 6/12/25
First I want to quickly define what a session replay is on the web, since it's surprising it exists at all.
If you want to record a users screen to understand how they use your website, you might look for a browser API that gives you a recording of their screen. One of those does exist, it's called getDisplayMedia
and has the exact capabilities you would need. But it has one big problem, you need to ask for permission before recording the users screen (reasonably?)

Almost no user will grant screen recording permissions for analytics purposes, and even asking would raise a lot of questions. A logical next step from here is to give up on getting a full screen recording, the permission popup is a deal breaker.
You definitely still want some data about how the user is using your website- like how frequently a feature is being used. So you track all interactions users make on your website. You may also be interested in the order users are using features, so you track the interactions as a sequence of events during the period they are using your website.
[{type: 'clicked', element: {...button info}, timestamp: '0:01}, ...]
Now that you have all this data, it would be nice to visualize it (hard to interpret a long stream of events for each user). One clever thing you can do is track the screen position of the elements interacted with, and reconstruct a lower resolution version of the users session on your website. Then you can observe how the website changes over the course of the session.

The more data you collect, the higher resolution you can make the "lower resolution" replay- you can collect the color of the button clicked, the css about the page layout, the scroll position of the page, and then apply all that data to your fake version.
As it turns out if you collect enough data you can reconstruct a replay that's about identical to what the user saw on their screen- all without asking for permission. Doing this without asking the user is a little questionable, but it's tough to draw a line between "collecting button clicks" and "reconstructing a pixel perfect replay of the users session", so nobody has drawn it (lol)

Now that the context is out of the way, the problem we're trying to solve is how to turn this array-of-user-events:
[{type: 'clicked', element: {...button info}, timestamp: '0:01}, ...]
into an mp4. The conventional way to replay these events, in packages like rrweb, is to make a website that updates according to the events. You change the fake website the same way the user's real website was changed.
Unfortunately, there's no library that can re-encode these events into an mp4 directly, as you would have to write a spec compliant browser in such library to correctly render the content.
So, we will have to replay the events the conventional way. With a browser, and then record the browser as it renders the website over time. This is not a very hard problem, you can use headless browsers, like puppeteer, record the browser, and then save the recording to disk as an mp4. This is the approach provided by the most popular session replay library- rrweb.
But, a very unfortunate property of this approach is the time it takes to get the mp4 scales linearly with how long the session replay is. This means if the session was 20 minutes, it will take ~20 minutes to get the mp4.

This is a huge problem if you need the mp4 immediately- for example, if you are vibe coding with an LLM and want to send a video of what's happening on your screen. No LLM can accurately interpret session replays- as that would mean they must have a spec compliant browser encoded in their weights. But their tokenizers can very much understand video!
Even if we must send video, it wouldn't be reasonable to ask the user to wait minutes for a video to be prepared, so we need a solution.
As for any performance problem, it's helpful to visualize where time is being spent on during the routine:

There are 2 potential paths here, make the bars smaller, or parallelize the work.
It seems weird to say we want the user session to be smaller, so lets define the problem better

We need to minimize the time between when we start recording and when we get the recording.
To do this, maybe we can replay the session at a higher playback speed? Lets say 60x speed. That way a 1 minute session replay could be recorded in just a second.
Assuming it was possible to update the dom this fast, you would still need to capture the browser screen output to turn it into an mp4. If you want to maintain 30fps, that means you need to capture 1800fps. We aren't limited by our monitors refresh rate since we aren't drawing it to our actual screen, but we still need to go through the browser rendering pipeline, which will take on the order of milliseconds for any non trivial operation. This path likely will bear no fruit.
So we can't make the green bar shorter, and we can't make the red bar short enough to meet our time budget. So next up, path 2- parallelization.
The factor that decides whether you can parallelize something is if operations are independent of each other. So, lets try to break up a hypothetical replay into atomic tasks, and see which, if any, are independent:

Now lets define which operations are dependent on each other with arrows:

To record any event, we need to wait for that event, and all previous events to happen. To not violate our dependency graph, we just need to make sure every dependency happens before our operation (recording an interaction up to a certain timestamp).
We can't parallelize any "RECORD" tasks with each other, as they are all dependent on the previous record task finishing, but we absolutely don't need to wait for the last user event (click button) to record the first. Or more generally, we don't need to wait for any future events to record what's already happened. This means we can just shift over all our recording tasks on the timeline without violating our dependency rule

That's a lot better, but lets confirm our algorithm infinitely scales- it runs in O(1) with respect to the session replay size. An easy way to do that is to double our session replay and observe how long our algorithm takes to execute

Since we only care about the time after we "Start Recording" we can definitely say the algorithm's execution time is unchanged, so it seems we have something we can work with.
Now, for how to actually implement such a system.
We need to receive events about the users session in real time for this to work- since we need to play the interactions on a remote browser and record it at the same time.
If we are using rrweb, we can just use their emit API to send events to our remote server
Our remote server will be hosting a headless browser with an rrweb replayer instance that can interpret these events.
If we run the replayer with liveMode
on, we just need to feed new events to the instance as they come in from the user:

To finish the loop, we need to incrementally record the browser window as more events come in.
One fool proof way is to just screenshot the page really really fast, collect the png's into an array, and when you need the video, use ffmpeg to convert the png's into an mp4. There also seems to also be a puppeteer API to get a proper video stream from the browser that I haven't tried yet, but that should allow for higher frame rates vs screenshotting.
With this done, we have our system implemented e2e. As the user is interacting with the website, events are sent live to our remote browser. Our remote browser is being updated live to mirror the users browser, and is being recorded as this happens. The second we want the mp4, we have all the data we need without having to ask our remote browser to do anything.
Show me the code
If you want to start hacking around with this approach, here's a fully contained file of the implementation you can run with bun- index.ts
And a sample app that sends replay events to the server + requesting an mp4 of the session- https://github.com/RobPruzan/real-time-example
Finally a demo of all that pieces together
