Real-time depth compositing with an AI-assist

Behind the system Wētā FX invented for ‘The Way of Water’.

One of the hardest things to do when filming a scene that features live-action and CG characters is ensuring you can visualize how they interact together. It can be difficult to place real-world objects in the correct screen space in what will eventually be a CG environment.

On James Cameron’s Avatar: The Way of Water, Wētā FX was able to solve this problem–and allow the director to more accurately frame and block scenes that had both live-action and CG characters–by implementing an on-set real-time depth compositing solution.

The aims of the tool were to provide for a real-time composite in camera in 3D space with pixel perfect occlusions, without the need for bluescreen or greenscreen. This was made possible via a link up between two computer vision cameras calibrated with the main cinema stereo cameras and the development of a trained deep learning model, made up of thousands of synthetic images generated from character and set scans, to process the stereo images and generate a useful depth map.

Via Wētā FX’s proprietary real-time compositing tool Live Comp, the resulting depth map could be combined with live-action images and composited together with CG elements correctly occluded right there for the filmmakers on set.

In this excerpt from issue #11 of befores & afters magazine, Wētā FX virtual production supervisor Dejan Momcilovic and senior researcher Tobias Schmidt break down the system for befores & afters.

b&a: I want to go back to the beginning. What were the earliest incarnations of a real-time depth compositing system at Wētā FX?

Dejan Momcilovic: On The BFG we had some of the first attempts at this in terms of depth compositing and the correct sorting of the objects. BFG was unique for us then because we had a live-action character and a CG character with a lot of interaction. We were capturing BFG, scaling him up, and then having him interact with the character Sophie. At some point you might be looking at the back of BFG, which meant we shouldn’t be seeing Sophie through him. We were just making it work by, say, attaching the plate onto his hand and putting it into the scene, and then compositing so he would properly occlude her with his. It was something we did to just make it work.

Fast forward to Way of Water performance capture, which started in 2017, on September 26th–I happen to remember the date! Even years before that we were talking about a lot of these things and testing redundancy and camera tracking. We knew that the whole thing would be heavy on our tool Live Comp and we needed to update all this tooling, but there was no solution yet for depth compositing.

Jim was quite determined that he wanted a depth comp and there was no such thing. We were kind of phasing in and out considering what could be done and did a few tests with the Z CAM, but the traditional methods were not fast enough and it was too easy to break.

Then we started considering an AI approach. The live-action shoot was to begin on March/April 2019. We had an idea that needed to be evaluated and we got Tobias to look into it for us.

Tobias Schmidt: It was very interesting. We actually considered some other approaches first, such as just detecting humans to work out which one is in front or behind, but this would of course only work for humans. In the end, the depth approach with AI was the holy grail. If that worked and worked fast enough, it would not just be limited to objects, but could work out of the box for anything.

b&a: Where did that idea and research take you further?

Dejan Momcilovic: There was a paper that we based our initial work on called StereoNet: Guided Hierarchical Refinement for Real-Time Edge-Aware Depth Prediction. At some point Toby said, ‘I think it could work.’

Tobias Schmidt: We were told by a couple of experts in the field that this cannot work, so it was kind of nice to disprove that.

Dejan Momcilovic: We got something ready and went to LA to show Lightstorm. That worked well, but then we also needed to show Jim. [Senior visual effects supervisor] Joe Letteri had a laptop while they were doing all this other kind of setup and said, ‘Hey Jim, have a look at this.’ It was a clip of me sitting there and then one of the Avatars hugging me. His hands were in front of me, his body was behind me. Jim was like, ‘What’s happening…?’ He said, ‘Alright, this is what I’m asking for!’ From there on it was sold to him. We never looked back.

b&a: Take me through, ultimately, what you devised to enable the live depth comp?

Dejan Momcilovic: On set, we had two computer vision cameras that were calibrated along with the stereo pair of cinema cameras. We put a grid in front of the cinema camera, it took a minute or two to process and then you have the calibration. This way we knew the relationship between the cinema cameras and our computer vision cameras.
What we derive from the hero camera is the position or the relative position to our system. For other data like the focus distance and zoom–because most of the film was shot on zoom lenses–we were getting this data streamed through a systemizer, part of the Cameron/Pace rig.

Our network was then trained based off of the views from these two cameras to generate the depth in the left or right camera. We were choosing and picking which lens, depending on if we were upside down or above the camera. We had trained models for all these configurations and that would be generated in one of the computer vision cameras, then it would be re-projected into a hero camera. It then filled the gaps that were caused by stereo shadows and then we could obtain depth for every pixel of the hero camera image.

We would then send this into our compositing software or into our scene. If we had to do transparencies, we had to turn this into a full 3D object so we could see through transparencies and so forth. Or we could do a simpler ‘depth sorting’ in an external application. We had Motion Builder with our renderer and all our scene in there and then we had this external module that was grabbing the images from the render and mixing them by the distance to the camera. That meant we could easily very efficiently stick a hero image with the full CG content and basically intertwine them correctly based on the depth.

Read more in issue #11.