Need to do a face swap? Here’s some of the main considerations to think about in the new machine learning VFX paradigm

Behind the face replacement work for the ballet dancing scenes in ‘Étoile’.

The Amazon Prime Video series Étoile features elite ballet dancers. However, neither of the main stars of the show, Lou de Laâge nor Ivan du Pontavice, are professional dancers. So, a face replacement process was employed to swap the dancer double faces with those of the lead actors for the ballet scenes.

A prominent challenge on the show was the nature of the ballet scenes themselves, which often involved characters spinning at high speed, hair falling over faces and a lot of motion blur. Many ‘deep fake’ solutions are centered on developing models from training data that is largely frontal face material, something that was not always possible to capture in the frenetic dances here. In addition, the ballet dances would ultimately be shown in lengthy takes.

All these considerations came into play for Afterparty VFX and visual effects supervisor David Gaddie, tasked with the face replacement work by production visual effects supervisor Lesley Robson-Foster. The studio built a set of proprietary tools and workflows that combined AI/machine learning, CG proxy builds of the actor’s faces, and compositing to complete around 100 VFX shots, totalling over 20 minutes of face replacement.

The test

As noted, the leads in Étoile needed to pass as professional dancers, even though they were not. In the past, this kind of challenge might have been solved by featuring dancer doubles in wide shots and then cutting to a close-up of the actor. Or, a production may have filmed scenes with the dancer doubles and then used a combination of 2D face replacement and 3D facial animation to replace the doubles. More recent approaches for this kind of task have adopted machine learning techniques.

In the end, a kind of ‘Pepsi challenge’ was initiated. “They did a test with several vendors and an in-house team,” shares Gaddie. “One doing CG, one doing an AI hybrid approach, and doing conventional 2D. Afterparty was the AI hybrid. I really went into that test feeling fairly unconfident because everything I’d tested with the tools that existed at the time and having to do a spinning dancer, well, they were not ideal. Everything back then that was used for training faces didn’t work when the full faces weren’t in the frame.”

Afterparty’s test—something Gaddie says relied upon a “very gaffer-taped together set of tools”—was the one considered to be the most seamless. In making the test, Gaddie had a chance to consider what he thought would be the toughest thing about the face replacements: the heavy amount of movement.

“The face is so much more than the face. The face is, how the whole head shape works. It’s the feeling of the skull, the way the neck connects to the body and the way that moves. There are so many little details that make the task more complicated. In our dancing, I knew that we might just see flashes of the faces sometimes. There would be a lot of motion blur and a lot of obstructions and head turns. But, you’ve got to make sure that when you see the face it looks completely real and when it doesn’t look real, it’s incredibly obvious.”

“The biggest thing we learned,” adds Gaddie, “is that the easiest thing to swap is a face that’s facing camera, because you recognize the face. I don’t just mean it from a technical standpoint, I mean it from a perception standpoint. If you see an actor who’s a famous person and you recognize them, there’s no question that’s the actor. But when you see someone in a profile or they’re in the distance and you can’t quite make out their details, then you’re seeing everything else that isn’t just the recognizable attributes of the face. That’s when you start to really notice if the exact skull shape is wrong or if the way their head sits in their neck feels wrong and you start to not recognize the character.”

Afterparty’s initial explorations with open source face swapping software Faceswap proved to Gaddie that whenever there was a head turn or rotation, the software would only deliver a ‘blurry mess’. Indeed, Gaddie’s first thought was to override the automatic face tracking markers and instead do the tracking of the eyes, nose and mouth in Nuke. “I found a way to write a script to get it back into Faceswap so we could get more stable data. That worked really well, except when I put it all back together, I worked out that the machine learning tools that Faceswap was using to stabilize the face assumed a frontal head. When we started getting the back of the head, the head would start shrinking to become a tiny head and then as it turned back, it would grow back to normal heads.”

A different approach was needed.

Gaddie therefore set out to find a better way to stabilize the training footage so that it was consistent, no matter what the head was doing in 360 space. “We made a head geo and used KeenTools to do it. We created a KeenTools head mesh, we tracked that in KeenTools and we then wrote a script that let us take that stabilized and tracked-in 3D space KeenTools head and convert it into tracking data that we could use to import into a training set.”

The idea here was to track the head position in 3D space but then bring it into 2D space, since that’s where the machine learning tools were reading things, ie. in 2D space. “What that gave us was, for example, when someone’s turned three quarters, no matter how many times they turned three quarters in our training set, it’s always exactly the same framing on our training set,” explains Gaddie. “We also created another script that let us bring the cleanup tools from the face swapping software we had into Nuke so that we could make adjustments to everything in Nuke. We could realign things in Nuke and we could create our mattes in Nuke.”

Again, this workflow was arrived at for the test. The different tests (by the three vendors) had utilized a dancer’s body, meaning the final performance was driven from the dancer. The nuance of what the dancer was doing was carrying across to the actor. But, at some point the question became, how do we get the actor’s performance onto the dancer?

That became Afterparty’s next engineering challenge.

Actor onto dancer

Looking to translate the actor’s performance onto the dancer, Gaddie noticed a happy accident. “We had tested several different varieties of models and one particular model we were testing was not very good at learning in-between poses. For example, if we hadn’t shot material of the actor doing something exactly like the dancer did, it just went very blurry and didn’t convert at all. Other models were being sold to us as, ‘It’s going to actually learn the in-between.’ What we found was that those models that learned the in-between also learnt the performance of the dancer. But the models that did a terrible job of learning the in-betweens never strayed from what the actor was doing.”

“A classic example of this was,” continues Gaddies, “if the dancer was talking, but the actor kept a very serious face in character, what I would see is a mask where I could almost feel like the dancer was talking underneath, but the actual face I was putting on never talked—it was only the serious face of the actor.”

The takeaway for Gaddie was that, by training a model with a specific performance, you could force that performance onto the AI double. “That became the foundation of our base concept. The methodology was, we want to get exactly what the actor is doing for a scene performed by the actor in the lighting of the scene. We want to capture that from as many cameras as possible so that we have a chance to get those in-between angles. We don’t want a situation where we don’t get quite the right angle and the AI doesn’t know what to do with it. We want to get as many angles as possible from as many cameras as possible in the lighting with the correct performance from the actor. That became the foundational thing for us.”

During shooting, then, every time production shot a scene, the actor would walk through the scene at around half speed. First the dancer would walk through the scene and then the actor would follow behind the dancer, copying the dancer, keeping in character the whole time. The main camera would follow the character as best they could, with additional cameras also capturing the moment. This became Afterparty’s main training data, which was used in conjunction with other base training data of the actors filmed in a capture rig.

“The in-scene capture became really critical,” notes Gaddie. “When we got the final ‘fit training’ for each shot [where the model is trained on data it won’t see in the final swap], we said, ‘OK, for this shot, this is the material that matches this shot and we’re going to force the model to only learn that material from that shot so that we only get that performance transferred onto the double. That worked most of the time. Sometimes we wanted to change the performance because the dancer was doing something that we didn’t want the actor to do.”

An example of this was for the character, Cheyenne Toussaint, played by Lou de Laâge and dance doubled by Constance Devernay. “Constance had a very different smile than Lou,” points out Gaddie. “Constance had very visible teeth and gums, so when she would smile, you would see her teeth. But Lou doesn’t reveal a lot of teeth when she smiles.. So, if you revealed anything of that big toothy smile of the dancer, it did not look like Lou, ie. Cheyenne, at all. It just suddenly became a different person. We had to really control our dataset for that. We had to continuously prune our dataset to remove anything that looked like an open mouth because it forced the wrong smile.”

The 3D side of things

As noted, Afterparty had been obtaining 3D tracking data for each character that required a face/head replacement (as part of stabilizing each head). Since they were tracking all the heads to get that precise head geometry for tracking in the scenes, that also provided the opportunity to take the precise head tracking data and bring in a CG version of the head to replace the performance.

For this stage, Afterparty would rely on either performance capture of the dancer or reference from the footage. “So,” says Gaddie, “when the actor did her walkthrough, we basically had the animator follow her performance and match her performance as close as they could (since we couldn’t get the actor back for doing a standard performance capture). It was the animator going through a performance from that walkthrough, matching a performance on a CG double of the actor. That was then what we used for doing the AI swap—basically swapping the actor for the actor.”

Essentially this was the creation of a new model that let Afterparty swap the CG actor with the AI version of the same actor. “What it meant was,” outlines Gaddie, “if there were specific nuances to her performance, the animator could make sure they got those performances. That, combined with our dataset, which was completely trained on the actor’s performance, pushed it into the right place.”

“The truth of the performance was more about the data than about the animation,” adds Gaddie. “Just choosing the right data of the actor doing the performance a certain way with a certain emotion made the dancer feel like the actor doing it with the right emotion and the right performance. This is why I don’t like the term AI—the machine learning is not intelligently working out what the character’s supposed to be doing and giving you a performance. It doesn’t know what a performance is, but if you feed it the right performance, then it knows how to transfer that right performance onto the character. Our system curates the data based on what performance the actor is delivering or what the director is looking for in a scene.”

Another factor to this was that Afterparty had two models running. One of the models was a very ‘rigid’ model that only did what the actors did, while the other model, as Gaddie describes, “took a little bit more influence from what the dancer did. That gave us a range of performances that I could feed back to Lesley and her team and say, okay, which of these performances feels like the right performance? We coded them all and we had these performance options based on the nuance of the training.”

Comp time

Compositing, done in Nuke, was the final stage of the face replacement process on Étoile. Here, Faceswap passes were tracked, rotoscoped and composited onto the dancer doubles. In fact, this is where Gaddie notes that it was always effectively head replacement, rather than just the face, because there were always considerations of skull shape and hair to deal with.

“Skull shape and hair became one of the biggest comp challenges,” states Gaddie. “Even if you’re using the hair of the double, you’ve got to really reshape everything. You’ve got to go, okay, here’s the head and here is the hair, and you’ve got to align the two things. We trained full heads, and while we didn’t always use the full hair, we always used part of the hair. The important thing was to get the skull shape and the hairlines to match. That involved sometimes stretching the skull out to get the silhouette to match and then bringing in or pushing out the hairlines to make the hairline exactly match between the double and the actor. Most of the work was getting the hairlines to match.”

Compositing also came in for fixing individual frames, where necessary. “You always need to fix frames,” says Gaddie. “Someone’s spinning and you’ve got a bit of an ear and a piece of a nose and you just don’t have all that in the dataset, or as much as you want in the dataset, and when things go wrong, suddenly the swap, as you’re training it, starts growing an eye where the nose is or it just gets confused. So there’s always going to be a weird frame that you have to fix. With all your coaxing and magic, at a certain point you’ve got to just go into comp and you’ve got to make it all work.”

The final episode

All of Afterparty’s work culminated into a significant sequence in the final episode where the performance happens on a dark stage with a moving camera and different lighting conditions. “It’s the climax of the whole season,” outlines Gaddie. “We knew we had to do that with a hybrid approach because Cheyenne had her hair out. Sometimes you only saw an eye or a nose. Sometimes you saw the whole face, but then there were wisps of hair going over her face, so I very quickly knew I couldn’t rely on the AI alone. My intention was never to use the CG alone, it was to AI-swap the CG. We trained a separate model that was CG and AI, and that model had no hair at all. It was just a bald character and we rendered it with different lighting, with shadow lighting and a version that was the spotlight as if cut by hair. It was like doing a bunch of traditional 3D passes that would help us put together a 3D asset in comp.”

“But, what we did instead was run each of those passes through our ML model. We generated outputs that were, say, the shadow pass AI output, the spotlight pass AI output, and then in comp, we could take those layers and bring them together with all the masks we had for the hair, et cetera. That ended up being a really great save for the most complicated scene.”

Gaddie says this approach was the most appropriate for that scene because performance was critical; it had to most closely resemble the actor’s performance. “We worked pretty hard to get the animation to at least look very close to what the dancer was doing. And then of course the AI did the rest of the work. We used both CG driven AI and dancer driven AI—whatever worked best—and we had to transition between those different moments. It was a beast, that scene, basically four and a half minutes of faces! It was intense.”