The volumetric capture crowd tech behind ‘I Wanna Dance With Somebody’

A step-by-step on how 700 terabytes worth of volumetric capture was crafted for the concert crowds in the Whitney Houston biopic.

The history of movie-making is also a history in crowd generation and replication, all the way from using thousands of real extras, to shooting multiple 2D plates and stitching or tiling them together, to realizing 3D crowd simulations, and more. The VFX filmmakers on Kasi Lemmons’ I Wanna Dance with Somebody–which follows the life of singer Whitney Houston–had already had significant experience with digital crowds on Bohemian Rhapsody, and wanted to go further this time.

“We approached I Wanna Dance with Somebody with the intention to take what we learnt from Bohemian Rhapsody to the next level,” the film’s Academy award winning visual effects supervisor Paul Norris and visual effects producer Tim Field told befores & afters. “The crowds for Bohemian Rhapsody were achieved using an array of six cameras capturing sprite elements from 120 degrees. For I Wanna Dance the volumetric capture approach utilized over 70 cameras from 360 degrees.”

“This allowed,” they added, “for more expansive camera moves and the flexibility in post to create new shots as required by the edit, some without the requirement to shoot pickups. As before, we wanted to shoot foreground crowd elements in the plates as often as possible.”

Volumetric capture relies on multiple cameras synchronized together to capture textures and shapes to produce 3D representations of people or objects. Ultimately, volumetric capture handled by Dimension Studio (through Avatar Dimension) and visual effects from Zero VFX and ReDefine would help fill stadiums with upwards of 70,000 digital crowd members for a number of key Houston performances, such as her delivery of the Star-Spangled Banner at Superbowl XXV in 1991.

“The whole pipeline for I Wanna Dance was tested a year before principal photography with DNEG and Dimension, testing the incorporation of CG elements and the variations that could be achieved for each capture,” detail Field and Norris. “This informed the overall methodology, bidding and awarding process.”

Capturing humans, volumetrically

In the final film, the volumetric capture process was extensive. For each concert or singing event featured in the film, an individual audience member would be captured separately in a greenscreen booth. This happened inside an array of cameras–70 of them–established by Avatar Dimension using their Microsoft Mixed Reality Capture Studio setup.

The process took place in Boston with more than 300 people ultimately being captured. In addition to breaking down the captures into concerts, Norris and Field worked out what their emotions and actions needed to be by studying the crowds in the original performance footage. They created a matrix document of actions with specific ratios of capture required for each along with a 20 second capture duration.

As crowds never react in complete unison, capturing the correct balance and priority enabled the creation of a vast range of different natural crowd reactions with the right mix of randomness, timings and ratio of actions. This matrix document provided enough advanced planning so that the whole capture shoot could be planned out as well as assisting the selection of the captures for processing into CG assets, fast tracking it down to the selection of a 10 second range from each successful 20 second take. Both 2K and 4K assets could be generated out of the resulting data.

Norris directed the audience members filmed in different costumes and era and location-correct hair and make-up. “Volumetric capture is limited in the colour palate and texture for certain costumes and the ability to capture hair detail,” identify Norris and Field. “We had to be clever with costume selection, testing different materials that were more scanning friendly, avoiding certain materials or textures and with hairstyles taking advantage of headwear like hats and bandanas as well as hairstyles with hard edges.”

There were also additional elements captured of individual audience members who might be waving flags, acting as security guards or even players on the field. Zero was able to work out with Dimension a way for using “in-volume motion tracking so that we could then put flags in people’s hands without having to go back and do match moves on their hands.

Filling venues: a look at the work by Zero VFX

A large part of the volumetric capture approach became managing all the captures, and bringing them into editorial. Dimension handled the processing. For Zero’s shots, the VFX studio then ingested the assets into its database after editorial selection.

“We had a pretty tight dailies process so that we could look at dailies and rushes and go through and start figuring out which ones we wanted to move to the next step of processing with,” outlines Zero visual effects supervisor Brian Drewes. “We had an editor from Zero on set to help go through it all–there was a ton of data to push around. It was massive. It was 700 terabytes.”

“We would have a process by which we would then look at the results, QC them on a per item basis, and then those would go into our Houdini pipe from there,” continues Drewes. “We could tag each asset so that then we were able to leverage our Houdini systems to deal with dispersion and to be able to really have a good cheat sheet for which of these frames correspond to which action and to which character, so that then we could have this really precise labeling for a very large group of people.”

As the audience member assets were placed in stadiums and venues (more on those below), one challenge for Zero VFX became avoiding collisions. However, its system was able to figure out the general proximity of each asset, advises Drewes. “We were able to figure out the best way for that pipeline to hand off so that then compositing could help with depth of field and other things like that. We were able to attack it both from lighting and from comp to balance it all together.”

This processed data was then shared with ReDefine as the other vendor involved with the concert VFX sequences.

Before building crowds, building stadiums

The venues featured in the film were, of course, as vital as the crowds themselves. Research was undertaken into the look and feel of these places from around the 1990s, and obviously from footage aired of Houston singing at them. At the actual venues (if they still existed) and at stand-in locations for filming, Norris and Field orchestrated a detailed photographic reference shoot, as well as the capturing of textures during filming. There was also a comprehensive Lidar scan done of each location.

Full CG environments built by Zero VFX included the stage for an Oprah Winfrey performance and the venue for the 1994 American Music Awards. The Superbowl XXV event, held at Tampa Stadium in Florida, for instance, was one of the most challenging for the VFX team. That’s because the unique shape of Tampa Stadium–where Houston sang the national anthem in 1991–did not match the dimensions of the stand-in location, which was Gillette Stadium in Boston.

“We ended up having to resolve those two worlds together to create a CG stadium that worked,” says Drewes. “That was a pretty substantial build because we were also seeing it from up high and the blimp view, and then also from the view of some F-16s that come screaming past it during the moment of the national anthem.”

The flexibility of volcap

For the VFX team on I Wanna Dance, using volumetric capture for crowd scenes would prove to be highly effective as a way of delivering a unique look and feel to the audiences, as Norris and Field attest.

“One of the biggest benefits of volumetric capture is the capture of an individual’s performance. As with Bohemian Rhapsody, this makes for a very realistic, believable crowd with all the individual nuances and performance characteristics of each performer as well as all the real clothing movements and facial expressions.”

Indeed, just on occasion, the individual performances captured via volumetric capture were almost too individual. But Zero and ReDefine’s systems allowed them to swap out crowd members where necessary.

Norris recalls it was easy to swap out individual crowd members who were distracting, acting too crazy or too emotional for that moment in the movie. “In fact we were able to adjust the entire balance of the whole crowd by changing the ratios of the different actions as required by the director and editor to achieve the desired emotional tone for each shot.”

Additionally, the use of the volumetric capture approach allowed ReDefine to create new concert shots for the World Tour sequence required during the edit. New shots were designed utilizing the various existing crowd capture assets and greenscreen foreground plates.

Norris and Field conclude that “these photographic techniques using volumetric capture elements produced audiences for the movie that were genuinely ‘real’.”