Dozo: an early mocap marvel

The motion captured character made her debut at SIGGRAPH in 1989.

#mocaphistory week is brought to you by Xsens.

For #mocaphistory week, I thought it might be fun to go back to the beginnings of what most people might equate with motion capture; the idea of performers covered in markers being captured by infrared cameras.

A project widely acknowledged as one of the earliest CG movie made using ‘passive’ optical motion capture is Jeff Kleiser and Diana Walczak’s ‘Don’t Touch Me’, a music video that played at SIGGRAPH 1989 in Boston with a digital singing performer. Kleiser-Walczak’s ‘Dozo’ character in the video was made possible with mocap (albeit only partially, as is revealed below) and some elaborate real-world facial sculpts translated into 3D.

Kleiser, who with Walczak was an early pioneer in the world of digital avatars, runs down how ‘Don’t Touch Me’ was made.

b&a: How did ‘Don’t Touch Me’ come about?

Jeff Kleiser: I had been working at Omnibus Computer Graphics. I was supervising feature film projects for them, and I had met my partner, Diana Walczak, at SIGGRAPH in 1985 in San Francisco.

At Omnibus we were talking about trying to figure out what kind of things we could get off the ground, and one of the ideas was to do a Spider-Man TV series, an animated Spider-Man TV series, using 3D animation, but they really didn’t have any idea how to create the Spider-Man body.

I said to the producers, ‘Well, look, I just met this really great sculptor. Why don’t we get her to sculpt the body, and then we can get one of those digitizers and digitise it, and then figure out how to make it come to life and move?’

So Diana sculpted this muscular character, much more muscular than Spider-Man, really, because we weren’t sure exactly what it was going to be for yet. We thought it could be Spider-Man, but that we could always stretch him out.

Anyway, Diana made this powerful looking sculpture about four feet tall, and then we started figuring out how to animate it. And we knew that we couldn’t really do shoulders very well. We didn’t have any software that could bend polygonal surfaces. There just wasn’t any code written, and we would have to write that. So we decided we would build this character with interpenetrating body parts, like an upper arm would interpenetrate with the lower arm, and the then the upper arm would interpenetrate with the shoulder.

Diana made these pieces based on the original sculpture that would ultimately interpenetrate so we could animate the character and have it look semi-okay without any deformation software. And that ultimately became our first film called ‘Nestor Sextone for President’.

At that point, I was starting to get interested in motion capture. We had done a little bit of motion capture, or faux motion capture, at my previous company, Digital Effects in New York. We put some ping pong balls on our art director and some red pyjamas, and photographed him with a 35 millimetre camera. And then we projected that image down on an Oxberry ‘down shooter’ and used a digitizing tablet to digitize each frame. So we collected pseudo three-dimensional information. It was really two-dimensional information, but because it was taken off of a 3D body moving around from one angle, from the front angle, it looked like 3D data.

Anyway, I wanted to do motion capture, but something much more elaborate. And so, I contacted Motion Analysis up in San Francisco who were one of the main companies doing mocap. Meanwhile, our first film ‘Nestor Sextone’ was spotted by Molly Connolly at Hewlett-Packard. And she said, ‘I think that thing’s a real breakthrough, and I want to give you guys some workstations, so you can do another film and put HP in the credits. I’ll give you the equipment.’

So we got three workstations from HP, really good ones at the time, and we said, ‘Okay, now, we’re going to do a three minute piece.’ Sextone was about a minute. He’s just talking, kind of a joke. And this one, we wanted to do a music video. We wanted to capture the motion of a female singer, and create the female body and apply the motion capture to the female body and also videotape our singer in closeup, so we could then also make a series of faces, so we could deform the faces and get her to sing along in sync with the music. This became ‘Don’t Touch Me’ and our digital singer was ‘Dozo’.

So we wrote the song with our friend, Frank Serafine, and then we go our friend, Perla Batalla, the singer, to do the motion capture and to sing the song. We videotaped her and motion captured her and put it altogether in a big rush for SIGGRAPH ’89 in Boston. And we barely made the deadline.

Our film got into the shipping room at 11:55, and the cutoff was midnight. So we really didn’t have time to do anything but get it out the door. There’s all kinds of technical flaws in it, because we didn’t have any opportunity to preview it. We had to get it into SIGGRAPH, or there would be a whole three or four months of wasted production effort, and we’d have to wait another year to present it.

b&a: How did you approach the motion capture, in particular?

Jeff Kleiser: We went up to Motion Analysis and we had choreographed a whole three minute song. We would play back the song, and Perla was singing, and she did a whole dance number. Unfortunately, the Motion Analysis software was kind of clumsy at the time, so we could only get about 18 seconds of clean 3D data out of the whole three and half minute performance.

So we had to cycle those 18 seconds through the whole thing. It was nowhere near the performance that we were hoping and expecting to get. We just had this one section that was clean, and because of the time constraints, we couldn’t wait for them to clean up the rest of the performance. We just had to cycle that 18 seconds over and over about six times in the film.

b&a: What was the mocap setup that you used?

Jeff Kleiser: It was eight IR cameras in a circle, raised up about 10 feet, looking down at her. And we had tracking markers on her limbs, little small ping-pong-like balls. And we also affixed some tracking markers in her hair, because we wanted to get a whole swinging hair thing, but they never actually successfully captured that. So it was primarily shoulders, elbows.

Then we had a stick that she was holding with a tracking point at the thumb and baby finger, just to capture the attitude of her hand, and the knees, and two on her feet, toe, and heel, knee. Four around her hips, and then shoulders, elbows, and hands, and top of her head, and a few on her pigtails, or ponytail.

b&a: In terms of her body capture, what do you remember were the challenges of crafting the performance from the capture?

Jeff Kleiser: Well, we were totally dependent on Motion Analysis to give us 3D data, because they had the software that analyzes the image coming in from all the cameras that makes it into a 3D thing. It was just either fraught with errors, or very, very slow. But like I said, we only got the 18 seconds of clean data out of them.

Ultimately, we figured out how to make elbows bend, sort of. Unlike ‘Nestor Sextone’, we were able to have continuous moving joints at the elbows and the neck. The shoulder proved to be too complicated, because there’s forward and backward, and up and down. We had to interpenetrate the arms and shoulders, so that’s why her shoulders look weird. The knees were a curved joint like the elbows. We really should have taken another year off and done it properly, but we were excited to get it into the show and couldn’t wait!

b&a: What about Dozo’s facial animation? I’m guessing that wasn’t anything to do with the mo-cap at all.

Jeff Kleiser: No, it was separate. So, there’s a guy named Larry Weinberg, who wrote Poser. At the time, he was working with me at Omnibus. And we went to Larry and said, ‘Larry, we need a way to make this character talk.’ And we figured out a way for Diana to make a neutral face, which is the same thing we did with Nestor Sextone, and then make a mold of that. And then we could push clay into the mould and make duplicate neutral faces. And then Diana would take a neutral face, and then re-sculpt the mouth into different positions, like FACS expressions, and also emotional cues, mostly seen in the eyebrows.

So we could mix and match frowns and eyebrows up, along with all the different phonemes of speech, A, E, I, O, U, and M, and N, and all the different ones we figured that we might need. And then we had all these sculpts. Then we had to digitize them all, and each one had to have the same exact topology, the same number of polygons in the same order in order to interpolate from one to the other.

Larry wrote a program called Reorder, basically because there was no way we could digitize all these faces. There’s 3,000 polygons in each face, and there’s no way we could digitize them in the same order. So he wrote a program where, you take two faces that we digitize and a starting polygon, the same starting polygon, and it would go through and reorder the polygons to be in the same order, and then point out whenever there is a problem, a discontinuity in the polygon count, so that we could go in and fix it.

It was very, very laborious to digitize these faces and then to reorder them and get through the reordering process, so that you could actually interpolate from one expression to another. Then once we had all these faces and phonemes of speech in the bag, Larry wrote another programme called Talk, which enabled us to, by just visually looking at the video reference of Parala singing, the close up camera of her singing, we could then scrub through and look at time code, and say, ‘Okay, at four minutes, 32 seconds, eight frames, her mouth is in an M position.’

And we’d go and find the M phoneme and assign that face at that frame. And six frames later, it’s starting to turn into an, ‘O’. We could put different percentages of different expressions into each key frame. So we’d make a list of frames, and then Larry’s Talk program would go and interpolate between expressions in sync with our video references track.

Like I say, I wish we had another six months to work on it, because we could have made it really, really nice. But this was our first prototype, and we just wanted to get it out there.

b&a: What do you remember was the reaction at SIGGRAPH to this in terms of audience reaction, but also just the community?

Jeff Kleiser: At the show, I remember we got a really loud applause and people were bowled over, because they hadn’t really seen characters talking with that level of geometric fidelity. I think at the time, there had been a number of attempts at making faces emote and talk. I think one of them was Mike the Talking Head, and others, but these had often been all algorithmic approaches.

The fact that we were interpolating from different phonemes of speech based on sculpts that Diana had made, it looked much more realistic than the algorithmic approaches. The algorithmic approaches, of course, quickly overwhelmed our sculptural approach in terms of accuracy and ability with blend shapes, and all that stuff became a whole science unto itself. But at the time, this was better facial animation than most people had seen before, so we got a really good reaction.

Follow along during this special weekly series, #mocaphistory, to re-visit motion capture history and hear from several performance capture professionals.