A Facial Animation Markup Language (FAML) for the Scripting of a Talking Head


by




Quoc Hung Huynh

09525748




Project Supervisor: Andrew Marriott





Submitted to

the Department of Computer Science

in fulfillment of the requirements for the degree of




Bachelor of Science (Computer Science) (Honours)



at


School of Computing

Curtin University of Technology

Perth, Western Australia



November 24, 2000


© Curtin University of Technology 2000








Abstract



The FAQBot forms the focus of our project. The FAQBot Talking Head animation combines a TTS system, an MPEG-4 based FAE and an AI, to produce a 3D talking head answering user's requests. The aim of this project is to implement a Facial Animation Markup Language to enable the control of the animated Talking Head to include facial expressions, gestures and emotions through the input text stream.


Our focus of research encompasses the domains of human psychology and cognitive sciences, computer graphics, computer vision and human-machine interaction to identify the factors that contribute to non-verbal communication of facial gestures, expressions and emotions in humans. From this we derived our subset of FAML tags to mimic identified non-verbal behaviours. The FAML tags form the tools needed to realistically animate the Talking Head.


The research utilises the MPEG-4 facial animation-coding standard. The subset of FAML tags specifies the movement of FAPs as defined by the MPEG-4 specification. The FAPs are used to display the facial expressions denoted by the FAML tags. The FAML tags are implemented to work in conjunction with the personality of the Talking Head allowing smooth and continuous animation. Timing of gestures is synchronized to the audio clock, defined by the timing in the Talking Head synthesized speech. This dissertation describes the development of the tools and techniques required to control the text driven animation of the Talking Head through the use of FAML tags.






















Acknowledgements



I wish to acknowledge the people who have supported me and made significant contributions to my Honours work. First and foremost, I wish to thank my project supervisor, Andrew Marriott, for initially giving me the chance to work on a project that appealed to me and yet challenged the depth of my skills. I would also like to thank him for his guidance, encouragement and support above and beyond my project.


I am in debt to Trang Ly for her support, forgiveness and encouragement, and for allowing me to use her as a facial model, and for making the late nights and weekends spent at university more enjoyable.


Special thanks to John Stallo, whom I worked with for this project, to truly bring the Talking Head to life. John was not only an academic colleague but also a friend, and allowed me to discuss problems with him as well as providing contributions of meaningful ideas to the work. I would lastly like to extend my gratitude to the previous students who are too numerous to name and have contributed to the body of work that is the FAQBot, for I have truly stood on their shoulders.






















Quoc Hung Huynh

11 November 2000





Contents




Chapter 1 9

Introduction 9

Chapter 2 11

Problem Description 11

2.1 Problem Statement 11

2.2 Subproblems 12

2.3 Significance of Study 12

Chapter 3 13

Literature Review 13

3.1 Talking Heads 14

3.2 Face Modelling 15

3.2.1 Interpolation 15

3.2.2 Parametric 16

3.2.3 Physical 17

3.3 Nonverbal Behaviour and Communication 18

3.3.1 Facial Displays 19

3.3.2 Facial expression 22

3.3.3 Emotion 26

3.3.4 Head Movement 28

3.3.5 Eye behaviour 28

3.4 Implemented Visual Text-to-Speech (VTTS) systems 30

3.5 MPEG-4 33

3.5.1 Facial Animation in MPEG-4 33

3.6 Markup Languages 36

3.7 Virtual Characters 38

3.8 Summary of literature review 41

Chapter 4 41

Research Methodology 41

4.1 Hypothesis 42

4.2 Delimitations and Assumptions 42

4.3 Limitations 43

4.4 Design and Demonstration 44

4.5 Evaluation 44

Chapter 5 44

Implementation 44

5.1 Background 45

5.1.1 Input text stream 45

5.1.2 Collaboration 47

5.1.3 The TTS Module 50

5.1.4 PST Personality Module 50

5.2 Overview 51

5.2.1 Festival Text-To-Speech Synthesiser Word Expansion 53

5.3 Synchronisation 54

5.3.1 Timing 55

5.3.2 Frames 57

5.3.3 FAML Synchronisation 58

5.4 Personality and Gesture conflict resolution 60

5.4.1 Blinking 60

5.4.2 Eyebrow 61

5.4.3 Gestures and Head Movements 62

5.4.4 Expressions 62

5.4.5 Emotion 63

5.5 Generic Tag specifications 63

5.6 Tag animation 64

5.7 Gesture FAML Tags 67

5.7.1 Head 67

5.7.2 Eyes 69

5.7.3 Brows 70

5.8 Expression FAML Tags 70

5.9 Emotion FAML Tags 71

5.10 Virtual Characters 71

5.10.1 News presenter 71

5.10.2 Sales assistant 72

5.10.3 Narrator / Storyteller 72

5.11 Producing realistic animation 72

5.11.1 Realistic Head Turns 73

5.11.2 Realistic Eye Movements 73

Chapter 6 75

Results and Analysis 75

6.1 The Experiment 75

6.2 Evaluation of Results 77

6.2.1 Profile of Users 77

6.2.2 Results of Phase Two 78

6.2.3 Results of Phase Three 85

6.3 Summary of Results 87

Chapter 7 88

Conclusions 88

7.1 Future work 89

Bibliography 92

Appendix A 100

Appendix B 103

Appendix C 105

Appendix D 107

Appendix E 109

Appendix F 111

Appendix G 119









































List of Figures



Figure 1 Facial expressions 36

Figure 2 Typical input text document for the FAQBot application 46

Figure 3 Straight filtering of unknown tags in the input text 48

Figure 4 Filtered input text document preserving utterance structure 49

Figure 5 FAQBot animation modules 51

Figure 6 FAML module overview 52

Figure 7 Festival expansion list 54

Figure 8 Utterance timing file 55

Figure 9 Phoneme data for the word "Here's" 55

Figure 10 Phoneme duration breakdown 55

Figure 11 Complete duration information for the utterance "Here's" 56

Figure 12 Timing file for utterance "Here's the latest news" 56

Figure 13 Calculated word timing for the utterance "Here's the latest news" 57

Figure 14 Start time values of each word as offset from time 0 of the audio stream 57

Figure 15 Frame synchronisation of word in the animation sequence 58

Figure 16 Example of FAML tags in text 58

Figure 17 FAML tags synchronization 59

Figure 18 A generic FAML tag 63

Figure 19 Breakdown of smile tag animation 65

Figure 20 The amplitude of a generic FAML tag over its duration in the animation sequence 66

Figure 21 The amplitude of a "smile" FAML tag over its duration in the animation sequence 67

Figure 22 Comparative Storyteller results from demonstration 1 Vs demonstration 2 78

Figure 23 Comparative News presenter results from demonstration 1 Vs demonstration 2 81

Figure 24 Comparative Sales assistant results for demonstration 1 Vs demonstration 2 83




















List of Tables



Table 1 Some communicative facial displays categorized by Chovil (1991)??????....21

Table 2 Facial Animation Parameter (FAP) Groups????????????????..34

Table 3 Primary Facial expressions as defined for FAP 2??????????.??.? ..35

Table 4 McNemar's and Stuart Maxwell's p values for Storyteller character??..????.79

Table 5 McNemar's and Stuart Maxwell's p values for News presenter character?.???..81

Table 6 McNemar's and Stuart Maxwell's p values for Sales assistant character?????.83

Table 5 TTS vs TTS and FAML results???????????.??????????84































Chapter 1



Introduction






Facial animation is now attracting more attention than ever before in its 25 years of development. Imaginative applications of animated graphics can be found in sophisticated human-computer interfaces, interactive games, multimedia titles, virtual reality and in an extensive variety of computer-generated animations. Supporting technologies include synthesized speech and artificial intelligence. The goal is to synthesize realistic Talking Heads, representing the dynamic facial likeness of humans.


One particular application of a computer generated Talking Head is the FAQBot application (Beard et al., 1999), developed jointly by Curtin University of Technology and the University of Genoa utilising an animated Talking Head as the interface to the application. The FAQBot is designed to answer users' frequently asked questions on predefined topics. The FAQBot combines an MPEG-4 based facial animation engine (FAE), a text-to-speech (TTS) system and artificial intelligence (AI).


The FAQBot application is still in development and forms the basis for this project. The FAQBot has evolved from a simple Talking Head animation, with only animated lip movement, to a Talking Head that can display user-defined personalities. The FAQBot Talking Head is however still lacking in terms of animation control. The animation of the FAQBot Talking Head is text driven, as such the animation control of the Talking Heads needs to reside within the input text. It is the task of this project to implement a markup system for the text input to the FAQBot Talking Head application to control the animation of facial gestures, expression and emotions of the Talking Head.


Animating the face by specifying every action manually is a very tedious task and often does not yield the desired results. In order to improve facial animation systems, understanding the non-verbal communication and non-verbal behaviour is an important priority. It is suggested that integrating such non-verbal behaviour as facial gestures, expressions and emotions, accompanied by speech, will increase the realism of the Talking Head animation (Pelachaud et al., 1996).


Facial expression changes continuously in humans, and many of these changes are synchronized to the spoken discourse. When people speak their faces are rarely still. They not only use their lips to talk, but raise their eyebrow, move, blink their eyes, or nod and turn their head. Facial expression is linked to the content of speech, for instance scrunching one's nose when talking about something unpleasant. It is also inherently linked to emotion, personality, and other behavioural variables (Ekman, 1979).


The goals and contributions of the research described in this project are described in chapters 2 and 4. Then follows a discussion of the relevant literature and background in chapter 3. This discussion encompasses the domains of psychology, cognitive sciences, computer graphics, computer vision and human­machine interaction. Chapter 5 describes the implementation of the facial animation markup. This is followed in chapter 6 by analysis of the data acquired using our experiments and details of the experiments and results. Finally, the last chapter provides suggestions for future work and the conclusions of this research (Chapter 7).



































Chapter 2



Problem Description






The following sections formally outline the problems investigated in this research. We further discuss the significant aspects concerning our project.


2.1 Problem Statement


The aim of this research is to design and implement a Facial Animation Markup Language (FAML) to control the facial gestures, expressions and emotion in the Talking Head animation for the FAQBot application. The FAML is to be used in the input text stream to "drive" the facial animation of the Talking Head. The FAML will enable the animator to markup the input text, specifying type, intensity and durations of facial gestures, expressions and emotions. The facial displays will be synchronized to the spoken speech such that the timing of the facial displays coincides with its location in the input text. Facial displays encompass the facial expressions, movements, gestures and emotions displayed by the face.


2.2 Subproblems


Non-verbal communication


We consider the work to be an issue of multi-modal communication, particularly the non-verbal mode. It is imperative that the FAML is able to animate the aspects of non-verbal communication that relate to content and structure of the spoken text, as well as the underlying behavioural aspects of human physiology present during communication. It is therefore imperative that the FAML is able to provide the functionality to allow the animator to exhibit the non-verbal displays for the Talking Head.


Animation control


The current animation of the talking head is probabilistic in nature and no exact method of control can be utilised to animate the Talking Head. The Talking Head is able to portray a personality, but is unable to link the personality to the text or speech. The FAML is able to exhibit control over the animation and direct the facial displays for the Talking Head animation.


The FAML is to work in conjunction with the underlying personality of the Talking Head and as such a method of conflict resolution between the two animation processes is required to ensure the continuous and smooth animation of the Talking Head.



Mutually exclusive personalities


Currently, only one personality can be portrayed for each animation sequence of the Talking Head. No mechanisms exist to allow the personalities to change during the animation. The FAML, although not able to specify a personality, is capable of changing the facial expression of the Talking Head during the animation such that a friendly personality can display sad facial expressions.


2.3 Significance of Study


In the current implementation of the FAQBot application, there is no mechanism to control the animation of facial expressions, gestures and emotions for the Talking Head animation. Recent developments have included a personality for the Talking Head, allowing probabilistic movements and expressions to be displayed in the Talking Head animation through personality defined parameters. However, there still does not exists a mechanism to allow consistent animation of character or persona in the Talking Head animation.


The implementation of the FAML allows high-level control of the Talking Head through the input text stream, enabling specified gestures, expressions and emotions to be scripted into the Talking Head animation, synchronized to the speech. The FAML tags can be used in conjunction with the underlying personality of the Talking Head to convey consistent persona or characters for the Talking Head animation.


The FAML enables the Talking Head animation of the FAQBot application to be further utilised in scripting characters such as virtual storytellers, virtual news presenters and virtual sales assistants.


An important aspect of this proposed research is that the work is based on the recently standardized MPEG-4 standard. MPEG-4 enables the animation of 3D Talking Heads using very low bandwidth, enabling smooth facial animation in multimedia and web based applications.







Chapter 3



Literature Review







The literature review begins with an introduction to the domain of Talking Heads, which are synthetic computer modeled human faces that can speak and move, and in particular the seminal FAQBot application. It then describes aspects of facial modelling, the types of models and the animation techniques used to breathe life into them. The Talking Head, and its subsequent animation features predominantly in our work, and as such the techniques that are used to derive and manipulate the face model are of significant interest.


As part of the delimitations of this body of work, the modelling of the relationship between gesture and speech is beyond the scope of the project. It is however relevant to address the level of symbiosis between the two communication channels and this will be discussed through aspects of multi-modal communication.


Animation control techniques are discussed as methods for animating and simulating a Talking Head. Particular attention is paid to the MPEG-4 specifications to which our project is delimited (see section 4.2). A section of this literature review is dedicated to the exploration of MPEG-4, its specifications and facial animation coding system.


This research involves the implementation of facial animation markup language to script facial expressions, gestures and emotions. As such we touch upon the types of non-verbal communication and their function in humans to narrow our focus on what expression and gestures we choose to model. The emotional and linguistic aspects of gestures and how well they relate semantically will also be addressed. We also discuss in detail markup languages and how they can be used to structure and organize input data for our Talking Head.


Lastly we move on to virtual characters and how they can be constructed and animated using FAML tags to produce "believable characters" that convey the illusion of life.


3.1 Talking Heads


Synthetic Talking Heads is a rapidly developing research area, however it is still in its infancy. It continues to attract attention for its application potential. It can be applied to synthesise an intelligent desktop agent, a virtual friend, virtual salesperson, virtual teacher, virtual presenters and even virtual actors (Noh and Neumann, 2000) (Binsted, 1999) (Parke and Waters, 1996).


The research and implementation of a Talking Head encompasses many disciplines and include facial animation, speech synthesis and multi-modal communication.


The FAQBot (Beard et al., 1999) is an example of the application of a Talking Head. The FAQBot application forms the interface to a Frequently Asked Questions (FAQ) database. The FAQBot application accepts user input based on the FAQ topic and the underlying AI (Artificial Intelligence) matches the input to an answer. The Talking Head then communicates this answer both visually and audibly. In its original state the FAQBot application was very static in nature and the Talking Head did not convey much movement. Recent developments made by Shepherdson (2000) have incorporated personality traits to the Talking Head, improving the realism of the Talking Head animation.


Binsted (1999) relates the application of a Talking Head to a soccer game commentator known as Rocco. Rocco is designed as a system for analysing simulation league games and generating multimedia presentations of the games. Its output is a combination of spoken natural language utterances, gestures and facial expressions. Although still in its infancy, Binsted (1999) has designed Rocco to be as believable as possible, mimicking the consistency between expression and action, as well as the modalities of expression.


All of the above applications have required a large amount of research involving aspects of facial animation, speech synthesis, non-verbal and multi-modal communication. The animation of a synthetic face is a very important aspect of this research as it forms the visual modality for communication. We discuss facial models and animation in the following section.


3.2 Face Modelling


A face is an independent communication channel that conveys both emotional and conversational signals, encoded as facial expressions (Nagao and Takeuchi, 1994).


As the technology of computer graphics and animation has increased, so too has the realism and performance of facial modelling and animation. Recent progress in computational power and facial animation has opened the door to powerful tools for the design, implementation and exploration of virtual environments (Badler, 1995) (Parke and Waters, 1996). Facial models are now more complex than ever, capable of modelling greater dimensions and subtlety in the human face, even to the extent of wrinkle modelling as described by Pelachaud and Prevost (1995).


There have been a number of approaches applied to the animation of synthetic faces. The following presents two common facial animation techniques: interpolation and parametric.


3.2.1 Interpolation


In early systems, modelling was done by digitizing the face (or part of the face) with different expressions. Each expression model was stored in an expression database. The animation was obtained by interpolating between two expressions. This method was very simple, but also an arduously time-consuming one. Even though simplistic, the system was still capable of generating expressive animations as outlined by Benoit et al. (1999).


The interpolation generalizes to polygonal surfaces applying the scheme to each vertex defining the surface. Intermediate forms of surfaces are achieved by interpolating each vertex between its two extreme positions (Parke and Waters, 1996). As noted by Shepherdson (2000) a basic assumption underlying interpolation of facial surfaces is that a single facial topology can be used for each surface.


Interpolation is similar to the cell based animation of cartoon characters, where key frames or key cells were produced and intermediate cells drawn to animate the cartoon from one key frame to another (Pelachaud and Prevost, 1995). The key frame technique requires a complete specification (point by point) at each key frame, but does not however require physical and structural formation of the model.


Key frame interpolation derived from simple interpolation is still widely used for implementing and controlling facial animation. This approach was first demonstrated by Parke (1972) to produce viable facial animation.


While this interpolation can be quite successful for limited applications, such as creating stimuli for perceptual experiments, such a system lacks the flexibility of animating the face to represent realism and consistency, since there is no way to control different facial features independently of each other (Beskow, 1996).



3.2.2 Parametric



An alternative method developed by Parkes (1982) modeled a parametrized three-dimensional facial model. Here, the facial model is produced through a set of parameters. Generally the parameters can be divided into two main groups: expression and conformation parameters as initially outlined by Parkes (1990). Expression parameters can be used to specify expressions such as brow actions, mouth shape or head direction. Conformation parameters control the overall topology of the face, allowing local or global control, and relate to the actual parameters acting upon the topology of the face (including position and size of features such as the eyes, nose, mouth). The animation is obtained by changing the set of parameters values and by interpolating between key frames (Pearce et al., 1986) (Cohen and Massaro, 1993) (Guiard-Marigny et al., 1994).


In context of this project, the parameterization technique is utilised by MPEG-4 facilitating the conformation and expression parameters of the parametric model. The MPEG-4 Facial Animation Parameters (FAPs) relate to the parametric expression parameters whilst Facial Definition Parameters (FDPs) relate to the parametric conformation parameters. MPEG-4, FAPs and FDPs will be discussed in section 3.5.


The main concerns of the parameterization technique are to define the physical properties of an element, and to determine the appropriate parameters of those properties. Since it is only the parameters of the face that is required, this approach has the advantage of being quite simple and efficient in that it requires low data storage, as well as providing precision control of parameters to reproduce exact lip shape during speech. MPEG-4 utilises a parameterized method of facial animation for the efficiency, simplicity and low bandwidth property of the parameterization technique (Ambrosini et al., 1998) (Laveagetto and Pockaj, 1999).


However, one major difficulty with parametric models, as Parke (1991) illustrated, is to develop a complete set of parameters that can describe any facial expression and any facial conformation. Furthermore, parametric models do not model movement propagation and neither do they simulate muscle movement, since this required the modelling of the underlying facial anatomy.


The next evolution in face modelling developed a physically based muscle-controlled face model that modeled the movement of the face to the underlying muscles.


3.2.3 Physical



Physically based models attempt to model the shape and dynamic changes of the face by modelling the underlying properties of facial tissue and muscle action (Parke and Waters, 1996) (Terzopolous and Waters, 1990, 1993) (Pelachaud and Prevost, 1995).


Platt and Badler (1981) created the first model to simulate muscle actions. Waters (1987) was the first to include forces, direction and magnitude, into his model. Later Terzopolous and Waters (1990, 1993) integrated various layers of skin. Using this technique, greater realism and subtle facial movements were created. These models provide the ability to manipulate facial expression based on the underlying muscles and facial tissue. Waters showed that the deformation that simulates the actions of muscles underlying the face looks more natural as muscle movement propagation is intrinsic to the model (Waters, 1987).


Structural models


Platt's model (Platt and Badler, 1981) consisted of an object decomposed into hierarchical structured regions. The face is decomposed further into subregions, where each particular subregion corresponds to one muscle or groups of muscles in the face. Each muscle can be simulated by specifying the precise locations of attachment to the surface structure. These regions under the action of the muscle, can show the propagation of movement along the surface of the subregions.


Muscle­Based models


Muscle­based models, or abstract­muscle models, mimic at a simple level the actions of primary muscle groups in the face. There are two distinct advantages for these models: (1) they are independent of particular facial geometry and (2) they map directly into muscle­based coding systems.


Ekman and Friesen (1978) used a Facial Action Coding System (FACS) to describe facial expressions. FACS are derived from an analysis of the anatomical basis of facial movement. Each facial movement is the results of muscle action. An action unit (AU) is the basic element of the FACS. Each AU defines the direct effect of a muscle as well as the eventual secondary propagation of movement in relation to the surface of the face.


Procedural model


This method is based on empirical data and not on biomechanical studies. Unlike muscle based models there is no propagation of movement. It allows hierarchical definitions of movement in the face, defining low level actions that can be combined together to form facial expressions and or lip shapes for speech.


The face model and how it is animated relates directly to the constraints to which it can be manipulated, and how they are managed. An understanding of the techniques used for face modelling and animation will provide insight into the evolution of facial animation and how it relates to our project.



3.3 Nonverbal Behaviour and Communication



Communication is a dynamic process with many interacting components. Nonverbal cues may provide clarity, meaning or contradiction for a spoken utterance. Nonverbal cues can also influence how we perceive others and how we, ourselves are perceived. Familiar faces may make us more likely to start a relationship and continue it (Chovil, 1991). A large number of studies have been conducted to aid understanding of nonverbal communication and its role in human interaction (Ekman, 1992) (Chovil, 1991) (Harper et al., 1978). Nonverbal communication is an important means to convey meaning and information at the verbal, semantic and emotional level.


Ellyson and Dovidio (1985) define the term nonverbal behaviour as that not part of formal, verbal language, referring to facial expressions, body, gaze and hand movements significant through the discourse of social interaction. Malandro (1989) elaborated on the work of Ellyson and Dovidio (1985) and defined nonverbal communication as the process by which nonverbal behaviours are used, either independently or in combination with verbal behaviours.


Miller (1981) has identified the primary uses of nonverbal behaviour of human in communication as:


  1. Expressing emotion: Non verbal signals are powerful. They primarily express inner
    feelings and evoke immediate action or response.


  1. Conveying interpersonal attitudes: Non-verbal messages are likely to be more genuine. Non-verbal behaviours are not as easily controlled as spoken words with the
    exception of some facial expressions and tone of voice.


  1. Non-verbal signals can express feelings too disturbing to state. These are feelings of superiority or dislike or feelings that etiquette or rules may prevent from being stated verbally. There is also the advantage of being able to change one's mind
    since a commitment has not been made out loud.


  1. Words have limitations. It is easier to explain the shape of something or give directions using hand gestures or head nods.


  1. Accompanying speech for the purpose of managing turn, taking, feedback and attention.



Miller (1981) suggests that only 7% of a message is sent through words with the remaining 93% sent through facial expressions (55%) and vocal intonation (38%).


He further explains why humans use non-verbal communication to such a degree:



  1. Non verbal signals are powerful. They primarily express inner feelings and evoke immediate action or response.


  1. Non-verbal messages are likely to be more genuine. Non-verbal behaviours are not as easily controlled as spoken words with the exception of some facial expressions and tone of voice.


Nonverbal cues are symbols with meaning interpretations also. In general, nonverbal symbols perform five activities of nonverbal behaviour, as suggested by Ellyson and Dovidio (1985)



The non-verbal signals and expressions all from the non-verbal behaviours exhibited during communication form the subset of FAML tags that are to be implemented. Facial expressions, head kinesics and eye behaviour all contribute to the realism of human behaviour. Miller (1981) further highlights the importance of non-verbal communication and alludes to its uses during communication.

An important component of nonverbal communication is facial expression, movement and action. These facial components of nonverbal communication are described in the following section.

3.3.1 Facial Displays


There are three main views on facial expression and facial displays and how they relate to communication. The "emotional view" correlates the movement of the face with the emotional state of the person. In essence emotions are central to the display of facial movements and expressions (Ekman and Rosenberg, 1997). Contrary to this, the "behavioural ecology view" does not treat facial displays as expressions of emotion, but rather as social signals of intent, which have meaning only in the social context (Chovel, 1991) (Fridlund, 1994). Recently facial expression has also been considered as an emotional activator in the "brain plasticity view" (Zajonc, 1994) (Ekman and Davidson, 1994) (Camras, 1992) (Lisetti and Schiano, 2000).


Emotional View: Expressions of Emotion


The emotional view suggests that there are essentially only two types of facial actions. The first are the reflex actions that indicate ongoing emotion and display them with facial expressions of emotions. The second are instrumental facial actions that show emotion that is not occurring, and reflect everyday social interactivity, such as a smile of politeness.


The emotional view has proposed a subset of universal emotions that are accompanied by facial displays. Six basic universal emotions were identified by Ekman and Friesen (1975) and are identified as: surprise, fear, anger, disgust, sadness, and happiness. These basic emotions will be discussed in further detail in section 3.3.3.



Behavioural Ecology View: Signals of Intent


Furthermore, facial expression can also be considered as a multi-modal form of communication, the face being only one independent element conveying conversational signals. It was noted by Birdwhistle (1970) that although the human face is capable of as many as 250,000 expressions, less than 100 sets of the expressions constitute distinct and meaningful symbols. Below is a table of communicative displays whose categorization is based mostly on Chovil (1991):


Syntactic Display

1. Exclamation mark

Eyebrow raising

2. Question mark

Eyebrow raising or lowering

3. Emphasiser

Eyebrow raising or lowering

4. Underliner

Longer eyebrow raising

5. Punctuation

Eyebrow movement

6. End of an utterance

Eyebrow raising

7. Beginning of a story

Eyebrow raising

8. Story continuation

Avoid eye contact

9. End of a story

Eye contact

Speaker Display

10. Thinking /Remembering

Eyebrow raising and lowering, closing the eyes, pulling back one mouth side

11. Facial Shrug: "I don't know"

Eyebrow flashes, mouth corners pulled down, mouth corners pulled back

12. Interactive: "You know?"

Eyebrow raising

13. Metacommunicative: indication of sarcasm or joke

Eyebrow raising and looking up and off

14. "Yes"

Eyebrow actions

15. "No"

Eyebrow actions

16. "Not"

Eyebrow actions

17. "But"

Eyebrow actions

Listener Comment Display

18. Backchannel:


19. Indication of attendance

Eyebrow raising, mouth corners pulled down

20. Indication of loudness

Eyebrows drawn together

21. Understanding levels:


22. Eyebrow raising

Eyebrow raising, head nod

23. Moderately confident

Eyebrow raising

24. Not confident

Eyebrow lowering

25. "Yes"

Eyebrow raising

26. Evaluation of utterances:


27. Agreement

Eyebrow raising

28. Request for more information

Eyebrow raising

29. Incredulity

Longer eyebrow raise

Table 1 Some communicative facial displays categorized by Chovil (1991)


These communicative signals were implemented in a human-computer interface system by Nagao and Takeuchi (1994) with successful results indicating that facial displays help conversation in the case of initial contact.


In the behavioural view there are no fundamental emotions or fundamental expression. This view does not treat facial displays as "expressions" of discrete or internal emotional states. Facial displays are considered as a "signification of intent", evolving in response to stimulus. Facial displays have meanings specific only to their context of occurrence, and are only used to serve the users social motives in that context. These motives do not necessary have any relation to emotion, and a range of emotions can occur in one social motive. Facial displays therefore, depend upon the intent of the user, the behaviour of the listener, and the context of the interaction and not on inner feelings as the emotional view suggests (Lisetti and Schiano, 2000).


Brain Plasticity: Emotional Activators and Regulators


Based on breakthroughs in neuroscience of the human brain. Facial actions have recently been considered as emotional activators and regulators. Research suggests that facial actions such as muscle movements can in actual fact generate emotion, as opposed to just an expression of emotion (Ekman 1993). Research conducted by Ekman and Davidson (1994) suggests that with voluntarily smiling, it is possible to generate a happy emotion within an individual. In this sense facial movement actions and expressions are used to activate and regulate emotion. They suggest that facial movements could help change the emotional state of a person.


The question whether facial activity is a necessary part of emotion is of particular concern to the project. To understand the link between facial expression and emotion further identifies the subset of non-verbal facial behaviours that are used during communication. The implementation of a better model of expression to produce emotion and gain insight into the types of expression required for an emotional display. Improving the ability to create more realistic and believable characters that exhibit the illusion of life.


3.3.2 Facial expression


The context of this project will be placed within the behavioural ecology view that facial expression and displays are used as a form of multi-modal communication centering on the human face. This is the most computationally simpler method of viewing non-verbal facial expressions and display. As such facial expressions do not necessarily correspond to any particular emotion. Some facial expressions are used to accentuate words in an utterance. The raising of the eyebrow can be used to punctuate a discourse and not be a signal of surprise. Ekman (1982) characterized facial expressions into the following groups.


  1. Emblems: correspond to the meanings of well known but culturally dependent movements. They can be used to replace verbal expressions such as a nod for "yes" and a shake for "no". Essentially emblems are a way to iconically accentuate what is being said.


  1. Emotional emblems: are made to convey signals about emotion that are being referenced. A person uses emotional emblems to refer to an emotion. For example, when you talk about something disgusting you wrinkle your nose, however you don't actually feel the emotion disgusted at the time.


  1. Conversational signals: are made to punctuate speech, or to emphasize it. Raising the eyebrows may be used to punctuate the end of an utterance.


  1. Punctuators: are movements over pauses. Certain head movements occur over pauses.


  1. Regulators: are movements that help the interaction between speaker and listener. They control the speaker turn based conversation.


  1. Manipulators: corresponds to the biological needs of the face, for instance blinking the eyes to keep them moist.


  1. Affect displays: are facial expressions of mood.



The following features were identified by Pelachaud et al. (1994) as relevant in modelling the human face. The relevance of these features comes from their role in facial conformation, movement, and communication.


Nose : Nose movement usually conveys an emotion of disgust. Furthermore, nostril movements are observed during deep respiration and inspiration. The size and shape of the nose varies among people with different origins. Nose shape contributes significantly to identification.


Eyebrows : Eyebrow movement is vital, both in verbal and non verbal communication. They are predominantly visible in emotions such as ``surprise'', ``fear'', and ``anger''.


Eyes : Eyes are a crucial source of expressive information. When looking at a picture of a person, people tend to devote the greatest attention to the eyes. The eye movement may reveal ``interest'', or ``attention'' of a person. The shape, size, and color of the eyes provide cues in recognizing individuals.


Ears : A face without ears looks like a mask. Ears have an intricate structure and shape. Modelling the detailed shape of ears may not be necessary, depending on the application. However, the simplification of ear shape changes the appearance of a complete face. Ear movement is extremely rare in humans.


Mouth : The mouth is a highly articulate facial zone. Lips articulate elaborately during speech. Modelling of lip motions should be able to open the mouth, stretch the lips, protrude the lips etc., to produce the phonemes and basic emotional expressions.


Cheeks : Cheek movement is visible in many emotional states. Generally, cheek movements supplement other movements that may include the mouth or lower part of the eyes. Actions such as the puffing and sucking of cheeks may provide emphasis for certain emotions. They reveal characteristic movements during sucking or whistling.


Chin : The movement of the chin is mainly associated with jaw motion. However, the chin is distinctively deformed to indicate ``disgust'' and ``anger'' with the lips tightened. The shape of chin also plays an important role when conforming facial models to individuals.


Neck : The neck permits the movement of the entire head, such as nodding, turning, rolling etc. As the neck moves, it can change its width or it may elongate.


In context of this project, the eyes, eyebrow, mouth, chin, nose, cheek and ears form the basis of the facial features in the Talking Head, and as such should be included in the FAML subset of tags animating the Talking Head animation. The neck however as stated in the delimitations is not independent of the head, and as such is unable to move independently. All other facial features however have been modeled accurately.


As stated by Pelachaud et al. (1991) all categories of facial expressions as outlined by Ekman (1979) need to be included and integrated to obtain a more complete facial animation. In the context of our project and the FAML, we need to ensure that for the effect of realism we provide a set of FAML tags that cover a subset of the identified categories of facial expression to provide a set of tools for the author to create the believable characters. Facial expressions occur continuously during speech, both complementing and reinforcing the information delivered in speech.


Temporal characteristics of facial actions


Facial expression can be defined as time-dependant changes in facial movement and can be described by the following three temporal parameters:


  1. Onset duration : How long the facial display takes to appear.

  2. Apex duration : How long the expression remains in the apex position.

  3. Offset duration : How long the expression takes to disappear.


Facial displays of expressions and emotion differ in the aforementioned parameters. For example the expression of sadness has a slow offset, whilst expression of happiness has a short onset. Although these parameters are vital in terms of believable animation of expression and emotion, observation of the literature indicates that there exists little data on the definitive values of onset, apex and offset durations (Essa, 1994) (Yacoob and Davis, 1994) (Bartlett et al., 1999). Pelachaud et al. (1996) use three parameters to specify a facial expression. Kalra (1993) used four parameters, attack (onset), decay, sustain (apex), and release (offset).


In context of this project, we utilise the three parameters of onset, apex and offset for the temporal characteristics of all facial expressions, gestures and emotions. The three parameters provided adequate realism in facial expressions as indicated by the literature. The extra parameter of decay, suggested by Kalra (1993) did not provide a significant increase in realism to warrant a fourth parameter to model temporal changes of expression.


Synchronism


A person conveys his thoughts with words and facial expressions. For example, actions such as smiling, raising of the eyebrow and wrinkling of the nose often occur with speech. Facial expressions accompany the flow of speech and are synchronised at the verbal level, punctuating accented segments and pauses.


An important aspect of communication is the link between gesture and speech and their tendency to occur in synchrony (Condon and Ogston, 1971). Synchrony implies that changes that occur during speech and body movements, such as the head and facial expressions appear at the same time. For example when a head begins to articulate, eye blinks, head movement, head tuning and brow movements can occur and finish at the end of the word.


Synchrony among body and facial motions occurs at all levels of speech, including the phoneme, the syllable, the word, the intonational phrase and the utterance (Cassell et al., 1994c). Speech has to be synchronised with lip movement, but this also includes facial expressions and gaze. A delay in the synchronisation process is easily perceived by the viewer and can appear unnatural and disturbing (Malandro, 1989).


Timely responses are crucial to successful conversation, since some delay in reactions can imply specific meaning or make the utterance unnecessarily ambiguous (Nagao and Takeuchi, 1994). Systems that use an automated interaction of both audio and visual channels (Pelachaud et al., 1996) (Nagao and Takeuchi, 1994) (Ostermann et al., 1998) (Cassell et al., 1994b) use the audio channel as the synchronous clock. The audio module sends a signal to the visual module, ensuring that the audio and visual representations are synchronised to support communicative process.


Synchronism for this project is implemented at the word level. All gestures, facial expressions and movements are linked to the start time of words in the utterance. The audio channel is used as the clock to denote the start time and durations for words in the utterance. The literature has supported the use of the audio channel as the synchronism between gesture, expression and speech. In context of this project The Text-To-Speech module signals the FAML module ensuring that the audio and visual representations of speech and facial gestures, expressions and emotions are synchronised (Ostermann et al., 1999).


Gestures occur in parallel with speech, although in the case of hesitations, pauses or syntactically complex speech, it is the gestures that appear first (McNiell, 1992). At the most local level, individual gestures and words are synchronised in time so that the "stroke", the most energetic part of the gesture, occurs either with or just before the phonologically most prominent syllable of the accompany speech segment (McNiell, 1992).


Multi-modal communication: The link between gesture and speech


Evidence presented by Kendon (1994) suggests that there is a close relationship between speech and spontaneous gestures during conversation. McNiell (1992) suggest that 75 percent of speech is accompanied by gestures, although the proportions of gestures changes. In general gesture types occur in all languages. For instance, many hesitation gestures occur at the beginning of speech and correlate with the avoidance of gaze (the head turns away from the viewer) as if to help the speaker to concentrate on what is going to be said.


Communication is still possible without gesture. Information appears to be just as about effectively communicated in the absence of gestures (Williams, 1977), for example on the telephone. However it has been shown that when speech is ambiguous or obscure, listeners tend to rely on gestures to fill in their gaps in comprehension.


It is noted that gesture and speech do not always manifest the same information. Firstly semantically, in that speech and gesture give a consistent view of an overall meaning to be conveyed, and pragmatically, in that speech and gesture mark information about this meaning as advancing the purpose of conversation in a consistent way. For example, gestures may depict the way in which an action was carried out when this aspect of meaning is not depicted in speech (Cassell and Stone, 1999).


McNiell (1992) stated that in terms of a computational implementation model, gesture and speech must arise from a common conceptual source, and that gesture plays an intrinsic role in communicative intent. In the implementation of the model two aspects must clearly be defined. Firstly, one single underlying conceptual source must serve as the representation that give rise to the form of both speech and gesture. Second, communicative intent must be specified.


According to McNiell (1992), gesture and speech arise together from the underlying representation that has both visual and linguistic aspects, and so the relationship between gesture and speech is essential to the production of meaning and comprehension.


We have sought to ensure that the chosen subset of FAML tags is sufficiently comprehensive to allow the animator to mimic the relationship between gesture and speech. As indicated by the literature, this is an important aspect of multi-modal communication, as gestures and facial expression supporting the speech aid in comprehension. So too, the FAML tags will enable the animation of the Talking Head to support the synthesised speech.




3.3.3 Emotion


When people speak, there is almost always emotional information communicated with speech. This emotional information is conveyed through multiple communication channels, including emotional qualities of the voice and visible facial expression.


Producing emotional responses requires both the ability to generate facial expressions, and a model for synthesizing appropriate emotion in a dynamic environment. Three main areas of the face are involved in visible expression, firstly, the upper part of the face, with the brows and forehead, secondly the eyes and thirdly the lower part of the face with the mouth (Parke and Waters, 1996). An emotion is defined as the evolution of the human face over time: it is a sequence of expressions with various durations and intensities (Ekman, 1978).


Events can often elicit multiple emotions whose effects blend together. For example a person can be both surprised and frightened. Such emotion can appear concurrently or in rapid succession.


Emotions can sometimes be confused with other aspect of expression, such as reflex and mood. A reflex, such as from being startled, is a brief event that cannot be completely inhibited like an emotional response. Alternatively mood, stretches over a longer period of time than an emotion, and is more inclined to refer to the tendency of an emotional display within a person. An emotion has a limited duration, half a second to for seconds as suggested by Ekman (1982), and the facial muscles cannot hold the expressions for minutes or hours (Ekman, 1982).


Each specific emotion has an average overall duration. However it is the time variation that is context specific. For example a smile of politeness may last a few seconds, but it may last longer with euphoria. Emotions adhere to the same temporal characteristics as described previously. When the overall duration of the emotion is lengthened, so too does the proportional expansion of the temporal stages of onset, apex and offset.


Ekman and Friesen (1978) found six emotions to have universal facial expressions: sadness, anger, joy, fear, disgust and surprise. Most existing facial animation systems use these sets of emotion (Pelachaud et al., 1996) (Nagao and Takeuchi, 1994) (Cassell et al., 1994b), including the MPEG-4 specification as delimited by this project.


Sadness


Sadness has many intensities and variations, including open-mouth crying, closed mouth crying suppressed sadness, nearly crying and miserable. In simple sadness the inner portions of the eyebrows are bent upwards and the corners of the mouth bend slightly downwards (Parke and Waters, 1996) (Flemming and Dobbs, 1999).


Anger


Can be aroused from frustration, physical threat, or psychological harm. In simple, anger the inner comers of the eyebrow are pulled downward and together. The lower edge of the eyebrow is at the same level as the upper eyelid. The mouth is closed with the upper lip slightly compressed or squared off. Variations of anger include shouting rage, rage, and sternness (Parke and Waters, 1996) (Flemming and Dobbs, 1999).


Joy


In simple joy, the eyebrows are relaxed. The upper eyelid is lowered slightly and the lower eyelid is straight being pushed up by the upper check. The mouth is wide with the corners pulled back towards the ears. Variations of joy include uproarious laughter, laughter, sly smile, open smile, false smile and false laughter (Parke and Waters, 1996) (Flemming and Dobbs, 1999).


Fear


Fear arises from persons, or situations that seem dangerous. Fear can range from worry to terror. In fear the eyebrows are raised and pulled together. The inner portions of the eyebrows are bent upwards. The eyes are alert. The mouth might be slightly dropped open and stretched horizontally (Parke and Waters, 1996) (Flemming and Dobbs, 1999).


Disgust


Disgust is a reaction to something that is unpleasant or distasteful. Disgust ranges from disdain to physical repulsion. In disgust the eyebrows are relaxed. The eyelids are relaxed or closed. The upper lip is raised in a sneer, often asymmetrical. For physical repulsion the eyebrows are lowered, especially at the inner corners. The eyes may be mostly shut in a squint (Parke and Waters, 1996) (Flemming and Dobbs, 1999).


Surprise


Surprise is a reaction to a sudden, unexpected event. In surprise the eyebrows are raised straight up as high as possible. The upper eyelids are opened and wide as possible with the lower eyelids relaxed. The mouth is dropped open without muscle tension to form an oval shape (Parke and Waters, 1996) (Flemming and Dobbs, 1999).


Emotions constitute the primary motivational system of humans. The description of emotion constitutes various components, such as physical responses, autonomic nervous system and brain responses, verbal responses (vocalisations), memories, feelings and facial expressions (Ekman, 1982). For a believable interactive application, there needs to be a connection of facial expression generation with a process that produces believable behaviours given the inputs to the system.


We have chosen to implement these six universal emotions stipulated by Ekman (1978) as they enable the connection of facial expression with the portrayal of believable behaviours, enabling the animator through the use of FAML tags to create realism and believability in the character animation.


3.3.4 Head Movement


Movements of the head and facial expressions can be characterized by their placement with respect to the linguistic utterance and their significance in transmitting information. Head movements can be categorised into three distinct sections: head turning, head nodding and head orientation (Shepherdson, 2000).


Head turns denote the direction of gaze and are accompanied by a change of head position. Head nodding is an example of an emblem, a form of nonverbal communication that can be directly related to a verbal phrase. A head nod could show agreement "yes", whilst a headshake would show disagreement "no". Head orientation can be used to impart personality traits, such as the lowering of the head to show submissive nonverbal behaviour (Shepherdson 2000).


Head movement can also coincide with hesitation and pauses within speech. Hadar et al. (1983) examined the relationship between head movement and speech. They established a link between the temporal aspects of head movement to the prosodic nature of speech. Head movement was classified into three categories based on its temporal aspects: (1) slow movements, occurring at 0.2-1.8 Hz (2) ordinary movements at 1.8 to 3.7 Hz and (3) rapid movements at 3.7-7.0 Hz. Hardar et al. (1983) concluded that primary accents are marked by rapid movements, while ordinary movements followed by stillness denoted terminal points or end of conversation. Rapid movement may also occur during marked repetition of syllables or words and short speech pauses.


3.3.5 Eye behaviour


Visual behaviour of the eyes is an important feature whose main functions are to help regulate the flow of conversation, to signal the search for feedback during interactions, to express emotion or to influence another person's behaviour (Walker and Trimboli, 1993) (Webbink, 1986). Eye contact is an important non-verbal method of establishing relationships and communicating with others. People are very sensitive to eye behaviour and are able to perceive the slightest change in eye direction.


As discussed by Argyle and Cook (1976) eye movement can be defined by the direction of gaze, the point or points of fixation, the percentage of eye contact over gaze avoidance, and the duration of eye contact. A common metric for eye behaviour is "interest". Eyes tend to fixate longer on objects of interest for longer periods of time. When a person is exasperated, or trying to solve a problem, or trying to remember something the eyes will look up (Parke and Waters 1996).


Eye movement


The eyes are in a state of continual motion, usually with rapid changes in fixation. When looking at another person, research conducted by Argyle and Cook (1976) found that the viewer concentrates upon the eyes of the other person 58 percent of the time, and then their mouth 13 percent of the time. The remaining regions of the face are attributed only one percent of the time.


The actual change of focus of the eyes, close focus versus distance focus, constitutes about 6 millimeters in iris displacement (Benoit et al., 1999), however this is easily perceived by humans. Pupils are closer to each other during close focus than distant focus, lending to the term "cross-eyed". This highlights the importance of synchronisation of eye movements for both the left and rights eyes.


Eye-head coordination


Argyle (1975) through experimentation stated that when people break eye contact to avoid gazing at one another, they usually move their heads to look away. A change in the direction of gaze is frequently accompanied by head movement (Argyle and Cook, 1976) (Bizzi, 1974). For example, a sad person has a tendency to look down as well as lowering the head. In the case of a predictive event, an event that preludes another event (Bizzi, 1974), the head generally moves before the eyes, which eventually follow with rapid eye movements. If a person lowers the head first, maintaining gaze and then cast the eyes downward, this preludes the expression of sadness in the person. This is the only case of a predictive event as discussed by (Bizzi, 1974). However, in general the eyes lead head movement (Parke and Waters, 1996) (Argyle, 1975) (White, 1986) and (Maestri 1996).


Blinking of the eyes


Blinking forms an important aspect of synthetic facial animation. Blinking is the rapid closure and opening of the lower and upper eyelids, a process that occurs simultaneously with left and right eyes. The eyes blink frequently, serving not only to accentuate speech, but to also satisfy the biological need to lubricate the eyes. In general, there is at least one blink per utterance (Parke and Waters, 1996) (Pelachard and Pervost, 1995).


It is important to note that the structure of the eye blink is synchronised to the articulation in speech, the eye might close over one syllable and open on another, and blinks can also occur on stressed vowels (Condon and Osgton, 1971).


Blinking can be categorized by the following parameters, as outlined by Pelachard and Pervost (1995):



As discussed by Parke and Waters (1996) through observations based on face-to-face communication there exists synchrony between the speaker's voice and the speaker's eye blinks. The speaker's eye blinks tend to follow pause in the speech, with experimental results showing that this occurs about 75 percent of the time.


Blink occurrence is also emotionally dependent. During fear, tension, anger, excitement and lying, the amount of blinking increases while it decreases during periods of concentration. Blinks also occur on any shift of eye direction as they call attention to change, as well as allowing the animator to make the expression stronger. The eyes are the most important part of an expression and must be animated with care. Any jitter or false movement on an in between destroys both communication and believability (Thomas and Johnston, 1981).


The discussion on eye and head movement highlights the importance of the eyes and head as forms on non-verbal communication and behaviour. It is clear that with regard to this project, the ability to control the movement of the eyes is essential for added realism. The FAML tags provide a subset of both head and eye movements to allow further realism in the scripting of the Talking Head animation.


3.4 Implemented Visual Text-to-Speech (VTTS) systems


From observations of the literature there are a number of systems that integrate a Talking Head with speech, facial expression and gestures, each with varying degrees of realism and effectiveness.


Morphing Systems


Ezzat and Poggio (1997) implemented a VTTS system that pre-stored all the images of the visemes, the visual representation of the phoneme, to allow the animation of lip movement during speech. The intermediate visemes were animated using a morphing technique. The system used optical flow methods borrowed from computer vision literature, to compute realistic transitions between visemes to every other viseme. A text-to-speech (TTS) synthesiser was exploited to generate phonemes/visemes and timing information to determine what visemes to use and the rate of morphing. Using this technique Ezzat and Poggio (1997) were able to synchronise the visual speech stream with audio speech stream, and hence give the impression of a video-realistic talking face. It can however be noted that Ezzat and Poggio (1997) only morphed viseme transitions and not any other facial gesture or feature. Eyebrow movement, blinking and nodding of the head was omitted.


Cosatto and Graf (1998) also used the method adopted by Ezzat and Poggio (1997) for facial animation but implemented a new technique that was capable of extracting facial parts such as the mouth, eyebrows into a compact library independently of each other. Then using these face models and a TTS, new video sequences are "warped" or "morphed" between different views. Because the facial features are controlled independently, each facial feature can be warped independently of the other. This technique can provide photo realistic animation of a Talking Head, however the difficulty is in finding precise specifications of the displacements of many points in order to guarantee results that mimic real faces. Moreover, the computation of such displacement is actually quite expensive and could never be used in real-time animation with the current level of technology. Both Cosatto and Graf (1998) and Ezzat and Poggio (1997) implemented systems animating a Talking Head but the computational expense was far too great for a real-time application. Both techniques suffered from any type of misalignment between visemes, which greatly degraded the performance of the facial animation.


Although the morphing approach does produced sufficient results with regards to realism as indicated by the literature, the morphing technique cannot be applied to this project as the animation systems based on the parametric head model and utilising the MPEG-4 facial animation coding system. MPEG-4 will be discussed in further detail within section 3.5.


Parametric systems


Cassell et al. (1994a) developed a rules-based model for the interaction between intonations and gesture, and implemented these rules in a conversation simulation system with two Talking Heads. Although the modelling of the interrelationships between speech and gesture is beyond the scope of this project, the implementation and synchronisation of the gestures is relevant. The implementation of the gestures was carried out by a group of Parallel Transition Networks (PaT-Nets), finite state machines, several of which ran in tandem. The PaT-Nets govern the production of the gesture and integration of the gesture into the facial animation. An AT&T Bell laboratories TTS synthesiser was used to produce the actual speech wave and phoneme timings. The phoneme timings, duration outputs and speech waves from the synthesis were merged together by rule with the abstract intonational and gestural notations. The detailed timing information allowed the synchronisation of the gestural animations with the speech.


The approach taken by Pelachaud et al. (1994) although similar to Cassell et al. (1994a) used a FACS notation (Facial Action Coding System) created by P. Ekman and W. Friesen (1978) to describe visible facial expressions. FACS describes temporary changes in facial appearance, how a feature is affected by its location, and the intensity of changes. An Action Unit AU corresponds to action produced by one or a group of muscles. The facial model presented by Pelachaud et al. (1994) integrated both the FACS and the AU to realistically animate the Talking Head. Expressions and facial gestures were broken into the corresponding AU or groups of AU and these were in turn animated using the FACS. Synchronisation was implemented in the same manner as Cassell et al. (1994b) and used timing information from the phoneme and TTS synthesiser.


The model implemented by Pelachaud et al. (1994) and Cassell et al. (1994a) provides an animation control system based on rules rather than tags. The system utilises a semantic model of the input text and based on behavioural rules of non-verbal communication link the facial gestures, expressions and emotions to the Talking Head animation. Similarly, this project uses tags to link the facial gestures and expression of non-verbal communication to the speech and the Talking Head animation. The rules used in both the Pelachaud et al. (1994) and Cassell et al. (1994) systems are based on FACS, which is also very similar to the MPEG-4 coding systems discussed in section 3.5. The knowledge gained from the rules of non-verbal communication provide a good indication towards the subset of FAML tags that are required to truly mimic the non-verbal behaviour in humans.


The FAQBot (Beard et al. 1999) was implemented in conjunction with Curtin University and the University of Genoa. The FAQBot was to provide a humane interface to a frequently asked question (FAQ) database. The FAQBot was implemented using the MPEG-4 coding standard. The MPEG-4 specification has already defined and standardized the animation of a synthetic face. The facial animation engine (FAE) implemented for the FAQBot application is similar to the FACS as implemented by Pelachaud et al. (1994) and the higher order expressions, such as smile, similarly can be thought of as collection of AU. In its initial implementation there were no facial gestures, simply a Talking Head, with lips synchronised speech. Further work has integrated a personality module to give the Talking Head behavioural parameters and facial gestures. However there still is no mechanism to synchronise specific gestures to the words or phonemes in the spoken speech. The FAQBot application is the foundational work for this project.


Ostermann et al. (1998) investigated the integration of Talking Heads and text-to-speech synthesisers for a visual TTS. The VTTS synthesiser allows defining facial expression as bookmarks in the text and is used to animate the Talking Head when it is talking. The bookmark itself names the expression, its amplitude and the duration during which the amplitude has to be reached by the face. Ostermann et al. (1998) in their research used MPEG-4 as the animation system. Their research has outlined a method of animating an MPEG-4 facial animation driven by the input text. The bookmarks provide a mechanism to link the gesture or facial expression to their position in the synthesised speech as well as providing syntax for the bookmarks.


Ostermann's implementation of the bookmark mechanism for the MPEG-4 animation of the Talking Head forms a primary knowledge base for the implementation of the FAML tag system. The tags' specification as delimited by the scope of this project is in actual fact derived from the work of Ostermann. Ostermann et al. (1998) describes the process by which the bookmarks alter the flow of the animation and how they a co-articulated together to ensure continuous and flowing animation.


3.5 MPEG-4



As indicated in the delimitations section of this project, MPEG-4 forms the definition and animation control system. MPEG-4 was developed by the Moving Pictures Expert Group (MPEG) and has been standardized by the International Standards Organization (ISO) (MPEG 1999). MPEG-4 enables the integration of face animation with multimedia communications and presentation. With regards to this project MPEG-4 forms a crucial part of the architecture that drives the face model, including the facial expression from the text input.


MPEG-4 separates the animation into two bit-steams, the face animation bit-stream and the audio bit-stream