Introduction

Research in Multimedia Parsing and Generation

To appear in Journal of Artificial Intelligence Review: Special Issue on the Integration of Natural Language and Vision Processingd Vision Processing, ed).

Research in Multimedia and Multimodal Parsing and Generation

Mark T. Maybury
The MITRE Corporation
Artificial Intelligence Center
Mail Stop K331, 202 Burlington Road
Bedford, MA 01730
(617) 271-7230
maybury@mitre.org

Abstract

This overview introduces the emerging set of techniques for parsing and generating multiple media (e.g., text, graphics, maps, gestures) using multiple sensory modalities (e.g., auditory, visual, tactile). We first briefly introduce and motivate the value of such techniques. Next we describe various computational methods for parsing input from heterogeneous media and modalities (e.g., natural language, gesture, gaze). We subsequently overview complementary techniques for generating coordinated multimedia and multimodal output. Finally, we discuss systems that have integrated both parsing and generation to enable multimedia dialogue in the context of intelligent interfaces. The article concludes by outlining fundamental problems which require further research.

KEYWORDS: Multimedia Interfaces, Multimodal Interfaces, Parsing, Generation, Intelligent Interfaces

1. Introduction

When humans converse with one another, we utilize a wide array of media to interact including written and spoken language, gestures, and drawings. We exploit multiple human sensory systems or modes of communication including vision, audition, and taction. Some media and modes of communication are more efficient or effective than others for certain tasks, users, or contexts (e.g., the use of maps to convey spatial information, the use of speech to control devices in hand and eyes-busy contexts). In other cases, some combination of media supports more natural, efficient, or accurate interaction (e.g., "put that there" interactions which mix speech and gesture). Whereas humans have a natural facility for managing and exploiting multiple input and output media and modalities, computers do not. Hence, the ability of machines to interpret multimedia input and generate multimedia output would be a valuable facility for a number of key applications such as information retrieval and analysis, training, and decision support. While significant progress has been made developing mechanisms for parsing and generating single media, less emphasis has been placed on the integration and coordination of multiple media and multiple modalities. The purpose of this article is to introduce techniques for building multimedia and multimodal interfaces, that is, those interfaces that parse and generate some combination of spoken and written natural language, graphics, maps, gesture, non-speech audio, animation, and so on.

The primary motivation for this line of research is the premise that human abilities should be amplified, not impeded, by using computers. While there has been much research focused on developing user, discourse, and task models to improve human computer interaction [Kobsa and Wahlster 1989], here we focus on intelligent multimedia interaction. If appropriate media are utilized for human computer interaction, there is the potential to (1) increase the bandwidth of information flow between human and machine (that is, the raw number of bits of information being communicated), and (2) improve the signal-to-noise ratio of this information (that is, the amount of useful bits conveyed). To achieve these potential gains, however, requires a better understanding of information characteristics, how they relate to characteristics of media, and how they relate to models of tasks, users, and environments. This goal is exacerbated by the proliferation of new interactive devices (e.g., datagloves and bodysuits, head mounted displays, eye-trackers, three dimensional sound), the lack of standards, and a poor or at least ill-applied knowledge of human cognitive and physical capabilities with respect to multimedia devices (See Figure 1). For example, some empirical studies [Krause 1993] provide evidence that even well accepted applications of multimedia (e.g., the use of check marks and graying out in menus) can exacerbate rather than improve user performance. This motivates the need to understand the principles underlying multimedia communication. Understanding these principles will not only result in better models and interactive devices, but also lead to new capabilities such as context-sensitive multimedia help, tools for (semi-)automated multimedia interface construction, and intelligent agents for multimedia information retrieval, processing, and presentation. This article first outlines research in parsing multimedia input, next mechanisms for generating multimedia output, and finally methods for integrating these together to support multimedia dialogue between computer and user. The article concludes by indicating directions for future research.

Figure 1: Device Proliferation

2. Multimedia Parsing

While there has been significant previous research in parsing and interpreting spoken and written natural languages (e.g., English, French), the advent of new interactive devices has motivated the extension of traditional lines of research. There has been significant investigation into processing isolated media, especially speech and natural language and, to a lesser degree, handwriting. Other research has focused on parsing equations (e.g., a handwritten "5+3"), drawings (e.g., flow charts), and even face recognition (e.g., lip, eye, head movements). Wittenburg [1993] overviews techniques for parsing handwriting, math expressions, and diagrams, including array grammars, graph grammars (node and hyper-edge replacement), and constraint-based grammars. In contrast, integrating multiple input media presents even greater challenges, and yet the potential benefits are great. For example, adding a visual channel to a speech recognizer provides visual information (e.g., lip movements, body posture) that can help resolve ambiguous speech as well as convey additional information (e.g., about attitudes). Figure 2 illustrates the notion of integrating multiple channels of input. As with natural language processing, there are many distinct representation and processing problems with multimedia parsing including the need to segment the input into discrete elements, syntactically parse them, semantically interpret them, and exploit discourse and contextual information to deal with input that is ungrammatical, ambiguous, vague and/or impartial. Input media need to be represented at many levels of abstraction (at least morphology, lexico-syntactic, semantic, pragmatic, and discourse) to enable cross media constraint, correlation, and, ultimately, media integration.

Figure 2: Multimedia Parsing

2.1 Interpreting Gesture

Some of these representational and processing complexities manifest themselves in gesture. Gestural input can come in a variety of forms from a variety of interactive devices (e.g., mouse, pen, dataglove). Just as (spoken and written) natural language can be used to perform various communicative functions (e.g., identify, make reference to, explain, shift focus), so too gestures can perform multiple functions and/or be multifunctional. Analogously, whereas gaze has been traditionally used as a replacement for mouse or deictic input [Jacob 1990], it can also be used to track user interest and focus of attention, to regulate turn taking, to indicate emotional state (e.g., eyes dropping), to indicate interpersonal attitudes (e.g., winking, rolling eyes), and even indicate level of expertise (e.g., by correlating movement with task accomplishment). Rim»and Schiaratura [1991] characterize several classes of gesture. Symbolic gestures are conventional, context-independent and typically unambiguous expressions (e.g., and OK or peace sign). In contrast, deictic gestures are pointers to entities, analogous to natural language deixis (e.g., "this not that"). Iconic gestures are used to display objects, spatial relations, and actions (e.g., illustrating the orientation of two cars at an accident scene). Finally, pantonimic gestures display an invisible object or tool (e.g., making a fist and moving to indicate a hammer).

Gestural languages, or rather sublanguages, exist as well. These include sign languages (which have associated syntax, semantics, discourse properties and so on) as well as signing systems for use in environments where alternative communication is difficult (e.g., interoperator signing in noisy environments, signing between two SCUBA divers). Gestures also have functions that are context dependent, e.g., shaking a finger can indicate disapproval or a request for attention. Thus, the interpretation of many gestures is (at least) task, (discourse) context, and culture dependent (e.g., repeatedly closing your fingers to your palm with the palm facing the ground means "come here" in Latin America but "good bye" in North America.) Fortunately, some of this ambiguity can be resolved using input from other channels, automated techniques for which we consider next.

2.2 Integrating Media Input

Current workstations support limited input from multiple input channels: keyboard (textual), mouse (graphical), and, increasingly, microphone (speech, non-speech audio) and video. Unfortunately, for most users and applications these channels are severely restricted to independent, sequential, and, in the case of gesture, two-dimensional input. Despite, or perhaps because of, the daunting range and complexity of gestural input, there have been many computational investigations of gesture. The first was Carbonell's (1970) SCHOLAR system for geography tutoring, the intelligent interface of which allowed for natural language interaction with simple pointing gestures to a map. Like many subsequent investigations, referent objects had pre-defined, unambiguous screen regions associated with them to enable a direct (one-to-one) mapping between screen location and referent.

Such was not the case in Kobsa et al.'s [1986] TACTILUS subcomponent of the XTRA (eXpert TRAnslator) system, an expert system interface to an electronic tax form [Wahlster 1991]. TACTILUS contained no pre-defined screen areas, and graphical objects were composites, thus there was no one-to-one correspondence between a visual region and referent. Moreover, the user could choose from a menu of deictic gestures of varying "granularity" (pencil, index finger, hand, region encircler). The ambiguity of regions combined with the vagueness of gesture required the system to resolve inexact and pars-pro-toto (part-for-the-whole) pointing. It did so by computing plausibility values of each demonstratum, an object being pointed to, by measuring the portion of the demonstratum covered by the pointer. Candidates were then pruned using the semantics of any associated language or dialogue. When referents were determined, no visual feedback was given to the user (following human communication conventions), although the authors recognized this risked possible false user implicature regarding the success of their identification. Also, as in other systems that integrated language and deixis (e.g., CUBRICON, which we consider next), language and pointing input had to occur sequentially, and yet they often temporally overlap in human-human communication. Interestingly, their implementation investigated two-handed input, as did [Buxton and Myers 1986], in which one hand could be used to indicate a region of focus of attention while the other could perform a selection.

The CUBRICON (Calspan-UB Research Center Intelligent CONversationalist) system [Neal and Shapiro 1991] permitted not only the use of gestures to resolve ambiguous linguistic input, but also the use of linguistic input to resolve ambiguous gestures. CUBRICON addressed the interpretation of speech, keyboard and mouse input using an Augmented Transition Network grammar of natural language that included gestural constituents (in noun phrase and locative adverbial positions). Thus a user could, in natural language, query Is this <point-using-the-mouse> a SAM1? or command Enter this <point-map-icon> here <point-form-slot>. or state Units from this <point-1> airbase will strike these targets <point-2> <point-3> <point-4>. The system allowed either spoken or written input, although the natural language and deictic input had to occur sequentially. Interestingly, when semantically interpreting combined linguistic and gestural input, ambiguous point gestures were resolved by exploiting a class or property expressed in the natural language. Moreover, when the natural language and gesture were inconsistent, the system applied a heuristic that started at the display position indicated by point gesture and performed an incremental bounded search to find at least one object consistent with the semantic features expressed in the natural language.

An interesting issue investigated in CUBRICON, parallel to the integration of linguistic and gestural input, was the coordination of gesture and language upon output. CUBRICON generated natural language with coordinated gestures in two cases: 1) when referents were visible on a display, e.g., The mobility of these SAMs <point-1> <point-2> is low and the mobility of this SAM <point-3> is high and 2) when the referent was a component of an object that is visible on a display, e.g., The target of OCA001 is the 3-L-runway of the Merseberg Airbase <point-to-airbase>. We will return more fully to the issue of coordinating multimedia output in a subsequent section.

Like CUBRICON, the Intelligent Multimedia Interface (AIMI) system [Burger and Marshall 1993] also investigated multimedia dialogue including natural language and graphical deixis in the context of a mission planning system. For example, a user might first query in natural language What aircraft are appropriate for the rescue mission?, to which the system might respond by automatically generating a table of appropriate aircraft. If the user then pointed to an item on the generated table, this would be introduced into the discourse context so that she could subsequently simply ask What is its speed?. Similarly, if a window containing a graphic was in focus in the dialogue, the user could anaphorically say "Print that chart". In contrast to CUBRICON's Augmented Transition Network grammar representation, AIMI parsed input (including menu and window interactions) into expressions in a sorted first-order language (with generalized quantifiers) whose predicates are drawn from the terms in a subsumption hierarchy, as in kl-one (see Brachman and Schmolze [1985]). Like CUBRICON, AIMI included a detailed model of the discourse context, including the user's focus of attention, but also incorporated a rich model of the (mission planning) task which enabled cooperative interaction. We return to multimedia dialogue in a susequent section.

Two of the fundamental problems with interpreting heterogeneous input are integrating temporally asynchronous input from different channels and parsing and interpreting the input to an appropriate level of abstraction so that these multiple channels can be integrated. For example, Koons et al. [1993] investigated integrating simultaneous speech, gestural, and eye movement for reference resolution in map and blocks world interactions. Whereas the previously described systems focused on (two dimensional) pointing in which gestures are typically terminals in the grammar (e.g., in CUBRICON), Koons' et al. [1993] application required parsing three dimensional and time varying gestures. For processing, they found it necessary to capture gesture features such as the posture of the hand (straight, relaxed, closed), its motion (moving, stopped), and its orientation (up, down, left, right, forward, backward -- derived from normal and longitudinal vectors from the palm). Over time, a stream of gesture features is then abstracted into more general gestlets (e.g., Pointing = attack, sweep, end reference). Similarly, low level eye tracking input was classified into classes of events (fixations, saccades, and blinks). The more general levels of representation were then exploited to integrate different channels of information.

Figure 3: Idealized input to speech, gesture, gaze parser

Figure 3 illustrates an idealized example of the frame representation their media parsers produce when a user utters "... that blue square below the red triangle" while pointing to and looking at objects on a screen. After parsing, an interpretation component finds values for each frame. Values are either objects for qualitative and categorical expressions such as "red" or "square" or regions in a spatial system for input such as gestures, gaze, and spatial expressions, e.g., "below". Ambiguous references (e.g., if there are multiple blue squares in the scene) are resolved by methods associated with the frames that find temporally adjacent input events that constrain interpretation.

While ambitious in its integration of three input modalities, many applications will require even richer and deeper representations of input to capture the complexity of language, gaze, and gesture (e.g., selectional restrictions, intentional representations). Equally, this work could be extended to incorporate other modalities (e.g., face recognition, lip reading, body posture). As we consider in the next section, equal challenges apply when coordinating several modalities for output.

3. Multimedia Generation

Just as improved human-computer communication requires the ability to interpret multiple media and modalities upon input, so too it requires the selection and coordination of multiple media and modalities upon output. There have been many investigations into single media generation. With respect to language, a number of methods for planning and realizing natural language text have emerged. Several natural language generation workshops and collections have addressed issues such as content selection, intention (i.e., speech act) planning, text structuring and ordering, text tailoring, incremental sentence generation, revision, lexical choice, avoiding and correcting misconceptions, the relation of language generation to speech synthesis, and text formatting and layout [Kempen 1987, Dale 1990, Paris et al. 1991, Dale 1992, Horacek and Zock 1993, Maybury forthcoming]. In addition to text generation, there have been a number of techniques developed to automatically design graphics [Feiner, Mackinlay and Marks 1992]. These include techniques to design and realize tables and charts [Mackinlay 1986], network diagrams [Marks 1991ab], business graphics displays [Roth and Mattis 1991], and three dimensional explanatory graphics [Feiner 1985]. Other research has investigated the use of non-speech audio to communicate state and process information [Buxton et al. 1985; Garver 1986; Buxton 1989; Buxton, Gaver, and Bly, in press] and even the synthesis of music [Schwanauer and Levitt 1993]. While effort has and continues to focus on single media design, an equally important concern is the selection, allocation of information to, and coordination of multiple media and modalities. As Figure 4 suggests, the same propositional content can often be expressed in multiple, possibly intermixed, media and modalities. How this can be done automatically, we consider next.

Figure 4: Multimedia Generation
(Attribute Information of Aircraft Represented in Multiple Media)

3.1 Content Selection and Media Representation

Often the first step in designing a presentation is determining what information to convey. Determining the importance and relevancy of content is a key problem in natural language generation and information retrieval and hence is beyond the scope of this article. Nevertheless, it is important to note that selecting content interacts in non-trivial ways with media design and realization, indeed while it may be the first step, it may be revisited as subsequent realization or layout constraints dictate. For example, selecting particular kinds of content can dictate certain media (e.g., the expression of quantification in natural language).

Related to selecting what to say is the way in which it is represented (and of course what is represented). Hovy and Arens [1993] describe the knowledge required for reasoning about multimedia information, formalizing in systemic networks the content, form, and purpose of the presentation and the characteristics of the producer, perceiver, and communicative situation. Other researchers have extended KL-ONE like knowledge representation schemes to include graphical knowledge and reasoning mechanisms, for example, to support reasoning about layout [Graf 1992] or cross-modal references (e.g., "the red switch at the bottom of the picture") which requires mapping between spatial and linguistic structures [Andr»et al. 1993].

3.2 Media Allocation

Once a system has determined (at least initially) what information to convey to an addressee, it must determine in what medium or which media to convey this information. In his APT system, Mackinlay [1986] developed "expressiveness rules" to relate characteristics of information to encoding techniques (e.g., position, size, orientation, shape, color) for graphical displays. Similarly, the Integrated Interfaces system [Arens et al. 1991] used simple "presentation rules" (e.g., display positions using points on a map, display future actions as text) to design mixed text, graphics, map, and tabular displays of daily reports of the US Navy's Pacific Fleet. Unlike APT, it allowed for preset stereotypical presentations if requested and was based on a KL-TWO knowledge representation scheme to handle a broader range of data. SAGE [Roth et al. 1991] also went beyond the graphics focus of APT and employed allocation heuristics (as in COMET, described below) which preferred graphics when there were a large number of quantitative or relational facts but natural language when the information was about, e.g., abstract concepts, or processes or when it contained relational attributes for a small number of data objects (e.g., total budget).

In contrast to these heuristics, the AIMI system [Burger and Marshall 1993] utilized design rules for media allocation which included media preferences (e.g., cartographic displays are preferred to flat lists which are preferred to text) that were governed by the nature of the original query and the resulting information to be presented (e.g., qualitative vs. quantitative, its dimensionality). For example a natural language query about airbases might result in the design of a cartographic presentation, one about planes that have certain qualitative characteristics, a list, ones that have certain qualitative characteristics, a bar chart. One interesting notion in AIMI was the use of non-speech audio to convey the speed, stage or duration of processes not visible to the user (e.g., background computations). AIMI also included mechanisms to tailor the design to the output device.

The WIP knowledge based presentation system [Andr»and Rist 1993] also incorporated media preferences for different information types. These specified:

graphics over text for concrete information (e.g., shape, color, texture ... also events and actions if visually perceptible changes)
graphics over text for spatial information (e.g., location, orientation, composition) or physical actions and events (unless accuracy is preferred over speed, in which case text is preferred)
text to express temporal overlap, temporal quantification (e.g., always), temporal shifts (e.g., three days later) and spatial or temporal layout to encode sequence (e.g., temporal relations between states, events, actions)
text to express semantic relations (e.g., cause/effect, action/result, problem/solution, condition, concession) to avoid ambiguity in picture sequences; graphics for rhetorical relations such as condition and concession only if accompanied by verbal comment
text for quantification, especially most vs. some vs. any vs. exactly-n
graphics to express negation (e.g. overlaid crossing bars) unless scope was ambiguous, then use text

Some of these preferences were captured in constraints associated with presentations actions, which were encoded in plan-operators, and used feedback from media realizers to influence the selection of content.

In contrast to WIP's plan-based approach, the COMET (COordinated Multimedia Explanation Testbed) [Feiner and McKeown, 1993] system followed a pipe-line approach using rhetorical schema to determine presentation content (logical forms) followed by a heuristic media allocator to select between text and three dimensional graphics (See Figure 5). The media allocation heuristics were 1) realize locational and physical attributes in graphics only, 2) realize abstract actions and connectives among actions (e.g., causality) in text only, and 3) realize simple and compound actions in both text and graphics.

Figure 5: The COMET Architecture

One problem with the above approaches to media allocation is that they map information characteristics onto media classes (e.g., text versus graphics) or media objects (e.g., tables, bar charts). In contrast, in an analytical study, Hovy and Arens [1993] characterize the complexity of media allocation in their investigation of rules that relate more general characteristics of information to characteristics of media. They characterize media allocation as the multistep process:

Present data duples (e.g., locations) on planar media (e.g., graphs, tables, maps)
Present data with specific denotations (e.g., spatial) on media with same denotations (e.g., locations on maps)
If more than one medium can be used, and there is an existing presentation, prefer the medium/a that is/are already present ...
Choose medium/a that can accommodate the most amount of information to be presented

They further characterize the complexity of information to media interdependencies by defining rules in a systemic framework. For example, to convey the notion of urgency they have two rules:

If the information is not yet part of the presentation, use a medium whose default detectability is high (e.g., aural medium) either for substrate (e.g., a piece of paper, a screen, a grid) or carrier (e.g., a marker on a map substrate; a prepositional phrase within a sentence predicate substrate).
If information is already displayed, use a present medium but switch one or more of the channels from fixed to the corresponding temporally varying state (e.g., flashing, pulsating, hopping).

What remains to be done is to computationally investigate and (with human subjects) evaluate these and other rules in an attempt to make progress toward a set of principles for media allocation.

3.3 Media Design and Coordination

In addition to selecting media appropriate to the information, successful presentations must ensure coordination across media. First, the content must be consistent, although not necessarily equivalent, across media. In addition, the resulting form (layout, expression) should be consistent. The COMET system achieves this, in part, by coordinating sentence breaks with picture breaks, providing cross references from text to graphics, and allowing intergenerator influence (i.e., between text and graphics generators). A related issue considered by Feiner et al. [1993] concerns the need to incorporate a temporal reasoning mechanism within presentation design systems to enable the control of the presentation of temporal information (e.g., temporal relations and event duration) using temporal media (e.g., animation and speech) by managing the order and duration of communicative acts.

Whereas COMET allocates information to media subsequent to content selection, the WIP architecture and that of TEXPLAN [Maybury 1991] perform content selection and media allocation simultaneously using plan operators. This enables declarative encoding of media and content constraints in one formalism which is then used to generate a resulting hierarchical, multimedia presentation plan. Media generators (e.g., for text, graphics) can then interact with this structure, as well as with each other, to provide feedback to the design. Figure 6 illustrates a portion of an illustrated instruction generated by WIP with its underlying presentation plan. In WIP, the text and graphics generators interact, for example, to generate unambiguous linguistic and visual references to objects [Andr»et al. 1993]. This interaction enables the text generator to make visual references such as "The on/off switch is located in the upper left part of the picture." WIP also includes a grid-based layout system [Graf 1992], described below, that co-constrains the presentation planner.

Figure 6: A Document Fragment and its Structure

3.4 Communicative Acts for Multimedia Generation

Following a tradition that views language as an action-based endeavor [Austin 1962, Searle 1969], researchers have begun to formalize multimedia communication as actions, in a hope at arriving at a deeper representation of the mechanisms underlying communication. Some systems have gone beyond single media to formalized multimedia actions (e.g., WIP, TEXPLAN), attempting to capture both the underlying structure and intent of presentations using a plan-based approach to communication. For example, as Figure 6 above illustrates, WIP designs its picture-text instructions using a speech-act like formalism that includes communicative (e.g., describe), textual (e.g., S-request), and graphical (e.g., depict) actions.

Maybury [1991, 1993, 1994, in press] details a taxonomy of communicative acts that includes linguistic, graphical, and physical actions. These are formalized as plan operators with associated constraints, enabling conditions, effects, and subacts. Certain classes of actions (e.g., deictic actions) are characterized in a media-independent form, and then specialized for particular media (e.g., pointing or tapping with a gestural device, highlighting a graphic, or utilizing linguistic deixis as in "this" or "that"). When multiple design and realization choices are possible, preference metrics, which include media preferences, mediate the choice. Given a choice, the metric prefers plan operators with fewer subplans (cognitive economy), fewer new variables (limiting the introduction of new entities in the focus space of the discourse), those that satisfy all preconditions (to avoid backward chaining for efficiency), and those plan operators that are more common or preferred in naturally-occurring explanations (e.g., certain kinds of communicative acts occur more frequently in human-produced presentations or are preferred by rhetoricians over other methods). Maybury [1991] details its application to the design of narrated, animated route directions.

3.4 Automated Layout of Media

The physical format and layout of a presentation often conveys the structure, intention, and significance of the underlying information and plays an important role in the presentation coherency. Most investigations of layout have focused on single media. For example, Hovy and Arens [1991] exploited the rhetorical structure used to generate text to guide the format of the resulting text, realized using the text formatting program, TeX. For example, when their text planner structured text using a SEQUENCE relation, this would be realized using the \bullet TeX command. Marks [1991ab] investigated layout and encoding of arc-node diagrams in his ANDD system. ANDD grouped nodes sharing common graphical values (e.g., shape, color, size) to reinforce perception of graphical properties. It was guided by pragmatic directives, for example, "emphasize" certain structural or quantitative values associated with particular nodes or groups of nodes. Relatedly, Roth and Mattis [1991] sorted chart objects and tree nodes to support search.

In contrast to these investigations of single media, Feiner's [1988] GRaphical Interface Design (GRID) system investigated text, illustrations, and, subsequently, virtual input devices. Layout was performed in an OPS5-like production system guided by a graphical design grid. In contrast to this rule-based approach, Graf [1992] argues that the design of an aesthetically pleasing layout can be characterized as a constraint satisfaction problem. Graf's LayLab system, the constraint-based layout manager within WIP, achieves coherent and effective output by reflecting certain semantic and pragmatic relations in the visual arrangement of a mixture of automatically generated text and graphics fragments. LayLab incorporates knowledge of document stereotypes (e.g., slides, instruction manuals, display environments), design heuristics (e.g., vertical vs. horizontal alignment), and graphical constraints. Layout includes the mapping of semantic and pragmatic relations (e.g., sequence, contrast relations) onto geometrical/topological/temporal constraints (e.g., horizontal and vertical layout, alignment, and symmetry) using specific visualization techniques. For example, two equally sized graphics can be contrasted by putting them beside one another or one under the other. To accomplish this task, LayLab integrates an incremental hierarchy solver and a finite domain solver in a layered constraint solver model in order to position the individual fragments on an automatically produced graphic design grid. Thus, layout is viewed as an important carrier of meaning. This approach could be generalized beyond static text-picture combinations to include dynamic and incrementally presented presentations as well as those that incorporate additional media (e.g., animation, video) [Graf forthcoming]. An interesting research issue concerns developing a constraint acquisition component that could infer design constraints from graphical sketches by human experts.

In an interactive setting, the CUBRICON multimedia interface [Neal and Shapiro 1991] supported multiple monitor/window interaction using an Intelligent Window Manager (IWM). IWM rated the importance of a window, W, using a weighted importance based on:

(35%) Recency of creation of W =
(30%) Content of W =
(15%) Recency of last interaction =
(10%) Frequency of use of W =
(10%) Context

Window management rules then controlled allocation of generated media to screen real-estate. Accordingly, IWM prefered to place maps on the color monitor, tables on monochrome. Forms were only placed on the monochrome monitor. IWM would place a map window with a related table on the color monitor if there was space, otherwise the least important window (and any related table) would be removed to make space. If there was space on the monochrome monitor, the table was placed there in a position corresponding to the new map, otherwise if the new table was more important than an existing form, it was placed in the lower right corner of the monochrome monitor. The importance ratings combined with placement heuristics yielded an effective technique for managing high level media objects (e.g., windows, tables) in an interactive setting.

3.4 Tailoring Multimedia Output

In addition to selecting and coordinating output, it is important to design presentations that are suited to a particular users abilities and task. In their research with SAGE, [Roth and Mattis 1990, 1991] characterize a range of information seeking goals for viewing quantitative and relational information. These purposes included accurate value-lookup (e.g., train table times, phone #Õs), value-scanning (approximate computations of, e.g., the mean, range, or sum of a data set), counting, n-wise comparison (e.g., product prices, stock performances), judging correlation (e.g., estimating covariance of variables), and locating data (e.g., finding data indexed by attribute values). Each of these goals may be supported by different presentations. Burger and Marshall [1993] capture this task-tailoring when they contrast two (fictional) responses to one natural language query, "When do trains leave for New York from Washington? (See Figure 7). If the addressee is interested in trend detection, a bar chart presentation is preferred; if they are interested in exact quantities (e.g., to support further calculations), a table is preferred.

	TABLE RESPONSE	                       BAR CHART RESPONSE

Figure 7: Exactness (table) versus trend analysis (bar chart) [Burger and Marshall 1993]

Presentations can be tailored to other factors, such as properties of the task, context (e.g., previously generated media), and user (e.g., visual ability, level of expertise).

4. Toward Intelligent Multimedia Interfaces

The previous two sections consider the integration and coordination of multimedia input and output. What about integrating these into an interactive multimedia interface? As Figure 8 illustrates, supporting multimedia dialogue requires models of the user, task, and discourse as well as models of media. Thus, multimedia interfaces build upon research in discourse and user modeling for interface design and management [Kobsa and Wahlster 1989]. Not only must media and its design and interaction be included, but other traditional knowledge sources and processes need to be modified to support a multimedia interaction. For example, discourse models need to incorporate representations of media, for example, to enable media (cross) reference and reuse. Similarly, user models need to be extended to represent media preferences of users.

Figure 8: Intelligent Multimedia Interfaces

Examples of multimedia dialogue systems include CUBRICON [Neal and Shapiro 1991], XTRA [Wahlster 1991], AIMI [Burger and Marshall 1993], and AlFresco [Stock et al. 1993]. Typically, these systems parse integrated input and generate coordinated output, but also worry about concerns such as maintaining coherency, cohesion, and consistency across both multimedia input and output. For example, these systems typically support integrated language and deixis for both input and output. They incorporate models of the discourse to, for example, resolve multimedia references (e.g., "Send this plane there"). For example, CUBRICON represented a global focus space ordered by recency; AIMI a focus space segmented by the intentional structure of the discourse (i.e., a model of the domain tasks to be completed).

While intelligent multimedia interfaces promise natural and personalized interaction, they remain complicated and require specialized expertise to build. One practical approach to achieving some of the benefits of these more sophisticated systems without the expense of developing full multimedia interpretation and generation components, was achieved in AlFresco [Stock et al. 1993], a multimedia information kiosk for Italian art exploration. Figure 9 illustrates a typical multimedia interaction with AlFresco. By adding natural language processing to a traditional hypermedia system, AlFresco achieved the benefits of hypermedia (e.g., organization of heterogeneous and unstructured information via hyperlinks, direct manipulation to facilitate exploration) together with the benefits of natural language parsing (e.g., direct query of nodes, links, and subnetworks which provides rapid navigation). Parsing helps overcome the indirectness of hypermedia as well as disorientation and cognitive overhead caused by too many links. Also, as in other systems previously described (e.g., CUBRICON, TACTILUS), ambiguous gesture and language can yield a unique referent through mutual constraint. Finally, AlFresco incorporates simple natural language generation which can be combined with more complex canned text (e.g., art critiques) and images.

USER: Vorrei sapere se Ambrogio Lorenzetti ha dipinto unÕopera che raffigura una scena sacra con un angelo. (I would like to know if Ambrogio Lorenzetti ever painted a work that represents a sacred scene with an angel.) ALFRESCO: SÅ Per esempio: A. Lorenzetti, lÕAnnunciazione. LÕangelo »lÕArcangelo Gabriele. (Yes, for example, A Lorenzetti, the annunciation. The angel is the Archangel Gabrielle.) USER: Chi ½questa persona? [pointing at Mary on the touch screen] (Who is this person?) ALFRESCO: La Madonna. (Mary.) USER: Puoi mostrarmi un ingrandimento che la contiene? (Can you show me an enlargement containing her?) ALFRESCO: [The system shows an enlargement]

Figure 9: Multimedia Interaction with AlFresco

[Reiter, Mellish, and Levine 1992] also integrated traditional language generation with hypertext to produce hypertext technical manuals. In addition information kiosks, many other practical applications of intelligent multimedia interfaces have been investigated in domains such as intelligent tutoring [Cornell et al. 1993, Goodman 1993], car-driver interfaces [Bonarini 1993], tax-form completion [Kobsa et al. 1986, Wahlster 1991], and air pollution analysis [Marti et al. 1992]. A less investigated but nevertheless important class of applications, which can in part exploit results in multimedia parsing and generation, is multimedia indexing, retrieval [Stein et al. 1992; Stein and Thiel 1993], and summarization.

While practical systems are possible today, the multimedia interface of the future may have facilities that are much more sophisticated. These interfaces may include human-like agents that converse naturally with users, monitoring their interaction with the interface (e.g., key strokes, gestures, facial expressions) and the properties of those (e.g., conversational syntax and semantics, dialogue structure) over time and for different tasks and contexts. Equally, future interfaces will likely incorporate more sophisticated presentation mechanisms. For example, Pelachaud [1992] characterizes spoken language intonation and associated emotions (anger, disgust, fear, happiness, sadness, and surprise) and from these uses rules to compute facial expressions, including lip shapes, head movements, eye and eyebrow movements, and blinks. Finally, future multimedia interfaces should support richer interactions, including user and session adaptation, dialogue interruptions, follow-up questions, and management of focus of attention.

5. Remaining Research Problems

While the above systems afford exciting possibilities, there remains a gap between current capabilities and a system that could more fully interact using multiple media and modalities. Many fundamental questions remain unanswered, including issues concerning the architectures and knowledge needed to support intelligent multimedia interaction, techniques for media integration and coordination, and methods for evaluation. With respect to architectures, many questions remain including: What are they key components (knowledge sources, processes) for multimedia interaction? What functionality do they need to support? What is the proper flow of control? How should they interact (e.g., serially, interleaved, co-constraining)? Can we develop infrastructure/tools to support and encourage progress?

5.1 Models

Equally important as these architectural concerns is the issue of how systems and system builders can acquire, represent, maintain, and exploit models of the knowledge required for such systems. Knowledge needed includes models of information characteristics, producer characteristics and goals, media and modalities (e.g., their characteristics, strengths and weaknesses), users (e.g., (physical and cognitive) abilities, preferences, attention, and intentions), discourse/dialogue histories, task histories, and models of the situation (e.g., tracking system parameters such as load and available media). What seems evident from the current state of the art is that no single model will suffice, rather, models will have to be represented and reasoned about at multiple levels of abstraction and/or fidelity. For example, media models will have to represent the range from higher level artifacts (e.g., text, tables, graphs) down to their smallest constituent parts (e.g., pixels).

5.2 Multimedia Parsing and Generation

In addition to knowledge, expert processes need to be invented, both for the interpretation and generation of media. Fundamental input parsing processing includes segmentation (within and across) of media, parsing and interpretation of (both ill-formed and partial) input, and resolving ambiguous and partial multimedia references. In addition to extended and novel parsing technology, new interactive devices (e.g., those replicating force feedback, those recognizing facial and body expressions) will need to be developed and tested. Finally, techniques for media integration and aggregation need to be further refined to ensure synergistic coupling among multiple media, overcoming asynchronous timing and varying levels of abstraction of input.

In addition to input, output techniques require further extension. Generation advances are required for content selection (i.e., choosing what to say), media allocation (i.e., choosing which media to say what in), modality selection (e.g., realizing language as visual text or aural speech), media realization (i.e., choosing how to say items in a particular media), media coordination (cross modal references, synchronicity), and media layout (size and position). One research issue concerns the degree of module interaction and/or self correction. For example, the WIP system described above required two feedback loops (one after media design and one after realization) to help resolve inter/intra media synthesis problems. Another issue regards the degree of reuse and/or refinement of pre-existing, canned media with dynamically generated media (e.g., AlFresco's integration of hypertext and natural language processing). Related to this is whether or not systems save the history or structure of a presentation and if and how canned artifacts (e.g., an animation) can be connected to representations of abstract knowledge. The need for deep knowledge of designed graphics depends at least upon the intended use of the multimedia presentation (e.g., for teaching versus manual generation) and the environment in which it is used (e.g., interactive, static). Still another issue concerns the degree of automation versus mixed initiative. For example, while many systems attempt to generate a final presentation, COMET [Feiner and McKeown 1993] does support user-controlled camera positions and the InformationKit (IKIT) [Suthers' et al. 1993] tutoring environment enables the user to control which "viewer" a concept is realized in (e.g., text, graphics). Fully automated generation becomes even more difficult if not impossible when aiming for user and context tailored multimedia presentations. Moreover, presentation composition and coordination must be sensitive to the purpose of the communication, the (cognitive) complexity of resulting presentation, its perceptual impact (e.g., clutter), consistency (e.g., of dimensions, sizes, with respect to previous presentations), and ambiguity (overlapping encodings of color, shape, etc.).

5.3 Multimedia Interfaces

In addition to better techniques for interpreting and synthesizing media, initial multimedia interface prototypes have uncovered areas for further research. These include moving beyond hypermedia to support multimedia question and answering (including cross modal references and follow-up questions), the ability to post-edit presentations, to critique user designs (both during and after their construction) and, finally, to manage the multimedia dialogue (e.g., turn-taking). Many issues remain to be resolved including dealing with ill-formed and incomplete input and output at multiple levels of abstraction (e.g., syntax, semantics, pragmatics), the utilization of media independent acts (for analysis, generation, and interaction management), and the relationship of models of focus of attention and multimedia acts. We also need to move beyond interfaces incorporating language, graphics, and gesture to also consider less explored media (e.g., non-speech audio, facial expressions, body language) and modalities (e.g., taction, olfaction) or even to invent new ones. Finally, the relation of multimedia interfaces and multimedia classification, indexing, and retrieval offers interesting research possibilities.

5.4 Methodology and Evaluation

A final research area which can help foster progress toward a science of multimedia interaction is methodology and evaluation. A number of different methodological approaches are apparent in current work in the field. Some researchers build systems, guided by reverse-engineering human designs and human-human interactions. Others focus on self-adaptive systems, where effective techniques are learned via interaction with users. Still others focus on empirical validation of techniques (through observation of man-machine interactions). Others follow a combination of approaches. Perhaps there are other, equally useful approaches? In all of these cases evaluation metrics and methods to measure progress need to be developed. In some cases, we need to determine the goodness of alternative presentations by measuring presentation well-formedness, consistency, balance, coherency, and cohesion. In other cases it is important to measure the pedagogic benefit, increase in efficiency, or increase in the effectiveness of accomplishing some task (e.g., teaching, fixing) to provide evidence of value-added of additional machinery for input or output. This may involve time/quality tradeoffs among media or processes. It will require both black box and glass box system evaluations. Finally, we also need to judge among possible input and output facilities, matching media to human (physical and cognitive) capabilities such as memory and attention. All of these evaluative endeavors demand standard terms, units of measurement, levels of performance, techniques of use, and so on, to enable comparison and sharing of results.

6. Conclusion

This article has described research into parsing simultaneous multimedia input and generating coordinated multimedia output, as well as prototypes that integrate these to support multimedia dialogue. Intelligent multimedia interfaces have the potential to enable systems and people to use media to their best advantage, in several ways. First, they can increase the raw bit rate of information flow between human and machine (for example, by using the most appropriate medium or mix of media and modalities for information exchange). Second, they can facilitate human interpretation of information by helping to focus user attention on the most important or relevant information. Third, these investigations can provide explicit models of media to facilitate interface design so, for example, future interfaces can benefit from additional aspects of human communication that are currently ignored by current interfaces (e.g., speech inflections, facial expressions, hand gestures). Intelligent multimedia interaction is relevance to a range of application areas such as decision support, information retrieval, education and training, and entertainment. This technology promises to improve the quality and effectiveness of interaction for everyone who communicates with a machine in the future, but we will only reach this state by solving the remaining fundamental problems outlined above.

7. Acknowledgments

I thank all the referenced authors for their ideas, which I have attempted to faithfully represent herein. Particular thanks go to Wolfgang Wahlster, Ed Hovy, and Yigal Arens for their comments on multimedia and multimodal issues. I am grateful to Rich Mitchell for creating Figures 1 and 2.

8. References

[Andr»and Rist, 1993] Andr» E. and Rist, T. 1993. The Design of Illustrated Documents as a Planning Task. In Intelligent Multimedia Interfaces, ed. M. Maybury, 94-116. Menlo Park: AAAI/MIT Press. Also DFKI Research Report RR-92-45.
[Andr»et al. 1993] Andr» E.; Finkler, W.; Graf, W.; Rist, T.; Schauder, A.; and Wahlster, W. 1993. WIP: The Automatic Synthesis of Multimodal Presentations. In Intelligent Multimedia Interfaces, ed. M. Maybury, 73-90. Menlo Park: AAAI/MIT Press. Also DFKI Research Report RR-92-46.
[Arens et al. 1991] Arens, Y.; Miller, L.; and Sondheimer, N. K. 1991. Presentation Design Using an Integrated Knowledge Base. In [Sullivan and Tyler 1991], 241-258.
[Austin 1962] Austin, J. 1962. How to do Things with Words, ed. J. O. Urmson. England: Oxford University Press.
[Buxton and Myers 1986] Buxton, W. and Myers, B. A. 1986. A Study in Two-Handed Input. Proceedings of Human Factors in Computing Systems (CHI-86), 321-326, New York: ACM.
[Buxton et al. 1985] Buxton, W., Bly, S., Frysinger, S., Lunney, D., Mansur, D., Mezrich, J. Morrison, R. 1985. Communicating with Sound. Proceedings of Human Factors in Computing Systems (CHI-85), 115-119, New York: ACM
[Buxton 1989] Buxton, W. (ed) 1989. Human-Computer Interaction 4: Special Issue on Nonspeech Audio, Lawrence Erlbaum.
[Buxton, Gaver, and Bly, in press] Buxton, W., Gaver, W. and Bly, S. Auditory Interfaces: The use of Non-speech Audio at the Interface. Cambridge University Press.
[Blattner and Dannenberg 1992] Blattner, M. M. and Dannenberg, R. B. (eds). 1992. Multimedia Interface Design, Reading, MA: ACM Press/Addison-Wesley.
[Bonarini 1993] Bonarini, A. 1993. Modeling Issues in Multimedia Car-Driver Interaction. In Intelligent Multimedia Interfaces, ed. M. Maybury, 353-371. Menlo Park: AAAI/MIT Press.
[Brachman and Schmolze 1985] Brachman, R. J., and Schmolze J. G. 1985. An Overview of the kl-one Knowledge Representation System. Cognitive Science, 9(2):171Ð216.
[Burger and Marshall 1993] Burger, J., and Marshall, R. 1993. The Application of Natural Language Models to Intelligent Multimedia. In Intelligent Multimedia Interfaces, ed. M. Maybury, 167-187. Menlo Park: AAAI/MIT Press.
[Carbonell 1970] Carbonell, J. R. 1970. Mixed-Initiative Man-Computer Dialogues. Bolt, Beranek and Newman (BBN) Report No. 1971, Cambridge, MA.
[Cornell et al. 1993] Cornell, M., Woolf, B., and Suthers, D. 1993. Using Live Information in a Multimedia Framework. In Intelligent Multimedia Interfaces, ed. M. Maybury, 307-327. Menlo Park: AAAI/MIT Press.
[Dale et al. 1990] Dale, R., Mellish, C. and M. Zock, editors. 1990. Current Research in Natural Language Generation. Based on Extended Abstracts from the Second European Workshop on Natural Language Generation, University of Edinburgh, Edinburgh, Scotland, 6-8 April, 1989. London: Academic Press. ISBN 0-12-200735-2, 356 pages.
[Dale et al. 1992] Dale, R., Hovy, E. RÓÔner, D., and Stock, O. (eds). 1992. Aspects of Automated Natural Language Generation, Lecture Notes in Computer Science, 587. Proceedings of 6th International Workshop on Natural Language Generation, Trento, Italy, April 5-7, 1992. Springer-Verlag: Berlin.
[Feiner 1985] Feiner, S. 1985. APEX: An Experiment in the Automated Creation of Pictorial Explanations. IEEE Computer Graphics and Application 5(11):29-37.
[Feiner 1988] Feiner, S. 1988. A Grid-based Approach to Automating Display Layout. Proceedings of the Graphics Interface, 192-197. Morgan Kaufmann, LA, CA, June.
[Feiner and McKeown 1993] Feiner, S. K. and McKeown, K. R. 1993. Automating the Generation of Coordinated Multimedia Explanations. In Intelligent Multimedia Interfaces, ed. M. Maybury, 113-134. Menlo Park: AAAI/MIT Press.
[Feiner et al. 1993] Feiner, S. K., Litman D. J., McKeown, K. R., Passonneau, R. J. 1993. Towards Coordinated Temporal Multimedia Presentations. In Intelligent Multimedia Interfaces, ed. M. Maybury, 139-147. Menlo Park: AAAI/MIT Press.
[Feiner, Mackinlay and Marks 1992] Feiner, S., Mackinlay, J. and Marks, J. 1992. Automating the Design of Effective Graphics for Intelligent User Interfaces. Tutorial Notes. Human Factors in Computing Systems, CHI-92, Monterey, May 4, 1992.
[Goodman 1993] Goodman, B. A. 1993. Multimedia Explanations for Intelligent Training Systems. In Intelligent Multimedia Interfaces, ed. M. Maybury, 148-171. Menlo Park: AAAI/MIT Press.
[Gray et al. 1993] Gray, W. D., Hefley, W. E., and Murray, D. (eds.) 1993. In Proceedings of the 1993 International Workshop on Intelligent User Interfaces, Orlando, FL January, 1993. New York: ACM.
[Graf 1992] Graf, W. 1992. Constraint-based Graphical Layout of Multimodal Presentations. In. [Catarci, Costabile, and Levialdi 1992], 365-385. Also available as DFKI Report RR-92-15.
[Graf forthcoming] Graf, W. 1994. Semantik-gesteuertes Layout-Design multimodaler Pr³Ôentationen, Universit³Õ des Saarlandes, Technische Fakult³Õ, SaarbrŸcken, Germany
[Horacek and Zock 1993] Horacek, H. and Zock, M. (eds) 1993. New Concepts in Natural Language Generation: Planning, Realization and Systems. Frances Pinter, London and New York.
[Hovy and Arens 1991] Hovy, E. H. and Arens, Y. 1991. Automatic Generation of Formatted Text. In Proceedings of Ninth National Conference of the American Association for Artificial Intelligence, 92-97, Anaheim, CA, July.
[Hovy and Arens 1993] Hovy, E. H. and Arens, Y. 1993. On the Knowledge Underlying Multimedia Presentations. In Intelligent Multimedia Interfaces, ed. M. Maybury, 280-306. Menlo Park: AAAI/MIT Press.
[Jacob 1990] Jacob, R. J. K. 1990. What You Look at is What You Get: Eye Movement-Based Interaction Techniques. In Proceedings of Human Factors in Computing Systems (CHI '90), 11-18. New York: ACM Press. Seattle, April 1-5.
[Kempen 1987] Kempen., G. editor, 1987. Natural Language Generation: New Results in Artificial Intelligence, Psychology, and Linguistics, Dordrecht: Martinus Nijhoff. NATO ASI Series.
[Kobsa and Wahlster 1989] Kobsa, A., and Wahlster, W. (eds.) 1989. User Models in Dialog Systems. Berlin: Springer-Verlag.
[Kobsa et al. 1986] Kobsa, A., Allgayer, J. Reddig, C. Reithinger, N. Schmauks, D. Harbush, K. and Wahlster, W. 1986. Combining Deictic Gestures and Natural Language for Referent Identification. Proceedings of the 11th International Conference on Computational Linguistics, Bonn, West Germany, 356-361.
[Koons et al. 1993] Koons, D. B., Sparrell, C. J., and Thorisson, K. R. 1993. Integrating Simultaneous Output from Speech, Gaze, and Hand Gestures. In Intelligent Multimedia Interfaces, ed. M. Maybury, 243-261. Menlo Park: AAAI/MIT Press.
[Krause 1993] Krause, J. 1993. A Multilayered Empirical Approach to Multimodality: Towards Mixed Solutions of Natural Language and Graphical Interfaces. In Intelligent Multimedia Interfaces, ed. M. Maybury, 312-336. Menlo Park: AAAI/MIT Press.
[Mackinlay 1986] Mackinlay, J. D. 1986. Automating the Design of Graphical Presentations of Relational Information. ACM Transactions on Graphics 5(2):110-141.
[Marks 1991a] Marks, J. W. 1991. Automating the Design of Network Diagrams. Ph.D. thesis, Harvard University.
[Marks 1991b] Marks, J. 1991. A Formal Specification Scheme for Network Diagrams That Facilitates Automated Design. Journal of Visual Languages and Computing 2(4):395-414.
[Marti et al. 1992] Marti, P., Profili, M., Raffaelli, P., and Toffoli, G. 1992 Graphics, Hyperqueries, and Natural Language: an Integrated Approach to User-Computer Interfaces. In [Catarci, Costabile, and Levialdi 1992], 68-84.
[Maybury 1990] Maybury, M. T. 1990. Planning Multisentential English Text using Communicative Acts. Ph.D. diss., University of Cambridge, England. Available as Rome Air Development Center TR 90-411, In-House Report, December 1990 or as Cambridge University Computer Laboratory TR-239, December, 1991.
[Maybury 1991] Maybury, M. T. 1991a. Planning Multimedia Explanations Using Communicative Acts. In Proceedings of the Ninth National Conference on Artificial Intelligence, 61-66. Anaheim, CA: AAAI.
[Maybury 1993] Maybury, M. T. (ed.) 1993. Intelligent Multimedia Interfaces. Menlo Park: AAAI/MIT Press.
[Maybury 1994a] Maybury, M. T. April, 1994. Knowledge Based Multimedia: The Future of Expert Systems and Multimedia. International Journal of Expert Systems with Applications. Special issue on Expert Systems Integration with Multimedia Technologies. 7(3). Ragusa, J. (ed.).
[Maybury in press] Maybury, M. T. in press. Communicative Acts for Multimedia and Multimodal Dialogue. In Taylor, M. M., N»Æl, F. and Bouwhuis, D. G. The Structure of Multimodal Dialogue. North-Holland: London. ISSN 1018-4554. Proceedings from workshop at Acquafredda di Maratea, Italy. September 16-20, 1991.
[Maybury forthcoming] Maybury, M. T. forthcoming. Automated Explanation and Natural Language Generation. In A Bibliography of Natural Language Generation. Sabourin, C., (ed), Montreal: Infolingua, 1-88.
[Neal and Shapiro 1991] Neal, J. G. and Shapiro, S. C. 1991. Intelligent Multi-Media Interface Technology. In [Sullivan and Tyler 1991], 11-43.
[Paris et al. 1991] Paris, C. L., W. R. Swartout, and W. C. Mann (eds). 1991. Natural Language Generation in Artificial Intelligence and Computational Linguistics, Kluwer: Norwell, MA.
[Pelachaud 1992] Pelachaud, C. Functional Decomposition of Facial Expressions for an Animation System. In [Catarci, Costabile, and Levialdi 1992], 26-49.
[Rim»and Schiaratura 1991] Rim» B., and Schiaratura, L. 1991. Gesture and Speech. In Fundamentals of Nonverbal Behavior, eds. R. S. Feldman and B. Rim, 239-281. New York: Press Syndicate of the University of Cambridge.
[Reiter, Mellish, and Levine 1992] Reiter, E., Mellish, C. and Levine, J. 1992. Automatic Generation of on-line Documentation in the IDAS Project. Proceedings of the 3rd Conference on Applied Natural Language Processing, 31 March - 3 April 1992, Trento, Italy. Morristown: ACL.
[Roth and Mattis 1990] Roth, S. F., and Mattis, J. 1990. Data Characterization for Intelligent Graphics Presentation. In Proceedings of the 1990 Conference on Human Factors in Computing Systems, 193Ð200. New Orleans, Louisiana. ACM/SIGCHI.
[Roth and Mattis 1991] Roth, S. F., and Mattis, J. 1991. Automating the Presentation of Information. In Proceedings of the IEEE Conference on AI Applications, 90-97. Miami Beach, FL.
[Roth et al. 1991] Roth, S. F.; Mattis, J.; and Mesnard, X. 1991. Graphics and Natural Language Generation as Components of Automatic Explanation. In [Sullivan and Tyler 1991], 207-239.
[Schwanauer and Levitt 1993] Schwanauer, S. and Levitt, D. (eds). 1993. Machine Models of Music. Cambridge, MA: MIT Press.
[Searle 1969] Searle, J. R. 1969. Speech Acts: An Essay in the Philosophy of Language. London: Cambridge University Press.
[Stein et al. 1992] Stein, A., Thiel, U., Tissen, A. 1992. Knowledge based Controlf of Visual Dialogues in Information Systems. In Catarci, T., Costabile, M. F., and Levialdi, S. (eds) 1992. Advanced Visual Interfaces: Proceedings of the International Workshop AVIÕ92, Singapore: World Scientific Series in Computer Science, Vol 36, 138-155.
[Stein and Thiel 1993] Stein, A. and Tissen, A. 1993. A Conversational Model of Multimodal Interaction in Information Systems. In Proceedings of the Eleventh National Conference on Artificial Intelligence, 283-288. Washington, DC: AAAI/MIT Press.
[Stock et al. 1993] Stock, O. and the AlFresco Project Team. 1993. AlFresco: Enjoying the Combination of Natural Language Processing and Hypermedia for Information Exploration. In Intelligent Multimedia Interfaces, ed. M. Maybury, 197-224. Menlo Park: AAAI/MIT Press.
[Sullivan and Tyler 1991] Sullivan, J. W., and Tyler, S. W. (eds.) 1991. Intelligent User Interfaces. Frontier Series. New York: ACM Press.
[Taylor and Bouwhuis 1989] Taylor, M., and Bouwhuis, D. G. (eds). 1989. The Structure of Multimodal Dialogue. B. U.: Elsevier Science Publishers.
[Thorisson et al 1992] Thorisson, K., Koons, D., Bolt, R. 1992. Multi-modal Natural Dialogue. In Proceedings of Computer Human Interaction (CHI-92), 653-654.
[Wahlster 1991] Wahlster, W. 1991. User and Discourse Models for Multimodal Communication. In [Sullivan and Tyler 1991], 45-67.
[Wittenburg 1993] Wittenburg, K. 1993. Multimedia and Multimodal Parsing: Tutorial Notes. 31st Annual Meeting of the ACL, Columbus, Ohio, 23 June, 1993.

About the Author

Mark Maybury is Director of the Bedford Artificial Intelligence Center at the MITRE Corporation, a group which performs research in intelligent human computer interaction, natural language processing, knowledge based software and intelligent training. Mark received a BA in Mathematics from the College of the Holy Cross, Worcester, MA in 1986, an M. Phil. in Computer Speech and Language Processing from Cambridge University, England in 1987, an MBA from Rensselaer Polytechnic Institute in Troy, NY in 1989, and a PhD in Artificial Intelligence from Cambridge University, England in 1991 for his dissertation, Generating Multisentential Text using Communicative Acts. Mark has published over fifty articles in the area of language generation and multimedia presentation. He chaired the AAAI-91 Workshop on Intelligent Multimedia Interfaces and edited the resulting international collection, Intelligent Multimedia Interfaces (AAAI/MIT Press, 1993). Mark's research interests include communication planning, tailored information presentation, and narrated animation.

Appendix A: Quick Guide to the Current Literature

Relevant intelligent multimedia interfaces literature includes many workshops on individual media (e.g., text generation, graphics generation). Related collections from workshops have focused on intelligent user interfaces in general [Sullivan and Tyler 1991; Gray et al. 1993], multimedia interface design [Blattner and Dannenberg 1992; Catarci, Costabile, and Levialdi 1992], and multimedia communication [Taylor and Bouwhuis 1989]. [Maybury 1993] focuses specifically on those intelligent interfaces that exploit multiple media and modes to facilitate human-computer communication.

Key References (Books)

[Taylor and Bouwhuis 1989] Taylor, M., and Bouwhuis, D. G. (eds) 1989. The Structure of Multimodal Dialogue. B. U.: Elsevier Science Publishers.
[Sullivan and Tyler 1991] Sullivan, J. W., and Tyler, S. W. (eds) 1991. Intelligent User Interfaces. Frontier Series. New York: ACM Press.
[Blattner and Dannenberg 1992] Blattner, M. M. and Dannenberg, R. B. (eds) 1992. Multimedia Interface Design, Reading, MA: ACM Press/Addison-Wesley.
[Catarci, Costabile, and Levialdi 1992] Catarci, T., Costabile, M. F., and Levialdi, S. (eds) 1992. Advanced Visual Interfaces: Proceedings of the International Workshop AVIÕ92, Singapore: World Scientific Series in Computer Science, Vol 36.
[Maybury 1993] Maybury, M. (ed) 1993. Intelligent Multimedia Interfaces, Cambridge, MA: AAAI/MIT Press.

Key References (Workshop/Conference Proceedings)

[Neches and Kaczmarek 1986] Neches, B. and Kaczmarek, T. 1986. Working Notes from the AAAI Workshop on Intelligence in Interfaces, August 14, 1986. Menlo Park: AAAI.
[Arens et al. 1989] Arens, Y., Feiner, S., Hollan, J., and Neches, B. (eds) 1989. Workshop Notes from the IJCAI-89 Workshop on A New Generation of Intelligent Interfaces, Detroit, MI, 22 August.
[Maybury 1991] Maybury, M. T. (ed) 1991. Working Notes from the AAAI Workshop on Intelligent Multimedia Interfaces, Ninth National Conference on Artificial Intelligence. 15 July, Anaheim, CA. Menlo Park: AAAI.
[Taylor et al. 1991] Taylor, M., Bouwhuis, D. G., and Ne»Í, F. (eds) 1991. Pre-proceedings of the Second Venaco Workshop on The Structure of Multimodal Dialogue, Acquafredda di Maratea, Italy, September, 1991.
[Gray et al. 1993] Gray, W. D., Hefley, W. E., and Murray, D. (eds) 1993. In Proceedings of the 1993 International Workshop on Intelligent User Interfaces, Orlando, FL January, 1993. New York: ACM.
[Johnson et al. forthcoming] Johnson, P., Marks, J., Maybury, M., Moore, J., and Feiner, S. (organizing committee) AAAI 1994 Spring Symposium on Intelligent Multimedia and Multimodal Systems, Stanford, CA, March 21-24, 1994.
[Buxton, Gaver, and Bly, in press] Buxton, W., Gaver, W. and Bly, S. Auditory Interfaces: The use of Non-speech Audio at the Interface. Cambridge University Press.

Key References (Tutorials/Overviews)

[Wittenburg 1993] Wittenburg, K. 1993. Multimedia and Multimodal Parsing: Tutorial Notes. 31st Annual Meeting of the ACL, Columbus, Ohio, 23 June, 1993.
[Feiner, Mackinlay and Marks 1992] Feiner, S., Mackinlay, J. and Marks, J. 1992. Automating the Design of Effective Graphics for Intelligent User Interfaces. Tutorial Notes. Human Factors in Computing Systems, CHI-92, Monterey, May 4, 1992.
[Wahlster 1993] Wahlster, W. 1993. Planning Multimodal Discourse. Invited Talk. Association for Computational Linguistics, Annual Meeting, Ohio State Univ., Columbus, Ohio, 24 June 1993.

Appendix B: Terminology Definitions

There is much terminological inconsistency in the literature regarding the use of the terms media and modality. By mode or modality we refer primarily to the human senses employed to process incoming information, e.g., vision, audition, taction, olfaction. We do not mean mode in the sense of purpose, e.g., word processing mode versus spread sheet mode. Additionally, we recognize medium, in its conventional definition, to refer both to the material object (e.g., paper, video) as well as the means by which information is conveyed (e.g., a sheet of paper with text on it). We would elaborate these definitions to include the possibility of layering so that, for example, a natural language medium might use written text or speech as media even though those media themselves rely on other modes.

Media and mode are related non-trivially. First, a single medium may support several other media or modalities. For example, a piece of paper may support both text and graphics just as a visual display may support text, images, and video. Likewise, a single modality may be supported by many media. For example, the language modality can be expressed visually (i.e., typed or written language) and aurally (i.e., spoken language) Ð in fact spoken language can have a visual component (e.g., lip reading). Just as a single medium may support several modalities and a single modality may be supported by many media, many media may support many modalities, and likewise. For example, a multimedia document which includes text, graphics, speech, video, effects several modalities, e.g., visual and auditory perception of natural language, visual perception of images (still and moving), and auditory perception of sounds. Finally, this multimedia and multimodal interaction occurs over time. Therefore, it is necessary to account for the processing of discourse, context shifts, and changes in agent states over time.

Surface to Air Missile