20050171780 | Speech-related object model and interface in managed code system | August, 2005 | Schmid et al. |
20160240210 | Speech Enhancement to Improve Speech Intelligibility and Automatic Speech Recognition | August, 2016 | Lou |
20150161986 | DEVICE-BASED PERSONAL SPEECH RECOGNITION TRAINING | June, 2015 | Tickoo et al. |
20040215445 | Pronunciation evaluation system | October, 2004 | Kojima |
20080177528 | Method of enabling any-directional translation of selected languages | July, 2008 | Drewes |
20080249777 | Method And System For Control Of An Application | October, 2008 | Thelen et al. |
20110093272 | MEDIA PROCESS SERVER APPARATUS AND MEDIA PROCESS METHOD THEREFOR | April, 2011 | Isobe et al. |
20110035222 | SELECTING FROM A PLURALITY OF AUDIO CLIPS FOR ANNOUNCING MEDIA | February, 2011 | Schiller |
20080262833 | Document Processing Device and Document Processing Method | October, 2008 | Kano et al. |
20140095161 | SYSTEM AND METHOD FOR CHANNEL EQUALIZATION USING CHARACTERISTICS OF AN UNKNOWN SIGNAL | April, 2014 | Waite et al. |
20040179698 | Digital audio player | September, 2004 | Conner |
[0002] 1. Field of Invention
[0003] This invention is directed to abstracting portions of information that is represented in finite-state devices.
[0004] 2. Description of Related Art
[0005] Multimodal interfaces allow input and/or output to be conveyed over multiple different channels, such as speech, graphics, gesture and the like. Multimodal interfaces enable more natural and effective interaction, because particular modes are best-suited for particular kinds of content. Multimodal interfaces are likely to play a critical role in the ongoing migration of interaction from desktop computing to wireless portable computing devices, such as personal digital assistants, like the Palm Pilot®, digital cellular telephones, public information kiosks that are wirelessly connected to the Internet or other distributed networks, and the like. One barrier to adopting such wireless portable computing devices is that they offer limited screen real estate, and often have limited keyboard interfaces, if any keyboard interface at all.
[0006] To realize the full potential of such wireless portable computing devices, multimodal interfaces need to support not just input from multiple modes. Rather, multimodal interfaces also need to support synergistic multimodal utterances that are optimally distributed over the various available modes. In order to achieve this, the content from different modes needs to be effectively integrated.
[0007] One previous attempt at integrating the content from the different modes is disclosed in “Unification-Based Multimodal Integration”, M. Johnston et al.,
[0008] In Johnston 1, a unification operation over typed feature structures was used to model the integration between the gesture mode and the speech mode. Unification operations determine the consistency of two pieces of partial information. If the two pieces of partial information are determined to be consistent, the unification operation combines the two pieces of partial information into a single result. Unification operations were used to determine whether a given piece of gestural input received over the gesture mode was compatible with a given piece of spoken input received over the speech mode. If the gestural input was determined to be compatible with the spoken input, the two inputs were combined into a single result that could be further interpreted.
[0009] In Johnston 1, typed feature structures were used as a common meaning representation for both the gestural inputs and the spoken inputs. In Johnston 1, the multimodal integration was modeled as a cross-product unification of feature structures assigned to the speech and gestural inputs. While the technique disclosed in Johnston 1 overcomes many of the limitations of earlier multimodal systems, this technique does not scale well to support multi-gesture utterances, complex unimodal gestures, or other modes and combinations of modes. To address these limitations, the unification-based multimodal integration technique disclosed in Johnston 1 was extended in “Unification-Based Multimodal Parsing”, M. Johnston,
[0010] Johnston 2 and 3 disclosed how techniques from natural language processing can be adapted to support parsing and interpretation of utterances distributed over multiple modes. In the approach disclosed by Johnston 2 and 3, speech and gesture recognition produce n-best lists of recognition results. The n-best recognition results are assigned typed feature structure representations by speech interpretation and gesture interpretation components. The n-best lists of feature structures from the spoken inputs and the gestural inputs are passed to a multi-dimensional chart parser that uses a multimodal unification-based grammar to combine the representations assigned to the input elements. Possible multimodal interpretations are then ranked. The optimal interpretation is then passed on for execution.
[0011] Further, in Johnston 1-3, gestural inputs are assigned typed feature structures by the gesture representation agents. Using feature structures as a semantic representation framework allow for the specification of partial meanings. Spoken or gestural input which partially specify a command can be represented as an underspecified feature structure in which certain features are not instantiated. Adopting typed feature structures facilitates the statement of constraints on integration. For example, if a given speech input can be integrated with a line gesture, it can be assigned a feature structure with an underspecified location feature whose value is required to be of type line.
[0012] However, the unification-based approach disclosed in Johnston 1-Johnston 3 does not allow for tight coupling of multimodal parsing with speech and gesture recognition. Compensation effects are dependent on the correct answer appearing in each of the n-best list of interpretations obtained from the recognitions obtained from the inputs of each mode. Moreover, multimodal parsing cannot directly influence the progress of either speech recognition or gesture recognition. The multi-dimensional parsing approach is also subject to significant concerns in terms of computational complexity. In the worst case, for the multi-dimensional parsing technique disclosed in Johnston 2, the number of parses to be considered is exponential relative to the number of input elements and the number of interpretations the input elements have. This complexity is manageable when the inputs yield only n-best results for small n. However, the complexity quickly gets out of hand if the inputs are sizable lattices with associated probabilities.
[0013] The unification-based approach also runs into significant problems when choosing between multiple competing parses and interpretations. Probabilities associated with composing speech events and multiple gestures need to be combined. Uni-modal interpretations need to be compared to multimodal interpretations and so on. While this can all be achieved using the unification-based approach disclosed in Johnston 1-Johnston 3, significant post-processing of sets of competing multimodal interpretations generated by the multimodal parser will be involved.
[0014] An alternative to the unification-based multimodal parsing technique disclosed in Johnston 3 is discussed in “Finite-state Multimodal Parsing and Understanding”, M. Johnston and S. Bangalore,
[0015] In Johnston 4 and the incorporated 253 application, language and gesture input streams are parsed and integrated by a single weighted finite-state device. This single weighted finite-state device provides language models for speech and gesture recognition and composes the meaning content from the speech and gesture input streams into a single semantic representation. Thus, Johnston 4 and the incorporated 253 application not only address multimodal language recognition, but also encode the semantics as well as the syntax into a single weighted finite-state device. Compared to the previous approaches for integrating multimodal input streams, such as those described in Johnston 1-3, which compose elements from n-best lists of recognition results, Johnston 4 and the incorporated 253 application provide the potential for direct compensation among the various multimodal input modes.
[0016] Further, in Johnston 4 and the incorporated 253 application, the structure interpretation of the multimodal commands is captured in a multimodal context-free grammar in which the multimodal aspects of the context free grammar are apparent in the terminals. The terminals contain three components, W, G and M, where W is the spoken language stream, G is the gesture stream, and M is the combined meaning. The non-terminals in the multimodal grammar are atomic symbols. In some exemplary embodiments of Johnston 4 and the incorporated 253 application, the gesture symbols G can be organized into a type hierarchy reflecting the ontology of the entities in the application domain. For example, a pointing gesture may be assigned the general semantic type “G”. This general semantic gesture “G” may have various subtypes, such as “Go” and “Gp”. In various exemplary embodiments, “Go” represents a gesture made against an organization object. In various exemplary embodiments, “Gp” represents a gesture made against a person object. Furthermore, the “Gp” type gesture may itself have subtypes, such as, for example, “Gpm” and “Gpf” for objects that respectively represent male and female persons. Compared with a feature-based multimodal grammar, these semantic types constitute a set of atomic categories which make the relevant distinctions for gesture events to predict speech events and vice versa.
[0017] Johnston 4 and the incorporated 253 application allow the gestural input to dynamically alter the language model used for speech recognition. In addition, the approach provided in Johnston 4 and the incorporated 253 application reduce the computational complexity of multi-dimensional multimodal parsing. In particular, the weighted finite-state devices, used in Johnston 4 and the incorporated 253 application, provide a well-understood probabilistic framework for combining the probability distributions associated with the speech and gesture or other input modes and for selecting among multiple competing multimodal interpretations.
[0018] Systems and methods for recognizing and representing gestural input have been very limited. Both the feature-based multimodal grammar provided in Johnston 1-3 and the multimodal context free grammar disclosed in Johnston 4 and the incorporated 253 application may be used to capture the structure and interpretation of multimodal commands. However, neither approach allows for efficiently and generically representing arbitrary gestures.
[0019] In addition, in the unification-based multimodal grammar disclosed in Johnston 1-3, spoken phrases and physical gestures are assigned typed feature structures by the natural language and gesture interpretation systems, respectively. In particular, each gesture feature structure includes a content portion that allows the specific location of the gesture on the gesture input portion of the multimodal user input device
[0020] However, in order to capture multimodal integration using finite-state methods, it is necessary to abstract over certain aspects of the gestural content. When using finite-state automata, a unique identifier is needed for each object or location in the gesture input portion
[0021] In the finite-state approach used in the systems and methods according to this invention, one possible, but ultimately unworkable solution, would be to incorporate all of the different possible identifiers for all of the different possible elements of the gesture input device
[0022] Various exemplary embodiments of the systems and methods disclosed in Johnston 4 and the incorporated 253 application overcome this problem by storing the specific identifiers of the elements within the gesture input portion
[0023] In Johnston 4 and the incorporated 253 application, therefore, instead of having the specific values for each possible object identifier in a finite-state automaton, that finite-state automaton instead incorporates the transitions “ε:e
[0024] For example, assuming a user using the gesture input portion
[0025] In Johnston 4 and the incorporated 253 application, during operation, the gesture recognition system
[0026] This invention separately provides systems and methods for abstracting over certain aspects of the certain components of information represented by finite-state methods and then reintegrating the components after performing any applicable operation.
[0027] This invention provides systems and methods for capturing multimodal integration using finite-state methods.
[0028] This invention separately provides systems and methods for abstracting over certain aspects of the certain components of information represented by finite-state methods and then reintegrating the components after multimodal parsing and integration.
[0029] This invention separately provides systems and methods for abstracting over certain aspects of the gestural content when using finite-state methods.
[0030] This invention separately provides systems and methods for abstracting over certain aspects of the gestural content and then reintegrating that content after multimodal parsing and integration.
[0031] In various exemplary embodiments, the systems and methods according to this invention abstract over certain parts, types and/or elements of information that is input into a device by representing the information with at least one symbol that serves, at least in part, as a placeholder for that information. This symbol placeholder is used by a function using finite state devices. After the function is performed, the certain parts, types and/or elements of information, represented by the symbols, are reintegrated with the result of the performed function.
[0032] In various exemplary embodiments, the systems and methods according to this invention abstract over certain parts, types and/or elements of information that is input into a device by representing the information with at least one symbol that serves at least in part as a placeholder for that information. This symbol placeholder is used by a function using finite state devices. After the function is performed, the certain parts, types and/or elements of information, represented by the symbols, is reintegrated with the result of the performed function. When the output contains one of the symbols that served as a placeholder for the information, the certain part, type and/or element of information is substituted back into the result. As a result, the result contains the certain parts, types and/or elements of information that was not necessary and/or difficult to include in performing the desired function.
[0033] In various exemplary embodiments, the systems and methods according to this invention abstract over certain information contained in a finite-state device, by representing input information using a first finite-state device that relates at least one set of symbols to another set of symbols that represents the input information. Certain parts, types and/or elements of the information, although the same information, are represented by different symbols. A projection of the finite state device is used to perform a desired function using finite-state devices. A result finite-state machine is generated and a projection of the result finite-state machine is obtained. The projection of the result finite-state machine is composed with the first finite-state device. A final finite-state device is generated to reincorporate the symbols, representing certain information, that were not used in carrying out the desired function. As a result, a result of the performed function may be obtained by concatenating the symbols representing the output of the performed function. When the symbol represents certain information, which is also represented by at least one other symbol, the symbol is replaced with the corresponding symbol for that information that was not used in performing the desired function.
[0034] These and other features and advantages of this invention are described in, or are apparent from, the following detailed description of various exemplary embodiments of the systems and methods according to this invention.
[0035] Various exemplary embodiments of this invention will be described in detail, with reference to the following figures, wherein:
[0036]
[0037]
[0038]
[0039]
[0040]
[0041]
[0042]
[0043]
[0044]
[0045]
[0046]
[0047]
[0048]
[0049]
[0050]
[0051]
[0052]
[0053]
[0054]
[0055]
[0056]
[0057]
[0058]
[0059]
[0060]
[0061]
[0062]
[0063]
[0064]
[0065]
[0066]
[0067]
[0068]
[0069] In each step of the processing cascade, one or two lattices are input and composed to produce an output lattice. In automatic speech recognition and in the following description of the exemplary embodiments of the systems and methods of this invention, the term “lattice” denotes a directed and labeled graph, which is possibly weighted. In each lattice, there is typically a designated start node “s” and a designated final node “t”. Each possible pathway through the lattice from the start node s to the final node t induces a hypothesis based on the arc labels between each pair of nodes in the path. For example, in a word lattice, the arc labels are words and the various paths between the start node s and the final node t form sentences. The weights on the arcs on each path between the start node s and the final node t are combined to represent the likelihood that that path will represent a particular portion of the utterance.
[0070] As shown in
[0071] The phone lattice
[0072] In particular, one conventional method of implementing automatic speech recognition forms each of the acoustic model lattice
[0073] Conventionally, the grammar or language model lattice
[0074] In contrast, in various exemplary embodiments of the systems and methods according to this invention, in the multimodal recognition or meaning system
[0075] Alternatively, in various exemplary embodiments of the systems and methods according this invention, the output of the automatic speech recognition system
[0076] Furthermore, it should be appreciated that, in various exemplary embodiments of the systems and methods according to this invention, the output of the gesture recognition system
[0077] Thus, it should further be appreciated that, while the following detailed description focuses on speech and gesture as the two input modes, any two or more input modes that can provide compensation between the modes, which can be combined to allow meaning to be extracted from the two or more recognized outputs, or both, can be used in place of, or in addition to, the speech and gesture input modes discussed herein.
[0078] In particular, as shown in
[0079] In those various exemplary embodiments that provide compensation between the gesture and speech recognition systems
[0080] The automatic speech recognition system
[0081] In contrast, in those exemplary embodiments that additionally extract meaning from the combination of the recognized gesture and the recognized speech, the multimodal parser/meaning recognition system
[0082] The multimodal parser/meaning recognition system
[0083] Moreover, in contrast to both of the embodiments outlined above, in those exemplary embodiments that only extract meaning from the combination of the recognized multimodal inputs, the multimodal parser/meaning recognition system
[0084] When the gesture recognition system
[0085] In this case, the gesture recognition lattice
[0086] FIGS.
[0087] The gesture feature extraction subsystem
[0088] The gesture feature lattice
[0089] It should be appreciated that the gesture feature recognition subsystem
[0090] For example, one known system captures the time and location or locations of the gesture. Optionally, these inputs are then normalized and/or rotated. The gestures are then provided to a pattern classification device that is implemented as part of the gesture feature recognition subsystem
[0091] When a single gesture is formed by two or more temporally-related and/or spatially-related gesture components, those gesture components can be combined into a single gesture either during the recognition process or by the multimodal parser/meaning recognition system
[0092] In various exemplary embodiments, the multimodal parser and meaning recognition system
[0093]
[0094] The lattice projection subsystem
[0095] In those various embodiments that combine the gesture recognition lattice
[0096] In those various exemplary embodiments that extract meaning from the multimodal inputs, the gesture and speech composing subsystem
[0097] It should be appreciated that the systems and methods disclosed herein use certain simplifying assumptions with respect to temporal constraints. In multi-gesture utterances, the primary function of temporal constraints is to force an order on the gestures. For example, if a user generates the spoken utterance “move this here” and simultaneously makes two gestures, then the first gesture corresponds to the spoken utterance “this”, while the second gesture corresponds to the spoken utterance “here”. In the various exemplary embodiments of the systems and methods according to this invention described herein, the multimodal grammars encode order, but do not impose explicit temporal constraints. However, it should be appreciated that there are multimodal applications in which more specific temporal constraints are relevant. For example, specific temporal constraints can be relevant in selecting among unimodal and multimodal interpretations. That is, for example, if a gesture is temporally distant from the speech, then the unimodal interpretation should be treated as having a higher probability of being correct. Methods for aggregating related inputs are disclosed in the U.S. patent application Ser. No. ______ (Attorney Docket No. 112364, filed on even date herewith), which is incorporated herein by reference in its entirety.
[0098] To illustrate one exemplary embodiment of the operation of the multimodal recognition and/or meaning system
[0099] The structure interpretation of multimodal commands of this kind can be captured declaratively in a multimodal context-free grammar. A multimodal context-free grammar can be defined formally as the quadruple MCFG as follows:
TABLE 1 MCFG = <N, T, P, S> where N is the set of non-terminals; P is the set of projections of the form: A→α where AεN and αε(N∪T)*; S is the start symbol for the grammar; T is the set of terminals: ((W∪ε) × (G∪ε) × (M∪ε) where W is the vocabulary of the speech; G is the vocabulary of gesture: G = (GestureSymbols ∪ EventSymbols); GestureSymbols = {Gp, Go, Gpf, Gpm . . . }; Finite collections of EventSymbols = {e M is the vocabulary that represents meaning and includes EventSymbols⊂M.
[0100] In general, a context-free grammar can be approximated by a finite-state automaton. The transition symbols of the finite-state automaton are the terminals of the context-free grammar. In the case of the multimodal context-free grammar defined above, these terminals contain three components, W, G and M. With respect to the discussion outlined above regarding temporal constraints, more specific temporal constraints than order can be encoded in the finite-state approach by writing symbols representing the passage of time onto the gesture tape and referring to such symbols in the multimodal grammar.
[0101] In the exemplary embodiment of the gesture input portion
[0102] In Johnston 4 and the incorporated 253 application compared with a feature-based multimodal grammar, these semantic types constitute a set of atomic categories which make the relevant distinctions for gesture events to predict speech events and vice versa. For example, if the gesture is a deictic, i.e., pointing, gesture to an object in the gesture input portion
[0103] In Johnston 4 and the incorporated 253 application, the gesture symbols G can be organized into a type hierarchy reflecting the ontology of the entities in the application domain. For example, in the exemplary embodiment of the gesture input portion
[0104] The systems and methods for recognizing and representing gestures disclosed in U.S. patent application Ser. No. ______ (Attorney Docket No. 110805, filed on even date herewith), incorporated herein by reference in its entirety, provide an approach that can be used instead of, or in addition, to that discussed above and shown in FIGS. TABLE 2 GestureSymbols = {G, area, location, restaurant, 1, . . . }; EventSymbols = {SEM}
[0105] These definitions can be used instead of, or in addition to, the definitions of the Gesture Symbols and Event Symbols shown in Table 1.
[0106]
[0107] By decomposing the gesture symbols into sequences of symbols, it is easier to reference sets of entities of a specific type. In addition, a smaller number of symbols are required in the alphabet of symbols that represent the gestural content of the grammar. Further, decomposing the gesture symbols into sequences of symbols facilitates storing specific gesture content, discussed below, and aggregating adjacent selection gestures, as disclosed in U.S. patent application Ser. No. ______ (Attorney Docket No. 112364), incorporated herein by reference in its entirety.
[0108] Under this approach, for example, the gesture symbol complexes have a basic format such as:
[0109] G FORM MEANING (NUMBER TYPE) SEM
[0110] However, it should be appreciated that the gesture symbol complexes can be implemented in any appropriate format. The “FORM” term specifies the physical form of the gesture. In various exemplary embodiments, the “FORM” term can take such values as area, point, line and arrow. The “MEANING” term specifies the specific meaning of the form of the gesture. For example, if the value for the “FORM” term of a gesture is area, the value for the “MEANING” term of that “FORM” term can be location, selection or any other value that is appropriate. If the value of the “MEANING” term of that “FORM” term is selection, such that one or more specific entities are selected, the “NUMBER” term and the “TYPE” term can be used to further specify the entities that are or have been selected. In particular, the value for the “NUMBER” term specifies the number of entities selected, such as 1, 2, 3, “many” and the like. Similarly, the value for the “TYPE” term specifies a particular type of entity, such as restaurant, theatre and the like, as appropriate for the given implementation or use. A value of mixed can be used for the “TYPE” term when one or more associated gestures reference a collection of entities of different types. The “SEM” term is a place holder for the specific content of the gesture, such as the points that make up an area or the identifiers (ids) of objects in a selection. To facilitate recomposing specific gestural content, specific content is mapped to a distinguished symbol, such as, the “SEM” term, while the other attributes of the gesture are mapped to themselves.
[0111] When using finite-state methods, in order to capture multimodal integration, it is desirable to abstract over specific aspects of gestural content. In the systems and methods according to this invention, abstraction is performed by representing the input gesture as a finite-state transducer that maps the specific contents of the input gesture to a distinguished symbol, such as, for example, the “SEM” term, while the other attributes of the input gesture are mapped to themselves. To perform multimodal integration, the output projection of the gesture finite-state transducer is used. After multimodal integration is completed, a projection of the gesture input and the meaning portions of the resulting finite-state machine is taken. The projection of the resulting finite-state machine is composed with the gesture input finite-state transducer to reintegrate the specific content of the gesture input which was left out of the finite-state process.
[0112] Thus, in the finite-state approach used in the systems and methods according to this invention, the specific content of the input gesture is essentially stored in the gesture finite-state transducer and a projection of the output of the gestural input finite-state transducer is used to conduct multimodal modal integration using finite-state devices. Multimodal integration is performed and a resulting finite-state device, which relates the gesture input and at least one other mode of multimodal input to meaning, is generated. After multimodal integration is performed, a projection of the resulting finite-state device is taken such that the projection contains the gesture input and the meaning portions of the resulting finite-state device. The specific content of the input gesture is then retrieved from the gesture finite-state transducer by composing the gesture finite-state transducer with the projection of the resulting finite-state device.
[0113]
[0114] In addition, in one exemplary embodiment of the systems and methods of this invention, a projection, i.e., the gesture input finite-state machine, is then used to perform multimodal integration. Accordingly, in order to reintegrate the specific content of the input gesture after multimodal integration is performed, an output projection of the resulting meaning finite-state machine is composed with the gesture finite-state transducer.
[0115] In the finite-state automata approach used in the systems and methods according to this invention, in addition to capturing the structure of language with the finite-state device, meaning is also captured. This is significant in multimodal language processing, because the central goal is to capture how the multiple modes contribute to the combined interpretation. In the finite-state automata technique used in the systems and methods according to this invention, symbols are written onto the third tape of the three-tape finite-state automaton, which, when concatenated together, yield the semantic representation for the multimodal utterance.
[0116]
[0117] The user is, for example, free to give commands or to reply to requests displayed on the graphical user interface using speech, by drawing on the display with a stylus, or using synchronous multimodal combinations of the available modes. The user can, for example, ask for the review, cuisine, phone number, address, or other information for a restaurant or set of restaurants. The working-city-guide application generates graphical callouts on the display.
[0118] For example, a user can request to see restaurants using the spoken command “Show cheap Italian restaurants in Chelsea ”. The working-city-guide application will then display the appropriate map location and show the locations of the restaurants that meet the indicated criteria. Alternatively, the user can give the same command multimodally by circling an area on the map and uttering “Show cheap Italian restaurants in this neighborhood”. As shown in
[0119] As shown in
[0120]
[0121]
[0122] Accordingly, by splitting the symbols into symbol complexes, it is not necessary to have a unique symbol for “G_area_location”, “G_area_selection
[0123] In various exemplary embodiments, the working-city-guide application can also provide a summary, one or more comparisons, and/or recommendations for an arbitrary set of restaurants. The output is tailored to the user's preferences based on a user model, which is, for example, based on answers to a brief questionnaire. For example, as shown in
[0124] The working-city-guide application can also provide subway directions. For example, if the user speaks the spoken utterance “How do I get to this place?” and circles one of the restaurants displayed on the map, the working-city-guide application will, for example, display a query or graphical callout, such as, for example, the text string “Where do you want to go from? ” The user can then respond, for example, with a spoken utterance stating a location, by saying a location, such as, for example, “25
[0125] The working-city-guide application then determines a subway route. In various exemplary embodiments, as appropriate, the working city-guide application generates a multimodal presentation indicating the series of actions the user needs to take will be displayed. In various exemplary embodiments, the working-city-guide application starts, for example, by zooming in on the first station and then gradually presenting each stage of the route along with a series of synchronized test-to-speech (TTS) prompts.
[0126] It should be thus be appreciated, from the preceding description of Table 2 and FIGS.
[0127] To abstract over specific contents of the gestural content in accordance with the systems and methods of this invention, the gesture lattice is converted to a gesture transducer I:G, where the G side is the set of gesture symbols (including SEM) and I contains both gesture symbols and the specific contents. The specific contents of the gestural input includes, for example, entities or points on the gesture input portion
[0128] Alternatively, a gesture transducer I:G can be generated by the gesture recognition system based on the gestural input. In this case, the gesture transducer is converted to a gesture finite-state machine usable to carry out multimodal integration or any other applicable function using finite-state devices. For example, if the user circles two restaurants and the user says “phone numbers for these two restaurants” the gesture is represented as a transducer as shown in
[0129] After the gesture symbols G and the words Ware integrated using the finite-state devices G:W and G_W:M, for example, i.e., after multimodal integration, the gesture path G and meaning path M in the resulting finite-state device are used to re-establish the connection between the SEM symbols and their specific contents, for example entities or points selected by the user, that are stored in the I path of the gesture transducer I:G. In particular, in order to reintegrate the specific contents of the gesture, the gesture transducer I. G is composed with the gesture path G and meaning path M of the device resulting from multimodal integration (I:G o G:M=I:M). In addition, in order to output the meaning, the symbols on the M side are concatenated together. Further, when outputting the meaning, if the M symbol is SEM, the symbol on the I side is taken for that arc.
[0130] While a three-tape finite-state automaton is feasible in principle, currently available tools for finite-state language processing generally only support two-tape finite-state automata, i.e., finite-state transducers. Furthermore, speech recognizers typically do not support the use of a three-tape finite-state automaton as a language model. Accordingly, the multimodal recognition and/or meaning system
[0131]
[0132] The gesture-to-speech finite-state transducer shown in
[0133] The gesture-to-speech transducer shown in
[0134] It should be appreciated that, in those exemplary embodiments that do not also extract meaning, the further processing outlined below with respect to FIGS.
[0135] The speech/gesture/meaning finite-state transducer shown in
[0136] Thus, the gesture-to-speech finite-state transducer and the speech/gesture/meaning finite-state transducers shown in
[0137] It should be appreciated that there are any variety of ways in which the multimodal finite-state transducers can be integrated with the automatic speech recognition system
[0138] The approach outlined in the following description of FIGS.
[0139] Accordingly, for the specific exemplary embodiment of the multimodal user input device
[0140] In addition, in the exemplary embodiments described above with respect to
[0141]
[0142] Next, in step
[0143] Then, in step
[0144] Alternatively, in step
[0145] Next, in step
[0146] Then, in step
[0147] Next, in step
[0148] Then, in step
[0149] In step
[0150] Next, in step
[0151] In step
[0152] It should be appreciated that, in embodiments that use much more complex multimodal interfaces, such as those illustrated in Johnston 1-3, the meaning finite-state transducer may very well be a weighted finite-state transducer having multiple paths between the start and end nodes representing the various possible meanings for the multimodal input and the probability corresponding to each path. In this case, in step
[0153] As outlined above, the various exemplary embodiments described herein allow spoken language and gesture input streams to be parsed and integrated by a single weighted finite-state device. This single weighted finite-state device provides language models for speech and gesture recognition and composes the meaning content from the speech and gesture input streams into a single semantic representation. Thus, the various systems and methods according to this invention not only address multimodal language recognition, but also encode the semantics as well as the syntax into a single weighted finite-state device. Compared to the previous approaches for integrating multimodal input streams, such as those described in Johnston 1-3, which compose elements from n-best lists of recognition results, the systems and methods according to this invention provide the potential for mutual compensation among the various multimodal input modes.
[0154] The systems and methods according to this invention allow the gestural input to dynamically alter the language model used for speech recognition. Additionally, the systems and methods according to this invention reduce the computational complexity of multi-dimensional multimodal parsing. In particular, the weighted finite-state devices used in the systems and methods according to this invention provide a well-understood probabilistic framework for combining the probability distributions associated with the speech and gesture input streams and for selecting among multiple competing multimodal interpretations.
[0155] It should be appreciated that the systems and methods for abstraction according to this invention are not limited to gestural input. The systems and methods for abstraction according to this invention may be used for any information represented by finite-state methods. In addition, the systems and methods for abstraction may be used to abstract any components of the information that is not necessary or difficult to retain when performing an operation using finite-state methods.
[0156] It should be appreciated that the systems and methods for abstraction according to this invention, are not limited to multimodal integration. The systems and methods for abstraction according to this invention may be used in other functions implemented with finite-state devices in which it would be undesirable, difficult, impossible and/or unnecessary to include certain components of the information represented by the finite-state devices when performing the desired function. In addition, it should be appreciated that the systems and methods for abstraction, according to this invention, may be used to abstract over certain components of the information represented by the finite-state devices that represent other forms of multimodal inputs.
[0157] It should be appreciated that the multimodal recognition and/or meaning system
[0158] Thus, it should be understood that each of the various systems and subsystems shown in FIGS.
[0159] It should also be appreciated that, while the above-outlined description of the various systems and methods according to this invention and the figures focus on speech and gesture as the multimodal inputs, any known or later-developed set of two or more input streams representing different modes of information or communication, such as speech, electronic-ink-based gestures or other haptic modes, keyboard input, inputs generated by observing or sensing human body motions, including hand motions, gaze motions, facial expressions, or other human body motions, or any other known or later-developed method for communicating information, can be combined and used as one of the input streams in the multimodal utterance.
[0160] Thus, while this invention has been described in conjunction with the exemplary embodiments outlined above, it is evident that many alternatives, modifications and variations will be apparent to those skilled in the art. Accordingly, the exemplary embodiments of these systems and methods according to this invention, as set forth above, are intended to be illustrative, not limiting. Various changes may be made without departing from the spirit and scope of this invention.