[0001] 1. Field of the Invention
[0002] The present invention relates to automatically identifying the presence of speech. More particularly, the invention is directed to methods and apparatus for automatically detecting and identifying received speech from a user of a speech recognition unit.
[0003] 2. Description of the Related Art
[0004] Speech recognition systems are well known in the art and are being used with increasingly frequency in hand held devices such as the “Palm Pilot” or “Compaq iPAQ” to store, in verbal form, calendar data and contact information. Hand held devices are also being used as voice message recorders and/or communication devices to record a reminder message, make a telephone call, access remote information, and the like. For example, demonstrations in laboratories have shown that these devices can function as an IP phone to transmit speech via IP packets, and to access voice portals which support voice enabled services by utilizing automatic speech recognition systems. In these applications speech is normally the input source and, therefore, speech detection is essential.
[0005] One problem with current speech detection systems is their inability to distinguish relevant speech from irrelevant speech or sounds that are normally present or heard, either separately or in combination with relevant speech, such as passing background conversations. Currently, speech recognition systems normally require the user to mark the beginning and/or end of speech input by performing an indexing activity such as pushing a button, or saying a specific word or phrase, so that the system will know when to “listen” and when to “sleep”. Some of the various techniques used by humans to determine when speech is intended for them is to listen for the use of a specific word such as their name, or look for a visual clue such as the movements of a person's mouth in combination with detecting speech. To provide speech recognition systems with functions that are compatible with the way that humans normally function, some speech recognition units use a specific word or phrase (similar to the use of a persons name) to activate or “wake up” the speech recognition system and a “go to sleep” phrase to tell the speech recognition system to stop operating. Many speech recognition units use the more positive approach of requiring the user to depress a “talk” button to activate the system. These methods, however, have specific limitations. The use of “wake up” words or phases are often undetected and additional time is then required to return the speech recognition system on or off. Toggle-to-talk buttons require user proximity which undermines the advantage of operating without the need for physical contact with the speech recognition system.
[0006] Aside from the general need of reliability in speech activity detection, recognition of speech input to an automatic speech system can be adversely affected by background voices and environmental noise. To overcome this obstacle, a point and speak method has been proposed for use with a computer. With this system, before speaking the user points a stylus at a screen icon to alert the system that he is going to talk. This system however is not only inconvenient to the user, but the process that it uses is inherently unnatural.
[0007] Clearly, what is needed is a method and apparatus for using operating or employing a speech recognition system that avoids the shortcomings of current systems. In the present invention, human speech activity is automatically detected and processed in a passive manner so that no extra effort or unnatural activity is required by the user to activate the speech recognition system.
[0008] The present invention provides methods and apparatus for automatically controlling the operation of a speech recognition system without requiring any unnatural movement or activity on the part of the speaker. As is apparent, people make various sounds that form the basis of all speech by controlling the shape and position of speech articulators such as the lips, mouth, tongue, teeth, etc. while passing air outwardly from the mouth. Controlled shapes of these articulators and their positions relative to each other determine the characteristics of the sound that is produced. The present invention identifies if received audio information is actually speech of the person that is using the speech recognition system before the system is turned on. In a hand held device, a video camera takes a video image of a speaker's face, specifically his speech articulators such as the lips and/or mouth shape. In a manner similar to “lip reading”, this information is analyzed to identify the sounds or words that such shape would make. At the same time, a microphone receives the sound that is actually produced. The characteristics of that sound are then compared to the sound that “should” result from the observed shape of the speech articulators to determine whether there is a match. If so, then the speech received is identified as emanating from the person in the video image, and the speech recognition system is activated to process the received sound.
[0009] For a better understanding of the invention, reference is made to the following description taken in conjunction with the accompanying drawing, and the scope of the invention will be pointed out by the appended claims.
[0010] Other objects and features of the present invention will become apparent from the following detailed description considered in conjunction with the accompanying drawings. It is It is to be understood, however, that the drawings are designed solely for purposes of illustration and not as a definition of the limits of the invention, for which reference should be made to the appended claims. It should be further understood that the drawings are not necessarily drawn to scale and that, unless otherwise indicated, they are merely intended to conceptually illustrate the structures and procedures described herein.
[0011] In the drawings:
[0012]
[0013]
[0014]
[0015] The present invention is broadly directed to methods and apparatus which automatically detect and determine if received speech is that of the user of the speech recognition system and, if so, for generating a signal to start the operation of the speech recognition system without requiring the speaker to first utter any activating control words or to depress or operate a start-stop button or switch. Thus, the occurrence of human speech is automatically detected and such detection is entirely transparent to the speaker. The method of the invention is preferably implemented in a digital computer based speech recognition system capable of recognizing speech data, and of at least temporarily storing recognized speech data in a memory. A typical speech recognition system receives speech as a collection or stream of speech data segments. As each speech data segment is vocalized by a user, the automated speech recognition system recognizes and stores a data element that corresponds to that speech data segment. In accordance with the present invention, as soon as speech is detected and is identified as being from the user, the speech recognition system is activated and remains so until speech from the user is no longer detected. No overt or conscious act on the part of the speaker is, therefore, required to activate and deactivate the system.
[0016] Referring to
[0017]
[0018]
[0019] There has accordingly been disclosed a method for inputting speech to a handheld device simultaneously using a microphone and a camera, where the microphone and camera are located on or in the handheld device. It should nevertheless be understood that the method herein disclosed can also be used with other devices, such as desktop computers or in IP telephony, that are equipped or associated with a microphone and camera. It is again noted that the inventive method is totally passive as it does not require that the user or others perform any unusual steps or other function in addition to the normal action of talking into the handheld or other device.
[0020] Although the invention has been described herein with respect to specific embodiments, it should be understood that these embodiments are exemplary only, and that it is contemplated that the described methods and apparatus of the invention can be varied widely while still maintaining the advantages of the invention. Thus, the disclosure should not be understood as limiting in any way the intended scope of the invention. In addition, as used herein, the term “unit” is intended to refer to a digital device that may, for example, take the form of a hardwired circuit, or software executing on a processor, or a combination thereof. For example, the units
[0021] Thus, while there have shown and described and pointed out fundamental novel features of the invention as applied to a preferred embodiment thereof, it will be understood that various omissions and substitutions and changes in the form and details of the devices illustrated, and in their operation, may be made by those skilled in the art without departing from the spirit of the invention. For example, it is expressly intended that all combinations of those elements and/or method steps which perform substantially the same function in substantially the same way to achieve the same results are within the scope of the invention. Moreover, it should be recognized that structures and/or elements and/or method steps shown and/or described in connection with any disclosed form or embodiment of the invention may be incorporated in any other disclosed or described or suggested form or embodiment as a general matter of design choice. It is the intention, therefore, to be limited only as indicated by the scope of the claims appended hereto.