Title:
SPEECH RECOGNITION APPARATUS
United States Patent 3553372
Abstract:
Speech recognition is effected by zero crossing analysis of the speech waveform wherein the time interval between consecutive zero crossings is measured, with these intervals and combinations thereof being subsequently identified. Measurement of the intervals is based on a nonlinear timescale, the rate of generation of which is dependent on the fundamental frequency of the speaker. Alterations in the timescale generation are effected due to the initial time constant of the timescale generator circuitry being proportional to and controlled by the variations in the fundamental frequency of the speech waveform.


Inventors:
Wright, Esmond Philip Goodwin (Bishop's Stortford, EN)
Bezdel, Wincenty (Harlow, Essex, EN)
Application Number:
04/587539
Publication Date:
01/05/1971
Filing Date:
10/18/1966
Export Citation:
Assignee:
International Standard Electric Corporation (New York, NY)
Primary Class:
Other Classes:
324/76.12, 704/251, 704/253, 704/254
International Classes:
G10L11/00; G10L11/04; (IPC1-7): G10L1/00
Field of Search:
179/1AS 340
View Patent Images:
US Patent References:
3416080Apparatus for the analysis of waveformsDecember 1968Wright et al.
3335225Formant period trackerAugust 1967Campanella
3278685Wave analyzing systemOctober 1966Harper
3102928Vocoder excitation generatorSeptember 1963Schroeder
Primary Examiner:
Claffy, Kathleen H.
Assistant Examiner:
Jirauch, Charles W.
Claims:
We claim

1. Speech recognition apparatus comprising:

2. Apparatus according to claim 1 in which the means for generating the nonlinear time scale includes first and second transistors having complementary symmetry with their emitters connected together, a positive feedback connection between the base of the first transistor and the collector of the second transistor, first and second capacitors connected to the base of the second transistor and means for charging the first and second capacitors at differential rates by the voltage related to the fundamental frequency.

Description:
This invention relates to speech recognition equipment in which automatic adjustment takes place to enable the equipment to suit itself to the speech characteristics of different talkers.

In our copending application Ser. No. 437,349 filed Mar. 2, 1965 for Apparatus for the Analysis of Waveforms, now issued as U.S. Pat. No. 3,416,080, there is described apparatus for speech recognition in which speech recognition is accomplished by analysis of the zero crossing intervals in the speech wave. Every word has, within fairly wide limits, a recognizable pattern of zero crossings which can be divided into groups representing different sounds; the crossings making up a group being in turn identified by their number and timing relative to each other. Such a method of speech recognition can be distinguished from frequency spectrum analysis in as much as the information bearing parameters can be converted into a time or digital domain in the case of zero crossing analysis. The zero crossing intervals making up each group are counted under the control of a suitable nonlinear time scale.

According to the present invention there is provided speech recognition apparatus including means for detecting reversals of polarity in the speech waveform, means for generating a measuring time scale waveform when a reversal is detected, means for counting the number of time scale units generated between the detected reversal and the next detected reversal and means for altering the scale of the time scale waveform according to a characteristic of the speech waveform.

In a preferred embodiment of the present invention there is provided means for producing a voltage proportional to the fundamental frequency of the speech waveform and means for generating a nonlinear pulse train time scale, the initial time constant of the pulse generator being controlled by and proportional to the voltage derived from the fundamental frequency.

The above and other features of the invention will become more readily apparent and be better understood from the following description of an embodiment thereof, taken in conjunction with the accompanying drawings in which:

FIG. 1 illustrates a typical speech waveform and the timing of the zero crossings contained therein,

FIG. 2 illustrates an alternative method of locating the zero crossings in the waveform,

FIG. 3 is a nonlinear timescale,

FIG. 4 is a block diagram of a circuit arranged to time the intervals between successive zero crossings in a waveform,

FIG. 5 illustrates a method of extracting zero crossings from the waveform,

FIG. 6 is a circuit by which the square wave shown in FIG. 5 may be obtained,

FIG. 7 is a block diagram of a circuit by which a limited number of parts of speech may be recognized,

FIG. 8 is a block diagram of an arrangement by which a larger vocabulary may be recognized, and

FIGS. 9 and 10 illustrate sections of FIG. 8,

FIG. 11 illustrates the nonlinear pulse train timescale generating circuit, and

FIG. 12 illustrates diagrammatically two nonlinear pulse time scales derived for different fundamental frequencies.

A fundamental aspect of speech recognition is the ability to extract from a speech waveform features such as frequencies, amplitudes, phase relationships etc., which can be recognized as conforming to certain known patterns for each type of speech sound. These features can be extracted and, with the aid of modern computers, measured, classified, stored and compared with various standards or reference patterns.

One method of analyzing speech waveforms for the purpose of extracting recognizable features therefrom is to count and measure the intervals between zero crossings of the waveform. A refinement of this technique is to count the number of combinations of zero crossing intervals that conform to a particular pattern. For example the speech waveform may be analyzed to ascertain the number of adjacent pairs of zero crossing intervals where the first interval falls within the range between 1 and 1.5 msec and is followed by an interval that falls within the range between 0.5 and 0.7 msec.

FIG. 1 illustrates a speech waveform 11 having zero crossings 12 to 20. The intervals between these zero crossings are represented as periods of time 21 to 28. The timing of these intervals is achieved by counting the number of timescale units generated by a timescale which is started when a zero crossing is detected. Thus interval 21 is timed as being 1 timescale unit in duration, while interval 24 is 3 timescale units in duration.

Whilst it has been assumed that the intervals between the actual zero crossings can be timed and counted, in practice it may be found that unwanted noise in the waveform will produce spurious zero crossings. To overcome this it can be arranged that instead of detecting the actual zero crossings, the analysis is based on the detection of those points where the waveform alternately exceeds positive and negative threshold amplitudes. This is illustrated in FIG. 2, in which the waveform 31 is depicted as crossing the positive threshold at points 32, 34, 36, 38 and 40, and crossing the negative threshold at points 33, 35, 37 and 39. This arrangement can be adopted because most of the noise in the waveform is of small amplitude compared with the speech waveform. Therefore the threshold values can be chosen so that the noise content of the waveform lies between them, and detection of the points 32 to 40 will not include spurious zero crossings. It will be noted that the threshold crossings do not depart significantly from the zero crossings, and in practice the intervals between the threshold crossings will be substantially the same as the intervals between the zero crossings.

Therefore, for the remainder of this specification the term "zero crossings" will be used to denote both actual zero crossings and threshold crossings.

It has been stated above that the intervals between zero crossings are timed by counting timescale units, the timescale being started afresh in each case when a zero crossing is detected.

The relation between the measured interval Zt, the counting period tc , and the count number n is:

Zt < tc (n+1)

It should be noted that where f is the frequency of the zero crossing wave.

Considering the lower and upper end frequencies of this wave, namely, f1 and f2, then

f1 = 1/2 fc (n+1)-1

f2 = 1/2 fc n-1

where fc is the counting rate, or pulse repetition frequency in the case of a pulse timescale.

Thus fo = 1/4 fc (2n+1)n-1 (n+1)-1 where fo is the center frequency, and B = (f2 -f1) = 1/2 fc n -1 (Bandwidth).

In the previous discussion, it was assumed that the counting rate was constant during the measured interval or channel. The principal disadvantage of this technique is that the accuracy of measurement depends directly upon the frequency of the signal to be measured. It can be seen that a low frequency or long interval will be measured very accurately compared with the measurement of a high frequency or short interval.

In terms of frequency bands, each count number at the lower end of the measured spectrum will produce a bandwidth which is too narrow, and each counter number at the higher end will produce a bandwidth which is too wide. For example, consider that the counting rate is 10 kc./s. The interval between two successive counts is equivalent of 5 kc./s. However, substitution of n in the preceding formulas shows that where n is equal to 1, the band is equivalent to 2,500 to 5,000 c./s. Similarly it is possible to show that for n = 15 the frequency band is 300 to 330 c./s.

In any practical application of this counting technique, it is most desirable to increase the number of counts for a high frequency, i.e. reduce the width of the band, and to decrease the number of counts for a lower frequency, i.e. increase the width of the band. A possible method of achieving this object is to use a nonlinear measuring scale so that the counting rate is effectively different in adjacent channels.

The formulas which were derived previously for counting frequency, count number, etc., still apply. However, instead of using fc, one has to substitute a function relating fc to either time, or to count number.

This function has the form

fc (n) = fo (1 + log f (n)) where fo is the frequency of the first pulse.

FIG. 3 depicts a nonlinear timescale such as is used in FIGS. 1 and 2.

FIG. 4 illustrates by block diagrams a circuit for timing the intervals between successive zero crossings in a waveform such as that shown in either FIG. 1 or FIG. 2.

The equipments denoted by the various blocks in the drawings are known electronic circuits and do not in themselves constitute novel features of the invention.

The incoming speech waveform 50 is fed to a wave-shaping circuit 51 used to identify the zero crossings. The identification may be performed according to the procedures outlined with reference to FIG. 2. The output from the wave-shaping circuit may take the form of a square wave, as shown in FIG. 5. It will be seen that the waveform 61 in FIG. 5 can be used to produce a square wave 62 having the same zero crossing characteristics as the waveform 61. Since zero crossing analysis is independent of amplitude or other factors, a square wave of fixed amplitude having the necessary zero crossing intervals makes a suitable trigger waveform for operating counters and other circuits.

One method of producing the desired square wave is by utilizing the circuit shown in FIG. 6. In this FIG., transistor 70 operates as an amplifier for the speech input, which is limited by amplitude limiter diodes 68 and 69 so as to avoid overloading of the amplifier. Transistor 71 operates as a phase-splitter and converts the amplified and limited signal from transistor 70 into two outputs in opposite phase. These outputs are passed to two transistors 72 and 73 operating as emitter followers and arranged to reproduce negative going signals only. The waveform 63 of FIG. 5 represents the outputs of transistors 72 and 73 added together. These two outputs are taken to the inputs of a pair of trigger transistors 74 and 75. The trigger can be set to a threshold value which is adjustable by means of a potentiometer 76 in the common emitter connection of the two transistors. The outputs from the circuit are derived from two inverter transistors 77 and 78, and are represented by the square wave 62 in FIG. 5.

The circuit of FIG. 6 is biased where shown by voltages V+ or V-, all of equal amplitude with respect to ground.

Returning to FIG. 4, the output of the wave-shaping circuit is applied to a measuring circuit 55 which includes separate timescale counting circuits 52 and 53, and a timescale generating circuit 54.

As has been previously stated the timescale generated is nonlinear, and recommences when each zero crossing is detected. The counter 52 is arranged to count the timescale units following all zero crossings going positive, and the counter 53 is arranged to count the timescale units following all negative going zero crossings.

Switches 56 and 57 can be set to select the counts of either counter 52 or 53, and the selected count is passed through a gate 58 which is under the control of a threshold and control circuit 59. This threshold and control circuit is used to control the time during which an examination of zero crossings is made. The results of each examination are displayed in a display counter 60, which registers the total number of zero crossings which occur during examination time.

The equipment depicted in FIG. 4 can be arranged to make various types of examination of the speech waveform 50, for example:

I. It can count the number of zero crossing intervals that fall into the time range between 1 msec and 1.5 msec.

II. It can count the number of combinations of intervals, such as those combinations where an interval of between 1 msec and 1.5 msec is followed by an interval of between 0.5 msec and 0.7 msec.

The recognition of simple parts of speech (not in the grammatical sense), such as digits zero to nine, as opposed to simple waveform analysis, can be achieved by an arrangement such as that shown in FIG. 7. It consists of a squaring circuit 80 which identifies the zero crossing intervals, a measuring circuit 81 which measures the zero crossing intervals, and a gating circuit 82 which sorts the zero crossing intervals into seven interval ranges, referred to as channels CH, as follows:

Ch1 - 00 to 1.31 msec

Ch2 - 1.31 to 0.93 msec

Ch3 - 0.93 to 0.73 msec

Ch4 - 0.73 to 0.42 msec

Ch5 - 0.42 to 0.31 msec

Ch6 - 0.31 to 0.18 msec

Ch7 - 0.18 to 0 msec.

A threshold circuit 83 provides "on" or "off" signals during the presence or absence of speech signals, and controls a timing circuit 84 which provides the following outputs: ##SPC1##

A group of threshold counters 85 are set to count the number of zero crossing intervals in a given channel. Each threshold counter produces an output when a threshold to which the counter is preset is reached. The following threshold counters (TC) are provided.

Tc1 for CH1

Tc2 for CH1 + CH2

Tc3 for CH3 + CH4

Tc4 for CH5

Tc5 for CH6 + CH7

Finally a gating circuit 86 is used to identify spoken digits according to the following patterns ##SPC2## 1 indicates presence of a parameter, 0 indicates its absence, and "blank space" means that presence or absence of a parameter is immaterial in the recognition.

An arrangement for recognizing a larger vocabulary is illustrated in FIG. 8. The speech input passes through an amplitude normalization circuit 87. In this unit a wide range of amplitudes is reduced to a range that can be handled by the circuits in the first stage of the recognition process.

In the first stage there are a number of units 88 to 95 which perform broad classifications of speech characteristics. For example, the unit marked 88 classifies the voiced or unvoiced characteristics. Units 89 and 90 isolate the first and second frequency ranges corresponding to formants of vowel sounds respectively and pass the vowel information in the form of zero crossings. Unit 91 extracts the fundamental frequency of a talker. Units marked 92 and 93 extract two groups of frequencies with respect to unvoiced sounds, and unit 94 detects consonant groups. The unit 95 is a threshold detector and unit 96 is a word-end detector.

The complexity of the first stage in the classification of speech characteristics depends mainly on the size of vocabulary and the range of talkers. For example, for the recognition of vowels it may be sufficient to analyze only one frequency range.

In the second stage of the recognition process analysis is performed on the portions of speech which were separated in the first stage. This analysis leads to the recognition of specific voiced and unvoiced sounds by the recognition circuits 97 and 98. The analysis is performed during the time controlled by a sample A which covers a segment of sound. The same analysis is repeated for any subsequent segment of the speech wave. The length of each segment, e.g. sample A, is determined by the fundamental frequency of the talker. This is the function of the measuring and segmentation unit 99.

FIG. 9 shows in more detail a part of a vowel recognition arrangement. Information is derived from the zero crossings of the first formant and the analysis is done by measuring zero crossing distances and extracting only the significant ones. The zero crossing intervals are measured in the unit 102, and the timing control 103, controlled by sample pulse A, selects the period during which the zero crossing distances are measured. The significant zero crossing distances extracted by the unit 102 are stored in the storage units marked D1, D2 ..... Dn. As has been stated above, the length of each sample of speech is determined by the fundamental frequency of the talker. The fundamental frequency also controls measurement of zero crossing distances. One sample constitutes the shortest recognizable portion of a sound. In the case of vowels these portions may be referred to as "little vowels." For example, during an uttering of the sound a recognition of a segment of the sound can consist of the following series of samples

o, a,a,a,o.

This series is stored as three a's and two o's. The recognition of each sample is performed by the recognition circuit 104 under the control of the sample pulse A and when a sufficient number of samples have been recognized a complete group of samples, i.e. a segment, is recognized by the recognition circuit 105 under the control of a segment pulse B. The recognition of the group of samples given above, under the control of the segment pulse B, indicates that the unknown letter sound was a. The segment B covers a number of samples A which is sufficient to make a decision on the unknown sound.

Recognition of a group of parameters, such as zero crossing distances or "little vowels" and so on, can be accomplished by a straightforward threshold circuit followed by logical gating or by a statistical decision circuit.

An example of the latter is shown schematically in FIG. 10. The output from each parameter (a parameter can be represented as either 1 or 0 voltage levels, or as an analogue or quantized voltage level) is taken via resistor Ri to a point recognizing, for example, a, o etc. The value of the resistor Ri represents a weighted contribution of a given parameter to the recognition of a, o etc., and is such that ##SPC3## where Ro is a constant of the adding circuit. Contributions of Ri should satisfy the expression ##SPC4## for all i's associated with a given point, say, a, o etc.

Similarly the unvoiced sounds are recognized by the recognition circuit 98.

As in the first stage, complexity of the remaining stages in the recognition process is mainly related to the size of vocabulary and the range of talkers. For example, voiced, unvoiced and phoneme recognition can be reduced to one unit. The phoneme recognition circuit 100 and the word recognition circuit 101 are arranged on the same lines as previously described with reference to FIGS. 9 and 10. The main difference is that in each succeeding recognition sequence another set of parameters is brought into use from the preceding stage.

The number of stages in the recognition process is also related to the size of vocabulary and the range of talkers. In the recognition of a short selected vocabulary it may be quite feasible to recognize words directly, without dividing them into phonemes, voiced sounds, etc.

In the arrangement shown in FIG. 11 two complementary transistors 201 and 202 have their emitters connected together. The base of transistor 202 is connected to the collector of transistor 201 by a positive feedback connection 203. The base of transistor 201 is connected to a bias voltage source at b via two resistors 210, 211 and is also connected to two grounded capacitors 212 and 213. Transistors 201 and 202 are respectively PNP and NPN and positive and negative DC bias supplies are connected as indicated to the collector and base of transistor 202 and the collector of transistor 201.

When the base of transistor 201 is driven negative sufficiently for it to begin to conduct then the action of the feedback circuit 203 will start to drive the base of transistor 202 positive. Transistor 202 then begins to conduct and its emitter-collector current reinforces the emitter-collector current of transistor 201 and the rise in emitter voltage of transistor 201 makes it conduct even more. This process continues until saturation is reached and the feedback voltage applied to the base of transistor 202 cannot rise any further.

The capacitors 212, 213 and resistors 210 and 211 control the voltage applied to the base of transistor 201 in response to a pulse at the input 204.

Initially a bias voltage b at point 208 is arranged to be at least equal to or more positive than the voltage a at point 209. The timing scale is initiated at time t by a negative going pulse at the input 204, applied to capacitor 212 by transistor 206. The amplitude of this pulse determines the duration T, (Note FIG. 12), of a succession of pulses in a timescale. This negative going pulse at 204 negatively charges capacitor 212 according to its amplitude. Capacitor 212 immediately starts to discharge according to the time constants of 210 and 212. At the same time 213, via 211, is charged negatively at a rate determined by the time constants of 213 and 211. When the voltage on 213 drops to a point where it is equal to the voltage a at point 209 the base voltage of transistor 201 is sufficiently negative to cause the transistor to conduct. The positive feedback circuit 203 ensures that the rise in conduction of transistors 201 and 202 is very rapid an causes the first timing pulse to be delivered to the output 205. When transistor 201 is saturated the drain on capacitor 213 via the base of transistor 201 is so great that the negative potential on the base of transistor 201 cannot be maintained and collapses. The base goes more positive and the transistor 201 cuts off. Capacitor 213 has now gone positive and starts to recharge negatively. Meanwhile capacitor 212 has lost some of its negative charge due to the potential b at point 208 and therefore the rate of negative charge of capacitor 213 is reduced. Thus the second pulse interval is longer than the first, and each succeeding interval is longer than the last. FIG. 12 illustrates a timescale P generated by the circuit of FIG. 11.

The negative-going pulses at point 204 are derived from the trigger output of the circuit of FIG. 6. This circuit will produce two square wave output waveforms which have positive-going trigger pulses, each trigger pulse in the one square wave output being representative of a positive-going zero crossing contained in the input speech wave and each trigger pulse in the other square wave output being representative of a negative-going zero crossing contained in the input speech wave. Each trigger output is conventionally inverted, the leading edge of which coincides with the positive-going edge of the relevant trigger output. These two sets of negative-going pulses have a constant width and amplitude to define the period T referred to above.

If the circuit is left untouched after the initial pulse at point 204 there will come a time when the output pulse interval becomes infinite. However, in practice the period T over which the timescale is required to function covers only a small number of pulses, and at the end of this period the timescale will be restarted by receipt of a new negative going pulse at point 204. To ensure that the timescale starts from zero, so to speak, at the start time t capacitor 213 is fully discharged positively by a positive going pulse applied via the diode 207.

The value of the potential b at point 208 in relation to the potential a at point 209 controls the number and distribution of output pulses during a given period T. To alter the scale, i.e. to increase or reduce T for the same number of pulses with the same pulse interval ratios it is only necessary to alter the initial negative charge on the capacitor 212. The timescale q in FIG. 12 illustrates the effect of reducing the amplitude of the input pulse at point 204.

As noted previously, reference is made to the use of a nonlinear timescale for counting zero crossing intervals. In the present invention the circuit of FIG. 11 is used to generate a nonlinear timescale the scale of which is automatically expanded or contracted according to the fundamental frequency or other characteristics of the talker. The derivation of a signal representing the fundamental frequency of a talker is well known and forms no part of the present invention, see for example "Automatic Extraction of the Excitation function of Speech with Particular Reference to the Use of Correlation Methods" by J. S. Gill, Proceedings of the Third International Congress on Acoustics, Stuttgart 1959, Vol. 1 page 217. The pitch analogue output of the system illustrated and described therein can be converted by means (not shown) to provide a controlling voltage waveform for the input 204 in the nonlinear timescale generator of FIG. 11, the amplitude of this voltage being related to the fundamental frequency or object characteristic of the talker.

It is to be understood that the foregoing description of specific examples of this invention is made by way of example only and is not to be considered as a limitation on its scope.




<- Previous Patent (METHOD FOR ENLARGED ...)   |   Next Patent (MANUAL REROUTER SYST...) ->