Title:
HELIUM ENVIRONMENT VOCODER
United States Patent 3825685


Abstract:
There is disclosed herein a vocoder system having a multichannel speech analyzer, a multichannel speech synthesizer and an excitation system for extracting time-varying characteristics of the excitation function of the input speech signal and reproducing this function for the excitation of the synthesizer. The excitation system includes a speech extractor, a voiced/unvoiced detector, a pitch generator, a noise generator, a voiced/unvoiced switch and an output pulse generator. The pitch extractor operates on the short term envelope of the speech and is derived from a signal which is the sum of the signals of all analyzer channels. Constant width pulses are generated and integrated to form a varying d.c. signal proportional to the frequency of the extracted pitch. During voiced speech a multivibrator controlled by the d.c. signal produces the input to the output pulse generator. The output pulse generator produces an output pulse train of constant energy and is applied for excitation of the synthesizer channels. During unvoiced speech the noise generator drives the output pulse generator.



Inventors:
ROWORTH D
Application Number:
05/250534
Publication Date:
07/23/1974
Filing Date:
05/05/1972
Assignee:
INT STANDARD CORP,US
Primary Class:
International Classes:
G10L19/02; (IPC1-7): G10L1/00
Field of Search:
179/1SA,15.55R
View Patent Images:



Primary Examiner:
Claffy, Kathleen H.
Assistant Examiner:
Leaheey, Jon Bradford
Attorney, Agent or Firm:
O'halloran Jr., John Lombardi Menotti Hill Alfred T. J. C.
Claims:
I claim

1. A helium environment vocoder having a multichannel speech analyzer, a multichannel speech synthesizer and an excitation system for extracting time varying characteristics of the excitation function of input helium speech and reproducing this function for the excitation of said synthesizer, said excitation system comprising:

2. A helium enviornment vocoder having a multichannel speech analyzer, a multichannel speech synthesizer and an excitation system for extracting time varying characteristics of the excitation function of input helium speech and reproducing this function for the excitation of said synthesizer, said excitation system comprising:

3. A vocoder according to claim 2, wherein

4. A helium enviornment vocoder having a multichannel speech analyzer, a multichannel speech synthesizer and an excitation system for extracting time varying characteristics of the excitation function of input helium speech and reproducing this function for the excitation of said synthesizer, said excitation system comprising:

5. A vocoder according to claim 4, wherein

6. A vocoder according to claim 4, wherein

7. A vocoder according to claim 6, wherein

8. A vocoder according to claim 4, wherein

9. A vocoder according to claim 8, wherein

10. A vocoder according to claim 8, wherein

11. A vocoder according to claim 10, wherein

12. A vocoder according to claim 10, further including

13. A vocoder according to claim 12, wherein

Description:
BACKGROUND OF THE INVENTION

This invention relates to a speech processor, such as a vocoder, and in particular to an excitation system therefor.

Such processors are especially useful in the processing of helium speech, which suffers severe distortion due to the speaker breathing an exotic gas mixture at abnormal pressures. This distortion is sometimes referred to as the "Donald Duck effect."

SUMMARY OF THE INVENTION

An object of the present invention is to provide a helium environment vocoder that will reduce the above-mentioned distortion.

A feature of the present invention is the provision of a helium enviornment vocoder having a multichannel speech analyzer, a multichannel speech synthesizer and an excitation system for extracting time varying characteristics of the excitation function of input helium speech and reproducing this function for the excitation of the synthesizer, the excitation system comprising: first means coupled to the analyzer for extracting continuously the fundamental frequency from the input speech; second means coupled to the first means for producing a sequence of pulses at a rate having a predetermined relationship to the fundamental frequency; a noise generator to produce noise; third means coupled to the analyzer for determining whether the input speech is voiced or unvoiced; fourth means coupled to the second means, the noise generator and the third means responsive to the output signal of the third means for selecting the sequence of pulses if said input speech is voiced and for selecting the noise from the noise generator if the input speech is unvoiced; and fifth means coupled to the fourth means and the synthesizer for applying the selected one of the sequence of pulses and the noise as an excitation input pulse stream to the synthesizer, the input pulse stream having a constant energy level.

Another feature of the present invention is the provision of a vocoder as defined above wherein each cnannel of the analyzer includes a bandpass filter coupled to the input speech signal, the passband of the bandpass filter being different for each channel, and a rectifier coupled to the output of the bandpass filter; and the first means includes sixth means coupled in parallel to the output of a given number of the rectifiers to produce a sum of the output signals passed by each of the given number of the rectifiers, a first differential amplifier having an output and two inputs, one of the two inputs of the first differential amplifier being coupled to the sixth means to receive the sum of the output signals, a seventh means coupled between the sixth means and the other of the two inputs of the second differential amplifier to smooth the sum of the output signals prior to being applied to the other of the two inputs of the first differential amplifier, eighth means coupled to the output of the first differential amplifier to square the output signal of the differential amplifier, a monostable circuit coupled to the eighth means driven by a squared output signal from the eighth means, and a low pass filter means having gain coupled to the output of the monostable circuit to operate on the output signal of the monostable circuit.

Still another feature of the present invention is the provision of a vocoder as defined above wherein the second means includes a voltage controlled multivibrator, and an input circuit for the multivibrator having a capacitor, and a pair of isolating diodes coupled to the capacitor, the pair of diodes having coupled thereto for applying to the capacitor through the pair of diodes a direct current voltage bearing a predetermined relationship to the fundamental frequency.

A further feature of the present invention is the provision of a vocoder as defined above and further including switching means coupled to the second means, the third means and the input circuit for the multivibrator, the switching means being responsive to the detection of unvoiced speech by the third means to disconnect the second means from the input circuit for the multivibrator.

Still a further feature of the present invention is the provision of a vocoder as defined above wherein the third means includes a comparator having two inputs and an output, one of the inputs being coupled to a first group of channels of the analyzer having a passband related to voiced speech and the other of the inputs being coupled to a second group of channels of the analyzer having a passband related to unvoiced speech, the second group of channels being different than the first group of channels, and a two level clamp circuit coupled to one of the two inputs to hold the output in one condition when the input exceeds a predetermined first threshold level and to hold the output in the other condition when that input falls below a second predetermined threshold level lower than the first threshold level.

BRIEF DESCRIPTION OF THE DRAWING

Above-mentioned and other feature and objects of this invention will become more apparent by reference to the following description taken in conjunction with the accompanying drawing in which:

FIG. 1 is a block diagram of a helium environment vocoder arrangement in accordance with the principles of the present invention;

FIG. 2 illustrates the block diagram of the multichannel analyzer and synthesizer channels of FIG. 1;

FIG. 3 illustrates the schematic diagram of the first and second stages of the pitch extractor section of the excitation system of FIG. 1;

FIG. 4 illustrates the frequency response of the second stage of FIG. 3;

FIG. 5 illustrates the schematic diagram of the third and fourth stages of the pitch extraction section of the excitation system of FIG. 1;

FIG. 6 illustrates the schematic diagram of the pitch and noise generators of the excitation system of FIG. 1;

FIG. 7 illustrates the schematic diagram of the voiced/unvoiced detector for the excitation system of FIG. 1; and

FIG. 8 illustrates the schematic diagram of the output pulse generator for the excitation system of FIG. 1.

DESCRIPTION OF THE PREFERRED EMBODIMENT

In the general arrangement shown in FIG. 1 the input heluim speech is received in a multichannel analyzer 10, from which the excitation system input is obtained via a summing network 11 as described below. Each analyzer channel also produces a channel signal which is the input for the corresponding channel in the synthesizer 12, the outputs of which are summed to give the processed speech output. The excitation system can be broadly divided into two sections, namely, pitch extraction from analyzer 10 and pitch generation for the synthesizer. The pitch extraction section operates upon the short-term envelope of the speech which is obtained from all the analyzer channels via summing network 11. The d.c. component and unwanted high frequency a.c. components are eliminated by d.c. eliminator 13 and squaring circuit 14, respectively, to give a square wave at the fundamental frequency. This square wave is then converted to a d.c. voltage whose amplitude is proportional to the instantaneous frequency averaged over a short period by pulse generator 15 and integrator 16, respectively. Meanwhile, a voiced/unvoiced decision is made by voiced/unvoiced detector 17, based on the relative high and low frequency energies in the multichannel inputs to the synthesizer, as derived by two groups of diode gates 18 and 19. Detector 17 controls the application of the d.c. voltage representing pitch to a voltage controlled multi-vibrator 20 via gate 21 and backlash circuit 22. At the same time detector 17 is responsible for selecting either the output of multivibrator 20, or the output of noise generator 23 by way of the changeover switch 24. The selected output is fed via a squaring circuit 25 to an output pulse generator 26 which provides the excitation signal for the synthesizer channels.

Turning now to the details of the arrangement of FIG. 1, the analyzer 10 consists of a number of similar channels, say 22 in all, each of which has a bandpass filter 27, FIG. 2, followed by a rectifier 28. Each bandpass filter 27 covers a different portion of the speech spectrum, and provides an output which, during voiced speech, is approximately equivalent to a pure tone at the filter center frequency amplitude modulated with a sawtooth waveform at the fundamental frequency. After rectification, with a short time constant tank capacitor (not shown), the result is a d.c. signal with a sawtooth type of waveform and a superimposed a.c. ripple component at the channel input filter center frequency. By summing this signal from all the channels in parallel in summing network 11, the sawtooth component, which is approximately the same for all the channels, is enhanced and the ripple components, being different for each channel, are diminished. The result is a d.c. signal whose amplitude is varying in accordance with the dynamics of the speech signal, and upon which is imposed a sawtooth waveform at the fundamental frequency.

It is now necessary to eliminate the d.c. component and unwanted high frequency a.c. components to produce a square wave at the fundamental frequency. Care is required in this operation: rapid rises in speech energy can produce a sharply rising d.c. waveform and this transient must be eliminated without losing the information from the wanted low frequency a.c. component. This could normally be achieved by a high pass filter, but if the filter has a sufficiently rapid roll-off to provide good discrimination its inpulse response will produce spurious signals which distort the required a.c. information.

The first stage of the pitch extraction section uses a differential amplifier 30, FIG. 3, to compare the input waveform from summing network 11 with a smoothed version of itself. The input waveform is applied to one input terminal of the amplifier and also to a tank circuit 31, from which the other input is derived. The rise time and d.c. components of the two signals are the same, but one signal has a smaller sawtooth component and a longer decay time than the other signal. Thus, the d.c. component is eliminated for rising and steady signals, but falling signals (such as at the end of a word) produce a temporary d.c. offset due to one input to the differential amplifier taking longer to decay than the other.

This remaining d.c. component is eliminated, the derived a.c. component low pass filtered and the remainder squared by the second stage of FIG. 3. Again the signal is applied to both inputs of a differential amplifier 32 with RC integrating networks at each input. The two integrating networks 33 and 34 are arranged to have slightly different integrating times (about 20 percent difference) and this results in a frequency response as shown in FIG. 4: the gain falls away at about 6 dB/octave in both directions from a center frequency which is set by the network time constants. The d.c. components of the signal are completely eliminated without introducing any high pass networks: effectively the operation relies on the difference in phase shift between the two integrating networks 33 and 34. This stage is also arranged to square the output by utilizing the feedback network 35, which provides d.c. feedback but no a.c. feedback.

Next it is necessary to convert the square wave into a d.c. voltage whose amplitude is proportional to the instantaneous frequency averaged over a short period. The square wave is applied to a monostable circuit 40, FIG. 5, which produces a pulse of fixed width each time it is triggered by a negative going edge in the square wave input. These pulses are then passed through a two stage low pass filter. The first stage 41 produces gain, set by the potentiometer for unity frequency ratio. The second stage 42 has only unity gain and its output is a varying d.c. with minimal ripple at the pulse repetition frequency. The gain is set so that the succeeding voltage controlled multivibrator 20, FIG. 1, produces pulses at the correct rate. The second stage of the filter is arranged to have a d.c. shift sufficient to compensate exactly for the d.c. threshold exhibited in the control characteristic of multivibrator 20, due to the need to overcome the Vbe of the control transistors.

The varying d.c. voltage from the transistor of stage 42 is used to control the pitch generator multivibrator 43, FIG. 6. This is a conventional cross-coupled multivibrator in which the charging current for the coupling capacitors is provided by current-source transistors 44 and 45 to whose bases is applied the control voltage.

There may be still some ripple remaining on the d.c. control voltage and to prevent this from frequency modulating the multivibrator output a backlash circuit 22, FIG. 1, is interposed in the control circuit. This backlash circuit takes the form of two germanium diodes 46 and 47 and a tank capacitor 48, as shown in FIG. 6. Changes in control voltage must be greater than the forward voltage of the diodes to be transferred to the control point of the multivibrator, so that significant changes in control voltage are transmitted while ripple components are rejected.

Gate 21 of FIG. 1 is realized by an FET switch 50 placed in the control line. During unvoiced speech the control voltage effectively falls to zero and without the switch the multivibrator output would cease. Then, when voicing recommenced, the system would take a finite and not insignificant time to build up to the correct frequency. This is avoided by arranging for the switch 50 to be opened during the absence of voicing and then the effect of the bias network and tank capacitor 48 is to keep the multivibrator running. The multivibrator frequency will, however, drift slowly towards a pre-set median value. In this way the multivibrator is already operating in approximately the correct condition when voicing resumes. The switch 50 is controlled by detector 17, FIG. 1.

For unvoiced speech a noise generator 23, FIG. 1, is required. Noise is generated by amplifying the input noise of an operational amplifier 51, FIG. 6, operating with a high source impedance. The feedback network 52 shown provides feedback only at d.c. and low frequencies so that the generator is operating essentially with open-loop gain for medium and high frequencies. The terms low, medium and high are here used in the context of the speech frequency range. The feedback configuration is designed to provide d.c. stability and a 6 dB/octave bass roll-off in the noise output. Frequency compensation provides a 6 dB/octave high frequency roll-off in open-loop gain, so that the net result is a broad band of noise. By adjustment of the value of the capacitor in the feedback loop the center frequency of the band can be set so that it is centered at the optimum value for unvoiced speech synthesis. Since the noise amplitude obtained varies from one amplifier to another the output signal is clipped to a standard amplitude by a pair of reversed diodes 53 and 54 in parallel. The signal from the pitch generator 43 is reduced to the same amplitude for application to the voiced/unvoiced switch by a suitable tap on the multivibrator.

The voiced/unvoiced gate or switch 24, FIG. 1, is essentially a single-pole change-over switch, to pass the output of either the pitch generator, multivibrator 43, FIG. 6, or the noise generator 51, FIG. 6. This is achieved electronically by the arrangement shown in FIG. 6, which is a two-channel analog switch using bipolar transistors 55 and 56 in the saturated mode. The control voltage swings between +10 volts and -10 volts, and the transistors are so biased that one switches from a conducting state to a non-conducting state as the other switches in the reverse direction. A dead space is avoided by the pair of diodes 57 and 58 in the bias chain. When one of the transistors is conducting it acts as a low impedance shunt and adequately attenuates the signal on that line.

The two outputs from the switch are fed to the squaring circuit 25, FIG. 1, which is realized by the differential amplifier 60, FIG. 6. This also rejects the spurious pulses which are introduced into both lines via the switching transistors when the switching signal changes polarity. This stage is arranged to square the signal by providing d.c. feedback only. Thus, at the output of this stage the signal consists of an infinitely clipped waveform which is either periodic (i.e., at the fundamental frequency during voicing), or random (during unvoiced speech).

The switching signal for switch 24, FIG. 1, is provided by detector 17, which is illustrated in detail in FIG. 7. Each of the analyzer channels, FIG. 2, has a low pass filter 29 following the rectifier 28. These low pass filtered channel signals reflect the energy levels in each of the frequency ranges covered by the channels. To make the voicing/unvoiced decision a comparison is made of the maximum signal amplitude in two groups of channels, one group covering the upper frequencies of the speech spectrum and the other group covering the lower frequencies. The circuit of FIG. 7 will, depending on certain rules built into it, make the decision as to whether the speech is voiced or unvoiced.

The two signals to be compared are derived from the two groups of channels by diode gate networks 18 and 19, FIG. 1. As shown in detail in FIG. 7 these two groups of channels consist of the six lowest frequency channels and the four highest frequency channels. The largest signal in each group is transmitted by the relevant diode in the two gate groups through resistors 70 and 71. The decision circuit is a comparator 72 so arranged that the output holds the switch in the unvoiced state when the upper group of channels has a larger signal that the lower. When the upper group has a smaller signal, the switch is in the voiced state. This decision relies upon the fact that during fricatives there is a fairly strong energy component at high frequencies but little energy at low frequencies, and vice versa.

A two level clamp 73 and 74 is applied to the high frequency bus input to the comparator to ensure the correct input under all conditions for a helium environment where, in helium speech, the amplitude of the unvoiced components relative to the voiced components is relatively depressed as compared to normal speech due to the "Donald Duck effect" . One clamp ensures that for h.f. signal levels above about 3 volts the output is always voiced, to cater to those vowel sounds which have a relatively high amount of energy at high frequencies. The other clamp ensures that for signal levels lower than about 1 volt the switch is always in the unvoiced condition. This ensures that a steady background noise is produced at the output during the absence of speech, rather than allowing erratic excitation depending upon the characteristics of the input noise. Positive feedback is applied to the comparator to provide a small amount of hysteresis in order to avoid spurious transitions when the two inputs are similar.

The output pulse generator 26, FIG. 1, receives the rectangular waveform from the switch via a level detector 80, FIG. 8, which sharpens the edges and removes spurious ripples around zero by infinitely clipping the signal. The positive going edges of the signal are then used to trigger the monostable circuit 81, which has a power stage 82 and 83 added to provide the necessary output drive capability.

The charge and discharge paths for the monostable timing capacitor are separated by a diode 84 in such a way that the charge on the capacitor (and hence the pulse width) is dependent upon the elapsed time since the previous pulse. As the time decreases the pulse becomes shorter, and the effect is to tend to equalize the average power of the pulse train. The correction is not exact, and the power tends to increase somewhat as the pulse rate rises. This is desirable in the case of a helium speech processor. The pulse rate is highest during random excitation, so that the energy level is somewhat higher for unvoiced speech than for voiced speech and this helps to compensate for the fact that the energy in the fricatives is depressed during helium speech.

While I have described above the principles of my invention in connection with specific apparatus it is to be clearly understood that this description is made only by way of example and not as a limitation to the scope of my invention as set forth in the objects thereof and in the accompanying claims.