Title:
Speech synthesis system
United States Patent 3892919


Abstract:
In a system in which a plurality of previously recorded waveforms corresponding to phonetic elements separately picked up from natural voice and having a pitch length, are connected to form any required speech, the degradation in the quality of the synthesized speech due to the discontinuity in the waveform of the synthesized speech is prevented by so controlling the period of reading out each phonetic element as to change the period stepwise at intervals of several phonetic elements (i.e., pitch lengths).



Inventors:
ICHIKAWA AKIRA
Application Number:
05/414746
Publication Date:
07/01/1975
Filing Date:
11/12/1973
Assignee:
HITACHI, LTD.
Primary Class:
Other Classes:
704/268, 704/E13.01
International Classes:
G10L13/06; (IPC1-7): G10L1/00
Field of Search:
179/1SM 340
View Patent Images:



Primary Examiner:
Claffy, Kathleen H.
Assistant Examiner:
Kemeny, Matt E. S.
Attorney, Agent or Firm:
Craig & Antonelli
Claims:
I claim

1. A speech synthesis system comprising:

2. A speech synthesis system according to claim 1, wherein said second means includes means for adjusting the intervals of the stepwise change between a quarter of a syllable and a full syllable.

3. A speech synthesis system comprising:

Description:
BACKGROUND OF THE INVENTION

1. Field of the Invention

The present invention relates to a speech synthesis system and more particularly to a system in which a sound wave extracted from natural voice and having about a pitch length is used as a phonetic segment or speech segment and in which the phonetic segments previously stored are selectively connected at controlled periods due to control signals corresponding to a required word or a sentence to be synthesized.

2. Description of the Prior Art

In recent years, the information service system which connects data processing devices such as electronic computors with communication lines such as telephones, has been developed. In such a system, a remote subscriber's question sent through a communication line is received by a central signal processing device which stores large information and the device prepares an answer for the question and sends it back to the subscriber, the answer being in the form of sound like human voice.

In this system, the most important is the speech synthesis part which makes an answer in the form of voice.

The requirements for the speech synthesis part, however, are as follows: (1) the synthesized speech must be as near the human voice as possible; the production cost must be low; and the system incorporating the part therein must permit multiple uses, that is, the part must be able to generate a plurality of speech at a time.

In a conventional speech synthesis system which is rather satisfactory from the standpoint of the above mentioned requirements, a plurality of sound waveforms each having a pitch length are previously prepared so as to be used as speech sound waveforms, i.e. speech segments, and the speech segments are selectively connected due to control signals corresponding to words or sentences to be synthesized.

This conventional system is rather cheap since any desired speech can be synthesized by connecting speech segments each having a waveform of a pitch length so that the number of the stored speech segments is relatively small. The speech segments can be read out rapidly, that is, the access time is very short, so that the multiple synthesis of speech is possible.

Moreover, the read-out time of a speech segment, that is, the length of the waveform of the speech segment can be controlled so that the pitch of the synthesized speech can also be controlled.

Although the conventional system has several merits as mentioned above, it has also been revealed by the inventors' experiments that the speech synthesized by the conventional system suffers from hoarse noises and that the vocal quality thereof is very poor. The cause of such a drawback is as follows. Namely, in this speech synthesis system, connected speech is formed by connecting the waveforms of speech segments and therefore a discontinuity, i.e. rapid change in amplitude, is caused in the junction portion between any two adjacent waveforms of speech segments and such discontinuities appear every pitch period (equal to the fundamental period of speech and having an audible range of frequencies) to generate hoarse noises in synthesized speech.

SUMMARY OF THE INVENTION

One object of the present invention is to improve the quality of the synthesized speech produced by a speech synthesis system in which a plurality of speech sound waveforms, each having a pitch length, to be used as speech segments are recorded and these speech segments are selectively connected to form synthesized speech.

Another object of the present invention is to provide a speech synthesis system in which a plurality of speech sound waveforms, each having a pitch length, to be used as speech segments are recorded and these speech segments are selectively connected to form synthesized speech, and in which the pitch control of speech sounds is simplified so that the system can be economically fabricated without deterioration in the vocal quality of the synthesized speech.

According to the present invention, which has been made to attain the above objects, in a speech synthesizing system in which speech segments, each having a pitch length, are selectively connected to synthesize desired speech, the time of reading out each speech segment, that is, the wavelength of each speech segment of synthesized speech is stepwise changed at intervals of several speech segments. Namely, the waveforms of speech segments read out are changed at intervals of one fifth of a syllable to a full syllable. Therefore, the system according to the present invention can produce synthesized speech softer to ear than that produced by a conventional speech synthesis system in which the length of the waveform of every speech segment is controlled individually.

Other objects, features and advantages of the present invention will be made apparent when one reads the following part of the specification with the aid of the attached drawings.

BRIEF DESCRIPTION OF THE DRAWING

FIG. 1 is an oscillographic representation of a monosyllable speech sound waveform.

FIG. 2 shows the modes of variations in the pitch frequency of monosyllable speech sounds in various pronounciations.

FIG. 3 shows the variations in the pitch frequency of one word.

FIGS. 4A and 4B show waveforms illustrating the discontinuities resulting from the connection of separate speech segments.

FIG. 5 shows the variation in pitch frequency of the synthesized speech formed by the speech synthesizing system according to the present invention.

FIG. 6 is a block diagram of a speech synthesis system embodying the present invention.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT

In FIG. 1, the waveform of a monosyllable speech sound is shown in a rectangular coordinate system in which the abscissa represents the time base and the ordinate gives the amplitude of waveform. As seen from FIG. 1, the waveform of the monosyllable speech sound consists of an irregular portion C like that of a consonant and a periodical portion V like that of a vowel. Especially, every syllable of the Japanese speech is composed of a single consonant followed by a single vowel or of a single vowel. And about one hundred different syllables can make up all the speech sounds covering the entire vocabulary of the Japanese language. Of the portions of the waveform shown in FIG. 1, the more important is the periodical portion V which occupies most part of the monosyllable speech sound waveform and forms the factors of the pitch, intonation and tone (indicating the kind of syllable) of the speech sound.

Namely, the pitch or intonation of the speech sound depends mainly on the repetition periods T1, T2, . . . , Tn, i.e. the pitch period, while the tone is determined by the frequency characteristic of the periodical portion V. The pitch period is usually 10 to 20 milliseconds.

FIG. 2 shows the variation in the pitch frequency (defined as the reciprocal of the pitch period) with time of the monosyllable speech sound shown in FIG. 1. In FIG. 2, the abscissa and the ordinate respectively represent the time base and the pitch frequency. When a monosyllable speech sound is individually pronounced, it has a characteristic curve 1 convex up as shown in FIG. 2. However, when the same speech sound is pronounced in a word or sentence, it may assume characteristic curves 2, 3 and 4 corresponding respectively to level, rising and falling intonation, depending upon the position it assumes in the word or sentence or upon the kind of word or sentence.

Accordingly, in case where the convected speech sounds corresponding to a desired word or sentence are formed by connecting together the prerecorded speech segments, i.e. speech sound waveforms obtained by dividing the waveform of the monosyllable speech sound as shown in FIG. 1, pronounced in a manner corresponding to the curve 1 in FIG. 2, into units, each having a pitch length, the discontinuities are formed in the junction points between the unit waveforms, i.e. speech segment waveforms, the discontinuities being the portions where the amplitudes of the waveform rapidly change.

Such discontinuities will be described in further detail. FIG. 3 shows the variation in pitch frequency with time of the speech sound corresponding to a word, in which the abscissa and the ordinate respectively represent the time base and the pitch frequency. In FIG. 3, curve 5 indicates the mode of the variation in pitch frequency of natural speech sound corresponding to a word to be synthesized, while curve 6 shows the mode of the variation in the pitch frequency of the monosyllable speech sound corresponding to the curve 1 in FIG. 2. The abscissa is divided into pronounciation intervals t1, t2, . . . , t5 of the monosyllable sounds. Accordingly, in order that the speech sound having a pitch frequency characteristic corresponding to the curve 5 may be composed of speech segments obtained from the natural voice having a pitch frequency characteristic corresponding to the curve 6, the length of the waveform of each speech segment, i.e pitch period, has to be controlled. Therefore, if the waveforms of the speech segments having pitch periods T1, T2, . . . , T4 as in FIG. 1 are connected and synthesized into connected speech having pitch periods longer or shorter than those periods T1, T2, T3 and T4, then the discontinuities 7 are formed in the junction portion of the respective speech segment waveforms as shown in FIGS. 4A and 4B. FIG. 4A corresponds to the case where the synthesized speech has a pitch frequency higher than that of the original natural voice from which the speech segment waveforms are obtained and has a pitch period shorter than that of the natural voice. FIG. 4B, on the other hand, corresponds to the case where the synthesized speech has a pitch frequency lower than that of the original natural voice and a pitch period longer than that of the original natural voice. The dicontinuities 7 thus resulted deteriorate the vocal quality of the synthesized speech and also generate hoarse noises.

In order to eliminate the influence of the discontinuities, a special treatment of waveforms must be introduced. According to the present invention, the degradation of the vocal quality due to the discontinuities can be prevented since the way of the pitch control in the speech control system is improved, and moreover a system can be realized in which the pitch control is further simplified by making the best use of the merits of the speech synthesizing system in which speech segments are connected to form synthesized speech.

Namely, as shown in FIG. 5, the pitch frequency or the pitch period of the synthesized speech is changed stepwise at intervals of a quarter of a syllable to a full syllable. It is empirically verified that the synthesized speech having a pitch frequency characteristic corresponding to a staircurve 8 indicated by dotted line FIG. 5, has a vocal quality superior to that having a pitch frequency characteristic indicated by a solid curve 5 in FIG. 5. In this case, it is needless to perform the pitch control for every speech segment and since the pitch periods of the successive speech segments are all the same, the pitch control system of the speech synthesis system is simplified.

In the following, the present invention will be described by way of a preferred embodiment.

FIG. 6 is a block diagram of a concrete structure of a speech synthesis system embodying the present invention.

First, a speech segment memory 32 is described for convenience' sake. In the memory 32, the speech sound waveforms of all the syllables necessary for the speech synthesis are stored in a high speed memory device such as a core memory. Each syllable in the memory consists of time-sequentially arranged speech segments constituting a waveform as shown in FIG. 1 and the waveform of each speech segment has an address allotted to indicate its location in the memory. In a monosyllable, serial numbers are allotted to the addresses of the speech segment waveforms arranged in time-sequence. Therefore, the first address is used as a syllable address to represent the syllable.

Each speech segment waveform is obtained by sampling the speech sound waveform shown in FIG. 1 at 8KHz and each of the sampled signal is coded into an 8-bit signal. The period at which one speech segment, i.e. wave portion within T1, T2, T3 or T4 in FIG. 1, is recorded is 10 to 20 msec. Namely, the period is set equal to the maximum one of the pitch periods of speech sounds to be synthesized.

A series of code signals, each representing one syllable, to constitute speech to be synthesized are received at a terminal 9 and fed through an input-output control circuit 10 to a data processing circuit 11. For example, code signals corresponding to the syllables YO, KO, HA and MA constituting the name of a famous port city of Japan, are applied to the circuit 11. The device to generate such code signals is not within the scope of this invention and not shown in the figure, but the device is equivalent to the conventional automatic response system, being designed to form data for answers to preset questions and to connect the code signals according to the arrangement of words corresponding to those answers.

The data processing circuit 11 interprets the code signals according to the predetermined program and generates signals instructing and controlling the operations of the respective parts of the speech synthesizing apparatus described later.

The operation of the circuit 11 will be described in further detail. Judging from the series of code signals, the circuit 11 generates speech segment information, pitch information and syllable time information according to a reference table.

The speech segment information is, for example, the address of the first speech segment of a syllable stored in the speech segment memory 32 described above; the pitch information is the information indicated by dotted curve 8 in FIG. 5, that is, the number indicating how many samples, counted from the first one, of the speech segments stored in the memory 32 is to be read out; and the syllable time information is the time information representing t1 to t5 in FIG. 5, that is, the number of samples to be read out within the time of one syllable.

The data processing circuit to perform such processing as described above may be designed especially for the present invention but a general purpose computer can be used as such a circuit so that the details thereof is omitted.

The three kinds of information are respectively stored as time-sequential signals in a syllable address buffer memory 14, a pitch time buffer memory 15 and a syllable time buffer memory 16 of a speech synthesizing apparatus 13. The speech synthesizing apparatus 13 consists of a part to select speech segments necessary to synthesize connected speech according to the speech segment information, a part to determine the pitch periods of the speech segments according to the pitch information and a part to determine the times allotted to syllables according to the syllable time information.

Next, the operations of the respective components of the speech synthesizing apparatus 13 will be described.

The address data of the syllable address memory 14 are transferred one by one to a segment address memory 17, in response to an external signal and simultaneously the data in the syllable address memory 14 is shifted forward to cause the address of the next syllable to come to the head position. Namely, the memory 14 and the memory 17 may be considered to form a shift register. Also, the combination of the pitch time buffer memory 15 and a pitch time memory or of the syllable time buffer memory 16 and a syllable time memory may be also considered to form a shift register.

With the circuit arrangement as described above, the address signal of the first speech segment of a syllable stored in the segment memory 17 is applied to a read out circuit 29 so that a series of sampled values constituting the segment are sequentially read out in synchronism with clock pulses from a clock signal generator 20. The number of the readout samples is detected by counting the clock pulses by a pitch counter 22. When the content of the pitch counter 22 coincides with the pitch time data set in the pitch memory, a coincidence circuit 25 detects the instant of coincidence to deliver a coincidence pulse. The coincidence pulse serves not only to reset the pitch counter 22 but also to shift a segment address counter 21 step by step. The output of the shifted segment address counter 21 is applied to the segment address memory 17 to read out the next speech segment from the speech segment memory 32, in the same manner as described above. Thereafter, the same operation of reading out the sampled values is repeated on. The coincidence pulse also resets the counter 23 at the same time.

On the other hand, the time counter 23 also counts the clock pulses, and when the content of the time counter 23 coincides with the syllable time data (that is, the number of sampling points occurring within a time during which the pitch frequency in one syllable remain the same, as described above) set in the syllable memory 19, a coincidence circuit 26 detects the instant of coincidence to deliver a coincidence pulse at the instant.

The coincidence pulse serves not only to transfer or shift the foremost pitch time data of the pitch time buffer memory 15 to the pitch time memory 18, but also to shift a syllable counter 24 step by step. When the content of the syllable counter 24 coincides with the step number recorded in a step number memory 23, a coincidence circuit 27 detects the instant of coincidence to deliver a coincidence pulse. The coincidence pulse resets the counter 24 and is also applied to the syllable address buffer memory 14 and the syllable time buffer memory 16 so that the control information for the syllable to be next synthesized, i.e. segment address and time data for the syllable, is transferred respectively to the memory 17 and the memory 19. The step number stored in the step number memory refers to the number of steps occurring within a time of one syllable when the pitch frequency is changed stepwise as shown in FIG. 5. In case of FIG. 5, the number of steps is three. As has been revealed from the experiments by the inventors, it is where the number of steps is three that the deterioration of the vocal quality of the synthesized speech due to the waveform discontinuities is reduced to the minimum. However, the number of steps need not be limited necessarily to 3 but may be 4 to 0, that is, the pitch frequency of the synthesized speech sounds may be varied at intervals of a quarter of a syllable or a full syllable.

The output signal obtained from the read out circuit 29 as a result of the operations as described above is equivalent to a signal obtained by subjecting the signal waveform shown in FIG. 4A or 4B to pulse code modulation since the speech synthesizing circuit 13 consists of digital circuits. The signal is then converted to an analog signal through an digital-to-analog converter 30 and the analog signal is finally converted to a speech sound signal or audible voice through an electro-acoustic transducer 31. In this case, the digital-to-analog converter 30 and the electro-acoustic transducer 31 are connected by such a transmission line as a telephone which electrically connects a remote subscriber with the central information service system.

The speech synthesis system shown in FIG. 6 has been described as applied to the case where the speech sounds only for one channel are synthesized. It is, however, a matter of course that since the whole system is composed of digital signal treating circuits and the speech segments are stored in such a memory as a core memory capable of high speed access then the system can be easily constructed in a multichannel arrangement as known in the field of the art.

Namely, such an arrangement for multichannel purpose can be realized if the input-output control circuit 10, the data processing circuit 11 and the speech segment memory 32 are used commonly and if the number of the speech synthesizing apparatuses 13 is increased according to the number of channels required.

Moreover, the speech segments stored in the speech segment memory may be obtained by directly extracting the components from the natural human voice or by artificially treating the waveforms of the human speech sounds .