US3575555A - Speech synthesizer providing smooth transistion between adjacent phonemes - Google Patents

Speech synthesizer providing smooth transistion between adjacent phonemes Download PDF

Info

Publication number
US3575555A
US3575555A US708323A US3575555DA US3575555A US 3575555 A US3575555 A US 3575555A US 708323 A US708323 A US 708323A US 3575555D A US3575555D A US 3575555DA US 3575555 A US3575555 A US 3575555A
Authority
US
United States
Prior art keywords
phoneme
storage means
phonemes
address
digitally
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Lifetime
Application number
US708323A
Inventor
Joseph F Schanne
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
RCA Corp
Original Assignee
RCA Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by RCA Corp filed Critical RCA Corp
Application granted granted Critical
Publication of US3575555A publication Critical patent/US3575555A/en
Anticipated expiration legal-status Critical
Expired - Lifetime legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination

Definitions

  • a drum is used to store all the phonemes required and delays between phonemes are prevented by using core UNITED STATES PATENTS memories as temporary storage, transferring from the drum to 3,286,235 11/1966 Sinn 340/152 one of the core memories while concurrently extracting the 3,319,002 5/1967 Clerk et al. l79/l(AS) preceding phoneme from another core memory to be 3,344,233 9/1967 Tufts 179/l(AS) converted tosound.
  • FIGS. 1 through 6 are the same in both applications.
  • Speech is a series of complex sounds generated and controlled by the larynx, tongue, oral and nasal cavities, and force of the breath.
  • the abilities of persons to speak and to understand one another are acquired characteristics that tend to mask the implicit complications involved.
  • the synthesis of speech by means other than human must take into account all the factors, however insignificant, that comprise understandable spoken words.
  • the recording of speech, as well as music, is usually done in an analog fashion. That is, the continuous changes in amplitude and frequency are maintained upon the storage medium.
  • the reproduction of speech can be effected by reconverting the recorded signals into audible sound.
  • synthesized speech In the synthesis of speech, more than mere reproduction is desired.
  • the objective of synthesized speech is the conversion of abstract facts or stored concepts into understandable speech to communicate the said facts or concepts to persons who want to know what they are.
  • a phoneme is a group of like or related sounds, varying under different phonetic conditions. Forty phonemes are involved in speaking English and they can be categorized into seven groups.
  • the first three groups comprise the vowel sounds.
  • the first group consists of 10 simple vowels; the second, the six complex vowels; and the third, the four semivowels and liquids.
  • the fourth group is the six plosives, or explosive sounds.
  • the fifth consists of the three nasal consonants.
  • the sixth group is comprised of nine fricatives or spirants, characterized by frictional rustling of the breath against some part of the oral passage as it is emitted.
  • the seventh group consists of two affricatives. These are a stop or explosive sound followed by a slow separation of the articulating organs, so that the last part is a fricative, or spirant, with corresponding organic position.
  • Table I (below) lists the phonemes by group as described above. Each of the phonemes is illustrated in a simple comprehensive work indicating by the usual pronunciation the sound of the phoneme, which is underlined for identification.
  • the constituent frequencies of a phoneme can be considered as the dominant frequencies called formants. It is well known that any complex periodic waveform can be synthesized by a combination of sine waves of proper frequencies, amplitudes, and phase relations. The characteristic sound of a phoneme can be reproduced recognizably by the combination of no more than three formants, each of which may or may not vary with respect to time.
  • transitions between phonemes are provided by special circuits that provide the required continuities across the junction or store several forms of each phoneme so that the proper one could be selected to provide the continuity at the junction.
  • An object of this invention is to provide artificial, or synthesized, speech of improved quality and requiring as little storage of sound as possible for an unrestricted vocabulary.
  • Another object of this invention is to produce synthetic speech in response to control signals that determine the information to betransmitted.
  • a further object of this invention is the transmission of speech by means of pulses to reduce the bandwidth requirements.
  • Another object of this invention is to provide means for converting the output data of electronic computers or other control devices into understandable speech.
  • a speech synthesizer embodying the invention includes a first storage for storing phoneme signals required to produce speech, transferring and selecting means to extract a predetermined sequence of phonemes from the first storage to one of more than one second storages, means for extracting the phonemes from the second storages in order, and for converting them to audible sound.
  • the second storages are provided with means whereby the first and last locations of the phoneme signals extracted can be varied in response to control signals. Furthermore, provision is made to load the first storage with phoneme signals.
  • FIG. 1 represents the approximate variations of the formants in the spoken word WED
  • FIG. 2 represents the approximate variations of the formants in the spoken word WADE
  • FIG. 3 illustrates how a junction to provide continuity of formants between phonemes is determined
  • FlG. 4 shows the resultant formants of two phonemes from FIG. 3 joined as illustrated
  • FIG. 5 represents two periods of a typical periodic complex waveform involved in speech
  • FlG. 6 shows the pulses resulting when the waveform of FIG. 5 is sampled at periodic intervals
  • FlG. 7 is a block diagram of an embodiment of the present invention for loading a drum with phoneme signals.
  • FIG. 8 is a block diagram of an embodiment of the present invention.
  • DETAILED DESCRIPTION FlG. 1 shows the formants 103, 107, and 109 for the spoken word WED as they might appear on a spectrogram with solid lines depicting the midpoint of the bands of frequencies present.
  • the lowest frequency formants 107 and 109 between the origin and the ordinate 101 constitute the /W/ phoneme which, in the word WED, is shown to be made up of two frequencies, both of which increase with time during the time frame 115.
  • the vowel sound of the IEH/ phoneme is composed of three formants 103, 107, and 109 between the ordinates represented by the dashed lines 101 and 105.
  • the final consonant /D/ occurs after a short pause at the end of the vowel sound.
  • the formants of each phoneme must be continuous with those of the following and preceding phonemes across their junctions.
  • the ordinate 101 in FIG. 1 represents one such junction between the /W/ and lEl-l/ phonemes; the formants 107 and 109 blend smoothly together and are continuous across the junction 101.
  • FIG. 2 is a similar representation of the spectrogram of the spoken word WADE.”
  • the [W/ phoneme consists of the lower two formants 207 and 209 in the time frame 215 which is delineated by the origin and the ordinate represented by the dashed line 201.
  • the IAY/ phoneme consists of the three formants 203, 207 and 209 between the dashed lines 201 and 205.
  • the formants of the [WI and MW phonemes blend smoothly at the junction therebetween represented by the ordinate 201.
  • the /W/ phoneme in the word "WED" occupies a time frame 115 that is longer in duration than that 215 of the /W/ phoneme in the word WADE.” Also, in FlG. 1 the lowest two formants 107 and 109 of the phoneme IEH/ are lower respectively than the lowest two formants 207 and 209 of the phoneme lAY/ in FIG. 2. The
  • /W/ phoneme formants in the word WADE in P10. 2 are similar to the /W/ phoneme formants in the word.WED in FIG. 1 over the same period of time. The outstanding difference between the two IW/ phonemes is that of FIG. 2 is truncated at an earlier point in time than the /W/ phoneme in FIG. 1.
  • FlG. 3 shows two phonemes not joined, but rather separated by some time interval. The sound depicted in FIG. 3 would be two complete phonemes spoken separately and distinctively.
  • the first phoneme in FIG. 3 is truncated at the point in time 321 depicted by the described intersection and this point in time is made to coincide with the beginning point in time 312 of the second phoneme, a junction is formed across which the formants are continuous. This is shown in FIG. 4 wherein the formants 407 and 409 are continuous across the junction 421, the combined phonemes starting at the same point in time 401 as that 301 of the first phoneme in FlG. 3 and the end of the combined phonemes occurring at a point in time 405 earlier than that 305 of the second phoneme in FIG. 3.
  • Another method of implementing this technique is to store individual phonemes in a manner that permits selected phonemes to be retrieved, truncated, and reproduced in a sequence previously determined, as, for example, by a control device such as a computer, to synthesize a desired speech pattern.
  • the phonemes are stored digitally by taking periodic samples of the amplitude of the wave shape of each phoneme and converting the magnitudes into binary numbers. The binary numbers obtained are stored in sequence for each phoneme.
  • FlG. 5 shows two periods of a typical waveform.
  • the line 501 representing the amplitude of the wave as a function of time traces a complex path from the originto the end of the first period 503.
  • the line 501 then traces a similar path to the end of the second period 505. If the amplitude of such a wave is measured at periodically occurring points in time that occur many times during the period of the wave being sampled, a series of numbers will result that will permit a close approximation of the original wave to be produced by generating individual amplitudes, as determined by the series, at intervals of time that are the same as those at which the measurements were taken. The more samples that are taken during a wave period, the more accurate the reproduction will be.
  • FIG. 6 is an example of a sample that could be taken of the waveform depicted in FIG. 5.
  • Each of the amplitude plots 601 represents the instantaneous value of the continuously varying amplitude of the line 501 of FIG. 5 at a corresponding point in time.
  • FIGS. 7 and 0. The loading mode will be described first.
  • the audio signal 701 provides one of the inputs to an analog'to-digital (AID) converter 703.
  • the other input to the A/ D converter 703 is derived from the timing signal source 705 so that the output of the A/D converter represents the instantaneous amplitude of the audio signal at the time of this input.
  • the amplitudes of the pulses can be divided into 120 divisions. Each magnitude can then be represented by a binary number of seven bits from the minimum value (0000000) to the maximum value (1111111).
  • the AC zero level is at 64 (1000000).
  • the reference level is actually taken as approximately 5 percent off center. The direction depends on the number of inversions through the amplifier. The reason for this offset is that the amplitude of the sound waves caused by the expulsion of breath is greater than that caused by the actions of the muscles in the larynx.
  • a flip-flop 709 On receipt of record command pulse 707, a flip-flop 709 is triggered to enable the signals from the timing source 705 (l to advance a triggerable address counter 711 through an enabling gate 713, and (2) to permit write command signals to a core memory 17 through an enabling gate 715.
  • Each pulse from the timing source 705 will advance the address counter 711 one location address, permit the output of the A/D converter 703 to be sent to core memory 17, and produce a write command pulse via gate 715 that causes the output of the A/D converter 703 to be transferred into the core memory 17 at a location specified by the address counter 711.
  • Successive digitized signals that compose the phoneme being stored are thereby stored in the core memory 17 starting at the lowest address of the memory 17.
  • the flip-flop 709 is triggered again.
  • the gate 713 is thereby disabled, inhibiting further advancement of the address counter 711.
  • the gate 715 is also disabled, inhibiting write commands to the core memory 17.
  • the state of the flipflop 709 after the second triggering described enables the triggering circuit of a second flip-flop 719.
  • the second flipflop 719 is triggered after being enabled by the first index pulse received from a storage drum 21.
  • the triggering of the second flip-flop 719 enables a gate 723, the other input of which is provided by a sector timing signal from the drum 21.
  • the sector timing signal from the drum 21 occurs once for each of the digitized signals to be stored thereon.
  • Each of the digitized signals stored in the core memory 17 consists of seven binary digits in the embodiment being described.
  • the seven binary bits of each signal are transferred into and out of the core memory 17 in parallel, 1.e., simultaneously. Storage on the drum 21 of the seven binary bits of each signal is performed serially, i.e., in consecutive order.
  • the sector timing signal from the drum 21 causes the address counter 711 to advance one memory location and provides a control signal (read command) to cause the core memory 17 to transfer a digitized signal to a parallel-toserial converter 725.
  • These two functions of the sector timing signal are accomplished only when the gate 723 has been enabled by the second flip-flop 710.
  • Another function of the sector timing pulse is to gate the read output of the core memory 17 into the parallel-to-serial converter 725.
  • the parallel-to-serial converter 725 is merely a seven-stage shift register into which the output of the core memory 17 is gated in parallel and the output of which is the result of shifting each successive stage into the last stage from where the output is talten.
  • a clock timing pulse from the drum 21 occurs once for each bit to be transferred into the drum 21 from the parallel-to serial converter 725.
  • seven clock timing pulses from the drum 21 are required to store the seven binary bits in serial fashion on the drum 21. Furthermore, for every seven clock timing pulses that occur, one sector timing pulse occurs.
  • the address counter 711 used in the embodiment of the invention being described is based on modulo 4096. That is, when the contents of the address counter 711 is advanced to 4095 (in binary digits, 111111111111), the next advance resets the counter to zero (in binary digits, 000000000000).
  • the loading of the core memory 17 is complete when the address counter 711 contents are advanced to 4095.
  • the first sector timing pulse from the gate 723 causes the address counter 711 to be advanced to zero so that the extraction of the successive digitized signals begins at the first address of the core memory 17.
  • the number of digitized signals transferred from the core memory 17 to the drum 21 via the parallel-toserial converter 725 may be less than 4096. It is therefore necessary to reset the address counter 711 by the record command pulse 707 prior to loading the core memory 17.
  • the transfer from the core memory 17 to the drum 21 continues until another index pulse from the drum 21, signifying the drum has completed a revolution, triggers the second flip-flop 719.
  • the gate 723 is disabled, inhibiting advancement of the address counter and preventing further command signals to the core memory 17.
  • the digitized signals comprised of seven binary bits each that constitute a phoneme are therefore recorded serially on a track of the drum 21 during one revolution. Additional phonemes are recorded on other tracks of the drum 2] by using other heads distributed axially along the drum surface. In the present embodiment, there are 128 such tracks for data.
  • the index, sector, and clock pulses are each recorded on a separate track. There are seven clock pulses between sector pulses, and approximately 4,000 sector pulses between index pulses, the latter occurring once per revolution.
  • Each track consists of an individual phoneme.
  • the tracks may be selected manually, as by means of switches. By selecting one of the data heads, the particular phoneme associated therewith can be recorded and later retrieved.
  • a specified sequence of phonemes can be extracted from the drum and transferred to one of two core memories alternately during a speech synthesis operation.
  • the phonemes will then be extracted from the core memories in the same order, truncated to produce continuity of speech sounds, and converted to audible sounds. Transfer from the drum to one core memory and extraction from the other core memory for conversion to audible sounds will be accomplished concurrently. The details of how this is accomplished in the present embodiment will now be explained by reference to FIG. 8.
  • the first designates which data track is to be read from the drum 21, thereby selecting the phoneme.
  • the second number indicates a starting position and the third, a finishing position.
  • the second and third numbers supplied indicating the starting and finishing positions for the phoneme being selected are delayed until the-selected phoneme has been extracted from the drum as described below.
  • These numbers are shown in the present embodiment to be supplied manually 831 or by a paper tape reader 833. It is apparent that such numbers could be provided by a complex control device such as a computer.
  • the starting and finishing positions are predetermined so as to truncate each phoneme to blend properly with the preceding and succeeding phonemes respectively.
  • the beginning and ending addresses are selected so that:
  • the value of the binary number is within 5 percent of 64 (1000000);
  • the formants in the phoneme are at frequencies which are contiguous with those in the end of the preceding phoneme with regard to the starting address and with those of the beginning of the following phoneme with regard to the ending address.
  • the first number is transferred from the manual control 831 or the paper tape reader 833 to a phoneme selector register 835.
  • the second number is transferred to a read counter 837 via an intermediate register 839
  • the third number is transferred to a holding register 841.
  • the phoneme associated with the starting and finishing position in the register will have been extracted from the drum 21 and stored in one oft he core memories 17 or 827.
  • a flip-flop 843 is provided to designate which core memory is involved in the first operation and which is involved in the second. For purpose of illustration, it will be assumed that the A-output 847 of the flip-flop 843 is true and that the B-output 845 is false. lt is immaterial to the operations to be described which output is assumed to be true first.
  • the first operation of transferring from the drum 21 to one of the core memories will be described.
  • the track to be read from the drum 21 is selected by the phoneme selector register 835.
  • An index pulse from the drum 21 resets a write counter 811 to zero.
  • the binary digits of the phoneme signal from the head selected on the drum 21 are gated by the clock pulse track into a serial-to-parallel converter 849.
  • the parallel output of the converter 849 consists of seven lines to each of the core memories 17 and 827.
  • the true A-output 847 of the flip-flop 843 will cause the binary digits from the converter 849 to be written into core memory B 827 by enabling the gates 851 and 853, associated with writing into core memory B 827.
  • the address at which each character of seven bits is to be written is transmitted to core memory B 827 from the write counter 811 via the enabled gate 851. Every seven clock pulses from the drum 21 will be accompanied in time by a sector pulse, which is transmitted to core memory B 827 by the enabled gate 853 to cause the storage of the seven bits from the converter 849 to occur, and which also advances the write counter 811 by one count. Thus, each successive seven binary digit phoneme signal is transferred from the drum 21 through the converter 849 into the core memory B 827.
  • the write counter 811 wraps around" at a count of 4095 so that a maximum of 4096 characters can be transferred. When the drum 21 has made a complete revolution, all the characters comprising a complete phoneme will have been transferred to core memory B 827. An index pulse will reset the write counter 811 to zero and, with no change in the flip-flop 843, the same sequence of characters will be transferred again without altering the contents of the core memory B 827.
  • the true A-output 847 of the flip-flop 843 also enables the gates 855, 857 and 859 associated with reading from the core memory A 17.
  • the address from which the characters are extracted from core memory A 17 is provided by the read counter 837 via the enabled gate 855.
  • the original setting of the read counter 837 was supplied externally from either the manual control 831 or the paper tape reader 833.
  • the timing pulse to cause the read-out (extraction) from the core memory A 17 is supplied by a l4kHz. oscillator 861 via the enabled gate 859.
  • the timing pulse also advances the read counter 837 one location.
  • the output of the core memory A 17 is a seven bit binary character which furnishes the input to a digital-to-analog converter 863 via the enabled gate 857.
  • the output of the digital-to-analog (D/A) converter 863 is a continuously varying electrical signal, the amplitude of which is determined by the digital input.
  • the continually varying output of the D/A converter is amplified by a suitable amplifier 865 and converted to audible sounds by a suitable transducer such as a speaker 867. Successive phonemes are read out from the core memory A 17 until the number in the read counter 837 is equal to the number in the finishing position register 841.
  • a comparator 869 the output of which triggers the flip-flop 843 and signals the paper tape reader 833 to furnish the second and third numbers associated with the phoneme just transferred from the drum 2] to the core memory B 827 and to furnish the first number of the next phoneme to be so transferred.
  • the triggering of the flip-flop 843 causes the B-output 845, which was previously false, to become true, and the A-output 847, which was previously true, to become false.
  • the track on the drum 21 selected by the phoneme selector 835 is read out to the serial-to-parallel converter 849, timed as described above, and the output of the converter 849 is sent to both core memories.
  • the true B-output 845 of the flip-flop 843 enables the gates 871 and 873 associated with writing into core memory A 17. Specifically, the sector timing pulse is supplied via the enabled gate 871 and the address via the enabled gate 873 from the write counter 811.
  • the corresponding gates 853 and 851 of core memory B 827 are now disabled because the A-output 847 of the flip-flop 843 is false. The first operation is therefore performed using the alternate core memory.
  • the second operation is also performed using the other core memory because the true B-output 845 of the flip-flop 843 enables the gates 875, 877 and 879 associated with reading from the core memory B 827.
  • the timing pulse from the oscillator 861 is supplied to the core memory B 827 via the enabled gate 875; the address from the read counter 837, via the enabled gate 877; and the output from the core memory B 827 is transmitted to the input of the digital-to-analog converter 863 via the enabled gate 879.
  • the corresponding gates 855, 859 and 857 associated with the core memory A 17 are disabled because the A-output 847 of the flip-flop 843 is false.
  • the drum 21 revolves at an angular velocity of 1800 revolutions per minute. On revolution of the drum is necessary to transfer an entire phoneme. It requires therefor approximately 34 milliseconds to transfer a phoneme from the drum to a core memory. In addition, there is a latency period, i.e., waiting time for the index pulse, of almost 34 milliseconds.
  • the read out frequency from the other core memory is l4kHz., so that a character is retrieved every 71% microseconds. Therefore, to transfer the phoneme into one core memory requires the amount of time needed to read out approximately 950 characters from the other core memory.
  • each core memory is 4096 characters but this amount is never extracted because of truncation at the beginning and end of the phoneme. However, the number of characters extracted will always exceed the minimum required to provide the time to load the other core memory from the drum. The time required to set a new starting position is short enough that no discontinuity of the sound produced is detectable. The output of the digital-to-analog converter 863 is sustained long enough to prevent any minor discontinuity that may otherwise tend to occur.
  • the numbers supplied by the paper tape reader 833 or equivalent control device, such as a computer, are chosen to select the phonemes required in the proper sequence to produce the speech sounds desired and to truncate such phonemes to provide the maximum intelligibility.
  • Another possible embodiment of this invention employing only one core memory provides for extracting from the drum only that portion of each phoneme that is to be reproduced and storing all extracted phoneme portions serially in a large core memory. Extraction from the drum and transfer to the core memory would begin at the first, or lowest, core memory address. When the last, or highest, core memory address is reached, the transfer begins again at the first address. Extraction from the core memory of the digital signals to be converted to analog signals commences from the first address and proceeds sequentially to the last, at which time extraction would begin at the first address again. Suitable means for a specified number of drum revolutions provides proper timing.
  • Apparatus for synthesizing speech comprising:
  • first storage means for storing a plurality of digitally-coded phonemes
  • second and third storage means each capable of storing at least one digitally-coded phoneme
  • control means for selecting a predetermined sequence of phonemes from said first storage means
  • transfer means for loading the successive digitally coded phonemes selected by said control means from the first storage means alternately to said second and third storage means;
  • read-out means operative concurrently with said transfer means for converting the digitally coded phoneme stored in said second storage means into audio signals during the time said third storage means is being loaded with the following phoneme by said transfer means and for converting the digitally coded phoneme stored by said third storage means into audio signals during the time said second storage means is being loaded with the following phoneme by said transfer means;
  • addressing means coupled to said read-out means and said control means for causing the read-out means to finish the retrieval of a phoneme from one storage means and to start the retrieval of a phoneme from the other storage means at respective addresses at which the stored digital values represent contiguous audio frequencies.
  • address register means for storing and incrementing the address of the storage means whose contents are being retrieved
  • start register means coupled to said control means for placing a starting address in said address register means
  • finish register means coupled to said control means for receiving therefrom the last address from which the contents of the addressed storage means is to be retrieved;
  • recognition means responsive to the contents of the address register means and the finish register means for providing a signal to said control means when the last address has been specified.
  • the memory select means include a triggerable bistable multivibrator which changes state in response to the signal from the recognition means in said addressing means.
  • control means includes means for specifying digitally, in sequence, groups of three numbers, each group comprising a first number designating the phoneme to be retrieved, a second number designating the start address, and a third number designating the finish address.
  • said first storage means stores a plurality of digitally coded phonemes serially; and further including serial-to-parallel converting means coupled between said first storage means and said transfer means for converting the serially stored digitally-coded phonemes into groups of signals for parallel transfer to either the second or third storage means.
  • V A 6.
  • said first storage means comprises a serial memory and said second and third storage means comprises static memories.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Electrophonic Musical Instruments (AREA)

Abstract

An apparatus for synthesizing speech from phonemes is described. The smooth transition of one phoneme into the next is accomplished by a timewise truncation of the end of the leading phoneme and of the beginning of the following phoneme so that the formants are continuous over the junction. The apparatus described stores phonemes in digital fashion to permit their retrieval, starting and stopping the retrieval so that the desired truncation is achieved. A drum is used to store all the phonemes required and delays between phonemes are prevented by using core memories as temporary storage, transferring from the drum to one of the core memories while concurrently extracting the preceding phoneme from another core memory to be converted to sound.

Description

United States Patent [72] inventor Joseph F. Schanne Primary Examiner-William C. Cooper Cheltenham, lPa. Assistant Examiner-Jon Bradford Leaheey [2i] Appl. No. 708,323 Att0mey-H. Christoffersen [22] Filed Feb. 26,1968 [45] Patented Apr. 20, 1971 [73] Assignee RCA Corporation [54] SPEECH SYNTHESIZER PROVIDING SMOOTH TRACT An apparatus for synthesizing speech from TRANSISTION BETWEEN ADJACENT PHONEMES phonemes IS described. The smooth transition of one GCIaims 8 Drawing Figs phoneme into the next 15 accomplished by a timewise truncation of the end of the leading phoneme and of the U-S- beginning of the following phoneme so that the formants are cl continuous over the junction. The apparatus described stores [50] F of 179/1 (AS); phonemes in digital fashion to permit their retrieval, starting 340/15 (2) and stopping the retrieval so that the desired truncation is [56]; Reierenms Cited achieved. A drum is used to store all the phonemes required and delays between phonemes are prevented by using core UNITED STATES PATENTS memories as temporary storage, transferring from the drum to 3,286,235 11/1966 Sinn 340/152 one of the core memories while concurrently extracting the 3,319,002 5/1967 Clerk et al. l79/l(AS) preceding phoneme from another core memory to be 3,344,233 9/1967 Tufts 179/l(AS) converted tosound.
seat 2 5 at! PHONEME PAPEU. TAPE. AUDIO setec'rcm LEADER OUTPUT MI L W9 W a m o A e 1 H DR M J a ag CON IEILTEK O' S ILLHlOR COMVEILT'EO.
ant-ran. moex coma .4
*ID MEMORY o A o m unfit; n fi lean COMMRHD j 647 845' W V I v wnrra A (Leno nranmuu COUNTEL touuTEtL Post-rum hm at; i an g4:
fl-Ecoumtt. Fmlsmub mum-i Poem-1cm M PATENTEU APREOIHYI 5575.555
SHEET 2 [IF 3 nmiue w m ANALOG TO AoorLess t: RUDIO COS$E|EQT|EQ warm COUNTER Lommnuo A corza j L MEMORY m "n" l I new 1 commauo 101 109 12:00.0 Pagl a; TO coumnuo Pup l 1 7" couvemm CLO 5 FLIP-FLOP Y I? 2r- 7 DRUM sacrum I mow mvsu-rorL bvLigmu i I SPEECH SYNTHESIZER PROVIDING SMOOTH TRANSISTION BETWEEN ADJACENT PHONEMES The invention herein described was made in the course of or under a contract or subcontract thereunder with the Department of the Air Force.
The invention herein described was made in the course of or under a contract or subcontract thereunder with the Department of the Air Force.
CROSS-REFERENCES TO RELATED APPLICATIONS A patent application, Ser. No. 708,389, titled SPEECH SYNTHESIZER, filed by Thomas B. Martin concurrently herewith and assigned to the assignee of this application contains related subject matter. FIGS. 1 through 6 are the same in both applications.
BACKGROUND OF THE INVENTION Speech is a series of complex sounds generated and controlled by the larynx, tongue, oral and nasal cavities, and force of the breath. The abilities of persons to speak and to understand one another are acquired characteristics that tend to mask the implicit complications involved. The synthesis of speech by means other than human must take into account all the factors, however insignificant, that comprise understandable spoken words.
The recording of speech, as well as music, is usually done in an analog fashion. That is, the continuous changes in amplitude and frequency are maintained upon the storage medium. The reproduction of speech can be effected by reconverting the recorded signals into audible sound.
In the synthesis of speech, more than mere reproduction is desired. The objective of synthesized speech is the conversion of abstract facts or stored concepts into understandable speech to communicate the said facts or concepts to persons who want to know what they are.
There are many methods of accomplishing this desired result. The most obvious is to record all the sentences possible within the framework of all the facts that the user might desire or require. For even a small number of facts, however, the storage requirements for all permutations and combinations of the facts involved becomes prohibitive.
An approach to reducing the required amount of storage is to store phrases instead of sentences. The storage required is still very large for only a few facts. A further reduction is possible by storing words and combining them, under suitable control, into sentences. This has been donebut results in a limited vocabulary. The same problems are encountered using syllables.
The most successful approach compatible with a large vocabulary without a prohibitively large storage requirement has been to use the basic speech unit, the phoneme.
A phoneme is a group of like or related sounds, varying under different phonetic conditions. Forty phonemes are involved in speaking English and they can be categorized into seven groups.
The first three groups comprise the vowel sounds. The first group consists of 10 simple vowels; the second, the six complex vowels; and the third, the four semivowels and liquids.
The fourth group is the six plosives, or explosive sounds.
The fifth consists of the three nasal consonants.
The sixth group is comprised of nine fricatives or spirants, characterized by frictional rustling of the breath against some part of the oral passage as it is emitted.
The seventh group consists of two affricatives. These are a stop or explosive sound followed by a slow separation of the articulating organs, so that the last part is a fricative, or spirant, with corresponding organic position.
Table I (below) lists the phonemes by group as described above. Each of the phonemes is illustrated in a simple comprehensive work indicating by the usual pronunciation the sound of the phoneme, which is underlined for identification.
TABLEI Elementary Sounds (Phonemes) of the English Language TABLE L-ELEMENTARY SOUNDS (PHONEMES) OF THE ENGLISH LANGUAGE I.-Slmple vowels: IV.Plosives:
(1) fit (1) bad (2) fe et (2) give (3) let (3) g ive (4 bat (4) pot (5) bgt (5) toy (6) not (6) eat (7) law (8) h1g1; V.Nasal consonants: bgt (1) Illay (10) bird (2) now IL-Complex vowels:
(1) pain VI.Frlcatives: e9 g (3) hgse (2) vision (4) ice (3) ery (a b0 y 4 t h at (6) few (5) Eat (a fat III.-Semi-vowels and liquids: (7) Qing 1 11 (s) s hed E (9) sat (3) late (4 2m VIIL-Afiricatives:
(1) church is It is not enough, however, merely to reproduce a sequence of recorded phonemes to produce artificial or synthesized speech. Three conditions must be met in the production of natural sounding synthetic speech from phonemes, viz:
1. there must be continuity in the speech waveform at the junction of phonemes;
2. there must be continuity in the pitch periods across the phoneme boundaries; and
3. there must be continuity of the constituent frequency components between phonemes.
The constituent frequencies of a phoneme can be considered as the dominant frequencies called formants. It is well known that any complex periodic waveform can be synthesized by a combination of sine waves of proper frequencies, amplitudes, and phase relations. The characteristic sound of a phoneme can be reproduced recognizably by the combination of no more than three formants, each of which may or may not vary with respect to time.
The synthesis of speech from phonemes requires, therefore, selecting the proper phoneme sequence and merging the formants of each at their junction points so that there are no discontinuities in the resulting speech.
Some of the approaches to providing smooth transition between phonemes have been described by Dudley et al. in Pat. No. 2,771,509; by David et al. in 2,860,187; and by Gerstman et al. in 3,158,685.
In this prior art, transitions between phonemes are provided by special circuits that provide the required continuities across the junction or store several forms of each phoneme so that the proper one could be selected to provide the continuity at the junction.
An object of this invention is to provide artificial, or synthesized, speech of improved quality and requiring as little storage of sound as possible for an unrestricted vocabulary.
Another object of this invention is to produce synthetic speech in response to control signals that determine the information to betransmitted.
A further object of this invention is the transmission of speech by means of pulses to reduce the bandwidth requirements.
Another object of this invention is to provide means for converting the output data of electronic computers or other control devices into understandable speech.
BRIEF SUMMARY OF THE lNVENTlON A speech synthesizer embodying the invention includes a first storage for storing phoneme signals required to produce speech, transferring and selecting means to extract a predetermined sequence of phonemes from the first storage to one of more than one second storages, means for extracting the phonemes from the second storages in order, and for converting them to audible sound. The second storages are provided with means whereby the first and last locations of the phoneme signals extracted can be varied in response to control signals. Furthermore, provision is made to load the first storage with phoneme signals.
BRIEF DESCRlPTlON OF THE DRAWlNG FIG. 1 represents the approximate variations of the formants in the spoken word WED;
FIG. 2 represents the approximate variations of the formants in the spoken word WADE;"
FIG. 3 illustrates how a junction to provide continuity of formants between phonemes is determined;
FlG. 4 shows the resultant formants of two phonemes from FIG. 3 joined as illustrated;
FIG. 5 represents two periods of a typical periodic complex waveform involved in speech;
FlG. 6 shows the pulses resulting when the waveform of FIG. 5 is sampled at periodic intervals;
FlG. 7 is a block diagram of an embodiment of the present invention for loading a drum with phoneme signals; and
FIG. 8 is a block diagram of an embodiment of the present invention.
DETAILED DESCRIPTION FlG. 1 shows the formants 103, 107, and 109 for the spoken word WED as they might appear on a spectrogram with solid lines depicting the midpoint of the bands of frequencies present. For instance, the lowest frequency formants 107 and 109 between the origin and the ordinate 101 constitute the /W/ phoneme which, in the word WED, is shown to be made up of two frequencies, both of which increase with time during the time frame 115. The vowel sound of the IEH/ phoneme is composed of three formants 103, 107, and 109 between the ordinates represented by the dashed lines 101 and 105. The final consonant /D/ occurs after a short pause at the end of the vowel sound. For smooth, intelligible speech, the formants of each phoneme must be continuous with those of the following and preceding phonemes across their junctions. The ordinate 101 in FIG. 1 represents one such junction between the /W/ and lEl-l/ phonemes; the formants 107 and 109 blend smoothly together and are continuous across the junction 101.
FIG. 2 is a similar representation of the spectrogram of the spoken word WADE." The [W/ phoneme consists of the lower two formants 207 and 209 in the time frame 215 which is delineated by the origin and the ordinate represented by the dashed line 201. The IAY/ phoneme consists of the three formants 203, 207 and 209 between the dashed lines 201 and 205. The formants of the [WI and MW phonemes blend smoothly at the junction therebetween represented by the ordinate 201.
Comparing FIGS. 1 and 2, the /W/ phoneme in the word "WED" occupies a time frame 115 that is longer in duration than that 215 of the /W/ phoneme in the word WADE." Also, in FlG. 1 the lowest two formants 107 and 109 of the phoneme IEH/ are lower respectively than the lowest two formants 207 and 209 of the phoneme lAY/ in FIG. 2. The
/W/ phoneme formants in the word WADE in P10. 2 are similar to the /W/ phoneme formants in the word.WED in FIG. 1 over the same period of time. The outstanding difference between the two IW/ phonemes is that of FIG. 2 is truncated at an earlier point in time than the /W/ phoneme in FIG. 1.
FlG. 3 shows two phonemes not joined, but rather separated by some time interval. The sound depicted in FIG. 3 would be two complete phonemes spoken separately and distinctively.
lf the two phonemes depicted in FIG. 3 are to be joined as part of speech synthesis, it is obvious that moving the terminating point in time 311 of the first phoneme into coincidence with the beginning point in time 312 of the second phoneme would result in discontinuities at the junction line so formed. The formants 303 of the first phoneme would end abruptly and the formants 313 of the second phoneme would immediately begin at different frequencies from those of the first. These abrupt changes in frequencies would result in distortions that would destroy the intelligibility of the speech being synthesized.
By extending lines 327 and 329 from the formant beginnings of the second phoneme, it can be noted in FIG. 3 that such lines will intersect the formants 307 and 309 respectively of the first phoneme. The point in time 321 determined by the intersection is a point at which the first phoneme can be truncated for a smooth transition of the formants from the first phoneme into those of the second phoneme. If the aforedescribed intersections do not occur at each formant of the first phoneme at the same time, the beginning of the second phoneme is changed so that they do. A slight amount of discontinuity is permissible so that the point in time at which the intersections occur need not be exactly the same.
lf the first phoneme in FIG. 3 is truncated at the point in time 321 depicted by the described intersection and this point in time is made to coincide with the beginning point in time 312 of the second phoneme, a junction is formed across which the formants are continuous. This is shown in FIG. 4 wherein the formants 407 and 409 are continuous across the junction 421, the combined phonemes starting at the same point in time 401 as that 301 of the first phoneme in FlG. 3 and the end of the combined phonemes occurring at a point in time 405 earlier than that 305 of the second phoneme in FIG. 3.
Another method of implementing this technique is to store individual phonemes in a manner that permits selected phonemes to be retrieved, truncated, and reproduced in a sequence previously determined, as, for example, by a control device such as a computer, to synthesize a desired speech pattern. ln the described embodiment of this invention, the phonemes are stored digitally by taking periodic samples of the amplitude of the wave shape of each phoneme and converting the magnitudes into binary numbers. The binary numbers obtained are stored in sequence for each phoneme.
FlG. 5 shows two periods of a typical waveform. The line 501 representing the amplitude of the wave as a function of time traces a complex path from the originto the end of the first period 503. The line 501 then traces a similar path to the end of the second period 505. If the amplitude of such a wave is measured at periodically occurring points in time that occur many times during the period of the wave being sampled, a series of numbers will result that will permit a close approximation of the original wave to be produced by generating individual amplitudes, as determined by the series, at intervals of time that are the same as those at which the measurements were taken. The more samples that are taken during a wave period, the more accurate the reproduction will be.
FIG. 6 is an example of a sample that could be taken of the waveform depicted in FIG. 5. Each of the amplitude plots 601 represents the instantaneous value of the continuously varying amplitude of the line 501 of FIG. 5 at a corresponding point in time.
To demonstrate one method, by way of example, employing such a technique, refer to FIGS. 7 and 0. The loading mode will be described first.
in H0. 7, the audio signal 701 provides one of the inputs to an analog'to-digital (AID) converter 703. The other input to the A/ D converter 703 is derived from the timing signal source 705 so that the output of the A/D converter represents the instantaneous amplitude of the audio signal at the time of this input. By way of example, the amplitudes of the pulses can be divided into 120 divisions. Each magnitude can then be represented by a binary number of seven bits from the minimum value (0000000) to the maximum value (1111111). The AC zero level is at 64 (1000000). (The reference level is actually taken as approximately 5 percent off center. The direction depends on the number of inversions through the amplifier. The reason for this offset is that the amplitude of the sound waves caused by the expulsion of breath is greater than that caused by the actions of the muscles in the larynx.)
On receipt of record command pulse 707, a flip-flop 709 is triggered to enable the signals from the timing source 705 (l to advance a triggerable address counter 711 through an enabling gate 713, and (2) to permit write command signals to a core memory 17 through an enabling gate 715. Each pulse from the timing source 705 will advance the address counter 711 one location address, permit the output of the A/D converter 703 to be sent to core memory 17, and produce a write command pulse via gate 715 that causes the output of the A/D converter 703 to be transferred into the core memory 17 at a location specified by the address counter 711.
Successive digitized signals that compose the phoneme being stored are thereby stored in the core memory 17 starting at the lowest address of the memory 17. When the address counter 711 has been advanced to the highest address of the memory, indicating that the memory capacity has been reached, the flip-flop 709 is triggered again. The gate 713 is thereby disabled, inhibiting further advancement of the address counter 711. The gate 715 is also disabled, inhibiting write commands to the core memory 17. The state of the flipflop 709 after the second triggering described enables the triggering circuit of a second flip-flop 719. The second flipflop 719 is triggered after being enabled by the first index pulse received from a storage drum 21. The triggering of the second flip-flop 719 enables a gate 723, the other input of which is provided by a sector timing signal from the drum 21. The sector timing signal from the drum 21 occurs once for each of the digitized signals to be stored thereon. Each of the digitized signals stored in the core memory 17 consists of seven binary digits in the embodiment being described. The seven binary bits of each signal are transferred into and out of the core memory 17 in parallel, 1.e., simultaneously. Storage on the drum 21 of the seven binary bits of each signal is performed serially, i.e., in consecutive order. The sector timing signal from the drum 21 causes the address counter 711 to advance one memory location and provides a control signal (read command) to cause the core memory 17 to transfer a digitized signal to a parallel-toserial converter 725. These two functions of the sector timing signal are accomplished only when the gate 723 has been enabled by the second flip-flop 710. Another function of the sector timing pulse is to gate the read output of the core memory 17 into the parallel-to-serial converter 725.
The parallel-to-serial converter 725 is merely a seven-stage shift register into which the output of the core memory 17 is gated in parallel and the output of which is the result of shifting each successive stage into the last stage from where the output is talten. A clock timing pulse from the drum 21 occurs once for each bit to be transferred into the drum 21 from the parallel-to serial converter 725. For each digitized signal extracted from the core memory 17, seven clock timing pulses from the drum 21 are required to store the seven binary bits in serial fashion on the drum 21. Furthermore, for every seven clock timing pulses that occur, one sector timing pulse occurs.
The address counter 711 used in the embodiment of the invention being described is based on modulo 4096. That is, when the contents of the address counter 711 is advanced to 4095 (in binary digits, 111111111111), the next advance resets the counter to zero (in binary digits, 000000000000). The loading of the core memory 17 is complete when the address counter 711 contents are advanced to 4095. The first sector timing pulse from the gate 723 causes the address counter 711 to be advanced to zero so that the extraction of the successive digitized signals begins at the first address of the core memory 17. The number of digitized signals transferred from the core memory 17 to the drum 21 via the parallel-toserial converter 725 may be less than 4096. It is therefore necessary to reset the address counter 711 by the record command pulse 707 prior to loading the core memory 17.
The transfer from the core memory 17 to the drum 21 continues until another index pulse from the drum 21, signifying the drum has completed a revolution, triggers the second flip-flop 719. The gate 723 is disabled, inhibiting advancement of the address counter and preventing further command signals to the core memory 17.
The digitized signals comprised of seven binary bits each that constitute a phoneme are therefore recorded serially on a track of the drum 21 during one revolution. Additional phonemes are recorded on other tracks of the drum 2] by using other heads distributed axially along the drum surface. In the present embodiment, there are 128 such tracks for data. The index, sector, and clock pulses are each recorded on a separate track. There are seven clock pulses between sector pulses, and approximately 4,000 sector pulses between index pulses, the latter occurring once per revolution. Each track consists of an individual phoneme. During the loading mode, the tracks may be selected manually, as by means of switches. By selecting one of the data heads, the particular phoneme associated therewith can be recorded and later retrieved.
After all the phonemes to be used have been recorded, a specified sequence of phonemes can be extracted from the drum and transferred to one of two core memories alternately during a speech synthesis operation. The phonemes will then be extracted from the core memories in the same order, truncated to produce continuity of speech sounds, and converted to audible sounds. Transfer from the drum to one core memory and extraction from the other core memory for conversion to audible sounds will be accomplished concurrently. The details of how this is accomplished in the present embodiment will now be explained by reference to FIG. 8.
Three numbers are provided for each phoneme to be reproduced. The first designates which data track is to be read from the drum 21, thereby selecting the phoneme. The second number indicates a starting position and the third, a finishing position. The second and third numbers supplied indicating the starting and finishing positions for the phoneme being selected are delayed until the-selected phoneme has been extracted from the drum as described below. These numbers are shown in the present embodiment to be supplied manually 831 or by a paper tape reader 833. It is apparent that such numbers could be provided by a complex control device such as a computer. The starting and finishing positions are predetermined so as to truncate each phoneme to blend properly with the preceding and succeeding phonemes respectively. The beginning and ending addresses are selected so that:
1. The value of the binary number is within 5 percent of 64 (1000000);
2. The formants in the phoneme are at frequencies which are contiguous with those in the end of the preceding phoneme with regard to the starting address and with those of the beginning of the following phoneme with regard to the ending address.
The first number is transferred from the manual control 831 or the paper tape reader 833 to a phoneme selector register 835. The second number is transferred to a read counter 837 via an intermediate register 839 The third number is transferred to a holding register 841.
The phoneme associated with the starting and finishing position in the register will have been extracted from the drum 21 and stored in one oft he core memories 17 or 827.
Two operations will be performed concurrently. The first transfers a phoneme from the drum 21 to one of the core memories 17 or 827, and the second extracts the phoneme in the other core memory and converts it to audible sound. A flip-flop 843 is provided to designate which core memory is involved in the first operation and which is involved in the second. For purpose of illustration, it will be assumed that the A-output 847 of the flip-flop 843 is true and that the B-output 845 is false. lt is immaterial to the operations to be described which output is assumed to be true first.
The first operation of transferring from the drum 21 to one of the core memories will be described. The track to be read from the drum 21 is selected by the phoneme selector register 835. An index pulse from the drum 21 resets a write counter 811 to zero. The binary digits of the phoneme signal from the head selected on the drum 21 are gated by the clock pulse track into a serial-to-parallel converter 849. The parallel output of the converter 849 consists of seven lines to each of the core memories 17 and 827. The true A-output 847 of the flip-flop 843 will cause the binary digits from the converter 849 to be written into core memory B 827 by enabling the gates 851 and 853, associated with writing into core memory B 827. The address at which each character of seven bits is to be written is transmitted to core memory B 827 from the write counter 811 via the enabled gate 851. Every seven clock pulses from the drum 21 will be accompanied in time by a sector pulse, which is transmitted to core memory B 827 by the enabled gate 853 to cause the storage of the seven bits from the converter 849 to occur, and which also advances the write counter 811 by one count. Thus, each successive seven binary digit phoneme signal is transferred from the drum 21 through the converter 849 into the core memory B 827. The write counter 811 wraps around" at a count of 4095 so that a maximum of 4096 characters can be transferred. When the drum 21 has made a complete revolution, all the characters comprising a complete phoneme will have been transferred to core memory B 827. An index pulse will reset the write counter 811 to zero and, with no change in the flip-flop 843, the same sequence of characters will be transferred again without altering the contents of the core memory B 827.
The second operation of extracting the signals from the other core memory and converting them to audible sound occurs concurrently with the first operation just described, and will now be described.
The true A-output 847 of the flip-flop 843 also enables the gates 855, 857 and 859 associated with reading from the core memory A 17. The address from which the characters are extracted from core memory A 17 is provided by the read counter 837 via the enabled gate 855. The original setting of the read counter 837 was supplied externally from either the manual control 831 or the paper tape reader 833. The timing pulse to cause the read-out (extraction) from the core memory A 17 is supplied by a l4kHz. oscillator 861 via the enabled gate 859. The timing pulse also advances the read counter 837 one location. The output of the core memory A 17 is a seven bit binary character which furnishes the input to a digital-to-analog converter 863 via the enabled gate 857. The output of the digital-to-analog (D/A) converter 863 is a continuously varying electrical signal, the amplitude of which is determined by the digital input. The continually varying output of the D/A converter is amplified by a suitable amplifier 865 and converted to audible sounds by a suitable transducer such as a speaker 867. Successive phonemes are read out from the core memory A 17 until the number in the read counter 837 is equal to the number in the finishing position register 841. The equality is detected by a comparator 869, the output of which triggers the flip-flop 843 and signals the paper tape reader 833 to furnish the second and third numbers associated with the phoneme just transferred from the drum 2] to the core memory B 827 and to furnish the first number of the next phoneme to be so transferred.
The triggering of the flip-flop 843 causes the B-output 845, which was previously false, to become true, and the A-output 847, which was previously true, to become false.
The track on the drum 21 selected by the phoneme selector 835 is read out to the serial-to-parallel converter 849, timed as described above, and the output of the converter 849 is sent to both core memories. The true B-output 845 of the flip-flop 843 enables the gates 871 and 873 associated with writing into core memory A 17. Specifically, the sector timing pulse is supplied via the enabled gate 871 and the address via the enabled gate 873 from the write counter 811. The corresponding gates 853 and 851 of core memory B 827 are now disabled because the A-output 847 of the flip-flop 843 is false. The first operation is therefore performed using the alternate core memory.
The second operation is also performed using the other core memory because the true B-output 845 of the flip-flop 843 enables the gates 875, 877 and 879 associated with reading from the core memory B 827. The timing pulse from the oscillator 861 is supplied to the core memory B 827 via the enabled gate 875; the address from the read counter 837, via the enabled gate 877; and the output from the core memory B 827 is transmitted to the input of the digital-to-analog converter 863 via the enabled gate 879. The corresponding gates 855, 859 and 857 associated with the core memory A 17 are disabled because the A-output 847 of the flip-flop 843 is false.
The alternate reading and writing to each core memory continues until all the desired phonemes selected have been converted to audible sound. in the described embodiment, the drum 21 revolves at an angular velocity of 1800 revolutions per minute. On revolution of the drum is necessary to transfer an entire phoneme. It requires therefor approximately 34 milliseconds to transfer a phoneme from the drum to a core memory. In addition, there is a latency period, i.e., waiting time for the index pulse, of almost 34 milliseconds. The read out frequency from the other core memory is l4kHz., so that a character is retrieved every 71% microseconds. Therefore, to transfer the phoneme into one core memory requires the amount of time needed to read out approximately 950 characters from the other core memory. The maximum capacity of each core memory is 4096 characters but this amount is never extracted because of truncation at the beginning and end of the phoneme. However, the number of characters extracted will always exceed the minimum required to provide the time to load the other core memory from the drum. The time required to set a new starting position is short enough that no discontinuity of the sound produced is detectable. The output of the digital-to-analog converter 863 is sustained long enough to prevent any minor discontinuity that may otherwise tend to occur.
The numbers supplied by the paper tape reader 833 or equivalent control device, such as a computer, are chosen to select the phonemes required in the proper sequence to produce the speech sounds desired and to truncate such phonemes to provide the maximum intelligibility.
Another possible embodiment of this invention employing only one core memory provides for extracting from the drum only that portion of each phoneme that is to be reproduced and storing all extracted phoneme portions serially in a large core memory. Extraction from the drum and transfer to the core memory would begin at the first, or lowest, core memory address. When the last, or highest, core memory address is reached, the transfer begins again at the first address. Extraction from the core memory of the digital signals to be converted to analog signals commences from the first address and proceeds sequentially to the last, at which time extraction would begin at the first address again. Suitable means for a specified number of drum revolutions provides proper timing.
lclaim:
1. Apparatus for synthesizing speech comprising:
first storage means for storing a plurality of digitally-coded phonemes;
second and third storage means, each capable of storing at least one digitally-coded phoneme;
control means for selecting a predetermined sequence of phonemes from said first storage means;
transfer means for loading the successive digitally coded phonemes selected by said control means from the first storage means alternately to said second and third storage means;
read-out means operative concurrently with said transfer means for converting the digitally coded phoneme stored in said second storage means into audio signals during the time said third storage means is being loaded with the following phoneme by said transfer means and for converting the digitally coded phoneme stored by said third storage means into audio signals during the time said second storage means is being loaded with the following phoneme by said transfer means; and
addressing means coupled to said read-out means and said control means for causing the read-out means to finish the retrieval of a phoneme from one storage means and to start the retrieval of a phoneme from the other storage means at respective addresses at which the stored digital values represent contiguous audio frequencies.
2. The invention as claimed in claim 1 wherein the addressing means comprises:
address register means for storing and incrementing the address of the storage means whose contents are being retrieved;
start register means coupled to said control means for placing a starting address in said address register means;
finish register means coupled to said control means for receiving therefrom the last address from which the contents of the addressed storage means is to be retrieved; and
recognition means responsive to the contents of the address register means and the finish register means for providing a signal to said control means when the last address has been specified.
3. The invention as claimed in claim 2 wherein the memory select means include a triggerable bistable multivibrator which changes state in response to the signal from the recognition means in said addressing means.
4. The invention as claimed in claim 3'wherein the control means includes means for specifying digitally, in sequence, groups of three numbers, each group comprising a first number designating the phoneme to be retrieved, a second number designating the start address, and a third number designating the finish address.
5. The invention as claimed in claim 4 wherein said first storage means stores a plurality of digitally coded phonemes serially; and further including serial-to-parallel converting means coupled between said first storage means and said transfer means for converting the serially stored digitally-coded phonemes into groups of signals for parallel transfer to either the second or third storage means. V A A 6. The invention as claimed in claim 1 wherein said first storage means comprises a serial memory and said second and third storage means comprises static memories.
UNITED STATES PATENT OFFICE CERTIFICATE OF CORRECTION Patent No. 3 575 555 Dated April 20 1971 Invent0r(s) Joseph F. Schanne It is certified that error appears in the above-identified patent and that said Letters Patent are hereby corrected as shown below:
lines 6-8 Delete the entire lines Col. 1, Col. 1 line 74 "work" shouid be -word- Col. 2, line 1 Delete the entire line. Col. 2, line 30 "VIII" should be --VII- Col. 5, line 51 "Le. should be --i.e. Col. 8, line 74 after "means for" insert --inhibi retrieval of a phoneme from the dr for-- Signed and sealed this 9th day of May 1972.
(SEAL) Abtesc:
EDWARD I-'I.FLETCH.ER,JR. ROBERT GOTTSCHALK A'Ltesting Officer Commissioner of Patents FORM PO-105O l10-59) lmrnuuhnr Ana:

Claims (6)

1. Apparatus for synthesizing speech comprising: first storage means for storing a plurality of digitally-coded phonemes; second and third storage means, each capable of storing at least one digitally-coded phoneme; control means for selecting a predetermined sequence of phonemes from said first storage means; transfer means for loading the successive digitally coded phonemes selected by said control means from the first storage means alternately to said second and third storage means; read-out means operative concurrently with said transfer means for converting the digitally coded phoneme stored in said second storage means into audio signals during the time said third storage means is being loaded with the following phoneme by said transfer means and for converting the digitally coded phoneme stored by said third storage means into audio signals during the time said second storage means is being loaded with the following phoneme by said transfer means; and addressing means coupled to said read-out means and said control means for causing the read-out means to finish the retrieval of a phoneme from one storage means and to start the retrieval of a phoneme from the other storage means at respective addresses at which the stored digital values represent contiguous audio frequencies.
2. The invention as claimed in claim 1 wherein the addressing means comprises: address register means for storing and incrementing the address of the storage means whose contents are being retrieved; start register means coupled to said control means for placing a starting address in said address register means; finish register means coupled to said control means for receiving therefrom the last address from which the contents of the addressed storage means is to be retrieved; and recognition means responsive to the contents of the address register means and the finish register means for providing a signal to said control means when the last address has been specified.
3. The invention as claimed in claim 2 wherein the memory select means include a triggerable bistable multivibrator which changes state in response to the signal from the recognition means in said addressing means.
4. The invention as claimed in claim 3 wherein the control means includes means for specifying digitally, in sequence, groups of three numbers, each group comprising a first number designating the phoneme to be retrieved, a second number designating the start address, and a third number designating the finish address.
5. The invention as claimed in claim 4 wherein said first storage means stores a plurality of digitally coded phonemes serially; and further including serial-to-parallel converting means coupled between said first storage means and said transfer means for converting the serially stored digitally-coded phonemes into groups of signals for parallel transfer to either the second or third storage means.
6. The invention as claimed in claim 1 wherein said first storage means comprises a serial memory and said second and third storage means comprises static memories.
US708323A 1968-02-26 1968-02-26 Speech synthesizer providing smooth transistion between adjacent phonemes Expired - Lifetime US3575555A (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US70832368A 1968-02-26 1968-02-26

Publications (1)

Publication Number Publication Date
US3575555A true US3575555A (en) 1971-04-20

Family

ID=24845338

Family Applications (1)

Application Number Title Priority Date Filing Date
US708323A Expired - Lifetime US3575555A (en) 1968-02-26 1968-02-26 Speech synthesizer providing smooth transistion between adjacent phonemes

Country Status (2)

Country Link
US (1) US3575555A (en)
JP (1) JPS5148002B1 (en)

Cited By (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US3794753A (en) * 1971-09-16 1974-02-26 Weston D Synthesis of speech from a magnetic tape matrix storage of phonetic segments
US3865982A (en) * 1973-05-15 1975-02-11 Belton Electronics Corp Digital audiometry apparatus and method
US3905030A (en) * 1970-07-17 1975-09-09 Lannionnais Electronique Digital source of periodic signals
JPS5159207A (en) * 1974-11-20 1976-05-24 Shurago Moza Fuoresuto
JPS52122004A (en) * 1975-11-14 1977-10-13 Mozer Forrest Shrago Method and device for synthesizing audio
US4121051A (en) * 1977-06-29 1978-10-17 International Telephone & Telegraph Corporation Speech synthesizer
US4210781A (en) * 1977-12-16 1980-07-01 Sanyo Electric Co., Ltd. Sound synthesizing apparatus
US4335277A (en) * 1979-05-07 1982-06-15 Texas Instruments Incorporated Control interface system for use with a memory device executing variable length instructions
US4337375A (en) * 1980-06-12 1982-06-29 Texas Instruments Incorporated Manually controllable data reading apparatus for speech synthesizers
USRE31172E (en) * 1977-12-16 1983-03-08 Sanyo Electric Co., Ltd. Sound synthesizing apparatus
US4384169A (en) * 1977-01-21 1983-05-17 Forrest S. Mozer Method and apparatus for speech synthesizing
US4658369A (en) * 1981-06-18 1987-04-14 Sanyo Electric Co., Ltd Sound synthesizing apparatus
WO1989003573A1 (en) * 1987-10-09 1989-04-20 Sound Entertainment, Inc. Generating speech from digitally stored coarticulated speech segments
EP0582377A2 (en) * 1992-08-03 1994-02-09 International Business Machines Corporation Speech Synthesis
US5430841A (en) * 1992-10-29 1995-07-04 International Business Machines Corporation Context management in a graphics system
US5802250A (en) * 1994-11-15 1998-09-01 United Microelectronics Corporation Method to eliminate noise in repeated sound start during digital sound recording
US5806039A (en) * 1992-12-25 1998-09-08 Canon Kabushiki Kaisha Data processing method and apparatus for generating sound signals representing music and speech in a multimedia apparatus
US6639512B1 (en) 1998-07-15 2003-10-28 Kyu-Woong Lee Environmental warning system
US20080217823A1 (en) * 2007-03-07 2008-09-11 Ball Corporation Mold construction for a process and apparatus for manufacturing shaped containers
US9218798B1 (en) * 2014-08-21 2015-12-22 Kawai Musical Instruments Manufacturing Co., Ltd. Voice assist device and program in electronic musical instrument
US20170249953A1 (en) * 2014-04-15 2017-08-31 Speech Morphing Systems, Inc. Method and apparatus for exemplary morphing computer system background
US10685644B2 (en) * 2017-12-29 2020-06-16 Yandex Europe Ag Method and system for text-to-speech synthesis

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US3286235A (en) * 1961-05-05 1966-11-15 Ultronic Systems Corp Information storage system
US3319002A (en) * 1963-05-24 1967-05-09 Clerk Joseph L De Electronic formant speech synthesizer
US3344233A (en) * 1967-09-26 Method and apparatus for segmenting speech into phonemes

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US3344233A (en) * 1967-09-26 Method and apparatus for segmenting speech into phonemes
US3286235A (en) * 1961-05-05 1966-11-15 Ultronic Systems Corp Information storage system
US3319002A (en) * 1963-05-24 1967-05-09 Clerk Joseph L De Electronic formant speech synthesizer

Cited By (29)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US3905030A (en) * 1970-07-17 1975-09-09 Lannionnais Electronique Digital source of periodic signals
US3794753A (en) * 1971-09-16 1974-02-26 Weston D Synthesis of speech from a magnetic tape matrix storage of phonetic segments
US3865982A (en) * 1973-05-15 1975-02-11 Belton Electronics Corp Digital audiometry apparatus and method
JPS5735479B2 (en) * 1974-11-20 1982-07-29
JPS5159207A (en) * 1974-11-20 1976-05-24 Shurago Moza Fuoresuto
JPS5737079B2 (en) * 1974-11-20 1982-08-07
JPS564195A (en) * 1974-11-20 1981-01-17 Mozer Forrest Shrago Voice synthesizer
JPS52122004A (en) * 1975-11-14 1977-10-13 Mozer Forrest Shrago Method and device for synthesizing audio
JPS573960B2 (en) * 1975-11-14 1982-01-23
US4384169A (en) * 1977-01-21 1983-05-17 Forrest S. Mozer Method and apparatus for speech synthesizing
US4121051A (en) * 1977-06-29 1978-10-17 International Telephone & Telegraph Corporation Speech synthesizer
US4210781A (en) * 1977-12-16 1980-07-01 Sanyo Electric Co., Ltd. Sound synthesizing apparatus
USRE31172E (en) * 1977-12-16 1983-03-08 Sanyo Electric Co., Ltd. Sound synthesizing apparatus
US4335277A (en) * 1979-05-07 1982-06-15 Texas Instruments Incorporated Control interface system for use with a memory device executing variable length instructions
US4337375A (en) * 1980-06-12 1982-06-29 Texas Instruments Incorporated Manually controllable data reading apparatus for speech synthesizers
US4658369A (en) * 1981-06-18 1987-04-14 Sanyo Electric Co., Ltd Sound synthesizing apparatus
WO1989003573A1 (en) * 1987-10-09 1989-04-20 Sound Entertainment, Inc. Generating speech from digitally stored coarticulated speech segments
EP0582377A3 (en) * 1992-08-03 1994-06-01 Ibm Speech synthesis
EP0582377A2 (en) * 1992-08-03 1994-02-09 International Business Machines Corporation Speech Synthesis
US5430841A (en) * 1992-10-29 1995-07-04 International Business Machines Corporation Context management in a graphics system
US5806039A (en) * 1992-12-25 1998-09-08 Canon Kabushiki Kaisha Data processing method and apparatus for generating sound signals representing music and speech in a multimedia apparatus
US5802250A (en) * 1994-11-15 1998-09-01 United Microelectronics Corporation Method to eliminate noise in repeated sound start during digital sound recording
US6639512B1 (en) 1998-07-15 2003-10-28 Kyu-Woong Lee Environmental warning system
US20080217823A1 (en) * 2007-03-07 2008-09-11 Ball Corporation Mold construction for a process and apparatus for manufacturing shaped containers
US7568369B2 (en) * 2007-03-07 2009-08-04 Ball Corporation Mold construction for a process and apparatus for manufacturing shaped containers
US20170249953A1 (en) * 2014-04-15 2017-08-31 Speech Morphing Systems, Inc. Method and apparatus for exemplary morphing computer system background
US10008216B2 (en) * 2014-04-15 2018-06-26 Speech Morphing Systems, Inc. Method and apparatus for exemplary morphing computer system background
US9218798B1 (en) * 2014-08-21 2015-12-22 Kawai Musical Instruments Manufacturing Co., Ltd. Voice assist device and program in electronic musical instrument
US10685644B2 (en) * 2017-12-29 2020-06-16 Yandex Europe Ag Method and system for text-to-speech synthesis

Also Published As

Publication number Publication date
JPS5148002B1 (en) 1976-12-18

Similar Documents

Publication Publication Date Title
US3575555A (en) Speech synthesizer providing smooth transistion between adjacent phonemes
US3588353A (en) Speech synthesizer utilizing timewise truncation of adjacent phonemes to provide smooth formant transition
US4214125A (en) Method and apparatus for speech synthesizing
US4121058A (en) Voice processor
US3928722A (en) Audio message generating apparatus used for query-reply system
US3803363A (en) Apparatus for the modification of the time duration of waveforms
GB1592473A (en) Method and apparatus for synthesis of speech
JPS6030960B2 (en) Synthesizer that converts digital frames into analog signals
GB2036516A (en) Voice synthesizer
US3398241A (en) Digital storage voice message generator
GB1257850A (en)
US4458110A (en) Storage element for speech synthesizer
US4384170A (en) Method and apparatus for speech synthesizing
US3532821A (en) Speech synthesizer
JP2700937B2 (en) Fast listening device
EP0194004A2 (en) Voice synthesis module
SU417965A3 (en)
JPS6184771A (en) Voice input device
JPS6067998A (en) Voice synthesizer
AT311077B (en) Device for synthesizing audio information
JPS5945498A (en) Recording/editing type voice synthesizer
Becker et al. Natural speech from a computer
SU763942A1 (en) Device for transmitting remote measurement data
DE2531006A1 (en) Speech synthesis system from diphthongs and phonemes - uses time limit for stored diphthongs and their double application
SU699545A1 (en) Speech synthesis device