JP6084654B2

JP6084654B2 - Speech recognition apparatus, speech recognition system, terminal used in the speech recognition system, and method for generating a speaker identification model

Info

Publication number: JP6084654B2
Application number: JP2015113949A
Authority: JP
Inventors: 泰貴畠山
Original assignee: Sharp Corp
Current assignee: Sharp Corp
Priority date: 2015-06-04
Filing date: 2015-06-04
Publication date: 2017-02-22
Anticipated expiration: 2035-06-04
Also published as: WO2016194740A1; JP2017003611A

Description

本開示は音声認識に関し、より特定的には、話者を識別する技術に関する。 The present disclosure relates to speech recognition, and more specifically to techniques for identifying a speaker.

音声認識において話者を識別する技術が知られている。たとえば、特開２０１０−２１７３１９号公報（特許文献１）は、「音声信号から話者の特定を行う話者特定装置において、話者特定のための精度向上を図る」ための技術を開示している（［要約］参照）。特開平７−２６１７８１号公報（特許文献２）は、「話者認識精度が高い話者認識のための音素モデルを作成する学習方法」を開示している（［要約］参照）。 A technique for identifying a speaker in speech recognition is known. For example, Japanese Patent Laid-Open No. 2010-217319 (Patent Document 1) discloses a technique for “increasing accuracy for speaker identification in a speaker identification device that identifies a speaker from an audio signal”. (See [Summary]). Japanese Patent Laid-Open No. 7-261781 (Patent Document 2) discloses a “learning method for creating a phoneme model for speaker recognition with high speaker recognition accuracy” (see [Summary]).

特開２０１０−２１７３１９号公報JP 2010-217319 A 特開平７−２６１７８１号公報Japanese Patent Laid-Open No. 7-261781

従来の音声に基づく話者識別では、話者を識別するためのモデルは予め与えられているものとし、より短いユーザ発話で効率よいモデルを構築することが目標とされている。そのため、短いながらも１分〜２分程度の発話を予めユーザに要求し、得られた音声データから話者識別のモデルの確立を行っている。 In conventional speaker identification based on speech, a model for identifying a speaker is given in advance, and the goal is to construct an efficient model with shorter user utterances. Therefore, although it is short, the user is requested in advance to speak for about 1 minute to 2 minutes, and a speaker identification model is established from the obtained voice data.

従来の技術は、ユーザに前処理としての発話を要求するものである。しかしながら、音声コミュニケーションにおける話者識別では、より自然な対話を行うために、ユーザの発話が学習のために用いられていることをユーザに感じさせることなく学習データを取得する必要がある。そのため、話者識別のモデルが構築されていない状態でユーザに負荷を強いることなく話者識別のモデルを構築するために必要な音声データを取得する必要がある。 The conventional technique requires the user to speak as preprocessing. However, in speaker identification in voice communication, it is necessary to acquire learning data without making the user feel that the user's speech is used for learning in order to perform a more natural conversation. Therefore, it is necessary to acquire voice data necessary for building a speaker identification model without imposing a load on the user in a state where the speaker identification model is not built.

本開示は、上述のような問題点を解決するためになされたものであって、ある局面における目的は、話者識別のモデルを構築するために必要な音声データを取得できる音声認識装置を提供することである。 The present disclosure has been made to solve the above-described problems, and an object in one aspect is to provide a speech recognition apparatus that can acquire speech data necessary for building a speaker identification model. It is to be.

他の局面における目的は、話者識別のモデルを構築するために必要な音声データを取得できる音声認識システムを提供することである。 An object in another aspect is to provide a speech recognition system capable of acquiring speech data necessary for building a speaker identification model.

他の局面における目的は、当該音声認識システムで使用される端末を提供することである。 An object in another aspect is to provide a terminal used in the voice recognition system.

さらに他の局面における目的は、話者識別のモデルを構築するために必要な話者識別モデルを生成するための方法を提供することである。 Still another object is to provide a method for generating a speaker identification model necessary to build a speaker identification model.

一実施の形態に従う音声認識装置は、話者を識別する情報を含む発話と、話者を識別する情報を含まない発話とをそれぞれ受け付けるための音声入力部と、音声認識処理を行うための音声認識部と、音声を出力するための音声出力部と、音声認識処理の結果に基づいて音声認識装置を制御するための制御部とを備える。制御部は、話者を識別する情報と、話者を識別する情報を含まない発話とを関連付けることにより、話者を識別するための話者識別モデルを生成する。 A speech recognition device according to an embodiment includes a speech input unit for receiving an utterance including information for identifying a speaker and an utterance not including information for identifying a speaker, and a speech for performing speech recognition processing A recognition unit, a voice output unit for outputting a voice, and a control unit for controlling the voice recognition device based on a result of the voice recognition process. The control unit generates a speaker identification model for identifying the speaker by associating the information for identifying the speaker with the utterance not including the information for identifying the speaker.

ある局面において、ユーザは、学習のための前処理を意識せずに、通常の音声対話を行うことのみで、学習に必要な音声データが収集され得る。 In a certain aspect, the user can collect voice data necessary for learning only by performing a normal voice dialogue without being aware of preprocessing for learning.

この発明の上記および他の目的、特徴、局面および利点は、添付の図面と関連して理解されるこの発明に関する次の詳細な説明から明らかとなるであろう。 The above and other objects, features, aspects and advantages of the present invention will become apparent from the following detailed description of the present invention taken in conjunction with the accompanying drawings.

しりとりゲームが行われる場合におけるユーザ１と端末２とのやり取りを表わす図である。It is a figure showing the exchange between the user 1 and the terminal 2 in case a shiritori game is performed. 本開示に係る第１の実施例に従う音声認識システムの構成の概要を表す図である。It is a figure showing the outline | summary of a structure of the speech recognition system according to 1st Example which concerns on this indication. 本開示に係る第２の実施例に従う音声認識システムの構成の概要を表す図である。It is a figure showing the outline | summary of a structure of the speech recognition system according to 2nd Example which concerns on this indication. 本開示に係る第３の実施例に従う音声認識システムの構成の概要を表す図である。It is a figure showing the outline | summary of a structure of the speech recognition system according to 3rd Example which concerns on this indication. 本開示に係る第４の実施例に従う音声認識システムの構成の概要を表す図である。It is a figure showing the outline | summary of a structure of the speech recognition system according to 4th Example which concerns on this indication. 本開示に係る音声認識システムを実現する機能の構成を表すブロック図である。It is a block diagram showing the structure of the function which implement | achieves the speech recognition system which concerns on this indication. 音声認識システムにおいて保持されるデータの格納の一態様を概念的に表す図である。It is a figure which represents notionally 1 aspect of the storage of the data hold | maintained in a speech recognition system. ユーザ１と端末２との間の対話により話者モデル８０が生成される状態を表す図である。It is a figure showing the state by which the speaker model 80 is produced | generated by the dialogue between the user 1 and the terminal 2. ユーザが発話の起点となる場合におけるシーケンスを表すフローチャート（その１）である。It is a flowchart (the 1) showing a sequence in case a user becomes the starting point of speech. ユーザが発話の起点となる場合におけるシーケンスを表すフローチャート（その２）である。It is a flowchart (the 2) showing a sequence in case a user becomes the starting point of speech. ユーザが発話の起点となる場合におけるシーケンスを表すフローチャート（その３）である。It is a flowchart (the 3) showing a sequence in case a user becomes the starting point of speech. ユーザが音声認識システムに既知である場合におけるユーザ１と端末２とのやり取りのシーケンスを表す図である。It is a figure showing the sequence of the exchange between the user 1 and the terminal 2 when a user is known to the voice recognition system. ユーザが既知である場合に行なわれる処理の流れを表すシーケンスチャート（その１）である。It is a sequence chart (the 1) showing the flow of the processing performed when a user is known. ユーザが既知である場合に行なわれる処理の流れを表すシーケンスチャート（その２）である。It is a sequence chart (the 2) showing the flow of processing performed when a user is known. 端末２からユーザ１に話しかけることが対話のトリガとなる場合を表す図である。It is a figure showing the case where talking to the user 1 from the terminal 2 serves as a trigger of a dialog. 音声認識システムで行われる処理の一部を表すシーケンスチャート（その１）である。It is a sequence chart (the 1) showing a part of process performed with a speech recognition system. 音声認識システムで行われる処理の一部を表すシーケンスチャート（その２）である。It is a sequence chart (the 2) showing a part of process performed with a speech recognition system.

以下、図面を参照しつつ、本発明の実施の形態について説明する。以下の説明では、同一の部品には同一の符号を付してある。それらの名称および機能も同じである。したがって、それらについての詳細な説明は繰り返さない。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. In the following description, the same parts are denoted by the same reference numerals. Their names and functions are also the same. Therefore, detailed description thereof will not be repeated.

＜技術思想＞
本開示によれば、ユーザが不定の場合に音声対話内容でユーザ名を問いかけてユーザを分類することにより話者識別のためのモデル構築に必要な音声データ（たとえば声紋情報）が収集される。例えば、「しりとり」や早口言葉等のゲームのような対話では、ゲーム相手（たとえば、端末、家電機器等）に対する複数回のユーザ発話が期待される。このような場合に、ゲーム相手となる装置は、予めユーザ名を問いかけてゲームをすることにより一連のユーザ発話を学習データとすることができる。または、ある未知のユーザが発話した後にユーザ名を問いかけることで一時刻前の未知のユーザ名を確定することができる。 <Technology>
According to the present disclosure, voice data (for example, voiceprint information) necessary for building a model for speaker identification is collected by asking a user name by voice conversation contents and classifying the user when the user is indefinite. For example, in a dialogue such as “Shiritori” or a quick phrase such as a game, a plurality of user utterances to a game partner (for example, a terminal, a home appliance, etc.) is expected. In such a case, the device as the game opponent can make a series of user utterances as learning data by asking the user name in advance and playing the game. Alternatively, an unknown user name one hour before can be determined by asking the user name after a certain unknown user speaks.

本実施の形態では、音声認識の一例として、たとえば形態素解析が用いられる。この解析手法によれば、固有名詞とそうでないものが切り分けられる。たとえば、音声認識システムは、名前の辞書をデータベースとして有し得る。音声認識は、形態素解析において辞書と抽出された固有名詞とをマッチングすることにより行なわれる。 In the present embodiment, for example, morphological analysis is used as an example of speech recognition. According to this analysis technique, proper nouns and those that are not are separated. For example, a speech recognition system may have a dictionary of names as a database. Speech recognition is performed by matching a dictionary with an extracted proper noun in morphological analysis.

＜構成の概要＞
（構成１）ある局面に従う音声認識装置は、話者を識別する情報を含む発話と、話者を識別する情報を含まない発話とをそれぞれ受け付けるためのマイクと、音声認識処理を行うためのプロセッサと、音声を出力するためのスピーカと、音声認識処理の結果に基づいて音声認識装置を制御するためのプロセッサとを備える。プロセッサは、話者を識別する情報と、話者を識別する情報を含まない発話とを関連付けることにより、話者を識別するための話者識別モデルを生成する。話者識別モデルは、たとえば、話者の識別ＩＤ（Identification）と、話者（音声認識装置のユーザ）の名前と、当該話者の発話から抽出された声紋情報等を含み得る。 <Outline of configuration>
(Configuration 1) A speech recognition apparatus according to an aspect includes a microphone for receiving an utterance including information for identifying a speaker and an utterance not including information for identifying a speaker, and a processor for performing speech recognition processing And a speaker for outputting voice and a processor for controlling the voice recognition device based on the result of the voice recognition processing. The processor generates a speaker identification model for identifying a speaker by associating information identifying the speaker with an utterance that does not include information identifying the speaker. The speaker identification model may include, for example, a speaker identification ID (Identification), the name of the speaker (user of the speech recognition apparatus), voiceprint information extracted from the speaker's utterance, and the like.

本実施の形態において、話者を識別する情報としては、たとえば、名前、あだ名、住民番号、政府機関から与えられた識別番号その他の情報であって、発話に含めることが可能な語句をいう。 In this embodiment, the information for identifying a speaker is, for example, a name, a nickname, a resident number, an identification number given by a government agency, or other information that can be included in an utterance.

（構成２）好ましくは、スピーカは、話者を識別する情報を尋ねる問い合せを出力する。話者を識別する情報と、話者を識別する情報を含まない発話とを関連付けることは、話者を識別する情報と、問い合わせの後に発せられた話者を識別する情報を含まない発話とを関連付けることを含む。 (Configuration 2) Preferably, the speaker outputs an inquiry asking for information for identifying a speaker. Associating information that identifies a speaker with an utterance that does not contain information that identifies the speaker means that information that identifies the speaker and utterance that does not contain information that identifies the speaker uttered after the inquiry. Including associating.

（構成３）好ましくは、スピーカは、話者を識別する情報を含まない発話の後に、話者を識別する情報を尋ねる問い合せを出力する。話者を識別する情報と、話者を識別する情報を含まない発話とを関連付けることは、問い合わせの前に発せられた発話と、問い合わせに応答する発話に含まれる話者を識別する情報とを関連付けることを含む。 (Configuration 3) Preferably, the speaker outputs an inquiry asking for information for identifying a speaker after an utterance not including information for identifying the speaker. Associating information that identifies a speaker with an utterance that does not include information that identifies the speaker is the process of associating the utterance before the inquiry with the information identifying the speaker included in the utterance that responds to the inquiry. Including associating.

（構成４）プロセッサは、スピーカから出力される発話に対する応答の内容に基づいて、スピーカから次に出力する発話の内容を決定するように構成されている。たとえば、音声認識装置は、複数の問い合わせを予め保持している。各問い合せの難易度は、階層的に異なる。ある局面において、難易度が中位である問い合わせの発話に対して、予め定められた一定時間内に応答が返ってこない場合、あるいは、応答が正しくない場合、プロセッサは、難易度が低い問い合わせ（しりとりの問題）を発話する。別の局面において、予め定められた一定時間内に早期に応答が返ってきた場合、プロセッサは、難易度が高い問い合わせ（しりとりの問題）を次の問い合わせとして発話する。 (Configuration 4) The processor is configured to determine the content of the next utterance to be output from the speaker based on the content of the response to the utterance output from the speaker. For example, the speech recognition apparatus holds a plurality of inquiries in advance. The difficulty level of each inquiry is hierarchically different. In a certain situation, if a response is not returned within a predetermined time for an utterance of a query with a medium difficulty level, or if the response is not correct, the processor issues a query with a low difficulty level ( Talk about the problem of shiritori). In another aspect, when a response is returned early within a predetermined time, the processor utters a query with a high degree of difficulty (a problem of shiritori) as the next query.

（構成５）当該音声認識装置は、生成された話者識別モデルを格納するためのメモリをさらに備える。プロセッサは、問い合わせに対する応答に基づいて、生成された話者識別モデルを更新するように構成されている。 (Configuration 5) The speech recognition apparatus further includes a memory for storing the generated speaker identification model. The processor is configured to update the generated speaker identification model based on the response to the query.

（構成６）別の局面に従うと、音声認識システムが提供される。音声認識システムは、端末と、当該端末と通信可能な装置とを備える。端末は、話者を識別する情報を含む発話と、話者を識別する情報を含まない発話とを受け付けるためのマイクと、音声を出力するためのスピーカと、マイクおよびスピーカに電気的に接続されて、当該装置と通信するための通信インターフェイスとを備える。装置は、端末と通信するための通信インターフェイスと、音声認識処理を行うためのプロセッサと、音声認識処理の結果に基づいて装置を制御するためのプロセッサとを備える。プロセッサは、話者を識別する情報と、話者を識別する情報を含まない発話とを関連付けることにより、話者を識別するための話者識別モデルを生成する。 (Configuration 6) According to another aspect, a speech recognition system is provided. The voice recognition system includes a terminal and a device capable of communicating with the terminal. The terminal is electrically connected to a microphone for receiving an utterance including information for identifying a speaker and an utterance not including information for identifying a speaker, a speaker for outputting sound, and the microphone and the speaker. And a communication interface for communicating with the device. The apparatus includes a communication interface for communicating with a terminal, a processor for performing voice recognition processing, and a processor for controlling the apparatus based on the result of the voice recognition processing. The processor generates a speaker identification model for identifying a speaker by associating information identifying the speaker with an utterance that does not include information identifying the speaker.

＜技術思想の背景＞
図１を参照して、本実施の形態に係る技術思想の背景について説明する。図１は、しりとりゲームが行われる場合におけるユーザ１と端末２とのやり取りを表わす図である。ユーザ１は、端末２に対して、メッセージ１０を発する。端末２は、メッセージ１０を認識すると、応答として、メッセージ１１を発する。 <Background of technical thought>
The background of the technical idea according to the present embodiment will be described with reference to FIG. FIG. 1 is a diagram illustrating the exchange between the user 1 and the terminal 2 when a shiritori game is performed. User 1 issues message 10 to terminal 2. When the terminal 2 recognizes the message 10, the terminal 2 issues a message 11 as a response.

ユーザ１は、端末２に対して、メッセージ１２を発する。端末２は、メッセージ１２を認識すると、メッセージ１２に含まれる名前と予め規定されたメッセージとを用いて合成されたメッセージ１３を発する。 User 1 issues message 12 to terminal 2. When the terminal 2 recognizes the message 12, the terminal 2 issues a message 13 synthesized using the name included in the message 12 and a predefined message.

予め定められた時間が経過すると、端末２は、メッセージ１４を発する。ユーザ１は、メッセージ１４を認識すると、予め規定された時間内に、応答として、メッセージ１４に続く言葉を考える。ユーザ１が、端末２に対して、メッセージ１５を発する。端末２は、メッセージ１５を認識すると、予め準備された国語辞書を参照して、メッセージ１５に続く言葉を考える。端末２は、予め規定された時間内に、メッセージ１５に対する言葉としてメッセージ１６を発する。このようにして、ユーザ１と端末２とは、しりとりゲームを続ける。 When a predetermined time elapses, the terminal 2 issues a message 14. When the user 1 recognizes the message 14, the user 1 considers a word following the message 14 as a response within a predetermined time. User 1 issues message 15 to terminal 2. When the terminal 2 recognizes the message 15, the terminal 2 refers to a language dictionary prepared in advance and considers words following the message 15. The terminal 2 issues a message 16 as a word for the message 15 within a predetermined time. In this way, the user 1 and the terminal 2 continue the shiritori game.

端末２の発話に対して、ユーザ１が予め規定された時間内に次の言葉を返せる場合は、同様にしりとりが続く。たとえば、ユーザ１が端末２に対してメッセージ１７を発する。端末２は、メッセージ１７を認識すると、メッセージ１８を発する。 In the case where the user 1 can return the next word within a predetermined time in response to the utterance of the terminal 2, the interruption continues. For example, the user 1 issues a message 17 to the terminal 2. When the terminal 2 recognizes the message 17, it issues a message 18.

一方、ユーザ１が次の言葉を返せない場合がある。この場合、ユーザ１は沈黙を続けるか、分からない旨のメッセージ１９を発することになる。端末２は、予め定められた一定の待ち時間内にユーザ１からの応答がないと判断した場合、あるいは、メッセージ１９を認識した場合には、その内容について予め規定されていたメッセージ２０を発する。 On the other hand, the user 1 may not be able to return the next word. In this case, the user 1 keeps silence or issues a message 19 indicating that he / she does not understand. When the terminal 2 determines that there is no response from the user 1 within a predetermined fixed waiting time, or when the terminal 2 recognizes the message 19, the terminal 2 issues a message 20 defined in advance for the content.

このような場合、端末２は、ユーザ１との間のメッセージのやり取りを通じて、ユーザ１が「たろう」であることを認識し、ユーザ情報として「たろう」を各データに関連付ける。 In such a case, the terminal 2 recognizes that the user 1 is “Taro” through exchanging messages with the user 1 and associates “Taro” as user information with each data.

図２〜図５を参照して本開示に係る音声認識システムの構成について説明する。
［端末］
図２は、本開示に係る第１の実施例に従う音声認識システムの構成の概要を表す図である。当該音声認識システムでは、ひとつの端末２００が音声認識システムとして機能する。 The configuration of the speech recognition system according to the present disclosure will be described with reference to FIGS.
[Terminal]
FIG. 2 is a diagram illustrating an outline of the configuration of the speech recognition system according to the first embodiment of the present disclosure. In the voice recognition system, one terminal 200 functions as a voice recognition system.

端末２００は、制御部３０と、音声入力部３１と、音声出力部３２と、話者識別部３３と、話者識別学習部３４と、ユーザ管理部３５と、音声認識部３６と、対話分析・生成部３７とを備える。端末２００は、たとえば、音声入出力機能と音声認識機能とを備える端末であればよい。当該端末は、たとえば、スマートフォン、テレビ、スタンドアロンで作動し得るお掃除ロボットその他の機器を含み得る。 The terminal 200 includes a control unit 30, a voice input unit 31, a voice output unit 32, a speaker identification unit 33, a speaker identification learning unit 34, a user management unit 35, a voice recognition unit 36, and a dialog analysis. A generation unit 37 is provided. The terminal 200 may be a terminal having a voice input / output function and a voice recognition function, for example. The terminal may include, for example, a smartphone, a television, a cleaning robot that can operate standalone, and other devices.

制御部３０は、端末２００の動作を制御する。音声入力部３１は、音声の入力を受け付けて信号を制御部３０に出力する。音声出力部３２は、制御部３０から出力された信号を音声に変換して、端末２００の外部に音声を出力する。音声出力部３２は、たとえばスピーカ、端子等を含む。話者識別部３３は、制御部３０から送られる信号に基づいて、端末２００に対する発話を行なった話者を識別する。別の局面において、話者識別部３３は、当該信号と端末２００に保存されているデータとに基づいて話者を識別する。当該データは、たとえば、端末２００のユーザとして予め登録された声紋情報等を含み得る。 The control unit 30 controls the operation of the terminal 200. The voice input unit 31 receives a voice input and outputs a signal to the control unit 30. The audio output unit 32 converts the signal output from the control unit 30 into audio and outputs the audio to the outside of the terminal 200. The audio output unit 32 includes, for example, a speaker, a terminal, and the like. The speaker identification unit 33 identifies a speaker who has made an utterance to the terminal 200 based on a signal sent from the control unit 30. In another aspect, the speaker identification unit 33 identifies a speaker based on the signal and data stored in the terminal 200. The data can include, for example, voiceprint information registered in advance as a user of the terminal 200.

話者識別学習部３４は、話者識別部３３により識別された話者の情報（ユーザＩＤ等）を用いて、話者毎のデータ（ユーザプロファイル）を作成する。ユーザ管理部３５は、端末２００のユーザ情報を保存する。ユーザ情報は、ユーザプロファイル等を含み得る。音声認識部３６は、制御部３０から送られる音声信号を用いて音声認識処理を実行する。たとえば、音声認識部３６は、発話に含まれている文字を抽出する。 The speaker identification learning unit 34 creates data (user profile) for each speaker using the information (user ID and the like) of the speaker identified by the speaker identifying unit 33. The user management unit 35 stores user information of the terminal 200. The user information may include a user profile and the like. The voice recognition unit 36 performs voice recognition processing using the voice signal sent from the control unit 30. For example, the voice recognition unit 36 extracts characters included in the utterance.

対話分析・生成部３７は、音声認識部３６による認識の結果に基づいて端末２００に対するメッセージを分析する。さらに、対話分析・生成部３７は、分析の結果に応じて、当該発話に応じた応答を生成する。別の局面において、対話分析・生成部３７は、端末２００における設定に基づいて、端末２００のユーザに対する働きかけのための発話を生成する。当該設定は、たとえば、端末２００が、自己の近傍にユーザの存在を検知したこと、予め設定された時刻が到来したこと等を含み得る。 The dialog analysis / generation unit 37 analyzes a message for the terminal 200 based on the result of recognition by the voice recognition unit 36. Furthermore, the dialog analysis / generation unit 37 generates a response according to the utterance according to the analysis result. In another aspect, the dialog analysis / generation unit 37 generates an utterance for encouraging the user of the terminal 200 based on the setting in the terminal 200. The setting may include, for example, that the terminal 200 has detected the presence of a user in the vicinity of itself, that a preset time has arrived.

［端末＋サーバ］
図３は、本開示に係る第２の実施例に従う音声認識システムの構成の概要を表す図である。当該音声認識システムは、端末３００と、サーバ３５０とを備える。端末３００は、音声入力部３１と、音声出力部３２とを備える。端末３００は、プロセッサ（図示しない）によって制御される。サーバ３５０は、制御部３０と、話者識別部３３と、話者識別学習部３４と、ユーザ管理部３５と、音声認識部３６と、対話分析・生成部３７とを備える。端末３００は、たとえば、音声入出力機能と通信機能とを備える端末として実現される。そのような端末は、たとえば、携帯電話その他の情報通信端末、音声認識機能と通信機能とを備えるお掃除ロボットその他の機器等を含み得る。 [Terminal + server]
FIG. 3 is a diagram illustrating an outline of the configuration of the speech recognition system according to the second embodiment of the present disclosure. The voice recognition system includes a terminal 300 and a server 350. The terminal 300 includes an audio input unit 31 and an audio output unit 32. Terminal 300 is controlled by a processor (not shown). The server 350 includes a control unit 30, a speaker identification unit 33, a speaker identification learning unit 34, a user management unit 35, a voice recognition unit 36, and a dialog analysis / generation unit 37. The terminal 300 is realized as a terminal having a voice input / output function and a communication function, for example. Such a terminal may include, for example, a mobile phone or other information communication terminal, a cleaning robot or the like having a voice recognition function and a communication function, and the like.

端末３００は、ユーザの発話を受け付けると、その発話に応じた音声信号を、通信インターフェイス（図示しない）を介してサーバ３５０に送信する。サーバ３５０は、その音声信号を受信すると、話者識別処理、音声認識処理、対話分析、応答生成等の処理を実行する。各処理は、図２に示される構成によって実現される処理と同様なので、詳細な説明は繰り返さない。 When terminal 300 accepts the user's utterance, terminal 300 transmits an audio signal corresponding to the utterance to server 350 via a communication interface (not shown). Upon receiving the voice signal, the server 350 executes processing such as speaker identification processing, voice recognition processing, dialog analysis, response generation, and the like. Each process is the same as the process realized by the configuration shown in FIG. 2, and thus detailed description will not be repeated.

サーバ３５０は、生成された応答を通信インターフェイス（図示しない）を介して端末３００に送信する。端末３００がその応答を受信すると、音声出力部３２は、その応答に応じた音声を出力する。 The server 350 transmits the generated response to the terminal 300 via a communication interface (not shown). When the terminal 300 receives the response, the audio output unit 32 outputs audio corresponding to the response.

［端末＋サーバ＋話者識別サーバ］
図４は、本開示に係る第３の実施例に従う音声認識システムの構成の概要を表す図である。当該音声認識システムは、端末３００と、サーバ４００と、話者識別サーバ４１０とを備える。サーバ４００は、制御部３０と、ユーザ管理部３５と、音声認識部３６と、対話分析・生成部３７とを備える。話者識別サーバ４１０は、話者識別部３３と、話者識別学習部３４とを備える。 [Terminal + server + speaker identification server]
FIG. 4 is a diagram illustrating an outline of the configuration of the speech recognition system according to the third embodiment of the present disclosure. The voice recognition system includes a terminal 300, a server 400, and a speaker identification server 410. The server 400 includes a control unit 30, a user management unit 35, a voice recognition unit 36, and a dialog analysis / generation unit 37. The speaker identification server 410 includes a speaker identification unit 33 and a speaker identification learning unit 34.

サーバ４００と話者識別サーバ４１０とは、公知の構成を有するコンピュータ装置によって実現される。当該コンピュータは、主たる構成要素として、プログラムを実行するＣＰＵ（Central Processing Unit）と、キーボードその他の入力装置と、ＲＡＭ（Random Access Memory）と、ハードディスクと、光ディスク駆動装置と、モニタと、通信ＩＦ（Interface）とを備える。 Server 400 and speaker identification server 410 are realized by a computer device having a known configuration. The computer includes, as main components, a CPU (Central Processing Unit) for executing a program, a keyboard and other input devices, a RAM (Random Access Memory), a hard disk, an optical disk drive, a monitor, a communication IF ( Interface).

コンピュータにおける処理は、各ハードウェアおよびＣＰＵにより実行されるソフトウェアによって実現される。ある局面において、当該ソフトウェアは、ハードディスクに予め格納されている。別の局面において、当該ソフトウェアは、ＣＤ−ＲＯＭその他のコンピュータ読み取り可能な不揮発性のデータ記録媒体に格納されてプログラム製品として流通している。さらに別の局面において、当該ソフトウェアは、インターネットその他のネットワークに接続されている情報提供事業者によってダウンロード可能なプログラム製品として提供される場合もある。 Processing in the computer is realized by software executed by each hardware and CPU. In one aspect, the software is stored in advance on a hard disk. In another aspect, the software is stored in a CD-ROM or other computer-readable non-volatile data recording medium and distributed as a program product. In yet another aspect, the software may be provided as a program product that can be downloaded by an information provider connected to the Internet or other networks.

コンピュータのハードウェア構成は、一般的なものである。したがって、サーバ４００と話者識別サーバ４１０のハードウェア構成の説明は繰り返さない。本実施の形態に係る技術思想を実現する本質的な部分は、当該コンピュータに格納されたプログラムであるともいえる。 The hardware configuration of a computer is general. Therefore, the description of the hardware configuration of server 400 and speaker identification server 410 will not be repeated. It can be said that the essential part for realizing the technical idea according to the present embodiment is a program stored in the computer.

サーバ４００は、端末３００から送られた音声信号を受信すると、通信インターフェイスを介して、その音声信号を話者識別サーバ４１０に送信する。 When the server 400 receives the voice signal transmitted from the terminal 300, the server 400 transmits the voice signal to the speaker identification server 410 via the communication interface.

話者識別サーバ４１０は、話者を認識し、また、話者を登録するためのデータを生成する。話者識別サーバ４１０は、生成したデータをサーバ４００に送信する。 The speaker identification server 410 recognizes the speaker and generates data for registering the speaker. The speaker identification server 410 transmits the generated data to the server 400.

［端末＋サーバ＋話者識別サーバ＋音声認識サーバ］
図５は、本開示に係る第４の実施例に従う音声認識システムの構成の概要を表す図である。当該音声認識システムは、端末３００と、サーバ５００と、話者識別サーバ４１０と、音声認識サーバ５２０とを備える。サーバ５００は、制御部３０と、ユーザ管理部３５と、対話分析・生成部３７とを備える。音声認識サーバ５２０は、音声認識部３６を備える。 [Terminal + server + speaker identification server + voice recognition server]
FIG. 5 is a diagram illustrating an outline of a configuration of a speech recognition system according to the fourth embodiment of the present disclosure. The voice recognition system includes a terminal 300, a server 500, a speaker identification server 410, and a voice recognition server 520. The server 500 includes a control unit 30, a user management unit 35, and a dialog analysis / generation unit 37. The voice recognition server 520 includes a voice recognition unit 36.

サーバ５００は、端末３００から音声信号を受信すると、その音声信号を話者識別サーバ４１０および音声認識サーバ５２０に送信する。音声認識サーバ５２０は、当該音声信号を用いて音声認識処理を実行し、認識の結果をサーバ５００に送信する。 When server 500 receives the voice signal from terminal 300, server 500 transmits the voice signal to speaker identification server 410 and voice recognition server 520. The voice recognition server 520 executes voice recognition processing using the voice signal, and transmits the recognition result to the server 500.

その他の動作は、前述の他の実施例に従う音声認識システムの構成における動作と同様である。したがって、他の動作の説明は繰り返さない。 Other operations are the same as those in the configuration of the speech recognition system according to the other embodiments described above. Therefore, description of other operations will not be repeated.

［機能構成］
図６は、本開示に係る音声認識システムを実現する機能の構成を表すブロック図である。音声認識システムは、端末モジュール６００と、メインモジュール６１０と、話者識別モジュール６２０と、音声認識モジュール６３０とを備える。 [Function configuration]
FIG. 6 is a block diagram illustrating a configuration of functions that implement the speech recognition system according to the present disclosure. The voice recognition system includes a terminal module 600, a main module 610, a speaker identification module 620, and a voice recognition module 630.

端末モジュール６００は、音声入力部３１と音声出力部３２とを備える。端末モジュール６００は、ユーザの近傍にあって発話を受け付けて、音声データと端末ＩＤとをメインモジュール６１０に送信する。別の局面において、端末モジュール６００は、メインモジュール６１０から送られた合成音声データを受信し、合成音声データに基づく音声を音声出力部３２から出力する。 The terminal module 600 includes an audio input unit 31 and an audio output unit 32. The terminal module 600 is in the vicinity of the user, accepts an utterance, and transmits voice data and a terminal ID to the main module 610. In another aspect, the terminal module 600 receives the synthesized voice data sent from the main module 610 and outputs voice based on the synthesized voice data from the voice output unit 32.

メインモジュール６１０において、制御部３０は、音声データと話者モデルリストとを話者識別モジュール６２０に送信する。話者識別モジュール６２０は、話者を識別すると、話者識別結果（たとえば、メッセージのＩＤ、話者が識別できたことを表すフラグ等）をメインモジュール６１０に送信する。 In the main module 610, the control unit 30 transmits the voice data and the speaker model list to the speaker identification module 620. When the speaker identification module 620 identifies a speaker, the speaker identification module 620 transmits a speaker identification result (for example, a message ID, a flag indicating that the speaker has been identified, etc.) to the main module 610.

制御部３０は、端末ＩＤまたは音声データをユーザ管理部３５に送信する。ユーザ管理部３５は、端末ＩＤまたは音声データを保存する。 The control unit 30 transmits the terminal ID or voice data to the user management unit 35. The user management unit 35 stores the terminal ID or voice data.

制御部３０は、ユーザ管理部３５から話者モデルリストを読み出す。
制御部３０は、対話分析・生成部３７との間で、たとえば、テキストデータのやり取りを行なう。 The control unit 30 reads the speaker model list from the user management unit 35.
For example, the control unit 30 exchanges text data with the dialog analysis / generation unit 37.

制御部３０は、音声データを音声認識モジュール６３０に送信する。音声認識モジュール６３０は、音声データを用いて音声認識処理を実行すると、その結果をテキストとして制御部３０に送る。 The control unit 30 transmits the voice data to the voice recognition module 630. When the speech recognition module 630 executes speech recognition processing using speech data, the speech recognition module 630 sends the result to the control unit 30 as text.

図６に示される機能は、図２〜図５に示される構成のいずれかによって実現される。
［データ構造］
図７を参照して、本実施の形態に係る音声認識システムのデータ構造について説明する。図７は、音声認識システムにおいて保持されるデータの格納の一態様を概念的に表す図である。ある局面において、音声認識システムは、端末管理テーブルと、家庭管理テーブルと、ユーザ管理テーブルとを含む。 The function shown in FIG. 6 is realized by one of the configurations shown in FIGS.
[data structure]
With reference to FIG. 7, the data structure of the speech recognition system according to the present embodiment will be described. FIG. 7 is a diagram conceptually showing one mode of storing data held in the voice recognition system. In one aspect, the voice recognition system includes a terminal management table, a home management table, and a user management table.

（端末管理テーブル）
端末管理テーブルは、端末ＩＤと、所属ユーザＩＤとを含む。端末ＩＤは、音声認識システムにおいて登録された端末を識別する。ある局面において、端末ＩＤは、音声認識システムの管理者（たとえば、制御部３０を含むコンピュータの管理者）によって一意に付与される。別の局面において、端末ＩＤは、当該端末のユーザが希望する任意の文字列（たとえば、英数字、記号など）によって構成される。この場合、端末ＩＤの重複が生じないように、たとえば、制御部３０は、ユーザによって入力されたＩＤが既に使用されているか否かをチェックし、使用済みの端末ＩＤが入力された場合は、その旨を端末に通知する。所属ユーザＩＤは、当該端末の使用者として登録されたユーザを識別する。端末の使用者の数は特に限られない。 (Terminal management table)
The terminal management table includes a terminal ID and a belonging user ID. The terminal ID identifies a terminal registered in the voice recognition system. In one aspect, the terminal ID is uniquely assigned by a manager of the voice recognition system (for example, a manager of a computer including the control unit 30). In another aspect, the terminal ID is configured by an arbitrary character string (for example, alphanumeric characters or symbols) desired by the user of the terminal. In this case, for example, the control unit 30 checks whether or not the ID input by the user is already used so that the terminal ID is not duplicated. If the used terminal ID is input, Notify the terminal to that effect. The affiliated user ID identifies a user registered as a user of the terminal. The number of users of the terminal is not particularly limited.

（家庭管理テーブル）
家庭管理テーブルは、家庭ＩＤと、当該家庭に所属する端末の端末ＩＤとを含む。家庭ＩＤは、音声認識システムのサービスを利用するユーザのグループとして家庭を識別する。ユーザのグループの単位は家庭に限られない。複数のユーザが一つのグループに関連付けられるものであればよい。家庭ＩＤには、１つ以上の端末の各端末ＩＤが関連付けられている。家庭に関連付けられる端末の数は特に限られない。 (Home management table)
The home management table includes a home ID and a terminal ID of a terminal belonging to the home. The home ID identifies the home as a group of users who use the service of the voice recognition system. The unit of the user group is not limited to the home. What is necessary is just to be able to associate a some user with one group. The home ID is associated with each terminal ID of one or more terminals. The number of terminals associated with the home is not particularly limited.

（ユーザ管理テーブル）
ユーザ管理テーブルは、ユーザＩＤと、ユーザ名と、話者モデルデータと、音声データリストとを含む。 (User management table)
The user management table includes a user ID, a user name, speaker model data, and a voice data list.

ユーザＩＤは、端末を使用するユーザを識別する。ユーザ名は、当該ユーザＩＤが割り当てられたユーザを識別する。話者モデルデータは、当該ユーザを識別するためのデータである。話者モデルデータは、たとえば、声紋情報を含み得る。 The user ID identifies a user who uses the terminal. The user name identifies the user to whom the user ID is assigned. The speaker model data is data for identifying the user. The speaker model data may include voiceprint information, for example.

音声データリストは、当該ユーザを識別するための音声データを含む。当該音声データは、ユーザから端末に対する発話、端末の発話に対するユーザの応答、端末に表示された文字列のユーザによる発話等を含み得る。 The audio data list includes audio data for identifying the user. The voice data may include an utterance from the user to the terminal, a user response to the terminal utterance, an utterance by the user of a character string displayed on the terminal, and the like.

［話者モデルの生成］
図８を参照して、話者モデルの生成について説明する。図８は、ユーザ１と端末２との間の対話により話者モデル８０が生成される状態を表す図である。なお、図１における状態と同様の状態の説明は繰り返さない。 [Generate speaker model]
The generation of a speaker model will be described with reference to FIG. FIG. 8 is a diagram illustrating a state in which the speaker model 80 is generated by the dialogue between the user 1 and the terminal 2. The description of the same state as that in FIG. 1 will not be repeated.

ユーザ１と端末２との対話において、ユーザ１が未登録の場合には、端末２は、まず最初にユーザ名を聞いて、以降の一定区間（たとえば、ゲーム終了等）までをそのユーザの発話として音声データをデータベースに登録する。音声データは声紋情報を含み得る。 In the dialog between the user 1 and the terminal 2, when the user 1 is not registered, the terminal 2 first listens to the user name and then utters the user until a certain period thereafter (for example, the end of the game). The voice data is registered in the database. The audio data can include voiceprint information.

ユーザ発話毎に、話者識別学習部は、対象の音声ＤＢ（Database）からこれまでの音声データ全てを学習データとして話者識別の学習を行う。 For each user utterance, the speaker identification learning unit learns the speaker identification using all the speech data from the target speech DB (Database) as learning data.

ＩＤが端末ごとに割り当てられる。端末とユーザ名とによってユーザを管理することにより他端末で同名のユーザがいるばあいでも対応可能となる。 An ID is assigned to each terminal. By managing the user by the terminal and the user name, it is possible to cope with a user having the same name at another terminal.

ユーザ１が自身の名前を発すると（メッセージ１２）、端末２はメッセージ１２を認識する。端末２は、メッセージ１２からユーザ名（＝たろう）を抽出すると、当該ユーザ名と端末２の端末ＩＤとをユーザ管理部３５に送信する。その後も、ユーザ１が発話すると、各メッセージ１５，メッセージ１７は、端末２を通してユーザ管理部３５に蓄積される。 When the user 1 utters his name (message 12), the terminal 2 recognizes the message 12. When the terminal 2 extracts the user name (= taro) from the message 12, the terminal 2 transmits the user name and the terminal ID of the terminal 2 to the user management unit 35. Thereafter, when the user 1 speaks, the messages 15 and 17 are stored in the user management unit 35 through the terminal 2.

話者識別学習部３４は、ユーザ管理部３５に保存されている端末ＩＤとユーザ名とを読み出して、話者モデル８０を生成する。話者モデル８０は、当該ユーザ名と端末ＩＤとを含む。したがって、以降は、端末２がユーザ１と対話することによりユーザ名が特定されると、当該ユーザに関連付けられた話者モデル８０が利用可能となる。 The speaker identification learning unit 34 reads the terminal ID and user name stored in the user management unit 35 and generates a speaker model 80. The speaker model 80 includes the user name and the terminal ID. Therefore, thereafter, when the user name is specified by the terminal 2 interacting with the user 1, the speaker model 80 associated with the user can be used.

［制御構造］
図９〜図１１を参照して、本実施の形態に係る音声認識システムの制御構造について説明する。図９から図１１は、それぞれ、ユーザが発話の起点となる場合におけるシーケンスを表すフローチャートである。 [Control structure]
A control structure of the speech recognition system according to the present embodiment will be described with reference to FIGS. FIG. 9 to FIG. 11 are flowcharts showing sequences when the user is the starting point of the utterance.

ステップ９１０にて、ユーザによる話者識別学習用のシーケンスを開始するための発話が行なわれる。たとえば、ユーザは「しりとりしようよ」というメッセージ９１１を発する。音声入力部３１は、メッセージ９１１を受け付けると、メッセージ９１１に応じた音声信号を制御部３０に送信する。 In step 910, an utterance for starting a sequence for speaker identification learning by the user is performed. For example, the user issues a message 911 “Let's take a bite”. When receiving the message 911, the voice input unit 31 transmits a voice signal corresponding to the message 911 to the control unit 30.

ステップ９１５にて、制御部３０は、当該音声信号を受信したことを検知すると、音声認識リクエストを音声認識部３６に送信する。 In step 915, when the control unit 30 detects that the voice signal has been received, the control unit 30 transmits a voice recognition request to the voice recognition unit 36.

ステップ９２０にて、制御部３０は、当該音声信号を受信したことを検知すると、話者モデルリスト取得リクエストをユーザ管理部３５に送信する。話者モデルリスト取得リクエストは、当該発話を与えたユーザに関連付けられている話者モデルリストにアクセスすることを要求する。 In step 920, when detecting that the voice signal has been received, the control unit 30 transmits a speaker model list acquisition request to the user management unit 35. The speaker model list acquisition request requests access to the speaker model list associated with the user who gave the utterance.

ステップ９２５にて、制御部３０は、当該話者モデルリスト取得リクエストに応答して、話者モデルリストレスポンスを制御部３０に送信する。話者モデルリストレスポンスは、当該ユーザに関連付けられている話者モデルリストの取得結果を含む。 In step 925, the control unit 30 transmits a speaker model list response to the control unit 30 in response to the speaker model list acquisition request. The speaker model list response includes the acquisition result of the speaker model list associated with the user.

ステップ９３０にて、制御部３０は、話者識別部３３に対して、話者識別リクエストを送信する。話者識別部３３は、話者識別リクエストの受信を検知すると、ユーザ管理部３５に保存されているデータを参照して、ステップ９１０にて発話を行なったユーザ（話者）の識別を試みる。 In step 930, the control unit 30 transmits a speaker identification request to the speaker identification unit 33. When the speaker identification unit 33 detects the reception of the speaker identification request, the speaker identification unit 33 refers to the data stored in the user management unit 35 and tries to identify the user (speaker) who made the utterance in step 910.

ステップ９３５にて、音声認識部３６は、ステップ９１５における音声認識リクエストに応答して、音声認識レスポンスを制御部３０に送信する。音声認識レスポンスは、音声認識が成功したか否かを含む。 In step 935, the voice recognition unit 36 transmits a voice recognition response to the control unit 30 in response to the voice recognition request in step 915. The voice recognition response includes whether or not the voice recognition is successful.

ステップ９４０にて、話者識別部３３は、話者識別失敗レスポンスを話者識別部３３に送信する。すなわち、ユーザが音声認識システムに登録されていないため、話者識別部３３は、当該発話を与えたユーザ（話者）を識別することができない。そこで、話者の識別が失敗したことを通知する話者識別失敗レスポンスが、話者識別部３３から制御部３０に送られる。 In step 940, the speaker identification unit 33 transmits a speaker identification failure response to the speaker identification unit 33. That is, since the user is not registered in the voice recognition system, the speaker identifying unit 33 cannot identify the user (speaker) who gave the speech. Therefore, a speaker identification failure response notifying that speaker identification has failed is sent from the speaker identification unit 33 to the control unit 30.

ステップ９４５にて、制御部３０は、話者識別失敗レスポンスの受信に応答して、対話分析・生成リクエストを対話分析・生成部３７に送信する。対話分析・生成リクエストは、音声識別結果および話者識別結果を含み得る。対話分析・生成部３７は、対話分析・生成リクエストを受信すると、当該発話を与えたユーザの名前を取得するためのメッセージを生成する。たとえば、対話分析・生成部３７は、音声認識システムにおいて予め準備されているテンプレートと、メッセージ９１１に含まれる用語「しりとり」とを用いて、メッセージ９４６（しりとりをはじめるよ。それじゃ、名前を教えてね。）を作成する。 In step 945, the control unit 30 transmits a dialog analysis / generation request to the dialog analysis / generation unit 37 in response to the reception of the speaker identification failure response. The dialogue analysis / generation request may include a voice identification result and a speaker identification result. Upon receiving the dialog analysis / generation request, the dialog analysis / generation unit 37 generates a message for acquiring the name of the user who gave the utterance. For example, the dialogue analysis / generation unit 37 uses a template prepared in advance in the speech recognition system and the term “shiritori” included in the message 911 to start a message 946 (let's start shiritori. Create.)

ステップ９５０にて、対話分析・生成部３７は、生成したメッセージ９４６を制御部３０に送信する。制御部３０は、当該メッセージの受信を検知すると、当該発話を与えた端末の端末ＩＤと当該メッセージとを含む音声レスポンスを生成する。 In step 950, the dialog analysis / generation unit 37 transmits the generated message 946 to the control unit 30. When detecting the reception of the message, the control unit 30 generates a voice response including the terminal ID of the terminal that has given the utterance and the message.

ステップ９５５にて、制御部３０は、音声出力部３２に対して、当該音声レスポンスを送信する。音声出力部３２は、当該音声レスポンスの信号を受信すると、当該信号に基づく音声を出力する。ユーザが当該音声を認識すると、その音声に対する発話を行なう。その発話は、音声入力部３１によって受け付けられる。 In step 955, the control unit 30 transmits the audio response to the audio output unit 32. When the audio output unit 32 receives the audio response signal, the audio output unit 32 outputs audio based on the signal. When the user recognizes the voice, the user speaks the voice. The utterance is received by the voice input unit 31.

ステップ９６０にて、音声入力部３１は、受け付けたメッセージ９６１（名前登録発話）の内容を制御部３０に送信する。メッセージ９６１は、たとえば「たろうだよ」のように、メッセージ９４６に対する回答（名前）を含む。制御部３０は、メッセージ９６１の受信を検知すると、音声認識リクエストを生成する。 In step 960, voice input unit 31 transmits the content of accepted message 961 (name registration utterance) to control unit 30. The message 961 includes an answer (name) to the message 946, for example, “Taro would!”. When the control unit 30 detects reception of the message 961, the control unit 30 generates a voice recognition request.

ステップ９６５にて、制御部３０は、音声認識部３６に対して音声認識リクエストを送信する。音声認識部３６は、音声認識リクエストの受信を検知すると、メッセージ９６１の音声認識処理を実行する。 In step 965, the control unit 30 transmits a voice recognition request to the voice recognition unit 36. When the voice recognition unit 36 detects the reception of the voice recognition request, the voice recognition unit 36 executes voice recognition processing of the message 961.

ステップ９７０にて、制御部３０は、ユーザ管理部３５に対して、話者モデルリスト取得リクエストを送信する。ユーザ管理部３５は、話者モデルリクエストの受信を検知すると、話者モデルリストの取得を試みる。ユーザ管理部３５は、取得を試みた結果を話者モデルリストレスポンスとして生成する。 In step 970, the control unit 30 transmits a speaker model list acquisition request to the user management unit 35. When the user management unit 35 detects the reception of the speaker model request, the user management unit 35 tries to acquire the speaker model list. The user management unit 35 generates a result of the acquisition attempt as a speaker model list response.

ステップ９７５にて、ユーザ管理部３５は、制御部３０に対して、話者モデルリストレスポンスを送信する。 In step 975, the user management unit 35 transmits a speaker model list response to the control unit 30.

ステップ９８０にて、制御部３０は、話者モデルリストレスポンスの受信に応答して、話者識別リクエストを話者識別部３３に送信する。話者識別部３３は、話者識別リクエストの受信を検知すると、話者の識別を開始し、識別結果を生成する。 In step 980, control unit 30 transmits a speaker identification request to speaker identification unit 33 in response to reception of the speaker model list response. When the speaker identification unit 33 detects reception of a speaker identification request, the speaker identification unit 33 starts speaker identification and generates an identification result.

図１０を参照して、ステップ１０１０にて、音声認識部３６は、話者識別リクエストに対する応答として、音声認識レスポンスを制御部３０に送信する。当該音声認識レスポンスは、メッセージ９６１の内容を認識できた旨を含み得る。 Referring to FIG. 10, in step 1010, voice recognition unit 36 transmits a voice recognition response to control unit 30 as a response to the speaker identification request. The voice recognition response may include information indicating that the content of the message 961 has been recognized.

ステップ１０１５にて、話者識別部３３は、話者識別失敗レスポンスを制御部３０に送信する。すなわち、話者（たろう）は、音声認識システムにおいて登録されていない。そこで、話者識別部３３は、話者を識別する試みが失敗したことを表すレスポンスを生成する。 In step 1015, the speaker identification unit 33 transmits a speaker identification failure response to the control unit 30. That is, the speaker is not registered in the speech recognition system. Therefore, the speaker identification unit 33 generates a response indicating that the attempt to identify the speaker has failed.

ステップ１０２０にて、制御部３０は、対話分析・生成リクエストを対話分析・生成部３７に送信する。対話分析・生成部３７は、対話分析・生成リクエストの受信に応答して、対話のためのメッセージ１０３１を生成する。メッセージ１０３１は、たとえば「たろうさんだね。それじゃはじめるよ。最初はりんご。」のように、発話の内容および話者を識別する情報を含むメッセージとして生成される。 In step 1020, the control unit 30 transmits a dialog analysis / generation request to the dialog analysis / generation unit 37. In response to receiving the dialog analysis / generation request, the dialog analysis / generation unit 37 generates a message 1031 for dialog. The message 1031 is generated as a message including the content of the utterance and information for identifying the speaker, such as “Taro-san. Let's start.

ステップ１０３０にて、対話分析・生成部３７は、メッセージ１０３１を制御部３０に送信する。制御部３０は、メッセージ１０３１の受信を検知すると、端末への発話に対して応答するため、メッセージ１０３１と端末ＩＤとを含む音声レスポンスを生成する。 In step 1030, dialogue analysis / generation unit 37 transmits message 1031 to control unit 30. When detecting the reception of the message 1031, the control unit 30 generates an audio response including the message 1031 and the terminal ID in order to respond to the utterance to the terminal.

ステップ１０３５にて、制御部３０は、当該音声レスポンスを端末に送信する。端末の音声出力部３２は、音声レスポンスの信号を受信すると、当該信号に基づく音声を出力する。ユーザは、その音声を認識すると、次の応答を考えて、端末に発話する。音声入力部３１は、その発話、たとえば「ゴリラ」を受け付ける。 In step 1035, control unit 30 transmits the voice response to the terminal. When receiving the audio response signal, the audio output unit 32 of the terminal outputs audio based on the signal. When the user recognizes the voice, the user speaks to the terminal in consideration of the next response. The voice input unit 31 receives the utterance, for example, “gorilla”.

その後、しりとりのための数回のやり取りが行なわれる（ステップ１０４０以降）。
ステップ１０４０にて、音声入力部３１は、受け付けたメッセージ１０４１を制御部３０に送信する。制御部３０は、メッセージ１０４１の受信を検知すると、音声認識リクエストを生成する。 Thereafter, several exchanges for shiritori are performed (step 1040 and subsequent steps).
In step 1040, the voice input unit 31 transmits the received message 1041 to the control unit 30. When the control unit 30 detects the reception of the message 1041, the control unit 30 generates a voice recognition request.

ステップ１０４５にて、制御部３０は、音声認識リクエストを音声認識部３６に送信する。音声認識部３６は、当該リクエストを受信すると、音声認識処理を開始する。 In step 1045, the control unit 30 transmits a voice recognition request to the voice recognition unit 36. When the voice recognition unit 36 receives the request, the voice recognition unit 36 starts a voice recognition process.

ステップ１０５０にて、制御部３０は、話者音声保存・リスト取得リクエストをユーザ管理部３５に送信する。ユーザ管理部３５は、当該リクエストの受信を検知すると、話者（たろう）の識別ＩＤと、話者（たろう）の名前とを、互いに関連付けることにより保存する。さらに、ユーザ管理部３５は、話者音声の保存が成功したことを表す応答を生成する。 In step 1050, control unit 30 transmits a speaker voice storage / list acquisition request to user management unit 35. When detecting the reception of the request, the user management unit 35 saves the identification ID of the speaker (taro) and the name of the speaker (taro) by associating each other. Further, the user management unit 35 generates a response indicating that the speaker voice has been successfully stored.

ステップ１０５５にて、ユーザ管理部３５は、当該応答として、話者音声保存・リスト取得レスポンスを制御部３０に送信する。 In step 1055, the user management unit 35 transmits a speaker voice storage / list acquisition response to the control unit 30 as the response.

ステップ１０６０にて、制御部３０は、話者識別モデル学習リクエストを話者識別学習部３４に送信する。話者識別学習部３４は、当該リクエストの受信を検知すると、話者識別モデルとして、当該発話を与えたユーザに音声を関連付けてモデルを生成し、適宜、更新する。 In step 1060, control unit 30 transmits a speaker identification model learning request to speaker identification learning unit 34. When the speaker identification learning unit 34 detects the reception of the request, the speaker identification learning unit 34 generates a model by associating a voice with the user who gave the utterance as a speaker identification model, and updates it appropriately.

ステップ１０６５にて、音声認識部３６は、音声認識リクエストに基づく処理の結果を音声認識レスポンスとして制御部３０に送信する。 In step 1065, the voice recognition unit 36 transmits the processing result based on the voice recognition request to the control unit 30 as a voice recognition response.

ステップ１０７０にて、話者識別学習部３４は、話者識別モデル学習リクエストに対する応答して、話者識別学習レスポンスを制御部３０に送信する。 In step 1070, the speaker identification learning unit 34 transmits a speaker identification learning response to the control unit 30 in response to the speaker identification model learning request.

ステップ１０７５にて、制御部３０は、対話分析・生成リクエストを生成して、生成したリクエストを対話分析・生成部３７に送信する。たとえば、制御部３０は、話者の学習のために十分なデータがなく学習失敗であると判断した場合には、当該リクエストを生成する。対話分析・生成部３７は、当該リクエストの受信を検知すると、さらに学習するためのメッセージ１０８１（たとえば、「ゴリラ・・・。それじゃぁ「ラクダ」）を生成する。 In step 1075, the control unit 30 generates a dialog analysis / generation request and transmits the generated request to the dialog analysis / generation unit 37. For example, if the control unit 30 determines that there is not enough data for the speaker's learning and the learning has failed, the control unit 30 generates the request. When the dialog analysis / generation unit 37 detects the reception of the request, the dialog analysis / generation unit 37 generates a message 1081 (for example, “gorilla... Then,“ camel ”) for further learning.

ステップ１０８０にて、対話分析・生成部３７は、生成したメッセージ１０８１を制御部３０に送信する。制御部３０は、メッセージ１０８１を受信すると、端末ＩＤとメッセージ１０８１とを含む音声レスポンスを生成する。 In step 1080, dialogue analysis / generation unit 37 transmits generated message 1081 to control unit 30. When the control unit 30 receives the message 1081, the control unit 30 generates a voice response including the terminal ID and the message 1081.

ステップ１０８５にて、制御部３０は、生成した音声レスポンスを端末に送信する。端末は、音声レスポンスを受信すると、音声出力部３２は、音声レスポンスに基づく音声を出力する。ユーザは、端末の音声出力部３２から発せられた音声を認識すると、その次の応答を考える。予め定められた時間内にユーザが、当該次の応答を発すると、音声入力部３１は、ユーザの発話を受け付けて、当該発話に応じた音声応答を生成する。 In step 1085, control unit 30 transmits the generated voice response to the terminal. When the terminal receives the voice response, the voice output unit 32 outputs voice based on the voice response. When the user recognizes the voice emitted from the voice output unit 32 of the terminal, the user considers the next response. When the user issues the next response within a predetermined time, the voice input unit 31 accepts the user's utterance and generates a voice response corresponding to the utterance.

図１１を参照して、ステップ１１１０にて、音声入力部３１は、メッセージ１１１１（たとえば、「ダイヤモンド」）を制御部３０に送信する。制御部３０は、メッセージ１１１１の受信を検知すると、音声認識リクエストと、話者音声保存・リスト取得リクエストとを生成する。 Referring to FIG. 11, in step 1110, voice input unit 31 transmits a message 1111 (for example, “diamond”) to control unit 30. When detecting the reception of the message 1111, the control unit 30 generates a speech recognition request and a speaker speech storage / list acquisition request.

ステップ１１１５にて、制御部３０は、音声認識リクエストを音声認識部３６に送信する。音声認識部３６は、当該リクエストの受信を検知すると、メッセージ１１１１の音声認識処理を開始する。 In step 1115, the control unit 30 transmits a voice recognition request to the voice recognition unit 36. When the voice recognition unit 36 detects reception of the request, the voice recognition unit 36 starts voice recognition processing of the message 1111.

ステップ１１２０にて、制御部３０は、メッセージ１１１１と話者音声保存・リスト取得リクエストとをユーザ管理部３５に送信する。ユーザ管理部３５は、当該リクエストの受信を検知すると、メッセージ１１１１の内容（音声データ）を、ユーザ（話者）の識別ＩＤに関連付けて格納する。 In step 1120, the control unit 30 transmits a message 1111 and a speaker voice storage / list acquisition request to the user management unit 35. When detecting the reception of the request, the user management unit 35 stores the content (voice data) of the message 1111 in association with the identification ID of the user (speaker).

ステップ１１３０にて、制御部３０は、話者識別モデル学習リクエストを話者識別学習部３４に送信する。話者識別学習部３４は、当該リクエストの受信を検知すると、話者識別モデルを学習する。より具体的には、話者識別学習部３４は、ユーザの識別ＩＤと、メッセージ１１１１に含まれる音声情報（たとえば、声紋情報）とを関連付けて保存する。学習が完了すると、話者識別学習部３４は、話者識別モデルの学習が完了したことを表すレスポンスを生成する。 In step 1130, the control unit 30 transmits a speaker identification model learning request to the speaker identification learning unit 34. When the speaker identification learning unit 34 detects reception of the request, the speaker identification learning unit 34 learns a speaker identification model. More specifically, the speaker identification learning unit 34 stores the user identification ID and voice information (for example, voiceprint information) included in the message 1111 in association with each other. When the learning is completed, the speaker identification learning unit 34 generates a response indicating that the learning of the speaker identification model is completed.

ステップ１１３５にて、音声認識部３６は、音声認識処理が終わったことに応答して、音声認識処理の結果を通知する音声認識レスポンスを生成し、当該レスポンスを制御部３０に送信する。 In step 1135, in response to the completion of the voice recognition process, the voice recognition unit 36 generates a voice recognition response that notifies the result of the voice recognition process, and transmits the response to the control unit 30.

ステップ１１４０にて、話者識別学習部３４は、生成したレスポンスと制御部３０に送信する。制御部３０は、音声認識部３６からのレスポンスと話者識別学習部３４からのレスポンスとを受信すると、学習に十分なデータが揃い、学習が完了したか否かを判断する。たとえば、予め定められた数以上の音声データがユーザの識別ＩＤに関連付けられた場合には、制御部３０は、学習に十分なデータが揃い学習が完了したと判断する。 In step 1140, the speaker identification learning unit 34 transmits the generated response to the control unit 30. When the control unit 30 receives the response from the voice recognition unit 36 and the response from the speaker identification learning unit 34, the control unit 30 determines whether data sufficient for learning is prepared and learning is completed. For example, when more than a predetermined number of audio data is associated with the user identification ID, the control unit 30 determines that sufficient data for learning is available and learning is completed.

制御部３０は、音声認識部３６からのレスポンスと話者識別学習部３４からのレスポンスの受信の内容に基づいて、対話分析・生成リクエストを生成する。たとえば、制御部３０は、各レスポンスの結果に基づいて、音声認識が成功し、かつ、学習に十分なデータが揃い学習が完了したと判断すると、当該リクエストを生成する。学習に十分なデータとは、たとえば、予め定められた一定時間内に音声データから抽出された情報量（一定のデータサイズを有する声紋情報の個数など）が学習に必要であると規定された情報量を超えているものをいう。 The control unit 30 generates a dialog analysis / generation request based on the contents of the response received from the voice recognition unit 36 and the response received from the speaker identification learning unit 34. For example, based on the result of each response, the control unit 30 generates the request when it is determined that the speech recognition is successful, and that sufficient data for learning is available and learning is completed. Data sufficient for learning is, for example, information that is defined as the amount of information extracted from speech data within a predetermined time (such as the number of voiceprint information having a fixed data size) required for learning. The one that exceeds the amount.

ステップ１１４５にて、制御部３０は、生成したリクエストを対話分析・生成部３７に送信する。対話分析・生成部３７は、当該リクエストの受信を検知すると、メッセージ１１１１に対するメッセージ１１５１を生成する。 In step 1145, the control unit 30 transmits the generated request to the dialog analysis / generation unit 37. When the dialog analysis / generation unit 37 detects reception of the request, the dialog analysis / generation unit 37 generates a message 1151 for the message 1111.

ステップ１１５０にて、対話分析・生成部３７は、生成したメッセージ１１５１を制御部３０に送信する。制御部３０は、メッセージ１１５１の受信を検知すると、端末ＩＤとメッセージ１１５１とを含む音声レスポンスを生成する。 In step 1150, the dialog analysis / generation unit 37 transmits the generated message 1151 to the control unit 30. When detecting the reception of the message 1151, the control unit 30 generates a voice response including the terminal ID and the message 1151.

ステップ１１５５にて、制御部３０は、端末に音声レスポンスを生成する。端末は、音声レスポンスを受信すると、音声出力部３２から音声を出力する。 In step 1155, control unit 30 generates an audio response to the terminal. When the terminal receives the voice response, the terminal outputs the voice from the voice output unit 32.

＜ユーザからの発話起点のシーケンス＞
図１２を参照して、他の局面について説明する。図１２は、ユーザが音声認識システムに既知である場合におけるユーザ１と端末２とのやり取りのシーケンスを表す図である。なお、前述の動作と同じ動作には同じ番号を付してある。したがって、同じ動作の説明は、繰り返さない。 <Speaking sequence from user>
Another aspect will be described with reference to FIG. FIG. 12 is a diagram illustrating an exchange sequence between the user 1 and the terminal 2 when the user is known to the voice recognition system. In addition, the same number is attached | subjected to the operation | movement same as the above-mentioned operation | movement. Therefore, description of the same operation will not be repeated.

ユーザが既に登録されている場合には、話者モデルが適宜更新される。したがって、常に直近のユーザの音声データに基づいた話者識別が可能となる。 If the user is already registered, the speaker model is updated as appropriate. Therefore, speaker identification based on the latest user's voice data is always possible.

ユーザ１が端末２に対して、メッセージ１０を発する。端末２は、メッセージ１０を受け付けると、音声認識処理と話者識別処理とを実行する。端末２は、話者識別処理の結果に基づいて、メッセージ１０の話者を識別できたと判断すると、その判断の結果に応じて、メッセージ１２１０を発する。メッセージ１２１０は、メッセージ１０に対する応答と、メッセージ１０の話者を確認するための問いかけとを含む。ユーザ１が、メッセージ１２１０に対するメッセージ１２２０を発すると、端末２は、メッセージ１２２０について音声認識処理と話者識別処理とを行なう。 User 1 issues message 10 to terminal 2. When the terminal 2 accepts the message 10, the terminal 2 executes voice recognition processing and speaker identification processing. If the terminal 2 determines that the speaker of the message 10 has been identified based on the result of the speaker identification process, the terminal 2 issues a message 1210 according to the determination result. Message 1210 includes a response to message 10 and an inquiry to identify the speaker of message 10. When the user 1 issues a message 1220 for the message 1210, the terminal 2 performs voice recognition processing and speaker identification processing on the message 1220.

端末２は、メッセージ１２２０の内容から、当該問いかけに対する回答が得られたと判断すると、端末２の端末ＩＤとユーザ名（たろう）とを含むデータをユーザ管理部３５に送信する。ユーザ管理部３５は、当該データを蓄積する。さらに、端末２は、メッセージ１２２０に対するメッセージ１２３０を発する。 When the terminal 2 determines from the content of the message 1220 that an answer to the question has been obtained, the terminal 2 transmits data including the terminal ID of the terminal 2 and the user name (taro) to the user management unit 35. The user management unit 35 accumulates the data. Further, the terminal 2 issues a message 1230 for the message 1220.

その後、端末２は、ユーザ１からの発話を認識するたびに、端末ＩＤとユーザ名とを含むデータをユーザ管理部３５に送信する。ユーザ管理部３５は、各データを保存する。 Thereafter, each time the terminal 2 recognizes an utterance from the user 1, the terminal 2 transmits data including the terminal ID and the user name to the user management unit 35. The user management unit 35 stores each data.

話者識別学習部３４は、ユーザ管理部３５から、端末ＩＤとユーザ名とを参照して、蓄積されたデータから、当該ユーザに関連付けられたデータを読み出し、話者モデル８０を作成する。 The speaker identification learning unit 34 refers to the terminal ID and the user name from the user management unit 35, reads data associated with the user from the accumulated data, and creates a speaker model 80.

図１３および図１４を参照して、ある局面に従う音声認識システムにおけるシーケンスについて説明する。図１３および図１４は、ユーザが既知である場合に行なわれる処理の流れを表すシーケンスチャートである。なお、前述の処理と同一の処理には同一のステップ番号を付してある。したがって、同一の処理の説明は繰り返さない。 With reference to FIG. 13 and FIG. 14, the sequence in the speech recognition system according to a certain aspect will be described. 13 and 14 are sequence charts showing the flow of processing performed when the user is known. The same steps as those described above are denoted by the same step numbers. Therefore, the description of the same process will not be repeated.

ステップ１３４０にて、話者識別部３３は、話者識別が成功したことを通知するために、話者識別レスポンスを制御部３０に送信する。制御部３０は、当該レスポンスと、音声認識部３６からのレスポンスとの受信を検知すると、対話分析・生成リクエストを生成する。当該リクエストは、音声識別結果と話者識別結果とを含む。 In step 1340, the speaker identification unit 33 transmits a speaker identification response to the control unit 30 in order to notify that the speaker identification has been successful. When the control unit 30 detects the reception of the response and the response from the voice recognition unit 36, the control unit 30 generates a dialog analysis / generation request. The request includes a voice identification result and a speaker identification result.

ステップ１３４５にて、制御部３０は、対話分析・生成部３７に対して、対話分析・生成リクエストを送信する。対話分析・生成部３７は、当該リクエストの受信を検知すると、メッセージ９１１に応答するためのメッセージ１３５１を生成する。このとき、メッセージ１３５１は、メッセージ９１１に対する応答と、メッセージ９１１の発話者を確認するための問いかけとを含む。 In step 1345, control unit 30 transmits a dialog analysis / generation request to dialog analysis / generation unit 37. When the dialog analysis / generation unit 37 detects reception of the request, the dialog analysis / generation unit 37 generates a message 1351 for responding to the message 911. At this time, the message 1351 includes a response to the message 911 and an inquiry for confirming the speaker of the message 911.

ステップ１３５０にて、対話分析・生成部３７は、生成したメッセージ１３５１を制御部３０に送信する。制御部３０がメッセージ１３５１と端末ＩＤとを含む音声レスポンスを端末に送信すると、端末の音声出力部３２は、音声を発話する。ユーザは、当該音声を認識して当該音声が正しいと判断すると、たとえば「そうだよ」とのメッセージ１３６１を発する（名前登録発話）。 In step 1350, dialog analysis / generation unit 37 transmits generated message 1351 to control unit 30. When the control unit 30 transmits a voice response including the message 1351 and the terminal ID to the terminal, the voice output unit 32 of the terminal utters voice. When the user recognizes the voice and determines that the voice is correct, for example, the user issues a message 1361 “Yes” (name registration utterance).

ステップ１３６０にて、音声入力部３１は、メッセージ１３６１の入力を受け付けると、その入力に応じた音声信号を制御部３０に送信する。その後、制御部３０は、音声認識リクエストを音声認識部３６に送信する（ステップ９６５）。 In step 1360, when voice input unit 31 receives input of message 1361, voice input unit 31 transmits a voice signal corresponding to the input to control unit 30. Thereafter, the control unit 30 transmits a voice recognition request to the voice recognition unit 36 (step 965).

図１４を参照して、ステップ１４１０にて、話者識別部３３は、話者識別リクエスト（ステップ９８０）に対する応答を話者認識レスポンスとして話者識別部３３に送信する。ユーザが音声認識システムにとって既知である場合、話者認識レスポンスは、話者が識別されたことを表す。制御部３０は、当該レスポンスの受信を検知すると、対話分析・生成リクエストを生成する。 Referring to FIG. 14, in step 1410, speaker identification unit 33 transmits a response to the speaker identification request (step 980) to speaker identification unit 33 as a speaker recognition response. If the user is known to the speech recognition system, the speaker recognition response indicates that the speaker has been identified. When detecting the reception of the response, the control unit 30 generates a dialog analysis / generation request.

ステップ１４２０にて、制御部３０は、生成した対話分析・生成リクエストを対話分析・生成部３７に送信する。対話分析・生成部３７は、当該リクエストの受信を検知すると、メッセージ１４３１を生成する。メッセージ１４３１は、これまでのやり取りの結果に基づいて、メッセージ１３５１に含まれる問いかけ｛たろうさんかな？）が正しかったことを踏まえた内容（やっぱり！）を含む。 In step 1420, control unit 30 transmits the generated dialog analysis / generation request to dialog analysis / generation unit 37. When the dialog analysis / generation unit 37 detects reception of the request, the dialog analysis / generation unit 37 generates a message 1431. The message 1431 is a question {Taro-san? Included in the message 1351 based on the result of the exchange so far. ) Is included based on the correctness (after all!).

ステップ１４３０にて、対話分析・生成部３７は、メッセージ１４３１を制御部３０に送信する。制御部３０は、メッセージ１４３１の受信を検知すると、端末ＩＤとメッセージ１４３１とを含む音声レスポンスを生成する。 In step 1430, dialogue analysis / generation unit 37 transmits message 1431 to control unit 30. When the control unit 30 detects reception of the message 1431, the control unit 30 generates an audio response including the terminal ID and the message 1431.

ステップ１４４０にて、制御部３０は、端末に音声レスポンスを送信する。音声出力部３２は、当該音声レスポンスに基づいて、メッセージ１４３１を音声で出力する。 In step 1440, control unit 30 transmits an audio response to the terminal. The voice output unit 32 outputs the message 1431 by voice based on the voice response.

その後、ステップ１０４０以降の処理が、前述の場合と同様に行なわれる。音声データが保存され、学習データ（たとえば、声紋情報等）は、対象ユーザの常に新しい音声データで更新される。なお、ユーザが既知の場合には、学習が完了しても、端末は、ユーザの名前を確認するための発話を行なわない。 Thereafter, the processing after step 1040 is performed in the same manner as described above. The voice data is stored, and the learning data (for example, voiceprint information) is constantly updated with new voice data of the target user. When the user is known, even when learning is completed, the terminal does not make an utterance for confirming the user's name.

＜端末が発話の起点となる場合＞
図１５〜図１７を参照して、さらに別の局面について説明する。図１５は、端末２からユーザ１に話しかけることが対話のトリガとなる場合を表す図である。 <When the terminal is the starting point of utterance>
Still another aspect will be described with reference to FIGS. FIG. 15 is a diagram illustrating a case in which talking to the user 1 from the terminal 2 triggers a conversation.

端末２からユーザに話しかけ、ユーザ発話及びユーザ名を聞き出すことによって得られた音声データをユーザ名と端末ＩＤとに紐付けることにより、音声データを学習する。 The voice data is learned by associating the voice data obtained by talking to the user from the terminal 2 and listening to the user utterance and the user name with the user name and the terminal ID.

端末２は、ユーザ１の存在を検知すると、ユーザ１に対して話しかける。ユーザ１の存在の検知は、たとえば、赤外線センサ、人感センサ等からの出力に基づいて行なわれる。端末２は、たとえば、メッセージ１５１０を発する。ユーザ１は、メッセージ１５１０を認識する。 When the terminal 2 detects the presence of the user 1, the terminal 2 talks to the user 1. The presence of the user 1 is detected based on, for example, an output from an infrared sensor, a human sensor, or the like. The terminal 2 issues a message 1510, for example. User 1 recognizes message 1510.

ユーザ１は、メッセージ１５１０に応答して、メッセージ１５２０を発する。端末２は、メッセージ１５１０を認識すると、音声認識処理と話者識別処理とを実行する。端末２は、各処理の結果に基づいて、ユーザ１に対する発話を切り換える。たとえば、話者が既知でないと判断すると、端末２は、メッセージ１５３０を生成し、音声でメッセージ１５３０を出力する。 In response to the message 1510, the user 1 issues a message 1520. When the terminal 2 recognizes the message 1510, the terminal 2 executes voice recognition processing and speaker identification processing. The terminal 2 switches the utterance to the user 1 based on the result of each process. For example, if it is determined that the speaker is not known, the terminal 2 generates a message 1530 and outputs the message 1530 by voice.

ユーザ１は、メッセージ１５３０に応答してメッセージ１５４０を端末２に向けて発する。端末２は、メッセージ１５４０について音声認識処理および話者識別処理を実行する。さらに、端末２は、端末２のユーザ名として認識された話者「たろう」と端末ＩＤとを関連付け、これまで受け付けたユーザ１のメッセージ１５２０，１５４０を話者の音声データとしてユーザ管理部３５に蓄積する。 In response to the message 1530, the user 1 issues a message 1540 toward the terminal 2. The terminal 2 performs voice recognition processing and speaker identification processing on the message 1540. Further, the terminal 2 associates the speaker “Taro” recognized as the user name of the terminal 2 with the terminal ID, and sends the messages 1520 and 1540 of the user 1 received so far to the user management unit 35 as voice data of the speaker. accumulate.

さらに、端末２は、メッセージ１５４０に対する応答としてメッセージ１５５０を生成し、音声でメッセージ１５５０を出力する。 Further, the terminal 2 generates a message 1550 as a response to the message 1540, and outputs the message 1550 by voice.

ユーザ管理部３５には、ユーザ「たろう」に関連付けられた音声データと、音声データから取得された識別情報（たとえば声紋情報）とが蓄積される。 The user management unit 35 stores voice data associated with the user “Taro” and identification information (for example, voiceprint information) acquired from the voice data.

図１６および図１７を参照して、ある局面における音声認識システムの動作について説明する。図１６および図１７は、音声認識システムで行われる処理の一部を表すシーケンスチャートである。 With reference to FIG. 16 and FIG. 17, operation | movement of the speech recognition system in a certain situation is demonstrated. 16 and 17 are sequence charts showing a part of processing performed in the speech recognition system.

ステップ１６１０にて、制御部３０は、予め定められた条件が成立したことを検知すると、対話生成リクエストを対話分析・生成部３７に送信する。当該条件は、たとえば、音声認識システムの範囲内でユーザの存在が検知されたこと、予め指定された時刻が到来したこと等である。対話生成リクエストは、たとえば、検出されたユーザに対して話しかけるためのメッセージ１５１０の生成要求を含む。対話分析・生成部３７は、当該リクエストの受信を検知すると、予め準備されたテンプレートに基づいて、メッセージ１５１０を生成する。 In step 1610, when detecting that a predetermined condition is satisfied, control unit 30 transmits a dialog generation request to dialog analysis / generation unit 37. The condition is, for example, that the presence of the user is detected within the range of the voice recognition system, that a predetermined time has arrived. The dialog generation request includes, for example, a request to generate a message 1510 for speaking to the detected user. When the dialog analysis / generation unit 37 detects reception of the request, the dialog analysis / generation unit 37 generates a message 1510 based on a template prepared in advance.

ステップ１６１５にて、対話分析・生成部３７は、当該リクエストに応答して生成したメッセージ１５１０を制御部３０に送信する。制御部３０は、メッセージ１５１０の受信を検知すると、メッセージ１５１０と端末ＩＤとを含む音声発話リクエストを端末に送信する。端末の音声入力部３１は、当該リクエストを受信すると、メッセージ１５１０を音声で出力する。ユーザは、メッセージ１５１０を認識すると、メッセージ１５１０に対する応答として、メッセージ１５２０を発する。 In step 1615, the dialog analysis / generation unit 37 transmits the message 1510 generated in response to the request to the control unit 30. When the control unit 30 detects reception of the message 1510, the control unit 30 transmits a voice utterance request including the message 1510 and the terminal ID to the terminal. When receiving the request, the voice input unit 31 of the terminal outputs the message 1510 by voice. When the user recognizes the message 1510, the user issues a message 1520 as a response to the message 1510.

ステップ１６２５にて、音声入力部３１は、メッセージ１５２０を音声信号として制御部３０に送信する。その後、ステップ９１５からステップ１３４５まで、前述の処理と同様の処理が実行される。 In step 1625, the voice input unit 31 transmits the message 1520 as a voice signal to the control unit 30. Thereafter, from step 915 to step 1345, processing similar to that described above is executed.

ステップ１３５０にて、対話分析・生成部３７は、メッセージ１５３０を制御部３０に送信する。メッセージ１５３０に基づく音声が出力されると、ユーザは、メッセージ１５４０を発する。メッセージ１５４０は、制御部３０から音声認識部３６に送られ、音声認識処理が実行される（ステップ１０４５）。 In step 1350, dialog analysis / generation unit 37 transmits message 1530 to control unit 30. When sound based on message 1530 is output, the user issues message 1540. The message 1540 is sent from the control unit 30 to the voice recognition unit 36, and voice recognition processing is executed (step 1045).

図１７を参照して、ステップ１０５０からステップ１０７０までの処理が、同様に実行される。その後、制御部３０は、学習に十分なデータがなく、学習が失敗したと判断すると、ステップ１７４０の処理が実行される。より具体的には、ステップ１７４１にて、制御部３０は、対話分析・生成リクエストを対話分析・生成部３７に送信する。対話分析・生成部３７は、当該リクエストの受信を検知すると、当該リクエストに応じたメッセージ１５５０を生成する。 Referring to FIG. 17, the processing from step 1050 to step 1070 is similarly executed. Thereafter, when the control unit 30 determines that there is not enough data for learning and learning has failed, the process of step 1740 is executed. More specifically, in step 1741, the control unit 30 transmits a dialog analysis / generation request to the dialog analysis / generation unit 37. When the dialog analysis / generation unit 37 detects reception of the request, the dialog analysis / generation unit 37 generates a message 1550 corresponding to the request.

ステップ１７４２にて、対話分析・生成部３７は、メッセージ１５５０を制御部３０に送信する。制御部３０は、メッセージ１５５０の受信を検知すると、端末ＩＤとメッセージ１５５０とを含む音声レスポンスを生成する。 In step 1742, dialog analysis / generation unit 37 transmits message 1550 to control unit 30. When the control unit 30 detects reception of the message 1550, the control unit 30 generates an audio response including the terminal ID and the message 1550.

一方、制御部３０は、学習に十分なデータが揃い楽手が完了したと判断すると、ステップ１７５０の処理を実行する。より詳しくは、ステップ１７５１にて、制御部３０は、対話分析・生成リクエストを対話分析・生成部３７に送信する。対話分析・生成部３７は、当該リクエストの受信を検知すると、当該リクエストに応答するためのメッセージ１５６０を生成する。 On the other hand, if the control unit 30 determines that enough data for learning is available and the user has completed, the process of step 1750 is executed. More specifically, in step 1751, the control unit 30 transmits a dialog analysis / generation request to the dialog analysis / generation unit 37. When the dialog analysis / generation unit 37 detects reception of the request, the dialog analysis / generation unit 37 generates a message 1560 for responding to the request.

ステップ１７５２にて、対話分析・生成部３７は、メッセージ１５６０を制御部３０に送信する。制御部３０は、メッセージ１５６０の受信を検知すると、端末ＩＤとメッセージ１５６０とを含む音声レスポンスを生成する。 In step 1752, dialog analysis / generation unit 37 transmits message 1560 to control unit 30. When the control unit 30 detects reception of the message 1560, the control unit 30 generates an audio response including the terminal ID and the message 1560.

ステップ１７６０にて、制御部３０は、当該音声レスポンスを端末に送信する。音声出力部３２は、音声レスポンスを受信すると、メッセージ１５６０を音声で出力する。 In step 1760, control unit 30 transmits the voice response to the terminal. When receiving the voice response, the voice output unit 32 outputs the message 1560 by voice.

＜他の局面＞
さらに他の局面について説明する。他の局面において、以下の構成が用いられてもよい。 <Other aspects>
Still another aspect will be described. In other aspects, the following configurations may be used.

（１）音声認識と音声認証とが並列に行なわれる。したがって、ユーザの発話内容の認識と当該ユーザの認証とが同時に行なわれる。 (1) Voice recognition and voice authentication are performed in parallel. Therefore, recognition of the user's utterance content and authentication of the user are performed simultaneously.

（２）ユーザ毎に、対話内容のログに基づいて各ユーザの興味ある話題が推定され、推定された話題に基づく対話が生成される。 (2) For each user, a topic of interest of each user is estimated based on the log of the conversation content, and a dialog based on the estimated topic is generated.

（３）対話数やその頻度に基づいて、ロボット（音声対話装置、あるいは音声対話システム）の発話内容が変化する。 (3) Based on the number of conversations and their frequency, the utterance content of the robot (voice dialogue apparatus or voice dialogue system) changes.

これらの要素の結果、ユーザは、ロボット（音声対話システム）に親しみを持つことができる。 As a result of these factors, the user can become familiar with the robot (voice interaction system).

たとえば、構成（１）により、当該技術思想が適用される音声対話システムは、カメラや無線タグ等の機器からの情報を使用することなく、ユーザを特定し（音声認証）、また、当該ユーザの発言内容の取得（音声認識）が可能になる。 For example, with the configuration (1), the voice interactive system to which the technical idea is applied specifies a user (voice authentication) without using information from a device such as a camera or a wireless tag, and the user's Acquisition of speech contents (voice recognition) becomes possible.

次に、構成（２）により、ユーザの日々の会話が音声対話システムに記憶され、必要に応じて分析される。音声対話システムは、分析結果に基づいて、各ユーザが興味ある話題（スポーツ、芸能ニュースなど）を他の情報提供装置から取得し、対話しているユーザに応じた話題を当該ユーザに提供することができる。 Next, according to the configuration (2), the daily conversation of the user is stored in the voice dialogue system and analyzed as necessary. Based on the analysis result, the voice dialogue system acquires a topic (sports, entertainment news, etc.) that each user is interested in from another information providing device, and provides the user with a topic according to the user who is interacting. Can do.

さらに、構成（３）により、音声対話システムとユーザとの対話が長期にかつ定期的に行われることにより、対話内容に応じて、音声対話システムからの発話の表現（言葉づかい、語調等）が変化し得る。その結果、ユーザが音声対話システム（あるいは、音声対話システムに含まれるロボットのような音声入出力端末）に対して親近感を持ち得る。これらの各構成は、適宜組み合され得る。 Furthermore, with the configuration (3), the dialogue between the voice dialogue system and the user is performed for a long time and periodically, so that the expression of the utterance from the voice dialogue system (wording, tone, etc.) changes according to the dialogue contents. Can do. As a result, the user can be familiar with the voice interaction system (or a voice input / output terminal such as a robot included in the voice interaction system). Each of these configurations can be combined as appropriate.

＜まとめ＞
以上のようにして、本実施の形態に係る音声認識システムによれば、ユーザは学習のための前処理を意識せずに、通常の音声対話を行うことにより、学習に必要な音声データをシステムに与えることができる。したがって、当該システムにより提供される機能を容易に利用することができる。 <Summary>
As described above, according to the voice recognition system according to the present embodiment, the user performs normal voice conversation without being conscious of the preprocessing for learning, thereby obtaining the voice data necessary for learning. Can be given to. Therefore, the function provided by the system can be easily used.

さらに他の局面において、ユーザが意識することなくユーザ認証され、当該ユーザに応じた話題が出力されるので、ユーザは音声認識システムにより提供されるサービスや機能に親近感を持ち得る。 In yet another aspect, user authentication is performed without the user's awareness and a topic corresponding to the user is output, so that the user can be familiar with the services and functions provided by the voice recognition system.

今回開示された実施の形態はすべての点で例示であって制限的なものではないと考えられるべきである。本発明の範囲は上記した説明ではなくて特許請求の範囲によって示され、特許請求の範囲と均等の意味および範囲内でのすべての変更が含まれることが意図される。 The embodiment disclosed this time should be considered as illustrative in all points and not restrictive. The scope of the present invention is defined by the terms of the claims, rather than the description above, and is intended to include any modifications within the scope and meaning equivalent to the terms of the claims.

３０制御部、３１音声入力部、３２音声出力部、３３話者識別部、３４話者識別学習部、３５ユーザ管理部、３６音声認識部、３７生成部、８０話者モデル、３５０，４００，５００サーバ、４１０話者識別サーバ、５２０音声認識サーバ、６００端末モジュール、６１０メインモジュール、６２０話者識別モジュール、６３０音声認識モジュール。 30 control unit, 31 voice input unit, 32 voice output unit, 33 speaker identification unit, 34 speaker identification learning unit, 35 user management unit, 36 voice recognition unit, 37 generation unit, 80 speaker model, 350, 400, 500 server, 410 speaker identification server, 520 speech recognition server, 600 terminal module, 610 main module, 620 speaker identification module, 630 speech recognition module.

Claims

A speech recognition device,
An audio input unit for receiving an utterance including information for identifying a speaker and an utterance not including information for identifying a speaker;
A speech recognition unit for performing speech recognition processing;
An audio output unit for outputting audio;
A control unit for controlling the voice recognition device based on the result of the voice recognition processing,
The voice output unit outputs an inquiry asking for information for identifying a speaker after an utterance that does not include information for identifying a speaker obtained by playing a game with a speaker.
The control unit associates an utterance that does not include information identifying the speaker issued before the inquiry with information that identifies the speaker included in the utterance responding to the inquiry, thereby identifying the speaker. A speech recognition device that generates a speaker identification model for identification.

The voice recognition device according to claim 1, wherein the game includes at least one of a shiritori game and a fast-paced word game.

Wherein, based on the contents of a response to the utterance output from the audio output unit, wherein being configured to determine the contents of the speech to be output next from the audio output unit, to claim 1 or 2 The speech recognition apparatus according to the description.

A storage device for storing a plurality of inquiries of different difficulty levels;
Based on the content of the utterance for the game received by the voice input unit, the control unit stores the content of the utterance for the game to be output next from the voice output unit in the storage device. The speech recognition device according to claim 1, wherein the speech recognition device is configured to determine from a plurality of inquiries.

A storage unit for storing the generated speaker identification model;
The controller is
The speech recognition device according to claim 1, wherein the speech recognition device is configured to update the generated speaker identification model based on a response to the inquiry.

A speech recognition system,
A terminal,
A device capable of communicating with the terminal,
The terminal
A voice input unit for receiving an utterance including information for identifying a speaker and an utterance not including information for identifying a speaker;
An audio output unit for outputting audio;
A communication unit that is electrically connected to the audio input unit and the audio output unit and communicates with the device;
The voice output unit is configured to output an inquiry asking for information for identifying a speaker after an utterance that does not include information for identifying a speaker obtained by playing a game with a speaker.
The device is
A communication unit for communicating with the terminal;
A speech recognition processing unit for performing speech recognition processing;
A control unit for controlling the device based on the result of the voice recognition processing,
The control unit associates an utterance that does not include information identifying the speaker issued before the inquiry with information that identifies the speaker included in the utterance responding to the inquiry, thereby identifying the speaker. A speech recognition system that generates a speaker identification model for identification.

A terminal used in the voice recognition system according to claim 6.

A method for generating a speaker identification model, comprising:
Accepting an utterance that does not include information identifying the speaker by playing a game ;
Outputting a query asking for information identifying the speaker;
Receiving an utterance in response to the inquiry;
Performing speech recognition processing;
Associating an utterance not including information identifying the speaker issued before the inquiry and information identifying a speaker included in the utterance responding to the inquiry based on the result of the voice recognition processing. Accordingly, and generating a speaker identification models for identifying the speaker, the method.