CN114360535B

CN114360535B - Voice conversation generation method and device, electronic equipment and storage medium

Info

Publication number: CN114360535B
Application number: CN202111601277.6A
Authority: CN
Inventors: 吴文权; 吴华
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-12-24
Filing date: 2021-12-24
Publication date: 2023-01-31
Anticipated expiration: 2041-12-24
Also published as: CN114360535A

Abstract

The present disclosure provides a method and an apparatus for generating a voice dialog, an electronic device, and a storage medium, which relate to the field of computer technologies, and in particular, to the technical fields of artificial intelligence, such as voice technology, natural language processing, and computer vision. The specific implementation scheme is as follows: performing voice recognition on the acquired input voice to determine a first text corresponding to the input voice; performing audio feature extraction on input voice to determine a first audio feature corresponding to the input voice; determining a second text and a second audio characteristic corresponding to the reply sentence to be generated according to the first audio characteristic and the first text; based on the second audio feature and the second text, a response speech is generated. Therefore, the second text and the second audio characteristics are determined according to the first audio characteristics and the first text corresponding to the input voice, so that the accuracy of the determined second text is improved, and the generated reply voice is more fit to the emotion of the speaker corresponding to the input voice.

Description

Voice conversation generation method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to the field of artificial intelligence technologies such as speech technology, natural language processing, and computer vision, and in particular, to a method and an apparatus for generating a speech dialog, an electronic device, and a storage medium.

Background

As the artificial intelligence technology has been continuously developed and perfected, it has played an extremely important role in various fields related to human daily life. For example, artificial intelligence has made significant progress in the field of voice conversations. In the related art, the voice information may be converted into text, and the text may be semantically analyzed to determine a reply text. Since the response text is determined only based on the single feature of the text included in the speech information in the related art, the accuracy of the finally determined response text may be low, and therefore, how to improve the accuracy of the response sentence is an important research direction.

Disclosure of Invention

The disclosure provides a method and a device for generating a voice conversation, electronic equipment and a storage medium.

According to a first aspect of the present disclosure, there is provided a method for generating a voice dialog, including:

performing voice recognition on the acquired input voice to determine a first text corresponding to the input voice;

performing audio feature extraction on the input voice to determine a first audio feature corresponding to the input voice;

determining a second text and a second audio characteristic corresponding to a reply sentence to be generated according to the first audio characteristic and the first text;

generating a response voice based on the second audio feature and the second text.

According to a second aspect of the present disclosure, there is provided a generation apparatus of a voice dialog, including:

the first determining module is used for performing voice recognition on the acquired input voice so as to determine a first text corresponding to the input voice;

the second determining module is used for performing audio feature extraction on the input voice to determine a first audio feature corresponding to the input voice;

a third determining module, configured to determine, according to the first audio feature and the first text, a second text and a second audio feature corresponding to a reply sentence to be generated;

a generating module for generating a reply voice based on the second audio feature and the second text.

According to a third aspect of the present disclosure, there is provided an electronic device comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of generating a voice dialog according to the first aspect.

According to a fourth aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the method of generating a voice dialog according to the first aspect.

According to a fifth aspect of the present disclosure, there is provided a computer program product comprising computer instructions which, when executed by a processor, implement the steps of the method of generating a voice dialog according to the first aspect.

The method, the device, the electronic equipment and the storage medium for generating the voice conversation have the following beneficial effects:

in the embodiment of the disclosure, voice recognition is performed on the acquired input voice to determine a first text corresponding to the input voice, then audio feature extraction is performed on the input voice to determine a first audio feature corresponding to the input voice, a second text and a second audio feature corresponding to a reply sentence to be generated are determined according to the first audio feature and the first text, and finally, the reply voice is generated based on the second audio feature and the second text. Therefore, the second text and the second audio characteristic corresponding to the reply sentence are determined according to the first audio characteristic and the first text corresponding to the input voice, so that the accuracy of the determined second text is improved, the emotional characteristic of the reply sentence can be determined according to the emotional characteristic corresponding to the input sentence, and the generated reply voice is more fit with the emotion of the speaker corresponding to the input voice.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

fig. 1 is a schematic flow chart of a method for generating a voice dialog according to an embodiment of the present disclosure;

fig. 2 is a flowchart illustrating a method for generating a voice dialog according to another embodiment of the present disclosure;

fig. 3 is a flowchart illustrating a method for generating a voice dialog according to another embodiment of the present disclosure;

fig. 4 is a schematic structural diagram of an apparatus for generating a voice dialog according to an embodiment of the present disclosure;

fig. 5 is a block diagram of an electronic device for implementing a method of generating a voice dialog according to an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The embodiment of the disclosure relates to the technical field of artificial intelligence such as computer vision and deep learning.

Artificial Intelligence (Artificial Intelligence), abbreviated in english as AI. The method is a new technical science for researching and developing theories, methods, technologies and application systems for simulating, extending and expanding human intelligence.

The key technologies of Speech technology in the field of computers are Automatic Speech Recognition (ASR) technology and Text To Speech (TTS) technology. The computer can listen, see, speak and feel, and the development direction of future human-computer interaction is provided, wherein the voice becomes the best viewed human-computer interaction mode in the future, and the voice has more advantages than other interaction modes.

Natural language processing is the computer processing, understanding and use of human languages (such as chinese, english, etc.), which is a cross discipline between computer science and linguistics, also commonly referred to as computational linguistics. Since natural language is the fundamental hallmark that human beings distinguish from other animals. Without language, human thinking has not been talk about, so natural language processing embodies the highest task and context of artificial intelligence, that is, only when a computer has the capability of processing natural language, the machine has to realize real intelligence.

Computer vision, which means that a camera and a computer are used to replace human eyes to perform machine vision such as identification, tracking and measurement on a target, and further image processing is performed, so that the computer processing becomes an image more suitable for human eye observation or transmitted to an instrument for detection.

In the technical scheme of the disclosure, the collection, storage, use, processing, transmission, provision, disclosure and other processing of the personal information of the related user are all in accordance with the regulations of related laws and regulations and do not violate the good customs of the public order.

it should be noted that the main execution body of the method for generating a voice dialog in this embodiment is a device for generating a voice dialog, and the device may be implemented in software and/or hardware, and the device may be configured in an electronic device, and the electronic device may include, but is not limited to, a terminal, a server, and the like.

As shown in fig. 1, the method for generating a voice dialog includes:

s101: and performing voice recognition on the acquired input voice to determine a first text corresponding to the input voice.

The acquired input voice may be a voice that needs to generate a corresponding reply text according to the content included in the voice. The input speech may be a continuous piece of speech, such as a sentence, a piece of speech, etc.

Optionally, the input voice may be acquired through a voice acquisition device, such as a microphone, a sound sensor, or the like, and the input voice may also be read from a storage space where the voice is stored.

The first text refers to text included in the input voice, that is, content included in the input voice is displayed in a text form.

In the embodiment of the disclosure, the speech recognition is used for converting a speech signal corresponding to the input speech into a corresponding first text. Optionally, a Hidden Markov Model (HMM) may be used to perform speech recognition on the input speech to determine a first text corresponding to the input speech; or, the same voice can be found by comparing the acquired input voice with the voice in the voice database, and then the utterance text corresponding to the voice in the voice database is obtained as the first text corresponding to the input voice. The present disclosure is not limited thereto.

S102: and performing audio feature extraction on the input voice to determine a first audio feature corresponding to the input voice.

The first audio characteristic may be information such as frequency and amplitude of a voice signal corresponding to the input voice.

It should be noted that the characteristics of the voice signal, such as frequency and amplitude, can reflect the emotion information of the speaker corresponding to the input voice. For example, the frequency of the voice signal corresponding to the input voice is high, which indicates that the speaker is fast in speaking speed and may be impatient in emotion; when the frequency of the speech signal is normal, it may be easier to represent the emotion of the speaker. When the amplitude of the speech signal is high, it indicates that the speaker is louder, and the emotion may be more heightened. When the corresponding amplitude of the speech signal is low, it means that the speaker's voice is small and the emotion may be low.

Optionally, fast fourier transform may be adopted to perform audio feature extraction on the input voice to determine the corresponding frequency, amplitude, and the like of the input voice. Alternatively, the max function in the matlab tool may be used to extract the amplitude corresponding to the input speech, and the pitch function may be used to extract the frequency in the input speech. The present disclosure is not limited thereto.

S103: and determining a second text and a second audio characteristic corresponding to the reply sentence to be generated according to the first audio characteristic and the first text.

It should be noted that, in the embodiment of the present disclosure, when determining the reply sentence according to the first audio feature of the input voice and the first text, not only the second text corresponding to the reply sentence may be determined, but also the second audio feature of the reply sentence may be determined at the same time, that is, the emotion when playing the reply sentence may be determined.

Wherein the second text may be text generated from the first audio feature and the first text for answering the input speech.

The second audio feature may be an emotional feature when the second text is played, which is determined according to the first audio feature. For example, the first audio feature is higher in frequency and higher in amplitude, which indicates that the emotion of the speaker corresponding to the input sentence is violent, so that the second audio feature corresponding to the reply sentence can be moderate in frequency and amplitude, that is, the second text is played in a relatively relaxed intonation.

Optionally, the first audio feature and the first text may be input into a preset dialogue model to obtain a second text and a second audio feature corresponding to the reply sentence to be generated.

Or, a keyword included in the first text may be extracted first, and then the second text and the second audio feature corresponding to the reply sentence to be generated may be determined according to the keyword and the first audio feature included in the first text.

In the embodiment of the disclosure, the second text of the reply sentence is determined according to the first audio feature and the first text corresponding to the input voice, so that under the condition that the first text corresponding to the input voice is the same, if the first audio features corresponding to the input voice are different, the second text corresponding to the generated reply sentence is also different, thereby not only improving the accuracy of the reply sentence, but also enabling the determined reply sentence to be more suitable for the emotion of the speaker corresponding to the input voice, and improving the diversity of the reply text.

S104: based on the second audio feature and the second text, a response speech is generated.

Wherein the reply speech is speech obtained by playing the second text using the second audio feature.

Optionally, a Speech synthesis technology (Text to Speech, TTS) may be used to combine the second Text and the second audio feature to generate the reply Speech.

In the embodiment of the disclosure, the voice recognition is performed on the acquired input voice to determine a first text corresponding to the input voice, then the audio feature extraction is performed on the input voice to determine a first audio feature corresponding to the input voice, then a second text and a second audio feature corresponding to a reply sentence to be generated are determined according to the first audio feature and the first text, and finally, the reply voice is generated based on the second audio feature and the second text. Therefore, the second text and the second audio characteristic corresponding to the reply sentence are determined according to the first audio characteristic and the first text corresponding to the input voice, so that the accuracy of the determined second text is improved, the emotional characteristic of the reply sentence can be determined according to the emotional characteristic corresponding to the input sentence, and the generated reply voice is more fit with the emotion of the speaker corresponding to the input voice.

Fig. 2 is a flowchart illustrating a method for generating a voice dialog according to another embodiment of the present disclosure. As shown in fig. 2, the method for generating a voice dialog includes:

s201: and performing voice recognition on the acquired input voice to determine a first text corresponding to the input voice.

The specific implementation form of step S201 may refer to detailed descriptions in other embodiments in the present disclosure, and details are not repeated here.

S202: and performing audio feature extraction on the input voice to determine a first audio feature corresponding to the input voice.

Wherein the first audio characteristic may comprise a magnitude characteristic and a frequency characteristic.

Optionally, a second amplitude corresponding to the input speech may be determined according to a first amplitude corresponding to each frame of speech in the input speech, and then an amplitude feature corresponding to the input speech may be determined according to a range to which the second amplitude belongs.

The first amplitude may be a maximum value of amplitudes corresponding to each frame of speech.

The second amplitude may be a maximum value of the first amplitude corresponding to each frame of speech. Namely, the maximum amplitude corresponding to the input voice is used as the second amplitude corresponding to the input voice.

Wherein the amplitude characteristics may include: high amplitude, medium amplitude, low amplitude, etc., which the present disclosure does not limit. It should be noted that each amplitude feature corresponds to a different amplitude range, and in the embodiment of the present disclosure, the amplitude feature corresponding to the input speech may be determined according to the range to which the second amplitude belongs.

In the embodiment of the present disclosure, before the first amplitude corresponding to each frame of voice in the input voice is obtained, the input voice may be subjected to frame segmentation, that is, the first audio data is segmented into segments with fixed length.

Alternatively, the first amplitude corresponding to each frame of speech in the input speech may be determined in any desirable manner, which is not limited in this disclosure. For example, each frame of voice in the input voice may be fourier transformed to obtain a first amplitude value corresponding to each frame of voice.

In the embodiment of the disclosure, a second amplitude corresponding to the input voice is determined according to a first amplitude corresponding to each frame of voice in the input voice, and then the input voice is determined to be a high amplitude, a medium amplitude or a low amplitude according to a range to which the second amplitude belongs, so that a plurality of amplitudes corresponding to the input voice can be represented by one amplitude feature, and thus, the calculation amount of subsequent processing is reduced under the condition of not affecting the accuracy of subsequently acquiring a second text.

Optionally, pitch detection may be performed on the input voice to determine a frequency value corresponding to the voice signal, and then, according to a range to which the frequency value belongs, a frequency feature corresponding to the input voice is determined.

Wherein the frequency characteristics may include: high frequency, medium frequency, low frequency, etc., as the present disclosure does not limit. It should be noted that each frequency feature corresponds to a different frequency range, and in the embodiment of the present disclosure, the frequency feature corresponding to the input voice may be determined according to the range to which the frequency value belongs.

Optionally, pitch detection may be performed on the input voice to obtain a maximum frequency corresponding to the input voice, and the maximum frequency is used as a frequency value corresponding to the voice signal. Alternatively, the average frequency corresponding to the input voice may be a frequency value corresponding to the voice signal. The present disclosure is not limited thereto.

In the embodiment of the disclosure, pitch detection is performed on the input voice to determine a frequency value corresponding to the voice signal, and then the input voice is determined to be high frequency, medium frequency or low frequency according to a range to which the frequency value belongs, so that one amplitude feature can represent multiple amplitudes corresponding to the input audio, and thus, the calculation amount of subsequent processing is reduced without affecting the accuracy of subsequently acquiring the second text.

S203: and determining a second text corresponding to the reply sentence to be generated and an emoticon contained in the second text according to the first audio feature and the first text.

Wherein the emoticon may include: joy, surprise, fear, and cast, etc., which are not limited by the present disclosure.

It should be noted that one emoticon may be included in the second text, or a plurality of emoticons may be included in the second text, which is not limited in the present disclosure. For example, the first sentence in the second text corresponds to an emoticon, and the second sentence and the third sentence correspond to an emoticon.

For example, if the amplitude feature included in the first audio feature is a low frequency and the frequency feature is also a low frequency, and it is determined that the emotion of the speaker corresponding to the input speech is sad, the emoticon included in the generated second text may be worried about.

It should be noted that the above examples are only simple examples, and are not intended to be specific limitations of the emoticons included in the first audio feature and the second text in the embodiments of the present disclosure.

S204: and displaying the second text and the emoticon on a display screen of the interactive device.

The interactive device is an electronic device capable of interacting with a user. The interactive device can generate a result corresponding to the interactive request by receiving the interactive request of the user and processing the interactive request, and then display the result to the user in the forms of voice, text and the like.

In the embodiment of the disclosure, after the second text of the reply sentence corresponding to the input voice and the emoticon included in the second text are determined, the second text and the emoticon may be displayed on the display screen of the interactive device, so that the user can read the second text in combination with the emoticon included in the display interface, thereby not only replying the request of the user, but also realizing effective communication with the user from multiple angles.

Optionally, in the embodiment of the present disclosure, the second text and the emoticon may be displayed on a display screen of the interactive device, and the interactive device may also play the second text in a tone corresponding to the emoticon.

For example, if the emoticon included in the second text is worried, the interactive apparatus may not only display the second text and the worried emoticon on the display screen, but also play the second text in a comforted tone.

In the embodiment of the disclosure, the voice recognition is performed on the acquired input voice to determine a first text corresponding to the input voice, then the audio feature extraction is performed on the input voice to determine a first audio feature corresponding to the input voice, then, according to the first audio feature and the first text, a second text corresponding to a reply sentence to be generated and an emoticon included in the second text are determined, and finally, the second text and the emoticon are displayed on a display screen of the interactive device. Therefore, the second text corresponding to the reply sentence and the emoticon contained in the second text are determined according to the first audio feature and the first text corresponding to the input voice, so that the determined second text can be more accurate, and the second sentence and the corresponding emoticon can be displayed in the display screen of the interactive equipment, so that the multi-angle effective communication with the user is realized.

Fig. 3 is a flowchart illustrating a method for generating a voice dialog according to another embodiment of the present disclosure. As shown in fig. 3, the method for generating a voice dialog includes:

s301: and performing voice recognition on the acquired input voice to determine a first text corresponding to the input voice.

S302: and performing audio feature extraction on the input voice to determine a first audio feature corresponding to the input voice.

The specific implementation forms of step S301 and step S302 may refer to detailed descriptions in other embodiments in the present disclosure, and are not described in detail here.

S303: and determining a second text and a second audio characteristic corresponding to the reply sentence to be generated according to the first audio characteristic and the first text.

In the embodiment of the present disclosure, the first audio feature and the first text may be input into a preset dialogue model to obtain a second text and a second audio feature corresponding to a reply sentence to be generated.

Optionally, the specific step of obtaining the preset dialogue model may include: the method comprises the steps of obtaining a training sample set, wherein the training sample set comprises an input text and corresponding audio characteristics, labeling a response text corresponding to the input text and corresponding audio characteristic labels, inputting the input text and the corresponding audio characteristics into an initial dialogue model to obtain a predicted response text output by the initial dialogue model and corresponding predicted audio characteristics, and correcting the initial dialogue model according to the difference between the predicted response text and the labeled response text and the difference between the predicted audio characteristics and the audio characteristic labels to generate a preset dialogue model.

Alternatively, the training sample set may be obtained by: firstly, automatically mining a large amount of text dialogue corpora from network information, carrying out artificial dubbing on the text dialogue corpora, then carrying out audio feature extraction on dubbed sample voice data to obtain an input text and corresponding audio features contained in the text dialogue corpora, and labeling a reply text and corresponding audio feature labels.

The audio features may include frequency features and amplitude features, among others. The amplitude characteristics may include: high, medium and low amplitude values; the frequency characteristics may include: high frequency, medium frequency, and low frequency, etc.

In the embodiment of the present disclosure, after audio feature analysis is performed on sample voice data obtained by dubbing, the obtained frequencies and amplitudes may be sorted in order from large to small, and then the frequency within the first threshold range is labeled as a high frequency, the frequency within the second threshold range is labeled as a medium frequency, and the frequency within the third threshold range is labeled as a low frequency; and marking the amplitude value in the fourth threshold value range as a high amplitude value, marking the amplitude value in the fifth threshold value range as a medium amplitude value, and marking the amplitude value in the sixth threshold value range as a low amplitude value.

For example, if the entire sample speech data corresponds to a frequency range of [ a, b ], the first threshold range may be [ b-10% × (b-a), b ], i.e., the highest 10% of the frequencies in the frequency range are labeled as high frequencies, the second threshold range may be [ a +10% × (b-a), b-10% ((b-a)), i.e., the frequencies of 10% -90% of the frequency range are labeled as medium frequencies, and the third threshold range may be [ a, a +10% (b-a) ], i.e., the lowest 10% of the frequencies in the frequency range are labeled as low frequencies.

For example, if the amplitude range corresponding to all the sample voice data is [ c, d ], the fourth threshold range may be [ d-10% × (d-c), d ], i.e., the amplitude of the highest 10% in the amplitude range is labeled as a high amplitude, the fifth threshold range may be [ c +10% × (d-c), d-10% (d-c) ], i.e., the amplitude of 10% -90% in the amplitude range is labeled as a medium amplitude, and the sixth threshold range may be [ c, c +10% (d-c) ], i.e., the amplitude of the lowest 10% in the amplitude range is labeled as a low amplitude.

It should be noted that the above examples are only simple examples, and cannot be used as specific limitations of the first threshold range, the second threshold range, the third threshold range, the fourth threshold range, the fifth threshold range, the sixth threshold range, and the like in the embodiments of the present disclosure.

And marking the amplitude value in the fourth threshold value range as a high amplitude value, marking the amplitude value in the fifth threshold value range as a medium amplitude value, and marking the amplitude value in the sixth threshold value range as a low amplitude value.

It can be understood that, since the preset dialogue model cannot learn the frequency or amplitude of all values, in the embodiment of the present disclosure, the frequency or amplitude may be divided into different levels according to ranges, so that the generalization capability of the dialogue model may be improved.

S304: and acquiring a scene image corresponding to the input voice.

The scene image may include a scene in which a speaker of the input speech is located, such as a classroom, a restaurant, a playground, and the like. Optionally, the scene image may include a face image of a speaker corresponding to the input speech, or may not include a face image that does not include the speaker, which is not limited in this disclosure.

Optionally, the image acquisition component is started to acquire the scene image corresponding to the input voice under the condition that it is monitored that the acquired voice data includes the user voice.

The image acquisition component can be a component with a photo taking function in the interactive device. Such as a camera assembly included in a mobile phone device or a tablet device having an interactive function.

Or intercepting a scene image corresponding to the input voice from the collected video stream according to the acquisition time of the input voice.

Optionally, a video capture device included in the interactive device is used to capture a video stream in real time, store the captured video stream in a memory, and then capture a scene image corresponding to the input voice from the captured video stream according to the time for acquiring the input voice.

In the embodiment of the disclosure, under the condition that the collected voice data contains the voice of the user, a scene image corresponding to the input voice is obtained; or, according to the acquisition time of the input voice, a scene image corresponding to the input voice is intercepted from the acquired video stream, so that the acquired scene image can accurately reflect the scene where the speaker corresponding to the input voice is located.

S305: and performing visual feature extraction on the scene image to determine the visual feature corresponding to the scene image.

The visual feature may be a scene feature included in the scene image. Such as teachers, restaurants, basketball courts, and the like.

Optionally, the target detection may be performed on the scene image to obtain the type and position information of the object included in the scene image, and then the scene described by the scene image, that is, the visual feature corresponding to the scene image, is determined according to the type and position information of each object.

Or, the scene image may be automatically segmented to partition an object or a color region included in the scene image, then, the features of each image sub-block are extracted, and an index is established, so that a spatial relationship feature corresponding to each object in the scene image is obtained, and then, the scene described by the scene image is determined based on the type and the spatial relationship of each object.

S306: the second text and/or the second audio characteristic is modified based on the visual characteristic.

It can be understood that, after the visual characteristics are determined, the second text or the second audio characteristics may be modified according to the visual characteristics, so that the modified second text or the second audio characteristics are more accurate, and further fit the emotion of the speaker corresponding to the input voice.

S307: and generating a response voice based on the modified second audio characteristic and the second text.

The specific implementation form of step S307 may refer to detailed descriptions in other embodiments of the present disclosure, and details are not repeated here.

In the embodiment of the disclosure, the voice recognition is performed on the acquired input voice to determine a first text corresponding to the input voice, then the audio feature extraction is performed on the input voice to determine a first audio feature corresponding to the input voice, then a second text and a second audio feature corresponding to a reply sentence to be generated are determined according to the first audio feature and the first text, then the second text and the second audio feature are corrected according to a visual feature corresponding to a scene image, and finally the reply voice is generated based on the corrected second audio feature and the second text. Therefore, the second text and the second audio feature generated according to the first audio feature and the first text are corrected based on the visual feature corresponding to the scene image, so that the accuracy of the corrected second text and the corrected second audio feature is further improved, the accuracy of the generated reply voice is further improved, and the reply voice is more suitable for the emotion of a speaker corresponding to the input voice.

Fig. 4 is a schematic structural diagram of a device for generating a voice dialog according to yet another embodiment of the present disclosure, and as shown in fig. 4, the device 400 for generating a voice dialog includes: a first determination module 410, a second determination module 420, a third determination module 430, and a generation module 440.

A first determining module 410, configured to perform speech recognition on the acquired input speech to determine a first text corresponding to the input speech;

the second determining module 420 is configured to perform audio feature extraction on the input voice to determine a first audio feature corresponding to the input voice;

a third determining module 430, configured to determine, according to the first audio feature and the first text, a second text and a second audio feature corresponding to the reply sentence to be generated;

a generating module 440, configured to generate the reply voice based on the second audio feature and the second text.

Optionally, the second determining module 420 is specifically configured to:

determining a second amplitude value corresponding to the input voice according to a first amplitude value corresponding to each frame of voice in the input voice;

and determining the amplitude characteristic corresponding to the input voice according to the range of the second amplitude.

Optionally, the second determining module 420 is specifically configured to:

performing fundamental tone detection on input voice to determine a frequency value corresponding to a voice signal;

and determining the frequency characteristics corresponding to the input voice according to the range of the frequency value.

Optionally, the generating module 440 includes:

the device comprises a first acquisition unit, a second acquisition unit and a control unit, wherein the first acquisition unit is used for acquiring a scene image corresponding to input voice;

the first determining unit is used for extracting visual features of the scene image to determine the visual features corresponding to the scene image;

the correcting unit is used for correcting the second text and/or the second audio characteristic according to the visual characteristic;

and the generating unit is used for generating the reply voice based on the corrected second audio characteristic and the second text.

Optionally, the first obtaining unit is specifically configured to:

starting an image acquisition assembly to acquire a scene image corresponding to input voice in response to the condition that the acquired voice data includes the voice of the user;

Optionally, the method further includes:

the fourth determining module is used for determining a second text corresponding to the reply sentence to be generated and an emoticon contained in the second text according to the first audio feature and the first text;

and the display module is used for displaying the second text and the emoticon on a display screen of the interactive equipment.

It should be noted that the explanation of the voice dialog generating method described above is also applicable to the voice dialog generating device of the present embodiment, and will not be described herein again.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 5 illustrates a schematic block diagram of an example electronic device 500 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. Electronic devices may also represent various forms of mobile devices, such as personal digital processors, cellular telephones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not intended to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 5, the apparatus 500 comprises a computing unit 501 which may perform various appropriate actions and processes in accordance with a computer program stored in a Read Only Memory (ROM) 502 or a computer program loaded from a storage unit 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data required for the operation of the device 500 can also be stored. The calculation unit 501, the ROM 502, and the RAM 503 are connected to each other by a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.

A number of components in the device 500 are connected to the I/O interface 505, including: an input unit 506 such as a keyboard, a mouse, or the like; an output unit 507 such as various types of displays, speakers, and the like; a storage unit 508, such as a magnetic disk, optical disk, or the like; and a communication unit 509 such as a network card, modem, wireless communication transceiver, etc. The communication unit 509 allows the device 500 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

The computing unit 501 may be a variety of general-purpose and/or special-purpose processing components having processing and computing capabilities. Some examples of the computing unit 501 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The computing unit 501 performs the various methods and processes described above, such as the generation of a voice conversation. For example, in some embodiments, the generation of voice dialogs may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 508. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 500 via the ROM 502 and/or the communication unit 509. When the computer program is loaded into RAM 503 and executed by computing unit 501, one or more steps of the generation of a voice dialog described above may be performed. Alternatively, in other embodiments, the computing unit 501 may be configured to perform the generation of the voice dialog by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program code, when executed by the processor or controller, causes the functions/acts specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), the internet, and blockchain networks.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server can be a cloud Server, also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service ("Virtual Private Server", or simply "VPS"). The server may also be a server of a distributed system, or a server incorporating a blockchain.

In this embodiment, first, voice recognition is performed on the acquired input voice to determine a first text corresponding to the input voice, then, audio feature extraction is performed on the input voice to determine a first audio feature corresponding to the input voice, then, according to the first audio feature and the first text, a second text and a second audio feature corresponding to a reply sentence to be generated are determined, and finally, the reply voice is generated based on the second audio feature and the second text. Therefore, the second text and the second audio characteristic corresponding to the reply sentence are determined according to the first audio characteristic and the first text corresponding to the input voice, so that the accuracy of the determined second text is improved, the emotional characteristic of the reply sentence can be determined according to the emotional characteristic corresponding to the input sentence, and the generated reply voice is more suitable for the emotion of the speaker corresponding to the input voice.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or to implicitly indicate the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present disclosure, "a plurality" means at least two, e.g., two, three, etc., unless explicitly specifically limited otherwise. In the description of the present disclosure, the words "if" and "if" used may be interpreted as "in \8230; \8230when" or "when 8230; \8230when" or "in response to a determination" or "in the case of \8230; \8230.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the protection scope of the present disclosure.

Claims

1. A method of generating a voice dialog, comprising:

determining a second text and a second audio feature corresponding to a reply sentence to be generated according to the first audio feature and the first text, wherein the first audio feature and the first text are input into a preset dialogue model to obtain the second text and the second audio feature corresponding to the reply sentence to be generated;

generating a response voice based on the second audio feature and the second text;

the generation method of the preset dialogue model comprises the following steps: acquiring a training sample set, wherein the training sample set comprises an input text and a corresponding audio feature, a labeled reply text corresponding to the input text and a corresponding audio feature tag, inputting the input text and the corresponding audio feature into an initial dialogue model to acquire a predicted reply text output by the initial dialogue model and a corresponding predicted audio feature, and correcting the initial dialogue model according to a difference between the predicted reply text and the labeled reply text and a difference between the predicted audio feature and the audio feature tag to generate the preset dialogue model, wherein the audio feature comprises a frequency feature and a magnitude feature;

generating a response speech based on the second audio feature and the second text, comprising:

acquiring a scene image corresponding to the input voice, wherein the scene image comprises a scene where a speaker of the input voice is located;

performing visual feature extraction on the scene image to determine a visual feature corresponding to the scene image, wherein the visual feature is a scene feature contained in the scene image;

according to the visual features, correcting the second text and/or the second audio features;

generating a reply voice based on the modified second audio feature and the second text;

the determining the visual characteristics corresponding to the scene image comprises:

the method comprises the steps of automatically segmenting the scene image, dividing an object or a color area contained in the scene image, extracting features of each image sub-block, establishing an index to obtain a spatial relationship feature corresponding to each object in the scene image, and determining a visual feature corresponding to the scene image based on the type and the spatial relationship of each object.

2. The method of claim 1, wherein the performing audio feature extraction on the input speech to determine a first audio feature corresponding to the input speech comprises:

and determining the amplitude characteristic corresponding to the input voice according to the range to which the second amplitude belongs.

3. The method of claim 2, wherein the performing audio feature extraction on the input speech to determine a first audio feature corresponding to the input speech comprises:

performing fundamental tone detection on the input voice to determine a frequency value corresponding to a voice signal;

4. The method of claim 1, wherein the obtaining of the scene image corresponding to the input voice comprises:

starting an image acquisition assembly to acquire a scene image corresponding to the input voice in response to the condition that the monitored voice data contains the user voice;

5. The method of any of claims 1-3, wherein after the determining the first audio characteristic corresponding to the input speech, further comprising:

determining a second text corresponding to the reply sentence to be generated and an emoticon contained in the second text according to the first audio feature and the first text;

and displaying the second text and the emoticon on a display screen of the interactive device.

6. An apparatus for generating a voice dialog, comprising:

a third determining module, configured to determine, according to the first audio feature and the first text, a second text and a second audio feature corresponding to the reply sentence to be generated, where the first audio feature and the first text are input into a preset dialogue model to obtain the second text and the second audio feature corresponding to the reply sentence to be generated;

a generating module, configured to generate a reply voice based on the second audio feature and the second text;

wherein the generating module comprises:

a first obtaining unit, configured to obtain a scene image corresponding to the input voice, where the scene image includes a scene in which a speaker of the input voice is located;

the first determining unit is used for performing visual feature extraction on the scene image to determine a visual feature corresponding to the scene image, wherein the visual feature is a scene feature contained in the scene image;

a generating unit, configured to generate a reply voice based on the modified second audio feature and the second text;

7. The apparatus of claim 6, wherein the second determining module is specifically configured to:

8. The apparatus of claim 7, wherein the second determining module is specifically configured to:

9. The apparatus of claim 6, wherein the first obtaining unit is specifically configured to:

starting an image acquisition assembly to acquire a scene image corresponding to the input voice in response to the condition that the acquired voice data includes the user voice;

10. The apparatus of any of claims 6-8, further comprising:

a fourth determining module, configured to determine, according to the first audio feature and the first text, a second text corresponding to a reply sentence to be generated and an emoticon included in the second text;

and the display module is used for displaying the second text and the emoticon on a display screen of the interactive device.

11. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-5.

12. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-5.

13. A computer program product comprising computer instructions which, when executed by a processor, carry out the steps of the method of any one of claims 1 to 5.