CN114882862A - Voice processing method and related equipment - Google Patents
Voice processing method and related equipment Download PDFInfo
- Publication number
- CN114882862A CN114882862A CN202210468926.8A CN202210468926A CN114882862A CN 114882862 A CN114882862 A CN 114882862A CN 202210468926 A CN202210468926 A CN 202210468926A CN 114882862 A CN114882862 A CN 114882862A
- Authority
- CN
- China
- Prior art keywords
- text
- voice
- pitch
- speech
- target
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000003672 processing method Methods 0.000 title abstract description 33
- 238000013528 artificial neural network Methods 0.000 claims abstract description 108
- 238000000034 method Methods 0.000 claims abstract description 87
- 238000012545 processing Methods 0.000 claims description 84
- 230000015654 memory Effects 0.000 claims description 44
- 230000004927 fusion Effects 0.000 claims description 13
- 238000004590 computer program Methods 0.000 claims description 6
- 230000001020 rhythmical effect Effects 0.000 claims description 4
- 238000012549 training Methods 0.000 description 108
- 239000013598 vector Substances 0.000 description 94
- 230000006870 function Effects 0.000 description 35
- 238000013527 convolutional neural network Methods 0.000 description 34
- 238000011176 pooling Methods 0.000 description 32
- 239000011159 matrix material Substances 0.000 description 30
- 230000008569 process Effects 0.000 description 28
- 238000004364 calculation method Methods 0.000 description 23
- 230000007246 mechanism Effects 0.000 description 17
- 238000010586 diagram Methods 0.000 description 14
- 230000001537 neural effect Effects 0.000 description 12
- 230000002457 bidirectional effect Effects 0.000 description 10
- 238000003780 insertion Methods 0.000 description 10
- 230000037431 insertion Effects 0.000 description 10
- 238000004891 communication Methods 0.000 description 9
- 230000001413 cellular effect Effects 0.000 description 7
- 238000012217 deletion Methods 0.000 description 7
- 230000037430 deletion Effects 0.000 description 7
- 210000002569 neuron Anatomy 0.000 description 7
- 238000007781 pre-processing Methods 0.000 description 7
- 230000004913 activation Effects 0.000 description 6
- 238000013473 artificial intelligence Methods 0.000 description 6
- 230000036961 partial effect Effects 0.000 description 6
- 238000004422 calculation algorithm Methods 0.000 description 5
- 238000013500 data storage Methods 0.000 description 5
- 239000000284 extract Substances 0.000 description 5
- 230000003993 interaction Effects 0.000 description 5
- 238000010606 normalization Methods 0.000 description 5
- 230000002829 reductive effect Effects 0.000 description 5
- 239000000872 buffer Substances 0.000 description 4
- 238000013135 deep learning Methods 0.000 description 4
- 238000000605 extraction Methods 0.000 description 4
- 230000000306 recurrent effect Effects 0.000 description 4
- 230000004044 response Effects 0.000 description 4
- 230000033764 rhythmic process Effects 0.000 description 4
- 230000005236 sound signal Effects 0.000 description 4
- 230000009471 action Effects 0.000 description 3
- 230000005540 biological transmission Effects 0.000 description 3
- 230000008878 coupling Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 238000001514 detection method Methods 0.000 description 3
- 230000002996 emotional effect Effects 0.000 description 3
- 238000007726 management method Methods 0.000 description 3
- 230000002441 reversible effect Effects 0.000 description 3
- 230000009466 transformation Effects 0.000 description 3
- MHABMANUFPZXEB-UHFFFAOYSA-N O-demethyl-aloesaponarin I Natural products O=C1C2=CC=CC(O)=C2C(=O)C2=C1C=C(O)C(C(O)=O)=C2C MHABMANUFPZXEB-UHFFFAOYSA-N 0.000 description 2
- 230000015572 biosynthetic process Effects 0.000 description 2
- 238000013480 data collection Methods 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000000670 limiting effect Effects 0.000 description 2
- 238000003058 natural language processing Methods 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 238000005070 sampling Methods 0.000 description 2
- 230000035807 sensation Effects 0.000 description 2
- 238000003786 synthesis reaction Methods 0.000 description 2
- 230000007704 transition Effects 0.000 description 2
- 230000001133 acceleration Effects 0.000 description 1
- 230000001174 ascending effect Effects 0.000 description 1
- 230000003190 augmentative effect Effects 0.000 description 1
- 238000013475 authorization Methods 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 238000005452 bending Methods 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 210000004027 cell Anatomy 0.000 description 1
- 239000003086 colorant Substances 0.000 description 1
- 239000000470 constituent Substances 0.000 description 1
- 238000012937 correction Methods 0.000 description 1
- 125000004122 cyclic group Chemical group 0.000 description 1
- 238000007599 discharging Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000008451 emotion Effects 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 230000005484 gravity Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 208000013409 limited attention Diseases 0.000 description 1
- 238000012886 linear function Methods 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 230000000873 masking effect Effects 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000010295 mobile communication Methods 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 230000008439 repair process Effects 0.000 description 1
- 238000005096 rolling process Methods 0.000 description 1
- 238000010079 rubber tapping Methods 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 238000000926 separation method Methods 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 230000003595 spectral effect Effects 0.000 description 1
- 238000001228 spectrum Methods 0.000 description 1
- 238000010897 surface acoustic wave method Methods 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
- G10L13/033—Voice editing, e.g. manipulating the voice of the synthesiser
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
- G10L13/033—Voice editing, e.g. manipulating the voice of the synthesiser
- G10L13/0335—Pitch control
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Machine Translation (AREA)
Abstract
A speech processing method is applied to the singing voice editing field, and comprises the following steps: acquiring original voice and a second text; predicting a second pitch characteristic of the second text according to a first pitch characteristic of non-edited voice in the original voice and information of the target text; obtaining a first voice feature corresponding to the second text through a neural network according to the second pitch feature and the second text; and generating target editing voice corresponding to the second text according to the first voice characteristic. According to the method and the device, the pitch characteristic of the second text (the text to be edited) is predicted, the first voice characteristic of the second text is generated according to the pitch characteristic, the target editing voice corresponding to the second text is generated based on the first voice characteristic, the pitch characteristics of the voice before and after singing voice editing are similar, and therefore the purpose that the audibility of the target editing voice is similar to that of the original voice is achieved.
Description
Technical Field
The embodiment of the application relates to the field of artificial intelligence, in particular to a voice processing method and related equipment.
Background
Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making. Research in the field of artificial intelligence includes robotics, natural language processing, computer vision, decision and reasoning, human-computer interaction, recommendation and search, AI basic theory, and the like.
At present, the voice editing has very important practical significance. For example, in a scenario where a user records a song (e.g., singing), some content in the speech may be in error due to a misstatement. In this case, the voice editing can help the user to quickly correct the error content in the original singing voice to generate the corrected voice. A commonly used speech editing method is to create a corrected speech by constructing a database containing a large number of speech segments in advance, acquiring a segment of a pronunciation unit from the database, and replacing an erroneous segment in the original speech with the segment.
However, the above-mentioned manner of speech editing depends on the diversity of speech segments in the database, and in the case of fewer speech segments in the database, it may result in poor listening sensation of the corrected speech (e.g., singing voice of the user).
Disclosure of Invention
The embodiment of the application provides a voice processing method and related equipment, which can realize that the listening impression of edited singing voice is similar to that of original voice, and improve user experience.
In a first aspect, the present application provides a voice processing method, which may be applied to scenes such as a user recording a short video, a teacher recording a lecture voice, and the like. The method may be performed by a speech processing device or may be performed by a component of a speech processing device, such as a processor, a chip, or a system of chips. The voice processing device can be a terminal device or a cloud device, and the method comprises the following steps: acquiring original voice and a second text, wherein the second text is a text in a target text except for a first text, the target text and the original text corresponding to the original voice both comprise the first text, and the voice of the first text corresponding to the original voice is non-edited voice; predicting a second pitch characteristic of the second text from a first pitch (pitch) characteristic of the non-editing speech and information of the target text; according to the second pitch feature and the second text, obtaining a first voice feature corresponding to the second text through a neural network; and generating target editing voice corresponding to the second text according to the first voice characteristic. According to the method and the device, the pitch characteristic of the second text (the text to be edited) is predicted, the first voice characteristic of the second text is generated according to the pitch characteristic, the target editing voice corresponding to the second text is generated based on the first voice characteristic, the pitch characteristics of the voice before and after singing voice editing are similar, and therefore the purpose that the audibility of the target editing voice is similar to that of the original voice is achieved.
In addition, there are various ways to obtain the second text, which may be directly obtaining the second text; position information (which may also be understood as mark information for indicating the position of the second text in the target text) may be obtained first, and when the second text is obtained according to the position and the target text, the position information is used for indicating the position of the second text in the target text; the method can also comprise the steps of obtaining a target text and an original text (or obtaining the target text and the original voice, and identifying the original voice to obtain the original text), and then determining a second text based on the original text and the target text.
In one possible implementation, generating a target editing voice corresponding to the second text based on the second voice feature includes: and generating the target editing voice through the vocoder based on the second voice characteristic.
In the possible implementation mode, the second voice characteristic is converted into the target editing voice according to the vocoder, so that the target editing voice has the voice characteristic similar to the original voice, and the hearing sense of a user is improved.
In one possible implementation, the content of the original voice is the singing voice of the user, and may be, for example, the voice recorded when the user sings.
In one possible implementation, obtaining the original speech and the second text includes: receiving an original voice and a second text sent by terminal equipment; the method further comprises the following steps: and sending the target editing voice to the terminal equipment, wherein the target editing voice is used for the terminal equipment to generate the target voice corresponding to the target text. The method can also be understood as an interactive scene, the cloud device performs complex calculation operation, the terminal device performs simple splicing operation, the original voice and the second text are obtained from the terminal device, the cloud device generates target editing voice, then the target editing voice is sent to the terminal device, and the terminal device performs splicing to obtain the target voice.
In this possible implementation manner, when the voice processing device is a cloud device, on one hand, the cloud device performs complex calculation to obtain target editing voice and return the target editing voice to the terminal device through interaction between the cloud device and the terminal device, so that the computational power and the storage space of the terminal device can be reduced. On the other hand, the target editing voice corresponding to the modified text can be generated according to the voice characteristics of the non-editing area in the original voice, and then the target voice corresponding to the target text generated by the non-editing voice is generated.
Optionally, in a possible implementation manner of the first aspect, the step of: acquiring original voice and a second text, comprising: receiving original voice and target text sent by terminal equipment; the method further comprises the following steps: and generating target voice corresponding to the target text based on the non-edited voice and the target edited voice, and sending the target voice to the terminal equipment.
In this possible implementation manner, the original voice and the target text sent by the terminal device are received, the non-edited voice can be obtained, the second voice feature corresponding to the second text is generated according to the first voice feature of the non-edited voice, the target edited voice is obtained according to the vocoder, and the target edited voice and the non-edited voice are spliced to generate the target voice. Equivalently, the processing procedures are all in the voice processing equipment, and the results are returned to the terminal equipment. The cloud device performs complex calculation to obtain the target voice and returns the target voice to the terminal device, so that the calculation power and the storage space of the terminal device can be reduced.
In one possible implementation, the first pitch (pitch) feature according to the non-editing speech and the second text comprise: according to a first pitch (pitch) feature of the non-editing speech, information of the target text, and a second speech feature of the non-editing speech; the second speech feature carries at least one of the following information: part of or all of the speech frames of the non-edited speech; a voiceprint feature of the non-editing speech; a timbre characteristic of the non-edited speech; prosodic features of the non-editing speech; and, a cadence characteristic of the non-editing speech.
The first voice feature and the second voice feature have the same or similar prosody, tone and/or signal-to-noise ratio, the prosody can reflect the emotional state or the speaking form of the speaker, and the prosody generally refers to the features of tone, accent emphasis, pause or rhythm.
In one possible implementation, the second speech feature carries a voiceprint feature of the original speech. The method for obtaining the voiceprint feature may be direct obtaining, or obtaining the voiceprint feature by recognizing the original voice.
In this possible implementation manner, on one hand, by introducing the voiceprint feature of the original voice, the subsequently generated first voice feature also carries the voiceprint feature of the original voice, so that the similarity between the target editing voice and the original voice is improved. On the other hand, in the case that the number of speakers (or users) is multiple, the introduction of the voiceprint feature can improve the subsequent predicted speech feature to be more similar to the voiceprint of the speaker of the original speech.
In one possible implementation, the information of the target text includes:
text embedding (text embedding) of each phoneme in the target text.
In one possible implementation, the target text is a text obtained by inserting the second text into the first text; or the target text is a text obtained by deleting a first part of text of the first text, and the second text is a text adjacent to the first part of text;
predicting a second pitch characteristic of the second text from a first pitch (pitch) characteristic of the non-editing speech and information of the target text, comprising:
fusing a first pitch (pitch) feature of the non-editing voice and information of the target text to obtain a first fusion result;
and inputting the first fusion result into a second neural network to obtain a second pitch feature of the second text.
In one possible implementation, the target text is obtained by replacing a second part of text in the first text with the second text;
predicting a second pitch characteristic of the second text from a first pitch (pitch) characteristic of the non-editing speech and information of the target text, comprising:
inputting a first pitch (pitch) feature of the non-editing speech into a third neural network, resulting in an initial pitch feature, the first initial pitch feature comprising a pitch for each of a plurality of frames;
inputting information of the target text into a fourth neural network to obtain pronunciation characteristics of the second text, wherein the pronunciation characteristics are used for indicating whether each frame in a plurality of frames included in the initial pitch characteristics pronounces;
fusing the initial pitch feature and the pronunciation feature to obtain a second pitch feature of the second text.
In one possible implementation, the method further comprises:
and predicting the frame number of each phoneme in the second text according to the frame number of each phoneme in the non-edited voice and the information of the target text.
In one possible implementation, the first pitch (pitch) feature includes: a pitch characteristic of each of a plurality of frames of the non-edited speech;
the second pitch feature comprising: the target edits a pitch characteristic of each of a plurality of frames of speech.
In one possible implementation, the information according to the frame number of each phoneme in the non-edited speech and the target text includes:
and according to the frame number of each phoneme in the non-edited voice, the information of the target text and the second voice characteristic of the non-edited voice.
In one possible implementation, the above steps further include: acquiring the position of the second text in the target text; and splicing the target editing voice and the non-editing voice based on the position to obtain the target voice corresponding to the target text. It can also be understood as replacing the edited voice in the original voice with the target edited voice, which is a voice other than the non-edited voice in the original voice.
In this possible implementation, the target editing speech and the non-editing speech may be spliced according to the position of the second text in the target text. If the first text is all of the overlapping text in the original text and the target text, the speech of the desired text (i.e., the target text) can be generated without altering the non-edited speech in the original speech.
Optionally, in a possible implementation manner of the first aspect, the step further includes: determining the non-editing speech based on the target text, the original text and the original speech, which may specifically be: determining a first text based on the target text and the original text; the non-edited speech is determined based on the first text, the original text, and the original speech.
In this possible implementation manner, by comparing the original text with the original speech, the non-edited speech of the first text in the original speech is determined, which facilitates the generation of the subsequent first speech feature.
Optionally, in a possible implementation manner of the first aspect, the step of: determining a first text based on the target text and the original text, comprising: determining an overlapped text based on the target text and the original text; displaying the overlaid text to a user; in response to a second operation by the user, the first text is determined from the overlapping texts.
In a second aspect, the present application provides a speech processing apparatus, comprising:
the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring an original voice and a second text, the second text is a text except a first text in a target text, the target text and the original text corresponding to the original voice both comprise the first text, and the voice of the first text corresponding to the original voice is a non-editing voice;
a pitch prediction module to predict a second pitch characteristic of the second text from a first pitch (pitch) characteristic of the non-editing speech and information of the target text;
the generating module is used for obtaining a first voice feature corresponding to the second text through a neural network according to the second pitch feature and the second text;
and generating target editing voice corresponding to the second text according to the first voice characteristic.
In one possible implementation, the content of the original speech is the singing voice of the user.
In one possible implementation, the first pitch (pitch) feature according to the non-editing speech and the second text comprise:
according to a first pitch (pitch) feature of the non-editing speech, information of the target text, and a second speech feature of the non-editing speech; the second speech feature carries at least one of the following information:
part of or all of the speech frames of the non-edited speech;
a voiceprint feature of the non-editing speech;
a timbre characteristic of the non-edited speech;
prosodic features of the non-editing speech; and the number of the first and second groups,
a rhythmic characteristic of the non-edited speech.
In one possible implementation, the information of the target text includes: text embedding (text embedding) of each phoneme in the target text.
In one possible implementation, the target text is a text obtained by inserting the second text into the first text; or the target text is a text obtained by deleting a first part of text of the first text, and the second text is a text adjacent to the first part of text;
the pitch prediction module is specifically configured to:
fusing a first pitch (pitch) feature of the non-editing voice and information of the target text to obtain a first fusion result;
and inputting the first fusion result into a second neural network to obtain a second pitch feature of the second text.
In one possible implementation, the target text is obtained by replacing a second part of text in the first text with the second text;
the pitch prediction module is specifically configured to:
inputting a first pitch (pitch) feature of the non-editing speech into a third neural network, resulting in an initial pitch feature, the first initial pitch feature comprising a pitch for each of a plurality of frames;
inputting information of the target text into a fourth neural network to obtain pronunciation characteristics of the second text, wherein the pronunciation characteristics are used for indicating whether each frame in a plurality of frames included in the initial pitch characteristics pronounces;
fusing the initial pitch feature and the pronunciation feature to obtain a second pitch feature of the second text.
In one possible implementation, the apparatus further comprises:
and the duration prediction module is used for predicting the frame number of each phoneme in the second text according to the frame number of each phoneme in the non-edited voice and the information of the target text.
In one possible implementation, the first pitch (pitch) feature includes: a pitch characteristic of each frame of the plurality of frames of non-edited speech;
the second pitch feature comprising: the target edits a pitch characteristic of each of a plurality of frames of speech.
In a possible implementation, the duration prediction module is specifically configured to:
and according to the frame number of each phoneme in the non-edited voice, the information of the target text and the second voice characteristic of the non-edited voice.
In one possible implementation, the obtaining module is further configured to:
acquiring the position of the second text in the target text;
and the generating module is further used for splicing the target editing voice and the non-editing voice based on the position to obtain a target voice corresponding to the target text.
A third aspect of the present application provides a speech processing apparatus that performs the method of the first aspect or any possible implementation manner of the first aspect.
A fourth aspect of the present application provides a speech processing apparatus comprising: a processor coupled to a memory for storing a program or instructions which, when executed by the processor, cause the speech processing device to carry out the method of the first aspect described above or any possible implementation manner of the first aspect.
A fifth aspect of the present application provides a computer-readable medium having stored thereon a computer program or instructions which, when run on a computer, cause the computer to perform the method of the first aspect or any possible implementation manner of the first aspect.
A sixth aspect of the present application provides a computer program product which, when executed on a computer, causes the computer to perform the method of the preceding first aspect or any possible implementation manner of the first aspect.
Drawings
FIG. 1 is a block diagram of a system architecture provided herein;
FIG. 2 is a schematic diagram of a convolutional neural network structure provided in the present application;
FIG. 3 is a schematic diagram of another convolutional neural network structure provided in the present application;
fig. 4 is a schematic diagram of a chip hardware structure provided in the present application;
FIG. 5 is a schematic flow chart diagram of a method for training a neural network provided herein;
FIG. 6 is a schematic diagram of a neural network according to the present disclosure;
FIG. 7a is a schematic flow chart of a speech processing method provided in the present application;
FIG. 7b is a schematic diagram of duration prediction provided herein;
FIG. 7c is a schematic illustration of pitch prediction provided herein;
FIG. 7d is a schematic illustration of pitch prediction provided by the present application;
8-10 are several schematic diagrams of a speech processing device display interface provided herein;
fig. 11 is a schematic structural diagram of a bidirectional decoder provided in the present application;
FIG. 12 is another schematic view of a speech processing device display interface provided herein;
FIG. 13 is another schematic flow chart of a speech processing method provided herein;
fig. 14-16 are schematic diagrams of several configurations of the speech processing device provided by the present application.
Detailed Description
The embodiment of the application provides a voice processing method and related equipment, which can achieve the effect that the hearing of edited voice is similar to that of original voice, and improve user experience.
The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
For ease of understanding, the relevant terms and concepts to which the embodiments of the present application relate generally will be described below.
1. Neural network
The neural network may be composed of neural units, which may be referred to as X s And an arithmetic unit with intercept 1 as input, the output of which may be:
wherein s is 1, 2, … … n, n is a natural number greater than 1, and W is s Is X s B is the bias of the neural unit. f is an activation function (activation functions) of the neural unit for introducing a nonlinear characteristic into the neural network to convert an input signal in the neural unit into an output signal. The output signal of the activation function may be used as an input to the next convolutional layer. The activation function may be a sigmoid function. A neural network is a network formed by a number of the above-mentioned single neural units joined together, i.e. the output of one neural unit may be the input of another neural unit. The input of each neural unit can be connected with the local receiving domain of the previous layer to extract the characteristics of the local receiving domain, and the local receiving domain can be a region composed of a plurality of neural units.
2. Deep neural network
Deep Neural Networks (DNNs), also known as multi-layer neural networks, can be understood as neural networks having many hidden layers, where "many" has no particular metric. From the division of DNNs by the location of different layers, neural networks inside DNNs can be divided into three categories: input layer, hidden layer, output layer. Generally, the first layer is an input layer, the last layer is an output layer, and the middle layers are hidden layers. The layers are all connected, that is, any neuron of the ith layer is necessarily connected with any neuron of the (i + 1) th layer. Of course, the deep neural network may not include the hidden layer, and is not limited herein.
The operation of each layer in the deep neural network can be expressed mathematicallyTo describe: from the work of each layer in the physical-level deep neural network, it can be understood that the transformation of the input space into the output space (i.e. the row space to the column space of the matrix) is accomplished by five operations on the input space (set of input vectors), which include: 1. ascending/descending dimensions; 2. zooming in/out; 3. rotating; 4. translating; 5. "bending". Wherein 1, 2, 3 are operated byFinish, operation 4 is performed byThe operation of 5 is completed by α (). The expression "space" is used herein because the object being classified is not a single thing, but a class of things, and space refers to the collection of all individuals of such things. Where W is a weight vector, each value in the vector representing a weight value for a neuron in the layer of neural network. The vector W determines the spatial transformation of the input space into the output space described above, i.e. the weight W of each layer controls how the space is transformed. The purpose of training the deep neural network is to finally obtain the weight matrix (the weight matrix formed by the vectors W of many layers) of all layers of the trained neural network. Therefore, the training process of the neural network is essentially a way of learning the control space transformation, and more specifically, the weight matrix.
3. Convolutional neural network
A Convolutional Neural Network (CNN) is a deep neural network with a convolutional structure. The convolutional neural network includes a feature extractor consisting of convolutional layers and sub-sampling layers. The feature extractor may be viewed as a filter and the convolution process may be viewed as convolving the same trainable filter with an input image or convolved feature plane (feature map). The convolutional layer is a neuron layer for performing convolutional processing on an input signal in a convolutional neural network. In convolutional layers of convolutional neural networks, one neuron may be connected to only a portion of the neighbor neurons. In a convolutional layer, there are usually several characteristic planes, and each characteristic plane may be composed of several neural units arranged in a rectangular shape. The neural units of the same feature plane share weights, where the shared weights are convolution kernels. Sharing weights may be understood as the way image information is extracted is location independent. The underlying principle among these is: the statistics of a certain part of the image are the same as the other parts. Meaning that image information learned in one part can also be used in another part. The same learned acquired image information can be used for all positions on the image. In the same convolution layer, a plurality of convolution kernels can be used to extract different image information, and generally, the greater the number of convolution kernels, the more abundant the image information reflected by the convolution operation.
The convolution kernel can be initialized in the form of a matrix of random size, and can acquire reasonable weight through learning in the training process of the convolutional neural network. In addition, sharing weights brings the direct benefit of reducing connections between layers of the convolutional neural network, while reducing the risk of overfitting. The networks such as the separation network, the identification network, the detection network, and the depth estimation network in the embodiment of the present application may all be CNNs.
4. Recurrent Neural Network (RNN)
In the traditional neural network model, all layers are connected, and nodes between each layer are connectionless. But such a general neural network is not solved for many problems. For example, it is predicted what the next word of a sentence is, because the preceding and following words in a sentence are not independent, and the preceding word is generally needed. A Recurrent Neural Network (RNN) means that the current output of a sequence is also related to the previous output. The specific expression is that the network memorizes the previous information, stores the previous information in the internal state of the network and applies the previous information to the calculation of the current output.
5. Loss function
In the process of training the deep neural network, because the output of the deep neural network is expected to be as close to the value really expected to be predicted as possible, the weight vector of each layer of the neural network can be updated according to the difference between the predicted value of the current network and the really expected target value (of course, an initialization process is usually carried out before the first updating, namely parameters are configured in advance for each layer in the deep neural network), for example, if the predicted value of the network is high, the weight vector is adjusted to be slightly lower, and the adjustment is carried out continuously until the neural network can predict the really expected target value. Therefore, it is necessary to define in advance "how to compare the difference between the predicted value and the target value", which are loss functions (loss functions) or objective functions (objective functions), which are important equations for measuring the difference between the predicted value and the target value. Taking the loss function as an example, if the higher the output value (loss) of the loss function indicates the larger the difference, the training of the deep neural network becomes the process of reducing the loss as much as possible.
6. From text to speech
Text To Speech (TTS) is a program or software system that converts text to speech.
7. Vocoder
A vocoder is a sound signal processing module or software that encodes acoustic features to generate sound waveforms.
8. Pitch of wind
When the sounding body sounds due to vibration, the sound can be generally decomposed into a plurality of pure sine waves, that is, all natural sounds are basically composed of a plurality of sine waves with different frequencies, wherein the sine wave with the lowest frequency is a fundamental tone (i.e., the fundamental frequency, which can be represented by F0), and the other sine waves with higher frequencies are overtones.
9. Rhythm
In the field of speech synthesis, prosody broadly refers to features that control the functions of intonation, pitch, accent emphasis, pause, and tempo. Prosody may reflect the emotional state of the speaker or the form of speech, etc.
10. Phoneme
Phoneme (phone): the minimum phonetic unit is divided according to the natural attributes of the speech, and is analyzed according to the pronunciation action in the syllable, and one action forms a phoneme. Phonemes are divided into two major categories, vowels and consonants. For example, the Chinese syllable a (e.g., one sound: o) has only one phoneme, ai (e.g., four sounds: ai) has two phonemes, dai (e.g., one sound: slow) has three phonemes, etc.
11. Word vector (embedding)
The word vector may also be referred to as "word embedding", "vectorization", "vector mapping", "embedding", and the like. Formally, a word vector represents an object with a dense vector.
12. Speech features
Voice characteristics: the processed speech signal is converted into a compact and logical representation, which is more discriminative and reliable than the actual signal. After a segment of the speech signal is acquired, speech features may be extracted from the speech signal. Among them, the extraction method usually extracts one multi-dimensional feature vector for each speech signal. There are many parametric representations of speech signals, such as: perceptual Linear Prediction (PLP), Linear Prediction Coding (LPC), and frequency cepstrum coefficient (MFCC), among others.
13. transformer layer
The neural network comprises an embedding layer and at least one transformer layer, wherein the at least one transformer layer can be N transformer layers (N is an integer larger than 0), wherein each transformer layer comprises an attention layer, an adding and normalizing (add & norm) layer, a feed forward (feed forward) layer and an adding and normalizing layer which are sequentially adjacent. Embedding the current input to obtain a plurality of characteristic vectors in an embedding layer; in the attention layer, acquiring P input vectors from a layer above the first transform layer, taking any first input vector of the P input vectors as a center, and obtaining intermediate vectors corresponding to the first input vectors based on the association degree between each input vector and the first input vector in a preset attention window range, so as to determine P intermediate vectors corresponding to the P input vectors; and at the pooling layer, combining the P intermediate vectors into Q output vectors, wherein a plurality of output vectors obtained by the last transformer layer in the transformer layers are used as the feature representation of the current input.
Next, each step described above will be specifically described with reference to specific examples.
Firstly, embedding processing is carried out on the current input in the embedding layer to obtain a plurality of feature vectors.
The embedding layer may be referred to as an input embedding (input embedding) layer. The current input may be a text input, for example, a piece of text, or a sentence. The text can be Chinese text, English text, or other language text. After the current input is obtained, the embedding layer may perform embedding processing on each word in the current input, so as to obtain a feature vector of each word. In some embodiments, the embedding layers include an input embedding layer and a position encoding (positional encoding) layer. In the input embedding layer, word embedding processing may be performed on each word in the current input, so as to obtain a word embedding vector for each word. At the position encoding layer, the position of each word in the current input may be obtained, and a position vector may be generated for the position of each word. In some examples, the position of the respective word may be an absolute position of the respective word in the current input. Taking the current input as "a number should also be" as an example, where the position of "a" may be represented as the first digit and the position of "a" may be represented as the second digit, … …. In some examples, the location of the words may be a relative location between the words. Still taking the current input as 'number should also be' as an example, where the position of 'number' can be represented as before 'number', the position of 'number' can be represented as after 'number', before 'should', … …. When the word embedded vector and the position vector of each word in the current input are obtained, the position vector of each word and the corresponding word embedded vector can be combined to obtain each word feature vector, and a plurality of feature vectors corresponding to the current input are obtained. The plurality of feature vectors may be represented as an embedded matrix having a preset dimension. The number of eigenvectors in the plurality of eigenvectors can be set to be M, and the predetermined dimension is H dimension, so that the plurality of eigenvectors can be represented as M × H embedded matrix.
14. Attention mechanism (attention mechanism)
The attention mechanism simulates the internal process of biological observation behavior, namely a mechanism which aligns internal experience and external feeling so as to increase the observation fineness of partial areas, and can rapidly screen out high-value information from a large amount of information by using limited attention resources. Attention mechanism can quickly extract important features of sparse data, and thus is widely used for natural language processing tasks, especially machine translation. The self-attention mechanism (self-attention mechanism) is an improvement of the attention mechanism, which reduces the dependence on external information and is better at capturing the internal correlation of data or features. The essential idea of the attention mechanism can be rewritten as the following formula:
the formula meaning means that a constituent element in the Source is imagined to be composed of a series of data pairs, at this time, a certain element Query in the Target is given, a weight coefficient of Value corresponding to each Key is obtained by calculating similarity or correlation between the Query and each Key, and then the Value is subjected to weighted summation, so that a final Attentition Value is obtained. So essentially the Attenttion mechanism is to perform weighted summation on the Value values of the elements in Source, and Query and Key are used to calculate the weight coefficients of the corresponding Value. Conceptually, Attention can be understood as selectively screening out and focusing on a small amount of important information from a large amount of information, ignoring most of the important information. The focusing process is embodied in the calculation of the weight coefficient, the greater the weight is, the more the weight is focused on the Value corresponding to the weight, that is, the weight represents the importance of the information, and the Value is the corresponding information. The self-Attention mechanism may be understood as internal Attention, where the Attention mechanism occurs between all elements in the Source and the Target element Query, and may also be understood as an Attention calculation mechanism in a special case of Source, where the specific calculation process is the same, and only the calculation object is changed.
At present, more and more scenes are edited by voice, for example, a scene of singing is a scene of a user recording a song (e.g., singing), and in order to repair wrong contents in original voice caused by misstatement, voice editing is generally used. The current speech editing method is to acquire a speech segment from a database, replace the error content with the speech segment, and generate a corrected speech.
However, this method depends too much on the voice segment stored in the database, and if the difference between the voice segment and the original voice is large, the voice after correction is incoherent, and the prosody is unnatural, resulting in poor listening sensation of the corrected voice. Although singing voice editing is very similar to the scene of voice editing, the singing voice data has larger variation in dimensions such as pronunciation duration, voice energy and pitch, which is different from the stable voice of speaking voice, and the existing voice editing technology is difficult to be directly applied to singing voice editing.
In order to solve the above problems, the present application provides a speech editing method, where a pitch characteristic affects a hearing of a target editing speech and a hearing of an original speech during singing speech editing, and the present application generates a first speech characteristic of a second text according to the pitch characteristic by predicting the pitch characteristic of the second text (text to be edited), and generates a target editing speech corresponding to the second text based on the first speech characteristic, so that the pitch characteristics of the speech before and after singing speech editing are similar, and further the hearing of the target editing speech and the hearing of the target editing speech of the original speech are similar to the hearing of the original speech.
The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
First, a system architecture provided in the embodiments of the present application is described.
Referring to fig. 1, a system architecture 10 is provided in accordance with an embodiment of the present application. As shown in the system architecture 10, the data collection device 16 is configured to collect training data, which in the embodiment of the present application includes training speech and training text corresponding to the training speech. And stores the training data in database 13 and training device 12 trains to obtain target model/rule 101 based on the training data maintained in database 13. In the following, it will be described in more detail how the training device 12 obtains the target model/rule 101 based on the training data, and the target model/rule 101 can be used to implement the speech processing method provided in the embodiment of the present application, that is, the text is input into the target model/rule 101 after being pre-processed by correlation, so that the speech feature of the text can be obtained. The target model/rule 101 in the embodiment of the present application may specifically be a neural network. It should be noted that, in practical applications, the training data maintained in the database 13 is not necessarily all acquired by the data acquisition device 16, and may be received from other devices. It should be noted that, the training device 12 does not necessarily perform the training of the target model/rule 101 based on the training data maintained by the database 13, and may also obtain the training data from the cloud or other places for performing the model training.
The target model/rule 101 obtained by training according to the training device 12 may be applied to different systems or devices, for example, the execution device 11 shown in fig. 1, where the execution device 11 may be a terminal, such as a mobile phone terminal, a tablet computer, a notebook computer, an AR/VR, a vehicle-mounted terminal, and may also be a server or a cloud. In fig. 1, the execution device 11 is configured with an I/O interface 112 for data interaction with an external device, and a user may input data to the I/O interface 112 through the client device 14, where the input data may include, in an embodiment of the present application: the input data may also include the second speech feature and the second text. In addition, the input data may be input by a user, may also be uploaded by the user through other devices, and may also be from a database, which is not limited herein.
If the input data includes the second speech feature, the target text, and the tag information, the preprocessing module 113 is configured to perform preprocessing according to the target text and the tag information received by the I/O interface 112, in this embodiment, the preprocessing module 113 may be configured to determine a target edit text in the target text based on the target text and the tag information. If the input data includes the second speech feature and the second text, the preprocessing module 113 is configured to perform preprocessing according to the target text and the label information received by the I/O interface 112, for example, to prepare for converting the target text into phonemes.
In the process that the execution device 11 performs preprocessing on the input data or in the process that the calculation module 111 of the execution device 11 performs calculation and the like, the execution device 11 may call data, codes and the like in the data storage system 15 for corresponding processing, and may store data, instructions and the like obtained by corresponding processing into the data storage system 15.
Finally, the I/O interface 112 returns the processing results, such as the first speech characteristic obtained as described above, to the client device 14 for presentation to the user.
It should be noted that the training device 12 may generate corresponding target models/rules 101 for different targets or different tasks based on different training data, and the corresponding target models/rules 101 may be used to achieve the targets or complete the tasks, so as to provide the user with the required results or provide the input for the subsequent other processes.
In the case shown in fig. 1, the user may manually give the input data, which may be operated through an interface provided by the I/O interface 112. Alternatively, the client device 14 may automatically send the input data to the I/O interface 112, and if the client device 14 is required to automatically send the input data to obtain authorization from the user, the user may set the corresponding permissions in the client device 14. The user can view the result output by the execution device 11 at the client device 14, and the specific presentation form can be display, sound, action and the like. The client device 14 may also serve as a data collection terminal, and collects input data of the input I/O interface 112 and output results of the output I/O interface 112 as new sample data, as shown in the figure, and stores the new sample data in the database 13. Of course, the input data inputted to the I/O interface 112 and the output result outputted from the I/O interface 112 as shown in the figure may be directly stored in the database 13 as new sample data by the I/O interface 112 without being collected by the client device 14.
It should be noted that fig. 1 is only a schematic diagram of a system architecture provided by an embodiment of the present application, and the position relationship between the devices, modules, and the like shown in the diagram does not constitute any limitation, for example, in fig. 1, the data storage system 15 is an external memory with respect to the execution device 11, and in other cases, the data storage system 15 may also be disposed in the execution device 11.
As shown in fig. 1, a target model/rule 101 is obtained by training according to the training device 12, where the target model/rule 101 may be a neural network in the embodiment of the present application, and specifically, in the network provided in the embodiment of the present application, the neural network may be a recurrent neural network, a long-short term memory network, or the like. The prediction network may be a convolutional neural network, a cyclic neural network, or the like.
Optionally, the neural network and the prediction network in the embodiment of the present application may be two separate networks, or may be a multi-task neural network, where one task is outputting the duration, one task is predicting the pitch characteristic, and the other task is outputting the speech characteristic.
Since CNN is a very common neural network, the structure of CNN will be described in detail below with reference to fig. 2. As described in the introduction of the basic concept, the convolutional neural network is a deep neural network with a convolutional structure, and is a deep learning (deep learning) architecture, and the deep learning architecture refers to performing multiple levels of learning at different abstraction levels through a machine learning algorithm. As a deep learning architecture, CNN is a feed-forward artificial neural network in which individual neurons can respond to images input thereto.
As shown in fig. 2, Convolutional Neural Network (CNN)100 may include an input layer 110, a convolutional/pooling layer 120, and a neural network layer 130, where pooling is optional.
Convolutional layer/pooling layer 120:
and (3) rolling layers:
as shown in FIG. 2, convolutional layer/pooling layer 120 may include, for example, 121-126 layers, in one implementation, 121 layers are convolutional layers, 122 layers are pooling layers, 123 layers are convolutional layers, 124 layers are pooling layers, 125 layers are convolutional layers, and 126 layers are pooling layers; in another implementation, 121, 122 are convolutional layers, 123 are pooling layers, 124, 125 are convolutional layers, and 126 are pooling layers. I.e., the output of a convolutional layer may be used as input to a subsequent pooling layer, or may be used as input to another convolutional layer to continue the convolution operation.
Taking convolutional layer 121 as an example, convolutional layer 121 may include a plurality of convolution operators, also called kernels, whose role in image processing is to act as a filter for extracting specific information from an input image matrix, and a convolution operator may be essentially a weight matrix, which is usually predefined, and during the convolution operation on an image, the weight matrix is usually processed on the input image pixel by pixel (or two pixels by two pixels … …, which depends on the value of step size stride) in the horizontal direction, so as to complete the task of extracting a specific feature from the image. The size of the weight matrix should be related to the size of the image, and it should be noted that the depth dimension (depth dimension) of the weight matrix is the same as the depth dimension of the input image, and the weight matrix extends to the entire depth of the input image in the process of performing the convolution operation. Thus, convolving with a single weight matrix will produce a single depth dimension of the convolved output, but in most cases not a single weight matrix is used, but a plurality of weight matrices of the same dimension are applied. The outputs of each weight matrix are stacked to form the depth dimension of the convolved image. Different weight matrixes can be used for extracting different features in the image, for example, one weight matrix is used for extracting image edge information, another weight matrix is used for extracting specific colors of the image, another weight matrix is used for blurring unwanted noise points in the image … …, the dimensions of the multiple weight matrixes are the same, the dimensions of feature maps extracted by the multiple weight matrixes with the same dimensions are also the same, and the extracted multiple feature maps with the same dimensions are combined to form the output of convolution operation.
The weight values in these weight matrices need to be obtained through a large amount of training in practical application, and each weight matrix formed by the trained weight values can extract information from the input image, thereby helping the convolutional neural network 100 to make correct prediction.
When convolutional neural network 100 has multiple convolutional layers, the initial convolutional layer (e.g., 121) tends to extract more general features, which may also be referred to as low-level features; as the depth of the convolutional neural network 100 increases, the more convolutional layers (e.g., 126) that go further back extract more complex features, such as features with high levels of semantics, the more highly semantic features are more suitable for the problem to be solved.
A pooling layer:
since it is often necessary to reduce the number of training parameters, it is often necessary to periodically introduce pooling layers after the convolutional layer, i.e. the layers 121-126 as illustrated by 120 in fig. 2, may be one convolutional layer followed by one pooling layer, or may be multiple convolutional layers followed by one or more pooling layers. During image processing, the only purpose of the pooling layer is to reduce the spatial size of the image. The pooling layer may include an average pooling operator and/or a maximum pooling operator for sampling the input image to smaller sized images. The average pooling operator may calculate pixel values in the image over a particular range to produce an average. The max pooling operator may take the pixel with the largest value in a particular range as the result of the max pooling. In addition, just as the size of the weighting matrix used in the convolutional layer should be related to the image size, the operator in the pooling layer should also be related to the image size. The size of the image output after the processing by the pooling layer may be smaller than the size of the image input to the pooling layer, and each pixel point in the image output by the pooling layer represents an average value or a maximum value of a corresponding sub-region of the image input to the pooling layer.
The neural network layer 130:
after processing by convolutional layer/pooling layer 120, convolutional neural network 100 is not sufficient to output the required output information. Because, as previously described, the convolutional layer/pooling layer 120 only extracts features and reduces the parameters brought by the input image. However, to generate the final output information (class information or other relevant information as needed), the convolutional neural network 100 needs to generate one or a set of outputs of the number of classes as needed using the neural network layer 130. Accordingly, a plurality of hidden layers (such as 131, 132, to 13n shown in fig. 2) and an output layer 140 may be included in the neural network layer 130, and parameters included in the plurality of hidden layers may be pre-trained according to related training data of a specific task type, for example, the task type may include image recognition, image classification, image super-resolution reconstruction, and the like.
After the hidden layers in the neural network layer 130, i.e. the last layer of the whole convolutional neural network 100 is the output layer 140, the output layer 140 has a loss function similar to the class cross entropy, and is specifically used for calculating the prediction error, once the forward propagation (i.e. the propagation from 110 to 140 in fig. 2 is the forward propagation) of the whole convolutional neural network 100 is completed, the backward propagation (i.e. the propagation from 140 to 110 in fig. 2 is the backward propagation) starts to update the weight values and the bias of the aforementioned layers, so as to reduce the loss of the convolutional neural network 100 and the error between the result output by the convolutional neural network 100 through the output layer and the ideal result.
It should be noted that the convolutional neural network 100 shown in fig. 2 is only an example of a convolutional neural network, and in a specific application, the convolutional neural network may also exist in the form of other network models, for example, as shown in fig. 3, a plurality of convolutional layers/pooling layers are parallel, and the features respectively extracted are all input to the whole neural network layer 130 for processing.
A hardware structure of a chip provided in an embodiment of the present application is described below.
Fig. 4 is a hardware structure of a chip provided in an embodiment of the present application, where the chip includes a neural network processor 40. The chip may be provided in the execution device 110 as shown in fig. 1 to complete the calculation work of the calculation module 111. The chip may also be disposed in the training apparatus 120 as shown in fig. 1 to complete the training work of the training apparatus 120 and output the target model/rule 101. The algorithm for each layer in the convolutional neural network shown in fig. 2 can be implemented in a chip as shown in fig. 4.
The neural network processor 40 may be any processor suitable for large-scale exclusive or operation processing, such as a neural-Network Processing Unit (NPU), a Tensor Processing Unit (TPU), or a Graphics Processing Unit (GPU). Taking NPU as an example: the neural network processor NPU40 is mounted as a coprocessor on a main processing unit (CPU) (host CPU) and tasks are distributed by the main CPU. The core portion of the NPU is an arithmetic circuit 403, and a controller 404 controls the arithmetic circuit 403 to extract data in a memory (weight memory or input memory) and perform an operation.
In some implementations, the arithmetic circuit 403 includes a plurality of processing units (PEs) therein. In some implementations, the operational circuitry 403 is a two-dimensional systolic array. The arithmetic circuit 403 may also be a one-dimensional systolic array or other electronic circuit capable of performing mathematical operations such as multiplication and addition. In some implementations, the arithmetic circuitry 403 is a general-purpose matrix processor.
For example, assume that there is an input matrix A, a weight matrix B, and an output matrix C. The arithmetic circuit fetches the data corresponding to matrix B from the weight memory 402 and buffers it on each PE in the arithmetic circuit. The arithmetic circuit takes the matrix a data from the input memory 401 and performs matrix operation with the matrix B, and partial or final results of the obtained matrix are stored in the accumulator 408.
The vector calculation unit 407 may further process the output of the operation circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, magnitude comparison, and the like. For example, the vector calculation unit 407 may be used for network calculations of non-convolution/non-FC layers in a neural network, such as pooling (pooling), batch normalization (batch normalization), local response normalization (local response normalization), and the like.
In some implementations, the vector calculation unit 407 can store the processed output vector to the unified buffer 406. For example, the vector calculation unit 407 may apply a non-linear function to the output of the arithmetic circuit 403, such as a vector of accumulated values, to generate the activation value. In some implementations, the vector calculation unit 407 generates normalized values, combined values, or both. In some implementations, the vector of processed outputs can be used as activation inputs to the arithmetic circuitry 403, for example for use in subsequent layers in a neural network.
The unified memory 406 is used to store input data as well as output data.
The weight data directly passes through a memory unit access controller 405 (DMAC) to carry the input data in the external memory to the input memory 401 and/or the unified memory 406, store the weight data in the external memory into the weight memory 402, and store the data in the unified memory 506 into the external memory.
A Bus Interface Unit (BIU) 410, configured to implement interaction between the main CPU, the DMAC, and the instruction fetch memory 409 through a bus.
An instruction fetch buffer (instruction fetch buffer)409 connected to the controller 404 is used for storing instructions used by the controller 404.
The controller 404 is configured to call an instruction cached in the instruction memory 409 to implement controlling of a working process of the operation accelerator.
Generally, the unified memory 406, the input memory 401, the weight memory 402 and the instruction fetch memory 409 are On-Chip (On-Chip) memories, the external memory is a memory outside the NPU, and the external memory may be a double data rate synchronous dynamic random access memory (DDR SDRAM), a High Bandwidth Memory (HBM) or other readable and writable memories.
The operation of each layer in the convolutional neural network shown in fig. 2 or fig. 3 may be performed by the operation circuit 403 or the vector calculation unit 407.
First, an application scenario to which the speech processing method provided in the embodiment of the present application is applied is described. The voice processing method can be applied to scenes needing to modify voice content, such as: the user records scenes such as short videos and teachers record teaching voices. The voice processing method can be applied to an application program, software or voice processing equipment with a voice editing function, such as a mobile phone, a computer, an intelligent voice assistant on a soundable detachable terminal, an intelligent sound box and the like.
The voice processing device is a terminal device for serving a user or a cloud device. The terminal device may include a Head Mounted Display (HMD), which may be a combination of a Virtual Reality (VR) box and a terminal, a VR all-in-one machine, a Personal Computer (PC), an Augmented Reality (AR) device, a Mixed Reality (MR) device, and the like, and may further include a cellular phone (cellular phone), a smart phone (smart phone), a Personal Digital Assistant (PDA), a tablet computer, a laptop computer (laptop computer), a Personal Computer (PC), a vehicle-mounted terminal, and the like, which are not limited herein.
The neural network, the prediction network training method, and the speech processing method according to the embodiments of the present application will be described in detail below with reference to the accompanying drawings.
The neural network and the prediction network in the embodiment of the application can be two separate networks, or can be a multi-task neural network, wherein one task is to output the time length, and the other task is to output the voice characteristics.
Next, a method for training a neural network according to an embodiment of the present application will be described in detail with reference to fig. 5. The training method shown in fig. 5 may be performed by a training apparatus of a neural network, where the training apparatus of the neural network may be a cloud service device, or may be a terminal device, for example, a device with sufficient computing power such as a computer or a server, which is used to perform the training method of the neural network, or may be a system composed of the cloud service device and the terminal device. Illustratively, the training method may be performed by the training apparatus 120 in fig. 1, the neural network processor 40 in fig. 4.
Optionally, the training method may be processed by a CPU, or may be processed by both the CPU and the GPU, or may use other processors suitable for neural network computation instead of the GPU, which is not limited in this application.
The training method shown in fig. 5 includes steps 501 and 502. Step 501 and step 502 are described in detail below.
First, a training process of the prediction network is briefly described. The prediction network in the embodiment of the present application may be a transform network, RNN, CNN, or the like, and is not limited herein. In the training stage of the prediction network, the input is a vector of a training text, and the output is duration, pitch characteristics or speech characteristics of each phoneme in the training text. And continuously reducing the difference between the duration, pitch characteristic or speech characteristic of each phoneme in the training text output by the prediction network and the actual duration, actual pitch characteristic or actual speech characteristic of the training speech corresponding to the training text, thereby obtaining the trained prediction network.
The training data in the embodiment of the present application includes training speech, or includes training speech and training text corresponding to the training speech. If the training data does not include training text, the training text may be obtained by recognizing training speech.
Optionally, if the number of speakers (or users) is multiple, the training speech features in the training data may further include user identification, or include voiceprint features of the training speech, or include vectors for identifying the voiceprint features of the training speech, for the purpose of correct subsequent predicted speech features.
Optionally, the training data may further include start-stop duration information of each phoneme in the training speech.
In the embodiment of the application, the training data can be acquired by directly recording the sound of the sound object, or by inputting audio information and video information by a user, or by receiving the audio information and video information sent by the acquisition device, and in practical application, the training data can be acquired in other manners, and the specific acquisition manner of the training data is not limited here.
And 502, training the neural network by taking the training data as the input of the neural network and taking the value of the loss function smaller than the threshold value as a target to obtain the trained neural network.
Alternatively, the training data may be subjected to some pre-processing, such as described above, if the training data includes training speech, the training text may be obtained in a manner that the training speech is recognized, and the training text may be phonetically represented into the neural network.
In the training process, the whole training text can be used as a target editing text and used as input, the loss function value is reduced and used as a target to train the neural network, and the difference between the speech features output by the neural network and the actual speech features corresponding to the training speech is continuously reduced. This training process may be understood as a predictive task. The loss function can be understood as a loss function corresponding to the prediction task.
The neural network in the embodiment of the present application may specifically be an attention mechanism model, for example: transformer, tacotron2, and the like. Wherein, the attention mechanism model comprises an encoder-decoder, and the structure of the encoder or the decoder can be a recurrent neural network, a long-short-term memory (LSTM) network, and the like.
The neural network in the embodiment of the present application includes an encoder (encoder) and a decoder (decoder), and the structure types of the encoder and the decoder may be RNN, LSTM, and the like, which is not limited herein. The coder is used for coding the training text into text vectors (vector representation in phoneme units, one vector corresponds to each input), and the decoder is used for obtaining the corresponding speech characteristics of the text according to the text vectors. In the training process of the decoder, the real voice characteristics corresponding to the previous step are calculated in each step and serve as conditions for calculation.
Further, in order to ensure the continuity of the preceding and following voices, the voice duration corresponding to the text vector may be corrected by using a prediction network. I.e. it can be understood as upsampling the text vector according to the duration of each phoneme in the training speech (it can also be understood as extending the frame number of the vector) to obtain a vector of the corresponding frame number. The decoder is used for obtaining the voice characteristics corresponding to the text according to the vector corresponding to the frame number.
Optionally, the decoder may be a unidirectional decoder or a bidirectional decoder (i.e., two directions are parallel), and is not limited herein. The two directions refer to directions of the training texts, can also be understood as directions of vectors corresponding to the training texts, and can also be understood as forward sequences or reverse sequences of the training texts, one direction is that one side of the training texts points to the other side of the training texts, and the other direction is that the other side of the training texts points to one side of the training texts.
Illustratively, if the training text is: "no lunch is eaten", the first direction or positive order may be the direction from "middle" to "no", and the second direction or negative order may be the direction from "no" to "middle".
If the decoder is a bidirectional decoder, the decoders in two directions (or positive and negative orders) are trained in parallel, and are independently calculated in the training process, so that no result dependence exists. Of course, if the prediction network and the neural network are a multitask network, and the prediction network may be referred to as a prediction module, the decoder may correct the speech feature output by the neural network according to the real duration information corresponding to the training text.
For example, in the case of singing voice editing, the input during model training may be original singing voice audio, corresponding lyric text (expressed in units of phonemes) to obtain duration information of each phoneme in the original audio, singing voice print characteristics, frame-level Pitch information, etc. according to the original singing voice audio, for example, may be obtained through other pre-trained models or tools (such as a singing voice lyric alignment tool, a Singer voice print extraction tool, and a Pitch extraction algorithm, etc.). The output may be a trained acoustic model with a training goal of minimizing the error between the predicted singing voice characteristics and the singing voice characteristics.
In the data preparation of the training samples, a training data set can be synthesized based on singing voice, and corresponding training data samples can be constructed by respectively simulating 'inserting, deleting and replacing' operation scenes.
Training:
stage 1: firstly, using group-truth lyrics, audio and Pitch and duration data to train a singing voice synthesis model, thereby obtaining a trained text coding module and an audio characteristic decoding module;
stage 2: the fixed text coding module and the audio feature decoding module train the time length normalization module and the Pitch prediction module by using a simulation editing operation training data set;
stage 3: end-to-end training, using all training data finetune whole model.
The architecture of the neural network in the embodiment of the present application can be seen in fig. 6. The neural network comprises an encoder and a decoder. Optionally, the neural network may further include a prediction module and an upsampling module. The prediction module is specifically configured to implement the function of the prediction network, and the upsampling module is specifically configured to implement the process of upsampling the text vector according to the duration of each phoneme in the training speech, which is not described herein again specifically.
It should be noted that the training process may also adopt other training methods instead of the aforementioned training method, and is not limited herein.
The following describes a speech processing method according to an embodiment of the present application in detail with reference to the drawings.
First, the speech processing method provided by the embodiment of the present application may be applied to a replacement scene, an insertion scene, or a deletion scene. The above scenario can be understood as that original speech corresponding to an original text is replaced, inserted, deleted, and the like to obtain target speech, so that the listening feeling of the target speech and the original speech is similar, and/or the fluency of the target speech is improved. The original voice can be considered as including the voice to be modified, and the target voice is the voice obtained after the user wants to modify the original voice.
For ease of understanding, several examples of the above scenarios are described below:
one, for alternate scenarios.
The original text is "today Shenzhen weather is very good" and the target text is "today Guangzhou weather is very good". Wherein the overlapping text is "weather good today". The non-overlapping text in the original text is "Shenzhen" and the non-overlapping text in the target text is "Guangzhou". The target text comprises a first text and a second text, and the first text is an overlapped text or a part of the overlapped text. The second text is the text in the target text other than the first text. For example: if the first text is "today is very good weather", the second text is "Guangzhou". If the first text is "good today", the second text is "Tianguangzhou Tian".
And II, for an insertion scene.
The original text is 'today Shenzhen weather is good', and the target text is 'today morning Shenzhen weather is good'. Wherein, the overlapped text is 'Shenzhen weather is very good today'. The non-overlapping text in the target text is "morning". To achieve coherence around the target speech, the insertion scenario can be considered as a replacement scenario that replaces "deep day" in the original speech with "deep morning" in the day. Namely, the first text is 'Shenzhen weather is good', and the second text is 'Tian Shang Mian'.
And thirdly, deleting scenes.
The original text is 'today Shenzhen weather is good', and the target text is 'today weather is good'. Wherein the overlapping text is "weather good today". The non-overlapping text in the original text is "Shenzhen". To achieve coherence around the target speech, the deletion scenario can be considered as a replacement scenario that replaces "tianshenzhen day" in the original speech with "tiantian. Namely, the first text is "good today's qi" and the second text is "day and day".
Optionally, the above scenarios are only examples, and in practical applications, there are other scenarios, which are not limited herein.
Since both the deletion scene and the insertion scene can be replaced by a replacement scene, the speech processing method provided by the embodiment of the present application is described below by taking the replacement scene as an example. The voice processing method provided by the embodiment of the application can be independently executed by the terminal device or the cloud device, and can also be completed by the terminal device and the cloud device together, which are respectively described as follows:
the first embodiment is as follows: the voice processing method is independently executed by the terminal equipment or the cloud equipment.
Referring to fig. 7a, an embodiment of a voice processing method provided in this application may be executed by a voice processing device, or may be executed by a component (e.g., a processor, a chip, or a system on a chip) of the voice processing device, where the voice processing device may be a terminal device or a cloud device, and the embodiment includes steps 701 to 704.
In the embodiment of the application, the voice processing device can directly acquire the original voice, the original text and the second text. Or the original speech and the second text may be obtained first, and the original text corresponding to the original speech is obtained after the original speech is recognized. The second text is the text in the target text except the first text, and the original text and the target text contain the first text. The first text may be understood as part or all of the text in the overlapping text of the original text and the target text.
In one possible implementation, the content of the original voice is the singing voice of the user, and may be, for example, the voice recorded when the user sings.
In the embodiment of the present application, there are multiple ways for the speech processing device to obtain the second text, which are described below:
first, the speech processing device may directly obtain the second text through other devices or user input.
And secondly, the voice processing equipment acquires a target text, obtains an overlapped text according to the target text and the original text corresponding to the original voice, and determines a second text according to the overlapped text. Specifically, the original text and characters in the target text are compared one by one or input into a comparison model, and overlapped text and/or non-overlapped text of the original text and the target text are determined. And determining the first text according to the overlapped texts. The first text may be an overlapped text, or may be a partial text in the overlapped text.
In the embodiment of the present application, there are various ways to determine the first text according to the overlapped text, and the speech processing device may directly determine the overlapped text as the first text, may also determine the first text in the overlapped text according to a preset rule, and may also determine the first text in the overlapped text according to an operation of a user. The preset rule may be that N characters in the overlapped content are removed to obtain a first text, where N is a positive integer.
It should be understood that the above two manners are only examples, and in practical applications, there are other manners to obtain the second text, and the details are not limited herein.
In addition, the speech processing device may align the original text with the original speech, determine the start-stop position of each phoneme in the original text in the original speech, and may know the duration of each phoneme in the original text. And further acquiring phonemes corresponding to the first text, namely acquiring the voice (namely, non-edited voice) corresponding to the first text in the original voice.
Alternatively, the speech processing device may align the original text with the original speech in such a way that a forced alignment method is used, for example: a Montreal Forced Aligner (MFA), a neural network having an alignment function, and other alignment tools, and the specific examples are not limited herein.
Alternatively, after the speech processing device obtains the original speech and the original text, a user interface including the original speech and the original text may be presented to the user. Further, the user performs a first operation on the original text through the user interface, and the voice processing device determines the target text in response to the first operation of the user. The first operation may be understood as the user editing the original text, and the editing may specifically be the aforementioned replacement, insertion, deletion, or the like.
Illustratively, the example in the alternative scenario above is continued. The original text is "today Shenzhen weather is very good" and the target text is "today Guangzhou weather is very good". Illustratively, the voice processing device is a mobile phone. After the speech processing device obtains the original text and the original speech, an interface as shown in fig. 8 is presented to the user, and the interface includes the original text and the original speech. As shown in fig. 9, the user may perform a first operation 901 on the original text, such as the aforementioned insertion, deletion, and replacement operations of "shenzhen" to "guangzhou", which is only described by way of replacement example.
Optionally, after the speech processing device determines the overlapped text of the original text and the target text, the overlapped text is displayed to the user, and then the first text is determined from the overlapped text according to a second operation of the user, so as to determine the second text. The second operation may be a click operation, a drag operation, a slide operation, and the like, and is not limited herein.
Illustratively, continuing the above example, the second text is "Guangzhou," the first text is "today's weather is good," and the non-edited speech is speech of the first text in the original speech. Assuming that one word corresponds to 2 frames and the original speech corresponding to the original text includes 16 frames, the non-edited speech corresponds to the 1 st to 4 th frames and the 9 th to 16 th frames in the original speech. It can be understood that, in practical applications, the correspondence between the text and the speech frame is not necessarily 1 to 2 in the above example, which is only for convenience of understanding the non-editing region, and the frame number corresponding to the original text is not limited herein. After determining the target text, the speech processing device may display an interface as shown in fig. 10, where the interface may include the non-edited speech and the edited speech in the original speech, the target text and the original speech, where the second text is "guangzhou", the target text is "guangzhou weather is good today", the non-edited speech is the speech corresponding to "weather is good today", and the edited speech is the speech corresponding to "shenzhen". It can also be understood that as the target text is edited by the user, the speech processing device, in turn, determines non-edited speech in the original speech based on the target text, the original text, and the original speech.
Optionally, the speech processing device receives an editing request sent by a user, where the editing request includes the original speech and the second text. Optionally, the edit request further includes the original text and/or the speaker identification. Of course, the edit request may also include the original speech and the target text.
In one possible implementation, the information of the target text includes: text embedding (text embedding) of each phoneme in the target text.
In one possible implementation, the Text embedding of each phoneme in the target Text may be obtained by a Text encoding module (Text Encoder) according to the target Text. For example, the target Text may be converted into corresponding phoneme sequences (e.g., "what can be read without errors", the corresponding phonemes are the initial and final sequences of the pinyin), and then input to the Text Encoder to be converted into corresponding Text embeddings in units of phonemes. The network structure of Text Encoder may be exemplified by a Tacotron2 model.
In one possible implementation, the frame number (also referred to as duration) of each phoneme in the non-edited speech may be obtained, and the frame number of each phoneme in the second text may be predicted according to the frame number of each phoneme in the non-edited speech and the information of the target text.
In one possible implementation, the neural network used to predict the frame number of each phoneme in the second Text may be as shown in fig. 7b (for example, may be a Duration prediction model based on a masking mechanism and fusing original real durations), which predicts the Duration (i.e., the frame number in the corresponding audio) of each phoneme to be edited (i.e., each phoneme in the second Text) with the output of the Text Encoder and the original real durations (Reference Duration, i.e., the durations of each phoneme in the first Text) and the corresponding mask as inputs.
In a possible implementation, after obtaining the number of frames of each phoneme in the target text (including the first text and the second text), according to the predicted duration of each phoneme, each text embedding (text embedding) is upsampled to obtain an embedding result corresponding to the number of frames (for example, if the predicted duration of the phoneme ai is 10 frames, the text embedding corresponding to the ai may be copied by N, where N is a positive number greater than 1, and for example, N is 10).
It will be appreciated that, alternatively, in the context of singing voice editing, the singing voice itself may follow a certain score which, in turn, specifies the duration of pronunciation and Pitch etc. of each word. Therefore, when the singing voice is edited, the corresponding duration and Pitch information of the area (non-edited voice) which does not need to be edited do not need to be predicted, and an accurate real numerical value can be directly obtained and used.
An illustration of the duration prediction for the second text is given below:
referring to fig. 7b, references Durations are real Durations of the phonemes in the original singing voice audio, wherein a dashed box is a duration to be predicted (which may be replaced by 0 since it is unknown at this time) of each phoneme in the second text; while the Edit Mask marks the phoneme to be predicted (wherein Mask is 0 to indicate that prediction is needed); the Reference durations are fused with the edge Mask (for example, by performing an inner product operation), and the result is then accumulated with the Text Embedding and Singer Embedding (extracted voiceprint features). Wherein 1 FFT Block may be a Transformer Block, and for example, 4 (i.e., N ═ 4) FFT blocks may be used; finally, the model predicts the duration of the phoneme corresponding to Mask 0 and outputs the duration together with the durations of other unedited phonemes.
In one possible implementation, the predicted durations of the phonemes in the second text may be used to upsample inputs for pitch feature prediction, e.g., the inputs for pitch feature prediction may include text embedding, each text embedding before upsampling corresponding to a phoneme, and the text embedding after upsampling including the number of frames of the corresponding phoneme.
In one possible implementation, the second speech characteristic of the non-edited speech may also be obtained from the non-edited speech. The second speech feature may carry at least one of the following information: part of or all of the speech frames of the non-edited speech; a voiceprint feature of the non-editing speech; a timbre characteristic of the non-edited speech; prosodic features of the non-editing speech; and, a cadence characteristic of the non-editing speech.
The speech features in the embodiment of the present application may be used to represent features of speech (for example, timbre, prosody, emotion, or rhythm, etc.), and the representation forms of the speech features may be various, and may be speech frames, sequences, vectors, etc., and are not limited herein. In addition, the speech features in the embodiment of the present application may specifically be parameters extracted from the above expression forms by the above-described PLP, LPC, MFCC, or the like.
Optionally, at least one speech frame is selected from the non-edited speech as the second speech feature. Further, a second speech feature of the context is more integrated for the first speech feature. The text corresponding to the at least one speech frame may be the text adjacent to the second text in the first text.
Optionally, the non-edited speech is encoded by an encoding model to obtain a target sequence, and the target sequence is used as the second speech feature. The coding model may be CNN, RNN, etc., and is not limited herein.
In addition, the second speech feature may also carry a voiceprint feature of the original speech. The method for obtaining the voiceprint feature may be direct obtaining, or obtaining the voiceprint feature by recognizing the original voice. On one hand, by introducing the voiceprint feature of the original voice, the subsequently generated first voice feature also carries the voiceprint feature of the original voice, and therefore the similarity degree of the target editing voice and the original voice is improved. On the other hand, in the case that the number of speakers (or users) is multiple, the introduction of the voiceprint feature can improve the subsequent predicted speech feature to be more similar to the voiceprint of the speaker of the original speech.
Optionally, the speech processing device may further obtain an identifier of a speaker of the original speech, so that when there are a plurality of speakers, the corresponding speech of the corresponding speaker may be matched, and the similarity between the subsequent target editing speech and the original speech is improved.
The following description will be given only by taking a speech frame as a speech feature (or understanding that a speech feature is obtained from a speech frame) as an example. Illustratively, continuing the above example, at least one of the 1 st frame to the 4 th frame and the 9 th frame to the 16 th frame in the original speech is selected as the second speech feature.
Illustratively, the second speech feature is a mel-frequency spectral feature.
In one possible implementation, the second speech feature may be expressed in the form of a vector, and in one possible implementation, the predicted duration of each phoneme in the second text may be used to perform upsampling of each input in pitch feature prediction, e.g., the input to perform pitch feature prediction may include the second speech feature, each vector before upsampling corresponding to a phoneme, and the upsampled text embedding a vector including the number of frames of the corresponding phoneme.
In one possible implementation, a second pitch characteristic of the second text may be predicted based on a first pitch (pitch) characteristic of the non-editing speech and information of the target text.
In one possible implementation, the first Pitch (Pitch) feature of the non-edited speech may be obtained by an existing Pitch extraction algorithm, which is not limited in this application.
In one possible implementation, a second pitch characteristic of the second text may be predicted by a neural mesh based on a first pitch (pitch) characteristic of the non-editing speech, information of the target text, and a second speech characteristic of the non-editing speech.
Next, how to predict a second pitch characteristic of the second text according to a first pitch (pitch) characteristic of the non-editing speech and the information of the target text is described:
in one possible implementation, the target text is a text obtained by inserting the second text into the first text; or the target text is a text obtained by deleting a first part of text of the first text, and the second text is a text adjacent to the first part of text; a first pitch (pitch) feature of the non-editing speech and information of the target text may be fused to obtain a first fusion result; and inputting the first fusion result into a second neural network to obtain a second pitch feature of the second text.
For insert and delete operations: the frame-level pitch feature of the target editing phoneme is predicted using the model shown in fig. 7 c. The Pitch prediction model for the insert and delete operation may have a model structure identical or similar to that of fig. 7b, except that the input at this time is a frame-level (the input in fig. 7b is a phoneme-level) Pitch value (the input in fig. 7b is duration information) extracted from the real singing voice, where the Pitch of the region to be edited is marked as a dashed box and its corresponding Edit Mask flag is set to 0.
In one possible implementation, the target text is obtained by replacing a second part of text in the first text with the second text; a first pitch (pitch) feature of the non-editing speech may be input to a third neural network, resulting in an initial pitch feature comprising a pitch for each of a plurality of frames; inputting information of the target text into a fourth neural network to obtain pronunciation characteristics of the second text, wherein the pronunciation characteristics are used for indicating whether each frame in a plurality of frames included in the initial pitch characteristics pronounces; fusing the initial pitch feature and the pronunciation feature to obtain a second pitch feature of the second text.
For a replacement operation (where the replacement operation only indicates that the number of newly edited text words is consistent with the number of replaced text words, and if not, the replacement operation is broken into two editing operations, i.e., deletion and insertion. Since the replaced text may have a large difference in pronunciation, to ensure the consistency of singing voice before and after replacement, the new pitch is predicted using the model shown in fig. 7 d:
pitch prediction model for replacement operation. Frame-level Voiced/unvoiced (U/UV) prediction may be introduced to aid in the prediction of Pitch. Illustratively, the design of the V/UV Predictor and F0 Predictor modules can be referred to as F0 Predictor in Fastspech 2.
In one possible implementation, the first pitch (pitch) feature entered may include a pitch feature of each of a plurality of frames of the non-editing speech; accordingly, the output second pitch characteristic may include a pitch characteristic of each of the plurality of frames of the target editing speech.
And 703, obtaining a first voice feature corresponding to the second text through a neural network according to the second pitch feature and the second text.
In one possible implementation, the second pitch feature and the second text (e.g., text embedding of the second text) may be fused (e.g., added) and the fusion result may be input into a neural network to obtain the first speech feature corresponding to the second text. The first speech feature corresponding to the second text may be a mel-frequency spectrum feature.
In a possible implementation, according to a first pitch (pitch) feature of the non-edited speech, information of the target text, and a second speech feature of the non-edited speech, the description about the second speech feature may refer to the description about the second speech feature in the foregoing embodiment, and details are not repeated here.
In a possible implementation, after the second speech feature is obtained, the first speech feature corresponding to the second text may be obtained through a neural network based on the second speech feature and the second text. The neural network may include an encoder and a decoder. And inputting the second text into an encoder to obtain a first vector corresponding to the second text, and decoding the first vector through a decoder based on the second voice feature to obtain the first voice feature. The second speech feature may be the same as or similar to the prosody, tone and/or signal-to-noise ratio of the first speech feature, and the prosody may reflect the emotional state or the speech form of the speaker, and the prosody generally refers to features such as intonation, tone, accent emphasis, pause or rhythm.
Optionally, an attention mechanism may be introduced between the encoder and the decoder for adjusting the correspondence of the number between the inputs and the outputs.
Optionally, a target text where the second text is located may be introduced in the encoding process of the encoder, so that the target text is referred to by the generated first vector of the second text, and the second text described by the first vector is more accurate. Namely, the first speech feature corresponding to the second text can be obtained through the neural network based on the second speech feature, the target text and the marking information. Specifically, the target text and the tag information are input into an encoder to obtain a first vector corresponding to the second text, and the first vector is decoded by a decoder based on the second speech feature to obtain the first speech feature. The marking information is used for marking the second text in the target text.
The decoder in the embodiment of the present application may be a unidirectional decoder or a bidirectional decoder, which is described below separately.
First, the decoder is a unidirectional decoder.
The decoder calculates a first vector or a second vector of speech frames from a first direction of the target text as the first speech feature based on the second speech feature. Wherein the first direction is a direction pointing from one side of the target text to the other side of the target text. In addition, the first direction may be understood as the forward or reverse order of the target text (the related description may refer to the description about the forward or reverse order in the embodiment shown in fig. 5).
Optionally, the second speech feature and the first vector are input into a decoder to obtain the first speech feature. Or inputting the second speech characteristic and the second vector into a decoder to obtain the first speech characteristic.
Second, if the second text is in the middle region of the target text, the decoder may be a bidirectional decoder (it is also understood that the encoder includes a first encoder and a second encoder).
The second text is in the middle area of the target text, and it can be understood that the second text is not at the two ends of the target text.
The bidirectional decoder in the embodiment of the present application has a plurality of cases, which are described below:
1. the first speech characteristic output by the bidirectional decoder from the first direction is the speech characteristic corresponding to the second text, and the fourth speech characteristic output by the bidirectional decoder from the second direction is the speech characteristic corresponding to the second text.
In this case, it can be understood that the complete speech features corresponding to the two second texts can be obtained through the left and right sides (i.e., the positive sequence and the negative sequence), respectively, and the first speech feature can be obtained according to the two speech features.
The first decoder calculates a first vector or a second vector from a first direction of the target text based on the second speech feature to obtain a first speech feature (hereinafter referred to as LR) of the second text. The second decoder calculates the first vector or the second vector from the second direction of the target text based on the second speech feature to obtain a fourth speech feature (hereinafter referred to as RL) of the second text. And generating a first voice characteristic according to the first voice characteristic and the fourth voice characteristic. Wherein the first direction is a direction pointing from one side of the target text to the other side of the target text and the second direction is opposite to the first direction (or it is understood that the second direction is a direction pointing from the other side of the target text to one side of the target text). The first direction may be in the positive order described above and the second direction may be in the negative order described above.
For a bi-directional decoder, when decoding a first frame of a first vector or a second vector in a first direction, a first encoder may decode a speech frame adjacent to a second text side (which may also be referred to as a left side) in non-edited speech as a condition to obtain N frames LR. When the second encoder decodes the first vector or the first frame of the second vector in the second direction, the second encoder may decode the speech frame adjacent to the other side (which may also be referred to as the right side) of the second text in the non-edited speech as a condition to obtain the N frame RL. Alternatively, the structure of the bi-directional decoder may refer to fig. 11. After acquiring the N frames LR and RL, a frame with a difference value smaller than a threshold value in LR and RL may be used as a transition frame (the position is m, m < N'), or a frame with a minimum difference value in LR and RL may be used as a transition frame. The N frames of the first speech feature may include the first m frames in the LR and the last N-m frames in the RL or the N frames of the first speech feature may include the first N-m frames in the LR and the last m frames in the RL. Wherein, the difference between LR and RL can be understood as the distance between vectors. In addition, if the speaker identifier is obtained in step 701, the first vector or the second vector in this step may further include a third vector for identifying the speaker. It can also be understood that the third vector is used to identify the voiceprint features of the original speech.
Illustratively, continuing the above example, assume that the first encoder obtains a LR frame corresponding to "Guangzhou" that includes LR 1 、LR 2 、LR 3 、LR 4 . The second encoder obtains the RL frame corresponding to the Guangzhou and comprises the RL 1 、RL 2 、RL 3 、RL 4 . And LR 2 And RL 2 If the difference is minimum, LR will be obtained 1 、LR 2 、RL 3 、RL 4 Or LR 1 、RL 2 、RL 3 、RL 4 As the first speech feature.
2. The first speech characteristic output by the bidirectional decoder from the first direction is the speech characteristic corresponding to the third text in the second text, and the fourth speech characteristic output by the bidirectional decoder from the second direction is the speech characteristic corresponding to the fourth text in the second text.
In this case, it can be understood that the partial speech features corresponding to the second text can be obtained through the left and right sides (i.e., the positive sequence and the negative sequence), respectively, and the complete first speech feature can be obtained according to the two partial speech features. Namely, a part of voice features are taken from the direction of positive sequence, another part of voice features are taken from the direction of negative sequence, and the voice features of the part and the voice features of the other part are spliced to obtain the integral voice features.
Illustratively, continuing the above example, assume that the first encoder obtains a third text ("Wide") corresponding LR frame including LR 1 And LR 2 . The second encoder obtains a RL frame corresponding to the fourth text ("state") that includes the RL 3 And RL 4 . Then spliced LR 1 、LR 2 、RL 3 、RL 4 A first speech feature is obtained.
It should be understood that the above two ways are only examples, and in practical applications, there are other ways to obtain the first speech feature, and the details are not limited herein.
In one possible implementation, after the first speech feature is obtained, the first speech feature may be converted into a target editing speech corresponding to the second text according to the vocoder. The vocoder may be a conventional vocoder (e.g., Griffin-lim algorithm), or may be a neural network vocoder (e.g., Melgan pre-trained using audio training data, or Hifigan), and the like, which is not limited herein.
Illustratively, continuing the above example, the target editing speech corresponding to "Guangzhou" is shown in FIG. 12.
Alternatively, if the original speech and the second text are obtained in step 701, the position of the second text in the target text is obtained.
Alternatively, if the target text has been acquired in step 701, the start-stop positions of the phonemes in the original text in the original speech may be determined by aligning the original speech with the original text through the alignment technique in step 701. And determining the position of the second text in the target text according to the starting and ending positions of the phonemes.
And step 706, splicing the target editing voice and the non-editing voice based on the position to generate a target voice corresponding to the target text. This step is optional.
The position in the embodiment of the application is used for splicing the non-edited voice and the target edited voice, and the position may be a position of the second text in the target text, a position of the first text in the target text, a position of the non-edited voice in the original voice, or a position of the edited voice in the original voice.
Alternatively, after the position of the second text in the target text is obtained, the start-stop position of each phoneme in the original text in the original speech may be determined by aligning the original speech and the original text through the alignment technique in the foregoing step 701. And determining the position of the non-editing voice or the editing voice in the original voice according to the position of the first text in the original text. And then the voice processing equipment splices the target edited voice and the non-edited voice based on the position to obtain the target voice. Namely, the target voice corresponding to the second text replaces the editing area in the original voice to obtain the target voice.
Illustratively, continuing the above example, the non-edited speech corresponds to the 1 st to 4 th frames and the 9 th to 16 th frames in the original speech. Target editing speech to LR 1 、LR 2 、RL 3 、RL 4 Or LR 1 、RL 2 、RL 3 、RL 4 . Splicing the target editing voice and the non-editing voice can be understood as replacing the 5 th frame to the 8 th frame in the original voice with the obtained four frames, so as to obtain the target voice. Namely, the speech corresponding to Guangzhou replaces the speech corresponding to Shenzhen in the original speech, and then the target text is obtained: "today Guangzhou weather is very good" corresponding target speech. The target speech corresponding to "today Guangzhou weather is good" is shown in FIG. 12.
Optionally, the voice processing device plays the target editing voice or the target voice after acquiring the target editing voice or the target voice.
In a possible implementation manner, the speech processing method provided in the embodiment of the present application includes steps 701 to 704. In another possible implementation manner, the speech processing method provided in this embodiment of the present application includes steps 701 to 705. In another possible implementation manner, the speech processing method provided in this embodiment of the present application includes steps 701 to 706. In addition, in the embodiment of the present application, each step shown in fig. 7a does not limit a timing relationship. For example: step 705 in the above method may also be performed after step 704, before step 701, or together with step 701.
The embodiment of the application provides a voice processing method, which comprises the following steps: acquiring original voice and a second text, wherein the second text is a text in a target text except for a first text, the target text and the original text corresponding to the original voice both comprise the first text, and the voice of the first text corresponding to the original voice is non-edited voice; predicting a second pitch characteristic of the second text from a first pitch (pitch) characteristic of the non-editing speech and information of the target text; obtaining a first voice feature corresponding to the second text through a neural network according to the second pitch feature and the second text; and generating target editing voice corresponding to the second text according to the first voice characteristic. According to the method and the device, the pitch characteristic of the second text (the text to be edited) is predicted, the first voice characteristic of the second text is generated according to the pitch characteristic, the target editing voice corresponding to the second text is generated based on the first voice characteristic, the pitch characteristics of the voice before and after singing voice editing are similar, and therefore the purpose that the audibility of the target editing voice is similar to that of the original voice is achieved.
The speech processing method in the embodiment of the present application will be described with reference to an example:
taking the scene of singing voice editing as an example, the original singing voice W to be edited (wherein the voice content is S, "love can not be wrong), and the following three different target voices are taken as examples respectively:
edit request Q1: the target voice is W1 (the voice content corresponding text T1 is "how can be asked but not wrong"),
edit request Q2: the target voice is W2 (the text T2 corresponding to the voice content is "love not question wrong"),
edit request Q3: the target voice is W3 (the text T2 corresponding to the voice content is "love is not asked about wrong")
Step S1: receiving a user 'voice editing' request;
the request at least comprises data such as original voice W to be edited, original lyric text S, target text T (T1 or T2 or T3) and the like, and the pre-operation comprises the following steps: comparing the original text S with the target text, and determining the editing type of the current editing request: that is, for Q1, Q2, and Q3, they can be determined to be insertion, deletion, and replacement operations, respectively; extracting audio features of each frame, Pitch features, from W; extracting Singer embedding from the W through a voiceprint model; converting S and the target text T into a representation of phonemes, such as T2, having a phoneme sequence of [ ai4 b u2 w en4 d ui4 c cuo4 ]; extracting the duration (frame number) corresponding to each phoneme in the S according to the W and the S; determining a Mask region according to the operation type, wherein for Q1, if the Mask region is an insertion operation (the insertion word is 'how'), the target Mask phoneme is a phoneme corresponding to 'what', namely the final target text phoneme of Q1 is [ ai4z en3 m e5 k e2 y i3 b u2 w en4 d ui4 c cuo4 ]; (where red represents the masked phoneme), for Q2, which is a delete operation (delete word "ok"), its target phoneme is the phoneme of the original "ok" neighboring word in S; namely, the final target text phoneme of the Q2 is [ ai4 b u2 w en4 d ui4 c cuo4 ]; (where red represents masked phonemes); for Q3, it is a replacement operation (what will be "can be" replaced "), so its target text phoneme is [ ai4z en3 m e5 b u2 w en4 d ui4 c cuo4 ]; (where red represents masked phonemes);
step S2: the target Text Phoneme obtained in the S1 generates a Text feature, namely, Phone-level Text Embedding, through a Text coding module;
step S3: predicting duration information of each phoneme in the target text through a duration warping module; this step can be accomplished by the following substeps:
generating a Mask vector and a reference duration vector according to Mask marks of the phonemes: for the non-Mask phoneme, the reference duration is the real duration extracted in the step S1, otherwise, the reference duration is set to 0; for non-Mask phonemes, setting the corresponding position in the Mask vector as 1, otherwise, setting the corresponding position as 0;
taking the reference time length vectors of Text Embedding and Singer Embedding and the Mask vectors as input, and predicting the time length corresponding to the Mask phoneme by using a time length prediction module shown as Figure2-2
According to the duration corresponding to each phoneme, upsampling the Embedding of each phoneme (namely if the duration of the phoneme A is 10, copying 10 copies of the Embedding of the phoneme A), thereby generating a Frame-level Text Embedding;
step S4: predicting the Pitch value of each frame by a Pitch prediction module, wherein the step can be completed by the following sub-steps:
for Q1 and Q2, the pitch of the frame corresponding to the Mask phoneme is predicted using the model shown in Figure 2-3:
wherein, for the non-Mask phoneme, the reference Pitch is the real Pitch extracted in S1, and the position corresponding to the Mask vector is marked as 1; for Mask phonemes, the pitch on the corresponding frame is set to 0, and the Mask is set to 0; and predicting the Frame-level pitch corresponding to the Mask phoneme.
For the replace operation Q3, the Frame-level Pitch of the Mask phoneme is predicted using the model shown in Figure 2-4;
step S5: and adding the Frame-Level text Embedding and the Pitch together, inputting the mixture into an audio characteristic decoding module, and predicting an audio characteristic Frame corresponding to a new Mask phoneme.
It should be understood that if a plurality of editing operations are involved in one editing request, editing can be performed in a left-to-right processing order, one by one, using the flow described above. On the other hand, a replacement operation can also be implemented by two operations "delete before insert".
The above describes the voice processing method implemented by the terminal device or the cloud device alone, and the following describes the voice processing method executed by the terminal device and the cloud device together.
Example two: the terminal equipment and the cloud equipment jointly execute the voice processing method.
Referring to fig. 13, an embodiment of a voice processing method provided in this application may be executed by a terminal device and a cloud device together, or may be executed by a component (e.g., a processor, a chip, or a system-on-chip) of the terminal device and a component (e.g., a processor, a chip, or a system-on-chip) of the cloud device, and the embodiment includes steps 1301 to 1306.
Step 1301, the terminal device obtains the original voice and the second text.
Step 1301 executed by the terminal device in this embodiment is similar to step 701 executed by the speech processing device in the embodiment shown in fig. 7a, and is not described again here.
In step 1302, the terminal device sends the original speech and the second text to the cloud device.
After the terminal device obtains the original voice and the second text, the terminal device may send the original voice and the second text to the cloud device.
Optionally, in step 1301, if the terminal device acquires the original speech and the target text, the terminal device sends the original speech and the target text to the cloud device.
Step 1303, the cloud device obtains the non-editing voice based on the original voice and the second text.
Step 1303 executed by the cloud device in this embodiment is similar to the description of determining the non-editing speech in step 701 executed by the speech processing device in the embodiment shown in fig. 7a, and is not described herein again.
In step 1304, the cloud device obtains a second pitch feature of the second text based on the first pitch feature of the non-edited voice and the information of the target text.
Step 1303 executed by the cloud device in this embodiment is similar to the description of determining the non-editing speech in step 702 executed by the speech processing device in the embodiment shown in fig. 7a, and is not described herein again.
Step 1305, the cloud device obtains, through the neural network, a first voice feature corresponding to the second text based on the second pitch feature and the second text.
In step 1306, the cloud device generates a target editing voice corresponding to the second text based on the first voice feature.
Steps 1304 to 1306 executed by the cloud device in this embodiment are similar to steps 702 to 704 executed by the speech processing device in the embodiment shown in fig. 7a, and are not described again here.
Step 1307, the cloud device sends the target editing voice to the terminal device. This step is optional.
Optionally, after the cloud device obtains the target editing voice, the target editing voice may be sent to the terminal device.
In step 1308, the terminal device or the cloud device obtains a position of the second text in the target text. This step is optional.
Step 1309, the terminal device or the cloud device concatenates the target editing speech and the non-editing speech based on the position to generate a target speech corresponding to the target text. This step is optional. This step is optional.
Step 1308 and step 1309 in this embodiment are similar to steps 705 to step 706 executed by the speech processing apparatus in the embodiment shown in fig. 7a, and are not described here again. Step 1308 and step 1309 in this embodiment may be executed by a terminal device or a cloud device.
Step 1310, the cloud device sends the target voice to the terminal device. This step is optional.
Optionally, if steps 1308 and 1309 are performed by the cloud end device, the cloud end device sends the target voice to the terminal device after acquiring the target voice. If steps 1308 and 1309 are performed by the terminal device, this step may not be performed.
Optionally, the terminal device plays the target editing voice or the target voice after acquiring the target editing voice or the target voice.
In a possible implementation manner, a speech processing method provided in an embodiment of the present application may include: the cloud device generates a target editing voice and sends the target editing voice to the terminal device, namely the method comprises steps 1301 to 1307. In another possible implementation manner, the speech processing method provided in the embodiment of the present application may include: the cloud device generates target editing voice, generates target voice according to the target editing voice and the non-editing voice, and sends the target voice to the terminal device. Namely, the method includes steps 1301 to 1306 and 1308 to 1310. In another possible implementation manner, the speech processing method provided in the embodiment of the present application may include: the cloud device generates a target editing voice and sends the target editing voice to the terminal device. And the terminal equipment generates target voice according to the target editing voice and the non-editing voice. I.e. the method comprises steps 1301 to 1309.
In the embodiment of the application, on one hand, the cloud device can perform complex calculation to obtain the target editing voice or the target voice and return the target editing voice or the target voice to the terminal device through interaction between the cloud device and the terminal device, so that the calculation power and the storage space of the terminal device can be reduced. On the other hand, the target editing voice corresponding to the modified text can be generated according to the voice characteristics of the non-editing area in the original voice, and then the target voice corresponding to the target text generated by the non-editing voice is generated. On the other hand, the user can modify the text in the original text to obtain the target editing voice corresponding to the modified text (i.e. the second text). And the editing experience of the user for voice editing based on the text is improved. On the other hand, when the target voice is generated, the non-edited voice is not modified, and the pitch characteristic of the target edited voice is similar to the pitch characteristic of the non-edited voice, so that when the user listens to the original voice and the target voice, the user hardly listens to the difference between the voice characteristics of the original voice and the target voice.
With reference to fig. 14, the voice processing method in the embodiment of the present application is described above, and a voice processing device in the embodiment of the present application is described below, where an embodiment of the voice processing device in the embodiment of the present application includes:
an obtaining module 1401, configured to obtain an original voice and a second text, where the second text is a text in a target text except for a first text, where both the target text and an original text corresponding to the original voice include the first text, and a voice of the first text corresponding to the original voice is a non-edited voice;
for a specific description of the obtaining module 1401, reference may be made to the description of step 701 in the foregoing embodiment, and details are not described here.
A pitch prediction module 1402 for predicting a second pitch characteristic of the second text based on a first pitch (pitch) characteristic of the non-editing speech and the information of the target text;
for a detailed description of the pitch prediction module 1402, reference may be made to the description of step 702 in the foregoing embodiment, which is not described herein again.
A generating module 1403, configured to obtain, according to the second pitch feature and the second text, a first speech feature corresponding to the second text through a neural network;
and generating target editing voice corresponding to the second text according to the first voice characteristic.
For a detailed description of the generating module 1403, reference may be made to the descriptions of steps 703 and 704 in the foregoing embodiment, which is not described herein again.
In one possible implementation, the content of the original speech is the singing voice of the user.
In one possible implementation, the first pitch (pitch) feature according to the non-editing speech and the second text comprise:
according to a first pitch (pitch) feature of the non-editing speech, information of the target text, and a second speech feature of the non-editing speech; the second speech feature carries at least one of the following information:
part of or all of the speech frames of the non-edited speech;
a voiceprint feature of the non-editing speech;
a timbre characteristic of the non-edited speech;
prosodic features of the non-editing speech; and the number of the first and second groups,
a rhythmic characteristic of the non-edited speech.
In one possible implementation, the information of the target text includes: text embedding (text embedding) of each phoneme in the target text.
In one possible implementation, the target text is a text obtained by inserting the second text into the first text; or the target text is a text obtained by deleting a first part of text of the first text, and the second text is a text adjacent to the first part of text;
the pitch prediction module is specifically configured to:
fusing a first pitch (pitch) feature of the non-editing voice and information of the target text to obtain a first fusion result;
and inputting the first fusion result into a second neural network to obtain a second pitch feature of the second text.
In one possible implementation, the target text is obtained by replacing a second part of text in the first text with the second text;
the pitch prediction module is specifically configured to:
inputting a first pitch (pitch) feature of the non-editing speech into a third neural network, resulting in an initial pitch feature, the first initial pitch feature comprising a pitch for each of a plurality of frames;
inputting information of the target text into a fourth neural network to obtain pronunciation characteristics of the second text, wherein the pronunciation characteristics are used for indicating whether each frame in a plurality of frames included in the initial pitch characteristics pronounces;
fusing the initial pitch feature and the pronunciation feature to obtain a second pitch feature of the second text.
In one possible implementation, the apparatus further comprises:
and the duration prediction module is used for predicting the frame number of each phoneme in the second text according to the frame number of each phoneme in the non-edited voice and the information of the target text.
In one possible implementation, the first pitch (pitch) feature includes: a pitch characteristic of each of a plurality of frames of the non-edited speech;
the second pitch feature comprising: the target edits a pitch characteristic of each of a plurality of frames of speech.
In a possible implementation, the duration prediction module is specifically configured to:
and according to the frame number of each phoneme in the non-edited voice, the information of the target text and the second voice characteristic of the non-edited voice.
In one possible implementation, the obtaining module is further configured to:
acquiring the position of the second text in the target text;
and the generating module is further used for splicing the target editing voice and the non-editing voice based on the position to obtain a target voice corresponding to the target text.
Referring to fig. 15, another speech processing apparatus is provided in the embodiment of the present application, and for convenience of description, only the relevant portions of the embodiment of the present application are shown, and details of the method are not disclosed. The voice processing device may be any terminal device including a mobile phone, a tablet computer, a Personal Digital Assistant (PDA), a point of sale (POS), a vehicle-mounted computer, and the like, taking the voice processing device as the mobile phone as an example:
fig. 15 is a block diagram illustrating a partial structure of a mobile phone related to a speech processing device provided in an embodiment of the present application. Referring to fig. 15, the cellular phone includes: radio Frequency (RF) circuitry 1510, memory 1520, input unit 1530, display unit 1540, sensor 1550, audio circuitry 1560, wireless fidelity (WiFi) module 1570, processor 1580, and power supply 1590. Those skilled in the art will appreciate that the handset configuration shown in fig. 15 is not intended to be limiting and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.
The following describes each component of the mobile phone in detail with reference to fig. 15:
the RF circuit 1510 may be configured to receive and transmit signals during information transmission and reception or during a call, and in particular, receive downlink information of a base station and then process the received downlink information to the processor 1580; in addition, the data for designing uplink is transmitted to the base station. In general, RF circuit 1510 includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a Low Noise Amplifier (LNA), a duplexer, and the like. In addition, RF circuit 1510 may also communicate with networks and other devices via wireless communication. The wireless communication may use any communication standard or protocol, including but not limited to global system for mobile communications (GSM), General Packet Radio Service (GPRS), Code Division Multiple Access (CDMA), Wideband Code Division Multiple Access (WCDMA), Long Term Evolution (LTE), email, Short Message Service (SMS), etc.
The memory 1520 may be used to store software programs and modules, and the processor 1580 performs various functional applications and data processing of the cellular phone by operating the software programs and modules stored in the memory 1520. The memory 1520 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. Further, the memory 1520 may include high-speed random access memory and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.
The input unit 1530 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the cellular phone. Specifically, the input unit 1530 may include a touch panel 1531 and other input devices 1532. The touch panel 1531, also referred to as a touch screen, can collect touch operations of a user (e.g., operations of the user on or near the touch panel 1531 using any suitable object or accessory such as a finger or a stylus) and drive corresponding connection devices according to a preset program. Alternatively, the touch panel 1531 may include two parts, a touch detection device and a touch controller. The touch detection device detects the touch direction of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch sensing device, converts the touch information into touch point coordinates, and sends the touch point coordinates to the processor 1580, and can receive and execute commands sent by the processor 1580. In addition, the touch panel 1531 may be implemented by various types such as a resistive type, a capacitive type, an infrared ray, and a surface acoustic wave. The input unit 1530 may include other input devices 1532 in addition to the touch panel 1531. In particular, other input devices 1532 may include, but are not limited to, one or more of a physical keyboard, function keys (such as volume control keys, switch keys, etc.), a trackball, a mouse, a joystick, and the like.
The display unit 1540 may be used to display information input by the user or information provided to the user and various menus of the mobile phone. The display unit 1540 may include a display panel 1541, and optionally, the display panel 1541 may be configured in the form of a Liquid Crystal Display (LCD), an organic light-emitting diode (OLED), or the like. Further, the touch panel 1531 may cover the display panel 1541, and when the touch panel 1531 detects a touch operation on or near the touch panel 1531, the touch operation is transmitted to the processor 1580 to determine the type of the touch event, and then the processor 1580 provides a corresponding visual output on the display panel 1541 according to the type of the touch event. Although in fig. 15, the touch panel 1531 and the display panel 1541 are two separate components to implement the input and output functions of the mobile phone, in some embodiments, the touch panel 1531 and the display panel 1541 may be integrated to implement the input and output functions of the mobile phone.
The handset can also include at least one sensor 1550, such as a light sensor, motion sensor, and other sensors. Specifically, the light sensor may include an ambient light sensor that adjusts the brightness of the display panel 1541 according to the brightness of ambient light and a proximity sensor that turns off the display panel 1541 and/or the backlight when the mobile phone is moved to the ear. As one of the motion sensors, the accelerometer sensor can detect the magnitude of acceleration in each direction (generally, three axes), can detect the magnitude and direction of gravity when stationary, and can be used for applications of recognizing the posture of a mobile phone (such as horizontal and vertical screen switching, related games, magnetometer posture calibration), vibration recognition related functions (such as pedometer and tapping), and the like; as for other sensors such as a gyroscope, a barometer, a hygrometer, a thermometer, and an infrared sensor, which can be configured on the mobile phone, further description is omitted here.
WiFi belongs to short-distance wireless transmission technology, and the mobile phone can help a user to receive and send e-mails, browse webpages, access streaming media and the like through a WiFi module 1570, and provides wireless broadband internet access for the user. Although fig. 15 shows WiFi module 1570, it is understood that it does not belong to the essential components of the handset.
The processor 1580 is a control center of the mobile phone, connects various parts of the entire mobile phone by using various interfaces and lines, and performs various functions of the mobile phone and processes data by operating or executing software programs and/or modules stored in the memory 1520 and calling data stored in the memory 1520, thereby integrally monitoring the mobile phone. Optionally, the processor 1580 may include one or more processing units; preferably, the processor 1580 may integrate an application processor, which mainly handles operating systems, user interfaces, application programs, and the like, and a modem processor, which mainly handles wireless communications. It is to be appreciated that the modem processor may not be integrated into the processor 1580.
The handset also includes a power supply 1590 (e.g., a battery) for powering the various components, which may preferably be logically coupled to the processor 1580 via a power management system to manage charging, discharging, and power consumption management functions via the power management system.
Although not shown, the mobile phone may further include a camera, a bluetooth module, etc., which are not described herein.
In this embodiment of the application, the processor 1580 included in the terminal device may execute the function of the speech processing device in the embodiment shown in fig. 7a or execute the function of the terminal device in the embodiment shown in fig. 13, which is not described herein again.
Referring to fig. 16, a schematic structural diagram of another speech processing device provided by the present application is shown. The voice processing device may be a cloud device. The cloud device may include a processor 1601, memory 1602, and a communication interface 1603. The processor 1601, the memory 1602 and the communication interface 1603 are interconnected by wires. Among other things, memory 1602 has stored therein program instructions and data.
The memory 1602 stores program instructions and data corresponding to the steps executed by the speech processing device in the embodiment corresponding to fig. 7 a. Or program instructions and data corresponding to the steps executed by the cloud device in the embodiment corresponding to fig. 13 are stored.
A processor 1601 for performing the steps performed by the speech processing device as shown in any of the embodiments shown in fig. 7a and described above. Or for performing the steps performed by the cloud device in any of the embodiments shown in fig. 13.
In one implementation, the cloud device may include more or fewer components than those shown in fig. 16, which are merely exemplary and not limiting.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of a unit is merely a logical division, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated units described above may be implemented in whole or in part by software, hardware, firmware, or any combination thereof.
When the integrated unit is implemented in software, it may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the application to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored on a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website, computer, server, or data center to another website, computer, server, or data center via wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.
The terms "first," "second," and the like in the description and in the claims of the present application and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the terms so used are interchangeable under appropriate circumstances and are merely descriptive of the various embodiments of the application and how objects of the same nature can be distinguished. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of elements is not necessarily limited to those elements, but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
Claims (24)
1. A method of speech processing, the method comprising:
acquiring original voice and a second text, wherein the second text is a text in a target text except for a first text, the target text and the original text corresponding to the original voice both comprise the first text, and the voice of the first text corresponding to the original voice is non-edited voice;
predicting a second pitch characteristic of the second text from a first pitch (pitch) characteristic of the non-editing speech and information of the target text;
obtaining a first voice feature corresponding to the second text through a neural network according to the second pitch feature and the second text;
and generating target editing voice corresponding to the second text according to the first voice characteristic.
2. The method of claim 1, wherein the content of the original speech is a singing voice of the user.
3. The method of any of claims 1 to 3, wherein the first pitch (pitch) feature and the second text from the non-editing speech comprise:
according to a first pitch (pitch) feature of the non-editing speech, information of the target text, and a second speech feature of the non-editing speech; the second speech feature carries at least one of the following information:
part of or all of the speech frames of the non-edited speech;
a voiceprint feature of the non-editing speech;
a timbre characteristic of the non-edited speech;
prosodic features of the non-editing speech; and the number of the first and second groups,
a rhythmic characteristic of the non-edited speech.
4. The method according to any one of claims 1 to 4, wherein the information of the target text comprises:
text embedding (text embedding) of each phoneme in the target text.
5. The method according to any one of claims 1 to 5, wherein the target text is a text obtained by inserting the second text into the first text; or the target text is a text obtained by deleting a first part of text of the first text, and the second text is a text adjacent to the first part of text;
predicting a second pitch characteristic of the second text from a first pitch (pitch) characteristic of the non-editing speech and information of the target text, comprising:
fusing a first pitch (pitch) feature of the non-editing voice and information of the target text to obtain a first fusion result;
and inputting the first fusion result into a second neural network to obtain a second pitch feature of the second text.
6. The method according to any one of claims 1 to 5, wherein the target text is obtained by replacing a second part of text in the first text with the second text;
predicting a second pitch characteristic of the second text from a first pitch (pitch) characteristic of the non-editing speech and information of the target text, comprising:
inputting a first pitch (pitch) feature of the non-editing speech into a third neural network, resulting in an initial pitch feature, the first initial pitch feature comprising a pitch for each of a plurality of frames;
inputting information of the target text into a fourth neural network to obtain pronunciation characteristics of the second text, wherein the pronunciation characteristics are used for indicating whether each frame in a plurality of frames included in the initial pitch characteristics pronounces;
fusing the initial pitch feature and the pronunciation feature to obtain a second pitch feature of the second text.
7. The method of any of claims 1 to 6, further comprising:
and predicting the frame number of each phoneme in the second text according to the frame number of each phoneme in the non-edited voice and the information of the target text.
8. The method of any one of claims 1 to 7, wherein the first pitch (pitch) feature comprises: a pitch characteristic of each of a plurality of frames of the non-edited speech;
the second pitch feature comprising: the target edits a pitch characteristic of each of a plurality of frames of speech.
9. The method according to claim 7 or 8, wherein said information based on the frame number of each phoneme in the non-edited speech and the target text comprises:
and according to the frame number of each phoneme in the non-edited voice, the information of the target text and the second voice characteristic of the non-edited voice.
10. The method according to any one of claims 1 to 9, further comprising:
acquiring the position of the second text in the target text;
and splicing the target editing voice and the non-editing voice based on the position to obtain the target voice corresponding to the target text.
11. A speech processing apparatus, characterized in that the apparatus comprises:
the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring an original voice and a second text, the second text is a text except a first text in a target text, the target text and the original text corresponding to the original voice both comprise the first text, and the voice of the first text corresponding to the original voice is a non-editing voice;
a pitch prediction module to predict a second pitch characteristic of the second text from a first pitch (pitch) characteristic of the non-editing speech and information of the target text;
the generating module is used for obtaining a first voice feature corresponding to the second text through a neural network according to the second pitch feature and the second text;
and generating target editing voice corresponding to the second text according to the first voice characteristic.
12. The apparatus of claim 11, wherein the content of the original speech is a singing voice of the user.
13. The apparatus of claim 11 or 12, wherein the first pitch (pitch) feature and the second text according to the non-editing speech comprise:
according to a first pitch (pitch) feature of the non-editing speech, information of the target text, and a second speech feature of the non-editing speech; the second speech feature carries at least one of the following information:
part of or all of the speech frames of the non-edited speech;
a voiceprint feature of the non-editing speech;
a timbre characteristic of the non-edited speech;
prosodic features of the non-editing speech; and the number of the first and second groups,
a rhythmic characteristic of the non-edited speech.
14. The apparatus according to any one of claims 11 to 14, wherein the information of the target text comprises: text embedding (text embedding) of each phoneme in the target text.
15. The apparatus according to any one of claims 11 to 14, wherein the target text is a text obtained by inserting the second text into the first text; or the target text is a text obtained by deleting a first part of text of the first text, and the second text is a text adjacent to the first part of text;
the pitch prediction module is specifically configured to:
fusing a first pitch (pitch) feature of the non-editing voice and information of the target text to obtain a first fusion result;
and inputting the first fusion result into a second neural network to obtain a second pitch feature of the second text.
16. The apparatus according to any one of claims 11 to 14, wherein the target text is obtained by replacing a second part of text in the first text with the second text;
the pitch prediction module is specifically configured to:
inputting a first pitch (pitch) feature of the non-editing speech into a third neural network, resulting in an initial pitch feature, the first initial pitch feature comprising a pitch for each of a plurality of frames;
inputting information of the target text into a fourth neural network to obtain pronunciation characteristics of the second text, wherein the pronunciation characteristics are used for indicating whether each frame in a plurality of frames included in the initial pitch characteristics pronounces;
fusing the initial pitch feature and the pronunciation feature to obtain a second pitch feature of the second text.
17. The apparatus of any one of claims 11 to 16, further comprising:
and the duration prediction module is used for predicting the frame number of each phoneme in the second text according to the frame number of each phoneme in the non-edited voice and the information of the target text.
18. The apparatus of any one of claims 11 to 17, wherein the first pitch (pitch) feature comprises: a pitch characteristic of each of a plurality of frames of the non-edited speech;
the second pitch feature comprising: the target edits a pitch characteristic of each of a plurality of frames of speech.
19. The apparatus according to claim 17 or 18, wherein the duration prediction module is specifically configured to:
and according to the frame number of each phoneme in the non-edited voice, the information of the target text and the second voice characteristic of the non-edited voice.
20. The apparatus according to any one of claims 11 to 19, wherein the obtaining module is further configured to:
acquiring the position of the second text in the target text;
and the generating module is further used for splicing the target editing voice and the non-editing voice based on the position to obtain a target voice corresponding to the target text.
21. A speech processing device, comprising: a processor coupled with a memory for storing a program or instructions that, when executed by the processor, cause the speech processing device to perform the method of any of claims 1 to 10.
22. The apparatus of claim 21, further comprising:
an input unit for receiving a second text;
and the output unit is used for playing the target editing voice corresponding to the second text or the target voice corresponding to the target text.
23. A computer-readable storage medium having stored therein instructions which, when executed on a computer, cause the computer to perform the method of any one of claims 1 to 10.
24. A computer program product, characterized in that the computer program product, when executed on a computer, causes the computer to perform the method according to any of claims 1 to 10.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210468926.8A CN114882862A (en) | 2022-04-29 | 2022-04-29 | Voice processing method and related equipment |
PCT/CN2023/086497 WO2023207541A1 (en) | 2022-04-29 | 2023-04-06 | Speech processing method and related device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210468926.8A CN114882862A (en) | 2022-04-29 | 2022-04-29 | Voice processing method and related equipment |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114882862A true CN114882862A (en) | 2022-08-09 |
Family
ID=82673378
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210468926.8A Pending CN114882862A (en) | 2022-04-29 | 2022-04-29 | Voice processing method and related equipment |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN114882862A (en) |
WO (1) | WO2023207541A1 (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116189654A (en) * | 2023-02-23 | 2023-05-30 | 京东科技信息技术有限公司 | Voice editing method and device, electronic equipment and storage medium |
WO2023207541A1 (en) * | 2022-04-29 | 2023-11-02 | 华为技术有限公司 | Speech processing method and related device |
CN117153144A (en) * | 2023-10-31 | 2023-12-01 | 杭州宇谷科技股份有限公司 | Battery information voice broadcasting method and device based on terminal calculation |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20240177386A1 (en) * | 2022-11-28 | 2024-05-30 | Alemira Ag | System and method for an audio-visual avatar creation |
Family Cites Families (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2006349787A (en) * | 2005-06-14 | 2006-12-28 | Hitachi Information & Control Solutions Ltd | Method and device for synthesizing voices |
JP5423466B2 (en) * | 2010-02-19 | 2014-02-19 | 富士通株式会社 | Speech synthesis apparatus, speech synthesis method, and speech synthesis program |
CN111899706B (en) * | 2020-07-30 | 2024-08-23 | 广州酷狗计算机科技有限公司 | Audio production method, device, equipment and storage medium |
CN113421547B (en) * | 2021-06-03 | 2023-03-17 | 华为技术有限公司 | Voice processing method and related equipment |
CN113808555B (en) * | 2021-09-17 | 2024-08-02 | 广州酷狗计算机科技有限公司 | Song synthesizing method and device, equipment, medium and product thereof |
CN113920977A (en) * | 2021-09-30 | 2022-01-11 | 宿迁硅基智能科技有限公司 | Speech synthesis model, model training method and speech synthesis method |
CN114882862A (en) * | 2022-04-29 | 2022-08-09 | 华为技术有限公司 | Voice processing method and related equipment |
-
2022
- 2022-04-29 CN CN202210468926.8A patent/CN114882862A/en active Pending
-
2023
- 2023-04-06 WO PCT/CN2023/086497 patent/WO2023207541A1/en unknown
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2023207541A1 (en) * | 2022-04-29 | 2023-11-02 | 华为技术有限公司 | Speech processing method and related device |
CN116189654A (en) * | 2023-02-23 | 2023-05-30 | 京东科技信息技术有限公司 | Voice editing method and device, electronic equipment and storage medium |
CN117153144A (en) * | 2023-10-31 | 2023-12-01 | 杭州宇谷科技股份有限公司 | Battery information voice broadcasting method and device based on terminal calculation |
CN117153144B (en) * | 2023-10-31 | 2024-02-06 | 杭州宇谷科技股份有限公司 | Battery information voice broadcasting method and device based on terminal calculation |
Also Published As
Publication number | Publication date |
---|---|
WO2023207541A1 (en) | 2023-11-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113421547B (en) | Voice processing method and related equipment | |
CN112487182B (en) | Training method of text processing model, text processing method and device | |
CN110782870B (en) | Speech synthesis method, device, electronic equipment and storage medium | |
CN110490213B (en) | Image recognition method, device and storage medium | |
CN107979764B (en) | Video subtitle generating method based on semantic segmentation and multi-layer attention framework | |
CN111048062B (en) | Speech synthesis method and apparatus | |
CN111276120B (en) | Speech synthesis method, apparatus and computer-readable storage medium | |
WO2023207541A1 (en) | Speech processing method and related device | |
CN112347795A (en) | Machine translation quality evaluation method, device, equipment and medium | |
JPH0375860A (en) | Personalized terminal | |
CN113948060A (en) | Network training method, data processing method and related equipment | |
CN112837669B (en) | Speech synthesis method, device and server | |
CN113822076A (en) | Text generation method and device, computer equipment and storage medium | |
CN114360492B (en) | Audio synthesis method, device, computer equipment and storage medium | |
JP2022502758A (en) | Coding methods, equipment, equipment and programs | |
CN112002346A (en) | Gender and age identification method, device, equipment and storage medium based on voice | |
CN110162598A (en) | A kind of data processing method and device, a kind of device for data processing | |
CN115937369A (en) | Expression animation generation method and system, electronic equipment and storage medium | |
CN115688937A (en) | Model training method and device | |
CN115169472A (en) | Music matching method and device for multimedia data and computer equipment | |
WO2024193434A1 (en) | Audio processing method and apparatus, device and storage medium | |
CN117877125B (en) | Action recognition and model training method and device, electronic equipment and storage medium | |
CN115240713B (en) | Voice emotion recognition method and device based on multi-modal characteristics and contrast learning | |
CN112712788B (en) | Speech synthesis method, training method and device of speech synthesis model | |
CN114464163A (en) | Method, device, equipment, storage medium and product for training speech synthesis model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |