CN108198547B - Voice endpoint detection method and device, computer equipment and storage medium - Google Patents

Voice endpoint detection method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN108198547B
CN108198547B CN201810048223.3A CN201810048223A CN108198547B CN 108198547 B CN108198547 B CN 108198547B CN 201810048223 A CN201810048223 A CN 201810048223A CN 108198547 B CN108198547 B CN 108198547B
Authority
CN
China
Prior art keywords
voice
noise
feature vector
acoustic
spectrum
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810048223.3A
Other languages
Chinese (zh)
Other versions
CN108198547A (en
Inventor
黄石磊
刘轶
王昕�
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Raisound Technology Co ltd
Original Assignee
Shenzhen Raisound Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Raisound Technology Co ltd filed Critical Shenzhen Raisound Technology Co ltd
Priority to CN201810048223.3A priority Critical patent/CN108198547B/en
Publication of CN108198547A publication Critical patent/CN108198547A/en
Application granted granted Critical
Publication of CN108198547B publication Critical patent/CN108198547B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/02Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
    • G10L19/032Quantisation or dequantisation of spectral components
    • G10L19/038Vector quantisation, e.g. TwinVQ audio
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/87Detection of discrete points within a voice signal

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Quality & Reliability (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The application relates to a voice endpoint detection method, a voice endpoint detection device, computer equipment and a storage medium. The method comprises the following steps: acquiring a voice signal with noise, and extracting acoustic features and spectrum features corresponding to the voice signal with noise; converting the acoustic features and the spectral features to obtain corresponding acoustic feature vectors and spectral feature vectors; acquiring a classifier, and inputting the acoustic feature vector and the spectral feature vector into the classifier to obtain an acoustic feature vector added with a voice tag and a spectral feature vector added with the voice tag; analyzing the acoustic feature vector added with the voice tag and the frequency spectrum feature vector added with the voice tag to obtain a corresponding voice signal; and determining a starting point and an ending point corresponding to the voice signal according to the time sequence of the voice signal. The method can effectively improve the accuracy of voice endpoint detection.

Description

Voice endpoint detection method and device, computer equipment and storage medium
Technical Field
The present application relates to the field of signal processing technologies, and in particular, to a method and an apparatus for language endpoint detection, a computer device, and a storage medium.
Background
With the continuous development of voice technology, voice endpoint detection technology plays a very important role in voice recognition technology. The voice end point detection is to detect the start point and the end point of a voice part from a continuous piece of noise voice, so that the voice can be effectively recognized.
The traditional voice endpoint detection methods include two methods, one is to extract the characteristics of each segment of signal according to the difference of the time domain characteristics and the frequency domain characteristics of voice and noise signals, and compare the characteristics of each segment of signal with a set threshold value, so as to perform voice endpoint detection. However, this method is only suitable for detection under stationary noise conditions, and noise robustness is poor, and it is difficult to distinguish between pure speech and noise, resulting in low accuracy of speech endpoint detection. . The other is based on a neural network mode, and the end point detection is carried out on the voice signals by utilizing a training model. However, the input vectors of most models only contain the characteristics of noisy speech, so that the noise robustness is poor, and the accuracy of speech endpoint detection is low. Therefore, how to effectively improve the accuracy of voice endpoint detection becomes a technical problem to be solved at present.
Disclosure of Invention
In view of the foregoing, it is desirable to provide a voice endpoint detection method, apparatus, computer device and storage medium capable of effectively improving accuracy of voice endpoint detection.
A voice endpoint detection method, comprising:
acquiring a voice signal with noise, and extracting acoustic features and spectrum features corresponding to the voice signal with noise;
converting the acoustic features and the spectral features to obtain corresponding acoustic feature vectors and spectral feature vectors;
acquiring a classifier, and inputting the acoustic feature vector and the spectral feature vector into the classifier to obtain an acoustic feature vector added with a voice tag and a spectral feature vector added with the voice tag;
analyzing the acoustic feature vector added with the voice tag and the frequency spectrum feature vector added with the voice tag to obtain a corresponding voice signal;
and determining a starting point and an ending point corresponding to the voice signal according to the time sequence of the voice signal.
In one embodiment, before the extracting the acoustic feature and the spectral feature corresponding to the noisy speech signal, the method further includes:
converting the voice signal with noise into a voice frequency spectrum with noise;
and carrying out time domain analysis and/or frequency domain analysis and/or transform domain analysis on the voice frequency spectrum with the noise to obtain acoustic characteristics corresponding to the voice signal with the noise.
In one embodiment, before the extracting the acoustic feature and the spectral feature corresponding to the noisy speech signal, the method further includes:
converting the voice signal with noise into a voice frequency spectrum with noise, and calculating a voice amplitude spectrum with noise according to the voice frequency spectrum with noise;
carrying out dynamic noise estimation on the voice frequency spectrum with the noise according to the voice amplitude spectrum with the noise to obtain a noise amplitude spectrum;
estimating a voice amplitude spectrum of the pure voice signal according to the voice amplitude spectrum with the noise and the noise amplitude spectrum;
and generating the frequency spectrum characteristics corresponding to the voice signal with the noise by using the voice amplitude spectrum with the noise, the noise amplitude spectrum and the voice amplitude spectrum.
In one embodiment, the converting the acoustic features and the spectral features includes:
extracting a preset number of frames before and after the current frame in the acoustic features and the frequency spectrum features;
calculating a mean vector and/or a variance vector corresponding to the current frame by using a preset number of frames before and after the current frame;
and carrying out logarithmic domain conversion on the acoustic feature and the frequency spectrum feature after the mean vector and/or the variance vector corresponding to the current frame are calculated to obtain a converted acoustic feature vector and a converted frequency spectrum feature vector.
In one embodiment, the step of obtaining the classifier further comprises:
acquiring noisy voice data added with a voice category label, and training the noisy voice data to obtain an initial classifier;
obtaining a first verification set, wherein the first verification set comprises a plurality of first voice data;
inputting a plurality of first voice data into a classifier to obtain class probabilities corresponding to the plurality of first voice data;
screening the category probabilities corresponding to the plurality of first voice data, and adding category labels to the selected first voice data to obtain a verification set added with the category labels;
training by using the verification set added with the class label and the training set to obtain a verification classifier;
obtaining a second verification set, wherein the second verification set comprises a plurality of second voice data;
inputting a plurality of second voice data into a verification classifier to obtain class probabilities corresponding to the plurality of second voice data;
and when the class probability corresponding to the second voice data reaches a preset probability value, obtaining the required classifier.
In one embodiment, the step of classifying the acoustic feature vector and the spectral feature vector by the classifier comprises:
taking the acoustic feature vector and the spectral feature vector as input of a classifier to obtain decision values corresponding to the acoustic feature vector and the spectral feature vector;
when the decision value is a first threshold value, adding a voice tag to the acoustic feature vector or the spectrum feature vector;
and when the decision value is a second threshold value, adding a non-voice label to the acoustic feature vector or the spectrum feature vector.
A voice endpoint detection apparatus comprising:
the extraction module is used for acquiring a voice signal with noise and extracting acoustic features and spectral features corresponding to the voice signal with noise;
the conversion module is used for converting the acoustic features and the spectral features to obtain corresponding acoustic feature vectors and spectral feature vectors;
the classification module is used for acquiring a classifier, inputting the acoustic feature vector and the spectral feature vector into the classifier, and obtaining an acoustic feature vector added with a voice tag and a spectral feature vector added with the voice tag;
the analysis module is used for analyzing the acoustic feature vector added with the voice tag and the frequency spectrum feature vector added with the voice tag to obtain a corresponding voice signal; and determining a starting point and an ending point corresponding to the voice signal according to the time sequence of the voice signal.
In one embodiment, the conversion module is further configured to extract a preset number of frames before and after a current frame in the acoustic features and the spectral features; calculating a mean vector and/or a variance vector corresponding to the current frame by using a preset number of frames before and after the current frame; and carrying out logarithmic domain conversion on the acoustic feature and the frequency spectrum feature after the mean vector and/or the variance vector corresponding to the current frame are calculated to obtain a converted acoustic feature vector and a converted frequency spectrum feature vector.
A computer device comprising a memory, the memory storing a computer program, a processor implementing the following steps when the processor executes the computer program:
acquiring a voice signal with noise, and extracting acoustic features and spectrum features corresponding to the voice signal with noise;
converting the acoustic features and the spectral features to obtain corresponding acoustic feature vectors and spectral feature vectors;
acquiring a classifier, and inputting the acoustic feature vector and the spectral feature vector into the classifier to obtain an acoustic feature vector added with a voice tag and a spectral feature vector added with the voice tag;
analyzing the acoustic feature vector added with the voice tag and the frequency spectrum feature vector added with the voice tag to obtain a corresponding voice signal;
and determining a starting point and an ending point corresponding to the voice signal according to the time sequence of the voice signal.
A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of:
acquiring a voice signal with noise, and extracting acoustic features and spectrum features corresponding to the voice signal with noise;
converting the acoustic features and the spectral features to obtain corresponding acoustic feature vectors and spectral feature vectors;
acquiring a classifier, and inputting the acoustic feature vector and the spectral feature vector into the classifier to obtain an acoustic feature vector added with a voice tag and a spectral feature vector added with the voice tag;
analyzing the acoustic feature vector added with the voice tag and the frequency spectrum feature vector added with the voice tag to obtain a corresponding voice signal;
and determining a starting point and an ending point corresponding to the voice signal according to the time sequence of the voice signal.
The voice endpoint detection method, the voice endpoint detection device, the computer equipment and the storage medium acquire the voice signal with noise and extract the acoustic characteristic and the spectral characteristic corresponding to the voice signal with noise; and converting the acoustic features and the spectral features to obtain corresponding acoustic feature vectors and spectral feature vectors. And acquiring a classifier, and inputting the acoustic feature vector and the spectral feature vector into the classifier to obtain the acoustic feature vector added with the voice tag and the spectral feature vector added with the voice tag, so that the acoustic feature vector and the spectral feature vector can be effectively classified, and voice and non-voice can be effectively identified. Analyzing the acoustic feature vector added with the voice tag and the frequency spectrum feature vector added with the voice tag to obtain a corresponding voice signal; the time sequence of the voice signal determines the corresponding starting point and the corresponding end point of the voice signal, so that the starting point and the end point of the voice signal with noise can be accurately identified, and the accuracy of voice end point detection can be effectively improved.
Drawings
FIG. 1 is a flow diagram of a method for voice endpoint detection in one embodiment;
FIG. 2 is a diagram of the internal structure of the speech endpoint detection apparatus in one embodiment;
FIG. 3 is a diagram illustrating an internal structure of a computer device according to an embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not limiting of the application. It will be understood that, as used herein, the terms "first," "second," and the like may be used herein to describe various elements, but these elements are not limited by these terms. These terms are only used to distinguish one element from another.
In one embodiment, as shown in fig. 1, a method for detecting a voice endpoint is provided, which is described by taking the method as an example for being applied to a terminal, and includes the following steps:
step 102, acquiring a voice signal with noise, and extracting acoustic features and spectral features corresponding to the voice signal with noise.
Generally, the actually collected voice signal usually contains noise with a certain intensity, and when the intensity of the noise is large, the effect of the voice application is obviously affected, for example, the voice recognition efficiency is low, the endpoint detection accuracy is reduced, and the like.
The terminal can acquire the voice input by the user through the voice input device. The terminal can be a terminal such as a smart phone, a tablet computer, a notebook computer, a desktop computer, and the like, and the terminal further includes a voice input device, for example, a device such as a microphone having a voice recording function. The voice input by the user and acquired by the terminal is usually a noisy voice signal containing noise, and the noisy voice signal may be a noisy voice signal such as a call voice, a recorded audio, a voice instruction and the like input by the user. And after the terminal acquires the voice signal with the noise, extracting the acoustic characteristic and the spectral characteristic corresponding to the voice signal with the noise. The acoustic features may include feature information of unvoiced sound, voiced sound, vowel sound, consonant sound, and the like of the noisy speech signal. The spectral characteristics may include the vibration frequency and vibration amplitude of the noisy speech signal and characteristic information such as loudness and timbre of the noisy speech signal.
Specifically, after the terminal acquires the voice signal with noise, the voice signal with noise is windowed and framed. For example, a hanning window may be used to divide the noisy speech signal into a plurality of frames that are 10-30ms (milliseconds) long, and the frame shift may be 10ms, so that the noisy speech signal may be divided into a plurality of frames of noisy speech signals. And after windowing and framing the voice signal with noise by the terminal, carrying out fast Fourier transform on the voice signal with noise after windowing and framing, thereby obtaining the frequency spectrum of the voice signal with noise. The terminal can extract the acoustic features and the spectrum features corresponding to the voice signals with noise according to the frequency spectrum of the voice with noise.
And 104, converting the acoustic features and the spectral features to obtain corresponding acoustic feature vectors and spectral feature vectors.
After the terminal extracts the acoustic features and the spectrum features corresponding to the voice signals with noise, the acoustic features and the spectrum features corresponding to the extracted voice signals with noise are converted, the acoustic features are converted into corresponding acoustic feature vectors, and the spectrum features are converted into corresponding spectrum feature vectors.
And 106, acquiring a classifier, and inputting the acoustic feature vector and the spectral feature vector into the classifier to obtain the acoustic feature vector added with the voice tag and the spectral feature vector added with the voice tag.
The terminal obtains a classifier, the classifier is trained before voice endpoint detection is carried out, and the classifier can divide input acoustic feature vectors and input spectral feature vectors into acoustic feature vectors and acoustic feature vectors of voice class and acoustic feature vectors of non-voice class and spectral feature vectors by adding voice tags and non-voice tags to the acoustic feature vectors and the spectral feature vectors. The terminal inputs the acoustic feature vector and the spectral feature vector corresponding to the voice with noise into the classifier, and the classifier is used for classifying the input acoustic feature vector and the spectral feature vector. When the input acoustic feature vector or the spectrum feature vector is of a voice category, adding a voice tag to the acoustic feature vector or the spectrum feature vector; when the input acoustic feature vector or the spectrum feature vector is in a non-speech category, a non-speech tag is added to the acoustic feature vector or the spectrum feature vector, so that speech and non-speech can be accurately identified. And after the terminal utilizes the classifier to carry out comparison on the acoustic feature vector and the spectral feature vector, the acoustic feature vector added with the voice tag and the spectral feature vector added with the voice tag can be obtained.
Further, the terminal takes the acoustic feature vector and the spectral feature vector as input of the classifier, and can also obtain decision values corresponding to the acoustic feature vector and the spectral feature vector. The terminal can add voice tags or non-voice tags to the acoustic feature vectors and the spectral feature vectors according to the obtained decision values. Therefore, the acoustic feature vectors and the spectral feature vectors are accurately classified.
And 108, analyzing the acoustic feature vector added with the voice tag and the frequency spectrum feature vector added with the voice tag to obtain the voice signal added with the voice tag.
And step 110, determining a starting point and an ending point corresponding to the voice signal according to the voice tag and the time sequence of the voice signal.
After the terminal classifies the acoustic feature vectors and the spectral feature vectors, the acoustic feature vectors to which the voice tags are added and the spectral feature vectors to which the voice tags are added need to be analyzed. Specifically, the terminal analyzes the acoustic feature vector added with the voice tag and the spectrum feature vector added with the voice tag to obtain the acoustic feature added with the voice tag and a spectrum corresponding to the spectrum feature. And the terminal converts the acoustic characteristics added with the voice tag and the frequency spectrum corresponding to the spectral characteristics into corresponding voice signals according to the time sequence of the voice signals with the noise, so that the corresponding voice signals can be obtained through analysis.
The noisy speech signal has a timing sequence, and the timing sequence of the speech signal after the addition of the voice tag still corresponds to the timing sequence of the noisy speech signal. The terminal analyzes the acoustic feature vector added with the voice tag and the frequency spectrum feature vector added with the voice tag into corresponding voice signals added with the voice tag, so that the terminal can determine a starting point and an ending point corresponding to the voice signals with noise according to the voice tag and the time sequence of the voice signals.
For example, after the terminal classifies the input acoustic feature vector and spectral feature vector by the classifier, the obtained decision value may be a value between 0 and 1. And when the obtained decision value is 1, the terminal adds a voice tag to the acoustic feature vector or the frequency spectrum feature vector. And when the obtained decision value is 0, the terminal adds a non-voice label to the acoustic feature vector or the frequency spectrum feature vector. Thereby, the acoustic feature vector and the spectral feature vector can be accurately classified. And after the terminal analyzes the acoustic feature vector added with the voice tag and the frequency spectrum feature vector added with the voice tag, the voice signal added with the voice tag can be obtained. According to the time sequence of the voice signal added with the voice tag, when the voice frame added with the voice tag appears for the first time, the voice frame is the starting point of the voice signal with noise, and when the voice frame corresponding to the voice tag appears for the last time, the voice frame is the ending point of the voice signal with noise. Further, it is also possible to determine a start point of the voice signal according to the jump of the decision value 0 to 1 and determine an end point of the voice signal according to the jump of the decision value 1 to 0. Therefore, the corresponding starting point and the corresponding ending point of the noisy speech signal can be accurately determined.
In this embodiment, after the terminal acquires the voice signal with noise, the terminal extracts the acoustic feature and the spectral feature corresponding to the voice signal with noise, and converts the acoustic feature and the spectral feature to obtain a corresponding acoustic feature vector and a corresponding spectral feature vector. The acoustic feature vector and the spectral feature vector are input to the classifier to obtain the acoustic feature vector added with the voice tag and the spectral feature vector added with the voice tag, so that the acoustic feature vector and the spectral feature vector can be effectively classified, and voice and non-voice can be effectively identified. And the terminal analyzes the acoustic characteristic vector added with the voice tag and the frequency spectrum characteristic vector added with the voice tag to obtain a corresponding voice signal. The terminal determines the starting point and the ending point corresponding to the voice signal according to the time sequence of the voice signal, so that the starting point and the ending point of the voice signal with noise can be accurately identified, and the accuracy of voice endpoint detection can be effectively improved.
In one embodiment, before extracting the acoustic features and the spectral features corresponding to the noisy speech signal, the method further includes: converting the voice signal with noise into voice frequency spectrum with noise; and carrying out time domain analysis and/or frequency domain analysis and/or transform domain analysis on the noisy speech frequency spectrum to obtain acoustic characteristics corresponding to the noisy speech signal.
In phonetics, speech features can be classified into acoustic features such as vowels, consonants, unvoiced sounds, voiced sounds, and silence. After the terminal acquires the voice signal with noise, windowing and framing are carried out on the voice signal with noise. For example, a hanning window may be used to divide a noisy speech signal into frames that are 10-30ms (milliseconds) in length, and the frame shift may take 10 ms. So that the noisy speech signal can be split into frames of noisy speech signals. And after windowing and framing the voice signal with noise by the terminal, carrying out fast Fourier transform on the voice signal with noise after windowing and framing, thereby obtaining the frequency spectrum of the voice signal with noise.
Further, the terminal can perform time domain analysis and/or frequency domain analysis and/or transform domain analysis on the noisy speech frequency spectrum, so that acoustic characteristics corresponding to the noisy speech signal can be obtained.
For example, the terminal may extract the acoustic features corresponding to the noisy speech signal by using an MFCC (Mel-Frequency Cepstrum Coefficients, Mel-Frequency cepstral Coefficients). After the terminal carries out windowing and framing on the voice signal with noise, the voice signal with noise is converted into the frequency spectrum of the voice signal with noise. The terminal transforms the frequency spectrum of the voice signal with noise into a voice cepstrum with noise, performs cepstrum analysis according to the voice cepstrum with noise, and performs discrete cosine transform on the voice cepstrum with noise to obtain the acoustic characteristic of each frame, so that the effective acoustic characteristic of the voice with noise can be obtained.
In one embodiment, before extracting the acoustic features and the spectral features corresponding to the noisy speech signal, the method further includes: converting the voice signal with noise into a voice frequency spectrum with noise, and calculating a voice amplitude spectrum with noise according to the voice frequency spectrum with noise; carrying out dynamic noise estimation on the noisy speech frequency spectrum according to the noisy speech amplitude spectrum to obtain a noise amplitude spectrum; estimating a voice amplitude spectrum of the pure voice signal according to the voice amplitude spectrum with the noise and the noise amplitude spectrum; and generating the frequency spectrum characteristics corresponding to the voice signal with the noise by using the voice amplitude spectrum with the noise, the noise amplitude spectrum and the voice amplitude spectrum.
After the terminal acquires the voice signal with noise, windowing and framing are carried out on the voice signal with noise. For example, a hanning window may be used to divide a noisy speech signal into frames that are 10-30ms (milliseconds) in length, and the frame shift may take 10 ms. So that the noisy speech signal can be split into frames of noisy speech signals. And after windowing and framing the voice signal with noise by the terminal, carrying out fast Fourier transform on the voice signal with noise after windowing and framing, thereby obtaining the frequency spectrum of the voice signal with noise. The frequency spectrum of the noisy speech signal may be an energy magnitude spectrum of the noisy speech after the fast fourier transform.
Further, the terminal can calculate a noisy speech amplitude spectrum and a noisy speech phase spectrum by using the noisy speech frequency spectrum. And the terminal carries out dynamic noise estimation on the voice frequency spectrum with the noise according to the voice amplitude spectrum with the noise and the voice phase spectrum with the noise. Specifically, the terminal may perform dynamic noise estimation on the noisy speech spectrum by using an improved minimum controlled recursive average algorithm, so that a noise magnitude spectrum may be obtained. And the terminal estimates the voice amplitude spectrum of the voice signal according to the voice amplitude spectrum with the noise, the voice phase spectrum with the noise and the noise amplitude spectrum. For example, the terminal may estimate the speech magnitude spectrum of the speech signal using a log magnitude spectrum minimum mean square error estimation method.
The terminal generates the frequency spectrum characteristic corresponding to the voice signal with noise by using the estimated voice amplitude spectrum with noise, the estimated noise amplitude spectrum and the estimated voice amplitude spectrum of the pure voice signal, so that the terminal can effectively extract the frequency spectrum characteristic corresponding to the voice signal with noise.
In one embodiment, converting the acoustic features and the spectral features comprises: extracting a preset number of frames before and after a current frame in the acoustic features and the spectral features; calculating a mean vector and/or a variance vector corresponding to the current frame by using a preset number of frames before and after the current frame; and carrying out logarithmic domain conversion on the acoustic feature and the frequency spectrum feature after the mean vector and/or the variance vector corresponding to the current frame are calculated to obtain a converted acoustic feature vector and a converted frequency spectrum feature vector.
After the terminal acquires the voice signal with noise, windowing and framing are carried out on the voice signal with noise. So that the noisy speech signal can be split into frames of noisy speech signals. And after windowing and framing the voice signal with noise by the terminal, carrying out fast Fourier transform on the voice signal with noise after windowing and framing, thereby obtaining the frequency spectrum of the voice signal with noise. The terminal can extract the acoustic characteristics and the spectrum characteristics corresponding to the voice signal with noise according to the spectrum of the voice with noise.
And after the terminal extracts the acoustic features and the spectral features corresponding to the voice signals with the noise, converting the acoustic features and the spectral features into acoustic feature vectors and spectral feature vectors. The terminal extracts a preset number of frames before and after the current frame in the acoustic feature vector and the spectral feature vector. The terminal calculates a mean vector or a variance vector corresponding to the current frame by using a preset number of frames before and after the current frame, so that the acoustic feature and the frequency spectrum feature can be smoothed to obtain a smoothed acoustic feature vector and a smoothed frequency spectrum feature vector.
For example, the terminal may obtain the forward and the next five frames of the current frame with acoustic or spectral features, for a total of 11 frames of noisy speech spectrum. By calculating the average of these 11 frames, the average vector of the current frame can be obtained. In particular, the terminal may obtain a filter bank in which the shape of the filter is a triangle, the triangle window representing the filtering window. Each filter has the characteristics of a triangular filter, which can be of equal bandwidth in the noisy speech spectrum. The terminal can calculate the average vector of the current frame by using the filter bank, so that the noisy speech frequency spectrum can be smoothed, and the smoothed acoustic feature vector and the smoothed spectral feature vector can be obtained.
And after the terminal smoothes the frequency spectrum of the voice with noise, calculating a logarithmic domain for the smoothed acoustic feature vector and the smoothed frequency spectrum feature vector to obtain a converted acoustic feature vector and a converted frequency spectrum feature vector. Specifically, the terminal may calculate log energy of the acoustic feature and the spectral feature output by each filter, and may thereby obtain a log domain of the acoustic feature vector and a log domain of the spectral feature vector, so that the converted acoustic feature vector and the spectral feature vector can be effectively obtained.
In one embodiment, the step of obtaining the classifier further comprises: acquiring noisy voice data added with a voice category label, and training the noisy voice data to obtain an initial classifier; acquiring a first verification set, wherein the first verification set comprises a plurality of first voice data; inputting the first voice data into a classifier to obtain class probabilities corresponding to the first voice data; screening the category probabilities corresponding to the plurality of first voice data, and adding category labels to the selected first voice data to obtain a verification set added with the category labels; training by using the verification set and the training set added with the class labels to obtain a verification classifier; acquiring a second verification set, wherein the second verification set comprises a plurality of second voice data; inputting the second voice data into a verification classifier to obtain class probabilities corresponding to the second voice data; and when the class probability corresponding to the second voice data reaches a preset probability value, obtaining the required classifier.
Before acquiring the classifier, a large amount of noisy speech data, which may be noisy speech data acquired by the terminal from a database or noisy speech data acquired by the terminal from the internet, needs to be used to train the classifier. When the classifier is trained, firstly, noisy speech data is labeled manually, and the classifier is obtained by training the artificially labeled noisy speech data.
Specifically, after extracting the acoustic features and the spectral features corresponding to the noisy speech data, the terminal converts the acoustic features and the spectral features into corresponding acoustic feature vectors and spectral feature vectors. The staff can label the acoustic feature vector and the spectral feature vector according to the category comparison table, and add a voice tag or a non-voice tag to each frame of voice signals with noise. And the terminal acquires the voice data with noise after the staff marks the voice data with noise according to the category comparison table.
The terminal combines the acoustic feature vector and the spectral feature vector after the label is added and inputs the combined acoustic feature vector and the spectral feature vector into an input layer of an LSTM (Bidirectional Long Short-term Memory), a nonlinear hidden layer in the LSTM neural network can learn new features from the input vector, and the category of the input vector is calculated through an activation function. Specifically, there are three gates in each LSTM unit, a forgetting gate, a candidate gate, and an output gate, respectively. The specific calculation formula may be:
Figure BDA0001551568260000111
wherein σIt is shown that the activation function is,
Figure BDA0001551568260000112
a forgetting gate weight matrix is represented,
Figure BDA0001551568260000113
is a weight matrix between the input layer and the hidden layer of the forgetting gate, bfIndicating the offset of the forgetting gate by hiding the output h of the previous layert-1With the current input xtLinear combination is performed and then the output value is compressed between 0 and 1 using the activation function. When the output value is closer to 1, the more information the memory retains is indicated; conversely, closer to 0 indicates that the memory holds less information.
The candidate gate calculates the current input unit state, and the specific formula can be as follows:
Figure BDA0001551568260000114
wherein, CiRepresenting the cell state of the current input, the output value can be scaled between-1 and 1 by the tanh activation function.
The output gate can control the amount of memory information for next layer network update, and the formula can be expressed as:
Figure BDA0001551568260000115
wherein, OtIndicating the amount of remembered information for the next level of network update.
The final output can be calculated by the LSTM unit and the formula can be expressed as:
ht=Ot×tanh(Ct)
the final acoustic feature vector or spectral feature vector is obtained by forward and backward calculation, and the formula can be expressed as:
Figure BDA0001551568260000121
wherein
Figure BDA0001551568260000122
Is the output vector in the forward direction and,
Figure BDA0001551568260000123
for the inverted output vector, hiA plurality of acoustic or spectral feature vectors labeled with class labels are finally assigned.
Further, the output layer in the LSTM may calculate the output unit C according to a preset decision functioniThe value of (c). Wherein, the output unit CiThe value of (b) may be a value between 0 and 1, with 1 representing a speech class and 0 representing a non-speech class.
The terminal calculates the probability that each acoustic feature and spectral feature belong to the voice category and the non-voice category in the category comparison table by using the plurality of acoustic feature vectors and the spectral feature vectors marked with the voice category labels, extracts the category with the maximum probability value of the acoustic feature vectors and the spectral feature vectors in the category comparison table, and adds the voice category label corresponding to the category with the maximum probability value to the acoustic feature vectors or the spectral feature vectors.
The terminal utilizes the noisy voice data added with the voice category label to train to obtain an initial classifier; the terminal acquires a first verification set, wherein the first verification set comprises a plurality of first voice data. The terminal inputs the first voice data into the classifier, and after the class probabilities corresponding to the first voice data are obtained, the class probabilities corresponding to the first voice data are screened. The staff adds the voice category label to the selected first voice data through the terminal, the terminal obtains the first voice data added with the voice category label, and the verification set added with the voice category label is generated through the first voice data added with the voice category label. And the terminal utilizes the verification set added with the category voice label and the noisy voice data to train to obtain a verification classifier. The terminal acquires a second verification set, wherein the second verification set comprises a plurality of second voice data; and inputting the second voice data into the verification classifier to obtain the class probability corresponding to the second voice data. And the terminal screens out the second language data with the class probability in a preset range, marks the screened second voice data, and trains the marked second voice data and the noise-containing voice data added with the label again to obtain a new classifier. And continuously training until the probability values of the acoustic feature vectors or the frequency spectrum feature vectors of preset number in all the verification sets are between preset probability range values, and stopping training to obtain the required classifier. Therefore, the classifier with high accuracy can be obtained, so that the acoustic feature vector and the spectral feature vector can be accurately classified, and the voice and the non-voice can be accurately identified.
In one embodiment, the step of classifying the acoustic feature vector and the spectral feature vector using a classifier comprises: taking the acoustic feature vector and the spectral feature vector as input of a classifier to obtain decision values corresponding to the acoustic feature vector and the spectral feature vector; when the decision value is a first threshold value, adding a voice tag to the acoustic feature vector or the frequency spectrum feature vector; and when the decision value is a second threshold value, adding a non-voice label to the acoustic feature vector or the spectrum feature vector.
And after the terminal acquires the voice signal with the noise, extracting the acoustic characteristic and the spectral characteristic corresponding to the voice signal with the noise. And the terminal converts the acoustic features and the spectral features to obtain corresponding acoustic feature vectors and spectral feature vectors. And after the terminal acquires the classifier, inputting the acoustic feature vector and the spectral feature vector into the classifier. After the classifier classifies the input acoustic feature vector and the input spectral feature vector, the decision values corresponding to the acoustic feature vector and the spectral feature vector can be obtained. And when the obtained decision value is a preset first threshold value, the terminal adds a voice tag to the acoustic characteristic vector or the frequency spectrum characteristic vector. Wherein the first threshold may be a range of values. And when the obtained decision value is a preset second threshold value, the terminal adds a non-voice label to the acoustic characteristic vector or the frequency spectrum characteristic vector. By accurately classifying the acoustic feature vectors and the spectral feature vectors by using the classifier, the voice signals and the non-voice signals in the noisy voice signals can be accurately identified.
For example, the resulting decision value may be a value between 0 and 1. The preset first threshold may be 1, and the preset second threshold may be 0. And when the obtained decision value is 1, the terminal adds a voice tag to the acoustic feature vector or the frequency spectrum feature vector. And when the obtained decision value is 0, the terminal adds a non-voice label to the acoustic feature vector or the frequency spectrum feature vector. Thereby, the acoustic feature vector and the spectral feature vector can be accurately classified.
In one embodiment, as shown in fig. 2, there is provided a voice endpoint detection apparatus, comprising an extraction module 202, a conversion module 204, a classification module 206, and a parsing module 208, wherein:
the extracting module 202 is configured to obtain a noisy speech signal and extract an acoustic feature and a spectral feature corresponding to the noisy speech signal.
The conversion module 204 is configured to convert the acoustic features and the spectral features to obtain corresponding acoustic feature vectors and spectral feature vectors.
The classification module 206 is configured to obtain a classifier, and input the acoustic feature vector and the spectral feature vector to the classifier to obtain an acoustic feature vector to which a voice tag is added and a spectral feature vector to which a voice tag is added.
The analysis module 208 is configured to analyze the acoustic feature vector with the voice tag added and the spectrum feature vector with the voice tag added to obtain a corresponding voice signal; and determining a starting point and an ending point corresponding to the voice signal according to the time sequence of the voice signal.
In one embodiment, the extraction module 202 is further configured to convert the noisy speech signal into a noisy speech spectrum; and carrying out time domain analysis and/or frequency domain analysis and/or transform domain analysis on the voice frequency spectrum with the noise to obtain acoustic characteristics corresponding to the voice signal with the noise.
In one embodiment, the extracting module 202 is further configured to convert the noisy speech signal into a noisy speech spectrum, and calculate a noisy speech magnitude spectrum according to the noisy speech spectrum; carrying out dynamic noise estimation on the noisy speech frequency spectrum according to the noisy speech amplitude spectrum to obtain a noise amplitude spectrum; estimating a voice amplitude spectrum of the pure voice signal according to the voice amplitude spectrum with the noise and the noise amplitude spectrum; and generating the frequency spectrum characteristics corresponding to the voice signal with the noise by using the voice amplitude spectrum with the noise, the noise amplitude spectrum and the voice amplitude spectrum.
In one embodiment, the conversion module 204 is further configured to extract a preset number of frames before and after the current frame in the acoustic feature and the spectral feature; calculating a mean vector and/or a variance vector corresponding to the current frame by using a preset number of frames before and after the current frame; and carrying out logarithmic domain conversion on the acoustic feature and the frequency spectrum feature after the mean vector and/or the variance vector corresponding to the current frame are calculated to obtain a converted acoustic feature vector and a converted frequency spectrum feature vector.
In one embodiment, the device further comprises a training module, configured to acquire noisy speech data to which the speech category label is added, and train the noisy speech data to obtain an initial classifier; acquiring a first verification set, wherein the first verification set comprises a plurality of first voice data; inputting the first voice data into an initial classifier to obtain class probabilities corresponding to the first voice data; screening the category probabilities corresponding to the plurality of first voice data, and adding category labels to the selected first voice data to obtain a verification set added with the category labels; training by using the verification set added with the class label and the noisy voice data added with the voice class label to obtain a verification classifier; acquiring a second verification set, wherein the second verification set comprises a plurality of second voice data; inputting the second voice data into a verification classifier to obtain class probabilities corresponding to the second voice data; and when the class probability corresponding to the second voice data reaches a preset probability value, obtaining the required classifier.
In one embodiment, the classification module 206 is further configured to use the acoustic feature vector and the spectral feature vector as inputs of a classifier to obtain decision values corresponding to the acoustic feature vector and the spectral feature vector; when the decision value is a first threshold value, adding a voice tag to the acoustic feature vector or the frequency spectrum feature vector; and when the decision value is a second threshold value, adding a non-voice label to the acoustic feature vector or the spectrum feature vector.
In one embodiment, a computer device is provided, which may be a terminal, and its internal structure diagram may be as shown in fig. 3. For example, the computer device may be a terminal, and the terminal may be, but is not limited to, various devices having a function of inputting voice, such as a smart phone, a tablet computer, a notebook computer, a personal computer, and a portable wearable device. The computer device includes a processor, a memory, a network interface, and a voice input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method of voice endpoint detection. The voice input device of the computer equipment can comprise a microphone, and can also comprise an external earphone and the like.
Those skilled in the art will appreciate that the architecture shown in fig. 3 is a block diagram of only a portion of the architecture associated with the subject application, and does not constitute a limitation on the servers to which the subject application applies, as a particular server may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In one embodiment, a computer device is provided, comprising a memory and a processor, the memory having a computer program stored therein, the processor implementing the following steps when executing the computer program: acquiring a voice signal with noise, and extracting acoustic features and spectrum features corresponding to the voice signal with noise; converting the acoustic features and the spectral features to obtain corresponding acoustic feature vectors and spectral feature vectors; acquiring a classifier, and inputting the acoustic feature vector and the spectral feature vector into the classifier to obtain the acoustic feature vector added with the voice tag and the spectral feature vector added with the voice tag; analyzing the acoustic feature vector added with the voice tag and the frequency spectrum feature vector added with the voice tag to obtain a corresponding voice signal; and determining a starting point and an ending point corresponding to the voice signal according to the time sequence of the voice signal.
In one embodiment, the processor, when executing the computer program, further performs the steps of: converting the voice signal with noise into a voice frequency spectrum with noise; and carrying out time domain analysis and/or frequency domain analysis and/or transform domain analysis on the voice frequency spectrum with the noise to obtain acoustic characteristics corresponding to the voice signal with the noise.
In one embodiment, the processor, when executing the computer program, further performs the steps of: converting the voice signal with noise into a voice frequency spectrum with noise, and calculating a voice amplitude spectrum with noise according to the voice frequency spectrum with noise; carrying out dynamic noise estimation on the noisy speech frequency spectrum according to the noisy speech amplitude spectrum to obtain a noise amplitude spectrum; estimating a voice amplitude spectrum of the pure voice signal according to the voice amplitude spectrum with the noise and the noise amplitude spectrum; and generating the frequency spectrum characteristics corresponding to the voice signal with the noise by using the voice amplitude spectrum with the noise, the noise amplitude spectrum and the voice amplitude spectrum.
In one embodiment, the processor, when executing the computer program, further performs the steps of: extracting a preset number of frames before and after the current frame in the acoustic features and the frequency spectrum features; calculating a mean vector and/or a variance vector corresponding to the current frame by using a preset number of frames before and after the current frame; and carrying out logarithmic domain conversion on the acoustic feature and the frequency spectrum feature after the mean vector and/or the variance vector corresponding to the current frame are calculated to obtain a converted acoustic feature vector and a converted frequency spectrum feature vector.
In one embodiment, the processor, when executing the computer program, further performs the steps of: acquiring noisy voice data added with a voice category label, and training the noisy voice data to obtain an initial classifier; acquiring a first verification set, wherein the first verification set comprises a plurality of first voice data; inputting the first voice data into a classifier to obtain class probabilities corresponding to the first voice data; screening the category probabilities corresponding to the plurality of first voice data, and adding category labels to the selected first voice data to obtain a verification set added with the category labels; training by using the verification set and the training set added with the class labels to obtain a verification classifier; acquiring a second verification set, wherein the second verification set comprises a plurality of second voice data; inputting the second voice data into a verification classifier to obtain class probabilities corresponding to the second voice data; and when the class probability corresponding to the second voice data reaches a preset probability value, obtaining the required classifier.
In one embodiment, the processor, when executing the computer program, further performs the steps of: taking the acoustic feature vector and the spectral feature vector as input of a classifier to obtain decision values corresponding to the acoustic feature vector and the spectral feature vector; when the decision value is a first threshold value, adding a voice tag to the acoustic feature vector or the frequency spectrum feature vector; and when the decision value is a second threshold value, adding a non-voice label to the acoustic feature vector or the spectrum feature vector.
In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of: acquiring a voice signal with noise, and extracting acoustic features and spectrum features corresponding to the voice signal with noise; converting the acoustic features and the spectral features to obtain corresponding acoustic feature vectors and spectral feature vectors; acquiring a classifier, and inputting the acoustic feature vector and the spectral feature vector into the classifier to obtain the acoustic feature vector added with the voice tag and the spectral feature vector added with the voice tag; analyzing the acoustic feature vector added with the voice tag and the frequency spectrum feature vector added with the voice tag to obtain a corresponding voice signal; and determining a starting point and an ending point corresponding to the voice signal according to the time sequence of the voice signal.
In one embodiment, the computer program when executed by the processor further performs the steps of: converting the voice signal with noise into a voice frequency spectrum with noise; and carrying out time domain analysis and/or frequency domain analysis and/or transform domain analysis on the voice frequency spectrum with the noise to obtain acoustic characteristics corresponding to the voice signal with the noise.
In one embodiment, the computer program when executed by the processor further performs the steps of: converting the voice signal with noise into a voice frequency spectrum with noise, and calculating a voice amplitude spectrum with noise according to the voice frequency spectrum with noise; carrying out dynamic noise estimation on the noisy speech frequency spectrum according to the noisy speech amplitude spectrum to obtain a noise amplitude spectrum; estimating a voice amplitude spectrum of the pure voice signal according to the voice amplitude spectrum with the noise and the noise amplitude spectrum; and generating the frequency spectrum characteristics corresponding to the voice signal with the noise by using the voice amplitude spectrum with the noise, the noise amplitude spectrum and the voice amplitude spectrum.
In one embodiment, the computer program when executed by the processor further performs the steps of: extracting a preset number of frames before and after the current frame in the acoustic features and the frequency spectrum features; calculating a mean vector and/or a variance vector corresponding to the current frame by using a preset number of frames before and after the current frame; and carrying out logarithmic domain conversion on the acoustic feature and the frequency spectrum feature after the mean vector and/or the variance vector corresponding to the current frame are calculated to obtain a converted acoustic feature vector and a converted frequency spectrum feature vector.
In one embodiment, the computer program when executed by the processor further performs the steps of: acquiring noisy voice data added with a voice category label, and training the noisy voice data to obtain an initial classifier; acquiring a first verification set, wherein the first verification set comprises a plurality of first voice data; inputting the first voice data into a classifier to obtain class probabilities corresponding to the first voice data; screening the category probabilities corresponding to the plurality of first voice data, and adding category labels to the selected first voice data to obtain a verification set added with the category labels; training by using the verification set and the training set added with the class labels to obtain a verification classifier; acquiring a second verification set, wherein the second verification set comprises a plurality of second voice data; inputting the second voice data into a verification classifier to obtain class probabilities corresponding to the second voice data; and when the class probability corresponding to the second voice data reaches a preset probability value, obtaining the required classifier.
In one embodiment, the computer program when executed by the processor further performs the steps of: taking the acoustic feature vector and the spectral feature vector as input of a classifier to obtain decision values corresponding to the acoustic feature vector and the spectral feature vector; when the decision value is a first threshold value, adding a voice tag to the acoustic feature vector or the frequency spectrum feature vector; and when the decision value is a second threshold value, adding a non-voice label to the acoustic feature vector or the spectrum feature vector.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (10)

1. A voice endpoint detection method, comprising:
acquiring a voice signal with noise, and extracting acoustic characteristics corresponding to the voice signal with noise;
extracting a voice amplitude spectrum with noise, a noise amplitude spectrum and a voice amplitude spectrum of the voice signal with noise;
generating a frequency spectrum characteristic corresponding to the voice signal with the noise according to the voice amplitude spectrum with the noise, the noise amplitude spectrum and the voice amplitude spectrum;
converting the acoustic features and the spectral features to obtain corresponding acoustic feature vectors and spectral feature vectors;
acquiring a classifier, and inputting the acoustic feature vector and the spectral feature vector into the classifier to obtain an acoustic feature vector added with a voice tag and a spectral feature vector added with the voice tag;
analyzing the acoustic feature vector added with the voice tag and the frequency spectrum feature vector added with the voice tag to obtain a corresponding voice signal;
and determining a starting point and an ending point corresponding to the voice signal according to the time sequence of the voice signal.
2. The method according to claim 1, further comprising, before said extracting the corresponding acoustic feature and spectral feature of the noisy speech signal:
converting the voice signal with noise into a voice frequency spectrum with noise;
and carrying out time domain analysis and/or frequency domain analysis and/or transform domain analysis on the voice frequency spectrum with the noise to obtain acoustic characteristics corresponding to the voice signal with the noise.
3. The method according to claim 1, wherein said extracting a noisy speech magnitude spectrum, a noise magnitude spectrum and a speech magnitude spectrum of said noisy speech signal comprises:
converting the voice signal with noise into a voice frequency spectrum with noise, and calculating a voice amplitude spectrum with noise according to the voice frequency spectrum with noise;
carrying out dynamic noise estimation on the voice frequency spectrum with the noise according to the voice amplitude spectrum with the noise to obtain a noise amplitude spectrum;
and estimating the voice amplitude spectrum of the pure voice signal according to the voice amplitude spectrum with the noise and the noise amplitude spectrum.
4. The method of claim 1, wherein the converting the acoustic and spectral features comprises:
extracting a preset number of frames before and after the current frame in the acoustic features and the frequency spectrum features;
calculating a mean vector and/or a variance vector corresponding to the current frame by using a preset number of frames before and after the current frame;
and carrying out logarithmic domain conversion on the acoustic feature and the frequency spectrum feature after the mean vector and/or the variance vector corresponding to the current frame are calculated to obtain a converted acoustic feature vector and a converted frequency spectrum feature vector.
5. The method of claim 1, wherein the step of obtaining a classifier further comprises, prior to:
acquiring noisy voice data added with a voice category label, and training the noisy voice data to obtain an initial classifier;
obtaining a first verification set, wherein the first verification set comprises a plurality of first voice data;
inputting a plurality of first voice data into the initial classifier to obtain class probabilities corresponding to the plurality of first voice data;
screening the category probabilities corresponding to the plurality of first voice data, and adding category labels to the selected first voice data to obtain a verification set added with the category labels;
training by using the verification set added with the class label and the noisy voice data added with the voice class label to obtain a verification classifier;
obtaining a second verification set, wherein the second verification set comprises a plurality of second voice data;
inputting a plurality of second voice data into a verification classifier to obtain class probabilities corresponding to the plurality of second voice data;
and when the class probability corresponding to the second voice data reaches a preset probability value, obtaining the required classifier.
6. The method of any one of claims 1 to 5, wherein the step of classifying the acoustic and spectral feature vectors using the classifier comprises:
taking the acoustic feature vector and the spectral feature vector as input of a classifier to obtain decision values corresponding to the acoustic feature vector and the spectral feature vector;
when the decision value is a first threshold value, adding a voice tag to the acoustic feature vector or the spectrum feature vector;
and when the decision value is a second threshold value, adding a non-voice label to the acoustic feature vector or the spectrum feature vector.
7. A voice endpoint detection apparatus comprising:
the extraction module is used for acquiring a voice signal with noise and extracting acoustic characteristics corresponding to the voice signal with noise; extracting a voice amplitude spectrum with noise, a noise amplitude spectrum and a voice amplitude spectrum of the voice signal with noise; generating a frequency spectrum characteristic corresponding to the voice signal with the noise according to the voice amplitude spectrum with the noise, the noise amplitude spectrum and the voice amplitude spectrum;
the conversion module is used for converting the acoustic features and the spectral features to obtain corresponding acoustic feature vectors and spectral feature vectors;
the classification module is used for acquiring a classifier, inputting the acoustic feature vector and the spectral feature vector into the classifier, and obtaining an acoustic feature vector added with a voice tag and a spectral feature vector added with the voice tag;
the analysis module is used for analyzing the acoustic feature vector added with the voice tag and the frequency spectrum feature vector added with the voice tag to obtain a corresponding voice signal; and determining a starting point and an ending point corresponding to the voice signal according to the time sequence of the voice signal.
8. The apparatus of claim 7, wherein the converting module is further configured to extract a preset number of frames before and after a current frame in the acoustic feature and the spectral feature; calculating a mean vector and/or a variance vector of the current frame by using a preset number of frames before and after the current frame; and carrying out logarithmic domain conversion on the acoustic feature and the frequency spectrum feature after the mean vector and/or the variance vector corresponding to the current frame are calculated to obtain a converted acoustic feature vector and a converted frequency spectrum feature vector.
9. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 6 when executing the computer program.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 6.
CN201810048223.3A 2018-01-18 2018-01-18 Voice endpoint detection method and device, computer equipment and storage medium Active CN108198547B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810048223.3A CN108198547B (en) 2018-01-18 2018-01-18 Voice endpoint detection method and device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810048223.3A CN108198547B (en) 2018-01-18 2018-01-18 Voice endpoint detection method and device, computer equipment and storage medium

Publications (2)

Publication Number Publication Date
CN108198547A CN108198547A (en) 2018-06-22
CN108198547B true CN108198547B (en) 2020-10-23

Family

ID=62589616

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810048223.3A Active CN108198547B (en) 2018-01-18 2018-01-18 Voice endpoint detection method and device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN108198547B (en)

Families Citing this family (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108922556B (en) * 2018-07-16 2019-08-27 百度在线网络技术(北京)有限公司 Sound processing method, device and equipment
CN110752973B (en) * 2018-07-24 2020-12-25 Tcl科技集团股份有限公司 Terminal equipment control method and device and terminal equipment
CN109036471B (en) * 2018-08-20 2020-06-30 百度在线网络技术(北京)有限公司 Voice endpoint detection method and device
CN110070884B (en) 2019-02-28 2022-03-15 北京字节跳动网络技术有限公司 Audio starting point detection method and device
CN110265032A (en) * 2019-06-05 2019-09-20 平安科技(深圳)有限公司 Conferencing data analysis and processing method, device, computer equipment and storage medium
CN110322872A (en) * 2019-06-05 2019-10-11 平安科技(深圳)有限公司 Conference voice data processing method, device, computer equipment and storage medium
CN110415704A (en) * 2019-06-14 2019-11-05 平安科技(深圳)有限公司 Data processing method, device, computer equipment and storage medium are put down in court's trial
CN110428853A (en) * 2019-08-30 2019-11-08 北京太极华保科技股份有限公司 Voice activity detection method, Voice activity detection device and electronic equipment
CN110808061B (en) * 2019-11-11 2022-03-15 广州国音智能科技有限公司 Voice separation method and device, mobile terminal and computer readable storage medium
CN110910906A (en) * 2019-11-12 2020-03-24 国网山东省电力公司临沂供电公司 Audio endpoint detection and noise reduction method based on power intranet
CN111179972A (en) * 2019-12-12 2020-05-19 中山大学 Human voice detection algorithm based on deep learning
CN111192600A (en) * 2019-12-27 2020-05-22 北京网众共创科技有限公司 Sound data processing method and device, storage medium and electronic device
CN111626061A (en) * 2020-05-27 2020-09-04 深圳前海微众银行股份有限公司 Conference record generation method, device, equipment and readable storage medium
CN111916060B (en) * 2020-08-12 2022-03-01 四川长虹电器股份有限公司 Deep learning voice endpoint detection method and system based on spectral subtraction
CN112652324A (en) * 2020-12-28 2021-04-13 深圳万兴软件有限公司 Speech enhancement optimization method, speech enhancement optimization system and readable storage medium
CN113327626B (en) * 2021-06-23 2023-09-08 深圳市北科瑞声科技股份有限公司 Voice noise reduction method, device, equipment and storage medium
CN113744725B (en) * 2021-08-19 2024-07-05 清华大学苏州汽车研究院(相城) Training method of voice endpoint detection model and voice noise reduction method
CN114974258B (en) * 2022-07-27 2022-12-16 深圳市北科瑞声科技股份有限公司 Speaker separation method, device, equipment and storage medium based on voice processing
CN115497511A (en) * 2022-10-31 2022-12-20 广州方硅信息技术有限公司 Method, device, equipment and medium for training and detecting voice activity detection model

Family Cites Families (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040064314A1 (en) * 2002-09-27 2004-04-01 Aubert Nicolas De Saint Methods and apparatus for speech end-point detection
KR100745976B1 (en) * 2005-01-12 2007-08-06 삼성전자주식회사 Method and apparatus for classifying voice and non-voice using sound model
JP4950930B2 (en) * 2008-04-03 2012-06-13 株式会社東芝 Apparatus, method and program for determining voice / non-voice
CN101308653A (en) * 2008-07-17 2008-11-19 安徽科大讯飞信息科技股份有限公司 End-point detecting method applied to speech identification system
CN101599269B (en) * 2009-07-02 2011-07-20 中国农业大学 Phonetic end point detection method and device therefor
CN103489454B (en) * 2013-09-22 2016-01-20 浙江大学 Based on the sound end detecting method of wave configuration feature cluster
CN103730124A (en) * 2013-12-31 2014-04-16 上海交通大学无锡研究院 Noise robustness endpoint detection method based on likelihood ratio test
CN105023572A (en) * 2014-04-16 2015-11-04 王景芳 Noised voice end point robustness detection method
CN104021789A (en) * 2014-06-25 2014-09-03 厦门大学 Self-adaption endpoint detection method using short-time time-frequency value
US10121471B2 (en) * 2015-06-29 2018-11-06 Amazon Technologies, Inc. Language model speech endpointing
CN105118502B (en) * 2015-07-14 2017-05-10 百度在线网络技术(北京)有限公司 End point detection method and system of voice identification system
CN107393526B (en) * 2017-07-19 2024-01-02 腾讯科技(深圳)有限公司 Voice silence detection method, device, computer equipment and storage medium

Also Published As

Publication number Publication date
CN108198547A (en) 2018-06-22

Similar Documents

Publication Publication Date Title
CN108198547B (en) Voice endpoint detection method and device, computer equipment and storage medium
CN108877775B (en) Voice data processing method and device, computer equipment and storage medium
US9792897B1 (en) Phoneme-expert assisted speech recognition and re-synthesis
US8160877B1 (en) Hierarchical real-time speaker recognition for biometric VoIP verification and targeting
Deshwal et al. Feature extraction methods in language identification: a survey
WO2020029404A1 (en) Speech processing method and device, computer device and readable storage medium
Vestman et al. Speaker recognition from whispered speech: A tutorial survey and an application of time-varying linear prediction
Hibare et al. Feature extraction techniques in speech processing: a survey
CN111145782B (en) Overlapped speech recognition method, device, computer equipment and storage medium
CN108305639B (en) Speech emotion recognition method, computer-readable storage medium and terminal
Yusnita et al. Malaysian English accents identification using LPC and formant analysis
Ismail et al. Mfcc-vq approach for qalqalahtajweed rule checking
Ananthi et al. SVM and HMM modeling techniques for speech recognition using LPCC and MFCC features
CN108091340B (en) Voiceprint recognition method, voiceprint recognition system, and computer-readable storage medium
Pao et al. Combining acoustic features for improved emotion recognition in mandarin speech
Zou et al. Improved voice activity detection based on support vector machine with high separable speech feature vectors
Nidhyananthan et al. Language and text-independent speaker identification system using GMM
Priyadarshani et al. Dynamic time warping based speech recognition for isolated Sinhala words
Sinha et al. Acoustic-phonetic feature based dialect identification in Hindi Speech
Nivetha A survey on speech feature extraction and classification techniques
Unnibhavi et al. LPC based speech recognition for Kannada vowels
Daqrouq et al. Arabic vowels recognition based on wavelet average framing linear prediction coding and neural network
Alkhatib et al. ASR Features Extraction Using MFCC And LPC: A Comparative Study
Islam et al. A Novel Approach for Text-Independent Speaker Identification Using Artificial Neural Network
Ye Speech recognition using time domain features from phase space reconstructions

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant