US8175874B2 - Personalized voice activity detection - Google Patents
Personalized voice activity detection Download PDFInfo
- Publication number
- US8175874B2 US8175874B2 US12/092,578 US9257806A US8175874B2 US 8175874 B2 US8175874 B2 US 8175874B2 US 9257806 A US9257806 A US 9257806A US 8175874 B2 US8175874 B2 US 8175874B2
- Authority
- US
- United States
- Prior art keywords
- segment
- segments
- user
- audio signal
- voice
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active, expires
Links
- 230000000694 effects Effects 0.000 title claims abstract description 35
- 238000001514 detection method Methods 0.000 title description 14
- 230000005236 sound signal Effects 0.000 claims abstract description 66
- 238000000034 method Methods 0.000 claims abstract description 63
- 230000005540 biological transmission Effects 0.000 claims abstract description 16
- 238000012545 processing Methods 0.000 claims description 33
- 238000012546 transfer Methods 0.000 claims description 29
- 238000004891 communication Methods 0.000 claims description 21
- 238000001914 filtration Methods 0.000 claims description 7
- 230000015654 memory Effects 0.000 claims description 6
- 238000000926 separation method Methods 0.000 claims description 6
- 230000001629 suppression Effects 0.000 claims description 6
- 238000010586 diagram Methods 0.000 description 8
- 230000007774 longterm Effects 0.000 description 8
- 230000006870 function Effects 0.000 description 5
- 239000000872 buffer Substances 0.000 description 4
- 206010011224 Cough Diseases 0.000 description 3
- 238000012795 verification Methods 0.000 description 3
- 230000001755 vocal effect Effects 0.000 description 3
- 238000013475 authorization Methods 0.000 description 2
- 238000001228 spectrum Methods 0.000 description 2
- 238000007619 statistical method Methods 0.000 description 2
- 241000282472 Canis lupus familiaris Species 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 239000003795 chemical substances by application Substances 0.000 description 1
- 230000006866 deterioration Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 238000011010 flushing procedure Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 238000010079 rubber tapping Methods 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 238000004904 shortening Methods 0.000 description 1
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 description 1
- 230000003936 working memory Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
Definitions
- the present invention relates generally to voice activity detection and more specifically to automatic identification and transfer of voice activity of specific speakers.
- Voice activity detection is the art of detecting the presence of voice activity, generally human speech, in audio signals.
- Voice activity detection is used in a wide range of systems handling audio signals for example systems dealing with: telecommunication, speech recognition, speaker verification, speaker identification, speaker segmentation, voice recording, noise suppression and others.
- voice activity detection can be used to implement different sampling rates based on the voice activity level detected, for example to raise/reduce the bandwidth when dealing with audio segments containing human speech.
- a speaker verification/identification system can be simplified by limiting processing to audio segments containing speech.
- a noise suppression system can use voice activity detection for comparing between segments with speech activity relative to segments without speech activity.
- voice activity detection can be used to reduce the required storage space by limiting the recording to meaningful information (e.g. segments with speech activity).
- voice controlled systems and/or applications are intended to receive voices from a single person or single group of people, and would function better if they actually receive only the voice or voices of the intended people, for example:
- Speaker verification systems such as used by banks to authenticate the customer
- noise for example:
- Some systems attempt to transfer voice and eliminate noise in order to improve efficiency in dealing with the signal.
- more sophisticated input devices e.g. extra microphones and/or sensors
- more sophisticated input devices e.g. extra microphones and/or sensors
- U.S. patent application publication No. 2005/0033572 published Feb. 10, 2005 the disclosure of which is incorporated herein by reference describes apparatus and method of a voice recognition system for an audio-visual system.
- the system receives reflected sounds from an audio-visual system, noise and a user's voice and is configured to isolate the user's voice and compare it to voice patterns that belong to at least one model.
- Japanese patent No. 11-154998 from Jun. 8, 1999 the disclosure of which is incorporated herein by reference, describes registering a voice print of a speaker, then during transmission a microphone collects a signal comprising the speakers voice and ambient noise. The signal is input to a comparing filter that extracts the voice of the speaker from the signal by comparing to the registered voice print.
- An aspect of an embodiment of the invention relates to a system and method of transferring audio data in real-time wherein only the voice of a registered user will be transferred.
- the system initially registers the voice patterns and/or characteristics of one or more users.
- the system then analyzes the audio data, segment by segment as it is created and transferred in real time.
- the system checks if a segment contains voice and if the voice is of a registered user.
- the system calculates a probability level that a segment representing voice is of a registered user and transfers the segment responsive to the determination.
- the probability is below a pre-selected threshold value the segment is blocked.
- the probability level is above a pre-selected value the segment is transferred.
- the probability level is less than a pre-selected value and greater than a threshold value the segment is transferred with the quality and/or strength of the signal adjusted according to the probability level, for example raising or lowering the volume.
- some previously blocked segments may be transferred responsive to a recalculation of their probability level when calculating the probability level of a proceeding segment.
- the transferred blocked segments are transferred at a higher rate to prevent a delay in the flow of the segments.
- a method of transferring a real-time audio signal transmission comprising, registering user characteristics of one or more users to be used to identify the voices of the users, accepting an audio signal as it is created as a sequence of segments, analyzing each segment of the accepted audio signal to determine if it contains voice activity, determining a probability level that the voice activity of the segment is of a registered user; and selectively transferring the contents of a segment responsive to the determined probability level.
- the segments are selected to comprise a single syllable.
- the segments are selected to comprise an interval of the audio signal smaller than 0.1 seconds.
- the selectively transferring comprises adjusting the quality level of the audio segment according to the determined probability level.
- adjusting the quality level comprises raising or lowering the volume of the audio signal in the segment.
- the selectively transferring comprises transferring previously blocked segments responsive to the determination for a consecutive segment.
- the previously blocked segments are transferred at a higher rate than the standard transfer rate of segments.
- the probability level of a segment is affected by the selective transfer of segments prior to the current segment.
- the method further comprises filtering out noise from each segment before analyzing the segment.
- the method comprises filtering out noise from each segment after analyzing the segment.
- the method further comprises performing source separation to the signal in a segment creating multiple segments before analyzing the segment and analyzing the multiple segments independently.
- the method further comprises analyzing the strongest signal in a segment comprising multiple audio signals, while taking into account prior segments.
- the method further comprises inserting an audio signal into a segment to indicate amendments to the segment.
- the transferring is through a communication system.
- the characteristics comprise voice patterns of the user.
- the characteristics comprise general information about the user.
- the selectively transferring allows voice activity of any user.
- the selectively transferring allows voice activity of a group of users with a common characteristic.
- the selectively transferring reduces bandwidth of transmissions through a communication network.
- the probability level becomes more accurate as the audio signal is processed.
- a system for transferring a real time audio transmission comprising, a processor to process data of the real time audio transmission and control the system, a working memory to serve as a work area for said central processing unit, a database memory to store data provided to the system for processing by said central processing unit, a channel interface to accept an audio signal for processing and transfer the processed audio signal to a receiver, a user interface to communicate with the user, wherein the system is adapted to, register characteristics of one or more users to be used to identify the voice of the users, accept via the channel interface an audio signal as it is created as a sequence of segments, analyze with the central processing unit each segment of the accepted audio signal to determine if it contains voice activity, determine a probability level that the voice activity of the segment is of a registered user, and selectively transfer the contents of a segment responsive to the determined probability level.
- the system provides an indication at a user communication device if it is activated.
- the indication gives indication
- FIG. 1 is a schematic illustration of implementation of a personalized voice detection system, according to an exemplary embodiment of the invention
- FIG. 2 is a schematic illustration of a communication network and optional positions for deploying a personal voice detection system, according to an exemplary embodiment of the invention
- FIG. 3A is a flow diagram of the process of analyzing an audio segment, according to an exemplary embodiment of the invention.
- FIG. 3B is a flow diagram of the process of analyzing an audio segment before voice is detected, according to an exemplary embodiment of the invention.
- FIG. 3C is a flow diagram of the process of analyzing an audio segment after a registered users voice is detected, according to an exemplary embodiment of the invention.
- FIG. 3D is a flow diagram of the process of analyzing an audio segment when voice that does not belong to a registered user is detected, according to an exemplary embodiment of the invention.
- FIG. 4 is a schematic illustration of a buffer storing segments of an audio signal, according to an exemplary embodiment of the invention.
- FIG. 5 is a flow diagram of the registration process of a user, according to an exemplary embodiment of the invention.
- FIG. 6 is a schematic illustration of the main components of a system for analyzing audio data, according to an exemplary embodiment of the invention.
- FIG. 1 is a schematic illustration of implementation of a personalized voice detection system, according to an exemplary embodiment of the invention.
- two users user A and user B conduct a conversation over telephones 110 and 140 respectively.
- user A produces an audio signal 120 comprising speech, which is received by telephone 110 for transmission to telephone 140 at user B.
- other audio signals 130 in the vicinity of user A are also received by telephone 110 .
- Other signals 130 may be noises (e.g. dogs barking, musical instruments playing, water running, a toilet flushing) or may be other people speaking.
- additional noises that originate from the communication equipment and/or the communication network may also interfere with audio signal 120 .
- user B is interested in hearing audio signal 120 of user A without audio signals 130 of the noise surrounding user A.
- user A is interested that user B receive audio signal 120 and not hear audio signals 130 , which for example may disclose details that user A is not interested in disclosing, for example the current location of user A (e.g. not at work) or current activity of user A (e.g. watching a movie).
- the conversation originating from user A will use a personalized voice detection system 100 which transfers only voices of users that are registered to use system 100 and suppresses all other sounds.
- system 100 can be implemented as a uni-directional system or a bi-directional system. Additionally, system 100 can be implemented to transfer the voice signal of a single user or multiple users. Further additionally, system 100 may interface between two or more users or between a user and a machine, for example system 100 may serve as a filter for receiving audio signals to activate a voice activated device so that the machine will receive a clean voice signal, wherein only the voice of registered users will reach the machine.
- system 100 enhances system performance by reducing bandwidth requirements since it provides a filtered signal.
- a clean voice signal and/or a reduced size voice signal can reduce processing time since the network has less data to handle.
- FIG. 2 is a schematic illustration of a communication network 200 and optional positions for deploying a personal voice detection system, according to an exemplary embodiment of the invention.
- various optional positions for implementing a personal voice detection system are marked with a star 210 .
- any device in the communication network through which a transmitted audio signal is transferred can be altered to accommodate system 100 , for example a base station of a mobile telephone network, a multiple control unit (MCU), a mobile switching center (MSC), a mobile telecommunications switching office (MTSO), a public switched telephone network server (PSTN), a voice mail server, a signal processing unit, a telephone of any type (e.g.
- system 100 is positioned directly in the transmitting device (e.g. telephone) in order to reduce transmission bandwidth requirements by sending a clean voice signal and/or deal with the signal as it is created without distortions, which may incur during the transfer of the signal.
- system 100 may be positioned directly in the device receiving the audio data, for example to serve as a filter that provides only speech data of authorized users to arrive at the device, thus simplifying the functions of the device.
- system 100 may comprise more than one unit with each part residing in a different location, for example the processing unit may be in one location and a database for recording information of a registered user may be located in a different location.
- a conversation may be transmitted through a path with more than one system 100 .
- only one system 100 is activated at a time to filter the audio transmissions.
- more than one system 100 can be used sequentially to verify accuracy of the filtering process preformed by the previous units.
- FIGS. 3A-3D are flow diagrams of the process of analyzing an audio segment, according to an exemplary embodiment of the invention.
- a speaker speaks in bursts of speech followed by a pause and/or a response from the opposing party or device. This process is typically repeated throughout a conversation.
- Each burst can be viewed as a sequence of one or more audio segments.
- FIG. 4 is a schematic illustration of a buffer 400 storing previous segments of an audio signal, according to an exemplary embodiment of the invention.
- FIG. 4 illustrates a currently being created audio burst 420 with a current segment 410 .
- T 1 and T 2 define previously received bursts 430 .
- every audio segment received 410 is analyzed by system 100 to determine if it is voice and if it is one of the registered users voice that needs to be transferred or if it a foreign sound that should be suppressed, for example speech of a none registered person.
- system 100 takes into account the state of system 100 when receiving the segment and the state of previously accepted segments, for example is segment 410 the start of a burst of speech or not, and if the previous segments were determined to be a registered users speech or not.
- the size of a segment may be a piece of an audio signal measured by an absolute time interval or may be the size of a single syllable or more or less than a single syllable of speech.
- segment 410 will be short (e.g. less than 100 mili-seconds) to prevent delay in transfer of the audio signal, for example if system 100 analyzes full words there would be a noticeable delay when conducting a live conversation.
- segment 410 may comprise:
- noise A registered user's voice with additional voices and/or sounds, which will be referred to as noise;
- system 100 may pre-process each segment to remove sounds which are clearly not a person's voice before analyzing the segment to determine if it contains a registered user's voice. Alternatively or additionally, system 100 may process each segment after analyzing the segment to remove unwanted sounds, which co-reside in segment 410 with the voice of a registered user.
- voice includes only speech of a person. Alternatively, voice includes any sound coming from a person's mouth, for example laughter and a cough.
- system 100 uses long term analysis of audio signal 420 (e.g. burst by burst) with short term analysis (e.g. segment by segment) to clean out the registered users voice signal.
- system 100 may use long term analysis to identify the voices of the registered users and learn the characteristics of the other noises appearing in the signal.
- the other noises can then be identified more easily when analyzing each segment and be removed when they appear in future segments as the conversation advances.
- the receiver of the conversation may hear foreign sounds, which will quickly disappear as the conversation advances.
- system 100 can be used to remove echo from a conversation.
- Another method that can be used by system 100 to clean out the audio signal is called a winner filter, wherein the noise is assumed to be mainly stationary. When the noise appears by itself its characteristics are learnt and subsequently removed from the proceeding segments.
- system 100 may analyze the strongest signal in segment 410 (and previous segments) to determine if to transfer the segment or not.
- source separation techniques that are known in the art are used to divide up each segment to multiple segments for each audio source.
- the segments of each audio source can then be analyzed separately, for example when multiple people are speaking system 100 can analyze the voice of each speaker separately to determine if it is of a registered user and should be transferred or suppressed.
- source separation can be performed by forming separate distinct continuous signals from a given signal. In some methods source separation requires knowledge of the characteristics of one of the signals (e.g. the registered user) or alternatively, additional hardware such as extra microphones near the audio source are required.
- the frequencies belonging to the registered user can be determined and frequencies that could not belong to the registered user are removed from audio segment 410 .
- other known noise filtering methods can also be implemented to clear out noise from the audio signal being generated.
- system 100 is provided with a database of common sounds which are suppressed if identified in the analyzed segments.
- system 100 records new sounds for suppression in the database, based on the activity of system 100 .
- system 100 attempts to completely suppress unwanted sounds.
- system 100 allows the user to select a suppression level by which unwanted sounds will be suppressed, so that the user may control the level of the background noise.
- the suppression level is selected automatically, for example to achieve a noise level proportional to the volume of the voice of the registered user.
- system 100 additionally enhances the quality (e.g. volume) of the voice of registered users.
- system 100 when background noises are eliminated system 100 provides an audio signal, for example a low beep at the beginning or during burst 420 to signify to the listener that the quality of the voices may be low and hard to understand due to noises that previously interfered with the voice signal, even if the noises have been removed.
- an audio signal for example a low beep at the beginning or during burst 420 to signify to the listener that the quality of the voices may be low and hard to understand due to noises that previously interfered with the voice signal, even if the noises have been removed.
- system 100 provides an indication (e.g. a graphical indication) on the communication device (e.g. 110 , 140 ) if system 100 is activated.
- the speaker may receive an indication if system 100 recognizes him/her as a registered speaker.
- various parameters are used independently or in combination to determine if a specific segment comprises voice and matches the voice of a specific user.
- An example of such a parameter includes details such as the pitch of the speaker.
- Another parameter that may be used is the average long-term LPC spectrum.
- pronunciation of specific letters assuming a specific language is dealt with can also be used to identify the speaker.
- some parameters e.g. pitch
- a short term piece of an audio signal e.g. a segment
- others e.g. long-term LPC
- require longer term pieces of the audio signal e.g. a burst of speech.
- system 100 attempts to provide a short term determination and improve it by long term determinations as described below.
- long term determinations are used additionally to identify background noise so that they can be removed.
- system 100 initially accepts ( 302 ) segment 410 of an audio signal.
- system 100 checks ( 304 ) what the current state of analysis is and processes the segment accordingly. If the current status is that no voice has been detected so far, or if this is a first segment, processing of the next segment 410 is transferred to state 308 “novoice”. Alternatively, if the previous segment was determined to be voice from a burst of speech from a registered user then processing is transferred to state 310 “myburst” (voice of a registered user). Otherwise, if the previous segments were determined to be voice from a non-registered user then processing is transferred to state 306 “oburst” (voice of “other” than a registered user (i.e. not belonging to a registered user)).
- FIG. 3B illustrates the processing flow for state 308 “novoice”.
- system 100 attempts to determine if segment 410 represents voice activity ( 312 ). If no voice activity is detected ( 314 ), for example segment 410 is determined to contain silence, segment 410 is suppressed ( 316 ), and not transmitted further. Alternatively, if segment 410 is determined to be voice activity ( 314 ), segment 410 is analyzed ( 318 ) to determine if it is the voice of a registered user ( 320 ). If segment 410 is determined to be the voice of a registered user with a high level of certainty, processing of segment 410 is transferred to state 310 “myburst”.
- segment 410 is determined not to be the voice of a registered user with a high level of certainty, processing of segment 410 is transferred to state 306 “oburst”. If segment 410 does not provide a clear determination regarding conformity to the voice of the registered users, system 100 will calculate ( 322 ) an estimated probability value regarding the probability that the segment 410 is voice of a registered user, based on segment 410 and any previous segments. If the estimated probability value is less than a pre-determined threshold value processing of segment 410 is transferred to state 306 “oburst”. If the estimated probability value is greater than a pre-determined threshold value processing of segment 410 is transferred to state 310 “myburst”.
- FIG. 3C illustrates the processing flow for state 310 “myburst”.
- the current segment is determined by state 308 “novoice” to represent voice of a registered user or if previous segments were determined to be voice of a registered user or to probably be voice of a registered user, then the current consecutive segment 410 will be processed in state 310 (“myburst”).
- segment 410 is analyzed ( 330 ) to determine ( 332 ) if segment 410 is part of current burst 420 with voice of a registered user, for example segment 410 may contain silence but yet be considered part of burst 420 since it is assumed that the user will continue to talk shortly.
- segment 410 contains silence and is transmitted to the receiver or if segment 410 is suppressed it does not make a difference for the receiver that is listening to segment 410 . However there is a difference if segment 410 contains other noise while the user is not speaking. If it is clear that segment 410 is the voice of a registered user, system 100 updates ( 334 ) burst counters, which record for current burst 420 details such as the audio parameters found for each segment of burst 420 , the number of segments analyzed and other details used to implement a determination regarding the identity of the speaker for burst 420 (e.g. the average probability estimation of each of the segments so far received for burst 420 ). Optionally, the recorded counters are used to improve the accuracy in estimating the probability of the proceeding segments of burst 420 as belonging to the voice of a registered user.
- burst counters which record for current burst 420 details such as the audio parameters found for each segment of burst 420 , the number of segments analyzed and other details used
- segment 410 is transferred ( 336 ) to the receiver, while taking into account the quality with which the prior segments were transferred.
- quality is adjusted to form a smooth and continuous audio signal with prior segments, which may have been transferred with a degraded quality as described below.
- system 100 estimates ( 338 ) a probability value regarding the probability of segment 410 belonging to the burst of a registered user based on segment 410 and previous segments of current burst 420 , for example using the values recorded in the burst counters.
- the estimated probability value and/or the content of segment 410 are used to determine if segment 410 is part of current burst 420 or if current burst 420 has completed ( 340 ) and segment 410 is the beginning of a new burst. If system 100 determines that segment 410 is after the end of burst 420 (e.g. a pause between bursts of speech) or the beginning of a new burst (e.g. with a different speaker), system 100 will transfer control to state 308 (“novoice”) to deal with the segment.
- state 308 novoice
- system 100 determines that segment 410 does not complete current burst 420 , system 100 updates ( 342 ) the burst counters.
- system 100 since system 100 is in state 310 (“myburst”) there is a presumption that segment 410 is of a registered user and segment 410 is transferred ( 344 ) to the receiver.
- the quality of segment 410 is optionally, degraded according to the estimated probability, which may even be zero.
- the quality used to transfer segment 410 is proportional to the probability level, for example the more probable that segment 410 is a registered user's voice the better the quality with which it is transferred.
- each segment of current burst 420 which is identified as voice of a registered user changes the estimated probability value for the proceeding segments, thus the first segment or segments of current burst 420 may be transferred with a reduced quality, and the quality will increase quickly to the full quality of the source as the burst is authenticated.
- a burst is initially assumed to be the voice of a registered user and slowly determined not to be voice of a registered user, it will quickly peter out as the determination is verified.
- the characteristics of the quality that is controlled include volume and/or removal of high order frequencies. Alternatively or additionally, other characteristics may be controlled, for example sample rate.
- system 100 when transferring segment 410 to the receiver, system 100 re-estimates the probability of one or more prior segments or even the proceeding part of all of current burst 420 , to verify that the previous decision is correct in view of the current decision regarding segment 410 .
- system 100 may transfer one or more prior segments from buffer 400 at an excelled rate (relative to the standard transmission rate) to the receiver, to prevent forming a delay in transfer of the conversation.
- the transfer of a small number of prior segments can improve accuracy of system 100 without a person receiving the transmission noticing any deterioration in the signal due to the small size of the segments and the small time interval under consideration.
- one or more consecutive segments are also transferred at an excelled rate to achieve a smooth signal.
- the transfer rate of the segments is excelled by transferring them at a higher frequency.
- the transfer rate is excelled by analyzing the content of the segments and removing or combining and shortening segments which do not affect the speech content being transferred, for example segments or portions of segments, which contain silence or noise.
- FIG. 3D illustrates the flow process for state 306 “oburst”.
- segment 410 is analyzed ( 350 ) to determine if segment 410 represents a piece of current burst 420 ( 352 ) and is a continuation of the previously identified voice. If it is clear that segment 410 represents the voice of a non registered user, as previously determined the burst counters are updated ( 354 ).
- the segment is suppressed ( 356 ) so that the receiver does not receive the content of the segment.
- system 100 will estimate ( 358 ) a probability value regarding the content of segment 410 .
- the probability value is compared to a pre-selected threshold value to determine if segment 410 should be suppressed or handled by 308 “novoice”.
- segment 410 is probably a new burst ( 360 ) control will be transferred to 308 “novoice” to continue processing.
- the burst counters are updated ( 362 ) and segment 410 is suppressed ( 364 ).
- system 100 is implemented with an adjustable aggressiveness level, which determines if system 100 should be more stringent or more flexible in deciding if to transfer segments or to suppress them.
- the user can select the aggressiveness level, which will typically alter the threshold values used.
- the aggressiveness level is determined automatically, for example based on the type of noise.
- the aggressiveness level especially effects the decisions regarding audio data in a noisy environment or audio data which is not exactly speech, for example laughter and coughs of a registered user.
- system 100 is implemented to allow the user to select if laughter, coughs and other sounds made by the user should be transferred.
- system 100 needs to define which voices are to be transferred and which voices are to be suppressed. This information is provided by registration of authorized users, during registration system 100 records information that is needed to identify a user.
- the registration process may be done by a dedicated process, for example using additional equipment such as a computer which records the voice of the user to analyze the verbal data provided by the user.
- the computer may record details about the user, for example age and sex. The results of the registration process are then supplied to system 100 .
- system 100 may allow a user to activate a registration process using the same equipment as for conducting a conversation, for example by dialing a special key sequence and repeating a specific message.
- the registration process may require providing general information about the user, for example sex, and age, by responding to questions presented to the user.
- the registration process may be automatic, for example by learning the characteristics of voice of the first speaker or main speaker in a conversation (e.g. the loudest speaker or speaker that talked the most) and suppressing any other sound or voice during the duration of the conversation.
- system 100 may register more than one speaker, for example all the people that spoke during a learning conversation.
- characteristics of a user's voice may include pitch, long term LPC spectrum and pronunciation of specific words.
- the registration process may take into account the language used by the user, since identification of a user is improved when knowing the language used by the user.
- FIG. 5 is a flow diagram 500 of the registration process of a user, according to an exemplary embodiment of the invention.
- the registration process ( 505 ) for system 100 is activated by the user or activated by system 100 at the beginning of a conversation to automatically register the speaker as mentioned above.
- system 100 checks if the user is a new user ( 510 ) or already exists in a registration database on system 100 . If the user already exists, system 100 retrieves ( 515 ) the stored information to allow the user to amend the previously registered data.
- system 100 checks if the user is interested in updating his/her background information ( 520 ), for example the age and sex of the user. If the user is interested in updating background information, system 100 accepts the details to update the information ( 525 ), for example using a voice recognition system or verbally providing the user with options that can be selected by key-tones.
- the user is provided with the option of generating a user voice model by repeating specific sentences ( 530 ). If the user is interested, he/she is requested to repeat specific sentences ( 535 ) and the information is used to create or update a model representing the user voice.
- the user is provided with the option of recording free style speech ( 540 ). If the user is interested in performing registration in this form he/she is required to provide ( 545 ) free style verbal data to system 100 which will allow system 100 to create a model representing the users voice.
- system 100 can analyze ( 550 ) the speech data spoken by the user to automatically form a voice model.
- a user may register using one of the above methods and then provide more input ( 555 ) by performing another of the above registration methods in order to enhance the accuracy of the voice model created for the user.
- the registered information can be stored in the registration database of system 100 and the user can exit the registration process ( 560 ).
- an enhanced voice model provides more accurate results when analyzing speech segments.
- even the knowledge of a user's background by itself can be used to determine possible voice ranges, for example an average pitch value, and rule out other values.
- system 100 will thus be able to identify that an analyzed segment does not belong to the age group of the registered user or to the sex of the user, and rule out the possibility that the voice segment belongs to him or her.
- system 100 is activated to filter out non-registered users of a communication device. Following is an exemplary list of characteristics and uses demonstrating the affinity between the users and a specific device, which can be implemented using system 100 :
- the sole user of the device for example a private mobile telephone, which is usable only by the owner.
- the main users of a device for example a family telephone line for use by a specific family.
- system 100 will filter out other voices during the shift hours and block the user's voice at other times.
- Short term users of a device for example guests at a hotel can be registered to use a device for a limited period, which expires when their stay is over.
- a group of users for example a group of users conducting a conference call, wherein any other voices are filtered out.
- Any user for example a communication device that allows anybody to initiate a conversation, automatically registers the speaker and only transfers their voice during the duration of the call.
- Physical authorization for example a communication device that requires plugging in an identity card, which authorizes use by the card holder and blocks voices of any other person.
- Code authorization for example a communication device that requires entering a code for each user that will be considered registered for a specific conversation or until cancellation by the user.
- a communication device that implements use of system 100 can provide an option of toggling the use of system 100 .
- a non registered user can disable the personalized voice detection function so that a non registered user can use the communication device without system 100 .
- system 100 can be toggled off.
- system 100 is installed in the communication device and toggled directly on the device, for example using a switch or special key sequence.
- system 100 is installed in other parts of the network and can be toggled by, for example by transmitting a code (e.g. dialing a specific number), which is intercepted by system 100 .
- a specific code can be associated with a list of one or more registered users.
- the voices of the registered user's in the list will be transferred during the call and any other sound or voice will be suppressed.
- FIG. 6 is a schematic illustration of the main components of system 100 for analyzing audio data, according to an exemplary embodiment of the invention.
- system 100 comprises a processor, which could be implemented by a central processing unit (CPU), a dedicated processing chip (e.g. a DSP) 610 or a circuit for processing and controlling functionality of system 100 .
- system 100 comprises a work memory 620 for storing data while processing audio signals, and a database 630 , which provides a non-volatile memory for long term storage of data, for example the entire content of a conversation and registration details of users.
- system 100 comprises a user interface 640 and a channel interface 650 .
- user interface 640 provides interaction with the user regarding the function of system 100 , for example to provide a graphical interface showing the state of the system, to allow toggling the state of the system from on to off or vice versa.
- channel interface 650 accepts audio signals from the user side and transfers the processed audio signal to a communication channel for transferring it to a receiver.
- system 100 accepts the audio signal provided by channel interface 650 , buffers it in work memory 620 , processes the audio signal as it is created segment by segment and passes on the processed signal via channel interface 650 to the receiver.
- system 100 is implemented by a standard general purpose computer, for example a personal computer.
- system 100 can be implemented by a dedicated device.
- system 100 is implemented in a single encasement at a single location.
- system 100 may be implemented by multiple pieces in a plurality of encasements and optionally, positioned at multiple locations, for example the database may be stationed at a remote site.
- system 100 may analyze the audio signal in parallel to other audio analysis systems which may deal with other aspects of analyzing audio signals, for example CODEC processing.
- system 100 may provide analysis information to other audio systems, which may be connected sequentially or any other way to reduce the processing required for those systems.
- a video call system can register the face of one or more user's and filter out the background during a call, for example if they are not interested in disclosing their location (e.g. not in their office).
- the face can be registered by taking a picture of the user with a digital camera for example a web cam or a camera built into a cellular telephone.
- the system will learn identifying characteristics of the user's face as is commonly known in the art.
- filtering out the background of a picture leaving only the user's face and possibly body is commonly practiced in the art.
- the background may be replaced with a predetermined background that is selected by the user.
- the image processing system can be used to reduce bandwidth since the background is filtered out.
- many of the options that are applicable to audio processing are also applicable to image processing, for example toggling the system on and off, and automatic registration (e.g. during the beginning of the call).
- Section headings are provided for assistance in navigation and should not be considered as necessarily limiting the contents of the section.
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Telephonic Communication Services (AREA)
Abstract
A method of transferring a real-time audio signal transmission, including: registering voice patterns (or other characteristics) of on more users to be used to identify the voices of the users, accepting an audio signal as it is created as a sequence of segments, analyzing each segment of the accepted audio signal to determine if it contains voice activity (314), determining a probability level that the voice activity of the segment is of a registered user (320 & 322); and selectively transferring the contents, of a segment responsive to the determined probability level (324).
Description
This application claims priority from U.S. provisional application No. 60/597,213 filed Nov. 17, 2005, the disclosure of which is incorporated herein by reference.
The present invention relates generally to voice activity detection and more specifically to automatic identification and transfer of voice activity of specific speakers.
Voice activity detection (VAD) is the art of detecting the presence of voice activity, generally human speech, in audio signals. Voice activity detection is used in a wide range of systems handling audio signals for example systems dealing with: telecommunication, speech recognition, speaker verification, speaker identification, speaker segmentation, voice recording, noise suppression and others. In a telecommunication system voice activity detection can be used to implement different sampling rates based on the voice activity level detected, for example to raise/reduce the bandwidth when dealing with audio segments containing human speech. A speaker verification/identification system can be simplified by limiting processing to audio segments containing speech. A noise suppression system can use voice activity detection for comparing between segments with speech activity relative to segments without speech activity. In voice recording systems voice activity detection can be used to reduce the required storage space by limiting the recording to meaningful information (e.g. segments with speech activity).
Many voice controlled systems and/or applications are intended to receive voices from a single person or single group of people, and would function better if they actually receive only the voice or voices of the intended people, for example:
1. Speaker verification systems such as used by banks to authenticate the customer;
2. Voice activated appliances, which are trained to recognize specific voices and/or commands; and
3. Telephone tapping devices, which are interested in recording voices of specific people.
Likewise in telephone conversations any background noise or voices of other people not participating in the conversation can be considered noise, for example:
1. When talking on a speakerphone with other people talking in the background;
2. When talking on a public telephone on a noisy street;
3. When talking on a mobile telephone in a noisy environment;
4. In a call center with many agents speaking to different callers in the same room;
5. When talking on the telephone and not interested that the party on the other end will identify the speakers location, for example with a loudspeaker giving announcements in the background;
6. When conducting a conference call in a closed room and a person that is not participating in the conversation enters the room to deliver a verbal message to one of the participants.
Some systems attempt to transfer voice and eliminate noise in order to improve efficiency in dealing with the signal. In some cases more sophisticated input devices (e.g. extra microphones and/or sensors) are used in order to help differentiate between different speakers and/or noise.
U.S. patent application publication No. 2005/0033572 published Feb. 10, 2005 the disclosure of which is incorporated herein by reference describes apparatus and method of a voice recognition system for an audio-visual system. The system receives reflected sounds from an audio-visual system, noise and a user's voice and is configured to isolate the user's voice and compare it to voice patterns that belong to at least one model.
Japanese patent No. 11-154998 from Jun. 8, 1999 the disclosure of which is incorporated herein by reference, describes registering a voice print of a speaker, then during transmission a microphone collects a signal comprising the speakers voice and ambient noise. The signal is input to a comparing filter that extracts the voice of the speaker from the signal by comparing to the registered voice print.
There is however a basic problem in implementing a system as suggested in the Japanese patent. In implementing a system for determining if a specific audio signal is voice and if it matches a specific voice pattern of a specific speaker, statistical methods are used, providing a probability level of conformity. The determination is not an absolute process wherein a real-time signal being generated is passed through a processor, which instantaneously provides a clean output signal that includes only the speech of a specific speaker. The above determination requires, statistical analysis of each part of the evolving audio signal to determine if the part contains the specific speakers voice or not. In some cases further evaluation of the evolving audio signal may reverse a previous determination, for example an audio segment which was initially determined to probably be a specific speaker may later be determined not to be the specific speaker or vice versa. Generally, instantaneous transfer of the audio will introduce a high level of error in the output signal, leading to portions of the speech being cut off or transfer of a large portion of the background noise. In contrast the greater the delay introduced before providing a determination the more accurate the decision tends to be, however providing a determination with a delay of more than a small amount (e.g. more than 100 mili-seconds) will result in a conversation of unacceptable quality.
An aspect of an embodiment of the invention relates to a system and method of transferring audio data in real-time wherein only the voice of a registered user will be transferred. The system initially registers the voice patterns and/or characteristics of one or more users. The system then analyzes the audio data, segment by segment as it is created and transferred in real time. The system checks if a segment contains voice and if the voice is of a registered user.
In an exemplary embodiment of the invention, the system calculates a probability level that a segment representing voice is of a registered user and transfers the segment responsive to the determination. Optionally, if the probability is below a pre-selected threshold value the segment is blocked. In an exemplary embodiment of the invention, if the probability level is above a pre-selected value the segment is transferred. Optionally, if the probability level is less than a pre-selected value and greater than a threshold value the segment is transferred with the quality and/or strength of the signal adjusted according to the probability level, for example raising or lowering the volume.
In some embodiments of the invention, some previously blocked segments may be transferred responsive to a recalculation of their probability level when calculating the probability level of a proceeding segment. Optionally, the transferred blocked segments are transferred at a higher rate to prevent a delay in the flow of the segments.
There is thus provided in accordance to an exemplary embodiment of the invention, a method of transferring a real-time audio signal transmission, comprising, registering user characteristics of one or more users to be used to identify the voices of the users, accepting an audio signal as it is created as a sequence of segments, analyzing each segment of the accepted audio signal to determine if it contains voice activity, determining a probability level that the voice activity of the segment is of a registered user; and selectively transferring the contents of a segment responsive to the determined probability level.
In an exemplary embodiment of the invention, the segments are selected to comprise a single syllable. Optionally, the segments are selected to comprise an interval of the audio signal smaller than 0.1 seconds. In an exemplary embodiment of the invention, the selectively transferring comprises adjusting the quality level of the audio segment according to the determined probability level. Optionally, adjusting the quality level comprises raising or lowering the volume of the audio signal in the segment. In an exemplary embodiment of the invention, the selectively transferring comprises transferring previously blocked segments responsive to the determination for a consecutive segment. Optionally, the previously blocked segments are transferred at a higher rate than the standard transfer rate of segments. In an exemplary embodiment of the invention, the probability level of a segment is affected by the selective transfer of segments prior to the current segment.
Optionally, the method further comprises filtering out noise from each segment before analyzing the segment. In an exemplary embodiment of the invention, the method comprises filtering out noise from each segment after analyzing the segment. Optionally, the method further comprises performing source separation to the signal in a segment creating multiple segments before analyzing the segment and analyzing the multiple segments independently. In an exemplary embodiment of the invention, the method further comprises analyzing the strongest signal in a segment comprising multiple audio signals, while taking into account prior segments. Optionally, the method further comprises inserting an audio signal into a segment to indicate amendments to the segment. In an exemplary embodiment of the invention, the transferring is through a communication system. Optionally, the characteristics comprise voice patterns of the user.
In an exemplary embodiment of the invention, the characteristics comprise general information about the user. Optionally, the selectively transferring allows voice activity of any user. In an exemplary embodiment of the invention, the selectively transferring allows voice activity of a group of users with a common characteristic. Optionally, the selectively transferring reduces bandwidth of transmissions through a communication network. In an exemplary embodiment of the invention, the probability level becomes more accurate as the audio signal is processed.
There is thus additionally provided according to an exemplary embodiment of the invention, a system for transferring a real time audio transmission, comprising, a processor to process data of the real time audio transmission and control the system, a working memory to serve as a work area for said central processing unit, a database memory to store data provided to the system for processing by said central processing unit, a channel interface to accept an audio signal for processing and transfer the processed audio signal to a receiver, a user interface to communicate with the user, wherein the system is adapted to, register characteristics of one or more users to be used to identify the voice of the users, accept via the channel interface an audio signal as it is created as a sequence of segments, analyze with the central processing unit each segment of the accepted audio signal to determine if it contains voice activity, determine a probability level that the voice activity of the segment is of a registered user, and selectively transfer the contents of a segment responsive to the determined probability level. Optionally, the system provides an indication at a user communication device if it is activated. In an exemplary embodiment of the invention, the indication gives indication if the user is recognized as a registered speaker. Optionally, the user can toggle the system on and off.
The present invention will be understood and appreciated more fully from the following detailed description taken in conjunction with the drawings. Identical structures, elements or parts, which appear in more than one figure, are generally labeled with the same or similar number in all the figures in which they appear, wherein:
Overview
In an exemplary embodiment of the invention, user B is interested in hearing audio signal 120 of user A without audio signals 130 of the noise surrounding user A. Alternatively or additionally, user A is interested that user B receive audio signal 120 and not hear audio signals 130, which for example may disclose details that user A is not interested in disclosing, for example the current location of user A (e.g. not at work) or current activity of user A (e.g. watching a movie). In an exemplary embodiment of the invention, the conversation originating from user A will use a personalized voice detection system 100 which transfers only voices of users that are registered to use system 100 and suppresses all other sounds.
In an exemplary embodiment of the invention, user A will register for use of system 100 so that user A's voice signals will be transferred. Optionally, user B is essentially provided only with the voice of user A. In an exemplary embodiment of the invention, system 100 can be implemented as a uni-directional system or a bi-directional system. Additionally, system 100 can be implemented to transfer the voice signal of a single user or multiple users. Further additionally, system 100 may interface between two or more users or between a user and a machine, for example system 100 may serve as a filter for receiving audio signals to activate a voice activated device so that the machine will receive a clean voice signal, wherein only the voice of registered users will reach the machine.
In some embodiments of the invention, system 100 enhances system performance by reducing bandwidth requirements since it provides a filtered signal. Optionally, a clean voice signal and/or a reduced size voice signal can reduce processing time since the network has less data to handle.
Positioning
In some embodiments of the invention, system 100 may comprise more than one unit with each part residing in a different location, for example the processing unit may be in one location and a database for recording information of a registered user may be located in a different location.
In some embodiments of the invention, a conversation may be transmitted through a path with more than one system 100. Optionally, only one system 100 is activated at a time to filter the audio transmissions. In some embodiments of the invention, more than one system 100 can be used sequentially to verify accuracy of the filtering process preformed by the previous units.
Implementation
In an exemplary embodiment of the invention, every audio segment received 410 is analyzed by system 100 to determine if it is voice and if it is one of the registered users voice that needs to be transferred or if it a foreign sound that should be suppressed, for example speech of a none registered person. In order to determine if an audio segment is voice of a registered user system 100 takes into account the state of system 100 when receiving the segment and the state of previously accepted segments, for example is segment 410 the start of a burst of speech or not, and if the previous segments were determined to be a registered users speech or not.
In an exemplary embodiment of the invention, the size of a segment may be a piece of an audio signal measured by an absolute time interval or may be the size of a single syllable or more or less than a single syllable of speech. Optionally, segment 410 will be short (e.g. less than 100 mili-seconds) to prevent delay in transfer of the audio signal, for example if system 100 analyzes full words there would be a noticeable delay when conducting a live conversation.
In an exemplary embodiment of the invention, segment 410 may comprise:
1. A registered user's voice without additional sounds;
2. A registered user's voice with additional voices and/or sounds, which will be referred to as noise;
3. No voice just additional sounds;
4. Voice not belonging to the registered user; or
5. Voices belonging to more than one registered user with/without additional voices, which will be referred to as noise.
In an exemplary embodiment of the invention, system 100 may pre-process each segment to remove sounds which are clearly not a person's voice before analyzing the segment to determine if it contains a registered user's voice. Alternatively or additionally, system 100 may process each segment after analyzing the segment to remove unwanted sounds, which co-reside in segment 410 with the voice of a registered user. In some embodiments of the invention, voice includes only speech of a person. Alternatively, voice includes any sound coming from a person's mouth, for example laughter and a cough.
Optionally, system 100 uses long term analysis of audio signal 420 (e.g. burst by burst) with short term analysis (e.g. segment by segment) to clean out the registered users voice signal. As an example system 100 may use long term analysis to identify the voices of the registered users and learn the characteristics of the other noises appearing in the signal. Optionally, the other noises can then be identified more easily when analyzing each segment and be removed when they appear in future segments as the conversation advances. Thus initially, the receiver of the conversation may hear foreign sounds, which will quickly disappear as the conversation advances. Optionally, system 100 can be used to remove echo from a conversation.
Another method that can be used by system 100 to clean out the audio signal is called a winner filter, wherein the noise is assumed to be mainly stationary. When the noise appears by itself its characteristics are learnt and subsequently removed from the proceeding segments.
In some embodiments of the invention, when a signal includes sounds from multiple sources (e.g. multiple speakers), system 100 may analyze the strongest signal in segment 410 (and previous segments) to determine if to transfer the segment or not. Alternatively or additionally, source separation techniques that are known in the art are used to divide up each segment to multiple segments for each audio source. Optionally, the segments of each audio source can then be analyzed separately, for example when multiple people are speaking system 100 can analyze the voice of each speaker separately to determine if it is of a registered user and should be transferred or suppressed. Optionally, source separation can be performed by forming separate distinct continuous signals from a given signal. In some methods source separation requires knowledge of the characteristics of one of the signals (e.g. the registered user) or alternatively, additional hardware such as extra microphones near the audio source are required.
In some embodiments of the invention, the frequencies belonging to the registered user can be determined and frequencies that could not belong to the registered user are removed from audio segment 410. Optionally, other known noise filtering methods can also be implemented to clear out noise from the audio signal being generated.
In an exemplary embodiment of the invention, system 100 is provided with a database of common sounds which are suppressed if identified in the analyzed segments. Optionally, system 100 records new sounds for suppression in the database, based on the activity of system 100.
In some embodiments of the invention, system 100 attempts to completely suppress unwanted sounds. Alternatively, system 100 allows the user to select a suppression level by which unwanted sounds will be suppressed, so that the user may control the level of the background noise. In some embodiments of the invention, the suppression level is selected automatically, for example to achieve a noise level proportional to the volume of the voice of the registered user. In some embodiments of the invention, system 100 additionally enhances the quality (e.g. volume) of the voice of registered users.
In some embodiments of the invention, when background noises are eliminated system 100 provides an audio signal, for example a low beep at the beginning or during burst 420 to signify to the listener that the quality of the voices may be low and hard to understand due to noises that previously interfered with the voice signal, even if the noises have been removed.
In some embodiments of the invention, system 100 provides an indication (e.g. a graphical indication) on the communication device (e.g. 110, 140) if system 100 is activated. Optionally, the speaker may receive an indication if system 100 recognizes him/her as a registered speaker.
In an exemplary embodiment of the invention, various parameters are used independently or in combination to determine if a specific segment comprises voice and matches the voice of a specific user. An example of such a parameter includes details such as the pitch of the speaker. Another parameter that may be used is the average long-term LPC spectrum. Additionally, pronunciation of specific letters assuming a specific language is dealt with can also be used to identify the speaker. Optionally, some parameters (e.g. pitch) provide a determination on a short term piece of an audio signal (e.g. a segment), whereas others (e.g. long-term LPC) require longer term pieces of the audio signal (e.g. a burst of speech). In an exemplary embodiment of the invention, system 100 attempts to provide a short term determination and improve it by long term determinations as described below. Optionally, as mentioned above long term determinations are used additionally to identify background noise so that they can be removed.
In an exemplary embodiment of the invention, as shown in flow diagram 300, system 100 initially accepts (302) segment 410 of an audio signal. Optionally, system 100 checks (304) what the current state of analysis is and processes the segment accordingly. If the current status is that no voice has been detected so far, or if this is a first segment, processing of the next segment 410 is transferred to state 308 “novoice”. Alternatively, if the previous segment was determined to be voice from a burst of speech from a registered user then processing is transferred to state 310 “myburst” (voice of a registered user). Otherwise, if the previous segments were determined to be voice from a non-registered user then processing is transferred to state 306 “oburst” (voice of “other” than a registered user (i.e. not belonging to a registered user)).
In an exemplary embodiment of the invention, segment 410 is transferred (336) to the receiver, while taking into account the quality with which the prior segments were transferred. Optionally, even if current segment 410 is clearly determined to be of a registered user its quality is adjusted to form a smooth and continuous audio signal with prior segments, which may have been transferred with a degraded quality as described below.
If it is clear that segment 410 is not voice of a registered user or if it is not clear if segment 410 is from a registered user or not, system 100 estimates (338) a probability value regarding the probability of segment 410 belonging to the burst of a registered user based on segment 410 and previous segments of current burst 420, for example using the values recorded in the burst counters.
In an exemplary embodiment of the invention, the estimated probability value and/or the content of segment 410 are used to determine if segment 410 is part of current burst 420 or if current burst 420 has completed (340) and segment 410 is the beginning of a new burst. If system 100 determines that segment 410 is after the end of burst 420 (e.g. a pause between bursts of speech) or the beginning of a new burst (e.g. with a different speaker), system 100 will transfer control to state 308 (“novoice”) to deal with the segment.
If system 100 determines that segment 410 does not complete current burst 420, system 100 updates (342) the burst counters. Optionally, since system 100 is in state 310 (“myburst”) there is a presumption that segment 410 is of a registered user and segment 410 is transferred (344) to the receiver. However the quality of segment 410 is optionally, degraded according to the estimated probability, which may even be zero. Optionally, the quality used to transfer segment 410 is proportional to the probability level, for example the more probable that segment 410 is a registered user's voice the better the quality with which it is transferred. In an exemplary embodiment of the invention, each segment of current burst 420 which is identified as voice of a registered user changes the estimated probability value for the proceeding segments, thus the first segment or segments of current burst 420 may be transferred with a reduced quality, and the quality will increase quickly to the full quality of the source as the burst is authenticated. In contrast if a burst is initially assumed to be the voice of a registered user and slowly determined not to be voice of a registered user, it will quickly peter out as the determination is verified. In some embodiments of the invention, the characteristics of the quality that is controlled include volume and/or removal of high order frequencies. Alternatively or additionally, other characteristics may be controlled, for example sample rate.
In some embodiments of the invention, when transferring segment 410 to the receiver, system 100 re-estimates the probability of one or more prior segments or even the proceeding part of all of current burst 420, to verify that the previous decision is correct in view of the current decision regarding segment 410. Optionally, if one or more segments were suppressed and it is determined that they should have been transmitted, system 100 may transfer one or more prior segments from buffer 400 at an excelled rate (relative to the standard transmission rate) to the receiver, to prevent forming a delay in transfer of the conversation. Optionally, the transfer of a small number of prior segments can improve accuracy of system 100 without a person receiving the transmission noticing any deterioration in the signal due to the small size of the segments and the small time interval under consideration. In some embodiments of the invention, when excelling the transfer rate of previously blocked segments, one or more consecutive segments are also transferred at an excelled rate to achieve a smooth signal. In some embodiments of the invention, the transfer rate of the segments is excelled by transferring them at a higher frequency. Alternatively or additionally, the transfer rate is excelled by analyzing the content of the segments and removing or combining and shortening segments which do not affect the speech content being transferred, for example segments or portions of segments, which contain silence or noise.
In contrast, if it is not clear if segment 410 is the continuation of the non registered user's voice from current burst 420 as previously determined or it is clear that the non registered user's voice has terminated, system 100 will estimate (358) a probability value regarding the content of segment 410. In an exemplary embodiment of the invention, the probability value is compared to a pre-selected threshold value to determine if segment 410 should be suppressed or handled by 308 “novoice”.
If it is determined that segment 410 is probably a new burst (360) control will be transferred to 308 “novoice” to continue processing. Optionally, if it is determined that segment 410 is probably part of the current burst of the non-registered user's voice the burst counters are updated (362) and segment 410 is suppressed (364).
In an exemplary embodiment of the invention, system 100 is implemented with an adjustable aggressiveness level, which determines if system 100 should be more stringent or more flexible in deciding if to transfer segments or to suppress them. Optionally, the user can select the aggressiveness level, which will typically alter the threshold values used. Alternatively, the aggressiveness level is determined automatically, for example based on the type of noise. The aggressiveness level especially effects the decisions regarding audio data in a noisy environment or audio data which is not exactly speech, for example laughter and coughs of a registered user. In some embodiments of the invention, system 100 is implemented to allow the user to select if laughter, coughs and other sounds made by the user should be transferred.
Registration
In an exemplary embodiment of the invention, system 100 needs to define which voices are to be transferred and which voices are to be suppressed. This information is provided by registration of authorized users, during registration system 100 records information that is needed to identify a user. In some embodiments of the invention, the registration process may be done by a dedicated process, for example using additional equipment such as a computer which records the voice of the user to analyze the verbal data provided by the user. Alternatively or additionally, the computer may record details about the user, for example age and sex. The results of the registration process are then supplied to system 100.
Alternatively, system 100 may allow a user to activate a registration process using the same equipment as for conducting a conversation, for example by dialing a special key sequence and repeating a specific message. Alternatively or additionally, the registration process may require providing general information about the user, for example sex, and age, by responding to questions presented to the user. In some embodiments of the invention, the registration process may be automatic, for example by learning the characteristics of voice of the first speaker or main speaker in a conversation (e.g. the loudest speaker or speaker that talked the most) and suppressing any other sound or voice during the duration of the conversation. Optionally, system 100 may register more than one speaker, for example all the people that spoke during a learning conversation.
In an exemplary embodiment of the invention, characteristics of a user's voice may include pitch, long term LPC spectrum and pronunciation of specific words. Optionally, the registration process may take into account the language used by the user, since identification of a user is improved when knowing the language used by the user.
Alternatively or additionally, the user is provided with the option of generating a user voice model by repeating specific sentences (530). If the user is interested, he/she is requested to repeat specific sentences (535) and the information is used to create or update a model representing the user voice.
Alternatively or additionally, the user is provided with the option of recording free style speech (540). If the user is interested in performing registration in this form he/she is required to provide (545) free style verbal data to system 100 which will allow system 100 to create a model representing the users voice.
Alternatively or additionally, system 100 can analyze (550) the speech data spoken by the user to automatically form a voice model.
In an exemplary embodiment of the invention, a user may register using one of the above methods and then provide more input (555) by performing another of the above registration methods in order to enhance the accuracy of the voice model created for the user. Alternatively, the registered information can be stored in the registration database of system 100 and the user can exit the registration process (560). Optionally, an enhanced voice model provides more accurate results when analyzing speech segments. In an exemplary embodiment of the invention, even the knowledge of a user's background by itself can be used to determine possible voice ranges, for example an average pitch value, and rule out other values. Optionally, system 100 will thus be able to identify that an analyzed segment does not belong to the age group of the registered user or to the sex of the user, and rule out the possibility that the voice segment belongs to him or her.
Affinity
In an exemplary embodiment of the invention, system 100 is activated to filter out non-registered users of a communication device. Following is an exemplary list of characteristics and uses demonstrating the affinity between the users and a specific device, which can be implemented using system 100:
1. The sole user of the device, for example a private mobile telephone, which is usable only by the owner.
2. The main users of a device, for example a family telephone line for use by a specific family.
3. Limited scheduled users of a device, for example in a call center a worker that uses the device during specific hours (during his/her shift). Optionally, system 100 will filter out other voices during the shift hours and block the user's voice at other times.
4. Short term users of a device, for example guests at a hotel can be registered to use a device for a limited period, which expires when their stay is over.
5. Selective users, for example a telephone in a public place at a work place that is limited to function only for specific authorized workers.
6. A group of users, for example a group of users conducting a conference call, wherein any other voices are filtered out.
7. Any user, for example a communication device that allows anybody to initiate a conversation, automatically registers the speaker and only transfers their voice during the duration of the call.
8. Physical authorization, for example a communication device that requires plugging in an identity card, which authorizes use by the card holder and blocks voices of any other person.
9. Code authorization, for example a communication device that requires entering a code for each user that will be considered registered for a specific conversation or until cancellation by the user.
In an exemplary embodiment of the invention, a communication device that implements use of system 100 can provide an option of toggling the use of system 100. Optionally, a non registered user can disable the personalized voice detection function so that a non registered user can use the communication device without system 100. Optionally, when a user is interested in sending tones or other sounds which should not be altered, system 100 can be toggled off. In an exemplary embodiment of the invention, system 100 is installed in the communication device and toggled directly on the device, for example using a switch or special key sequence. Alternatively, system 100 is installed in other parts of the network and can be toggled by, for example by transmitting a code (e.g. dialing a specific number), which is intercepted by system 100.
In an exemplary embodiment of the invention, a specific code can be associated with a list of one or more registered users. Optionally, by dialing the code before establishing a call, or conference call, or as part of the establishment of the conference call, the voices of the registered user's in the list will be transferred during the call and any other sound or voice will be suppressed.
Structure
In some embodiments of the invention, system 100 is implemented by a standard general purpose computer, for example a personal computer. Alternatively, system 100 can be implemented by a dedicated device. In some embodiments of the invention, system 100 is implemented in a single encasement at a single location. Alternatively, system 100 may be implemented by multiple pieces in a plurality of encasements and optionally, positioned at multiple locations, for example the database may be stationed at a remote site.
In some embodiments of the invention, system 100 may analyze the audio signal in parallel to other audio analysis systems which may deal with other aspects of analyzing audio signals, for example CODEC processing. Optionally, system 100 may provide analysis information to other audio systems, which may be connected sequentially or any other way to reduce the processing required for those systems.
In an exemplary embodiment of the invention, the idea presented hereinabove may be expanded from audio processing to image processing. Optionally, a video call system can register the face of one or more user's and filter out the background during a call, for example if they are not interested in disclosing their location (e.g. not in their office). In an exemplary embodiment of the invention, the face can be registered by taking a picture of the user with a digital camera for example a web cam or a camera built into a cellular telephone. Optionally, the system will learn identifying characteristics of the user's face as is commonly known in the art. Additionally, filtering out the background of a picture leaving only the user's face and possibly body is commonly practiced in the art. Optionally, the background may be replaced with a predetermined background that is selected by the user.
In an exemplary embodiment of the invention, the image processing system can be used to reduce bandwidth since the background is filtered out. Optionally, many of the options that are applicable to audio processing are also applicable to image processing, for example toggling the system on and off, and automatic registration (e.g. during the beginning of the call).
It should be appreciated that the above described methods and apparatus may be varied in many ways, including omitting or adding steps, changing the order of steps and the type of devices used. It should be appreciated that different features may be combined in different ways. In particular, not all the features shown above in a particular embodiment are necessary in every embodiment of the invention. Further combinations of the above features are also considered to be within the scope of some embodiments of the invention.
Section headings are provided for assistance in navigation and should not be considered as necessarily limiting the contents of the section.
It will be appreciated by persons skilled in the art that the present invention is not limited to what has been particularly shown and described hereinabove. Rather the scope of the present invention is defined only by the claims, which follow.
Claims (28)
1. A method of transferring to a receiver in real time content of segments of an audio signal transmission of a call, the method comprising:
receiving from a call an audio signal as a sequence of segments including segments that have user characteristics that were registered to identify voices of users and other segments that do not have registered user characteristics;
analyzing at least one segment of the received audio signal to determine if it contains voice activity;
determining a probability level that the voice activity of the analyzed segment is of a registered user according to the registered user characteristics; and
selectively transferring during the call the content of a segment to a receiver if the determined probability level is greater than a threshold value;
wherein the content of segments of the same call, for which the determined probability level is less than the threshold value, is suppressed completely or partially.
2. A method according to claim 1 , wherein said segments are selected to comprise a single syllable.
3. A method according to claim 1 , wherein said, segments are selected to comprise an interval of the audio signal smaller than 0.1 seconds.
4. A method according to claim 1 , wherein said selectively transferring comprises adjusting the quality level of the audio segment according to the determined probability level.
5. A method according to claim 4 , wherein adjusting the quality level comprises raising or lowering the volume of the audio signal in the segment.
6. A method according to claim 1 , wherein said selectively transferring comprises transferring previously blocked segments responsive to the determination for a consecutive segment.
7. A method according to claim 6 , wherein said previously blocked segments are transferred at a higher rate than the standard transfer rate of segments.
8. A method according to claim 1 , wherein said probability level of a segment is affected by the selective transfer of segments prior to the current segment.
9. A method according to claim 1 , further comprising filtering out noise from each segment before analyzing the segment.
10. A method according to claim 1 , further comprising filtering out noise from each segment after analyzing the segment.
11. A method according to claim 1 , further comprising performing source separation to the signal in a segment creating multiple segments before analyzing the segment and analyzing the multiple segments independently.
12. A method according to claim 1 , further comprising analyzing the strongest signal in a segment with multiple audio signals, while taking into account prior segments.
13. A method according to claim 1 , further comprising inserting an audio signal into a segment to indicate amendments to the segment.
14. A method according to claim 1 , wherein said transferring is through a communication system.
15. A method according to claim 1 , wherein said characteristics comprise voice patterns of the user.
16. A method according to claim 1 , wherein said characteristics comprise general information about the user.
17. A method according to claim 1 , wherein said selectively transferring allows voice activity of any user.
18. A method according to claim 1 , wherein said selectively transferring allows voice activity of a group of users with a common characteristic.
19. A method according to claim 1 , wherein said selectively transferring reduces bandwidth of transmissions through a communication network.
20. A method according to claim 1 , wherein said probability level becomes more accurate as the audio signal is processed.
21. A method according to claim 1 further comprising:
allowing a user to select a suppression level by which unwanted sounds are suppressed.
22. A system for transferring to a receiver in real time content of segments of an audio transmission of a call, the system comprising:
a processor to process data of the real time audio transmission and to control the system;
a memory to serve as a work area for said processor;
a channel interface to provide an audio signal for processing and to transfer the processed audio signal to a receiver;
wherein said system is adapted to:
receiving from a call an audio signal from the channel interface as a sequence of segments including segments that have user characteristics that were registered to identify voices of users and other segments that do not have registered user characteristics;
analyzing with the processor at least one segment of the received audio signal to determine if it contains voice activity;
determining a probability level that the voice activity of the analyzed segment is of a registered user according to the registered user characteristics;
and selectively transferring during the call the contents of a segment to the receiver if the determined probability level is greater than a threshold value;
wherein the contents of segments of the same call, for which the determined probability level is less than the threshold value, is suppressed completely or partially.
23. A system according to claim 22 , wherein said system provides an indication at a user communication device if it is activated.
24. A system according to claim 23 , wherein said indication gives indication if the user is recognized as a registered speaker.
25. A system according to claim 22 further comprising a database memory to store data provided to the system for processing by said processor.
26. A system according to claim 22 further comprising a user interface to communicate with the user.
27. A system according to claim 22 , wherein said processor performs source separation to the signal in a segment creating multiple segments before analyzing the segment and analyzing the multiple segments independently.
28. A processor for transferring to a receiver in real time content of segments of an audio signal transmission of a call, the processor comprising:
an audio signal interface; and
circuitry operative to:
receive from a call through the audio signal interface an audio signal as a sequence of segments including segments that have user characteristics that were registered to identify voices of users and other segments that do not have registered user characteristics;
analyze at least one segment of the received audio signal to determine if it contains voice activity;
determine a probability level that the voice activity of the analyzed segment is of a registered user according to the registered user characteristics; and
selectively transfer through the audio signal interface during the call the content of a segment to a receiver if the determined probability level is greater than a threshold value;
wherein the content of segments of the same call, for which the determined probability level is less than the threshold value, is suppressed completely or partially.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US12/092,578 US8175874B2 (en) | 2005-11-17 | 2006-07-18 | Personalized voice activity detection |
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US59721305P | 2005-11-17 | 2005-11-17 | |
PCT/IL2006/000831 WO2007057879A1 (en) | 2005-11-17 | 2006-07-18 | Personalized voice activity detection |
US12/092,578 US8175874B2 (en) | 2005-11-17 | 2006-07-18 | Personalized voice activity detection |
Publications (2)
Publication Number | Publication Date |
---|---|
US20080255842A1 US20080255842A1 (en) | 2008-10-16 |
US8175874B2 true US8175874B2 (en) | 2012-05-08 |
Family
ID=38048330
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US12/092,578 Active 2029-01-28 US8175874B2 (en) | 2005-11-17 | 2006-07-18 | Personalized voice activity detection |
Country Status (2)
Country | Link |
---|---|
US (1) | US8175874B2 (en) |
WO (1) | WO2007057879A1 (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8646101B1 (en) * | 2005-02-11 | 2014-02-04 | Steven C. Millwee | Method and system of verifying and authenticating personal history |
US20150055764A1 (en) * | 2008-07-30 | 2015-02-26 | At&T Intellectual Property I, L.P. | Transparent voice registration and verification method and system |
US10522151B2 (en) | 2015-02-03 | 2019-12-31 | Dolby Laboratories Licensing Corporation | Conference segmentation based on conversational dynamics |
Families Citing this family (135)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8677377B2 (en) | 2005-09-08 | 2014-03-18 | Apple Inc. | Method and apparatus for building an intelligent automated assistant |
US9318108B2 (en) | 2010-01-18 | 2016-04-19 | Apple Inc. | Intelligent automated assistant |
US7599475B2 (en) * | 2007-03-12 | 2009-10-06 | Nice Systems, Ltd. | Method and apparatus for generic analytics |
US10002189B2 (en) | 2007-12-20 | 2018-06-19 | Apple Inc. | Method and apparatus for searching using an active ontology |
US9330720B2 (en) | 2008-01-03 | 2016-05-03 | Apple Inc. | Methods and apparatus for altering audio output signals |
US8996376B2 (en) | 2008-04-05 | 2015-03-31 | Apple Inc. | Intelligent text-to-speech conversion |
US20100030549A1 (en) | 2008-07-31 | 2010-02-04 | Lee Michael M | Mobile device having human language translation capability with positional feedback |
US8676904B2 (en) | 2008-10-02 | 2014-03-18 | Apple Inc. | Electronic devices with voice command and contextual data processing capabilities |
KR101581883B1 (en) * | 2009-04-30 | 2016-01-11 | 삼성전자주식회사 | Appratus for detecting voice using motion information and method thereof |
WO2010126321A2 (en) * | 2009-04-30 | 2010-11-04 | 삼성전자주식회사 | Apparatus and method for user intention inference using multimodal information |
US10241644B2 (en) | 2011-06-03 | 2019-03-26 | Apple Inc. | Actionable reminder entries |
US10241752B2 (en) | 2011-09-30 | 2019-03-26 | Apple Inc. | Interface for a virtual digital assistant |
US9431006B2 (en) | 2009-07-02 | 2016-08-30 | Apple Inc. | Methods and apparatuses for automatic speech recognition |
US20110103370A1 (en) * | 2009-10-29 | 2011-05-05 | General Instruments Corporation | Call monitoring and hung call prevention |
US8682667B2 (en) | 2010-02-25 | 2014-03-25 | Apple Inc. | User profiling for selecting user specific voice input processing information |
KR101791444B1 (en) * | 2010-11-29 | 2017-10-30 | 뉘앙스 커뮤니케이션즈, 인코포레이티드 | Dynamic microphone signal mixer |
US9262612B2 (en) | 2011-03-21 | 2016-02-16 | Apple Inc. | Device access using voice authentication |
CN102781075B (en) * | 2011-05-12 | 2016-08-24 | 中兴通讯股份有限公司 | A kind of method reducing mobile terminal call power consumption and mobile terminal |
US10057736B2 (en) | 2011-06-03 | 2018-08-21 | Apple Inc. | Active transport based notifications |
EP2721609A1 (en) * | 2011-06-20 | 2014-04-23 | Agnitio S.L. | Identification of a local speaker |
US10134385B2 (en) | 2012-03-02 | 2018-11-20 | Apple Inc. | Systems and methods for name pronunciation |
US9280610B2 (en) | 2012-05-14 | 2016-03-08 | Apple Inc. | Crowd sourcing information to fulfill user requests |
US9721563B2 (en) | 2012-06-08 | 2017-08-01 | Apple Inc. | Name recognition system |
US9547647B2 (en) | 2012-09-19 | 2017-01-17 | Apple Inc. | Voice-based media searching |
US9837078B2 (en) | 2012-11-09 | 2017-12-05 | Mattersight Corporation | Methods and apparatus for identifying fraudulent callers |
WO2014197334A2 (en) | 2013-06-07 | 2014-12-11 | Apple Inc. | System and method for user-specified pronunciation of words for speech synthesis and recognition |
WO2014197335A1 (en) | 2013-06-08 | 2014-12-11 | Apple Inc. | Interpreting and acting upon commands that involve sharing information with remote devices |
US10176167B2 (en) | 2013-06-09 | 2019-01-08 | Apple Inc. | System and method for inferring user intent from speech inputs |
CN105264524B (en) | 2013-06-09 | 2019-08-02 | 苹果公司 | For realizing the equipment, method and graphic user interface of the session continuity of two or more examples across digital assistants |
US9595271B2 (en) * | 2013-06-27 | 2017-03-14 | Getgo, Inc. | Computer system employing speech recognition for detection of non-speech audio |
CN104217715B (en) * | 2013-08-12 | 2017-06-16 | 北京诺亚星云科技有限责任公司 | A kind of real-time voice sample testing method and system |
US10296160B2 (en) | 2013-12-06 | 2019-05-21 | Apple Inc. | Method for extracting salient dialog usage from live data |
US20150334720A1 (en) * | 2014-05-13 | 2015-11-19 | Shaul Simhi | Profile-Based Noise Reduction |
US10170123B2 (en) | 2014-05-30 | 2019-01-01 | Apple Inc. | Intelligent assistant for home automation |
US9430463B2 (en) | 2014-05-30 | 2016-08-30 | Apple Inc. | Exemplar-based natural language processing |
EP3480811A1 (en) | 2014-05-30 | 2019-05-08 | Apple Inc. | Multi-command single utterance input method |
US9633004B2 (en) | 2014-05-30 | 2017-04-25 | Apple Inc. | Better resolution when referencing to concepts |
US10186282B2 (en) * | 2014-06-19 | 2019-01-22 | Apple Inc. | Robust end-pointing of speech signals using speaker recognition |
US9338493B2 (en) | 2014-06-30 | 2016-05-10 | Apple Inc. | Intelligent automated assistant for TV user interactions |
US9818400B2 (en) | 2014-09-11 | 2017-11-14 | Apple Inc. | Method and apparatus for discovering trending terms in speech requests |
US10127911B2 (en) | 2014-09-30 | 2018-11-13 | Apple Inc. | Speaker identification and unsupervised speaker adaptation techniques |
US10074360B2 (en) | 2014-09-30 | 2018-09-11 | Apple Inc. | Providing an indication of the suitability of speech recognition |
US9668121B2 (en) | 2014-09-30 | 2017-05-30 | Apple Inc. | Social reminders |
US10152299B2 (en) | 2015-03-06 | 2018-12-11 | Apple Inc. | Reducing response latency of intelligent automated assistants |
US9721566B2 (en) | 2015-03-08 | 2017-08-01 | Apple Inc. | Competing devices responding to voice triggers |
US10567477B2 (en) | 2015-03-08 | 2020-02-18 | Apple Inc. | Virtual assistant continuity |
US9886953B2 (en) | 2015-03-08 | 2018-02-06 | Apple Inc. | Virtual assistant activation |
US10083688B2 (en) | 2015-05-27 | 2018-09-25 | Apple Inc. | Device voice control for selecting a displayed affordance |
US9578173B2 (en) | 2015-06-05 | 2017-02-21 | Apple Inc. | Virtual assistant aided communication with 3rd party service in a communication session |
US11025565B2 (en) | 2015-06-07 | 2021-06-01 | Apple Inc. | Personalized prediction of responses for instant messaging |
US10048936B2 (en) | 2015-08-31 | 2018-08-14 | Roku, Inc. | Audio command interface for a multimedia device |
US10671428B2 (en) | 2015-09-08 | 2020-06-02 | Apple Inc. | Distributed personal assistant |
US10747498B2 (en) | 2015-09-08 | 2020-08-18 | Apple Inc. | Zero latency digital assistant |
US11010550B2 (en) | 2015-09-29 | 2021-05-18 | Apple Inc. | Unified language modeling framework for word prediction, auto-completion and auto-correction |
US10366158B2 (en) | 2015-09-29 | 2019-07-30 | Apple Inc. | Efficient word encoding for recurrent neural network language models |
US11587559B2 (en) | 2015-09-30 | 2023-02-21 | Apple Inc. | Intelligent device identification |
US10370118B1 (en) | 2015-10-31 | 2019-08-06 | Simon Saito Nielsen | Lighting apparatus for remote controlled device |
US10691473B2 (en) | 2015-11-06 | 2020-06-23 | Apple Inc. | Intelligent automated assistant in a messaging environment |
US10049668B2 (en) | 2015-12-02 | 2018-08-14 | Apple Inc. | Applying neural network language models to weighted finite state transducers for automatic speech recognition |
US10223066B2 (en) | 2015-12-23 | 2019-03-05 | Apple Inc. | Proactive assistance based on dialog communication between devices |
US10446143B2 (en) | 2016-03-14 | 2019-10-15 | Apple Inc. | Identification of voice inputs providing credentials |
US9934775B2 (en) | 2016-05-26 | 2018-04-03 | Apple Inc. | Unit-selection text-to-speech synthesis based on predicted concatenation parameters |
US9972304B2 (en) | 2016-06-03 | 2018-05-15 | Apple Inc. | Privacy preserving distributed evaluation framework for embedded personalized systems |
US11322157B2 (en) | 2016-06-06 | 2022-05-03 | Cirrus Logic, Inc. | Voice user interface |
US10249300B2 (en) | 2016-06-06 | 2019-04-02 | Apple Inc. | Intelligent list reading |
US10049663B2 (en) | 2016-06-08 | 2018-08-14 | Apple, Inc. | Intelligent automated assistant for media exploration |
DK179588B1 (en) | 2016-06-09 | 2019-02-22 | Apple Inc. | Intelligent automated assistant in a home environment |
US10067938B2 (en) | 2016-06-10 | 2018-09-04 | Apple Inc. | Multilingual word prediction |
US10490187B2 (en) | 2016-06-10 | 2019-11-26 | Apple Inc. | Digital assistant providing automated status report |
US10586535B2 (en) | 2016-06-10 | 2020-03-10 | Apple Inc. | Intelligent digital assistant in a multi-tasking environment |
US10192552B2 (en) | 2016-06-10 | 2019-01-29 | Apple Inc. | Digital assistant providing whispered speech |
US10509862B2 (en) | 2016-06-10 | 2019-12-17 | Apple Inc. | Dynamic phrase expansion of language input |
DK179415B1 (en) | 2016-06-11 | 2018-06-14 | Apple Inc | Intelligent device arbitration and control |
DK179343B1 (en) | 2016-06-11 | 2018-05-14 | Apple Inc | Intelligent task discovery |
DK201670540A1 (en) | 2016-06-11 | 2018-01-08 | Apple Inc | Application integration with a digital assistant |
DK179049B1 (en) | 2016-06-11 | 2017-09-18 | Apple Inc | Data driven natural language event detection and classification |
US11404056B1 (en) | 2016-06-30 | 2022-08-02 | Snap Inc. | Remoteless control of drone behavior |
KR102596430B1 (en) * | 2016-08-31 | 2023-10-31 | 삼성전자주식회사 | Method and apparatus for speech recognition based on speaker recognition |
US10474753B2 (en) | 2016-09-07 | 2019-11-12 | Apple Inc. | Language identification using recurrent neural networks |
US10043516B2 (en) | 2016-09-23 | 2018-08-07 | Apple Inc. | Intelligent automated assistant |
US11281993B2 (en) | 2016-12-05 | 2022-03-22 | Apple Inc. | Model and ensemble compression for metric learning |
US10593346B2 (en) | 2016-12-22 | 2020-03-17 | Apple Inc. | Rank-reduced token representation for automatic speech recognition |
US11204787B2 (en) | 2017-01-09 | 2021-12-21 | Apple Inc. | Application integration with a digital assistant |
EP3364615B1 (en) * | 2017-02-17 | 2022-07-27 | Telefónica Germany GmbH & Co. OHG | Device and method for forwarding or routing speech frames in a transport network of a mobile communications system |
US10593351B2 (en) * | 2017-05-03 | 2020-03-17 | Ajit Arun Zadgaonkar | System and method for estimating hormone level and physiological conditions by analysing speech samples |
DK201770383A1 (en) | 2017-05-09 | 2018-12-14 | Apple Inc. | User interface for correcting recognition errors |
US10417266B2 (en) | 2017-05-09 | 2019-09-17 | Apple Inc. | Context-aware ranking of intelligent response suggestions |
US10726832B2 (en) | 2017-05-11 | 2020-07-28 | Apple Inc. | Maintaining privacy of personal information |
US10395654B2 (en) | 2017-05-11 | 2019-08-27 | Apple Inc. | Text normalization based on a data-driven learning network |
DK201770439A1 (en) | 2017-05-11 | 2018-12-13 | Apple Inc. | Offline personal assistant |
DK179745B1 (en) | 2017-05-12 | 2019-05-01 | Apple Inc. | SYNCHRONIZATION AND TASK DELEGATION OF A DIGITAL ASSISTANT |
US11301477B2 (en) | 2017-05-12 | 2022-04-12 | Apple Inc. | Feedback analysis of a digital assistant |
DK179496B1 (en) | 2017-05-12 | 2019-01-15 | Apple Inc. | USER-SPECIFIC Acoustic Models |
DK201770428A1 (en) | 2017-05-12 | 2019-02-18 | Apple Inc. | Low-latency intelligent automated assistant |
DK201770431A1 (en) | 2017-05-15 | 2018-12-20 | Apple Inc. | Optimizing dialogue policy decisions for digital assistants using implicit feedback |
DK201770432A1 (en) | 2017-05-15 | 2018-12-21 | Apple Inc. | Hierarchical belief states for digital assistants |
DK179560B1 (en) | 2017-05-16 | 2019-02-18 | Apple Inc. | Far-field extension for digital assistant services |
US10403278B2 (en) | 2017-05-16 | 2019-09-03 | Apple Inc. | Methods and systems for phonetic matching in digital assistant services |
US10311144B2 (en) | 2017-05-16 | 2019-06-04 | Apple Inc. | Emoji word sense disambiguation |
US10303715B2 (en) | 2017-05-16 | 2019-05-28 | Apple Inc. | Intelligent automated assistant for media exploration |
US10446138B2 (en) * | 2017-05-23 | 2019-10-15 | Verbit Software Ltd. | System and method for assessing audio files for transcription services |
US10522135B2 (en) | 2017-05-24 | 2019-12-31 | Verbit Software Ltd. | System and method for segmenting audio files for transcription |
US10657328B2 (en) | 2017-06-02 | 2020-05-19 | Apple Inc. | Multi-task recurrent neural network architecture for efficient morphology handling in neural language modeling |
EP3425923B1 (en) * | 2017-07-06 | 2024-05-08 | GN Audio A/S | Headset with reduction of ambient noise |
US11348265B1 (en) | 2017-09-15 | 2022-05-31 | Snap Inc. | Computing a point cloud from stitched images |
US10445429B2 (en) | 2017-09-21 | 2019-10-15 | Apple Inc. | Natural language understanding using vocabularies with compressed serialized tries |
US11753142B1 (en) | 2017-09-29 | 2023-09-12 | Snap Inc. | Noise modulation for unmanned aerial vehicles |
US10755051B2 (en) | 2017-09-29 | 2020-08-25 | Apple Inc. | Rule-based natural language processing |
US11531357B1 (en) | 2017-10-05 | 2022-12-20 | Snap Inc. | Spatial vector-based drone control |
US10403288B2 (en) * | 2017-10-17 | 2019-09-03 | Google Llc | Speaker diarization |
US10636424B2 (en) | 2017-11-30 | 2020-04-28 | Apple Inc. | Multi-turn canned dialog |
US10733982B2 (en) | 2018-01-08 | 2020-08-04 | Apple Inc. | Multi-directional dialog |
US11037567B2 (en) | 2018-01-19 | 2021-06-15 | Sorenson Ip Holdings, Llc | Transcription of communications |
US10733375B2 (en) | 2018-01-31 | 2020-08-04 | Apple Inc. | Knowledge-based framework for improving natural language understanding |
US10789959B2 (en) | 2018-03-02 | 2020-09-29 | Apple Inc. | Training speaker recognition models for digital assistants |
US11822346B1 (en) | 2018-03-06 | 2023-11-21 | Snap Inc. | Systems and methods for estimating user intent to launch autonomous aerial vehicle |
US10592604B2 (en) | 2018-03-12 | 2020-03-17 | Apple Inc. | Inverse text normalization for automatic speech recognition |
US10388286B1 (en) * | 2018-03-20 | 2019-08-20 | Capital One Services, Llc | Systems and methods of sound-based fraud protection |
US10818288B2 (en) | 2018-03-26 | 2020-10-27 | Apple Inc. | Natural assistant interaction |
US10909331B2 (en) | 2018-03-30 | 2021-02-02 | Apple Inc. | Implicit identification of translation payload with neural machine translation |
US10928918B2 (en) | 2018-05-07 | 2021-02-23 | Apple Inc. | Raise to speak |
US11145294B2 (en) | 2018-05-07 | 2021-10-12 | Apple Inc. | Intelligent automated assistant for delivering content from user experiences |
US10984780B2 (en) | 2018-05-21 | 2021-04-20 | Apple Inc. | Global semantic word embeddings using bi-directional recurrent neural networks |
DK180639B1 (en) | 2018-06-01 | 2021-11-04 | Apple Inc | DISABILITY OF ATTENTION-ATTENTIVE VIRTUAL ASSISTANT |
US10892996B2 (en) | 2018-06-01 | 2021-01-12 | Apple Inc. | Variable latency device coordination |
US11386266B2 (en) | 2018-06-01 | 2022-07-12 | Apple Inc. | Text correction |
DK201870355A1 (en) | 2018-06-01 | 2019-12-16 | Apple Inc. | Virtual assistant operation in multi-device environments |
DK179822B1 (en) | 2018-06-01 | 2019-07-12 | Apple Inc. | Voice interaction at a primary device to access call functionality of a companion device |
US11076039B2 (en) | 2018-06-03 | 2021-07-27 | Apple Inc. | Accelerated task performance |
CN108922540B (en) * | 2018-07-27 | 2023-01-24 | 重庆柚瓣家科技有限公司 | Method and system for carrying out continuous AI (Artificial Intelligence) conversation with old people user |
US12071228B1 (en) * | 2019-03-28 | 2024-08-27 | Snap Inc. | Drone with propeller guard configured as an airfoil |
US11521643B2 (en) * | 2020-05-08 | 2022-12-06 | Bose Corporation | Wearable audio device with user own-voice recording |
CN112397093B (en) * | 2020-12-04 | 2024-02-27 | 中国联合网络通信集团有限公司 | Voice detection method and device |
JP2023045371A (en) * | 2021-09-22 | 2023-04-03 | 富士フイルムビジネスイノベーション株式会社 | Communication server and communication system |
US11972521B2 (en) | 2022-08-31 | 2024-04-30 | Snap Inc. | Multisensorial presentation of volumetric content |
Citations (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4038503A (en) * | 1975-12-29 | 1977-07-26 | Dialog Systems, Inc. | Speech recognition apparatus |
US4328591A (en) * | 1979-04-23 | 1982-05-04 | Baghdady Elie J | Method and apparatus for signal detection, separation and suppression |
US4489435A (en) * | 1981-10-05 | 1984-12-18 | Exxon Corporation | Method and apparatus for continuous word string recognition |
US4720802A (en) * | 1983-07-26 | 1988-01-19 | Lear Siegler | Noise compensation arrangement |
US5572623A (en) * | 1992-10-21 | 1996-11-05 | Sextant Avionique | Method of speech detection |
US6006175A (en) * | 1996-02-06 | 1999-12-21 | The Regents Of The University Of California | Methods and apparatus for non-acoustic speech characterization and recognition |
US6011853A (en) * | 1995-10-05 | 2000-01-04 | Nokia Mobile Phones, Ltd. | Equalization of speech signal in mobile phone |
US6259916B1 (en) * | 1996-08-06 | 2001-07-10 | Conexant Systems, Inc. | Method and apparatus for minimizing perceptible impact on an interrupted call prior to hand-off |
US6285535B1 (en) * | 1998-02-23 | 2001-09-04 | Mitsubishi Materials Corporation | Surge absorber |
US20030061036A1 (en) * | 2001-05-17 | 2003-03-27 | Harinath Garudadri | System and method for transmitting speech activity in a distributed voice recognition system |
US6882971B2 (en) * | 2002-07-18 | 2005-04-19 | General Instrument Corporation | Method and apparatus for improving listener differentiation of talkers during a conference call |
US7231019B2 (en) * | 2004-02-12 | 2007-06-12 | Microsoft Corporation | Automatic identification of telephone callers based on voice characteristics |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5299198A (en) * | 1990-12-06 | 1994-03-29 | Hughes Aircraft Company | Method and apparatus for exploitation of voice inactivity to increase the capacity of a time division multiple access radio communications system |
JP2001289661A (en) * | 2000-04-07 | 2001-10-19 | Alpine Electronics Inc | Navigator |
-
2006
- 2006-07-18 US US12/092,578 patent/US8175874B2/en active Active
- 2006-07-18 WO PCT/IL2006/000831 patent/WO2007057879A1/en active Application Filing
Patent Citations (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4038503A (en) * | 1975-12-29 | 1977-07-26 | Dialog Systems, Inc. | Speech recognition apparatus |
US4328591A (en) * | 1979-04-23 | 1982-05-04 | Baghdady Elie J | Method and apparatus for signal detection, separation and suppression |
US4489435A (en) * | 1981-10-05 | 1984-12-18 | Exxon Corporation | Method and apparatus for continuous word string recognition |
US4720802A (en) * | 1983-07-26 | 1988-01-19 | Lear Siegler | Noise compensation arrangement |
US5572623A (en) * | 1992-10-21 | 1996-11-05 | Sextant Avionique | Method of speech detection |
US6011853A (en) * | 1995-10-05 | 2000-01-04 | Nokia Mobile Phones, Ltd. | Equalization of speech signal in mobile phone |
US6006175A (en) * | 1996-02-06 | 1999-12-21 | The Regents Of The University Of California | Methods and apparatus for non-acoustic speech characterization and recognition |
US6259916B1 (en) * | 1996-08-06 | 2001-07-10 | Conexant Systems, Inc. | Method and apparatus for minimizing perceptible impact on an interrupted call prior to hand-off |
US6285535B1 (en) * | 1998-02-23 | 2001-09-04 | Mitsubishi Materials Corporation | Surge absorber |
US20030061036A1 (en) * | 2001-05-17 | 2003-03-27 | Harinath Garudadri | System and method for transmitting speech activity in a distributed voice recognition system |
US6882971B2 (en) * | 2002-07-18 | 2005-04-19 | General Instrument Corporation | Method and apparatus for improving listener differentiation of talkers during a conference call |
US7231019B2 (en) * | 2004-02-12 | 2007-06-12 | Microsoft Corporation | Automatic identification of telephone callers based on voice characteristics |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8646101B1 (en) * | 2005-02-11 | 2014-02-04 | Steven C. Millwee | Method and system of verifying and authenticating personal history |
US9183363B1 (en) | 2005-02-11 | 2015-11-10 | Ireviewnow Llc | Method and system of verifying and authenticating consumer reporting history |
US20150055764A1 (en) * | 2008-07-30 | 2015-02-26 | At&T Intellectual Property I, L.P. | Transparent voice registration and verification method and system |
US9369577B2 (en) * | 2008-07-30 | 2016-06-14 | Interactions Llc | Transparent voice registration and verification method and system |
US10522151B2 (en) | 2015-02-03 | 2019-12-31 | Dolby Laboratories Licensing Corporation | Conference segmentation based on conversational dynamics |
Also Published As
Publication number | Publication date |
---|---|
WO2007057879A1 (en) | 2007-05-24 |
US20080255842A1 (en) | 2008-10-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US8175874B2 (en) | Personalized voice activity detection | |
USRE45289E1 (en) | Selective noise/channel/coding models and recognizers for automatic speech recognition | |
US8606573B2 (en) | Voice recognition improved accuracy in mobile environments | |
US5594784A (en) | Apparatus and method for transparent telephony utilizing speech-based signaling for initiating and handling calls | |
US20090248411A1 (en) | Front-End Noise Reduction for Speech Recognition Engine | |
JP5137376B2 (en) | Two-way telephony trainer and exerciser | |
US20070263823A1 (en) | Automatic participant placement in conferencing | |
US8731940B2 (en) | Method of controlling a system and signal processing system | |
US8762138B2 (en) | Method of editing a noise-database and computer device | |
US9661139B2 (en) | Conversation detection in an ambient telephony system | |
US9135928B2 (en) | Audio transmission channel quality assessment | |
JP2004133403A (en) | Sound signal processing apparatus | |
JP2001075580A (en) | Method and device for voice recognition | |
CN111199751B (en) | Microphone shielding method and device and electronic equipment | |
US7970115B1 (en) | Assisted discrimination of similar sounding speakers | |
CN110265038B (en) | Processing method and electronic equipment | |
JP2009027239A (en) | Telecommunication conference apparatus | |
US9398150B2 (en) | Method of setting detection parameters in an apparatus for on hold music detection | |
Harma et al. | Conversation detection in ambient telephony | |
CN113923395A (en) | Method, equipment and storage medium for improving conference quality | |
JP2019176412A (en) | Communication processing device, program, and method | |
JP2003060792A (en) | Device for recording and reproducing a plurality of voices | |
US20150334720A1 (en) | Profile-Based Noise Reduction | |
JP2005123869A (en) | System and method for dictating call content | |
Gallardo et al. | Transmission channel effects on human speaker identification in multiparty conference calls |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
STCF | Information on status: patent grant |
Free format text: PATENTED CASE |
|
CC | Certificate of correction | ||
FPAY | Fee payment |
Year of fee payment: 4 |
|
MAFP | Maintenance fee payment |
Free format text: PAYMENT OF MAINTENANCE FEE, 8TH YR, SMALL ENTITY (ORIGINAL EVENT CODE: M2552); ENTITY STATUS OF PATENT OWNER: SMALL ENTITY Year of fee payment: 8 |
|
MAFP | Maintenance fee payment |
Free format text: PAYMENT OF MAINTENANCE FEE, 12TH YR, SMALL ENTITY (ORIGINAL EVENT CODE: M2553); ENTITY STATUS OF PATENT OWNER: SMALL ENTITY Year of fee payment: 12 |