WO2014096506A1 - Method, apparatus, and computer program product for personalizing speech recognition - Google Patents
Method, apparatus, and computer program product for personalizing speech recognition Download PDFInfo
- Publication number
- WO2014096506A1 WO2014096506A1 PCT/FI2012/051285 FI2012051285W WO2014096506A1 WO 2014096506 A1 WO2014096506 A1 WO 2014096506A1 FI 2012051285 W FI2012051285 W FI 2012051285W WO 2014096506 A1 WO2014096506 A1 WO 2014096506A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- speech recognition
- recognition model
- speech
- received
- model
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 56
- 238000004590 computer program Methods 0.000 title claims abstract description 44
- 230000001419 dependent effect Effects 0.000 claims abstract description 122
- 230000005540 biological transmission Effects 0.000 claims abstract description 23
- 238000003860 storage Methods 0.000 claims abstract description 19
- 238000013528 artificial neural network Methods 0.000 claims abstract description 17
- 238000012545 processing Methods 0.000 claims description 38
- 230000008569 process Effects 0.000 claims description 24
- 230000015654 memory Effects 0.000 claims description 23
- 238000012937 correction Methods 0.000 claims description 13
- 238000012795 verification Methods 0.000 claims description 12
- 238000007670 refining Methods 0.000 claims description 8
- 238000002553 single reaction monitoring Methods 0.000 abstract description 42
- 238000013426 sirius red morphometry Methods 0.000 abstract description 42
- 238000010792 warming Methods 0.000 abstract 1
- 238000004891 communication Methods 0.000 description 46
- 230000006870 function Effects 0.000 description 21
- 238000012549 training Methods 0.000 description 16
- 238000004422 calculation algorithm Methods 0.000 description 6
- 230000003203 everyday effect Effects 0.000 description 4
- 238000012048 forced swim test Methods 0.000 description 4
- 230000007246 mechanism Effects 0.000 description 4
- 230000006978 adaptation Effects 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 230000008901 benefit Effects 0.000 description 2
- 230000001413 cellular effect Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000000977 initiatory effect Effects 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 230000008054 signal transmission Effects 0.000 description 2
- 238000013179 statistical model Methods 0.000 description 2
- 230000000007 visual effect Effects 0.000 description 2
- 230000009471 action Effects 0.000 description 1
- 238000007792 addition Methods 0.000 description 1
- 230000003321 amplification Effects 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000009826 distribution Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000009434 installation Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 108700039855 mouse a Proteins 0.000 description 1
- 238000003199 nucleic acid amplification method Methods 0.000 description 1
- 230000026683 transduction Effects 0.000 description 1
- 238000010361 transduction Methods 0.000 description 1
- 239000013598 vector Substances 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/065—Adaptation
- G10L15/07—Adaptation to the speaker
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/28—Constructional details of speech recognition systems
- G10L15/30—Distributed recognition, e.g. in client-server systems, for mobile phones or network applications
Definitions
- An example embodiment of the present invention relates generally to speech recognition, and more particularly, to a method, apparatus and computer program product for personalizing speech recognition.
- Speech recognition may be used to control these and other devices, such as wireless phones, cars, household appliances, and other devices used in everyday life or work.
- Speech recognition which may be referred to as automatic speech recognition (ASR)
- ASR automatic speech recognition
- SRM speech recognition model
- acoustic modes and language models can be fused together, or otherwise may be combined.
- These SRMs are the building blocks for words and strings of words, such as phrases or sentences and are used by a device to process speech input (e.g., recognize the speech input and derive a machine readable interpretation).
- a speech recognition processor may receive speech samples and then may match those samples with the basic sound units in the acoustic model.
- the speech recognition processor then may, for example, calculate the most likely words from the SRM based on the matched basic sound units, such as by using Hidden Markov Models (HMMs) and/or dynamic time warping (DTW).
- HMM and DTW are examples of statistical models that describe speech patterns probabilistically.
- various neural networks (NN) and /or finite state transducers (FST) may also be used as SRMs.
- Other suitable models can also be used as SRM.
- an unknown speech pattern is compared with known reference patterns.
- the speech pattern is divided into several frames, and the local distance between the speech pattern included in each frame and the corresponding speech segment of the reference pattern is calculated. This distance is calculated by comparing the speech segment and the corresponding speech segment of the reference pattern with each other, and it is thus a kind of numerical value for the differences found in the comparison. For speech segments close to each other, a smaller distance is usually obtained than for speech segments further from each other. On the basis of local distances obtained this way, a minimum path between the beginning and end points of the word are sought by using a DTW algorithm. Thus, by DTW, a distance is obtained between the uttered word and the reference word.
- an HMM model is first formed for each word to be recognized (e.g. for each reference word).
- an observation probability is calculated for each HMM model in the memory, and as the recognition result, a counterpart word is obtained for the HMM model with the greatest observation probability.
- the probability is calculated that it is the word uttered by the speaker.
- the above-mentioned observation probability describes the resemblance of the received speech pattern and the closest HMM model (e.g. the closest reference speech pattern).
- the reference words, or word candidates can be further weighted by the language models.
- the recognition process can occur in a single pass- through mode with fused acoustic models and language models.
- interconnecting data nodes store information regarding speech patterns.
- the nodes of the NN may be used to classify phonetic features of speech input, and may be configured so as to focus on portions of the model that may be most valuable in distinguishing words during speech recognition processes.
- a well designed NN will therefore minimize, in some examples, the processing time required to recognize speech inputs.
- NNs are particularly well suited for training of larger data sets, such as data sets representing natural language.
- FST frequency division multiple access
- speech inputs may be processed, various operations may be performed on the speech input, and a most probable output, (e.g., recognized word) may be selected.
- FSTs may be particularly beneficial, in some examples, in phonological analysis. The reusability and flexibility of algorithms performed on FSTs make FSTs particularly useful in combining portions of, or various SRMs.
- An SRM may therefore incorporate speech recognition data from various sources, apply weights to the speech recognition data, and generate weighted FSTs for use in speech recognition tasks.
- the various types of SRMs may include speaker independent SRMs and speaker dependent SRMs.
- Speaker independent SRMs may comprise averages of language and acoustic models collected from a large sample of users.
- a speaker dependent SRM may be specific to the user and may be adapted by the user through training. Initial training may be performed during a first use of the SRM and training continues during normal use of the SRM.
- a speaker dependent SRM comprises unique sets of electronic characteristics for the acoustic model and a unique language model for the words formed from combinations of unique basic sound units.
- SRMs used by a device to process speech input may therefore rely on any combination of the HMM, DTW, NN, FST, and other models, as well as a blend of speaker dependent SRMs and speaker independent SRMs.
- a method, apparatus, and computer program product are provided for personalizing a speech recognition model (SRM).
- SRM speech recognition model
- a method is provided for receiving at least one portion of a speech recognition model, wherein the at least one portion of the speech recognition model is stored remotely and is adaptable by one or more user terminal to process input speech, accessing a speech recognition model, wherein the speech recognition model is based on at least the received at least one portion of the speech recognition model, and adapting the speech recognition model based on terminal dependent data.
- the method may further include processing received speech input using the speech recognition model, and generating a textual output.
- the method may further include receiving a speech input, and refining a speaker dependent speech recognition model based on the speech input.
- the method may further include verifying or correcting a processing of the speech input, wherein refining the speaker dependent speech recognition model is further based on the verification or correction.
- the method may further include causing transmission of at least one portion of the speaker dependent speech recognition model to a remote storage location.
- the terminal dependent data may comprise microphone information and/or a context.
- the received at least one portion of a speech recognition model is received based on one of at least an individual user, group of users, geographic location, or a dialect.
- the received at least one portion of a speech recognition model is based on at least one of a Hidden Markov Model, dynamic time warping model, neural network, or finite state transducer.
- An additional method is provided including receiving at least one portion of a speaker dependent speech recognition model from a user terminal and generating at least one additional portion of a speech recognition model based on the received at least one portion of a speaker dependent speech recognition model, wherein the at least one additional portion of a speech recognition model is adaptable by one or more user terminals.
- the method may further include causing transmission of the at least one additional portion of the speech recognition model to an additional user terminal.
- Generating the at least one additional portion of the speech recognition model is further based on one of at least an individual user, group of users, geographic location, or a dialect.
- the at least one additional portion of the speech recognition model may be based on at least one of a hidden Markov Model, dynamic time warping model, neural network, or finite state transducer.
- An apparatus comprising at least one processor and at least one memory including computer program code, the at least one memory and the computer program code configured to, with the processor, cause the apparatus to at least receive at least one portion of a speech recognition model, wherein the at least one portion of the speech recognition model is stored remotely, and is adaptable by one or more user terminals to process input speech, access a speech recognition model, wherein the speech recognition model is based on at least the received at least one portion of the speech recognition model, and adapt the speech recognition model based on terminal dependent data.
- An additional apparatus comprising at least one processor and at least one memory including computer program code, the at least one memory and the computer program code configured to, with the processor, cause the apparatus to at least receive at least one portion of a speaker dependent speech recognition model from a user terminal, and generate at least one additional portion of a speech recognition model based on the received at least one portion of a speaker dependent speech recognition model, wherein the at least one additional portion of the speech recognition model is adaptable by one or more user terminals.
- a computer program product comprising at least one non-transitory computer- readable storage medium having computer-executable program code instructions stored therein, the computer-executable program code instructions comprising program code instructions to receive at least one portion of a speech recognition model, wherein the at least one portion of the speech recognition model is stored remotely, and is adaptable by one or more user terminals to process input speech, access a speech recognition model, wherein the speech recognition model is based on at least the received at least one portion of the speech recognition model, and adapt the speech recognition model based on terminal dependent data.
- An additional computer program product comprising at least one non-transitory computer-readable storage medium having computer-executable program code instructions stored therein, the computer-executable program code instructions comprising program code instructions to receive at least one portion of a speaker dependent speech recognition model from a user terminal, generate at least one additional portion of a speech recognition model based on the received at least one portion of a speaker dependent speech recognition model, wherein the at least one additional portion of a speech recognition model is adaptable by one or more user terminals.
- An apparatus comprising means for receiving at least one portion of a speech recognition model, wherein the at least one portion of the speech recognition model is stored remotely, and is adaptable by one or more user terminals to process input speech, accessing a speech recognition model, wherein the speech recognition model is based on at least the received at least one portion of the speech recognition model, and adapting the speech recognition model based on terminal dependent data.
- An additional apparatus comprising means for receiving at least one portion of a speaker dependent speech recognition model from a user terminal, and generating at least one additional portion of a speech recognition based on the received at least one portion of a speaker dependent speech recognition model, wherein the at least one additional portion of the speech recognition model is adaptable by one or more user terminals.
- Figure 1 is a block diagram of a personalized speech recognition apparatus in communication with user terminals which may be configured to implement example embodiments of the present invention
- FIG. 2 is a flowchart illustrating operations to receive and adapt an SRM on a user terminal, in accordance with one embodiment of the present invention
- Figure 3 is a flowchart illustrating operations to transmit an SRM, receive a speaker dependent SRM, and generate an additional SRM, using a speech personalization apparatus in accordance with one embodiment of the present invention.
- FIG. 4 is a display for training an SRM, in accordance with one embodiment of the present invention.
- circuitry refers to (a) hardware-only circuit implementations (e.g., implementations in analog circuitry and/or digital circuitry); (b) combinations of circuits and computer program product(s) comprising software and/or firmware instructions stored on one or more computer readable memories that work together to cause an apparatus to perform one or more functions described herein; and (c) circuits, such as, for example, a microprocessor(s) or a portion of a microprocessor(s), that require software or firmware for operation even if the software or firmware is not physically present.
- This definition of 'circuitry' applies to all uses of this term herein, including in any claims.
- the term 'circuitry' also includes an implementation comprising one or more processors and/or portion(s) thereof and accompanying software and/or firmware.
- the term 'circuitry' as used herein also includes, for example, a baseband integrated circuit or applications processor integrated circuit for a mobile phone or a similar integrated circuit in a server, a cellular network device, other network device, and/or other computing device.
- personalized speech recognition apparatus 102 may include or otherwise be in communication with processor 20, user interface 22, communication interface 24, memory device 26, and speech personalization administrator 28.
- Personalized speech recognition apparatus 102 may be embodied by a wide variety of devices including mobile terminals, e.g., mobile telephones, smartphones, tablet computers, laptop computers, or the like, computers, workstations, servers or the like and may be implemented as a distributed system or a cloud based entity.
- the personalized speech recognition apparatus 102 may receive, and/or transmit SRMs, as well as generate additional SRMs that may be adaptable by one more user terminals.
- An SRM is a statistical model that describes speech patterns probabilistically, and may include a language model (words) and an acoustic model (basic sound units).
- Example SRMs include the HMM, DTW, , and FST models.
- An SRM may be provided to a user terminal to enable speech recognition capabilities (e.g., processing of input speech) on the user terminal.
- transmittal of an SRM may include transmittal of a portion of the SRM, since an SRM in its entirety may be too large for practical transmission (and an SRM portion may also be considered an SRM).
- the SRM portion may be incorporable into an SRM, so that the portion may then be incorporated with another portion of an SRM to provide a complete or fully functioning SRM. It will therefore be appreciated that any reference to an SRM herein, may indicate a portion or portions of an SRM, but for simplicity may be referred to as an SRM.
- the SRMs may incorporate speaker independent data, speaker dependent data, and/or terminal dependent data.
- the speaker independent data may include averaged, normalized, or otherwise consolidated language and acoustic models collected from a large sample of users.
- the speaker dependent data may alternatively be biased toward a particular individual, or group of users, such as a group of users speaking a particular language or dialect, or from a particular geographic region.
- the speaker dependent data may be generated and/or refined on a user terminal by training the SRM.
- the speaker dependent data may be generated or refined on one or more user terminals and/or devices, such that it may be shared, via the personalized speech recognition apparatus 102, between the one or more user terminals and/or devices.
- training may include, but is not limited to, providing speech input to the user terminal, potentially updating and/or verifying the processing of the speech input, and updating the SRM accordingly.
- the training may include the explicit dictation of special training data by a speaker, and/or implicit training through the general use of the user terminal.
- various models such as a HMM may be constructed for each the speaker dependent SRM to be stored.
- a speaker dependent SRM incorporating the speaker dependent data may be communicated from the user terminal to the personalized speech recognition apparatus 102.
- the terminal dependent data may include information regarding the user terminal itself, such as characteristics of the microphone on the user terminal to capture the speech input, and/or a context of the user terminal (e.g., an environment the device is commonly used in, or the intended purpose of the device), or any settings of the user terminal 1 10A that could impact the processing of speech input.
- An SRM received from the personalized speech recognition apparatus 102 may be adapted on the user terminal based on the terminal dependent data, so that the particular user terminal may more accurately process speech inputs.
- Speaker dependent SRMs including speaker dependent data may be stored on personalized speech recognition apparatus 102.
- the speaker dependent SRM, or a portion thereof, may be further modified and/or transmitted to another device to allow the user terminal to benefit from the speaker dependent data, thereby improving the probability of successful speech recognition on another user terminal.
- one or more user terminals may access or otherwise download the speaker dependent model for the purposes of providing personalized speech recognition.
- the personalized speech recognition apparatus 102 may receive updates to the speaker dependent model.
- the personalized speech recognition apparatus 102 may therefore further tune or otherwise modify the speaker dependent model.
- the processor 20 (and/or co-processors or any other processing circuitry assisting or otherwise associated with the processor 20) may be in communication with the memory device 26 via a bus for passing information among components of the personalized speech recognition apparatus 102.
- the memory device 26 may include, for example, one or more volatile and/or non-volatile memories.
- the memory device 26 may be an electronic storage device (e.g., a computer readable storage medium) comprising gates configured to store data (e.g., bits) that may be retrievable by a machine (e.g., a computing device like the processor 20).
- the memory device 26 may be configured to store information, data, content, applications, instructions, or the like for enabling the apparatus to carry out various functions in accordance with an example embodiment of the present invention.
- the memory device 26 could be configured to store various SRMs, including speaker independent and speaker dependent portions.
- the speaker dependent data may be associated with a particular user or group of users, enabling the processor 20 to identify and provide appropriate SRMs to various devices.
- the memory device 26 could be configured to buffer input data for processing by the processor 20, and/or to store instructions for execution by the processor 20.
- the personalized speech recognition apparatus 102 may, in some embodiments, be embodied in various devices as described above. However, in some embodiments, the personalized speech recognition apparatus 102 may be embodied as a chip or chip set.
- the personalized speech recognition apparatus 102 may comprise one or more physical packages (e.g., chips) including materials, components and/or wires on a structural assembly (e.g., a baseboard).
- the structural assembly may provide physical strength, conservation of size, and/or limitation of electrical interaction for component circuitry included thereon.
- the personalized speech recognition apparatus 102 may therefore, in some cases, may be configured to implement an embodiment of the present invention on a single chip or as a single "system on a chip.”
- a chip or chipset may constitute means for performing one or more operations described herein for personalizing speech recognition in devices.
- the processor 20 may be embodied in a number of different ways.
- the processor 20 may be embodied as one or more of various hardware processing means such as a coprocessor, a microprocessor, a controller, a digital signal processor (DSP), a processing element with or without an accompanying DSP, or various other processing circuitry including integrated circuits such as, for example, an application specific integrated circuit (ASIC) an field programmable gate array (FPGA), a microcontroller unit (MCU), a hardware accelerator, a special-purpose computer chip, or the like.
- the processor 20 may include one or more processing cores configured to perform independently.
- a multi-core processor may enable multiprocessing within a single physical package.
- the processor 20 may include one or more processors configured in tandem via the bus to enable independent execution of instructions, pipelining and/or multithreading.
- the processor 20 may be configured to execute instructions stored in the memory device 26 or otherwise accessible to the processor 20.
- such instructions may provide for the retrieval, transmittal, and/or processing of SRMs, including generating additional SRMs based on received updated speaker dependent SRMs.
- the processor 20 may be configured to execute hard coded functionality.
- the processor 20 may represent an entity (e.g., physically embodied in circuitry) capable of performing operations according to an embodiment of the present invention, such as the personalization of SRMs.
- the processor 20 when the processor 20 is embodied as an ASIC, FPGA or the like, the processor 20 may be specifically configured hardware for conducting the operations described herein.
- the instructions when the processor 20 is embodied as an executor of software instructions, the instructions may specifically configure the processor 20 to perform the algorithms and/or operations described herein when the instructions are executed.
- the processor 20 may be a processor of a specific device (e.g., a user terminal or network entity) configured to employ an embodiment of the present invention by further configuration of the processor 20 by instructions for performing the algorithms and/or operations described herein.
- the processor 20 may include, among other things, a clock, an arithmetic logic unit (ALU) and logic gates configured to support operation of the processor 20.
- ALU arithmetic logic unit
- the communication interface 24 may be any means such as a device or circuitry embodied in either hardware or a combination of hardware and software that is configured to receive and/or transmit data from/to a network and/or any other device or module in communication with the personalized speech recognition apparatus 102.
- the communication interface 24 may include, for example, an antenna (or multiple antennas) and supporting hardware and/or software for enabling communications with a wireless communication network, for transmitting and receiving SRMs to and from remote devices. Additionally or alternatively, the communication interface 24 may include the circuitry for interacting with the antenna(s) to cause transmission of signals via the antenna(s) or to handle receipt of signals received via the antenna(s).
- the communication interface 24 may alternatively or also support wired communication.
- the communication interface 24 may include a communication modem and/or other hardware/software for supporting communication via cable, digital subscriber line (DSL), universal serial bus (USB) or other mechanisms.
- DSL digital subscriber line
- USB universal serial bus
- the personalized speech recognition apparatus 102 may include a user interface 22 that may, in turn, be in communication with the processor 20 to receive an indication of a user input and/or to cause provision of an audible, visual, mechanical or other output to the user.
- the user interface 22 may include, for example, a keyboard, a mouse a display, a touch screen(s), touch areas, soft keys, a microphone, a speaker, or other input/output mechanisms.
- the processor 20 may comprise user interface circuitry configured to control at least some functions of one or more user interface elements such as, for example, a speaker, ringer, microphone, display, and/or the like.
- the processor 20 and/or user interface circuitry comprising the processor 20 may be configured to control one or more functions of one or more user interface elements through computer program instructions (e.g., software and/or firmware) stored on a memory accessible to the processor 20 (e.g., memory device 26, and/or the like).
- processor 20 may be embodied as, include, or otherwise control a speech personalization administrator 28 for providing personalized speech recognition.
- the speech personalization administrator 28 may be embodied as various means, such as circuitry, hardware, a computer program product comprising computer readable program instructions stored on a computer readable medium (for example, memory device 26) and executed by a processing device (for example, processor 20), or some combination thereof.
- Speech personalization administrator 28 may be capable of communication with one or more of the processor 20, memory device 26, user interface 22, and communication interface 24. As such, the speech personalization administrator 28 may be configured to generate additional SRMs, adaptable by a variety of user terminals and that may be based on speaker dependent SRMs, as described above and in further detail hereinafter.
- User terminal 1 10 may be embodied as a mobile terminal, such as personal digital assistants (PDAs), pagers, mobile televisions, mobile telephones, gaming devices, laptop computers, tablet computers, cameras, camera phones, video recorders, audio/video players, radios, global positioning system (GPS) devices, navigation devices, or any combination of the aforementioned, and other types of devices capable of providing speech recognition.
- PDAs personal digital assistants
- GPS global positioning system
- the user terminal 110 need not necessarily be embodied by a mobile device and, instead, may be embodied in a fixed device, such as a computer, workstation, or home appliance, such as a coffee maker. Additionally or alternatively, user terminal(s) 110 may be embodied in a vehicle, or any other machine or device capable of processing voice commands.
- user terminal 110A is illustrated in further detail, but it will be appreciated that any of the user terminals 110, such as user terminal 110B may be configured as illustrated in and described with respect to user terminal 1 10A.
- the user terminal 1 10 may therefore include or otherwise be in communication with processor 120, user interface 122, communication interface 124, and memory device 126.
- the processor 120 (and/or co-processors or any other processing circuitry assisting or otherwise associated with the processor 120) may be in communication with the memory device 126 via a bus for passing information among components of the user terminal 1 10.
- the memory device 126 may include, for example, one or more volatile and/or non-volatile memories.
- the memory device 126 may be an electronic storage device (e.g., a computer readable storage medium) comprising gates configured to store data (e.g., bits) that may be retrievable by a machine (e.g., a computing device like the processor 120).
- the memory device 126 may be configured to store information, data, content, applications, instructions, or the like for enabling the user terminal to carry out various functions in accordance with an example embodiment of the present invention.
- the memory device 126 could be configured to store SRMs, instructions for adapting SRMs with terminal dependent data, and instructions for training SRMs with speaker dependent data. Memory device 126 may therefore buffer input data for processing by the processor 120. Additionally or alternatively, the memory device 26 could be configured to store instructions for execution by the processor 120.
- the processor 120 may be embodied in a number of different ways.
- the processor 120 may be embodied as one or more of various hardware processing means such as a coprocessor, a microprocessor, a controller, a DSP, a processing element with or without an accompanying DSP, or various other processing circuitry including integrated circuits such as, for example, an ASIC , an FPGA, an MCU, a hardware accelerator, a special-purpose computer chip, or the like.
- the processor 120 may include one or more processing cores configured to perform independently.
- a multi-core processor may enable multiprocessing within a single physical package.
- the processor 120 may include one or more processors configured in tandem via the bus to enable independent execution of instructions, pipelining and/or multithreading.
- the processor 120 may be configured to execute instructions stored in the memory device 126 or otherwise accessible to the processor 120.
- the processor 120 may be configured to adapt an SRM advantageously to the user terminal, based on terminal dependent data, such as microphone information and context, so that the SRM may account for variances across user terminals.
- the user terminal(s) 110 may include means, such as a processor 120, for training the SRM with speech input, to generate and/or refine a speaker dependent SRM that may improve speech input processing on the user terminal (and subsequently, other user terminals).
- the processor 120 may be configured to execute hard coded functionality.
- the processor 120 may represent an entity (e.g., physically embodied in circuitry) capable of performing operations according to an embodiment of the present invention while configured accordingly.
- the processor 120 when the processor 120 is embodied as an ASIC, FPGA or the like, the processor 120 may be specifically configured hardware for conducting the operations described herein.
- the instructions when the processor 120 is embodied as an executor of software instructions, the instructions may specifically configure the processor 120 to perform the algorithms and/or operations, such as adaptation and training of SRMs, processing of speech input, such as by using the SRMs, for conversion to text, when the instructions are executed.
- the processor 120 may be a processor of a specific device (e.g., a mobile terminal or network entity) configured to employ an embodiment of the present invention by further configuration of the processor 120 by instructions for performing the algorithms and/or operations described herein.
- the processor 120 may include, among other things, a clock, an arithmetic logic unit (ALU) and logic gates configured to support operation of the processor 120.
- ALU arithmetic logic unit
- the communication interface 124 may be any means such as a device or circuitry embodied in either hardware or a combination of hardware and software that is configured to receive and/or transmit data from/to a network and/or any other device or module in communication with the user terminal 110.
- the communication interface 124 may be specifically configured for transmitting and receiving SRMs to and from the personalized speech recognition apparatus 102.
- the communication interface 124 may include, for example, an antenna (or multiple antennas) and supporting hardware and/or software for enabling communications with a wireless communication network. Additionally or alternatively, the communication interface 124 may include the circuitry for interacting with the antenna(s) to cause transmission of signals via the antenna(s) or to handle receipt of signals received via the antenna(s).
- the communication interface 124 may alternatively or also support wired communication for communication of SRMs.
- the communication interface 124 may include a communication modem and/or other hardware/software for supporting communication via cable, digital subscriber line (DSL), universal serial bus (USB) or other mechanisms.
- DSL digital subscriber line
- USB universal serial bus
- the user terminal 1 10 may include a user interface 122 that may, in turn, be in communication with the processor 120 to receive an indication of a user input and/or to cause provision of an audible, visual, mechanical or other output to the user.
- the user interface 122 may include, for example, a keyboard, a mouse, a display, a touch screen(s), touch areas, soft keys, a microphone, a speaker, or other input/output mechanisms.
- the user interface 122 may therefore be configured to receive speech input, such as, via a microphone, for the purposes of speech recognition and/or training of an SRM.
- the processor 120 may comprise user interface circuitry configured to control at least some functions of one or more user interface elements such as, for example, a speaker, ringer, microphone, display, and/or the like.
- the processor 120 and/or user interface circuitry comprising the processor 120 may be configured to control one or more functions of one or more user interface elements through computer program instructions (e.g., software and/or firmware) stored on a memory accessible to the processor 120 (e.g., memory device 126, and/or the like).
- Network 100 may be embodied in a local area network, the Intemet, any other form of a network, or in any combination thereof, including proprietary private and semi-private networks and public networks.
- the network 100 may comprise a wire line network, wireless network (e.g., a cellular network, wireless local area network, wireless wide area network, some combination thereof, or the like), or a combination thereof, and in some example embodiments comprises at least a portion of the Internet.
- the network 100 may be used for transmitting speaker dependent data and/or SRMs to and from devices.
- a user terminal 1 10 may be directly coupled to and/or may include a personalized speech recognition apparatus 102.
- FIG. 2 the operations for receiving and adapting an SRM on a user terminal, in accordance with one embodiment of the present invention are outlined in accordance with one example embodiment.
- the operations of Figures 2 may be performed by the user terminal 110A, user terminal HOB, and/or the like, for example.
- the user terminal 1 10A may include means, such as the processor 120, communication interface 124, or the like, for receiving at least one portion of an SRM, wherein the at least one portion of an SRM is stored remotely and is adaptable by one of more user terminals to process input speech.
- the user terminal 1 10A may receive at least one portion of an SRM from the personalized speech recognition apparatus 102, for example, including any combination of the HMM, DTW, , and FST models, as described above.
- the at least one portion of an SRM may also include any combination of speaker independent data and/or speaker dependent data, and may be adaptable by the user terminal 11 OA to process speech input (e.g., perform speech recognition tasks).
- the adaptation is described in further detail with respect to operation 210.
- a user of user terminal 1 1 OA may provide logon credentials or the like, via user interface 122, communication interface 124, and/or network 100 to the personalized speech recognition apparatus 102.
- the user terminal 110A may check for updates by communicating with the personalized speech recognition apparatus 102, and receive an SRM or portion thereof if an update is available.
- an update may be available if a user updated, based on training, verification or the like on another device, such as user terminal HOB.
- the user terminal 1 10A may download an SRM or portion thereof for the first time (such as during initial device setup, or factory reset), or the newly received SRM or portion thereof may include updates compared to a previous version used by user terminal 11 OA.
- receipt of the SRM or portion thereof by the user terminal 1 1 OA may occur during scheduled update routines that may be unobtrusive to or unnoticed by a user. That is, the synchronization may occur seamlessly as a background system update.
- a request for an SRM or portion thereof may be explicitly initiated on the user terminal 11 OA (such as logging onto the personalized speech recognition apparatus 102 and requesting an update).
- an update may be initiated by the personalized speech recognition apparatus 102. For example, a user may be automatically notified that an update is available, such as by Short Message Service (SMS), for example, so as to confirm that they would like to receive the at least one portion of an SRM on the user terminal 11 OA.
- SMS Short Message Service
- the user terminal 1 10A may therefore receive at least one portion of an SRM associated with the individual user (such as identified with the logon credentials). Additionally or alternatively, the SRM or portion thereof may be identified by the personalized speech recognition apparatus by other means. For example, a user of a device may provide a geographic location, via a Global Positioning Device (GPS) and/or manual indication of a location, for example. The user terminal 110A may therefore receive an SRM based on a geographic location and /or dialect. Having received at least one portion of an SRM, as described with respect to operation 200, the user terminal 110A may include means, such as the processor 120, for accessing a speech recognition model, wherein the speech recognition model is based on at least the received at least one portion of speech recognition model.
- GPS Global Positioning Device
- the received at least one portion of an SRM may be a complete SRM, and may therefore be stored on memory device 126, and accessed by the processor 120.
- the processor 120 may incorporate the at least one portion of an SRM to form a complete SRM.
- the SRM may be stored and accessed on memory device 126, for example.
- the user terminal 11 OA may include means, such as the processor 120, for adapting the SRM based on or more terminal dependent data.
- the terminal dependent data may include information regarding the user terminal 1 1 OA itself, such as characteristics of the microphone on the user terminal 1 10A to capture the speech input, and/or a context of the user terminal (e.g., an environment the device is commonly used in, or the intended purpose of the user terminal), or any settings of the user terminal 1 10A that could impact the process of speech input.
- the processor 120 may therefore utilize the terminal dependent data in adapting the SRM for use on the user terminal 110A.
- microphone information may be retrieved from memory device 126, or read from a microphone component of user interface 122 by processor 120, for example.
- the microphone information may include any information relating to the microphone that may impact how speech input is recognized and/or processed according to the SRM.
- the microphone information may comprise a microphone model identifier, or orientation of the microphone within the device.
- the microphone may additionally or alternatively be characterized by its transduction type, such as condenser and/or dynamic, for example.
- the user terminal 1 10A using the processor 120, may therefore adapt the SRM according to microphone information to account for acoustic, phonetic, and/or other variances between microphones. For example, calculations in a DTW model may be consistently modified throughout, so that the user terminal 1 10A may accurately interpret sounds captured by the microphone.
- the user terminal 11 OA may adapt the SRM based on the context of the user terminal.
- Use of an SRM by a speaker phone in a vehicle may be subject to background noise, such as wind, and/or radio or other device interference.
- the processor 120 of user terminal 1 1 OA may therefore adapt the received SRM, which in its previous state may not have accounted for such background noises, accordingly.
- Information regarding the context or use of the user terminal 1 1 OA may be explicitly retrieved from memory device 126, for example, and/or derived from various components of the user terminal 110A, allowing processor 120 to adapt the SRM based on what contexts the user terminal 11 OA will most likely be used in.
- Settings configuring various components of the user terminal 1 10A may be considered by the processor 120 in adapting the SRM for the user terminal 11 OA.
- the settings may affect the adaptation of the SRM, and/or cause the processor 120 to adjust the settings of the user terminal 1 10A to tailor the device for use of the SRM.
- An adapted SRM may be stored on memory device 126, for example.
- the user terminal 11 OA may include means, such as the user interface 122, communication interface 124, and/or processor 120 for receiving a speech input.
- the speech input may be provided by a user to user terminal 110B by using a microphone of user interface 122, for example.
- the user terminal 11 OA may receive a speech input through everyday use of the user terminal and may process the speech to generate text.
- the user terminal 110A may process received speech input using the SRM, and generate a textual output.
- the processor 120 may process the speech input according to the SRM.
- the processor 120 may calculate observation probability on the speech input based on the SRM that includes one or more HMM, DTW, , or FST models, for example.
- the processor 120 may identify a reference word with the highest probability when compared to other reference words, a threshold or the like. Based on those probabilities, the processor may then select or otherwise generate the speech recognition result (e.g. a text output).
- the user terminal 11 OA may include means, such as the user interface 122, communication interface 124 and/or processor 120, for verifying or correcting a processing of the speech input.
- the verification or correction could be received explicitly by a user input to the user terminal 11 OA, or implicitly by everyday use of the user terminal 11 OA.
- the user terminal 11 OA may be configured to receive an explicit correction of a processed speech input.
- speech recognition such as an example application that prefills dictated words in a draft email message
- the interpretation of the speech input may be incorrect.
- the user may correct a misinterpreted word(s) by selecting the misinterpreted word, and typing the corrected word in its place. See Figure 4.
- a user interface 122 may display an indication 400 of a word, such as a word that is misinterpreted, such as a word that is misinterpreted during the processing of input speech.
- indication 400 may be provided by the user terminal 1 1 OA in scenarios such as those in which the SRM provided no reference word above some threshold probability, indicating that the processing of the speech input was not likely correct. Additionally or alternatively, the indication 400 may be provided explicitly by a user, by selection of the word for correction, for example.
- User input 410 provides a means for receiving a correction of the processed speech input. In this example, the speech recognition system has interpreted the word "forest,” and a user provides the correct phrase, "for the rest.”
- a speech input may be deemed as correct based on implicit verification.
- a user terminal such as user terminal 1 10A, may be embodied as a mobile phone and may further be operable to receive a speech input such as "call Suzanne.”
- the user terminal 100A Upon automatic selection and execution of the associated command (e.g., initiating a call to a phone number saved for a contact by the name of Suzanne), and failure to receive any correction to stop the initiated phone call, the user terminal 100A, such as by the processor 120, may consider this absence of any action by the user a verification of the processed speech input.
- the user terminal 11 OA may generate and/or otherwise refine a speaker dependent SRM based on the speech input.
- the SRM may be trained using speech input received with respect to operation 220, and/or verification or correction of the processed speech input with respect to operation 230.
- Existing SRMs on memory device 126 may therefore be tailored for use by a particular user or group of users.
- new speaker dependent SRMs may be generated for improved speech input processing. Training can be performed, for example, by using feature vectors of the speech input (provided with respect to operation 220) and associating them with corresponding reference words, as provided by the verification and/or correction with respect to the operation 230 above.
- a verification or correction need not be provided, but the processor 120 may identify the reference words from a script on memory device 126 (such as in an example embodiment where the speech input is received based on a script).
- the SRM such as an HMM, DTW, , FST, or the like, may therefore be expanded, or otherwise modified, to incorporate the speech input and associated reference words.
- processed speech input and associated reference words may be further processed by processor 120, and applied to an existing SRM, to refine a speaker dependent SRM.
- a new speaker dependent SRM may be generated.
- the generated or refined speaker dependent SRM may be stored on memory device 126, for example.
- the user terminal 1 10A may include means, such as communication interface 124, and/or processor 120, for causing transmission of the speaker dependent SRM to a remote storage location, such as personalized speech recognition apparatus 102, for example.
- Transmission of the speaker dependent SRM to a remote location may allow the speaker dependent SRM to be advantageously transmitted to other user terminals, such as described in further detail with respect to Figure 3. Further, and in some examples, by transmitting the speaker dependent SRM to the remote location, one or more user terminals may provide updates to or otherwise refine the speaker dependent SRM. The speaker dependent SRM may therefore be retrieved from memory device 126, and transmitted via communication interface 124 and over network 100, for example, to the remote storage location.
- the transmission may occur automatically following generation and/or refinement of the speaker dependent SRM with respect to operation 240.
- a user of user terminal 1 10A may initiate the transmission, such as for example, providing logon credentials to the personalized speech recognition apparatus 102, as described with respect to operation 200, for example.
- the speaker dependent SRM may then be transmitted to the personalized speech recognition apparatus 102 for storage, and subsequent retrievals.
- FIG. 3 is a flowchart illustrating operations to transmit an SRM, receive a speaker dependent SRM, and generate an additional SRM, using a speech personalization apparatus 102 in accordance with one embodiment of the present invention.
- the personalized speech recognition apparatus 102 may include means, such as the processor 20, speech personalization administrator 28, communication interface 24, or the like, for causing transmission of an SRM (or portion thereof) to a user terminal.
- the SRM may therefore be retrieved from memory device 26, and sent over network 1 10, via communication interface 24, to user terminal 1 10A, for example.
- the SRM that is transmitted may be an SRM that is configured for a particular device, a particular region or dialect or the like.
- the personalized speech recognition apparatus 102 may generate the additional SRM based on an associated with a group of users, such as one associated with a geographic location. For example, some geographic areas, like the southern United States, for example, may experience regional accents that may otherwise confuse speech input processing systems. Personalized speech recognition apparatus 102 may therefore generate the additional SRM based on a particular geographic location in order to subsequently provide more accurate speech recognition functions to users in, from, or otherwise associated with the same geographic location. Similarly, an additional SRM may be generated based on a specific dialect. For example, due to varying dialects, some words may be pronounced differently than the same word in a different language, potentially causing erroneous speech input processing on a user terminal.
- Personalized speech recognition apparatus 102 may therefore associate the speaker dependent SRM with a dialect in order to provide more accurate speech recognition functions to users whose speech is closely related to the specific dialect.
- a user of a device may then provide indication of a particular dialect, and receive an SRM adapted for that dialect.
- the SRM may already be adapted to a particular user.
- the personalized speech recognition apparatus 102 may receive logon information from a user terminal, such as user terminal 11 OA that indicates the identity of a particular user.
- personalized speech recognition apparatus 102 such as via the processor 20, the communications interface 24 or the like, may cause the SRM related to the particular user to be transmitted to user terminal 11 OA.
- the transmission may be initiated on the personalized speech recognition apparatus 102 in various ways, such as receiving requests initiated explicitly (e.g., logon) or automatically (e.g., initial installation) from the user terminal 11 OA, and/or automatic transmission imitated by the personalized speech recognition apparatus 102.
- requests initiated explicitly e.g., logon
- automatically e.g., initial installation
- transmission imitated by the personalized speech recognition apparatus 102 e.g., initial installation
- the personalized speech recognition apparatus 102 may include means, such as the processor 20, speech personalization administrator 28, communication interface 24, or the like, for receiving at least a portion of a speaker dependent SRM from the user terminal, such as user terminal 11 OA.
- the received speaker dependent SRM (or portion thereof) may contain one or more updates to or refinements of the speaker dependent SRM as is described with respect to operations 240 and 250 of Figure 2.
- the personalized speech recognition apparatus 102 may include means, such as the processor 20, speech personalization administrator 28, or the like, for generating an additional or otherwise updated SRM based on the speaker dependent SRM, wherein the additional SRM is adaptable by one or more user terminals.
- the additional SRM is contrasted based on the speaker dependent SRM and comprises the updates to or refinements of the SRM from the user terminal, as well as, one or more other user terminals.
- the speech personalization administrator 28 may access an existing SRM on memory 26, and modify, update or otherwise refine the SRM with the speaker dependent SRM, or a portion of the speaker dependent SRM, accordingly. Additionally, or alternatively, a new SRM may be generated using the speaker dependent SRM.
- the additional SRM may be, or otherwise include, a HMM, DTW, , or FST, for example.
- the additional SRM may be adaptable by one or more user terminals, such as described with respect to operations 200 and 210 above.
- the personalized speech recognition apparatus 102 may include means, such as the processor 20, communication interface 24, or the like, for causing transmission of the additional SRM to an additional device.
- the transmission may be initiated and completed by use of similar operations described with respect to operation 300, but the SRM may this time be transmitted to a different terminal, such as user terminal HOB, for example.
- the additional SRM may be shared between one or more user terminals, devices and/or the like.
- the personalized speech recognition apparatus 102 may select the additional SRM to transmit to the user terminal HOB, based on a variety of factors, such as terminal dependent data, and/or user identification, for example.
- An association of the individual user (or group of users) and speaker dependent SRM may allow the personalized speech recognition apparatus 102 to advantageously provide the SRM on demand, to various devices belonging to a user.
- a user terminal 11 OA embodied as a personal computer or laptop capable of producing text from speech input, such as dictated reports or emails, may rely on an extensive SRM representing an entire natural language.
- a speaker dependent SRM generated and/or refined on the user terminal 11 OA may be available on personalized speech recognition apparatus 102 for distribution to one or more other user terminals.
- the same user of user terminal 1 10A purchases a new user terminal 1 10B, it may be advantageous to provide portions of the speaker dependent SRM to the user terminal 1 10B.
- the user terminal HOB is a coffee maker.
- a coffee maker may not require such a broad vocabulary required by the personal computer or laptop embodiment of user terminal 1 10A, but only portions of the speaker dependent SRM including language relating to functions of the coffee maker (e.g., grind, brew), or language relating to measurements and/or timing.
- the personalized speech recognition apparatus 102 may advantageously select an SRM or a portion of an SRM (as generated in operation 320) for use by the coffee maker, potentially minimizing bandwidth required for transmitting, and memory required for storing (on user terminal HOB), an otherwise extensive SRM.
- the user terminal HOB may utilize the SRM in processing of speech input, therefore offering to its users personalized speech recognition or in other examples, by not needing its users to retrain a SRM. That is, the speech input processing may be improved by use of the SRM, which may include a portion(s) of the speaker dependent SRM generated or refined on user terminal 1 10A. As such, the user of user terminal HOB may provide speech input to user terminal 1 10B, and experience reduced or minimized error rates in speech input processing and/or execution of associated voice commands.
- Figure 1 illustrates an embodiment utilizing user terminals 110A and 1 10B, and a personalized speech recognition apparatus 102
- a personalized speech recognition apparatus 102 may be locally installed on a device such as user terminal 110A and/or HOB and configured to run independently, where data may not necessarily be shared across devices or a server.
- a user terminal such as user terminal 11 OA and/or HOB may provide a speaker dependent SRM to a personalized speech recognition apparatus 102, may receive an SRM from a personalized speech recognition apparatus 102, or may both provide and receive the same, respectively.
- the personalized speech recognition apparatus 102 may be implemented in the cloud, and data may be transmitted between user terminal(s) and server(s) over network 100.
- various user terminals may routinely receive updated SRMs, thereby continually improving speech input processing by utilizing speaker dependent SRMs generated and/or refined on other user terminals.
- a device such as user terminal 1 10A and/or HOB may be shipped with SRMs preinstalled.
- the SRM may be local to the area or country the user terminal is distributed in. That is, the SRM may be based on a speaker dependent SRM based on a dialect or geographic area.
- Figures 2 and 3 are flowcharts illustrating operations performed by a user terminal 110A user terminal 1 10B and/or the like, and personalized speech recognition apparatus 102, respectively.
- each block of the flowchart, and combinations of blocks in the flowcharts may be implemented by various means, such as hardware, firmware, processor, circuitry, and/or other devices associated with execution of software including one or more computer program instructions.
- one or more of the procedures described above may be embodied by computer program instructions.
- the computer program instructions which embody the procedures described above may be stored by a memory device 26 or 126 employing an embodiment of the present invention and executed by a processor 20 or 120.
- any such computer program instructions may be loaded onto a computer or other programmable apparatus (e.g., hardware) to produce a machine, such that the resulting computer or other programmable apparatus implements the functions specified in the flowcharts' blocks.
- These computer program instructions may also be stored in a computer- readable memory that may direct a computer or other programmable apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture the execution of which implements the function specified in the flowcharts' blocks.
- the computer program instructions may also be loaded onto a computer or other programmable apparatus to cause a series of operations to be performed on the computer or other programmable apparatus to produce a computer-implemented process such that the instructions which execute on the computer or other programmable apparatus provide operations for implementing the functions specified in the flowcharts' blocks.
- blocks of the flowcharts support combinations of means for performing the specified functions and combinations of operations for performing the specified functions for performing the specified functions. It will also be understood that one or more blocks of the flowcharts, and combinations of blocks in the flowcharts, may be implemented by special purpose hardware -based computer systems which perform the specified functions, or combinations of special purpose hardware and computer instructions.
- certain ones of the operations above may be modified or further amplified.
- additional optional operations may be included as indicated by the blocks shown with a dashed outline in Figures 2 and 3. Modifications, additions, or amplifications to the operations above may be performed in any order and in any combination.
Landscapes
- Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Telephonic Communication Services (AREA)
- Telephone Function (AREA)
- User Interface Of Digital Computer (AREA)
Abstract
A method, apparatus and computer program product are provided for personalizing speech recognition data. A speech recognition model (SRM) that is adaptable by a user terminal based on user terminal dependent data may be received and adapted by a user terminal. A speaker dependent SRM may be refined on the user terminal and transmitted to a remote storage location, such as personalized speech recognition apparatus. The apparatus may cause transmission of SRMs to various user terminals, and may generate additional SRMs based on speaker dependent SRMs. Speaker dependent SRMs may be generated based on an individual, group of users, geographic location, dialect, or the like. SRMs may be based on hidden Markov Models,dynamic time warming models, neural networks, finite state transducers, or the like.
Description
METHOD, APPARATUS, AND COMPUTER PROGRAM PRODUCT FOR PERSONALIZING SPEECH RECOGNITION
TECHNOLOGICAL FIELD
An example embodiment of the present invention relates generally to speech recognition, and more particularly, to a method, apparatus and computer program product for personalizing speech recognition. BACKGROUND
The widespread use of technology, including mobile technology, in everyday life has led to an increased demand for other forms of user interaction with various devices. Devices providing a user with hands free control capabilities are becoming increasingly popular that allow users to control a device with voice commands, such as via speech recognition, while still focusing their attention on driving or other activities. Speech recognition may be used to control these and other devices, such as wireless phones, cars, household appliances, and other devices used in everyday life or work. Speech recognition, which may be referred to as automatic speech recognition (ASR), may be conducted by various applications that may be operable to convert recognized speech into text (e.g., a speech-to-text system). Current ASR and/or speech-to-text systems are typically based on a speech recognition model (SRM) comprising an acoustic model and a language model. For improved efficiency, the acoustic modes and language models can be fused together, or otherwise may be combined. These SRMs are the building blocks for words and strings of words, such as phrases or sentences and are used by a device to process speech input (e.g., recognize the speech input and derive a machine readable interpretation).
By way of example, a speech recognition processor, in some examples, may receive speech samples and then may match those samples with the basic sound units in the acoustic model. The speech recognition processor then may, for example, calculate the most likely words from the SRM based on the matched basic sound units, such as by using Hidden Markov Models (HMMs) and/or dynamic time warping (DTW). HMM and DTW are examples of statistical models that describe speech patterns probabilistically. Additionally or alternatively, various neural networks (NN) and /or finite state transducers (FST) may also be used as SRMs. Other suitable models can also be used as SRM.
In the DTW and in some additional examples, an unknown speech pattern is compared with known reference patterns. In dynamic time warping, the speech pattern is divided into several
frames, and the local distance between the speech pattern included in each frame and the corresponding speech segment of the reference pattern is calculated. This distance is calculated by comparing the speech segment and the corresponding speech segment of the reference pattern with each other, and it is thus a kind of numerical value for the differences found in the comparison. For speech segments close to each other, a smaller distance is usually obtained than for speech segments further from each other. On the basis of local distances obtained this way, a minimum path between the beginning and end points of the word are sought by using a DTW algorithm. Thus, by DTW, a distance is obtained between the uttered word and the reference word.
In speech recognition using the HMM method, an HMM model is first formed for each word to be recognized (e.g. for each reference word). When the speech recognition device receives a speech pattern, an observation probability is calculated for each HMM model in the memory, and as the recognition result, a counterpart word is obtained for the HMM model with the greatest observation probability. Thus for each reference word, the probability is calculated that it is the word uttered by the speaker. The above-mentioned observation probability describes the resemblance of the received speech pattern and the closest HMM model (e.g. the closest reference speech pattern). The reference words, or word candidates, can be further weighted by the language models. In some embodiments, the recognition process can occur in a single pass- through mode with fused acoustic models and language models.
In a NN method, interconnecting data nodes store information regarding speech patterns. The nodes of the NN may be used to classify phonetic features of speech input, and may be configured so as to focus on portions of the model that may be most valuable in distinguishing words during speech recognition processes. A well designed NN will therefore minimize, in some examples, the processing time required to recognize speech inputs. NNs are particularly well suited for training of larger data sets, such as data sets representing natural language.
In an FST method, speech inputs may be processed, various operations may be performed on the speech input, and a most probable output, (e.g., recognized word) may be selected. FSTs may be particularly beneficial, in some examples, in phonological analysis. The reusability and flexibility of algorithms performed on FSTs make FSTs particularly useful in combining portions of, or various SRMs. An SRM may therefore incorporate speech recognition data from various sources, apply weights to the speech recognition data, and generate weighted FSTs for use in speech recognition tasks.
The various types of SRMs may include speaker independent SRMs and speaker dependent SRMs. Speaker independent SRMs may comprise averages of language and acoustic models collected from a large sample of users. A speaker dependent SRM may be specific to the user
and may be adapted by the user through training. Initial training may be performed during a first use of the SRM and training continues during normal use of the SRM. A speaker dependent SRM comprises unique sets of electronic characteristics for the acoustic model and a unique language model for the words formed from combinations of unique basic sound units.
It is appreciated that given the complexity of natural language, the data needed to process and understand speech may also be complex. SRMs used by a device to process speech input may therefore rely on any combination of the HMM, DTW, NN, FST, and other models, as well as a blend of speaker dependent SRMs and speaker independent SRMs.
BRIEF SUMMARY
A method, apparatus, and computer program product are provided for personalizing a speech recognition model (SRM). In one embodiment, a method is provided for receiving at least one portion of a speech recognition model, wherein the at least one portion of the speech recognition model is stored remotely and is adaptable by one or more user terminal to process input speech, accessing a speech recognition model, wherein the speech recognition model is based on at least the received at least one portion of the speech recognition model, and adapting the speech recognition model based on terminal dependent data.
In some embodiments, the method may further include processing received speech input using the speech recognition model, and generating a textual output. In some embodiments, the method may further include receiving a speech input, and refining a speaker dependent speech recognition model based on the speech input. In some embodiments, the method may further include verifying or correcting a processing of the speech input, wherein refining the speaker dependent speech recognition model is further based on the verification or correction. In some embodiments, the method may further include causing transmission of at least one portion of the speaker dependent speech recognition model to a remote storage location. The terminal dependent data may comprise microphone information and/or a context. The received at least one portion of a speech recognition model is received based on one of at least an individual user, group of users, geographic location, or a dialect. The received at least one portion of a speech recognition model is based on at least one of a Hidden Markov Model, dynamic time warping model, neural network, or finite state transducer. An additional method is provided including receiving at least one portion of a speaker dependent speech recognition model from a user terminal and generating at least one additional portion of a speech recognition model based on the received at least one portion of a speaker dependent speech recognition model, wherein the at least one additional portion of a speech recognition model is adaptable by one or more user terminals.
In some embodiments, the method may further include causing transmission of the at least one additional portion of the speech recognition model to an additional user terminal. Generating the at least one additional portion of the speech recognition model is further based on one of at least an individual user, group of users, geographic location, or a dialect. The at least one additional portion of the speech recognition model may be based on at least one of a hidden Markov Model, dynamic time warping model, neural network, or finite state transducer.
An apparatus is also provided, comprising at least one processor and at least one memory including computer program code, the at least one memory and the computer program code configured to, with the processor, cause the apparatus to at least receive at least one portion of a speech recognition model, wherein the at least one portion of the speech recognition model is stored remotely, and is adaptable by one or more user terminals to process input speech, access a speech recognition model, wherein the speech recognition model is based on at least the received at least one portion of the speech recognition model, and adapt the speech recognition model based on terminal dependent data.
An additional apparatus is provided comprising at least one processor and at least one memory including computer program code, the at least one memory and the computer program code configured to, with the processor, cause the apparatus to at least receive at least one portion of a speaker dependent speech recognition model from a user terminal, and generate at least one additional portion of a speech recognition model based on the received at least one portion of a speaker dependent speech recognition model, wherein the at least one additional portion of the speech recognition model is adaptable by one or more user terminals.
A computer program product is provided, comprising at least one non-transitory computer- readable storage medium having computer-executable program code instructions stored therein, the computer-executable program code instructions comprising program code instructions to receive at least one portion of a speech recognition model, wherein the at least one portion of the speech recognition model is stored remotely, and is adaptable by one or more user terminals to process input speech, access a speech recognition model, wherein the speech recognition model is based on at least the received at least one portion of the speech recognition model, and adapt the speech recognition model based on terminal dependent data.
An additional computer program product is provided, comprising at least one non-transitory computer-readable storage medium having computer-executable program code instructions stored therein, the computer-executable program code instructions comprising program code instructions to receive at least one portion of a speaker dependent speech recognition model from a user terminal, generate at least one additional portion of a speech recognition model based on the received at least one portion of a speaker dependent speech recognition model, wherein
the at least one additional portion of a speech recognition model is adaptable by one or more user terminals.
An apparatus is also provided, comprising means for receiving at least one portion of a speech recognition model, wherein the at least one portion of the speech recognition model is stored remotely, and is adaptable by one or more user terminals to process input speech, accessing a speech recognition model, wherein the speech recognition model is based on at least the received at least one portion of the speech recognition model, and adapting the speech recognition model based on terminal dependent data.
An additional apparatus is provided, comprising means for receiving at least one portion of a speaker dependent speech recognition model from a user terminal, and generating at least one additional portion of a speech recognition based on the received at least one portion of a speaker dependent speech recognition model, wherein the at least one additional portion of the speech recognition model is adaptable by one or more user terminals.
BRIEF DESCRIPTION OF THE DRAWINGS
Having thus described certain example embodiments of the present invention in general terms, reference will hereinafter be made to the accompanying drawings which are not necessarily drawn to scale, and wherein:
Figure 1 is a block diagram of a personalized speech recognition apparatus in communication with user terminals which may be configured to implement example embodiments of the present invention;
Figure 2 is a flowchart illustrating operations to receive and adapt an SRM on a user terminal, in accordance with one embodiment of the present invention;
Figure 3 is a flowchart illustrating operations to transmit an SRM, receive a speaker dependent SRM, and generate an additional SRM, using a speech personalization apparatus in accordance with one embodiment of the present invention; and
Figure 4 is a display for training an SRM, in accordance with one embodiment of the present invention.
DETAILED DESCRIPTION Some embodiments of the present invention will now be described more fully hereinafter with reference to the accompanying drawings, in which some, but not all, embodiments of the invention are shown. Indeed, various embodiments of the invention may be embodied in many different forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will satisfy applicable legal
requirements. Like reference numerals refer to like elements throughout. As used herein, the terms "data," "content," "information," and similar terms may be used interchangeably to refer to data capable of being transmitted, received and/or stored in accordance with embodiments of the present invention. Thus, use of any such terms should not be taken to limit the spirit and scope of embodiments of the present invention.
Additionally, as used herein, the term 'circuitry' refers to (a) hardware-only circuit implementations (e.g., implementations in analog circuitry and/or digital circuitry); (b) combinations of circuits and computer program product(s) comprising software and/or firmware instructions stored on one or more computer readable memories that work together to cause an apparatus to perform one or more functions described herein; and (c) circuits, such as, for example, a microprocessor(s) or a portion of a microprocessor(s), that require software or firmware for operation even if the software or firmware is not physically present. This definition of 'circuitry' applies to all uses of this term herein, including in any claims. As a further example, as used herein, the term 'circuitry' also includes an implementation comprising one or more processors and/or portion(s) thereof and accompanying software and/or firmware. As another example, the term 'circuitry' as used herein also includes, for example, a baseband integrated circuit or applications processor integrated circuit for a mobile phone or a similar integrated circuit in a server, a cellular network device, other network device, and/or other computing device.
As defined herein, a "computer-readable storage medium," which refers to a physical storage medium (e.g., volatile or non-volatile memory device), may be differentiated from a "computer- readable transmission medium," which refers to an electromagnetic signal.
As described below, a method, apparatus and computer program product are provided for accessing and adapting remotely stored personalized speech recognition data for use on or more devices. Referring to Figure 1, personalized speech recognition apparatus 102 may include or otherwise be in communication with processor 20, user interface 22, communication interface 24, memory device 26, and speech personalization administrator 28. Personalized speech recognition apparatus 102 may be embodied by a wide variety of devices including mobile terminals, e.g., mobile telephones, smartphones, tablet computers, laptop computers, or the like, computers, workstations, servers or the like and may be implemented as a distributed system or a cloud based entity.
In example embodiments, the personalized speech recognition apparatus 102 may receive, and/or transmit SRMs, as well as generate additional SRMs that may be adaptable by one more user terminals. An SRM is a statistical model that describes speech patterns probabilistically, and may include a language model (words) and an acoustic model (basic sound units). Example
SRMs include the HMM, DTW, , and FST models. An SRM may be provided to a user terminal to enable speech recognition capabilities (e.g., processing of input speech) on the user terminal. In some embodiments, transmittal of an SRM may include transmittal of a portion of the SRM, since an SRM in its entirety may be too large for practical transmission (and an SRM portion may also be considered an SRM). The SRM portion may be incorporable into an SRM, so that the portion may then be incorporated with another portion of an SRM to provide a complete or fully functioning SRM. It will therefore be appreciated that any reference to an SRM herein, may indicate a portion or portions of an SRM, but for simplicity may be referred to as an SRM.
In some embodiments, the SRMs may incorporate speaker independent data, speaker dependent data, and/or terminal dependent data. The speaker independent data may include averaged, normalized, or otherwise consolidated language and acoustic models collected from a large sample of users.
The speaker dependent data may alternatively be biased toward a particular individual, or group of users, such as a group of users speaking a particular language or dialect, or from a particular geographic region. The speaker dependent data may be generated and/or refined on a user terminal by training the SRM. Alternatively or additionally, the speaker dependent data may be generated or refined on one or more user terminals and/or devices, such that it may be shared, via the personalized speech recognition apparatus 102, between the one or more user terminals and/or devices.
In some example embodiments, training may include, but is not limited to, providing speech input to the user terminal, potentially updating and/or verifying the processing of the speech input, and updating the SRM accordingly. On some user terminals, the training may include the explicit dictation of special training data by a speaker, and/or implicit training through the general use of the user terminal. In some embodiments, various models, such as a HMM may be constructed for each the speaker dependent SRM to be stored. A speaker dependent SRM incorporating the speaker dependent data may be communicated from the user terminal to the personalized speech recognition apparatus 102. The terminal dependent data may include information regarding the user terminal itself, such as characteristics of the microphone on the user terminal to capture the speech input, and/or a context of the user terminal (e.g., an environment the device is commonly used in, or the intended purpose of the device), or any settings of the user terminal 1 10A that could impact the processing of speech input. An SRM received from the personalized speech recognition
apparatus 102 may be adapted on the user terminal based on the terminal dependent data, so that the particular user terminal may more accurately process speech inputs.
Speaker dependent SRMs, including speaker dependent data may be stored on personalized speech recognition apparatus 102. The speaker dependent SRM, or a portion thereof, may be further modified and/or transmitted to another device to allow the user terminal to benefit from the speaker dependent data, thereby improving the probability of successful speech recognition on another user terminal. As such, one or more user terminals may access or otherwise download the speaker dependent model for the purposes of providing personalized speech recognition.
Advantageously, for example, as the one or more user terminals provide personalized speech recognition, using the speaker dependent model, and the speech recognition result is verified or otherwise confirmed (e.g. check by a user for errors), the personalized speech recognition apparatus 102 may receive updates to the speaker dependent model. The personalized speech recognition apparatus 102 may therefore further tune or otherwise modify the speaker dependent model.
In some embodiments, the processor 20 (and/or co-processors or any other processing circuitry assisting or otherwise associated with the processor 20) may be in communication with the memory device 26 via a bus for passing information among components of the personalized speech recognition apparatus 102. The memory device 26 may include, for example, one or more volatile and/or non-volatile memories. In other words, for example, the memory device 26 may be an electronic storage device (e.g., a computer readable storage medium) comprising gates configured to store data (e.g., bits) that may be retrievable by a machine (e.g., a computing device like the processor 20). The memory device 26 may be configured to store information, data, content, applications, instructions, or the like for enabling the apparatus to carry out various functions in accordance with an example embodiment of the present invention. For example, the memory device 26 could be configured to store various SRMs, including speaker independent and speaker dependent portions. The speaker dependent data may be associated with a particular user or group of users, enabling the processor 20 to identify and provide appropriate SRMs to various devices. As such, the memory device 26 could be configured to buffer input data for processing by the processor 20, and/or to store instructions for execution by the processor 20. The personalized speech recognition apparatus 102 may, in some embodiments, be embodied in various devices as described above. However, in some embodiments, the personalized speech recognition apparatus 102 may be embodied as a chip or chip set. In other words, the personalized speech recognition apparatus 102 may comprise one or more physical packages (e.g., chips) including materials, components and/or wires on a structural assembly (e.g., a
baseboard). The structural assembly may provide physical strength, conservation of size, and/or limitation of electrical interaction for component circuitry included thereon. The personalized speech recognition apparatus 102 may therefore, in some cases, may be configured to implement an embodiment of the present invention on a single chip or as a single "system on a chip." As such, in some cases, a chip or chipset may constitute means for performing one or more operations described herein for personalizing speech recognition in devices.
The processor 20 may be embodied in a number of different ways. For example, the processor 20 may be embodied as one or more of various hardware processing means such as a coprocessor, a microprocessor, a controller, a digital signal processor (DSP), a processing element with or without an accompanying DSP, or various other processing circuitry including integrated circuits such as, for example, an application specific integrated circuit (ASIC) an field programmable gate array (FPGA), a microcontroller unit (MCU), a hardware accelerator, a special-purpose computer chip, or the like. As such, in some embodiments, the processor 20 may include one or more processing cores configured to perform independently. A multi-core processor may enable multiprocessing within a single physical package. Additionally or alternatively, the processor 20 may include one or more processors configured in tandem via the bus to enable independent execution of instructions, pipelining and/or multithreading. In an example embodiment, the processor 20 may be configured to execute instructions stored in the memory device 26 or otherwise accessible to the processor 20. In example embodiments, such instructions may provide for the retrieval, transmittal, and/or processing of SRMs, including generating additional SRMs based on received updated speaker dependent SRMs. Alternatively or additionally, the processor 20 may be configured to execute hard coded functionality. As such, whether configured by hardware or software methods, or by a combination thereof, the processor 20 may represent an entity (e.g., physically embodied in circuitry) capable of performing operations according to an embodiment of the present invention, such as the personalization of SRMs. Thus, for example, when the processor 20 is embodied as an ASIC, FPGA or the like, the processor 20 may be specifically configured hardware for conducting the operations described herein. Alternatively, as another example, when the processor 20 is embodied as an executor of software instructions, the instructions may specifically configure the processor 20 to perform the algorithms and/or operations described herein when the instructions are executed. However, in some cases, the processor 20 may be a processor of a specific device (e.g., a user terminal or network entity) configured to employ an embodiment of the present invention by further configuration of the processor 20 by instructions for performing the algorithms and/or operations described herein. The processor 20 may include, among other things, a clock, an arithmetic logic unit (ALU) and logic gates configured to support operation of the processor 20.
Meanwhile, the communication interface 24 may be any means such as a device or circuitry embodied in either hardware or a combination of hardware and software that is configured to receive and/or transmit data from/to a network and/or any other device or module in communication with the personalized speech recognition apparatus 102. In this regard, the communication interface 24 may include, for example, an antenna (or multiple antennas) and supporting hardware and/or software for enabling communications with a wireless communication network, for transmitting and receiving SRMs to and from remote devices. Additionally or alternatively, the communication interface 24 may include the circuitry for interacting with the antenna(s) to cause transmission of signals via the antenna(s) or to handle receipt of signals received via the antenna(s). In some environments, the communication interface 24 may alternatively or also support wired communication. As such, for example, the communication interface 24 may include a communication modem and/or other hardware/software for supporting communication via cable, digital subscriber line (DSL), universal serial bus (USB) or other mechanisms.
In some embodiments, such as instances in which the personalized speech recognition apparatus 102 is embodied by a user device, the personalized speech recognition apparatus 102 may include a user interface 22 that may, in turn, be in communication with the processor 20 to receive an indication of a user input and/or to cause provision of an audible, visual, mechanical or other output to the user. As such, the user interface 22 may include, for example, a keyboard, a mouse a display, a touch screen(s), touch areas, soft keys, a microphone, a speaker, or other input/output mechanisms. Alternatively or additionally, the processor 20 may comprise user interface circuitry configured to control at least some functions of one or more user interface elements such as, for example, a speaker, ringer, microphone, display, and/or the like. The processor 20 and/or user interface circuitry comprising the processor 20 may be configured to control one or more functions of one or more user interface elements through computer program instructions (e.g., software and/or firmware) stored on a memory accessible to the processor 20 (e.g., memory device 26, and/or the like). In some example embodiments, processor 20 may be embodied as, include, or otherwise control a speech personalization administrator 28 for providing personalized speech recognition. As such, the speech personalization administrator 28 may be embodied as various means, such as circuitry, hardware, a computer program product comprising computer readable program instructions stored on a computer readable medium (for example, memory device 26) and executed by a processing device (for example, processor 20), or some combination thereof. Speech personalization administrator 28 may be capable of communication with one or more of the processor 20, memory device 26, user interface 22, and communication interface 24. As such, the speech personalization administrator 28 may be configured to generate additional
SRMs, adaptable by a variety of user terminals and that may be based on speaker dependent SRMs, as described above and in further detail hereinafter.
Any number of user terminal(s) 110, such as 11 OA and HOB, may connect to personalized speech recognition apparatus 102 via a network 100. User terminal 1 10 may be embodied as a mobile terminal, such as personal digital assistants (PDAs), pagers, mobile televisions, mobile telephones, gaming devices, laptop computers, tablet computers, cameras, camera phones, video recorders, audio/video players, radios, global positioning system (GPS) devices, navigation devices, or any combination of the aforementioned, and other types of devices capable of providing speech recognition. The user terminal 110 need not necessarily be embodied by a mobile device and, instead, may be embodied in a fixed device, such as a computer, workstation, or home appliance, such as a coffee maker. Additionally or alternatively, user terminal(s) 110 may be embodied in a vehicle, or any other machine or device capable of processing voice commands.
For simplicity, only user terminal 110A is illustrated in further detail, but it will be appreciated that any of the user terminals 110, such as user terminal 110B may be configured as illustrated in and described with respect to user terminal 1 10A. The user terminal 1 10 may therefore include or otherwise be in communication with processor 120, user interface 122, communication interface 124, and memory device 126.
In some embodiments, the processor 120 (and/or co-processors or any other processing circuitry assisting or otherwise associated with the processor 120) may be in communication with the memory device 126 via a bus for passing information among components of the user terminal 1 10. The memory device 126 may include, for example, one or more volatile and/or non-volatile memories. In other words, for example, the memory device 126 may be an electronic storage device (e.g., a computer readable storage medium) comprising gates configured to store data (e.g., bits) that may be retrievable by a machine (e.g., a computing device like the processor 120). The memory device 126 may be configured to store information, data, content, applications, instructions, or the like for enabling the user terminal to carry out various functions in accordance with an example embodiment of the present invention. For example, the memory device 126 could be configured to store SRMs, instructions for adapting SRMs with terminal dependent data, and instructions for training SRMs with speaker dependent data. Memory device 126 may therefore buffer input data for processing by the processor 120. Additionally or alternatively, the memory device 26 could be configured to store instructions for execution by the processor 120.
The processor 120 may be embodied in a number of different ways. For example, the processor 120 may be embodied as one or more of various hardware processing means such as a
coprocessor, a microprocessor, a controller, a DSP, a processing element with or without an accompanying DSP, or various other processing circuitry including integrated circuits such as, for example, an ASIC , an FPGA, an MCU, a hardware accelerator, a special-purpose computer chip, or the like. As such, in some embodiments, the processor 120 may include one or more processing cores configured to perform independently. A multi-core processor may enable multiprocessing within a single physical package. Additionally or alternatively, the processor 120 may include one or more processors configured in tandem via the bus to enable independent execution of instructions, pipelining and/or multithreading. In an example embodiment, the processor 120 may be configured to execute instructions stored in the memory device 126 or otherwise accessible to the processor 120. For example, the processor 120 may be configured to adapt an SRM advantageously to the user terminal, based on terminal dependent data, such as microphone information and context, so that the SRM may account for variances across user terminals. In example embodiments, the user terminal(s) 110 may include means, such as a processor 120, for training the SRM with speech input, to generate and/or refine a speaker dependent SRM that may improve speech input processing on the user terminal (and subsequently, other user terminals). Alternatively or additionally, the processor 120 may be configured to execute hard coded functionality. As such, whether configured by hardware or software methods, or by a combination thereof, the processor 120 may represent an entity (e.g., physically embodied in circuitry) capable of performing operations according to an embodiment of the present invention while configured accordingly. Thus, for example, when the processor 120 is embodied as an ASIC, FPGA or the like, the processor 120 may be specifically configured hardware for conducting the operations described herein. Alternatively, as another example, when the processor 120 is embodied as an executor of software instructions, the instructions may specifically configure the processor 120 to perform the algorithms and/or operations, such as adaptation and training of SRMs, processing of speech input, such as by using the SRMs, for conversion to text, when the instructions are executed. However, in some cases, the processor 120 may be a processor of a specific device (e.g., a mobile terminal or network entity) configured to employ an embodiment of the present invention by further configuration of the processor 120 by instructions for performing the algorithms and/or operations described herein. The processor 120 may include, among other things, a clock, an arithmetic logic unit (ALU) and logic gates configured to support operation of the processor 120.
Meanwhile, the communication interface 124 may be any means such as a device or circuitry embodied in either hardware or a combination of hardware and software that is configured to receive and/or transmit data from/to a network and/or any other device or module in communication with the user terminal 110. In example embodiments, the communication interface 124 may be specifically configured for transmitting and receiving SRMs to and from the personalized speech recognition apparatus 102. In this regard, the communication interface
124 may include, for example, an antenna (or multiple antennas) and supporting hardware and/or software for enabling communications with a wireless communication network. Additionally or alternatively, the communication interface 124 may include the circuitry for interacting with the antenna(s) to cause transmission of signals via the antenna(s) or to handle receipt of signals received via the antenna(s). In some environments, the communication interface 124 may alternatively or also support wired communication for communication of SRMs. As such, for example, the communication interface 124 may include a communication modem and/or other hardware/software for supporting communication via cable, digital subscriber line (DSL), universal serial bus (USB) or other mechanisms.
The user terminal 1 10 may include a user interface 122 that may, in turn, be in communication with the processor 120 to receive an indication of a user input and/or to cause provision of an audible, visual, mechanical or other output to the user. As such, the user interface 122 may include, for example, a keyboard, a mouse, a display, a touch screen(s), touch areas, soft keys, a microphone, a speaker, or other input/output mechanisms. The user interface 122 may therefore be configured to receive speech input, such as, via a microphone, for the purposes of speech recognition and/or training of an SRM. Alternatively or additionally, the processor 120 may comprise user interface circuitry configured to control at least some functions of one or more user interface elements such as, for example, a speaker, ringer, microphone, display, and/or the like. The processor 120 and/or user interface circuitry comprising the processor 120 may be configured to control one or more functions of one or more user interface elements through computer program instructions (e.g., software and/or firmware) stored on a memory accessible to the processor 120 (e.g., memory device 126, and/or the like). Network 100 may be embodied in a local area network, the Intemet, any other form of a network, or in any combination thereof, including proprietary private and semi-private networks and public networks. The network 100 may comprise a wire line network, wireless network (e.g., a cellular network, wireless local area network, wireless wide area network, some combination thereof, or the like), or a combination thereof, and in some example embodiments comprises at least a portion of the Internet. The network 100 may be used for transmitting speaker dependent data and/or SRMs to and from devices. As another example, a user terminal 1 10 may be directly coupled to and/or may include a personalized speech recognition apparatus 102. Referring now to Figure 2, the operations for receiving and adapting an SRM on a user terminal, in accordance with one embodiment of the present invention are outlined in accordance with one example embodiment. In this regard and as described below, the operations of Figures 2 may be performed by the user terminal 110A, user terminal HOB, and/or the like, for example.
As shown by operation 200, the user terminal 1 10A may include means, such as the processor 120, communication interface 124, or the like, for receiving at least one portion of an SRM, wherein the at least one portion of an SRM is stored remotely and is adaptable by one of more user terminals to process input speech. In other words, the user terminal 1 10A may receive at least one portion of an SRM from the personalized speech recognition apparatus 102, for example, including any combination of the HMM, DTW, , and FST models, as described above. The at least one portion of an SRM may also include any combination of speaker independent data and/or speaker dependent data, and may be adaptable by the user terminal 11 OA to process speech input (e.g., perform speech recognition tasks). The adaptation is described in further detail with respect to operation 210.
To receive the at least one portion of an SRM on the user terminal 1 1 OA, in an example embodiment, a user of user terminal 1 1 OA may provide logon credentials or the like, via user interface 122, communication interface 124, and/or network 100 to the personalized speech recognition apparatus 102. In some embodiments, the user terminal 110A may check for updates by communicating with the personalized speech recognition apparatus 102, and receive an SRM or portion thereof if an update is available. In some examples, an update may be available if a user updated, based on training, verification or the like on another device, such as user terminal HOB.
In some embodiments, the user terminal 1 10A may download an SRM or portion thereof for the first time (such as during initial device setup, or factory reset), or the newly received SRM or portion thereof may include updates compared to a previous version used by user terminal 11 OA. In some embodiments, receipt of the SRM or portion thereof by the user terminal 1 1 OA may occur during scheduled update routines that may be unobtrusive to or unnoticed by a user. That is, the synchronization may occur seamlessly as a background system update. Additionally or alternatively, a request for an SRM or portion thereof may be explicitly initiated on the user terminal 11 OA (such as logging onto the personalized speech recognition apparatus 102 and requesting an update). In some embodiments, an update may be initiated by the personalized speech recognition apparatus 102. For example, a user may be automatically notified that an update is available, such as by Short Message Service (SMS), for example, so as to confirm that they would like to receive the at least one portion of an SRM on the user terminal 11 OA.
The user terminal 1 10A may therefore receive at least one portion of an SRM associated with the individual user (such as identified with the logon credentials). Additionally or alternatively, the SRM or portion thereof may be identified by the personalized speech recognition apparatus by other means. For example, a user of a device may provide a geographic location, via a Global Positioning Device (GPS) and/or manual indication of a location, for example. The user terminal 110A may therefore receive an SRM based on a geographic location and /or dialect.
Having received at least one portion of an SRM, as described with respect to operation 200, the user terminal 110A may include means, such as the processor 120, for accessing a speech recognition model, wherein the speech recognition model is based on at least the received at least one portion of speech recognition model. As such, the received at least one portion of an SRM may be a complete SRM, and may therefore be stored on memory device 126, and accessed by the processor 120. In some embodiments, where the at least one portion of the SRM does not provide a complete or fully functioning SRM, the processor 120 may incorporate the at least one portion of an SRM to form a complete SRM. As such the SRM may be stored and accessed on memory device 126, for example.
Having accessed an SRM, as shown by operation 208, the user terminal 11 OA may include means, such as the processor 120, for adapting the SRM based on or more terminal dependent data. The terminal dependent data may include information regarding the user terminal 1 1 OA itself, such as characteristics of the microphone on the user terminal 1 10A to capture the speech input, and/or a context of the user terminal (e.g., an environment the device is commonly used in, or the intended purpose of the user terminal), or any settings of the user terminal 1 10A that could impact the process of speech input. The processor 120 may therefore utilize the terminal dependent data in adapting the SRM for use on the user terminal 110A.
In an example embodiment, microphone information may be retrieved from memory device 126, or read from a microphone component of user interface 122 by processor 120, for example. The microphone information may include any information relating to the microphone that may impact how speech input is recognized and/or processed according to the SRM. For example, the microphone information may comprise a microphone model identifier, or orientation of the microphone within the device. The microphone may additionally or alternatively be characterized by its transduction type, such as condenser and/or dynamic, for example. The user terminal 1 10A, using the processor 120, may therefore adapt the SRM according to microphone information to account for acoustic, phonetic, and/or other variances between microphones. For example, calculations in a DTW model may be consistently modified throughout, so that the user terminal 1 10A may accurately interpret sounds captured by the microphone.
In another example embodiment, the user terminal 11 OA may adapt the SRM based on the context of the user terminal. Use of an SRM by a speaker phone in a vehicle, for example, may be subject to background noise, such as wind, and/or radio or other device interference. The processor 120 of user terminal 1 1 OA may therefore adapt the received SRM, which in its previous state may not have accounted for such background noises, accordingly. Information regarding the context or use of the user terminal 1 1 OA may be explicitly retrieved from memory device 126, for example, and/or derived from various components of the user terminal 110A,
allowing processor 120 to adapt the SRM based on what contexts the user terminal 11 OA will most likely be used in.
Although microphone information and context of the user terminal are provided as example terminal dependent data, it will be appreciated that numerous other terminal dependent data exist. Settings configuring various components of the user terminal 1 10A may be considered by the processor 120 in adapting the SRM for the user terminal 11 OA. In some embodiments, the settings may affect the adaptation of the SRM, and/or cause the processor 120 to adjust the settings of the user terminal 1 10A to tailor the device for use of the SRM. An adapted SRM may be stored on memory device 126, for example.
As shown by operation 220, the user terminal 11 OA may include means, such as the user interface 122, communication interface 124, and/or processor 120 for receiving a speech input. The speech input may be provided by a user to user terminal 110B by using a microphone of user interface 122, for example.
Additionally or alternatively, the user terminal 11 OA may receive a speech input through everyday use of the user terminal and may process the speech to generate text. The user terminal 110A may process received speech input using the SRM, and generate a textual output. In some examples, the processor 120 may process the speech input according to the SRM. For example, the processor 120 may calculate observation probability on the speech input based on the SRM that includes one or more HMM, DTW, , or FST models, for example. By way of further example, the processor 120 may identify a reference word with the highest probability when compared to other reference words, a threshold or the like. Based on those probabilities, the processor may then select or otherwise generate the speech recognition result (e.g. a text output).
As shown by operation 230, the user terminal 11 OA may include means, such as the user interface 122, communication interface 124 and/or processor 120, for verifying or correcting a processing of the speech input. The verification or correction could be received explicitly by a user input to the user terminal 11 OA, or implicitly by everyday use of the user terminal 11 OA.
For example, the user terminal 11 OA may be configured to receive an explicit correction of a processed speech input. In applications employing speech recognition, such as an example application that prefills dictated words in a draft email message, the interpretation of the speech input may be incorrect. In such cases, the user may correct a misinterpreted word(s) by selecting the misinterpreted word, and typing the corrected word in its place. See Figure 4.
As is provided in Figure 4, a user interface 122 may display an indication 400 of a word, such as a word that is misinterpreted, such as a word that is misinterpreted during the processing of input
speech. In some examples, indication 400 may be provided by the user terminal 1 1 OA in scenarios such as those in which the SRM provided no reference word above some threshold probability, indicating that the processing of the speech input was not likely correct. Additionally or alternatively, the indication 400 may be provided explicitly by a user, by selection of the word for correction, for example. User input 410 provides a means for receiving a correction of the processed speech input. In this example, the speech recognition system has interpreted the word "forest," and a user provides the correct phrase, "for the rest."
In other examples a speech input may be deemed as correct based on implicit verification. For example, a user terminal, such as user terminal 1 10A, may be embodied as a mobile phone and may further be operable to receive a speech input such as "call Suzanne." Upon automatic selection and execution of the associated command (e.g., initiating a call to a phone number saved for a contact by the name of Suzanne), and failure to receive any correction to stop the initiated phone call, the user terminal 100A, such as by the processor 120, may consider this absence of any action by the user a verification of the processed speech input.
As shown by operation 240, the user terminal 11 OA, such as by processor 120, and memory device 126, for example, may generate and/or otherwise refine a speaker dependent SRM based on the speech input. As such, the SRM may be trained using speech input received with respect to operation 220, and/or verification or correction of the processed speech input with respect to operation 230. Existing SRMs on memory device 126 may therefore be tailored for use by a particular user or group of users. Additionally or alternatively, new speaker dependent SRMs may be generated for improved speech input processing. Training can be performed, for example, by using feature vectors of the speech input (provided with respect to operation 220) and associating them with corresponding reference words, as provided by the verification and/or correction with respect to the operation 230 above. Additionally or alternatively, a verification or correction need not be provided, but the processor 120 may identify the reference words from a script on memory device 126 (such as in an example embodiment where the speech input is received based on a script).
The SRM, such as an HMM, DTW, , FST, or the like, may therefore be expanded, or otherwise modified, to incorporate the speech input and associated reference words. In some examples, processed speech input and associated reference words may be further processed by processor 120, and applied to an existing SRM, to refine a speaker dependent SRM. In some embodiments, where an SRM is not already present on the user terminal 11 OA, a new speaker dependent SRM may be generated. The generated or refined speaker dependent SRM may be stored on memory device 126, for example.
As shown by operation 250, the user terminal 1 10A may include means, such as communication interface 124, and/or processor 120, for causing transmission of the speaker dependent SRM to a remote storage location, such as personalized speech recognition apparatus 102, for example. Transmission of the speaker dependent SRM to a remote location may allow the speaker dependent SRM to be advantageously transmitted to other user terminals, such as described in further detail with respect to Figure 3. Further, and in some examples, by transmitting the speaker dependent SRM to the remote location, one or more user terminals may provide updates to or otherwise refine the speaker dependent SRM. The speaker dependent SRM may therefore be retrieved from memory device 126, and transmitted via communication interface 124 and over network 100, for example, to the remote storage location.
In some embodiments, the transmission may occur automatically following generation and/or refinement of the speaker dependent SRM with respect to operation 240. In some embodiments, a user of user terminal 1 10A may initiate the transmission, such as for example, providing logon credentials to the personalized speech recognition apparatus 102, as described with respect to operation 200, for example. The speaker dependent SRM may then be transmitted to the personalized speech recognition apparatus 102 for storage, and subsequent retrievals.
Figure 3 is a flowchart illustrating operations to transmit an SRM, receive a speaker dependent SRM, and generate an additional SRM, using a speech personalization apparatus 102 in accordance with one embodiment of the present invention.
As shown by operation 300, the personalized speech recognition apparatus 102 may include means, such as the processor 20, speech personalization administrator 28, communication interface 24, or the like, for causing transmission of an SRM (or portion thereof) to a user terminal. The SRM may therefore be retrieved from memory device 26, and sent over network 1 10, via communication interface 24, to user terminal 1 10A, for example. In some examples, the SRM that is transmitted may be an SRM that is configured for a particular device, a particular region or dialect or the like.
For example, the personalized speech recognition apparatus 102 may generate the additional SRM based on an associated with a group of users, such as one associated with a geographic location. For example, some geographic areas, like the southern United States, for example, may experience regional accents that may otherwise confuse speech input processing systems. Personalized speech recognition apparatus 102 may therefore generate the additional SRM based on a particular geographic location in order to subsequently provide more accurate speech recognition functions to users in, from, or otherwise associated with the same geographic location.
Similarly, an additional SRM may be generated based on a specific dialect. For example, due to varying dialects, some words may be pronounced differently than the same word in a different language, potentially causing erroneous speech input processing on a user terminal. Personalized speech recognition apparatus 102 may therefore associate the speaker dependent SRM with a dialect in order to provide more accurate speech recognition functions to users whose speech is closely related to the specific dialect. A user of a device may then provide indication of a particular dialect, and receive an SRM adapted for that dialect.
Alternatively or additionally, the SRM may already be adapted to a particular user. For example, the personalized speech recognition apparatus 102 may receive logon information from a user terminal, such as user terminal 11 OA that indicates the identity of a particular user. As such, personalized speech recognition apparatus 102, such as via the processor 20, the communications interface 24 or the like, may cause the SRM related to the particular user to be transmitted to user terminal 11 OA.
The transmission may be initiated on the personalized speech recognition apparatus 102 in various ways, such as receiving requests initiated explicitly (e.g., logon) or automatically (e.g., initial installation) from the user terminal 11 OA, and/or automatic transmission imitated by the personalized speech recognition apparatus 102. Various other methods for initiation of transmission of the SRM are described herein.
As shown by operation 310, the personalized speech recognition apparatus 102 may include means, such as the processor 20, speech personalization administrator 28, communication interface 24, or the like, for receiving at least a portion of a speaker dependent SRM from the user terminal, such as user terminal 11 OA. In some examples, the received speaker dependent SRM (or portion thereof) may contain one or more updates to or refinements of the speaker dependent SRM as is described with respect to operations 240 and 250 of Figure 2.
As shown by operation 320, the personalized speech recognition apparatus 102 may include means, such as the processor 20, speech personalization administrator 28, or the like, for generating an additional or otherwise updated SRM based on the speaker dependent SRM, wherein the additional SRM is adaptable by one or more user terminals. In some examples, the additional SRM is contrasted based on the speaker dependent SRM and comprises the updates to or refinements of the SRM from the user terminal, as well as, one or more other user terminals.
As such, the speech personalization administrator 28 may access an existing SRM on memory 26, and modify, update or otherwise refine the SRM with the speaker dependent SRM, or a portion of the speaker dependent SRM, accordingly. Additionally, or alternatively, a new SRM may be generated using the speaker dependent SRM. The additional SRM may be, or otherwise include,
a HMM, DTW, , or FST, for example. The additional SRM may be adaptable by one or more user terminals, such as described with respect to operations 200 and 210 above.
As shown by operation 330, the personalized speech recognition apparatus 102 may include means, such as the processor 20, communication interface 24, or the like, for causing transmission of the additional SRM to an additional device. The transmission may be initiated and completed by use of similar operations described with respect to operation 300, but the SRM may this time be transmitted to a different terminal, such as user terminal HOB, for example. Advantageously, for example, the additional SRM may be shared between one or more user terminals, devices and/or the like.
In an example embodiment, the personalized speech recognition apparatus 102 may select the additional SRM to transmit to the user terminal HOB, based on a variety of factors, such as terminal dependent data, and/or user identification, for example. An association of the individual user (or group of users) and speaker dependent SRM may allow the personalized speech recognition apparatus 102 to advantageously provide the SRM on demand, to various devices belonging to a user.
For example, a user terminal 11 OA embodied as a personal computer or laptop capable of producing text from speech input, such as dictated reports or emails, may rely on an extensive SRM representing an entire natural language. A speaker dependent SRM generated and/or refined on the user terminal 11 OA may be available on personalized speech recognition apparatus 102 for distribution to one or more other user terminals. For example, if the same user of user terminal 1 10A purchases a new user terminal 1 10B, it may be advantageous to provide portions of the speaker dependent SRM to the user terminal 1 10B. Presume for example that the user terminal HOB is a coffee maker. A coffee maker, may not require such a broad vocabulary required by the personal computer or laptop embodiment of user terminal 1 10A, but only portions of the speaker dependent SRM including language relating to functions of the coffee maker (e.g., grind, brew), or language relating to measurements and/or timing. As such, upon detecting that user terminal HOB is embodied as a coffee maker, for example, the personalized speech recognition apparatus 102 may advantageously select an SRM or a portion of an SRM (as generated in operation 320) for use by the coffee maker, potentially minimizing bandwidth required for transmitting, and memory required for storing (on user terminal HOB), an otherwise extensive SRM.
Having received an SRM from the personalized speech recognition apparatus 102, the user terminal HOB may utilize the SRM in processing of speech input, therefore offering to its users personalized speech recognition or in other examples, by not needing its users to retrain a SRM.
That is, the speech input processing may be improved by use of the SRM, which may include a portion(s) of the speaker dependent SRM generated or refined on user terminal 1 10A. As such, the user of user terminal HOB may provide speech input to user terminal 1 10B, and experience reduced or minimized error rates in speech input processing and/or execution of associated voice commands.
It will be appreciated that, although Figure 1 illustrates an embodiment utilizing user terminals 110A and 1 10B, and a personalized speech recognition apparatus 102, many other configurations exist. Indeed, a personalized speech recognition apparatus 102 may be locally installed on a device such as user terminal 110A and/or HOB and configured to run independently, where data may not necessarily be shared across devices or a server.
In some embodiments, a user terminal such as user terminal 11 OA and/or HOB may provide a speaker dependent SRM to a personalized speech recognition apparatus 102, may receive an SRM from a personalized speech recognition apparatus 102, or may both provide and receive the same, respectively. In such example embodiments, the personalized speech recognition apparatus 102 may be implemented in the cloud, and data may be transmitted between user terminal(s) and server(s) over network 100. In a particularly advantageous embodiment, various user terminals may routinely receive updated SRMs, thereby continually improving speech input processing by utilizing speaker dependent SRMs generated and/or refined on other user terminals.
Additionally or alternatively, in some embodiments, a device such as user terminal 1 10A and/or HOB may be shipped with SRMs preinstalled. In some embodiments, the SRM may be local to the area or country the user terminal is distributed in. That is, the SRM may be based on a speaker dependent SRM based on a dialect or geographic area.
As described above, Figures 2 and 3 are flowcharts illustrating operations performed by a user terminal 110A user terminal 1 10B and/or the like, and personalized speech recognition apparatus 102, respectively. It will be understood that each block of the flowchart, and combinations of blocks in the flowcharts, may be implemented by various means, such as hardware, firmware, processor, circuitry, and/or other devices associated with execution of software including one or more computer program instructions. For example, one or more of the procedures described above may be embodied by computer program instructions. In this regard, the computer program instructions which embody the procedures described above may be stored by a memory device 26 or 126 employing an embodiment of the present invention and executed by a processor 20 or 120. As will be appreciated, any such computer program instructions may be loaded onto a computer or other programmable apparatus (e.g., hardware) to produce a machine, such that the resulting computer or other programmable apparatus implements the functions specified in the
flowcharts' blocks. These computer program instructions may also be stored in a computer- readable memory that may direct a computer or other programmable apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture the execution of which implements the function specified in the flowcharts' blocks. The computer program instructions may also be loaded onto a computer or other programmable apparatus to cause a series of operations to be performed on the computer or other programmable apparatus to produce a computer-implemented process such that the instructions which execute on the computer or other programmable apparatus provide operations for implementing the functions specified in the flowcharts' blocks.
Accordingly, blocks of the flowcharts support combinations of means for performing the specified functions and combinations of operations for performing the specified functions for performing the specified functions. It will also be understood that one or more blocks of the flowcharts, and combinations of blocks in the flowcharts, may be implemented by special purpose hardware -based computer systems which perform the specified functions, or combinations of special purpose hardware and computer instructions.
In some embodiments, certain ones of the operations above may be modified or further amplified. Furthermore, in some embodiments, additional optional operations may be included as indicated by the blocks shown with a dashed outline in Figures 2 and 3. Modifications, additions, or amplifications to the operations above may be performed in any order and in any combination.
Many modifications and other embodiments of the inventions set forth herein will come to mind to one skilled in the art to which these inventions pertain having the benefit of the teachings presented in the foregoing descriptions and the associated drawings. Therefore, it is to be understood that the inventions are not to be limited to the specific embodiments disclosed and that modifications and other embodiments are intended to be included within the scope of the appended claims. Moreover, although the foregoing descriptions and the associated drawings describe example embodiments in the context of certain example combinations of elements and/or functions, it should be appreciated that different combinations of elements and/or functions may be provided by alternative embodiments without departing from the scope of the appended claims. In this regard, for example, different combinations of elements and/or functions than those explicitly described above are also contemplated as may be set forth in some of the appended claims. Although specific terms are employed herein, they are used in a generic and descriptive sense only and not for purposes of limitation.
Claims
1. A method comprising:
receiving at least one portion of a speech recognition model, wherein the at least one portion of the speech recognition model is stored remotely and is adaptable by one or more user terminal to process input speech;
accessing a speech recognition model, wherein the speech recognition model is based on at least the received at least one portion of the speech recognition model; and adapting the speech recognition model based on terminal dependent data.
2. A method according to claim 1 , further comprising:
processing received speech input using the speech recognition model; and generating a textual output.
3. A method according to claim 1 or 2, further comprising:
receiving a speech input; and
refining a speaker dependent speech recognition model based on the speech input.
4. A method according to claim 3, further comprising:
verifying or correcting a processing of the speech input, wherein refining the speaker dependent speech recognition model is further based on the verification or correction.
5. A method according to claim 3, or 4, further comprising:
causing transmission of at least one portion of the speaker dependent speech recognition model to a remote storage location.
6. The method according to claim 1 , 2, 3, or 4, wherein the terminal dependent data
comprises microphone information.
7. The method according to claim 1, 2, 3, 4 or 5, wherein the terminal dependent data
comprises a context.
8. The method according to claim 1, 2, 3, 4, 5, or 6, wherein the received at least one
portion of a speech recognition model is received based on one of at least an individual user, group of users, geographic location, or a dialect.
9. A method according to claim 1, 2, 3, 4, 5, 6 or 7, wherein the received at least one portion of a speech recognition model is based on at least one of a Hidden Markov Model, dynamic time warping model, neural network, or finite state transducer.
10. A method comprising:
receiving at least one portion of a speaker dependent speech recognition model from a user terminal; and
generating at least one additional portion of a speech recognition model based on the received at least one portion of a speaker dependent speech recognition model, wherein the at least one additional portion of a speech recognition model is adaptable by one or more user terminals.
1 1. A method according to claim 10, further comprising:
causing transmission of the at least one additional portion of the speech recognition model to an additional user terminal.
12. A method according to claim 10 or 11 , wherein generating the at least one additional portion of the speech recognition model is further based on one of at least an individual user, group of users, geographic location, or a dialect.
13. A method according to claim 10, 11 or 12, wherein the at least one additional portion of the speech recognition model is based on at least one of a hidden Markov Model, dynamic time warping model, neural network, or finite state transducer.
14. An apparatus comprising at least one processor and at least one memory including
computer program code, the at least one memory and the computer program code configured to, with the processor, cause the apparatus to at least:
receive at least one portion of a speech recognition model, wherein the at least one portion of the speech recognition model is stored remotely, and is adaptable by one or more user terminals to process input speech;
access a speech recognition model, wherein the speech recognition model is based on at least the received at least one portion of the speech recognition model; and
adapt the speech recognition model based on terminal dependent data.
15. An apparatus according to claim 14, wherein the at least one memory and the computer program code are further configured to, with the processor, cause the apparatus to at least:
process received speech input using the speech recognition model; and generate a textual output.
16. An apparatus according to claim 14 or 15, wherein the at least one memory and the computer program code are further configured to, with the processor, cause the apparatus to at least:
receive a speech input; and
refine a speaker dependent speech recognition model based on the speech input.
17. An apparatus according to claim 16, wherein the at least one memory and the computer program code are further configured to, with the processor, cause the apparatus to at least:
verify or correct a processing of the speech input, wherein refining the speaker dependent speech recognition model is further based on the verification or correction.
18. An apparatus according to claim 16 or 17, wherein the at least one memory and the
computer program code are further configured to, with the processor, cause the apparatus to at least:
cause transmission of at least a portion of the speaker dependent speech recognition model to a remote storage location.
19. An apparatus according to claim 14, 15, 16, 17 or 18 wherein the terminal dependent data comprises microphone information.
20. An apparatus according to claim 14, 15, 16, 17, 18 or 19, wherein the terminal dependent data comprises a context.
21. An apparatus according to claim 14, 15, 16, 17 or 18, wherein the received at least one portion of a speech recognition model is received based on one of at least an individual user, group of users, geographic location, or a dialect.
22. An apparatus according to claim 14, 15, 16, 17, 18, 19, 20 or 21 , wherein the received at least one portion of a speech recognition model is based on at least one of a Hidden Markov Model, dynamic time warping model, neural network, or finite state transducer.
23. An apparatus comprising at least one processor and at least one memory including
computer program code, the at least one memory and the computer program code configured to, with the processor, cause the apparatus to at least:
receive at least one portion of a speaker dependent speech recognition model from a user terminal; and
generate at least one additional portion of a speech recognition model based on the received at least one portion of a speaker dependent speech recognition model,
wherein the at least one additional portion of the speech recognition model is adaptable by one or more user terminals.
24. An apparatus according to claim 23, wherein the at least one memory and the computer program code are further configured to, with the processor, cause the apparatus to at least:
cause transmission of the at least one additional portion of a speech recognition model to an additional user terminal.
25. An apparatus according to claim 23 or 24, wherein generating the at least one additional portion of a speech recognition model is further based on one of at least an individual user, group of users, geographic location, or a dialect.
26. An apparatus according to claim 23, 24 or 25, wherein the at least one additional portion of the speech recognition model is based on at least one of a hidden Markov Model, dynamic time warping model, neural network, or finite state transducer.
27. A computer program product comprising at least one non-transitory computer-readable storage medium having computer-executable program code instructions stored therein, the computer-executable program code instructions comprising program code instructions to:
receive at least one portion of a speech recognition model, wherein the at least one portion of the speech recognition model is stored remotely, and is adaptable by one or more user terminals to process input speech;
access a speech recognition model, wherein the speech recognition model is based on at least the received at least one portion of the speech recognition model; and
adapt the speech recognition model based on terminal dependent data.
28. A computer program product according to claim 27, wherein the computer-executable program code instructions further comprise program code instructions to:
process received speech input using the speech recognition model; and generate a textual output.
29. A computer program product according to claim 27 or 28, wherein the computer- executable program code instructions further comprise program code instructions to: receive a speech input; and
refine a speaker dependent speech recognition model based on the speech input.
30. A computer program product according to claim 29, wherein the computer-executable program code instructions further comprise program code instructions to:
verify or correct a processing of the speech input, wherein refining the speaker dependent speech recognition model is further based on the verification or correction.'
31. A computer program product according to claim 29 or 30, wherein the computer- executable program code instructions further comprise program code instructions to: cause transmission of at least one portion of the speaker dependent speech recognition model to a remote storage location.
32. A computer program product according to claim 27, 28, 29, 30 or 31 , wherein the
terminal dependent data comprises microphone information.
33. A computer program product according to claim 27, 28, 29, 30, 31 or 32, wherein the terminal dependent data comprises a context.
34. A computer program product according to claim 27, 28, 29, 30, 31 , 32 or 33, wherein the received at least one portion of a speech recognition model is received based on one of at least an individual user, group of users, geographic location, or a dialect.
35. A computer program product according to claim 27, 28, 29, 30, 31 , 32, 33 or 34, wherein the received at least one portion of a speech recognition model is based on at least one of a Hidden Markov Model, dynamic time warping model, neural network, or finite state transducer.
36. A computer program product comprising at least one non-transitory computer-readable storage medium having computer-executable program code instructions stored therein, the computer-executable program code instructions comprising program code instructions to:
receive at least one portion of a speaker dependent speech recognition model from a user terminal; and
generate at least one additional portion of a speech recognition model based on the received at least one portion of a speaker dependent speech recognition model, wherein the at least one additional portion of a speech recognition model is adaptable by one or more user terminals.
37. A computer program product according to claim 36, wherein the computer-executable program code instructions further comprise program code instructions to:
cause transmission of the at least one additional portion or the additional speech recognition model to an additional user terminal.
38. A computer program product according to claim 36 or 37, wherein generating the at least one additional portion of the speech recognition model is further based on one of at least an individual user, group of users, geographic location, or a dialect.
39. A computer program product according to claim 36, 37 or 38, wherein the at least one additional portion of the speech recognition model is based on at least one of a hidden Markov Model, dynamic time warping model, neural network, or finite state transducer.
40. An apparatus comprising means for:
receiving at least one portion of a speech recognition model, wherein the at least one portion of the speech recognition model is stored remotely, and is adaptable by one or more user terminals to process input speech;
accessing a speech recognition model, wherein the speech recognition model is based on at least the received at least one portion of the speech recognition model; and adapting the speech recognition model based on terminal dependent data.
41. An apparatus according to claim 40, further comprising means for:
process received speech input using the speech recognition model; and generate a textual output.
42. An apparatus according to claim 40 or 41 , further comprising means for:
receiving a speech input; and
refining a speaker dependent speech recognition model based on the speech input.
43. An apparatus according to claim 42, further comprising means for:
verifying or correcting a processing of the speech input, wherein refining the speaker dependent speech recognition model is further based on the verification or correction.
44. An apparatus according to claim 42 or 43, further comprising means for:
causing transmission of at least one portion of the speaker dependent speech recognition model to a remote storage location.
45. An apparatus according to claim 40, 41, 42, 43 or 44 wherein the terminal dependent data comprises microphone information.
46. An apparatus according to claim 40, 41, 42, 43, 44 or 45, wherein the terminal dependent data comprises a context.
47. An apparatus according to claim 40, 41, 42, 43, 44, 45 or 46, wherein the received at least one portion of a speech recognition model is received based on one of at least an individual user, group of users, geographic location, or a dialect.
48. An apparatus according to claim 40, 41, 42, 43, 44, 45, 46 or 47, wherein the received at least one portion of a speech recognition model is based on at least one of a Hidden Markov Model, dynamic time warping model, neural network, or finite state transducer.
49. An apparatus comprising means for:
receiving at least one portion of a speaker dependent speech recognition model from a user terminal; and
generating at least one additional portion of a speech recognition based on the received at least one portion of a speaker dependent speech recognition model, wherein the at least one additional portion of the speech recognition model is adaptable by one or more user terminals.
50. An apparatus according to claim 49, further comprising means for:
causing transmission of the at least one additional portion to an additional user terminal.
51. An apparatus according to claim 49 or 50, wherein generating the at least one additional portion of the speech recognition model is further based on one of at least an individual user, group of users, geographic location, or a dialect.
52. An apparatus according to claim 49, 50 or 51 , wherein the at least one additional portion or the additional speech recognition model is based on at least one of a hidden Markov Model, dynamic time warping model, neural network, or finite state transducer.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/FI2012/051285 WO2014096506A1 (en) | 2012-12-21 | 2012-12-21 | Method, apparatus, and computer program product for personalizing speech recognition |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/FI2012/051285 WO2014096506A1 (en) | 2012-12-21 | 2012-12-21 | Method, apparatus, and computer program product for personalizing speech recognition |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2014096506A1 true WO2014096506A1 (en) | 2014-06-26 |
Family
ID=50977656
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/FI2012/051285 WO2014096506A1 (en) | 2012-12-21 | 2012-12-21 | Method, apparatus, and computer program product for personalizing speech recognition |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2014096506A1 (en) |
Cited By (125)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2016209499A1 (en) * | 2015-06-25 | 2016-12-29 | Intel Corporation | Speech recognition services |
US10043516B2 (en) | 2016-09-23 | 2018-08-07 | Apple Inc. | Intelligent automated assistant |
US10083690B2 (en) | 2014-05-30 | 2018-09-25 | Apple Inc. | Better resolution when referencing to concepts |
US10108612B2 (en) | 2008-07-31 | 2018-10-23 | Apple Inc. | Mobile device having human language translation capability with positional feedback |
US10303715B2 (en) | 2017-05-16 | 2019-05-28 | Apple Inc. | Intelligent automated assistant for media exploration |
US10311871B2 (en) | 2015-03-08 | 2019-06-04 | Apple Inc. | Competing devices responding to voice triggers |
US10311144B2 (en) | 2017-05-16 | 2019-06-04 | Apple Inc. | Emoji word sense disambiguation |
US10332518B2 (en) | 2017-05-09 | 2019-06-25 | Apple Inc. | User interface for correcting recognition errors |
US10356243B2 (en) | 2015-06-05 | 2019-07-16 | Apple Inc. | Virtual assistant aided communication with 3rd party service in a communication session |
US10354652B2 (en) | 2015-12-02 | 2019-07-16 | Apple Inc. | Applying neural network language models to weighted finite state transducers for automatic speech recognition |
US10381016B2 (en) | 2008-01-03 | 2019-08-13 | Apple Inc. | Methods and apparatus for altering audio output signals |
US10390213B2 (en) | 2014-09-30 | 2019-08-20 | Apple Inc. | Social reminders |
US10395654B2 (en) | 2017-05-11 | 2019-08-27 | Apple Inc. | Text normalization based on a data-driven learning network |
US10403278B2 (en) | 2017-05-16 | 2019-09-03 | Apple Inc. | Methods and systems for phonetic matching in digital assistant services |
US10403283B1 (en) | 2018-06-01 | 2019-09-03 | Apple Inc. | Voice interaction at a primary device to access call functionality of a companion device |
US10410637B2 (en) | 2017-05-12 | 2019-09-10 | Apple Inc. | User-specific acoustic models |
US10417266B2 (en) | 2017-05-09 | 2019-09-17 | Apple Inc. | Context-aware ranking of intelligent response suggestions |
US10417405B2 (en) | 2011-03-21 | 2019-09-17 | Apple Inc. | Device access using voice authentication |
US10417344B2 (en) | 2014-05-30 | 2019-09-17 | Apple Inc. | Exemplar-based natural language processing |
US10431204B2 (en) | 2014-09-11 | 2019-10-01 | Apple Inc. | Method and apparatus for discovering trending terms in speech requests |
US10438595B2 (en) | 2014-09-30 | 2019-10-08 | Apple Inc. | Speaker identification and unsupervised speaker adaptation techniques |
JP2019528502A (en) * | 2016-06-23 | 2019-10-10 | ホアウェイ・テクノロジーズ・カンパニー・リミテッド | Method and apparatus for optimizing a model applicable to pattern recognition and terminal device |
US10445429B2 (en) | 2017-09-21 | 2019-10-15 | Apple Inc. | Natural language understanding using vocabularies with compressed serialized tries |
US10453443B2 (en) | 2014-09-30 | 2019-10-22 | Apple Inc. | Providing an indication of the suitability of speech recognition |
US10474753B2 (en) | 2016-09-07 | 2019-11-12 | Apple Inc. | Language identification using recurrent neural networks |
US10482874B2 (en) | 2017-05-15 | 2019-11-19 | Apple Inc. | Hierarchical belief states for digital assistants |
US10497365B2 (en) | 2014-05-30 | 2019-12-03 | Apple Inc. | Multi-command single utterance input method |
US10496705B1 (en) | 2018-06-03 | 2019-12-03 | Apple Inc. | Accelerated task performance |
US10529332B2 (en) | 2015-03-08 | 2020-01-07 | Apple Inc. | Virtual assistant activation |
CN110765105A (en) * | 2019-10-14 | 2020-02-07 | 珠海格力电器股份有限公司 | Method, device, equipment and medium for establishing wake-up instruction database |
US10567477B2 (en) | 2015-03-08 | 2020-02-18 | Apple Inc. | Virtual assistant continuity |
US10580409B2 (en) | 2016-06-11 | 2020-03-03 | Apple Inc. | Application integration with a digital assistant |
US10592604B2 (en) | 2018-03-12 | 2020-03-17 | Apple Inc. | Inverse text normalization for automatic speech recognition |
US10636424B2 (en) | 2017-11-30 | 2020-04-28 | Apple Inc. | Multi-turn canned dialog |
US10643611B2 (en) | 2008-10-02 | 2020-05-05 | Apple Inc. | Electronic devices with voice command and contextual data processing capabilities |
US10657961B2 (en) | 2013-06-08 | 2020-05-19 | Apple Inc. | Interpreting and acting upon commands that involve sharing information with remote devices |
US10657328B2 (en) | 2017-06-02 | 2020-05-19 | Apple Inc. | Multi-task recurrent neural network architecture for efficient morphology handling in neural language modeling |
US10684703B2 (en) | 2018-06-01 | 2020-06-16 | Apple Inc. | Attention aware virtual assistant dismissal |
US10692504B2 (en) | 2010-02-25 | 2020-06-23 | Apple Inc. | User profiling for voice input processing |
US10699717B2 (en) | 2014-05-30 | 2020-06-30 | Apple Inc. | Intelligent assistant for home automation |
US10714117B2 (en) | 2013-02-07 | 2020-07-14 | Apple Inc. | Voice trigger for a digital assistant |
US10726832B2 (en) | 2017-05-11 | 2020-07-28 | Apple Inc. | Maintaining privacy of personal information |
US10733375B2 (en) | 2018-01-31 | 2020-08-04 | Apple Inc. | Knowledge-based framework for improving natural language understanding |
US10733993B2 (en) | 2016-06-10 | 2020-08-04 | Apple Inc. | Intelligent digital assistant in a multi-tasking environment |
US10733982B2 (en) | 2018-01-08 | 2020-08-04 | Apple Inc. | Multi-directional dialog |
US10741185B2 (en) | 2010-01-18 | 2020-08-11 | Apple Inc. | Intelligent automated assistant |
US10748546B2 (en) | 2017-05-16 | 2020-08-18 | Apple Inc. | Digital assistant services based on device capabilities |
US10755703B2 (en) | 2017-05-11 | 2020-08-25 | Apple Inc. | Offline personal assistant |
US10755051B2 (en) | 2017-09-29 | 2020-08-25 | Apple Inc. | Rule-based natural language processing |
US10769385B2 (en) | 2013-06-09 | 2020-09-08 | Apple Inc. | System and method for inferring user intent from speech inputs |
US10789945B2 (en) | 2017-05-12 | 2020-09-29 | Apple Inc. | Low-latency intelligent automated assistant |
US10791176B2 (en) | 2017-05-12 | 2020-09-29 | Apple Inc. | Synchronization and task delegation of a digital assistant |
US10789959B2 (en) | 2018-03-02 | 2020-09-29 | Apple Inc. | Training speaker recognition models for digital assistants |
US10810274B2 (en) | 2017-05-15 | 2020-10-20 | Apple Inc. | Optimizing dialogue policy decisions for digital assistants using implicit feedback |
US10818288B2 (en) | 2018-03-26 | 2020-10-27 | Apple Inc. | Natural assistant interaction |
US10839159B2 (en) | 2018-09-28 | 2020-11-17 | Apple Inc. | Named entity normalization in a spoken dialog system |
US10892996B2 (en) | 2018-06-01 | 2021-01-12 | Apple Inc. | Variable latency device coordination |
US10909331B2 (en) | 2018-03-30 | 2021-02-02 | Apple Inc. | Implicit identification of translation payload with neural machine translation |
US10928918B2 (en) | 2018-05-07 | 2021-02-23 | Apple Inc. | Raise to speak |
US10942702B2 (en) | 2016-06-11 | 2021-03-09 | Apple Inc. | Intelligent device arbitration and control |
US10942703B2 (en) | 2015-12-23 | 2021-03-09 | Apple Inc. | Proactive assistance based on dialog communication between devices |
US10956666B2 (en) | 2015-11-09 | 2021-03-23 | Apple Inc. | Unconventional virtual assistant interactions |
US10984780B2 (en) | 2018-05-21 | 2021-04-20 | Apple Inc. | Global semantic word embeddings using bi-directional recurrent neural networks |
US11010561B2 (en) | 2018-09-27 | 2021-05-18 | Apple Inc. | Sentiment prediction from textual data |
US11010127B2 (en) | 2015-06-29 | 2021-05-18 | Apple Inc. | Virtual assistant for media playback |
US11025565B2 (en) | 2015-06-07 | 2021-06-01 | Apple Inc. | Personalized prediction of responses for instant messaging |
US11023513B2 (en) | 2007-12-20 | 2021-06-01 | Apple Inc. | Method and apparatus for searching using an active ontology |
US11048473B2 (en) | 2013-06-09 | 2021-06-29 | Apple Inc. | Device, method, and graphical user interface for enabling conversation persistence across two or more instances of a digital assistant |
US11069347B2 (en) | 2016-06-08 | 2021-07-20 | Apple Inc. | Intelligent automated assistant for media exploration |
US11070949B2 (en) | 2015-05-27 | 2021-07-20 | Apple Inc. | Systems and methods for proactively identifying and surfacing relevant content on an electronic device with a touch-sensitive display |
US11069336B2 (en) | 2012-03-02 | 2021-07-20 | Apple Inc. | Systems and methods for name pronunciation |
US11120372B2 (en) | 2011-06-03 | 2021-09-14 | Apple Inc. | Performing actions associated with task items that represent tasks to perform |
US11126400B2 (en) | 2015-09-08 | 2021-09-21 | Apple Inc. | Zero latency digital assistant |
US11127397B2 (en) | 2015-05-27 | 2021-09-21 | Apple Inc. | Device voice control |
US11133008B2 (en) | 2014-05-30 | 2021-09-28 | Apple Inc. | Reducing the need for manual start/end-pointing and trigger phrases |
US11140099B2 (en) | 2019-05-21 | 2021-10-05 | Apple Inc. | Providing message response suggestions |
US11145294B2 (en) | 2018-05-07 | 2021-10-12 | Apple Inc. | Intelligent automated assistant for delivering content from user experiences |
US11170166B2 (en) | 2018-09-28 | 2021-11-09 | Apple Inc. | Neural typographical error modeling via generative adversarial networks |
US20210375290A1 (en) * | 2020-05-26 | 2021-12-02 | Apple Inc. | Personalized voices for text messaging |
US11204787B2 (en) | 2017-01-09 | 2021-12-21 | Apple Inc. | Application integration with a digital assistant |
US11217251B2 (en) | 2019-05-06 | 2022-01-04 | Apple Inc. | Spoken notifications |
US11227589B2 (en) | 2016-06-06 | 2022-01-18 | Apple Inc. | Intelligent list reading |
US11231904B2 (en) | 2015-03-06 | 2022-01-25 | Apple Inc. | Reducing response latency of intelligent automated assistants |
US11237797B2 (en) | 2019-05-31 | 2022-02-01 | Apple Inc. | User activity shortcut suggestions |
US11269678B2 (en) | 2012-05-15 | 2022-03-08 | Apple Inc. | Systems and methods for integrating third party services with a digital assistant |
US11281993B2 (en) | 2016-12-05 | 2022-03-22 | Apple Inc. | Model and ensemble compression for metric learning |
US11289073B2 (en) | 2019-05-31 | 2022-03-29 | Apple Inc. | Device text to speech |
US11301477B2 (en) | 2017-05-12 | 2022-04-12 | Apple Inc. | Feedback analysis of a digital assistant |
US11307752B2 (en) | 2019-05-06 | 2022-04-19 | Apple Inc. | User configurable task triggers |
US11314370B2 (en) | 2013-12-06 | 2022-04-26 | Apple Inc. | Method for extracting salient dialog usage from live data |
US11350253B2 (en) | 2011-06-03 | 2022-05-31 | Apple Inc. | Active transport based notifications |
US11348573B2 (en) | 2019-03-18 | 2022-05-31 | Apple Inc. | Multimodality in digital assistant systems |
US11360641B2 (en) | 2019-06-01 | 2022-06-14 | Apple Inc. | Increasing the relevance of new available information |
US11386266B2 (en) | 2018-06-01 | 2022-07-12 | Apple Inc. | Text correction |
US11388291B2 (en) | 2013-03-14 | 2022-07-12 | Apple Inc. | System and method for processing voicemail |
US11423886B2 (en) | 2010-01-18 | 2022-08-23 | Apple Inc. | Task flow identification based on user intent |
US11423908B2 (en) | 2019-05-06 | 2022-08-23 | Apple Inc. | Interpreting spoken requests |
US11462215B2 (en) | 2018-09-28 | 2022-10-04 | Apple Inc. | Multi-modal inputs for voice commands |
US11468282B2 (en) | 2015-05-15 | 2022-10-11 | Apple Inc. | Virtual assistant in a communication session |
US11467802B2 (en) | 2017-05-11 | 2022-10-11 | Apple Inc. | Maintaining privacy of personal information |
US11475884B2 (en) | 2019-05-06 | 2022-10-18 | Apple Inc. | Reducing digital assistant latency when a language is incorrectly determined |
US11475898B2 (en) | 2018-10-26 | 2022-10-18 | Apple Inc. | Low-latency multi-speaker speech recognition |
US11488406B2 (en) | 2019-09-25 | 2022-11-01 | Apple Inc. | Text detection using global geometry estimators |
US11496600B2 (en) | 2019-05-31 | 2022-11-08 | Apple Inc. | Remote execution of machine-learned models |
US11495218B2 (en) | 2018-06-01 | 2022-11-08 | Apple Inc. | Virtual assistant operation in multi-device environments |
US11500672B2 (en) | 2015-09-08 | 2022-11-15 | Apple Inc. | Distributed personal assistant |
US11516537B2 (en) | 2014-06-30 | 2022-11-29 | Apple Inc. | Intelligent automated assistant for TV user interactions |
US11526368B2 (en) | 2015-11-06 | 2022-12-13 | Apple Inc. | Intelligent automated assistant in a messaging environment |
US11532306B2 (en) | 2017-05-16 | 2022-12-20 | Apple Inc. | Detecting a trigger of a digital assistant |
US11638059B2 (en) | 2019-01-04 | 2023-04-25 | Apple Inc. | Content playback on multiple devices |
US11657813B2 (en) | 2019-05-31 | 2023-05-23 | Apple Inc. | Voice identification in digital assistant systems |
US11671920B2 (en) | 2007-04-03 | 2023-06-06 | Apple Inc. | Method and system for operating a multifunction portable electronic device using voice-activation |
US11696060B2 (en) | 2020-07-21 | 2023-07-04 | Apple Inc. | User identification using headphones |
US11755276B2 (en) | 2020-05-12 | 2023-09-12 | Apple Inc. | Reducing description length based on confidence |
US11765209B2 (en) | 2020-05-11 | 2023-09-19 | Apple Inc. | Digital assistant hardware abstraction |
US11790914B2 (en) | 2019-06-01 | 2023-10-17 | Apple Inc. | Methods and user interfaces for voice-based control of electronic devices |
US11798547B2 (en) | 2013-03-15 | 2023-10-24 | Apple Inc. | Voice activated device for use with a voice-based digital assistant |
US11809483B2 (en) | 2015-09-08 | 2023-11-07 | Apple Inc. | Intelligent automated assistant for media search and playback |
US11838734B2 (en) | 2020-07-20 | 2023-12-05 | Apple Inc. | Multi-device audio adjustment coordination |
US11853536B2 (en) | 2015-09-08 | 2023-12-26 | Apple Inc. | Intelligent automated assistant in a media environment |
US11914848B2 (en) | 2020-05-11 | 2024-02-27 | Apple Inc. | Providing relevant data items based on context |
US11928604B2 (en) | 2005-09-08 | 2024-03-12 | Apple Inc. | Method and apparatus for building an intelligent automated assistant |
US12010262B2 (en) | 2013-08-06 | 2024-06-11 | Apple Inc. | Auto-activating smart responses based on activities from remote devices |
US12014118B2 (en) | 2017-05-15 | 2024-06-18 | Apple Inc. | Multi-modal interfaces having selection disambiguation and text modification capability |
US12051413B2 (en) | 2015-09-30 | 2024-07-30 | Apple Inc. | Intelligent device identification |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20030050783A1 (en) * | 2001-09-13 | 2003-03-13 | Shinichi Yoshizawa | Terminal device, server device and speech recognition method |
US20070124134A1 (en) * | 2005-11-25 | 2007-05-31 | Swisscom Mobile Ag | Method for personalization of a service |
US20100145699A1 (en) * | 2008-12-09 | 2010-06-10 | Nokia Corporation | Adaptation of automatic speech recognition acoustic models |
-
2012
- 2012-12-21 WO PCT/FI2012/051285 patent/WO2014096506A1/en active Application Filing
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20030050783A1 (en) * | 2001-09-13 | 2003-03-13 | Shinichi Yoshizawa | Terminal device, server device and speech recognition method |
US20070124134A1 (en) * | 2005-11-25 | 2007-05-31 | Swisscom Mobile Ag | Method for personalization of a service |
US20100145699A1 (en) * | 2008-12-09 | 2010-06-10 | Nokia Corporation | Adaptation of automatic speech recognition acoustic models |
Cited By (206)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11928604B2 (en) | 2005-09-08 | 2024-03-12 | Apple Inc. | Method and apparatus for building an intelligent automated assistant |
US11979836B2 (en) | 2007-04-03 | 2024-05-07 | Apple Inc. | Method and system for operating a multi-function portable electronic device using voice-activation |
US11671920B2 (en) | 2007-04-03 | 2023-06-06 | Apple Inc. | Method and system for operating a multifunction portable electronic device using voice-activation |
US11023513B2 (en) | 2007-12-20 | 2021-06-01 | Apple Inc. | Method and apparatus for searching using an active ontology |
US10381016B2 (en) | 2008-01-03 | 2019-08-13 | Apple Inc. | Methods and apparatus for altering audio output signals |
US10108612B2 (en) | 2008-07-31 | 2018-10-23 | Apple Inc. | Mobile device having human language translation capability with positional feedback |
US10643611B2 (en) | 2008-10-02 | 2020-05-05 | Apple Inc. | Electronic devices with voice command and contextual data processing capabilities |
US11900936B2 (en) | 2008-10-02 | 2024-02-13 | Apple Inc. | Electronic devices with voice command and contextual data processing capabilities |
US11348582B2 (en) | 2008-10-02 | 2022-05-31 | Apple Inc. | Electronic devices with voice command and contextual data processing capabilities |
US12087308B2 (en) | 2010-01-18 | 2024-09-10 | Apple Inc. | Intelligent automated assistant |
US11423886B2 (en) | 2010-01-18 | 2022-08-23 | Apple Inc. | Task flow identification based on user intent |
US10741185B2 (en) | 2010-01-18 | 2020-08-11 | Apple Inc. | Intelligent automated assistant |
US10692504B2 (en) | 2010-02-25 | 2020-06-23 | Apple Inc. | User profiling for voice input processing |
US10417405B2 (en) | 2011-03-21 | 2019-09-17 | Apple Inc. | Device access using voice authentication |
US11350253B2 (en) | 2011-06-03 | 2022-05-31 | Apple Inc. | Active transport based notifications |
US11120372B2 (en) | 2011-06-03 | 2021-09-14 | Apple Inc. | Performing actions associated with task items that represent tasks to perform |
US11069336B2 (en) | 2012-03-02 | 2021-07-20 | Apple Inc. | Systems and methods for name pronunciation |
US11321116B2 (en) | 2012-05-15 | 2022-05-03 | Apple Inc. | Systems and methods for integrating third party services with a digital assistant |
US11269678B2 (en) | 2012-05-15 | 2022-03-08 | Apple Inc. | Systems and methods for integrating third party services with a digital assistant |
US10978090B2 (en) | 2013-02-07 | 2021-04-13 | Apple Inc. | Voice trigger for a digital assistant |
US11636869B2 (en) | 2013-02-07 | 2023-04-25 | Apple Inc. | Voice trigger for a digital assistant |
US10714117B2 (en) | 2013-02-07 | 2020-07-14 | Apple Inc. | Voice trigger for a digital assistant |
US11557310B2 (en) | 2013-02-07 | 2023-01-17 | Apple Inc. | Voice trigger for a digital assistant |
US12009007B2 (en) | 2013-02-07 | 2024-06-11 | Apple Inc. | Voice trigger for a digital assistant |
US11862186B2 (en) | 2013-02-07 | 2024-01-02 | Apple Inc. | Voice trigger for a digital assistant |
US11388291B2 (en) | 2013-03-14 | 2022-07-12 | Apple Inc. | System and method for processing voicemail |
US11798547B2 (en) | 2013-03-15 | 2023-10-24 | Apple Inc. | Voice activated device for use with a voice-based digital assistant |
US10657961B2 (en) | 2013-06-08 | 2020-05-19 | Apple Inc. | Interpreting and acting upon commands that involve sharing information with remote devices |
US10769385B2 (en) | 2013-06-09 | 2020-09-08 | Apple Inc. | System and method for inferring user intent from speech inputs |
US11727219B2 (en) | 2013-06-09 | 2023-08-15 | Apple Inc. | System and method for inferring user intent from speech inputs |
US11048473B2 (en) | 2013-06-09 | 2021-06-29 | Apple Inc. | Device, method, and graphical user interface for enabling conversation persistence across two or more instances of a digital assistant |
US12073147B2 (en) | 2013-06-09 | 2024-08-27 | Apple Inc. | Device, method, and graphical user interface for enabling conversation persistence across two or more instances of a digital assistant |
US12010262B2 (en) | 2013-08-06 | 2024-06-11 | Apple Inc. | Auto-activating smart responses based on activities from remote devices |
US11314370B2 (en) | 2013-12-06 | 2022-04-26 | Apple Inc. | Method for extracting salient dialog usage from live data |
US11810562B2 (en) | 2014-05-30 | 2023-11-07 | Apple Inc. | Reducing the need for manual start/end-pointing and trigger phrases |
US10699717B2 (en) | 2014-05-30 | 2020-06-30 | Apple Inc. | Intelligent assistant for home automation |
US11670289B2 (en) | 2014-05-30 | 2023-06-06 | Apple Inc. | Multi-command single utterance input method |
US10657966B2 (en) | 2014-05-30 | 2020-05-19 | Apple Inc. | Better resolution when referencing to concepts |
US11133008B2 (en) | 2014-05-30 | 2021-09-28 | Apple Inc. | Reducing the need for manual start/end-pointing and trigger phrases |
US10417344B2 (en) | 2014-05-30 | 2019-09-17 | Apple Inc. | Exemplar-based natural language processing |
US12067990B2 (en) | 2014-05-30 | 2024-08-20 | Apple Inc. | Intelligent assistant for home automation |
US10083690B2 (en) | 2014-05-30 | 2018-09-25 | Apple Inc. | Better resolution when referencing to concepts |
US10878809B2 (en) | 2014-05-30 | 2020-12-29 | Apple Inc. | Multi-command single utterance input method |
US11257504B2 (en) | 2014-05-30 | 2022-02-22 | Apple Inc. | Intelligent assistant for home automation |
US10714095B2 (en) | 2014-05-30 | 2020-07-14 | Apple Inc. | Intelligent assistant for home automation |
US12118999B2 (en) | 2014-05-30 | 2024-10-15 | Apple Inc. | Reducing the need for manual start/end-pointing and trigger phrases |
US10497365B2 (en) | 2014-05-30 | 2019-12-03 | Apple Inc. | Multi-command single utterance input method |
US11699448B2 (en) | 2014-05-30 | 2023-07-11 | Apple Inc. | Intelligent assistant for home automation |
US11516537B2 (en) | 2014-06-30 | 2022-11-29 | Apple Inc. | Intelligent automated assistant for TV user interactions |
US11838579B2 (en) | 2014-06-30 | 2023-12-05 | Apple Inc. | Intelligent automated assistant for TV user interactions |
US10431204B2 (en) | 2014-09-11 | 2019-10-01 | Apple Inc. | Method and apparatus for discovering trending terms in speech requests |
US10438595B2 (en) | 2014-09-30 | 2019-10-08 | Apple Inc. | Speaker identification and unsupervised speaker adaptation techniques |
US10390213B2 (en) | 2014-09-30 | 2019-08-20 | Apple Inc. | Social reminders |
US10453443B2 (en) | 2014-09-30 | 2019-10-22 | Apple Inc. | Providing an indication of the suitability of speech recognition |
US11231904B2 (en) | 2015-03-06 | 2022-01-25 | Apple Inc. | Reducing response latency of intelligent automated assistants |
US10930282B2 (en) | 2015-03-08 | 2021-02-23 | Apple Inc. | Competing devices responding to voice triggers |
US10567477B2 (en) | 2015-03-08 | 2020-02-18 | Apple Inc. | Virtual assistant continuity |
US10529332B2 (en) | 2015-03-08 | 2020-01-07 | Apple Inc. | Virtual assistant activation |
US11087759B2 (en) | 2015-03-08 | 2021-08-10 | Apple Inc. | Virtual assistant activation |
US10311871B2 (en) | 2015-03-08 | 2019-06-04 | Apple Inc. | Competing devices responding to voice triggers |
US11842734B2 (en) | 2015-03-08 | 2023-12-12 | Apple Inc. | Virtual assistant activation |
US11468282B2 (en) | 2015-05-15 | 2022-10-11 | Apple Inc. | Virtual assistant in a communication session |
US12001933B2 (en) | 2015-05-15 | 2024-06-04 | Apple Inc. | Virtual assistant in a communication session |
US11070949B2 (en) | 2015-05-27 | 2021-07-20 | Apple Inc. | Systems and methods for proactively identifying and surfacing relevant content on an electronic device with a touch-sensitive display |
US11127397B2 (en) | 2015-05-27 | 2021-09-21 | Apple Inc. | Device voice control |
US10356243B2 (en) | 2015-06-05 | 2019-07-16 | Apple Inc. | Virtual assistant aided communication with 3rd party service in a communication session |
US10681212B2 (en) | 2015-06-05 | 2020-06-09 | Apple Inc. | Virtual assistant aided communication with 3rd party service in a communication session |
US11025565B2 (en) | 2015-06-07 | 2021-06-01 | Apple Inc. | Personalized prediction of responses for instant messaging |
WO2016209499A1 (en) * | 2015-06-25 | 2016-12-29 | Intel Corporation | Speech recognition services |
US11947873B2 (en) | 2015-06-29 | 2024-04-02 | Apple Inc. | Virtual assistant for media playback |
US11010127B2 (en) | 2015-06-29 | 2021-05-18 | Apple Inc. | Virtual assistant for media playback |
US11500672B2 (en) | 2015-09-08 | 2022-11-15 | Apple Inc. | Distributed personal assistant |
US11809483B2 (en) | 2015-09-08 | 2023-11-07 | Apple Inc. | Intelligent automated assistant for media search and playback |
US11550542B2 (en) | 2015-09-08 | 2023-01-10 | Apple Inc. | Zero latency digital assistant |
US11954405B2 (en) | 2015-09-08 | 2024-04-09 | Apple Inc. | Zero latency digital assistant |
US11853536B2 (en) | 2015-09-08 | 2023-12-26 | Apple Inc. | Intelligent automated assistant in a media environment |
US11126400B2 (en) | 2015-09-08 | 2021-09-21 | Apple Inc. | Zero latency digital assistant |
US12051413B2 (en) | 2015-09-30 | 2024-07-30 | Apple Inc. | Intelligent device identification |
US11526368B2 (en) | 2015-11-06 | 2022-12-13 | Apple Inc. | Intelligent automated assistant in a messaging environment |
US11809886B2 (en) | 2015-11-06 | 2023-11-07 | Apple Inc. | Intelligent automated assistant in a messaging environment |
US10956666B2 (en) | 2015-11-09 | 2021-03-23 | Apple Inc. | Unconventional virtual assistant interactions |
US11886805B2 (en) | 2015-11-09 | 2024-01-30 | Apple Inc. | Unconventional virtual assistant interactions |
US10354652B2 (en) | 2015-12-02 | 2019-07-16 | Apple Inc. | Applying neural network language models to weighted finite state transducers for automatic speech recognition |
US10942703B2 (en) | 2015-12-23 | 2021-03-09 | Apple Inc. | Proactive assistance based on dialog communication between devices |
US11227589B2 (en) | 2016-06-06 | 2022-01-18 | Apple Inc. | Intelligent list reading |
US11069347B2 (en) | 2016-06-08 | 2021-07-20 | Apple Inc. | Intelligent automated assistant for media exploration |
US11037565B2 (en) | 2016-06-10 | 2021-06-15 | Apple Inc. | Intelligent digital assistant in a multi-tasking environment |
US11657820B2 (en) | 2016-06-10 | 2023-05-23 | Apple Inc. | Intelligent digital assistant in a multi-tasking environment |
US10733993B2 (en) | 2016-06-10 | 2020-08-04 | Apple Inc. | Intelligent digital assistant in a multi-tasking environment |
US11152002B2 (en) | 2016-06-11 | 2021-10-19 | Apple Inc. | Application integration with a digital assistant |
US10580409B2 (en) | 2016-06-11 | 2020-03-03 | Apple Inc. | Application integration with a digital assistant |
US10942702B2 (en) | 2016-06-11 | 2021-03-09 | Apple Inc. | Intelligent device arbitration and control |
US11749275B2 (en) | 2016-06-11 | 2023-09-05 | Apple Inc. | Application integration with a digital assistant |
US11809783B2 (en) | 2016-06-11 | 2023-11-07 | Apple Inc. | Intelligent device arbitration and control |
US10825447B2 (en) | 2016-06-23 | 2020-11-03 | Huawei Technologies Co., Ltd. | Method and apparatus for optimizing model applicable to pattern recognition, and terminal device |
JP2019528502A (en) * | 2016-06-23 | 2019-10-10 | ホアウェイ・テクノロジーズ・カンパニー・リミテッド | Method and apparatus for optimizing a model applicable to pattern recognition and terminal device |
US10474753B2 (en) | 2016-09-07 | 2019-11-12 | Apple Inc. | Language identification using recurrent neural networks |
US10043516B2 (en) | 2016-09-23 | 2018-08-07 | Apple Inc. | Intelligent automated assistant |
US10553215B2 (en) | 2016-09-23 | 2020-02-04 | Apple Inc. | Intelligent automated assistant |
US11281993B2 (en) | 2016-12-05 | 2022-03-22 | Apple Inc. | Model and ensemble compression for metric learning |
US11204787B2 (en) | 2017-01-09 | 2021-12-21 | Apple Inc. | Application integration with a digital assistant |
US11656884B2 (en) | 2017-01-09 | 2023-05-23 | Apple Inc. | Application integration with a digital assistant |
US10741181B2 (en) | 2017-05-09 | 2020-08-11 | Apple Inc. | User interface for correcting recognition errors |
US10332518B2 (en) | 2017-05-09 | 2019-06-25 | Apple Inc. | User interface for correcting recognition errors |
US10417266B2 (en) | 2017-05-09 | 2019-09-17 | Apple Inc. | Context-aware ranking of intelligent response suggestions |
US10726832B2 (en) | 2017-05-11 | 2020-07-28 | Apple Inc. | Maintaining privacy of personal information |
US10847142B2 (en) | 2017-05-11 | 2020-11-24 | Apple Inc. | Maintaining privacy of personal information |
US10395654B2 (en) | 2017-05-11 | 2019-08-27 | Apple Inc. | Text normalization based on a data-driven learning network |
US10755703B2 (en) | 2017-05-11 | 2020-08-25 | Apple Inc. | Offline personal assistant |
US11467802B2 (en) | 2017-05-11 | 2022-10-11 | Apple Inc. | Maintaining privacy of personal information |
US11599331B2 (en) | 2017-05-11 | 2023-03-07 | Apple Inc. | Maintaining privacy of personal information |
US10410637B2 (en) | 2017-05-12 | 2019-09-10 | Apple Inc. | User-specific acoustic models |
US11538469B2 (en) | 2017-05-12 | 2022-12-27 | Apple Inc. | Low-latency intelligent automated assistant |
US11405466B2 (en) | 2017-05-12 | 2022-08-02 | Apple Inc. | Synchronization and task delegation of a digital assistant |
US11862151B2 (en) | 2017-05-12 | 2024-01-02 | Apple Inc. | Low-latency intelligent automated assistant |
US10789945B2 (en) | 2017-05-12 | 2020-09-29 | Apple Inc. | Low-latency intelligent automated assistant |
US11837237B2 (en) | 2017-05-12 | 2023-12-05 | Apple Inc. | User-specific acoustic models |
US11301477B2 (en) | 2017-05-12 | 2022-04-12 | Apple Inc. | Feedback analysis of a digital assistant |
US11580990B2 (en) | 2017-05-12 | 2023-02-14 | Apple Inc. | User-specific acoustic models |
US11380310B2 (en) | 2017-05-12 | 2022-07-05 | Apple Inc. | Low-latency intelligent automated assistant |
US10791176B2 (en) | 2017-05-12 | 2020-09-29 | Apple Inc. | Synchronization and task delegation of a digital assistant |
US12014118B2 (en) | 2017-05-15 | 2024-06-18 | Apple Inc. | Multi-modal interfaces having selection disambiguation and text modification capability |
US10482874B2 (en) | 2017-05-15 | 2019-11-19 | Apple Inc. | Hierarchical belief states for digital assistants |
US10810274B2 (en) | 2017-05-15 | 2020-10-20 | Apple Inc. | Optimizing dialogue policy decisions for digital assistants using implicit feedback |
US11217255B2 (en) | 2017-05-16 | 2022-01-04 | Apple Inc. | Far-field extension for digital assistant services |
US12026197B2 (en) | 2017-05-16 | 2024-07-02 | Apple Inc. | Intelligent automated assistant for media exploration |
US10909171B2 (en) | 2017-05-16 | 2021-02-02 | Apple Inc. | Intelligent automated assistant for media exploration |
US11532306B2 (en) | 2017-05-16 | 2022-12-20 | Apple Inc. | Detecting a trigger of a digital assistant |
US10311144B2 (en) | 2017-05-16 | 2019-06-04 | Apple Inc. | Emoji word sense disambiguation |
US11675829B2 (en) | 2017-05-16 | 2023-06-13 | Apple Inc. | Intelligent automated assistant for media exploration |
US10748546B2 (en) | 2017-05-16 | 2020-08-18 | Apple Inc. | Digital assistant services based on device capabilities |
US10403278B2 (en) | 2017-05-16 | 2019-09-03 | Apple Inc. | Methods and systems for phonetic matching in digital assistant services |
US10303715B2 (en) | 2017-05-16 | 2019-05-28 | Apple Inc. | Intelligent automated assistant for media exploration |
US10657328B2 (en) | 2017-06-02 | 2020-05-19 | Apple Inc. | Multi-task recurrent neural network architecture for efficient morphology handling in neural language modeling |
US10445429B2 (en) | 2017-09-21 | 2019-10-15 | Apple Inc. | Natural language understanding using vocabularies with compressed serialized tries |
US10755051B2 (en) | 2017-09-29 | 2020-08-25 | Apple Inc. | Rule-based natural language processing |
US10636424B2 (en) | 2017-11-30 | 2020-04-28 | Apple Inc. | Multi-turn canned dialog |
US10733982B2 (en) | 2018-01-08 | 2020-08-04 | Apple Inc. | Multi-directional dialog |
US10733375B2 (en) | 2018-01-31 | 2020-08-04 | Apple Inc. | Knowledge-based framework for improving natural language understanding |
US10789959B2 (en) | 2018-03-02 | 2020-09-29 | Apple Inc. | Training speaker recognition models for digital assistants |
US10592604B2 (en) | 2018-03-12 | 2020-03-17 | Apple Inc. | Inverse text normalization for automatic speech recognition |
US11710482B2 (en) | 2018-03-26 | 2023-07-25 | Apple Inc. | Natural assistant interaction |
US10818288B2 (en) | 2018-03-26 | 2020-10-27 | Apple Inc. | Natural assistant interaction |
US10909331B2 (en) | 2018-03-30 | 2021-02-02 | Apple Inc. | Implicit identification of translation payload with neural machine translation |
US11145294B2 (en) | 2018-05-07 | 2021-10-12 | Apple Inc. | Intelligent automated assistant for delivering content from user experiences |
US11907436B2 (en) | 2018-05-07 | 2024-02-20 | Apple Inc. | Raise to speak |
US10928918B2 (en) | 2018-05-07 | 2021-02-23 | Apple Inc. | Raise to speak |
US11169616B2 (en) | 2018-05-07 | 2021-11-09 | Apple Inc. | Raise to speak |
US11487364B2 (en) | 2018-05-07 | 2022-11-01 | Apple Inc. | Raise to speak |
US11900923B2 (en) | 2018-05-07 | 2024-02-13 | Apple Inc. | Intelligent automated assistant for delivering content from user experiences |
US11854539B2 (en) | 2018-05-07 | 2023-12-26 | Apple Inc. | Intelligent automated assistant for delivering content from user experiences |
US10984780B2 (en) | 2018-05-21 | 2021-04-20 | Apple Inc. | Global semantic word embeddings using bi-directional recurrent neural networks |
US11495218B2 (en) | 2018-06-01 | 2022-11-08 | Apple Inc. | Virtual assistant operation in multi-device environments |
US11360577B2 (en) | 2018-06-01 | 2022-06-14 | Apple Inc. | Attention aware virtual assistant dismissal |
US12080287B2 (en) | 2018-06-01 | 2024-09-03 | Apple Inc. | Voice interaction at a primary device to access call functionality of a companion device |
US11431642B2 (en) | 2018-06-01 | 2022-08-30 | Apple Inc. | Variable latency device coordination |
US11630525B2 (en) | 2018-06-01 | 2023-04-18 | Apple Inc. | Attention aware virtual assistant dismissal |
US10892996B2 (en) | 2018-06-01 | 2021-01-12 | Apple Inc. | Variable latency device coordination |
US12067985B2 (en) | 2018-06-01 | 2024-08-20 | Apple Inc. | Virtual assistant operations in multi-device environments |
US11009970B2 (en) | 2018-06-01 | 2021-05-18 | Apple Inc. | Attention aware virtual assistant dismissal |
US10720160B2 (en) | 2018-06-01 | 2020-07-21 | Apple Inc. | Voice interaction at a primary device to access call functionality of a companion device |
US10403283B1 (en) | 2018-06-01 | 2019-09-03 | Apple Inc. | Voice interaction at a primary device to access call functionality of a companion device |
US11386266B2 (en) | 2018-06-01 | 2022-07-12 | Apple Inc. | Text correction |
US10984798B2 (en) | 2018-06-01 | 2021-04-20 | Apple Inc. | Voice interaction at a primary device to access call functionality of a companion device |
US10684703B2 (en) | 2018-06-01 | 2020-06-16 | Apple Inc. | Attention aware virtual assistant dismissal |
US12061752B2 (en) | 2018-06-01 | 2024-08-13 | Apple Inc. | Attention aware virtual assistant dismissal |
US10496705B1 (en) | 2018-06-03 | 2019-12-03 | Apple Inc. | Accelerated task performance |
US10944859B2 (en) | 2018-06-03 | 2021-03-09 | Apple Inc. | Accelerated task performance |
US10504518B1 (en) | 2018-06-03 | 2019-12-10 | Apple Inc. | Accelerated task performance |
US11010561B2 (en) | 2018-09-27 | 2021-05-18 | Apple Inc. | Sentiment prediction from textual data |
US11893992B2 (en) | 2018-09-28 | 2024-02-06 | Apple Inc. | Multi-modal inputs for voice commands |
US11170166B2 (en) | 2018-09-28 | 2021-11-09 | Apple Inc. | Neural typographical error modeling via generative adversarial networks |
US11462215B2 (en) | 2018-09-28 | 2022-10-04 | Apple Inc. | Multi-modal inputs for voice commands |
US10839159B2 (en) | 2018-09-28 | 2020-11-17 | Apple Inc. | Named entity normalization in a spoken dialog system |
US11475898B2 (en) | 2018-10-26 | 2022-10-18 | Apple Inc. | Low-latency multi-speaker speech recognition |
US11638059B2 (en) | 2019-01-04 | 2023-04-25 | Apple Inc. | Content playback on multiple devices |
US11348573B2 (en) | 2019-03-18 | 2022-05-31 | Apple Inc. | Multimodality in digital assistant systems |
US11783815B2 (en) | 2019-03-18 | 2023-10-10 | Apple Inc. | Multimodality in digital assistant systems |
US12136419B2 (en) | 2019-03-18 | 2024-11-05 | Apple Inc. | Multimodality in digital assistant systems |
US11705130B2 (en) | 2019-05-06 | 2023-07-18 | Apple Inc. | Spoken notifications |
US11423908B2 (en) | 2019-05-06 | 2022-08-23 | Apple Inc. | Interpreting spoken requests |
US11475884B2 (en) | 2019-05-06 | 2022-10-18 | Apple Inc. | Reducing digital assistant latency when a language is incorrectly determined |
US11675491B2 (en) | 2019-05-06 | 2023-06-13 | Apple Inc. | User configurable task triggers |
US11307752B2 (en) | 2019-05-06 | 2022-04-19 | Apple Inc. | User configurable task triggers |
US11217251B2 (en) | 2019-05-06 | 2022-01-04 | Apple Inc. | Spoken notifications |
US11140099B2 (en) | 2019-05-21 | 2021-10-05 | Apple Inc. | Providing message response suggestions |
US11888791B2 (en) | 2019-05-21 | 2024-01-30 | Apple Inc. | Providing message response suggestions |
US11360739B2 (en) | 2019-05-31 | 2022-06-14 | Apple Inc. | User activity shortcut suggestions |
US11289073B2 (en) | 2019-05-31 | 2022-03-29 | Apple Inc. | Device text to speech |
US11657813B2 (en) | 2019-05-31 | 2023-05-23 | Apple Inc. | Voice identification in digital assistant systems |
US11237797B2 (en) | 2019-05-31 | 2022-02-01 | Apple Inc. | User activity shortcut suggestions |
US11496600B2 (en) | 2019-05-31 | 2022-11-08 | Apple Inc. | Remote execution of machine-learned models |
US11360641B2 (en) | 2019-06-01 | 2022-06-14 | Apple Inc. | Increasing the relevance of new available information |
US11790914B2 (en) | 2019-06-01 | 2023-10-17 | Apple Inc. | Methods and user interfaces for voice-based control of electronic devices |
US11488406B2 (en) | 2019-09-25 | 2022-11-01 | Apple Inc. | Text detection using global geometry estimators |
CN110765105A (en) * | 2019-10-14 | 2020-02-07 | 珠海格力电器股份有限公司 | Method, device, equipment and medium for establishing wake-up instruction database |
US11765209B2 (en) | 2020-05-11 | 2023-09-19 | Apple Inc. | Digital assistant hardware abstraction |
US11924254B2 (en) | 2020-05-11 | 2024-03-05 | Apple Inc. | Digital assistant hardware abstraction |
US11914848B2 (en) | 2020-05-11 | 2024-02-27 | Apple Inc. | Providing relevant data items based on context |
US11755276B2 (en) | 2020-05-12 | 2023-09-12 | Apple Inc. | Reducing description length based on confidence |
US20230051062A1 (en) * | 2020-05-26 | 2023-02-16 | Apple Inc. | Personalized voices for text messaging |
US11508380B2 (en) | 2020-05-26 | 2022-11-22 | Apple Inc. | Personalized voices for text messaging |
US20210375290A1 (en) * | 2020-05-26 | 2021-12-02 | Apple Inc. | Personalized voices for text messaging |
US11838734B2 (en) | 2020-07-20 | 2023-12-05 | Apple Inc. | Multi-device audio adjustment coordination |
US11696060B2 (en) | 2020-07-21 | 2023-07-04 | Apple Inc. | User identification using headphones |
US11750962B2 (en) | 2020-07-21 | 2023-09-05 | Apple Inc. | User identification using headphones |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2014096506A1 (en) | Method, apparatus, and computer program product for personalizing speech recognition | |
CN107644638B (en) | Audio recognition method, device, terminal and computer readable storage medium | |
KR102100389B1 (en) | Personalized entity pronunciation learning | |
US20210264916A1 (en) | Electronic device for generating personalized asr model and method for operating same | |
EP3195310B1 (en) | Keyword detection using speaker-independent keyword models for user-designated keywords | |
KR101786533B1 (en) | Multi-level speech recofnition | |
EP3413305A1 (en) | Dual mode speech recognition | |
US10705789B2 (en) | Dynamic volume adjustment for virtual assistants | |
EP3340239A1 (en) | Electronic device and speech recognition method therefor | |
CN112970059B (en) | Electronic device for processing user utterance and control method thereof | |
CN112470217A (en) | Method for determining electronic device to perform speech recognition and electronic device | |
WO2019213443A1 (en) | Audio analytics for natural language processing | |
US9653073B2 (en) | Voice input correction | |
TWI682385B (en) | Speech service control apparatus and method thereof | |
CN111261151B (en) | Voice processing method and device, electronic equipment and storage medium | |
US10170122B2 (en) | Speech recognition method, electronic device and speech recognition system | |
CN107544271A (en) | Terminal control method, device and computer-readable recording medium | |
US10535337B2 (en) | Method for correcting false recognition contained in recognition result of speech of user | |
AU2019201441B2 (en) | Electronic device for processing user voice input | |
CN114223029A (en) | Server supporting device to perform voice recognition and operation method of server | |
CN112334978A (en) | Electronic device supporting personalized device connection and method thereof | |
CN111640429B (en) | Method for providing voice recognition service and electronic device for the same | |
CN110942779A (en) | Noise processing method, device and system | |
KR20190122457A (en) | Electronic device for performing speech recognition and the method for the same | |
US20220270617A1 (en) | Electronic device for supporting artificial intelligence agent services to talk to users |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 12890352 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 12890352 Country of ref document: EP Kind code of ref document: A1 |