CN112328999B

CN112328999B - Double-recording quality inspection method and device, server and storage medium

Info

Publication number: CN112328999B
Application number: CN202110005259.5A
Authority: CN
Inventors: 白世杰; 吴富章
Original assignee: Beijing Yuanjian Information Technology Co Ltd
Current assignee: Beijing Yuanjian Information Technology Co Ltd
Priority date: 2021-01-05
Filing date: 2021-01-05
Publication date: 2021-04-06
Anticipated expiration: 2041-01-05
Also published as: CN112328999A

Abstract

The application provides a double-record quality inspection method, a double-record quality inspection device, a server and a storage medium, and relates to the technical field of identity recognition. The method comprises the following steps: extracting the features of a face image in a first video image collected by a client by adopting a preset face recognition model to obtain first face features; performing feature extraction on voice data acquired by a client by adopting a preset voiceprint recognition model to obtain voiceprint features; performing feature extraction on voice data by adopting a preset voice recognition model to obtain a voice text; and processing by adopting a preset multi-mode recognition model according to the first face characteristic, the voiceprint characteristic and the characteristic of the voice text to obtain an identity confirmation result of the target user, wherein the identity confirmation result is used for indicating whether the target user confirms that the identity is real and unique. The method and the device can effectively ensure the reliability of the identity authentication of the user in the remote double-recording quality inspection process.

Description

Double-recording quality inspection method and device, server and storage medium

Technical Field

The invention relates to the technical field of identity recognition, in particular to a double-record quality inspection method, a double-record quality inspection device, a server and a storage medium.

Background

In the financial or security industry, it is necessary to confirm the user's real will by recording and recording the user's confirmation of the transaction process at the time of transaction.

The existing technology aiming at remote online double recording mostly adopts the mode that voice data and video data are respectively transmitted to a server, the server separately identifies the voice data and the video data, but whether a user of a remote client side is a living body or not can not be respectively identified in the identification process, namely, a bug exists that the identity of the user can be counterfeited by using photos and recording.

In the prior art, whether a user at a client is a living body is generally determined by indicating the user to nod head, shake head, open mouth, blink and the like, but the scheme cannot determine the authenticity of voice data, so that the authenticity of the identity of the user is difficult to determine.

Disclosure of Invention

The present invention is directed to provide a method, an apparatus, a server and a storage medium for dual-record quality inspection, so as to accurately determine the authenticity of a remote user identity.

In order to achieve the above purpose, the technical solutions adopted in the embodiments of the present application are as follows:

in a first aspect, an embodiment of the present application provides a dual-record quality inspection method, including:

extracting the features of a face image in a first video image collected by a client by adopting a preset face recognition model to obtain first face features;

performing feature extraction on the voice data acquired by the client by adopting a preset voiceprint recognition model to obtain voiceprint features;

extracting the characteristics of the voice data by adopting a preset voice recognition model to obtain a voice text;

processing by adopting a preset multi-mode recognition model according to the first face feature, the voiceprint feature and the feature of the voice text to obtain an identity confirmation result of the target user, wherein the identity confirmation result is used for indicating whether the target user confirms that the identity is real and unique;

the multi-modal recognition model is obtained by training in advance by adopting sample face features, sample voiceprint features and features of a sample voice text, wherein the sample face features have marking information of whether the sample face features are real faces, the sample voiceprint features have marking information of whether the sample voiceprint features are real voices, and the features of the sample voice text have marking information of whether the sample voice text is confirmed voices.

Optionally, before the processing is performed by using a preset multi-modal recognition model according to the first face feature, the voiceprint feature and the feature of the voice text to obtain an identity confirmation result of the target user, the method further includes:

according to the start-stop time periods of all text segments in the voice text, intercepting a lip language image frame sequence corresponding to the start-stop time periods from the first video image;

judging whether the action of the lip language image frame sequence is matched with the preset lip language action corresponding to each text segment or not by adopting a preset lip language identification model to obtain a lip language matching result of the target user;

the processing by adopting a preset multi-mode recognition model according to the first face feature, the voiceprint feature and the feature of the voice text to obtain the identity confirmation result of the target user comprises the following steps:

and if the lip language matching result passes, processing by adopting the multi-mode recognition model according to the first face feature, the voiceprint feature and the feature of the voice text to obtain the identity confirmation result.

Optionally, before the capturing, according to the start-stop time period of each text segment in the voice text, a sequence of lip language image frames corresponding to the start-stop time period from the first video image, the method further includes:

acquiring the recorded voiceprint characteristics of the target user from a preset voiceprint database;

comparing the voiceprint features with the recorded voiceprint features to obtain a recorded voiceprint comparison result of the target user;

the capturing a sequence of lip language image frames corresponding to the start-stop time period from the first video image according to the start-stop time period of each text segment in the voice text includes:

and if the recorded voiceprint comparison result passes, intercepting the sequence of the lip language image frames from the first video image according to the starting and stopping time period.

Optionally, before the obtaining the recorded voiceprint feature of the target user from the preset voiceprint database, the method further includes:

acquiring a record face image of the target user from a preset face database;

adopting the face recognition model to extract face features of the recorded face image to obtain second face features;

comparing the first face features with the second face features to obtain a recorded face comparison result of the target user;

the obtaining of the recorded voiceprint characteristics of the target user from a preset voiceprint database includes:

and if the comparison result of the record face passes, obtaining the record voiceprint characteristics of the target user from the voiceprint database.

Optionally, before the obtaining the docketing face image of the target user from the preset face database, the method further includes:

carrying out region detection on the second video image acquired by the client by adopting a preset identity card detection model to obtain an identity card text region;

performing character recognition on the text area of the identity card by adopting a preset identity character recognition model to obtain first identity character information;

comparing the first identity character information with second identity character information of the target user in a preset identity information database to obtain an identity card character comparison result;

the obtaining of the docketing face image of the target user from a preset face database includes:

and if the character comparison result of the identity card passes, acquiring the record face image from the face database.

Optionally, the method includes the steps of performing region detection on the second video image acquired by the client by using a preset identity card detection model to obtain an identity card text region, including:

performing region detection on the second video image by adopting the identity card detection model to obtain an identity card face portrait region and the identity card text region;

if the identity card character comparison result passes, acquiring the record face image from the face database, wherein the record face image comprises:

adopting the face recognition model to extract face features of the face portrait area of the identity card to obtain third face features;

comparing the first face features with the third face features to obtain an identity card face comparison result of the target user;

and if the identity card face comparison result and the identity card character comparison result both pass, acquiring the record face image from the face database.

Optionally, the method further includes:

performing motion detection on the first video image by adopting a preset face motion model to obtain a motion detection result of the first video image;

if the action detection result comprises: and if the number of the recognized preset facial actions is larger than or equal to a preset number threshold, and the identity confirmation result passes, determining that the target user is the unique identity of the living body.

In a second aspect, an embodiment of the present application further provides a dual-record quality inspection apparatus, where the apparatus includes:

the first face recognition module is used for extracting features of a face image in a first video image collected by a client by adopting a preset face recognition model to obtain first face features;

the voice print recognition module is used for extracting the characteristics of the voice data collected by the client by adopting a preset voice print recognition model to obtain voice print characteristics;

the voice recognition module is used for extracting the characteristics of the voice data by adopting a preset voice recognition model to obtain a voice text;

the identity confirmation module is used for processing the first face feature, the voiceprint feature and the feature of the voice text by adopting a preset multi-mode recognition model to obtain an identity confirmation result of the target user, wherein the identity confirmation result is used for indicating whether the target user confirms that the identity is real and unique;

Optionally, before the identity confirmation module, the apparatus further includes:

a lip language image frame acquisition module, configured to intercept, from the first video image, a lip language image frame sequence corresponding to a start-stop time period according to the start-stop time period of each text segment in the voice text;

the lip language identification module is used for judging whether the action of the lip language image frame sequence is matched with the preset lip language action corresponding to each text segment or not by adopting a preset lip language identification model to obtain a lip language matching result of the target user;

and the identity confirmation module is used for adopting the multi-mode recognition model to process according to the first face feature, the voiceprint feature and the feature of the voice text to obtain the identity confirmation result if the lip matching result passes.

Optionally, before the lip language image frame acquiring module, the apparatus further includes:

a record voiceprint feature acquisition module, configured to acquire a record voiceprint feature of the target user from a preset voiceprint database;

the voiceprint comparison module is used for comparing the voiceprint features with the recorded voiceprint features to obtain a recorded voiceprint comparison result of the target user;

and the lip language image frame acquisition module is used for intercepting the lip language image frame sequence from the first video image according to the start-stop time period if the recorded voiceprint comparison result passes.

Optionally, before the recording voiceprint feature obtaining module, the apparatus further includes:

the record face image acquisition module is used for acquiring a record face image of the target user from a preset face database;

the second face recognition module is used for extracting face features of the recorded face image by adopting the face recognition model to obtain second face features;

the face comparison module is used for comparing the first face feature with the second face feature to obtain a recorded face comparison result of the target user;

and the recorded voiceprint feature acquisition module is used for acquiring the recorded voiceprint features of the target user from the voiceprint database if the recorded face comparison result passes.

Optionally, before the filing face image obtaining module, the apparatus further includes:

the identity card detection module is used for carrying out region detection on the second video image acquired by the client by adopting a preset identity card detection model to obtain an identity card text region;

the character recognition module is used for carrying out character recognition on the text area of the identity card by adopting a preset identity character recognition model to obtain first identity character information;

the character comparison module is used for comparing the first identity character information with second identity character information of the target user in a preset identity information database to obtain an identity card character comparison result;

and the record face image acquisition module is used for acquiring the record face image from the face database if the character comparison result of the identity card passes.

Optionally, the identity card detecting module includes:

the head portrait and text detection unit is used for carrying out region detection on the second video image by adopting the identity card detection model to obtain an identity card face head portrait region and the identity card text region;

the record face image acquisition module comprises:

the face feature recognition unit is used for extracting face features of the face portrait area of the identity card by adopting the face recognition model to obtain third face features;

the face feature comparison unit is used for comparing the first face feature with the third face feature to obtain an identity card face comparison result of the target user;

and the record face image acquisition unit is used for acquiring the record face image from the face database if the identity card face comparison result and the identity card character comparison result both pass.

Optionally, the apparatus further comprises:

the panel action detection module is used for adopting a preset face action model to perform action detection on the first video image to obtain an action detection result of the first video image;

a living body confirmation module, configured to, if the motion detection result includes: and if the number of the recognized preset facial actions is larger than or equal to a preset number threshold, and the identity confirmation result passes, determining that the target user is the unique identity of the living body.

In a third aspect, an embodiment of the present application further provides a server, including: the dual-record quality inspection method comprises a processor and a memory, wherein program instructions executable by the processor are stored in the memory, and when the server runs, the processor executes the program instructions stored in the memory so as to execute any one of the steps of the dual-record quality inspection method.

In a fourth aspect, an embodiment of the present application further provides a storage medium, where the storage medium stores a computer program, and the computer program is executed by a processor to perform any of the steps of the dual-record quality inspection method.

The beneficial effect of this application is:

according to the double-record quality inspection method, the double-record quality inspection device, the server and the storage medium, the face recognition model is adopted to extract the first face feature in the first video image, the voiceprint recognition model is adopted to extract the voiceprint feature in the voice data, the voice recognition model is adopted to extract the voice text in the voice data, and according to the first face feature, the voiceprint feature and the feature of the voice text, the multi-mode recognition model is adopted to obtain the identity confirmation result of the target user so as to indicate whether the target user confirms that the identity is real and unique. By the method provided by the embodiment of the application, a multi-mode recognition model can be adopted to process the first face feature, the voiceprint feature and the feature of the voice text to obtain the probability value indicating whether the target user confirms the real and unique identity, so that cross-media data, namely the effective combination of video images, voice data and text data can be realized, the user does not need to be indicated to execute an appointed action, the real and unique identity of the user is monitored in the whole process under the condition that the user does not sense, the user experience is improved, and the business handling smoothness is improved; the method has the advantages that the various deep learning models are adopted to carry out multi-directional identification on the video image, the voice data and the text data of the user, the multi-mode fusion model is fully utilized to identify the identity of the user, the reliability of identity authentication of the user in the remote double-recording quality inspection process is effectively guaranteed, other people are prevented from forging identity information to carry out illegal operation, and safety guarantee is provided for remote double-recording quality inspection.

In addition, automatic quality inspection is realized through the server, manual quality inspection is replaced, quality inspection cost is reduced, quality inspection efficiency and quality inspection accuracy are improved, and false inspection and missing inspection caused by manual quality inspection are avoided.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.

Fig. 1 is a schematic flowchart of a first dual-recording quality inspection method according to an embodiment of the present disclosure;

fig. 2 is a schematic flowchart of a second dual-recording quality inspection method according to an embodiment of the present disclosure;

fig. 3 is a schematic flowchart of a third dual-record quality inspection method according to an embodiment of the present disclosure;

fig. 4 is a schematic flowchart of a fourth dual-record quality inspection method according to an embodiment of the present disclosure;

fig. 5 is a schematic flowchart of a fifth dual-recording quality inspection method according to an embodiment of the present disclosure;

fig. 6 is a schematic flowchart of a sixth dual-record quality inspection method according to an embodiment of the present disclosure;

fig. 7 is a schematic flowchart of a seventh dual-recording quality inspection method according to an embodiment of the present application;

fig. 8 is a schematic flowchart of an eighth dual-recording quality inspection method according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of a dual recording quality inspection apparatus according to an embodiment of the present disclosure;

fig. 10 is a schematic diagram of a server according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.

Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.

In the description of the present invention, the terms "first", "second", "third", and the like are used only for distinguishing the description, and are not intended to indicate or imply relative importance.

The double-recording quality inspection method provided by the embodiment of the application can be performed on a server with a double-recording quality inspection function. The server may be an application server corresponding to the client, and the client may be a client application installed on an electronic device such as a mobile phone or a computer device. The application server may be a device where a server application corresponding to the client application is located, and may be a local server or a cloud server.

For example, the dual-recording quality inspection method is executed by the client device, and a plurality of deep learning models, such as a face recognition model, a voiceprint recognition model, a semantic recognition model, a multi-modal recognition model, a lip language recognition model, an identity card detection model, an identity character recognition model, a facial action model, and a face detection model, are deployed in advance on the server. The face recognition model can be used for extracting face features in face images, the voiceprint recognition model can be used for extracting voiceprint features in voice data, the voice recognition model can be used for extracting voice texts in the voice data, the multi-mode recognition model can be used for confirming user identities, the lip recognition model can be used for recognizing whether lip motions are matched with preset lip motions corresponding to the texts or not, the identity card detection module can be used for extracting an identity card text region and an identity card face portrait region, the identity character recognition model can be used for extracting character information of the identity card text region, the face motion model can be used for recognizing face motions, and the face detection model can be used for detecting face images in video images.

Fig. 1 is a schematic flowchart of a first dual-recording quality inspection method according to an embodiment of the present disclosure; as shown in fig. 1, the method includes:

s100: and extracting the characteristics of the face image in the first video image collected by the client by adopting a preset face recognition model to obtain first face characteristics.

Specifically, the preset face recognition model is a model that is trained by using a sample face image in advance, and can be used for extracting face features in the face image. In practical application, a server can send a double recording request instruction to a client, and the client controls a camera of a device where the client is located to acquire video images and controls a voice input device of the device where the client is located, such as a microphone, to acquire voice data according to the double recording request instruction. Wherein, the video image that the camera was gathered includes: the camera is directed to a video image captured of the user's face, such as the first video image. After the camera collects the first video image, the collected first video image can be transmitted to the server by the client, and the server extracts the characteristics of the face image in the first video image through the face recognition model to obtain the first face characteristics.

In one possible example, prior to S100, the method further comprises:

and carrying out face detection on the first video image by adopting a preset face detection module, and deducting the face image from the first video image.

Specifically, the preset face detection module is a model which is trained by using a first sample video image in advance, and can be used for extracting a face image in the video image. Because the first video image may have other images besides the user face, such as a background image, etc., in order to avoid interference of the background image on face feature extraction, the face detection module may be used to perform face detection on the first video image before face feature extraction, and to deduct a face image with a detected face from the first video image, and transmit the face image to the face recognition model, so that the face recognition model performs extraction of the first face feature based on the face image.

S200: and performing feature extraction on the voice data acquired by the client by adopting a preset voiceprint recognition model to obtain voiceprint features.

Specifically, the preset voiceprint recognition model is a model obtained by training in advance by using sample voice data, and can be used for extracting voiceprint features in the voice data.

The voice data collected by the voice input device may include: the voice input device is used for acquiring voice data aiming at the voice of a user. After the voice data is collected by the voice input equipment, the voice data can be transmitted to the server, and the server adopts a voiceprint recognition model to carry out voiceprint feature extraction on the voice data to obtain voiceprint features.

S300: and performing feature extraction on the voice data by adopting a preset voice recognition model to obtain a voice text.

Specifically, the preset voice recognition model is a model obtained by adopting voice data for training in advance, and can be used for extracting a voice text in the voice data, and the server extracts semantic features in the voice data through the voice recognition model to obtain the voice text. The speech text may be text that characterizes semantics in the speech data.

For example, the voice data includes an identity confirmation voice indicating a corresponding user of the client, such as: and the person XXX confirms that the identity information is real and effective.

S400: and processing by adopting a preset multi-mode recognition model according to the first face characteristic, the voiceprint characteristic and the characteristic of the voice text to obtain an identity confirmation result of the target user, wherein the identity confirmation result is used for indicating whether the target user confirms that the identity is real and unique.

Specifically, the multi-modal recognition model comprises a multi-modal fusion layer, a self-attention layer and a full connection layer. The multi-mode fusion layer can perform multi-mode feature fusion on the extracted first face feature F, the extracted voiceprint feature V and the feature T of the voice text by adopting a preset fusion method to obtain multi-mode fusion features. For example, the preset fusion method may be a tensor product fusion method, and the fused multi-modal fusion features M are: m = F ⊗ V ⊗ T. The notation ⊗ denotes the tensor product.

The self-attention layer can learn the first face feature F, the voiceprint feature V and the feature T of the voice text in the multi-mode fusion feature M by adopting a preset learning method, and a living body identity recognition vector S is used for representing a learning result.

For example, the multi-modal fusion features M may be multiplied by A, B, C three preset matrices to obtain a vector Q, K, V, similarity may be calculated by performing vector inner product on Q and K to obtain a weight vector P, and Hadamard (Hadamard) product calculation may be performed on P and V to obtain a living identity recognition vector S. The matrix A, B, C is a parameter matrix trained in advance.

The full connection layer can classify the living body identification vectors S output from the attention layer and output a classification identification result, and the classification identification result can be used for representing an identification confirmation result of a user corresponding to the client, namely a target user. For example, the classification recognition result may be a one-dimensional vector, such as a one-dimensional probability value. If the probability value is greater than or equal to the preset probability value, determining that the identity confirmation result is as follows: the identity of the target user is real and unique; if the probability value is smaller than the preset probability value, the identity confirmation result can be determined to be that the identity of the target user is not true and unique.

And performing multi-mode fusion, self-attention layer learning and full-connection layer classification on the first face feature, the voiceprint feature and the feature of the voice text by adopting a multi-mode recognition model to obtain a classification recognition result and represent the identity confirmation result of the user. Compared with the method that the face authenticity is recognized by the face feature recognition model, the voiceprint authenticity is recognized by the voiceprint feature recognition model, the semantics of the voice text is recognized by the semantic recognition model, the face authenticity, the voiceprint authenticity and the semantics are combined to judge the identity confirmation result of the target user, the multimode recognition model performs multimode fusion, learning and recognition on the features to obtain the identity confirmation result of the user, the accuracy of the identity confirmation result is higher through mutual assistance and supplement among the modalities, and the safety and the reliability of double-record quality inspection are guaranteed.

According to the double-record quality inspection method provided by the embodiment of the application, a face recognition model is adopted to extract a first face feature in a first video image, a voiceprint recognition model is adopted to extract a voiceprint feature in voice data, a voice recognition model is adopted to extract a voice text in the voice data, and according to the first face feature, the voiceprint feature and the characteristics of the voice text, a multi-mode recognition model is adopted to obtain an identity confirmation result of a target user so as to indicate whether the target user confirms that the identity is real and unique. By the method provided by the embodiment of the application, a multi-mode recognition model can be adopted to process the first face feature, the voiceprint feature and the feature of the voice text to obtain the probability value indicating whether the target user confirms the real and unique identity, so that cross-media data, namely the effective combination of video images, voice data and text data can be realized, the user does not need to be indicated to execute an appointed action, the real and unique identity of the user is monitored in the whole process under the condition that the user does not sense, the user experience is improved, and the business handling smoothness is improved; the method has the advantages that the various deep learning models are adopted to carry out multi-directional identification on the video image, the voice data and the text data of the user, the multi-mode fusion model is fully utilized to identify the identity of the user, the reliability of identity authentication of the user in the remote double-recording quality inspection process is effectively guaranteed, other people are prevented from forging identity information to carry out illegal operation, and safety guarantee is provided for remote double-recording quality inspection.

On the basis of the foregoing embodiment, an embodiment of the present application further provides a dual-record quality inspection method, and fig. 2 is a schematic flow chart of a second dual-record quality inspection method provided in the embodiment of the present application, as shown in fig. 2, before the foregoing S400, the method further includes:

s51: and according to the start-stop time periods of all the text segments in the voice text, intercepting a sequence of lip language image frames corresponding to the start-stop time periods from the first video image.

Specifically, in order to avoid that the voice data input by the user is pre-recorded data, the lip language action in the first video image needs to be compared with the preset lip language action corresponding to the voice text. And dividing the voice text obtained in the step S300 into a plurality of text segments according to the time segments, and intercepting the sequence of lip language image frames corresponding to the same time segment from the first video image according to the start time and the end time of the text segments.

S52: and judging whether the action of the lip language image frame sequence is matched with the preset lip language action corresponding to each text segment or not by adopting a preset lip language identification model to obtain a lip language matching result of the target user.

Specifically, the preset lip language recognition model is a model obtained by training a preset lip language action corresponding to a sample lip language action and a sample text in advance, and can be used for comparing the lip language action with the preset lip language action and judging whether the lip language action is matched with the preset lip language action. In practical application, preset lip language actions corresponding to voice texts are stored in a server, each text segment and a lip language image frame sequence corresponding to a start-stop time period are input into a lip language recognition model, whether the preset lip language actions of each text segment are matched with the lip language actions in the lip language image frame sequence is judged, and an obtained lip language matching result is used for indicating whether a target user is a living body. If the preset lip language action of each text segment is not matched with the lip language action in the sequence of the lip language image frames, the lip language matching result indicates that the target user is not a living body, the double-recording quality inspection fails, the server indicates that the lip language quality inspection of the target user fails at the client, the user authentication is invalid, and re-authentication is needed.

The above S400 includes:

s400 a: and if the lip language matching result passes, processing by adopting a multi-mode recognition model according to the first face feature, the voiceprint feature and the feature of the voice text to obtain an identity confirmation result.

Specifically, if the preset lip language action of each text segment is matched with the lip language action in the sequence of lip language image frames, the lip language matching result indicates that the target user is a living body, the first face feature, the voiceprint feature and the feature of the voice text can be input into the multi-modal recognition model, and the identity of the target user is secondarily confirmed to obtain an identity confirmation result.

Before the multi-modal recognition model is used to obtain the identity confirmation result of the target user, a lip language image frame sequence corresponding to the start-stop time period is intercepted from a first video image according to the start-stop time period of each text segment in a voice text, and whether the action of the lip language image frame sequence is matched with the preset lip language action corresponding to each text segment is judged by using the lip language recognition model, so that the lip language matching result of the target user is obtained. The method is used for carrying out lip language matching detection, judging whether the lip language action of a target user for reading a specified text is the same as the preset lip language action of the text, confirming that the target user is a living body when the lip language action of the target user is the same as the preset lip language action of the text, improving the effect of double-recording quality inspection, preventing other people from falsely filling a real user with photos and recordings, providing safety guarantee for remote double-recording quality inspection, and ensuring the reliability of double-recording quality inspection.

On the basis of any of the foregoing embodiments, an embodiment of the present application further provides a dual-record quality inspection method, and fig. 3 is a schematic flow chart of a third dual-record quality inspection method provided in the embodiment of the present application, as shown in fig. 3, before the foregoing S51, the method further includes:

s41: and acquiring the recorded voiceprint characteristics of the target user from a preset voiceprint database.

Specifically, before lip language recognition, whether the user identity is real can be determined by judging the voiceprint characteristics of the target user. The real record voiceprint characteristics of a plurality of users are pre-stored in a preset voiceprint database, each user has a unique user identification and record voiceprint characteristics, and the record voiceprint characteristics of the user can be called from the preset voiceprint database according to the unique user identification of each user. For example, the preset voiceprint database may be a voiceprint feature base database recorded by the user in the public security department, the user identifier is an identification number, and the recorded voiceprint feature of the target user is called from the voiceprint feature base database of the public security department according to the identification number of the target user.

S42: and comparing the voiceprint features with the recorded voiceprint features to obtain a recorded voiceprint comparison result of the target user.

Specifically, the voiceprint feature of the target user extracted through the voiceprint recognition model is compared with the recorded voiceprint feature of the target user to obtain a voiceprint similarity value of the voiceprint feature and the recorded voiceprint feature, if the voiceprint similarity value is smaller than a voiceprint similarity threshold value, the recorded voiceprint comparison result of the target user indicates that the voiceprint recognition of the target user is wrong, the target user and the user corresponding to the recorded voiceprint feature are not the same person, the double-recording quality inspection fails, the server indicates that the voiceprint quality inspection of the target user fails at the client, the user authentication is invalid, and re-authentication is needed.

The S51 includes:

s51 a: and if the recorded voiceprint comparison result passes, intercepting a sequence of lip language image frames from the first video image according to the start-stop time period.

Specifically, if the voiceprint similarity value is greater than or equal to the voiceprint similarity threshold, the recorded voiceprint comparison result of the target user indicates that the voiceprint recognition of the target user is correct, and the target user and the user corresponding to the recorded voiceprint feature are the same person, so that lip language matching can be performed.

According to the double-record quality inspection method provided by the embodiment of the application, before lip language recognition and multi-mode recognition, the recorded voiceprint features of the target user are obtained from the preset voiceprint database, and the voiceprint features and the recorded voiceprint features are compared to obtain the recorded voiceprint comparison result of the target user. By the method, the voiceprint characteristics of the target user extracted from the voice data are compared with the recorded voiceprint characteristics of the target user to determine whether the identity of the target user is real or not, so that other people are prevented from impersonating the target user by using a voice changing device and the like, the reliability of identity authentication of the user in the remote double-record quality inspection process is effectively ensured, other people are prevented from forging identity information to carry out illegal operation, and safety guarantee is provided for remote double-record quality inspection.

On the basis of any of the foregoing embodiments, an embodiment of the present application further provides a dual-record quality inspection method, and fig. 4 is a schematic flow chart of a fourth dual-record quality inspection method provided in the embodiment of the present application, as shown in fig. 4, before the foregoing S41, the method further includes:

s31: and acquiring a record face image of the target user from a preset face database.

Specifically, before voiceprint comparison, whether the user identity is real can be determined by judging the face features of the target user. The method comprises the steps that a plurality of real record face images of users are pre-stored in a preset face database, each user has a unique user identification and a record face image, and the record face image of the user can be called from the preset face database according to the unique user identification of each user. For example, the preset face database may be a face image base of the user on record in the public security department, and the record face image of the target user is called from the face image base of the public security department according to the identification number of the target user.

S32: and adopting a face recognition model to extract face features of the recorded face image to obtain second face features.

Specifically, in order to ensure the accuracy of the comparison result, the face features of the recorded face image and the face image of the first video image can be extracted by using the same face recognition model, and the server extracts the face features in the recorded face image through the face recognition model to obtain the second face features.

S33: and comparing the first face features with the second face features to obtain a recorded face comparison result of the target user.

Specifically, the first face feature and the second face feature are compared to obtain a first face similarity value of the first face feature and the second face feature, if the first face similarity value is smaller than a first face similarity threshold value, a recorded face comparison result of a target user indicates that face recognition of the target user is wrong, the target user and a user corresponding to the recorded face image are not the same person, double-record quality inspection fails, the server indicates that face quality inspection of the target user fails at a client, user authentication is invalid, and re-authentication is needed.

The S41 includes:

s41 a: and if the comparison result of the filed face passes, acquiring the filed voiceprint characteristics of the target user from the voiceprint database.

Specifically, if the first face similarity value is greater than or equal to the first face similarity threshold, the recorded face comparison result of the target user indicates that the face recognition of the target user is correct, and the target user and the user corresponding to the recorded face image are the same person, then voiceprint comparison can be performed.

According to the double-record quality inspection method, before voiceprint comparison, lip language recognition and multi-mode recognition are carried out, the record face image of the target user is obtained from the preset face database, the second face feature of the record face image is extracted by the face recognition model, and the first face feature and the second face feature are compared to obtain the record face comparison result of the target user. By the method, the first face features extracted from the first video image and the second face features of the recorded face image of the target user are compared to confirm whether the identity of the target user is real or not, the reliability of identity authentication of the user in the remote double-record quality inspection process is effectively guaranteed, other people are prevented from forging identity information to carry out illegal operation, and safety guarantee is provided for remote double-record quality inspection.

On the basis of any of the foregoing embodiments, an embodiment of the present application further provides a dual-record quality inspection method, and fig. 5 is a schematic flow chart of a fifth dual-record quality inspection method provided in the embodiment of the present application, as shown in fig. 5, before the foregoing S31, the method further includes:

s21: and carrying out region detection on the second video image acquired by the client by adopting a preset identity card detection model to obtain an identity card text region.

Specifically, before the face comparison, whether the user identity is real can be determined by judging the identity text information of the target user. The preset identity card detection model is obtained by adopting a second sample video image for training in advance and can be used for detecting an identity card text region in the video image. In practical application, an identity card presentation instruction can be sent to a client by a server, the client controls equipment where the client is located to output information for reminding a target user of presenting the identity card according to the identity card presentation instruction, the information can be text information displayed on a display screen of the equipment where the client is located, and also can be voice information played by a loudspeaker of the equipment where the client is located, and a video image acquired by a camera comprises: the camera captures a video image, such as the second video image, for the identity card presented by the user. After the camera collects the second video image, the collected second video image can be transmitted to the server by the client, and the server extracts the second video image through the identity card detection model to perform region detection to obtain the identity card text region.

S22: and performing character recognition on the text area of the identity card by adopting a preset identity character recognition model to obtain first identity character information.

Specifically, the preset identity Character Recognition model is a model obtained by training a sample identity card text region in advance, and may be used to extract identity text information of the identity card text region. After the server obtains the identity card text area by adopting the identity card detection model, the identity card text area is input into the identity character recognition model, and character recognition is carried out on the identity card text area to obtain first identity character information. For example, the first identity text message may include: name, sex, nationality, year, month and day of birth, identification number, and address.

S23: and comparing the first identity character information with second identity character information of the target user in a preset identity information database to obtain an identity card character comparison result.

Specifically, the real identity character information of a plurality of users is prestored in the preset identity information database, each user has unique user representation and real identity character information, and the real identity character information of the target user can be called from the preset identity information database according to the unique user identifier of the target user and is used as the second identity character information of the target user. For example, the preset identity information database may be an identity text information base on which the user records in the public security department, and the second identity text information of the target user is called from the identity text information base of the public security department according to the identity number of the target user.

And if the identity similarity value is smaller than an identity similarity threshold value, the identity similarity comparison result of the target user indicates that the identity character information of the target user is identified wrongly, the target user and the user corresponding to the second identity character information are not the same person, double-recording quality inspection fails, the server indicates that the quality inspection of the identity character information of the target user fails at the client, the user authentication is invalid, and re-authentication is needed.

The S31 includes:

s31 a: and if the character comparison result of the identity card passes, acquiring a record face image from the face database.

Specifically, if the identity similarity value is greater than or equal to the identity similarity threshold, the identity character comparison result of the target user indicates that the identity character information of the target user is correctly identified, and the target user and the user corresponding to the second identity character information are the same person, so that face comparison can be performed.

Before face comparison, voiceprint comparison, lip language recognition and multi-mode recognition are carried out, an identity card text region of a second video image is extracted through an identity card detection model, first identity character information is extracted through an identity character recognition model, and the first identity character information is compared with second identity character information called from an identity information database to obtain an identity card character comparison result of a target user. By the method, the first identity character information of the identity card provided by the target user can be compared with the second identity character information in the identity information database to determine whether the identity card provided by the target user is real or not, so that false identity card authentication is avoided, the reliability of identity authentication of the user in a remote double-recording quality inspection process is effectively ensured, other people are prevented from forging the identity information to carry out illegal operation, and safety guarantee is provided for remote double-recording quality inspection.

On the basis of any of the foregoing embodiments, an embodiment of the present application further provides a dual-record quality inspection method, fig. 6 is a flowchart illustrating a sixth dual-record quality inspection method provided in the embodiment of the present application, and as shown in fig. 6, step S21 includes:

and carrying out region detection on the second video image by adopting an identity card detection model to obtain an identity card face head portrait region and an identity card text region.

Specifically, the identity card detection model can extract an identity card face portrait region except an identity card text region, and the server detects the identity card region in the second video image through the identity card detection model to extract the identity card face portrait region and the identity card text region.

The S31a includes:

s31a 1: and adopting a face recognition model to extract face features of the face portrait area of the identity card to obtain third face features.

Specifically, in order to ensure the accuracy of the comparison result, the face features of the identity card face image area and the face image of the first video image can be extracted by adopting the same face recognition model, and the server extracts the face features in the identity card face image area through the face recognition model to obtain the third face features.

S31a 2: and comparing the first face characteristic with the third face characteristic to obtain an identity card face comparison result of the target user.

Specifically, the first face feature and the third face feature are compared to obtain a second face similarity value of the first face feature and the third face feature, if the second face similarity value is smaller than a second face similarity threshold value, the identity card face comparison result of the target user indicates that the identity card face identification of the target user is wrong, the target user and a user corresponding to the identity card face image are not the same person, double-record quality inspection fails, the server indicates that the identity card face quality inspection of the target user fails at the client, user authentication is invalid, and re-authentication is needed.

S31a 3: and if the identity card face comparison result and the identity card character comparison result both pass, acquiring a record face image from the face database.

Specifically, if the second face similarity value is greater than or equal to the second face similarity threshold, the identification card face comparison result of the target user indicates that the identification card face identification of the target user is correct, the target user and the user corresponding to the identification card face image are the same person, the identification card face comparison result passes, and meanwhile, if the identification card character comparison result obtained in S23 passes, the face comparison can be performed.

According to the double-record quality inspection method, before face comparison, voiceprint comparison, lip language identification and multi-mode identification are carried out, a face identification model is adopted to extract the third face characteristic of the head portrait area of the identity card face, and the first face characteristic and the third face characteristic are compared to obtain the identity card face comparison result of the target user. By the method, the first face characteristic in the first video image can be compared with the third face characteristic in the identity card provided by the target user, so that the identity card is uniform, the reliability of identity authentication of the user in the remote double-record quality inspection process is effectively ensured, other people are prevented from forging identity information to carry out illegal operation, and safety guarantee is provided for remote double-record quality inspection.

On the basis of any of the foregoing embodiments, an embodiment of the present application further provides a dual-record quality inspection method, and fig. 7 is a schematic flow chart of a seventh dual-record quality inspection method provided in the embodiment of the present application, and as shown in fig. 7, the method further includes:

s500: and performing motion detection on the first video image by adopting a preset face motion model to obtain a motion detection result of the first video image.

Specifically, the preset facial motion model is a model obtained by training a first sample video image in advance, and can be used for detecting facial motion in the video image. After the server completes the identity confirmation of the multi-modal fusion feature of the target user in S400, the server may further detect facial movements of consecutive frames in the first video image by using a facial movement model, to obtain a movement detection result of the first video image, where the movement detection result is used to indicate whether the target user is a living body.

In one possible example, the face motion model is a blink model, and the blink motion of consecutive frames in the first video image is detected by using the blink model to obtain a blink detection result.

In another possible example, when the action model is an open-close mouth model, the open-close mouth model is adopted to detect the open-close mouth actions of a plurality of continuous frames in the first video image, and an open-close mouth detection result is obtained.

S600: if the action detection result comprises: and if the number of the recognized preset facial actions is larger than or equal to the preset number threshold, and the identity confirmation result passes, determining that the target user is the unique identity of the living body.

Specifically, according to the number of preset facial movements within consecutive frames identified by the facial movement model, if the number of preset facial movements is greater than or equal to a preset number threshold, it is indicated that the target user is a living body, and meanwhile, the identity confirmation result of S400 passes, and it is determined that the target user is the unique identity of the living body.

In one possible example, the facial motion model is a blink model, and the motion detection result includes, based on the number of blinks identified by the blink model in consecutive frames: the number of recognized blinking actions is greater than or equal to a preset number threshold.

In another possible example, the facial motion model is an open-close mouth model, and the motion detection result includes the number of open-close mouths in consecutive multiple frames identified by the open-close mouth model: the recognized times of mouth opening and closing actions are larger than or equal to a preset time threshold value.

The double-recording quality inspection method provided by the embodiment of the application adopts the facial action model to detect the action in the first video image to obtain the action detection result of the first video image, and if the number of the preset facial actions identified in the action detection result is greater than or equal to the preset number threshold and the identity confirmation result passes, the target user is determined to be the unique identity of the living body. By the method, the facial action of the first video image is detected, the user is not required to be instructed to execute the specified action, whether the target user is a living body or not is judged under the condition that the user does not sense the action, the user experience is improved, the target user is determined to be the only identity of the living body under the condition that the identity confirmation result passes, and the reliability of identity authentication of the user in the remote double-recording quality inspection process is effectively guaranteed.

On the basis of any of the foregoing embodiments, an embodiment of the present application further provides a dual-record quality inspection method, and fig. 8 is a schematic flow chart of an eighth dual-record quality inspection method provided in the embodiment of the present application, and as shown in fig. 8, the method includes:

S22: performing character recognition on a text region of the identity card by adopting a preset identity character recognition model to obtain first identity character information;

Specifically, the method of S31 and S31a1-S31a3 can be adopted in S31a, which is not described herein.

Specifically, the method of S41 is adopted in S41a, which is not described herein again.

Specifically, the method of S51 is adopted in S51a, which is not described herein again.

Specifically, the method of S400 is adopted in S400a, which is not described herein.

On the basis of any of the above embodiments, an embodiment of the present application further provides a dual-recording quality inspection apparatus, and fig. 9 is a schematic structural diagram of the dual-recording quality inspection apparatus provided in the embodiment of the present application, and as shown in fig. 9, the apparatus includes:

the first face recognition module 100 is configured to perform feature extraction on a face image in a first video image acquired by a client by using a preset face recognition model to obtain a first face feature;

the voiceprint recognition module 200 is configured to perform feature extraction on voice data acquired by a client by using a preset voiceprint recognition model to obtain voiceprint features;

the voice recognition module 300 is configured to perform feature extraction on voice data by using a preset voice recognition model to obtain a voice text;

the identity confirmation module 400 is configured to perform processing by using a preset multi-modal recognition model according to the first face feature, the voiceprint feature and the feature of the voice text to obtain an identity confirmation result of the target user, where the identity confirmation result is used to indicate whether the target user confirms that the identity is true and unique;

Optionally, before the identity confirmation module 400, the apparatus further includes:

the lip language image frame acquisition module is used for intercepting a lip language image frame sequence corresponding to a start-stop time period from the first video image according to the start-stop time period of each text segment in the voice text;

and the identity confirmation module is used for processing by adopting a multi-mode recognition model according to the first face feature, the voiceprint feature and the feature of the voice text to obtain an identity confirmation result if the lip language matching result passes.

the record voiceprint feature acquisition module is used for acquiring record voiceprint features of a target user from a preset voiceprint database;

the voiceprint comparison module is used for comparing the voiceprint characteristics with the recorded voiceprint characteristics to obtain a recorded voiceprint comparison result of the target user;

and the lip language image frame acquisition module is used for intercepting a lip language image frame sequence from the first video image according to the start-stop time period if the recorded voiceprint comparison result passes.

Optionally, before the recording of the voiceprint feature obtaining module, the apparatus further includes:

the system comprises a record face image acquisition module, a record face image acquisition module and a record face image acquisition module, wherein the record face image acquisition module is used for acquiring a record face image of a target user from a preset face database;

the second face recognition module is used for extracting face features of the recorded face image by adopting a face recognition model to obtain second face features;

the face comparison module is used for comparing the first face features with the second face features to obtain a recorded face comparison result of the target user;

and the record voiceprint feature acquisition module is used for acquiring the record voiceprint features of the target user from the voiceprint database if the record face does not pass the result.

Optionally, before the module for obtaining a face image for filing, the apparatus further includes:

and the record face image acquisition module is used for acquiring a record face image from the face database if the character comparison result of the identity card passes.

Optionally, the identity card detection module includes:

the head portrait and text detection unit is used for carrying out region detection on the second video image by adopting an identity card detection model to obtain an identity card face head portrait region and an identity card text region;

the record face image acquisition module comprises:

the face feature recognition unit is used for extracting face features of the face portrait area of the identity card by adopting a face recognition model to obtain third face features;

the face characteristic comparison unit is used for comparing the first face characteristic with the third face characteristic to obtain an identity card face comparison result of the target user;

and the record face image acquisition unit is used for acquiring a record face image from the face database if the identity card face comparison result and the identity card character comparison result both pass.

Optionally, the apparatus further comprises:

the panel action detection module is used for detecting the action of the first video image by adopting a preset face action model to obtain an action detection result of the first video image;

a living body confirmation module for, if the motion detection result includes: and if the number of the recognized preset facial actions is larger than or equal to the preset number threshold, and the identity confirmation result passes, determining that the target user is the unique identity of the living body.

The above-mentioned apparatus is used for executing the method provided by the foregoing embodiment, and the implementation principle and technical effect are similar, which are not described herein again.

These above modules may be one or more integrated circuits configured to implement the above methods, such as: one or more Application Specific Integrated Circuits (ASICs), or one or more microprocessors (DSPs), or one or more Field Programmable Gate Arrays (FPGAs), among others. For another example, when one of the above modules is implemented in the form of a Processing element scheduler code, the Processing element may be a general-purpose processor, such as a Central Processing Unit (CPU) or other processor capable of calling program code. For another example, these modules may be integrated together and implemented in the form of a system-on-a-chip (SOC).

Fig. 10 is a schematic diagram of a server provided in an embodiment of the present application, and as shown in fig. 10, the server 500 includes: a processor 501, a memory 502, and a program instruction stored in the memory 502 and executable by the processor 501, when the server 500 runs, the processor 501 executes the program instruction stored in the memory 502 to execute any one of the above method embodiments. The specific implementation and technical effects are similar, and are not described herein again.

Optionally, the present invention further provides a storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the computer program performs any of the above method embodiments.

In the embodiments provided in the present invention, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

The integrated unit implemented in the form of a software functional unit may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium and includes several instructions to enable a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) to execute some steps of the methods according to the embodiments of the present invention. And the aforementioned storage medium includes: a U disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and shall be covered by the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A double-record quality inspection method is characterized by comprising the following steps:

the multi-modal recognition model is a model obtained by training by adopting sample face features, sample voiceprint features and features of a sample voice text in advance, wherein the sample face features have marking information of whether a real face exists, the sample voiceprint features have marking information of whether a real voice exists, and the features of the sample voice text have marking information of whether a confirmed voice exists;

the method further comprises the following steps:

2. The method of claim 1, wherein the multi-modal recognition model comprises: the multi-mode fusion layer, the self-attention layer and the full-connection layer are used for processing by adopting a preset multi-mode recognition model according to the first face feature, the voiceprint feature and the feature of the voice text to obtain an identity confirmation result of the target user, and the method comprises the following steps:

performing multi-mode feature fusion on the first face feature, the voiceprint feature and the feature of the voice text by adopting the multi-mode fusion layer to obtain multi-mode fusion features;

performing feature learning on the multi-mode fusion features by adopting the self-attention layer to obtain living body identity recognition vectors;

and classifying the living body identity recognition vectors by adopting the full connection layer to obtain an identity confirmation result of the target user.

3. The method of claim 1, wherein before the processing according to the first facial feature, the voiceprint feature and the feature of the phonetic text by using a preset multi-modal recognition model to obtain the identity confirmation result of the target user, the method further comprises:

4. The method as claimed in claim 3, wherein before the step of intercepting the sequence of lip-language image frames corresponding to the start-stop time period from the first video image according to the start-stop time period of each text segment in the voice text, the method further comprises:

5. The method of claim 4, wherein prior to obtaining the docket voiceprint features of the target user from a pre-defined voiceprint database, the method further comprises:

acquiring a record face image of the target user from a preset face database;

6. The method of claim 5, wherein before the obtaining the docket face image of the target user from the preset face database, the method further comprises:

7. The method of claim 6, wherein the performing region detection on the second video image acquired by the client by using a preset identity card detection model to obtain an identity card text region comprises:

8. A dual record quality inspection apparatus, the apparatus comprising:

the face recognition module is used for extracting features of a face image in a first video image collected by a client by adopting a preset face recognition model to obtain first face features;

the device further comprises:

9. A server, comprising: a processor, a memory, the memory having stored therein program instructions executable by the processor, the processor executing the program instructions stored in the memory when the server is running to perform the steps of the dual-record quality inspection method according to any one of claims 1 to 7.

10. A storage medium having stored thereon a computer program for performing the steps of the dual-record quality inspection method according to any one of claims 1 to 7 when executed by a processor.