CN118673180B

CN118673180B - Video content retrieval method based on label retrieval and multi-modal vector

Info

Publication number: CN118673180B
Application number: CN202411163349.7A
Authority: CN
Inventors: 颜涛; 余军; 朱愚; 黄信云; 刘让刚
Original assignee: Chengdu Chinamcloud Technology Co ltd
Current assignee: Chengdu Chinamcloud Technology Co ltd
Priority date: 2024-08-23
Filing date: 2024-08-23
Publication date: 2024-10-18
Anticipated expiration: 2044-08-23
Also published as: CN118673180A

Abstract

The invention belongs to the technical field of video retrieval, and discloses a video content retrieval method based on label retrieval and multi-modal vectors, which comprises the following steps: step 100: video processing before retrieval; the video processing refers to extracting labels, semantic vectors and text descriptions in video content; step 200: sequencing and displaying the search results according to the search results input by the user; the method and the device realize multi-dimensional mixed retrieval of video content, reduce the defects of a single model on special data, and greatly improve the accuracy of video search through fusion retrieval of multi-dimensional data; the method can be applied to various video data asset management, and has high economic value for secondary use of video data assets.

Description

Video content retrieval method based on label retrieval and multi-modal vector

Technical Field

The invention belongs to the technical field of video retrieval, and particularly relates to a video content retrieval method based on tag retrieval and multi-modal vectors.

Background

With the continuous development of internet technology, today, with the continuous growth of network bandwidth, network video attracts wide users with its convenient access experience, diversified film sources and real-time update speed, so that network video becomes an essential component in network life of users. With the growth of the mass of various types of videos existing in a network, video users often acquire interesting video content through a video retrieval mode.

At present, the development of video content retrieval technology is divided into several different stages, namely from traditional video retrieval based on text to cross-mode video retrieval based on AI labels, and then to natural language video retrieval based on large models; advances in these technologies have not only increased search efficiency and accuracy, but have also greatly improved user experience. The multi-mode feature vector retrieval method is characterized in that multi-mode feature analysis is carried out on a video picture, multi-mode large models such as a multi-mode pre-training neural network model (CLIP) are utilized to extract features for vector retrieval, and the method can solve the problem of scene semantic features and is widely applied to scenes of image retrieval; however, the problem of the method is that the false detection rate is high, and the model accuracy rate which is high in the industry is only more than 90%.

Disclosure of Invention

The invention aims to provide a video content retrieval method based on label retrieval and multimodal vectors, the method combines tag retrieval and multi-mode feature vector retrieval to realize accurate retrieval of video content.

In order to achieve the above purpose, the invention adopts the following technical scheme:

A video content retrieval method based on tag retrieval and multimodal vectors, comprising:

Video processing before retrieval; the video processing refers to extracting labels, semantic vectors and text descriptions in video content;

Sequencing and displaying the search results according to the search results input by the user;

The search result sorting according to the search formula input by the user is performed according to the following steps: step 201: extracting a search type semantic vector input by a user by adopting a multi-mode large model; matching the semantic vector of the search type with the semantic vector of the video content to obtain a multi-mode semantic vector search score, sequencing the multi-mode semantic vector search score from high to low, and obtaining a data list vec_ret of the first N video contents;

step 202: performing entity word segmentation on the search type input by the user; searching entity word segmentation in a data list vec_ret by adopting an ES search engine to obtain an ES score, carrying out normalization processing on the ES score by utilizing a logarithmic function f (x), and then reordering the ES score according to the high-to-low order to obtain a data list tag_ret;

Step 203: the multi-mode semantic vector retrieval score of the data list vec_ret obtained in the step 201 and the ES score of the data list tag_ret in the step 202 are weighted evenly and then reordered to obtain a data list m_ret of the previous M (M < N) video contents;

Step 204: counting the hit number of the tags in the data list m_ret according to the entity word segmentation of the search type input by the user, and sorting the tags with the hit number as a priority, namely sorting the tags more and more before the hit number is, and sorting the tags with the average weighted score when the hit number is the same; and outputting a final sequencing result.

Further, the video processing means that extracting labels, semantic vectors and text descriptions in video content comprises the following steps:

Step 101: performing transition cutting on video content by adopting a CV algorithm to obtain a plurality of fragment sequences; extracting pictures in each segment sequence as segment characterization pictures;

Step 102: extracting labels in the fragment characterization picture by adopting AI labels;

step 103: filtering the segment characterization picture by adopting a CNN convolution network to obtain a normal characteristic segment characterization picture;

Step 104: processing the normal characteristic fragment representation picture; and extracting semantic vectors and text descriptions in the normal characteristic fragment representation picture by adopting a multi-mode large model and a text question-answering model.

Further, the abnormal feature segment characterization screen filtered in step 103 is processed:

comprehensively judging the labels, transition segmentation and text description of the abnormal feature segment representation pictures by adopting a visual label model, a picture segmentation model and a picture-text question-answer model;

The comprehensive judgment is to judge whether the fragment characterization picture has a label or not, whether the fragment characterization picture can be subjected to transition segmentation or not and whether the fragment characterization picture has text description or not again, and if the fragment characterization picture without the label or the transition segmentation is determined to be an invalid fragment characterization picture; otherwise, the text description extracted by the image-text question-answering model is used as the text description or the 'no specific meaning' is used as the text description;

and extracting semantic features by using the multi-mode large model as semantic vectors.

Further, the multi-modal semantic vector retrieval score is obtained through cosine similarity.

Further, the ES score in step 202 averages the ES scores of the plurality of segmented words.

Further, in step 202, when the word segment belongs to a person and corresponds to a tag of the video content, and when the word segment corresponds to a manual tag, the ES score is given a weight greater than 1.

Compared with the prior art, the invention has the following beneficial effects:

(1) Before feature extraction is carried out on picture semantics, a CNN convolution network is used for classifying pictures, so that picture semantic descriptions can be rapidly extracted; then detecting the interface classified as an abnormal picture again, avoiding data loss caused by misclassification, and reducing the multi-mode false detection rate;

(2) The invention unifies the ES retrieval score and the multi-modal cosine similarity score to the same data interval by using data averaging and numerical normalization while using a self-scoring mechanism, reduces the error of the weighted average, and improves the precision by weighting factors with high hit rates such as characters, manual labels and the like;

(3) The video content with high hit rate is further ranked in front by performing secondary ranking through the number of tag hits and the final average weighted score, so that the accuracy of the video content retrieval result can be greatly improved.

Drawings

Fig. 1 is a general flow chart of the present invention.

Fig. 2 is a flow chart of the video processing before retrieval according to the present invention.

FIG. 3 is a flowchart of the process of the present invention during retrieval.

Detailed Description

As shown in fig. 1, the video content retrieval method based on tag retrieval and multimodal vector provided in this embodiment includes the following steps:

Video processing before retrieval; the video processing mainly aims at providing a data organization form in video content, wherein the data organization form is a label, a semantic vector and a text description.

As shown in fig. 2, the method for acquiring the data organization form in the video content specifically includes the following steps:

step 103: classifying the segment characterization pictures by adopting a CNN convolution network, wherein the segment characterization pictures are divided into normal characteristic segment characterization pictures and abnormal characteristic segment characterization pictures, the abnormal characteristic segment characterization pictures refer to segment characterization pictures of pure color pictures and highly blurred pictures, and the rest are normal pictures;

Step 104: processing the normal characteristic fragment representation picture; extracting semantic vectors and text descriptions in the normal feature segment representation pictures by adopting a multi-mode large model and a text question-answering model;

Step 105: processing the abnormal characteristic fragment representation pictures, extracting effective pictures in the abnormal characteristic representation pictures, and filtering out more effective pictures, such as effective pictures of a solid background, by using a CNN convolution network;

Comprehensively judging the labels, transition segmentation and text descriptions of the abnormal characteristic segment characterization images by adopting a visual label model, an image segmentation model and an image-text question-answering model, wherein the comprehensive judgment means that whether the segment characterization images have the labels, whether the transition segmentation can be carried out and whether the text descriptions are carried out or not is judged again, and if the segment characterization images without the labels, the transition segmentation and the text descriptions are determined to be invalid segment characterization images; otherwise, the text description extracted by the image-text question-answering model is used as the text description or the 'no specific meaning' is used as the text description; and extracting semantic features by using the multi-mode large model as semantic vectors.

The method and the device have the advantages that the extracted labels, text descriptions and semantic vectors of the video content are associated and stored in the database, and the labels, text descriptions and semantic vectors of the video content are obtained by processing the video content through various models, so that the controllability and accuracy of subsequent content retrieval are improved.

The video content is ordered during retrieval, and the hit rate of the retrieval result is improved through comprehensive ordering such as single-row ordering, mixed ordering and the like; as shown in fig. 3, the method specifically comprises the following steps:

step 201: extracting a search type semantic vector input by a user by adopting a multi-mode large model; matching the semantic vector of the search type with the semantic vector of the video content in the database to obtain a multi-mode semantic vector search score, sequencing the multi-mode semantic vector search score from high to low, and obtaining a data list vec_ret of the first N video contents; the multi-mode semantic vector retrieval score is obtained through cosine similarity.

Step 202: performing entity word segmentation on the search type input by the user; searching the word segmentation in a data list vec_ret by adopting an ES search engine to obtain an ES score, normalizing the ES score by using a logarithmic function f (x), and then reordering the ES score according to the high-to-low order to obtain a data list tag_ret;

Since the search word segmentation result input by the user is generally greater than 1, that is, there are two or more words, the final ES score of the embodiment takes the average value of the ES scores of the multiple words for the ES scores of the multiple words.

In addition, when the word segment belongs to a person and corresponds to a label of the video content, and when the word segment corresponds to a manual label, the embodiment gives a weight greater than 1 to the final ES score, and improves the hit rate. The character is more pointed, the retrieval result precision is higher, the manual label belongs to the label marked manually by people, and the accuracy of the manual label is higher than that of the AI label, so that when the word segmentation belongs to the two conditions, the ES score of the character is improved.

Because the interval of the multi-mode semantic vector retrieval score is [0.21,0.35] and the ES score is [3, 25], in order to enable the ES score and the multi-mode semantic vector retrieval score to be in the same interval range, the application processes the ES score through a logarithmic function f (x), and when x is in the interval of [3, 35], the function value of the ES score is made to approach to [0.21,0.35]; when x is in the interval [1,3], making the function value approach to [0.15,0.21]; when x is greater than 35, it is 0.35.

The embodiment realizes multi-dimensional mixed retrieval of video content, reduces the defect of a single model on special data, and greatly improves the accuracy of video search through fusion retrieval of multi-dimensional data; the method can be applied to various video data asset management, and has high economic value for secondary use of video data assets.

The foregoing is merely a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any modification and substitution based on the technical scheme and the inventive concept provided by the present invention should be covered in the scope of the present invention.

Claims

1. A method for retrieving video content based on tag retrieval and multimodal vectors, comprising:

the video processing means that the label, the semantic vector and the text description in the video content are extracted, and the method comprises the following steps of:

step 103: classifying the segment characterization pictures by adopting a CNN convolution network, and dividing the segment characterization pictures into normal characteristic segment characterization pictures and abnormal characteristic segment characterization pictures;

Processing the abnormal characteristic fragment characterization picture: comprehensively judging the labels, transition segmentation and text description of the abnormal feature segment representation pictures by adopting a visual label model, a picture segmentation model and a picture-text question-answer model;

The comprehensive judgment is to judge whether the fragment characterization picture has a label, can carry out transition segmentation and has character description again, and if the fragment characterization picture without the label and the transition segmentation is determined to be an invalid fragment characterization picture; otherwise, the text description extracted by the image-text question-answering model is used as the text description or the 'no specific meaning' is used as the text description;

extracting semantic features by using the multi-mode large model as semantic vectors;

the search result ordering according to the search formula input by the user is performed according to the following steps:

step 201: extracting a search type semantic vector input by a user by adopting a multi-mode large model; matching the semantic vector of the search type with the semantic vector of the video content to obtain a multi-mode semantic vector search score, sequencing the multi-mode semantic vector search score from high to low, and obtaining a data list vec_ret of the first N video contents;

when the word is a person and corresponds to the label of the video content, and when the word is a manual label, giving a weight greater than 1 to the ES score;

Step 203: the multi-mode semantic vector retrieval score of the data list vec_ret obtained in the step 201 and the ES score of the data list tag_ret in the step 202 are weighted evenly and then reordered to obtain a data list m_ret of the first M video contents, wherein M is smaller than N;

2. The method for tag retrieval and multimodal vector based video content retrieval of claim 1, wherein the multimodal semantic vector retrieval score is obtained by cosine similarity.

3. The method for tag retrieval and multimodal vector based video content retrieval of claim 1, wherein said ES score averages ES scores for a plurality of tokens in step 202.