CN118673180B - Video content retrieval method based on label retrieval and multi-modal vector - Google Patents

Video content retrieval method based on label retrieval and multi-modal vector Download PDF

Info

Publication number
CN118673180B
CN118673180B CN202411163349.7A CN202411163349A CN118673180B CN 118673180 B CN118673180 B CN 118673180B CN 202411163349 A CN202411163349 A CN 202411163349A CN 118673180 B CN118673180 B CN 118673180B
Authority
CN
China
Prior art keywords
retrieval
score
video content
video
search
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202411163349.7A
Other languages
Chinese (zh)
Other versions
CN118673180A (en
Inventor
颜涛
余军
朱愚
黄信云
刘让刚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Chinamcloud Technology Co ltd
Original Assignee
Chengdu Chinamcloud Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Chinamcloud Technology Co ltd filed Critical Chengdu Chinamcloud Technology Co ltd
Priority to CN202411163349.7A priority Critical patent/CN118673180B/en
Publication of CN118673180A publication Critical patent/CN118673180A/en
Application granted granted Critical
Publication of CN118673180B publication Critical patent/CN118673180B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention belongs to the technical field of video retrieval, and discloses a video content retrieval method based on label retrieval and multi-modal vectors, which comprises the following steps: step 100: video processing before retrieval; the video processing refers to extracting labels, semantic vectors and text descriptions in video content; step 200: sequencing and displaying the search results according to the search results input by the user; the method and the device realize multi-dimensional mixed retrieval of video content, reduce the defects of a single model on special data, and greatly improve the accuracy of video search through fusion retrieval of multi-dimensional data; the method can be applied to various video data asset management, and has high economic value for secondary use of video data assets.

Description

Video content retrieval method based on label retrieval and multi-modal vector
Technical Field
The invention belongs to the technical field of video retrieval, and particularly relates to a video content retrieval method based on tag retrieval and multi-modal vectors.
Background
With the continuous development of internet technology, today, with the continuous growth of network bandwidth, network video attracts wide users with its convenient access experience, diversified film sources and real-time update speed, so that network video becomes an essential component in network life of users. With the growth of the mass of various types of videos existing in a network, video users often acquire interesting video content through a video retrieval mode.
At present, the development of video content retrieval technology is divided into several different stages, namely from traditional video retrieval based on text to cross-mode video retrieval based on AI labels, and then to natural language video retrieval based on large models; advances in these technologies have not only increased search efficiency and accuracy, but have also greatly improved user experience. The multi-mode feature vector retrieval method is characterized in that multi-mode feature analysis is carried out on a video picture, multi-mode large models such as a multi-mode pre-training neural network model (CLIP) are utilized to extract features for vector retrieval, and the method can solve the problem of scene semantic features and is widely applied to scenes of image retrieval; however, the problem of the method is that the false detection rate is high, and the model accuracy rate which is high in the industry is only more than 90%.
Disclosure of Invention
The invention aims to provide a video content retrieval method based on label retrieval and multimodal vectors, the method combines tag retrieval and multi-mode feature vector retrieval to realize accurate retrieval of video content.
In order to achieve the above purpose, the invention adopts the following technical scheme:
A video content retrieval method based on tag retrieval and multimodal vectors, comprising:
Video processing before retrieval; the video processing refers to extracting labels, semantic vectors and text descriptions in video content;
Sequencing and displaying the search results according to the search results input by the user;
The search result sorting according to the search formula input by the user is performed according to the following steps: step 201: extracting a search type semantic vector input by a user by adopting a multi-mode large model; matching the semantic vector of the search type with the semantic vector of the video content to obtain a multi-mode semantic vector search score, sequencing the multi-mode semantic vector search score from high to low, and obtaining a data list vec_ret of the first N video contents;
step 202: performing entity word segmentation on the search type input by the user; searching entity word segmentation in a data list vec_ret by adopting an ES search engine to obtain an ES score, carrying out normalization processing on the ES score by utilizing a logarithmic function f (x), and then reordering the ES score according to the high-to-low order to obtain a data list tag_ret;
Step 203: the multi-mode semantic vector retrieval score of the data list vec_ret obtained in the step 201 and the ES score of the data list tag_ret in the step 202 are weighted evenly and then reordered to obtain a data list m_ret of the previous M (M < N) video contents;
Step 204: counting the hit number of the tags in the data list m_ret according to the entity word segmentation of the search type input by the user, and sorting the tags with the hit number as a priority, namely sorting the tags more and more before the hit number is, and sorting the tags with the average weighted score when the hit number is the same; and outputting a final sequencing result.
Further, the video processing means that extracting labels, semantic vectors and text descriptions in video content comprises the following steps:
Step 101: performing transition cutting on video content by adopting a CV algorithm to obtain a plurality of fragment sequences; extracting pictures in each segment sequence as segment characterization pictures;
Step 102: extracting labels in the fragment characterization picture by adopting AI labels;
step 103: filtering the segment characterization picture by adopting a CNN convolution network to obtain a normal characteristic segment characterization picture;
Step 104: processing the normal characteristic fragment representation picture; and extracting semantic vectors and text descriptions in the normal characteristic fragment representation picture by adopting a multi-mode large model and a text question-answering model.
Further, the abnormal feature segment characterization screen filtered in step 103 is processed:
comprehensively judging the labels, transition segmentation and text description of the abnormal feature segment representation pictures by adopting a visual label model, a picture segmentation model and a picture-text question-answer model;
The comprehensive judgment is to judge whether the fragment characterization picture has a label or not, whether the fragment characterization picture can be subjected to transition segmentation or not and whether the fragment characterization picture has text description or not again, and if the fragment characterization picture without the label or the transition segmentation is determined to be an invalid fragment characterization picture; otherwise, the text description extracted by the image-text question-answering model is used as the text description or the 'no specific meaning' is used as the text description;
and extracting semantic features by using the multi-mode large model as semantic vectors.
Further, the multi-modal semantic vector retrieval score is obtained through cosine similarity.
Further, the ES score in step 202 averages the ES scores of the plurality of segmented words.
Further, in step 202, when the word segment belongs to a person and corresponds to a tag of the video content, and when the word segment corresponds to a manual tag, the ES score is given a weight greater than 1.
Compared with the prior art, the invention has the following beneficial effects:
(1) Before feature extraction is carried out on picture semantics, a CNN convolution network is used for classifying pictures, so that picture semantic descriptions can be rapidly extracted; then detecting the interface classified as an abnormal picture again, avoiding data loss caused by misclassification, and reducing the multi-mode false detection rate;
(2) The invention unifies the ES retrieval score and the multi-modal cosine similarity score to the same data interval by using data averaging and numerical normalization while using a self-scoring mechanism, reduces the error of the weighted average, and improves the precision by weighting factors with high hit rates such as characters, manual labels and the like;
(3) The video content with high hit rate is further ranked in front by performing secondary ranking through the number of tag hits and the final average weighted score, so that the accuracy of the video content retrieval result can be greatly improved.
Drawings
Fig. 1 is a general flow chart of the present invention.
Fig. 2 is a flow chart of the video processing before retrieval according to the present invention.
FIG. 3 is a flowchart of the process of the present invention during retrieval.
Detailed Description
As shown in fig. 1, the video content retrieval method based on tag retrieval and multimodal vector provided in this embodiment includes the following steps:
Video processing before retrieval; the video processing mainly aims at providing a data organization form in video content, wherein the data organization form is a label, a semantic vector and a text description.
As shown in fig. 2, the method for acquiring the data organization form in the video content specifically includes the following steps:
Step 101: performing transition cutting on video content by adopting a CV algorithm to obtain a plurality of fragment sequences; extracting pictures in each segment sequence as segment characterization pictures;
Step 102: extracting labels in the fragment characterization picture by adopting AI labels;
step 103: classifying the segment characterization pictures by adopting a CNN convolution network, wherein the segment characterization pictures are divided into normal characteristic segment characterization pictures and abnormal characteristic segment characterization pictures, the abnormal characteristic segment characterization pictures refer to segment characterization pictures of pure color pictures and highly blurred pictures, and the rest are normal pictures;
Step 104: processing the normal characteristic fragment representation picture; extracting semantic vectors and text descriptions in the normal feature segment representation pictures by adopting a multi-mode large model and a text question-answering model;
Step 105: processing the abnormal characteristic fragment representation pictures, extracting effective pictures in the abnormal characteristic representation pictures, and filtering out more effective pictures, such as effective pictures of a solid background, by using a CNN convolution network;
Comprehensively judging the labels, transition segmentation and text descriptions of the abnormal characteristic segment characterization images by adopting a visual label model, an image segmentation model and an image-text question-answering model, wherein the comprehensive judgment means that whether the segment characterization images have the labels, whether the transition segmentation can be carried out and whether the text descriptions are carried out or not is judged again, and if the segment characterization images without the labels, the transition segmentation and the text descriptions are determined to be invalid segment characterization images; otherwise, the text description extracted by the image-text question-answering model is used as the text description or the 'no specific meaning' is used as the text description; and extracting semantic features by using the multi-mode large model as semantic vectors.
The method and the device have the advantages that the extracted labels, text descriptions and semantic vectors of the video content are associated and stored in the database, and the labels, text descriptions and semantic vectors of the video content are obtained by processing the video content through various models, so that the controllability and accuracy of subsequent content retrieval are improved.
The video content is ordered during retrieval, and the hit rate of the retrieval result is improved through comprehensive ordering such as single-row ordering, mixed ordering and the like; as shown in fig. 3, the method specifically comprises the following steps:
step 201: extracting a search type semantic vector input by a user by adopting a multi-mode large model; matching the semantic vector of the search type with the semantic vector of the video content in the database to obtain a multi-mode semantic vector search score, sequencing the multi-mode semantic vector search score from high to low, and obtaining a data list vec_ret of the first N video contents; the multi-mode semantic vector retrieval score is obtained through cosine similarity.
Step 202: performing entity word segmentation on the search type input by the user; searching the word segmentation in a data list vec_ret by adopting an ES search engine to obtain an ES score, normalizing the ES score by using a logarithmic function f (x), and then reordering the ES score according to the high-to-low order to obtain a data list tag_ret;
Since the search word segmentation result input by the user is generally greater than 1, that is, there are two or more words, the final ES score of the embodiment takes the average value of the ES scores of the multiple words for the ES scores of the multiple words.
In addition, when the word segment belongs to a person and corresponds to a label of the video content, and when the word segment corresponds to a manual label, the embodiment gives a weight greater than 1 to the final ES score, and improves the hit rate. The character is more pointed, the retrieval result precision is higher, the manual label belongs to the label marked manually by people, and the accuracy of the manual label is higher than that of the AI label, so that when the word segmentation belongs to the two conditions, the ES score of the character is improved.
Because the interval of the multi-mode semantic vector retrieval score is [0.21,0.35] and the ES score is [3, 25], in order to enable the ES score and the multi-mode semantic vector retrieval score to be in the same interval range, the application processes the ES score through a logarithmic function f (x), and when x is in the interval of [3, 35], the function value of the ES score is made to approach to [0.21,0.35]; when x is in the interval [1,3], making the function value approach to [0.15,0.21]; when x is greater than 35, it is 0.35.
Step 203: the multi-mode semantic vector retrieval score of the data list vec_ret obtained in the step 201 and the ES score of the data list tag_ret in the step 202 are weighted evenly and then reordered to obtain a data list m_ret of the previous M (M < N) video contents;
Step 204: counting the hit number of the tags in the data list m_ret according to the entity word segmentation of the search type input by the user, and sorting the tags with the hit number as a priority, namely sorting the tags more and more before the hit number is, and sorting the tags with the average weighted score when the hit number is the same; and outputting a final sequencing result.
The embodiment realizes multi-dimensional mixed retrieval of video content, reduces the defect of a single model on special data, and greatly improves the accuracy of video search through fusion retrieval of multi-dimensional data; the method can be applied to various video data asset management, and has high economic value for secondary use of video data assets.
The foregoing is merely a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any modification and substitution based on the technical scheme and the inventive concept provided by the present invention should be covered in the scope of the present invention.

Claims (3)

1. A method for retrieving video content based on tag retrieval and multimodal vectors, comprising:
Video processing before retrieval; the video processing refers to extracting labels, semantic vectors and text descriptions in video content;
Sequencing and displaying the search results according to the search results input by the user;
the video processing means that the label, the semantic vector and the text description in the video content are extracted, and the method comprises the following steps of:
Step 101: performing transition cutting on video content by adopting a CV algorithm to obtain a plurality of fragment sequences; extracting pictures in each segment sequence as segment characterization pictures;
Step 102: extracting labels in the fragment characterization picture by adopting AI labels;
step 103: classifying the segment characterization pictures by adopting a CNN convolution network, and dividing the segment characterization pictures into normal characteristic segment characterization pictures and abnormal characteristic segment characterization pictures;
Processing the abnormal characteristic fragment characterization picture: comprehensively judging the labels, transition segmentation and text description of the abnormal feature segment representation pictures by adopting a visual label model, a picture segmentation model and a picture-text question-answer model;
The comprehensive judgment is to judge whether the fragment characterization picture has a label, can carry out transition segmentation and has character description again, and if the fragment characterization picture without the label and the transition segmentation is determined to be an invalid fragment characterization picture; otherwise, the text description extracted by the image-text question-answering model is used as the text description or the 'no specific meaning' is used as the text description;
extracting semantic features by using the multi-mode large model as semantic vectors;
Step 104: processing the normal characteristic fragment representation picture; extracting semantic vectors and text descriptions in the normal feature segment representation pictures by adopting a multi-mode large model and a text question-answering model;
the search result ordering according to the search formula input by the user is performed according to the following steps:
step 201: extracting a search type semantic vector input by a user by adopting a multi-mode large model; matching the semantic vector of the search type with the semantic vector of the video content to obtain a multi-mode semantic vector search score, sequencing the multi-mode semantic vector search score from high to low, and obtaining a data list vec_ret of the first N video contents;
step 202: performing entity word segmentation on the search type input by the user; searching entity word segmentation in a data list vec_ret by adopting an ES search engine to obtain an ES score, carrying out normalization processing on the ES score by utilizing a logarithmic function f (x), and then reordering the ES score according to the high-to-low order to obtain a data list tag_ret;
when the word is a person and corresponds to the label of the video content, and when the word is a manual label, giving a weight greater than 1 to the ES score;
Step 203: the multi-mode semantic vector retrieval score of the data list vec_ret obtained in the step 201 and the ES score of the data list tag_ret in the step 202 are weighted evenly and then reordered to obtain a data list m_ret of the first M video contents, wherein M is smaller than N;
Step 204: counting the hit number of the tags in the data list m_ret according to the entity word segmentation of the search type input by the user, and sorting the tags with the hit number as a priority, namely sorting the tags more and more before the hit number is, and sorting the tags with the average weighted score when the hit number is the same; and outputting a final sequencing result.
2. The method for tag retrieval and multimodal vector based video content retrieval of claim 1, wherein the multimodal semantic vector retrieval score is obtained by cosine similarity.
3. The method for tag retrieval and multimodal vector based video content retrieval of claim 1, wherein said ES score averages ES scores for a plurality of tokens in step 202.
CN202411163349.7A 2024-08-23 2024-08-23 Video content retrieval method based on label retrieval and multi-modal vector Active CN118673180B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202411163349.7A CN118673180B (en) 2024-08-23 2024-08-23 Video content retrieval method based on label retrieval and multi-modal vector

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202411163349.7A CN118673180B (en) 2024-08-23 2024-08-23 Video content retrieval method based on label retrieval and multi-modal vector

Publications (2)

Publication Number Publication Date
CN118673180A CN118673180A (en) 2024-09-20
CN118673180B true CN118673180B (en) 2024-10-18

Family

ID=92731460

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202411163349.7A Active CN118673180B (en) 2024-08-23 2024-08-23 Video content retrieval method based on label retrieval and multi-modal vector

Country Status (1)

Country Link
CN (1) CN118673180B (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118093938A (en) * 2024-02-22 2024-05-28 江西添鹏科技发展有限公司 Video query retrieval method and system based on semantic depth model

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8682809B2 (en) * 2012-04-18 2014-03-25 Scorpcast, Llc System and methods for providing user generated video reviews
CN113129312B (en) * 2018-10-15 2022-10-28 华为技术有限公司 Image processing method, device and equipment

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118093938A (en) * 2024-02-22 2024-05-28 江西添鹏科技发展有限公司 Video query retrieval method and system based on semantic depth model

Also Published As

Publication number Publication date
CN118673180A (en) 2024-09-20

Similar Documents

Publication Publication Date Title
CN107169049B (en) Application tag information generation method and device
Chen et al. Chinesefoodnet: A large-scale image dataset for chinese food recognition
CN112347244B (en) Yellow-based and gambling-based website detection method based on mixed feature analysis
US20210406549A1 (en) Method and apparatus for detecting information insertion region, electronic device, and storage medium
CN111754302B (en) Video live broadcast interface commodity display intelligent management system based on big data
CN109117777A (en) The method and apparatus for generating information
US20090234842A1 (en) Image search using face detection
CN111967302A (en) Video tag generation method and device and electronic equipment
CN113779308B (en) Short video detection and multi-classification method, device and storage medium
CN110297988A (en) Hot topic detection method based on weighting LDA and improvement Single-Pass clustering algorithm
CN102385592B (en) Image concept detection method and device
CN111581423B (en) Target retrieval method and device
CN104573711B (en) The image understanding method of object and scene based on text objects scene relation
CN107169002A (en) A kind of personalized interface method for pushing and device recognized based on face
CN113849598B (en) Social media false information detection method and detection system based on deep learning
CN108846047A (en) A kind of picture retrieval method and system based on convolution feature
CN101739428A (en) Method for establishing index for multimedia
CN109189965A (en) Pictograph search method and system
CN114187595A (en) Document layout recognition method and system based on fusion of visual features and semantic features
CN111488813A (en) Video emotion marking method and device, electronic equipment and storage medium
CN112434173B (en) Search content output method and device, computer equipment and readable storage medium
CN118673180B (en) Video content retrieval method based on label retrieval and multi-modal vector
CN106354838B (en) Data visualization method based on semantic resonance color
CN115438141B (en) Information retrieval method based on knowledge graph model
CN114912026B (en) Network public opinion monitoring analysis processing method, equipment and computer storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant