CN112508776B

CN112508776B - Action migration method and device and electronic equipment

Info

Publication number: CN112508776B
Application number: CN202011468573.9A
Authority: CN
Inventors: 唐吉霖; 袁燚; 胡志鹏
Original assignee: Netease Hangzhou Network Co Ltd
Current assignee: Netease Hangzhou Network Co Ltd
Priority date: 2020-12-11
Filing date: 2020-12-11
Publication date: 2024-02-27
Anticipated expiration: 2040-12-11
Also published as: CN112508776A

Abstract

The invention provides an action migration method, an action migration device and electronic equipment; wherein the method comprises the following steps: acquiring position information of a designated position point in a target object under a source action and a target action; wherein the target object is divided into a plurality of parts; determining offset information of the location when the source action is converted to the target action based on the position information of the designated position point contained in the location; determining target image features of the location based on the offset information and source image features of the location; and obtaining a target object under the target action through the target image characteristics of each part. The method for calculating the offset information by taking the parts as units can obtain the finer and more accurate offset condition of the target action relative to the original action, and further can improve the overall effect of action migration.

Description

Action migration method and device and electronic equipment

Technical Field

The present invention relates to the field of artificial intelligence technologies, and in particular, to a method and an apparatus for motion migration and an electronic device.

Background

Motion migration is a technique of converting a motion of a person in a source image into a target motion, and the motion may be referred to as a gesture or posture; after the motion of the person is converted into the target motion, the original appearance characteristics of the person are still maintained. In the related art, based on original action data and target action data of a person in a source image, appearance stream data of the person is obtained, and the appearance stream data can reflect the overall deviation condition of the target action relative to the original action; and then, carrying out deformation operation on the image characteristics of the source image according to the appearance stream data to obtain a target image, wherein the person in the target image has a target action, and simultaneously, the appearance characteristics of the person in the source image are maintained. In the motion migration method, if the difference between the target motion and the original motion is large, it is difficult for the appearance stream data to accurately reflect the offset of the target motion and the original motion, and the overall effect of motion migration is affected.

Disclosure of Invention

In view of the above, the present invention aims to provide a motion migration method, a motion migration device and an electronic device, so as to improve the overall effect of motion migration.

In a first aspect, an embodiment of the present invention provides an action migration method, where the method includes: acquiring position information of a designated position point in a target object under a source action and a target action; wherein the target object is divided into a plurality of parts; determining offset information of the part when the source action is converted to the target action based on the position information of the designated position point contained in the part; determining target image features of the location based on the offset information and source image features of the location; and obtaining a target object under the target action through the target image characteristics of each part.

The step of determining the target image feature of the region based on the offset information and the source image feature of the region includes: performing deformation operation on the source image features of the part based on the offset information to obtain intermediate image features of the part; screening the intermediate image features based on the feature visibility information of the parts to obtain target image features of the parts; wherein the feature visibility information is used to indicate: probability that the target image feature of each location point in the part exists in the source image feature; the part is composed of a plurality of position points, and the designated position point contained in the part belongs to the plurality of position points.

The step of screening the intermediate image features based on the feature visibility information of the part to obtain the target image features of the part comprises the following steps: obtaining the part action characteristics of the part under the target action through the position information of the appointed position point contained in the part under the target action; and carrying out weighted summation processing on the part action characteristics and the intermediate image characteristics based on the characteristic visibility information of the part to obtain target image characteristics.

The matching part is based on the feature visibility information of the partThe step of carrying out weighted summation processing on the bit motion characteristic and the intermediate image characteristic to obtain the target image characteristic comprises the following steps: the target image features are calculated by the following formula:wherein (1)>Representing a target image feature; v (V) ^local Feature visibility information indicating a location; />Representing intermediate image features; />Indicating the location motion characteristics.

The step of determining offset information of the location when the source action is changed to the target action based on the position information of the specified position point included in the location includes: for each part, the position information of the designated position point included in the part under the source action and the target action is input into the appearance stream generating network, and the offset information of the part and the feature visibility information of the part are output.

Before the step of obtaining the target object under the target action by the target image characteristics of each part, the method further comprises the following steps: inputting the target image characteristics of the part into an expansion convolution network, and outputting final target image characteristics of the part; wherein the expanded convolution network comprises at least one expanded convolution layer connected in series; each expansion convolution layer is provided with preset expansion rate parameters.

The step of obtaining the target object under the target action through the target image characteristics of each part comprises the following steps: obtaining object action characteristics of the target object under the target action through the position information of the designated position point of the target object under the target action; and obtaining the target object under the target action based on the object action characteristics of the target object and the target image characteristics of each part.

The step of obtaining the target object under the target action based on the object action characteristics of the target object and the target image characteristics of each part comprises the following steps: splicing the object action characteristics of the target object and the target image characteristics of each part to obtain splicing characteristics; inputting the spliced features into a feature fusion network, and outputting initial fusion features; based on the initial fusion feature, a target object under the target action is determined.

The step of determining the target object under the target action based on the initial fusion characteristic comprises the following steps: carrying out multi-scale pyramid pooling operation on the initial fusion features to obtain hierarchical features with a plurality of specified scales; aiming at each level feature, carrying out non-local operation on the level feature to obtain an operation result of the level feature; performing up-sampling treatment on the operation result to obtain a sampling result of the hierarchical feature; carrying out fusion processing on the sampling result of each hierarchical feature and the initial fusion feature to obtain a global fusion feature; and decoding the global fusion characteristics to obtain a target object under the target action.

In a second aspect, an embodiment of the present invention provides an action migration apparatus, including: the information acquisition module is used for acquiring the position information of the designated position point in the target object under the source action and the target action; wherein the target object is divided into a plurality of parts; an information processing module for determining offset information of the location when the source action is changed to the target action based on the position information of the specified position point included in the location; performing deformation operation on the source image characteristics of the part based on the offset information to obtain target image characteristics of the part; and the object output module is used for obtaining a target object under the target action through the target image characteristics of each part.

In a third aspect, an embodiment of the present invention provides an electronic device, including a processor and a memory, where the memory stores machine executable instructions executable by the processor, and the processor executes the machine executable instructions to implement the foregoing action migration method.

In a fourth aspect, embodiments of the present invention provide a machine-readable storage medium storing machine-executable instructions that, when invoked and executed by a processor, cause the processor to implement the above-described action migration method.

The embodiment of the invention has the following beneficial effects:

in the motion migration method, the motion migration device and the electronic equipment, the target object is divided into a plurality of parts, and offset information of the parts when the source motion is converted to the target motion is determined based on position information of designated position points contained in the parts; performing deformation operation on the source image characteristics of the part based on the offset information to obtain target image characteristics of the part; and then the target object under the target action is obtained through the target image characteristics of each part. In the method, a target object is divided into a plurality of parts, offset information of each part is obtained based on position information of a specified position point contained in the part under a source action and a target action, and deformation operation is carried out on image features of the part based on the offset information; the method for calculating the offset information by taking the parts as units can obtain the finer and more accurate offset condition of the target action relative to the original action, and further can improve the overall effect of action migration.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

In order to make the above objects, features and advantages of the present invention more comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are needed in the description of the embodiments or the prior art will be briefly described, it being obvious that the drawings in the description below are some embodiments of the invention and that other drawings may be obtained from these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of an action migration method according to an embodiment of the present invention;

FIG. 2 is a flowchart of an action migration method according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of another motion migration method according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a deformation operation based on bilinear interpolation according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of an expanded convolutional network according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of a multi-scale pyramid pooling operation and a non-local operation according to an embodiment of the present invention;

FIG. 7 is a schematic diagram of a training process of an action migration model according to an embodiment of the present invention;

FIG. 8 is a schematic diagram of an application process of an action migration model according to an embodiment of the present invention;

FIG. 9 is a schematic diagram of an action migration effect outputted by an action migration model according to an embodiment of the present invention;

fig. 10 is a schematic structural diagram of an action migration device according to an embodiment of the present invention;

fig. 11 is a schematic diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Motion migration is a technique of converting a human body image in a source input picture into a target motion gesture. Taking 2D (2-Dimensional) space as an example, 2D motion migration requires giving a source character image and a target human body pose, thereby generating a real and natural character image in the target pose while maintaining the original appearance characteristics of the source character image. At present, the technology is widely applied to various fields such as film and television production, animation generation, virtual try-on and the like, and has wide application prospect and huge market value.

For ease of understanding, fig. 1 is a schematic diagram of an action migration manner in the related art. In the field of motion migration, a technique based on overall appearance flow is mainly used to transform a human body image in a source input picture into a target motion gesture, while maintaining the original appearance of the source character image. As shown in fig. 1, the motion migration method based on the overall appearance flow mainly includes two modules: the system comprises an overall appearance stream generation module and a target picture generation module.

The overall appearance flow generating module receives the source human body posture P _s And a target human body posture P _t As input, a convolutional neural network is used for calculating to obtain a global appearance stream W representing the offset condition between the pixels of the whole source picture and the target picture ^global . The target picture generation module receives the source human body picture I _s And the generated overall appearance stream W ^global As input, the source human body picture I is first of all processed by an encoder network _s Encoded into corresponding source picture features, and then streamed W according to the overall appearance generated ^global The method comprises the steps of performing deformation operation (also called warp operation) on the source picture characteristics, and finally decoding the deformed picture characteristics into a generated target picture by utilizing a decoder network

In the above motion migration mode, the complete source human body posture P is directly set _s And a target human body posture P _t Predicting the apparent flow of the whole human body as input, which cannot reliably and effectively process the source human body pose P _s With the target human body posture P _t The complex situation when the huge difference exists is difficult to generate accurate and high-quality overall appearance flow, and the overall effect of action migration is affected.

Based on the above, the method, the device and the electronic equipment for moving movement provided by the embodiment of the invention can be applied to moving movement of human body, animals or other targets capable of being deformed; the motion migration in the present embodiment may also be referred to as gesture migration, or the like.

Referring first to the flow chart of a method of action migration shown in fig. 2, the method comprises the steps of:

Step S202, acquiring position information of a specified position point in a target object under a source action and a target action; wherein the target object is divided into a plurality of parts;

the appointed position points in the target object can be pre-appointed, and taking a human body as an example, the appointed position points can be joint points of parts such as head, shoulder, elbow, hand, hip, knee, foot and the like; designated location points may also be referred to as keypoints or articulation points. The position information of the specified position point of the target object under the source action can be collectively called as a source human body gesture P _s The method comprises the steps of carrying out a first treatment on the surface of the The position information of each appointed position point of the target object under the target action can be collectively called as target human body gesture P _t . The position information of the specified position point in the target object may be represented by a heat map, in a specific implementation manner, a specified position point corresponds to a heat map of a channel, where the position of the specified position point is expressed by a pixel value of a pixel, for example, a pixel value near a center position of the specified position point is relatively high, and a pixel value far from the center position of the specified position point is relatively low, so as to form a heat map in which the pixel value gradually changes with the position. As an example, if the designated location points of the target object include 18, the above-mentioned heat map is 18 channels, and the heat map of the 18 channels encodes the spatial locations of 18 key points of the whole human body.

In actual implementation, the source image generally includes the target object under the source action, and the openPose et al human body posture estimation open source library can be used to extract the position coordinates of the specified position point of the target object under the source action from the source image, and then output the heat map, so that the position information of the specified position point of the target object under the source action can be obtained. Similarly, the open source library can be estimated through the human body gesture, and the position information of the designated position point of the target object under the target action can be extracted from other images, wherein the other images generally do not comprise the target object and only comprise other objects under the target action.

The target object is divided into a plurality of parts in advance, and as an example, the target object may be divided into three parts of a head, a trunk, and legs; in other ways, the target object may be divided into a greater or lesser number of regions. After the division of the parts, the present embodiment performs feature processing for each part.

Step S204, determining offset information of a part when the source action is converted to the target action based on the position information of the designated position point contained in the part; determining a target image feature of the location based on the offset information and a source image feature of the location;

Since the target object is divided into a plurality of parts, the above-described step S204 may be performed for each part. For a certain part, the specified position points contained in the part are a subset of the specified position points in the target object; therefore, the position information of the specified position point included in the part can be extracted from the position information of the specified position point of the target object. Specifically, first position information of a specified position point included in the part under the source action and second position information of a specified position point included in the part under the target action are acquired, offset information of the specified position point can be calculated by the first position information and the second position information, the offset information indicates position change occurring when the specified position point is converted from the source action to the target action, and the offset information of the specified position point can be obtained by vector calculation specifically by the first position information and the second position information.

In the image, an image area corresponding to a part consists of a plurality of pixel points which are closely arranged, wherein one pixel point can be understood as a position point, or the pixel points are screened, and part of the pixel points are used as the position points; thus, a site comprises or consists of a plurality of location points, and the site comprises a specified location point belonging to the plurality of location points, or the site comprises a specified location point being a subset of the plurality of location points that comprise the site. The offset information of the above-mentioned part includes offset information of a specified position point in the part, and also includes information of other position points than the specified position point, that is, the offset information of the part includes offset information of each position point in the part.

By the offset information of the specified position point, the offset information of the other positions than the specified position point in the portion can be estimated. Taking the position point A as an example, according to the relative position of the position point A and the appointed position point and the offset information of the appointed position point, the offset information of the position point A can be estimated; based on the position information of the position point a under the source action and the offset information of the position point a, the position information of the position point a under the target action can be obtained by vector calculation.

After the offset information of each position point in the part is obtained, the target image characteristic of the part can be obtained based on the offset information and the source image characteristic of the part. The method comprises the steps of obtaining a local image of a part by carrying out image segmentation processing on a source image containing a target object under a source action, and then extracting image features of the local image of the part through a feature extraction network to obtain source image features of the part. The source image feature of the portion is an image feature of each position point of the portion under the source action of the target object, and in order to obtain the target image feature of the portion under the target action, the source image feature needs to be subjected to operations such as deformation and the like through the offset information, so as to obtain the target image feature of the portion.

Step S206, obtaining a target object under the target action through the target image characteristics of each part.

The target image features of the whole target object can be obtained by performing processes such as stitching and fusion on the target image features of each part, and the target image containing the target object under the target action can be obtained by performing decoding processing on the target image features of the target object.

In the above-described motion migration method, the target object is divided into a plurality of portions, and offset information of the portions at the time of transition from the source motion to the target motion is determined based on position information of a specified position point included in the portions; performing deformation operation on the source image characteristics of the part based on the offset information to obtain target image characteristics of the part; and then the target object under the target action is obtained through the target image characteristics of each part. In the method, a target object is divided into a plurality of parts, offset information of each part is obtained based on position information of a specified position point contained in the part under a source action and a target action, and deformation operation is carried out on image features of the part based on the offset information; the method for calculating the offset information by taking the parts as units can obtain the finer and more accurate offset condition of the target action relative to the original action, and further can improve the overall effect of action migration.

As further described below. First, the target object may be divided into a plurality of portions according to the connection relationship between the respective specified position points of the target object. And obtaining the local sub-gesture data of each part according to the appointed position point contained in each part. As an example, when the target object is divided into three parts of the head, torso and legs,and-> Representing the source local sub-pose and the target local sub-pose after the division of the parts respectively, wherein +_>The source local sub-gestures of the head, the trunk and the legs are respectively represented, namely, under the source action, the position information of the appointed position points respectively contained in the head, the trunk and the legs;representing head, torso and leg three respectivelyThe target local sub-gesture of each part, namely the position information of the appointed position points contained in the three parts of the head, the trunk and the legs respectively under the target action.

After the position information of the specified position point included in each part is obtained, the position information of the specified position point included in the part under the source action and the target action is input into the appearance flow generation network for each part, and the offset information of the part and the feature visibility information of the part are output. Referring to a schematic diagram of an action migration method shown in fig. 3, an appearance flow generating network may be set for each part, after the source human body gesture and the target human body gesture are divided into local sub-gestures of each part according to the part (may also be referred to as a component), position information of a designated position point included in each part is input into the appearance flow generating network corresponding to the part, and offset information of the part and feature visibility information of the part are output; the offset information of the region may be referred to as a partial appearance flow chart, and the feature visibility information of the region may be referred to as a partial visibility chart. The site-based appearance flow generation network may be represented as Wherein,the appearance flow generating networks respectively represent the head, the trunk and the legs. The principle of operation of the appearance flow generation network can be expressed as: />Wherein W is ^local Offset information indicating the location, V ^local Feature visibility information indicating the location.

Taking a target object as a human body as an example, the human body is composed of different parts with different motion complexity aiming at gesture change. Thus, the present embodiment breaks the human body into different semantic parts, such as the head, torso, and legs, and uses different independent appearance flow generation networks to estimate the local appearance flows of these parts, i.e., the offset information of each part, respectively. In this way, the embodiment not only reduces the difficulty of the network model to directly learn the complex overall human body posture change, but also can utilize the specific network to more accurately and pointedly process the posture change condition of each human body part, thereby improving the accuracy of offset information and further improving the overall effect of motion migration.

After obtaining the offset information of the location for each location, the source image features of the location need to be processed based on the offset information, and finally the target image features of the location are obtained. Continuing to refer to fig. 3, after the source human body picture is cut, obtaining a source local picture of three parts, namely a head part, a trunk part and a leg part; each source local picture passes through a respective feature extraction network E ^head ,E ^torso ,E ^leg Thereafter, a source local picture feature (also referred to as a source image feature) of each region is obtained. The source local picture characteristics, the local appearance flow and the local visibility graph are input to a local deformation module together, the local deformation module is used for carrying out the following processing, and the deformed source local picture characteristics, namely the target image characteristics, are output.

Specifically, firstly, performing deformation operation on source image features of a part based on offset information to obtain intermediate image features of the part; this deformation operation may also be referred to as a warp operation. The deformation operation can carry out affine image transformation on the image characteristics of the part based on the offset information so as to obtain intermediate image characteristics of the part. The offset information records the coordinate offset between the source feature and the target feature of the region, and the offset information of the region may be referred to as a local apparent flow. In a specific implementation manner, the intermediate image feature can be obtained by performing bilinear interpolation deformation operation on the source image feature based on the offset information of the part, and the intermediate image feature is aligned or matched with the target motion of the part. In particular implementations, the intermediate image features may be represented as Wherein->Intermediate image features representing head, torso and legs, respectively; intermediate image feature->The generation process of (2) can be represented by the following expression:

wherein G is _warp Representing a deformation operation based on bilinear interpolation;a source image characteristic representing the region of interest,wherein (1)>Representing source image features of the head, torso and legs, respectively. FIG. 4 is a schematic diagram of a deformation operation based on bilinear interpolation; after bilinear sampling based on local appearance flow, the position of the feature on each position point in the source image feature changes to a certain extent, so that the intermediate image feature is obtained.

Considering that the source action and the target action of the target object often have different part visibility, the intermediate image features need to be screened based on the feature visibility information of the part to obtain the target image features of the part, so that the accuracy of the target image features is improved. Wherein the feature visibility information is used to indicate: the target image feature for each location point in a region that is made up of a plurality of location points, the region containing a specified location point that belongs to the plurality of location points, is a probability that the target image feature exists for the source image feature. The feature visibility information may be represented as V ^local ＝{V ^head ,V ^torso ,V ^leg -a }; wherein V is ^head ,V ^torso ,V ^leg Respectively represent the headsFeature visibility information for the parts, torso and legs; in a specific example, confidence values between 0 and 1 are stored in the feature visibility information, and each position point corresponds to a confidence value, where the confidence value indicates whether the target image feature of the position point exists in the source image feature, and the larger the confidence value, the larger the probability of existence.

When screening the intermediate image features based on the feature visibility information of the parts, firstly obtaining the part action features of the parts under the target action through the position information of the appointed position points contained in the parts under the target action; the process can be realized by an encoder network consisting of a plurality of convolution layers; the position information of the specified position point included in the target action lower part can also be called target local sub-posture data, and can be expressed asInputting the target local sub-pose data into the encoder network, and outputting the part motion characteristics of the part under the target motion, namely +.>The position information of the designated position point included in the target-motion target region is generally represented by a heat map, and the process is also understood as converting the heat map representation into a feature representation.

The foregoingRepresenting sub-pose data locally from a targetAnd encoding the obtained target local posture characteristics, namely the part action characteristics. Then, based on feature visibility information of the part, weighting and summing processing is carried out on the action feature of the part and the intermediate image feature to obtain a target image feature, and the accuracy of the target image feature can be further improved. Specifically, the target image feature can be obtained by calculating the following formula:

wherein,representing a target image feature; v (V) ^local Feature visibility information indicating a location; />Representing intermediate image features; />Indicating the location motion characteristics. As can be seen from the above expression, for a position point having a relatively high confidence value,is characterized by approaching->For location points with a lower confidence value, +.>Is characterized by approaching->

The motion migration mode can fully utilize the prior structure position information of a human body or other objects to generate more accurate and high-quality appearance stream data, can effectively process the situation that the source motion and the target motion have great difference, and improves the overall effect of motion migration.

After obtaining the target image features of each part, in order to effectively capture the local semantic correlation existing between the internal pixels of different parts, an expansion convolution network is further introduced, the target image features of the parts are input into the expansion convolution network, and the final target image features of the parts are output; wherein the expanded convolution network comprises at least one expanded convolution layer connected in series; each expansion convolution layer is provided with preset expansion rate parameters. The expansion rate parameters between the expansion convolution layers can be the same or different; in a specific example, the dilated convolution network includes two dilated convolution layers, where the dilated convolution layers have different dilation rate parameters. As shown in fig. 5, the source local image features of each part are subjected to feature deformation and feature selection (also referred to as feature screening) based on the appearance flow graph and the visibility graph, and the target image features of the part are output, and after the target image features of the part are processed by the hybrid expansion convolution module, final target image features of the part, namely deformation source local image features, are obtained. The hybrid convolution module corresponds to the expansion convolution network; the module comprises two cascaded dilated convolution layers, wherein the first dilated convolution layer has a dilation rate of 1 and the second dilated convolution layer has a dilation rate of 2.

By processing the target image features through the expansion convolution network, the spatial receptive field of each position point in the target image features can be expanded, and information interaction with adjacent position points is increased. Let G _hdcb Representing the above-mentioned dilated convolution network, which may also be referred to as a hybrid dilated convolution module, different-site final target image features (also referred to as deformed local picture features)Can be obtained by>

In the related art, only the object features in each spatial position are independently generated, and semantic correlation existing between the object features in different spatial positions is not considered in most cases. Taking the target object as a human body as an example, for adjacent pixels in a local area belonging to the same part of the human body, their appearance features generally have semantic relevance and consistency. Therefore, unlike the related art in which the target features at each spatial location are independently generated, the present embodiment additionally introduces a hybrid expansion convolution module in the network to effectively capture the local semantic correlation existing inside different human body parts. The method can fully consider the local and global semantic relativity among the target features in different spatial positions, further improve the global appearance consistency of the generated result, and simultaneously reserve the local appearance details, thereby improving the overall effect of the action migration.

After the target image feature of each part is obtained through the above embodiment, the image of the target object under the target action needs to be obtained based on the target image feature of each part. Specifically, firstly, obtaining object action characteristics of a target object under a target action through position information of a designated position point of the target object under the target action; wherein, the position information of the specified position point of the target object under the target action can be specifically called as target human body gesture, and is expressed as P _t The method comprises the steps of carrying out a first treatment on the surface of the Will P _t The object action characteristics, also called target global attitude characteristics, are obtained through the coding of an encoder consisting of convolution layers; the global pose feature of the target includes the feature of the position information of all the designated position points of the target object, namely the feature provides key information about the placement positions of different human body parts in the target image.

Then, based on the object motion characteristics of the target object and the target image characteristics of each part, the target object under the target motion is obtained. When generating a target object under a target action, simultaneously referencing the overall object action characteristics of the target image and the target image characteristics of each part; and meanwhile, the overall characteristics and the local characteristics of the part are referred, so that the characteristics are more accurate, and the action migration effect is better.

Specifically, object action characteristics of the target object and target image characteristics of each part are spliced to obtain spliced characteristics; inputting the spliced features into a feature fusion network, and outputting initial fusion features; based on the initial fusion feature, a target object under the target action is determined. The feature fusion network can be realized by one or more convolution layers and is mainly used for feature aggregation of spliced features. In specific implementation, each of the deformed products can be processedTarget image features of individual sitesI.e. deformation local picture features of different body parts +.>I.e. target global gesture features are spliced together and input into a global fusion module (i.e. the feature fusion network) to output initial global fusion features F _fusion I.e. the initial fusion feature described above; the F is _fusion The generation process of (1) can be expressed asWherein G is _fusion Representing a feature fusion network.

Due to the inherent symmetry of the human body, there is also a semantic correlation of the appearance characteristics of different human body parts that are distant from each other, for example, the appearance characteristics of the left and right legs should be kept consistent. Therefore, the embodiment introduces a lightweight and effective pyramid pooling-based non-local module to capture global semantic relatedness of different human body parts at different scales. The method can consider semantic relativity among the parts far away, so that the action migration effect is improved.

Specifically, after the initial fusion characteristics are obtained, carrying out multi-scale pyramid pooling operation on the initial fusion characteristics to obtain hierarchical characteristics with a plurality of specified scales; the multi-scale pyramid pooling operation can adaptively divide the initial fusion characteristics into a plurality of local areas according to preset scale parameters; the appointed dimensions of the local features can be the same or different; the multi-scale pyramid pooling operation can also select the most important global representation from each local area, and the process of selecting the most important global representation can be realized through the maximum pooling operation; thus generating hierarchical features of different scales in parallel based on each local region and the corresponding global representation. For example, the scale of the hierarchical features may be 4×4×c, 6×6×c, etc.; where 4*4 and 6*6 represent the spatial resolution of the hierarchical feature and c represents the number of channels of the hierarchical feature.

Then, aiming at each level feature, carrying out non-local operation on the level feature to obtain an operation result of the level feature; the non-local operation here is mainly used to weight sum the features of all spatial locations on the feature map of the hierarchical feature to obtain the response value at a particular target location of the hierarchical feature. After the operation result of the hierarchical feature is obtained, up-sampling the operation result to obtain a sampling result of the hierarchical feature; the scale of the sampling result of each level characteristic is the same as that of the initial fusion characteristic; then, carrying out fusion processing on the sampling result of each level characteristic and the initial fusion characteristic to obtain a global fusion characteristic; firstly, splicing the sampling result of each hierarchical feature and the initial fusion feature to obtain a splicing result, and then calculating the splicing result through a convolution layer to obtain a final global fusion feature. For example, there are two hierarchical features, and the scale of the sampling result of the two hierarchical features is h×w×c; the scale of the initial fusion feature is also h×w×c; after the splicing treatment, the scale of the splicing result is h w 3c, and the global fusion characteristic of h w c is obtained after the convolution layer operation of the convolution kernel 1*1 of the splicing result. Another global fusion feature is denoted as F _global The method comprises the steps of carrying out a first treatment on the surface of the The F is _global The generation process of (1) can be expressed as F _global ＝G _pnb (F _fusion ) The method comprises the steps of carrying out a first treatment on the surface of the Wherein F is _fusion Representing the initial fusion feature, G _pnb Representing the aforementioned multi-scale pyramid pooling operations and non-local operations.

Referring to fig. 6, an initial global fusion feature, that is, the initial fusion feature is pooled by a pyramid to obtain a plurality of local areas with specified dimensions, each local area is subjected to non-local operation to obtain a responsive hierarchical feature, and the hierarchical feature is upsampled and then spliced with the initial global fusion feature, and then subjected to convolution operation to obtain the global fusion feature. The non-local operation process can refer to fig. 6, and includes a series of operations of convolution, deformation, feature matrix point multiplication, softmax function processing, feature element-by-element addition and the like on a local area.

And finally, decoding the global fusion characteristics to obtain a target object under the target action. Specifically, the global fusion feature F _global Inputting into a decoder network composed of a plurality of deconvolution layers for decoding processing to generate final target pictureThe target picture contains a target object under a target action.

With continued reference to FIG. 3, FIG. 3 fully describes the process of action migration described above; the local deformation module can output target image characteristics of each part, namely deformation source local picture characteristics; then, through the feature extraction network E ^pose Performing feature extraction processing on the position information of the designated position point of the target object under the target action to obtain the object action feature of the target object under the target action; the object action feature of the target object can also be called a target global gesture feature, and the feature comprises position information of all designated position points of the target object; the target global attitude feature and the target image feature of each part are input to a global fusion module together, feature fusion processing is carried out through the global fusion module, global fusion features are output, the global fusion features are processed through a decoder Dec, and a target picture is obtained and comprises a target object under a target action.

The above-mentioned motion migration method can be implemented by a motion migration model, and fig. 7 shows a training process of the motion migration model. Inputting a source sample picture, source posture sample data and target posture sample data into an initial model, and outputting a target picture; calculating a loss function based on the target picture and the corresponding true value picture to obtain a loss value; then updating model parameters in the optimized initial model in a gradient descending mode based on the loss value; and continuously executing the steps of inputting the source sample picture, the source gesture sample data and the target gesture sample data into the initial model until the maximum iteration number is reached, and storing model parameters in the current model to obtain the action migration model. For the model dedicated to human motion migration, the source sample picture may be a source human body sample picture, and the source posture sample data and the target posture sample data may be source human body posture sample data and target human body posture sample data.

In the training stage of the model, a true value picture I corresponding to the source human body picture _t The method is applied to calculating the loss function to guide the network to update parameters, and the quality of the generated result is continuously improved. Specifically, the loss function of the model defines the generated target pictureAnd a real target image I _t The smaller the loss function, the more similar the two images are. In the model training process, the weight parameters of the model are iteratively updated and optimized through a gradient descent algorithm so that the output of the model is +.>And true value I _t As consistent as possible, thereby ultimately yielding the desired motion migration model.

As shown in fig. 8, after model training is completed, the model may be invoked to perform an action migration task, taking human action migration as an example, and after model parameters of the action migration model are loaded, source human body picture I _s Source human body posture P _s And a target human body posture P _t And inputting the result image into the motion migration model, and outputting the result image, wherein the result image comprises a human body image under the target motion.

FIG. 9 below shows the effect of motion migration output by the motion migration model; together, fig. 9 includes four sets of examples; in each set of examples, the first image in the upper left corner is a source human body image, and a pair of target human body images can be obtained based on target human body posture data, wherein human body actions in the target human body images are matched with actions embodied by the target human body posture data.

Corresponding to the above method embodiment, referring to fig. 10, a schematic structural diagram of an action migration device is shown, where the device includes:

an information obtaining module 102, configured to obtain location information of a specified location point in a target object under a source action and a target action; wherein the target object is divided into a plurality of parts;

an information processing module 104 for determining offset information of the part when the source action is changed to the target action based on the position information of the specified position point included in the part; performing deformation operation on the source image characteristics of the part based on the offset information to obtain target image characteristics of the part;

and the object output module 106 is used for obtaining the target object under the target action through the target image characteristics of each part.

The motion transfer device divides a target object into a plurality of parts, and determines offset information of the parts when the source motion is converted to the target motion based on position information of a designated position point included in the parts; performing deformation operation on the source image characteristics of the part based on the offset information to obtain target image characteristics of the part; and then the target object under the target action is obtained through the target image characteristics of each part. In the method, a target object is divided into a plurality of parts, offset information of each part is obtained based on position information of a specified position point contained in the part under a source action and a target action, and deformation operation is carried out on image features of the part based on the offset information; the method for calculating the offset information by taking the parts as units can obtain the finer and more accurate offset condition of the target action relative to the original action, and further can improve the overall effect of action migration.

The above information processing module is further configured to: performing deformation operation on the source image features of the part based on the offset information to obtain intermediate image features of the part; screening the intermediate image features based on the feature visibility information of the parts to obtain target image features of the parts; wherein the feature visibility information is used to indicate: probability that the target image feature of each location point in the part exists in the source image feature; the part is composed of a plurality of position points, and the designated position point contained in the part belongs to the plurality of position points.

The above information processing module is further configured to: obtaining the part action characteristics of the part under the target action through the position information of the appointed position point contained in the part under the target action; and carrying out weighted summation processing on the part action characteristics and the intermediate image characteristics based on the characteristic visibility information of the part to obtain target image characteristics.

The above information processing module is further configured to: the target image features are calculated by the following formula:

wherein,representing a target image feature; v (V) ^local Feature visibility information indicating a location; />Representing intermediate image features; />Indicating the location motion characteristics.

The above information processing module is further configured to: for each part, the position information of the designated position point included in the part under the source action and the target action is input into the appearance stream generating network, and the offset information of the part and the feature visibility information of the part are output.

The device further comprises: the feature output module is used for: inputting the target image characteristics of the part into an expansion convolution network, and outputting final target image characteristics of the part; wherein the expanded convolution network comprises at least one expanded convolution layer connected in series; each expansion convolution layer is provided with preset expansion rate parameters.

The object output module is further configured to: obtaining object action characteristics of the target object under the target action through the position information of the designated position point of the target object under the target action; and obtaining the target object under the target action based on the object action characteristics of the target object and the target image characteristics of each part.

The object output module is further configured to: splicing the object action characteristics of the target object and the target image characteristics of each part to obtain splicing characteristics; inputting the spliced features into a feature fusion network, and outputting initial fusion features; based on the initial fusion feature, a target object under the target action is determined.

The object output module is further configured to: carrying out multi-scale pyramid pooling operation on the initial fusion features to obtain hierarchical features with a plurality of specified scales; aiming at each level feature, carrying out non-local operation on the level feature to obtain an operation result of the level feature; performing up-sampling treatment on the operation result to obtain a sampling result of the hierarchical feature; carrying out fusion processing on the sampling result of each hierarchical feature and the initial fusion feature to obtain a global fusion feature; and decoding the global fusion characteristics to obtain a target object under the target action.

The embodiment also provides an electronic device, including a processor and a memory, where the memory stores machine executable instructions that can be executed by the processor, and the processor executes the machine executable instructions to implement the action migration method. The electronic device may be a server or a terminal device.

Referring to fig. 11, the electronic device includes a processor 100 and a memory 101, the memory 101 storing machine executable instructions that can be executed by the processor 100, the processor 100 executing the machine executable instructions to implement the above-described action migration method.

Further, the electronic device shown in fig. 11 further includes a bus 102 and a communication interface 103, and the processor 100, the communication interface 103, and the memory 101 are connected through the bus 102.

The memory 101 may include a high-speed random access memory (RAM, randomAccessMemory) and may further include a non-volatile memory (non-volatile memory), such as at least one magnetic disk memory. The communication connection between the system network element and at least one other network element is implemented via at least one communication interface 103 (which may be wired or wireless), and may use the internet, a wide area network, a local network, a metropolitan area network, etc. Bus 102 may be an ISA bus, a PCI bus, an EISA bus, or the like. The buses may be classified as address buses, data buses, control buses, etc. For ease of illustration, only one bi-directional arrow is shown in FIG. 11, but not only one bus or type of bus.

The processor 100 may be an integrated circuit chip with signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in the processor 100 or by instructions in the form of software. The processor 100 may be a general-purpose processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but also digital signal processors (Digital Signal Processor, DSP for short), application specific integrated circuits (Application Specific Integrated Circuit, ASIC for short), field-programmable gate arrays (Field-Programmable Gate Array, FPGA for short) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components. The disclosed methods, steps, and logic blocks in the embodiments of the present invention may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present invention may be embodied directly in the execution of a hardware decoding processor, or in the execution of a combination of hardware and software modules in a decoding processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in the memory 101, and the processor 100 reads the information in the memory 101 and, in combination with its hardware, performs the steps of the method of the previous embodiment.

The present embodiments also provide a machine-readable storage medium storing machine-executable instructions that, when invoked and executed by a processor, cause the processor to implement the above-described action migration method.

The action migration method, the action migration device and the computer program product of the electronic device provided by the embodiments of the present invention include a computer readable storage medium storing program codes, and instructions included in the program codes may be used to execute the method described in the foregoing method embodiment, and specific implementation may refer to the method embodiment and will not be described herein.

In addition, in the description of embodiments of the present invention, unless explicitly stated and limited otherwise, the terms "mounted," "connected," and "connected" are to be construed broadly, and may be, for example, fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communication between two elements. The specific meaning of the above terms in the present invention will be understood by those skilled in the art in specific cases.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, an electronic device, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

In the description of the present invention, it should be noted that the directions or positional relationships indicated by the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc. are based on the directions or positional relationships shown in the drawings, are merely for convenience of describing the present invention and simplifying the description, and do not indicate or imply that the devices or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and thus should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

Finally, it should be noted that: the above examples are only specific embodiments of the present invention for illustrating the technical solution of the present invention, but not for limiting the scope of the present invention, and although the present invention has been described in detail with reference to the foregoing examples, it will be understood by those skilled in the art that the present invention is not limited thereto: any person skilled in the art may modify or easily conceive of the technical solution described in the foregoing embodiments, or perform equivalent substitution of some of the technical features, while remaining within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention, and are intended to be included in the scope of the present invention. Therefore, the protection scope of the invention is subject to the protection scope of the claims.

Claims

1. A method of action migration, the method comprising:

acquiring position information of a designated position point in a target object under a source action and a target action; wherein the target object is divided into a plurality of parts;

determining offset information of the location when the source action is converted to the target action based on position information of a designated position point contained in the location; determining a target image feature of the location based on the offset information and a source image feature of the location;

Obtaining a target object under the target action through the target image characteristics of each part;

determining offset information of the location when the source action is converted to the target action based on position information of a designated position point included in the location, the step comprising:

for each part, inputting the position information of the appointed position point contained in the part into an appearance flow generating network under the source action and the target action, and outputting the offset information of the part and the feature visibility information of the part; wherein the offset information for the location includes offset information for each location point in the location.

2. The method of claim 1, wherein determining the target image feature of the site based on the offset information and the source image feature of the site comprises:

performing deformation operation on the source image characteristics of the part based on the offset information to obtain intermediate image characteristics of the part;

screening the intermediate image features based on the feature visibility information of the part to obtain target image features of the part; wherein the feature visibility information is used to indicate: the probability that the target image feature of each location point in the region is present in the source image feature; the location is composed of a plurality of location points, and the designated location point contained in the location belongs to the plurality of location points.

3. The method according to claim 2, wherein the step of screening the intermediate image features based on the feature visibility information of the region to obtain target image features of the region includes:

obtaining part action characteristics of the part under the target action through position information of a designated position point contained in the part under the target action;

and carrying out weighted summation processing on the part action feature and the intermediate image feature based on the feature visibility information of the part to obtain the target image feature.

4. A method according to claim 3, wherein the step of weighting and summing the site motion feature and the intermediate image feature based on the feature visibility information of the site to obtain the target image feature comprises:

the target image feature is calculated by the following formula:

wherein,representing the target image features; v (V) ^local Feature visibility information representing the location; />Representing the intermediate image feature; />Representing the site-motion characteristics.

5. The method of claim 1, wherein prior to the step of obtaining the target object under the target action from the target image features of each of the sites, the method further comprises:

Inputting the target image characteristics of the part into an expansion convolution network, and outputting the final target image characteristics of the part; wherein the expanded convolution network comprises at least one layer of serially connected expanded convolution layers; each expansion convolution layer is provided with preset expansion rate parameters.

6. The method according to claim 1, wherein the step of obtaining the target object under the target action from the target image features of each of the sites comprises:

obtaining object action characteristics of the target object under the target action according to the position information of the designated position point of the target object under the target action;

and obtaining the target object under the target action based on the object action characteristics of the target object and the target image characteristics of each part.

7. The method of claim 6, wherein the step of obtaining the target object under the target action based on the object action features of the target object and the target image features of each of the sites comprises:

performing stitching processing on the object action characteristics of the target object and the target image characteristics of each part to obtain stitching characteristics;

Inputting the spliced features into a feature fusion network, and outputting initial fusion features; and determining a target object under the target action based on the initial fusion characteristic.

8. The method of claim 7, wherein the step of determining a target object under the target action based on the initial fusion feature comprises:

performing multi-scale pyramid pooling operation on the initial fusion features to obtain hierarchical features with a plurality of specified scales;

aiming at each hierarchical feature, carrying out non-local operation on the hierarchical feature to obtain an operation result of the hierarchical feature; performing up-sampling processing on the operation result to obtain a sampling result of the hierarchical feature;

carrying out fusion processing on the sampling result of each hierarchical feature and the initial fusion feature to obtain a global fusion feature; and decoding the global fusion feature to obtain a target object under the target action.

9. An action migration device, the device comprising:

the information acquisition module is used for acquiring the position information of the designated position point in the target object under the source action and the target action; wherein the target object is divided into a plurality of parts;

An information processing module for determining offset information of the location when the source action is converted to the target action based on position information of a specified position point included in the location; performing deformation operation on the source image characteristics of the part based on the offset information to obtain target image characteristics of the part;

the object output module is used for obtaining a target object under the target action through the target image characteristics of each part;

the information processing module is further configured to: for each part, inputting the position information of the appointed position point contained in the part into an appearance flow generating network under the source action and the target action, and outputting the offset information of the part and the feature visibility information of the part; wherein the offset information for the location includes offset information for each location point in the location.

10. An electronic device comprising a processor and a memory, the memory storing machine executable instructions executable by the processor, the processor executing the machine executable instructions to implement the method of action migration of any one of claims 1-8.

11. A machine-readable storage medium storing machine-executable instructions which, when invoked and executed by a processor, cause the processor to implement the method of action migration of any one of claims 1-8.