Bug 1989121 - [CBT][Veeam] Block backup of hosted-engine vm
Summary: [CBT][Veeam] Block backup of hosted-engine vm
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: Backup-Restore.VMs
Version: 4.4.7.7
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ovirt-4.5.0
: 4.5.0.1
Assignee: Mark Kemel
QA Contact: Evelina Shames
bugs@ovirt.org
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-08-02 13:39 UTC by Yury.Panchenko
Modified: 2022-05-11 10:38 UTC (History)
13 users (show)

Fixed In Version: ovirt-engine-4.5.0.1
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-04-20 06:33:59 UTC
oVirt Team: Storage
Embargoed:
pm-rhel: ovirt-4.5?
pm-rhel: planning_ack?
pm-rhel: devel_ack+
pm-rhel: testing_ack+


Attachments (Terms of Use)
Engine and vdsm logs (987.09 KB, application/x-7z-compressed)
2021-08-02 13:39 UTC, Yury.Panchenko
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github oVirt ovirt-engine pull 220 0 None Merged core: disable incremental backup for HEVM 2022-04-04 20:11:03 UTC
Red Hat Issue Tracker RHV-42941 0 None None None 2021-10-05 07:34:07 UTC

Description Yury.Panchenko 2021-08-02 13:39:36 UTC
Created attachment 1810131 [details]
Engine and vdsm logs

Description of problem:
Backup of hosted-engine vm always failed.

Version-Release number of selected component (if applicable):
vdsm-4.40.70.6-1.el8ev.x86_64
ovirt-imageio-daemon-2.2.0-1.el8ev.x86_64
libvirt-daemon-7.0.0-14.1.module+el8.4.0+11095+d46acebf.x86_64
qemu-kvm-5.2.0-16.module+el8.4.0+11536+725e25d9.2.x86_64

How reproducible:
Start full backup of hosted-engine vm

Actual results:
Backup failed 
VDSErrorException: Failed to StartVmBackupVDS, error = Backup Error: {'vm_id': 'd31f3fac-0c19-4db5-903a-a4af5b54ec59', 'backup': <vdsm.virt.backup.BackupConfig object at 0x7fa45437e898>, 'reason': "Error starting backup: internal error: unable to execute QEMU command 'transaction': Cannot append backup-top filter: Could not open '/run/vdsm/storage/a735a825-e13c-4b54-bfa9-9424f385dbc8/4ebfdfea-31e8-4f15-8d30-559a35f1e375/70d63963-cd9c-452e-97d3-c671b2f83ddb': No such file or directory"}, code = 1600

Expected results:
Successful backup.

Comment 4 Arik 2022-03-15 13:35:07 UTC
Not an RFE - that's a bug

Comment 5 Mark Kemel 2022-03-15 14:09:46 UTC
This is the error thrown by the backupBegin function:

2022-03-15 10:11:13,515+0200 ERROR (jsonrpc/2) [api] FINISH start_backup error=Backup Error: {'vm_id': 'c41a2c1d-24bc-4020-8028-bca9f54bf767', 'backup': <vdsm.virt.backup.BackupConfig object at 0x7f95bc9e97b8>, 'reason': "Error starting backup: internal error: unable to execute QEMU command 'transaction': Source and target image have different sizes"} (api:131)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/virt/backup.py", line 410, in _begin_backup
    dom.backupBegin(backup_xml, checkpoint_xml, flags=flags)
  File "/usr/lib/python3.6/site-packages/vdsm/virt/virdomain.py", line 159, in call
    return getattr(self._vm._dom, name)(*a, **kw)
  File "/usr/lib/python3.6/site-packages/vdsm/virt/virdomain.py", line 101, in f
    ret = attr(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/vdsm/common/libvirtconnection.py", line 131, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/vdsm/common/function.py", line 94, in wrapper
    return func(inst, *args, **kwargs)
  File "/usr/lib64/python3.6/site-packages/libvirt.py", line 797, in backupBegin
    raise libvirtError('virDomainBackupBegin() failed')
libvirt.libvirtError: internal error: unable to execute QEMU command 'transaction': Source and target image have different sizes

It happens because the HE VM disk and the scratch disk are not the same size, which is caused by incorrect size of the HE VM disk in the engine DB.
E.g. in my env:

# qemu-img info a2df2288-c33d-4c1b-b447-4077f5000fa5
image: a2df2288-c33d-4c1b-b447-4077f5000fa5
file format: raw
virtual size: 50 GiB (53689188352 bytes)
disk size: 7.13 GiB

And in the DB:
engine=# SELECT size FROM images where image_guid='a2df2288-c33d-4c1b-b447-4077f5000fa5';
    size     
-------------
 54760833024

This causes the scratch disk to be created with incorrect size: (from vdsm.log)

022-03-15 10:10:57,636+0200 INFO  (jsonrpc/0) [vdsm.api] START createVolume(sdUUID='96f3f20c-40a7-4ba8-b4d6-a13fad0135a0', spUUID='45
7b7702-92f4-11ec-b87a-001a4a160475', imgUUID='079f9427-2243-4285-b4f4-2eabf455a312', size='54760833024', volFormat=4, preallocate=2, d
iskType='SCRD', volUUID='0ca99d26-236a-4d23-9a96-92950639cce8', desc='{"DiskAlias":"VM HostedEngine backup ddc279d1-76b2-49e7-8b78-257
083e6e203 scratch disk for he_virtio_disk","DiskDescription":"Backup ddc279d1-76b2-49e7-8b78-257083e6e203 scratch disk"}', srcImgUUID=
'00000000-0000-0000-0000-000000000000', srcVolUUID='00000000-0000-0000-0000-000000000000', initialSize=None, addBitmaps=False) from=::
ffff:10.46.11.117,51490, flow_id=21362664-5924-41c7-b66f-30abe46036bc, task_id=b6dc0ab6-4d93-4f67-9916-6c5b72944bf6 (api:48)

Comment 6 Arik 2022-03-15 14:13:47 UTC
Based on comment 5, this is probably something related to the deployment of the hosted-engine and therefore moving to the integration team

Comment 7 Arik 2022-03-15 14:32:59 UTC
ok, hold your horses - we check if what it takes to measure the image there so it would also work for existing deployments

Comment 8 Nir Soffer 2022-03-15 16:51:47 UTC
Mark, can you share here:

- The metadata file of this volume (a2df2288-c33d-4c1b-b447-4077f5000fa5.meta)

- Output of "ls -l" on the volume file

- Output of

   sudo vdsm-client Volume getInfo storagedomainID={domain-id} storagepoolID={pool-id} imageID={disk-id} volumeID={volume-id}

- Output of

   sudo vdsm-client StorageDomain dump sd_id=9fcc4284-142b-4d0e-bc22-958e90564d9d | jq '.volumes | .[] | select(.image=="disk-id")'

Comment 9 Arik 2022-03-15 16:58:20 UTC
Sandro, in short we found out that it fails because the virtual size of the HE disk is different on the engine and VDSM metadata compared to its virtual size on the storage.

This is a unique situation for hosted engine it seems, because otherwise we fix the virtual size we (on RHV) hold according to the size on the storage.

There is the possibility to query the virtual size from the storage during backup of hosted engine but this would arguably complicate an already complicated flow - is it realistic to fix the virtual size of the hosted engine disk for new and existing deployments in 4.5 on your end?

Comment 10 Mark Kemel 2022-03-16 07:43:11 UTC
1. Metadata: 

# cat a2df2288-c33d-4c1b-b447-4077f5000fa5.meta 
CAP=54760833024
CTIME=1645434868
DESCRIPTION={"DiskAlias":"he_virtio_disk","DiskDescription":"Hosted-Engine disk"}
DISKTYPE=HEVD
DOMAIN=96f3f20c-40a7-4ba8-b4d6-a13fad0135a0
FORMAT=RAW
GEN=0
IMAGE=a18cabd3-b894-4bb9-bcd4-906a5d817faa
LEGALITY=LEGAL
PUUID=00000000-0000-0000-0000-000000000000
TYPE=SPARSE
VOLTYPE=LEAF
EOF

2. # ls -l
-rw-rw----. 1 vdsm kvm 53689188352 Mar 16 09:10 a2df2288-c33d-4c1b-b447-4077f5000fa5

3. # vdsm-client Volume getInfo storagedomainID=96f3f20c-40a7-4ba8-b4d6-a13fad0135a0 storagepoolID=457b7702-92f4-11ec-b87a-001a4a160475 imageID=a18cabd3-b894-4bb9-bcd4-906a5d817faa volumeID=a2df2288-c33d-4c1b-b447-4077f5000fa5
{
    "apparentsize": "53689188352",
    "capacity": "54760833024",
    "children": [],
    "ctime": "1645434868",
    "description": "{\"DiskAlias\":\"he_virtio_disk\",\"DiskDescription\":\"Hosted-Engine disk\"}",
    "disktype": "HEVD",
    "domain": "96f3f20c-40a7-4ba8-b4d6-a13fad0135a0",
    "format": "RAW",
    "generation": 0,
    "image": "a18cabd3-b894-4bb9-bcd4-906a5d817faa",
    "lease": {
        "offset": 0,
        "owners": [
            1
        ],
        "path": "/rhev/data-center/mnt/ntap-rhv-dev-nfs.lab.eng.tlv2.redhat.com:_pub_mkemel_mark-he/96f3f20c-40a7-4ba8-b4d6-a13fad0135a0/images/a18cabd3-b894-4bb9-bcd4-906a5d817faa/a2df2288-c33d-4c1b-b447-4077f5000fa5.lease",
        "version": 6
    },
    "legality": "LEGAL",
    "mtime": "0",
    "parent": "00000000-0000-0000-0000-000000000000",
    "pool": "",
    "status": "OK",
    "truesize": "7657226240",
    "type": "SPARSE",
    "uuid": "a2df2288-c33d-4c1b-b447-4077f5000fa5",
    "voltype": "LEAF"
}

4. # vdsm-client StorageDomain dump sd_id=96f3f20c-40a7-4ba8-b4d6-a13fad0135a0 | jq '.volumes | .[] | select(.image=="a18cabd3-b894-4bb9-bcd4-906a5d817faa")'
{
  "apparentsize": 53689188352,
  "capacity": 54760833024,
  "ctime": 1645434868,
  "description": "{\"DiskAlias\":\"he_virtio_disk\",\"DiskDescription\":\"Hosted-Engine disk\"}",
  "disktype": "HEVD",
  "format": "RAW",
  "generation": 0,
  "image": "a18cabd3-b894-4bb9-bcd4-906a5d817faa",
  "legality": "LEGAL",
  "parent": "00000000-0000-0000-0000-000000000000",
  "status": "OK",
  "truesize": 7657234432,
  "type": "SPARSE",
  "voltype": "LEAF"
}

Comment 11 Sandro Bonazzola 2022-03-16 09:45:34 UTC
(In reply to Arik from comment #9)
> Sandro, in short we found out that it fails because the virtual size of the
> HE disk is different on the engine and VDSM metadata compared to its virtual
> size on the storage.
> 
> This is a unique situation for hosted engine it seems, because otherwise we
> fix the virtual size we (on RHV) hold according to the size on the storage.
> 
> There is the possibility to query the virtual size from the storage during
> backup of hosted engine but this would arguably complicate an already
> complicated flow - is it realistic to fix the virtual size of the hosted
> engine disk for new and existing deployments in 4.5 on your end?

Redirecting the question to Asaf

Comment 12 Arik 2022-03-16 10:11:40 UTC
Yury, did you plan how to restore the hosted-engine VM once you are able to back it up?

Comment 13 Yedidyah Bar David 2022-03-16 10:18:50 UTC
(
To clarify: The "normal" flow to backup/restore HE is:
1. Take a backup using engine-backup
2. Restore with 'hosted-engine --deploy --restore-from-file'

The engine VM's disk image itself is never backed-up/restored, as-is.
)

Comment 14 Yedidyah Bar David 2022-03-16 10:49:44 UTC
One way to make the request in current bug useful: Convert the backed up image to an OVA and provide that to 'hosted-engine --deploy' as the "appliance" image. This would require some refinements here and there, but likely doable. Not sure we want to recommend/support this, though, instead of comment 13 - we'd first like to understand your backup/restore plan and considerations (e.g. if you already considered comment 13, why isn't that enough). Thanks.

Comment 15 Nir Soffer 2022-03-16 10:59:42 UTC
(In reply to Mark Kemel from comment #10)
> 1. Metadata: 
> 
> # cat a2df2288-c33d-4c1b-b447-4077f5000fa5.meta 
> CAP=54760833024
...

> 2. # ls -l
> -rw-rw----. 1 vdsm kvm 53689188352 Mar 16 09:10
> a2df2288-c33d-4c1b-b447-4077f5000fa5

Vdsm meddata is corrupted, contains wrong CAP - for raw image
CAP must the file size.

> 3. # vdsm-client Volume getInfo
> storagedomainID=96f3f20c-40a7-4ba8-b4d6-a13fad0135a0
> storagepoolID=457b7702-92f4-11ec-b87a-001a4a160475
> imageID=a18cabd3-b894-4bb9-bcd4-906a5d817faa
> volumeID=a2df2288-c33d-4c1b-b447-4077f5000fa5
> {
>     "apparentsize": "53689188352",
>     "capacity": "54760833024",

The issue is reported (expected)

 
> 4. # vdsm-client StorageDomain dump
> sd_id=96f3f20c-40a7-4ba8-b4d6-a13fad0135a0 | jq '.volumes | .[] |
> select(.image=="a18cabd3-b894-4bb9-bcd4-906a5d817faa")'
> {
>   "apparentsize": 53689188352,
>   "capacity": 54760833024,

Issue reported also here.

So we have 2 issues:

1. Engine db is corrupted - contains wrong virtual size. This must
   be fixed to match the apparentsize reported by vdsm.

2. Vdsm metadata is corrupted - this can be fixed only manually, vdsm
   does not provide an API to change this value.

It will be good to understand how the wrong metadata was created when copying
the hosted engine disk to storage. If the volume was created using the vdsm
API (Volume.create), the file size should match the metadata size.

The issue could be creating the vdsm volume with the wrong size, and then
copying the disk with qemu-img convert without the "-n" flag. In this case
qemu-img convert creates target file in the same size of the source file.

This issue should be handle in hosted engine setup, regardless of this bug.

Comment 16 Arik 2022-03-16 11:02:15 UTC
(In reply to Nir Soffer from comment #15)
> This issue should be handle in hosted engine setup, regardless of this bug.

We're back to discussing the fix?
Let's first see form Yury what's the plan for restoring the hosted-engine, this bz may not be relevant without a proper mechanism to restoring the hosted engine

Comment 17 Arik 2022-03-16 11:06:24 UTC
(In reply to Nir Soffer from comment #15)
> This issue should be handle in hosted engine setup, regardless of this bug.

Ah, regardless :)
Feel free to file another bug on that then

Comment 18 Nir Soffer 2022-03-16 11:25:12 UTC
Regarding the issue of backing up the hosted engine vm - since we have no
way to restore it (restore requires engine running), I think backing up
the hosted engine vm should be blocked.

If we have a way to backup and restore the hosted engine vm, we can remove
this blocking.

Comment 19 Yedidyah Bar David 2022-03-16 11:28:10 UTC
BTW: For properly hooking into the process of comment 13 (e.g. to install/restore/configure/whatever other stuff, during the HE deploy/restore), see:

https://github.com/oVirt/ovirt-ansible-collection/blob/master/roles/hosted_engine_setup/README.md#make-changes-in-the-engine-vm-during-the-deployment

Comment 21 Yury.Panchenko 2022-03-30 09:26:24 UTC
Hello there
The main problem is that customer isn't able to backup full KVM data by one backup software. For regular vms he has to use Veeam RHV backup and for the engine data he has to use egine-backup.
I'd like to have one solution for both type of the data.

> Yury, did you plan how to restore the hosted-engine VM once you are able to back it up?
We can't restore it back to the KVM, but there are few methods to recover some data from the backup.

Let's imagine that the hosted-engine is down, we have two ways to recover
1) There is instant vm restore technology in Veeam B&R. A customer can place the engine on a vmware esxi or a hyper-v host temporally, to run engine-backup and use it for recovery, or he can leave it forever on this hypervisor.
2) There is file level restore technology. A customer can go into the appliance files and recover the database data.

Comment 22 Yedidyah Bar David 2022-03-30 09:57:00 UTC
(In reply to Yury.Panchenko from comment #21)
> Hello there
> The main problem is that customer isn't able to backup full KVM data by one
> backup software. For regular vms he has to use Veeam RHV backup and for the
> engine data he has to use egine-backup.
> I'd like to have one solution for both type of the data.

You are most welcome to refer to engine-backup as part of oVirt/RHV's API,
and not merely as an end-user-facing tool, IMO - call it from Veeam RHV backup
etc., and tell us if anything is missing for using it like that.

> 
> > Yury, did you plan how to restore the hosted-engine VM once you are able to back it up?
> We can't restore it back to the KVM, but there are few methods to recover
> some data from the backup.
> 
> Let's imagine that the hosted-engine is down, we have two ways to recover
> 1) There is instant vm restore technology in Veeam B&R. A customer can place
> the engine on a vmware esxi or a hyper-v host temporally, to run
> engine-backup and use it for recovery, or he can leave it forever on this
> hypervisor.

This is a very reasonable approach if you already have on-site such
infra, but if not, then IMO setting up something just for recovery
is a lot of work.

> 2) There is file level restore technology. A customer can go into the
> appliance files and recover the database data.

This will also take quite some work which is designed to all be done by
engine-backup.

Comment 23 Yury.Panchenko 2022-03-30 12:18:12 UTC
> You are most welcome to refer to engine-backup as part of oVirt/RHV's API,
> and not merely as an end-user-facing tool, IMO - call it from Veeam RHV backup
> etc., and tell us if anything is missing for using it like that.
It'll be great to add this feature to the backup API, so we can store this config backup on our repository

I think is some future we could develop some direct restore the engine to the RHV host. It won't be easy, but it'll allow a customer to recover engine in case of hard failure.

Comment 24 Nir Soffer 2022-03-30 12:28:43 UTC
The backup API give access to VM data. We don't have any API for file level backup.

To do file level backup you can connect to the VM via ssh, or install an agent
on the VM that will prepare and push backup data to the the backup system.

Which kind of API is missing for this?

Comment 25 Yury.Panchenko 2022-03-30 14:23:31 UTC
> To do file level backup you can connect to the VM via ssh, or install an agent
> on the VM that will prepare and push backup data to the the backup system.
> Which kind of API is missing for this?
The feature is nice to have. You don't have to create it.
We've already had agent backup feature.

In common I interested in backup full hosted-engine vm.
I guess it will be able to backup when you will switch to the snapshot backup method

Comment 26 Arik 2022-03-30 14:30:41 UTC
(In reply to Yury.Panchenko from comment #25)
> In common I interested in backup full hosted-engine vm.
> I guess it will be able to backup when you will switch to the snapshot
> backup method

The new method doesn't apply to the hosted engine VM so the issue remains

Comment 27 Nir Soffer 2022-03-30 16:00:10 UTC
(In reply to Yury.Panchenko from comment #25)
> > To do file level backup you can connect to the VM via ssh, or install an agent
> > on the VM that will prepare and push backup data to the the backup system.
> > Which kind of API is missing for this?
> The feature is nice to have. You don't have to create it.
> We've already had agent backup feature.
> 
> In common I interested in backup full hosted-engine vm.
> I guess it will be able to backup when you will switch to the snapshot
> backup method

Hosted engine does not support snapshots, so new snapshot based cannot work.

The previous incremental backup can also does not work, because of a bug in
hosted engine deployment, creating invalid volume metadata. If this issue
will be fixed, for example by system upgrade fixing the invalid metadata,
backing up hosted engine will be possible, but without a way to restore 
the vm, I don't see the point in this.

In general hosted engine vm is a special case - this is part of the RVH
infrastructure, and int the same way you don't backup the RHV hosts, you 
don't backup the hosted engine vm.

Comment 28 Yury.Panchenko 2022-03-30 16:28:40 UTC
> In general hosted engine vm is a special case - this is part of the RVH
> infrastructure, and int the same way you don't backup the RHV hosts, you 
> don't backup the hosted engine vm.
You are right.
I can't say that the feature is very important, but it sometimes can help

Comment 29 Arik 2022-04-03 11:32:43 UTC
(In reply to Yury.Panchenko from comment #28)
> > In general hosted engine vm is a special case - this is part of the RVH
> > infrastructure, and int the same way you don't backup the RHV hosts, you 
> > don't backup the hosted engine vm.
> You are right.
> I can't say that the feature is very important, but it sometimes can help

Right, we all seem to agree that it should be possible to back up the hosted engine VM as well however it's a special case and users are provided with dedicated tools that facilitate this.
So I'm going to change the scope of this bug - we'll rather block taking a backup of an hosted engine VM, which doesn't work now anyway, and advise users to leverage engine-backup and hosted-engine --deploy instead

Comment 30 Arik 2022-04-03 11:34:43 UTC
(In reply to Yury.Panchenko from comment #0)
> Created attachment 1810131 [details]
> Engine and vdsm logs
> 
> Description of problem:
> Backup of hosted-engine vm always failed.
> 
> Version-Release number of selected component (if applicable):
> vdsm-4.40.70.6-1.el8ev.x86_64
> ovirt-imageio-daemon-2.2.0-1.el8ev.x86_64
> libvirt-daemon-7.0.0-14.1.module+el8.4.0+11095+d46acebf.x86_64
> qemu-kvm-5.2.0-16.module+el8.4.0+11536+725e25d9.2.x86_64
> 
> How reproducible:
> Start full backup of hosted-engine vm
> 
> Actual results:
> Backup failed 
> VDSErrorException: Failed to StartVmBackupVDS, error = Backup Error:
> {'vm_id': 'd31f3fac-0c19-4db5-903a-a4af5b54ec59', 'backup':
> <vdsm.virt.backup.BackupConfig object at 0x7fa45437e898>, 'reason': "Error
> starting backup: internal error: unable to execute QEMU command
> 'transaction': Cannot append backup-top filter: Could not open
> '/run/vdsm/storage/a735a825-e13c-4b54-bfa9-9424f385dbc8/4ebfdfea-31e8-4f15-
> 8d30-559a35f1e375/70d63963-cd9c-452e-97d3-c671b2f83ddb': No such file or
> directory"}, code = 1600
> 
> Expected results:
> Successful backup.

So the expected result would then be to fail with an appropriate message

Comment 31 Mark Kemel 2022-04-04 14:33:43 UTC
Verification steps:

1. Start backup for the HostedEngine VM:
POST api/vms/${vmId}/backups
<backup>
    <disks>
       <disk id="${diskId}" />
    </disks>
</backup>

2. Expect the following response:
<fault>
    <detail>[Cannot backup VM. Backup of Hosted Engine VM is not supported]</detail>
    <reason>Operation Failed</reason>
</fault>

Comment 32 Evelina Shames 2022-04-14 08:01:02 UTC
(In reply to Mark Kemel from comment #31)
> Verification steps:
> 
> 1. Start backup for the HostedEngine VM:
> POST api/vms/${vmId}/backups
> <backup>
>     <disks>
>        <disk id="${diskId}" />
>     </disks>
> </backup>
> 
> 2. Expect the following response:
> <fault>
>     <detail>[Cannot backup VM. Backup of Hosted Engine VM is not
> supported]</detail>
>     <reason>Operation Failed</reason>
> </fault>

Verified on ovirt-engine-4.5.0.2-0.7.el8ev, backup of the HostedEngine VM is not supported:
[Cannot backup VM. Backup of Hosted Engine VM is not supported]"

Comment 33 Sandro Bonazzola 2022-04-20 06:33:59 UTC
This bugzilla is included in oVirt 4.5.0 release, published on April 20th 2022.

Since the problem described in this bug report should be resolved in oVirt 4.5.0 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.