Created attachment 1810131 [details] Engine and vdsm logs Description of problem: Backup of hosted-engine vm always failed. Version-Release number of selected component (if applicable): vdsm-4.40.70.6-1.el8ev.x86_64 ovirt-imageio-daemon-2.2.0-1.el8ev.x86_64 libvirt-daemon-7.0.0-14.1.module+el8.4.0+11095+d46acebf.x86_64 qemu-kvm-5.2.0-16.module+el8.4.0+11536+725e25d9.2.x86_64 How reproducible: Start full backup of hosted-engine vm Actual results: Backup failed VDSErrorException: Failed to StartVmBackupVDS, error = Backup Error: {'vm_id': 'd31f3fac-0c19-4db5-903a-a4af5b54ec59', 'backup': <vdsm.virt.backup.BackupConfig object at 0x7fa45437e898>, 'reason': "Error starting backup: internal error: unable to execute QEMU command 'transaction': Cannot append backup-top filter: Could not open '/run/vdsm/storage/a735a825-e13c-4b54-bfa9-9424f385dbc8/4ebfdfea-31e8-4f15-8d30-559a35f1e375/70d63963-cd9c-452e-97d3-c671b2f83ddb': No such file or directory"}, code = 1600 Expected results: Successful backup.
Not an RFE - that's a bug
This is the error thrown by the backupBegin function: 2022-03-15 10:11:13,515+0200 ERROR (jsonrpc/2) [api] FINISH start_backup error=Backup Error: {'vm_id': 'c41a2c1d-24bc-4020-8028-bca9f54bf767', 'backup': <vdsm.virt.backup.BackupConfig object at 0x7f95bc9e97b8>, 'reason': "Error starting backup: internal error: unable to execute QEMU command 'transaction': Source and target image have different sizes"} (api:131) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/virt/backup.py", line 410, in _begin_backup dom.backupBegin(backup_xml, checkpoint_xml, flags=flags) File "/usr/lib/python3.6/site-packages/vdsm/virt/virdomain.py", line 159, in call return getattr(self._vm._dom, name)(*a, **kw) File "/usr/lib/python3.6/site-packages/vdsm/virt/virdomain.py", line 101, in f ret = attr(*args, **kwargs) File "/usr/lib/python3.6/site-packages/vdsm/common/libvirtconnection.py", line 131, in wrapper ret = f(*args, **kwargs) File "/usr/lib/python3.6/site-packages/vdsm/common/function.py", line 94, in wrapper return func(inst, *args, **kwargs) File "/usr/lib64/python3.6/site-packages/libvirt.py", line 797, in backupBegin raise libvirtError('virDomainBackupBegin() failed') libvirt.libvirtError: internal error: unable to execute QEMU command 'transaction': Source and target image have different sizes It happens because the HE VM disk and the scratch disk are not the same size, which is caused by incorrect size of the HE VM disk in the engine DB. E.g. in my env: # qemu-img info a2df2288-c33d-4c1b-b447-4077f5000fa5 image: a2df2288-c33d-4c1b-b447-4077f5000fa5 file format: raw virtual size: 50 GiB (53689188352 bytes) disk size: 7.13 GiB And in the DB: engine=# SELECT size FROM images where image_guid='a2df2288-c33d-4c1b-b447-4077f5000fa5'; size ------------- 54760833024 This causes the scratch disk to be created with incorrect size: (from vdsm.log) 022-03-15 10:10:57,636+0200 INFO (jsonrpc/0) [vdsm.api] START createVolume(sdUUID='96f3f20c-40a7-4ba8-b4d6-a13fad0135a0', spUUID='45 7b7702-92f4-11ec-b87a-001a4a160475', imgUUID='079f9427-2243-4285-b4f4-2eabf455a312', size='54760833024', volFormat=4, preallocate=2, d iskType='SCRD', volUUID='0ca99d26-236a-4d23-9a96-92950639cce8', desc='{"DiskAlias":"VM HostedEngine backup ddc279d1-76b2-49e7-8b78-257 083e6e203 scratch disk for he_virtio_disk","DiskDescription":"Backup ddc279d1-76b2-49e7-8b78-257083e6e203 scratch disk"}', srcImgUUID= '00000000-0000-0000-0000-000000000000', srcVolUUID='00000000-0000-0000-0000-000000000000', initialSize=None, addBitmaps=False) from=:: ffff:10.46.11.117,51490, flow_id=21362664-5924-41c7-b66f-30abe46036bc, task_id=b6dc0ab6-4d93-4f67-9916-6c5b72944bf6 (api:48)
Based on comment 5, this is probably something related to the deployment of the hosted-engine and therefore moving to the integration team
ok, hold your horses - we check if what it takes to measure the image there so it would also work for existing deployments
Mark, can you share here: - The metadata file of this volume (a2df2288-c33d-4c1b-b447-4077f5000fa5.meta) - Output of "ls -l" on the volume file - Output of sudo vdsm-client Volume getInfo storagedomainID={domain-id} storagepoolID={pool-id} imageID={disk-id} volumeID={volume-id} - Output of sudo vdsm-client StorageDomain dump sd_id=9fcc4284-142b-4d0e-bc22-958e90564d9d | jq '.volumes | .[] | select(.image=="disk-id")'
Sandro, in short we found out that it fails because the virtual size of the HE disk is different on the engine and VDSM metadata compared to its virtual size on the storage. This is a unique situation for hosted engine it seems, because otherwise we fix the virtual size we (on RHV) hold according to the size on the storage. There is the possibility to query the virtual size from the storage during backup of hosted engine but this would arguably complicate an already complicated flow - is it realistic to fix the virtual size of the hosted engine disk for new and existing deployments in 4.5 on your end?
1. Metadata: # cat a2df2288-c33d-4c1b-b447-4077f5000fa5.meta CAP=54760833024 CTIME=1645434868 DESCRIPTION={"DiskAlias":"he_virtio_disk","DiskDescription":"Hosted-Engine disk"} DISKTYPE=HEVD DOMAIN=96f3f20c-40a7-4ba8-b4d6-a13fad0135a0 FORMAT=RAW GEN=0 IMAGE=a18cabd3-b894-4bb9-bcd4-906a5d817faa LEGALITY=LEGAL PUUID=00000000-0000-0000-0000-000000000000 TYPE=SPARSE VOLTYPE=LEAF EOF 2. # ls -l -rw-rw----. 1 vdsm kvm 53689188352 Mar 16 09:10 a2df2288-c33d-4c1b-b447-4077f5000fa5 3. # vdsm-client Volume getInfo storagedomainID=96f3f20c-40a7-4ba8-b4d6-a13fad0135a0 storagepoolID=457b7702-92f4-11ec-b87a-001a4a160475 imageID=a18cabd3-b894-4bb9-bcd4-906a5d817faa volumeID=a2df2288-c33d-4c1b-b447-4077f5000fa5 { "apparentsize": "53689188352", "capacity": "54760833024", "children": [], "ctime": "1645434868", "description": "{\"DiskAlias\":\"he_virtio_disk\",\"DiskDescription\":\"Hosted-Engine disk\"}", "disktype": "HEVD", "domain": "96f3f20c-40a7-4ba8-b4d6-a13fad0135a0", "format": "RAW", "generation": 0, "image": "a18cabd3-b894-4bb9-bcd4-906a5d817faa", "lease": { "offset": 0, "owners": [ 1 ], "path": "/rhev/data-center/mnt/ntap-rhv-dev-nfs.lab.eng.tlv2.redhat.com:_pub_mkemel_mark-he/96f3f20c-40a7-4ba8-b4d6-a13fad0135a0/images/a18cabd3-b894-4bb9-bcd4-906a5d817faa/a2df2288-c33d-4c1b-b447-4077f5000fa5.lease", "version": 6 }, "legality": "LEGAL", "mtime": "0", "parent": "00000000-0000-0000-0000-000000000000", "pool": "", "status": "OK", "truesize": "7657226240", "type": "SPARSE", "uuid": "a2df2288-c33d-4c1b-b447-4077f5000fa5", "voltype": "LEAF" } 4. # vdsm-client StorageDomain dump sd_id=96f3f20c-40a7-4ba8-b4d6-a13fad0135a0 | jq '.volumes | .[] | select(.image=="a18cabd3-b894-4bb9-bcd4-906a5d817faa")' { "apparentsize": 53689188352, "capacity": 54760833024, "ctime": 1645434868, "description": "{\"DiskAlias\":\"he_virtio_disk\",\"DiskDescription\":\"Hosted-Engine disk\"}", "disktype": "HEVD", "format": "RAW", "generation": 0, "image": "a18cabd3-b894-4bb9-bcd4-906a5d817faa", "legality": "LEGAL", "parent": "00000000-0000-0000-0000-000000000000", "status": "OK", "truesize": 7657234432, "type": "SPARSE", "voltype": "LEAF" }
(In reply to Arik from comment #9) > Sandro, in short we found out that it fails because the virtual size of the > HE disk is different on the engine and VDSM metadata compared to its virtual > size on the storage. > > This is a unique situation for hosted engine it seems, because otherwise we > fix the virtual size we (on RHV) hold according to the size on the storage. > > There is the possibility to query the virtual size from the storage during > backup of hosted engine but this would arguably complicate an already > complicated flow - is it realistic to fix the virtual size of the hosted > engine disk for new and existing deployments in 4.5 on your end? Redirecting the question to Asaf
Yury, did you plan how to restore the hosted-engine VM once you are able to back it up?
( To clarify: The "normal" flow to backup/restore HE is: 1. Take a backup using engine-backup 2. Restore with 'hosted-engine --deploy --restore-from-file' The engine VM's disk image itself is never backed-up/restored, as-is. )
One way to make the request in current bug useful: Convert the backed up image to an OVA and provide that to 'hosted-engine --deploy' as the "appliance" image. This would require some refinements here and there, but likely doable. Not sure we want to recommend/support this, though, instead of comment 13 - we'd first like to understand your backup/restore plan and considerations (e.g. if you already considered comment 13, why isn't that enough). Thanks.
(In reply to Mark Kemel from comment #10) > 1. Metadata: > > # cat a2df2288-c33d-4c1b-b447-4077f5000fa5.meta > CAP=54760833024 ... > 2. # ls -l > -rw-rw----. 1 vdsm kvm 53689188352 Mar 16 09:10 > a2df2288-c33d-4c1b-b447-4077f5000fa5 Vdsm meddata is corrupted, contains wrong CAP - for raw image CAP must the file size. > 3. # vdsm-client Volume getInfo > storagedomainID=96f3f20c-40a7-4ba8-b4d6-a13fad0135a0 > storagepoolID=457b7702-92f4-11ec-b87a-001a4a160475 > imageID=a18cabd3-b894-4bb9-bcd4-906a5d817faa > volumeID=a2df2288-c33d-4c1b-b447-4077f5000fa5 > { > "apparentsize": "53689188352", > "capacity": "54760833024", The issue is reported (expected) > 4. # vdsm-client StorageDomain dump > sd_id=96f3f20c-40a7-4ba8-b4d6-a13fad0135a0 | jq '.volumes | .[] | > select(.image=="a18cabd3-b894-4bb9-bcd4-906a5d817faa")' > { > "apparentsize": 53689188352, > "capacity": 54760833024, Issue reported also here. So we have 2 issues: 1. Engine db is corrupted - contains wrong virtual size. This must be fixed to match the apparentsize reported by vdsm. 2. Vdsm metadata is corrupted - this can be fixed only manually, vdsm does not provide an API to change this value. It will be good to understand how the wrong metadata was created when copying the hosted engine disk to storage. If the volume was created using the vdsm API (Volume.create), the file size should match the metadata size. The issue could be creating the vdsm volume with the wrong size, and then copying the disk with qemu-img convert without the "-n" flag. In this case qemu-img convert creates target file in the same size of the source file. This issue should be handle in hosted engine setup, regardless of this bug.
(In reply to Nir Soffer from comment #15) > This issue should be handle in hosted engine setup, regardless of this bug. We're back to discussing the fix? Let's first see form Yury what's the plan for restoring the hosted-engine, this bz may not be relevant without a proper mechanism to restoring the hosted engine
(In reply to Nir Soffer from comment #15) > This issue should be handle in hosted engine setup, regardless of this bug. Ah, regardless :) Feel free to file another bug on that then
Regarding the issue of backing up the hosted engine vm - since we have no way to restore it (restore requires engine running), I think backing up the hosted engine vm should be blocked. If we have a way to backup and restore the hosted engine vm, we can remove this blocking.
BTW: For properly hooking into the process of comment 13 (e.g. to install/restore/configure/whatever other stuff, during the HE deploy/restore), see: https://github.com/oVirt/ovirt-ansible-collection/blob/master/roles/hosted_engine_setup/README.md#make-changes-in-the-engine-vm-during-the-deployment
Hello there The main problem is that customer isn't able to backup full KVM data by one backup software. For regular vms he has to use Veeam RHV backup and for the engine data he has to use egine-backup. I'd like to have one solution for both type of the data. > Yury, did you plan how to restore the hosted-engine VM once you are able to back it up? We can't restore it back to the KVM, but there are few methods to recover some data from the backup. Let's imagine that the hosted-engine is down, we have two ways to recover 1) There is instant vm restore technology in Veeam B&R. A customer can place the engine on a vmware esxi or a hyper-v host temporally, to run engine-backup and use it for recovery, or he can leave it forever on this hypervisor. 2) There is file level restore technology. A customer can go into the appliance files and recover the database data.
(In reply to Yury.Panchenko from comment #21) > Hello there > The main problem is that customer isn't able to backup full KVM data by one > backup software. For regular vms he has to use Veeam RHV backup and for the > engine data he has to use egine-backup. > I'd like to have one solution for both type of the data. You are most welcome to refer to engine-backup as part of oVirt/RHV's API, and not merely as an end-user-facing tool, IMO - call it from Veeam RHV backup etc., and tell us if anything is missing for using it like that. > > > Yury, did you plan how to restore the hosted-engine VM once you are able to back it up? > We can't restore it back to the KVM, but there are few methods to recover > some data from the backup. > > Let's imagine that the hosted-engine is down, we have two ways to recover > 1) There is instant vm restore technology in Veeam B&R. A customer can place > the engine on a vmware esxi or a hyper-v host temporally, to run > engine-backup and use it for recovery, or he can leave it forever on this > hypervisor. This is a very reasonable approach if you already have on-site such infra, but if not, then IMO setting up something just for recovery is a lot of work. > 2) There is file level restore technology. A customer can go into the > appliance files and recover the database data. This will also take quite some work which is designed to all be done by engine-backup.
> You are most welcome to refer to engine-backup as part of oVirt/RHV's API, > and not merely as an end-user-facing tool, IMO - call it from Veeam RHV backup > etc., and tell us if anything is missing for using it like that. It'll be great to add this feature to the backup API, so we can store this config backup on our repository I think is some future we could develop some direct restore the engine to the RHV host. It won't be easy, but it'll allow a customer to recover engine in case of hard failure.
The backup API give access to VM data. We don't have any API for file level backup. To do file level backup you can connect to the VM via ssh, or install an agent on the VM that will prepare and push backup data to the the backup system. Which kind of API is missing for this?
> To do file level backup you can connect to the VM via ssh, or install an agent > on the VM that will prepare and push backup data to the the backup system. > Which kind of API is missing for this? The feature is nice to have. You don't have to create it. We've already had agent backup feature. In common I interested in backup full hosted-engine vm. I guess it will be able to backup when you will switch to the snapshot backup method
(In reply to Yury.Panchenko from comment #25) > In common I interested in backup full hosted-engine vm. > I guess it will be able to backup when you will switch to the snapshot > backup method The new method doesn't apply to the hosted engine VM so the issue remains
(In reply to Yury.Panchenko from comment #25) > > To do file level backup you can connect to the VM via ssh, or install an agent > > on the VM that will prepare and push backup data to the the backup system. > > Which kind of API is missing for this? > The feature is nice to have. You don't have to create it. > We've already had agent backup feature. > > In common I interested in backup full hosted-engine vm. > I guess it will be able to backup when you will switch to the snapshot > backup method Hosted engine does not support snapshots, so new snapshot based cannot work. The previous incremental backup can also does not work, because of a bug in hosted engine deployment, creating invalid volume metadata. If this issue will be fixed, for example by system upgrade fixing the invalid metadata, backing up hosted engine will be possible, but without a way to restore the vm, I don't see the point in this. In general hosted engine vm is a special case - this is part of the RVH infrastructure, and int the same way you don't backup the RHV hosts, you don't backup the hosted engine vm.
> In general hosted engine vm is a special case - this is part of the RVH > infrastructure, and int the same way you don't backup the RHV hosts, you > don't backup the hosted engine vm. You are right. I can't say that the feature is very important, but it sometimes can help
(In reply to Yury.Panchenko from comment #28) > > In general hosted engine vm is a special case - this is part of the RVH > > infrastructure, and int the same way you don't backup the RHV hosts, you > > don't backup the hosted engine vm. > You are right. > I can't say that the feature is very important, but it sometimes can help Right, we all seem to agree that it should be possible to back up the hosted engine VM as well however it's a special case and users are provided with dedicated tools that facilitate this. So I'm going to change the scope of this bug - we'll rather block taking a backup of an hosted engine VM, which doesn't work now anyway, and advise users to leverage engine-backup and hosted-engine --deploy instead
(In reply to Yury.Panchenko from comment #0) > Created attachment 1810131 [details] > Engine and vdsm logs > > Description of problem: > Backup of hosted-engine vm always failed. > > Version-Release number of selected component (if applicable): > vdsm-4.40.70.6-1.el8ev.x86_64 > ovirt-imageio-daemon-2.2.0-1.el8ev.x86_64 > libvirt-daemon-7.0.0-14.1.module+el8.4.0+11095+d46acebf.x86_64 > qemu-kvm-5.2.0-16.module+el8.4.0+11536+725e25d9.2.x86_64 > > How reproducible: > Start full backup of hosted-engine vm > > Actual results: > Backup failed > VDSErrorException: Failed to StartVmBackupVDS, error = Backup Error: > {'vm_id': 'd31f3fac-0c19-4db5-903a-a4af5b54ec59', 'backup': > <vdsm.virt.backup.BackupConfig object at 0x7fa45437e898>, 'reason': "Error > starting backup: internal error: unable to execute QEMU command > 'transaction': Cannot append backup-top filter: Could not open > '/run/vdsm/storage/a735a825-e13c-4b54-bfa9-9424f385dbc8/4ebfdfea-31e8-4f15- > 8d30-559a35f1e375/70d63963-cd9c-452e-97d3-c671b2f83ddb': No such file or > directory"}, code = 1600 > > Expected results: > Successful backup. So the expected result would then be to fail with an appropriate message
Verification steps: 1. Start backup for the HostedEngine VM: POST api/vms/${vmId}/backups <backup> <disks> <disk id="${diskId}" /> </disks> </backup> 2. Expect the following response: <fault> <detail>[Cannot backup VM. Backup of Hosted Engine VM is not supported]</detail> <reason>Operation Failed</reason> </fault>
(In reply to Mark Kemel from comment #31) > Verification steps: > > 1. Start backup for the HostedEngine VM: > POST api/vms/${vmId}/backups > <backup> > <disks> > <disk id="${diskId}" /> > </disks> > </backup> > > 2. Expect the following response: > <fault> > <detail>[Cannot backup VM. Backup of Hosted Engine VM is not > supported]</detail> > <reason>Operation Failed</reason> > </fault> Verified on ovirt-engine-4.5.0.2-0.7.el8ev, backup of the HostedEngine VM is not supported: [Cannot backup VM. Backup of Hosted Engine VM is not supported]"
This bugzilla is included in oVirt 4.5.0 release, published on April 20th 2022. Since the problem described in this bug report should be resolved in oVirt 4.5.0 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.