Page MenuHomeFreeBSD

Warm Migration feature for bhyve [Part 2]
Needs RevisionPublic

Authored by mihaiburcea15_gmail.com on Mar 31 2022, 6:47 AM.
Tags
Referenced Files
Unknown Object (File)
Mon, Jan 6, 5:18 PM
Unknown Object (File)
Thu, Dec 26, 6:14 PM
Unknown Object (File)
Thu, Dec 26, 4:42 PM
Unknown Object (File)
Thu, Dec 26, 4:23 PM
Unknown Object (File)
Dec 7 2024, 2:51 PM
Unknown Object (File)
Dec 1 2024, 12:57 AM
Unknown Object (File)
Nov 20 2024, 7:33 AM
Unknown Object (File)
Nov 19 2024, 7:40 AM

Details

Summary

This second part adds specs migration (HW_MACHINE, HW_MODEL and HW_PAGESIZE), and comparison between hosts.
When we are in Capsicum capability mode, we need to Casper to allow us to grab these specs and connect to the destination. We only need to do this on the source, because on the destination, the migration happens before the process enters capability mode.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

elenamihailescu22_gmail.com changed the visibility from "All Users" to "Public (No Login Required)".Apr 1 2023, 5:00 AM
corvink added inline comments.
usr.sbin/bhyve/migration.c
143
154
172
173
209

Most code is send or recv specific. So, I strongly prefer splitting this into a send_migration_check_specs and a recv_migration_check_specs function.

274–277
303
325
334

If IPv6 isn't supported, don't handle it. Just exit with an error here.

344–345
364
374
383
390
396

You could avoid this conversion if you would return sin_addr from get_migration_host_and_type.

412
463
486

It's hard to read. Maybe use something like Could not recv 'migration completed': remote or received error.

499–503

Is this the right order? You've pause vcpus first and then user_devs. So, shouldn't you resume user_devs first and then vcpus?

525–526

I don't like the goto style but it's widely used in bhyve. So, it can be used here as well.

  error = rc;
  goto cleanup
  ...

cleanup:
  close(con_socket);
  close(s)

  return (error);
531
536

Hard to read as well.

539
mihaiburcea15_gmail.com marked 21 inline comments as done.
mihaiburcea15_gmail.com edited the summary of this revision. (Show Details)

Addressed @corvink's comments and added Casper support.

usr.sbin/bhyve/migration.c
209

I think there's still a lot of duplicate code and it's better to only have one function for it.
No strong feelings though, so it can be split if you think so.

525–526

Got rid of one socket, so I think it's fine now to have a close and a return.

corvink added inline comments.
usr.sbin/bhyve/migration.c
209

IMHO, it's a bit hard to read but it's ok leaving it as is. Maybe some helper functions can help here. E.g.:

static int
migration_send_response(int socket, void *msg, size_t len, enum migration_transfer_req req)
{
  if (req == MIGRATION_SEND_REQ)
    migration_transfer_data(socket, msg, len, MIGRATION_RECV_REQ);
  else
    migration_transfer_data(socket, msg, len, MIGRATION_SEND_REQ);
}
555

Would be nice to explain, why we can't use casper function on recv.

556–560

This code is duplicated three times in this function. I have no strong preference but it might be worth adding a helper function for it.

614–619

Why are you exiting early on transferred < 0 and not on transferred == 0?

This revision is now accepted and ready to land.Jun 19 2023, 7:15 AM

One additional thought: What do you think about sending e.g. a nvlist for the spec check? A nvlist would be more flexible for future addition.

Before proceed this work to commit, could you provide high level design for this approach ?

This revision now requires changes to proceed.Jun 20 2023, 11:35 AM
usr.sbin/bhyve/migration.c
950

commented code "fprintf" ?

afedorov requested changes to this revision.EditedJun 20 2023, 2:40 PM

I think we should bring support for snapshots first. And enable BHYVE_SNAPSHOT by default.

To do this, it remains to agree on the snapshot format: https://lists.freebsd.org/archives/freebsd-virtualization/2023-May/001295.html

I'm not sure whether to tie the transfer of the memory dump to the TCP.

Or add an opportunity similar to 'zfs send' to be independent of transport. For example, share a memory image using SSH or NFS.

I think we should bring support for snapshots first. And enable BHYVE_SNAPSHOT by default.

I agree.

Before proceed this work to commit, could you provide high level design for this approach ?

Are you asking for a high level design for the migration feature? I believe there was a thesis paper that was published by one of the students that describes the approach in detail. Would that be helpful so see?

Before proceed this work to commit, could you provide high level design for this approach ?

If you need documentation regarding migration/live migration, you can find it here:

Please let me know if you need anything more detailed than this. These papers and presentations go from the first ideas (bhyvecon 2019) to the current implementation. The AsiaBSDCon 2023 is the latest one, I cannot find it online, but I can share it with you if you want.

I think we should bring support for snapshots first. And enable BHYVE_SNAPSHOT by default.

To do this, it remains to agree on the snapshot format: https://lists.freebsd.org/archives/freebsd-virtualization/2023-May/001295.html

I'm not sure whether to tie the transfer of the memory dump to the TCP.

Or add an opportunity similar to 'zfs send' to be independent of transport. For example, share a memory image using SSH or NFS.

Not sure how the transfer would be *tied* to the TCP protocol just by having the implementation use TCP by default. Many tools like curl, socat and browsers have figured out that you can use URIs to specify protocols and even authentication when connecting to a service. The migration parameters can be extended to use ssh://, file:/// or even the basic - for piping the binary data to standard output.

I think asking this question more than one year after this review has been open, and more than two after the original review (https://reviews.freebsd.org/D28270) was created is simply meant to delay accepting the changes. I understand that having a good and secure code base is a must, but this change simply asks for "why don't we over engineer this part which has very little to do with the actual virtual machine migration process?", without actually providing a good case for "this is not up to the BSD standards".

I think we should bring support for snapshots first. And enable BHYVE_SNAPSHOT by default.

To do this, it remains to agree on the snapshot format: https://lists.freebsd.org/archives/freebsd-virtualization/2023-May/001295.html

I'm not sure whether to tie the transfer of the memory dump to the TCP.

Or add an opportunity similar to 'zfs send' to be independent of transport. For example, share a memory image using SSH or NFS.

Not sure how the transfer would be *tied* to the TCP protocol just by having the implementation use TCP by default. Many tools like curl, socat and browsers have figured out that you can use URIs to specify protocols and even authentication when connecting to a service. The migration parameters can be extended to use ssh://, file:/// or even the basic - for piping the binary data to standard output.

I think asking this question more than one year after this review has been open, and more than two after the original review (https://reviews.freebsd.org/D28270) was created is simply meant to delay accepting the changes. I understand that having a good and secure code base is a must, but this change simply asks for "why don't we over engineer this part which has very little to do with the actual virtual machine migration process?", without actually providing a good case for "this is not up to the BSD standards".

I had review (D35590) that was opened about a year. This was a fix for snapshot implementation.

We are talking about two things:

  1. Why this feature cannot be done outside of bhyve?
  2. Who will do fixing if this migration review(s) becomes committed?

I suppose, both questions should answered before committing those changes. Because even without this migration additions, current snapshot implementation still has a lot of bugs and, I believe, they should be fixed first before add new functionality for snapshot/resume.

I think we should bring support for snapshots first. And enable BHYVE_SNAPSHOT by default.

To do this, it remains to agree on the snapshot format: https://lists.freebsd.org/archives/freebsd-virtualization/2023-May/001295.html

I'm not sure whether to tie the transfer of the memory dump to the TCP.

Or add an opportunity similar to 'zfs send' to be independent of transport. For example, share a memory image using SSH or NFS.

Not sure how the transfer would be *tied* to the TCP protocol just by having the implementation use TCP by default. Many tools like curl, socat and browsers have figured out that you can use URIs to specify protocols and even authentication when connecting to a service. The migration parameters can be extended to use ssh://, file:/// or even the basic - for piping the binary data to standard output.

I think asking this question more than one year after this review has been open, and more than two after the original review (https://reviews.freebsd.org/D28270) was created is simply meant to delay accepting the changes. I understand that having a good and secure code base is a must, but this change simply asks for "why don't we over engineer this part which has very little to do with the actual virtual machine migration process?", without actually providing a good case for "this is not up to the BSD standards".

I had review (D35590) that was opened about a year. This is fix for current snapshot implementation. We are talking about two things:

  1. Why this feature cannot be done outside of bhyve?
  2. Who will do fixing if this migration review(s) becomes committed?

I suppose, both questions should answered before committing those changes. Because even without this migration additions, current snapshot implementation still has a lot of bugs and, I believe, they should be fixed first before add new functionality for snapshot/resume.

I understand your frustration with stale commits, but I don't quite understand how the migration process would affect the snapshot commits. As far as I'm aware, the two are mostly independent, with migration simply transferring the (raw) device snapshot data to a different host at the very end of the migration process. This means that while migration depends on having a working snapshot mechanism that can provide the correct device data, updating the snapshot mechanism should not become any different from how it is now.

What the migration does additionally, however, is keeping track of memory page changes while the host is running. The *live* migration must be part of bhyve, as the migration process must transfer the virtual machine's memory from one host to the other while the virtual machine runs on the original host.

Of course, some changes to the migration code are unavoidable in case the snapshot process changes in some particular ways, but they should remain minimal in my expectations.

The *live* migration must be part of bhyve, as the migration process must transfer the virtual machine's memory from one host to the other while the virtual machine runs on the original host.

Of course, some changes to the migration code are unavoidable in case the snapshot process changes in some particular ways, but they should remain minimal in my expectations.

I believe that migration code can be outside of bhyve process even with warm or live migration. Really, suppose to have bmigrate process written in C, shell, python (high level language):

a. bhyve dumps diff-pages to a file1.
b. bmigrate sends data to host2 or just places on shared storage
c. bmigrate calls bhyve to suspend without memory.
d. bmigrate sends suspended data to host2. and restore memory 1-st iteration on host2.
e. bmigrate calls vmm to dump diff-pages to file2.
f. bmigrate sends diff-pages to host2 and restore 2-st iteration.
g. bmigrate restore process on host2.

This is rough idea, but it should work w/o adding a lot of code to bhyve.

The same is for warm migration. Why this code should be in bhyve? All states, cpus, etc. could be tracked outside of bhyve.

The *live* migration must be part of bhyve, as the migration process must transfer the virtual machine's memory from one host to the other while the virtual machine runs on the original host.

Of course, some changes to the migration code are unavoidable in case the snapshot process changes in some particular ways, but they should remain minimal in my expectations.

I believe that migration code can be outside of bhyve process even with warm or live migration. Really, suppose to have bmigrate process written in C, shell, python (high level language):

a. bhyve dumps diff-pages to a file1.
b. bmigrate sends data to host2 or just places on shared storage
c. bmigrate calls bhyve to suspend without memory.
d. bmigrate sends suspended data to host2. and restore memory 1-st iteration on host2.
e. bmigrate calls vmm to dump diff-pages to file2.
f. bmigrate sends diff-pages to host2 and restore 2-st iteration.
g. bmigrate restore process on host2.

This is rough idea, but it should work w/o adding a lot of code to bhyve.

The same is for warm migration. Why this code should be in bhyve? All states, cpus, etc. could be tracked outside of bhyve.

I guess that could be done like that too, but I don't really see an advantage of having a different tool that would have to keep track of device state outside of bhyve. It would still have to be tightly linked to the changes to bhyve itself, as any change to how device state is tracked internally would also have to be updated in the tool, or it the virtual machines would break. Having code that deals with the device state inside of bhyve itself makes the most sense to me, as the dependencies would be more readily obvious to anyone looking to update the code or fix a bug.

If the desire is to simply keep the snapshot and migration code separate from the rest of the bhyve code, it may be possible to better segregate it to the snapshot.* and migration.* files. On a personal note, from my experience having more tools to handle something makes using the functionality more confusing to the end user, especially the less initiated. I, for one, think that there is a point where offering more options for power users, but making something options harder to find, will hurt your bottom line in the end. There is a reason why most users prefer graphical interfaces to the command line, despite being arguably more efficient.

The *live* migration must be part of bhyve, as the migration process must transfer the virtual machine's memory from one host to the other while the virtual machine runs on the original host.

Of course, some changes to the migration code are unavoidable in case the snapshot process changes in some particular ways, but they should remain minimal in my expectations.

I believe that migration code can be outside of bhyve process even with warm or live migration. Really, suppose to have bmigrate process written in C, shell, python (high level language):

a. bhyve dumps diff-pages to a file1.
b. bmigrate sends data to host2 or just places on shared storage
c. bmigrate calls bhyve to suspend without memory.
d. bmigrate sends suspended data to host2. and restore memory 1-st iteration on host2.
e. bmigrate calls vmm to dump diff-pages to file2.
f. bmigrate sends diff-pages to host2 and restore 2-st iteration.
g. bmigrate restore process on host2.

This is rough idea, but it should work w/o adding a lot of code to bhyve.

The same is for warm migration. Why this code should be in bhyve? All states, cpus, etc. could be tracked outside of bhyve.

We consider that the migration part is a must-have feature for a mature hypervisor and we really want to have it for bhyve. The end goal is to have a live migration feature for bhyve. It means inspecting the guest memory pages to check for changes between rounds. It also modifies a dirty bit. The inspection (and changes) should be done from within the bhyve process. I'm not sure how this part should be tackled in your suggestion. However, please keep in mind that this approach was discussed with the bhyve's maintainers 4 years ago. We asked for advice and came up with this idea of having a custom dirty bit for migration.

We consider that the migration part is a must-have feature for a mature hypervisor and we really want to have it for bhyve. The end goal is to have a live migration feature for bhyve. It means inspecting the guest memory pages to check for changes between rounds. It also modifies a dirty bit. The inspection (and changes) should be done from within the bhyve process. I'm not sure how this part should be tackled in your suggestion. However, please keep in mind that this approach was discussed with the bhyve's maintainers 4 years ago. We asked for advice and came up with this idea of having a custom dirty bit for migration.

I don't see any problem with that. I will open review with possibility to get diff pages.

So instead of having a lot code in bhyve, I would suggest to have portion of primitives that can help to implement warm, live migration outside of bhyve process.

We consider that the migration part is a must-have feature for a mature hypervisor and we really want to have it for bhyve. The end goal is to have a live migration feature for bhyve. It means inspecting the guest memory pages to check for changes between rounds. It also modifies a dirty bit. The inspection (and changes) should be done from within the bhyve process. I'm not sure how this part should be tackled in your suggestion. However, please keep in mind that this approach was discussed with the bhyve's maintainers 4 years ago. We asked for advice and came up with this idea of having a custom dirty bit for migration.

I don't see any problem with that. I will open review with possibility to get diff pages.

So instead of having a lot code in bhyve, I would suggest to have portion of primitives that can help to implement warm, live migration outside of bhyve process.

Wait. You're casually suggesting that years of work be discarded in favor of an alternate approach. Please be more specific about what you perceive to be the problem with the method taken in this patch series. "I have another idea" isn't a very good reason to go in a completely different direction.

We consider that the migration part is a must-have feature for a mature hypervisor and we really want to have it for bhyve. The end goal is to have a live migration feature for bhyve. It means inspecting the guest memory pages to check for changes between rounds. It also modifies a dirty bit. The inspection (and changes) should be done from within the bhyve process. I'm not sure how this part should be tackled in your suggestion. However, please keep in mind that this approach was discussed with the bhyve's maintainers 4 years ago. We asked for advice and came up with this idea of having a custom dirty bit for migration.

I don't see any problem with that. I will open review with possibility to get diff pages.

So instead of having a lot code in bhyve, I would suggest to have portion of primitives that can help to implement warm, live migration outside of bhyve process.

So, as I understand it, there is no real reason to go for this approach other than it would be a way to rewrite the code. It also seems excessive, as the code base overall would not decrease and adding a process that can directly access the address space of a different process (required in order to transfer the virtual machine's memory) would be a major security concern.

So instead of having a lot code in bhyve, I would suggest to have portion of primitives that can help to implement warm, live migration outside of bhyve process.

Wait. You're casually suggesting that years of work be discarded in favor of an alternate approach. Please be more specific about what you perceive to be the problem with the method taken in this patch series. "I have another idea" isn't a very good reason to go in a completely different direction.

Suggestion was about to help you to move to the right direction. I understand that this patch series ate time to implement, but it should be not just reason to integrate it into bhyve. Do you agree? Instead, I would talk about better implementation with more robust approach.

I suppose the following reasons to discuss about this patch-series:

  1. Snapshot/resume code was totally thrown and "implementing team" forget about the code since it was integrated in 2020. Yes, it was a good job, but it was incomplete and has a lot of issues and nobody wants to fix them.
  2. Who will support integrated code ? This patch-series has a lot code and it potentially has bugs. Will it be the same story just integrate something and forget about code, i.e. do not participate in bug fixing?
  3. Instead of adding a lot of code to bhyve, it is better to place code outside from bhyve source, to make bhyve pretty small and secure.
  4. Warm/Live migrations can have different steps and requirements, hooks, etc. For example, it can be done with storage migration (Storage vMotion) or without, with additional tunning VM like vCPU slowdown or without. And better place for implementing those is another program that will handle all needed requirements and functionalities.

Correct me, if I see something wrong.

So instead of having a lot code in bhyve, I would suggest to have portion of primitives that can help to implement warm, live migration outside of bhyve process.

Wait. You're casually suggesting that years of work be discarded in favor of an alternate approach. Please be more specific about what you perceive to be the problem with the method taken in this patch series. "I have another idea" isn't a very good reason to go in a completely different direction.

Suggestion was about to help you to move to the right direction. I understand that this patch series ate time to implement, but it should be not just reason to integrate it into bhyve. Do you agree? Instead, I would talk about better implementation with more robust approach.

I suppose the following reasons to discuss about this patch-series:

  1. Snapshot/resume code was totally thrown and "implementing team" forget about the code since it was integrated in 2020. Yes, it was a good job, but it was incomplete and has a lot of issues and nobody wants to fix them.
  2. Who will support integrated code ? This patch-series has a lot code and it potentially has bugs. Will it be the same story just integrate something and forget about code, i.e. do not participate in bug fixing?
  3. Instead of adding a lot of code to bhyve, it is better to place code outside from bhyve source, to make bhyve pretty small and secure.
  4. Warm/Live migrations can have different steps and requirements, hooks, etc. For example, it can be done with storage migration (Storage vMotion) or without, with additional tunning VM like vCPU slowdown or without. And better place for implementing those is another program that will handle all needed requirements and functionalities.

Correct me, if I see something wrong.

Do you already have a migration feature/tool implemented (bmigrate)? If so, please let us know if you plan to make it available in the future.

I'll just leave the patches we have proposed since suspend/resume landed here. Our goal was to remove the BHYVE_SNAPSHOT guard as well so that upstreaming the live migration will go smoothly, and that's why we proposed the following reviews. Please keep in mind that these reviews are starting from 2020. We split some of the patches in smaller parts to be easily reviewed. Of course, these reviews are old and no longer useful since the same things were either pushed upstream this year or things go in a different direction (e.g., nvlist), I just wanted to point out our activity.

I think we should bring support for snapshots first. And enable BHYVE_SNAPSHOT by default.

To do this, it remains to agree on the snapshot format: https://lists.freebsd.org/archives/freebsd-virtualization/2023-May/001295.html

Please (you and others) reply to the snapshot format proposal. Otherwise, it's hard to find consensus on that...

Do you already have a migration feature/tool implemented (bmigrate)? If so, please let us know if you plan to make it available in the future.

It doesn't matter whether do we have it or not. My intention is not promote possible my tool. Intention is to create a flexible solution that can be easily improved, extended and supported, instead of having solution with a specific limited functionality.

One more reason do not have whole migration code in bhyve inside - is inability to correct behaviour during migration for already running process.
Imagine, there is a bhyve process that should be migrated to an another host. Preparation steps gets error in bhyve due to some not correct logic handling or some another bug in migration code. And that's all. It cannot be resolved "on the fly". Code for already running process can not be modified. Instead, if have logic and functional code in a dedicated control application, any issue with buggy code path can be resolved, that application can be updated and then bhyve process can be successfully migrated.

I'll just leave the patches we have proposed since suspend/resume landed here. Our goal was to remove the BHYVE_SNAPSHOT guard as well so that upstreaming the live migration will go smoothly, and that's why we proposed the following reviews. Please keep in mind that these reviews are starting from 2020. We split some of the patches in smaller parts to be easily reviewed. Of course, these reviews are old and no longer useful since the same things were either pushed upstream this year or things go in a different direction (e.g., nvlist), I just wanted to point out our activity.

Good, thanks. I just pointed that current snapshot code still has a lot of issues and doesn't work well on CURRENT. And instead of adding new features, it would be nice to make it stable and robust. And again almost all provided links are
adding new functionality, but that is not fixing bugs and issues (or helping in that) I mentioned earlier. If a bhyve process can not be suspended, resumed or a process crashes or even host panics, adding a new feature looks redundant.

I suppose that fixing existing issues and make the code stable is more important for now.

I suppose that fixing existing issues and make the code stable is more important for now.

And if somebody feels that snapshot code is stable enough for adding new features, I would say it is not!

You can simply crash host with the following steps:

  1. Suspend VM
  2. Run "dd if=/dev/zero bs=1K count=1 conv=notrunc of=${snapshot}.kern"
  3. Resume VM

That's all. Kernel crashes with panic: general protection fault. Who wants to fix that ?)

I suppose that fixing existing issues and make the code stable is more important for now.

And if somebody feels that snapshot code is stable enough for adding new features, I would say it is not!

You can simply crash host with the following steps:

  1. Suspend VM
  2. Run "dd if=/dev/zero bs=1K count=1 conv=notrunc of=${snapshot}.kern"
  3. Resume VM

That's all. Kernel crashes with panic: general protection fault. Who wants to fix that ?)

I've read through the various open reviews and your suggestions for a new file format on the mailing list. Would a more robust file format fix that? I.e. adding a checksum or signing the file with a cryptographic hash?

I've recently joined the Enterprise Working Group (https://wiki.freebsd.org/EnterpriseWorkingGroup) and I'm supporting Greg Wallace as project manager for the bhyve+jails management work stream. Migration was raised multiple times as critical point and requirement for enterprise stakeholders.

Might it be helpful to arrange a zoom or Teams call to discuss this in more detail? There's weekly calls on jails and bhyve that could be a good forum to move things forward or alternatively, we can look into arranging a call specifically for snapshot/migration. Usually, on the weekly calls, there's a good mix of developers and admins/users to have a good set of sparring partners. If a more dev-focused call is required, I'd be more than willing to look into setting that up.

Let me know if there's interest!

That's all. Kernel crashes with panic: general protection fault. Who wants to fix that ?)

I've read through the various open reviews and your suggestions for a new file format on the mailing list. Would a more robust file format fix that? I.e. adding a checksum or signing the file with a cryptographic hash?

It was pointing that instead of adding new things to "snapshot" code, it would be better to make it stable first.

I think steps should be:

  1. Discuss file format for snapshot (one file). I suggest to use nvlist internals.
  2. Implement that file format in bhyve and get rid of current approach with three files for snapshot.
  3. Add useful debug and testing functions to bhyve to analyse VM snapshots. Again, using nvlist can make it easy.
  4. Discuss migrations: warm/live.
  5. Add primitives to implement warm/live migration.
  6. Implement warm migration.
  7. Implement live migration.

I've recently joined the Enterprise Working Group (https://wiki.freebsd.org/EnterpriseWorkingGroup) and I'm supporting Greg Wallace as project manager for the bhyve+jails management work stream. Migration was raised multiple times as critical point and requirement for enterprise stakeholders.

Might it be helpful to arrange a zoom or Teams call to discuss this in more detail?

Yes, it is good idea. We arrange a call for that. I can use both Zoom or Teams.

There's weekly calls on jails and bhyve that could be a good forum to move things forward or alternatively,

we can look into arranging a call specifically for snapshot/migration. Usually, on the weekly calls, there's a

good mix of developers and admins/users to have a good set of sparring partners. If a more dev-focused call

is required, I'd be more than willing to look into setting that up.

Let me know if there's interest!

Thank you for your efforts! We can start discussion at the bhyve Call on Thursday.