Page MenuHomeFreeBSD

mi_startup: sort sysinit array using qsort instead of bubble sort
AbandonedPublic

Authored by arichardson on May 1 2023, 8:34 PM.
Tags
None
Referenced Files
Unknown Object (File)
Thu, Dec 26, 8:13 PM
Unknown Object (File)
Dec 6 2024, 11:27 AM
Unknown Object (File)
Dec 2 2024, 3:11 AM
Unknown Object (File)
Nov 29 2024, 11:39 PM
Unknown Object (File)
Nov 14 2024, 3:02 PM
Unknown Object (File)
Oct 15 2024, 1:22 AM
Unknown Object (File)
Oct 13 2024, 12:49 AM
Unknown Object (File)
Sep 29 2024, 10:35 PM

Details

Summary

While booting on an emulator with instruction tracing enabled, I noticed
that a rather large amount of time was being spent inside mi_startup.
It turns out this is due to sorting an array of ~1000 items using bubble
sort (this has been used since at least 1995). Replacing the hand-written
bubble sort with a call to qsort() noticeably speeds up this function by
reducing the sorting cost from 5.5M instructions to 117K (on RISC-V 64):

Sorting 1125 sysinit items
117378 instructions with qsort()
5509157 instructions with bubble

For an x86 VM this also helps noticeably, cperciva@ reports that this
reduces the time mi_startup spends sorting from 6094436 cycles
(2.031 ms) to 585496 cycles (0.195 ms).

Test Plan

Still boots - using an unstable sort shouldn't matter here.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 51291
Build 48182: arc lint + arc unit

Event Timeline

arichardson created this revision.

does qsort's unstable sort actually change the order of anything?

This revision is now accepted and ready to land.May 1 2023, 9:55 PM
In D39916#908603, @imp wrote:

does qsort's unstable sort actually change the order of anything?

I'm not sure if it does since I didn't look at the resulting output. But I'd hope dependencies are captured correctly by the subsystem & order fields.

this is a massively dodgy change, one has to expect some of the real orderings work by accident. i think it would be better to take a look at trimming the list instead.

In D39916#908757, @mjg wrote:

this is a massively dodgy change, one has to expect some of the real orderings work by accident. i think it would be better to take a look at trimming the list instead.

Trimming the list definitely makes sense but I went for the simpler change here. The list is currently somewhat arbitrarily sorted by the linker and I don't think this makes it any worse. We could add a stable sort to libkern but I don't think this should be needed.

In D39916#908757, @mjg wrote:

this is a massively dodgy change, one has to expect some of the real orderings work by accident. i think it would be better to take a look at trimming the list instead.

Trimming the list definitely makes sense but I went for the simpler change here. The list is currently somewhat arbitrarily sorted by the linker and I don't think this makes it any worse. We could add a stable sort to libkern but I don't think this should be needed.

Arbitrary ordering of the linker has bitten us in the past, sometimes it seems like it'd be nice to have the option to introduce some intentional chaos so that we can find potential issues by randomly reordering (within a {subsystem, order}, naturally), but you'd really want to be able to do that *and* find a way to make it reproducible (e.g., feed it this seed via tunable and it'll re-order in such a way that GENERIC dies in a horrible fire)

hselasky added inline comments.
sys/kern/init_main.c
220

You should know you cannot compare integers using subtraction due to sign overflow.

You really need to do if's for every case!

268

You really need to be careful here. The qsort() we have in the kernel comes from libc, and it is not safe in any regards, like execution time and stack usage. It is a recursive function, and in worst case O(N²) - I.E. no better than the current bubble sort.

My opinion is to not add any more consumers of qsort() until the issues about execution time and stack usage are resolved!

This revision now requires changes to proceed.May 19 2023, 3:29 PM
sys/kern/init_main.c
220

Looking at all the SI_SUB_* values shows that the largest value is (1<<28)-1, so in this case it's safe. I don't know if it matters too much but using the subtraction should be faster.

268

That is the worst case though - in my testing it's substantially faster than bubblesort. Ideally these arrays could be sorted at compile-time but this is a rather *huge* speedup (50x faster)

In D39916#908757, @mjg wrote:

this is a massively dodgy change, one has to expect some of the real orderings work by accident. i think it would be better to take a look at trimming the list instead.

Trimming the list definitely makes sense but I went for the simpler change here. The list is currently somewhat arbitrarily sorted by the linker and I don't think this makes it any worse. We could add a stable sort to libkern but I don't think this should be needed.

Arbitrary ordering of the linker has bitten us in the past, sometimes it seems like it'd be nice to have the option to introduce some intentional chaos so that we can find potential issues by randomly reordering (within a {subsystem, order}, naturally), but you'd really want to be able to do that *and* find a way to make it reproducible (e.g., feed it this seed via tunable and it'll re-order in such a way that GENERIC dies in a horrible fire)

I guess you could print the seed during boot to allow reproducing it. Maybe we could also add a tunable to run initializers that have the same subsystem and order in reverse which should catch most of those issues?

sys/kern/init_main.c
220

But this isn't comparing arbitrary integers, it's comparing values between 0 and SI_SUB_LAST/SI_ORDER_ANY, which means between 0 and 0xfffffff, i.e. non-negative values that fit in 28 bits (29 bits with a sign bit). So no, overflow is impossible. But if you really cared, you could always do the subtraction in the unsigned world and cast back at the end.

arichardson added a subscriber: cperciva.

Rebased on latest main.
@cperciva it would be great if you could test how much this helps your usecase.

In D39916#908757, @mjg wrote:

this is a massively dodgy change, one has to expect some of the real orderings work by accident. i think it would be better to take a look at trimming the list instead.

Trimming the list definitely makes sense but I went for the simpler change here. The list is currently somewhat arbitrarily sorted by the linker and I don't think this makes it any worse. We could add a stable sort to libkern but I don't think this should be needed.

Arbitrary ordering of the linker has bitten us in the past, sometimes it seems like it'd be nice to have the option to introduce some intentional chaos so that we can find potential issues by randomly reordering (within a {subsystem, order}, naturally), but you'd really want to be able to do that *and* find a way to make it reproducible (e.g., feed it this seed via tunable and it'll re-order in such a way that GENERIC dies in a horrible fire)

I guess you could print the seed during boot to allow reproducing it. Maybe we could also add a tunable to run initializers that have the same subsystem and order in reverse which should catch most of those issues?

I really, really like that idea. That feels much easier to implement while accomplishing the same goal with fewer logistics issues (e.g., you /could/ print the seed during boot, but then that may get lost in scrollback and you might not have a keyboard to get back to it at the time of failure; though ideally you'd be doing this with a serial console recorded).

Rebased on latest main.
@cperciva it would be great if you could test how much this helps your usecase.

Reduces the time mi_startup spends sorting from 6094436 cycles (2.031 ms) to 585496 cycles (0.195 ms).

BTW you should change the TSENTER2/TSEXIT2 lines from "bubblesort" to "qsort". :-)

sys/kern/init_main.c
220

Hi,

Maybe add a dependency for that, like compile time asserting SI_SUB_LAST and SI_ORDER_ANY being valid for the optimisation?

The code pattern you've written, is incorrect for full 32-bit integer comparison. And the fields in question are enums and not limited to 28 bits from what I can see.

Static code analysis tools don't grasp how values are put into these fields and the enums in kernel.h may change.

Why not write it like it should be done from the start, which will handle all current and future values, regardless of subsystem and order signedness aswell?

if (a > b)
        return (1);
else if (a < b)
        return (-1);
else
        return (0);

The clang compiler, when using -O2, elegantly utilize dedicated CPU instructions for the code above, to avoid any branching or gotos.

To me the optimization looks like a false positive bad coding case (or style issue).

The code you've written looks bad simply put. I'm not saying it is wrong.

268

Like some people already know and have learnt, the current implementation of qsort() may suddenly bite you when you least expect it. And given the world is already full of problems, why leave more behind?

  1. Are you sure you have enough stack for the worst -case recursion inside qsort() in the environment where this function is executing?
  1. I've already proposed bsort(), https://reviews.freebsd.org/D36493, as a safe sorting function.
  1. Talking about performance, my compile time sorting method will give the best performance and debugging experience. Why not go directly there and start working on that, instead of changing the code twice?
  1. Change the bubble sort into an insertion sort algorithm, and keep it that way. When an insertion sort algorithm sees a sorted array, it will be O(N):

https://en.wikipedia.org/wiki/Insertion_sort

This is critical boot code, and qsort() doesn't belong there, as currently implemented.

Sorting these fields should be done at compile time, instead of every time the system boots. That's even better. It is quite trivial to implement.

Sorting 1125 sysinit items
117378 instructions with qsort()
5509157 instructions with bubble

And if the list is already sorted and you use insertion sort there instead of bubble sort, what is the time then? Can you get that into the list too?

I guess it will be something like 500x instead of your 50x.

--HPS

Sorting 1125 sysinit items
117378 instructions with qsort()
5509157 instructions with bubble

And if the list is already sorted and you use insertion sort there instead of bubble sort, what is the time then? Can you get that into the list too?

I guess it will be something like 500x instead of your 50x.

--HPS

Yes that would obviously be faster but I don't have time to work on that. If you are planning to work on this and it's going to be ready for review in the near future I'd be happy to drop this patch.

However, using a pre-existing sorting function to obtain a noticeable speedup seems like a valid approach to me. Just because it could be done in a more optimal way at some hypothetical point in the future shouldn't mean we can't make incremental improvements on the way to an ideal solution. Otherwise things would never improve...

sys/kern/init_main.c
220

I'm fine changing this to avoid the subtraction although it does perform measurable better on simpler ISAs such as RISC-V

268

I'd be more than happy to use a better sorting function, but as of today only qsort exists in the kernel and I'm not going to spend my time writing sorting algorithms.

Considering the tiny size of the input I really don't think the worst case qsort stack usage matters since it's log n. And the worst case time is still no worse than the current code.

Sorting 1125 sysinit items
117378 instructions with qsort()
5509157 instructions with bubble

And if the list is already sorted and you use insertion sort there instead of bubble sort, what is the time then? Can you get that into the list too?

I guess it will be something like 500x instead of your 50x.

--HPS

Yes that would obviously be faster but I don't have time to work on that. If you are planning to work on this and it's going to be ready for review in the near future I'd be happy to drop this patch.

However, using a pre-existing sorting function to obtain a noticeable speedup seems like a valid approach to me. Just because it could be done in a more optimal way at some hypothetical point in the future shouldn't mean we can't make incremental improvements on the way to an ideal solution. Otherwise things would never improve...

Could we make a deal on that? I writeup a patch for the 500x case today, and you can test it and compare against the current results?

sys/kern/init_main.c
268

I think the recursion level for qsort() in the kernel is log2(N) only if the sorting input is properly random. Else you risk a recursion equal to the number of elements you sort. I may be wrong, but from my 10-seconds of look at the kernel's qsort(), it looks like the vanilla and unmodified Quick Sort algorithm.

Could we make a deal on that? I writeup a patch for the 500x case today, and you can test it and compare against the current results?

I can test patches.

sys/kern/init_main.c
268

We recurse on the smaller "half" and iterate on the larger "half". Our worst case stack depth is log2(N).

Sorting 1125 sysinit items
117378 instructions with qsort()
5509157 instructions with bubble

And if the list is already sorted and you use insertion sort there instead of bubble sort, what is the time then? Can you get that into the list too?

I guess it will be something like 500x instead of your 50x.

--HPS

Yes that would obviously be faster but I don't have time to work on that. If you are planning to work on this and it's going to be ready for review in the near future I'd be happy to drop this patch.

However, using a pre-existing sorting function to obtain a noticeable speedup seems like a valid approach to me. Just because it could be done in a more optimal way at some hypothetical point in the future shouldn't mean we can't make incremental improvements on the way to an ideal solution. Otherwise things would never improve...

Could we make a deal on that? I writeup a patch for the 500x case today, and you can test it and compare against the current results?

Sure, I'd be more than happy to test and review patches :)

Sure, I'd be more than happy to test and review patches :)

It took a little more time than anticipated. Here are the dependencies first:

https://cgit.freebsd.org/src/commit/?id=805d759338a2be939fffc8bf3f25cfaab981a9be
https://reviews.freebsd.org/D40190
https://reviews.freebsd.org/D40191
https://reviews.freebsd.org/D40192

And the main patch:

https://reviews.freebsd.org/D40193

I've tested AMD64 GENERIC for now, and verified no SYSINITs are sorted at all during startup. Will do some more build testing, if we can agree this is the way to go.

Some notes:

a) sysuninits suffer from the same issue like sysinits (I fixed this in my patch)
b) A lot of unneeded sorting can be avoided by clearing the "order" field when sysinits are marked done. There is no need to sort executed sysinits further.
c) Merge sorting when adding new modules

sys/kern/init_main.c
268

I see. But the log2(N) stack usage doesn't stop the O(N²) CPU usage - right?