mi_startup: sort sysinit array using qsort instead of bubble sort
AbandonedPublic
Actions

Authored by arichardson on May 1 2023, 8:34 PM.

Details

Reviewers

imp
jhb
jrtc27
• hselasky

Summary

While booting on an emulator with instruction tracing enabled, I noticed
that a rather large amount of time was being spent inside mi_startup.
It turns out this is due to sorting an array of ~1000 items using bubble
sort (this has been used since at least 1995). Replacing the hand-written
bubble sort with a call to qsort() noticeably speeds up this function by
reducing the sorting cost from 5.5M instructions to 117K (on RISC-V 64):

Sorting 1125 sysinit items
117378 instructions with qsort()
5509157 instructions with bubble

For an x86 VM this also helps noticeably, cperciva@ reports that this
reduces the time mi_startup spends sorting from 6094436 cycles
(2.031 ms) to 585496 cycles (0.195 ms).

Test Plan

Still boots - using an unstable sort shouldn't matter here.

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Passed

Unit

No Test Coverage

Build Status

Buildable 51557
Build 48448: arc lint + arc unit

Event Timeline

arichardson requested review of this revision.May 1 2023, 8:34 PM

arichardson created this revision.

Harbormaster completed remote builds in B51291: Diff 121336.May 1 2023, 8:34 PM

does qsort's unstable sort actually change the order of anything?

This revision is now accepted and ready to land.May 1 2023, 9:55 PM

In D39916#908603, @imp wrote:

does qsort's unstable sort actually change the order of anything?

I'm not sure if it does since I didn't look at the resulting output. But I'd hope dependencies are captured correctly by the subsystem & order fields.

this is a massively dodgy change, one has to expect some of the real orderings work by accident. i think it would be better to take a look at trimming the list instead.

In D39916#908757, @mjg wrote:

this is a massively dodgy change, one has to expect some of the real orderings work by accident. i think it would be better to take a look at trimming the list instead.

Trimming the list definitely makes sense but I went for the simpler change here. The list is currently somewhat arbitrarily sorted by the linker and I don't think this makes it any worse. We could add a stable sort to libkern but I don't think this should be needed.

In D39916#908771, @arichardson wrote:

In D39916#908757, @mjg wrote:

this is a massively dodgy change, one has to expect some of the real orderings work by accident. i think it would be better to take a look at trimming the list instead.

Trimming the list definitely makes sense but I went for the simpler change here. The list is currently somewhat arbitrarily sorted by the linker and I don't think this makes it any worse. We could add a stable sort to libkern but I don't think this should be needed.

Arbitrary ordering of the linker has bitten us in the past, sometimes it seems like it'd be nice to have the option to introduce some intentional chaos so that we can find potential issues by randomly reordering (within a {subsystem, order}, naturally), but you'd really want to be able to do that *and* find a way to make it reproducible (e.g., feed it this seed via tunable and it'll re-order in such a way that GENERIC dies in a horrible fire)

• hselasky added a subscriber: • hselasky.May 19 2023, 3:28 PM

• hselasky added inline comments.

sys/kern/init_main.c
220	You should know you cannot compare integers using subtraction due to sign overflow. You really need to do if's for every case!
269–271	You really need to be careful here. The qsort() we have in the kernel comes from libc, and it is not safe in any regards, like execution time and stack usage. It is a recursive function, and in worst case O(N²) - I.E. no better than the current bubble sort. My opinion is to not add any more consumers of qsort() until the issues about execution time and stack usage are resolved!

• hselasky requested changes to this revision.May 19 2023, 3:29 PM

This revision now requires changes to proceed.May 19 2023, 3:29 PM

arichardson added inline comments.May 19 2023, 5:18 PM

sys/kern/init_main.c
220	Looking at all the SI_SUB_* values shows that the largest value is (1<<28)-1, so in this case it's safe. I don't know if it matters too much but using the subtraction should be faster.
269–271	That is the worst case though - in my testing it's substantially faster than bubblesort. Ideally these arrays could be sorted at compile-time but this is a rather huge speedup (50x faster)

In D39916#914437, @kevans wrote:

In D39916#908771, @arichardson wrote:

In D39916#908757, @mjg wrote:

this is a massively dodgy change, one has to expect some of the real orderings work by accident. i think it would be better to take a look at trimming the list instead.

Trimming the list definitely makes sense but I went for the simpler change here. The list is currently somewhat arbitrarily sorted by the linker and I don't think this makes it any worse. We could add a stable sort to libkern but I don't think this should be needed.

Arbitrary ordering of the linker has bitten us in the past, sometimes it seems like it'd be nice to have the option to introduce some intentional chaos so that we can find potential issues by randomly reordering (within a {subsystem, order}, naturally), but you'd really want to be able to do that *and* find a way to make it reproducible (e.g., feed it this seed via tunable and it'll re-order in such a way that GENERIC dies in a horrible fire)

I guess you could print the seed during boot to allow reproducing it. Maybe we could also add a tunable to run initializers that have the same subsystem and order in reverse which should catch most of those issues?

jrtc27 added inline comments.May 19 2023, 5:23 PM

sys/kern/init_main.c
220	But this isn't comparing arbitrary integers, it's comparing values between 0 and SI_SUB_LAST/SI_ORDER_ANY, which means between 0 and 0xfffffff, i.e. non-negative values that fit in 28 bits (29 bits with a sign bit). So no, overflow is impossible. But if you really cared, you could always do the subtraction in the unsigned world and cast back at the end.

Rebased on latest main.
@cperciva it would be great if you could test how much this helps your usecase.

Harbormaster completed remote builds in B51557: Diff 122129.May 19 2023, 5:24 PM

In D39916#914484, @arichardson wrote:

In D39916#914437, @kevans wrote:

In D39916#908771, @arichardson wrote:

In D39916#908757, @mjg wrote:

this is a massively dodgy change, one has to expect some of the real orderings work by accident. i think it would be better to take a look at trimming the list instead.

Trimming the list definitely makes sense but I went for the simpler change here. The list is currently somewhat arbitrarily sorted by the linker and I don't think this makes it any worse. We could add a stable sort to libkern but I don't think this should be needed.

Arbitrary ordering of the linker has bitten us in the past, sometimes it seems like it'd be nice to have the option to introduce some intentional chaos so that we can find potential issues by randomly reordering (within a {subsystem, order}, naturally), but you'd really want to be able to do that *and* find a way to make it reproducible (e.g., feed it this seed via tunable and it'll re-order in such a way that GENERIC dies in a horrible fire)

I guess you could print the seed during boot to allow reproducing it. Maybe we could also add a tunable to run initializers that have the same subsystem and order in reverse which should catch most of those issues?

I really, really like that idea. That feels much easier to implement while accomplishing the same goal with fewer logistics issues (e.g., you /could/ print the seed during boot, but then that may get lost in scrollback and you might not have a keyboard to get back to it at the time of failure; though ideally you'd be doing this with a serial console recorded).

In D39916#914489, @arichardson wrote:

Rebased on latest main.
@cperciva it would be great if you could test how much this helps your usecase.

Reduces the time mi_startup spends sorting from 6094436 cycles (2.031 ms) to 585496 cycles (0.195 ms).

BTW you should change the TSENTER2/TSEXIT2 lines from "bubblesort" to "qsort". :-)

fix tslog

Harbormaster completed remote builds in B51560: Diff 122134.May 19 2023, 8:03 PM

arichardson edited the summary of this revision. (Show Details)May 19 2023, 8:21 PM

• hselasky added inline comments.May 19 2023, 10:11 PM

sys/kern/init_main.c
220	Hi, Maybe add a dependency for that, like compile time asserting SI_SUB_LAST and SI_ORDER_ANY being valid for the optimisation? The code pattern you've written, is incorrect for full 32-bit integer comparison. And the fields in question are enums and not limited to 28 bits from what I can see. Static code analysis tools don't grasp how values are put into these fields and the enums in kernel.h may change. Why not write it like it should be done from the start, which will handle all current and future values, regardless of subsystem and order signedness aswell? if (a > b) return (1); else if (a < b) return (-1); else return (0); The clang compiler, when using -O2, elegantly utilize dedicated CPU instructions for the code above, to avoid any branching or gotos. To me the optimization looks like a false positive bad coding case (or style issue). The code you've written looks bad simply put. I'm not saying it is wrong.
269–271	Like some people already know and have learnt, the current implementation of qsort() may suddenly bite you when you least expect it. And given the world is already full of problems, why leave more behind? Are you sure you have enough stack for the worst -case recursion inside qsort() in the environment where this function is executing? I've already proposed bsort(), https://reviews.freebsd.org/D36493, as a safe sorting function. Talking about performance, my compile time sorting method will give the best performance and debugging experience. Why not go directly there and start working on that, instead of changing the code twice? Change the bubble sort into an insertion sort algorithm, and keep it that way. When an insertion sort algorithm sees a sorted array, it will be O(N): https://en.wikipedia.org/wiki/Insertion_sort This is critical boot code, and qsort() doesn't belong there, as currently implemented. Sorting these fields should be done at compile time, instead of every time the system boots. That's even better. It is quite trivial to implement.

Sorting 1125 sysinit items
117378 instructions with qsort()
5509157 instructions with bubble

And if the list is already sorted and you use insertion sort there instead of bubble sort, what is the time then? Can you get that into the list too?

I guess it will be something like 500x instead of your 50x.

--HPS

In D39916#914569, @hselasky wrote:
Sorting 1125 sysinit items
117378 instructions with qsort()
5509157 instructions with bubble
And if the list is already sorted and you use insertion sort there instead of bubble sort, what is the time then? Can you get that into the list too?

I guess it will be something like 500x instead of your 50x.

--HPS

Yes that would obviously be faster but I don't have time to work on that. If you are planning to work on this and it's going to be ready for review in the near future I'd be happy to drop this patch.

However, using a pre-existing sorting function to obtain a noticeable speedup seems like a valid approach to me. Just because it could be done in a more optimal way at some hypothetical point in the future shouldn't mean we can't make incremental improvements on the way to an ideal solution. Otherwise things would never improve...

arichardson added inline comments.May 20 2023, 12:54 AM

sys/kern/init_main.c
220	I'm fine changing this to avoid the subtraction although it does perform measurable better on simpler ISAs such as RISC-V
269–271	I'd be more than happy to use a better sorting function, but as of today only qsort exists in the kernel and I'm not going to spend my time writing sorting algorithms. Considering the tiny size of the input I really don't think the worst case qsort stack usage matters since it's log n. And the worst case time is still no worse than the current code.

In D39916#914571, @arichardson wrote:
In D39916#914569, @hselasky wrote:
Sorting 1125 sysinit items
117378 instructions with qsort()
5509157 instructions with bubble
And if the list is already sorted and you use insertion sort there instead of bubble sort, what is the time then? Can you get that into the list too?

I guess it will be something like 500x instead of your 50x.

--HPS
Yes that would obviously be faster but I don't have time to work on that. If you are planning to work on this and it's going to be ready for review in the near future I'd be happy to drop this patch.

However, using a pre-existing sorting function to obtain a noticeable speedup seems like a valid approach to me. Just because it could be done in a more optimal way at some hypothetical point in the future shouldn't mean we can't make incremental improvements on the way to an ideal solution. Otherwise things would never improve...

Could we make a deal on that? I writeup a patch for the 500x case today, and you can test it and compare against the current results?

• hselasky added inline comments.May 20 2023, 8:14 AM

sys/kern/init_main.c
269–271	I think the recursion level for qsort() in the kernel is log2(N) only if the sorting input is properly random. Else you risk a recursion equal to the number of elements you sort. I may be wrong, but from my 10-seconds of look at the kernel's qsort(), it looks like the vanilla and unmodified Quick Sort algorithm.

Could we make a deal on that? I writeup a patch for the 500x case today, and you can test it and compare against the current results?

I can test patches.

sys/kern/init_main.c
269–271	We recurse on the smaller "half" and iterate on the larger "half". Our worst case stack depth is log2(N).

In D39916#914668, @hselasky wrote:
In D39916#914571, @arichardson wrote:
In D39916#914569, @hselasky wrote:
Sorting 1125 sysinit items
117378 instructions with qsort()
5509157 instructions with bubble
And if the list is already sorted and you use insertion sort there instead of bubble sort, what is the time then? Can you get that into the list too?

I guess it will be something like 500x instead of your 50x.

--HPS
Yes that would obviously be faster but I don't have time to work on that. If you are planning to work on this and it's going to be ready for review in the near future I'd be happy to drop this patch.

However, using a pre-existing sorting function to obtain a noticeable speedup seems like a valid approach to me. Just because it could be done in a more optimal way at some hypothetical point in the future shouldn't mean we can't make incremental improvements on the way to an ideal solution. Otherwise things would never improve...
Could we make a deal on that? I writeup a patch for the 500x case today, and you can test it and compare against the current results?

Sure, I'd be more than happy to test and review patches :)

In D39916#914843, @arichardson wrote:

Sure, I'd be more than happy to test and review patches :)

It took a little more time than anticipated. Here are the dependencies first:

https://cgit.freebsd.org/src/commit/?id=805d759338a2be939fffc8bf3f25cfaab981a9be
https://reviews.freebsd.org/D40190
https://reviews.freebsd.org/D40191
https://reviews.freebsd.org/D40192

And the main patch:

https://reviews.freebsd.org/D40193

I've tested AMD64 GENERIC for now, and verified no SYSINITs are sorted at all during startup. Will do some more build testing, if we can agree this is the way to go.

Some notes:

a) sysuninits suffer from the same issue like sysinits (I fixed this in my patch)
b) A lot of unneeded sorting can be avoided by clearing the "order" field when sysinits are marked done. There is no need to sort executed sysinits further.
c) Merge sorting when adding new modules

sys/kern/init_main.c
269–271	I see. But the log2(N) stack usage doesn't stop the O(N²) CPU usage - right?

cy added a subscriber: cy.May 22 2023, 1:20 AM

FYI: Here is my second take on the issue D40463 . Please have a look if you are interested.

cperciva mentioned this in D41075: init_main: Switch from sysinit array to SLIST.Jul 19 2023, 1:03 AM

emaste added a subscriber: emaste.Jul 19 2023, 2:34 PM

This was solved differently in 9a7add6d0 and 71679cf468b.

Herald added a subscriber: olce. · View Herald TranscriptJun 28 2024, 8:19 AM

arichardson abandoned this revision.Aug 23 2024, 9:57 PM

Revision Contents
Changeset List

Path

Size

sys/

kern/

init_main.c

33 lines

Diff 122129

View Options

mi_startup: sort sysinit array using qsort instead of bubble sortAbandonedPublicActions