Page MenuHomeFreeBSD

bitset(9): Introduce BIT_FOREACH_SET and BIT_FOREACH_CLR
ClosedPublic

Authored by markj on Sep 20 2021, 7:11 PM.
Tags
None
Referenced Files
Unknown Object (File)
Sun, Jan 19, 12:37 PM
Unknown Object (File)
Sat, Jan 11, 10:01 PM
Unknown Object (File)
Sat, Jan 11, 8:01 PM
Unknown Object (File)
Tue, Jan 7, 10:00 PM
Unknown Object (File)
Dec 17 2024, 2:10 AM
Unknown Object (File)
Dec 17 2024, 2:10 AM
Unknown Object (File)
Dec 16 2024, 6:51 PM
Unknown Object (File)
Dec 1 2024, 8:27 PM
Subscribers

Details

Summary

These allow one to non-destructively iterate over the set or clear bits
in a bitset. The motivation is that we have several code fragments
which iterate over a CPU set like this:

while ((cpu = CPU_FFS(&cpus)) != 0) {
	cpu--;
	CPU_CLR(cpu, &cpus);
	<do something>;
}

This is slow because CPU_FFS begins the search at the beginning of the
bitset each time. On amd64 and arm64, CPU sets have size 256, so there
are four limbs in the bitset and we do a lot of unnecessary scanning.

A second problem is that this is destructive, so code which needs to
preserve the original set has to make a copy. In particular, we have
quite a few functions which take a cpuset_t parameter by value, meaning
that each call has to copy the 32 byte cpuset_t.

An alternate implementation would be something like this:

for (i = 0; i < BITSET_SIZE; i++)
	if (BIT_ISSET(i, bitset))
		<do something>;

Some benchmarking on a new-ish i7 CPU and on a Haswell Xeon indicates
that this alternate implementation is much slower, though. I tried
several microbenchmarks which iterate over a set and store a value to
memory in the loop body. When the set is zero-filled, BIT_FOREACH_SET
is at least an order of magnitude faster; when it is all-ones
BIT_FOREACH_SET is about twice as fast; when it is filled with random
bytes BIT_FOREACH_SET is 5-6x faster. The destructive loop is even
worse than the other two on this hardware.

I get similar results on an EC2 Graviton (arm64) instance, but the
differences are even more pronounced.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

markj requested review of this revision.Sep 20 2021, 7:11 PM
cem added inline comments.
share/man/man9/bitset.9
97

size_t bit might be more consistent with other examples here.

This revision is now accepted and ready to land.Sep 20 2021, 7:28 PM

Might be BIT_FOREACH_ONES and BIT_FOREACH_ZEROS would be more telling? My first thought when I saw the names was that they set and clear all bits somehow.

In D32028#722786, @kib wrote:

Might be BIT_FOREACH_ONES and BIT_FOREACH_ZEROS would be more telling? My first thought when I saw the names was that they set and clear all bits somehow.

How about BIT_FOREACH_ISSET and BIT_FOREACH_ISCLR, for consistency with BIT_ISSET/BIT_ISCLR? _ONES and _ZEROS seems ok too, but the cpuset wrappers CPU_FOREACH_ONES and CPU_FOREACH_ZEROS sound weird to me.

In D32028#722786, @kib wrote:

Might be BIT_FOREACH_ONES and BIT_FOREACH_ZEROS would be more telling? My first thought when I saw the names was that they set and clear all bits somehow.

How about BIT_FOREACH_ISSET and BIT_FOREACH_ISCLR, for consistency with BIT_ISSET/BIT_ISCLR? _ONES and _ZEROS seems ok too, but the cpuset wrappers CPU_FOREACH_ONES and CPU_FOREACH_ZEROS sound weird to me.

Ok with me.

This revision now requires review to proceed.Sep 20 2021, 8:35 PM
This revision is now accepted and ready to land.Sep 20 2021, 8:40 PM
sys/sys/bitset.h
274–279

This nested for loop construct has a booby trap, continue/break don't work, and it isn't documented. A way out of this is to flatten it into one loop, perhaps using helper functions or macros to make it readable.

This macro also has a problem (not new, like most of the others here) that if the bit set size is not a multiple of the word size, garbage will be reported at the end. This may be particularly acute for the BIT_FOREACH_ISCLR macro. I started a patch to fix that in the other macros but didn't finish it: https://github.com/rlibby/freebsd/commit/a205c7d865d3d826b96fe48d7011fd870f3cbb93

I also started and didn't finish some tests: https://github.com/rlibby/freebsd/commit/5dc206eb1b52aad570688b260a7f473b31bb0266

sys/sys/bitset.h
274–279

Nice catch. Re bitset size, we could static assert that? Unless we have a bitset using this construct that isn’t a multiple of word size.

sys/sys/bitset.h
274–279

Re bitset size, we could static assert that? Unless we have a bitset using this construct that isn’t a multiple of word size.

I don't want to distract from the feature too much here since this is an existing issue, just wanted to raise awareness. We can take it up later.

That said: Since not all bitsets have a statically defined size, I don't think it will be easy to enforce static rules about it. On the other hand, I think it's not too hard to fix and potentially has no extra cost for static size sets with nice sizes.

sys/sys/bitset.h
274–279

Thanks for pointing this out, I feel quite dumb for not having thought about it. (Though, I think it is just "break" that is broken and not "continue"?)

I've spent a fair bit of time trying to repair this. It's straightforward to eliminate the nesting if I use a statement expression which conditionally advances to the next non-zero limb of the bitmap. The problem is that clang and gcc both generate quite pessimal code no matter how I try to write it. As a result the new no-nesting macro ends up much slower than before if the bitmap is mostly populated. Most of the time it's slower than a simple loop over each bit[*]. The new macro is still much faster for sparse bitmaps, so maybe that's acceptable.

Now I think it's possible to fix the problem without removing the nesting: if the outer loop additionally tests a condition which is false only if the inner loop terminates without clearing every bit in the copied limb, then "break" works and the loop runs basically as fast as before. So, is there anything other than "break" that I should consider?

If I go ahead with this idea (which works with some basic tests at least), I would include some regression tests, probably using your tests as a starting point.

\* With a full bitmap, testing each bit is faster than anything else I've implemented if I also disable loop unrolling. But with cpusets (size 256) I expect the compiler to do some unrolling anyway.

sys/sys/bitset.h
274–279

I think it's right that it's just break that is broken and continue is fine.

I think your suggestion to check an additional condition in the outer loop should work, I can't think of any other grammar problems. I wouldn't complain about that approach.

You might try something like this if you haven't already:

#define	_BIT_FOREACH_ADVANCE(_s, i, p, op, __iter) __extension__ ({	\
	int __found;							\
	for (;;) {							\
		if (__iter.v != 0) {					\
			int __b = ffsl(__iter.v) - 1;			\
			__iter.v &= ~(1l << __b);			\
			(i) = __iter.i * _BITSET_BITS + __b;		\
			__found = 1;					\
			break;						\
		}							\
		__iter.i++;						\
		if (__iter.i >= __bitset_words(_s)) {			\
			__found = 0;					\
			break;						\
		}							\
    		__iter.v = op((p)->__bits[__iter.i]);			\
	}								\
	__found != 0;							\
)}

struct _bitset_iter {
	ssize_t i;
	long v;
};

#define _BIT_FOREACH(_s, i, p, op)					\
	for (struct _bitset_iter __iter = { -1, 0 };			\
	    _BIT_FOREACH_ADVANCE(_s, i, p, op, __iter); )

(Untested, not compiled, just a sketch.)