Page MenuHomeFreeBSD

sound: Refactor the format conversion framework
Needs ReviewPublic

Authored by christos on Dec 5 2024, 5:57 PM.
Tags
None
Referenced Files
Unknown Object (File)
Thu, Dec 26, 8:30 AM
Unknown Object (File)
Thu, Dec 26, 7:54 AM
Unknown Object (File)
Thu, Dec 26, 7:47 AM
Unknown Object (File)
Thu, Dec 26, 5:54 AM
Unknown Object (File)
Thu, Dec 19, 6:19 PM
Unknown Object (File)
Sat, Dec 14, 4:55 PM
Unknown Object (File)
Dec 8 2024, 5:22 PM

Details

Summary

Merge the PCM_READ|WRITE_* macros defined in pcm/pcm.h, as well as the
intpcm_read|write_* macros defined in pcm/feeder_format.c, into two
inline functions - pcm_sample_read() and pcm_sample_write(). The absence
of macro magic makes the code significantly easier to read, use and
modify.

Since pcm_sample_read() and pcm_sample_write() take the input/output
format as a parameter, get rid of the read() and write() function
pointers defined in struct feed_format_info, as well as the
feeder_format_read|write_op() functions, and use pcm_sample_read() and
pcm_sample_write() directly.

Sponsored by: The FreeBSD Fondation
MFC after: 1 week

Test Plan

Apply D48330 before this patch and run the test.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped
Build Status
Buildable 61409
Build 58293: arc lint + arc unit

Event Timeline

I tested with all the supported formats and everything seems to work as expected. I suspect we might have even gotten a small performance gain with these changes.

Overall I like the direction of this change, but I still see some major issues:

  • I strongly suggest a unit test for the format conversions.
  • A benchmark wouldn't hurt to check for performance regressions.
  • We have to make sure and check that the inline attribute works as intended.
  • In some places the format is only known at runtime, this will hurt performance.

Also the inline includes do not compile at the moment:

--- feeder_matrix.o ---
In file included from /usr/src/sys/dev/sound/pcm/feeder_matrix.c:47:
In file included from /usr/src/sys/dev/sound/pcm/sound.h:87:
/usr/src/sys/dev/sound/pcm/pcm.h:125:29: error: inline function 'pcm_sample_write' is not defined [-Werror,-Wundefined-inline]
  125 | extern __always_inline void pcm_sample_write(uint8_t *, intpcm_t, int);
      |                             ^
/usr/src/sys/dev/sound/pcm/feeder_matrix.c:171:1: note: used here
  171 | FEEDMATRIX_DECLARE(S, 16, LE)
      | ^
/usr/src/sys/dev/sound/pcm/feeder_matrix.c:127:5: note: expanded from macro 'FEEDMATRIX_DECLARE'
  127 |                                 pcm_sample_write(dst, 0,                \
      |                                 ^
sys/dev/sound/pcm/feeder_format.c
52

Can we move the shift into a function? Like another inline read / write function pair that does the shift depending on format?

54–55

Do we still need this table at runtime? I think it could be replaced by a function.

89–91

These are backwards I think - same for the following *_LE formats.

160–162

Here we only know the audio format at runtime. Inlining the switch here may still be faster than the function pointer call we previously did, but it's hard to tell as it heavily relies on the branch prediction of the CPU.

sys/dev/sound/pcm/feeder_matrix.c
256–257

This is the other place where format is only known at runtime, see comment in feeder_format.c.

sys/dev/sound/pcm/pcm.h
128–129

Does that even work with separate compilation units? I suppose we need the whole function included with a header to force inline?

I found some info on the always_inline attribute and inline in clang, it seems to be a bit more complicated with separate compilation units.

sys/dev/sound/pcm/feeder_format.c
52

This has been in my TODO list already. I was thinking we could just add an additional bool argument to pcm_sample_read|write() to keep things cleaner.

54–55

So if we remove feeder_format_getshift() (see related comment), then this table essentially works as a list of supported formats (see feed_format_init(). I guess we could just turn this table into a switch statement in a new function and inline it inside feed_format_init().

89–91

They are indeed. Good catch. :)

160–162

But I suppose even in the worst case scenario it shouldn't be worse than using a function pointer, or?

sys/dev/sound/pcm/pcm.h
128–129

Strangely enough the current patch compiles on my VM, but I think we could just as well have the functions in pcm/pcm.h. Maybe it'd be make more sense even.

  • I strongly suggest a unit test for the format conversions.

What exactly do we want to test? If converting a sample X from format1 to format2 will give us the intended Y result?

christos added inline comments.
sys/dev/sound/pcm/feeder_format.c
54–55
  • I strongly suggest a unit test for the format conversions.

What exactly do we want to test? If converting a sample X from format1 to format2 will give us the intended Y result?

Yes, that's what I meant. If done the right way it also serves as a check for the conversions on machines with different endianness. Given the extent of refactoring you do, and the fact that we already found some issues with it, that should be enough to justify a unit test. It's really hard to get things like this right for every possible format, and you don't want to find these errors through obscure user bug reports.

sys/dev/sound/pcm/feeder_format.c
54–55

I think you can get rid of the separate implementation for little / big endian here. Shifting the uint8_t around works correctly independent of endianness, as already implemented for most cases. The compiler is smart enough to detect whether these are no-ops, see also sys/sys/endian.h.

54–55

Same as above, this should work without distinction of endianness.

160–162

For modern CPUs, branch prediction also works on function pointers. Which is probably one of the reasons why the function pointer implementation doesn't hurt performance as much as I first expected. Smaller CPUs like a raspberry PI may work differently though, IDK.

Getting the format condition out of this loop would certainly be best, but we get the cross product of read / write formats then, so it may not be worth the efforts.

sys/dev/sound/pcm/feeder_format.c
54–55

I think it'd be more intuitive to check against AFMT_FOO_NE and AFMT_FOO_OE. For the former case we just use the buffer as-is (with the cast), and for the latter, we bswap().

sys/dev/sound/pcm/feeder_format.c
54–55

Possible, but actually less intuitive to me - you still have to handle the 24bit formats differently, and you have to assume things about the platform's endianness. There's platforms with mixed endianness for example, although I don't think FreeBSD currently supports one.

I'd go with the sys/sys/endian.h, uniformly for all formats. Also it is the canonical way that's used in many places for conversions like this.

sys/dev/sound/pcm/feeder_format.c
54–55

To clarify, I meant the sys/sys/endian.h approach, that is also currently used here for 24bit formats. Not to use the actual sys/sys/endian.h implementation.

sys/dev/sound/pcm/feeder_format.c
54–55

So basically, if I understand correctly, what you propose is that for example, AFMT_S16_LE is v = INTPCM_T(src[0] | (int8_t)src[1] << 8);, and AFMT_S16_BE is v = INTPCM_T(src[1] | (int8_t)src[0] << 8);.

sys/dev/sound/pcm/feeder_format.c
54–55

Yes, that should work independent of endianness, and in my experience all compilers are smart enough to optimize this into a simple 16bit-read for native endian samples.

We have to be careful about the meaning of the result values here. As far as I understand these are not correctly signed 32bit or 64bit integers, but some shifted down bit representation of the smaller sample values. An explicit distinction between these and the full integers used for computation would be a huge improvement, IMHO. Separate topic though.

christos marked 7 inline comments as done.
  • Only use PCM_FXSHIFT with 32-bit formats. This is consistent with the previous behavior.
  • I'm not sure if we need to test AC3, MU_LAW and A_LAW, so I left them commented out.
  • @dev_submerge.ch Is there still a problem with compiling if pcm_sample_read|write() are defined in feeder_format.c?
  • We should somehow make sure everything works properly on big-endian machines.
  • Should this live under SND_DEBUG?
christos edited the test plan for this revision. (Show Details)
christos marked 2 inline comments as done.

First of all, could you please update these reviews when you rebase? The diff here still has D47868 included, and thus fails to apply cleanly.

  • Only use PCM_FXSHIFT with 32-bit formats. This is consistent with the previous behavior.
  • I'm not sure if we need to test AC3, MU_LAW and A_LAW, so I left them commented out.
  • @dev_submerge.ch Is there still a problem with compiling if pcm_sample_read|write() are defined in feeder_format.c?

Sure. You cannot compile __always_inline stuff when the definition is not available:

--- feeder_eq.o ---
In file included from /usr/src/sys/dev/sound/pcm/feeder_eq.c:42:
In file included from /usr/src/sys/dev/sound/pcm/sound.h:87:
/usr/src/sys/dev/sound/pcm/pcm.h:126:33: error: inline function 'pcm_sample_read' is not defined [-Werror,-Wundefined-inline]
  126 | extern __always_inline intpcm_t pcm_sample_read(uint8_t *, uint32_t, bool);
      |                                 ^
/usr/src/sys/dev/sound/pcm/feeder_eq.c:217:1: note: used here
  217 | FEEDEQ_DECLARE(S, 16, LE)
      | ^
/usr/src/sys/dev/sound/pcm/feeder_eq.c:161:8: note: expanded from macro 'FEEDEQ_DECLARE'
  161 |                         v = pcm_sample_read(dst, AFMT_##SIGN##BIT##_##ENDIAN,   \
      |                             ^
In file included from /usr/src/sys/dev/sound/pcm/feeder_eq.c:42:
In file included from /usr/src/sys/dev/sound/pcm/sound.h:87:
/usr/src/sys/dev/sound/pcm/pcm.h:127:29: error: inline function 'pcm_sample_write' is not defined [-Werror,-Wundefined-inline]
  127 | extern __always_inline void pcm_sample_write(uint8_t *, intpcm_t, uint32_t, bool);
      |                             ^
/usr/src/sys/dev/sound/pcm/feeder_eq.c:217:1: note: used here
  217 | FEEDEQ_DECLARE(S, 16, LE)
      | ^
/usr/src/sys/dev/sound/pcm/feeder_eq.c:164:4: note: expanded from macro 'FEEDEQ_DECLARE'
  164 |                         pcm_sample_write(dst, v, AFMT_##SIGN##BIT##_##ENDIAN,   \
      |                         ^
2 errors generated.
*** [feeder_eq.o] Error code 1
  • We should somehow make sure everything works properly on big-endian machines.

If we make this a unit test, I suppose it would be run on any big-endian test builders? If there is no CI for those, we could ask someone that works on such an architecture to run them.

  • Should this live under SND_DEBUG?

I'd say a unit test is more appropriate.

sys/dev/sound/pcm/feeder_format.c
54–55

That's already a lot better. For readability I would prefer a separate function that calls this one and does the shift, instead of a boolean flag. That would emphasize the different return values and make it easier to grep what is used in which feeder.

I'm not convinced we actually need two separate functions in the end, the shifted values should work everywhere, since they are of the same magnitude as 32bit (or 24bit) samples. But for refactoring this is ok.

Also the shift here seems contradictory if SND_PCM64 is not set. In that case 32bit samples are shifted down to 24bit magnitude, whereas all other sample formats are shifted up to full 32bit magnitude?

54–55

Apart from putting this into a unit test, I'd suggest to make these tests a lot simpler and more verbose. If you replicate too much logic it's easy to fool yourself (and us reviewers) with a badly written test.

I would start with a byte array like { 0x01, 0x02, 0x03, 0x04 }, read it as a specific sample format and compare it to an explicit value. Then write the value into another array and compare that explicitly. Also repeat this for at least one negative integer value, like { 0x81, 0x82, 0x83, 0x84 } - does that make sense?

sys/dev/sound/pcm/pcm.h
34–41

Some of these comments should be preserved and adjusted. They are valuable hints to understand what we do here.

128–129

You have to put the actual implementation here, otherwise the inlining doesn't work and it won't compile.

christos added inline comments.
sys/dev/sound/pcm/feeder_format.c
54–55

Also the shift here seems contradictory if SND_PCM64 is not set. In that case 32bit samples are shifted down to 24bit magnitude, whereas all other sample formats are shifted up to full 32bit magnitude?

I had the same thought. The reason I kept this was to preserve the current behavior. That being said, I also do not understand this.

/*
 * 24bit integer ?!? This is quite unfortunate, eh? Get the fact straight:
 * Dynamic range for:
 *	1) Human =~ 140db
 *	2) 16bit = 96db (close enough)
 *	3) 24bit = 144db (perfect)
 *	4) 32bit = 196db (way too much)
 *	5) Bugs Bunny = Gazillion!@%$Erbzzztt-EINVAL db
 * Since we're not Bugs Bunny ..uh..err.. avoiding 64bit arithmetic, 24bit
 * is pretty much sufficient for our signed integer processing.
 */

According to this comment from pcm/pcm.h (left diff), 24-bit magnitude is just fine for the human ear, so I suppose we should either:

  1. Not shift down 32-bit formats to 24 bits, and also shift the rest of the formats up to 32 bits.
  2. Shift everything up/down to 24 bits.
54–55

I do not have any strong objection to this approach, but in what way is this test confusing (if at all)? I think it's easier to automate testing of all formats (not sure about AFMT_FLOAT yet, though) this way.

Also I did move it to tests/sys/sound.

sys/dev/sound/pcm/feeder_format.c
54–55

There is this comment though:

To avoid overflow, we truncate 32bit (and only 32bit) samples down to 24bit (see below for the reason), unless SND_PCM_64 is defined.

sys/dev/sound/pcm/feeder_format.c
54–55

Both of these comments apply. In case SND_PCM_64 is set, all calculations are in 64bit integers, there's no risk of overflow, and we may shift all samples up to 32bit magnitude. If SND_PCM_64 is not set, calculations are in 32bit integers and we should avoid anything that exceeds the 24bit magnitude.

And just to "get the fact straight": 32bit precision can be useful in music production setups where you want to preserve the full 24bit precision quality, but need some headroom to accommodate for louder / quieter parts. Thus 24bit everywhere, always, is not an option.

I didn't look at all feeders yet, but there's some nuances in the requirements:

  • Forward feeders that don't do format conversion get by without any sample shift.
  • Format conversion feeders always need the same magnitude for read and write, but won't overflow because there are no calculations. Means we don't have to shift down 32bit samples to 24bit if SND_PCM_64 is not set.
  • Mixers and equalizers do calculations, samples should have the same magnitude for read and write, and we have to shift down 32bit samples to 24bit if SND_PCM_64 is not set.

For the sake of refactoring, I'd suggest to implement separate inline functions for these different requirements. We can call the main sample read / write function from there and shift as necessary per requirement.

54–55

In my experience, unit tests like this should be easily verifiable, test exactly what is required from the implementation, with the least amount of assumptions and dependencies.

For example your current code makes assumptions about the platform's integer memory layout, depends on endianness specific macros, and does many non-obvious (__bswap16(int32_t)?) transformations on the output before checking it. To verify its correctness I have to consider different int sizes, different endianness, different sample formats and that the output is properly checked. BTW, there's an error in there, can you spot it?

While all we want to test is that a certain sample format in a byte array is read to the correct integer value, and vice versa for writing. Put these into a well-structured table of test data, it will be more verbose but also explicit and extendable.

Another reason is that there's already quite some undefined or implementation defined behavior in pcm_sample_read() and pcm_sample_write(), like signed integer shifts or unsigned <-> signed conversions. While hard to avoid there, at least I'd want to keep those out of the tests.

FYI: at some point we had the sound system with the lowest latency. Ariff was comparing MS, OS X and Linux. I do not know if they have catched-up since then, but it may be worth the effort to check the performance / latency for such changes. Unfortunately I haven't found the info about the latency stuff he did, only his resampling quality comparison with other resampler implementations (https://people.freebsd.org/~ariff/z_comparison/). In https://people.freebsd.org/~ariff/old/ he has some old low latency diffs, so some interested soul could check which parts he modified to get lower latency.

FYI: at some point we had the sound system with the lowest latency. Ariff was comparing MS, OS X and Linux. I do not know if they have catched-up since then, but it may be worth the effort to check the performance / latency for such changes. Unfortunately I haven't found the info about the latency stuff he did, only his resampling quality comparison with other resampler implementations (https://people.freebsd.org/~ariff/z_comparison/). In https://people.freebsd.org/~ariff/old/ he has some old low latency diffs, so some interested soul could check which parts he modified to get lower latency.

Thanks for your input. I agree that we should keep an eye on performance, but I'd like to clarify: In our current refactoring of the format conversion we're not touching anything that changes the buffer sizes. And that's the only thing that could possibly affect audio latency. If latency really is of concern, then we'd have to look at smaller buffer sizes and how to make timers and scheduling more reliable. Also I doubt that we ever had the lowest overall latency, if you count in ASIO on Windows. I suspect that was about the latency introduced by the resampler.

What we actually do here is replace the macro-generated explicit specializations for different formats with inline functions and explicit format parameters. This is expected to result in comparable efficiency using modern compilers. I suppose we can have a look at the assembler generated for the unit test. Any suggestion for benchmarking the format conversions?

Thanks for your input. I agree that we should keep an eye on performance, but I'd like to clarify: In our current refactoring of the format conversion we're not touching anything that changes the buffer sizes. And that's the only thing that could possibly affect audio latency. If latency really is of concern, then we'd have to look at smaller buffer sizes and how to make timers and scheduling more reliable. Also I doubt that we ever had the lowest overall latency, if you count in ASIO on Windows. I suspect that was about the latency introduced by the resampler.

We have the possibility to modify the latency / buffer size already. See https://meka.rs/blog/2017/01/25/sing-beastie-sing/
And no, it was not about the resampler. See also https://news.ycombinator.com/item?id=12616183
And it was about 14 years ago or such. So comparing with Windows 2000 / Vista. No idea how much Windows has improved in the latency area since then, but I have some dim memories about "not that much". We should be still quiet good in this area.

What we actually do here is replace the macro-generated explicit specializations for different formats with inline functions and explicit format parameters. This is expected to result in comparable efficiency using modern compilers. I suppose we can have a look at the assembler generated for the unit test. Any suggestion for benchmarking the format conversions?

Benchmarking: including the format conversion code into a userland tool which can be benchmarked, or maybe via dtrace (timestamps between entering and exiting), or hwpmc?

Thanks for your input. I agree that we should keep an eye on performance, but I'd like to clarify: In our current refactoring of the format conversion we're not touching anything that changes the buffer sizes. And that's the only thing that could possibly affect audio latency. If latency really is of concern, then we'd have to look at smaller buffer sizes and how to make timers and scheduling more reliable. Also I doubt that we ever had the lowest overall latency, if you count in ASIO on Windows. I suspect that was about the latency introduced by the resampler.

We have the possibility to modify the latency / buffer size already. See https://meka.rs/blog/2017/01/25/sing-beastie-sing/

Sure, within some limits. There's a lot to fix and improve in that area. Funny enough, Meka's setup was terrible back then in terms of latency.

And no, it was not about the resampler. See also https://news.ycombinator.com/item?id=12616183
And it was about 14 years ago or such. So comparing with Windows 2000 / Vista. No idea how much Windows has improved in the latency area since then, but I have some dim memories about "not that much". We should be still quiet good in this area.

That's not low latency in my book, more like decent enough latency for media consumption. I suppose we do well in that department. For applications that really need low latency, e.g. live sound effects, ASIO on Windows still beats us, and certainly did so 14 years ago. Anyway, different use cases.

What we actually do here is replace the macro-generated explicit specializations for different formats with inline functions and explicit format parameters. This is expected to result in comparable efficiency using modern compilers. I suppose we can have a look at the assembler generated for the unit test. Any suggestion for benchmarking the format conversions?

Benchmarking: including the format conversion code into a userland tool which can be benchmarked, or maybe via dtrace (timestamps between entering and exiting), or hwpmc?

Thanks, I didn't find much info about micro benchmarking in FreeBSD base. DTrace may be an option though if the number of samples processed is large enough.

christos marked 3 inline comments as done.
  • Add pcm_sample_read|write_shift() macros and remove the boolean argument from pcm_sample_read|write().
  • Move tests to sys/tests/sound and re-write according to Florian's suggestions.
  • Hardcode expected results instead of reverse engineering them.
  • Move pcm/sound.h's AFMT_* defines out of ifdef _KERNEL so that the test can use them.

Will rebase dependent reviews when we are done with this first.

Notice that in the test code I left some comments, namely:

  • Since the results are hardcoded, I'm not sure whether this will work on big-endian machines.
  • I left out tests (for now) for when SND_PCM_64 is not set.
  • I'm not sure whether we need to add tests for AFMT_MU_LAW and AFMT_A_LAW. See how pcm/pcm.h handles these formats.
sys/dev/sound/pcm/pcm.h
327–333

Why macros instead of inline functions?

tests/sys/sound/fmtconv.c
68

Shouldn't this value be negative? Same for the following unsigned sample formats?

158–162

That wasn't the idea, it can't work that way. You have to define the format independent of the machine's endianness, here that would be AFMT_U32_LE. We want to test that reading a little endian U32 sample of bytes { 0x81, 0x82, 0x83, 0x84 } always results in the corresponding integer value of 0x04838281. That should be true on any machine, regardless of it's endianness.

For the same reason, the result of the write functions has to be a byte array. We want to test that writing a value of 0x04838281 as a little endian U32 sample results in a byte array of { 0x81, 0x82, 0x83, 0x84 }, regardless of the machine's endianness. Don't compare integer values here, compare the resulting byte arrays. Then our test checks the right thing, without any endianness of the machine running it involved. Only the implementation of the read and write function has to care about endianness. Our test just checks the correctness of the results.

Also you may want to create separate tables for read and write test data. I think you can then fit each test case on one line, tidying up the whole thing.

190

I'd prefer to avoid the undefined behavior of shifting a negative signed integer here, for the sake of testing. Can we add the shifted value to the test data? Alternatively we could reformulate the calculation without undefined behavior.

Please note that I'm working on a unit test which exercises the original sample read / write macros, to be committed before this refactoring. This way we can verify the test procedure and test data with the original code, then adapt it to the refactored code. It's still WIP, but I can already tell that we definitely have some regressions here.

Christos, please have a close look at the precise definition of arithmetic operators in C, regarding integer promotion. There's some subtle nuances about conversion order that affect your code.

BTW, I hope you all had a pleasant start into 2025!

christos marked 4 inline comments as done.
  • Remove unit test in favor of D48330. Work will continue there. Depend on it.
  • Make shift functions inline, instead of macros.
  • Mark read/write functions as __unused to suppress compiler warnings about unused shift functions in various files.
  • Fix regressions with unsigned format conversion. Now passes D48330 tests.
  • Modify D48330 to work with the refactor.

BTW, I hope you all had a pleasant start into 2025!

Happy new year! :-)

christos edited the test plan for this revision. (Show Details)