Page MenuHomeFreeBSD

sound: Allocate vchans on-demand
Needs ReviewPublic

Authored by christos on Dec 4 2024, 4:57 PM.
Tags
None
Referenced Files
Unknown Object (File)
Mon, Jan 6, 9:13 PM
Unknown Object (File)
Thu, Jan 2, 2:01 AM
Unknown Object (File)
Thu, Dec 26, 10:32 PM
Unknown Object (File)
Thu, Dec 26, 10:22 AM
Unknown Object (File)
Thu, Dec 26, 9:07 AM
Unknown Object (File)
Thu, Dec 26, 8:47 AM
Unknown Object (File)
Thu, Dec 26, 8:23 AM
Unknown Object (File)
Thu, Dec 26, 5:40 AM
Subscribers

Details

Summary

Refactor pcm_chnalloc() and merge with parts of vchan_setnew() (now
removed) and dsp_open()’s channel creation into a new dsp_chn_alloc()
function. The function is responsible for either using a free HW channel
(if vchans are disabled), or allocating a new vchan.

Clean up allocated vchans associated with a given dsp_cdevpriv on
dsp_close() instead of leaving them unused.

hw.snd.vchans_enable (previously hw.snd.maxautovchans) and
dev.pcm.X.{play|rec}.vchans now work as tunables to only enable/disable
vchans, as opposed to setting their number and/or (de-)allocating
vchans. Since these sysctls do not trigger any (de-)allocations anymore,
their effect is instantaneous, whereas before we could have frozen the
machine (when trying to allocate new vchans) when setting
dev.pcm.X.{play|rec}.vchans to a very large value.

Create a new "primary" channel sublist so that we do not waste time
looping through all channels in dsp_chn_alloc(), since we are only
looking for a parent channel to either use, or add a new vchan to. This
guarantees a steady traversal speed, as the parent channels are most
likely going to be just a handful (2). What was currently in place was a
loop through the whole channel list, which meant that the traversal
would take longer the more channels were added to that list.

Previously the vchan_setmaxauto() call in pcm_register() would
eventually trigger a call to chn_reset() on the primary (parent)
channels with, usually, whatever \*vchanformat and \*vchanrate were set
to. To compensate for this, set \*vchanformat and \*vchanrate already in
pcm_init() and add a chn_reset() call in chn_init() and assign them as
the channel speed, instead of using the defaults (AFMT_U8 format and
8000 rate).

Sponsored by: The FreeBSD Foundation
MFC after: 1 week

Test Plan
  • Audio tests (snd_uaudio(4): Focusrite Scarlett Solo)
    • Simplex: OK
    • Duplex: OK
    • Same vchanrate: OK
    • Same vchanrate, one direction of vchans disabled: OK
    • Same vchanrate, vchans disabled: OK
    • Different vchanrate: BAD (Most likely what's mentioned in the first paragraph of snd_uaudio.4 BUGS)
    • Different vchanrate, one direction of vchans disabled: BAD (see above)
    • Different vchanrate, vchans disabled: BAD (see above)
  • Benchmark: for i in $(seq 1 X); do cat /dev/random >/dev/dsp & done

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped
Build Status
Buildable 61517
Build 58401: arc lint + arc unit

Event Timeline

Some notes:

  • I would like a different name than hw.snd.maxautovchans since the functionality is different now. Maybe hw.snd.vchans? That would break compatibility though. In any case, I will add a RELNOTES entry when it's time to commit.
  • @dev_submerge.ch regarding your comment in a previous review: "I'd recommend to cap the total number of vchans at around 10000, to keep the allocation / deallocation time for a single vchan below ~1ms.”. Do you think we still need this limit?

Some notes:

  • I would like a different name than hw.snd.maxautovchans since the functionality is different now. Maybe hw.snd.vchans? That would break compatibility though. In any case, I will add a RELNOTES entry when it's time to commit.

This implementation is more like a default value for the dev.pcm.X.vchans, maybe default_vchans or autovchans?

  • @dev_submerge.ch regarding your comment in a previous review: "I'd recommend to cap the total number of vchans at around 10000, to keep the allocation / deallocation time for a single vchan below ~1ms.”. Do you think we still need this limit?

I guess we still need this limit as long as we traverse the whole list in some of the operations. It's possible that we're doing less list traversals now, but the ballpark should be the same I think.

  • Audio tests (snd_uaudio(4): Focusrite Scarlett Solo)
    • Simplex: OK
    • Duplex: OK
    • Same vchanrate: OK
    • Same vchanrate, one direction of vchans disabled: OK
    • Same vchanrate, vchans disabled: OK
    • Different vchanrate: BAD (Most likely what's mentioned in the first paragraph of snd_uaudio.4 BUGS)
    • Different vchanrate, one direction of vchans disabled: BAD (see above)
    • Different vchanrate, vchans disabled: BAD (see above)

AFAIK there are multiple problems related to these errors:

First of all, it's not just snd_uaudio(4), none of the common sound cards can handle multiple sample rates at the same time.

Then, vchan settings define an audio format as a common base for the vchan conversion, they are not meant to set the audio format of the hardware. It just happens that the feeder composition tries to minimize conversions, and thereby also tries to change the format of the hardware. In fact we should be able to set any vchan format / sample rate that can be converted to the hardware format. Whether that changes the hardware audio format is secondary.

Now from the driver's POV, it has to decide for a certain hardware audio format. Unfortunately the format requests through the kmod interface come in no particular order, and separately for rec / play direction and sample rate, format, channel number, latency. Only when a channel is triggered and the hardware is running can the driver be sure that the hardware format cannot be changed.

And here comes another issue, IIRC the driver does return the actual values for the hardware audio parameters when a format is requested, but some of these return values are ignored.

So, unless there's a regression, I think the problems regarding different vchanrate settings are unrelated to this patch.

Some notes:

  • I would like a different name than hw.snd.maxautovchans since the functionality is different now. Maybe hw.snd.vchans? That would break compatibility though. In any case, I will add a RELNOTES entry when it's time to commit.

This implementation is more like a default value for the dev.pcm.X.vchans, maybe default_vchans or autovchans?

The patch makes it so that maxautovchans behaves like a global switch for vchans, which can be overriden by the individual dev.pcm.X.[play|rec].vchans controls. So IMHO default_vchans doesn't really apply here. autovchans could work, but I think simply vchans is more intuitive. What do you think?

  • @dev_submerge.ch regarding your comment in a previous review: "I'd recommend to cap the total number of vchans at around 10000, to keep the allocation / deallocation time for a single vchan below ~1ms.”. Do you think we still need this limit?

I guess we still need this limit as long as we traverse the whole list in some of the operations. It's possible that we're doing less list traversals now, but the ballpark should be the same I think.

The allocation traversals are significantly less now because we are only looping through 1 or 2 channels, instead of the whole list for each dsp_open(). But it's true that there are still places where we need to loop through the whole list. And 10000 should be enough for pretty much anyone I guess; my test VM with 512M RAM can barely handle ~450. :-)

10000 in both directions, or for each one? I think we could use it as a total cap.

  • Audio tests (snd_uaudio(4): Focusrite Scarlett Solo)

[...]

First of all, it's not just snd_uaudio(4), none of the common sound cards can handle multiple sample rates at the same time.

Then, vchan settings define an audio format as a common base for the vchan conversion, they are not meant to set the audio format of the hardware. It just happens that the feeder composition tries to minimize conversions, and thereby also tries to change the format of the hardware. In fact we should be able to set any vchan format / sample rate that can be converted to the hardware format. Whether that changes the hardware audio format is secondary.

Now from the driver's POV, it has to decide for a certain hardware audio format. Unfortunately the format requests through the kmod interface come in no particular order, and separately for rec / play direction and sample rate, format, channel number, latency. Only when a channel is triggered and the hardware is running can the driver be sure that the hardware format cannot be changed.

And here comes another issue, IIRC the driver does return the actual values for the hardware audio parameters when a format is requested, but some of these return values are ignored.

So, unless there's a regression, I think the problems regarding different vchanrate settings are unrelated to this patch.

I haven't noticed any regression so far, but I will do more tests.

Some notes:

  • I would like a different name than hw.snd.maxautovchans since the functionality is different now. Maybe hw.snd.vchans? That would break compatibility though. In any case, I will add a RELNOTES entry when it's time to commit.

This implementation is more like a default value for the dev.pcm.X.vchans, maybe default_vchans or autovchans?

The patch makes it so that maxautovchans behaves like a global switch for vchans, which can be overriden by the individual dev.pcm.X.[play|rec].vchans controls. So IMHO default_vchans doesn't really apply here. autovchans could work, but I think simply vchans is more intuitive. What do you think?

To me that's the exact definition of a default value ;-)
Maybe vchans_enable is closer to what it does now. But I'm not going to argue, I think a good description / documentation is more important.

  • @dev_submerge.ch regarding your comment in a previous review: "I'd recommend to cap the total number of vchans at around 10000, to keep the allocation / deallocation time for a single vchan below ~1ms.”. Do you think we still need this limit?

I guess we still need this limit as long as we traverse the whole list in some of the operations. It's possible that we're doing less list traversals now, but the ballpark should be the same I think.

The allocation traversals are significantly less now because we are only looping through 1 or 2 channels, instead of the whole list for each dsp_open(). But it's true that there are still places where we need to loop through the whole list. And 10000 should be enough for pretty much anyone I guess; my test VM with 512M RAM can barely handle ~450. :-)

10000 in both directions, or for each one? I think we could use it as a total cap.

10000 in total should be enough, we can then check for some problematic code paths and reevaluate if necessary.

share/man/man4/pcm.4
362–369

We should mention here that it can be overridden. Also the explanation for dev.pcm.X.vchans needs an update regarding that.

christos marked an inline comment as done.
  • Rename hw.snd.maxautovchans to hw.snd.vchans_enable.
  • Add 10000 total channel cap.
  • Update man page.

I will also add a RELNOTES entry for hw.snd.maxautovchans when it's time to commit.

Update sysctl descriptions.

share/man/man4/pcm.4
365

"overridden"

sys/dev/sound/pcm/dsp.c
205–214

Since d is not locked here, and the flags do not represent the current state of opened (v)channels, could we end up with using a busy primary channel here? Or a vchan on top of a directly opened primary channel?

363–364

Increase of the channel counters is not protected under the same mutex lock that we hold here. The comparison should at least be >=, in case multiple channels are opened at once and the total of the counters exceeds 10000.

sys/dev/sound/pcm/vchan.c
813–814

Are we allowed to clear the CHN_F_BUSY flag here? I suppose it could depend on other children's state?

Kernel panics when I play something, stop it, cat /dev/sndstat, and then try to play something again. My tree has D47868, D47917 and D47918 on top of main branch, I suspect it's this one.

Fatal trap 9: general protection fault while in kernel mode
cpuid = 4; apic id = 04
instruction pointer     = 0x20:0xffffffff808c8d70
stack pointer           = 0x28:0xfffffe00d9971bd0
frame pointer           = 0x28:0xfffffe00d9971c00
code segment            = base 0x0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 1026 (sox)
rdi: fffff800018b8c80 rsi: 0000000000000008 rdx: deadc0dedeadc20e
rcx: deadc0dedeadc0de  r8: 0000000000000002  r9: ffffffffffffffff
rax: fffff800016b5d88 rbx: fffff80007e3bc00 rbp: fffffe00d9971c00
r10: ffffffff81c4a918 r11: 0000000000000000 r12: fffff80009404080
r13: fffff800016b5d80 r14: 0000000000000001 r15: fffff800079ac090
trap number             = 9
panic: general protection fault
cpuid = 4
time = 1734030966
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00d9971910
vpanic() at vpanic+0x136/frame 0xfffffe00d9971a40
panic() at panic+0x43/frame 0xfffffe00d9971aa0
trap_fatal() at trap_fatal+0x40b/frame 0xfffffe00d9971b00
calltrap() at calltrap+0x8/frame 0xfffffe00d9971b00
--- trap 0x9, rip = 0xffffffff808c8d70, rsp = 0xfffffe00d9971bd0, rbp = 0xfffffe00d9971c00 ---
chn_trigger() at chn_trigger+0x150/frame 0xfffffe00d9971c00
chn_write() at chn_write+0x164/frame 0xfffffe00d9971c60
dsp_io_ops() at dsp_io_ops+0x3ae/frame 0xfffffe00d9971cc0
dsp_write() at dsp_write+0x22/frame 0xfffffe00d9971ce0
devfs_write_f() at devfs_write_f+0xf3/frame 0xfffffe00d9971d40
dofilewrite() at dofilewrite+0x81/frame 0xfffffe00d9971d90
sys_write() at sys_write+0xb7/frame 0xfffffe00d9971e00
amd64_syscall() at amd64_syscall+0x158/frame 0xfffffe00d9971f30
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe00d9971f30
--- syscall (4, FreeBSD ELF64, write), rip = 0x82e2b3d3a, rsp = 0x820d384c8, rbp = 0x820d384f0 ---
KDB: enter: panic
[ thread pid 1026 tid 100293 ]
Stopped at      kdb_enter+0x33: movq    $0,0x1055632(%rip)

Kernel panics when I play something, stop it, cat /dev/sndstat, and then try to play something again. My tree has D47868, D47917 and D47918 on top of main branch, I suspect it's this one.

Fatal trap 9: general protection fault while in kernel mode
cpuid = 4; apic id = 04
instruction pointer     = 0x20:0xffffffff808c8d70
stack pointer           = 0x28:0xfffffe00d9971bd0
frame pointer           = 0x28:0xfffffe00d9971c00
code segment            = base 0x0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 1026 (sox)
rdi: fffff800018b8c80 rsi: 0000000000000008 rdx: deadc0dedeadc20e
rcx: deadc0dedeadc0de  r8: 0000000000000002  r9: ffffffffffffffff
rax: fffff800016b5d88 rbx: fffff80007e3bc00 rbp: fffffe00d9971c00
r10: ffffffff81c4a918 r11: 0000000000000000 r12: fffff80009404080
r13: fffff800016b5d80 r14: 0000000000000001 r15: fffff800079ac090
trap number             = 9
panic: general protection fault
cpuid = 4
time = 1734030966
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00d9971910
vpanic() at vpanic+0x136/frame 0xfffffe00d9971a40
panic() at panic+0x43/frame 0xfffffe00d9971aa0
trap_fatal() at trap_fatal+0x40b/frame 0xfffffe00d9971b00
calltrap() at calltrap+0x8/frame 0xfffffe00d9971b00
--- trap 0x9, rip = 0xffffffff808c8d70, rsp = 0xfffffe00d9971bd0, rbp = 0xfffffe00d9971c00 ---
chn_trigger() at chn_trigger+0x150/frame 0xfffffe00d9971c00
chn_write() at chn_write+0x164/frame 0xfffffe00d9971c60
dsp_io_ops() at dsp_io_ops+0x3ae/frame 0xfffffe00d9971cc0
dsp_write() at dsp_write+0x22/frame 0xfffffe00d9971ce0
devfs_write_f() at devfs_write_f+0xf3/frame 0xfffffe00d9971d40
dofilewrite() at dofilewrite+0x81/frame 0xfffffe00d9971d90
sys_write() at sys_write+0xb7/frame 0xfffffe00d9971e00
amd64_syscall() at amd64_syscall+0x158/frame 0xfffffe00d9971f30
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe00d9971f30
--- syscall (4, FreeBSD ELF64, write), rip = 0x82e2b3d3a, rsp = 0x820d384c8, rbp = 0x820d384f0 ---
KDB: enter: panic
[ thread pid 1026 tid 100293 ]
Stopped at      kdb_enter+0x33: movq    $0,0x1055632(%rip)

Interesting, I am not able to reproduce this. Does this happen with a different player than sox as well? Not sure how this patch could be responsible for this, though. All it does is essentially allocate vchans on open() and delete them on close(), instead of keeping them around after close().

sys/dev/sound/pcm/dsp.c
205–214

d is in fact "locked". SD_F_BUSY is set, which acts as a kind of sleepable lock. Also which flags are you referring to? The PCM or the channel ones?

sys/dev/sound/pcm/vchan.c
813–814

The reason I am clearing it here, is because I see that CHN_F_HAS_VCHAN is also cleared here. The fail label is only reached if a call to chn_reset() or chn_getcaps() on the parent fails. So I guess this code is reached if the parent is (for lack of a better word) unusable in general.

Kernel panics when I play something, stop it, cat /dev/sndstat, and then try to play something again. My tree has D47868, D47917 and D47918 on top of main branch, I suspect it's this one.

Fatal trap 9: general protection fault while in kernel mode
...

Interesting, I am not able to reproduce this. Does this happen with a different player than sox as well? Not sure how this patch could be responsible for this, though. All it does is essentially allocate vchans on open() and delete them on close(), instead of keeping them around after close().

I now isolated the panic to this patch (on top of D47868 which is required to build). It also happens with cat /dev/zero > /dev/dsp instead of sox. The cat /dev/sndstat is not needed to reproduce it. I did a full system build / install with and without this patch, consistent results.

sys/dev/sound/pcm/dsp.c
205–214

I meant the d->flags, but you're right, they are protected through SD_F_BUSY. What I apparently can do is to enable vchans with sysctl dev.pcm.0.play.vchans = 1 while playing on a primary channel, and then open a vchan for playback on top of that. It leads to a timeout and inconsistent state afterwards, not good.

I think there's a conceptual problem here with the flags not representing the actual state of channels running. We have to know whether the primary channel is used directly or only for vchans.

sys/dev/sound/pcm/vchan.c
813–814

I wouldn't equate a chn_reset() error to the parent being completely unusable. And even if that's true, we take these flags as indicator for vchans, it's inconsistent to unset them without checking for other vchans first. See vchan_destroy() below, maybe we can call that or reuse the code there?

FYI, the panic also happens on close:

Fatal trap 9: general protection fault while in kernel mode
cpuid = 2; apic id = 02
instruction pointer     = 0x20:0xffffffff808c8dd0
stack pointer           = 0x28:0xfffffe00d974b600
frame pointer           = 0x28:0xfffffe00d974b630
code segment            = base 0x0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 1059 (cat)
rdi: fffff800017d5fa0 rsi: 0000000000000008 rdx: deadc0dedeadc20e
rcx: deadc0dedeadc0de  r8: 0000000000000002  r9: ffffffffffffffff
rax: fffff80001843d88 rbx: fffff800017d0800 rbp: fffffe00d974b630
r10: ffffffff81c4a450 r11: 0000000000000000 r12: fffffe00ddd4cc00
r13: fffff80001843d80 r14: 00000000ffffffff r15: fffff800017cf0c0
trap number             = 9
panic: general protection fault
cpuid = 2
time = 1734349393
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00d974b340
vpanic() at vpanic+0x136/frame 0xfffffe00d974b470
panic() at panic+0x43/frame 0xfffffe00d974b4d0
trap_fatal() at trap_fatal+0x40b/frame 0xfffffe00d974b530
calltrap() at calltrap+0x8/frame 0xfffffe00d974b530
--- trap 0x9, rip = 0xffffffff808c8dd0, rsp = 0xfffffe00d974b600, rbp = 0xfffffe00d974b630 ---
chn_trigger() at chn_trigger+0x1b0/frame 0xfffffe00d974b630
chn_flush() at chn_flush+0x54/frame 0xfffffe00d974b650
dsp_close() at dsp_close+0x3d6/frame 0xfffffe00d974b6a0
devfs_destroy_cdevpriv() at devfs_destroy_cdevpriv+0xab/frame 0xfffffe00d974b6c0
devfs_close_f() at devfs_close_f+0x63/frame 0xfffffe00d974b6f0
_fdrop() at _fdrop+0x1b/frame 0xfffffe00d974b710
closef() at closef+0x1e3/frame 0xfffffe00d974b7a0
fdescfree() at fdescfree+0x41e/frame 0xfffffe00d974b860
exit1() at exit1+0x49d/frame 0xfffffe00d974b8d0
sigexit() at sigexit+0x16b/frame 0xfffffe00d974bd50
postsig() at postsig+0x1a5/frame 0xfffffe00d974be20
ast_sig() at ast_sig+0x1bb/frame 0xfffffe00d974bed0
ast_handler() at ast_handler+0xe8/frame 0xfffffe00d974bf10
ast() at ast+0x20/frame 0xfffffe00d974bf30
doreti_ast() at doreti_ast+0x1c/frame 0x3076a6210ba0
KDB: enter: panic
[ thread pid 1059 tid 100269 ]
Stopped at      kdb_enter+0x33: movq    $0,0x10554b2(%rip)

Kernel panics when I play something, stop it, cat /dev/sndstat, and then try to play something again. My tree has D47868, D47917 and D47918 on top of main branch, I suspect it's this one.

Fatal trap 9: general protection fault while in kernel mode
...

Interesting, I am not able to reproduce this. Does this happen with a different player than sox as well? Not sure how this patch could be responsible for this, though. All it does is essentially allocate vchans on open() and delete them on close(), instead of keeping them around after close().

I now isolated the panic to this patch (on top of D47868 which is required to build). It also happens with cat /dev/zero > /dev/dsp instead of sox. The cat /dev/sndstat is not needed to reproduce it. I did a full system build / install with and without this patch, consistent results.

I applied just D47868 and this one and still cannot reproduce the panic... Tried all the tests I have in the test section here, as well as what you mentioned, with both vchans enabled or disabled, including bitperfect. Judging from your backtraces, it seems that the panics most likely occur at the CHN_INSERT_HEAD_SAFE()/CHN_REMOVE_SAFE() calls. Is there something worth noting in your setup/test environment?

I will also work on addressing the issues in the inline comments.

Kernel panics when I play something, stop it, cat /dev/sndstat, and then try to play something again. My tree has D47868, D47917 and D47918 on top of main branch, I suspect it's this one.

Fatal trap 9: general protection fault while in kernel mode
...

Interesting, I am not able to reproduce this. Does this happen with a different player than sox as well? Not sure how this patch could be responsible for this, though. All it does is essentially allocate vchans on open() and delete them on close(), instead of keeping them around after close().

I now isolated the panic to this patch (on top of D47868 which is required to build). It also happens with cat /dev/zero > /dev/dsp instead of sox. The cat /dev/sndstat is not needed to reproduce it. I did a full system build / install with and without this patch, consistent results.

I applied just D47868 and this one and still cannot reproduce the panic... Tried all the tests I have in the test section here, as well as what you mentioned, with both vchans enabled or disabled, including bitperfect. Judging from your backtraces, it seems that the panics most likely occur at the CHN_INSERT_HEAD_SAFE()/CHN_REMOVE_SAFE() calls. Is there something worth noting in your setup/test environment?

I will also work on addressing the issues in the inline comments.

Nothing special, really. Just a bhyve VM with CURRENT, a USB card passthrough for the single snd_uaudio device. No non-default settings, I only use this setup for testing.

I had a look at a core dump, with the playback failing at first attempt. There seems to be a garbage pointer at the head of the channels.pcm.busy list. Obviously CHN_INSERT_HEAD_SAFE() trips over that. The pointer references 0xdeadc0dedeadc0de memory, I suppose that's already freed memory?

Ok, I did some more debugging, my current theory is this:

  1. The first (and only) playback vchan is closed while running, we call vchan_destroy().
  2. vchan_destroy() removes the vchan from its parent's children list.
  3. vchan_destroy() calls chn_kill(), which in turn calls chn_trigger(ABORT).
  4. chn_trigger(ABORT) calls vchan_trigger() through the object method.
  5. vchan_trigger() calls chn_notify() on the parent.
  6. chn_notify() fails with ENODEV because the parent's children list is empty now.
  7. Thus vchan_trigger() also fails, ENODEV.
  8. chn_trigger(ABORT) fails early and doesn't remove the vchan from the channels.pcm.busy list.
  9. chn_kill() frees the vchan data, leaving a stale entry in channels.pcm.busy.
  10. Next occasion to traverse channels.pcm.busy list -> panic!

So a precondition of this panic would be that there's one vchan open which gets closed abruptly, instead of being stopped. This seems to hold in my experiments, as I cannot trigger the panic while keeping a vchan open.

Ok, I did some more debugging, my current theory is this:

  1. The first (and only) playback vchan is closed while running, we call vchan_destroy().
  2. vchan_destroy() removes the vchan from its parent's children list.
  3. vchan_destroy() calls chn_kill(), which in turn calls chn_trigger(ABORT).
  4. chn_trigger(ABORT) calls vchan_trigger() through the object method.
  5. vchan_trigger() calls chn_notify() on the parent.
  6. chn_notify() fails with ENODEV because the parent's children list is empty now.
  7. Thus vchan_trigger() also fails, ENODEV.
  8. chn_trigger(ABORT) fails early and doesn't remove the vchan from the channels.pcm.busy list.
  9. chn_kill() frees the vchan data, leaving a stale entry in channels.pcm.busy.
  10. Next occasion to traverse channels.pcm.busy list -> panic!

So a precondition of this panic would be that there's one vchan open which gets closed abruptly, instead of being stopped. This seems to hold in my experiments, as I cannot trigger the panic while keeping a vchan open.

Ok, tested with chn_notify() returning (0) instead of ENODEV. Whether that's the proper fix is open for discussion, but it is a fix. That was a nice puzzle.

Interesting. Does calling chn_abort() on the vchan in dsp_close() before the call to vchan_destroy() fix this?

Interesting. Does calling chn_abort() on the vchan in dsp_close() before the call to vchan_destroy() fix this?

Probably, but to me that would fix the symptoms, not the cause. I think chn_notify() shouldn't fail with an error there, the primary channel having no children is not a real problem - it just means there is nothing to do (no-one to notify). Also maybe an error from vchan_trigger() should be handled differently in chn_trigger(), regarding the busy list.

Interesting. Does calling chn_abort() on the vchan in dsp_close() before the call to vchan_destroy() fix this?

Probably, but to me that would fix the symptoms, not the cause. I think chn_notify() shouldn't fail with an error there, the primary channel having no children is not a real problem - it just means there is nothing to do (no-one to notify). Also maybe an error from vchan_trigger() should be handled differently in chn_trigger(), regarding the busy list.

Yeah, I agree that chn_notify() should return 0 if the children list is empty. My idea was that it'd make sense to first stop the channel, then destroy it. I think we should go with the chn_notify() change instead, though. D48156

My bad. Do not take this version as the finalized one.

christos marked 3 inline comments as done.

Make sure that in case we choose a primary channel to allocate a vchan, it's
not being used directly.

Tested with:

  • vchans enabled, spawn vchan(s): OK
  • vchans enabled, spawn vchan, disable vchans, spawn chan: OK (ENODEV, we need to close the vchan first)
  • vchans disabled, spawn chan: OK
  • vchans disabled, spawn chan, spawn another chan: OK (EBUSY)
  • vchans disabled, spawn chan, enable vchans, spawn vchan: OK (EBUSY)

Also depend on D48183.

sys/dev/sound/pcm/dsp.c
205–214

Do you think the new diff addresses the issue properly? The bug you mention is fixed.

sys/dev/sound/pcm/vchan.c
813–814
sys/dev/sound/pcm/dsp.c
189–190

Since we do not take vchan enabled / disabled into account here, this may select a primary channel that is incompatible with what we need. While this just defers the error in case of one primary channel per direction, it may prevent choosing another (compatible) one if there are multiple primary channels per direction.

202–203

We probably need this condition in the loop too, I'd suggest a variable like bool vchan_enabled = ... for readability.

430–438

What happens in duplex mode when only the recording dsp_chn_alloc() call fails? Is dsp_close() actually called when dsp_open() fails? And if that's the case, we have an inconsistency between priv->wrch and the channels.pcm.opened list. We call CHN_REMOVE() there...

sys/dev/sound/pcm/vchan.c
813–816

We may want to explicitly set *child = NULL; here, to not leave any dangling pointers, even if we also return an error value.

christos added inline comments.
sys/dev/sound/pcm/dsp.c
189–190

I included the vchan enabled/disabled check here as well, as suggested in your comment below. However, if vchans are disabled for a given direction, would having >1 primary channels in that direction make a difference in the first place?

430–438

So, dsp_close() will be called, but because priv->wrch will not have yet been added to the opened list, the CHN_REMOVE() will most likely cause a problem. I think we could retain the wrch and rdch variables, and assign them to priv->wrch and priv->rdch respectively, once we get past the dsp_chn_alloc()s.

sys/dev/sound/pcm/dsp.c
189–190

Now I'm confused - I only see the c->flags & CHN_F_HAS_VCHAN which checks for the presence of vchans, not whether they are enabled? Or do you mean you included these checks locally but didn't update the diff here yet?

And yes, if vchans are disabled, with a vchan still running on the first playback channel, this first playback channel would always be selected because CHN_F_HAS_VCHAN is set. Despite the second playback channel possibly being free to use, with CHN_F_BUSY not set.

sys/dev/sound/pcm/dsp.c
189–190

Sorry, I meant that I applied the change locally. This should be enough IMHO.

christos marked 4 inline comments as done.

Address all remaining comments.

sys/dev/sound/pcm/dsp.c
194–196

Shouldn't this also break out of the loop when vdir_enabled is not set? I think the logic order is mixed up, did you mean (c->flags & CHN_F_BUSY) == 0 || (vdir_enabled && c->flags & CHN_F_HAS_VCHAN)?

430–443

I think we still have an issue here, when allocation of the playback channel succeeds, but recording channel fails. We're basically "leaking" the playback channel now, with no chance for dsp_close() to clean up the channels because it's lacking info about the allocation.

I would suggest to treat the cases for playback and recording separately, handle allocation success or failure independently per direction. Complete the registration of e.g. playback channel in the open list, even if the subsequent recording channel allocation fails. If the data is consistent, we can let dsp_close() do the cleanup afterwards.

christos marked an inline comment as done.

Address comments.

sys/dev/sound/pcm/dsp.c
194–196

Yes, I noticed this right after I submitted. I meant to write

		if ((c->flags & CHN_F_BUSY) == 0 ||
		    (vdir_enabled && (c->flags & CHN_F_HAS_VCHAN)))
430–443

Good idea.