Page MenuHomeFreeBSD

busdma: Avoid leaking bounce pages when destroy DMA tags
AcceptedPublic

Authored by markj on Nov 11 2024, 10:44 PM.
Tags
None
Referenced Files
Unknown Object (File)
Mon, Dec 30, 10:20 PM
Unknown Object (File)
Thu, Dec 12, 6:08 PM
Unknown Object (File)
Nov 23 2024, 9:38 AM
Unknown Object (File)
Nov 18 2024, 3:47 AM
Unknown Object (File)
Nov 17 2024, 2:35 PM
Unknown Object (File)
Nov 15 2024, 12:09 PM
Unknown Object (File)
Nov 14 2024, 3:20 PM
Unknown Object (File)
Nov 12 2024, 2:36 PM

Details

Summary

busdma tags may have an attached "bounce zone" which manages bounce page
allocations. However, it does not get cleaned up when the owning busdma
tag is destroyed. Fix the leak:

  • Add a destroy_bounce_zone() implementation and modify MD busdma code to use it.
  • Make sure that bounce_zone_list accesses are locked by the bounce mutex, previously there was no explicit serialization (though the bus_topo lock probably made this relatively safe).
  • Add refcounting to bounce zones so that busdma tags can safely share them.

PR: 278569
MFC after: 1 month

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped
Build Status
Buildable 60577
Build 57461: arc lint + arc unit

Event Timeline

This revision is now accepted and ready to land.Nov 12 2024, 12:06 AM

Thanks for handling this one. I had looked at it previously, and the one thing I was uncertain about was the interaction with busdma_thread()/deferred loads. It appears to me there is one unlocked reference to bz->deferred_time which could race with bus_dmamap_destroy(), although it would be exceedingly rare. Can you comment on this?

Thanks for handling this one. I had looked at it previously, and the one thing I was uncertain about was the interaction with busdma_thread()/deferred loads. It appears to me there is one unlocked reference to bz->deferred_time which could race with bus_dmamap_destroy(), although it would be exceedingly rare. Can you comment on this?

I think this race is not new with this patch. That loop can also race with bus_dmamap_destroy() and bus_dma_tag_destroy().

More generally, I believe it's only a problem if consumers misuse the busdma interface. A map gets added to the bounce zone's bounce_map_waitinglist when a bus_dmamap_load() fails to reserve enough bounce pages (presumably because another consumer is using them), and BUS_DMA_WAITOK was specified. Later, when a bus_dmamap_unload() call releases some bounce pages owned by the same zone, the busdma layer can try again to reserve pages; this happens in the busdma thread so that the callback is executed in a safe context.

To trigger the race, a consumer would have to call bus_dmamap_load(callback, BUS_DMA_WAITOK), get EINPROGRESS, and then destroy the map and tag (releasing the last reference on the bounce zone) before the callback is invoked. I'd argue that it's the consumer's responsibility to make sure that this doesn't happen, though the documentation doesn't explicitly state this. If so, I believe there is no bug here.

Thanks for handling this one. I had looked at it previously, and the one thing I was uncertain about was the interaction with busdma_thread()/deferred loads. It appears to me there is one unlocked reference to bz->deferred_time which could race with bus_dmamap_destroy(), although it would be exceedingly rare. Can you comment on this?

I think this race is not new with this patch. That loop can also race with bus_dmamap_destroy() and bus_dma_tag_destroy().

More generally, I believe it's only a problem if consumers misuse the busdma interface. A map gets added to the bounce zone's bounce_map_waitinglist when a bus_dmamap_load() fails to reserve enough bounce pages (presumably because another consumer is using them), and BUS_DMA_WAITOK was specified. Later, when a bus_dmamap_unload() call releases some bounce pages owned by the same zone, the busdma layer can try again to reserve pages; this happens in the busdma thread so that the callback is executed in a safe context.

To trigger the race, a consumer would have to call bus_dmamap_load(callback, BUS_DMA_WAITOK), get EINPROGRESS, and then destroy the map and tag (releasing the last reference on the bounce zone) before the callback is invoked. I'd argue that it's the consumer's responsibility to make sure that this doesn't happen, though the documentation doesn't explicitly state this. If so, I believe there is no bug here.

Agreed; thank you!

FWIW, I do think it was the intended design originally to never destroy these things as the thought was that you wouldn't be unloading the drivers that used them if your system used bounce pages at all. That assumption may no longer hold though and this seems fine to me.

sys/kern/subr_busdma_bounce.c
209–212

This might now start multiple kthreads which would be a problem (I think). That is, you might destroy the all the tags referencing a zone drain this list back to empty, then create a new tag that needs a zone and you will create another kthread. Instead, this probably needs another sentinel, maybe just a global static bool kthread_started; or the like.

220–221

Note that in this case we don't actually clean up the zone but leave it lying around, and we also fail to start the kthread if needed. We'd be better off just panicking here I think.

290

If you do the STAILQ_REMOVE from bounce_zone_list first you can drop the lock sooner. It isn't really needed for any of this other cleanup you are doing.

sys/kern/subr_busdma_bounce.c
176

This was racey before in that it could create multiple zones with the same constraints.

577

Given the lock here, I guess it's ok if there are multiple kthreads, but kind of pointless to have them.

markj marked 3 inline comments as done.

Handle John's comments.

This revision now requires review to proceed.Nov 15 2024, 2:40 AM
This revision is now accepted and ready to land.Nov 15 2024, 5:02 PM
sys/x86/x86/busdma_bounce.c
332

So, it turns out that this patch doesn't fully fix the problem of leaking bounce pages. The root of the problem is here: every time a map is created, we preallocate at least one page for its use, no matter how many free pages are already in the zone. In particular, we don't know how many other maps/tags are sharing the zone, so we don't know exactly how many pages we might actually need.

When destroying the map/tag, we don't free any pages from the zone, so a loop which creates a tag+map requiring bounce pages, and then destroys them without destroying the zone will leak pages.

In particular, my patch fixes the reproducer from bugzilla PR 278569, but the reproducer could be modified slightly to reintroduce the leak (e.g., by having a second busdma tag which holds a reference on the bounce zone).

I wonder if the solution might be to simply get rid of this MIN_ALLOC_COMP mechanism. In particular, this flag is stored in the busdma tag, which might have many DMA maps associated with it, so it only applies to the first map. Do we really need it?

sys/x86/x86/busdma_bounce.c
332

Hmmm, I think it's the pages = MAX(pages, 1) that gets you into trouble?

It looks like the purpose of the BUS_DMA_MIN_ALLOC_COMP flag is that we avoid doing the allocation for the first map if we allocated bounce pages when creating the tag via BUS_DMA_ALLOC_NOW to avoid double allocating for the first map.

I wonder if the || really should &&. The current || is due to a revert:

commit eae22c443014034ab08894be5cb03940869ec977
Author: Svatopluk Kraus <skra@FreeBSD.org>
Date:   Mon Nov 23 11:19:00 2015 +0000

    Revert r291142.
    
    The not quite consistent logic for bounce pages allocation is utilizited
    by re(4) interface which can hang now.
    
    Approved by:    kib (mentor)

Notes:
    svn path=/head/; revision=291193

That reverted commit seems to be fixing this exact edge case, and I'm not sure what it needs to be reverted TBH:

commit 6fa7734d6fbbec1e34bfee33427969ac9a92ff80
Author: Svatopluk Kraus <skra@FreeBSD.org>
Date:   Sat Nov 21 19:55:01 2015 +0000

    Fix BUS_DMA_MIN_ALLOC_COMP flag logic. When bus_dmamap_t map is being
    created for bus_dma_tag_t tag, bounce pages should be allocated
    only if needed.
    
    Before the fix, they were allocated always if BUS_DMA_COULD_BOUNCE flag
    was set but BUS_DMA_MIN_ALLOC_COMP not. As bounce pages are never freed,
    it could cause memory exhaustion when a lot of such tags together with
    their maps were created.
    
    Note that there could be more maps in one tag by current design.
    However BUS_DMA_MIN_ALLOC_COMP flag is tag's flag. It's set after
    bounce pages are allocated. Thus, they are allocated only for first
    tag's map which needs them.
    
    Approved by:    kib (mentor)

Notes:
    svn path=/head/; revision=291142

I wonder what is up with re(4). (And why does a network driver need bounce pages!!!!)

If re(4)'s issue is that it is using bounce pages for RX buffers and that it needs more bounce pages, then I think it is too broken to use bus_dma for its RX buffers. Instead, it just needs to break down and use contigmalloc or something and copy packets into mbufs that it passes up the network stack. Bounce buffers for RX mbufs is the same thing in terms of the cost of memory copies and memory overhead, but round tripping through bus_dma to achieve that would be terrible overhead. I can't really imagine that re(4) is so broken though? It would be good to understand the case where re(4) needs bounce pages. If it is for transmit, then maybe we just need a way for re(4) to explicitly ask for at least one bounce page per map via a new flag on the tag rather than breaking all tags for everyone else.

sys/x86/x86/busdma_bounce.c
332

Yes, it's pages = MAX(pages, 1) that causes problems, sorry for not being clear.

And, apparently it was a bug report from me that caused the revert. I had a system with two re interfaces but different devices. The problematic one was a RTL8169, which of course has since died. Looking at if_re.c, I suspect that the difference is that for non-PCIe devices, the driver sets lowaddr = BUS_SPACE_MAXADDR_32BIT for its root busdma tag, so it'd have BUS_DMA_COULD_BOUNCE set. (And, this particular system had 8 or 16GB of RAM so might indeed end up bouncing.)

After the revert, skra@ sent me a follow-up patch which apparently fixed the problem but never got committed for some reason. The condition for deciding whether to allocate bounce pages effectively becomes:

if ((dmat->bounce_flags & BUS_DMA_MIN_ALLOC_COMP) == 0 ||
     (bz->map_count > 0 && bz->total_bpages < maxpages))

and I believe that would also fix the leak.

sys/x86/x86/busdma_bounce.c
332

Isn't the code you quoted above the current code? Did you paste the wrong thing from the follow-up patch?

sys/x86/x86/busdma_bounce.c
332

Sigh, yes, sorry. It should be:

if (((dmat->bounce_flags & BUS_DMA_MIN_ALLOC_COMP) == 0 ||
     bz->map_count > 0) && bz->total_bpages < maxpages)
sys/x86/x86/busdma_bounce.c
332

So the effect is to always honor bz->total_pages < maxpages? It also seems like this means you don't need the pages = MAX(pages, 1) line after that since pages = MIN(maxpages - bz->total_bpages, pages); will now always ensure pages is non-zero.

sys/x86/x86/busdma_bounce.c
332

Yes. I think I see why the commit was reverted: it changed the check for "do I reserve bounce pages?" to:

if ((dmat->bounce_flags & BUS_DMA_MIN_ALLOC_COMP) == 0 &&
    bz->map_count > 0 && bz->total_bpages < maxpages)

but that's too strict. If we didn't prealloc bounce pages at tag creation time, and if this is the first busdma mapping attached to the bounce zone, then we won't reserve pages since bz->map_count == 0.

I posted https://reviews.freebsd.org/D48182 to fix this.

342

Yet another bug: this increment needs to be atomic, but there's nothing serializing it aside from the bus topology lock (assuming that busdma maps are created from DEVICE_ATTACH, which is usually but not always true).