Page MenuHomeFreeBSD

ifnet: fix use-after-free by ignoring post-detach ifp link events.
AcceptedPublic

Authored by melifaro on Apr 17 2023, 10:49 AM.
Tags
None
Referenced Files
F102075821: D39614.id.diff
Thu, Nov 7, 7:58 AM
Unknown Object (File)
Wed, Nov 6, 4:30 PM
Unknown Object (File)
Tue, Nov 5, 2:16 PM
Unknown Object (File)
Fri, Oct 18, 10:22 AM
Unknown Object (File)
Thu, Oct 17, 4:43 PM
Unknown Object (File)
Thu, Oct 17, 2:17 AM
Unknown Object (File)
Wed, Oct 16, 2:45 PM
Unknown Object (File)
Wed, Oct 16, 10:02 AM

Details

Reviewers
glebius
kp
Group Reviewers
network
Summary

User reported netlink-related panic kern/270813 when unloading if_ena kernel module.

The crash happens when the swi-enqueued link state event is triggered. By the time of execution, the ifp in question and its resources are already freed, triggering the panic.

The following call chain leads to this:

ena_detach()
  ether_ifdetach()
    if_detach()
      if_detach_internal()  # here we drain taskqueue_swi from link events
  ena_destroy_device()
    if_link_state_change()  # here we enqueue new link event

In order to fix the bug, this change proposes ignoring all link state events after ifnet was unlinked. The implementation assumes that the destruction happens in the single thread and avoid lock-based flag checking.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 50954
Build 47845: arc lint + arc unit

Event Timeline

melifaro added reviewers: network, glebius.
melifaro edited the summary of this revision. (Show Details)

I've tested this commit and it solves the issue.
Updated kernel and world to the latest main. => Issue reproduced.
Applied this fix to kernel sources, built and installed kernel => issue doesn't reproduce.

Does that fix the general problem? Can't we end up starting an interface removal after this check and this still end up in do_link_state_change() with a destroyed ifp?

So perhaps we should be looking up the ifp again from the callback, rather than passing the pointer? Something like m_rcvif_serialize()/m_rcvif_restore() dance we do when we enqueue packets.

I do not quite understand this logic:

ena_detach()
  ether_ifdetach()
    if_detach()
      if_detach_internal()  # here we drain taskqueue_swi from link events
  ena_destroy_device()
    if_link_state_change()  # here we enqueue new link event

Why is ether_ifdetach() invoked before if_link_state_change() ???

I don't think the fix is a good one. Why can't we just remove call to if_link_state_change() from ena_destroy_device()? The problem is not even a race, as we are checking the IFF_DYING in the scheduling function, not in the delayed context. I'd rather add assertion that ifp isn't dying in the if_link_state_change().

P.S. At first glance I thought the problem is the race that never happened before but potentially can. See XXXGL comment down below. But it isn't.

In D39614#901740, @kp wrote:

Does that fix the general problem? Can't we end up starting an interface removal after this check and this still end up in do_link_state_change() with a destroyed ifp?

So perhaps we should be looking up the ifp again from the callback, rather than passing the pointer? Something like m_rcvif_serialize()/m_rcvif_restore() dance we do when we enqueue packets.

Well, I guess the answer is "no" for the most general problems I'm thinking of in this domain.
It would be nice to discuss the generalisation, as that's certainly not the only similar thing in the network space.

Personally I see the problem in the lack of clearly-defined KPIs for ifnets: the structures are accessed & modified directly, there are no rules (or even guidelines) on what's possible and what is not.

For now the KPI "sort of" assume we're doing interface control stuff in a single thread, thus mostly avoiding locking for ifnets & only doing locking for the specific things like interface hierarchy.
Should we keep this assumption? Should we do per-ifp sx lock in the KPI? Should we explicitly require locking on the caller side?

The fix to this particular issue depends on the answers of the above (or can be a bandaid as in diff if we want to kick the can down the road).

In D39614#901770, @zlei wrote:

I do not quite understand this logic:

ena_detach()
  ether_ifdetach()
    if_detach()
      if_detach_internal()  # here we drain taskqueue_swi from link events
  ena_destroy_device()
    if_link_state_change()  # here we enqueue new link event

Why is ether_ifdetach() invoked before if_link_state_change() ???

As far as I understand, ena_destroy_device() is called in multiple places, including an init error handler, so it incorporates setting the link down. I think that it would make sense for the driver to move the link state change elsewhere, but I'd prefer to address the stack problem first.

I don't think the fix is a good one. Why can't we just remove call to if_link_state_change() from ena_destroy_device()? The problem is not even a race, as we are checking the IFF_DYING in the scheduling function, not in the delayed context. I'd rather add assertion that ifp isn't dying in the if_link_state_change().

I agree that it's more of a bandaid that a fix - and would love to discuss the proper one. As I wrote in the reply to @kp, I see this more as a side effect of not having clearly defined KPI.
Maybe we don't need to figure out the desired model now fully, and address just this specific issue. For example, by agreeing on the rule that the interface driver must not call the functions { .., if_link_state_change(), .. } after unlinking/detaching the interfaces and indeed add the assert to validate it.

P.S. At first glance I thought the problem is the race that never happened before but potentially can. See XXXGL comment down below. But it isn't.

In D39614#901740, @kp wrote:

Does that fix the general problem? Can't we end up starting an interface removal after this check and this still end up in do_link_state_change() with a destroyed ifp?

So perhaps we should be looking up the ifp again from the callback, rather than passing the pointer? Something like m_rcvif_serialize()/m_rcvif_restore() dance we do when we enqueue packets.

Well, I guess the answer is "no" for the most general problems I'm thinking of in this domain.
It would be nice to discuss the generalisation, as that's certainly not the only similar thing in the network space.

Personally I see the problem in the lack of clearly-defined KPIs for ifnets: the structures are accessed & modified directly, there are no rules (or even guidelines) on what's possible and what is not.

For now the KPI "sort of" assume we're doing interface control stuff in a single thread, thus mostly avoiding locking for ifnets & only doing locking for the specific things like interface hierarchy.
Should we keep this assumption? Should we do per-ifp sx lock in the KPI? Should we explicitly require locking on the caller side?

The fix to this particular issue depends on the answers of the above (or can be a bandaid as in diff if we want to kick the can down the road).

Those are all good questions, but I think in this case our issue is that the ifnet goes away between the task being enqueued and executed. There are a number of other places where we see things like that, and that's the 'general problem' I was referring to.

For that specific issue locking the ifnet isn't going to buy us anything at all.

In D39614#901770, @zlei wrote:

I do not quite understand this logic:

ena_detach()
  ether_ifdetach()
    if_detach()
      if_detach_internal()  # here we drain taskqueue_swi from link events
  ena_destroy_device()
    if_link_state_change()  # here we enqueue new link event

Why is ether_ifdetach() invoked before if_link_state_change() ???

As far as I understand, ena_destroy_device() is called in multiple places, including an init error handler, so it incorporates setting the link down. I think that it would make sense for the driver to move the link state change elsewhere, but I'd prefer to address the stack problem first.

I agree that the call to if_link_state_change() shouldn't happen after ether_ifdetach().
I want to change this in the ena driver.
Looking at many different network drivers I saw that other than mlx5 (from what I found) most drivers don't call if_link_state_change() before ether_ifdetach().
Looking at the implementation and inner calls of both ether_ifdetach() and if_link_state_change() it doesn't seem that ether_ifdetach() does any of the work that if_link_state_change() does.
So my question is this:
Is it necessary to call if_link_state_change() before ether_ifdetach()? because it seems almost no one is doing it.

(looked for answers in git log/blame, mailing lists and forums but couldn't find any)

In D39614#902547, @kp wrote:
In D39614#901740, @kp wrote:

Does that fix the general problem? Can't we end up starting an interface removal after this check and this still end up in do_link_state_change() with a destroyed ifp?

So perhaps we should be looking up the ifp again from the callback, rather than passing the pointer? Something like m_rcvif_serialize()/m_rcvif_restore() dance we do when we enqueue packets.

Well, I guess the answer is "no" for the most general problems I'm thinking of in this domain.
It would be nice to discuss the generalisation, as that's certainly not the only similar thing in the network space.

Personally I see the problem in the lack of clearly-defined KPIs for ifnets: the structures are accessed & modified directly, there are no rules (or even guidelines) on what's possible and what is not.

For now the KPI "sort of" assume we're doing interface control stuff in a single thread, thus mostly avoiding locking for ifnets & only doing locking for the specific things like interface hierarchy.
Should we keep this assumption? Should we do per-ifp sx lock in the KPI? Should we explicitly require locking on the caller side?

The fix to this particular issue depends on the answers of the above (or can be a bandaid as in diff if we want to kick the can down the road).

Those are all good questions, but I think in this case our issue is that the ifnet goes away between the task being enqueued and executed. There are a number of other places where we see things like that, and that's the 'general problem' I was referring to.

Ack! I was looking into it from a different angle. The mental model of "safe detach" to me is a) set the demarcation point that signals "no more data accepted", b) ensure no data is indeed accepted and c) clear the queued data. I'm a bit unsure if we can or should generalise the implementations for the different datapaths and control plane. The current code mostly have everything implemented - IFF_DYING flag is set, marking the end of the era and the taskq is cleaned from the link tasks matching this interface. The remaining part is rejecting new changes.

We can indeed do an MPASS() check in the enqueue code to reject such code patterns if that's what we agree with.

What do you folks think?

For that specific issue locking the ifnet isn't going to buy us anything at all.

In D39614#901770, @zlei wrote:

I do not quite understand this logic:

ena_detach()
  ether_ifdetach()
    if_detach()
      if_detach_internal()  # here we drain taskqueue_swi from link events
  ena_destroy_device()
    if_link_state_change()  # here we enqueue new link event

Why is ether_ifdetach() invoked before if_link_state_change() ???

As far as I understand, ena_destroy_device() is called in multiple places, including an init error handler, so it incorporates setting the link down. I think that it would make sense for the driver to move the link state change elsewhere, but I'd prefer to address the stack problem first.

I agree that the call to if_link_state_change() shouldn't happen after ether_ifdetach().
I want to change this in the ena driver.
Looking at many different network drivers I saw that other than mlx5 (from what I found) most drivers don't call if_link_state_change() before ether_ifdetach().
Looking at the implementation and inner calls of both ether_ifdetach() and if_link_state_change() it doesn't seem that ether_ifdetach() does any of the work that if_link_state_change() does.
So my question is this:
Is it necessary to call if_link_state_change() before ether_ifdetach()? because it seems almost no one is doing it.

I don't think so. Previously we haven't signalled operstate change to the userland. As the interface is anyway going to be destroyed within milliseconds, I don't think it's required.

(looked for answers in git log/blame, mailing lists and forums but couldn't find any)

In D39614#902547, @kp wrote:

The fix to this particular issue depends on the answers of the above (or can be a bandaid as in diff if we want to kick the can down the road).

Those are all good questions, but I think in this case our issue is that the ifnet goes away between the task being enqueued and executed. There are a number of other places where we see things like that, and that's the 'general problem' I was referring to.

Ack! I was looking into it from a different angle. The mental model of "safe detach" to me is a) set the demarcation point that signals "no more data accepted", b) ensure no data is indeed accepted and c) clear the queued data. I'm a bit unsure if we can or should generalise the implementations for the different datapaths and control plane. The current code mostly have everything implemented - IFF_DYING flag is set, marking the end of the era and the taskq is cleaned from the link tasks matching this interface. The remaining part is rejecting new changes.

We can indeed do an MPASS() check in the enqueue code to reject such code patterns if that's what we agree with.

What do you folks think?

So that seems like a sane general approach, but I'm not clear on how it'd work here. We'd have to make sure the ifp sticks around until the task queue no longer has any if_linktasks remaining, and I don't think we have a mechanism for that currently.

In D39614#907626, @kp wrote:
In D39614#902547, @kp wrote:

The fix to this particular issue depends on the answers of the above (or can be a bandaid as in diff if we want to kick the can down the road).

Those are all good questions, but I think in this case our issue is that the ifnet goes away between the task being enqueued and executed. There are a number of other places where we see things like that, and that's the 'general problem' I was referring to.

Ack! I was looking into it from a different angle. The mental model of "safe detach" to me is a) set the demarcation point that signals "no more data accepted", b) ensure no data is indeed accepted and c) clear the queued data. I'm a bit unsure if we can or should generalise the implementations for the different datapaths and control plane. The current code mostly have everything implemented - IFF_DYING flag is set, marking the end of the era and the taskq is cleaned from the link tasks matching this interface. The remaining part is rejecting new changes.

We can indeed do an MPASS() check in the enqueue code to reject such code patterns if that's what we agree with.

What do you folks think?

So that seems like a sane general approach, but I'm not clear on how it'd work here. We'd have to make sure the ifp sticks around until the task queue no longer has any if_linktasks remaining, and I don't think we have a mechanism for that currently.

I think we have something in if.c:if_detach_internal() for that ( https://cgit.freebsd.org/src/tree/sys/net/if.c#n1145 ):

	/*
	 * In any case (destroy or vmove) detach us from the groups
	 * and remove/wait for pending events on the taskq.
	 * XXX-BZ in theory an interface could still enqueue a taskq change?
	 */
	if_delgroups(ifp);

	taskqueue_drain(taskqueue_swi, &ifp->if_linktask);
	taskqueue_drain(taskqueue_swi, &ifp->if_addmultitask);

So I still believe that the remaining part is to reject the new changes. Happy to discuss it more :-)
And anyway, @akiyano_amazon.com, I guess we need the driver to be updated to avoid setting the state when dying. Could you please look at that?

This revision is now accepted and ready to land.Jun 6 2023, 9:07 AM
In D39614#907626, @kp wrote:
In D39614#902547, @kp wrote:

The fix to this particular issue depends on the answers of the above (or can be a bandaid as in diff if we want to kick the can down the road).

Those are all good questions, but I think in this case our issue is that the ifnet goes away between the task being enqueued and executed. There are a number of other places where we see things like that, and that's the 'general problem' I was referring to.

Ack! I was looking into it from a different angle. The mental model of "safe detach" to me is a) set the demarcation point that signals "no more data accepted", b) ensure no data is indeed accepted and c) clear the queued data. I'm a bit unsure if we can or should generalise the implementations for the different datapaths and control plane. The current code mostly have everything implemented - IFF_DYING flag is set, marking the end of the era and the taskq is cleaned from the link tasks matching this interface. The remaining part is rejecting new changes.

We can indeed do an MPASS() check in the enqueue code to reject such code patterns if that's what we agree with.

What do you folks think?

So that seems like a sane general approach, but I'm not clear on how it'd work here. We'd have to make sure the ifp sticks around until the task queue no longer has any if_linktasks remaining, and I don't think we have a mechanism for that currently.

I think we have something in if.c:if_detach_internal() for that ( https://cgit.freebsd.org/src/tree/sys/net/if.c#n1145 ):

	/*
	 * In any case (destroy or vmove) detach us from the groups
	 * and remove/wait for pending events on the taskq.
	 * XXX-BZ in theory an interface could still enqueue a taskq change?
	 */
	if_delgroups(ifp);

	taskqueue_drain(taskqueue_swi, &ifp->if_linktask);
	taskqueue_drain(taskqueue_swi, &ifp->if_addmultitask);

So I still believe that the remaining part is to reject the new changes. Happy to discuss it more :-)
And anyway, @akiyano_amazon.com, I guess we need the driver to be updated to avoid setting the state when dying. Could you please look at that?

Yes we will do that fix in the driver for our next release of the driver.