Paths

Table of Contentst

ifnet: fix use-after-free by ignoring post-detach ifp link events.
AcceptedPublic
Actions

Authored by melifaro on Apr 17 2023, 10:49 AM.

Details

Reviewers

glebius
kp

Group Reviewers

network

Summary

User reported netlink-related panic kern/270813 when unloading if_ena kernel module.

The crash happens when the swi-enqueued link state event is triggered. By the time of execution, the ifp in question and its resources are already freed, triggering the panic.

The following call chain leads to this:

ena_detach()
  ether_ifdetach()
    if_detach()
      if_detach_internal()  # here we drain taskqueue_swi from link events
  ena_destroy_device()
    if_link_state_change()  # here we enqueue new link event

In order to fix the bug, this change proposes ignoring all link state events after ifnet was unlinked. The implementation assumes that the destruction happens in the single thread and avoid lock-based flag checking.

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Passed

Unit

No Test Coverage

Build Status

Buildable 50954
Build 47845: arc lint + arc unit

Event Timeline

melifaro created this revision.Apr 17 2023, 10:49 AM

Herald added subscribers: glebius, ae, imp. · View Herald TranscriptApr 17 2023, 10:49 AM

melifaro requested review of this revision.Apr 17 2023, 10:49 AM

Harbormaster completed remote builds in B50954: Diff 120451.Apr 17 2023, 10:49 AM

melifaro edited the summary of this revision. (Show Details)Apr 17 2023, 10:58 AM

melifaro added reviewers: network, glebius.

melifaro edited the summary of this revision. (Show Details)

I've tested this commit and it solves the issue.
Updated kernel and world to the latest main. => Issue reproduced.
Applied this fix to kernel sources, built and installed kernel => issue doesn't reproduce.

Does that fix the general problem? Can't we end up starting an interface removal after this check and this still end up in do_link_state_change() with a destroyed ifp?

So perhaps we should be looking up the ifp again from the callback, rather than passing the pointer? Something like m_rcvif_serialize()/m_rcvif_restore() dance we do when we enqueue packets.

I do not quite understand this logic:

ena_detach()
  ether_ifdetach()
    if_detach()
      if_detach_internal()  # here we drain taskqueue_swi from link events
  ena_destroy_device()
    if_link_state_change()  # here we enqueue new link event

Why is ether_ifdetach() invoked before if_link_state_change() ???

I don't think the fix is a good one. Why can't we just remove call to if_link_state_change() from ena_destroy_device()? The problem is not even a race, as we are checking the IFF_DYING in the scheduling function, not in the delayed context. I'd rather add assertion that ifp isn't dying in the if_link_state_change().

P.S. At first glance I thought the problem is the race that never happened before but potentially can. See XXXGL comment down below. But it isn't.

In D39614#901740, @kp wrote:

Does that fix the general problem? Can't we end up starting an interface removal after this check and this still end up in do_link_state_change() with a destroyed ifp?

So perhaps we should be looking up the ifp again from the callback, rather than passing the pointer? Something like m_rcvif_serialize()/m_rcvif_restore() dance we do when we enqueue packets.

Well, I guess the answer is "no" for the most general problems I'm thinking of in this domain.
It would be nice to discuss the generalisation, as that's certainly not the only similar thing in the network space.

Personally I see the problem in the lack of clearly-defined KPIs for ifnets: the structures are accessed & modified directly, there are no rules (or even guidelines) on what's possible and what is not.

For now the KPI "sort of" assume we're doing interface control stuff in a single thread, thus mostly avoiding locking for ifnets & only doing locking for the specific things like interface hierarchy.
Should we keep this assumption? Should we do per-ifp sx lock in the KPI? Should we explicitly require locking on the caller side?

The fix to this particular issue depends on the answers of the above (or can be a bandaid as in diff if we want to kick the can down the road).

In D39614#901770, @zlei wrote:
I do not quite understand this logic:
ena_detach()
  ether_ifdetach()
    if_detach()
      if_detach_internal()  # here we drain taskqueue_swi from link events
  ena_destroy_device()
    if_link_state_change()  # here we enqueue new link event
Why is ether_ifdetach() invoked before if_link_state_change() ???

As far as I understand, ena_destroy_device() is called in multiple places, including an init error handler, so it incorporates setting the link down. I think that it would make sense for the driver to move the link state change elsewhere, but I'd prefer to address the stack problem first.

In D39614#901815, @glebius wrote:

I don't think the fix is a good one. Why can't we just remove call to if_link_state_change() from ena_destroy_device()? The problem is not even a race, as we are checking the IFF_DYING in the scheduling function, not in the delayed context. I'd rather add assertion that ifp isn't dying in the if_link_state_change().

I agree that it's more of a bandaid that a fix - and would love to discuss the proper one. As I wrote in the reply to @kp, I see this more as a side effect of not having clearly defined KPI.
Maybe we don't need to figure out the desired model now fully, and address just this specific issue. For example, by agreeing on the rule that the interface driver must not call the functions { .., if_link_state_change(), .. } after unlinking/detaching the interfaces and indeed add the assert to validate it.

P.S. At first glance I thought the problem is the race that never happened before but potentially can. See XXXGL comment down below. But it isn't.

In D39614#902532, @melifaro wrote:

In D39614#901740, @kp wrote:

Does that fix the general problem? Can't we end up starting an interface removal after this check and this still end up in do_link_state_change() with a destroyed ifp?

So perhaps we should be looking up the ifp again from the callback, rather than passing the pointer? Something like m_rcvif_serialize()/m_rcvif_restore() dance we do when we enqueue packets.

Well, I guess the answer is "no" for the most general problems I'm thinking of in this domain.
It would be nice to discuss the generalisation, as that's certainly not the only similar thing in the network space.

Personally I see the problem in the lack of clearly-defined KPIs for ifnets: the structures are accessed & modified directly, there are no rules (or even guidelines) on what's possible and what is not.

For now the KPI "sort of" assume we're doing interface control stuff in a single thread, thus mostly avoiding locking for ifnets & only doing locking for the specific things like interface hierarchy.
Should we keep this assumption? Should we do per-ifp sx lock in the KPI? Should we explicitly require locking on the caller side?

The fix to this particular issue depends on the answers of the above (or can be a bandaid as in diff if we want to kick the can down the road).

Those are all good questions, but I think in this case our issue is that the ifnet goes away between the task being enqueued and executed. There are a number of other places where we see things like that, and that's the 'general problem' I was referring to.

For that specific issue locking the ifnet isn't going to buy us anything at all.

In D39614#902533, @melifaro wrote:
In D39614#901770, @zlei wrote:
I do not quite understand this logic:
ena_detach()
  ether_ifdetach()
    if_detach()
      if_detach_internal()  # here we drain taskqueue_swi from link events
  ena_destroy_device()
    if_link_state_change()  # here we enqueue new link event
Why is ether_ifdetach() invoked before if_link_state_change() ???
As far as I understand, ena_destroy_device() is called in multiple places, including an init error handler, so it incorporates setting the link down. I think that it would make sense for the driver to move the link state change elsewhere, but I'd prefer to address the stack problem first.

I agree that the call to if_link_state_change() shouldn't happen after ether_ifdetach().
I want to change this in the ena driver.
Looking at many different network drivers I saw that other than mlx5 (from what I found) most drivers don't call if_link_state_change() before ether_ifdetach().
Looking at the implementation and inner calls of both ether_ifdetach() and if_link_state_change() it doesn't seem that ether_ifdetach() does any of the work that if_link_state_change() does.
So my question is this:
Is it necessary to call if_link_state_change() before ether_ifdetach()? because it seems almost no one is doing it.

(looked for answers in git log/blame, mailing lists and forums but couldn't find any)

zlei mentioned this in D38955: al_eth: Finish conversion to IfAPI.Apr 25 2023, 3:30 AM

In D39614#902547, @kp wrote:

In D39614#902532, @melifaro wrote:

In D39614#901740, @kp wrote:

Does that fix the general problem? Can't we end up starting an interface removal after this check and this still end up in do_link_state_change() with a destroyed ifp?

So perhaps we should be looking up the ifp again from the callback, rather than passing the pointer? Something like m_rcvif_serialize()/m_rcvif_restore() dance we do when we enqueue packets.

Well, I guess the answer is "no" for the most general problems I'm thinking of in this domain.
It would be nice to discuss the generalisation, as that's certainly not the only similar thing in the network space.

Personally I see the problem in the lack of clearly-defined KPIs for ifnets: the structures are accessed & modified directly, there are no rules (or even guidelines) on what's possible and what is not.

For now the KPI "sort of" assume we're doing interface control stuff in a single thread, thus mostly avoiding locking for ifnets & only doing locking for the specific things like interface hierarchy.
Should we keep this assumption? Should we do per-ifp sx lock in the KPI? Should we explicitly require locking on the caller side?

The fix to this particular issue depends on the answers of the above (or can be a bandaid as in diff if we want to kick the can down the road).

Those are all good questions, but I think in this case our issue is that the ifnet goes away between the task being enqueued and executed. There are a number of other places where we see things like that, and that's the 'general problem' I was referring to.

Ack! I was looking into it from a different angle. The mental model of "safe detach" to me is a) set the demarcation point that signals "no more data accepted", b) ensure no data is indeed accepted and c) clear the queued data. I'm a bit unsure if we can or should generalise the implementations for the different datapaths and control plane. The current code mostly have everything implemented - IFF_DYING flag is set, marking the end of the era and the taskq is cleaned from the link tasks matching this interface. The remaining part is rejecting new changes.

We can indeed do an MPASS() check in the enqueue code to reject such code patterns if that's what we agree with.

What do you folks think?

For that specific issue locking the ifnet isn't going to buy us anything at all.

In D39614#903233, @akiyano_amazon.com wrote:
In D39614#902533, @melifaro wrote:
In D39614#901770, @zlei wrote:
I do not quite understand this logic:
ena_detach()
  ether_ifdetach()
    if_detach()
      if_detach_internal()  # here we drain taskqueue_swi from link events
  ena_destroy_device()
    if_link_state_change()  # here we enqueue new link event
Why is ether_ifdetach() invoked before if_link_state_change() ???
As far as I understand, ena_destroy_device() is called in multiple places, including an init error handler, so it incorporates setting the link down. I think that it would make sense for the driver to move the link state change elsewhere, but I'd prefer to address the stack problem first.
I agree that the call to if_link_state_change() shouldn't happen after ether_ifdetach().
I want to change this in the ena driver.
Looking at many different network drivers I saw that other than mlx5 (from what I found) most drivers don't call if_link_state_change() before ether_ifdetach().
Looking at the implementation and inner calls of both ether_ifdetach() and if_link_state_change() it doesn't seem that ether_ifdetach() does any of the work that if_link_state_change() does.
So my question is this:
Is it necessary to call if_link_state_change() before ether_ifdetach()? because it seems almost no one is doing it.

I don't think so. Previously we haven't signalled operstate change to the userland. As the interface is anyway going to be destroyed within milliseconds, I don't think it's required.

(looked for answers in git log/blame, mailing lists and forums but couldn't find any)

In D39614#907531, @melifaro wrote:

In D39614#902547, @kp wrote:

The fix to this particular issue depends on the answers of the above (or can be a bandaid as in diff if we want to kick the can down the road).

Those are all good questions, but I think in this case our issue is that the ifnet goes away between the task being enqueued and executed. There are a number of other places where we see things like that, and that's the 'general problem' I was referring to.

Ack! I was looking into it from a different angle. The mental model of "safe detach" to me is a) set the demarcation point that signals "no more data accepted", b) ensure no data is indeed accepted and c) clear the queued data. I'm a bit unsure if we can or should generalise the implementations for the different datapaths and control plane. The current code mostly have everything implemented - IFF_DYING flag is set, marking the end of the era and the taskq is cleaned from the link tasks matching this interface. The remaining part is rejecting new changes.

We can indeed do an MPASS() check in the enqueue code to reject such code patterns if that's what we agree with.

What do you folks think?

So that seems like a sane general approach, but I'm not clear on how it'd work here. We'd have to make sure the ifp sticks around until the task queue no longer has any if_linktasks remaining, and I don't think we have a mechanism for that currently.

emaste added a subscriber: emaste.May 26 2023, 1:36 AM

In D39614#907626, @kp wrote:

In D39614#907531, @melifaro wrote:

In D39614#902547, @kp wrote:

The fix to this particular issue depends on the answers of the above (or can be a bandaid as in diff if we want to kick the can down the road).

Those are all good questions, but I think in this case our issue is that the ifnet goes away between the task being enqueued and executed. There are a number of other places where we see things like that, and that's the 'general problem' I was referring to.

Ack! I was looking into it from a different angle. The mental model of "safe detach" to me is a) set the demarcation point that signals "no more data accepted", b) ensure no data is indeed accepted and c) clear the queued data. I'm a bit unsure if we can or should generalise the implementations for the different datapaths and control plane. The current code mostly have everything implemented - IFF_DYING flag is set, marking the end of the era and the taskq is cleaned from the link tasks matching this interface. The remaining part is rejecting new changes.

We can indeed do an MPASS() check in the enqueue code to reject such code patterns if that's what we agree with.

What do you folks think?

So that seems like a sane general approach, but I'm not clear on how it'd work here. We'd have to make sure the ifp sticks around until the task queue no longer has any if_linktasks remaining, and I don't think we have a mechanism for that currently.

I think we have something in if.c:if_detach_internal() for that ( https://cgit.freebsd.org/src/tree/sys/net/if.c#n1145 ):

	/*
	 * In any case (destroy or vmove) detach us from the groups
	 * and remove/wait for pending events on the taskq.
	 * XXX-BZ in theory an interface could still enqueue a taskq change?
	 */
	if_delgroups(ifp);

	taskqueue_drain(taskqueue_swi, &ifp->if_linktask);
	taskqueue_drain(taskqueue_swi, &ifp->if_addmultitask);

So I still believe that the remaining part is to reject the new changes. Happy to discuss it more :-)
And anyway, @akiyano_amazon.com, I guess we need the driver to be updated to avoid setting the state when dying. Could you please look at that?

kp accepted this revision as: kp.Jun 6 2023, 9:07 AM

This revision is now accepted and ready to land.Jun 6 2023, 9:07 AM

In D39614#919546, @melifaro wrote:
In D39614#907626, @kp wrote:

In D39614#907531, @melifaro wrote:

In D39614#902547, @kp wrote:

The fix to this particular issue depends on the answers of the above (or can be a bandaid as in diff if we want to kick the can down the road).

Those are all good questions, but I think in this case our issue is that the ifnet goes away between the task being enqueued and executed. There are a number of other places where we see things like that, and that's the 'general problem' I was referring to.

Ack! I was looking into it from a different angle. The mental model of "safe detach" to me is a) set the demarcation point that signals "no more data accepted", b) ensure no data is indeed accepted and c) clear the queued data. I'm a bit unsure if we can or should generalise the implementations for the different datapaths and control plane. The current code mostly have everything implemented - IFF_DYING flag is set, marking the end of the era and the taskq is cleaned from the link tasks matching this interface. The remaining part is rejecting new changes.

We can indeed do an MPASS() check in the enqueue code to reject such code patterns if that's what we agree with.

What do you folks think?

So that seems like a sane general approach, but I'm not clear on how it'd work here. We'd have to make sure the ifp sticks around until the task queue no longer has any if_linktasks remaining, and I don't think we have a mechanism for that currently.

I think we have something in if.c:if_detach_internal() for that ( https://cgit.freebsd.org/src/tree/sys/net/if.c#n1145 ):
	/*
	 * In any case (destroy or vmove) detach us from the groups
	 * and remove/wait for pending events on the taskq.
	 * XXX-BZ in theory an interface could still enqueue a taskq change?
	 */
	if_delgroups(ifp);

	taskqueue_drain(taskqueue_swi, &ifp->if_linktask);
	taskqueue_drain(taskqueue_swi, &ifp->if_addmultitask);
So I still believe that the remaining part is to reject the new changes. Happy to discuss it more :-)
And anyway, @akiyano_amazon.com, I guess we need the driver to be updated to avoid setting the state when dying. Could you please look at that?