linuxkpi: races between linux_queue_delayed_work_on() and linux_cancel_delayed_work_sync()
ClosedPublic
Actions

Authored by kib on Nov 4 2023, 4:56 PM.

Details

Reviewers

Commits

rG96cb1d70008e: linuxkpi linux_work: use 'true' instead of 'non-zero'
rG05fe82455f26: linuxkpi: races between linux_queue_delayed_work_on() and…

Summary

1. Suppose that linux_queue_delayed_work_on() is called with
   non-zero delay and found the work.state WORK_ST_IDLE. It
   resets the state to WORK_ST_TIMER and locks timer.mtx. Now, if
   linux_cancel_delayed_work_sync() was also called meantime, read
   state as WORK_ST_TIMER and already taken the mutex, it is executing
   callout_stop() on non-armed callout. Then linux_queue_delayed_work_on()
   continues and schedules callout.  But the return value from cancel() is
   false, making it possible to the requeue from callback to slip in.

2. If linux_cancel_delayed_work_sync() returned true, we need to cancel
   again.  The requeue from callback could have revived the work.

The end result is that we schedule callout that might be freed, since
cancel_delayed_work_sync() claims that everything was stopped.  This
contradicts the way the KPI is used in Linux, where consumers expect
that cancel_delayed_work_sync() is reliable on its own.

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

kib created this revision.Nov 4 2023, 4:56 PM

Herald added subscribers: linuxkpi, emaste, imp. · View Herald TranscriptNov 4 2023, 4:56 PM

kib requested review of this revision.Nov 4 2023, 4:56 PM

bz added a subscriber: bz.Nov 5 2023, 7:22 PM

bz added inline comments.

sys/compat/linuxkpi/common/src/linux_work.c
475	I would call that lkpi_cancel_...internal(); lkpi_ for it's an internal function and spelt out _internal.
529	is this simpler? res = false; while (linux_cancel_delayed_work_sync_int(dwork)) res = true;

kib marked an inline comment as done.Nov 5 2023, 7:38 PM

kib added inline comments.

sys/compat/linuxkpi/common/src/linux_work.c
475	All functions in this file are prefixed with _linux, even static. I can use _internal, _int is just shorter.

Write loop simpler.

My suggestions are for pre-existing nits, since I had to read the code anyway. Please feel free to ignore them.

If linux_cancel_delayed_work_sync() returned true, we need to cancel again. The requeue from callback could have revived the work.

Assuming that the first bug is fixed, how can this happen? taskqueue_drain() should only return once the task has finished executing and it's no longer pending. But, if the task queued itself with a timeout, it'll be the callout that's pending, not the task. Flipping the order of callout_drain() and taskqueue_drain() calls does not solve the problem either. Ok.