This started as exploration for the reasons for r361967/r362910 assertion failures. Then Peter Holm was able to narrow down the problem to very easy reproduction with timeout(1) which pointed to the big issue with reaping.
There seems to be a lot of problems with calculation of pg_jobs which directs SIGHUP/SIGCONT delivery for orphaned process group:
- Re-calculation of the orphaned status for children of exiting parent was wrong, but mostly unnoticed when all children were reparented to init(8). When child can be reparented to a different process which could affect the child' job control state, it was not properly accounted for in pg_jobc.
- Lockless check for exiting process' parent process group is racy because nothing prevents the parent from changing its group membership.
- Exited process is left in the process group, until waited. This affects other calculations of pg_jobc.
Split handling of job control status on process changing its process group, and process exiting. Calculate increments and decrements for pg_jobs by exact checking the orphanage instead of assuming process group membership for children and parent. Move the call to killjobc() later under the proctree_lock. Mark exiting process in killjobc() with a new flag P_TREE_GRPEXITED and skip it for all pg_jobc calculations after the flag is set.
Add checker that independently recalculates pg_jobc value and compares it with the memoized process group state. This is enabled under INVARIANTS.
Tested by: pho