vm_map_protect may return an error response after clipping the first or last map entry in the region to be reserved. This creates a pair of matching entries that should have been "simplified" back into one, or never created. This change defers the clipping of those entries until most of the vm_map_protect failure cases have been ruled out.
Details
Diff Detail
- Lint
Lint Skipped - Unit
Tests Skipped
Event Timeline
vm_map.c | ||
---|---|---|
2535 | Should we be accumulating the total swap reserved in this loop and releasing it on an error return? | |
2538 | Should we be delaying the assignment of current->cred until the next loop, since otherwise we may cause two entries to become mergeable, and return early with KERN_RESOURCE_SHORTAGE without either restoring them to their original states or merging them? |
vm_map.c | ||
---|---|---|
2535 | Ideally, I think so. Note though that you would also need to go back and release any acquired cred refs. If you do as you suggest below and defer assignment of current->cred to the second loop, you wouldn't have to worry about this. | |
2582–2583 | We may have reserved swap space for an earlier entry without clipping it. If the entry spanned one of start or end we will have marked the entry as charged by setting entry->cred, but the amount of reserved swap space will be less than entry->end - entry->start. So when the entry is deleted, we may release more swap space than was reserved. At least, I don't see anything preventing this scenario. |
Reduce the swap_reserve calls to one, and not clip or modify map entry fields until it succeeds. This change makes an assumption about the need to check ptoa(obj->size) - that it isn't necessary when you the start and end of the map entry - that I'm believing because alc told me so.
Relative to the previous version, combine the last two loops of vm_map_protect into one.
With D20711.58891.diff I got:
20190622 09:11:08 all (10/70): mmap18.sh panic: no cred for entry 0xfffff8060ee67310 cpuid = 1 time = 1561187513 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00adcfe750 vpanic() at vpanic+0x19d/frame 0xfffffe00adcfe7a0 panic() at panic+0x43/frame 0xfffffe00adcfe800 vm_fault_copy_entry() at vm_fault_copy_entry+0x65a/frame 0xfffffe00adcfe8b0 vm_map_protect() at vm_map_protect+0x496/frame 0xfffffe00adcfe960 kern_mprotect() at kern_mprotect+0xc0/frame 0xfffffe00adcfe990 amd64_syscall() at amd64_syscall+0x291/frame 0xfffffe00adcfeab0 fast_syscall_common() at fast_syscall_common+0x101/frame 0xfffffe00adcfeab0 --- syscall (74, FreeBSD ELF64, sys_mprotect), rip = 0x80042f00a, rsp = 0x7fffdfbfbf78, rbp = 0x7fffdfbfbfb0 --- KDB: enter: panic [ thread pid 60596 tid 100406 ] Stopped at kdb_enter+0x3b: movq $0,kdb_why db> run pho db:0:pho> set $lines 20000 db:0:pho> run pho1 db:1:pho1> bt Tracing pid 60596 tid 100406 td 0xfffff801f87d8000 kdb_enter() at kdb_enter+0x3b/frame 0xfffffe00adcfe750 vpanic() at vpanic+0x1ba/frame 0xfffffe00adcfe7a0 panic() at panic+0x43/frame 0xfffffe00adcfe800 vm_fault_copy_entry() at vm_fault_copy_entry+0x65a/frame 0xfffffe00adcfe8b0 vm_map_protect() at vm_map_protect+0x496/frame 0xfffffe00adcfe960 kern_mprotect() at kern_mprotect+0xc0/frame 0xfffffe00adcfe990 amd64_syscall() at amd64_syscall+0x291/frame 0xfffffe00adcfeab0
Reorder events in the final loop now that the significance of 'cred' has been partially revealed to me.
I ran all of the mmap() tests I have on D20711.58892.diff for a total of 8 hours.
I also did a buildworld / installworld.
No problems seen with this partial test.
Ah!
https://people.freebsd.org/~pho/stress/log/dougm036.txt is patched with D20711.58892.diff
Don't account for clipping in computing the amount of swap space to reserve. Fix some style glitches.
vm_map.c | ||
---|---|---|
2582–2583 | I'm not clear about whether this comment expresses a concern about a bug that this patch would introduce, or a bug that's already there that this patch is being asked to fix. So far, I'm not trying to fix anything except the cases where vm_map_protect clips or modifies entries and then fails without undoing those changes. Is this about using entry->end - entry->size instead of ptoa(obj->size)? I'm happy to go back to ptoa(obj->size). Somebody told me that they were the same. If that's not it, then I'm not sure what we're talking about. |
vm_map.c | ||
---|---|---|
2582–2583 | Look at how the clip_start entry is handled, for instance for NEED_COPY case. You reserve the amount of swap needed for non-clipped entry, then set cred for clipped. This means that the swap reserved is bigger than the size of the entry, and the difference would be never released. |
vm_map.c | ||
---|---|---|
2508 | Suppose an entry spans end. That entry will be clipped below, but we will reserve space for the new entry created by that clipping without charging that entry for the space. | |
2582–2583 | I was pointing out a bug in the initial version of the patch, which is still present and related to my comment above. Basically, with your patch we are reserving swap space based on the size of unclipped map entries, and I believe that that is incorrect. I will try to provide more context in my comments in the future. |
vm_map.c | ||
---|---|---|
2582–2583 | That's right now. But back up to diff 58892, where the critical line was: to_reserve += ulmin(end, current->end) - ulmax(start, current->start); so I was reserving just the amount needed for clipped entries. But Mark's first comment was made with this line in place, so I didn't understand it to be about accounting-for-clipping. Why did I change this line 12 hours ago? Because a Peter Holm crash suggested I wasn't reserving enough swap space, and I thought this line might be the problem. | |
2582–2583 | Thanks for all the help you can give me. But, as explained in the response to Kostik, I was accounting for clipping at the time you made your comment. I'm just not doing it now. |
vm_map.c | ||
---|---|---|
2582–2583 | The initial version had a different variant of the problem: it accounted for the clipping when reserving swap space, but the first loop could return an error before any clipping actually happened. So the amount of swap space reserved may not have matched the entry size. |
vm_map.c | ||
---|---|---|
2582–2583 | Isn't this what happens now? That is, clip_start calls vm_entry_charge_object, which leaves the two halves of the clipped entry both with the same underlying object and the same charge. Isn't that range being charged twice *now*? I don't see where clip_start is later making two objects, one for each of the two map entries, with reduced charges. |
Don't assume I can ignore ptoa(obj->size). Account for clipping when there's no object.
If vm_map_entry_back is going to be invoked by clipping, then use the size of the untrimmed entry, which is what vm_map_entry_back will set as the size of the object that it creates. Otherwise, if there's no object, there won't be an object made and use the clipped map entry difference. Otherwise, use the size from the pre-existing object, the size of which won't be modified by vm_map_entry_charge_object, or by clip_start or clip_end.
Add an assertion that the amount of swap space to reserved has been calculated correctly.
With D20711.58934.diff I get:
20190624 05:49:07 all (2/15): mmap11.sh panic: vm_map_protect: wrong amount reserved cpuid = 1 time = 1561348157 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00aca04800 vpanic() at vpanic+0x19d/frame 0xfffffe00aca04850 panic() at panic+0x43/frame 0xfffffe00aca048b0 vm_map_protect() at vm_map_protect+0x7b9/frame 0xfffffe00aca04960 kern_mprotect() at kern_mprotect+0xc0/frame 0xfffffe00aca04990 amd64_syscall() at amd64_syscall+0x291/frame 0xfffffe00aca04ab0 fast_syscall_common() at fast_syscall_common+0x101/frame 0xfffffe00aca04ab0
Move the simplification of entries to a separate, third loop, not because I want to but because I want to see if simplification is what's making my accounting assertion fail, or if the cause is some other thing I don't understand.
So everyone but Peter is free to ignore this version.
Retreat. I know I can improve something, and I know that I'm misunderstanding something else. I'll try to take the small win for now.
vm_map.c | ||
---|---|---|
2541 | Isn't this test redundant? In other words, doesn't vm_map_clip_start() perform the same test? |