Randomize the address of shared page and store it in struct vmspace.
It's inherited on fork.
The shared page mapping was moved from exec_new_vmspace to exec_map_stack.
The former function was called before image activator has a chance to parse ASLR related flags.
The behavior can be changed using kern.elf64.aslr.shared_page sysctl.
The feature is enabled by default for 64 bit applications.
Details
Tested on arm64 and amd64:
- Run kyua with randomization on/off and compare results.
- Run a buildworld as a form of stress test.
- Play around with Linuxulator a bit to make sure that no regression was introduced there.
Diff Detail
- Lint
Lint Skipped - Unit
Tests Skipped
Event Timeline
tests/sys/kern/kern_copyin.c | ||
---|---|---|
115 | I've been using getpagesize() as it can normally get the page size without a syscall. |
tests/sys/kern/kern_copyin.c | ||
---|---|---|
115 | Oh right I forgot about that... indeed getpagesize() is the way to go. |
tests/sys/kern/kern_copyin.c | ||
---|---|---|
115 | Oh, but _SC_PAGESIZE will also avoid the syscall: case _SC_PAGESIZE: return (getpagesize()); and it seems POSIX deprecated getpagesize(). So sysconf(_SC_PAGESIZE) is preferred I guess. |
tests/sys/kern/kern_copyin.c | ||
---|---|---|
110 | I would suggest rewriting this (drop the "now") to avoid implying changed behaviour ("can now"); in several years it won't be "new" behaviour. |
sys/amd64/amd64/exec_machdep.c | ||
---|---|---|
207 ↗ | (On Diff #106468) | This part of the change can be done as a separate commit (once review is done), e.g. as @markj did with 706f4a81a81250a326ea25914e7effe1768f1a37 |
could you please explain the reason for this change? since the shp is read-only, what is the point of randomization?
sys/vm/vm_map.h | ||
---|---|---|
298 ↗ | (On Diff #106468) | I suggest to make the patch larger, but then the code overall would be more consistent IMO. Store the address of the shared page in struct vmspace, instead of the offset. sv_timekeep_base, sv_fxrng_gen_base, and sv_vdso_base should be replaced by offsets in the shared page (I might missed something in the enumeration). First patch in series should do just that, and initialize vm_shp_base (name is just an example) from p->p_sysent->sv_shared_page_base. Second patch would do the actual work, by only handling vm_shp_base as needed. There are at least two things I strongly recomment to do:
|
Sorry for the radio silence. I discovered that this patch, in its current form breaks Linuxulator VDSO clock routines.
Basically the problem is that the Linux VDSO glue code needs to read vdso_timekeep, that is stored in the shared page.
I have to figure out a fix for this first, before proceeding with this.
Once I have something I'll either open up a new phabricator revision, or update this one.
It is probably between too hard to impossible, due to linux vdso written in C. It requires PIC code, and it is almost always requires GOT and performing relocations before the code can work.
IMO it is enough to exclude linux ABI from shared page randomization.
That's what we noticed when HardenedBSD implemented VDSO randomization as well. The way the linuxulator's VDSO is implemented is not too friendly towards ASLR.
Ok, I'll go with that.
I guess that the best way to do this is to add an opt-in flag to sysentvec.
opt-in as in the shared page address will be randomized only if this flag is set for this ABI
Well, with some accuracy and care it is possible. At least, all variables should be static or hidden (and I do not think that non-local vars can exists in vdso at all, at least while the page is mapped rx). So the first step is to build with -fPIC and then mangle code enough to ensure that no relocations are produced. But this is still quite fragile and depends on the compiler use that it is not intended to.
My belief is that the only reliable option is to write it in asm, to have tight-controlled code.
There is one more possible approach to this.
Right now linux binaries have two shared pages mapped, the regular one and the one with the glue code.
I could copy all the relevant bits to the former and modify the code to only map the "glue" page.
This would require some work and could break things.
So IMHO for now disabling shared page randomization in linuxulator is the best way to go.
- Replace PAGE_SIZE with sysconf(_SC_PAGESIZE).
- Store offsets to various shared page segments, instead of their base addresses.
- Store the shared page base address in struct vmspace, instead its offset to the default location.
I took a look at the linuxulator again and actually I don't think this patch will cause an issue there.
All ASLR related features, including this one are only activated if SV_ASLR flag is set in sysentvec.
None of the linuxulator sysentvecs have this flag set, so we should be fine here.
On a side note this means that the linuxulator binaries are not using ASLR at all.
sys/amd64/amd64/exec_machdep.c | ||
---|---|---|
207 ↗ | (On Diff #106468) | Sure, will do. |
sys/vm/vm_map.h | ||
298 ↗ | (On Diff #106468) | I don't think it's possible to map a guard page, such as one used for stack gap. |
For commit, I still believe that the split of the patch into two parts, one where you switch code to use base + offsets, and another where the actual randomization occurs, is the best. Could you, please, split the reviews?
sys/kern/imgact_elf.c | ||
---|---|---|
1315 | I think you do not need a new map flag. The contexts where the flag is set, and then read, all use struct image_params. I suspect you can move the bool into this structure and avoid changing map_flags type. | |
sys/kern/kern_exec.c | ||
1273 | I suggest to uprintf() something in case of error there, to give a hint to user. See other examples in imgact_elf.c etc. | |
sys/kern/kern_sharedpage.c | ||
389 ↗ | (On Diff #106568) | Arguably you can just copy all offsets unconditionally, regardless of the flags. |
sys/vm/vm_map.h | ||
298 ↗ | (On Diff #106468) | So kern_copyin needs to be changed. It probably would need map(MAP_FIXED) to keep it usable. Also, I do not see why this quite specific test should drive design decisions. |
I've split the review into three parts:
- Introduce PROC_SIGCODE: https://reviews.freebsd.org/D35392
- Rework how shared page related data is stored: https://reviews.freebsd.org/D35393
- Shared page address randomization(this one)
sys/vm/vm_map.h | ||
---|---|---|
298 ↗ | (On Diff #106468) | That's what I'm doing right now. If we want to use the guard page the test would need to be removed altogether. |
sys/kern/imgact_elf.c | ||
---|---|---|
1315 | The problem is that if the flag is stored in image_params, sysctl_kern_proc_vm_layout won't be able to access it. |
I still think that the guard at location of fixed shp is required.
sys/kern/kern_exec.c | ||
---|---|---|
1173–1174 | Would it be useful to mention in the herald comment that this procedure installs shared page mapping as well? It is less obvious now, from the non-changed name. | |
tests/sys/kern/kern_copyin.c | ||
117 | If shared page mapping is not randomized, wouldn't this mmap destroy it? |
- If the address randomization is applied map a guard page to the usual location of the shared page.
- Remove kern_copyin test. It doesn't work with the guard page at the top of user address space.
I agree with that. I wanted to wait for your opinion w.r.t removing the kern_copyin test, since it won't work with the guard page.
I went ahead and removed it in the latest update of this revision.
tests/sys/kern/kern_copyin.c | ||
---|---|---|
117 | That's what MAP_EXCL is for. According to mmap man page: If MAP_EXCL is not specified, a successful MAP_FIXED request replaces any previous mappings for the process' pages in the range from addr to addr + len. In contrast, if MAP_EXCL is specified, the request will fail if a mapping already exists within the range. |
Why? As discussed earlier, if the shared page is randomized and the guard is mapped at the top of the UVA, map(MAP_FIXED) something at this location.
Whoops, you're right. I don't know why, but for some reason I thought that you can't replace the guard page.
Something urgent came up at work, so it'll probably take me a couple of days to update this revision.
tests/sys/kern/kern_copyin.c | ||
---|---|---|
126 | I quite dislike it. I think a much better approach is to provide explicit shared page address in the uva layout sysctl, as I proposed earlier. Then you check if shp address intersects with maxuser - pagesize. Slightly more elaborate but also working approach is to flag shared page in the kinfo_vmentry output of sysctl KERN_PROC_VMMAP. |
Pass the shared page address in kinfo_vm_layout struct and use it in kern_copyin test.
sys/kern/kern_exec.c | ||
---|---|---|
1248 | If amd64_lower_shared_page() was applied, this would map the guard one page below the top of UVA. | |
1259 | The guard should be one page long, there is no reason to use sv_shared_page_len there. | |
tests/sys/kern/kern_copyin.c | ||
126 | This assumes that the 'shared page mapping' size is exactly one page. One day it have to be grown in size, we are already very low on free space there after vdso's consume most of it. |
- Pass the shared page size in kinfo_vm_layout and use that information in kern_copyin test.
- Map the guard page at the top of UVA, instead of mapping it at the usual location of the shared page.
sys/kern/kern_exec.c | ||
---|---|---|
1248 | The problem here is that the amd64_lower_shared_page also lowers the sv_maxuser, which is used to determine that top of a UVA. |
sys/kern/kern_exec.c | ||
---|---|---|
1262 | Why not use vm_map_fixed() there? IMO it is the right KPI, instead of _find(). | |
1277 | Same there, I think vm_map_fixed() is better use than _find(). Original code used _fixed(). | |
tests/sys/kern/kern_copyin.c | ||
126 | I think the most correct approach is to check the vm map and see if anything is mapped at the maxuser-1, and if it is, whether the mapping is guard. But for now let it go. |
Use vm_map_fixed to allocate the guard page.
sys/kern/kern_exec.c | ||
---|---|---|
1277 | I used _find() here since that's how the stack is allocated. |
sys/kern/kern_exec.c | ||
---|---|---|
1277 | I believe it is a good opportunity to use the proper KPI there. vm_map_fixed() is directly translated to vm_map_insert(), while vm_map_find() is convoluted enough even for VMFS_NO_SPACE. It prepares for anonymous clustering, adjusts for ASLR calculations etc. vm_map_fixed() avoids both the CPU overhead and cognitive load to recognize VMFS_NO_SPACE. Other than that, I do not have more comments for the patch. |