Page MenuHomeFreeBSD

[RFC] Shared page address randomization
ClosedPublic

Authored by kd on May 30 2022, 12:46 PM.
Tags
None
Referenced Files
F107105949: D35349.diff
Fri, Jan 10, 5:04 AM
Unknown Object (File)
Mon, Jan 6, 4:30 PM
Unknown Object (File)
Sat, Jan 4, 10:29 PM
Unknown Object (File)
Sat, Jan 4, 9:20 PM
Unknown Object (File)
Sat, Jan 4, 4:52 PM
Unknown Object (File)
Thu, Jan 2, 11:44 AM
Unknown Object (File)
Sun, Dec 22, 3:25 PM
Unknown Object (File)
Fri, Dec 20, 2:18 AM

Details

Summary

Randomize the address of shared page and store it in struct vmspace.
It's inherited on fork.
The shared page mapping was moved from exec_new_vmspace to exec_map_stack.
The former function was called before image activator has a chance to parse ASLR related flags.
The behavior can be changed using kern.elf64.aslr.shared_page sysctl.
The feature is enabled by default for 64 bit applications.

Test Plan

Tested on arm64 and amd64:

  • Run kyua with randomization on/off and compare results.
  • Run a buildworld as a form of stress test.
  • Play around with Linuxulator a bit to make sure that no regression was introduced there.

Diff Detail

Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
kd requested review of this revision.May 30 2022, 12:46 PM
tests/sys/kern/kern_copyin.c
7

@andrew should we use sysconf(_SC_PAGESIZE)?

tests/sys/kern/kern_copyin.c
7

I've been using getpagesize() as it can normally get the page size without a syscall.

tests/sys/kern/kern_copyin.c
7

Oh right I forgot about that... indeed getpagesize() is the way to go.

tests/sys/kern/kern_copyin.c
7

Oh, but _SC_PAGESIZE will also avoid the syscall:

case _SC_PAGESIZE:
        return (getpagesize());

and it seems POSIX deprecated getpagesize(). So sysconf(_SC_PAGESIZE) is preferred I guess.

tests/sys/kern/kern_copyin.c
7

D35352 for a .Xr and POSIX deprecation mention in getpagesize(3).

tests/sys/kern/kern_copyin.c
2

I would suggest rewriting this (drop the "now") to avoid implying changed behaviour ("can now"); in several years it won't be "new" behaviour.

sys/amd64/amd64/exec_machdep.c
207 ↗(On Diff #106468)

This part of the change can be done as a separate commit (once review is done), e.g. as @markj did with 706f4a81a81250a326ea25914e7effe1768f1a37

could you please explain the reason for this change? since the shp is read-only, what is the point of randomization?

sys/vm/vm_map.h
298 ↗(On Diff #106468)

I suggest to make the patch larger, but then the code overall would be more consistent IMO.

Store the address of the shared page in struct vmspace, instead of the offset. sv_timekeep_base, sv_fxrng_gen_base, and sv_vdso_base should be replaced by offsets in the shared page (I might missed something in the enumeration).

First patch in series should do just that, and initialize vm_shp_base (name is just an example) from p->p_sysent->sv_shared_page_base.

Second patch would do the actual work, by only handling vm_shp_base as needed.

There are at least two things I strongly recomment to do:

  • if the shared page is randomized, map a guard at its fixed located at the end of UVA
  • amd64_lower_shared_page() does not make any useful thing anymore if the shared page is randomized and a guard is used

Sorry for the radio silence. I discovered that this patch, in its current form breaks Linuxulator VDSO clock routines.
Basically the problem is that the Linux VDSO glue code needs to read vdso_timekeep, that is stored in the shared page.
I have to figure out a fix for this first, before proceeding with this.
Once I have something I'll either open up a new phabricator revision, or update this one.

In D35349#801982, @mindal_semihalf.com wrote:

Sorry for the radio silence. I discovered that this patch, in its current form breaks Linuxulator VDSO clock routines.
Basically the problem is that the Linux VDSO glue code needs to read vdso_timekeep, that is stored in the shared page.
I have to figure out a fix for this first, before proceeding with this.
Once I have something I'll either open up a new phabricator revision, or update this one.

It is probably between too hard to impossible, due to linux vdso written in C. It requires PIC code, and it is almost always requires GOT and performing relocations before the code can work.

IMO it is enough to exclude linux ABI from shared page randomization.

In D35349#801990, @kib wrote:
In D35349#801982, @mindal_semihalf.com wrote:

Sorry for the radio silence. I discovered that this patch, in its current form breaks Linuxulator VDSO clock routines.
Basically the problem is that the Linux VDSO glue code needs to read vdso_timekeep, that is stored in the shared page.
I have to figure out a fix for this first, before proceeding with this.
Once I have something I'll either open up a new phabricator revision, or update this one.

It is probably between too hard to impossible, due to linux vdso written in C. It requires PIC code, and it is almost always requires GOT and performing relocations before the code can work.

IMO it is enough to exclude linux ABI from shared page randomization.

That's what we noticed when HardenedBSD implemented VDSO randomization as well. The way the linuxulator's VDSO is implemented is not too friendly towards ASLR.

In D35349#801990, @kib wrote:
In D35349#801982, @mindal_semihalf.com wrote:

Sorry for the radio silence. I discovered that this patch, in its current form breaks Linuxulator VDSO clock routines.
Basically the problem is that the Linux VDSO glue code needs to read vdso_timekeep, that is stored in the shared page.
I have to figure out a fix for this first, before proceeding with this.
Once I have something I'll either open up a new phabricator revision, or update this one.

It is probably between too hard to impossible, due to linux vdso written in C. It requires PIC code, and it is almost always requires GOT and performing relocations before the code can work.

IMO it is enough to exclude linux ABI from shared page randomization.

That's what we noticed when HardenedBSD implemented VDSO randomization as well. The way the linuxulator's VDSO is implemented is not too friendly towards ASLR.

Ok, I'll go with that.
I guess that the best way to do this is to add an opt-in flag to sysentvec.

In D35349#802004, @mindal_semihalf.com wrote:
In D35349#801990, @kib wrote:
In D35349#801982, @mindal_semihalf.com wrote:

Sorry for the radio silence. I discovered that this patch, in its current form breaks Linuxulator VDSO clock routines.
Basically the problem is that the Linux VDSO glue code needs to read vdso_timekeep, that is stored in the shared page.
I have to figure out a fix for this first, before proceeding with this.
Once I have something I'll either open up a new phabricator revision, or update this one.

It is probably between too hard to impossible, due to linux vdso written in C. It requires PIC code, and it is almost always requires GOT and performing relocations before the code can work.

IMO it is enough to exclude linux ABI from shared page randomization.

That's what we noticed when HardenedBSD implemented VDSO randomization as well. The way the linuxulator's VDSO is implemented is not too friendly towards ASLR.

Ok, I'll go with that.
I guess that the best way to do this is to add an opt-in flag to sysentvec.

opt-in as in the shared page address will be randomized only if this flag is set for this ABI

In D35349#802004, @mindal_semihalf.com wrote:

Ok, I'll go with that.
I guess that the best way to do this is to add an opt-in flag to sysentvec.

Well, with some accuracy and care it is possible. At least, all variables should be static or hidden (and I do not think that non-local vars can exists in vdso at all, at least while the page is mapped rx). So the first step is to build with -fPIC and then mangle code enough to ensure that no relocations are produced. But this is still quite fragile and depends on the compiler use that it is not intended to.

My belief is that the only reliable option is to write it in asm, to have tight-controlled code.

In D35349#802006, @kib wrote:

Well, with some accuracy and care it is possible. At least, all variables should be static or hidden (and I do not think that non-local vars can exists in vdso at all, at least while the page is mapped rx). So the first step is to build with -fPIC and then mangle code enough to ensure that no relocations are produced. But this is still quite fragile and depends on the compiler use that it is not intended to.

My belief is that the only reliable option is to write it in asm, to have tight-controlled code.

There is one more possible approach to this.
Right now linux binaries have two shared pages mapped, the regular one and the one with the glue code.
I could copy all the relevant bits to the former and modify the code to only map the "glue" page.
This would require some work and could break things.
So IMHO for now disabling shared page randomization in linuxulator is the best way to go.

kd edited the summary of this revision. (Show Details)
  • Replace PAGE_SIZE with sysconf(_SC_PAGESIZE).
  • Store offsets to various shared page segments, instead of their base addresses.
  • Store the shared page base address in struct vmspace, instead its offset to the default location.

I took a look at the linuxulator again and actually I don't think this patch will cause an issue there.
All ASLR related features, including this one are only activated if SV_ASLR flag is set in sysentvec.
None of the linuxulator sysentvecs have this flag set, so we should be fine here.
On a side note this means that the linuxulator binaries are not using ASLR at all.

sys/amd64/amd64/exec_machdep.c
207 ↗(On Diff #106468)

Sure, will do.

sys/vm/vm_map.h
298 ↗(On Diff #106468)

I don't think it's possible to map a guard page, such as one used for stack gap.
The kern_copyin test (see below) needs read access to the top of UVA, so in that scenario we need to have something mapped there.
I could create a RO page filled with something like 0xdeabeef, but imho it's just to send a segfault if a program tries to dereference a hardcoded address of the shared page.
For the same reason amd64_lower_shared_page is still needed.

For commit, I still believe that the split of the patch into two parts, one where you switch code to use base + offsets, and another where the actual randomization occurs, is the best. Could you, please, split the reviews?

sys/kern/imgact_elf.c
1315

I think you do not need a new map flag. The contexts where the flag is set, and then read, all use struct image_params. I suspect you can move the bool into this structure and avoid changing map_flags type.

sys/kern/kern_exec.c
1273

I suggest to uprintf() something in case of error there, to give a hint to user. See other examples in imgact_elf.c etc.

sys/kern/kern_sharedpage.c
389 ↗(On Diff #106568)

Arguably you can just copy all offsets unconditionally, regardless of the flags.

sys/vm/vm_map.h
298 ↗(On Diff #106468)

So kern_copyin needs to be changed. It probably would need map(MAP_FIXED) to keep it usable. Also, I do not see why this quite specific test should drive design decisions.

kd retitled this revision from [RFC] Shared page address randomization PoC to [RFC] Shared page address randomization.

Split the review into three parts.

In D35349#802348, @kib wrote:

For commit, I still believe that the split of the patch into two parts, one where you switch code to use base + offsets, and another where the actual randomization occurs, is the best. Could you, please, split the reviews?

I've split the review into three parts:

  1. Introduce PROC_SIGCODE: https://reviews.freebsd.org/D35392
  2. Rework how shared page related data is stored: https://reviews.freebsd.org/D35393
  3. Shared page address randomization(this one)
sys/vm/vm_map.h
298 ↗(On Diff #106468)

That's what I'm doing right now.
As a part of this review I've modified the test in question to map a page with MAP_ANON | MAP_FIXED | MAP_EXCL.

If we want to use the guard page the test would need to be removed altogether.
I believe that it was originally written by you, see https://reviews.freebsd.org/D3925 for details.
What do you think? Should it just be removed?

sys/kern/imgact_elf.c
1315

The problem is that if the flag is stored in image_params, sysctl_kern_proc_vm_layout won't be able to access it.
At the same time the shared page address randomization can only be toggled globally, with a sysctl.
Because of that maybe there is not need to expose this flag to userspace, on a per process basis.
I'll modify this patch to use imgp, unless someone opposes doing this like that.

Add an error message if mapping shared page fails

sys/kern/imgact_elf.c
1315

Can you check sv_shared_page_base against address stored in vmspace? If they are equal, consider shared page base not randomized.

sys/sys/user.h
640

Perhaps returning the shp mapping address would be useful?

kd marked an inline comment as done.
kd edited the summary of this revision. (Show Details)
  • Move shared page ASLR flag from vmspace to image_params.
sys/kern/imgact_elf.c
1315

Good idea!
Did just that in the latest update to this revision.

sys/sys/user.h
640

There is already the sysctl_kern_proc_sigtramp, which returns the sigcode base address, so IMO it's not really needed.

I still think that the guard at location of fixed shp is required.

sys/kern/kern_exec.c
1173–1174

Would it be useful to mention in the herald comment that this procedure installs shared page mapping as well? It is less obvious now, from the non-changed name.

tests/sys/kern/kern_copyin.c
10

If shared page mapping is not randomized, wouldn't this mmap destroy it?

  • If the address randomization is applied map a guard page to the usual location of the shared page.
  • Remove kern_copyin test. It doesn't work with the guard page at the top of user address space.
In D35349#803266, @kib wrote:

I still think that the guard at location of fixed shp is required.

I agree with that. I wanted to wait for your opinion w.r.t removing the kern_copyin test, since it won't work with the guard page.
I went ahead and removed it in the latest update of this revision.

tests/sys/kern/kern_copyin.c
10

That's what MAP_EXCL is for. According to mmap man page:

If MAP_EXCL is not specified, a successful MAP_FIXED
request replaces any previous mappings for the
process' pages in the range from addr to addr + len.
In contrast, if MAP_EXCL is specified, the request
will fail if a mapping already exists within the
range.
In D35349#803352, @mindal_semihalf.com wrote:
  • Remove kern_copyin test. It doesn't work with the guard page at the top of user address space.

Why? As discussed earlier, if the shared page is randomized and the guard is mapped at the top of the UVA, map(MAP_FIXED) something at this location.

In D35349#803460, @kib wrote:

Why? As discussed earlier, if the shared page is randomized and the guard is mapped at the top of the UVA, map(MAP_FIXED) something at this location.

Whoops, you're right. I don't know why, but for some reason I thought that you can't replace the guard page.
Something urgent came up at work, so it'll probably take me a couple of days to update this revision.

Check if the shared page address is randomized in kern_copyin test.

tests/sys/kern/kern_copyin.c
7

I quite dislike it. I think a much better approach is to provide explicit shared page address in the uva layout sysctl, as I proposed earlier. Then you check if shp address intersects with maxuser - pagesize.

Slightly more elaborate but also working approach is to flag shared page in the kinfo_vmentry output of sysctl KERN_PROC_VMMAP.

Pass the shared page address in kinfo_vm_layout struct and use it in kern_copyin test.

sys/kern/kern_exec.c
1248

If amd64_lower_shared_page() was applied, this would map the guard one page below the top of UVA.

1259

The guard should be one page long, there is no reason to use sv_shared_page_len there.

tests/sys/kern/kern_copyin.c
7

This assumes that the 'shared page mapping' size is exactly one page. One day it have to be grown in size, we are already very low on free space there after vdso's consume most of it.

  • Pass the shared page size in kinfo_vm_layout and use that information in kern_copyin test.
  • Map the guard page at the top of UVA, instead of mapping it at the usual location of the shared page.
sys/kern/kern_exec.c
1248

The problem here is that the amd64_lower_shared_page also lowers the sv_maxuser, which is used to determine that top of a UVA.
We can't really remove amd64_lower_shared_page since the shp address randomization is an opt-out feature.
For now I'll just map the guard at the sv_maxuser - PAGE_SIZE, and change its size to one page.

sys/kern/kern_exec.c
1262

Why not use vm_map_fixed() there? IMO it is the right KPI, instead of _find().

1277

Same there, I think vm_map_fixed() is better use than _find(). Original code used _fixed().

tests/sys/kern/kern_copyin.c
7

I think the most correct approach is to check the vm map and see if anything is mapped at the maxuser-1, and if it is, whether the mapping is guard. But for now let it go.

kd marked 2 inline comments as done.

Use vm_map_fixed to allocate the guard page.

sys/kern/kern_exec.c
1277

I used _find() here since that's how the stack is allocated.
IMO it's better to be consistent here.
I've changed the guard page mapping logic to use _fixed().

sys/kern/kern_exec.c
1277

I believe it is a good opportunity to use the proper KPI there. vm_map_fixed() is directly translated to vm_map_insert(), while vm_map_find() is convoluted enough even for VMFS_NO_SPACE. It prepares for anonymous clustering, adjusts for ASLR calculations etc. vm_map_fixed() avoids both the CPU overhead and cognitive load to recognize VMFS_NO_SPACE.

Other than that, I do not have more comments for the patch.

Use vm_map_fixed to map shared page when no address randomization is applied.

This revision is now accepted and ready to land.Jul 12 2022, 10:19 AM
This revision was automatically updated to reflect the committed changes.