Page MenuHomeFreeBSD

VM anonymous clustering: be more persistent
ClosedPublic

Authored by kib on Apr 27 2023, 7:17 AM.
Tags
None
Referenced Files
Unknown Object (File)
Sun, Oct 20, 10:12 PM
Unknown Object (File)
Oct 14 2024, 2:49 PM
Unknown Object (File)
Oct 14 2024, 2:49 PM
Unknown Object (File)
Oct 14 2024, 2:49 PM
Unknown Object (File)
Oct 14 2024, 2:49 PM
Unknown Object (File)
Oct 14 2024, 2:49 PM
Unknown Object (File)
Oct 14 2024, 2:25 PM
Unknown Object (File)
Oct 12 2024, 7:20 AM

Details

Summary
It is possible that VA right before the current anonymous clustering
location was unmapped, e.g. this could happen when the malloc
allocations interleave with explicit mmaps of the anonymous memory, and
frees/unmaps.

Mark anon mappings with a new map entry flag MAP_ENTRY_ANON_LOC_UPD, and
update the map->anon_loc with the previous entry' end if freed entry is
flagged.
ASLR: add a new mode for anonymous clustering control

The value 3 for vm.cluster_anon still initializes the clustering point
at the image creation, and never update it. So, all virtual address
allocations are performed using an address-ordered, first-fit policy,
just like we do when ASLR is turned off.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

My thought is upward migration of map->anon_loc could lead problem in some condition.
below test code can demonstrate malloced address still shift to higher address due to anon_loc updating, forcing jemalloc to eat more memory to store metadata.

vm.cluster_anon=2
setenv MALLOC_CONF 'retain:false'

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <unistd.h>

int main() {
    const size_t CHUNK_SIZE = 10 * 1024 * 1024;

    // Initialize pointers for the two memory chunks
    void* chunk1 = NULL;
    void* chunk2 = NULL;

    while (1) {
        // Allocate memory for chunk1 if it's not already allocated
        if (chunk1 == NULL) {
            chunk1 = malloc(CHUNK_SIZE);
            if (chunk1 == NULL) {
                printf("Failed to allocate memory for chunk1\n");
                return 1;
            }
        }

        // Allocate memory for chunk2 if it's not already allocated
        if (chunk2 == NULL) {
            chunk2 = malloc(CHUNK_SIZE);
            if (chunk2 == NULL) {
                printf("Failed to allocate memory for chunk2\n");
                return 1;
            }
        }

        // Free the chunk with the lower address, ensuring at least one valid allocation
        if (chunk1 < chunk2) {
            free(chunk1);
            chunk1 = NULL;
        } else {
            free(chunk2);
            chunk2 = NULL;
        }

//      sleep(1);
    }
    return 0;
}

We cannot provide the minimal VA spread with ASLR, after all this is the point of the 'R' in it. To glue vm_map_entries and underlying objects to avoid their proliferation is the goal of clustering.

In D39845#907052, @kib wrote:

We cannot provide the minimal VA spread with ASLR, after all this is the point of the 'R' in it. To glue vm_map_entries and underlying objects to avoid their proliferation is the goal of clustering.

Agree higher VA spread would be expected with ASLR. The problem I'm trying to address is jemalloc's metadata leak due to returned VA upward shifting, which could eventually hit userspace OOM.
When there is big memory hole below anon_loc, it seems to be a good idea to reuse the hole to cluster the request rather than expand the VA range.

When there is big memory hole below anon_loc, it seems to be a good idea to reuse the hole to cluster the request rather than expand the VA range.

This is exactly what my patch does when there is no map entry placed at anon_loc. Otherwise it essentially disables the clustering, which you could do as well by existing knob.

what if the freed hole is between 2 anon vma entries (curr_min_addr will fall down to the higher entry->end right?), however the hole won't be reused.
I used the test code mentioned earlier to verify your patch, still I can see anon_loc growing and memory footprint increasing.

what if the freed hole is between 2 anon vma entries (curr_min_addr will fall down to the higher entry->end right?), however the hole won't be reused.
I used the test code mentioned earlier to verify your patch, still I can see anon_loc growing and memory footprint increasing.

Yes the hole would not be reused but also it would not cause the VA fragmentation. If jemalloc cannot cope with increasing VAs returned by mmap(2), it seems as some kind of deficiency in jemalloc, and it is better to fix it there. After all there are more malloc implementations.

I did not dug into the jemalloc implementaion.

Reusing the hole might be fine, but we need some criteria when to stop clustering and go back down to allocate from lowest available VA. Again, what you suggesting would simply disable clustering. So you could set vm.cluster_anon to zero and have the same effect, most likely.

Another option would be to pick a random virtual address for anon_loc at execve() time, and never update it. So, all virtual address allocations would be performed using an address-ordered, first-fit policy, just like we do when ASLR is turned off.

In D39845#921390, @alc wrote:

Another option would be to pick a random virtual address for anon_loc at execve() time, and never update it. So, all virtual address allocations would be performed using an address-ordered, first-fit policy, just like we do when ASLR is turned off.

Are you proposing yet another mode for cluster_anon, or you mean to replace the mode 2 'always cluster' with this variant?

In D39845#921521, @kib wrote:
In D39845#921390, @alc wrote:

Another option would be to pick a random virtual address for anon_loc at execve() time, and never update it. So, all virtual address allocations would be performed using an address-ordered, first-fit policy, just like we do when ASLR is turned off.

Are you proposing yet another mode for cluster_anon, or you mean to replace the mode 2 'always cluster' with this variant?

The latter.

Since I was working on superpages-related changes lately, I was surprised to find that we do fewer promotions with alsr on versus off. I would have expected clustering to result in there being no fewer promotions with alsr on. However, some large address ranges seem to be misaligned:

1447     0x3a55b29e7000     0x3a55b29ea000 r--    3    6   3   1 CN--- vn /usr/sbin/rpc.statd
1447     0x3a55b29ea000     0x3a55b29ed000 r-x    3    6   3   1 CN--- vn /usr/sbin/rpc.statd
1447     0x3a55b29ed000     0x3a55b29ee000 rw-    1    0   1   0 CN--- vn /usr/sbin/rpc.statd
1447     0x3a55b29ee000     0x3a55b29ef000 rw-    1    1   1   0 C---- sw
1447     0x3a5db31aa000     0x3a5dd318a000 ---    0    0   0   0 ----- gd
1447     0x3a5dd318a000     0x3a5dd31aa000 rw-    6    6   1   0 C--D- sw
1447     0x3a5dd3dc1000     0x3a5dd3dc2000 r-x    1    1  25   0 ----- ph
1447     0x3a5dd3fbd000     0x3a5dd3fde000 rw-   26   26   1   0 C---- sw
1447     0x3a5dd4cad000     0x3a5dd4cb2000 r--    5   10   8   4 CN--- vn /usr/lib/librpcsvc.so.5
1447     0x3a5dd4cb2000     0x3a5dd4cb8000 r-x    6   10   8   4 CN--- vn /usr/lib/librpcsvc.so.5
1447     0x3a5dd4cb8000     0x3a5dd4cb9000 rw-    1    0   1   0 CN--- vn /usr/lib/librpcsvc.so.5
1447     0x3a5dd4cb9000     0x3a5dd4cba000 rw-    1    0   1   0 C---- vn /usr/lib/librpcsvc.so.5
1447     0x3a5dd4e80000     0x3a5dd4f04000 r--   92  416 105  59 CN--- vn /lib/libc.so.7
1447     0x3a5dd4f04000     0x3a5dd503d000 r-x  297  416 105  59 CN--- vn /lib/libc.so.7
1447     0x3a5dd503d000     0x3a5dd5046000 r--    9    0   1   0 CN--- vn /lib/libc.so.7
1447     0x3a5dd5046000     0x3a5dd5047000 rw-    1    0   1   0 CN--- vn /lib/libc.so.7
1447     0x3a5dd5047000     0x3a5dd504d000 rw-    6    0   1   0 C---- vn /lib/libc.so.7
1447     0x3a5dd504d000     0x3a5dd5270000 rw-   18   18   1   0 C---- sw
1447     0x3a5dd5c00000     0x3a5dd5e00000 rw-   10   10   1   0 C---- sw
1447     0x3a5dd6b72000     0x3a5dd6d72000 rw-   66   66   1   0 C---- sw
1447     0x3a5dd7400000     0x3a5dd7800000 rw-    8    8   1   0 C---- sw
1447     0x3a5dd79fc000     0x3a5dd7efc000 rw-  919  919   1   0 CNS-- sw
1447     0x3a5dd84b6000     0x3a5dd8eb6000 rw-    0    0   0   0 ----- --
1447     0x3a5dda600000     0x3a5dea600000 rw-    1    1   1   0 ----- vn /var/db/statd.status
1447     0x5660c31d0000     0x5660c31d7000 r--    7   29  69  23 CN--- vn /libexec/ld-elf.so.1
1447     0x5660c31d7000     0x5660c31ed000 r-x   22   29  69  23 CN--- vn /libexec/ld-elf.so.1
1447     0x5660c31ed000     0x5660c31ee000 r--    1    0   1   0 CN--- vn /libexec/ld-elf.so.1
1447     0x5660c31ee000     0x5660c31f0000 rw-    2    2   1   0 C---- sw
1447     0x7ffffffff000     0x800000000000 ---    0    0   0   0 ----- gd

Misalignment is perhaps a different issue. We can select aligned top for stack, and aligned anon_loc each time we have to choose one.

I decided to add a new mode '3' with the proposed behavior.

ASLR: add a new mode for anonymous clustering control

The value 3 for vm.cluster_anon still initializes the clustering point
at the image creation, and never update it. So, all virtual address
allocations are performed using an address-ordered, first-fit policy,
just like we do when ASLR is turned off.

Suggested by:   alc
sys/vm/vm_map.c
3979

This is fixed in my repo

In D39845#922683, @kib wrote:

Misalignment is perhaps a different issue. We can select aligned top for stack, and aligned anon_loc each time we have to choose one.

The initial alignment of anon_loc really doesn't matter that much. At most, it makes a difference of one promotion per process, and on average less than that. It is clustering, specifically, the lack thereof, that matters.

The following patch together sheds light on what is happening:

diff --git a/sys/vm/vm_map.c b/sys/vm/vm_map.c
index f5863a9b9939..8ffcac8668d2 100644
--- a/sys/vm/vm_map.c
+++ b/sys/vm/vm_map.c
@@ -2102,6 +2102,8 @@ vm_map_find(vm_map_t map, vm_object_t object, vm_ooffset_t offset,
            ("vm_map_find: non-NULL backing object for stack"));
        MPASS((cow & MAP_REMAP) == 0 || (find_space == VMFS_NO_SPACE &&
            (cow & (MAP_STACK_GROWS_DOWN | MAP_STACK_GROWS_UP)) == 0));
+       vm_offset_t hint = *addr;
+       int save_find_space = find_space;
        if (find_space == VMFS_OPTIMAL_SPACE && (object == NULL ||
            (object->flags & OBJ_COLORED) == 0))
                find_space = VMFS_ANY_SPACE;
@@ -2229,6 +2231,11 @@ vm_map_find(vm_map_t map, vm_object_t object, vm_ooffset_t offset,
        if (rv == KERN_SUCCESS && update_anon)
                map->anon_loc = *addr + length;
 done:
+       if (rv == KERN_SUCCESS && object == NULL) {
+               CTR6(KTR_SPARE5,
+                   "map: %p, hint: %#lx, len: %#lx, (find: %d, align: %#lx), addr: %#lx",
+                   map, hint, length, save_find_space & 255, alignment, *addr);
+       }
        vm_map_unlock(map);
        return (rv);
 }

Here is some representative output from otherwise unmodified sources:

1649 map: 0xfffffe01a3faf208, hint: 0x803e31000, len: 0x300000, (find: 0, align: 0x1000), addr: 0x83d707000
1648 map: 0xfffffe01a3faf208, hint: 0x803e31000, len: 0x200000, (find: 0, align: 0x1000), addr: 0x83cb79000
1647 map: 0xfffffe01a3faf208, hint: 0x803e31000, len: 0x200000, (find: 0, align: 0x200000), addr: 0x83c600000
1646 map: 0xfffffe01a3faf208, hint: 0x803e31000, len: 0x200000, (find: 0, align: 0x1000), addr: 0x83bc95000
1645 map: 0xfffffe01a3faf208, hint: 0x803e31000, len: 0x200000, (find: 0, align: 0x200000), addr: 0x83b200000
1644 map: 0xfffffe01a3faf208, hint: 0x803e31000, len: 0x200000, (find: 0, align: 0x1000), addr: 0x83a3f8000
1643 map: 0xfffffe01a3faf208, hint: 0x803e31000, len: 0x200000, (find: 0, align: 0x200000), addr: 0x839800000
1642 map: 0xfffffe01a3faf208, hint: 0x804259000, len: 0x201000, (find: 2, align: 0), addr: 0x8389db000
1641 map: 0xfffffe01a3faf208, hint: 0x80445a000, len: 0x201000, (find: 2, align: 0), addr: 0x838256000
1640 map: 0xfffffe01a3faf208, hint: 0x80465b000, len: 0x201000, (find: 2, align: 0), addr: 0x837625000
1639 map: 0xfffffe01a3faf208, hint: 0x803e31000, len: 0x200000, (find: 0, align: 0x1000), addr: 0x8366e8000
1638 map: 0xfffffe01a3faf208, hint: 0x803e31000, len: 0x200000, (find: 0, align: 0x200000), addr: 0x833c00000
1637 map: 0xfffffe01a3faf208, hint: 0x80485c000, len: 0x201000, (find: 2, align: 0), addr: 0x833510000
1636 map: 0xfffffe01a3faf208, hint: 0x803e31000, len: 0x1000, (find: 2, align: 0), addr: 0x831af0000
1635 map: 0xfffffe01a3faf208, hint: 0x803e31000, len: 0x280000, (find: 0, align: 0x1000), addr: 0x830f33000
1634 map: 0xfffffe01a3faf208, hint: 0x803e31000, len: 0x1000, (find: 2, align: 0), addr: 0x82f48d000
1633 map: 0xfffffe01a3faf208, hint: 0x803e31000, len: 0x21000, (find: 2, align: 0), addr: 0x82da84000
1632 map: 0xfffffe01a3faf208, hint: 0x804a5d000, len: 0x1000, (find: 2, align: 0), addr: 0x82d52d000
1631 map: 0xfffffe01a3faf208, hint: 0x803e31000, len: 0x400000, (find: 0, align: 0x200000), addr: 0x82c800000
1630 map: 0xfffffe01a3faf208, hint: 0x803e31000, len: 0x200000, (find: 0, align: 0x1000), addr: 0x82b8d2000
1629 map: 0xfffffe01a3faf208, hint: 0x803e31000, len: 0x200000, (find: 0, align: 0x200000), addr: 0x82b600000
1628 map: 0xfffffe01a3faf208, hint: 0x803e31000, len: 0x24000, (find: 2, align: 0), addr: 0x82abaa000
1627 map: 0xfffffe01a3faf208, hint: 0x803e31000, len: 0x1a000, (find: 2, align: 0), addr: 0x82a193000
1626 map: 0xfffffe01a3faf208, hint: 0x803e31000, len: 0x1b000, (find: 2, align: 0), addr: 0x82b2bb000
1625 map: 0xfffffe01a3faf208, hint: 0x803e31000, len: 0x3f0000, (find: 2, align: 0), addr: 0x829ba2000
1624 map: 0xfffffe01a3faf208, hint: 0x803e31000, len: 0x3c000, (find: 2, align: 0), addr: 0x828502000
1623 map: 0xfffffe01a3faf208, hint: 0x803e31000, len: 0x21000, (find: 2, align: 0), addr: 0x82929e000
1622 map: 0xfffffe01a3faf208, hint: 0x803e31000, len: 0x103000, (find: 2, align: 0), addr: 0x828031000
1621 map: 0xfffffe01a3faf208, hint: 0x803e31000, len: 0x2d000, (find: 2, align: 0), addr: 0x8271e8000
1620 map: 0xfffffe01a3faf208, hint: 0x803e31000, len: 0x1c000, (find: 2, align: 0), addr: 0x827eb1000
1619 map: 0xfffffe01a3faf208, hint: 0x803e31000, len: 0x41000, (find: 2, align: 0), addr: 0x826d77000
1618 map: 0xfffffe01a3faf208, hint: 0x803e31000, len: 0x6000, (find: 2, align: 0), addr: 0x825ea9000
1617 map: 0xfffffe01a3faf208, hint: 0x803e31000, len: 0x21000, (find: 2, align: 0), addr: 0x825bf3000
1616 map: 0xfffffe01a3faf208, hint: 0x803e31000, len: 0x20000000, (find: 1, align: 0), addr: 0x804a5e000
sys/vm/vm_map.c
1991–1992

Consider this diagnostic patch:

diff --git a/sys/vm/vm_map.c b/sys/vm/vm_map.c
index 8ffcac8668d2..492717849d37 100644
--- a/sys/vm/vm_map.c
+++ b/sys/vm/vm_map.c
@@ -2000,6 +2000,11 @@ SYSCTL_LONG(_vm, OID_AUTO, aslr_restarts, CTLFLAG_RD,
     &aslr_restarts, 0,
     "Number of aslr failures");
 
+static long aslr_hint_zeroes;
+SYSCTL_LONG(_vm, OID_AUTO, aslr_hint_zeroes, CTLFLAG_RD,
+    &aslr_hint_zeroes, 0,
+    "Number of aslr XXX");
+
 /*
  * Searches for the specified amount of free space in the given map with the
  * specified alignment.  Performs an address-ordered, first-fit search from
@@ -2236,6 +2241,8 @@ vm_map_find(vm_map_t map, vm_object_t object, vm_ooffset_t offset,
                    "map: %p, hint: %#lx, len: %#lx, (find: %d, align: %#lx), addr: %#lx",
                    map, hint, length, save_find_space & 255, alignment, *addr);
        }
+       if (hint == 0)
+               atomic_add_long(&aslr_hint_zeroes, 1);
        vm_map_unlock(map);
        return (rv);
 }

And, this output from a machine that's been busy for a couple hours:

# sysctl vm.aslr_hint_zeroes
vm.aslr_hint_zeroes: 0

In other words, by default we never cluster, because addr is never zero.

sys/vm/vm_map.c
1991–1992

Yes it is relevant for the cluster mode 1, which is default.

vm_map_find(): the address hint is almost always non-zero

Adopt to the observation and fix cluster mode 1 by enabling clustering
if hint is zero or less then current clustering location.

Noted by:       alc
sys/vm/vm_map.c
2138
kib marked an inline comment as done.

Plug one more place where anon_loc is effectively changed.

sys/vm/vm_map.c
3979

Consider this alternative approach: Introduce an "update anon_loc on removal" flag on map entries. Set it in vm_map_find/insert() on anonymous mappings when cluster_anon is set to 2. Here, when the flag is set and the map entry is beneath the current anon_loc, reset the anon_loc to the end of the previous map entry. This should not only allow the recovery of more address ranges, but also eliminate the need for the new vm_map_lookup_entry() call in vm_map_find().

kib marked an inline comment as done.

MAP_ENTRY_ANON_LOC_UPD

Correct the value for MAP_ENTRY_ANON_LOC_UPD

I did some testing with and without the previous version of the patch. I simply did a "buildworld" as the workload.

Without the patch and with the default ASLR settings, I get the following, because as I showed above clustering doesn't occur.:
vm.pmap.pde.promotions: 159374

With ASLR disabled, I get:
vm.pmap.pde.promotions: 176866

With the patch and with the default ASLR settings, I get:
vm.pmap.pde.promotions: 179200

With the patch and with cluster_anon set to 3 ("const anon_loc"), I get:
vm.pmap.pde.promotions: 189771

ASLR with working clustering can "outperform" no-ASLR, our old default, in this respect because the randomization actually results in better segregation between mapped files and anonymous memory. In other words, there is somewhat less fragmentation of the heap for processes, like clang, that have mapped files. Here is an example of the fragmentation from pre-ASLR times:
https://reviews.freebsd.org/P267

sys/vm/vm_map.c
3980–3981

The flag allows us to recognize cases where a clustered allocation is being freed even though it isn't the last allocation, and reset anon_loc to allow reallocation of that address range. That said, I'm having second thoughts about this approach.

The following aspect of the previous version of the patch has an unintended consequence:

else if (!clustering_anon_loc_const() &&
    !vm_map_lookup_entry(map, curr_min_addr, &entry))
        curr_min_addr = entry->end;

With cluster_anon set to its default value of 1, the first time that a clustered allocation is attempted (line 7186) the above code effectively lowers anon_loc to the end of the previous map entry, far from the address originally generated for anon_loc.

7195 map: 0xfffffe018592ec30, hint: 0x33c594562000, len: 0x400000, (find: 0, align: 0x200000), addr: 0x33cbdf200000
7194 map: 0xfffffe018592ec30, anon_loc: 0x33cbdf200000, addr: 0x33cbdf200000
7193 map: 0xfffffe018592ec30, hint: 0x33c594562000, len: 0x200000, (find: 0, align: 0x1000), addr: 0x33cbdf000000
7192 map: 0xfffffe018592ec30, anon_loc: 0x33cbdf000000, addr: 0x33cbdf000000
7191 map: 0xfffffe018592ec30, hint: 0x33c594562000, len: 0x200000, (find: 0, align: 0x200000), addr: 0x33cbdee00000
7190 map: 0xfffffe018592ec30, anon_loc: 0x33cbded54000, addr: 0x33cbdee00000
7189 map: 0xfffffe018592ec30, hint: 0x33c594562000, len: 0x3f0000, (find: 2, align: 0), addr: 0x33c5b69b1000
7188 map: 0xfffffe018592ec30, hint: 0x33c594562000, len: 0x17000, (find: 2, align: 0), addr: 0x33c5b6009000
7187 map: 0xfffffe018592ec30, hint: 0x33c594562000, len: 0x21000, (find: 2, align: 0), addr: 0x33cbded33000
7186 map: 0xfffffe018592ec30, anon_loc: 0x49b32b000000, addr: 0x33cbded33000
7185 map: 0xfffffe018592ec30, hint: 0x33c594562000, len: 0x20000000, (find: 1, align: 0), addr: 0x33c595237000
7184 map: 0xfffffe018592ec30, interp addr: 0x33cbded13000
7183 map: 0xfffffe018592ec30, anon_loc: 0x49b32b000000
7182 map: 0xfffffe018592ec30, dyn addr: 0x33bd94558000
7181 map: 0xfffffe018592ec30, maxv: 0x7ffffffff000

Consequently, the initial alignment of anon_loc is lost.

The latest version of the patch doesn't have this issue.

After mulling this over for a while, I'm going to suggest a different approach to address the clustering problem. Essentially, I'm pushing down the point at which we convert a hint of zero to a minimum address into vm_map_find_min().

diff --git a/sys/vm/vm_map.c b/sys/vm/vm_map.c
index f5863a9b9939..4fd0aff24b75 100644
--- a/sys/vm/vm_map.c
+++ b/sys/vm/vm_map.c
@@ -1981,14 +1981,14 @@ SYSCTL_INT(_vm, OID_AUTO, cluster_anon, CTLFLAG_RW,
     "Cluster anonymous mappings: 0 = no, 1 = yes if no hint, 2 = always");
 
 static bool
-clustering_anon_allowed(vm_offset_t addr)
+clustering_anon_allowed(vm_offset_t addr, int cow)
 {
 
        switch (cluster_anon) {
        case 0:
                return (false);
        case 1:
-               return (addr == 0);
+               return (addr == 0 || (cow & MAP_NO_HINT) != 0);
        case 2:
        default:
                return (true);
@@ -2111,7 +2111,7 @@ vm_map_find(vm_map_t map, vm_object_t object, vm_ooffset_t offset,
        } else
                alignment = 0;
        en_aslr = (map->flags & MAP_ASLR) != 0;
-       update_anon = cluster = clustering_anon_allowed(*addr) &&
+       update_anon = cluster = clustering_anon_allowed(*addr, cow) &&
            (map->flags & MAP_IS_SUB_MAP) == 0 && max_addr == 0 &&
            find_space != VMFS_NO_SPACE && object == NULL &&
            (cow & (MAP_INHERIT_SHARE | MAP_STACK_GROWS_UP |
@@ -2255,6 +2255,10 @@ vm_map_find_min(vm_map_t map, vm_object_t object, vm_ooffset_t offset,
        int rv;
 
        hint = *addr;
+       if (hint == 0)
+               cow |= MAP_NO_HINT;
+       if (hint < min_addr)
+               *addr = hint = min_addr;
        for (;;) {
                rv = vm_map_find(map, object, offset, addr, length, max_addr,
                    find_space, prot, max, cow);
diff --git a/sys/vm/vm_map.h b/sys/vm/vm_map.h
index 2ac54a39a57b..fd8b606e8ddc 100644
--- a/sys/vm/vm_map.h
+++ b/sys/vm/vm_map.h
@@ -383,6 +383,7 @@ long vmspace_resident_count(struct vmspace *vmspace);
 #define        MAP_CREATE_STACK_GAP_DN 0x00020000
 #define        MAP_VN_EXEC             0x00040000
 #define        MAP_SPLIT_BOUNDARY_MASK 0x00180000
+#define        MAP_NO_HINT             0x00200000
 
 #define        MAP_SPLIT_BOUNDARY_SHIFT 19
 
diff --git a/sys/vm/vm_mmap.c b/sys/vm/vm_mmap.c
index 56345fcaf560..37bf87d813a1 100644
--- a/sys/vm/vm_mmap.c
+++ b/sys/vm/vm_mmap.c
@@ -347,14 +347,16 @@ kern_mmap(struct thread *td, const struct mmap_req *mrp)
                if (addr + size > MAP_32BIT_MAX_ADDR)
                        addr = 0;
 #endif
-       } else {
+       } else if ((flags & MAP_ANON) == 0) {
                /*
                 * XXX for non-fixed mappings where no hint is provided or
                 * the hint would fall in the potential heap space,
                 * place it after the end of the largest possible heap.
                 *
-                * There should really be a pmap call to determine a reasonable
-                * location.
+                * For anonymous mappings within the address space of the
+                * calling process, the absence of a hint is handled at a
+                * lower level in order to implement different clustering
+                * strategies for ASLR.
                 */
                if (addr == 0 ||
                    (addr >= round_page((vm_offset_t)vms->vm_taddr) &&

To be clear, the above change does not address the upward creep of anon_loc. We'll still need to commit a variation on the patch in this review.

Yes, I believe this is only one change from the three present in my branch. I do not have objections.

In D39845#926656, @kib wrote:

Yes, I believe this is only one change from the three present in my branch. I do not have objections.

Would you have me create a separate review for my clustering fix, or just reference this review?

In D39845#926725, @alc wrote:
In D39845#926656, @kib wrote:

Yes, I believe this is only one change from the three present in my branch. I do not have objections.

Would you have me create a separate review for my clustering fix, or just reference this review?

I am fine either way. Perhaps a separate review would be cleaner, because your commit would not close this one.

Rebase. Drop patch for handling non-zero hint.

Let me say up front, that the worst of the address space creep has been addressed by the changes in place. However, I still see some behavior that I think should be changed. I'll elaborate on that later, but first I have a question: @kib, it has never been clear to me why we update anon_loc. Is it meant solely as an optimization to free address space finding, or is it done for some other reason?

In D39845#952791, @alc wrote:

@kib, it has never been clear to me why we update anon_loc. Is it meant solely as an optimization to free address space finding, or is it done for some other reason?

I probably never ever considered not updating anon_loc. When I asked myself why, the most obvious reason is the failure mode: if looking for free space after anon_loc failing, we need to move the clustering point somewhere, otherwise it would fail on the first try forever.

In D39845#952794, @kib wrote:
In D39845#952791, @alc wrote:

@kib, it has never been clear to me why we update anon_loc. Is it meant solely as an optimization to free address space finding, or is it done for some other reason?

I probably never ever considered not updating anon_loc. When I asked myself why, the most obvious reason is the failure mode: if looking for free space after anon_loc failing, we need to move the clustering point somewhere, otherwise it would fail on the first try forever.

My observation is that jemalloc performs two types of virtual memory allocations: (1) large chunks of virtual memory, where the chunk size is a multiple of a superpage and explicitly aligned, and (2) small allocations, mostly 128 KB, where no alignment is requested. Typically, it starts with a small allocation, and over time it makes both types of allocation.

With anon_loc being updated on every allocation, we wind up with a repeating pattern of a small allocation, a large gap, and a large, aligned allocation. (As an aside, we wind up allocating a reservation for these small allocations, but it will never fill because the next large, aligned allocation updates anon_loc, leaving a gap that will never be filled with other small allocations.) In contrast, if anon_loc isn't updated on every allocation, both the small and large allocations will be clustered together, and there will only be one gap.

I'm going to suggest the following as the final resolution for this issue. It addresses the fragmentation that I described in my last message, and it and handles the scenario that you brought up wherein we run out of free space at the current anon_loc, perform an ASLR restart, and want to avoid repeating the ASLR restart on the next mapping.

diff --git a/sys/vm/vm_map.c b/sys/vm/vm_map.c
index 3c7afcb6642f..fa71bb8a01d6 100644
--- a/sys/vm/vm_map.c
+++ b/sys/vm/vm_map.c
@@ -2247,8 +2247,15 @@ vm_map_find(vm_map_t map, vm_object_t object, vm_ooffset_t offset,
                rv = vm_map_insert(map, object, offset, *addr, *addr + length,
                    prot, max, cow);
        }
-       if (rv == KERN_SUCCESS && update_anon)
-               map->anon_loc = *addr + length;
+
+       /*
+        * Update the starting address for clustered anonymous memory mappings
+        * if a starting address was not previously defined or an ASLR restart
+        * placed an anonymous memory mapping at a lower address.
+        */
+       if (update_anon && rv == KERN_SUCCESS && (map->anon_loc == 0 ||
+           *addr < map->anon_loc))
+               map->anon_loc = *addr;
 done:
        vm_map_unlock(map);
        return (rv);
@@ -4041,9 +4048,6 @@ vm_map_delete(vm_map_t map, vm_offset_t start, vm_offset_t end)
                    entry->object.vm_object != NULL)
                        pmap_map_delete(map->pmap, entry->start, entry->end);
 
-               if (entry->end == map->anon_loc)
-                       map->anon_loc = entry->start;
-
                /*
                 * Delete the entry only after removing all pmap
                 * entries pointing to its pages.  (Otherwise, its

With this patch in place, I see a small reduction in reservations allocated (e.g., 1.6% during buildworld), fewer partially populated reservations. a small increase in 64KB page promotions on arm64, and fewer map entries in the heap.

Here is a very simple example. Before the change:

97156     0x2eb08e600000     0x2eb08e64b000 rw-   37   37   1   0 ----- sw
97156     0x2eb08e800000     0x2eb08f021000 rw- 1031 29378   3   0 ----- sw
97156     0x2eb08f021000     0x2eb08f022000 rw-    1 29378   3   0 ----- sw
97156     0x2eb08f022000     0x2eb095f22000 rw- 28346 29378   3   0 --S-- sw
97156     0x2eb096000000     0x2eb09d000000 rw- 20411 20411   1   0 --S-- sw

After the change:

17423     0x258f55e00000     0x258f55e6c000 rw-   38   39   2   0 ----- sw
17423     0x258f55e6c000     0x258f55e6d000 rw-    1   39   2   0 ----- sw
17423     0x258f56000000     0x258f5d700000 rw- 29483 29483   1   0 --S-- sw
17423     0x258f5d800000     0x258f67000000 rw- 27646 27646   1   0 --S-- sw

In particular, a smaller allocation of 0x21000 bytes got folded into the first map entry, instead of getting tacked onto the end of a later entry that would have otherwise ended on a superpage boundary.

In D39845#1039475, @alc wrote:

I'm going to suggest the following as the final resolution for this issue. It addresses the fragmentation that I described in my last message, and it and handles the scenario that you brought up wherein we run out of free space at the current anon_loc, perform an ASLR restart, and want to avoid repeating the ASLR restart on the next mapping.

diff --git a/sys/vm/vm_map.c b/sys/vm/vm_map.c
index 3c7afcb6642f..fa71bb8a01d6 100644
--- a/sys/vm/vm_map.c
+++ b/sys/vm/vm_map.c
@@ -2247,8 +2247,15 @@ vm_map_find(vm_map_t map, vm_object_t object, vm_ooffset_t offset,
                rv = vm_map_insert(map, object, offset, *addr, *addr + length,
                    prot, max, cow);
        }
-       if (rv == KERN_SUCCESS && update_anon)
-               map->anon_loc = *addr + length;
+
+       /*
+        * Update the starting address for clustered anonymous memory mappings
+        * if a starting address was not previously defined or an ASLR restart
+        * placed an anonymous memory mapping at a lower address.
+        */
+       if (update_anon && rv == KERN_SUCCESS && (map->anon_loc == 0 ||
+           *addr < map->anon_loc))
+               map->anon_loc = *addr;
 done:
        vm_map_unlock(map);
        return (rv);
@@ -4041,9 +4048,6 @@ vm_map_delete(vm_map_t map, vm_offset_t start, vm_offset_t end)
                    entry->object.vm_object != NULL)
                        pmap_map_delete(map->pmap, entry->start, entry->end);
 
-               if (entry->end == map->anon_loc)
-                       map->anon_loc = entry->start;
-
                /*
                 * Delete the entry only after removing all pmap
                 * entries pointing to its pages.  (Otherwise, its

Please commit.

This revision was not accepted when it landed; it landed in state Needs Review.Jun 13 2024, 8:16 PM
This revision was automatically updated to reflect the committed changes.