On amd64, the pmap code passes all_cpus to smp_targeted_tlb_shootdown() when unmapping from the kernel map. This function has an optimized path to send ipi's to all but itself, which it intends to do when the target is all cpus. However, we need to compare the target cpu mask with all_cpus(), rather than using CPU_ISFULLSET(). Comparing with CPU_ISFULLSET() will only work when we have MAXCPU cpus active in the system, otherwise, we'll be sending repeated ipis, rather than a single ipi to all cpus but ourselves.
Fixing this should reduce the time spent in native_lapic_ipi_wait as we will be sending ipis in parallel, rather than one-by-one. This result is confirmed by dtrace