Introduce fine-grained locking on the dmamap's list of map entries, replacing the use of the domain lock. This is not the most significant source of lock contention, but it is the easiest to address.
Lock profiling of GENERIC-NODEBUG from before this patch:
# lockstat -C netperf -H lip3 -t TCP_MAERTS -i 20 MIGRATED TCP MAERTS TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to lip3 () port 0 AF_INET : +/-2.500% @ 99% conf. : histogram : interval : dirty data : demo Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 65536 32768 32768 10.00 2857.56 Adaptive mutex spin: 62714827 events in 201.680 seconds (310962 events/sec) Count indv cuml rcnt nsec Lock Caller ------------------------------------------------------------------------------- 22577330 36% 36% 0.00 744 dmarhw dmar_domain_unload+0x1a8 22363945 36% 72% 0.00 2122 dmarhw dmar_qi_task+0x10e 8010516 13% 84% 0.00 2123 iodom dmar_domain_free_entry+0x52 4438971 7% 92% 0.00 1606 iodom iommu_bus_dmamap_load_something+0x226 2941346 5% 96% 0.00 1383 iodom iommu_gas_map+0x497 1654506 3% 99% 0.00 605 dmarhw dmar_qi_task+0x1b6 712178 1% 100% 0.00 324 iodom iommu_bus_dmamap_unload+0xca 15949 0% 100% 0.00 628 so_rcv _sleep+0x23e 32 0% 100% 0.00 219 Giant usbd_enum_lock+0x5e 25 0% 100% 0.00 452 Giant usbd_do_request_flags+0x6d4 7 0% 100% 0.00 83 IOMMU_MAP_ENTRY zone_put_bucket+0x28c 6 0% 100% 0.00 4322 process lock proc_to_reap+0x45f 3 0% 100% 0.00 92 IOMMU_MAP_ENTRY cache_alloc+0x596 2 0% 100% 0.00 1501 uhci2 usb_process+0xc1 2 0% 100% 0.00 1390 so_rcv soreceive_generic+0xaf9 2 0% 100% 0.00 287 ix0:TX(3):callout _task_fn_admin+0x17e 1 0% 100% 0.00 138 uhci3 usb_process+0xc1 1 0% 100% 0.00 92 so_rcv tcp_do_segment+0x2f1a 1 0% 100% 0.00 2842 mbuf zone_put_bucket+0x28c 1 0% 100% 0.00 188 ix0:TX(1):callout softclock_call_cc+0xf7 1 0% 100% 0.00 281 ix0:TX(1):callout _task_fn_admin+0x17e 1 0% 100% 0.00 451 ehci1 usb_process+0xc1 1 0% 100% 0.00 87 256 Bucket zone_put_bucket+0x28c ------------------------------------------------------------------------------- Adaptive mutex block: 1944 events in 201.680 seconds (10 events/sec) Count indv cuml rcnt nsec Lock Caller ------------------------------------------------------------------------------- 1925 99% 99% 0.00 3795 dmarhw dmar_qi_task+0x1b6 13 1% 100% 0.00 6375 iodom dmar_domain_free_entry+0x52 5 0% 100% 0.00 7062 dmarhw dmar_qi_task+0x10e 1 0% 100% 0.00 5938 Giant usbd_do_request_flags+0x6d4 ------------------------------------------------------------------------------- ...
This patch eliminates the contention in iommu_bus_dmamap_load_something and iommu_bus_dmamap_unload.
Netperf results for GENERIC-NODEBUG
Best before:
# netperf -H lip3 -t TCP_MAERTS -i 20 MIGRATED TCP MAERTS TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to lip3 () port 0 AF_INET : +/-2.500% @ 99% conf. : histogram : interval : dirty data : demo Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 65536 32768 32768 10.00 3659.93
Best after:
# netperf -H lip3 -t TCP_MAERTS -i 20 MIGRATED TCP MAERTS TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to lip3 () port 0 AF_INET : +/-2.500% @ 99% conf. : histogram : interval : dirty data : demo Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 65536 32768 32768 10.00 3730.82