This is the work referenced by https://reviews.freebsd.org/D29531#663885, plus changes for the review feedback on D29531 that were relevant here as well (specifically, @kib's commenting input and @rpokala's + @imp's copyright header input).
Details
Thus far, just basic testing on top of QEMU 4.2.1 + kernel 5.8.0 on Ubuntu 20.04 on Intel and AMD HW via:
- date
- syscall_timing
- sysctls debug.clock_show_io, debug.clock_do_io, kern.timecounter.hardware, and kern.timecounter.fast_gettime
- Using dtrace to sanity-check that sys_gettimeofday()/sys_clock_gettime() were/weren't being called as expected.
- zzz(8) plus the QEMU system_powerdown and system_wakeup commands (for test-driving the suspend/resume entry points).
I also tried the sysbench invocation in https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=216759#c11 and can confirm similar differences between using the HPET and using the kvmclock or TSC.
An example including some timing info:
user@test:~ % sysctl kern.timecounter kern.timecounter.tsc_shift: 1 kern.timecounter.smp_tsc_adjust: 0 kern.timecounter.smp_tsc: 0 kern.timecounter.invariant_tsc: 0 kern.timecounter.fast_gettime: 1 kern.timecounter.tick: 1 kern.timecounter.choice: i8254(0) ACPI-fast(900) HPET(950) kvmclock(975) TSC(-100) dummy(-1000000) kern.timecounter.hardware: kvmclock kern.timecounter.alloweddeviation: 5 kern.timecounter.timehands_count: 2 kern.timecounter.stepwarnings: 0 kern.timecounter.tc.i8254.quality: 0 kern.timecounter.tc.i8254.frequency: 1193182 kern.timecounter.tc.i8254.counter: 59794 kern.timecounter.tc.i8254.mask: 65535 kern.timecounter.tc.ACPI-fast.quality: 900 kern.timecounter.tc.ACPI-fast.frequency: 3579545 kern.timecounter.tc.ACPI-fast.counter: 3543295 kern.timecounter.tc.ACPI-fast.mask: 16777215 kern.timecounter.tc.HPET.quality: 950 kern.timecounter.tc.HPET.frequency: 100000000 kern.timecounter.tc.HPET.counter: 1966409884 kern.timecounter.tc.HPET.mask: 4294967295 kern.timecounter.tc.kvmclock.quality: 975 kern.timecounter.tc.kvmclock.frequency: 1000000000 kern.timecounter.tc.kvmclock.counter: 3373140415 kern.timecounter.tc.kvmclock.mask: 4294967295 kern.timecounter.tc.TSC.quality: -100 kern.timecounter.tc.TSC.frequency: 1607988965 kern.timecounter.tc.TSC.counter: 513410333 kern.timecounter.tc.TSC.mask: 4294967295 user@test:~ % syscall_timing gettimeofday Clock resolution: 0.000000002 test loop time iterations periteration gettimeofday 0 1.000971171 24698568 0.000000040 gettimeofday 1 1.007769518 24866552 0.000000040 gettimeofday 2 1.001071247 24700412 0.000000040 gettimeofday 3 1.006630289 24832878 0.000000040 gettimeofday 4 1.003202891 24769714 0.000000040 gettimeofday 5 1.004513069 24790162 0.000000040 gettimeofday 6 1.005332074 24833956 0.000000040 gettimeofday 7 1.002384739 24735470 0.000000040 gettimeofday 8 1.007455964 24884169 0.000000040 gettimeofday 9 1.000256110 24708625 0.000000040 user@test:~ % sudo sysctl kern.timecounter.fast_gettime=0 kern.timecounter.fast_gettime: 1 -> 0 user@test:~ % syscall_timing gettimeofday Clock resolution: 0.000000002 test loop time iterations periteration gettimeofday 0 1.062566203 8546464 0.000000124 gettimeofday 1 1.037318670 8305737 0.000000124 gettimeofday 2 1.050000947 8410842 0.000000124 gettimeofday 3 1.049828515 8405755 0.000000124 gettimeofday 4 1.062561943 8522758 0.000000124 gettimeofday 5 1.037252041 8295647 0.000000125 gettimeofday 6 1.009902776 8103514 0.000000124 gettimeofday 7 1.029863208 8271323 0.000000124 gettimeofday 8 1.059883267 8462669 0.000000125 gettimeofday 9 1.009913686 8092699 0.000000124 user@test:~ % sudo sysctl kern.timecounter.fast_gettime=1 kern.timecounter.fast_gettime: 0 -> 1 user@test:~ % sudo sysctl kern.timecounter.hardware=TSC kern.timecounter.hardware: kvmclock -> TSC user@test:~ % syscall_timing gettimeofday Clock resolution: 0.000000001 test loop time iterations periteration gettimeofday 0 1.039693055 39156300 0.000000026 gettimeofday 1 1.059332059 39727049 0.000000026 gettimeofday 2 1.040353569 39121632 0.000000026 gettimeofday 3 1.050123063 39481224 0.000000026 gettimeofday 4 1.049670668 39530649 0.000000026 gettimeofday 5 1.062567179 39960000 0.000000026 gettimeofday 6 1.037282737 38908035 0.000000026 gettimeofday 7 1.062564063 39918185 0.000000026 gettimeofday 8 1.037147904 39045152 0.000000026 gettimeofday 9 1.009909895 38041308 0.000000026 user@test:~ % sudo sysctl kern.timecounter.fast_gettime=0 kern.timecounter.fast_gettime: 1 -> 0 user@test:~ % syscall_timing gettimeofday Clock resolution: 0.000000001 test loop time iterations periteration gettimeofday 0 1.000343476 9173210 0.000000109 gettimeofday 1 1.007733227 9233248 0.000000109 gettimeofday 2 1.051675824 9637613 0.000000109 gettimeofday 3 1.003205497 9195701 0.000000109 gettimeofday 4 1.006564792 9226628 0.000000109 gettimeofday 5 1.001095929 9169797 0.000000109 gettimeofday 6 1.007764181 9224751 0.000000109 gettimeofday 7 1.000836593 9169834 0.000000109 gettimeofday 8 1.006850768 9220051 0.000000109 gettimeofday 9 1.002965937 9190431 0.000000109 user@test:~ % sudo sysctl kern.timecounter.fast_gettime=1 kern.timecounter.fast_gettime: 0 -> 1 user@test:~ % sudo sysctl kern.timecounter.hardware=HPET kern.timecounter.hardware: TSC -> HPET user@test:~ % syscall_timing gettimeofday Clock resolution: 0.000000011 test loop time iterations periteration gettimeofday 0 1.001976650 361817 0.000002769 gettimeofday 1 1.007771340 363755 0.000002770 gettimeofday 2 1.007759790 364436 0.000002765 gettimeofday 3 1.002167610 361223 0.000002774 gettimeofday 4 1.005514150 363170 0.000002768 gettimeofday 5 1.004283770 363254 0.000002764 gettimeofday 6 1.003406260 361886 0.000002772 gettimeofday 7 1.006412230 362752 0.000002774 gettimeofday 8 1.049863580 378690 0.000002772 gettimeofday 9 1.006341930 363179 0.000002770 user@test:~ % sudo sysctl kern.timecounter.fast_gettime=0 kern.timecounter.fast_gettime: 1 -> 0 user@test:~ % syscall_timing gettimeofday Clock resolution: 0.000000011 test loop time iterations periteration gettimeofday 0 1.005940780 339403 0.000002963 gettimeofday 1 1.062578460 359351 0.000002956 gettimeofday 2 1.037196880 350767 0.000002956 gettimeofday 3 1.062579500 359486 0.000002955 gettimeofday 4 1.037203790 351224 0.000002953 gettimeofday 5 1.009921970 342757 0.000002946 gettimeofday 6 1.029915260 334432 0.000003079 gettimeofday 7 1.059815440 345772 0.000003065 gettimeofday 8 1.009940770 329601 0.000003064 gettimeofday 9 1.062575370 347237 0.000003060 user@test:~ %
Diff Detail
- Repository
- rS FreeBSD src repository - subversion
- Lint
Lint Passed - Unit
No Test Coverage - Build Status
Buildable 38951 Build 35840: arc lint + arc unit
Event Timeline
Perhaps we drop 11.x from the port what with 11.x soon to be EOL'ed---does that seem reasonable to you, too?
Thanks and sorry for the delay! Mostly nits, I think the only major question is whether you need to use the dma system to allocate the memory for the vcpu time infos. I think it would be fine to just use plain malloc.
lib/libc/x86/sys/__vdso_gettc.c | ||
---|---|---|
104 | I think the inline here is IMO misleading, as you end up taking a pointer to the function when filling the tsc_selector struct, so the inline is just dropped. | |
358 | I think it would be enough to just say that the TSC_STABLE is mandatory because TSCs must be synchronized across CPUs or else using vcpu time info of CPU 0 on for all CPUs would be wrong. | |
sys/dev/kvm_clock/kvm_clock.c | ||
151 | Do you really need to use the dma subsystem for this? Isn't it enough to just malloc a region of memory and use it? You could then also use M_WAITOK maybe? | |
sys/x86/x86/pvclock.c | ||
260 | Given the list of arguments here I think we should consider creating a structure and passing a pointer to it, or rather fill some of the pvclock struct fields by the caller? You could split the fields of pvclock between the public ones to be filled by the caller and the private ones. | |
266 | Nit: I usually try to avoid splitting such log messages because grepping for them afterwards is not possible. I would rather try to place the message starting on the newline as you won't have to split it then. | |
268 | I think this should return EACCES as the file only allows opening in read-only mode. | |
316 | Given the pvclock has a resolution of 1ns and that's fixed I think it would be fine to just hardcode this as 1us. It's not like PVCLOCK_RESOLUTION_NS can or will be modified anyway? | |
317 | Couldn't vDSO became functional at some later point (ie: when migrated to a different host) even if the current vcpu_time_info reports non-stable? I wonder whether we should just check if the stable flag is supported instead of also checking if it's currently set. |
Thanks a lot for this latest round of feedback, @royger! Revised accordingly and/or discussion inline.
lib/libc/x86/sys/__vdso_gettc.c | ||
---|---|---|
358 | I decided to just remove this comment altogether. It was basically just attempting to point out how this flag can theoretically change at any time, which would explain how we could arrive at this location from binuptime() even though binuptime(), indirectly via tk_enabled, just observed PVCLOCK_FLAG_TSC_STABLE as set. But I don't think it's as subtle of a point as I must have when I originally wrote this, and I can see how this comment might confuse if __vdso_pvclock_tsc() is read separately from binuptime(). | |
sys/dev/kvm_clock/kvm_clock.c | ||
151 |
I'd decided against malloc() since it doesn't provide API-level alignment guarantees; while, implementation-wise, it does look like >= PAGE_SIZE-ed malloc(9) allocations will be page-aligned, my read of malloc(9) (particularly section IMPLEMENTATION NOTES) is that this should not be assumed by callers. (FWIW, I had also noted (1) the precedent of hyperv's use of BUS_DMA(9) for its corresponding PV clock in-memory data and (2) internally, this draft's use of BUS_DMA(9) will ultimately do the same thing that the first draft was doing with its direct use of the physical memory pager). Just recapping, the intention is that timeinfos be readily mmap-able---the allocation containing timeinfos should start and end on a page boundary (with any unused portion of the last page zeroed). Does this choice make sense or am I missing something?
This choice was considering use cases where the driver is loaded during boot (whether compiled in or via loader.conf(5)). My thinking was that a failure to satisfy a relatively small allocation like this during boot would be an indication of far greater problems and, as such, it wouldn't be worth adding any blocking+reclaims+retries to the boot process, especially since the system can still function without kvmclock. But I suppose it's also ok to allow the reclaims+retries, so, I've switched bus_dmamem_alloc() from BUS_DMA_NOWAIT to BUS_DMA_WAITOK in this version as per this suggestion. | |
sys/x86/x86/pvclock.c | ||
317 | Right. But pvclock_tc_vdso_timehands{,32}() is called with each vDSO timehands update, so, such scenarios should be supported, no? I think the (pvc->ti_vcpu0_page->flags & PVCLOCK_FLAG_TSC_STABLE) != 0 clause could, indeed, be dropped. But I included it thinking that, in the unstable case, we might as well short-circuit the vDSO codepath sooner than later---with this clause, the vDSO codepath decides to fall back to the syscall codepath after it looks at tk->tk_enabled in binuptime() whereas, without this clause, this decision will happen after the PVCLOCK_FLAG_TSC_STABLE flag check in __vdso_pvclock_tsc(). I obtained a rough measurement of the delta between these two versions of the unstable TSC codepath by inverting the PVCLOCK_FLAG_TSC_STABLE check in __vdso_pvclock_tsc() and then looking at syscall_timing gettimeofday numbers for (1) kern.timecounter.fast_gettime=0 and (2) kern.timecounter.fast_gettime=1. I figure, on my PVCLOCK_FLAG_TSC_STABLE test systems, (1) should roughly simulate the version where the code is left as-is and (2) should roughly simulate the version where this clause is dropped. These syscall_timing gettimeofday numbers from (1) to (2) regressed from 116ns -> 146ns = +30ns on my AMD HW and 119ns -> 141ns = +22ns on my Intel HW. So, potentially non-negligible and makes sense to keep this clause in place (or achieve the same outcome in some other way)? What do you think? |
It looks like we need this for EC2 "T3" and "T3A" family instances, where apparently it's possible for a VM to bounce between CPUs running at different clock speeds(!!!) and the TSC-low timecounter is running into problems (https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=256781).
(I don't have anything to contribute to reviewing the change, just wanted to flag the EC2 environment as one where this is relevant.)
Just replied to the most controversial part I think, will try to review the rest tomorrow. Thanks!
sys/dev/kvm_clock/kvm_clock.c | ||
---|---|---|
151 |
Right, TBH I think malloc returning and address not aligned to a page boundary when allocating a region >= PAGE_SIZE would likely be a bug in the implementation.
Well, I think it's overly complicated (and cumbersome due to all the initialization involved) to use bus_dma just to allocate a memory region that's page aligned. I'm sure there are already instances in-kernel that assume that malloc(PAGE_SIZE) returns a page. I wouldn't be opposed to using: sc->timeinfos = malloc(size, ...); if ((sc->timeinfos & PAGE_MASK) != 0) { device_printf(..., "failed to allocate aligned memory region"); KASSERT(0, ("Failed to allocate page aligned region")); return (ENOMEM); } But I understand others might have different opinions. Maybe @kib could provide some insight on whether there's another way to allocate aligned memory regions without having to resort to bus_dma, I don't have that much insight on the VM subsystem.
Likely, if we fail to satisfy such small allocation it's likely the kernel won't be able to finish booting anyway. Adding a comment as to why BUS_DMA_NOWAIT is used (ie: not because the context prevents using BUS_DMA_WAITOK) would be OK also.
|
sys/dev/kvm_clock/kvm_clock.c | ||
---|---|---|
151 | While malloc(9) can't give these guarantees, uma(8) can. Creating the tag, allocating and loading it are basically creating a uma zone behind the scenes and using them directly would be better than using busdma just to get alignment. |
sys/dev/kvm_clock/kvm_clock.c | ||
---|---|---|
151 | We have malloc_domainset_aligned(), added exactly to export the alignment guarantees from the malloc subsystem. |
sys/dev/kvm_clock/kvm_clock.c | ||
---|---|---|
151 | Thanks! That's exactly what's needed here. I think this function has not been added to the malloc(9) man page? Would it make sense to also add some syntactic sugar and introduce a: #define malloc_aligned(s, a, t, f) malloc_domainset_aligned(s, a, t, DOMAINSET_RR(), f) Or similar. |
sys/dev/kvm_clock/kvm_clock.c | ||
---|---|---|
151 | Of course, feel free to add. There were no users for it until now. |
Yes, thanks a lot, @kib, for pointing out malloc_domainset_aligned()! Much better! (Doh---I think I did see this a while back but misinterpreted its relative lack of use and absence from malloc(9) as discouraging new uses).
I've switched to malloc_domainset_aligned() for now so that this work isn't potentially blocked on separate submission(s) to add a malloc_aligned() and update malloc(9) etc, figuring we can easily switch over to such a malloc_aligned() if/when appropriate.
lib/libc/x86/sys/__vdso_gettc.c | ||
---|---|---|
363 | I do not quite follow this. Why do you need to set pvclock_vcpu_info to MAP_FAILED in advance, during init? Basically, this introduces somewhat unpleasant and very hard to diagnose case, where two threads happen to execute e.g. gettimeofday(), and then one of them falls back to syscall. Unpleasant part is that on the next gettimeofday() call this thread would use vdso path, and kernel vs. usermode timecounters might be relatively off. I tried to avoid this with HPET. Also note use of acq/rel atomics to not rely implicitly on x86 TSO. | |
sys/dev/kvm_clock/kvm_clock.c | ||
96 | This blank line is excessive. | |
227 | Excessive blank line. | |
229 | And this one. It would be too much spam to comment about all of them. | |
253 | Why do you need newbus attachment for this thing? Due to the clock kobj interface? But is it required? | |
sys/x86/x86/pvclock.c | ||
244 | Please do not use these linuxisms. I suspect you need load_acq(&wc->version) instead of these two lines ... | |
247 | .. and then atomic_thread_fence_acq() instead of this rmb(). But where is the writer? |
sys/x86/x86/pvclock.c | ||
---|---|---|
247 | The writer is the hypervisor. |
Thanks a lot for the review/feedback, @kib! I'm thinking about the questions raised by the init-related item; this revision attempts to address the other items in the meantime.
lib/libc/x86/sys/__vdso_gettc.c | ||
---|---|---|
363 | hm... Yeah, this was assuming that it's always "ok"---in that a thread will never see a syscall-based time reading that is less than a previous syscall- or vDSO-based time reading---for a thread to fall back to the syscall path and, based on that assumption, was deliberately allowing the described race in exchange for simplified init code. If it can't be assumed to be safe to switch back and forth between the vDSO and syscall codepaths, then maybe we can't currently support the vDSO codepath for this clock source as the timekeeping code presently stands. (I'll think about this more). But doesn't the hyperv PV TSC vDSO codepath make this same assumption, since a thread can end up switching between the vDSO and syscall codepaths at any time based on whether TscSequence == 0 (hyperv_ref_tsc->tsc_seq == 0)? I want to make sure I have a correct and detailed understanding of the specific vDSO/syscall mismatch problem(s) you're referencing. Would you be able to provide example(s) of the specific way(s) that syscall-based and vDSO-based time values can differ that are being referenced here? I think I can see a case where two calls to clock_gettime() for CLOCK_MONOTONIC_FAST or CLOCK_UPTIME_FAST---which only use th_offset without adding the subsequent delta---could lead to the second call's value being less than the first. This would happen if a thread managed to make a syscall-based reading followed by a vDSO-based reading that both occurred after the in-kernel timehands update but before the corresponding vDSO-based timehands update. Is this an example of the sorts of cases you're thinking of? Am I understanding correctly if I'm thinking that this same scenario but for CLOCK_{MONOTONIC,UPTIME}{,_PRECISE}---which use th_offset plus a th_offset_count-relative delta---would not have this problem because the th_offset_count-based delta that would be included in each reading would be relative to its respective th_offset value? | |
sys/dev/kvm_clock/kvm_clock.c | ||
253 | Right, all of this is as per the clock/RTC subsystem (sys/clock.h, kern/subr_rtc.c), which does use the clock kobj interface's entry points. |
Switch the vDSO and syscall codepaths over to a common (and freestanding) version of __vdso_gettc_rdtsc().
(A previous revision accidentally switched the vDSO codepath from __vdso_gettc_rdtsc() to rdtsc(); this fixes that).
Can you extract the userspace/kernel code unification with rdtsc_ordered.h into separate patch?
sys/x86/include/_rdtsc_ordered.h | ||
---|---|---|
50 ↗ | (On Diff #92030) | We do not indent function definitions in macros typically. Please remove these 4 spaces. |
77 ↗ | (On Diff #92030) | There is absolutely no reason to defined this macro. Both kernel and userspace variants are used once, so you could just put the corresponding code directly into libc and kernel, and avoid obfuscation through the macro. In fact, I think that the better approach is to avoid macros at all. Put the code you want to share (as I understand, this is DEFINE_RDTSC_ORDERED_COMMON) into e.g. sys/x86/include/rdtsc_fenced.h directly, and include it into libc "vdso" and kernel rdtsc.h |
Rebase, address @kib's review feedback, and related clean-up.
(Regarding rdtsc_ordered() and rdtscp_aux()---I plan to split these into separate submissions, but, before doing so, figure we might as well wait to see if they are still present in an approved version of this submission).
No functional changes intended. Since the last revision:
- Rebase again to pick up malloc_aligned() now that D31004 has been committed.
- kvm_clock_attach(): Use malloc_aligned() instead of malloc_domainset_aligned().
- Split out freestanding portions into separate submissions.
sys/dev/kvm_clock/kvm_clock.c | ||
---|---|---|
113 | if ((regs[0] & (KVM_FEATURE_CLOCKSOURCE | KVM_FEATURE_CLOCKSOURCE2)) == 0) | |
144 | Should this case be assert? Is it possible that _CLOCKSOURCE feature disappears? | |
149 | malloc() for PAGE_SIZE and larger is a strange choice. Why not use kmem_malloc()? For instance, ldt allocation on x86 uses kmem_malloc(), and it is quite similar to kvm clock case. |
sys/dev/kvm_clock/kvm_clock.c | ||
---|---|---|
149 | I see that kmem_malloc(), internally, always returns page-aligned allocations, but does it formally guarantee this condition at the API level like malloc_{,domainset_}aligned(9) do? There also seem to be so few uses of it relative to malloc(9), especially among drivers---should we maintain that precedent here? |
sys/dev/kvm_clock/kvm_clock.c | ||
---|---|---|
149 | Yes, the kmem_malloc() KPI fundamental property is that it operates on the page granularity. As I said, the intent there (kvmclock) matches the KPI purpose perfectly. It is one-off allocation from the platform-specific driver, which is tightly integrated with the rest of the system. |
sys/dev/kvm_clock/kvm_clock.c | ||
---|---|---|
149 | Ok, sounds good, then. Thanks for the clarification! |