vmstat -i was implementing an interface rather distinct from what the kernel had.
Details
Diff Detail
- Repository
- rG FreeBSD src repository
- Lint
Lint Not Applicable - Unit
Tests Not Applicable
Event Timeline
Trying for random reviewers since I don't have a good idea of who to ask for review/commit for this.
Ran into this when trying to figure out what isn't right with D36610. The unreported information is crucial to figuring out what is occurring.
usr.bin/vmstat/vmstat.c | ||
---|---|---|
1293–1294 | I actually think if(intrcnt||aflag) { may be better for two reasons. First, an interrupt which has occurred is interesting no matter whether a name has been set or not. Second, this better matches how the man page describes the -a option, "When used with -i, include statistics about interrupts that have never been generated." No where is it mentioned interrupts without names assigned are never reported (despite effecting the total count). |
So what is printed if the interrupt is unnamed? Please show an example of vmstat -i output before/after the change.
Ugh, type up most of how I wanted to respond and then Phabricator drops it on the floor when one thing needs doing...
I don't have proper output after this change since I'm trying to accomplish a task with sysctl("hw.intrnames")/sysctl("hw.intrcnt") and my tree isn't in quite a proper state. I do have an example of rather broken output generated by assumptions vmstat has been adding to the strings produced:
# vmstat -i interrupt total rate cpu69:ast 1 0 Total 9314 274 #
Issue is due to vmstat's assumptions it was losing the titles of the more active interrupts, so one couldn't tell whether that count was evenly split between several interrupts, or mostly on a single interrupt. Nor could one guess whether it was at the start or end. Knowing that can could give better hints as to what is wrong.
I'm going to have to go hunting for the assumptions and squash them.
There is a fair bit of history in vmstat -i. Some portions really needed updates long ago. I think I've fixed the additional major problems. Big one was it was using an odd algorithm to identify interrupt name length which didn't really work too well.
Would you prefer to review the extra 3 small commits in one pass/review? Have each as a separate Phabricator review?
I do not quite understand what would be printed for the "\0" interrupt name (I am trying to avoid libxo as much as I can).
I'm not trying to avoid libxo, but I'm not trying to engage it either; I suspect it is similar to Python's string formatting. Don't think about libxo there though, think like the typical C programmer. intrnames[0] == '\0' => empty string, thus blank followed by numbers.
I finally managed to sort things out with my builds and vmstat. There is a troublesome situation here. vmstat's interpretation of sysctl("hw.intrnames") has gotten badly out of synch with the kernel's presentation of sysctl("hw.intrnames"). I'm distinctly surprised the current version of vmstat -i works at all given how far apart things are.
vmstat's interpretation appears to be sysctl("hw.internames") is a series of asciz strings separated by single \0. Thus intrname += strlen(intrname) + 1; serves to get to the next name.
All 3 kernel interrupt subsystems interpret intrnames (which presently resides behind sysctl("hw.intrnames")) as a series of fixed-length fields. As such intrname += sizeof(intrnames) / nitems(intrcnt) would serve to get to the next name.
Appears vmstat -i continues to work mostly by dumb luck. The first few entries tend to have longer names, thus the field width gets set right. The first few entries are also the ones which accumulate most interrupts. I'm ending up tickling this issue due to D36610.
Which interpretation is correct and uses should conform to? I'm up for either, but I need to know which is correct. Should I create a new Phabricator review for each commit? Should I merge the 4 additional commits be part of this review?
usr.bin/vmstat/vmstat.c | ||
---|---|---|
1293–1294 | Having now done enough builds to figure out the actual issues, my vote is now for if (*intrcnts != 0 || (aflag >= 1 && intrnames[0] != '\0') || aflag >= 2) {. |
You may have four commits and one review. Make sure to send properly split and formatted patches to a committer.
I believe that the ABI compatibility there is the best choice, i.e. vmstat interpretation must be supported by kernel, and not vmstat modified to the kernel behavior.
Hopefully fusing these together doesn't make your job much harder...
Marking the spots. What the FreeBSD kernel has been doing for more than a decade was to return a series of fixed-length strings for sysctl("hw.intrnames"). Yet in these two spots it appears vmstat has been acting as if it was strings separated by zero bytes.
I'm rather astonished vmstat's output wasn't looking like line noise. Yet somehow this had managed to sneak past everyone's attention for years.
usr.bin/vmstat/vmstat.c | ||
---|---|---|
1307 | This is one spot. | |
1352–1353 | Fixed by getting the size before. | |
1359 | This is one spot. |
So lets start with simple part. Can you extract and mail me git-formatted patch that removes ncpus and changes getcpuinfo() return type to void? I think it is the easiest bit to get in.
usr.bin/vmstat/vmstat.c | ||
---|---|---|
1267 | Is there a caller for read_intrcnts() that passes NULL? |
Hmm, not quite how I thought this was going to work. I'm wondering what action, if any, I'm supposed to take with this now.
I guess this is being done this way.
usr.bin/vmstat/vmstat.c | ||
---|---|---|
1267 | There wasn't in the original, but it seemed a sensible case to handle. Then fixing the issue of mishandling the kernel interface is best handled by using this. |
Note this is out there at https://gitlab.com/ehem/freebsd-src.git the branch "D36628" (https://gitlab.com/ehem/freebsd-src/-/commits/D36628/).
You did exactly the opposite to what I suggested: you adjusted userspace to kernel interface, instead of fixing the ABI and keeping old vmstat -i working, by fixing kernel. I do not want to enforce my opinion, but at least I thin we need to get the rationale for going this way instead.
usr.bin/vmstat/vmstat.c | ||
---|---|---|
1264 | else if can be put on the same line | |
1271 | I am not sure that removing the loop altogether is wise. If a new interrupt is added between intrcntlen query and fetching content of hw.intrcnt, vmstat(1) now exits with Out of memory instead of redoing the fetch. | |
1344 | Same note about removing the loop |
And I thought I had responded, but hadn't. Sorry.
I'm troubled by what I see in source history. Both the current kernel and current vmstat implementations appear to predate 2003. That is long enough for there to be multiple userspace uses of the sysctl(). As such I suspect we should be asking are there more implementations of the interface vmstat uses or more implementations of what the kernel has been providing?
I'm inclined to suspect there could be dozens of implementations in the ports collection and they would likely match with current kernel behavior.
Either direction will cause some degree of breakage, I suspect preserving the kernel interface will cause less. If you're wondering, I ran into this due to D36610 (which will then need an update for reading kernel cores).
usr.bin/vmstat/vmstat.c | ||
---|---|---|
1264 | I think this is better since this clause is an alternative to the kread(), not associated with the parent if. Mainly this is to ensure we don't drop an error return from mysysctl(). | |
1271 | The interrupt list doesn't change very much, so this will be extremely rare. The original reason for the loop appears to be this predates the behavior of passing a NULL buffer to sysctl() giving the data length. Though yes people do win lotteries. | |
1344 | And the same response. |
Thanks to @jrtc27 pointing out the key line I had been spacing on, turns out both of these presently match the kernel. All 3 interrupt frameworks had been handling intrnames as a series of fixed-length strings, but had been space-padding them.
This looks like it was meant as a mechanism to transition from packed \0 separated strings to fixed-length strings. Just in the 20 intervening years no one had gotten around to fixing vmstat. I suspect handling them as fixed-length strings is easier for pretty well everyone and it is overdue to finish this transition.
It is entirely possible to continue space-padding, just I suspect this is better for pretty well everyone. Someone needs to make the call for what 14.0 will have, I'm strongly in favor of finishing this transition.
Finally tracked down an actual bug I'd been running into. Since I fear you may wish to take small steps and fix strictly what is buggy, I've split this into a separate commit at the front of my repository. This is clearly a bug with vmstat since this violates the sysctl() interface.
usr.bin/vmstat/vmstat.c | ||
---|---|---|
1355 | Due to wanting to be prepared for a compatibility situation, I finally identified the bug I had been running into. This test is wrong since it will run off the end of the buffer if the table is full. |
Depending on the direction this takes, the bug is already fully fixed by what I've got.
The real issue is "hw.intrnames" isn't documented anywhere and some other program which needed this information might use 3 different strategies to iterate through the names. I can't find enough information to suggest which strategy is the proper one to use and needs compatibility.
- The old interface is\0NUL\0separated\0strings\0. vmstat -i continues to use the strategy of += strlen(name) which matches this. Kernel code pads with spaces to maintain this approach. This is almost certainly detrimental to any other potential user. I do admit this does work, but this ancient interface really should be updated.
- The strings could be interpreted as fixed-length with the length being determined by length("hw.intrnames")/length("hw.intrcnt")*sizeof(u_long). Interpreting both as the same length and using the fixed element size of "hw.intrcnt" to compute the length. This is the strategy I've provided an implementation of.
- The strings could be interpreted as being fixed-length with the length MAXCOMLEN + 1. This is based on the constant being exported by kernel headers. Disadvantageous as stray counters couldn't have longer names, but very simple to use.
I suspect option 3 is the best for the greatest number of users. In which case I need to update D36628 to match, since vmstat -i should serve as an example of what strategy should be used. Padding with spaces maintains compatibility, but compatibility with an interface which is adverse to most uses. I initially implemented option 2 since at first it looked to have nice properties.
Answering based on your description.
Option 2 depends on the arch' sizeof(u_long), which means that e.g. COMPAT32 requires shims or does not work. Option 3 means that we use unrelated static constant which I do not like as well. For me, the option 1 has the advantages of being backward compatible simply by its nature, and not having the drawbacks I listed above.
Also, the races I listed in the previous rounds of review, should be fixed to move this forward.
All 3 are presently backwards-compatible to 2003 FreeBSD kernels (at a minimum) due to how things have been implemented.
The reason sizeof(u_long) matters to option 2 is the number of elements in "hw.intrcnt" is length(sysctl("hw.intrcnt"))/sizeof(u_long (if a process is 32-bit sysctl_intrcnt() already provides a list of uint32_t instead of uint64_t). MAXCOMLEN + 1 is hardly unrelated as all the implementations of intrnames feature that as their entry length (all 3 interrupt subsystems use the same constant which is published in headers).
My big concern with option 2 is there is a race between retrieving "hw.intrcnt" and "hw.intrnames". If one was added or removed between those two, then the size calculation would fail (though that would merely garble output).
Presently this question has a greater effect on implementation, so I'm more worried here.
I'm also pretty sure any other program trying to make use of this would prefer entries of the same size to \0 separated ones.
Make that any kernel after ecee5704ed500a18a49850acb275aa4ff463bdd3, since intrcnt_setname() has been copied too all architectures. This seems painful for pretty well all uses.
I strongly dislike this interface since it is inferior for most consumers. I will admit it can work if bugs get sorted out.
usr.bin/vmstat/vmstat.c | ||
---|---|---|
1352 | This is an outright bug. Without this goes off the end of the buffer provided by sysctl(). |
Could you please update the review summary to exactly describe what is the problem and how is it solved?
usr.bin/vmstat/vmstat.c | ||
---|---|---|
1352 |
usr.bin/vmstat/vmstat.c | ||
---|---|---|
1287 | The changes to this function, by my reading, are non-functional, and just delete unnecessary variables. It's subjective whether or not it's better style, though. |
The summary dated from a point where it was going bonkers due to issues. Two distinct interfaces had been mixed together and I was having a hard time deciding which was correct and should be maintained.
The end of the buffer generated by sysctl() is identified by intrnames + inamlen. Looking for a zero-length string is a bug, since the kernel interface has never guaranteed there would be such at the end. Just in the present state unless the entire in-kernel intrnames table was full, there would be at least one zero-length string at the end of the allocated portion. I was trying to trim what sysctl("hw.intrnames") dumped on userspace and found this overrun.
usr.bin/vmstat/vmstat.c | ||
---|---|---|
1287 | Agreed. I dislike random non-functional variables. If @kib wants to keep them, then so be it. |
usr.bin/vmstat/vmstat.c | ||
---|---|---|
1287 | If you're changing it though it should be in a different patch, as it's a separate function from the one you're making a functional change in. |
vmstat -i was implementing an interface rather distinct from what the kernel had.
This summary is pretty much useless. *What* was wrong about the implementation, and *how* have you fixed it? Once you rip out the separate non-functional print_intrcnts change from this you're left with a one-line diff that's pretty easy to understand (I believe it's _always_ hit when parsing a core file, but otherwise only hit when hw.intrnames perfectly fills a power-of-two buffer >= 1024 bytes in size), but you should still have a proper commit message.
The Phabricator summary got out of synch since I was having to play hunt down the bug(s). There were multiple issues and multiple problems. One concern is, is it better to always have 1 git commit == 1 Phabricator review? Is it perhaps sometimes better to use one review for several closely overlapping issues?
The bug manifests when every entry in sysctl("hw.intrnames") (=> in-kernel intrnames) is allocated. If that is the case the loop will run off the end of the buffer and won't terminate until it runs into two sequential 0-bytes. If this occurs clen could end up large (since strlen() of random memory could be large), which then causes vmstat -i's output to be garbled. I was running into this since D36610 causes sysctl("hw.intrnames") to only provide the valid entries.
Ah, ok, inamlen will have been updated to the actual length returned by the kernel, so the power-of-two thing isn't relevant, other than to determine whether it's immediately an out-of-bounds access or an uninitialised memory access. Both are bad, and the latter will often lead to the former.
And no, it doesn't have to be one review == one commit, but it generally is. When it's not, the set of changes should make sense together, which they may have done if you'd actually got a proper description, but as things stand you have to trawl through the entire diff to figure out what's what. Normally you'd also provide the list of commits in that case, whether as a link to a branch that has the review broken down into commits with proper messages or as just a text dump of the commit messages.
The summary must explain what was wrong, and what was fixed.
Please send me two git-patches with proper commit messages and author metadata. One is for the first chunk, with the removal of unneeded vars (I suppose there is no behavior change from it), and another with the fix itself.