Paths

Table of Contentst

vm_pageout: Scan inactive dirty pages less aggressively
Needs ReviewPublic
Actions

Authored by markj on Mon, Jan 6, 7:16 PM.

Details

Reviewers

alc
kib

Summary

Consider a database workload where the bulk of RAM is used for a
fixed-size file-backed cache. Any leftover pages are used for
filesystem caching or anonymous memory. In particular, there is little
memory pressure and the inactive queue is scanned rarely.

Once in a while, the free page count dips a bit below the setpoint,
triggering an inactive queue scan. Since almost all of the memory there
is used by the database cache, the scan encounters only referenced
and/or dirty pages, moving them to the active and laundry queues. In
particular, it ends up completely depleting the inactive queue, even for
a small, non-urgent free page shortage.

This scan might process many gigabytes worth of pages in one go,
triggering VM object lock contention (on the DB cache file's VM object)
and consuming CPU, which can cause application latency spikes.

Observing this behaviour, my observation is that we should abort
scanning once we've encountered many dirty pages without meeting the
shortage. In general we've tried to make the page daemon control loops
avoid large bursts of work, and if a scan fails to turn up clean pages,
there's not much use in moving everything to laundry queue at once.

Modify the inactive scan to abort early if we encounter enough dirty
pages without meeting the shortage. If the shortage hasn't been met,
this will trigger shortfall laundering, wherein the laundry thread
will clean as many pages as needed to meet the instantaneous shortfall.
Laundered pages will be placed near the head of the inactive queue, so
will be immediately visible to the page daemon during its next scan of
the inactive queue.

Since this causes pages to move to the laundry queue more slowly, allow
clustering with inactive pages. I can't see much downside to this in
any case.

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Skipped

Unit

Tests Skipped

Build Status

Buildable 61563
Build 58447: arc lint + arc unit

Event Timeline

markj created this revision.Mon, Jan 6, 7:16 PM

Herald added a subscriber: imp. · View Herald TranscriptMon, Jan 6, 7:16 PM

markj requested review of this revision.Mon, Jan 6, 7:16 PM

Harbormaster completed remote builds in B61522: Diff 148819.Mon, Jan 6, 7:16 PM

This scan might process many gigabytes worth of pages in one go,
triggering VM object lock contention (on the DB cache file's VM object)
and consuming CPU, which can cause application latency spikes.

I meant to note that this is exacerbated by the page daemon being multithreaded on high core count systems - in this case we had 5 threads all processing the inactive queue over several seconds.

As a side note, I think the PPS calculation in vm_pageout_inactive_dispatch() also doesn't work well in this scenario: it counts the number of pages freed, not the number of pages scanned, so a queue full of dirty and/or referenced pages will result in a low PPS score, which makes it more likely that we'll dispatch multiple threads during a shortage.

Permit the inactive weight to have a value of 0, which effectively
restores the old behaviour.

Clamp the weights in the sysctl handler to make a multiplication overflow
less likely.

Set the inactive weight to 1 instead of 2. In my testing, we are still moving
pages to the laundry quite aggressively, see below, so we don't need the extra
multiplier.

Avoid incrementing oom_seq if there's no instantaneous shortage. Otherwise
it's possible to get spurious OOM kills after an acute page shortage: after the
shortage is resolved, the PID controller will still have positive output for a
period of time and thus will scan the queue. If the inactive queue is full of
dirty pages, the OOM controller will infer that the page daemon is failing to
make progress, but if the shortage has already been resolved, this is wrong.

This problem is not new but is easier to trigger now that we move pages to the
laundry less aggressively.

Harbormaster completed remote builds in B61563: Diff 148906.Tue, Jan 7, 6:41 PM

As I understand, the patch causes the inactive scan to stop even if there is still page_shortage (>0), hoping that laundry would keep up and do the necessary cleaning. Suppose that we have the mix of the anon and file dirty pages, and, for instance, no swap (or files are backed by slow device). Then it is possible that for the long time, despite queuing the pages for laundry, they cannot be cleaned, so the page_shortage is not going to go away.
Wouldn't it be needed for such patch to ensure that either launder thread make progress, or inactive scan continues? I understand that scan would be kicked again, but I mean that laundry should kick it as well if it cannot get rid of page_shortage.

In D48337#1104688, @kib wrote:

As I understand, the patch causes the inactive scan to stop even if there is still page_shortage (>0), hoping that laundry would keep up and do the necessary cleaning. Suppose that we have the mix of the anon and file dirty pages, and, for instance, no swap (or files are backed by slow device). Then it is possible that for the long time, despite queuing the pages for laundry, they cannot be cleaned, so the page_shortage is not going to go away.
Wouldn't it be needed for such patch to ensure that either launder thread make progress, or inactive scan continues? I understand that scan would be kicked again, but I mean that laundry should kick it as well if it cannot get rid of page_shortage.

This is a good point, I did not think about such configurations, and need to test further.

I suspect it will not be a major problem, for two reasons: first, once a dirty anon page is moved to the laundry queue, it will stay there until it is freed. So the page daemon will quickly remove such pages from the inactive queue. When an anon page is first dirtied, it will end up in the active queue, and it takes a long time to deactivate. Thus, in stead-state operation, I believe the inactive queue will not contain many anon dirty pages.

The second reason is that the PID controller will quickly increase the size of page_shortage if the initial demand is not met, due to the integral term of the PID controller output, so the page daemon will still scan a large number of pages, albeit a bit more slowly than before.

Earlier this week, I did some experiments on a system with 64GB of RAM. I used a program which allocates ~60GB of dirty anon pages and puts them in the inactive queue (I set vm.pageout_update_period to a small number to accelerate this; MADV_DONTNEED could also be used). Then I tried creating shortfalls of different sizes (e.g., 100MB below the free page target) to see how the page daemon responds.

Without this patch, the pagedaemon moves all 60GB to the laundry queue. With the patch, we still move a large number of pages, e.g., 10GB+ in response to a 1GB shortfall. This is because the PID controller becomes quite aggressive if the page shortage cannot be satisfied instantaneously[**]; again, I believe this is mostly due to the integral term. So, I am not too worried about the case you described, but it deserves more analysis.

[**] This is related to the change in vm_pageout_mightbe_oom(). Perhaps that should be committed separately. After a page shortage is met, the page daemon may still keep trying to reclaim pages as demanded by the PID controller. This means that a persistent page_shortage > 0 condition is not necessarily a strong signal that we should trigger an OOM kill. We should also take the instantaneous shortage into account.

Revision Contents
Changeset List

Path

Size

sys/

vm/

vm_pageout.c

65 lines

Diff 148906

View Options

vm_pageout: Scan inactive dirty pages less aggressivelyNeeds ReviewPublicActions