Web lists-archives.com

Re: [RFC PATCH v2 11/12] x86/mm/tlb: Use async and inline messages for flushing




> On May 31, 2019, at 2:47 PM, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
> 
> 
> On May 31, 2019, at 2:33 PM, Nadav Amit <namit@xxxxxxxxxx> wrote:
> 
>>> On May 31, 2019, at 2:14 PM, Andy Lutomirski <luto@xxxxxxxxxx> wrote:
>>> 
>>>> On Thu, May 30, 2019 at 11:37 PM Nadav Amit <namit@xxxxxxxxxx> wrote:
>>>> When we flush userspace mappings, we can defer the TLB flushes, as long
>>>> the following conditions are met:
>>>> 
>>>> 1. No tables are freed, since otherwise speculative page walks might
>>>> cause machine-checks.
>>>> 
>>>> 2. No one would access userspace before flush takes place. Specifically,
>>>> NMI handlers and kprobes would avoid accessing userspace.
>>> 
>>> I think I need to ask the big picture question.  When someone calls
>>> flush_tlb_mm_range() (or the other entry points), if no page tables
>>> were freed, they want the guarantee that future accesses (initiated
>>> observably after the flush returns) will not use paging entries that
>>> were replaced by stores ordered before flush_tlb_mm_range().  We also
>>> need the guarantee that any effects from any memory access using the
>>> old paging entries will become globally visible before
>>> flush_tlb_mm_range().
>>> 
>>> I'm wondering if receipt of an IPI is enough to guarantee any of this.
>>> If CPU 1 sets a dirty bit and CPU 2 writes to the APIC to send an IPI
>>> to CPU 1, at what point is CPU 2 guaranteed to be able to observe the
>>> dirty bit?  An interrupt entry today is fully serializing by the time
>>> it finishes, but interrupt entries are epicly slow, and I don't know
>>> if the APIC waits long enough.  Heck, what if IRQs are off on the
>>> remote CPU?  There are a handful of places where we touch user memory
>>> with IRQs off, and it's (sadly) possible for user code to turn off
>>> IRQs with iopl().
>>> 
>>> I *think* that Intel has stated recently that SMT siblings are
>>> guaranteed to stop speculating when you write to the APIC ICR to poke
>>> them, but SMT is very special.
>>> 
>>> My general conclusion is that I think the code needs to document what
>>> is guaranteed and why.
>> 
>> I think I might have managed to confuse you with a bug I made (last minute
>> bug when I was doing some cleanup). This bug does not affect the performance
>> much, but it might led you to think that I use the APIC sending as
>> synchronization.
>> 
>> The idea is not for us to rely on write to ICR as something serializing. The
>> flow should be as follows:
>> 
>> 
>>   CPU0                    CPU1
>> 
>> flush_tlb_mm_range()
>> __smp_call_function_many()
>> [ prepare call_single_data (csd) ]
>> [ lock csd ] 
>> [ send IPI ]
>>   (*)
>> [ wait for csd to be unlocked ]
>>                   [ interrupt ]
>>                   [ copy csd info to stack ]
>>                   [ csd unlock ]
>> [ find csd is unlocked ]
>> [ continue (**) ]
>>                   [ flush TLB ]
>> 
>> 
>> At (**) the pages might be recycled, written-back to disk, etc. Note that
>> during (*), CPU0 might do some local TLB flushes, making it very likely that
>> CSD will be unlocked by the time it gets there.
>> 
>> As you can see, I don’t rely on any special micro-architectural behavior.
>> The synchronization is done purely in software.
>> 
>> Does it make more sense now?
> 
> Yes.  Have you benchmarked this?

Partially. Numbers are indeed worse. Here are preliminary results, comparing
to v1 (concurrent):

	n_threads	before		concurrent	+async
	---------	------		---------- 	------
	1		661		663		663
	2		1436		1225 (-14%)	1115 (-22%)
	4		1571		1421 (-10%)	1289 (-18%)

Note that the benefit of “async" would be greater if the initiator does not
flush the TLB at all. This might happen in the case of kswapd, for example.
Let me try some micro-optimizations first, run more benchmarks and get back
to you.