Web lists-archives.com

Re: [PATCH 3/8] mm, truncate: Remove all exceptional entries from pagevec under one lock




On Thu, Oct 12, 2017 at 03:33:23PM +0200, Jan Kara wrote:
> >  		return;
> >  
> > -	if (dax_mapping(mapping)) {
> > -		dax_delete_mapping_entry(mapping, index);
> > -		return;
> > +	dax = dax_mapping(mapping);
> > +	if (!dax)
> > +		spin_lock_irq(&mapping->tree_lock);
> > +
> > +	for (i = ei, j = ei; i < pagevec_count(pvec); i++) {
> > +		struct page *page = pvec->pages[i];
> > +		pgoff_t index = indices[i];
> > +
> > +		if (!radix_tree_exceptional_entry(page)) {
> > +			pvec->pages[j++] = page;
> > +			continue;
> > +		}
> > +
> > +		if (unlikely(dax)) {
> > +			dax_delete_mapping_entry(mapping, index);
> > +			continue;
> > +		}
> > +
> > +		__clear_shadow_entry(mapping, index, page);
> >  	}
> > -	clear_shadow_entry(mapping, index, entry);
> > +
> > +	if (!dax)
> > +		spin_unlock_irq(&mapping->tree_lock);
> > +	pvec->nr = j;
> >  }
> 
> When I look at this I think could make things cleaner. I have the following
> observations:
> 
> 1) All truncate_inode_pages(), invalidate_mapping_pages(),
> invalidate_inode_pages2_range() essentially do very similar thing and would
> benefit from a similar kind of batching.
> 

While this is true, the benefit is much more marginal that I didn't feel
the level of churn was justified. Primarily it would help fadvise() and
invalidating when buffered and direct IO is mixed. I didn't think it would
be that much cleaner as a result so I left it.

> 2) As you observed and measured, batching of radix tree operations makes
> sense both when removing pages and shadow entries, I'm very confident it
> would make sense for DAX exceptional entries as well.
> 

True, but I didn't have a suitable setup for testing DAX so I wasn't
comfortable with making the change. dax_delete_mapping_entry can sleep but it
should be as simple as not taking the spinlock in dax_delete_mapping_entry
and always locking in truncate_exceptional_pvec_entries. dax is already
releasing the mapping->tree_lock if it needs to sleep and I didn't spot
any other gotcha but I'd prefer that change was done by someone that can
verify it works properly.

> 3) In all cases (i.e., those three functions and for all entry types) the
> workflow seems to be:
>   * lockless lookup of entries
>   * prepare entry for reclaim (or determine it is not elligible)
>   * lock mapping->tree_lock
>   * verify entry is still elligible for reclaim (otherwise bail)
>   * clear radix tree entry
>   * unlock mapping->tree_lock
>   * final cleanup of the entry
> 
> So I'm wondering whether we cannot somehow refactor stuff so that batching
> of radix tree operations could be shared and we wouldn't have to duplicate
> it in all those cases.
> 
> But it would be rather large overhaul of the code so it may be a bit out of
> scope for these improvements...
> 

I think it would be out of scope for this improvement but I can look into
it if the series is accepted. I think it would be a lot of churn for fairly
marginal benefit though.

> > @@ -409,8 +445,8 @@ void truncate_inode_pages_range(struct address_space *mapping,
> >  			}
> >  
> >  			if (radix_tree_exceptional_entry(page)) {
> > -				truncate_exceptional_entry(mapping, index,
> > -							   page);
> > +				if (ei != PAGEVEC_SIZE)
> > +					ei = i;
> 
> This should be ei == PAGEVEC_SIZE I think.
> 
> Otherwise the patch looks good to me so feel free to add:
> 

Fixed.

> Reviewed-by: Jan Kara <jack@xxxxxxx>

Thanks

-- 
Mel Gorman
SUSE Labs