Web lists-archives.com

Re: [PATCH v1] add: speed up cmd_add() by utilizing read_cache_preload()






On 11/2/2018 11:23 AM, Junio C Hamano wrote:
Ben Peart <peartben@xxxxxxxxx> writes:

From: Ben Peart <benpeart@xxxxxxxxxxxxx>

During an "add", a call is made to run_diff_files() which calls
check_remove() for each index-entry.  The preload_index() code
distributes some of the costs across multiple threads.

Nice.  I peeked around and noticed that we already do this in
builtin_diff_index() before running run_diff_index() when !cached,
and builtin_diff_files(), of course.


Thanks for double checking!

Because the files checked are restricted to pathspec, adding individual
files makes no measurable impact but on a Windows repo with ~200K files,
'git add .' drops from 6.3 seconds to 3.3 seconds for a 47% savings.

;-)

diff --git a/builtin/add.c b/builtin/add.c
index ad49806ebf..f65c172299 100644
--- a/builtin/add.c
+++ b/builtin/add.c
@@ -445,11 +445,6 @@ int cmd_add(int argc, const char **argv, const char *prefix)
  		return 0;
  	}
- if (read_cache() < 0)
-		die(_("index file corrupt"));
-
-	die_in_unpopulated_submodule(&the_index, prefix);
-
  	/*
  	 * Check the "pathspec '%s' did not match any files" block
  	 * below before enabling new magic.

It is not explained why this is not a mere s/read_cache/&_preload/
in the log message.  I can see it is because you wanted to make the
pathspec available to preload to further cut down the preloaded
paths, and I do not think it has any unintended (negatie) side
effect to parse the pathspec before populating the in-core index,
but that would have been a good thing to mention in the proposed log
message.


That is correct. parse_pathspec() was after read_cache() because it _used_ to use the index to determine whether a pathspec is in a submodule. That was fixed by Brandon in Aug 2017 when he cleaned up all submodule config code so it is safe to move read_cache_preload() after the call to parse_pathspec().

How about this for a revised commit message?



During an "add", a call is made to run_diff_files() which calls
check_remove() for each index-entry. The preload_index() code distributes some of the costs across multiple threads.

Limit read_cache_preload() to pathspec, so that it doesn't process more entries than are needed and end up slowing things down instead of speeding them up.

Because the files checked are restricted to pathspec, adding individual
files makes no measurable impact but on a Windows repo with ~200K files,
'git add .' drops from 6.3 seconds to 3.3 seconds for a 47% savings.



@@ -459,6 +454,10 @@ int cmd_add(int argc, const char **argv, const char *prefix)
  		       PATHSPEC_SYMLINK_LEADING_PATH,
  		       prefix, argv);
+ if (read_cache_preload(&pathspec) < 0)
+		die(_("index file corrupt"));
+
+	die_in_unpopulated_submodule(&the_index, prefix);
  	die_path_inside_submodule(&the_index, &pathspec);
if (add_new_files) {

base-commit: 4ede3d42dfb57f9a41ac96a1f216c62eb7566cc2