A Very Bad umount
- Date: Tue, 11 Sep 2018 14:28:30 -0500
- From: "Martin McCormick" <martin.m@xxxxxxxxxxxxxx>
- Subject: A Very Bad umount
This has all the earmarks of a race condition because it is
totally intermittent. It succeeds maybe 80% of the time.
I am using rsync to backup a Linux system to a pair of
thumb drives which both appear to be healthy. The mounting
process goes as follows:
#Combine 2 256-GB drives in to 1 512 GB drive.
mhddfs /rsnapshot1,/rsnapshot2 /var/cache/rsnapshot -o mlimit=100M
If one does
# df -h /var/cache/rsnapshot
Filesystem Size Used Avail Use% Mounted on
/rsnapshot1;/rsnapshot2 463G 173G 267G 40% /var/cache/rsnapshot
That all works as it should. One can run rsnapshot and
get a backup of today's file system.
The /etc/rsnapshot.conf file is set to call the mount
process before rsync runs and then do the umount after it finishes
# Specify the path to a script (and any optional arguments) to run right
# after rsnapshot syncs files
My problem may be with how I am unmounting everything so
umount /var/cache/rsnapshot /rsnapshot2 /rsnapshot1
Normally, this simply works and /var/cache/rsnapshot ends up
empty but when one of these intermittent explosions happens, I
receive the following
Date: Tue, 11 Sep 2018 00:06:23 -0500
From: root@wb5agz (Cron Daemon)
Subject: Cron <root@wb5agz> /usr/local/etc/daily_backup
From root@wb5agz Tue Sep 11 00: 06:24 2018
/bin/rm: cannot remove '/var/cache/rsnapshot/halfday.1/wb5agz/home/usr/lib/i386
-linux-gnu': Transport endpoint is not connected
/bin/rm: cannot remove '/var/cache/rsnapshot/halfday.1/wb5agz/home/usr/lib/libg
pgme-pth.so.11': Transport endpoint is not connected
That is the beginning of what was, to day, a 152-line
message in which all of the error messages ended in
"Transport endpoint is not connected"
When I have discovered one of these crashes, I have
re-run the script as root and it usually runs perfectly the
second time defying the definition of madness which is to keep
doing the same and expect different results. You frequently get
them in the form of a proper backup.
Today, I manually re-ran the backup and this time, it
actually failed from the command line with the same error
messages for each file mentioned. The spew frequently highlights
a different set of directories.
Look at the two drives later and they are fine except
that one does not get the last backup as rsync saw the errors and
you're left with the last good backup.
I did a ls /var/cache/rsnapshot after the big spew and
got an error about "Transport endpoint is not connected" again.
I have actually tried
umount /rsnapshot2 /rsnapshot1 /var/cache/rsnapshot
as well as
umount /var/cache/rsnapshot /rsnapshot2 /rsnapshot1 . I was
thinking that the order might make a difference but have gotten
as many good runs with either order.
If one looks in /var/log/syslog, one sees the mounting of
the two drives and no errors and there are no errors reported if
you watch it happen.
Are there any ideas on how to do the umount to insure
that all the inodes are in the state they should be in before the
umount is done?
Normally, this blocks until every inode is set and the umount succeeds.
I have been chasing this rabbit for quite a while now and
it can sometimes be weeks without a spew, just long enough to
think that the last rejiggering of the order for unmounting or
someother futile rearranging of the Titanic's deck chairs actually
made a difference.
Any constructive ideas are appreciated. If I left the
drives mounted all the time, there would be no spew but since
these are backup drives, having them mounted all the time is
Martin McCormick WB5AGZ