Web lists-archives.com

[Samba] glusterfs + ctdb + nfs-ganesha , unplug the network cable of serving node, takes around ~20 mins for IO to resume




Hi all

We did some failover/failback tests on 2 nodes(A and B) with architecture 'glusterfs + ctdb(public address) + nfs-ganesha'。

1st:
During write, unplug the network cable of serving node A
->NFS Client took a few seconds to recover to conitinue writing.

After some minutes, plug the network cable of serving node A
->NFS Client also took a few seconds to recover to conitinue writing.

2nd:
During write, unplug the network cable of serving node A
->NFS Client took 20 minutes to recover to conitinue writing.
It is too slow for clients to accept the recovery time。

From CTDB log, during failover and failback, fail node failed to kill the connection with client
while recovery node failed to send ‘tickle ack’to client to re-established connection.

So during 1~3s ,takeover is failed。
Why is it failed to fast recovery and took 20 minutes to recovery successfully.
Is there anyone knows the reason?
We are looking forward to your reply. Thanks.

-------------------------------------------------------------------------------------------------------------
The following is some test logs and configuration.
Node A:
cat /var/log/log.ctdb
2019/02/22 18:00:57.468629 ctdbd[18309]: Release of IP 10.10.11.51/24 on interface eth3  node:1
2019/02/22 18:01:02.132565 ctdbd[18309]: Monitoring event was cancelled
2019/02/22 18:01:02.547046 ctdb-eventd[18310]: 10.interface: Killing TCP connection ::ffff:10.10.11.18:951 ::ffff:10.10.11.51:2049
2019/02/22 18:01:02.547112 ctdb-eventd[18310]: 10.interface: Failed sendto (No route to host)
...
2019/02/22 18:01:02.547259 ctdb-eventd[18310]: 10.interface: Failed sendto (No route to host)
2019/02/22 18:01:02.548458 ctdb-eventd[18310]: 10.interface: Failed to kill TCP connections for IP 10.10.11.51 (1/1 remaining)
2019/02/22 18:01:02.680399 ctdb-eventd[18310]: 60.nfs: method return time=1550829662.675715 sender=:1.1803 -> destination=:1.1819 serial=445 reply_serial=2
2019/02/22 18:01:02.680479 ctdb-eventd[18310]: 60.nfs:    boolean true
2019/02/22 18:01:02.680500 ctdb-eventd[18310]: 60.nfs:    string "Started grace period"
2019/02/22 18:01:03.255313 ctdb-eventd[18310]: 60.nfs: Reconfiguring service "nfs"...
2019/02/22 18:01:03.353830 ctdb-recoverd[18402]: Takeover run completed successfully
2019/02/22 18:01:05.345783 ctdbd[18309]: Starting traverse on DB ctdb.tdb (id 9809)
2019/02/22 18:01:05.348204 ctdbd[18309]: Ending traverse on DB ctdb.tdb (id 9809), records 1

Node B:
cat /var/log/log.ctdb
2019/02/22 18:01:02.699755 ctdbd[29541]: Takeover of IP 10.10.11.51/24 on interface eth3
2019/02/22 18:01:02.701360 ctdbd[29541]: Monitoring event was cancelled
2019/02/22 18:01:03.010811 ctdb-eventd[29542]: 60.nfs: removed ‘/mnt/mgt_vol/grp45/nfs_state/nfs-ganesha/.noderefs/10.10.11.51’
2019/02/22 18:01:03.010896 ctdb-eventd[29542]: 60.nfs: ‘/mnt/mgt_vol/grp45/nfs_state/nfs-ganesha/.noderefs/10.10.11.51’ -> ‘/mnt/mgt_vol/grp45/nfs_state/nfs-ganesha/node-4’
2019/02/22 18:01:03.010922 ctdb-eventd[29542]: 60.nfs: method return time=1550829663.005719 sender=:1.192 -> destination=:1.206 serial=438 reply_serial=2
2019/02/22 18:01:03.010937 ctdb-eventd[29542]: 60.nfs:    boolean true
2019/02/22 18:01:03.010973 ctdb-eventd[29542]: 60.nfs:    string "Started grace period"
2019/02/22 18:01:03.065121 ctdbd[29541]: Failed sendto (No route to host)
2019/02/22 18:01:03.065191 ctdbd[29541]: ../ctdb/server/ctdb_takeover.c:388 Failed to send tcp tickle ack for ::ffff:10.10.11.18
2019/02/22 18:01:03.303342 ctdb-eventd[29542]: 60.nfs: Reconfiguring service "nfs"...
2019/02/22 18:01:03.347137 ctdb-recoverd[29647]: Reenabling takeover runs
2019/02/22 18:01:04.172108 ctdbd[29541]: Failed sendto (No route to host)
2019/02/22 18:01:04.172180 ctdbd[29541]: ../ctdb/server/ctdb_takeover.c:388 Failed to send tcp tickle ack for ::ffff:10.10.11.18
2019/02/22 18:01:05.278093 ctdbd[29541]: Failed sendto (No route to host)
2019/02/22 18:01:05.278159 ctdbd[29541]: ../ctdb/server/ctdb_takeover.c:388 Failed to send tcp tickle ack for ::ffff:10.10.11.18
2019/02/22 18:01:05.389656 ctdbd[29541]: Starting traverse on DB ctdb.tdb (id 6238)
2019/02/22 18:01:05.392182 ctdbd[29541]: Ending traverse on DB ctdb.tdb (id 6238), records 1

cat /etc/sysconfig/ctdb
CTDB_RECOVERY_LOCK=/mnt/mgt_vol/grp45/lockfile
CTDB_PUBLIC_INTERFACE=eth3
CTDB_NODES=/mnt/mgt_vol/grp45/nodes
CTDB_PUBLIC_ADDRESSES=/mnt/mgt_vol/grp45/public_addresses
CTDB_MANAGES_SAMBA=yes
CTDB_MANAGES_WINBIND=no
CTDB_MANAGES_VSFTP=yes
CTDB_SAMBA_SKIP_SHARE_CHECK=yes
CTDB_MANAGES_NFS=yes
CTDB_NFS_CALLOUT=/etc/ctdb/nfs-ganesha-callout
CTDB_NFS_STATE_FS_TYPE=glusterfs
CTDB_NFS_CHECKS_DIR=/etc/ctdb/nfs-checks-ganesha.d/
CTDB_NFS_STATE_MNT=/mnt/mgt_vol/grp45/nfs_state
CTDB_NFS_SKIP_SHARE_CHECK=yes
CTDB_SET_KeepaliveLimit=1

cat /mnt/mgt_vol/grp45/nodes
192.168.100.15 #inner network
192.168.100.14 #inner network

cat /mnt/mgt_vol/grp45/public_addresses
10.10.11.50/24 eth3 #extranet network
10.10.11.51/24 eth3 #extranet network


以上、よろしくお?いいたします。
--------------------------------------------------
**************************************************
Liu Dan
PF Dept
Nanjing Fujitsu Nanda Software Tech.Co.,Ltd.(FNST)
TEL:+86+25-86630566-8512
FUJITSU INTERNAL:79955-8512
EMail: liud.fnst@xxxxxxxxxxxxxx<mailto:liud.fnst@xxxxxxxxxxxxxx>
**************************************************
--------------------------------------------------



-- 
To unsubscribe from this list go to the following URL and read the
instructions:  https://lists.samba.org/mailman/options/samba