Web lists-archives.com

Re: [Samba] ctdb vacuum timeouts and record locks




On Tue, 7 Nov 2017 17:05:27 -0800, Computerisms Corporation via samba
<samba@xxxxxxxxxxxxxxx> wrote:

> >> I am using the 10.external.  ip addr show shows the correct IP addresses
> >> on eth0 in the lxc container.  rebooted the physical machine, this node
> >> is buggered.  shut it down, used ip addr add to put the addresses on the
> >> other node, used ctdb addip and the node took it and node1 is now
> >> functioning with all 4 IPs just fine.  Or so it appears right now.
> >>
> >> something is seriously schizophrenic here...  
> > 
> > I'm wondering why you're using 10.external.  Although we have tested
> > it, we haven't actually seen it used in production before!  10.external
> > is a hack to allow use of CTDB's connection tracking while managing the
> > public IP addresses externally.  That is, you tell CTDB about the
> > public IPs, use "ctdb moveip" to inform CTDB about moved public IPs and
> > it sends grat ARPs and tickle ACKs on the takeover node.  It doesn't
> > actually assign the public IP addresses to nodes.  
> 
> Hm, okay, I was clear that using 10.external it is a human's 
> responsibility to deal with assigning IPs to physical interfaces.  In 
> re-reading the docs, I see DeterministicIPs and NoIPFailback are 
> required for moveip, which I am not sure are set.  will check next 
> opportunity, if they aren't that might explain the behaviour, however, 
> the ips were correctly assigned using the ip command.

The documentation (CTDB >= 4.6) for moveip says:

       IPAllocAlgorithm != 0

so it will work for the other algorithms but not deterministic.

In 4.5, which is what I assume you're running the documentation
recommends:

        DeterministicIPs = 0

so, this one needs to be off.

I don't think these options will explain the messages you're seeing.

> The reason I am using 10.external is because when I initially set up my 
> cluster test environment, none of ctdb's automatic networking 
> assignments worked.  ip addr show wouldn't display the addresses as 
> being assigned to the interface.  I never did get down to the bottom of 
> that problem, I had thought perhaps the lxc container was the issue, but 
> don't know why it would be, the ip commands all seem to work fine from 
> th cli.

OK.  CTDB just runs the "ip" command in the event scripts to in most
cases it should be the same as running them from the cli. I wonder if it
could be an SELinux issue or something?

> While I was trying to find my way around that, I found 10.external.  I 
> found that by adjusting my start scripts to include the appropriate ip 
> addr add commands, it worked fine.  in my test environment I played with 
> the ctdb addip/delip/moveip commands, and manually assigning the 
> addresses, and it all worked fine.  If I turned off a node, I could 
> uncomment a couple lines in the start script in the other node and 
> restart and everything moved to where it was supposed to be.

You shouldn't need to mess with addip and delip.  If the IP addresses
are configured in the public addresses file at startup then moveip
should be sufficient to let ctdbd know that the address has moved.

> > The documentation might not be clear on this but if you're using
> > 10.external then you need to have the DisableIPFailover tunable set to
> > 1 on all nodes so that CTDB doesn't try to move the IPs itself.  
> 
> I do have the DisableIPFailover set.
> 
> from the documentation, I am under the impression that if I do ctdb 
> delip on one node, and ctdb addip on the other node, and make sure the 
> other node shows the correct additional IPs assigned to the physical 
> interface using the ip addr show command, that should move an ip from 
> one node to the other.  But when I do this, I will frequently still see 
> messages like <ip> still hosted during callback, or failed to release 
> <ip> in the logs.  sometimes on startup, I will see log entries like 
> <ip> incorrectly on an interface, when ip addr show shows the address is 
> correctly on an interface, and ctdb ipinfo will show that the ip is 
> assigned to the node.

This message:

  IP 192.168.120.90 still hosted during release IP callback, failing

comes from this block of code in
ctdb/server/ctdb_takeover.c:release_ip_callback():

	if (ctdb->tunable.disable_ip_failover == 0 && ctdb->do_checkpublicip) {
		if  (ctdb_sys_have_ip(state->addr)) {
			DEBUG(DEBUG_ERR,
			      ("IP %s still hosted during release IP callback, failing\n",
			       ctdb_addr_to_str(state->addr)));
			ctdb_request_control_reply(ctdb, state->c,
						   NULL, -1, NULL);
			talloc_free(state);
			return;
		}
	}

So, if DisableIPFailover is set to 1 then that message can't happen.
Remember that the tunables are not cluster-wide, so need to be set on
all nodes.

> Does this mean these commands are not working, or could it be that the 
> 10.external doesn't do the magic in these cases?

10.external doesn't do anything for the "releaseip" and "takeip"
events.  It really does depend on the IP address(es) being moved
manually and "moveip" being used...

> > Please let us know if the documentation could be improved...  
> 
> Often documentation isn't straightforward until you have had some 
> experience and gained some of context that those who wrote it have.  I 
> am not sure about improving documentation, but I can say I learned 
> significantly more about how to set things up, what to expect, and what 
> procedures to perform by reading mailing list posts than I did by 
> reading the manuals or the wiki...

Hmmm... we've been trying to turn the wiki content for CTDB into a very
simple how to... but it doesn't look like we're succeeding.  :-(

If you can point to particular things then we'll try to improve them...

peace & happiness,
martin

-- 
To unsubscribe from this list go to the following URL and read the
instructions:  https://lists.samba.org/mailman/options/samba