Sentinel cluster settings 1 node network iface down, Probability unable to query the master node, MasterAddr error： context deadline exceeded #3172

kwenzh · 2024-10-28T12:54:40Z

Issue tracker is used for reporting bugs and discussing new features. Please use
stackoverflow for supporting issues.

in a 3 node cluster, 3 sentinel + 3 redis-server, named: A 、B、C node, Construct C node network card goes offline, eg: ifconfig eth0 down, then the client reconnects to the Redis Sentinel to find the master address with func NewFailoverClient

Expected Behavior

redis-server failover , client can connect new master redis success

Current Behavior

Probability error: context deadline exceeded, when it try to connect C sentinel node, return err in https://github.com/redis/go-redis/blob/master/sentinel.go#L559, although A and B is work normaly, the context is deadline in this time, Because the faulty node C is placed in the first place during random sentinel addresses, C exhausts the context time, resulting in the immediate context timeout of A and B

Possible Solution

In obtaining the master address function, instead of using sequential joins for each sentinel address query you can consider concurrent goroutine queries, or use a separate context for each round of queries
Change the context of each iteration to be independent, use context.deadline to copy context

for i, sentinelAddr := range c.sentinelAddrs {
		sentinel := NewSentinelClient(c.opt.sentinelOptions(sentinelAddr))

		masterAddr, err := sentinel.GetMasterAddrByName(ctx, c.opt.MasterName).Result()
		if err != nil {
			_ = sentinel.Close()
			if errors.Is(err, context.Canceled) || errors.Is(err, context.DeadlineExceeded) {
				return "", err
			}
			internal.Logger.Printf(ctx, "sentinel: GetMasterAddrByName master=%q failed: %s",
				c.opt.MasterName, err)
			continue
		}

		// Push working sentinel to the top.
		c.sentinelAddrs[0], c.sentinelAddrs[i] = c.sentinelAddrs[i], c.sentinelAddrs[0]
		c.setSentinel(ctx, sentinel)

		addr := net.JoinHostPort(masterAddr[0], masterAddr[1])
		return addr, nil
	}

Steps to Reproduce

deploy a 3 sentinel + 3 redis server cluster,
make One of the node nics is offline and unreachable, eg ifconfig etho down
The client connect redis cluster repeatedly with func NewFailoverClient
Check whether the primary redis address can be obtained
it seem error : context deadline exceeded,

Context (Environment)

centos8 with kernel: 4.18
go-redis: v9.6.0
ctx timeout: 3s,
dialTimeout: default 5s

Detailed Description

I think the point is,

The first point to get the primary address is, why query each node sequentially, so that the failed node in the front row may affect the healthy node in the back
Second, when repeated initialization, the random function is a pseudo-random, and the random seed is 1, which may lead to multiple rounds of repeated initialization results are the same, and it is always fixed for a certain failure, that is, when the faulty node is randomized to the first place

The text was updated successfully, but these errors were encountered:

kwenzh · 2024-10-29T01:18:28Z

Simulating multiple random sentinel nodes, it can be observed that node C is randomly placed in the first position during the second simulation. Moreover, the results are the same in each round because it is pseudo-random with a seed of 1.

for cnt := 0; cnt < 10; cnt++ {
		arrs := []string{"A", "B", "C"}
		Shuffle(3, func(i, j int) {
			fmt.Println(">>>>>>>>", i, j)
			arrs[i], arrs[j] = arrs[j], arrs[i]
		})
		fmt.Println(">>>>>>>>", arrs)
	}

output:

>>>>>>>> 2 1
>>>>>>>> 1 1     
>>>>>>>> [A C B] 
>>>>>>>> 2 1     
>>>>>>>> 1 0     
>>>>>>>> [C A B] 
>>>>>>>> 2 1     
>>>>>>>> 1 1     
>>>>>>>> [A C B] 
>>>>>>>> 2 0     
>>>>>>>> 1 0     
>>>>>>>> [B C A] 
>>>>>>>> 2 0     
>>>>>>>> 1 0     
>>>>>>>> [B C A] 
>>>>>>>> 2 1     
>>>>>>>> 1 1     
>>>>>>>> [A C B] 
>>>>>>>> 2 0     
>>>>>>>> 1 0     
>>>>>>>> [B C A] 
>>>>>>>> 2 0     
>>>>>>>> 1 0     
>>>>>>>> [B C A] 
>>>>>>>> 2 0     
>>>>>>>> 1 0                                                                    
>>>>>>>> [B C A]                                                                
>>>>>>>> 2 2                                                                    
>>>>>>>> 1 0                                                                    
>>>>>>>> [B A C]

Simulating multiple initializations of the sentinel, when node C fails, an error will occur in the second round of the loop, causing it to exit due to a context timeout.


func mock_sentinel() {
	for i := 0; i < 10; i++ {
		addr := []string{
			"A", "B", "C",
		}
		sent := redis.NewFailoverClient(&redis.FailoverOptions{
			SentinelAddrs: addr,
			MasterName: "mymaster",
		})
		ctx, cancel := context.WithTimeout(context.Background(), time.Second*3)
		defer cancel()
		_, err := sent.Ping(ctx).Result()
		if err != nil {
			panic(err)
		}
                fmt.Println("connect failover client ok", i)
	}
}

kwenzh changed the title ~~Sentinel cluster set 1 node network iface down, unable to elect a master, context deadline exceeded~~ Sentinel cluster settings 1 node network iface down, Probability unable to query the master node, MasterAddr error： context deadline exceeded Oct 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sentinel cluster settings 1 node network iface down, Probability unable to query the master node, MasterAddr error： context deadline exceeded #3172

Sentinel cluster settings 1 node network iface down, Probability unable to query the master node, MasterAddr error： context deadline exceeded #3172

kwenzh commented Oct 28, 2024 •

edited

Loading

kwenzh commented Oct 29, 2024 •

edited

Loading

Sentinel cluster settings 1 node network iface down, Probability unable to query the master node, MasterAddr error： context deadline exceeded #3172

Sentinel cluster settings 1 node network iface down, Probability unable to query the master node, MasterAddr error： context deadline exceeded #3172

Comments

kwenzh commented Oct 28, 2024 • edited Loading

Expected Behavior

Current Behavior

Possible Solution

Steps to Reproduce

Context (Environment)

Detailed Description

kwenzh commented Oct 29, 2024 • edited Loading

kwenzh commented Oct 28, 2024 •

edited

Loading

kwenzh commented Oct 29, 2024 •

edited

Loading