Redis Cluster Encountering an Unusual Error

Today, our Redis cluster, which has been running smoothly for several weeks, suddenly ceased to function. In the logs, I found that it’s unable to reach one of the Redis servers. I’ve checked thoroughly and confirmed that there are no issues accessing this server.

I’m seeking assistance in understanding whether the Redis settings need adjustment. Since it’s a cluster, shouldn’t it attempt to connect to another part of the Redis cluster if it can’t reach the IP on port 7000?

The error logged in nextcloud.log is as follows:

“Could not boot workflowengine: read error on connection to 10.255.10.50:7000”,“exception”:{},“CustomMessage”:“Could not boot workflowengine: read error on connection to 10.255.10.50:7000”}}

My Redis configuration in nextcloud:

‘memcache.local’ => ‘\OC\Memcache\APCu’,
‘memcache.distributed’ => ‘\OC\Memcache\redis’,
‘memcache.locking’ => ‘\OC\Memcache\redis’,

‘redis.cluster’ =>
array (
‘seeds’ =>
array (
0 => ‘10.255.10.50:7000’,
1 => ‘10.255.10.50:7001’,
2 => ‘10.255.10.50:7002’,
3 => ‘10.255.10.51:7000’,
4 => ‘10.255.10.51:7001’,
5 => ‘10.255.10.51:7002’,
6 => ‘10.255.10.52:7000’,
7 => ‘10.255.10.52:7001’,
8 => ‘10.255.10.52:7002’,
),
‘failover_mode’ => 0,
‘timeout’ => 0,
‘read_timeout’ => 0,
‘password’ => ‘xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx’,
‘dbindex’ => 0,
),

I did try to switch the master over to one of the slaves, but the issue is still there. Cluster status in Redis:

cluster_state:ok
cluster_slots_assigned:16384
cluster_slots_ok:16384
cluster_slots_pfail:0
cluster_slots_fail:0
cluster_known_nodes:9
cluster_size:3
cluster_current_epoch:7
cluster_my_epoch:2
cluster_stats_messages_ping_sent:3916469
cluster_stats_messages_pong_sent:3629328
cluster_stats_messages_fail_sent:7
cluster_stats_messages_auth-ack_sent:4
cluster_stats_messages_sent:7545808
cluster_stats_messages_ping_received:3629328
cluster_stats_messages_pong_received:3916467
cluster_stats_messages_fail_received:5
cluster_stats_messages_auth-req_received:4
cluster_stats_messages_received:7545804
total_cluster_links_buffer_limit_exceeded:0

Let me know if you need any further information.

Based on the information provided, it seems that your Redis cluster is configured correctly, and the cluster itself is reporting a healthy state. The error message “Could not boot workflowengine: read error on connection to 10.255.10.50:7000” suggests that there is a problem with the connection to the specified Redis node, but you’ve confirmed that the server is accessible.

Here are a few steps you can take to troubleshoot the issue further:

Check Redis Logs: Look at the logs of the Redis server that is failing to connect (10.255.10.50 on port 7000). There might be more detailed error messages that can give you a clue about what’s going wrong.

Test Connectivity: Use redis-cli or telnet to manually connect to the Redis server on the problematic port from the Nextcloud server to ensure that there are no network issues or firewalls blocking the connection.

redis-cli -h 10.255.10.50 -p 7000 -a yourpassword
# or
telnet 10.255.10.50 7000
  • Check Redis Configuration: Ensure that the redis.conf file on the problematic Redis server is correctly configured for cluster mode and that it’s not in a fail state.
  • Examine Nextcloud Configuration: Double-check the config.php file for Nextcloud to ensure that the Redis configuration parameters are correct. Pay special attention to the seeds array to make sure all IP addresses and ports are correct.
  • Failover Handling: If a node goes down, Redis Cluster should automatically failover to a replica. However, this depends on the failover_mode setting. The failover_mode parameter in your configuration is set to 0, which is not a standard Redis option. For Redis Cluster, there is no failover_mode setting; failover is automatic. You might want to remove this line if it’s not serving any purpose.
  • Cluster Configuration: Ensure that the Redis Cluster configuration is correct on all nodes. You can use the redis-cli --cluster check command to validate the cluster’s configuration.
  • Restart Services: Sometimes, simply restarting the Redis service on the problematic server can resolve transient issues.
  • Nextcloud Cache Clear: Clear the Nextcloud cache using the OCC command, which might help if there’s a caching issue:
sudo -u www-data php occ cache:clear

Consult Documentation: Review the Nextcloud documentation for any specific Redis cluster settings or recommendations.