Monday, March 13, 2023

OLVM - Glusterfs tuning to avoid split-brain

 




Intro

In any organization, data is the most valuable asset. Many technologies emerged to protect data and avoid a single point of failure.  When storing any critical data, having a single point of failure is very risky. While many databases and other software allow you to spread data out in the context of a single application. Glusterfs enables the flexibility to spread the data among multiple bricks. 

GlusterFS (Gluster File System) is an open source Distributed File System that can scale out in a building-block fashion to store multiple petabytes of data.  Also, glusterfs is supported as a cluster file system for OLVM.   There are 4 main types of glusterfs volumes.

Type of GlusterFS Volumes
  • Distributed volume
  • Replicated volume
  • Distributed replicated volume
  • Striped Volume and Distributed Striped Volume

When you have distributed file system there can be a situation glusterfs file can be moved to split-brain due to the main two reasons in the OLVM environment. 

  •  Network failure on OLVM management network. 
OLVM configuration we used a management network to replicate glusterfs data from one brick to another.

  • Performing storage changes without changing the KVM to maintenance mode.
Eg: Adding new storage to KVM and rescanning multipath without changing KVM to maintenance mode.

In this article, I will elaborate on how we can tune glusterfs parameters to avoid split-brain to perform this healing automatically. 

How to overcome glusterfs split -brain in 2 nodes setup.

  1. Configure fencing on KVM.
  2. Tuning glusterfs parameter.

Tuning glusterfs parameter.

These tuning parameters will help to heal split-brain files automatically. If you need to perform manual healing please read my previous article: https://chanaka-dbhelp.blogspot.com/2023/01/olvm-gluster-data-domain-healing.html.

Enable Healing

Enable below mention parameters will help to enable self-metadata replication. 

Note: Make sure to enable these parameters on non-peak hours.


Get the current parameter settings



## DEV
[root@KVM01 glusterfs]# gluster volume get dev_gvol0 all | grep heal
cluster.background-self-heal-count       8
cluster.metadata-self-heal               off
cluster.data-self-heal                   off
cluster.entry-self-heal                  off
cluster.self-heal-daemon                 on
cluster.heal-timeout                     600
cluster.self-heal-window-size            8
cluster.data-self-heal-algorithm         (null)
cluster.self-heal-readdir-size           1KB
cluster.heal-wait-queue-length           128
storage.health-check-interval            30
storage.health-check-timeout             20
cluster.disperse-self-heal-daemon        enable
disperse.background-heals                8
disperse.heal-wait-qlength               128
cluster.heal-timeout                     600
cluster.granular-entry-heal              no
disperse.self-heal-window-size           32
[root@KVM01 glusterfs]#

Sample commands to enable the parameters



# gluster volume set VOLNAME cluster.data-self-heal on
# gluster volume set VOLNAME cluster.metadata-self-heal on 
# gluster volume set VOLNAME cluster.entry-self-heal on
# gluster volume set VOLNAME self-heal-daemon on

Enable self-healing parameters



gluster volume set dev_gvol0 cluster.data-self-heal on
gluster volume set dev_gvol0 cluster.metadata-self-heal on
gluster volume set dev_gvol0 cluster.entry-self-heal on
gluster volume set dev_gvol0 self-heal-daemon on

Expected output



[root@KVM01 glusterfs]# gluster volume set dev_gvol0 cluster.data-self-heal on
volume set: success
[root@KVM01 glusterfs]# gluster volume set dev_gvol0 cluster.metadata-self-heal on
volume set: success
[root@KVM01 glusterfs]# gluster volume set dev_gvol0 cluster.entry-self-heal on
volume set: success
[root@KVM01 glusterfs]#

Expected logout output: log is located under /var/log/glusterfs/



[2023-01-13 17:52:47.684532] I [glusterfsd-mgmt.c:2170:mgmt_getspec_cbk] 0-glusterfs: Received list of available volfile servers: KVM121.sofe.ca:24007
The message "I [MSGID: 0] [options.c:1240:xlator_option_reconf_int32] 0-dev_gvol0-client-0: option ping-timeout using set value 42" repeated 2 times between [2023-01-13 17:51:45.745482] and [2023-01-13 17:52:47.685495]
[2023-01-13 17:52:47.685520] I [MSGID: 0] [options.c:1245:xlator_option_reconf_bool] 0-dev_gvol0-client-0: option send-gids using set value true
[2023-01-13 17:52:47.685564] I [MSGID: 0] [options.c:1245:xlator_option_reconf_bool] 0-dev_gvol0-client-0: option strict-locks using set value off
[2023-01-13 17:52:47.685585] I [MSGID: 0] [options.c:1240:xlator_option_reconf_int32] 0-dev_gvol0-client-1: option ping-timeout using set value 42
[2023-01-13 17:52:47.685614] I [MSGID: 0] [options.c:1245:xlator_option_reconf_bool] 0-dev_gvol0-client-1: option send-gids using set value true
[2023-01-13 17:52:47.685631] I [MSGID: 0] [options.c:1245:xlator_option_reconf_bool] 0-dev_gvol0-client-1: option strict-locks using set value off
[2023-01-13 17:52:47.685684] I [MSGID: 0] [options.c:1245:xlator_option_reconf_bool] 0-dev_gvol0-replicate-0: option metadata-self-heal using set value on
[2023-01-13 17:52:47.685702] I [MSGID: 0] [options.c:1236:xlator_option_reconf_str] 0-dev_gvol0-replicate-0: option data-self-heal using set value on
[2023-01-13 17:52:47.685721] I [MSGID: 0] [options.c:1245:xlator_option_reconf_bool] 0-dev_gvol0-replicate-0: option entry-self-heal using set value on
[2023-01-13 17:52:47.686118] I [MSGID: 0] [options.c:1245:xlator_option_reconf_bool] 0-dev_gvol0-dht: option lock-migration using set value off
[2023-01-13 17:52:47.686146] I [MSGID: 0] [options.c:1245:xlator_option_reconf_bool] 0-dev_gvol0-dht: option force-migration using set value off
[2023-01-13 17:52:47.686225] I [MSGID: 0] [options.c:1245:xlator_option_reconf_bool] 0-dev_gvol0-utime: option noatime using set value on
[2023-01-13 17:52:47.686377] I [MSGID: 0] [options.c:1245:xlator_option_reconf_bool] 0-dev_gvol0: option count-fop-hits using set value off
[2023-01-13 17:52:47.686392] I [MSGID: 0] [options.c:1245:xlator_option_reconf_bool] 0-dev_gvol0: option latency-measurement using set value off

Glusterfs Split Brain Automatic Healing.

Please refer to the oracle meta link note: How to Heal Gluster Split Brain Automatically (Doc ID 2688312.1).

Also, I would recommend you consult oracle before changing these parameters.

This setting enables glusterfs to prioritize the healing as per the ctime , mtime , size, and Majority.
For manual recovery, we did use selecting the mtime , so with automatic selection, we can set the mtime.

If we have more than 2 bricks we can select the majority

What is CTIME , MTIME, SIZE, and MAJORITY?


ctime.  : heal the volume using the file with the latest ctime.
mtime.  : heal the volume using the file with the latest mtime.
size.   :  heal the volume using the file with the largest size.
majority: heal the volume using the file with mtime and identical size in more than half of the bricks in the replica (in the case of volumes with 3 or more replicas).

Validate the current parameter



[root@KVM01 ~]# gluster volume get dev_gvol0 all | grep -i child
cluster.favorite-child-policy            none
[root@KVM01 ~]#

Enable the parameter

This parameter can be enable by using set commands.

[root@KVM01 ~]# gluster volume set dev_gvol0 cluster.favorite-child-policy mtime
volume set: success
[root@KVM01 ~]#

Validate


[root@KVM01 ~]# gluster volume heal dev_gvol0 info
Brick KVM01:/nodirectwritedata/glusterfs/dev_brick_01/dev_gvol0
Status: Connected
Number of entries: 0

Brick KVM02:/nodirectwritedata/glusterfs/dev_brick_02/dev_gvol0
Status: Connected
Number of entries: 0

[root@KVM01 ~]#

Conclusion

The Gluster split-brain condition means the same files in different bricks in a volume have a mismatch in data. To reduce the chance of split-brain condition recommendation is to use Replica 3 volume and arbiter volume. This is because both use the client quorum option.

The client quorum is a feature in the Automatic File Replication module. It prevents split-brain in the I/O path of replicated and distributed-replicate volume. Using the above parameters this can be optimized further.




Unified Auditing Housekeeping

  Intro  Data is the new currency. It is one of the most valuable organizational assets, however, if that data is not well...