Monday, March 13, 2023

OLVM - Glusterfs tuning to avoid split-brain

 




Intro

In any organization, data is the most valuable asset. Many technologies emerged to protect data and avoid a single point of failure.  When storing any critical data, having a single point of failure is very risky. While many databases and other software allow you to spread data out in the context of a single application. Glusterfs enables the flexibility to spread the data among multiple bricks. 

GlusterFS (Gluster File System) is an open source Distributed File System that can scale out in a building-block fashion to store multiple petabytes of data.  Also, glusterfs is supported as a cluster file system for OLVM.   There are 4 main types of glusterfs volumes.

Type of GlusterFS Volumes
  • Distributed volume
  • Replicated volume
  • Distributed replicated volume
  • Striped Volume and Distributed Striped Volume

When you have distributed file system there can be a situation glusterfs file can be moved to split-brain due to the main two reasons in the OLVM environment. 

  •  Network failure on OLVM management network. 
OLVM configuration we used a management network to replicate glusterfs data from one brick to another.

  • Performing storage changes without changing the KVM to maintenance mode.
Eg: Adding new storage to KVM and rescanning multipath without changing KVM to maintenance mode.

In this article, I will elaborate on how we can tune glusterfs parameters to avoid split-brain to perform this healing automatically. 

How to overcome glusterfs split -brain in 2 nodes setup.

  1. Configure fencing on KVM.
  2. Tuning glusterfs parameter.

Tuning glusterfs parameter.

These tuning parameters will help to heal split-brain files automatically. If you need to perform manual healing please read my previous article: https://chanaka-dbhelp.blogspot.com/2023/01/olvm-gluster-data-domain-healing.html.

Enable Healing

Enable below mention parameters will help to enable self-metadata replication. 

Note: Make sure to enable these parameters on non-peak hours.


Get the current parameter settings



## DEV
[root@KVM01 glusterfs]# gluster volume get dev_gvol0 all | grep heal
cluster.background-self-heal-count       8
cluster.metadata-self-heal               off
cluster.data-self-heal                   off
cluster.entry-self-heal                  off
cluster.self-heal-daemon                 on
cluster.heal-timeout                     600
cluster.self-heal-window-size            8
cluster.data-self-heal-algorithm         (null)
cluster.self-heal-readdir-size           1KB
cluster.heal-wait-queue-length           128
storage.health-check-interval            30
storage.health-check-timeout             20
cluster.disperse-self-heal-daemon        enable
disperse.background-heals                8
disperse.heal-wait-qlength               128
cluster.heal-timeout                     600
cluster.granular-entry-heal              no
disperse.self-heal-window-size           32
[root@KVM01 glusterfs]#

Sample commands to enable the parameters



# gluster volume set VOLNAME cluster.data-self-heal on
# gluster volume set VOLNAME cluster.metadata-self-heal on 
# gluster volume set VOLNAME cluster.entry-self-heal on
# gluster volume set VOLNAME self-heal-daemon on

Enable self-healing parameters



gluster volume set dev_gvol0 cluster.data-self-heal on
gluster volume set dev_gvol0 cluster.metadata-self-heal on
gluster volume set dev_gvol0 cluster.entry-self-heal on
gluster volume set dev_gvol0 self-heal-daemon on

Expected output



[root@KVM01 glusterfs]# gluster volume set dev_gvol0 cluster.data-self-heal on
volume set: success
[root@KVM01 glusterfs]# gluster volume set dev_gvol0 cluster.metadata-self-heal on
volume set: success
[root@KVM01 glusterfs]# gluster volume set dev_gvol0 cluster.entry-self-heal on
volume set: success
[root@KVM01 glusterfs]#

Expected logout output: log is located under /var/log/glusterfs/



[2023-01-13 17:52:47.684532] I [glusterfsd-mgmt.c:2170:mgmt_getspec_cbk] 0-glusterfs: Received list of available volfile servers: KVM121.sofe.ca:24007
The message "I [MSGID: 0] [options.c:1240:xlator_option_reconf_int32] 0-dev_gvol0-client-0: option ping-timeout using set value 42" repeated 2 times between [2023-01-13 17:51:45.745482] and [2023-01-13 17:52:47.685495]
[2023-01-13 17:52:47.685520] I [MSGID: 0] [options.c:1245:xlator_option_reconf_bool] 0-dev_gvol0-client-0: option send-gids using set value true
[2023-01-13 17:52:47.685564] I [MSGID: 0] [options.c:1245:xlator_option_reconf_bool] 0-dev_gvol0-client-0: option strict-locks using set value off
[2023-01-13 17:52:47.685585] I [MSGID: 0] [options.c:1240:xlator_option_reconf_int32] 0-dev_gvol0-client-1: option ping-timeout using set value 42
[2023-01-13 17:52:47.685614] I [MSGID: 0] [options.c:1245:xlator_option_reconf_bool] 0-dev_gvol0-client-1: option send-gids using set value true
[2023-01-13 17:52:47.685631] I [MSGID: 0] [options.c:1245:xlator_option_reconf_bool] 0-dev_gvol0-client-1: option strict-locks using set value off
[2023-01-13 17:52:47.685684] I [MSGID: 0] [options.c:1245:xlator_option_reconf_bool] 0-dev_gvol0-replicate-0: option metadata-self-heal using set value on
[2023-01-13 17:52:47.685702] I [MSGID: 0] [options.c:1236:xlator_option_reconf_str] 0-dev_gvol0-replicate-0: option data-self-heal using set value on
[2023-01-13 17:52:47.685721] I [MSGID: 0] [options.c:1245:xlator_option_reconf_bool] 0-dev_gvol0-replicate-0: option entry-self-heal using set value on
[2023-01-13 17:52:47.686118] I [MSGID: 0] [options.c:1245:xlator_option_reconf_bool] 0-dev_gvol0-dht: option lock-migration using set value off
[2023-01-13 17:52:47.686146] I [MSGID: 0] [options.c:1245:xlator_option_reconf_bool] 0-dev_gvol0-dht: option force-migration using set value off
[2023-01-13 17:52:47.686225] I [MSGID: 0] [options.c:1245:xlator_option_reconf_bool] 0-dev_gvol0-utime: option noatime using set value on
[2023-01-13 17:52:47.686377] I [MSGID: 0] [options.c:1245:xlator_option_reconf_bool] 0-dev_gvol0: option count-fop-hits using set value off
[2023-01-13 17:52:47.686392] I [MSGID: 0] [options.c:1245:xlator_option_reconf_bool] 0-dev_gvol0: option latency-measurement using set value off

Glusterfs Split Brain Automatic Healing.

Please refer to the oracle meta link note: How to Heal Gluster Split Brain Automatically (Doc ID 2688312.1).

Also, I would recommend you consult oracle before changing these parameters.

This setting enables glusterfs to prioritize the healing as per the ctime , mtime , size, and Majority.
For manual recovery, we did use selecting the mtime , so with automatic selection, we can set the mtime.

If we have more than 2 bricks we can select the majority

What is CTIME , MTIME, SIZE, and MAJORITY?


ctime.  : heal the volume using the file with the latest ctime.
mtime.  : heal the volume using the file with the latest mtime.
size.   :  heal the volume using the file with the largest size.
majority: heal the volume using the file with mtime and identical size in more than half of the bricks in the replica (in the case of volumes with 3 or more replicas).

Validate the current parameter



[root@KVM01 ~]# gluster volume get dev_gvol0 all | grep -i child
cluster.favorite-child-policy            none
[root@KVM01 ~]#

Enable the parameter

This parameter can be enable by using set commands.

[root@KVM01 ~]# gluster volume set dev_gvol0 cluster.favorite-child-policy mtime
volume set: success
[root@KVM01 ~]#

Validate


[root@KVM01 ~]# gluster volume heal dev_gvol0 info
Brick KVM01:/nodirectwritedata/glusterfs/dev_brick_01/dev_gvol0
Status: Connected
Number of entries: 0

Brick KVM02:/nodirectwritedata/glusterfs/dev_brick_02/dev_gvol0
Status: Connected
Number of entries: 0

[root@KVM01 ~]#

Conclusion

The Gluster split-brain condition means the same files in different bricks in a volume have a mismatch in data. To reduce the chance of split-brain condition recommendation is to use Replica 3 volume and arbiter volume. This is because both use the client quorum option.

The client quorum is a feature in the Automatic File Replication module. It prevents split-brain in the I/O path of replicated and distributed-replicate volume. Using the above parameters this can be optimized further.




No comments:

Post a Comment

Exacs database creation using dbaascli

  Intro OCI (Oracle Cloud Infrastructure) provides robust automation capabilities for routine maintenance tasks such as patching, ...