Intro
In any organization, data is the most valuable asset. Many technologies
emerged to protect data and avoid a single point of failure. When
storing any critical data, having a single point of failure is very risky.
While many databases and other software allow you to spread data out in the
context of a single application. Glusterfs enables the flexibility to spread
the data among multiple bricks.
GlusterFS (Gluster File System) is an
open source
Distributed File System that can scale out in a building-block fashion to
store multiple petabytes of data. Also, glusterfs is supported as a
cluster file system for OLVM. There are 4 main types of glusterfs
volumes.
Type of GlusterFS Volumes
- Distributed volume
- Replicated volume
- Distributed replicated volume
- Striped Volume and Distributed Striped Volume
When you have distributed file system there can be a situation glusterfs file
can be moved to split-brain due to the main two reasons in the OLVM
environment.
- Network failure on OLVM management network.
- Performing storage changes without changing the KVM to maintenance mode.
Eg: Adding new storage to KVM and rescanning multipath without changing KVM
to maintenance mode.
In this article, I will elaborate on how we can tune glusterfs parameters to
avoid split-brain to perform this healing automatically.
How to overcome glusterfs split -brain in 2 nodes setup.
- Configure fencing on KVM.
- Tuning glusterfs parameter.
Tuning glusterfs parameter.
These tuning parameters will help to heal split-brain files automatically. If
you need to perform manual healing please read my previous article: https://chanaka-dbhelp.blogspot.com/2023/01/olvm-gluster-data-domain-healing.html.
Enable Healing
Enable below mention parameters will help to enable self-metadata
replication.
Note: Make sure to enable these
parameters on non-peak hours.
Get the current parameter settings
## DEV
[root@KVM01 glusterfs]# gluster volume get dev_gvol0 all | grep heal
cluster.background-self-heal-count 8
cluster.metadata-self-heal off
cluster.data-self-heal off
cluster.entry-self-heal off
cluster.self-heal-daemon on
cluster.heal-timeout 600
cluster.self-heal-window-size 8
cluster.data-self-heal-algorithm (null)
cluster.self-heal-readdir-size 1KB
cluster.heal-wait-queue-length 128
storage.health-check-interval 30
storage.health-check-timeout 20
cluster.disperse-self-heal-daemon enable
disperse.background-heals 8
disperse.heal-wait-qlength 128
cluster.heal-timeout 600
cluster.granular-entry-heal no
disperse.self-heal-window-size 32
[root@KVM01 glusterfs]#
Sample commands to enable the parameters
# gluster volume set VOLNAME cluster.data-self-heal on
# gluster volume set VOLNAME cluster.metadata-self-heal on
# gluster volume set VOLNAME cluster.entry-self-heal on
# gluster volume set VOLNAME self-heal-daemon on
Enable self-healing parameters
gluster volume set dev_gvol0 cluster.data-self-heal on
gluster volume set dev_gvol0 cluster.metadata-self-heal on
gluster volume set dev_gvol0 cluster.entry-self-heal on
gluster volume set dev_gvol0 self-heal-daemon on
Expected output
[root@KVM01 glusterfs]# gluster volume set dev_gvol0 cluster.data-self-heal on
volume set: success
[root@KVM01 glusterfs]# gluster volume set dev_gvol0 cluster.metadata-self-heal on
volume set: success
[root@KVM01 glusterfs]# gluster volume set dev_gvol0 cluster.entry-self-heal on
volume set: success
[root@KVM01 glusterfs]#
Expected logout output: log is located under /var/log/glusterfs/
[2023-01-13 17:52:47.684532] I [glusterfsd-mgmt.c:2170:mgmt_getspec_cbk] 0-glusterfs: Received list of available volfile servers: KVM121.sofe.ca:24007
The message "I [MSGID: 0] [options.c:1240:xlator_option_reconf_int32] 0-dev_gvol0-client-0: option ping-timeout using set value 42" repeated 2 times between [2023-01-13 17:51:45.745482] and [2023-01-13 17:52:47.685495]
[2023-01-13 17:52:47.685520] I [MSGID: 0] [options.c:1245:xlator_option_reconf_bool] 0-dev_gvol0-client-0: option send-gids using set value true
[2023-01-13 17:52:47.685564] I [MSGID: 0] [options.c:1245:xlator_option_reconf_bool] 0-dev_gvol0-client-0: option strict-locks using set value off
[2023-01-13 17:52:47.685585] I [MSGID: 0] [options.c:1240:xlator_option_reconf_int32] 0-dev_gvol0-client-1: option ping-timeout using set value 42
[2023-01-13 17:52:47.685614] I [MSGID: 0] [options.c:1245:xlator_option_reconf_bool] 0-dev_gvol0-client-1: option send-gids using set value true
[2023-01-13 17:52:47.685631] I [MSGID: 0] [options.c:1245:xlator_option_reconf_bool] 0-dev_gvol0-client-1: option strict-locks using set value off
[2023-01-13 17:52:47.685684] I [MSGID: 0] [options.c:1245:xlator_option_reconf_bool] 0-dev_gvol0-replicate-0: option metadata-self-heal using set value on
[2023-01-13 17:52:47.685702] I [MSGID: 0] [options.c:1236:xlator_option_reconf_str] 0-dev_gvol0-replicate-0: option data-self-heal using set value on
[2023-01-13 17:52:47.685721] I [MSGID: 0] [options.c:1245:xlator_option_reconf_bool] 0-dev_gvol0-replicate-0: option entry-self-heal using set value on
[2023-01-13 17:52:47.686118] I [MSGID: 0] [options.c:1245:xlator_option_reconf_bool] 0-dev_gvol0-dht: option lock-migration using set value off
[2023-01-13 17:52:47.686146] I [MSGID: 0] [options.c:1245:xlator_option_reconf_bool] 0-dev_gvol0-dht: option force-migration using set value off
[2023-01-13 17:52:47.686225] I [MSGID: 0] [options.c:1245:xlator_option_reconf_bool] 0-dev_gvol0-utime: option noatime using set value on
[2023-01-13 17:52:47.686377] I [MSGID: 0] [options.c:1245:xlator_option_reconf_bool] 0-dev_gvol0: option count-fop-hits using set value off
[2023-01-13 17:52:47.686392] I [MSGID: 0] [options.c:1245:xlator_option_reconf_bool] 0-dev_gvol0: option latency-measurement using set value off
Glusterfs Split Brain Automatic Healing.
Please refer to the oracle meta link note: How to Heal Gluster Split Brain Automatically (Doc ID 2688312.1).
This setting enables glusterfs to prioritize the healing as per the ctime , mtime , size, and Majority.
For manual recovery, we did use selecting the mtime , so with automatic selection, we can set the mtime.
If we have more than 2 bricks we can select the majority.
What is CTIME , MTIME, SIZE, and MAJORITY?
ctime. : heal the volume using the file with the latest ctime.
mtime. : heal the volume using the file with the latest mtime.
size. : heal the volume using the file with the largest size.
majority: heal the volume using the file with mtime and identical size in more than half of the bricks in the replica (in the case of volumes with 3 or more replicas).
Validate the current parameter
[root@KVM01 ~]# gluster volume get dev_gvol0 all | grep -i child
cluster.favorite-child-policy none
[root@KVM01 ~]#
Enable the parameter
This parameter can be enable by using set commands.
[root@KVM01 ~]# gluster volume set dev_gvol0 cluster.favorite-child-policy mtime
volume set: success
[root@KVM01 ~]#
Validate
[root@KVM01 ~]# gluster volume heal dev_gvol0 info
Brick KVM01:/nodirectwritedata/glusterfs/dev_brick_01/dev_gvol0
Status: Connected
Number of entries: 0
Brick KVM02:/nodirectwritedata/glusterfs/dev_brick_02/dev_gvol0
Status: Connected
Number of entries: 0
[root@KVM01 ~]#
Conclusion
The Gluster split-brain condition means the same files in different bricks in a volume have a mismatch in data. To reduce the chance of split-brain condition recommendation is to use Replica 3 volume and arbiter volume. This is because both use the client quorum option.
The client quorum is a feature in the Automatic File Replication module. It prevents split-brain in the I/O path of replicated and distributed-replicate volume. Using the above parameters this can be optimized further.
No comments:
Post a Comment