Monday, June 6, 2022

ODA - 18.8.0.0 - OVM upgrade issue

 



ODA (Oracle Data Appliance) upgrades are challenging and need to be ready to spend more hours on upgrades due to unexpected failures.  As a proactive measurement, you should be ready to face these unexpected failures by creating a proactive SR with oracle. In this article, I will cover the upgrades issue encountered for 18.3.0.0 to 18.8.0.0.  The major component in the upgrading is the grid.

We encountered a major issue while upgrading the OVM on DOM0. Even if the component shows that the OVM version is up to date, oakcli orchestration tool revalidates this version by rerunning the required rpm on dom0 and the bug was hitting on opensm component , opensm is "an InfiniBand compliant Subnet Manager and Administration"


Using "oakcli update -patch 18.8.0.0.0 --verify" commands can verify the patch version after unpack.


###### localhost
[root@localhost ~]# oakcli update -patch  18.8.0.0.0 --verify
INFO: 2022-05-27 22:00:05: Reading the metadata file now...
                Component Name            Installed Version         Proposed Patch Version
                ---------------           ------------------        -----------------
                Controller_INT            4.650.00-7176             Up-to-date
                Controller_EXT            13.00.00.00               Up-to-date
                Expander                  0018                      001E
                SSD_SHARED {
                [ c1d20,c1d21,c1d22,      A29A                      Up-to-date
                c1d23 ]
                [ c1d16,c1d17,c1d18,      A29A                      Up-to-date
                c1d19 ]
                             }
                HDD_LOCAL                 A7E0                      Up-to-date
                HDD_SHARED {
                [ c1d0,c1d1,c1d2,c1d      PAG1                      PD51
                3,c1d4,c1d6,c1d7,c1d
                8,c1d9,c1d10,c1d12,c
                1d13,c1d14,c1d15 ]
                [ c1d5,c1d11 ]            PD51                      Up-to-date
                             }
                ILOM                      4.0.2.26.b r125868        4.0.4.52 r132805
                BIOS                      30130500                  30300200
                IPMI                      1.8.12.4                  1.8.15.0
                HMP                       2.4.1.0.14                2.4.5.0.1
                OAK                       18.3.0.0.0                18.8.0.0.0
                OL                        6.10                      Up-to-date
                OVM                       3.4.4                     Up-to-date
                GI_HOME                   18.3.0.0.180717           18.8.0.0.191015
                DB_HOME                   12.1.0.2.160119           12.1.0.2.191015
                ASR                       18.3.1                    19.4.0
[root@localhost ~]#

This OVM issue was directly impacting patching from 18.3.0.0 to 18.8.0.0, DOM0 patching failed when trying to upgrade to OVM to 3.4.4. While upgrading DOM0, stopping the opensm using the command "service opensmd stop" caused the process to not come up after the network service restart . Parameters are read from the “/etc/rc.local” file.

Below is the main server patching log.

Error 

Main Server Patching Log : 

This is the main server patch log and highlighted OVM update error for clarity.

2022-05-28 04:44:03: Executing cmd: /u01/app/18.0.0.0/grid/bin/crsctl check cluster
2022-05-28 04:44:03: Command output:
>  CRS-4535: Cannot communicate with Cluster Ready Services
>  CRS-4530: Communications failure contacting Cluster Synchronization Services daemon
>  CRS-4534: Cannot communicate with Event Manager , 
>End Command output
2022-05-28 04:44:03: SUCCESS:  Successfully stopped the clusterware on local node
2022-05-28 04:44:03: Executing cmd: /opt/oracle/oak/bin/oakcli fstop oak
2022-05-28 04:44:06: Command output:
>  2022-05-28 04:44:03.583081370:[init.oak]:[Stopping oakd]
>  2022-05-28 04:44:06.640504545:[init.oak]:[Successfully stopped the oakd..] , 
>End Command output
2022-05-28 04:49:34: WARNING:  Unable to update the device OVM to:  3.4.4


DOM0 Log :

Dom0 log is trying to obtain a value from running the pgrep opensm command. Opensm services need to be up and running to obtain the value. These logs are useful to get a clear picture of the problem “/var/log/opensm.log” and “/var/log/messages”.


2022-05-28 04:51:02,732 root DEBUG Running: pgrep opensm
2022-05-28 04:51:02,732 root DEBUG Running: pgrep opensm
2022-05-28 04:51:02,732 root DEBUG Running: pgrep opensm
2022-05-28 04:51:02,732 root DEBUG Running: pgrep opensm
2022-05-28 04:51:02,732 root DEBUG Running: pgrep opensm
2022-05-28 04:51:02,732 root DEBUG Running: pgrep opensm
2022-05-28 04:51:02,732 root DEBUG Running: pgrep opensm
2022-05-28 04:51:02,732 root DEBUG Running: pgrep opensm
2022-05-28 04:51:02,732 root DEBUG Running: pgrep opensm
2022-05-28 04:51:02,732 root DEBUG Running: pgrep opensm
2022-05-28 04:51:02,732 root DEBUG Running: pgrep opensm
2022-05-28 04:51:02,813 root DEBUG Failed to run the cmd: pgrep opensm
2022-05-28 04:51:02,813 root DEBUG Failed to run the cmd: pgrep opensm
2022-05-28 04:51:02,813 root DEBUG Failed to run the cmd: pgrep opensm
2022-05-28 04:51:02,813 root DEBUG Failed to run the cmd: pgrep opensm
2022-05-28 04:51:02,813 root DEBUG Failed to run the cmd: pgrep opensm
2022-05-28 04:51:02,813 root DEBUG Failed to run the cmd: pgrep opensm
2022-05-28 04:51:02,813 root DEBUG Failed to run the cmd: pgrep opensm
2022-05-28 04:51:02,813 root DEBUG Failed to run the cmd: pgrep opensm
2022-05-28 04:51:02,813 root DEBUG Failed to run the cmd: pgrep opensm
2022-05-28 04:51:02,813 root DEBUG Failed to run the cmd: pgrep opensm
2022-05-28 04:51:02,813 root DEBUG Failed to run the cmd: pgrep opensm

Solution

Make sure to backup the “/etc/rc.local” file before making any changes. New parameters “-y” to opensm service.

Use the below mention link to get an idea of the opensm parameters.


-y, --stay_on_fatal

This option will cause SM not to exit on fatal initialization issues: if SM discovers duplicated guides or a 12x link with lane reversal badly configured. By default, the SM will exit on these errors.




-- Before
[root@localhost patch]# cat /etc/rc.local | grep -i opensm
service opensmd stop
opensm -g 0x0010e00001889ad9 -W -p 15 -B
opensm -g 0x0010e00001889ada -W -p 0 -B

-- After 
[root@localhost patch]# cat /etc/rc.local | grep -i opensm
service opensmd stop
opensm -y -g 0x0010e00001889ad9 -W -p 15 -B
opensm -y -g 0x0010e00001889ada -W -p 0 -B


Conclusion

Before the reboot, I strongly recommend you run these commands in both the nodes' terminals before the restart. This helps to validate commands giving any errors.

Expected output :



-- Node01
[root@localhost patch]# opensm -y -g 0x0010e00001889ad9 -W -p 15 -B
-------------------------------------------------
OpenSM 3.3.19
 Reading Cached Option File: /etc/opensm/opensm.conf
-E- Parsing error in field guid, expected numeric input received: 0x0010e00001889ad9 0x0010e00001889ada
 Unrecognized token: "max_seq_redisc"
 Unrecognized token: "rereg_on_guid_migr"
 Unrecognized token: "aguid_inout_notice"
 Unrecognized token: "sm_assign_guid_func"
 Unrecognized token: "reports"
 Unrecognized token: "per_module_logging"
 Unrecognized token: "consolidate_ipv4_mask"
Command Line Arguments:
 Staying on fatal initialization errors
 Guid 0x10e00001889ad9
 Priority = 15
 Daemon mode
 Log File: /var/log/opensm.log
 
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------

[root@localhost patch]# opensm -y -g 0x0010e00001889ada -W -p 0 -B
-------------------------------------------------
OpenSM 3.3.19
 Reading Cached Option File: /etc/opensm/opensm.conf
-E- Parsing error in field guid, expected numeric input received: 0x0010e00001889ad9 0x0010e00001889ada
 Unrecognized token: "max_seq_redisc"
 Unrecognized token: "rereg_on_guid_migr"
 Unrecognized token: "aguid_inout_notice"
 Unrecognized token: "sm_assign_guid_func"
 Unrecognized token: "reports"
 Unrecognized token: "per_module_logging"
 Unrecognized token: "consolidate_ipv4_mask"
Command Line Arguments:
 Staying on fatal initialization errors
 Guid 0x10e00001889ada
 Priority = 0
 Daemon mode
 Log File: /var/log/opensm.log
-------------------------------------------------
[root@pinode0 patch]#



No comments:

Post a Comment

Oracle world 2024 - AI

  Intro  The world is transitioning from the data era to the age of artificial intelligence. Many organizations are leveraging AI features t...