ODA (Oracle Data Appliance) upgrades are challenging and need to be ready to spend more hours on upgrades due to unexpected failures. As a proactive measurement, you should be ready to face these unexpected failures by creating a proactive SR with oracle. In this article, I will cover the upgrades issue encountered for 18.3.0.0 to 18.8.0.0. The major component in the upgrading is the grid.
We encountered a major issue while upgrading the OVM on DOM0. Even if the component shows that the OVM version is up to date, oakcli orchestration tool revalidates this version by rerunning the required rpm on dom0 and the bug was hitting on opensm component , opensm is "an InfiniBand compliant Subnet Manager and Administration"
Using "oakcli update -patch 18.8.0.0.0 --verify" commands can verify the patch version after unpack.
###### localhost
[root@localhost ~]# oakcli update -patch 18.8.0.0.0 --verify
INFO: 2022-05-27 22:00:05: Reading the metadata file now...
Component Name Installed Version Proposed Patch Version
--------------- ------------------ -----------------
Controller_INT 4.650.00-7176 Up-to-date
Controller_EXT 13.00.00.00 Up-to-date
Expander 0018 001E
SSD_SHARED {
[ c1d20,c1d21,c1d22, A29A Up-to-date
c1d23 ]
[ c1d16,c1d17,c1d18, A29A Up-to-date
c1d19 ]
}
HDD_LOCAL A7E0 Up-to-date
HDD_SHARED {
[ c1d0,c1d1,c1d2,c1d PAG1 PD51
3,c1d4,c1d6,c1d7,c1d
8,c1d9,c1d10,c1d12,c
1d13,c1d14,c1d15 ]
[ c1d5,c1d11 ] PD51 Up-to-date
}
ILOM 4.0.2.26.b r125868 4.0.4.52 r132805
BIOS 30130500 30300200
IPMI 1.8.12.4 1.8.15.0
HMP 2.4.1.0.14 2.4.5.0.1
OAK 18.3.0.0.0 18.8.0.0.0
OL 6.10 Up-to-date
OVM 3.4.4 Up-to-date
GI_HOME 18.3.0.0.180717 18.8.0.0.191015
DB_HOME 12.1.0.2.160119 12.1.0.2.191015
ASR 18.3.1 19.4.0
[root@localhost ~]#
This OVM issue was directly impacting patching from 18.3.0.0 to 18.8.0.0, DOM0 patching failed when trying to upgrade to OVM to 3.4.4. While upgrading DOM0, stopping the opensm using the command "service opensmd stop" caused the process to not come up after the network service restart . Parameters are read from the “/etc/rc.local” file.
Below is the main server patching log.
Error
Main Server Patching Log :
2022-05-28 04:44:03: Executing cmd: /u01/app/18.0.0.0/grid/bin/crsctl check cluster
2022-05-28 04:44:03: Command output:
> CRS-4535: Cannot communicate with Cluster Ready Services
> CRS-4530: Communications failure contacting Cluster Synchronization Services daemon
> CRS-4534: Cannot communicate with Event Manager ,
>End Command output
2022-05-28 04:44:03: SUCCESS: Successfully stopped the clusterware on local node
2022-05-28 04:44:03: Executing cmd: /opt/oracle/oak/bin/oakcli fstop oak
2022-05-28 04:44:06: Command output:
> 2022-05-28 04:44:03.583081370:[init.oak]:[Stopping oakd]
> 2022-05-28 04:44:06.640504545:[init.oak]:[Successfully stopped the oakd..] ,
>End Command output
2022-05-28 04:49:34: WARNING: Unable to update the device OVM to: 3.4.4
DOM0 Log :
Dom0 log is trying to obtain a value from running the pgrep opensm command. Opensm services need to be up and running to obtain the value. These logs are useful to get a clear picture of the problem “/var/log/opensm.log” and “/var/log/messages”.
2022-05-28 04:51:02,732 root DEBUG Running: pgrep opensm
2022-05-28 04:51:02,732 root DEBUG Running: pgrep opensm
2022-05-28 04:51:02,732 root DEBUG Running: pgrep opensm
2022-05-28 04:51:02,732 root DEBUG Running: pgrep opensm
2022-05-28 04:51:02,732 root DEBUG Running: pgrep opensm
2022-05-28 04:51:02,732 root DEBUG Running: pgrep opensm
2022-05-28 04:51:02,732 root DEBUG Running: pgrep opensm
2022-05-28 04:51:02,732 root DEBUG Running: pgrep opensm
2022-05-28 04:51:02,732 root DEBUG Running: pgrep opensm
2022-05-28 04:51:02,732 root DEBUG Running: pgrep opensm
2022-05-28 04:51:02,732 root DEBUG Running: pgrep opensm
2022-05-28 04:51:02,813 root DEBUG Failed to run the cmd: pgrep opensm
2022-05-28 04:51:02,813 root DEBUG Failed to run the cmd: pgrep opensm
2022-05-28 04:51:02,813 root DEBUG Failed to run the cmd: pgrep opensm
2022-05-28 04:51:02,813 root DEBUG Failed to run the cmd: pgrep opensm
2022-05-28 04:51:02,813 root DEBUG Failed to run the cmd: pgrep opensm
2022-05-28 04:51:02,813 root DEBUG Failed to run the cmd: pgrep opensm
2022-05-28 04:51:02,813 root DEBUG Failed to run the cmd: pgrep opensm
2022-05-28 04:51:02,813 root DEBUG Failed to run the cmd: pgrep opensm
2022-05-28 04:51:02,813 root DEBUG Failed to run the cmd: pgrep opensm
2022-05-28 04:51:02,813 root DEBUG Failed to run the cmd: pgrep opensm
2022-05-28 04:51:02,813 root DEBUG Failed to run the cmd: pgrep opensm
Solution
Make sure to backup the “/etc/rc.local” file before making any changes. New parameters “-y” to opensm service.-y, --stay_on_fatal
This option will cause SM not to exit on fatal initialization issues: if SM discovers duplicated guides or a 12x link with lane reversal badly configured. By default, the SM will exit on these errors.
-- Before
[root@localhost patch]# cat /etc/rc.local | grep -i opensm
service opensmd stop
opensm -g 0x0010e00001889ad9 -W -p 15 -B
opensm -g 0x0010e00001889ada -W -p 0 -B
-- After
[root@localhost patch]# cat /etc/rc.local | grep -i opensm
service opensmd stop
opensm -y -g 0x0010e00001889ad9 -W -p 15 -B
opensm -y -g 0x0010e00001889ada -W -p 0 -B
Conclusion
Before the reboot, I strongly recommend you run these commands in both the nodes' terminals before the restart. This helps to validate commands giving any errors.
Expected output :
-- Node01
[root@localhost patch]# opensm -y -g 0x0010e00001889ad9 -W -p 15 -B
-------------------------------------------------
OpenSM 3.3.19
Reading Cached Option File: /etc/opensm/opensm.conf
-E- Parsing error in field guid, expected numeric input received: 0x0010e00001889ad9 0x0010e00001889ada
Unrecognized token: "max_seq_redisc"
Unrecognized token: "rereg_on_guid_migr"
Unrecognized token: "aguid_inout_notice"
Unrecognized token: "sm_assign_guid_func"
Unrecognized token: "reports"
Unrecognized token: "per_module_logging"
Unrecognized token: "consolidate_ipv4_mask"
Command Line Arguments:
Staying on fatal initialization errors
Guid 0x10e00001889ad9
Priority = 15
Daemon mode
Log File: /var/log/opensm.log
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
[root@localhost patch]# opensm -y -g 0x0010e00001889ada -W -p 0 -B
-------------------------------------------------
OpenSM 3.3.19
Reading Cached Option File: /etc/opensm/opensm.conf
-E- Parsing error in field guid, expected numeric input received: 0x0010e00001889ad9 0x0010e00001889ada
Unrecognized token: "max_seq_redisc"
Unrecognized token: "rereg_on_guid_migr"
Unrecognized token: "aguid_inout_notice"
Unrecognized token: "sm_assign_guid_func"
Unrecognized token: "reports"
Unrecognized token: "per_module_logging"
Unrecognized token: "consolidate_ipv4_mask"
Command Line Arguments:
Staying on fatal initialization errors
Guid 0x10e00001889ada
Priority = 0
Daemon mode
Log File: /var/log/opensm.log
-------------------------------------------------
[root@pinode0 patch]#
No comments:
Post a Comment