ODA upgrades elevate technical knowledge to a different level for each upgrade. Each upgrade gives you a new obstacle to tackle in a limited time.
While upgrading ODA from 12.1.2.10 to 12.1.2.12 found some unusual issue that was hitting during the pre-patching stage.
Patching was executed from the Master - node. The master node can be identified by executing the oakcli show ismaster
[root@piodadb1 ~]# oakcli show ismaster
OAKD is in Master Mode
[root@piodadb1 ~]#
Patching got stuck on the "/opt/oracle/oak/pkgrepos/System/12.1.2.12.0/bin/copydom0patch.py 12.1.2.12.0 http://192.168.18.1:7882" step.
This step is used to copy the patch bundle into dom0 and extract the bundle. Once the extraction is complete same script will be used to apply the upgrade on dom0.
This was the log that was not moving for more than 45 min.
Error: ODA_BASE
>End Command output
2022-01-22 11:26:52: Setting the DATA:RECO ratio to 43:57
2022-01-22 11:26:52: Executing cmd: /opt/oracle/oak/bin/oakcli modify config -DATA 43
2022-01-22 11:26:52: Executing cmd: python -c 'import sys;sys.path.append("/opt/oracle/oak/adapters/");import common_agentutils as cu;node=cu.getvmagentnodes();print("IP1:%s"%node[0]["IP"]);print("IP2:%s"%node[1]["IP"]);print("port1:%s"%node[0]["port"]);print("port2:%s"%node[1]["port"]);'
2022-01-22 11:26:52: Command output:
> IP1:192.168.18.1
> IP2:192.168.19.1
> port1:7882
> port2:7882 ,
>End Command output
2022-01-22 11:26:52: This is V4 machine
2022-01-22 11:26:52: Executing cmd: /opt/oracle/oak/pkgrepos/System/12.1.2.12.0/bin/copydom0patch.py 12.1.2.12.0 http://192.168.18.1:7882 -- Hanging for 45 min
This is the part of the dom0 log that is located under /opt/oracle/oak/log/<hostname>/patch/oakpatching_*.log log updates were stuck after the extraction.
DOM0 : log
2000-09-21 09:38:32,942 root DEBUG Running: mkdir -p /tmp/otpatchdir
2000-09-21 09:38:32,942 root DEBUG Running: mkdir -p /tmp/otpatchdir
2000-09-21 09:38:32,942 root DEBUG Running: mkdir -p /tmp/otpatchdir
2000-09-21 09:38:33,006 root DEBUG Successfully run the command: mkdir -p /tmp/otpatchdir
2000-09-21 09:38:33,006 root DEBUG Successfully run the command: mkdir -p /tmp/otpatchdir
2000-09-21 09:38:33,006 root DEBUG Successfully run the command: mkdir -p /tmp/otpatchdir
2000-09-21 09:42:47,355 root DEBUG Running: cat /tmp/otpatchdir/* > /opt/oracle/oak/pkgrepos/orapkgs/OVS/12.1.2.12.0/Base/OAKPatchBundle_12.1.2.12.0_170926.tar
2000-09-21 09:42:47,355 root DEBUG Running: cat /tmp/otpatchdir/* > /opt/oracle/oak/pkgrepos/orapkgs/OVS/12.1.2.12.0/Base/OAKPatchBundle_12.1.2.12.0_170926.tar
2000-09-21 09:42:47,355 root DEBUG Running: cat /tmp/otpatchdir/* > /opt/oracle/oak/pkgrepos/orapkgs/OVS/12.1.2.12.0/Base/OAKPatchBundle_12.1.2.12.0_170926.tar
2000-09-21 09:42:56,241 root DEBUG Successfully run the command: cat /tmp/otpatchdir/* > /opt/oracle/oak/pkgrepos/orapkgs/OVS/12.1.2.12.0/Base/OAKPatchBundle_12.1.2.12.0_170926.tar
2000-09-21 09:42:56,241 root DEBUG Successfully run the command: cat /tmp/otpatchdir/* > /opt/oracle/oak/pkgrepos/orapkgs/OVS/12.1.2.12.0/Base/OAKPatchBundle_12.1.2.12.0_170926.tar
2000-09-21 09:42:56,241 root DEBUG Successfully run the command: cat /tmp/otpatchdir/* > /opt/oracle/oak/pkgrepos/orapkgs/OVS/12.1.2.12.0/Base/OAKPatchBundle_12.1.2.12.0_170926.tar
2000-09-21 09:42:56,242 root DEBUG Running: tar -xof /opt/oracle/oak/pkgrepos/orapkgs/OVS/12.1.2.12.0/Base/OAKPatchBundle_12.1.2.12.0_170926.tar -C /opt/oracle/oak
2000-09-21 09:42:56,242 root DEBUG Running: tar -xof /opt/oracle/oak/pkgrepos/orapkgs/OVS/12.1.2.12.0/Base/OAKPatchBundle_12.1.2.12.0_170926.tar -C /opt/oracle/oak
2000-09-21 09:42:56,242 root DEBUG Running: tar -xof /opt/oracle/oak/pkgrepos/orapkgs/OVS/12.1.2.12.0/Base/OAKPatchBundle_12.1.2.12.0_170926.tar -C /opt/oracle/oak
Root Cause
In the script there is the exception clause with a sleep command after the extraction, In our case the system time of both ODA_BASE and DOM0 environment was different, somehow DOM0 time was Set to the year 2000 and ODA_BASE was in 2022.
[root@localhost~]# cat /opt/oracle/oak/pkgrepos/System/12.1.2.12.0/bin/copydom0patch.py
#!/usr/bin/python
import os
import sys
import glob
import socket
import time
import httplib
from SimpleXMLRPCServer import SimpleXMLRPCServer
import xmlrpclib
from subprocess import Popen,PIPE
#TODO error handling
OAK_HOME = "/opt/oracle/oak"
version = sys.argv[1]
dom0addr = sys.argv[2]
fileloc = os.path.join(OAK_HOME, "pkgrepos", "orapkgs", "OVS", version, "Base/*")
filelist = glob.glob(fileloc)
srcfile = filelist[0]
tmppatchdest = os.path.join("/", "tmp", "otpatchdir")
tmppatchfile = os.path.join(tmppatchdest, "dom0patchfile.zip")
# create a temporary dir
cmd = "rm -rf " + tmppatchdest
proc = Popen(cmd, shell=True, stderr=PIPE, stdout=PIPE)
proc.wait()
cmd = "mkdir -p " + tmppatchdest
proc = Popen(cmd, shell=True, stderr=PIPE, stdout=PIPE)
proc.wait()
#split the patch file
cmd = "split -b 25000000 -d "+ srcfile + " " +tmppatchfile
proc = Popen(cmd, shell=True, stderr=PIPE, stdout=PIPE)
proc.wait()
# get a server connection object
#proxy = xmlrpclib.ServerProxy("http://192.168.16.24:7881")
proxy = xmlrpclib.ServerProxy(dom0addr);
# bug20645053: First try to ping and see if the DOM0 agent is alive
# We do this only 3 times.
This was causing the issue due to different times on oda and dom0
count = 0
while (count < 2):
try:
print "trying to ping the dom0 agent"
proxy.ping()
count = 6
except:
time.sleep(60) <- This was causing the issue due to different times on oda and dom0
count = count + 1
try:
proxy.ping()
print "successfully pinged the dom0 agent\n"
except (socket.error, xmlrpclib.Fault ,xmlrpclib.ProtocolError, xmlrpclib.ResponseError, httplib.InvalidURL), err:
print "fatal error: could not ping DOM0 agent"
print "exception %s", err
sys.exit(1)
#create the destination area
# bug20645053: If we are unable to create a socket then exit
# with a "fatal error" message.
try:
out = proxy.mkdest(tmppatchdest)
except (socket.error), err:
print "fatal error: could not connect to the DOM0 agent"
print "exception %s", err
sys.exit(1)
#copy the patchfile to destination
globdir = os.path.join(tmppatchdest, "*")
for file in glob.glob(globdir):
file = file.strip('\t\n\r')
handle = open(file, "rb")
out = proxy.copypatch(tmppatchdest, file, xmlrpclib.Binary(handle.read()))
# remove the temporary dir
cmd = "rm -rf " + tmppatchdest
proc = Popen(cmd, shell=True, stderr=PIPE, stdout=PIPE)
proc.wait()
Conclusion
Before starting the update make sure to validate the server timing on all the nodes including dom0. I would recommend configuring NTP on all the servers which gives you better control over the server timing.
No comments:
Post a Comment