Monday, June 6, 2022

12.1.2.10 - 12.1.2.12 - ODA upgrade - Pre-patching issue

 






ODA upgrades elevate technical knowledge to a different level for each upgrade. Each upgrade gives you a new obstacle to tackle in a limited time. 

While upgrading ODA from 12.1.2.10 to 12.1.2.12 found some unusual issue that was hitting during the pre-patching stage.

Patching was executed from the Master - node. The master node can be identified by executing the oakcli show ismaster


[root@piodadb1 ~]# oakcli show ismaster
OAKD is in Master Mode
[root@piodadb1 ~]#

Patching got stuck on the "/opt/oracle/oak/pkgrepos/System/12.1.2.12.0/bin/copydom0patch.py 12.1.2.12.0 http://192.168.18.1:7882" step. 

This step is used to copy the patch bundle into dom0  and extract the bundle. Once the extraction is complete same script will be used to apply the upgrade on dom0.

This was the log that was not moving for more than 45 min. 

Error: ODA_BASE



>End Command output
2022-01-22 11:26:52: Setting the DATA:RECO ratio to 43:57
2022-01-22 11:26:52: Executing cmd: /opt/oracle/oak/bin/oakcli modify config -DATA 43
2022-01-22 11:26:52: Executing cmd: python -c 'import sys;sys.path.append("/opt/oracle/oak/adapters/");import common_agentutils as cu;node=cu.getvmagentnodes();print("IP1:%s"%node[0]["IP"]);print("IP2:%s"%node[1]["IP"]);print("port1:%s"%node[0]["port"]);print("port2:%s"%node[1]["port"]);'
2022-01-22 11:26:52: Command output:
>  IP1:192.168.18.1
>  IP2:192.168.19.1
>  port1:7882
>  port2:7882 ,
>End Command output
2022-01-22 11:26:52: This is V4 machine
2022-01-22 11:26:52: Executing cmd: /opt/oracle/oak/pkgrepos/System/12.1.2.12.0/bin/copydom0patch.py 12.1.2.12.0 http://192.168.18.1:7882  -- Hanging for 45 min 

This is the part of the dom0 log that is located under /opt/oracle/oak/log/<hostname>/patch/oakpatching_*.log log updates were stuck after the extraction.

DOM0 : log



2000-09-21 09:38:32,942 root DEBUG Running: mkdir -p /tmp/otpatchdir
2000-09-21 09:38:32,942 root DEBUG Running: mkdir -p /tmp/otpatchdir
2000-09-21 09:38:32,942 root DEBUG Running: mkdir -p /tmp/otpatchdir
2000-09-21 09:38:33,006 root DEBUG Successfully run the command: mkdir -p /tmp/otpatchdir
2000-09-21 09:38:33,006 root DEBUG Successfully run the command: mkdir -p /tmp/otpatchdir
2000-09-21 09:38:33,006 root DEBUG Successfully run the command: mkdir -p /tmp/otpatchdir
2000-09-21 09:42:47,355 root DEBUG Running: cat /tmp/otpatchdir/* > /opt/oracle/oak/pkgrepos/orapkgs/OVS/12.1.2.12.0/Base/OAKPatchBundle_12.1.2.12.0_170926.tar
2000-09-21 09:42:47,355 root DEBUG Running: cat /tmp/otpatchdir/* > /opt/oracle/oak/pkgrepos/orapkgs/OVS/12.1.2.12.0/Base/OAKPatchBundle_12.1.2.12.0_170926.tar
2000-09-21 09:42:47,355 root DEBUG Running: cat /tmp/otpatchdir/* > /opt/oracle/oak/pkgrepos/orapkgs/OVS/12.1.2.12.0/Base/OAKPatchBundle_12.1.2.12.0_170926.tar
2000-09-21 09:42:56,241 root DEBUG Successfully run the command: cat /tmp/otpatchdir/* > /opt/oracle/oak/pkgrepos/orapkgs/OVS/12.1.2.12.0/Base/OAKPatchBundle_12.1.2.12.0_170926.tar
2000-09-21 09:42:56,241 root DEBUG Successfully run the command: cat /tmp/otpatchdir/* > /opt/oracle/oak/pkgrepos/orapkgs/OVS/12.1.2.12.0/Base/OAKPatchBundle_12.1.2.12.0_170926.tar
2000-09-21 09:42:56,241 root DEBUG Successfully run the command: cat /tmp/otpatchdir/* > /opt/oracle/oak/pkgrepos/orapkgs/OVS/12.1.2.12.0/Base/OAKPatchBundle_12.1.2.12.0_170926.tar
2000-09-21 09:42:56,242 root DEBUG Running: tar -xof  /opt/oracle/oak/pkgrepos/orapkgs/OVS/12.1.2.12.0/Base/OAKPatchBundle_12.1.2.12.0_170926.tar  -C  /opt/oracle/oak
2000-09-21 09:42:56,242 root DEBUG Running: tar -xof  /opt/oracle/oak/pkgrepos/orapkgs/OVS/12.1.2.12.0/Base/OAKPatchBundle_12.1.2.12.0_170926.tar  -C  /opt/oracle/oak
2000-09-21 09:42:56,242 root DEBUG Running: tar -xof  /opt/oracle/oak/pkgrepos/orapkgs/OVS/12.1.2.12.0/Base/OAKPatchBundle_12.1.2.12.0_170926.tar  -C  /opt/oracle/oak

Root Cause


In the script there is the exception clause with a sleep command after the extraction, In our case the system time of both ODA_BASE and DOM0 environment was different, somehow DOM0 time was Set to the year 2000 and ODA_BASE was in 2022.

 
Due to these different server times, the script was not moving and kept issuing the sleep function.



[root@localhost~]# cat /opt/oracle/oak/pkgrepos/System/12.1.2.12.0/bin/copydom0patch.py
#!/usr/bin/python
import os
import sys
import glob
import socket
import time
import httplib
from SimpleXMLRPCServer import SimpleXMLRPCServer
import xmlrpclib
from subprocess import Popen,PIPE

#TODO error handling
OAK_HOME = "/opt/oracle/oak"
version = sys.argv[1]
dom0addr = sys.argv[2]
fileloc = os.path.join(OAK_HOME, "pkgrepos", "orapkgs", "OVS", version, "Base/*")
filelist = glob.glob(fileloc)
srcfile = filelist[0]

tmppatchdest = os.path.join("/", "tmp", "otpatchdir")
tmppatchfile = os.path.join(tmppatchdest, "dom0patchfile.zip")

# create a temporary dir
cmd = "rm -rf " + tmppatchdest
proc = Popen(cmd, shell=True, stderr=PIPE, stdout=PIPE)
proc.wait()
cmd = "mkdir -p " + tmppatchdest
proc = Popen(cmd, shell=True, stderr=PIPE, stdout=PIPE)
proc.wait()

#split the patch file
cmd =  "split -b 25000000 -d "+ srcfile + " " +tmppatchfile
proc = Popen(cmd, shell=True, stderr=PIPE, stdout=PIPE)
proc.wait()

# get a server connection object
#proxy = xmlrpclib.ServerProxy("http://192.168.16.24:7881")
proxy = xmlrpclib.ServerProxy(dom0addr);

# bug20645053: First try to ping and see if the DOM0 agent is alive
# We do this only 3 times.


This was causing the issue due to different times on oda and dom0


count = 0
while (count < 2):
    try:
        print "trying to ping the dom0 agent"
        proxy.ping()
        count = 6
    except:
        time.sleep(60) <- This was causing the issue due to different times on oda and dom0
        count = count + 1

try:
    proxy.ping()
    print "successfully pinged the dom0 agent\n"
except (socket.error, xmlrpclib.Fault ,xmlrpclib.ProtocolError, xmlrpclib.ResponseError, httplib.InvalidURL), err:
    print "fatal error: could not ping DOM0 agent"
    print "exception %s", err
    sys.exit(1)

#create the destination area
# bug20645053: If we are unable to create a socket then exit
# with a "fatal error" message.
try:
    out = proxy.mkdest(tmppatchdest)
except (socket.error), err:
    print "fatal error: could not connect to the DOM0 agent"
    print "exception %s", err
    sys.exit(1)

#copy the patchfile to destination
globdir = os.path.join(tmppatchdest, "*")
for file in glob.glob(globdir):
    file = file.strip('\t\n\r')
    handle = open(file, "rb")
    out = proxy.copypatch(tmppatchdest, file, xmlrpclib.Binary(handle.read()))

# remove the  temporary dir
cmd = "rm -rf " + tmppatchdest
proc = Popen(cmd, shell=True, stderr=PIPE, stdout=PIPE)
proc.wait()


Conclusion

Before starting the update make sure to validate the server timing on all the nodes including dom0. I would recommend configuring NTP on all the servers which gives you better control over the server timing.  

No comments:

Post a Comment

Exacs database creation using dbaascli

  Intro OCI (Oracle Cloud Infrastructure) provides robust automation capabilities for routine maintenance tasks such as patching, ...