TANTI TECHNOLOGIES: Thumb Rules for HACMP

1) Check the status of cluster “clstat -o;clRGinfo”, the cluster & the resource groups should be in stable state. Cluster commands/utilities directory –
‘/usr/es/sbin/cluster/utilities’. “lssrc -a|grep -i clst”: to check if the cluster is running. “lssrc -ls clstrmgr”

2) Use ‘clshowres’ to check the resource group configuration; with this info you can identify the volume group, serviceip, application server(appln.
start/stop scripts) associated with the resource group.

3) When you bring the RG(Resource Group) offline, after the RG is offline, verify the associated resources(service ip, vg, application) are down/offline as
well.

4) Use ‘cllsserv’ to find out the application start/stop scripts. Ensure the appln. start/stop scripts are in sync across nodes[you can use ‘cksum’ and verify].

5) Generate cluster snap; collect cluster information,
###############################
/usr/es/sbin/cluster/utilities/clsnapshot -c -i -n filename                                       /* generates filename.info in the
/usr/es/sbin/cluster/snapshots/’ directory.
/usr/es/sbin/cluster/clstat -o >> /usr/es/sbin/cluster/snapshots/filename.info     /* clstat displays current status of cluster; clstat may not work on some clusters.
/usr/es/sbin/cluster/utilities/clRGinfo >> /usr/es/sbin/cluster/snapshots/filename.info     /* displays the current status of resource groups.
/usr/es/sbin/cluster/utilities/cllsserv >> /usr/es/sbin/cluster/snapshots/filename.info     /* shows the start & stop scripts of the application server.
###############################

6) Collect system information of cluster nodes. The system information will contain the ip/static route information and
other important information.

7) Cluster verification is run on daily basis; look for the ‘fail’ or ‘error’ keyword in the ‘/var/hacmp/clverify/clverify.log’

8) Perform emulate(preview) “verification & synchronization” [smitty hacmp -> Extended Configuration -> Extended Verification and Synchronization-> select
emulate] and see if there are any errors.

9) If any errors reported, especially if it is the VG/lvm out of sync issue, then take downtime(resource groups down)(highly recommended), and then perform
actual sychronization(fixing of sync errors).
[Please note that it is not advised to perform actual sychronization with RG(resource group) being UP; because this could result in the disk lock issue and
could disrupt the future failover].

10) It is recommended that we donot perform direct ‘move’ operation of resource groups; instead perform offline of RG on one node, ensure all the associated
resources are offline, and then perform online of RG on other node. This will ensure that there are no resource conflict and there is clean movement of RG.

11) To monitor the RG online/offline/move: “tail -f /tmp/hacmp.out”

12) All the possible activities(VG/LVM/FS) to be performed via CSPOC(Cluster-single point of control). smitty hacmp -> C-SPOC. Before you perform any C-SPOC activity ensure you do
synchronization(emulation) to see if there any errors. If there are any errors then first fix the errors(with RGs offline), and then proceed with C-SPOC.

13) HACMP logs: ‘/tmp/hacmp.out’ ‘/usr/es/adm/cluster.log’ ‘/var/hacmp/clverify/clverify.log’

14) In case if you have performed the fs change outside C-SPOC, donot panic, we can synch it to other nodes(with RGs offline). In case if the customer
insists to do it online itself then there is procedure to do that as well. (1)

15) At some occasions: after the RG failover, serviceip may not ping/work; this could be due to arp cache, flush the serviceip entry in the arp cache on both nodes.
If the ip issue persists, then engage network team, they may have to refresh at switch level.

16) During the offline process if the RG status reports as ‘error’, then in majority of the cases it would be the filesystem unmount issue; HA maybe unable to unmount filesystem of the RG(filesystem in use by other process), thus it reports error for the RG. Stop the process/application using the filesystem, then unmount the filesystem and rerun RG offline.

17) During HACMP activities, clearly inform the application team that the ‘RG(Resource Group)’ online/startup does start the application(execute application start script automatically), and that they may need to just verify application. This precaution is because many of the application team members are unaware of HACMP/cluster functionality, due to which they may attempt to restart their application, which could harm their application.

18) In certain cases, where the HACMP is in unstable state, you may not be able to bring the RG offline gracefully, the turnaround is to execute the application stop script(‘cllsserv’), which will stop the application gracefully and then reboot the cluster node [this will ensure that the application is not corrupted due to abrupt shutdown].

19) Remember Resouce Group(RG) primarily consists of volume group, serviceip, application server(appln. start/stop scripts) resources.

20) Incase if the cluster is not functioning, and the application team demands their application to be UP asap, then in that adverse scenario, you can manually bring the resources online..that is activate vg, mount filesystems, activate serviceip, execute application start script [this shall bring UP their application(outside hacmp)].

21) If there are 2 or multiple resource groups in HACMP cluster, then verify the naming of mountpoints/filesystems of the volume groups of resource groups; the mountpoints/fsname/lvname should be unique across volume groups of RGs. It should not overlap.

22) You can recycle logs by ” clcycle hacmp.out and clcycle cluster.log” before doing any major HACMP/AIX changes in cluster. This will give an exact error log to begin with. Clcycle will take backup of both log files and will name it as cluster.log.1 and hacmp.out.1 & so on.

23) The shared volume groups in HACMP cluster where nodes are VIO clients(using VIO disks), have to be in “Enhanced Concurrent mode(active/pasive)” to avoid the possibility of filesystems/data corruption.

TANTI TECHNOLOGIES

Tanti Technology

Thursday, 28 September 2017

Thumb Rules for HACMP

No comments:

Post a Comment