TANTI TECHNOLOGIES: 09/28/17

Thursday, 28 September 2017

Thumb Rules – AIX ADMIN

1. Collect System Information before doing any change on your server.

2. Ensure the current OS level is consistent: “oslevel -s;oslevel -r;instfix -i|grep ML;instfix -i|grep SP;lppchk -v” [If the os is inconsistent, then first bring the os level to consistent state and then proceed with the change].

3. Check the lv/filesystems consistency “df -k”(df should not hang); all lvs should be in sync state “lsvg -o|lsvg -il”.

4. Check errpt & ‘alog -t console -o’ to see if there are any errors.

5. Perform ‘pwdck’ & ‘grpck’ to verify password & group file consistency.

6. Ensure the root user has not expired.

7. Ensure your id on the server is working.

8. Ensure mksysb(OS image backup) is taken. Verify mksysb with “lsmksysb -l -f /mksysbimg”(check size).

9. Check the /etc/exclude.rootvg to see if any important filesystem/dir is excluded from the mksysb backup.

10. Take the TSM backup of mksysb as well.

11. Ensure the scheduled TSM backup of filesystems is done.

12. Verify boot list & boot device: “bootlist -m normal -o” “ipl_varyon -i”

13. Ensure hd5(bootlv) is 32MB (contigous PPs) [very important for migration]

14. Remember Application/Database team are responsible for their Application/Database backup and restore. Therefore alert the application teams so that they take their backups and other precautions before we perform our change.

15. Ensure the ‘console’ for the server is working.

16. Initiate the change on console. If there is any network disconnection during the change, you can reconnect to the console and get the menus back.

17. Remember three essential things are in place before you perform any change “backup(mksysb); system information; console”

18. Ensure you are on right server(hostname;ifconfig -a) before you perform change.

19. Remember one change at one time; multiple changes could cause problem & can complicate troubleshooting.

20. Pre-inform the monitoring team about your change; so that there are no vague alerts/PRs.

21. Ensure there is enough free filesystem space(/usr, /var, / ), required for the change.

22. Before and After the change, ensure the filesystems are well below the thresholds(75%).
23. Perform changes only for the approved CRs(change requests).

24. Check the history of the servers(PRs or CRs)…to see if there were any issues or change failures for these servers.

25. Ensure that there is no clash of SAs over changes; one change being performed by multiple SAs. Its better to verify with the ‘who -u’ command, to see if there are any SAs already working on the server.

26. If there are two disks in rootvg, then perform alt disk clone for one disk. This is fastest & safest backout method in case of any failure. Though you perform alt disk clone, ensure you as well take mksysb.

27. For migration change, check if there is SAN(IBM/EMC..) used, if so, then you have to follow the procedure of exporting vgs, uninstall sdd* fileset;and after migration reinstall sdd* fileset, reimport vgs etc.

28. Ensure hd5(bootlv) is 32MB (contigous PPs) [very important for migration]

29. EXPECT THE UNEXPECTED : Ensure you have the clear backout plan in place.

30. Have the patches/filesets pre-downloaded and verified.

31. Check/verify the repositories on NIM/central server; check if these repositories were tested/used earlier.

32. Verify the connectivity between the client and NIM server. In some accounts the admin network is to be exclusively enabled(cabled) for the changes.

33. Take care of the network zones(public, private, behind firewall); each zone may have their separate NIM server.

34. Take care of the static routes(which may differ across zones). Ensure the static routes are not disturbed after the change.

35. Ensure there are no other conflicting changes from other departments such as SAN, network, firewall, application.. which could dampen your change.

36. Maintain/record the commands run/console output in the notepad(named after the change).

37. Update the change record with the output/artefact. You can execute the commands with date command, so that the date & time when the command is run is recorded “date;uptime;oslevel -r”.

38. If the system has not rebooted from longtime(> 100 days); then perform ‘bosboot’ & then reboot the machine(verify the fs/appln. after reboot), & then commence with the migration/upgrade. [Donot reboot the machine if the bosboot fails!]

39. Check filesystems count “df -k|wc -l” ; verify the count after migration or reboot.

40. Ensure there are no schedule reboots in crontab. If there is any then comment it before you proceed with the change.

41. Look for the log messages carefully; donot ignore warnings.

42. Perform preview(TL/SP upgrade) before you perform actual change; see if there are any errors reported in preview(look for keyword ‘error’ / ‘fail’); look for the tail/summary of messages;

43. Though the preview may report as ‘ok’ at the header, still you have to look in the messages and read the tail/summary of preview.

44. If preview reports any dependency/requisite fileset missing then have those downloaded as well.

45. Ensure you have enough free space in rootvg. Min of 1-2 GB to be free in rootvg(TL upgrade/OS migration).

46. Ensure application team have tested their application on the new TL/SP/OS to which you are upgrading your system.

47. If you have multiple putty sessions opened; then name the sessions accordingly [Putty -> under behaviour -> window title]; this will help you in quickly getting to the right session.

48. On sev1 issues, update the SDM in the ST multichat at regular intervals.

49. Over the conf voice call, if they verbally request you to perform any change, get the confirmation in writing in the multi ST chat.

50. Update the PR in regular intervals.

51. Update your team with the issue status(via mail).

52. Document any new learnings(from issues/changes) and share it with team.

53. Ensure proper validated CR procedure is in place; Precheck -> Installation -> Backout -> Verification

54. Save the severity multichats; enable autologging of multichats in Sametime.

55. If the junior SA resource is performing the change for first time(first time in any account); then let the techlead club him with a senior SA.

56. SAs First time execution (or first time in an account) or to say any first time had to be dealt carefully. Techleads have to take extra precautions.

57. Any important information/document, please ensure that you update it in your account teamroom.

58. Ensure for TL upgrades, you go by TL by TL, shortcut to direct TL could sometimes cause problem.

59. Check if the server is running any cluster (HACMP), if so then you have to follow different procedure.

Thumb Rules for HACMP

1) Check the status of cluster “clstat -o;clRGinfo”, the cluster & the resource groups should be in stable state. Cluster commands/utilities directory –
‘/usr/es/sbin/cluster/utilities’. “lssrc -a|grep -i clst”: to check if the cluster is running. “lssrc -ls clstrmgr”

2) Use ‘clshowres’ to check the resource group configuration; with this info you can identify the volume group, serviceip, application server(appln.
start/stop scripts) associated with the resource group.

3) When you bring the RG(Resource Group) offline, after the RG is offline, verify the associated resources(service ip, vg, application) are down/offline as
well.

4) Use ‘cllsserv’ to find out the application start/stop scripts. Ensure the appln. start/stop scripts are in sync across nodes[you can use ‘cksum’ and verify].

5) Generate cluster snap; collect cluster information,
###############################
/usr/es/sbin/cluster/utilities/clsnapshot -c -i -n filename                                       /* generates filename.info in the
/usr/es/sbin/cluster/snapshots/’ directory.
/usr/es/sbin/cluster/clstat -o >> /usr/es/sbin/cluster/snapshots/filename.info     /* clstat displays current status of cluster; clstat may not work on some clusters.
/usr/es/sbin/cluster/utilities/clRGinfo >> /usr/es/sbin/cluster/snapshots/filename.info     /* displays the current status of resource groups.
/usr/es/sbin/cluster/utilities/cllsserv >> /usr/es/sbin/cluster/snapshots/filename.info     /* shows the start & stop scripts of the application server.
###############################

6) Collect system information of cluster nodes. The system information will contain the ip/static route information and
other important information.

7) Cluster verification is run on daily basis; look for the ‘fail’ or ‘error’ keyword in the ‘/var/hacmp/clverify/clverify.log’

8) Perform emulate(preview) “verification & synchronization” [smitty hacmp -> Extended Configuration -> Extended Verification and Synchronization-> select
emulate] and see if there are any errors.

9) If any errors reported, especially if it is the VG/lvm out of sync issue, then take downtime(resource groups down)(highly recommended), and then perform
actual sychronization(fixing of sync errors).
[Please note that it is not advised to perform actual sychronization with RG(resource group) being UP; because this could result in the disk lock issue and
could disrupt the future failover].

10) It is recommended that we donot perform direct ‘move’ operation of resource groups; instead perform offline of RG on one node, ensure all the associated
resources are offline, and then perform online of RG on other node. This will ensure that there are no resource conflict and there is clean movement of RG.

11) To monitor the RG online/offline/move: “tail -f /tmp/hacmp.out”

12) All the possible activities(VG/LVM/FS) to be performed via CSPOC(Cluster-single point of control). smitty hacmp -> C-SPOC. Before you perform any C-SPOC activity ensure you do
synchronization(emulation) to see if there any errors. If there are any errors then first fix the errors(with RGs offline), and then proceed with C-SPOC.

13) HACMP logs: ‘/tmp/hacmp.out’ ‘/usr/es/adm/cluster.log’ ‘/var/hacmp/clverify/clverify.log’

14) In case if you have performed the fs change outside C-SPOC, donot panic, we can synch it to other nodes(with RGs offline). In case if the customer
insists to do it online itself then there is procedure to do that as well. (1)

15) At some occasions: after the RG failover, serviceip may not ping/work; this could be due to arp cache, flush the serviceip entry in the arp cache on both nodes.
If the ip issue persists, then engage network team, they may have to refresh at switch level.

16) During the offline process if the RG status reports as ‘error’, then in majority of the cases it would be the filesystem unmount issue; HA maybe unable to unmount filesystem of the RG(filesystem in use by other process), thus it reports error for the RG. Stop the process/application using the filesystem, then unmount the filesystem and rerun RG offline.

17) During HACMP activities, clearly inform the application team that the ‘RG(Resource Group)’ online/startup does start the application(execute application start script automatically), and that they may need to just verify application. This precaution is because many of the application team members are unaware of HACMP/cluster functionality, due to which they may attempt to restart their application, which could harm their application.

18) In certain cases, where the HACMP is in unstable state, you may not be able to bring the RG offline gracefully, the turnaround is to execute the application stop script(‘cllsserv’), which will stop the application gracefully and then reboot the cluster node [this will ensure that the application is not corrupted due to abrupt shutdown].

19) Remember Resouce Group(RG) primarily consists of volume group, serviceip, application server(appln. start/stop scripts) resources.

20) Incase if the cluster is not functioning, and the application team demands their application to be UP asap, then in that adverse scenario, you can manually bring the resources online..that is activate vg, mount filesystems, activate serviceip, execute application start script [this shall bring UP their application(outside hacmp)].

21) If there are 2 or multiple resource groups in HACMP cluster, then verify the naming of mountpoints/filesystems of the volume groups of resource groups; the mountpoints/fsname/lvname should be unique across volume groups of RGs. It should not overlap.

22) You can recycle logs by ” clcycle hacmp.out and clcycle cluster.log” before doing any major HACMP/AIX changes in cluster. This will give an exact error log to begin with. Clcycle will take backup of both log files and will name it as cluster.log.1 and hacmp.out.1 & so on.

23) The shared volume groups in HACMP cluster where nodes are VIO clients(using VIO disks), have to be in “Enhanced Concurrent mode(active/pasive)” to avoid the possibility of filesystems/data corruption.

Tanti Technology

Thursday, 28 September 2017

Thumb Rules – AIX ADMIN

Thumb Rules for HACMP