TANTI TECHNOLOGIES: Thumb Rules

1. Collect System Information before doing any change on your server.

2. Ensure the current OS level is consistent: “oslevel -s;oslevel -r;instfix -i|grep ML;instfix -i|grep SP;lppchk -v” [If the os is inconsistent, then first bring the os level to consistent state and then proceed with the change].

3. Check the lv/filesystems consistency “df -k”(df should not hang); all lvs should be in sync state “lsvg -o|lsvg -il”.

4. Check errpt & ‘alog -t console -o’ to see if there are any errors.

5. Perform ‘pwdck’ & ‘grpck’ to verify password & group file consistency.

6. Ensure the root user has not expired.

7. Ensure your id on the server is working.

8. Ensure mksysb(OS image backup) is taken. Verify mksysb with “lsmksysb -l -f /mksysbimg”(check size).

9. Check the /etc/exclude.rootvg to see if any important filesystem/dir is excluded from the mksysb backup.

10. Take the TSM backup of mksysb as well.

11. Ensure the scheduled TSM backup of filesystems is done.

12. Verify boot list & boot device: “bootlist -m normal -o” “ipl_varyon -i”

13. Ensure hd5(bootlv) is 32MB (contigous PPs) [very important for migration]

14. Remember Application/Database team are responsible for their Application/Database backup and restore. Therefore alert the application teams so that they take their backups and other precautions before we perform our change.

15. Ensure the ‘console’ for the server is working.

16. Initiate the change on console. If there is any network disconnection during the change, you can reconnect to the console and get the menus back.

17. Remember three essential things are in place before you perform any change “backup(mksysb); system information; console”

18. Ensure you are on right server(hostname;ifconfig -a) before you perform change.

19. Remember one change at one time; multiple changes could cause problem & can complicate troubleshooting.

20. Pre-inform the monitoring team about your change; so that there are no vague alerts/PRs.

21. Ensure there is enough free filesystem space(/usr, /var, / ), required for the change.

22. Before and After the change, ensure the filesystems are well below the thresholds(75%).
23. Perform changes only for the approved CRs(change requests).

24. Check the history of the servers(PRs or CRs)…to see if there were any issues or change failures for these servers.

25. Ensure that there is no clash of SAs over changes; one change being performed by multiple SAs. Its better to verify with the ‘who -u’ command, to see if there are any SAs already working on the server.

26. If there are two disks in rootvg, then perform alt disk clone for one disk. This is fastest & safest backout method in case of any failure. Though you perform alt disk clone, ensure you as well take mksysb.

27. For migration change, check if there is SAN(IBM/EMC..) used, if so, then you have to follow the procedure of exporting vgs, uninstall sdd* fileset;and after migration reinstall sdd* fileset, reimport vgs etc.

28. Ensure hd5(bootlv) is 32MB (contigous PPs) [very important for migration]

29. EXPECT THE UNEXPECTED : Ensure you have the clear backout plan in place.

30. Have the patches/filesets pre-downloaded and verified.

31. Check/verify the repositories on NIM/central server; check if these repositories were tested/used earlier.

32. Verify the connectivity between the client and NIM server. In some accounts the admin network is to be exclusively enabled(cabled) for the changes.

33. Take care of the network zones(public, private, behind firewall); each zone may have their separate NIM server.

34. Take care of the static routes(which may differ across zones). Ensure the static routes are not disturbed after the change.

35. Ensure there are no other conflicting changes from other departments such as SAN, network, firewall, application.. which could dampen your change.

36. Maintain/record the commands run/console output in the notepad(named after the change).

37. Update the change record with the output/artefact. You can execute the commands with date command, so that the date & time when the command is run is recorded “date;uptime;oslevel -r”.

38. If the system has not rebooted from longtime(> 100 days); then perform ‘bosboot’ & then reboot the machine(verify the fs/appln. after reboot), & then commence with the migration/upgrade. [Donot reboot the machine if the bosboot fails!]

39. Check filesystems count “df -k|wc -l” ; verify the count after migration or reboot.

40. Ensure there are no schedule reboots in crontab. If there is any then comment it before you proceed with the change.

41. Look for the log messages carefully; donot ignore warnings.

42. Perform preview(TL/SP upgrade) before you perform actual change; see if there are any errors reported in preview(look for keyword ‘error’ / ‘fail’); look for the tail/summary of messages;

43. Though the preview may report as ‘ok’ at the header, still you have to look in the messages and read the tail/summary of preview.

44. If preview reports any dependency/requisite fileset missing then have those downloaded as well.

45. Ensure you have enough free space in rootvg. Min of 1-2 GB to be free in rootvg(TL upgrade/OS migration).

46. Ensure application team have tested their application on the new TL/SP/OS to which you are upgrading your system.

47. If you have multiple putty sessions opened; then name the sessions accordingly [Putty -> under behaviour -> window title]; this will help you in quickly getting to the right session.

48. On sev1 issues, update the SDM in the ST multichat at regular intervals.

49. Over the conf voice call, if they verbally request you to perform any change, get the confirmation in writing in the multi ST chat.

50. Update the PR in regular intervals.

51. Update your team with the issue status(via mail).

52. Document any new learnings(from issues/changes) and share it with team.

53. Ensure proper validated CR procedure is in place; Precheck -> Installation -> Backout -> Verification

54. Save the severity multichats; enable autologging of multichats in Sametime.

55. If the junior SA resource is performing the change for first time(first time in any account); then let the techlead club him with a senior SA.

56. SAs First time execution (or first time in an account) or to say any first time had to be dealt carefully. Techleads have to take extra precautions.

57. Any important information/document, please ensure that you update it in your account teamroom.

58. Ensure for TL upgrades, you go by TL by TL, shortcut to direct TL could sometimes cause problem.

59. Check if the server is running any cluster (HACMP), if so then you have to follow different procedure.

TANTI TECHNOLOGIES

Tanti Technology

Thursday, 28 September 2017

Thumb Rules – AIX ADMIN

No comments:

Post a Comment