1. Collect System Information before doing any
change on your server.
2. Ensure the current OS level is
consistent: “oslevel -s;oslevel -r;instfix -i|grep
ML;instfix -i|grep SP;lppchk -v” [If the os is inconsistent, then first bring
the os level to consistent state and then proceed with the change].
3. Check the lv/filesystems consistency “df
-k”(df should not hang); all lvs should be in sync state “lsvg -o|lsvg -il”.
4. Check errpt & ‘alog -t console -o’ to
see if there are any errors.
5. Perform ‘pwdck’ & ‘grpck’ to verify
password & group file consistency.
6. Ensure the root user has not expired.
7. Ensure your id on the server is working.
8. Ensure mksysb(OS image backup) is taken.
Verify mksysb with “lsmksysb -l -f /mksysbimg”(check size).
9. Check the /etc/exclude.rootvg to see if any
important filesystem/dir is excluded from the mksysb backup.
10. Take the TSM backup of mksysb as well.
11. Ensure the scheduled TSM backup of
filesystems is done.
12. Verify boot list & boot
device: “bootlist -m normal -o” “ipl_varyon -i”
13. Ensure hd5(bootlv) is 32MB (contigous
PPs) [very important for migration]
14. Remember Application/Database team are
responsible for their Application/Database backup and restore. Therefore alert
the application teams so that they take their backups and other precautions
before we perform our change.
15. Ensure the ‘console’ for the server is
working.
16. Initiate the change on console. If there
is any network disconnection during the change, you can reconnect to the
console and get the menus back.
17. Remember three essential things are in
place before you perform any change “backup(mksysb); system information;
console”
18. Ensure you are on right
server(hostname;ifconfig -a) before you perform change.
19. Remember one change at one time; multiple
changes could cause problem & can complicate troubleshooting.
20. Pre-inform the monitoring team about your
change; so that there are no vague alerts/PRs.
21. Ensure there is enough free filesystem
space(/usr, /var, / ), required for the change.
22. Before and After the change, ensure the
filesystems are well below the thresholds(75%).
23. Perform changes only for the approved CRs(change requests).
23. Perform changes only for the approved CRs(change requests).
24. Check the history of the servers(PRs or
CRs)…to see if there were any issues or change failures for these servers.
25. Ensure that there is no clash of SAs over
changes; one change being performed by multiple SAs. Its better to verify with
the ‘who -u’ command, to see if there are any SAs already working on the
server.
26. If there are two disks in rootvg, then
perform alt disk clone for one disk. This is fastest & safest backout
method in case of any failure. Though you perform alt disk clone, ensure you as
well take mksysb.
27. For migration change, check if there is
SAN(IBM/EMC..) used, if so, then you have to follow the procedure of exporting
vgs, uninstall sdd* fileset;and after migration reinstall sdd* fileset,
reimport vgs etc.
28. Ensure hd5(bootlv) is 32MB (contigous
PPs) [very important for migration]
29. EXPECT THE UNEXPECTED : Ensure you have
the clear backout plan in place.
30. Have the patches/filesets pre-downloaded
and verified.
31. Check/verify the repositories on
NIM/central server; check if these repositories were tested/used earlier.
32. Verify the connectivity between the client
and NIM server. In some accounts the admin network is to be exclusively
enabled(cabled) for the changes.
33. Take care of the network zones(public,
private, behind firewall); each zone may have their separate NIM server.
34. Take care of the static routes(which may
differ across zones). Ensure the static routes are not disturbed after the
change.
35. Ensure there are no other conflicting
changes from other departments such as SAN, network, firewall, application..
which could dampen your change.
36. Maintain/record the commands run/console
output in the notepad(named after the change).
37. Update the change record with the
output/artefact. You can execute the commands with date command, so that the
date & time when the command is run is recorded “date;uptime;oslevel -r”.
38. If the system has not rebooted from
longtime(> 100 days); then perform ‘bosboot’ & then reboot the
machine(verify the fs/appln. after reboot), & then commence with the
migration/upgrade. [Donot reboot the machine if the bosboot fails!]
39. Check filesystems count “df -k|wc
-l” ; verify the count after migration or reboot.
40. Ensure there are no schedule reboots in
crontab. If there is any then comment it before you proceed with the change.
41. Look for the log messages carefully; donot
ignore warnings.
42. Perform preview(TL/SP upgrade) before you
perform actual change; see if there are any errors reported in preview(look for
keyword ‘error’ / ‘fail’); look for the tail/summary of messages;
43. Though the preview may report as ‘ok’ at
the header, still you have to look in the messages and read the tail/summary of
preview.
44. If preview reports any
dependency/requisite fileset missing then have those downloaded as well.
45. Ensure you have enough free space in
rootvg. Min of 1-2 GB to be free in rootvg(TL upgrade/OS migration).
46. Ensure application team have tested their
application on the new TL/SP/OS to which you are upgrading your system.
47. If you have multiple putty sessions
opened; then name the sessions accordingly [Putty -> under behaviour ->
window title]; this will help you in quickly getting to the right session.
48. On sev1 issues, update the SDM in the ST
multichat at regular intervals.
49. Over the conf voice call, if they verbally
request you to perform any change, get the confirmation in writing in the multi
ST chat.
50. Update the PR in regular intervals.
51. Update your team with the issue status(via
mail).
52. Document any new learnings(from
issues/changes) and share it with team.
53. Ensure proper validated CR procedure is in
place; Precheck -> Installation -> Backout -> Verification
54. Save the severity multichats; enable
autologging of multichats in Sametime.
55. If the junior SA resource is performing
the change for first time(first time in any account); then let the techlead
club him with a senior SA.
56. SAs First time execution (or first time in
an account) or to say any first time had to be dealt carefully. Techleads have
to take extra precautions.
57. Any important information/document, please
ensure that you update it in your account teamroom.
58. Ensure for TL upgrades, you go by TL by
TL, shortcut to direct TL could sometimes cause problem.
59. Check if the server is running any cluster
(HACMP), if so then you have to follow different procedure.
No comments:
Post a Comment