TANTI TECHNOLOGIES: 10/19/19

Saturday, 19 October 2019

RMC (Resource Monitoring and Control):

RMC is a distributed framework and architecture that allows the HMC to communicate with a managed logical partition. RMC daemons should be running on AIX partition in order to be able to do DLPAR operations on HMC.

For example "Dynamic LPAR Resource Manager" is an RMC daemon that runs inside the AIX (and VIO server). The HMC uses this capability to remotely execute partition specific commands.

The daemons in the LPARs and the daemons on the HMC must be able to communicate to the AIX partition through an external network not through the Service Processor. An external network that the partition has access to and the HMC has acces to as well.

For example, if HMC has a connection to a 9.x.x.x network and I put my AIX partition to that 9.x.x.x network and as long there is network connectivity (HMC is allowed to communicate to that partition over that network) and RMC daemon is running on the partition, then DLPAR operations are available.

In order for RMC to work, port 657 upd/tcp must be open in both directions between the HMC public interface and the lpar.

The RMC daemons are part of the Reliable, Scalable Cluster Technology (RSCT) and are controlled by the System Resource Controller (SRC). These daemons run in all LPARs and communicate with equivalent RMC daemons running on the HMC. The daemons start automatically when the operating system starts and synchronize with the HMC RMC daemons.

Note: Apart from rebooting, there is no way to stop and start the RMC daemons on the HMC!

----------------------------------------

HMC and LPAR authentication (RSCT authentication)
(RSCT authentication is used to ensure the HMC is communicating with the correct LPAR.)

Authentication is the process of ensuring that another party is who it claims to be.
Authorization is the process by which a cluster software component grants or denies resources based on certain criteria.
The RSCT component that implements authorization is RMC. It uses access control list (ACL) files to control user access to resources.

The RSCT authorization process in detail:
1. On the HMC: DMSRM pushes down a secret key and HMC IP address to NVRAM where AIX LPAR exists.

2. On the AIX LPAR: CSMAgentRM, reads the key and HMC IP address out from NVRAM. It will then authenticate the HMC. This process is repeated every five minutes on a LPAR to detect new HMCs.

3. On the AIX LPAR: After authenticating the HMC, CSMAgentRM will contact the DMSRM on the HMC to create a ManagedNode resource then it creates a ManagementServer resource on AIX.

4. On the AIX LPAR: After the creation of these resources on the HMC and AIX, CSMAgentRM grants HMC permission to access necessary resources on the LPAR and changes its ManagedNode Status to 1 on the HMC.

5. On the HMC: After the ManagedNode Status is changed to 1, a session is established with the LPAR to query operating system information and DLPAR capabilities, and then waits for the DLPAR commands from users.

----------------------------------------

RMC Domain Status

When partitions have active RMC connections, they become managed nodes in a Management Domain. The HMC is then the Management Control Point (MCP) of that Management Domain. You can then use the rmcdomainstatus command to check the status of those managed nodes (i.e. your partitions).

As root on the HMC or on the AIX LPAR you can execute the rmcdomainstatus command as follows:

# /usr/sbin/rsct/bin/rmcdomainstatus -s ctrmc

From HMC: You should get a list of all the partitions that the HMC server can reach on the public network on port 657.

Management Domain Status: Managed Nodes
O a 0xc8bc2c9647c1cef3 0003 9.2.5.241
I a 0x96586cb4b5fc641c 0002 9.2.5.33

From LPAR: You should get a list of all the Management Control Points

Management Domain Status: Management Control Points
   I A 0xef889c809d9617c7 0001 9.57.24.139

1. column:
-I: Indicates that the partition is "Up" as determined by the RMC heartbeat mechanism (i.e. an active RMC connection exists).
-O: Indicates that the RMC connection is "Down", as determined by the RMC heartbeat mechanism.

2. column:
-A: Indicates that there are no messages queued to the specified node
-a: Same as A, but the specified node is executing a version of the RMC daemon that is at a lower code level than the local RMC daemon.

more info: https://www-304.ibm.com/support/docview.wss?uid=isg3T1011508

----------------------------------------

If rmcdomainstatus shows "i" at (1st column):

Indicates that the partition is Pending Up. Communication has been established, but the initial handshake between two RMC daemons has not been completed (message authentication is most likely failing.)
Authentication problems will occur when the partition identity do not match each other's trusted host list:

# /usr/sbin/rsct/bin/ctsvhbal        <--list and="" both="" command="" current="" for="" hmc="" identities="" logical="" on="" partition="" run="" span="" the="" this="">
# /usr/sbin/rsct/bin/ctsthl -l       <--list host="" list="" on="" partition="" span="" the="" trusted="">

On the HMC, there is an entry for the partition. On the partition, there is an entry for the HMC. The HOST_IDENTITY value must match one of the identities listed in the respective ctsvhbal command output.

----------------------------------------

Things to check at the HMC:

- checking the status of the managed nodes: /usr/sbin/rsct/bin/rmcdomainstatus -s ctrmc (you must be root on the HMC)

- checking connection between HMC and LPAR:
hscroot@hmc10:~> lspartition -dlpar
<#0> Partition:<2 10.10.50.18="" aix10.domain.com="">
       Active:<1>, OS:, DCaps:<0x4f9f>, CmdCaps:<0x1b 0x1b="">, PinnedMem:<1452>
<#1> Partition:<4 10.10.50.71="" aix20.domain.com="">
       Active:<0>, OS:, DCaps:<0x0>, CmdCaps:<0x1b 0x1b="">, PinnedMem:<656>

For correct DLPAR function:
- the partition must return with the correct IP of the lpar.
- the active value (Active:...) must be higher than zero,
- the decaps value (DCaps:...) must be higher 0x0

(The first line shows a DLPAR capable LPAR, the second line is anon-working LPAR)

- another way to check RMC connection: lssyscfg -r lpar -F lpar_id,name,state,rmc_state,rmc_ipaddr -m p750
(It should list "active" for the LPARs with active RMC connection.)

----------------------------------------

Things to check at the LPAR:

- checking the status of the managed nodes: /usr/sbin/rsct/bin/rmcdomainstatus -s ctrmc

- Checking RMC status:
# lssrc -a | grep rsct
ctrmc            rsct             8847376      active          <--it a="" is="" rmc="" span="" subsystem="">
IBM.DRM          rsct_rm          6684802      active          <--it command="" dlpar="" executing="" for="" is="" on="" partition="" span="" the="">
IBM.DMSRM        rsct_rm          7929940      active          <--it for="" is="" of="" partitions="" span="" statuses="" tracking="">
IBM.ServiceRM    rsct_rm          10223780     active
IBM.CSMAgentRM   rsct_rm          4915254      active          <--it and="" between="" for="" handshaking="" hmc="" is="" nbsp="" partition="" span="" the="">
ctcas            rsct                          inoperative     <--it for="" is="" security="" span="" verification="">
IBM.ERRM         rsct_rm                       inoperative
IBM.AuditRM      rsct_rm                       inoperative
IBM.LPRM         rsct_rm                       inoperative
IBM.HostRM       rsct_rm                       inoperative     <--it for="" information="" is="" obtaining="" os="" span="">

You will see some active and some missing (The key for DLPAR is the IBM.DRM)

- Stopping and starting RMC without erasing configuration:

# /usr/sbin/rsct/bin/rmcctrl -z    <--it daemons="" span="" stops="" the="">
# /usr/sbin/rsct/bin/rmcctrl -A    <--adds and="" daemons="" entry="" etc="" inittab="" it="" span="" starts="" the="" to="">
# /usr/sbin/rsct/bin/rmcctrl -p    <--enables client="" connections="" daemons="" for="" remote="" span="" the="">

(This is the correct method to stop and start RMC without erasing the configuration.)
Do not use stopsrc and startsrc for these daemons; use the rmcctrl commands instead!

- recfgct: deletes the RMC database, does a discovery, and recreates the RMC configuration
# /usr/sbin/rsct/install/bin/recfgct
(Wait several minutes)
# lssrc -a | grep rsct

(If you see IBM.DRM active, then you have probably resolved the issue)

- lsrsrc "IBM.ManagementServer"    <--it hmcs="" rmc="" shows="" span="" via="">

Sample LVM Procedures

:

Filesystem Procedures

Procedure to create a filesystem using JFS:

· See below the procedure for creating a logical volume and a filesystem using JFS:

Procedure to extend the size of filesystem using JFS:

1. "df" to see the filesystem, it's current size, % utilization and the name of it's logical volume

2. "lslv " to show information about the logical volume including it's volume group name.

3. "lsvg " to show information about the volume group, including number of free pp's and the pp size

4. If there are not enough free pp's then see below for procedure to add a disk to a volume group.

5. "chfs -a size= +4194304 " to grow the filesystem by 2 GB (4194304=2*1024*1024*1024/512)

· NOTE: Growing the file system will automatically grow the logical volume

6. df" shows the file system's current size is 2 GB more than before.

Troubleshooting extending the size of a filesystem using JFS:

· Error Message: 0516-787 extendlv: Maximum allocation for logical volume is 512.

· Maximum number of LPs for the logical volume has been exceeded - must increase the allocation

· Calculate the number of LPs needed = LV Size in MB / LP size in MB

· chlv -x

Procedure to remove a file system

1. Unmount the filesystem

2. Remove the logical volume "rmlv "

3. Remove the filesystem information from /etc/filesystems

Procedure to reduce the size of a file system - shareold is 8mb and needs to be reduced to 4mb

1. Create the file system

1. crfs -v jfs -m /usr/sharenew -g rootvg -a size=8192

2. this makes a logical volume in the root volume group of 4MB that uses jfs

2. Mount the volume

1. mount /usr/sharenew

3. Move the files from the old file system (/usr/shareold)

1. cd /usr/shareold

2. tar cf - | (cd /usr/sharenew; tar xvf -)

3. cd

4. Unmount the file systems

1. umount /usr/sharenew

2. umount /usr/shareold

5. Remove the old file system and it's logical volume

1. rmfs /usr/shareold

1. chfs -m /usr/shareold /usr/sharenew

7. Mount the new filesystem

1. mount /usr/shareold

8. Delete the temporary mount point

1. rmdir /usr/share

Logical Volume Procedures

Procedure to create a logical volume and filesystem in a volume group using JFS:

1. lsvg to determine the size of the PP

2. lslv in similar logical volumes to determine if mirroring is in effect

3. Calculate the number of PPs needed for the logical volume

1. bc

2. scale=2

3. /

4. quit

4. mklv -y "" <# of LPS> --> creates the logical volume

5. crfs -v jfs -d -m / -A yes --> makes the filesystem, creates the mountpoint and puts it in /etc/filesystems

6. mount / --> mounts the new fileystem

7. df / --> verifies the mount and the size of the new filesystem

8. Check the ownership and permissions of the new mount point

1. ls -ld

2. chown owner:group

3. chmod XXX

9. If mirroring is in effect, then mirror this logical volume to another disk (original and 1 mirror):

1. mklvcopy -s y 2

Check to see if all of the logical volumes in a volume group are mirrored

· lsvg -l

Mirror a logical volume after the fact

· mklvcopy -s y 2

Volume Group Procedures

Procedure to create a volume group:

1. lsdev -C -c disk -> lists available disks (and the hdisk#) on the server

2. mkvg -y "" hdisk# --> creates the volume group on the named hard disk

3. varyonvg --> activates the volume group

Procedure to add a disk to a volume group (extend the volume group)

· extendvg

· Verify the disk has been successfully added to the vg

· lsvg -p

Procedure to mirror the rootvg:

1. lspv --> determine the hdisk#

2. extendvg rootvg hdisk --> add the hdisk to the volume group

3. lspv --> verify that the hdisk has been successfully added to the volume group

4. chvg -Q 'n' rootvg --> change the quorum so that the vg will stay active if one of the mirrors fail

5. mirrorvg -S -c 2 rootvg --> mirror all of the logical volumes in the volume group

6. lsvg -l rootvg --> verify successful mirroring (pps will appear "stale" until synchronization is complete).

7. bosboot -a --> update the boot image information

8. bootlist -m normal -o hdisk0 hdisk1 --> create a new bootlist

9. bootlist -m normal -o --> verify the bootlist is correct

Procedure to increase the number of LP's available
Assume we receive an error that the maximum number of LP's had been exceeded, and the maximum number of LP's defined was 1100:

1. "lsvg " to show the total PP's available in the volume group =1250

2. "lsvg -l " to show the total PP's used in all logical volumes in that volume group (showed sys1log, the jfs log was using 2 PP's)

3. "chlv -x 1248 " to change the maximum number of LP's from 1100 to 1248 (1250 PP's in the volume group - 2 PP's used by the jfs log = 1248 available)

Physical Disk Procedures

Procedure to find disks/vpaths that are unallocated

· lsvpcfg

· This will show disks/vpaths and the volume group they are allocated to

· lspv|grep None

· This will show pvs and whether they are asssociated with a volume group

· Note: For vpaths, the hdisks will show as none, but they may be allocated to a vpath - you must grep each hdisk with the lsvpcfg

Procedure to make a new lun available to AIX

· Allocate the new lun on the SAN

· Run "cfgmgr"

· Verify the new vpatch/hdisk by running "lsvpcfg"

· There should be a new vpath and it should be available with no volume group - if not, rerun cfgmgr

Procedure to list the PVs in a volume group: