Contact for queries :

Login

  UpComing Live WebEx Workshop Series

Interview Preparation : Veritas Volume Manager and Cluster Services

[toc]

Important Keywords

LLT – heartbeat communcation over cluster interconnects
GAB – managin cluster memberships.
HAD – vcs enginer manages the agents and Service groups. and monitored by hashadow

Important Commands

# gabconfig -a : How to check the status of various GAB ports on the cluster nodes
# lltstat -nvv : How to check the detailed status of LLT links ?
# gabconfig -c (start GAB)
# gabconfig -U (stop GAB)
# lltconfig -c -> start LLT
# lltconfig -U -> stop LLT (GAB needs to stopped first)
# gabconfig -c -x –> In case we don’t have sufficient number of nodes to start VCS [ may be due to a maintenance activity ],
but have to do manual GAB seeding
# hastart –> start HAD or VCS
# hastop -local –> Stops service groups and VCS engine [HAD] on the node where it is fired
# hastop -local -evacuate –> migrates Service groups to the other node and stops HAD on the current node only
# hastop -local -force –> Stops HAD leaving services running on the node where it is fired
# hastop -all -force –> Stops HAD on all the nodes of cluster leaving the services running
# hastop -all –> Stops HAD on all nodes in cluster and takes service groups offline
# hagrp -online [service-group] -sys [node] (Online the SG on a particular node)
# hagrp -offline [service-group] -sys [node] (Offline the SG on particular node)
# hagrp -switch [service-group] -to [target-node]

Important Files

::: /etc/llttab <– LLT uses /etc/llttab to set the configuration of the LLT interconnects.
# cat /etc/llttab
set-node node01
set-cluster 02 <– unique cluster number assigned to the entire cluster
link nxge1 /dev/nxge1 – ether – –
link nxge2 /dev/nxge2 – ether – –
link-lowpri /dev/nxge0 – ether – –
:::: /etc/llthosts <– contains cluster-wide unique node number
# cat /etc/llthosts
0 node01 <– each node assigned with a unique node number . It can range from 0 to 31
1 node02
:::: /etc/gabtab <– command to start the GAB
# cat /etc/gabtab
/sbin/gabconfig -c -n 4 <– ” -n 4 ” means number of nodes that must be communicating in order to start VCS
Operations:
 

Important Procedures

Adding Service group

haconf –makerw
hagrp –add SG
hagrp –modify SG SystemList node01 0 node02 1
hagrp –modify SG AutoStartList node02
haconf –dump -makero
:::: freeze/unfreeze ( When you freeze a service group, VCS continues to monitor the service group, but does not allow it or the resources under it to be taken offline or brought online. Failover is also disable even when a resource faults. When you unfreeze the SG, it start behaving in the normal way. )
# hagrp -freeze [service-group] <- temporary freeze
# hagrp -unfreeze [service-group] <- temporary unfreeze
# hagrp -freeze -persistent[service-group] <– persistent Freeze
# hagrp -unfreeze [service-group] -persistent <– persistent unfreeze

Adding a node to an active cluster:

1: Set up the hardware
Before adding a node to an existing cluster, node must be physically connected with the cluster.
Connect the VCS private Ethernet controllers
Connect the node to the shared storage
 
2: Install the VCS software in the node
Install the VCS software and install the license.
 
3: Configure LLT and GAB
Create the LLT & GAB configuration files (/etc/llthosts, /etc/llttab and /etc/gabtab) in the new node and update the files on the existing node.
 
4: Add the node to an existing cluster
We have to perform below given tasks in any of the existing node of a cluster
4.1: Change the  cluster configuration to R/W mode.
# haconf –makerw
 
4.2: Add the new node to the cluster
# hasys –add <new node name>
 
4.3: Copy main.cf file from an existing node to new node
# scp /etc/VRTSvcs/conf/config/main.cf new_node:/
/etc/VRTSvcs/conf/config/main.cf
 
4.4: Start vcs on the new node
# hastart
 
4.5: Now make the configuration again read only.
# haconf –dump –makero
 
4.6: Start VCS 
# hastart
 
4.7: Run the GAB configuration command on each node to verify thatport aandport hinclude the new node in the membership.
# /sbin/gabconfig -a

Removing a node (node03) from active cluster:

1: Backup the configuration file
# cp /etc/VRTSvcs/conf/config/main.cf /etc/VRTSvcs/conf/config/main.cf.orig

2: Check the status of the nodes and the service groups
# hastatus –summary

3: Switch service group which is online on the node leaving the cluster
# hagrp –switch <service group> to <node name>

4: Delete the node from the VCS configuration
1: Make the cluster configuration R/W
# haconf –makerw

2: Stop the cluster on leaving node
# hastop –sys <node>

3: Delete the leaving node from the service group’s SystemList attribute.
# hagrp –modify <group> SystemList –delete <node>

4: Delete the node from the cluster
# hasys –delete <node>

5: Now again make the cluster configuration Read Only.
# haconf –dump –makero

5: Modify the LLT and GAB configuration files to reflect changes
Modify /etc/llthosts, /etc/llttab and /etc/gabtab files on the remining node on the cluster.

6: Remove VCS configuration on the node leaving the cluster
1: Unconfigure and unload LLT and GAB
# /sbin/gabconfig –U
# /sbin/lltconfig –U
2: Unload the LLT and GAB modules
# modunload –i <gab_module>
# modunload –I <llt_module>

3: Rename the startup files to prevent LLT, GAB and VCS from starting up in future.
# mv /etc/rc2.d/S70llt /etc/rc2.d/s70llt
# mv /etc/rc2.d/S92gab /etc/rc2.d/s92gab
# mv /etc/rc3.d/S99vcs /etc/rc3.d/s99vcs
4: Remove VCS package from the node

 

How to shutdown a node in VCS cluster?

1) Make the cluster configuration Read/Write
# haconf –makerw

2) Either Switchover or failover all the service group which are online on shutting down node to remaining node
# hagrp –switch <service group> -to <node name>

3) Freeze all the service group which are online in the cluster.
# hagrp –freeze <service group> -persistent
4) Stop the cluster on the node that is going to be down.
# hastop –local –force

5) Rename the VCS startup script
# cd /etc/rc3.d
# mv S99vcs s99vcs

6) Now reboot the box.

Once the system will come up after reboot, Follow the below given instructions.

1) Start the VCS on this node
# hastart –force
2) Make the service group online if they were made offline before the system down.
# hagrp –online <service group> -sys <node name>

3) Unfreeze all the service groups which are frozen.
# hagrp -unfreeze <service group> -persistent

4) Now make the cluster configuration Read-Only
# haconf -dump –makero

5) Now again move back the VCS startup script
# cd /etc/rc3.d
# mv s99vcs S99vcs

 

Adding a new low priority LLT links

( removal procedure also same after modification of llttab file for the link removal)

Modify the llttab on any one node to add the new link information:

# cp /etc/llttab /etc/llttab.bak
# vi /etc/llttab
set-node node02
set-cluster 3
link nxge0 /dev/qfe:0 – ether – –
link nxge1 /dev/qfe:1 – ether – –
link-lowpri e1000g0 /dev/e1000g:0 – ether – – <– new entry to llttab
# haconf -dump -makero
# hastop -all -force (on any one node)
# gabconfig -a

Stop fencing on each node of the cluster:

# /sbin/vxfen-shutdown
# vxfenadm -d
# gabconfig -a

unconfigure gab and llt on each node:

# gabconfig -U
# gabconfig -a
# lltconfig -U
# lltconfig

Now start LLT and GAB on each node

# lltconfig -c
# lltconfig
# sh /etc/gabtab
# gabconfig -a

Start fencing on each node

# /sbin/vxfen-startup
# vxfenadm -d
# gabconfig -a

Now start VCS on each node and verify if everything is running fine

# hastart
# hastatus -sum

verify with

lltstat -nvv

 
 

How to upgrade Solaris OS in which VCS is running?

1)Stop VCS on this node
Make the VCS configuration R/W
# haconf –makerw

Move all service groups from this node to another node and freeze this node:
# hasys –freeze –persistent –evacuate <node name>

# Make the cluster configuration Read/Only?
# haconf –dump –makero

# Stop the cluster on this node
# hastop –force –local

2)Stop, unconfigure and unsinstall LLT and GAB on this node
Unconfigure GAB
# gabconfig –U

Unconfigure LLT
# lltconfig –U

Now remove GAB and LLT packages
# pkgrm VRTSgab VRTSllt

3)Now upgrade Solaris and switch to single user mode

4)Now Install and configure LLT and GAB
# pkgadd –d . VRTSgab VRTSllt

5)Now switch to multi user mode and start VCS
# init 3
# hastart

6)Now unfreeze this node
# hasys –unfreeze –persistent <node name>
# haconf –dump –makero

Jeopardy and Split brain troubleshooting

To recover from jeopardy

just fix the failed link(s) and GAB automatically detects the new link(s) and the jeopardy membership is removed from node.

The Reason

 Split brain occurs when all the LLT links fails simultaneously. Here systems in the cluster fail to
identify whether it is a system failure or an interconnect failure. Each mini-cluster thus formed thinks that
it is the only cluster thats active at the moment and tries to start the service groups on the other mini-cluster which he think is down.
Similar thing happens to the other mini-cluster and this may lead to a simultaneous access to the storage and can cause data corruption.

IO Fencing

VCS implements I/O fencing mechanism to avoid a possible split-brain condition. It ensure data integrity and data protection.
I/O fencing driver uses SCSI-3 PGR (persistent group reservations) to fence off the data in case of a possible split brain scenario.

How io fencing avoids split brain:

In case of a possible split brain
As show in the figure above assume that node01 has key “A” and node02 has key “B”.
1. Both nodes think that the other node has failed and start racing to write their keys to the coordinator disks.
2. node01 manages to write the key to majority of disks i.e. 2 disks
3. node02 panics
4. node01 now has a perfect membership and hence Service groups from node02 can be started on node01

Difference between MultiNICA and MultiNICB resource types

MultiNICA and IPMultiNIC

– supports active/passive configuration.
– Requires only 1 base IP (test IP).
– Does not require to have all IPs in the same subnet.

MultiNICB and IPMultiNICB

– supports active/active configuration.
– Faster failover than the MultiNICA.
– Requires IP address for each interface.

Service Group flushing

Flushing of a service group is required when, agents for the resources in the service group seems suspended waiting
for resources to be taken online/offline. Flushing a service group clears any internal wait states and stops VCS
from attempting to bring resources online.
# hagrp -flush [SG] -sys node01 <– To flush the service group SG on the cluster node, node01

Clearing Resource Faults

For persistent resources
Do not do anything and wait for the next OfflineMonitorInterval (default – 300 seconds) for the resource to become online.

For non-persistent resources

Clear the fault and probe the resource on node01 :
# hares -clear [resource_name] -sys node01
# hares -probe [resource_name] -sys node01

VCS related Interview questions

How do check the status of VERITAS Cluster Server aka VCS?

Ans: hastatus –sum

Which is the main config file for VCS and where it is located?

Ans: main.cf is the main configuration file for VCS and it is located in /etc/VRTSvcs/conf/config.

Which command you will use to check the syntax of the main.cf?

Ans: hacf -verify /etc/VRTSvcs/conf/config

How to switchover the service group in VCS?

Ans: # hagrp –switch -to

How to online the service groups in VCS?

Ans: # hagrp –online -sys

How to set the VCS configuration Read-Only?

Ans: # haconf –dump –makero

How to set the VCS configuration Read-Write?

Ans: # haconf -makerw

 How to display the list of all snapshots?

Ans: # hasnap –display –list

How to add a user with cluster administrator/Operator access?

Ans: # hauser –add <user> -priv Administrator/Operator

How to add a user with group administrator/Operator access?

Ans: # hauser –add <user> -priv Administrator/Operator –group <service group>

How to display the status of a service group on a system?

Ans: # hagrp –state <service group> -sys <system>
 

How to display the resources for a specific service group?

Ans: # hagrp –resources <service group>
 

How to display the service group dependencies?

Ans: # hagrp –dep <service group>
 

How to display information about a service group on a system?

Ans: # hagrp –display <service group> -sys <system name>
 

How to display resource dependencies?

Ans: # hares –dep <resource name>
 

How to display information about a resource?

Ans: # hares –display <resource name>
 

How to display resources of a service group?

Ans: # hares –display –group <service group>
 

How to display resources of a resource type?

Ans: # hares –display –type <resource type>
 

How to display attributes of a system?

Ans: # hares –display –sys <system name>
 

How to display all resources type?

Ans: # hatype –list
 

How to list the systems in the cluster?

Ans: # hasys –list
 

How to display information about a particular system?

Ans: # hasys –list <system name>
 

How to display information about the cluster?

Ans: # haclus –display
 

How to display the status of all service groups including resources in cluster?

Ans: # hastatus

How to display the status of cluster faults, including faulted service groups, systems, links and agents?

Ans: # hastatus –summary

How to add a service group in a cluster?

Ans: # hagrp –add <service group>

How to delete a service group from a cluster?

Ans: # hagrp –delete <service group>

How to modify a service group attribute such as SystemList, AutoStartList, parallel etc?

Ans: 
(A) How to populate the SystemList attribute of service group groupX with SystemA and B.
# hagrp –modfy groupX SystemList –add SystemA 1 SystemB 2
(B) How to populate the AutoStartList attribute of service group groupX with SystemA and B.
# hagrp –modify groupX AutoStartList –add SystemA SystemB
(C) How to define the service group as a parallel?
# hagrp –modify <service group> Parallel 1

How to bring a service group online?

Ans: # hagrp –online <service group> -sys <system name>

 

How to take a service group offline?

Ans: # hagrp –offline <service group> -sys <system name>

How to take a service group offline if all resources are probed?

Ans: # hagrp –offline <service group> -ifprobed –sys <system name>

How to switch a service group from one system to another system?

Ans: # hagrp –switch <service group> -to <system name>

How to freeze a service group?

Ans: # hagrp –freeze <service group> -persistent

How to unfreeze a frozen service group?

Ans: # hagrp –unfreeze <service group> -persistent

How to disable a service group?

Ans: # hagrp –disable <service group> -sys <system name>

How to enable a service group?

Ans: # hagrp –enable < service group> -sys <system name>

How to enable all resources in a service group?

Ans: # hagrp –enableresources <service group>

How to disable all resources in a service group?

Ans: # hagrp –disableresources <service group>

How to clear faulted, non-persistent resources in a service group?

Ans: # hagrp –clear <service group> -sys <system name>

How to clear resources in ADMIN_WAIT state in a service group?

Ans: # hagrp –clearadminwait <service group> -sys <system name>

How to flush a service group?

Ans: # hagrp –flush <service group> -sys <system name>

How to link a service group with another?

Ans: # hagrp –link <parent service group> <child service group> <gd_category> <gd_location> <gd_type>
gd_category = Category of group dependency (online/offline)
gd_location = The scope of dependency (local/global/remote)
gd_type = type of group dependency (soft/firm/hard)

How to unlink a service group with another?

Ans: # hagrp –unlink <parent service group> <child service group>
 

How do check the status of VERITAS Cluster Server?

Ans: hastatus –sum

Which is the main config file for VCS and where it is located?

Ans: main.cf is the main configuration file for VCS and it is located in /etc/VRTSvcs/conf/config.

Which command you will use to check the syntax of the main.cf?

Ans: hacf -verify /etc/VRTSvcs/conf/config

How will you check the status of individual resource of VCS cluster?

Ans: hares –state <resource>

What is the service group in VCS?

Ans: Service group is made up of resources and their links which you normally requires to maintain the HA of application.

What is the use of halink command?

Ans: halink is used to link the dependencies of the resources

What is the difference between switchover and failover?

Ans: Switchover is an manual task where as failover is automatic. You can switchover service group from online cluster node to offline cluster node in case of power outage, hardware failure, schedule shutdown and reboot. But the failover will failover the service group to the other node when VCS heartbeat link down, damaged, broken because of some disaster or system hung.

What is the use of hagrp command?

Ans: hagrp is used for doing administrative actions on service groups like online, offline, switch etc.

How to switchover the service group in VCS?

Ans: hagrp –switch <service group> to <node>

How to online the service groups in VCS?

Ans: hagrp –online <service group> -sys <node>

How to access the VCS cluster management console?

Ans: VCS cluster management console can be accessed by the below given URLs:
http://Servername:8181/cmc/
or
https://Servername:8443/cmc

How to access the Cluster Manager Java Console?

Ans: #/opt/VRTSvcs/bin/hagui

What is Jeopardy?

Ans: When a node in the cluster is having only one interconnected link remaining, then it’s very difficult for GAB to discriminate between system or network failure. A special membership category takes effect in this situation, called jeopardy membership. This memebship prevent cluster from split brain condition. When a system is placed in jeopardy membership, two actions occur:
1: Service groups running on this node placed in auto disabled state. A service group in auto disabled state may failover on a resource or group fault but can’t failover on system fault.
2: VCS operates the cluster as a single node cluster. Other systems in the clusters are partitioned off in a separate cluster membership.

What is the main daemon of VCS?

Ans: had (high availability daemon) which is started by hashadow daemon.

What is GAB?

Ans: Group Membership Services/Atomic Broadcast (GAB) is responsible for cluster membership and reliable cluster communication. GAB has two major functions:

1: Cluster membership
GAB maintains cluster membership by receiving heartbeat from LLT. When a system no longer receives heartbeats from a cluster peer, GAB marks the node as down.
2: Cluster communication
GAB provides the guranteed delivery of messages to all the systems. The atomic broadcast functionality is used by HAD to ensure that all systems within the cluster receive configuration change messages.

What is LLT?

Ans: Low Latency Transport (LLT) is used for all cluster communication. LLT has 2 major functions:
1: Traffic Distribution
LLT works as a backbone for GAB. LLT distributes all inter communication across all configured network links. If a link is failes, traffic is directed to the remaining link.
2: Heartbeat
LLT is responsible for sending and receiving heartbeat signals.
 

How many network links are supported in LLT?

Ans: 8 links are supported.

How many nodes can join a Cluster?

Ans: Maximum of 32 nodes is supported in VCS.

What is heartbeat?

Ans: Heartbeat is an Ethernet broadcast packet. This packet notifies all othe nodes that sender is functional. This is the only broadcast traffic generated by VCS. Each node sends 2 hearbeat packets per second per interface. Heartbeat is used by GAB to determine cluster membership.

What is split brain condition?

Ans: When all the cluster interconnected links fail, it is possible for one cluster to separate into 2 subclusters, each of which doesn’t know about the other subcluster. The two subclusters could each carry out recovery actions for the departed system. For example two systems could try to import the same storage and cause data corruption.

What is coordinator disk?

Ans:Coordinator disks are three standard disks or LUNs set aside for I/O fencing during cluster reconfiguration. Coordinator disks do not serve any other storage purpose in the VCS configuration. These disks provide a lock mechanism to determine which nodes get to fence off data drives from other nodes. A node must eject a peer from the coordinator disks before it can fence the peer from the data drives. This concept of racing for control of the coordinator disks to gain the ability to fence data disks is key to understanding prevention of split brain through fencing.

What is IO fencing and how to configure IO fencing?

Ans: IO fencing is a feature that prevents data corruption in the event of a communication breakdown in a cluster. IO fencing is used to remove the risk associated with split brain condition. I/O fencing allows write access for members of the active cluster and blocks access to storage from non-members; even a node that is alive is unable to cause damage.
 

August 13, 2015

0 responses on "Interview Preparation : Veritas Volume Manager and Cluster Services"

Leave a Message

Your email address will not be published. Required fields are marked *

About iGURKUL

IGURKUL I.T. Training Hub offering various Career Certification courses in Computer Networking, Unix, Linux, Cloud Computing and DevOps Technologies. With its rich experience in IT training service sector, iGURKUL has been able to set Industry best practices in IT Training for the past five years.

In Past five years, more than 5000 professionals have been trained by iGURKUL for System administration, Cloud Computing and DevOps Skill set through our Online Training portal www.unixadminschool.com. And , each day , more than 10000 working professionals from all over the globe visiting our knowledge base www.unixadminschool.com/blog for the best practices and Knowledge learning.

top
copyright protected - 2011 © igurkul I.T. solutions. All rights reserved.