[toc]
Important Keywords
LLT – heartbeat communcation over cluster interconnects
GAB – managin cluster memberships.
HAD – vcs enginer manages the agents and Service groups. and monitored by hashadow
Important Commands
# gabconfig -a : How to check the status of various GAB ports on the cluster nodes
# lltstat -nvv : How to check the detailed status of LLT links ?
# gabconfig -c (start GAB)
# gabconfig -U (stop GAB)
# lltconfig -c -> start LLT
# lltconfig -U -> stop LLT (GAB needs to stopped first)
# gabconfig -c -x –> In case we don’t have sufficient number of nodes to start VCS [ may be due to a maintenance activity ],
but have to do manual GAB seeding
# hastart –> start HAD or VCS
# hastop -local –> Stops service groups and VCS engine [HAD] on the node where it is fired
# hastop -local -evacuate –> migrates Service groups to the other node and stops HAD on the current node only
# hastop -local -force –> Stops HAD leaving services running on the node where it is fired
# hastop -all -force –> Stops HAD on all the nodes of cluster leaving the services running
# hastop -all –> Stops HAD on all nodes in cluster and takes service groups offline
# hagrp -online [service-group] -sys [node] (Online the SG on a particular node)
# hagrp -offline [service-group] -sys [node] (Offline the SG on particular node)
# hagrp -switch [service-group] -to [target-node]
Important Files
::: /etc/llttab <– LLT uses /etc/llttab to set the configuration of the LLT interconnects.
# cat /etc/llttab
set-node node01
set-cluster 02 <– unique cluster number assigned to the entire cluster
link nxge1 /dev/nxge1 – ether – –
link nxge2 /dev/nxge2 – ether – –
link-lowpri /dev/nxge0 – ether – –
:::: /etc/llthosts <– contains cluster-wide unique node number
# cat /etc/llthosts
0 node01 <– each node assigned with a unique node number . It can range from 0 to 31
1 node02
:::: /etc/gabtab <– command to start the GAB
# cat /etc/gabtab
/sbin/gabconfig -c -n 4 <– ” -n 4 ” means number of nodes that must be communicating in order to start VCS
Operations:
Important Procedures
Adding Service group
haconf –makerw
hagrp –add SG
hagrp –modify SG SystemList node01 0 node02 1
hagrp –modify SG AutoStartList node02
haconf –dump -makero
:::: freeze/unfreeze ( When you freeze a service group, VCS continues to monitor the service group, but does not allow it or the resources under it to be taken offline or brought online. Failover is also disable even when a resource faults. When you unfreeze the SG, it start behaving in the normal way. )
# hagrp -freeze [service-group] <- temporary freeze
# hagrp -unfreeze [service-group] <- temporary unfreeze
# hagrp -freeze -persistent[service-group] <– persistent Freeze
# hagrp -unfreeze [service-group] -persistent <– persistent unfreeze
Adding a node to an active cluster:
Removing a node (node03) from active cluster:
1: Backup the configuration file
# cp /etc/VRTSvcs/conf/config/main.cf /etc/VRTSvcs/conf/config/main.cf.orig
2: Check the status of the nodes and the service groups
# hastatus –summary
3: Switch service group which is online on the node leaving the cluster
# hagrp –switch <service group> to <node name>
4: Delete the node from the VCS configuration
1: Make the cluster configuration R/W
# haconf –makerw
2: Stop the cluster on leaving node
# hastop –sys <node>
3: Delete the leaving node from the service group’s SystemList attribute.
# hagrp –modify <group> SystemList –delete <node>
4: Delete the node from the cluster
# hasys –delete <node>
5: Now again make the cluster configuration Read Only.
# haconf –dump –makero
5: Modify the LLT and GAB configuration files to reflect changes
Modify /etc/llthosts, /etc/llttab and /etc/gabtab files on the remining node on the cluster.
6: Remove VCS configuration on the node leaving the cluster
1: Unconfigure and unload LLT and GAB
# /sbin/gabconfig –U
# /sbin/lltconfig –U
2: Unload the LLT and GAB modules
# modunload –i <gab_module>
# modunload –I <llt_module>
3: Rename the startup files to prevent LLT, GAB and VCS from starting up in future.
# mv /etc/rc2.d/S70llt /etc/rc2.d/s70llt
# mv /etc/rc2.d/S92gab /etc/rc2.d/s92gab
# mv /etc/rc3.d/S99vcs /etc/rc3.d/s99vcs
4: Remove VCS package from the node
How to shutdown a node in VCS cluster?
1) Make the cluster configuration Read/Write
# haconf –makerw
2) Either Switchover or failover all the service group which are online on shutting down node to remaining node
# hagrp –switch <service group> -to <node name>
3) Freeze all the service group which are online in the cluster.
# hagrp –freeze <service group> -persistent
4) Stop the cluster on the node that is going to be down.
# hastop –local –force
5) Rename the VCS startup script
# cd /etc/rc3.d
# mv S99vcs s99vcs
6) Now reboot the box.
Once the system will come up after reboot, Follow the below given instructions.
1) Start the VCS on this node
# hastart –force
2) Make the service group online if they were made offline before the system down.
# hagrp –online <service group> -sys <node name>
3) Unfreeze all the service groups which are frozen.
# hagrp -unfreeze <service group> -persistent
4) Now make the cluster configuration Read-Only
# haconf -dump –makero
5) Now again move back the VCS startup script
# cd /etc/rc3.d
# mv s99vcs S99vcs
Adding a new low priority LLT links
( removal procedure also same after modification of llttab file for the link removal)
Modify the llttab on any one node to add the new link information:
# cp /etc/llttab /etc/llttab.bak
# vi /etc/llttab
set-node node02
set-cluster 3
link nxge0 /dev/qfe:0 – ether – –
link nxge1 /dev/qfe:1 – ether – –
link-lowpri e1000g0 /dev/e1000g:0 – ether – – <– new entry to llttab
# haconf -dump -makero
# hastop -all -force (on any one node)
# gabconfig -a
Stop fencing on each node of the cluster:
# /sbin/vxfen-shutdown
# vxfenadm -d
# gabconfig -a
unconfigure gab and llt on each node:
# gabconfig -U
# gabconfig -a
# lltconfig -U
# lltconfig
Now start LLT and GAB on each node
# lltconfig -c
# lltconfig
# sh /etc/gabtab
# gabconfig -a
Start fencing on each node
# /sbin/vxfen-startup
# vxfenadm -d
# gabconfig -a
Now start VCS on each node and verify if everything is running fine
# hastart
# hastatus -sum
verify with
lltstat -nvv
How to upgrade Solaris OS in which VCS is running?
1)Stop VCS on this node
Make the VCS configuration R/W
# haconf –makerw
Move all service groups from this node to another node and freeze this node:
# hasys –freeze –persistent –evacuate <node name>
# Make the cluster configuration Read/Only?
# haconf –dump –makero
# Stop the cluster on this node
# hastop –force –local
2)Stop, unconfigure and unsinstall LLT and GAB on this node
Unconfigure GAB
# gabconfig –U
Unconfigure LLT
# lltconfig –U
Now remove GAB and LLT packages
# pkgrm VRTSgab VRTSllt
3)Now upgrade Solaris and switch to single user mode
4)Now Install and configure LLT and GAB
# pkgadd –d . VRTSgab VRTSllt
5)Now switch to multi user mode and start VCS
# init 3
# hastart
6)Now unfreeze this node
# hasys –unfreeze –persistent <node name>
# haconf –dump –makero
Jeopardy and Split brain troubleshooting
To recover from jeopardy
just fix the failed link(s) and GAB automatically detects the new link(s) and the jeopardy membership is removed from node.
The Reason
Split brain occurs when all the LLT links fails simultaneously. Here systems in the cluster fail to
identify whether it is a system failure or an interconnect failure. Each mini-cluster thus formed thinks that
it is the only cluster thats active at the moment and tries to start the service groups on the other mini-cluster which he think is down.
Similar thing happens to the other mini-cluster and this may lead to a simultaneous access to the storage and can cause data corruption.
IO Fencing
VCS implements I/O fencing mechanism to avoid a possible split-brain condition. It ensure data integrity and data protection.
I/O fencing driver uses SCSI-3 PGR (persistent group reservations) to fence off the data in case of a possible split brain scenario.
How io fencing avoids split brain:
In case of a possible split brain
As show in the figure above assume that node01 has key “A” and node02 has key “B”.
1. Both nodes think that the other node has failed and start racing to write their keys to the coordinator disks.
2. node01 manages to write the key to majority of disks i.e. 2 disks
3. node02 panics
4. node01 now has a perfect membership and hence Service groups from node02 can be started on node01
Difference between MultiNICA and MultiNICB resource types
MultiNICA and IPMultiNIC
– supports active/passive configuration.
– Requires only 1 base IP (test IP).
– Does not require to have all IPs in the same subnet.
MultiNICB and IPMultiNICB
– supports active/active configuration.
– Faster failover than the MultiNICA.
– Requires IP address for each interface.
Service Group flushing
Flushing of a service group is required when, agents for the resources in the service group seems suspended waiting
for resources to be taken online/offline. Flushing a service group clears any internal wait states and stops VCS
from attempting to bring resources online.
# hagrp -flush [SG] -sys node01 <– To flush the service group SG on the cluster node, node01
Clearing Resource Faults
For persistent resources
Do not do anything and wait for the next OfflineMonitorInterval (default – 300 seconds) for the resource to become online.
For non-persistent resources
Clear the fault and probe the resource on node01 :
# hares -clear [resource_name] -sys node01
# hares -probe [resource_name] -sys node01
VCS related Interview questions
How do check the status of VERITAS Cluster Server aka VCS?
Ans: hastatus –sum
Which is the main config file for VCS and where it is located?
Ans: main.cf is the main configuration file for VCS and it is located in /etc/VRTSvcs/conf/config.
Which command you will use to check the syntax of the main.cf?
Ans: hacf -verify /etc/VRTSvcs/conf/config
How to switchover the service group in VCS?
Ans: # hagrp –switch -to
How to online the service groups in VCS?
Ans: # hagrp –online -sys
How to set the VCS configuration Read-Only?
Ans: # haconf –dump –makero
How to set the VCS configuration Read-Write?
Ans: # haconf -makerw
How to display the list of all snapshots?
Ans: # hasnap –display –list
How to add a user with cluster administrator/Operator access?
Ans: # hauser –add <user> -priv Administrator/Operator
How to add a user with group administrator/Operator access?
Ans: # hauser –add <user> -priv Administrator/Operator –group <service group>
How to display the status of a service group on a system?
Ans: # hagrp –state <service group> -sys <system>
How to display the resources for a specific service group?
Ans: # hagrp –resources <service group>
How to display the service group dependencies?
Ans: # hagrp –dep <service group>
How to display information about a service group on a system?
Ans: # hagrp –display <service group> -sys <system name>
How to display resource dependencies?
Ans: # hares –dep <resource name>
How to display information about a resource?
Ans: # hares –display <resource name>
How to display resources of a service group?
Ans: # hares –display –group <service group>
How to display resources of a resource type?
Ans: # hares –display –type <resource type>
How to display attributes of a system?
Ans: # hares –display –sys <system name>
How to display all resources type?
Ans: # hatype –list
How to list the systems in the cluster?
Ans: # hasys –list
How to display information about a particular system?
Ans: # hasys –list <system name>
How to display information about the cluster?
Ans: # haclus –display
How to display the status of all service groups including resources in cluster?
Ans: # hastatus
How to display the status of cluster faults, including faulted service groups, systems, links and agents?
Ans: # hastatus –summary
How to add a service group in a cluster?
Ans: # hagrp –add <service group>
How to delete a service group from a cluster?
Ans: # hagrp –delete <service group>
How to modify a service group attribute such as SystemList, AutoStartList, parallel etc?
Ans:
(A) How to populate the SystemList attribute of service group groupX with SystemA and B.
# hagrp –modfy groupX SystemList –add SystemA 1 SystemB 2
(B) How to populate the AutoStartList attribute of service group groupX with SystemA and B.
# hagrp –modify groupX AutoStartList –add SystemA SystemB
(C) How to define the service group as a parallel?
# hagrp –modify <service group> Parallel 1
How to bring a service group online?
Ans: # hagrp –online <service group> -sys <system name>
How to take a service group offline?
Ans: # hagrp –offline <service group> -sys <system name>
How to take a service group offline if all resources are probed?
Ans: # hagrp –offline <service group> -ifprobed –sys <system name>
How to switch a service group from one system to another system?
Ans: # hagrp –switch <service group> -to <system name>
How to freeze a service group?
Ans: # hagrp –freeze <service group> -persistent
How to unfreeze a frozen service group?
Ans: # hagrp –unfreeze <service group> -persistent
How to disable a service group?
Ans: # hagrp –disable <service group> -sys <system name>
How to enable a service group?
Ans: # hagrp –enable < service group> -sys <system name>
How to enable all resources in a service group?
Ans: # hagrp –enableresources <service group>
How to disable all resources in a service group?
Ans: # hagrp –disableresources <service group>
How to clear faulted, non-persistent resources in a service group?
Ans: # hagrp –clear <service group> -sys <system name>
How to clear resources in ADMIN_WAIT state in a service group?
Ans: # hagrp –clearadminwait <service group> -sys <system name>
How to flush a service group?
Ans: # hagrp –flush <service group> -sys <system name>
How to link a service group with another?
Ans: # hagrp –link <parent service group> <child service group> <gd_category> <gd_location> <gd_type>
gd_category = Category of group dependency (online/offline)
gd_location = The scope of dependency (local/global/remote)
gd_type = type of group dependency (soft/firm/hard)
How to unlink a service group with another?
Ans: # hagrp –unlink <parent service group> <child service group>
How do check the status of VERITAS Cluster Server?
Ans: hastatus –sum
Which is the main config file for VCS and where it is located?
Ans: main.cf is the main configuration file for VCS and it is located in /etc/VRTSvcs/conf/config.
Which command you will use to check the syntax of the main.cf?
Ans: hacf -verify /etc/VRTSvcs/conf/config
How will you check the status of individual resource of VCS cluster?
Ans: hares –state <resource>
What is the service group in VCS?
Ans: Service group is made up of resources and their links which you normally requires to maintain the HA of application.
What is the use of halink command?
Ans: halink is used to link the dependencies of the resources
What is the difference between switchover and failover?
Ans: Switchover is an manual task where as failover is automatic. You can switchover service group from online cluster node to offline cluster node in case of power outage, hardware failure, schedule shutdown and reboot. But the failover will failover the service group to the other node when VCS heartbeat link down, damaged, broken because of some disaster or system hung.
What is the use of hagrp command?
Ans: hagrp is used for doing administrative actions on service groups like online, offline, switch etc.
How to switchover the service group in VCS?
Ans: hagrp –switch <service group> to <node>
How to online the service groups in VCS?
Ans: hagrp –online <service group> -sys <node>
How to access the VCS cluster management console?
Ans: VCS cluster management console can be accessed by the below given URLs:
http://Servername:8181/cmc/
or
https://Servername:8443/cmc
How to access the Cluster Manager Java Console?
Ans: #/opt/VRTSvcs/bin/hagui
What is Jeopardy?
Ans: When a node in the cluster is having only one interconnected link remaining, then it’s very difficult for GAB to discriminate between system or network failure. A special membership category takes effect in this situation, called jeopardy membership. This memebship prevent cluster from split brain condition. When a system is placed in jeopardy membership, two actions occur:
1: Service groups running on this node placed in auto disabled state. A service group in auto disabled state may failover on a resource or group fault but can’t failover on system fault.
2: VCS operates the cluster as a single node cluster. Other systems in the clusters are partitioned off in a separate cluster membership.
What is the main daemon of VCS?
Ans: had (high availability daemon) which is started by hashadow daemon.
What is GAB?
Ans: Group Membership Services/Atomic Broadcast (GAB) is responsible for cluster membership and reliable cluster communication. GAB has two major functions:
1: Cluster membership
GAB maintains cluster membership by receiving heartbeat from LLT. When a system no longer receives heartbeats from a cluster peer, GAB marks the node as down.
2: Cluster communication
GAB provides the guranteed delivery of messages to all the systems. The atomic broadcast functionality is used by HAD to ensure that all systems within the cluster receive configuration change messages.
What is LLT?
Ans: Low Latency Transport (LLT) is used for all cluster communication. LLT has 2 major functions:
1: Traffic Distribution
LLT works as a backbone for GAB. LLT distributes all inter communication across all configured network links. If a link is failes, traffic is directed to the remaining link.
2: Heartbeat
LLT is responsible for sending and receiving heartbeat signals.
How many network links are supported in LLT?
Ans: 8 links are supported.
How many nodes can join a Cluster?
Ans: Maximum of 32 nodes is supported in VCS.
What is heartbeat?
Ans: Heartbeat is an Ethernet broadcast packet. This packet notifies all othe nodes that sender is functional. This is the only broadcast traffic generated by VCS. Each node sends 2 hearbeat packets per second per interface. Heartbeat is used by GAB to determine cluster membership.
What is split brain condition?
Ans: When all the cluster interconnected links fail, it is possible for one cluster to separate into 2 subclusters, each of which doesn’t know about the other subcluster. The two subclusters could each carry out recovery actions for the departed system. For example two systems could try to import the same storage and cause data corruption.
What is coordinator disk?
Ans:Coordinator disks are three standard disks or LUNs set aside for I/O fencing during cluster reconfiguration. Coordinator disks do not serve any other storage purpose in the VCS configuration. These disks provide a lock mechanism to determine which nodes get to fence off data drives from other nodes. A node must eject a peer from the coordinator disks before it can fence the peer from the data drives. This concept of racing for control of the coordinator disks to gain the ability to fence data disks is key to understanding prevention of split brain through fencing.
What is IO fencing and how to configure IO fencing?
Ans: IO fencing is a feature that prevents data corruption in the event of a communication breakdown in a cluster. IO fencing is used to remove the risk associated with split brain condition. I/O fencing allows write access for members of the active cluster and blocks access to storage from non-members; even a node that is alive is unable to cause damage.
0 responses on "Interview Preparation : Veritas Volume Manager and Cluster Services"