RAC Attack - Oracle Cluster Database at Home/RAC Attack 12c/Node Fencing
Prev: Create Database
Explore RAC (e)
The goal of this lab is to demonstrate Oracle Clusterware’s fencing ability by forcing a configuration that will trigger Oracle Clusterware’s built-in fencing features. With Oracle Clusterware, fencing is handled at the node level by rebooting the non-responsive or failed node. This is similar to the as Shoot The Other Machine In The Head (STOMITH) algorithm, but it’s really a suicide instead of affecting the other machine. There are many good sources for more information online.
- Start with a normal, running cluster with the database instances up and running.
Monitor the logfiles for clusterware on each node. On each node, start a new window and run the following command:
[oracle@<node_name> ~]$ tail –f /u01/app/12.1.0/grid/log/`hostname -s`/crsd/crsd.log [oracle@<node_name> ~]$ tail –f /u01/app/12.1.0/grid/log/`hostname -s`/cssd/ocssd.log
In my 220.127.116.11 lab environment, the location of the clusterware log files now reside in diag_dest under the $ORACLE_BASE/diag/crs/$HOSTNAME/crs/trace directory.
- We will simulate “unplugging” the network interface by taking one of the private network interfaces down. On the collabn2 node, take the private network interface down by running the following command (as the root user):
[root@collabn2 ~]# ifconfig eth1 down
Alternatively, you can also simulate this by physically taking the Internal Network interface offline in VirtualBox:
Go to collabn2 -> Settings -> Network -> Adapter 2, uncheck Cable connected and click OK.
- Following this command, watch the logfiles you began monitoring in step 2 above. You should see errors in those logfiles and eventually (could take a minute or two, literally) you will observe one node reboot itself.
If you used ifconfig to trigger a failure, then the node will rejoin the cluster and the instance should start automatically.
If you used VirtualBox to trigger a failure then the node will not rejoin the cluster.
- Which file has the error messages that indicate why the node is not rejoining the cluster?
- Is the node that reboots always the same as the node with the failure? Why or why not?