Problem: VIP goes offline intermittently, which even crashes other RAC services
Solution: One of the reasons causing this problem could be a change in network configuration. RAC uses default gateway as it’s ping target, during VIP status / health check.
Default gateway is normally first address in the network or VLAN.
Start network address is 220.127.116.11
VIP address is 18.104.22.168
Default gateway would normally be 22.214.171.124 (unless explicitly changed)
If VIP does not receive the ping response from the default gateway, RAC will think it, as a network interface failure. This response will try to failover VIP on available nodes in cluster.
Normally if default gateway is not accessible from a node in RAC, it is highly unlikely that it will be assessable from other node(s), as all of them will be in same network / VLAN. So failover to other nodes will also fail.
This will have cascading effect on all dependent applications based on VIP.
For a 2-node cluster, we should look for errors in
Error log will be similar to following
2008-02-12 14:48:50.900: [ RACG] [ora.node2.vip]: 144841 [ 29016 ] Checking interface existance
144841 [ 29016 ] Calling getifbyip
144842 [ 29016 ] getifbyip: started for 126.96.36.199
144842 [ 29016 ] Completed getifbyip
144845 [ 29016 ] Completed with initial interface test
144845 [ 29016 ] Performing
2008-02-12 14:48:50.901: [ RACG] [ora.node2.vip]: CRS_STAT testing
144845 [ 29016 ] Completed CRS_STAT testing
144845 [ 29016 ] Interface tests
144845 [ 29016 ] checkIf: start for if=e1000g0
144845 [ 29016 ] checkIf: -z defaultgw
144845 [ 29016 ] defaultgw: started
144845 [ 29016 ] defaultgw: complete
2008-02-12 14:48:50.901: [ RACG] [ora.node2.vip]: d with 188.8.131.52
144845 [ 29016 ] checkIf: -n defaultgw
144845 [ 29016 ] checkIf: checking if=e1000g0 is UP. (host=node2)
144848 [ 29016 ] checkIf: in while, before sleep
144849 [ 29016 ] checkIf: in while, before sleep
144850 [ 29016 ] checkIf: c
2008-02-12 14:48:50.901: [ RACG] [ora.node2.vip]: hecked if=e1000g0 failed
Interface e1000g0 checked failed (host=node2)
144850 [ 29016 ] checkIf: end for if=e1000g0