VIP goes offline intermittently

Problem: VIP goes offline intermittently, which even crashes other RAC services

Solution: One of the reasons causing this problem could be a change in network configuration. RAC uses default gateway as it’s ping target, during VIP status / health check.

Default gateway is normally first address in the network or VLAN.

For example:

Start network address is 145.46.43.0
VIP address is 145.46.43.15
Default gateway would normally be 145.46.43.1 (unless explicitly changed)

If VIP does not receive the ping response from the default gateway, RAC will think it, as a network interface failure. This response will try to failover VIP on available nodes in cluster.

Normally if default gateway is not accessible from a node in RAC, it is highly unlikely that it will be assessable from other node(s), as all of them will be in same network / VLAN. So failover to other nodes will also fail.

This will have cascading effect on all dependent applications based on VIP.

For a 2-node cluster, we should look for errors in

$ORA_CRS_HOME/log/node1/racg/ora.node1.vip.log
$ORA_CRS_HOME/log/node2/racg/ora.node2.vip.log

Error log will be similar to following

2008-02-12 14:48:50.900: [ RACG][1] [28769][1][ora.node2.vip]: 144841 [ 29016 ] Checking interface existance
144841 [ 29016 ] Calling getifbyip
144842 [ 29016 ] getifbyip: started for 145.46.43.15
144842 [ 29016 ] Completed getifbyip
144845 [ 29016 ] Completed with initial interface test
144845 [ 29016 ] Performing
2008-02-12 14:48:50.901: [ RACG][1] [28769][1][ora.node2.vip]: CRS_STAT testing
144845 [ 29016 ] Completed CRS_STAT testing
144845 [ 29016 ] Interface tests
144845 [ 29016 ] checkIf: start for if=e1000g0
144845 [ 29016 ] checkIf: -z defaultgw
144845 [ 29016 ] defaultgw: started
144845 [ 29016 ] defaultgw: complete
2008-02-12 14:48:50.901: [ RACG][1] [28769][1][ora.node2.vip]: d with 145.46.43.1
144845 [ 29016 ] checkIf: -n defaultgw
144845 [ 29016 ] checkIf: checking if=e1000g0 is UP. (host=node2)
144848 [ 29016 ] checkIf: in while, before sleep
144849 [ 29016 ] checkIf: in while, before sleep
144850 [ 29016 ] checkIf: c
2008-02-12 14:48:50.901: [ RACG][1] [28769][1][ora.node2.vip]: hecked if=e1000g0 failed
Interface e1000g0 checked failed (host=node2)
144850 [ 29016 ] checkIf: end for if=e1000g0

Advertisements
This entry was posted in Oracle Cluster Ready Services, Oracle Real Application Cluster and tagged . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s