Anybody out there modifying the Oracle RAC parameter called misscount?
I wrote the Oracle on ONTAP best practice guide, also known as TR-3633. My recommendation in TR-3633 about changing the value of misscount is getting controversial again. I’m getting complaints from Oracle support relayed through customers. This parameter normally has nothing to do with storage IO, but we’ve seen an increasing number of customers with stability problems running with RAC under virtualization or RAC with the binaries on networked storage.
The result is that, under some circumstances, the Oracle RAC processes appear to block on some kind of IO against binaries, libraries, or maybe a log file, and it interferes with the network heartbeat. The result is when the processes unblock after a pause lasting more than “misscount” seconds, they immediately evict and reboot the node. Increasing the value of misscount increases the allowable pause in network heartbeat, which allows the cluster to survive the storage disruption.
Oracle customer support usually starts by insisting that changing the value of misscount is totally unsupported. I see no documentation to support that statement. The parameter is publicly documented, and there are exceptions to the default value referenced on the Oracle support site.
Every time this comes up, I ask the customer to get their Oracle support contact on the phone so we can talk it through. They usually back off the “unsupported” comment and say instead that Oracle RAC binaries should never be on networked storage. That would be nice, but it precludes virtualization of RAC and some customers like SAN booting. We bring up that point, and they’ve always conceded the point that, in this case, increasing the value of misscount is warranted.
If someone has a better explanation or fix for this problem, I’d like to hear it. All I know is that increasing misscount to something like 200 provably resolves the problem.