There ain’t no Best Practices

I got a reply to the previous post about ASM and NFS that contained the dreaded question:

Does oracle/netapp have any best practice on this ?

I’m usually opposed to Best Practices. It get nervous whenever someone declares something a “Best Practice” in the IT world because that’s frequently interpreted as an unbreakable law.

“Best Practices” is an especially bad term to use in reference to an Oracle database. The problem is the variation. I know some huge enterprises running tiny 200GB Oracle databases, and I know customers with database approaching the 1PB mark. There’s lots of filesystem and OS options. I was even dealing with Oracle on OpenVMS a few weeks ago. I’ve seen one customer with an 8-node RAC cluster that only runs one instance at a time. They use it as a sort of load-balancing option service. I’m in Munich because of a particular customer with 60,000 databases in production.

So, Oracle customer can be running one of about 10 different OS’s, the databases vary in size by almost 6 orders of magnitude, there’s 4 major options for clustering, and there could be between 1 and 60,000 of the things to manage. There’s no way you can establish many “Best Practices” that would apply across the board.

When I wrote TR-3633 and all the other database documents on the NetApp web site, I was trying to present what I call ‘design considerations’ which are the things I want users to consider and then make up their own mind on what seems right for their specific needs.

Legacy Documentation

A secondary problem with defining a Best Practice for Oracle is 90% of the documentation out there, including with what’s written by Oracle itself, is rooted in ancient and sometimes obsolete practices.

This isn’t a swipe against the database architecture itself. The more you look at it, the more impressed you’ll get in terms of its ability to detect problems and recover from various failure scenarios. Still, it hasn’t changed that much over the years from a storage point of view, which means the so-called “Best Practices” were established based on the hardware and IT infrastructure available at the time. That largely means servers with dumb disks inside.

RAID Levels

The most annoying recommendation permeating Oracle documentation is the nonsense about SAME – Stripe And Mirror Everything. That would make perfect sense a few decades ago where most storage systems would have been plain hard drives inside a server’s drive bays.

SAME makes sense there. RAID-4 or RAID-5 would have created a major performance hit because servicing write IO would require some read IO to regenerate parity. Striping would have been the only way to get decent performance.

It has not been the year 1988 for a very long time now. We have these things called “storage arrays” now. Yes, there are actual hardware devices created specifically to host and serve data!

NetApp has never offered RAID-10 for a very good reason. The original RAID-4 implementation combined with the pointer-based filesystem called WAFL yielded RAID-4 level efficiency with RAID-10 performance. The reason was pretty simple – WAFL avoids the need to perform reads in order to service a write. Writes are written as a single RAID stripe with parity calculated in memory. The same principles applies today with WAFL and RAID-DP and RAID-TEC.

In other words, you don’t lose half your capacity the way you do with RAID-10, and you don’t take the performance hit with overwrites. Furthermore, if you have spinning media, RAID-DP gives you better data protection. You can do the math. You can lose *any* two drives in a RAID-DP array without losing data. With RAID-10, there are many 2-drive failure scenarios which will cause data loss.

Furthermore, if you mirror and you lose a disk, the IO required to recreate the mirror puts a lot of load on the one surviving copy of your data. After the storage system is a few years old, the chance of that drive dropping dead during the RAID rebuild increases. I’ve assumed that’s why Oracle Exadata best practices include triple mirroring, aka ASM High Redundancy.

Nobody that cares about their data would rely on 2-way mirroring to protect spinning media with drives being the size they are now. The rebuild times can take ages. With RAID-DP/RAID-TEC I don’t see any reason to use anything other than external redundancy, with the possible exception of stretched RAC. There are some cases where high redundancy might make sense if you’re spreading the CRS/voting resources across 2 production sites and a 3rd tiebreaker site.

Triplicate Control Files

I’ve had customer insist to me that if you lose the controlfiles, the database is useless. That’s wrong. You just recreate them. It’s inconvenient, but certainly doable.

If you’re located in 1988, multiple controlfiles makes more sense. Drives fail from time to time, so it makes sense to ensure you protect the controlfiles. It will greatly reduce the time required to recover from tape, which would probably be a much more routine requirement.

I know controlfiles are tiny, but there’s still way too much time spent worrying about distributing 3 controlfiles across multiple filesystems, and sometimes it actually interferes with recoverability because you’re now dealing with snapshots of files from different points in time. If the controlfiles are all on the same filesystem you can more easily perform recovery and cloning tasks because snapshot-based backups of those files will be perfectly synchronized.

Mirrored Redo Logs

I don’t really understand this either. What is the problem we’re trying to solve with mirrored redo logs? If your data center is in 1988 you certainly need to do this to be ready for inevitable drive failures, but why now?

One rationale I’ve heard is

“If you only have one copy of your redo logs and it’s lost, you’ve lost data”.

That’s technically true, but so is this:

“If you only have one copy of your database and someone does an rm -rf /, you’ve lost data”

There are a lot of things that can go wrong if someone is deleting files of your servers. Why single out the redo log?

I’d guess about half of the customers I know don’t bother mirroring the redo logs. The other half didn’t make a conscious decision to mirror, they’re just doing what they’ve always done.

The only exceptions I can think of would be some of the huge banking customers. I’ve seen some databases where they placed the datafiles plus on one fault domain (aggregate) and placed the archive logs and the second redo log on a different aggregate. As long as there’s a backup strategy for the database you’d be protected even against complete aggregate failure. Obviously the database is going to crash if the datafiles go missing, and it’s going to hang if the archive log location is gone, but there will not be data loss.

Sparse Files

I’m not sure why Oracle doesn’t like sparse files. I suspect it’s a legacy thing. Sparse file behavior used to be a little more variable with different OS’s and filesystems, and Oracle databases used to get really angry if they abruptly ran out of space during datafile extension. It seems to be okay these days.

Doing a dd of a bunch of zeros to file that will be used as an ASM device shouldn’t really be necessary, but these files aren’t usually large enough for that to be a problem. A large-block dd of zeros is pretty quick.

The 2-volume layout

There’s no universal best practice, but any time I’m setting something up I’m aiming for simplicity and manageability. That means the 2-volume approach:

  1. One volume containing the datafiles and only the datafiles. Nothing but datafiles for one and only one databsae
  2. One volume containing the archive logs, redo logs, and controlfiles. Don’t mirror the redo logs, and use only one controlfile unless there’s real benefit.

The benefits of this approach are as follows:

  1. If I need to restore the database, I can revert the state of the datafiles with volume snaprestore *without* damaging or losing other files.
  2. Any time I snapshot the log volume, I capture a perfectly synchronized set of redo, archive, and controlfile data. That means I can easily use automatic recovery.
  3. Since I can use automatic recovery, I don’t need to use hot backup mode. All I need to recover or clone is a snapshot of the datafile volume that is earlier in time than the state of the log volume. That’s a recoverable backup. I can schedule snapshots directly on the storage system. No need for any external software.
  4. If I want to mirror, I can schedule a more rapid update of the log volume replica and improve the RPO. I can schedule the mirror updates directly on the storage system. No need for any external software.
  5. If the protocol is NFS, just create a couple volumes. You can resize as needed.
  6. If the protocol is FCP/FCoE/iSCSI, place the datafiles on a volume group consisting of 4 to 8 LUNs. Place the combined log data on a volume group of 2 to 4 LUNs. That gives you the ability to grow the filesystem is smaller increments of equal size and enough parallelism through the SCSI layer to deliver good performance.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s