I’ve been recruited.
A team working with a large service provider is developing a scalable approach to performance management leveraging ONTAP quality-of-service (QoS). They asked for thoughts.
Friends don’t let friends use FlexShare
There was a primitive QoS capability on the older ONTAP 7 systems called FlexShare. It worked okay, but it was never designed as a scalable policy-based system. I used to tell customers they should never touch FlexShare unless they already have a problem clearly identified and understood. In those cases, it was great, but trying to configure it proactively frequently caused more problems than it solved. Among other things, admins tended to forget they were tinkering with FlexShare. 12 months down the road database performance tanked because it was butting up against the FlexShare limits they themselves applied!
FlexShare QoS was completely replaced a couple years ago. Today, ONTAP9 allows for both IOPS and bandwidth QoS control on a “storage object” which can be SVM, volume, or file. It’s flexible and granular.
All-SSD requires QoS
One notable side effect of increasing adoption of all-SSD arrays is consolidation. Go back 10 years and one reasonably active database could consume all available IOPS because it would slam into latency limits of spinning media. Remove the spinning media and you remove the latency limitations. That means you can get the fully capabilities of a given storage array.
The average database I see peaks around 7K IOPS. Sure, on occasion I see something like this…
Load Profile Per Second
Redo size: 3,809.6
Physical reads: 224,184.9
Physical writes: 2.6
That’s an actual database with 225K IOPS. If you want 225K IOPS, no problem, I’m just saying very few customers needs that. Most of the databases out there aren’t high-performance processing engines, they just store information reliably so applications can access it and users can view it.
Assume you own a storage system that can handle 400K low-latency, predicable IOPS and you’re dealing with databases that don’t generally breach the 10K IOPS level. You can host at least 40 of them. You will, however, have 40 databases competing for the same storage resources. They may try to grab all 400K IOPS at times. Maybe that’s acceptable, but maybe not.
The main problem is that it damages the predictability of the system, which is one of the #1 reasons to use all-flash in the first place. You need QoS to preserve predictable behavior.
QoS requires All-SSD
QoS would be really difficult to implement without getting rid of spinning media. The problem is the lack of predictability. For the most part, you can load up an SSD drive with all the IO you want and it will continue to service the IO in a similar fashion. Yes, it will eventually max out, but that’s an edge case.
In contrast, spinning media behavior varies a lot. Performance management therefore gets tough because you have to consider the nature of the IO, data layout, drive head movement, and all sorts of interesting features that may or may not be present in a given drive or firmware version. You’d need a team of AI engineers to get a truly reliable, scalable approach to QoS using spinning media.
It would be nice to have a guaranteed QoS capability, but it’s not as useful as it seems.
First, think of the consequences. If you have 10 databases with a “guaranteed” or “committed” or “reserved” 25K IOPS then you have to be prepared for all 10 of them running at full speed simultaneously. That’s highly unlikely to happen, and that’s expensive to prepare for.
If it’s absolutely required, address this sort of need via initial sizing of the solution. For example, an AFF8080 HA pair can do about 400K IOPS, so that’s your ceiling. You have 400K guaranteed IOPS to pass out amount your workloads. You should also consider setting minimums on each database to ensure they don’t try to consume more than their allotted IOPS.
On the whole, it would be better to leave off the cap. That means:
- A given database could burst to much higher IOPS levels when they are available.
- On the whole, multiple databases should share the available IOPS politely. There would be “contention” meaning variation, but they should still have their full reservation.
In some cases, that wouldn’t be desirable and a cap would need to be enforced. For example, in a paying multitenant environment, the end customers ought to be paying for the performance they receive, so the applying the QoS controls to limit them to what they have paid for makes sense. In the case of a purely internal storage system, that would be silly. Why not use the whole performance potential you’ve paid for? Sure, there will be variation, but artificially capping performance just because you can is counterproductive.
QoS: A practical example
Here’s the first practical QoS example I ran into. A certain customer had a query problem. On occasion, one of their end users was running a query than ran really REALLY efficiently. It was hammering the storage system and eating all the bandwidth. The user in question wasn’t in any particular hurry to have the query complete, but he had no ability to make it go slower.
As a guess, I suggested slapping a MB/sec QoS control on the datafiles in question. That would impose rate limiting. They limited it the MB/sec to something like half the current level. That does mean the query ran about 2X slower, but the other end users weren’t complaining any longer.
QoS and a database
Here’s my initial thoughts on QoS. If anyone has thoughts, please comment and I’ll incorporate.
First, we need to isolate the storage resources into 3 groups:
- Redo logs
- Archive logs
The reason is they will usually need different QoS controls to accommodate the different IO patterns of those file types.
The problem with nailing down the right QoS policy is understanding exactly what workload exists. Most DBA’s would look at this:
Load Profile Per Second
Redo size: 4,920,766.5
Physical reads: 167,983.2
Physical writes: 1,945.0
and then conclude this is 168K IOPS. This is wrong.
This database was reading about 168K blocks/second but that almost never translates into 168K IOPS. In this particular database, about 25% of the blocks were read as single block reads. The other 75% of the blocks were read as multiblock reads. The actual IOPS level is more like 40K. Interpreting database physical reads as “IOPS” will result in a lot of wasted infrastructure.
The current project involves a service provider, which means we need to operate at scale. We need standards.
A typical set of datafiles would need between 5K and 30K IOPS. I’d probably then build tiers as follows:
- Small = 5K IOPS
- Medium = 15K IOPS
- Large = 30K IOPS
In a LUN environment, we need to distribute those IOPS among the LUNs. We have two options. First, we could drop those LUNs on a volume with the desired IOPS control. That would apply to the volume as a container for whatever LUNs it contains. Second, we could divide the IOPS among the LUNs themselves. If a database was “small” and had four ASM LUNs, each LUN would then be assigned a 1250 IOPS QoS. I’d prefer the volume approach to reduce the number of things being managed.
In an NFS environment, we’d apply the QoS control to the volume. Things would get complicated if there were multiple volumes because IO will not be balanced among volumes. One volume might reach its IOPS threshold while another is only lightly used. So long as the IOPS needs are less than 50K there should be no problem placing the datafiles in a single volume. Ridiculously high IOPS levels sometimes warrant the use of multiple volumes in order to deliver better balancing across CPU’s, but that’s only for the paranoid where every microsecond of latency counts. We’re talking about scale-out here. Lots of lower IO databases.
There might also need to be a MB/sec QoS control. I have mixed feelings on this. On the whole, it’s the raw IOPS that cause performance problems, not consumed bandwidth. Activities that use a lot of bandwidth are usually administrative operations and late-night reports where some contention between workload and variation in performance wouldn’t be a problem.
I lean toward using IOPS QoS along on datafiles and skipping MB/sec, but if I did do MB/sec I would associate it with the tiers previously defined, meaning we have something like this:
- Small = 5K IOPS and 100MB/sec
- Medium = 15K IOPS and 250MB/sec
- Large = 30K IOPS and 500MB/sec
This is somewhat of a guess. Those bandwidth limits might look high, but databases legitimately need bursts of high bandwidth at times. This is sequential IO, which is not very demanding on a storage system, so it really shouldn’t cause a lot of problems.
Redo Log QoS
I would probably avoid redo log QoS for two reasons:
- Redo log IO is usually miniscule compared to other IO
- Databases really REALLY don’t like any extra latency on redo log IO
If, however, we’re worried about cheating (such as a rogue DBA dropping an index file on a volume that is supposed to be used for redo logging) or an unexpected bully workload we could consider something like this:
- Small = 1000 IOPS
- Medium = 2000 IOPS
- Large = 5000 IOPS
What about bandwidth? Most databases hosting user-driven applications I see are around 5MB/sec of redo logging. High IO would be 50MB/sec or so.
On occasion, we see something like this late at night:
Load Profile Per Second
Redo size: 193,637,827.3
Physical reads: 110,275.5
Physical writes: 45,776.2
193MB/sec of redo logging is enormous. We’re talking major-bank, end-of-quarter processing. The point is that there is a lot of variation. Setting a QoS control on a database might work fine 99% of the time, what if 1% of the time it causes huge problems?
If we really wanted to apply both an IOPS and bandwidth control, and are prepared for some tuning, something like this seems reasonable.
- Small = 1000 IOPS and 10MB/sec
- Medium = 2000 IOPS and 20MB/sec
- Large = 5000 IOPS and 50MB/sec
There is a significant risk of adding a MB/sec QoS control. The issue is archive logging. If the MB/sec is too strict, it could interfere with copying the closed redo logs to the archive log destination. Placing an IOPS control on redo logging should merely slow down the redo logging process at high IO levels, but if the MB/sec control causes the archiving process to fall behind to the point where no redo logs are available for use, the whole database will come to a standstill until a redo log is available again. For that reason, use MB/sec QoS on redo logs with care.
Archive Log QoS
Archive log destinations tend to be large. Sometimes they are larger than the rest of the database combined. This creates a risk of “cheating”, such as placing datafiles in an archive log destination.
Placing a strict QoS control on an archive log destination should be harmless and would discourage cheating. Normally the archive log destination would be subject to negligible IOPS. Yes, IO’s happen, but the IO should be huge block sequential writes as redo data is copied in. A strict IOPS limit such as 500 IOPS should ensure no datafiles appear on the archive log destination, while a generous MB/sec limit would ensure archive logging proceeds without disruption. It should at least match the redo logging MB/sec QoS limit.
That’s my first batch of thoughts. Questions? Objections?