Storage Unlimited, Hypothetical
large format drive-shelf units + IOMs, patiently waiting for their SFF-8634 SAS3 cabling

Storage Unlimited, Hypothetical

Sometimes hypothetical scenarios and questions of intrigue require delving into the realm of imagination and then back to the cold-harsh reality of the physical world. Let's focus on a recent Hypothetical Storage Architecture Question: "If you had an unlimited budget, how much data would you store?"

Adjusted Query: how much storage space would the user deploy to an existing multi-Petabyte array, with consideration to physical limitations in systems design which prioritize two common architectural decision targets: More or Faster?

(1) maximizing raw-block capacity
(2) maximizing I/O performance

TL;DR

Between 4-5PB is the simplest answer, but those are just numbers. Could be 10x either, but that's not realistic - it's conjecture. The full picture requires insight into physical limits of the transfer time, storage space, rack space, and rotational speeds involved; spacetime could be analyzed by the right minds if this were JPL ^_^ .. instead of limited to hardware I've acquired over a small handful of years.

Hypothetical or Reality

Let's say that I contacted the datacenter and ordered additional power circuits to light-up additional drive shelves, which are already installed (minus the box of drive trays which are in a massive box on a parts shelf in my home office). Ok, reality checks out there.

The circuit drops would go to the central storage rack which is currently at 84% power limit. I know this because power is monitored and validated via a combination of Check_MK + Prometheus + SNMP exporters, which poll power stats remotely from the cabinet's APC ATS, the rack's per-port metered and switched outlet PDUs, and then if I really want to I can look at the DC's circuit stats (via MRTG) which are tracked for the cabinet. Reality checks out there decently enough. Moving on...


The Hypothetical

First we'll cover some raw numbers and theoretical maximums by looking at two high-level approaches required for "More or Faster".

The astute reader may wonder why we're not discussing >100TB QLC NVMe drives or ultra-dense E2 ruler-style 2U chassis, and the answer is that those calculations are generally similar in equation workflow (eg: how many things of X can we put into Y at A,B,C performance/size/latency), however those hardware types are not currently racked in my real-world systems at an interesting scale. However, migrating to denser NVMe arrangements will likely be part of an eventual down-scaling forklift operation whenever the SAS3 arrays are no longer useful for my personal labs or colo bills.

Production is another story, where large format SAS3 is still relevant and being deployed and improved over time; in the last several years we're seen drive capacity double, along with some innovative designs. It's not often considered exciting, but it lasts A LONG TIME, and the NSA loves archiving the internet on this medium over in their Utah storage complex (overheard, surely).

So, back to the math...

Decision: Optimize for Faster (prioritize IOPs + bw)

Aggregate per-rack w/ 10x shelves of raw 4K block = 3,824 TB == 3.72 PB

Shelves SAS3 Drives Brand Model Total
8x DE3-24P 192 Seagate EXOS Mach2 @ 18TB = 3,456 TB
2x DE3-24C 48 Oracle SSD @ 7.68TB/drive = 368 TB
== 3,824 TB

Decision: Optimize for RAW Storage Capacity

How about the same hardware focused on max-total raw space with less regard to per-drive IOPs (perhaps we offset the latency reduction per-drive via larger L1ARC and L2ARC buffers on the head-nodes, which becomes a loaded tangent)... then the rack could be upgraded to host 4.85PB @ $800/m MRC ($400/m cabinet + $400/m amperage)**.

We'll also save the conversation about concessions made at time of contract signing where I might / maybe / surely have saved on recurring costs by locking-in a multi-year multi-rack contract. Sales team, thanks!

What is this MRC, MRR?

MRC in this situation is the 'Monthly Recurring Cost' for power + rack. This value does not include the cost of acquired hardware, which is variable on supply-chain sourcing and other factors. I'm also not going to describe potential MRC or MRR offsets which factor into profit margin potential and/or influence EBITDA.

Ok, what is MRR and EBID..? Fine, at least with MRR it's 'Monthly Recurring Revenue', and one can read all about EBITDA elsewhere.

Shelves SAS3 Drives Brand Model Total
8x DE3-24P 192 Seagate EXOS X30 @ 24TB = 4,608 TB
2x DE3-24C 48 Oracle SSD @ 7.68TB/drive = 368 TB
== 4,976 TB

What about Flash-Cache or Optane or Non-Volatile blah blah blah

Sure, those are considerations, where applicable. Some footnote notes...

  • It's prudent to deploy storage controllers with SLC NVMe or PMem/NVDIMM persistent cache (which are often a baseline performance requirement). These are sized anywhere from cumulative 1-12TB per-controller, depending on available PCIe lanes and/or NV-DIMM slots.
  • It's a good idea to make scoping decisions like that ahead of time, instead of later realizing that the controllers' lanes are maxed out or all DIMM slots are used (or you bought 128GB NVDIMMs instead of 512GB, thinking "8x 128GB should be fine for PMem" or the other fun one about ignoring (or being unaware) about GHz speed reduction by running 4-DPC in a 2-DPC max-perf system.
  • Those determinations may be either workload dependent, or in my personal life sometimes it qualifies as the "because I wanted to" type of maximalist methodology, or the "this is relevant for future proofing my own rationalization on fun hardware", or could be considered a baseline in pre-sales because "they always want more performance eventually; just bake it in from the start". It varies, but remember where you started and where any future requirements for performance headroom exists.

Where's the Math?

Validation on spec baselines can be tedious, but anyway... the DE3-24C and DE3-24P have very shiny and well made Dual-Redundant IOMs with 4x SAS3 SFF-8643 connectors.

Specs for our SAS3 Fabric with DE3-24 Shelves

  • One SAS-3 lane (PHY): 12 Gbit/s ⇒ 1.5 GB/s (raw line-rate, decimal GB)
  • External SAS x4 port connector: 4 PHY × 12 Gbit/s ⇒ 48 Gbit/s ⇒ 6 GB/s
  • Each DE3-24{C/P} has 2× IOM, with each IOM having 4× SFF-8643 ports
ItemValueNotes
SAS-3 lane rate12 Gbit/sper PHY
1 PHY raw1.5 GB/s12/8
1 x4 port raw6 GB/s4 PHY × 1.5
Shelf fabric cap12N GB/sdual-IOM
10-shelf fabric cap120N GB/s10 shelves
N active x4 ports per IOMPer-shelf fabric cap (GB/s)10-shelf fabric cap (GB/s)
112120
224240
3 (N+1 on 4-port IOM)36360
4 (no spare)48480

Drive Specs - Seagate EXOS & Oracle SSD

DriveMetricValue
Seagate Exos ST24000NM007H (24TB HDD)Max sustained OD (MB/s)285
Seagate Exos Mach.2 ST18000NM0272 (18TB HDD)Max sustained OD (MB/s)554
Mach.2 ST18000NM02724K random @ QD16 (IOPS)304 read / 560 write
Oracle 7.68TB RI SAS-3 SSDSeq (MB/s)2150 read / 1980 write
Oracle 7.68TB RI SAS-3 SSD4K random (IOPS)450k read / 95k write

Scenario A – Maximum RAW Block Space

Before going into the theoretical weeds, let's look at some per-shelf media limits.

1) Oracle DE3-24P shelf w/ 24TB Spinners: 24 × 285 MB/s = 6.84 GB/s (media-limited)

2) Oracle DE3-24C shelf w/ SSDs:

    • Read: 24 × 2150 MB/s = 51.60 GB/scapped by fabric at 36 GB/s (N=3)
    • Write: 24 × 1980 MB/s = 47.52 GB/scapped by fabric at 36 GB/s (N=3)
TierShelvesPer-shelf effective (GB/s)Tier aggregate (GB/s)
DE3-24P (24× 24TB HDD)86.8454.72
DE3-24C (24× 7.68TB SSD)236.00 (fabric-capped)72.00
Total (10 shelves)10126.72

Scenario B: Maximum Performance

Hardware has limits, and we're looking at 8× DE3-24P w/ 18TB Mach.2 HDD + 2× DE3-24C w/ 7.68TB SSD (N=3)

Aggregate sequential bandwidth (N=3)

  • Mach.2 18TB HDD shelf: 24 × 554 MB/s = 13.296 GB/s (media-limited; below 36 GB/s fabric cap)
  • SSD shelf: remains fabric-capped at 36 GB/s read and write (N=3)
TierShelvesPer-shelf effective (GB/s)Tier aggregate (GB/s)
DE3-24P (24× Mach.2 HDD)813.296106.368
DE3-24C (24× 7.68TB SSD)236.000 (fabric-capped)72.000
Total (10 shelves)10178.368

Aggregate Random R/W 4K IOPS (N=3)

Fabric-derived 4K IOPS cap per shelf (upper bound):
36 GB/s @ 4KB = approx 8.79M IOPS

  • SSD shelf
    • Read: 24 × 450k = 10.8M IOPs → capped to 8.79M IOPs
    • Write: 24 × 95k = 2.28M IOPsmedia-limited (below fabric cap)
  • Mach.2 HDD shelf (4K @ QD16)
    • Read: 24 × 304 = 7,296 IOPs
    • Write: 24 × 560 = 13,440 IOPs
TierAggregate Read IOPsAggregate Write IOPs
8× DE3-24P (Mach.2 HDD shelves)58,368107,520
2× DE3-24C (SSD shelves)17,580,0004,560,000
Total (10 shelves)17,638,3684,667,520

Let's See the Deltas!

What's "a delta"? Sometimes it's the headwaters of a bay, where the river distributes across a flood-plane into many separately-unified channels, eventually giving way to deeper waters and.... no no not that kind.

Delta as a value in storage is often used as a marker to show the differential of two functions which indicates directional change. At least, that's what comes from my brain while I sit here having an ice cream sandwich, but it has some alternate terminology which applies irrespective of ice cream hour:

Delta uppercase Δ, lowercase δ
Uppercase: a change of any changeable quantity
Lowercase: the central difference for a function

~ The discriminant of a polynomial equation or symmetric difference of two sets
~ The determinant of the matrix of coefficients of a set of linear equations

Scenario B vs Scenario A (with N=3)

  • Total sequential BW: 126.72 → 178.368 GB/s (increase +51.648 GB/s)
    • Primary determinant: per-shelf BW 6.84 → 13.296 GB/s (Mach.2 dual-actuator).
  • SSD tier remains fabric-capped at 36 GB/s per SSD shelf under N=3; increasing N (or adding more independent uplinks with additional HBAs) is what changes that cap.
  1. Faster array = less space @ higher IOPs and BW == 178 GB/s @ 3.8PB
  2. Larger array = more space @ lower IOPs and BW == 126 GB/s @ 4.9PB

Equations w/ decimal units: 1 TB = 1000 GB

  • jaja usually we work in 1024 (but I'm tired today, so not right now)
t_1TB(s)      = 1000 / BW_GBps
t_full(s)     = (Capacity_TB * 1000) / BW_GBps
t_full(hours) = t_full(s) / 3600
BW_density    = BW_GBps / Capacity_TB   (GB/s per TB)
              = 1000 * BW_density       (MB/s per TB)
Metric Faster array Larger array
Bandwidth (GB/s) 178 126
Capacity (TB) 3824 4976
Time per 1 TB (s) 5.62 7.94
Full sweep time (h) 5.97 10.97
BW density (GB/s per TB) 0.04655 0.02532
BW density (MB/s per TB) 46.55 25.32
Density ratio (faster/larger) 1.84×

So... what's the answer? More, Faster?

Is bigger is better? Sometimes. Sometimes you need it faster. Sometimes you want both.