Digital Storage Architectures: HDD, SSD, and NVMe

Magnetic Hard Disk Drives

A hard disk drive stores data as patterns of magnetic polarity on aluminium or glass platters coated with a ferromagnetic material, typically a cobalt alloy. Multiple platters spin on a shared spindle at a fixed angular velocity — most consumer drives spin at 5,400 or 7,200 revolutions per minute, while high-performance enterprise drives reach 10,000 or 15,000 RPM.

Read-write heads float nanometres above each platter surface on an aerodynamic air bearing created by the platter's rotation. The head assembly pivots on a voice-coil actuator, sweeping the heads across the platter radius to access different tracks. Positioning the head over the correct track (seek) and then waiting for the target sector to rotate beneath it (rotational latency) together constitute the mechanical access time, which ranges from a few milliseconds to over 20 milliseconds for random access patterns.

Exposed hard disk drive showing aluminium platters and read-write head assembly — Exposed hard disk drive showing the aluminium platter stack and actuator arm. Source: Wikimedia Commons (CC BY-SA).

Track and Sector Geometry

Each platter surface is divided into concentric tracks and each track into fixed-size sectors. Modern drives use 4,096-byte Advanced Format sectors (standardised by the International Disk Drive Equipment and Materials Association, IDEMA) rather than the legacy 512-byte sectors. The drive firmware translates the host's Logical Block Address (LBA) space to physical cylinder-head-sector (CHS) locations internally, abstracting the geometry from the operating system.

Zone Bit Recording (ZBR) places more sectors on outer tracks, which are physically longer, increasing the linear density uniformly. This means outer zones have higher sequential throughput than inner zones, a characteristic measurable in standard storage benchmarks.

Shingled Magnetic Recording

Shingled Magnetic Recording (SMR) increases areal density by allowing write tracks to overlap read tracks, like roof shingles. Because writing to one track partially overwrites its neighbour, SMR drives must perform read-modify-write operations at the granularity of a zone (typically several hundred megabytes). This makes SMR drives unsuitable for workloads with frequent random writes, but acceptable for sequential archival storage. Host-managed SMR requires explicit zone management from the operating system; drive-managed SMR handles zone buffering internally with a conventional cache region.

NAND Flash Solid-State Drives

Solid-state drives store data in arrays of floating-gate or charge-trap transistors. The threshold voltage of each cell determines the stored bit pattern. Flash memory is organised hierarchically: cells are grouped into pages (typically 4 KiB to 64 KiB), pages into blocks (typically 256 to 1,024 pages), and blocks into planes and dies.

NAND flash cell types by bits per cell

SLC (Single-Level Cell): 1 bit per cell. Highest endurance (100,000+ program/erase cycles), used in industrial and write-intensive enterprise applications.
MLC (Multi-Level Cell): 2 bits per cell. Moderate endurance (3,000–10,000 cycles), found in high-performance consumer and enterprise SSDs.
TLC (Triple-Level Cell): 3 bits per cell. Lower endurance (300–1,000 cycles), used in most consumer SSDs for cost efficiency.
QLC (Quad-Level Cell): 4 bits per cell. Lowest endurance (~100–150 cycles), suited for read-intensive workloads and near-line storage.

3D NAND Stacking

Two-dimensional planar NAND hit physical limits around 15–16 nm feature size, where interference between adjacent cells became unmanageable. Three-dimensional NAND stacks memory cells vertically into dozens or hundreds of layers. Samsung's V-NAND, Micron's 3D NAND, and Kioxia's BiCS NAND are commercial implementations of this approach. Layer counts have grown from 24 layers in early 3D NAND to over 200 layers in current generations, increasing capacity per die while keeping die area and cost relatively stable.

Flash Constraints: Erase Before Write

NAND flash cannot overwrite data in place. Before a page can be written, its containing block must be erased. An erase operation resets an entire block to the all-ones state. This asymmetry means the flash translation layer (FTL) in the SSD controller must manage a mapping between logical addresses and physical locations, perform garbage collection (consolidating valid pages from partially-invalidated blocks before erasing them), and distribute writes evenly across the array to prevent premature wear of frequently-written blocks — a process called wear levelling.

Inline Compression in SSDs

Some SSD controllers implement transparent inline data compression between the host interface and the NAND array. Controllers from STEC, Seagate, and others compress incoming data with a lightweight algorithm (typically LZ-based) before writing it to flash. Compressible data occupies fewer NAND pages, increasing effective capacity and reducing write amplification, which extends flash endurance. The host continues to see the full logical capacity regardless of physical utilisation. This technique is common in enterprise SSDs and some consumer drives, though it adds latency to writes that don't compress well (effectively random data).

Storage Interfaces: SATA, SAS, and PCIe

The Serial ATA (SATA) interface, which carries the ATA command set, became the standard for consumer and desktop drives through the 2000s and 2010s. SATA III provides a maximum bandwidth of 600 MB/s, which is sufficient for mechanical drives and many NAND flash designs. The ATA command set was designed for mechanical disks and includes a single command queue with a depth of 32 entries (NCQ, Native Command Queuing), adequate for sequential workloads but suboptimal for highly parallel flash arrays.

Interface	Max Bandwidth	Queue Depth	Typical Use
SATA III	600 MB/s	32 (NCQ)	Consumer HDDs, budget SSDs
SAS-3	1,200 MB/s	254	Enterprise HDDs, SCSI protocol
PCIe 3.0 x4 (NVMe)	~3,500 MB/s	65,535 × 64,000	Performance SSDs
PCIe 4.0 x4 (NVMe)	~7,000 MB/s	65,535 × 64,000	High-performance SSDs
PCIe 5.0 x4 (NVMe)	~14,000 MB/s	65,535 × 64,000	Enterprise, AI workloads

NVMe: The Non-Volatile Memory Express Protocol

NVMe (NVM Express specification) was designed from the ground up for flash storage attached directly to the PCIe bus. Unlike AHCI (used over SATA), NVMe exposes up to 65,535 I/O queues, each with up to 65,535 entries. This deep, wide queue structure allows NVMe SSDs to process thousands of simultaneous I/O operations from multiple CPU cores without serialisation bottlenecks.

The NVMe command set is significantly leaner than ATA or SCSI: the read and write commands are simpler, and the specification includes explicit support for namespaces, allowing a single physical device to present multiple independent logical volumes to the host. NVMe also defines the NVMe-oF (NVMe over Fabrics) extension, which allows NVMe semantics over RDMA networks (RoCE, iWARP) or Fibre Channel, enabling disaggregated storage architectures where flash capacity is pooled and shared across many compute nodes.

M.2 and U.2 Form Factors

NVMe SSDs ship in several physical form factors. The M.2 card format (originally called NGFF, Next Generation Form Factor) is dominant in laptops and desktop motherboards. An M.2 slot connects via PCIe lanes routed directly from the CPU, bypassing the chipset for minimum latency. The U.2 (formerly SFF-8639) connector is used in enterprise 2.5-inch drives, providing more thermal headroom and supporting hot-swap in server backplanes.

Distributed Storage and Erasure Coding

Large-scale storage systems distribute data across many drives or servers to provide fault tolerance and aggregate bandwidth. Two primary approaches exist: replication (maintaining two or three identical copies across separate failure domains) and erasure coding (encoding data into a larger number of fragments such that any subset of a defined size is sufficient to reconstruct the original).

Reed-Solomon erasure coding, used in RAID-6 and distributed object stores such as Ceph, is a form of error-correcting code that adds redundant parity fragments. An RS(n, k) code encodes k data fragments into n total fragments; any k fragments suffice for reconstruction. This provides stronger fault tolerance than replication at lower storage overhead but requires more CPU for encoding and decoding. Facebook's f4 storage system, described in a 2014 USENIX OSDI paper, demonstrated erasure coding deployed at exabyte scale.

German cloud infrastructure providers such as IONOS (1&1) and Hetzner operate datacentres in Frankfurt and other German cities. Frankfurt is one of Europe's principal internet exchange points (DE-CIX, the world's largest IXP by throughput), making it a natural location for storage-intensive infrastructure serving European markets.

References: NVM Express Base Specification 2.0 (nvmexpress.org), IDEMA Advanced Format standard, JEDEC JESD218 (Flash Endurance), Reed & Solomon (1960) "Polynomial Codes over Certain Finite Fields".