High I/O Wait on Linux Servers: Causes, Diagnosis, and Multiple Solutions

High I/O wait (iowait) is a common Linux performance issue where CPUs stay idle while waiting for disk or storage operations to finish. Although the CPU is not overloaded, high iowait can slow applications, increase latency, and even cause outages. This guide explains what iowait means, the main causes behind it, and practical solutions to fix it based on your environment and requirements.

What Is I/O Wait, Exactly?

Before diagnosing anything, you must understand what the metric means. The Linux kernel tracks CPU time in several buckets. The iowait bucket increments when:

At least one I/O request is outstanding (pending disk or network block device read/write).
The CPU has no other runnable task to execute.

This means iowait is an opportunity cost metric. It tells you the percentage of time the CPU could have been doing useful work but had nothing to run because everything was blocked on I/O. A 30% iowait does not mean 30% of your disk bandwidth is wasted, it means 30% of CPU cycles went unused while storage requests were in flight.

How to observe iowait:

top - look at the 'wa' field in the CPU line: top

mpstat - per-CPU iowait breakdown: mpstat -P ALL 1

vmstat - system-wide view including I/O: vmstat 1

iostat - disk-level throughput and latency: iostat -xz 1

iotop - per-process I/O activity (like top, for disk): iotop -ao

pidstat - per-process I/O stats: pidstat -d 1

Once you have confirmed high iowait, your investigation splits into two tracks: which device is the bottleneck, and which process is responsible.

Identify the busiest disk: iostat -xz 1 | grep -v "^$"

Look for high %util, high await (ms latency), high r/s or w/s

Find the top I/O consumers: iotop -b -n 5 -o

With that foundation in place, let us walk through every major cause and its solutions.

Cause 1: Undersized or Slow Storage Hardware

The Problem: The most fundamental cause of high iowait is simply that the storage medium cannot keep up with the workload's demands. This is especially common with:

Traditional spinning HDDs serving random I/O workloads (databases, logging servers).
Consumer-grade SSDs deployed in production environments where write endurance and IOPS are insufficient.
Network-attached storage (NAS) or SAN with inadequate bandwidth or high latency.
Shared cloud block storage (like AWS gp2 EBS volumes) hitting IOPS limits.

You will typically see await values above 10–20ms for HDDs and above 1–2ms for SSDs in iostat, combined with %util approaching 100%.

Solutions

Solution A: Upgrade to NVMe SSDs: The single highest-impact hardware change you can make. NVMe drives offer sub-millisecond latency and hundreds of thousands of IOPS versus the 100–200 IOPS typical of a 7200 RPM HDD. For databases and random-read workloads, this can reduce iowait from 40–60% down to under 5% without any application changes.

Solution B: Use RAID for throughput or redundancy

RAID 0 (striping): Doubles throughput across two drives. No redundancy.
RAID 10 (striped mirrors): Best of both worlds for databases — redundancy plus parallel I/O.
RAID 5/6: Better space efficiency, but write penalty makes them poor for write-heavy workloads.

Solution C: Provision more IOPS on cloud storage: On AWS, move from gp2 to gp3 or io2 EBS volumes and explicitly provision the IOPS you need. On GCP, switch from standard persistent disks to SSD persistent disks or Hyperdisk Extreme. This is often the fastest fix in cloud environments.

Cause 2: Inefficient Application I/O Patterns

Hardware is fast, but the application is issuing I/O in pathological patterns:

Synchronous, unbuffered writes: Each write waits for disk acknowledgment before proceeding.
Tiny random writes: Thousands of 4KB writes per second saturate IOPS budgets.
Read amplification: An ORM fetching entire rows when only two columns are needed, causing large unnecessary reads.
Missing fsync batching: Databases calling fsync() on every transaction instead of grouping commits.

Solutions

Solution A: Enable write buffering and async I/O

If your application is using O_SYNC or O_DSYNC on file opens, evaluate whether that level of durability is truly required. Many applications default to sync writes out of caution when async writes with periodic flushing are sufficient.

For applications you control, switch to buffered writes and flush explicitly at logical transaction boundaries rather than per-operation:

# Instead of: write every record and sync

for record in records:

f.write(record)

f.flush()

os.fsync(f.fileno()) # Kills performance

# Do: buffer and sync once per batch

for record in records:

f.write(record)

f.flush()

os.fsync(f.fileno()) # One sync for the whole batch

Solution B: Use O_DIRECT with async I/O for databases: Databases like PostgreSQL and MySQL use their own buffer pool. When they also go through the kernel page cache, you get double-buffering and unnecessary cache pressure. Using O_DIRECT bypasses the page cache and lets the database manage its own caching, reducing iowait by eliminating redundant copies.

PostgreSQL already does this automatically. For MySQL/InnoDB, ensure innodb_flush_method = O_DIRECT is set.

Solution C: Batch and coalesce writes: Use write-ahead logging (WAL) patterns to accumulate changes in memory and flush sequentially. Sequential I/O is dramatically faster than random I/O on both HDDs and SSDs. This is why databases use WAL, turning random transactional writes into sequential log writes.

Cause 3: Insufficient RAM / Page Cache Thrashing

Linux uses free RAM as a page cache, recently read disk blocks are kept in memory so subsequent reads are served from RAM rather than disk. When RAM is exhausted, the kernel must evict cached pages to make room for new data or process memory, then re-read evicted pages from disk on next access. This is called page cache thrashing or swap thrashing.

Signs to look for:

Check memory usage: free -h

# If 'available' is near zero and swap is heavily used, RAM is the bottleneck

vmstat 1

Look for high 'si' (swap in) and 'so' (swap out) values

High swap activity combined with high iowait almost always means RAM exhaustion.

Solution A: Add more RAM: The most direct fix. Linux's page cache is extraordinarily effective, adding RAM reduces disk reads proportionally. A database server that fits its working set entirely in RAM can see iowait drop from 40% to under 1%.

Solution B: Tune vm.swappiness: The vm.swappiness kernel parameter (0–100) controls how aggressively the kernel swaps process memory to disk versus reclaiming page cache. The default is 60, which means the kernel will swap out process pages fairly willingly. For servers running databases or caches:

Check current value: cat /proc/sys/vm/swappiness

Set to 10 - prefer keeping process memory in RAM, reclaim page cache first

sysctl -w vm.swappiness=10

Make permanent:

echo "vm.swappiness=10" >> /etc/sysctl.conf

sysctl -p

For Redis and similar in-memory datastores, setting swappiness to 0 is often recommended (though 0 does not fully disable swapping).

Solution C: Use HugePages for database workloads: Databases with large buffer pools benefit enormously from Transparent HugePages (THP) or explicit HugePages. Fewer TLB misses mean faster memory access, reducing the frequency of cache misses that trigger disk reads.

For PostgreSQL and Oracle, configure explicit HugePages:

Calculate required HugePages:

grep -i hugepagesize /proc/meminfo

Set in /etc/sysctl.conf:

vm.nr_hugepages = 1024 # Adjust based on your shared_buffers / hugepagesize

Solution D: Tune the page cache dirty ratios: If iowait spikes are periodic (every few seconds), the culprit is often the kernel flushing dirty pages in large batches. Tune the dirty page thresholds to write more frequently in smaller bursts:

# Default: flush when 20% of RAM is dirty

vm.dirty_ratio = 10 # Start flushing at 10%

vm.dirty_background_ratio = 5 # Background flush starts at 5%

vm.dirty_expire_centisecs = 1000 # Expire dirty pages after 10s

vm.dirty_writeback_centisecs = 500 # Flush every 5s

sysctl -w vm.dirty_ratio=10

sysctl -w vm.dirty_background_ratio=5

Cause 4: Database Bottlenecks

Databases are the single most common source of high iowait on application servers. The specific causes within databases include:

Missing indexes: Full table scans read entire tables from disk.
Oversized result sets: Queries returning far more data than needed.
Checkpoint storms: PostgreSQL or MySQL checkpointing all dirty buffer pool pages to disk simultaneously.
Redo log / WAL contention: Log files too small, causing frequent switches and flushing.
Buffer pool too small: The database cannot cache its working set in memory.

Solution A: Identify and add missing indexes

-- PostgreSQL: find sequential scans on large tables

SELECT schemaname, tablename, seq_scan, seq_tup_read,

idx_scan, n_live_tup

FROM pg_stat_user_tables

WHERE seq_scan > 0

ORDER BY seq_tup_read DESC

LIMIT 20;

-- MySQL: show queries without index usage

SHOW STATUS LIKE 'Select_full_join';

SHOW STATUS LIKE 'Select_scan';

-- Enable slow query log to catch bad queries

SET GLOBAL slow_query_log = 1;

SET GLOBAL long_query_time = 1;

Solution B: Increase buffer pool / shared buffers: For PostgreSQL, shared_buffers defaults to 128MB — embarrassingly small for any production workload. Set it to 25% of available RAM as a starting point:

# postgresql.conf

shared_buffers = 8GB # 25% of 32GB RAM

effective_cache_size = 24GB # Hint to query planner: 75% of RAM

work_mem = 64MB # Per-sort / per-hash operation

For MySQL/InnoDB:

# my.cnf

innodb_buffer_pool_size = 24G # 75% of RAM for dedicated DB servers

innodb_buffer_pool_instances = 8 # One per GB of buffer pool

innodb_log_file_size = 2G # Larger redo logs = fewer checkpoints

innodb_flush_log_at_trx_commit = 2 # Relax durability slightly for performance

Solution C: Smooth out checkpoint storms

PostgreSQL checkpoints can cause massive iowait spikes. Spread the I/O over a longer window:

# postgresql.conf

checkpoint_completion_target = 0.9 # Spread checkpoint I/O over 90% of interval

checkpoint_timeout = 15min # Allow longer between checkpoints

max_wal_size = 4GB # More WAL before forcing a checkpoint

Solution D: Connection pooling to reduce I/O overhead

Each database connection maintains its own I/O context. Thousands of connections create thousands of competing I/O streams. Use PgBouncer (PostgreSQL) or ProxySQL (MySQL) to multiplex many application connections over a small pool of actual database connections.

Install PgBouncer: apt install pgbouncer

Key settings in pgbouncer.ini

pool_mode = transaction # Reuse connections per transaction

max_client_conn = 1000 # Frontend: accept up to 1000 app connections

default_pool_size = 20 # Backend: only 20 real DB connections

Cause 5: Log File and Audit Trail Overload

Applications logging at DEBUG level in production, audit daemons writing every syscall, or web servers logging every request with flush-on-write can generate enough I/O to saturate even fast SSDs. This is particularly insidious because it is invisible in application metrics, everything looks fine, but the OS is buried in log writes.

Identify the culprit:

Find processes writing most to disk: iotop -a -o -b -n 30 | sort -k10 -rn | head -20

Check which files are being written heavily: lsof +D /var/log | awk '{print $1}' | sort | uniq -c | sort -rn

Solution A: Raise log levels in production

Change application log level from DEBUG or INFO to WARN or ERROR. This is the single easiest fix and can reduce log I/O by 90% or more with zero infrastructure changes.

Solution B: Use asynchronous logging: Buffer log writes in memory and flush periodically rather than on every log line. In Java, use Log4j2's AsyncAppender. In Python, use logging.handlers.MemoryHandler. In Go, wrap your logger in a buffered writer.

import logging

from logging.handlers import MemoryHandler

# Buffer up to 1000 log records, flush every 5 seconds or on ERROR+

memory_handler = MemoryHandler(

capacity=1000,

flushLevel=logging.ERROR,

target=file_handler

)

logger.addHandler(memory_handler)

Solution C: Log to tmpfs / ramdisk

For high-frequency ephemeral logs (access logs, debug traces), write to a RAM-backed tmpfs mount and periodically archive to persistent storage:

Create a 2GB ramdisk for logs: mount -t tmpfs -o size=2G tmpfs /var/log/app-tmp

Rsync to persistent storage every 5 minutes: echo "*/5 * * * * rsync -a /var/log/app-tmp/ /var/log/app-archive/" | crontab -

Solution D: Use journald with rate limiting

If systemd's journald is the culprit, rate-limit what it accepts:

# /etc/systemd/journald.conf

RateLimitIntervalSec=30s

RateLimitBurst=10000

SystemMaxUse=2G

RuntimeMaxUse=500M

Conclusion

High I/O wait is never a single-cause problem. It is the intersection of hardware capability, application behavior, OS configuration, and workload patterns. High iowait is solvable, often with nothing more than a sysctl change or a SQL index. Start with the cheapest, highest-impact fix first, measure again, and iterate.

High I/O Wait on Linux Servers: Causes, Diagnosis, and Multiple Solutions

Categories

Support

What Is I/O Wait, Exactly?

How to observe iowait:

Identify the busiest disk: iostat -xz 1 | grep -v "^$"

Cause 1: Undersized or Slow Storage Hardware

Solutions

Cause 2: Inefficient Application I/O Patterns

Solutions

Cause 3: Insufficient RAM / Page Cache Thrashing

Look for high 'si' (swap in) and 'so' (swap out) values

Set to 10 - prefer keeping process memory in RAM, reclaim page cache first

Cause 4: Database Bottlenecks

For MySQL/InnoDB:

Solution C: Smooth out checkpoint storms

Solution D: Connection pooling to reduce I/O overhead

Key settings in pgbouncer.ini

Cause 5: Log File and Audit Trail Overload

Conclusion

Most Popular Articles

High I/O Wait on Linux Servers: Causes, Diagnosis, and Multiple Solutions

Categories

Support

What Is I/O Wait, Exactly?

How to observe iowait:

Identify the busiest disk: iostat -xz 1 | grep -v "^$"

Cause 1: Undersized or Slow Storage Hardware

Solutions

Cause 2: Inefficient Application I/O Patterns

Solutions

Cause 3: Insufficient RAM / Page Cache Thrashing

Look for high 'si' (swap in) and 'so' (swap out) values

Set to 10 - prefer keeping process memory in RAM, reclaim page cache first

Cause 4: Database Bottlenecks

For MySQL/InnoDB:

Solution C: Smooth out checkpoint storms

Solution D: Connection pooling to reduce I/O overhead

Key settings in pgbouncer.ini

Cause 5: Log File and Audit Trail Overload

Conclusion

Most Popular Articles

Generate Password