Skip to content

SLURM Usage Monitor

slurm-usage Logo

A high-performance monitoring system that collects and analyzes SLURM job efficiency metrics, optimized for large-scale HPC environments.

Tip

SLURM purges detailed metrics after 30 days. Run slurm-usage collect to preserve your data!

Purpose

SLURM's accounting database purges detailed job metrics (CPU usage, memory usage) after 30 days. This tool captures and preserves that data in efficient Parquet format for long-term analysis of resource utilization patterns.

Key Features

  • 📊 Captures comprehensive efficiency metrics from all job states
  • 💾 Efficient Parquet storage - columnar format optimized for analytics
  • 🔄 Smart incremental processing - tracks completed dates to minimize re-processing
  • 📈 Rich visualizations - bar charts for resource usage, efficiency, and node utilization
  • 👥 Group-based analytics - track usage by research groups/teams
  • 🖥️ Node utilization tracking - analyze per-node CPU and GPU usage
  • Parallel collection - multi-threaded data collection by default
  • Cron-ready - designed for automated daily collection
  • 🎯 Intelligent re-collection - only re-fetches incomplete job states

What It Collects

For each job:

  • Job metadata: ID, user, name, partition, state, node list
  • Time info: submit, start, end times, elapsed duration
  • Allocated resources: CPUs, memory, GPUs, nodes
  • Actual usage: CPU seconds used (TotalCPU), peak memory (MaxRSS)
  • Calculated metrics:
  • CPU efficiency % (actual CPU time / allocated CPU time)
  • Memory efficiency % (peak memory / allocated memory)
  • CPU hours wasted
  • Memory GB-hours wasted
  • Total reserved resources (CPU/GPU/memory hours)