SLURM Usage Monitor
A high-performance monitoring system that collects and analyzes SLURM job efficiency metrics, optimized for large-scale HPC environments.
Tip
SLURM purges detailed metrics after 30 days. Run slurm-usage collect to preserve your data!
Purpose
SLURM's accounting database purges detailed job metrics (CPU usage, memory usage) after 30 days. This tool captures and preserves that data in efficient Parquet format for long-term analysis of resource utilization patterns.
Key Features
- 📊 Captures comprehensive efficiency metrics from all job states
- 💾 Efficient Parquet storage - columnar format optimized for analytics
- 🔄 Smart incremental processing - tracks completed dates to minimize re-processing
- 📈 Rich visualizations - bar charts for resource usage, efficiency, and node utilization
- 👥 Group-based analytics - track usage by research groups/teams
- 🖥️ Node utilization tracking - analyze per-node CPU and GPU usage
- ⚡ Parallel collection - multi-threaded data collection by default
- ⏰ Cron-ready - designed for automated daily collection
- 🎯 Intelligent re-collection - only re-fetches incomplete job states
What It Collects
For each job:
- Job metadata: ID, user, name, partition, state, node list
- Time info: submit, start, end times, elapsed duration
- Allocated resources: CPUs, memory, GPUs, nodes
- Actual usage: CPU seconds used (TotalCPU), peak memory (MaxRSS)
- Calculated metrics:
- CPU efficiency % (actual CPU time / allocated CPU time)
- Memory efficiency % (peak memory / allocated memory)
- CPU hours wasted
- Memory GB-hours wasted
- Total reserved resources (CPU/GPU/memory hours)