Troubleshooting
No efficiency data?
- Check if SLURM accounting is configured:
scontrol show config | grep JobAcct - Verify jobs have
.batchsteps:sacct -j JOBID
Collection is slow?
- Increase parallel workers:
slurm-usage collect --n-parallel 8 - The first run processes historical data and will be slower
Missing user groups?
- Create or update the configuration file in
~/.config/slurm-usage/config.yaml - Ungrouped users will appear as "ungrouped" in group statistics
Script won't run?
- Ensure
uvis installed:curl -LsSf https://astral.sh/uv/install.sh | sh - Check SLURM access:
slurm-usage test(or./slurm_usage.py testif running from source)
Performance Optimizations
- Date completion tracking: Dates with only finished jobs are marked complete and skipped
- Parallel collection: Default 4 workers fetch different dates simultaneously
- Smart merging: Only updates changed jobs when re-collecting
- Efficient storage: Parquet format provides ~10x compression over CSV
- Date-based partitioning: Data organized by date for efficient queries
Important Notes
-
30-day window: SLURM purges detailed metrics after 30 days. Run collection at least weekly to ensure no data is lost.
-
Batch steps: Actual usage metrics (TotalCPU, MaxRSS) are stored in the
.batchstep, not the parent job record. -
State normalization: All CANCELLED variants are normalized to "CANCELLED" for consistency.
-
GPU tracking: GPU allocation is extracted from the AllocTRES field.
-
Raw data archival: Raw SLURM records are preserved in case reprocessing is needed.
Support
We appreciate your feedback and contributions! If you encounter any issues or have suggestions for improvements, please file an issue on the GitHub repository. We also welcome pull requests for bug fixes or new features.