Skip to content

Troubleshooting

No efficiency data?

  • Check if SLURM accounting is configured: scontrol show config | grep JobAcct
  • Verify jobs have .batch steps: sacct -j JOBID

Collection is slow?

  • Increase parallel workers: slurm-usage collect --n-parallel 8
  • The first run processes historical data and will be slower

Missing user groups?

  • Create or update the configuration file in ~/.config/slurm-usage/config.yaml
  • Ungrouped users will appear as "ungrouped" in group statistics

Script won't run?

  • Ensure uv is installed: curl -LsSf https://astral.sh/uv/install.sh | sh
  • Check SLURM access: slurm-usage test (or ./slurm_usage.py test if running from source)

Performance Optimizations

  • Date completion tracking: Dates with only finished jobs are marked complete and skipped
  • Parallel collection: Default 4 workers fetch different dates simultaneously
  • Smart merging: Only updates changed jobs when re-collecting
  • Efficient storage: Parquet format provides ~10x compression over CSV
  • Date-based partitioning: Data organized by date for efficient queries

Important Notes

  1. 30-day window: SLURM purges detailed metrics after 30 days. Run collection at least weekly to ensure no data is lost.

  2. Batch steps: Actual usage metrics (TotalCPU, MaxRSS) are stored in the .batch step, not the parent job record.

  3. State normalization: All CANCELLED variants are normalized to "CANCELLED" for consistency.

  4. GPU tracking: GPU allocation is extracted from the AllocTRES field.

  5. Raw data archival: Raw SLURM records are preserved in case reprocessing is needed.

Support

We appreciate your feedback and contributions! If you encounter any issues or have suggestions for improvements, please file an issue on the GitHub repository. We also welcome pull requests for bug fixes or new features.