Usage
CLI Commands
The following commands are available:
Usage: slurm_usage.py [OPTIONS] COMMAND [ARGS]...
SLURM Job Monitor - Collect and analyze job efficiency metrics
╭─ Options ──────────────────────────────────────────────────────────────────────────────╮
│ --install-completion Install completion for the current shell. │
│ --show-completion Show completion for the current shell, to copy it or │
│ customize the installation. │
│ --help Show this message and exit. │
╰────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Commands ─────────────────────────────────────────────────────────────────────────────╮
│ collect Collect job data from SLURM using parallel date-based queries. │
│ analyze Analyze collected job data. │
│ status Show monitoring system status. │
│ current Display current cluster usage statistics from squeue. │
│ nodes Display node information from SLURM. │
│ test Run a quick test of the system. │
╰────────────────────────────────────────────────────────────────────────────────────────╯
Example Commands
# Collect data (uses 4 parallel workers by default)
slurm-usage collect
# Collect last 7 days of data
slurm-usage collect --days 7
# Collect with more parallel workers
slurm-usage collect --n-parallel 8
# Analyze collected data
slurm-usage analyze --days 7
# Display current cluster usage
slurm-usage current
# Display node information
slurm-usage nodes
# Check system status
slurm-usage status
# Test system configuration
slurm-usage test
Note: If running from source, use ./slurm_usage.py instead of slurm-usage.
Command Options
collect - Gather job data from SLURM
--days/-d: Days to look back (default: 1)--data-dir: Data directory location (default: ./data)--summary/--no-summary: Show analysis after collection (default: True)--n-parallel/-n: Number of parallel workers (default: 4)
analyze - Analyze collected data
--days/-d: Days to analyze (default: 7)--data-dir: Data directory location
status - Show system status
--data-dir: Data directory location
current - Display current cluster usage
Shows real-time cluster utilization from squeue, broken down by user and partition.
nodes - Display node information
Shows information about cluster nodes including CPU and GPU counts.
test - Test system configuration
CLI Commands
The following commands are available:
Usage: slurm_usage.py [OPTIONS] COMMAND [ARGS]...
SLURM Job Monitor - Collect and analyze job efficiency metrics
╭─ Options ──────────────────────────────────────────────────────────────────────────────╮
│ --install-completion Install completion for the current shell. │
│ --show-completion Show completion for the current shell, to copy it or │
│ customize the installation. │
│ --help Show this message and exit. │
╰────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Commands ─────────────────────────────────────────────────────────────────────────────╮
│ collect Collect job data from SLURM using parallel date-based queries. │
│ analyze Analyze collected job data. │
│ status Show monitoring system status. │
│ current Display current cluster usage statistics from squeue. │
│ nodes Display node information from SLURM. │
│ test Run a quick test of the system. │
╰────────────────────────────────────────────────────────────────────────────────────────╯
Example Commands
# Collect data (uses 4 parallel workers by default)
slurm-usage collect
# Collect last 7 days of data
slurm-usage collect --days 7
# Collect with more parallel workers
slurm-usage collect --n-parallel 8
# Analyze collected data
slurm-usage analyze --days 7
# Display current cluster usage
slurm-usage current
# Display node information
slurm-usage nodes
# Check system status
slurm-usage status
# Test system configuration
slurm-usage test
Note: If running from source, use ./slurm_usage.py instead of slurm-usage.
Command Options
collect - Gather job data from SLURM
--days/-d: Days to look back (default: 1)--data-dir: Data directory location (default: ./data)--summary/--no-summary: Show analysis after collection (default: True)--n-parallel/-n: Number of parallel workers (default: 4)
analyze - Analyze collected data
--days/-d: Days to analyze (default: 7)--data-dir: Data directory location
status - Show system status
--data-dir: Data directory location
current - Display current cluster usage
Shows real-time cluster utilization from squeue, broken down by user and partition.
nodes - Display node information
Shows information about cluster nodes including CPU and GPU counts.
test - Test system configuration
Example Commands
# Collect data (uses 4 parallel workers by default)
slurm-usage collect
# Collect last 7 days of data
slurm-usage collect --days 7
# Collect with more parallel workers
slurm-usage collect --n-parallel 8
# Analyze collected data
slurm-usage analyze --days 7
# Display current cluster usage
slurm-usage current
# Display node information
slurm-usage nodes
# Check system status
slurm-usage status
# Test system configuration
slurm-usage test
Note: If running from source, use ./slurm_usage.py instead of slurm-usage.
Command Options
collect - Gather job data from SLURM
--days/-d: Days to look back (default: 1)--data-dir: Data directory location (default: ./data)--summary/--no-summary: Show analysis after collection (default: True)--n-parallel/-n: Number of parallel workers (default: 4)
analyze - Analyze collected data
--days/-d: Days to analyze (default: 7)--data-dir: Data directory location
status - Show system status
--data-dir: Data directory location
current - Display current cluster usage
Shows real-time cluster utilization from squeue, broken down by user and partition.
nodes - Display node information
Shows information about cluster nodes including CPU and GPU counts.
test - Test system configuration
Output Structure
Data Organization
data/
├── raw/ # Raw SLURM data (archived)
│ ├── 2025-08-19.parquet # Daily raw records
│ ├── 2025-08-20.parquet
│ └── ...
├── processed/ # Processed job metrics
│ ├── 2025-08-19.parquet # Daily processed data
│ ├── 2025-08-20.parquet
│ └── ...
└── .date_completion_tracker.json # Tracks fully processed dates
Sample Analysis Output
═══ Resource Usage by User ═══
┌─────────────┬──────┬───────────┬──────────────┬───────────┬─────────┬──────────┐
│ User │ Jobs │ CPU Hours │ Memory GB-hrs│ GPU Hours │ CPU Eff │ Mem Eff │
├─────────────┼──────┼───────────┼──────────────┼───────────┼─────────┼──────────┤
│ alice │ 124 │ 12,847 │ 48,291 │ 1,024 │ 45.2% │ 23.7% │
│ bob │ 87 │ 8,234 │ 31,456 │ 512 │ 38.1% │ 18.4% │
└─────────────┴──────┴───────────┴──────────────┴───────────┴─────────┴──────────┘
═══ Node Usage Analysis ═══
┌────────────┬──────┬───────────┬───────────┬───────────┐
│ Node │ Jobs │ CPU Hours │ GPU Hours │ CPU Util% │
├────────────┼──────┼───────────┼───────────┼───────────┤
│ cluster-1 │ 234 │ 45,678 │ 2,048 │ 74.3% │
│ cluster-2 │ 198 │ 41,234 │ 1,536 │ 67.1% │
└────────────┴──────┴───────────┴───────────┴───────────┘
Smart Re-collection
The monitor intelligently handles job state transitions:
- Complete dates: Once all jobs for a date reach final states (COMPLETED, FAILED, CANCELLED, etc.), the date is marked complete and won't be re-queried
- Incomplete jobs: Jobs in states like RUNNING, PENDING, or SUSPENDED are automatically re-collected on subsequent runs
- Efficient updates: Only changed jobs are updated, minimizing processing time
Tracked Incomplete States
The following job states indicate a job may change and will trigger re-collection:
- Active:
RUNNING,PENDING,SUSPENDED - Transitional:
COMPLETING,CONFIGURING,STAGE_OUT,SIGNALING - Requeue:
REQUEUED,REQUEUE_FED,REQUEUE_HOLD - Other:
RESIZING,REVOKED,SPECIAL_EXIT