Skip to content

Usage

CLI Commands

The following commands are available:

 Usage: slurm_usage.py [OPTIONS] COMMAND [ARGS]...

 SLURM Job Monitor - Collect and analyze job efficiency metrics


╭─ Options ──────────────────────────────────────────────────────────────────────────────╮
 --install-completion          Install completion for the current shell.                 --show-completion             Show completion for the current shell, to copy it or                                    customize the installation.                               --help                        Show this message and exit.                              ╰────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Commands ─────────────────────────────────────────────────────────────────────────────╮
 collect   Collect job data from SLURM using parallel date-based queries.                analyze   Analyze collected job data.                                                   status    Show monitoring system status.                                                current   Display current cluster usage statistics from squeue.                         nodes     Display node information from SLURM.                                          test      Run a quick test of the system.                                              ╰────────────────────────────────────────────────────────────────────────────────────────╯

Example Commands

# Collect data (uses 4 parallel workers by default)
slurm-usage collect

# Collect last 7 days of data
slurm-usage collect --days 7

# Collect with more parallel workers
slurm-usage collect --n-parallel 8

# Analyze collected data
slurm-usage analyze --days 7

# Display current cluster usage
slurm-usage current

# Display node information
slurm-usage nodes

# Check system status
slurm-usage status

# Test system configuration
slurm-usage test

Note: If running from source, use ./slurm_usage.py instead of slurm-usage.

Command Options

collect - Gather job data from SLURM

  • --days/-d: Days to look back (default: 1)
  • --data-dir: Data directory location (default: ./data)
  • --summary/--no-summary: Show analysis after collection (default: True)
  • --n-parallel/-n: Number of parallel workers (default: 4)

analyze - Analyze collected data

  • --days/-d: Days to analyze (default: 7)
  • --data-dir: Data directory location

status - Show system status

  • --data-dir: Data directory location

current - Display current cluster usage

Shows real-time cluster utilization from squeue, broken down by user and partition.

nodes - Display node information

Shows information about cluster nodes including CPU and GPU counts.

test - Test system configuration

CLI Commands

The following commands are available:

 Usage: slurm_usage.py [OPTIONS] COMMAND [ARGS]...

 SLURM Job Monitor - Collect and analyze job efficiency metrics


╭─ Options ──────────────────────────────────────────────────────────────────────────────╮
 --install-completion          Install completion for the current shell.                 --show-completion             Show completion for the current shell, to copy it or                                    customize the installation.                               --help                        Show this message and exit.                              ╰────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Commands ─────────────────────────────────────────────────────────────────────────────╮
 collect   Collect job data from SLURM using parallel date-based queries.                analyze   Analyze collected job data.                                                   status    Show monitoring system status.                                                current   Display current cluster usage statistics from squeue.                         nodes     Display node information from SLURM.                                          test      Run a quick test of the system.                                              ╰────────────────────────────────────────────────────────────────────────────────────────╯

Example Commands

# Collect data (uses 4 parallel workers by default)
slurm-usage collect

# Collect last 7 days of data
slurm-usage collect --days 7

# Collect with more parallel workers
slurm-usage collect --n-parallel 8

# Analyze collected data
slurm-usage analyze --days 7

# Display current cluster usage
slurm-usage current

# Display node information
slurm-usage nodes

# Check system status
slurm-usage status

# Test system configuration
slurm-usage test

Note: If running from source, use ./slurm_usage.py instead of slurm-usage.

Command Options

collect - Gather job data from SLURM

  • --days/-d: Days to look back (default: 1)
  • --data-dir: Data directory location (default: ./data)
  • --summary/--no-summary: Show analysis after collection (default: True)
  • --n-parallel/-n: Number of parallel workers (default: 4)

analyze - Analyze collected data

  • --days/-d: Days to analyze (default: 7)
  • --data-dir: Data directory location

status - Show system status

  • --data-dir: Data directory location

current - Display current cluster usage

Shows real-time cluster utilization from squeue, broken down by user and partition.

nodes - Display node information

Shows information about cluster nodes including CPU and GPU counts.

test - Test system configuration

Example Commands

# Collect data (uses 4 parallel workers by default)
slurm-usage collect

# Collect last 7 days of data
slurm-usage collect --days 7

# Collect with more parallel workers
slurm-usage collect --n-parallel 8

# Analyze collected data
slurm-usage analyze --days 7

# Display current cluster usage
slurm-usage current

# Display node information
slurm-usage nodes

# Check system status
slurm-usage status

# Test system configuration
slurm-usage test

Note: If running from source, use ./slurm_usage.py instead of slurm-usage.

Command Options

collect - Gather job data from SLURM

  • --days/-d: Days to look back (default: 1)
  • --data-dir: Data directory location (default: ./data)
  • --summary/--no-summary: Show analysis after collection (default: True)
  • --n-parallel/-n: Number of parallel workers (default: 4)

analyze - Analyze collected data

  • --days/-d: Days to analyze (default: 7)
  • --data-dir: Data directory location

status - Show system status

  • --data-dir: Data directory location

current - Display current cluster usage

Shows real-time cluster utilization from squeue, broken down by user and partition.

nodes - Display node information

Shows information about cluster nodes including CPU and GPU counts.

test - Test system configuration

Output Structure

Data Organization

data/
├── raw/                        # Raw SLURM data (archived)
│   ├── 2025-08-19.parquet      # Daily raw records
│   ├── 2025-08-20.parquet
│   └── ...
├── processed/                  # Processed job metrics
│   ├── 2025-08-19.parquet      # Daily processed data
│   ├── 2025-08-20.parquet
│   └── ...
└── .date_completion_tracker.json  # Tracks fully processed dates

Sample Analysis Output

═══ Resource Usage by User ═══

┌─────────────┬──────┬───────────┬──────────────┬───────────┬─────────┬──────────┐
│ User        │ Jobs │ CPU Hours │ Memory GB-hrs│ GPU Hours │ CPU Eff │ Mem Eff  │
├─────────────┼──────┼───────────┼──────────────┼───────────┼─────────┼──────────┤
│ alice       │  124 │   12,847  │    48,291    │    1,024  │  45.2%  │  23.7%   │
│ bob         │   87 │    8,234  │    31,456    │      512  │  38.1%  │  18.4%   │
└─────────────┴──────┴───────────┴──────────────┴───────────┴─────────┴──────────┘

═══ Node Usage Analysis ═══

┌────────────┬──────┬───────────┬───────────┬───────────┐
│ Node       │ Jobs │ CPU Hours │ GPU Hours │ CPU Util% │
├────────────┼──────┼───────────┼───────────┼───────────┤
│ cluster-1  │  234 │   45,678  │    2,048  │   74.3%   │
│ cluster-2  │  198 │   41,234  │    1,536  │   67.1%   │
└────────────┴──────┴───────────┴───────────┴───────────┘

Smart Re-collection

The monitor intelligently handles job state transitions:

  • Complete dates: Once all jobs for a date reach final states (COMPLETED, FAILED, CANCELLED, etc.), the date is marked complete and won't be re-queried
  • Incomplete jobs: Jobs in states like RUNNING, PENDING, or SUSPENDED are automatically re-collected on subsequent runs
  • Efficient updates: Only changed jobs are updated, minimizing processing time

Tracked Incomplete States

The following job states indicate a job may change and will trigger re-collection:

  • Active: RUNNING, PENDING, SUSPENDED
  • Transitional: COMPLETING, CONFIGURING, STAGE_OUT, SIGNALING
  • Requeue: REQUEUED, REQUEUE_FED, REQUEUE_HOLD
  • Other: RESIZING, REVOKED, SPECIAL_EXIT

Automated Collection

Using Cron

# Add to crontab (runs daily at 2 AM)
crontab -e

# If installed with uv tool or pip:
0 2 * * * /path/to/slurm-usage collect --days 2

# Or if running from source:
0 2 * * * /path/to/slurm-usage/slurm_usage.py collect --days 2