Usage

CLI Commands

The following commands are available:

 Usage: slurm_usage.py [OPTIONS] COMMAND [ARGS]...

 SLURM Job Monitor - Collect and analyze job efficiency metrics


╭─ Options ──────────────────────────────────────────────────────────────────────────────╮
│ --install-completion          Install completion for the current shell.                │
│ --show-completion             Show completion for the current shell, to copy it or     │
│                               customize the installation.                              │
│ --help                        Show this message and exit.                              │
╰────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Commands ─────────────────────────────────────────────────────────────────────────────╮
│ collect   Collect job data from SLURM using parallel date-based queries.               │
│ analyze   Analyze collected job data.                                                  │
│ status    Show monitoring system status.                                               │
│ current   Display current cluster usage statistics from squeue.                        │
│ nodes     Display node information from SLURM.                                         │
│ test      Run a quick test of the system.                                              │
╰────────────────────────────────────────────────────────────────────────────────────────╯

Example Commands

# Collect data (uses 4 parallel workers by default)
slurm-usage collect

# Collect last 7 days of data
slurm-usage collect --days 7

# Collect with more parallel workers
slurm-usage collect --n-parallel 8

# Analyze collected data
slurm-usage analyze --days 7

# Display current cluster usage
slurm-usage current

# Display node information
slurm-usage nodes

# Check system status
slurm-usage status

# Test system configuration
slurm-usage test

Note: If running from source, use ./slurm_usage.py instead of slurm-usage.

Command Options

`collect` - Gather job data from SLURM

--days/-d: Days to look back (default: 1)
--data-dir: Data directory location (default: ./data)
--summary/--no-summary: Show analysis after collection (default: True)
--n-parallel/-n: Number of parallel workers (default: 4)

`analyze` - Analyze collected data

--days/-d: Days to analyze (default: 7)
--data-dir: Data directory location

`status` - Show system status

--data-dir: Data directory location

`current` - Display current cluster usage

Shows real-time cluster utilization from squeue, broken down by user and partition.

`nodes` - Display node information

Shows information about cluster nodes including CPU and GPU counts.

`test` - Test system configuration

CLI Commands

The following commands are available:

 Usage: slurm_usage.py [OPTIONS] COMMAND [ARGS]...

 SLURM Job Monitor - Collect and analyze job efficiency metrics


╭─ Options ──────────────────────────────────────────────────────────────────────────────╮
│ --install-completion          Install completion for the current shell.                │
│ --show-completion             Show completion for the current shell, to copy it or     │
│                               customize the installation.                              │
│ --help                        Show this message and exit.                              │
╰────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Commands ─────────────────────────────────────────────────────────────────────────────╮
│ collect   Collect job data from SLURM using parallel date-based queries.               │
│ analyze   Analyze collected job data.                                                  │
│ status    Show monitoring system status.                                               │
│ current   Display current cluster usage statistics from squeue.                        │
│ nodes     Display node information from SLURM.                                         │
│ test      Run a quick test of the system.                                              │
╰────────────────────────────────────────────────────────────────────────────────────────╯

Example Commands

# Collect data (uses 4 parallel workers by default)
slurm-usage collect

# Collect last 7 days of data
slurm-usage collect --days 7

# Collect with more parallel workers
slurm-usage collect --n-parallel 8

# Analyze collected data
slurm-usage analyze --days 7

# Display current cluster usage
slurm-usage current

# Display node information
slurm-usage nodes

# Check system status
slurm-usage status

# Test system configuration
slurm-usage test

Note: If running from source, use ./slurm_usage.py instead of slurm-usage.

Command Options

`collect` - Gather job data from SLURM

--days/-d: Days to look back (default: 1)
--data-dir: Data directory location (default: ./data)
--summary/--no-summary: Show analysis after collection (default: True)
--n-parallel/-n: Number of parallel workers (default: 4)

`analyze` - Analyze collected data

--days/-d: Days to analyze (default: 7)
--data-dir: Data directory location

`status` - Show system status

--data-dir: Data directory location

`current` - Display current cluster usage

Shows real-time cluster utilization from squeue, broken down by user and partition.

`nodes` - Display node information

Shows information about cluster nodes including CPU and GPU counts.

`test` - Test system configuration

Example Commands

# Collect data (uses 4 parallel workers by default)
slurm-usage collect

# Collect last 7 days of data
slurm-usage collect --days 7

# Collect with more parallel workers
slurm-usage collect --n-parallel 8

# Analyze collected data
slurm-usage analyze --days 7

# Display current cluster usage
slurm-usage current

# Display node information
slurm-usage nodes

# Check system status
slurm-usage status

# Test system configuration
slurm-usage test

Note: If running from source, use ./slurm_usage.py instead of slurm-usage.

Command Options

`collect` - Gather job data from SLURM

--days/-d: Days to look back (default: 1)
--data-dir: Data directory location (default: ./data)
--summary/--no-summary: Show analysis after collection (default: True)
--n-parallel/-n: Number of parallel workers (default: 4)

`analyze` - Analyze collected data

--days/-d: Days to analyze (default: 7)
--data-dir: Data directory location

`status` - Show system status

--data-dir: Data directory location

`current` - Display current cluster usage

Shows real-time cluster utilization from squeue, broken down by user and partition.

`nodes` - Display node information

Shows information about cluster nodes including CPU and GPU counts.

`test` - Test system configuration

Output Structure

Data Organization

data/
├── raw/                        # Raw SLURM data (archived)
│   ├── 2025-08-19.parquet      # Daily raw records
│   ├── 2025-08-20.parquet
│   └── ...
├── processed/                  # Processed job metrics
│   ├── 2025-08-19.parquet      # Daily processed data
│   ├── 2025-08-20.parquet
│   └── ...
└── .date_completion_tracker.json  # Tracks fully processed dates

Sample Analysis Output

═══ Resource Usage by User ═══

┌─────────────┬──────┬───────────┬──────────────┬───────────┬─────────┬──────────┐
│ User        │ Jobs │ CPU Hours │ Memory GB-hrs│ GPU Hours │ CPU Eff │ Mem Eff  │
├─────────────┼──────┼───────────┼──────────────┼───────────┼─────────┼──────────┤
│ alice       │  124 │   12,847  │    48,291    │    1,024  │  45.2%  │  23.7%   │
│ bob         │   87 │    8,234  │    31,456    │      512  │  38.1%  │  18.4%   │
└─────────────┴──────┴───────────┴──────────────┴───────────┴─────────┴──────────┘

═══ Node Usage Analysis ═══

┌────────────┬──────┬───────────┬───────────┬───────────┐
│ Node       │ Jobs │ CPU Hours │ GPU Hours │ CPU Util% │
├────────────┼──────┼───────────┼───────────┼───────────┤
│ cluster-1  │  234 │   45,678  │    2,048  │   74.3%   │
│ cluster-2  │  198 │   41,234  │    1,536  │   67.1%   │
└────────────┴──────┴───────────┴───────────┴───────────┘

Smart Re-collection

The monitor intelligently handles job state transitions:

Complete dates: Once all jobs for a date reach final states (COMPLETED, FAILED, CANCELLED, etc.), the date is marked complete and won't be re-queried
Incomplete jobs: Jobs in states like RUNNING, PENDING, or SUSPENDED are automatically re-collected on subsequent runs
Efficient updates: Only changed jobs are updated, minimizing processing time

Tracked Incomplete States

The following job states indicate a job may change and will trigger re-collection:

Active: RUNNING, PENDING, SUSPENDED
Transitional: COMPLETING, CONFIGURING, STAGE_OUT, SIGNALING
Requeue: REQUEUED, REQUEUE_FED, REQUEUE_HOLD
Other: RESIZING, REVOKED, SPECIAL_EXIT

Automated Collection

Using Cron

# Add to crontab (runs daily at 2 AM)
crontab -e

# If installed with uv tool or pip:
0 2 * * * /path/to/slurm-usage collect --days 2

# Or if running from source:
0 2 * * * /path/to/slurm-usage/slurm_usage.py collect --days 2

Usage

CLI Commands

Example Commands

Command Options

collect - Gather job data from SLURM

analyze - Analyze collected data

status - Show system status

current - Display current cluster usage

nodes - Display node information

test - Test system configuration

CLI Commands

Example Commands

Command Options

collect - Gather job data from SLURM

analyze - Analyze collected data

status - Show system status

current - Display current cluster usage

nodes - Display node information

test - Test system configuration

Example Commands

Command Options

collect - Gather job data from SLURM

analyze - Analyze collected data

status - Show system status

current - Display current cluster usage

nodes - Display node information

test - Test system configuration

Output Structure

Data Organization

Sample Analysis Output

Smart Re-collection

Tracked Incomplete States

Automated Collection

Using Cron

`collect` - Gather job data from SLURM

`analyze` - Analyze collected data

`status` - Show system status

`current` - Display current cluster usage

`nodes` - Display node information

`test` - Test system configuration

`collect` - Gather job data from SLURM

`analyze` - Analyze collected data

`status` - Show system status

`current` - Display current cluster usage

`nodes` - Display node information

`test` - Test system configuration

`collect` - Gather job data from SLURM

`analyze` - Analyze collected data

`status` - Show system status

`current` - Display current cluster usage

`nodes` - Display node information

`test` - Test system configuration