Principal

The principal is the brain of the cdktr system. While agents handle the physical execution of workflows, the principal coordinates everything—managing workflow definitions, scheduling executions, tracking agent health, and providing the central API for all system interactions. It's the single source of truth for the state of your distributed workflow system.

What Does the Principal Do?

The principal serves as the central orchestrator with several critical responsibilities:

  1. Workflow Management: The principal continuously monitors the workflow directory on the filesystem, loading and parsing YAML workflow definitions. It maintains an in-memory workflow store that gets periodically refreshed (by default, every 60 seconds) to pick up new or modified workflow definitions without requiring a restart. This hot-reload capability makes it trivial to deploy new workflows in CI/CD environments.

  2. Intelligent Scheduling: For workflows with cron schedules defined, the principal runs an integrated scheduler that monitors these schedules and automatically triggers workflows at their appointed times. The scheduler uses a priority queue internally to efficiently determine which workflow should execute next, checking every 500 milliseconds whether it's time to launch a workflow.

  3. Work Distribution: At its core, the principal maintains a global task queue of workflows ready for execution. When workflows are triggered (either by schedule, manual request, or external event), they're added to this queue. Agents continuously poll the principal for work, and when they request a workflow, the principal pops one from the queue and assigns it to that agent.

  4. Agent Lifecycle Management: The principal tracks all registered agents and their health status. When an agent starts up, it registers with the principal and begins sending heartbeats every 5 seconds. The principal runs a dedicated heartbeat monitor that checks for agents that haven't checked in within the timeout period (default: 30 seconds). If an agent times out, the principal automatically marks all workflows running on that agent as CRASHED, preventing them from being stuck in a RUNNING state indefinitely.

  5. Persistent State Management: All workflow execution history, task status updates, and logs flow through the principal and get persisted to DuckDB. The principal also periodically persists its task queue to disk (every second by default) so that if the principal crashes and restarts, it can resume processing workflows without losing queued work.

  6. API Gateway: The principal exposes a ZeroMQ-based API that serves as the primary interface for the entire system. The TUI, CLI, external event listeners, and agents all communicate with the principal through this API. It handles requests for listing workflows, triggering executions, querying logs, and checking system status.

The Workflow Lifecycle from the Principal's Perspective

Understanding how the principal orchestrates workflow execution helps clarify its role in the system:

1. Workflow Discovery and Loading

On startup and periodically thereafter, the principal scans the configured workflow directory (default: workflows/) and loads all YAML files it finds. Each workflow is parsed, validated, and converted into an internal representation including its DAG structure. Invalid workflows are logged but don't prevent the system from starting.

2. Scheduling and Triggering

For workflows with cron schedules, the scheduler component calculates the next execution time and maintains a priority queue ordered by next run time. When the time arrives, the scheduler triggers the workflow by sending it to the principal's workflow queue. Workflows can also be triggered manually via the CLI or TUI, or by external event listeners.

3. Queue Management

When a workflow is triggered, it enters the principal's global task queue. This queue is the central coordination point—agents don't know what work exists until they ask for it. The principal simply maintains the queue and serves workflows first-come, first-served to agents that request work.

4. Agent Assignment

When an agent polls for work and has available capacity, the principal removes a workflow from the queue and sends it to that agent. The principal records which agent is running which workflow instance, allowing it to track distributed execution across the cluster.

5. Status Tracking

As the agent executes the workflow, it sends status updates back to the principal:

  • Workflow started (RUNNING)
  • Individual tasks started (PENDING → RUNNING)
  • Task completions (COMPLETED or FAILED)
  • Final workflow status (COMPLETED, FAILED, or CRASHED)

The principal stores all these updates in DuckDB, building a complete audit trail of execution.

6. Log Aggregation

Task logs generated by agents flow back to the principal via a dedicated ZeroMQ PUB/SUB channel. The principal's log manager receives these messages and queues them for batch insertion into DuckDB every 30 seconds. This approach balances real-time log capture with database write efficiency.

Background Services

The principal runs several background services concurrently:

Admin Refresh Loop

Runs continuously to refresh workflow definitions from disk and persist the task queue state. This ensures that new workflow files are discovered without manual intervention and that the queue survives principal restarts.

Log Manager

Operates two components: a listener that subscribes to log messages from agents via ZeroMQ, and a persistence loop that batches log messages and writes them to DuckDB every 30 seconds. This architecture decouples log collection from database writes, preventing slow database operations from blocking log reception.

Heartbeat Monitor

Continuously scans registered agents, checking when each last sent a heartbeat. If an agent hasn't checked in within the configured timeout, the monitor marks all its running workflows as CRASHED and removes them from the active workflow tracking map.

Scheduler (Optional)

When enabled, the scheduler maintains its own workflow refresh loop and continuously monitors cron schedules to trigger workflows at the right time. The scheduler can be disabled via the --no-scheduler flag for testing or when you want pure manual/event-driven workflow execution.

API Server

The ZeroMQ request/reply server runs continuously, handling incoming API requests from agents, the TUI, CLI, and external systems. All requests are processed synchronously—the server receives a request, processes it, sends a response, then waits for the next request.

High Availability and Recovery

The principal is designed with resilience in mind:

Task Queue Persistence: Every second, the principal serializes its current task queue to disk. If the principal crashes or is restarted, it reloads this state on startup, allowing queued workflows to continue processing without being lost.

Agent Self-Healing: When an agent loses connection to the principal (perhaps due to network issues), it doesn't immediately fail. Instead, it completes any workflows already in progress, buffering logs locally until the principal becomes reachable again. This resilient design prevents cascading failures.

Graceful Degradation: If the database becomes temporarily unavailable, the principal continues operating—workflow execution proceeds, but status updates and logs queue up in memory until the database recovers. This prevents database issues from halting the entire system.

Workflow Refresh: Because workflows are reloaded from disk every minute, you can deploy new workflow definitions while the principal is running. Simply drop new YAML files into the workflow directory, and within a minute, they'll be available for scheduling or manual execution.

Configuration

The principal is configured through environment variables:

  • CDKTR_PRINCIPAL_HOST: Bind address for the API server (default: 0.0.0.0)
  • CDKTR_PRINCIPAL_PORT: API server port (default: 5561)
  • CDKTR_WORKFLOW_DIR: Directory to scan for workflow YAML files (default: workflows)
  • CDKTR_DB_PATH: Path to DuckDB database file (default: $HOME/.cdktr/app.db)
  • CDKTR_AGENT_HEARTBEAT_TIMEOUT_MS: How long to wait before marking an agent as timed out (default: 30000)

See the Configuration section for a complete list of configuration options.

The Principal is Not a Single Point of Failure

While the principal is the central coordinator, it's designed to minimize the impact of failures. Agents can survive temporary principal outages by completing in-flight work and buffering status updates. When the principal returns, agents re-register and resume normal operation. For production deployments requiring higher availability, you can implement active-passive failover by running a standby principal that takes over if the primary fails, using shared storage for the DuckDB database and task queue persistence files.