Skip to content

ECS Deployment

Developer Reference

This page covers internal implementation details. It is not included in the User Guide.

Deploy the Escher backend on AWS ECS Fargate using the v2-deployment-ecs shell scripts and the v3-escher-deployment CI/CD pipeline.

Verified vs illustrative

Authoritative commands and topology in this page come from v2-deployment-ecs/README.md. Specifics:

  • The end-to-end deploy script is ./deploy-all.sh (not deploy.sh). ./deploy.sh setup-cluster exists for cluster-only setup.
  • Default region is us-west-1 or us-west-2 depending on environment.
  • Internal Route53 zone is escher-prod.internal.
  • Topology includes API Gateway (Lambda) for cloudwatch-logs and feedback services in addition to the ECS Fargate tasks for Gateway / UI Agent / Playbook Store.

The exact YAML task definitions, Cognito UserPoolArn, and security-group rules below are illustrative — they reflect the design pattern in the README but the precise values for your tenant come from v2-deployment-ecs/configs/. Confirm against that repo before applying.


Architecture

Internet


ALB (HTTPS :443)
  Cognito OIDC listener rule — validates access_token before forwarding


Gateway  (ECS Fargate — us-west-1 — escher-v2 cluster)
  Port 8080, arm64, 2048 CPU / 4096 MB
  Health: GET /q/health → 200
    │  private subnet  ─────────────────────────────────────────────────────┐
    ▼                                                                        │
Analysis Agent          Playbook Agent                                       │
  Port 8081               Port 8082                                         │
  arm64                   arm64                                             │
  2048 CPU / 4096 MB      2048 CPU / 4096 MB                               │

Context Engine  (Port 8001)  ─────▶  Asset Store  (Port 8000)              │
  arm64, 1024 CPU / 2048 MB           arm64, 1024 CPU / 2048 MB            │
                                                  │                         │
                                                  ▼                         │
                                              SurrealDB                     │
                                          (ECS or managed)  ◀──────────────┘

All backend services run in private subnets (no direct internet access). Only the ALB is internet-facing. Services discover each other via AWS Cloud Map DNS (*.escher.internal).


Prerequisites

  • AWS account with ECS, ECR, ALB, IAM, and Route53 Cloud Map permissions
  • Docker images built and pushed to ECR (done by v3-escher-deployment CI/CD)
  • VPC with private subnets for ECS tasks and a public subnet pair for the ALB
  • Cognito User Pool and App Client created (for ALB auth rule)
  • Secrets Manager entries for ANTHROPIC_API_KEY

Cluster and VPC configuration

SettingValue
Cluster nameescher-v2
Regionus-west-1
VPC CIDR10.0.0.0/16
Capacity providerFARGATE
Container runtimearm64 (Graviton)
bash
# Create cluster (one-time)
aws ecs create-cluster \
  --cluster-name escher-v2 \
  --capacity-providers FARGATE \
  --region us-west-1

Service task definitions

Each service has a task definition in v2-deployment-ecs/services/. Key parameters per component:

Gateway

yaml
# services/gateway-service.yaml
family: gateway
cpu: "2048"
memory: "4096"
requiresCompatibilities: [FARGATE]
networkMode: awsvpc
runtimePlatform:
  operatingSystemFamily: LINUX
  cpuArchitecture: ARM64
containerDefinitions:
  - name: gateway
    image: ACCOUNT.dkr.ecr.us-west-1.amazonaws.com/escher-gateway:TAG
    portMappings:
      - containerPort: 8080
        protocol: tcp
    healthCheck:
      command: ["CMD-SHELL", "curl -f http://localhost:8080/q/health || exit 1"]
      interval: 30
      timeout: 5
      retries: 3
    environment:
      - name: ANALYSIS_AGENT_URL
        value: http://analysis-agent.escher.internal:8081
      - name: PLAYBOOK_AGENT_URL
        value: http://playbook-agent.escher.internal:8082
      - name: CONTEXT_ENGINE_URL
        value: http://context-engine.escher.internal:8001
    secrets:
      - name: COGNITO_JWKS_URL
        valueFrom: arn:aws:secretsmanager:us-west-1:ACCOUNT:secret:escher/cognito-jwks-url

Analysis Agent

yaml
family: analysis-agent
cpu: "2048"
memory: "4096"
requiresCompatibilities: [FARGATE]
networkMode: awsvpc
runtimePlatform:
  cpuArchitecture: ARM64
containerDefinitions:
  - name: analysis-agent
    image: ACCOUNT.dkr.ecr.us-west-1.amazonaws.com/escher-analysis-agent:TAG
    portMappings:
      - containerPort: 8081
    environment:
      - name: CONTEXT_ENGINE_URL
        value: http://context-engine.escher.internal:8001
    secrets:
      - name: ANTHROPIC_API_KEY
        valueFrom: arn:aws:secretsmanager:us-west-1:ACCOUNT:secret:escher/anthropic-api-key

Context Engine + Asset Store

yaml
# context-engine: 1024 CPU / 2048 MB
# asset-store:    1024 CPU / 2048 MB
# Both connect to SurrealDB at SURREAL_URL

Load balancer configuration

bash
# Create ALB target group for Gateway
aws elbv2 create-target-group \
  --name escher-gateway-tg \
  --protocol HTTP \
  --port 8080 \
  --vpc-id vpc-XXXXXX \
  --target-type ip \
  --health-check-path /q/health \
  --health-check-interval-seconds 30 \
  --region us-west-1

# Add Cognito auth rule on HTTPS listener
aws elbv2 create-rule \
  --listener-arn arn:aws:elasticloadbalancing:... \
  --priority 1 \
  --conditions '[{"Field":"path-pattern","Values":["/*"]}]' \
  --actions '[
    {
      "Type": "authenticate-cognito",
      "AuthenticateCognitoConfig": {
        "UserPoolArn": "arn:aws:cognito-idp:us-west-1:ACCOUNT:userpool/POOL_ID",
        "UserPoolClientId": "CLIENT_ID",
        "UserPoolDomain": "escher-auth.auth.us-west-1.amazoncognito.com",
        "OnUnauthenticatedRequest": "deny"
      },
      "Order": 1
    },
    {
      "Type": "forward",
      "TargetGroupArn": "arn:aws:elasticloadbalancing:...",
      "Order": 2
    }
  ]'

IAM roles

Each ECS task uses a dedicated task role and a shared task execution role.

Task execution role (shared)

Allows ECS agent to pull images from ECR and fetch secrets:

json
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "ecr:GetAuthorizationToken",
        "ecr:BatchCheckLayerAvailability",
        "ecr:GetDownloadUrlForLayer",
        "ecr:BatchGetImage",
        "logs:CreateLogStream",
        "logs:PutLogEvents",
        "secretsmanager:GetSecretValue"
      ],
      "Resource": "*"
    }
  ]
}

Task roles (per service)

ServiceExtra permissions needed
GatewayNone (routes internally only)
Analysis AgentNone (calls Anthropic via internet egress)
Context EngineNone (calls Asset Store internally)
Asset StoreNone (talks to SurrealDB internally)

No service requires AWS API permissions — all cloud estate operations run on the user's machine via the desktop app's local Tauri plugins, not on the backend.


Security groups

escher-alb-sg
  Ingress:  0.0.0.0/0 → 443/tcp
  Egress:   escher-gateway-sg → 8080/tcp

escher-gateway-sg
  Ingress:  escher-alb-sg → 8080/tcp
  Egress:   escher-agent-sg → 8081-8082/tcp
            escher-ce-sg → 8001/tcp

escher-agent-sg
  Ingress:  escher-gateway-sg → 8081-8082/tcp
  Egress:   escher-ce-sg → 8001/tcp
            0.0.0.0/0 → 443/tcp  (Anthropic API egress)

escher-ce-sg
  Ingress:  escher-gateway-sg → 8001/tcp
            escher-agent-sg → 8001/tcp
  Egress:   escher-asset-sg → 8000/tcp

escher-asset-sg
  Ingress:  escher-ce-sg → 8000/tcp
  Egress:   escher-surrealdb-sg → 4000/tcp

Deployment

The v2-deployment-ecs/deploy.sh script handles image tagging, task definition registration, and service update:

bash
cd v2-deployment-ecs

# Deploy a single service
./deploy.sh gateway v2.15.0

# Deploy all services (ordered by dependency)
./deploy.sh all v2.15.0

The script:

  1. Tags the ECR image with the release version
  2. Registers a new task definition revision
  3. Calls aws ecs update-service --force-new-deployment
  4. Waits for the service to reach steady state

In production, deployments are triggered automatically by the CI/CD pipeline. See CI/CD.


Health checking all services

bash
# From inside the VPC (bastion or SSM session)
for svc in gateway:8080/q/health analysis-agent:8081/health playbook-agent:8082/health context-engine:8001/health asset-store:8000/health; do
  host="${svc%:*}"
  path="${svc#*:}"
  echo "$host: $(curl -sf http://$host.escher.internal:$path || echo FAILED)"
done

# From internet (ALB endpoint, after auth)
curl -H "Authorization: Bearer $TOKEN" https://api.escher.tessell.com/q/health

Rollback

bash
# Roll back a service to the previous task definition revision
aws ecs update-service \
  --cluster escher-v2 \
  --service escher-gateway \
  --task-definition gateway:PREVIOUS_REVISION \
  --force-new-deployment \
  --region us-west-1

Next steps

Escher — Agentic CloudOps by Tessell