Ephemeral Environments: Eliminating Deployment Bottlenecks

Foundry24

A Series C startup came to us with a familiar problem: their engineering team was growing faster than their deployment infrastructure could handle. Developers were constantly stepping on each other’s toes, environment drift was causing “works on my machine” bugs, and the shared staging environment had become a bottleneck.

Two weeks later, they were deploying to production multiple times per day.

The Problem

The team had a single staging environment shared by 15 engineers. The workflow looked like this:

  1. Developer finishes feature on local machine
  2. Developer asks in Slack: “Anyone using staging?”
  3. Developer waits (sometimes hours) for staging to be free
  4. Developer deploys to staging, tests, finds issues
  5. Developer fixes issues, redeploys
  6. Repeat until feature works
  7. Deploy to production (usually batched weekly because it was so painful)

Environment drift made things worse. Staging and production configurations would slowly diverge, leading to bugs that only appeared in production. The team had lost confidence in their testing process.

The Solution: Ephemeral Environments

We designed a system where every pull request automatically gets its own isolated environment. Here’s how it works:

Architecture Overview

PR Opened → GitHub Actions → CloudFormation Stack → ECS Service → Unique URL
PR Closed → GitHub Actions → Stack Deleted → Resources Cleaned Up

Each environment is:

  • Isolated: Own database, own services, own URL
  • Ephemeral: Created on PR open, destroyed on PR close
  • Tagged: CloudFormation stack named with Jira ticket number
  • Identical: Same configuration as production

GitHub Actions Workflow

The workflow triggers on PR events and manages the entire lifecycle:

name: Ephemeral Environment

on:
  pull_request:
    types: [opened, synchronize, closed]

jobs:
  deploy:
    if: github.event.action != 'closed'
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Extract Jira Ticket
        id: jira
        run: |
          TICKET=$(echo "${{ github.head_ref }}" | grep -oE '[A-Z]+-[0-9]+' | head -1)
          echo "ticket=${TICKET:-pr-${{ github.event.number }}}" >> $GITHUB_OUTPUT

      - name: Build and Push to ECR
        run: |
          aws ecr get-login-password | docker login --username AWS --password-stdin $ECR_REGISTRY
          docker build -t $ECR_REGISTRY/app:${{ steps.jira.outputs.ticket }} .
          docker push $ECR_REGISTRY/app:${{ steps.jira.outputs.ticket }}

      - name: Deploy CloudFormation Stack
        env:
          VPC_ID: ${{ secrets.VPC_ID }}
          PRIVATE_SUBNET_IDS: ${{ secrets.PRIVATE_SUBNET_IDS }}
          ECS_CLUSTER_ARN: ${{ secrets.ECS_CLUSTER_ARN }}
          EXECUTION_ROLE_ARN: ${{ secrets.EXECUTION_ROLE_ARN }}
          DB_SUBNET_GROUP: ${{ secrets.DB_SUBNET_GROUP }}
        run: |
          aws cloudformation deploy \
            --stack-name env-${{ steps.jira.outputs.ticket }} \
            --template-file infra/ephemeral.yaml \
            --parameter-overrides \
              EnvironmentName=${{ steps.jira.outputs.ticket }} \
              ImageTag=${{ steps.jira.outputs.ticket }} \
              VpcId=$VPC_ID \
              PrivateSubnetIds=$PRIVATE_SUBNET_IDS \
              ECSClusterArn=$ECS_CLUSTER_ARN \
              ExecutionRoleArn=$EXECUTION_ROLE_ARN \
              DBSubnetGroupName=$DB_SUBNET_GROUP \
            --tags \
              JiraTicket=${{ steps.jira.outputs.ticket }} \
              PRNumber=${{ github.event.number }} \
            --no-fail-on-empty-changeset

      - name: Comment PR with URL
        uses: actions/github-script@v7
        with:
          script: |
            github.rest.issues.createComment({
              owner: context.repo.owner,
              repo: context.repo.repo,
              issue_number: context.issue.number,
              body: '🚀 Environment deployed: https://${{ steps.jira.outputs.ticket }}.preview.example.com'
            })

  cleanup:
    if: github.event.action == 'closed'
    runs-on: ubuntu-latest
    steps:
      - name: Extract Jira Ticket
        id: jira
        run: |
          TICKET=$(echo "${{ github.head_ref }}" | grep -oE '[A-Z]+-[0-9]+' | head -1)
          echo "ticket=${TICKET:-pr-${{ github.event.number }}}" >> $GITHUB_OUTPUT

      - name: Delete CloudFormation Stack
        run: |
          aws cloudformation delete-stack --stack-name env-${{ steps.jira.outputs.ticket }}

CloudFormation Template

The template creates an isolated ECS service with its own database. Parameters reference shared infrastructure (VPC, cluster, roles) that exists outside the ephemeral stack:

AWSTemplateFormatVersion: '2010-09-09'
Description: Ephemeral environment stack

Parameters:
  EnvironmentName:
    Type: String
    Description: Jira ticket or PR identifier
  ImageTag:
    Type: String
    Description: ECR image tag to deploy
  VpcId:
    Type: AWS::EC2::VPC::Id
    Description: VPC for the environment
  PrivateSubnetIds:
    Type: List<AWS::EC2::Subnet::Id>
    Description: Private subnets for ECS tasks
  ECSClusterArn:
    Type: String
    Description: ARN of the shared ECS cluster
  ExecutionRoleArn:
    Type: String
    Description: ARN of the ECS task execution role
  DBSubnetGroupName:
    Type: String
    Description: DB subnet group for RDS

Resources:
  # Security group for the ECS service
  ServiceSecurityGroup:
    Type: AWS::EC2::SecurityGroup
    Properties:
      GroupDescription: !Sub 'Security group for ${EnvironmentName}'
      VpcId: !Ref VpcId
      SecurityGroupIngress:
        - IpProtocol: tcp
          FromPort: 8080
          ToPort: 8080
          CidrIp: 10.0.0.0/8

  # Security group for the database
  DBSecurityGroup:
    Type: AWS::EC2::SecurityGroup
    Properties:
      GroupDescription: !Sub 'DB security group for ${EnvironmentName}'
      VpcId: !Ref VpcId
      SecurityGroupIngress:
        - IpProtocol: tcp
          FromPort: 5432
          ToPort: 5432
          SourceSecurityGroupId: !Ref ServiceSecurityGroup

  # RDS PostgreSQL instance
  Database:
    Type: AWS::RDS::DBInstance
    DeletionPolicy: Delete
    Properties:
      DBInstanceIdentifier: !Sub 'db-${EnvironmentName}'
      DBInstanceClass: db.t3.micro
      Engine: postgres
      EngineVersion: '15'
      DBName: app
      MasterUsername: !Sub '{{resolve:secretsmanager:ephemeral-db-creds:SecretString:username}}'
      MasterUserPassword: !Sub '{{resolve:secretsmanager:ephemeral-db-creds:SecretString:password}}'
      AllocatedStorage: 20
      DBSubnetGroupName: !Ref DBSubnetGroupName
      VPCSecurityGroups:
        - !Ref DBSecurityGroup
      PubliclyAccessible: false
      BackupRetentionPeriod: 0
      DeleteAutomatedBackups: true

  # ECS Task Definition
  TaskDefinition:
    Type: AWS::ECS::TaskDefinition
    Properties:
      Family: !Sub 'task-${EnvironmentName}'
      Cpu: '256'
      Memory: '512'
      NetworkMode: awsvpc
      RequiresCompatibilities:
        - FARGATE
      ExecutionRoleArn: !Ref ExecutionRoleArn
      ContainerDefinitions:
        - Name: app
          Image: !Sub '${AWS::AccountId}.dkr.ecr.${AWS::Region}.amazonaws.com/app:${ImageTag}'
          Essential: true
          PortMappings:
            - ContainerPort: 8080
              Protocol: tcp
          Environment:
            - Name: DATABASE_URL
              Value: !Sub 'postgresql://{{resolve:secretsmanager:ephemeral-db-creds:SecretString:username}}:{{resolve:secretsmanager:ephemeral-db-creds:SecretString:password}}@${Database.Endpoint.Address}:5432/app'
            - Name: ENVIRONMENT
              Value: !Ref EnvironmentName
          LogConfiguration:
            LogDriver: awslogs
            Options:
              awslogs-group: !Ref LogGroup
              awslogs-region: !Ref AWS::Region
              awslogs-stream-prefix: !Ref EnvironmentName

  # CloudWatch Log Group
  LogGroup:
    Type: AWS::Logs::LogGroup
    Properties:
      LogGroupName: !Sub '/ecs/ephemeral/${EnvironmentName}'
      RetentionInDays: 7

  # ECS Service
  Service:
    Type: AWS::ECS::Service
    DependsOn: Database
    Properties:
      ServiceName: !Sub 'svc-${EnvironmentName}'
      Cluster: !Ref ECSClusterArn
      TaskDefinition: !Ref TaskDefinition
      DesiredCount: 1
      LaunchType: FARGATE
      NetworkConfiguration:
        AwsvpcConfiguration:
          Subnets: !Ref PrivateSubnetIds
          SecurityGroups:
            - !Ref ServiceSecurityGroup
          AssignPublicIp: DISABLED

Outputs:
  ServiceUrl:
    Description: Internal service URL
    Value: !Sub 'http://${EnvironmentName}.internal:8080'
  DatabaseEndpoint:
    Description: RDS endpoint
    Value: !GetAtt Database.Endpoint.Address

Key Design Decisions

CloudFormation for Lifecycle Management

CloudFormation’s stack-based model is ideal for ephemeral environments:

  1. Stack deletion: One command cleans up all resources
  2. Native AWS integration: No external state to manage
  3. Drift detection: Built-in monitoring
  4. Direct GitHub Actions support: No additional tooling required

Lesson Learned: Avoid Explicit Resource Names

One gotcha we hit early: explicitly naming resources can prevent stack deletion.

When you set properties like DBInstanceIdentifier, ServiceName, or LogGroupName, CloudFormation can’t replace those resources during updates because the names are immutable. Worse, if deletion fails partway through, you can end up with orphaned resources that block future deployments.

# Problematic - explicit name prevents replacement
Database:
  Type: AWS::RDS::DBInstance
  Properties:
    DBInstanceIdentifier: !Sub 'db-${EnvironmentName}'  # Can cause issues

# Better - let CloudFormation generate the name
Database:
  Type: AWS::RDS::DBInstance
  Properties:
    # No DBInstanceIdentifier - CloudFormation generates a unique name
    DBName: app  # This is the database name, not the instance identifier

For ephemeral environments where stacks are constantly created and deleted, letting CloudFormation auto-generate resource names avoids this entire class of problems. The stack name itself provides the logical grouping you need.

We kept explicit names in the template above for clarity, but in production we removed most of them.

Why ECS/Fargate?

ECS/Fargate kept things simple:

  1. Simpler mental model: Services, tasks, containers
  2. No cluster management: Fargate handles infrastructure
  3. Cost effective: Pay per task, not per node
  4. Fast iteration: Changes deploy in minutes, not hours

Database Strategy

Each environment gets its own RDS instance. Yes, this costs more than shared databases, but:

  1. True isolation: No risk of test data leaking
  2. Schema freedom: Developers can run migrations without coordination
  3. Production parity: Same database engine and version
  4. Easy cleanup: Delete stack, delete database

For cost control, we used db.t3.micro instances and set CloudFormation to snapshot on delete (for debugging) but auto-delete after 7 days.

Results

After two weeks of implementation:

  • Environment conflicts: Zero. Each PR has its own world.
  • Deployment frequency: From weekly batches to multiple daily deploys
  • Bug discovery: Issues found in PR environments, not production
  • Developer velocity: No more waiting for staging access
  • Onboarding: New developers productive on day one

The team’s release confidence improved dramatically. When you can test your exact changes in an isolated environment identical to production, you stop fearing deployments.

Cost Considerations

Ephemeral environments aren’t free, but they’re cheaper than you might think:

  • ECS Fargate: ~$0.04/hour for a small task
  • RDS t3.micro: ~$0.02/hour
  • Average PR lifetime: 2-3 days
  • Average cost per PR: ~$3-5

Compare that to the cost of a production bug or a week of blocked development time.

Conclusion

Ephemeral environments transformed this team’s development workflow. The technical implementation was straightforward—GitHub Actions, ECS, CloudFormation—but the cultural impact was significant.

Developers stopped asking permission to test. QA could review features in isolation. Product managers could see changes before they merged. The entire release process became routine instead of an event.

If your team is fighting over shared environments, consider going ephemeral. The infrastructure cost is minimal compared to the velocity gains.


Need help implementing ephemeral environments for your team? Book a call to discuss your infrastructure challenges.