Tunnelling a Portainer Edge agent through AWS PrivateLink (and a few traps I hit on the way)

I run two EKS clusters in the same AWS account: one for staging (which also hosts our Portainer install) and one for production. I wanted the Portainer UI to be able to do real-time things on the prod cluster — kubectl exec, log tailing, deploy from a Git stack — but without exposing any new public port on the staging side, and without VPC peering (the two VPCs use the same 192.168.0.0/16 CIDR, which makes peering a non-starter anyway).

The answer: AWS PrivateLink. Specifically, the prod-cluster Edge agent reaches the Portainer chisel tunnel server through a VPC Interface Endpoint, which lands on an internal Network Load Balancer in the staging VPC. None of it touches the public internet.

This post is half “here is the architecture” and half “here are the six traps I hit on the way to a working tunnel,” because I think the second half is more useful than the first.

The architecture

Two channels go between the prod-cluster Edge agent and the staging Portainer server:

                  Portainer (staging EKS)
                  ├── UI / heartbeat (:9000)  ──► public ALB :443
                  │                                 ▲
                  └── chisel tunnel  (:30776) ──► internal NLB :8000
                                                    ▲
                                                    │  PrivateLink
                                                    │  (TCP, AWS network only)
                                                    │
                                              VPC Interface Endpoint
                                                    ▲
                                                    │
                          portainer-edge-agent (prod EKS)

Heartbeat rides the public ALB ingress on port 443. The agent does outbound HTTPS, polls for instructions, posts snapshots. It works fine through NAT, the public ALB only ever sees HTTPS.
Chisel reverse tunnel is raw TCP. ALB can’t carry it (L7 only). The natural fit would be a public NLB on TCP/8000, but I didn’t want a public chisel listener on the internet — so an internal NLB plus PrivateLink instead.

The infrastructure shopping list

Conceptually small, six AWS resources plus one Kubernetes Service. I’ll walk through each.

1. The Service in the cluster

This is the only piece that lives in Git:

# edge-tunnel-service.yaml
apiVersion: v1
kind: Service
metadata:
  name: portainer-edge-tunnel
  namespace: portainer
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-type: external
    service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip
    service.beta.kubernetes.io/aws-load-balancer-scheme: internal
    service.beta.kubernetes.io/aws-load-balancer-cross-zone-load-balancing-enabled: "true"
spec:
  type: LoadBalancer
  selector:
    app.kubernetes.io/name: portainer
  ports:
    - name: edge
      port: 8000
      # Helm chart launches Portainer with --tunnel-port=30776, so the
      # chisel server actually listens on 30776 inside the pod (not the
      # default 8000). Keep the NLB front-end on 8000 to match what's
      # encoded in agents' EDGE_KEYs.
      targetPort: 30776
      protocol: TCP

Two annotations worth flagging:

aws-load-balancer-type: external is not “internet-facing” — it means “use the AWS Load Balancer Controller, not the in-tree cloud controller”. The naming is genuinely confusing.
aws-load-balancer-scheme: internal is what makes it private. This is the field PrivateLink cares about — internet-facing NLBs are rejected by the VPC Endpoint Service API.

2. A private subnet for the NLB ENIs

If your VPC is set up with public and private subnets, you can probably skip this — just tag a private subnet with kubernetes.io/role/internal-elb=1 and call it done. My VPC had only public subnets (a single-tier setup), so I created one:

aws ec2 create-subnet \
  --vpc-id vpc-EXAMPLE \
  --cidr-block 192.168.0.0/24 \
  --availability-zone eu-west-1b \
  --tag-specifications 'ResourceType=subnet,Tags=[
    {Key=Name,Value=eks-internal-elb-1b},
    {Key=kubernetes.io/cluster/my-cluster,Value=shared},
    {Key=kubernetes.io/role/internal-elb,Value=1}
  ]'

A few notes for the cost-conscious:

You don’t need a NAT gateway. The NLB only accepts inbound; it never originates outbound traffic. The subnet’s route table can be literally empty (no 0.0.0.0/0 route at all).
You don’t need a worker node in this subnet. With target-type: ip, the NLB targets the pod’s IP directly, and intra-VPC routing handles the path between any subnets. The new subnet is just a parking lot for NLB ENIs.
One AZ is enough for non-critical stuff like this — the tunnel isn’t on the user data path. The NLB has to share at least one AZ with the consumer-side Interface Endpoint, but the consumer can be in any matching AZ.

3. The PrivateLink endpoint service

Once the Service in step 1 has provisioned its NLB, wrap the NLB in a VPC Endpoint Service:

aws ec2 create-vpc-endpoint-service-configuration \
  --network-load-balancer-arns "$NLB_ARN" \
  --acceptance-required

acceptance-required is a nice safety prompt: when a consumer tries to connect, you have to explicitly approve the connection request once. Same-account consumers go through it too, which I appreciate.

4. The Interface Endpoint on the consumer side

In the prod VPC, in a subnet in the same AZ as the NLB:

aws ec2 create-vpc-endpoint \
  --vpc-endpoint-type Interface \
  --vpc-id vpc-PROD-EXAMPLE \
  --service-name com.amazonaws.vpce.eu-west-1.vpce-svc-EXAMPLE \
  --subnet-ids subnet-PROD-EXAMPLE \
  --security-group-ids sg-EXAMPLE

…then approve the connection on the provider side:

aws ec2 accept-vpc-endpoint-connections \
  --service-id "$SVC_ID" \
  --vpc-endpoint-ids "$VPCE_ID"

The endpoint exposes a regional DNS name like:

vpce-EXAMPLE-abc.vpce-svc-EXAMPLE.eu-west-1.vpce.amazonaws.com

That’s the address agents will dial as the tunnel target.

The traps

This part is the value of the post, in my opinion. I hit six things in a row, each blocking, each annoying, each obvious in hindsight.

Trap 1: the AWS Load Balancer Controller’s IAM was scoped to ALBs

The Service creation went through, but the LB never appeared. The controller logged:

FailedBuildModel: failed to list subnets by reachability:
  api error UnauthorizedOperation: not authorized to perform:
  ec2:DescribeRouteTables

For internet-facing ALBs, the controller doesn’t need DescribeRouteTables — it discovers public subnets via the kubernetes.io/role/elb tag. For internal LBs, it has to walk route tables to verify a subnet is actually private (no IGW route). My controller’s IAM role was set up for ALB ingresses years ago and never got the broader v2.x policy.

Fix: replace the policy with the canonical one from kubernetes-sigs:

ROLE=alb-ingress-controller
POLICY_ARN=$(aws iam list-attached-role-policies --role-name "$ROLE" \
  --query 'AttachedPolicies[?starts_with(PolicyArn, `arn:aws:iam::123456789012`)].PolicyArn' \
  --output text)

curl -o /tmp/iam_policy.json \
  https://raw.githubusercontent.com/kubernetes-sigs/aws-load-balancer-controller/v2.13.2/docs/install/iam_policy.json

aws iam create-policy-version --policy-arn "$POLICY_ARN" \
  --policy-document file:///tmp/iam_policy.json --set-as-default

Trap 2: the chisel server isn’t on port 8000

NLB came up, target was registering, then went unhealthy — Target.FailedHealthChecks. I assumed an SG issue and added an SG rule (spoiler: also needed, see Trap 3). Health checks still failed.

Probing from a debug pod inside the cluster:

kubectl run nc-probe --rm -i --restart=Never --image=busybox:1.36 \
  --command -- nc -zv -w3 portainer 8000
# nc: portainer (10.x.x.x:8000): Connection timed out

kubectl run nc-probe --rm -i --restart=Never --image=busybox:1.36 \
  --command -- nc -zv -w3 portainer 9000
# portainer (10.x.x.x:9000) open

Port 9000 (UI) was up, 8000 (chisel) was not. I pulled the deployment spec:

kubectl -n portainer get deploy portainer \
  -o jsonpath='{.spec.template.spec.containers[0].args}'
# ["--tunnel-port=30776"]

There it was. The Portainer Helm chart launches the binary with --tunnel-port=30776 — likely so the in-pod port matches the chart’s NodePort (which has to be in the 30000–32767 range). The container does declare port 8000 in its containerPorts, but nothing listens there.

Fix: change targetPort in my Service to 30776 (back-end port). The front-end stays on 8000 so the value baked into agent EDGE_KEYs doesn’t need to change.

Trap 3: the cluster security group needs a rule

With target-type: ip on an internal NLB, the AWS LB Controller does not automatically punch a rule into the cluster security group for the NLB’s source range. You have to add it yourself:

aws ec2 authorize-security-group-ingress \
  --group-id sg-CLUSTER-SG \
  --ip-permissions 'IpProtocol=tcp,FromPort=30776,ToPort=30776,
    IpRanges=[{CidrIp=192.168.0.0/24,
    Description=Portainer chisel tunnel from internal NLB subnet}]'

For internet-facing ALBs through ingresses this Just Works because the controller manages the SG for you. Internal NLBs with IP targets are a manual step.

Trap 4: Edge Compute features were disabled

Even with traps 1–3 fixed, NLB target was still unhealthy. The container declared port 8000, my targetPort was now 30776, but my debug-pod TCP probe to 30776 also timed out. Why?

Because Portainer doesn’t bind the chisel server unless the Edge Compute features toggle is on:

curl -X PUT -H "X-API-Key: …" -H "Content-Type: application/json" \
  -d '{"enableEdgeComputeFeatures": true}' \
  https://portainer.example.com/api/settings

Importantly, this setting only takes effect on pod restart:

kubectl -n portainer rollout restart deploy/portainer

After the restart, port 30776 was open, NLB target went healthy, and I moved on to the next trap.

Trap 5: the install script mangles `+`, `/`, and `=` in the EDGE_KEY

The Portainer UI’s “Add Edge Environment” wizard hands you a curl-pipe script:

curl https://downloads.portainer.io/.../portainer-edge-agent-setup.sh \
  | bash -s -- "<EDGE_ID>" "<EDGE_KEY>" "1" "" ""

The script does naive text substitution into a YAML template. If the EDGE_KEY contains +, /, or = (extremely common — they’re valid base64 chars and the random secret portion is base64), the substitution corrupts the value.

The fix is to let the script run — it correctly creates the namespace, ServiceAccount, ClusterRoleBinding, headless Service, and Deployment — then overwrite the env var via the Kubernetes API, which doesn’t go through any text substitution:

kubectl -n portainer set env deploy/portainer-edge-agent \
  EDGE_KEY="$EDGE_KEY"

This was the moment I realised the Portainer install script and I had fundamentally different ideas about what “shell-safe” meant.

Trap 6: the agent uses raw (unpadded) base64

Here’s where the debugging got entertaining. After the kubectl set env, the agent kept fatal-erroring with:

unable to associate Edge key | error="illegal base64 data at input byte 227"

Position 227. The string was 228 chars, so byte 227 is the last character — the = padding I’d added. The Portainer agent decodes EDGE_KEY with raw stdlib base64 and rejects =.

So far so okay — strip the padding:

import base64
new = "<api-url>|<tunnel>|<secret>|<endpoint-id>"
EDGE_KEY = base64.b64encode(new.encode()).decode().rstrip("=")

But before I figured that out, the byte position kept moving between attempts. First attempt: byte 181. Next: 142. Next: 172. Why does the “illegal byte” position vary across attempts with the same key?

Trap 6b: long base64 strings get soft-wrapped on copy-paste

It took me longer than I’d like to admit to realise: the byte position was changing because I was copy-pasting the EDGE_KEY through a soft-wrapping viewer, which silently injected a literal newline into the string at whatever width the viewer happened to be. Different window widths on different attempts → different wrap points → different position of the first illegal byte the agent’s base64 decoder hit.

The fix is to never paste a long base64 string into a kubectl set env EDGE_KEY="…" command directly. Instead, write it to a file via a single-quoted heredoc (which preserves your +, /, =, and is robust to shell interpretation), strip whitespace, then reference the variable:

cat > /tmp/edge_key.b64 <<'EOF'
<paste the long base64 here — newlines from wrap don't matter,
they get stripped in the next line>
EOF
EDGE_KEY=$(tr -d '[:space:]' < /tmp/edge_key.b64)
echo "Length: ${#EDGE_KEY}"   # sanity-check length
printf '%s' "$EDGE_KEY" | base64 -d   # sanity-check it decodes
kubectl -n portainer set env deploy/portainer-edge-agent EDGE_KEY="$EDGE_KEY"

If the byte-position of your base64 error is shifting from attempt to attempt, it’s not the key — it’s whitespace.

The runbook, distilled

For posterity, here’s the actual end-to-end recipe with all six traps already accounted for:

# ─── On staging (provider side) ──────────────────────────────────

# 1. Internal NLB Service
kubectl apply -f edge-tunnel-service.yaml

# 2. (one-time) Make sure the AWS LB Controller IAM role has the upstream policy
#    https://github.com/kubernetes-sigs/aws-load-balancer-controller/blob/v2.13.2/docs/install/iam_policy.json

# 3. (one-time) Enable Portainer's Edge Compute features and restart the pod
curl -X PUT -H "X-API-Key: …" -H "Content-Type: application/json" \
  -d '{"enableEdgeComputeFeatures": true}' \
  https://portainer.example.com/api/settings
kubectl -n portainer rollout restart deploy/portainer

# 4. Add an SG rule for the chisel back-end port from the NLB subnet
aws ec2 authorize-security-group-ingress --group-id sg-CLUSTER-SG \
  --protocol tcp --port 30776 --cidr 192.168.0.0/24

# 5. VPC Endpoint Service in front of the NLB
SVC_ID=$(aws ec2 create-vpc-endpoint-service-configuration \
  --network-load-balancer-arns "$NLB_ARN" --acceptance-required \
  --query 'ServiceConfiguration.ServiceId' --output text)

# ─── On the consumer side (prod VPC) ─────────────────────────────

# 6. Interface Endpoint
VPCE_ID=$(aws ec2 create-vpc-endpoint \
  --vpc-endpoint-type Interface \
  --vpc-id vpc-PROD-EXAMPLE \
  --service-name "com.amazonaws.vpce.eu-west-1.${SVC_ID}" \
  --subnet-ids subnet-PROD-1B \
  --security-group-ids sg-VPCE-EXAMPLE \
  --query 'VpcEndpoint.VpcEndpointId' --output text)

# 7. Accept on the provider side
aws ec2 accept-vpc-endpoint-connections \
  --service-id "$SVC_ID" --vpc-endpoint-ids "$VPCE_ID"

VPCE_DNS=$(aws ec2 describe-vpc-endpoints --vpc-endpoint-ids "$VPCE_ID" \
  --query 'VpcEndpoints[0].DnsEntries[0].DnsName' --output text)

# ─── Onboard the agent ───────────────────────────────────────────

# 8. Patch the EDGE_KEY locally to point at the VPCE DNS
python3 <<PY
import base64
ORIG = "<key from Portainer UI>"
parts = base64.b64decode(ORIG + '=' * (-len(ORIG) % 4)).decode().split('|')
parts[1] = "${VPCE_DNS}:8000"
print(base64.b64encode('|'.join(parts).encode()).decode().rstrip('='))
PY

# 9. Run Portainer's install script with the ORIGINAL key (let it
#    crashloop — we just need the manifests it creates)
curl https://downloads.portainer.io/.../portainer-edge-agent-setup.sh \
  | bash -s -- "<EDGE_ID>" "<ORIGINAL_KEY>" "1" "" ""

# 10. Heredoc the patched key, strip whitespace, sanity-check, patch
cat > /tmp/k <<'EOF'
<paste patched key here>
EOF
EDGE_KEY=$(tr -d '[:space:]' < /tmp/k)
kubectl -n portainer set env deploy/portainer-edge-agent EDGE_KEY="$EDGE_KEY"
kubectl -n portainer rollout status deploy/portainer-edge-agent
kubectl -n portainer logs deploy/portainer-edge-agent --tail=30

Look for “edge agent connected” and a 200 from /api/endpoints/<id>/kubernetes/api/v1/namespaces against the Portainer API. If you see a 500 "Unable to get the active tunnel", your patched EDGE_KEY didn’t actually take effect or the NLB target is unhealthy — both easy to verify.

Things that surprised me, in retrospect

The AWS LB Controller annotation type: external does not mean what it looks like. It means “external to the in-tree controller”, not “internet-facing”. I keep tripping over this one and probably always will.
Portainer’s chisel port is non-default under the standard Helm chart. It’s worth grepping args: on the deployment whenever you expect a service on a “well-known” port and don’t see it.
PrivateLink is genuinely cheap and isolating. Same-account I could have just peered, but the per-resource scope of PrivateLink (“one specific NLB on one specific port, nothing else”) makes for a much smaller blast radius than peering’s “everything in VPC A can reach everything in VPC B”. For a single tunnel like this, it’s the right call.
The Portainer agent EDGE_KEY format is documented essentially nowhere. What it contains, the encoding rules, the install script pitfalls — none of this is in the official docs. The fastest way to find any of it is to base64-decode a key and look at the pipe-separated fields.

I lost about half a day to traps 5 and 6, and the rest of the work was well under an hour. If you’re standing up the same architecture, I hope this saves you that half day.