IP address masquerade in AKS
Troubleshooting problematic requests between services running in Azure Kubernetes Service (AKS) clusters in different networks can be difficult because of IP address masquerading. This post shows how we can configure masquerading in AKS to make tracing problems back to their source pod quicker and easier.
Context
For context, my environment is a “hub and spoke” architecture - multiple AKS clusters (“spokes”), each running in separate virtual networks (“vnets”) communicate with services running in a single “hub” AKS cluster also in its own vnet. Each spoke vnet is peered to the hub vnet, and all calls are over HTTP and ingress-nginx is the Ingress Controller. The clusters use Azure CNI Node Subnet, so each pod is assigned an IP address from the local subnet.
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃VNet - spoke 1 (10.x.x.x/15) ┃ ┃VNet - spoke n (10.x.x.x/15) ┃
┃ ┌─────────────────────────┐ ┃ ┃ ┌─────────────────────────┐ ┃
┃ │AKS cluster - spoke 1 │ ┃ ┃ │AKS cluster - spoke n │ ┃
┃ │ │ ┃ ┃ │ │ ┃
┃ │ ┌─ ── ── ── ── ─┐ │ ┃ ┃ │ ┌─ ── ── ── ── ─┐ │ ┃
┃ │ services ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ services │ ┃
┃ │ └─ ── ── ── ── ─┘ │ ┃ ││ ┃ │ └─ ── ── ── ── ─┘ │ ┃
┃ └─────────────────────────┘ ┃ ┃ └─────────────────────────┘ ┃
┗━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┛ ││ ┗━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┛
peering requests peering
┃ ││ ┃
┏━━┻━━━━━━━━━━━━━━━━ ━━━━━━━━━━━━━━━━━━━━━━┻━━┓
┃VNet - hub ││ (10.x.x.x/15) ┃
┃ ┌──────────── ──────────────────┐ ┃
┃ │ Azure Standard Load Balancer │ ┃
┃ └────────────┬┬──────────────────┘ ┃
┃ ┃
┃ ┌────────────┼┼──────────────────┐ ┃
┃ │AKS cluster │ ┃
┃ │ ┌────┴┴─────────┐ │ ┃
┃ │ │ ingress-nginx │ │ ┃
┃ │ └────┬┬─────────┘ │ ┃
┃ │ │ ┃
┃ │ ┌─ ──┴┴─ ── ── ─┐ │ ┃
┃ │ services │ ┃
┃ │ └─ ── ── ── ── ─┘ │ ┃
┃ └────────────────────────────────┘ ┃
┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛
It is useful to know exactly where requests to the hub are coming from. For example, I dealt with a situation recently where client applications in the spoke clusters were making abusive calls to a service running in the hub cluster. The ingress-nginx logs showed the IP addresses of the node where the problematic requests were coming from, but it was difficult and time-consuming to locate the source pod since there were many pods running on that node. In the end, I was able to use the client’s user agent string to identify the source pod. If the ingress-nginx logs showed the source IP of the pod, it would’ve been a lot quicker and easier to track down the offending client applications.
IP address masquerading
AKS runs ip-masq-agent-v2 as a DaemonSet, which is responsible for configuring the IP address masquerade rules on each node. By default, the agent excludes the RFC 1918, link-local, and local vnet addresses from masquerade. The ConfigMap named azure-ip-masq-agent-config-reconciled contains the settings to exclude the local vnet address blocks, as shown below:
#❯ kubectl get configmap azure-ip-masq-agent-config-reconciled -o yaml
apiVersion: v1
data:
ip-masq-agent-reconciled: |
MasqLinkLocal: true
NonMasqueradeCIDRs:
- 10.40.0.0/17 # subnet CIDR
- 10.0.0.0/16
- 10.40.0.0/15 # vnet CIDR
kind: ConfigMap
metadata:
labels:
app.kubernetes.io/managed-by: Eno
component: ip-masq-agent
name: azure-ip-masq-agent-config-reconciled
namespace: kube-system
This ConfigMap cannot be edited (well, it can but any changes are automatically reverted). But, as we can see from the DaemonSet specification, there’s another ConfigMap that can be optionally used.
#❯ kubectl get daemonsets.apps azure-ip-masq-agent -o yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
annotations:
deprecated.daemonset.template.generation: "1"
generation: 1
labels:
app.kubernetes.io/managed-by: Eno
component: azure-ip-masq-agent
kubernetes.azure.com/managedby: aks
tier: node
name: azure-ip-masq-agent
namespace: kube-system
spec:
...
containers:
- image: mcr.microsoft.com/oss/v2/azure/ip-masq-agent-v2:v0.1.15-2
name: azure-ip-masq-agent
...
volumeMounts:
- mountPath: /etc/config
name: azure-ip-masq-agent-config-volume
- mountPath: /run/xtables.lock
name: iptableslock
...
volumes:
- name: azure-ip-masq-agent-config-volume
projected:
defaultMode: 420
sources:
- configMap:
items:
- key: ip-masq-agent # what about this one?
mode: 444
path: ip-masq-agent
name: azure-ip-masq-agent-config
optional: true
- configMap:
items:
- key: ip-masq-agent-reconciled
mode: 444
path: ip-masq-agent-reconciled
name: azure-ip-masq-agent-config-reconciled # shown above
optional: true
...
...
A ConfigMap named ip-masq-agent is not present in AKS by default (at least not in any of my clusters). Creating a ConfigMap with that name using the same structure as the one that’s there by default allows us to add our own IP address blocks (10.0.0.0/8 in this case) to the non-masquerade list:
#❯ kubectl get configmap azure-ip-masq-agent-config -o yaml
apiVersion: v1
data:
ip-masq-agent: |-
nonMasqueradeCIDRs:
- 10.0.0.0/8
kind: ConfigMap
metadata:
labels:
addonmanager.kubernetes.io/mode: EnsureExists
component: ip-masq-agent
kubernetes.io/cluster-service: "true"
name: azure-ip-masq-agent-config
namespace: kube-system
About a minute after that ConfigMap is applied, we can see the agent recognize the configuration update when our custom CIDR appears in the logs (the multiple overlapping CIDRs is awkward, but it still works):
The ip-masq-agent logs confirm that a new IP address block has been added:
❯ stern azure-ip-masq-agent-s4wqn
azure-ip-masq-agent-s4wqn azure-ip-masq-agent I0919 20:07:06.673421 1 ip-masq-agent.go:199] using config: {"nonMasqueradeCIDRs":["10.40.0.0/17","10.0.0.0/16","10.40.0.0/15"],"masqLinkLocal":true,"masqLinkLocalIPv6":false}
azure-ip-masq-agent-s4wqn azure-ip-masq-agent I0919 20:08:06.711897 1 ip-masq-agent.go:211] syncing config file "ip-masq-agent" at "/etc/config/"
azure-ip-masq-agent-s4wqn azure-ip-masq-agent I0919 20:08:06.712016 1 ip-masq-agent.go:211] syncing config file "ip-masq-agent-reconciled" at "/etc/config/"
azure-ip-masq-agent-s4wqn azure-ip-masq-agent I0919 20:08:06.712098 1 ip-masq-agent.go:199] using config: {"nonMasqueradeCIDRs":["10.0.0.0/8","10.40.0.0/17","10.0.0.0/16","10.40.0.0/15"],"masqLinkLocal":true,"masqLinkLocalIPv6":false}
Conclusion
By adding a IP block to the exclusion list that covers all of our internal networks, the ingress-nginx access logs in the hub now show the source IP address of the source pods - instead of the node - making troubleshooting and tracking down the source of errant requests easier.