CloudRunbook | Practical Cloud Engineering
Private Endpoints + DNS baseline: stop outages before they happen
A practical baseline for Azure Private Endpoints and DNS: ownership, zone design, resolver routing, and onboarding patterns that prevent midnight outages.
- Put every Private DNS zone in one platform subscription and guard it like any other shared service.
- Use Azure DNS Private Resolver + hub-only links; automation handles zone linking when subscriptions are vended.
- Day 0: deploy zones/resolver, run linking automation, publish the request workflow. Day 30: add policy/monitoring, integrate with ops. Quarterly: audit links vs inventory and renew rollbacks.
- Terraform/Bicep only—portal deployments create outages at 02:00.
- KQL + Policy catch drift; playbooks and rollbacks exist per service.
Why you should care
Private Endpoints without governance are outage factories. The moment someone links a zone to the wrong vNet, deletes the record, or forgets reverse lookups, apps stop resolving. A baseline keeps ownership, automation, and rollback in the platform team instead of relying on every workload engineer remembering the DNS handshake.
What good looks like
- Zones in one place – all Azure Private DNS zones owned by platform, tagged, versioned.
- Hub-based resolver – Azure DNS Private Resolver (inbound/outbound) deployed to the hub, workloads linked centrally.
- Automated linking – subscription vending calls Terraform/Bicep modules to link spokes and register records.
- Policy guardrails – Azure Policy ensures Private Endpoints only land in approved subnets and zones are linked.
- Monitoring + rollback – scripts compare expected vs actual links, alerts fire on deletion, playbooks exist per service.
Baseline decision points
- Zone ownership – Platform subscription is the default; avoids sprawl and simplifies approvals.
- Linking pattern – Hub-only links + resolver vs direct workload links. Hub pattern reduces churn and RBAC bloat.
- Per-service vs shared zones – Usually one zone per Azure service (
privatelink.<service>.windows.net); easier for lifecycle and rollback. - Automation tooling – Terraform/Bicep modules invoked by vending or change pipeline. No portal.
- Resolver strategy – Azure DNS Private Resolver with inbound/outbound endpoints in hub VNets; no custom DNS VMs.
- Subscription vending hook – Decide when the linking script runs (during networking module or Private Endpoint request).
- Monitoring + drift – How you detect missing links, stale records, or unauthorised zones. Use KQL and automation.
Signal vs noise
- Enable now: Zones, resolver, linking automation, policies for subnet restrictions, monitoring scripts, request workflow.
- Enable at Day 30: Reverse lookup zones and advanced alerting once the baseline is stable.
- Probably never: Letting workloads own their own zones, manual linking (“just this once”), or building custom DNS servers because “it feels familiar.”
Phased rollout
- Day 0 baseline – Deploy zones/resolver, build Terraform module for linking, document the request workflow, integrate with vending.
- Day 30 hardening – Enable policy guardrails, add reverse lookup zones, wire drift detection into alerting, start logging.
- Quarterly review – Compare expected vs actual links, rotate resolver credentials, test rollback playbooks, update service catalogue.
Runbook: Private Endpoints + DNS baseline
- Define ownership and landing zones for DNS resources
Put private DNS zones, the Azure DNS Private Resolver, and automation identities in a platform subscription. Document the RBAC split: platform controls records; workloads request via automation or service management.
- Catalogue the services you allow via Private Link
List the Azure services (e.g. Key Vault, Storage, SQL, Container Registry) that are permitted. This drives which private zones you deploy. Keep the list under version control so policy and automation stay aligned.
- Deploy standard private DNS zones with IaC
Create the per-service zones ahead of time. Use consistent naming and tag them with owner + lifecycle information. This Terraform snippet keeps the catalogue consistent.
variable "zones" { type = list(string) default = [ "privatelink.vaultcore.azure.net", "privatelink.blob.core.windows.net", "privatelink.database.windows.net" ] } resource "azurerm_private_dns_zone" "this" { for_each = toset(var.zones) name = each.value resource_group_name = "rg-platform-dns" tags = { owner = "platform-dns" purpose = "PrivateLink" } } - Provision Azure DNS Private Resolver in the hub
Deploy inbound/outbound endpoints in the hub vNet. Configure forwarding rules to the private zones. This keeps spoke networks lightweight and avoids custom DNS servers.
- Automate vNet linking and zone registration
Script or pipeline the linking of workload vNets to the relevant zones when a subscription onboards. Avoid manual linking; it inevitably drifts. Store metadata so you can generate reports of which vNets are linked to which zones.
- Create a Private Endpoint request workflow
Provide a template (Terraform module, ARM/Bicep snippet, or portal checklist) that workloads must follow. Include parameters for approval, target subnet, zone linking, and rollback. Make the workflow part of your change process.
- Enforce policy guardrails
Use Azure Policy to prevent random private endpoints in forbidden subnets, enforce Private DNS zone registration, and audit any endpoint without an RBAC-approved owner. This keeps requests visible.
- Add monitoring and drift detection
Run scheduled scripts (Logic App/Azure Automation) to compare declared zone links against actual ones. Alert when zones are unlinked or when records are missing. Combine with activity log alerts for zone deletions.
- Feed the pattern into subscription vending
When a new subscription is created, ensure the standard networking template includes the DNS forwarding settings and zone link automation. Workloads should inherit the ability to resolve private endpoints without extra steps.
Policy guardrails
Use Bicep to prevent rogue Private Endpoints and ensure zone linking happens. Replace policy IDs with your own definitions.
param mgScope string
var policies = {
name: 'private-link-guardrails'
displayName: 'Private Link Guardrails'
policyDefinitions: [
{
policyDefinitionReferenceId: 'allowed-subnets'
policyDefinitionId: '/providers/Microsoft.Authorization/policyDefinitions/<allow-subnets>'
}
{
policyDefinitionReferenceId: 'deploy-zone-link'
policyDefinitionId: '/providers/Microsoft.Authorization/policyDefinitions/<deploy-zone-link>'
}
]
}
resource initiative 'Microsoft.Authorization/policySetDefinitions@2021-06-01' = {
name: policies.name
properties: policies
}
resource assignment 'Microsoft.Authorization/policyAssignments@2021-06-01' = {
name: 'private-link-guardrails'
properties: {
displayName: 'Private Link Guardrails'
policyDefinitionId: initiative.id
scope: mgScope
}
}Validation checks
KQL makes drift detection painless. This query highlights vNets without the expected Private DNS zone link; adjust the workspace/resource group before use.
AzureDiagnostics
| where Category == "PrivateDnsAudit"
| summarize lastEvent = max(TimeGenerated) by VirtualNetworkName_s, Action_s
| where Action_s == "UnlinkVirtualNetwork" and lastEvent < ago(1d)Validation checklist
- ✓
All approved private DNS zones exist in the platform subscription with correct tags.
- ✓
Azure DNS Private Resolver endpoints are deployed, healthy, and documented.
- ✓
Every workload vNet is linked to the required zones via automation (no manual leftovers).
- ✓
Private Endpoint requests use the standard workflow and land in approved subnets.
- ✓
Azure Policy assignments audit or deny unsupported Private Link usage.
- ✓
Monitoring alerts fire if a zone link is removed or a zone is deleted.
- ✓
Subscription vending pipelines automatically grant workloads DNS resolution from day zero.
- ✓
Rollback procedures for each service (Key Vault, Storage, SQL, etc.) are written and tested.
Common pitfalls
If each team creates zones and endpoints in the portal, you will never regain control. Use IaC or service catalog items.
Putting some zones in workload subs and others in platform subs is a support nightmare. Pick one model and document it.
Logs and diagnostics often need reverse DNS. Ignoring it makes investigations painful.
If the resolver isn’t in place when workloads arrive, they bake in custom DNS workarounds that are hard to remove later.
Zone deletions and link removals go unnoticed until production fails. Monitor and alert on drift.
Rollback / back-out plan
- To remove a Private Endpoint: delete the endpoint, clear the DNS A record, and run the automation that restores public connectivity if needed. Expect short outages.
- To revert resolver changes: disable forwarding rules, but note that cached results may linger until TTLs expire. Communicate clearly with workloads before flipping.
- To undo zone links: remove the link and flush DNS on affected hosts. Document why the link was removed and ensure alternatives (public endpoints or other regions) exist.
Rollback is never “instant” because DNS caching and dependency on Private Link may require maintenance windows. Keep playbooks per service.