CloudRunbook | Practical Cloud Engineering
Identity-first Azure: the baseline every landing zone should start with
A practical identity baseline for secure Azure architecture: admin separation, PIM, Conditional Access, workload identities, and secrets. Written as a runbook you can implement.
- Split admin identities from daily identities and enforce least privilege through PIM.
- Group-based RBAC + Conditional Access + logging is the bare minimum; portal role assignments do not scale.
- Day 0: create admin groups, PIM-eligible roles, CA baseline, break-glass. Day 30: convert noisy recommendations into policy, federate workload identities. Quarterly: review PIM activity, CA logs, exemptions.
- Secrets belong in Key Vault with private endpoints; prefer managed identities and OIDC for workloads.
- Treat the directory like code: Terraform for groups/assignments, Bicep/Policy for guardrails, KQL for validation.
Why identity-first is the fastest path to “secure by default”
In Azure, identity is your control plane. If an attacker (or a rushed engineer) can get privileged identity access, all the networking controls and policy guardrails become speed bumps.
An identity baseline is not about locking people out. It’s about creating a predictable model where:
- platform admins can do platform work safely
- workload teams can ship without needing broad rights
- credentials stop living in pipelines and random scripts
- you can explain “who can do what” in 60 seconds
What good looks like
- Admin separation – every privileged user has a dedicated admin identity, MFA, and Conditional Access policies applied.
- Group-based RBAC with PIM – roles are assigned to Entra groups, not individuals, and are eligible (not permanent).
- Documented break-glass – two emergency accounts, monitored, stored offline, tested quarterly.
- Workload identity pattern – managed identities for Azure hosts, OIDC federation for CI/CD, Key Vault for secrets behind private endpoints.
- Policy + logging – Azure Policy enforcing RBAC hygiene, PIM activity logged to a central workspace, alerts hitting the SOC.
Baseline decision points
- Scope for admin roles – Tenant root vs management group, who owns what, and which roles exist (Platform, Security, Network, Workload).
- PIM configuration – Activation durations, approval requirements, justification text, and notifications.
- Conditional Access baseline – MFA requirements, trusted locations, device compliance expectations, and break-glass exclusions.
- Break-glass process – Who holds credentials, when they’re used, how they’re monitored/resets.
- Workload identity strategy – Managed identities vs service principals, when to use OIDC federation, and how to track secrets.
- Logging + alerting – Where PIM/Sign-in logs land, which alerts go to the SOC/platform, and how exemptions are tracked.
- Cost/trade-offs – Azure AD P1/P2 licensing coverage and training needs for engineers.
Signal vs noise
- Enable now: Separate admin accounts, PIM eligibility, CA baseline, break-glass monitoring, managed identities for new workloads.
- Enable at Day 30: Full OIDC federation for all CI/CD, automatic revocation of unused service principals, policy enforcement of MFA for workload admins.
- Probably never: Permanent Owner at tenant root, secrets scattered in pipelines, or “MFA optional” exceptions without expiry.
Phased rollout
- Day 0 baseline – Create admin groups, PIM assignments, CA policies, break-glass accounts, and Terraform modules to manage them.
- Day 30 hardening – Enable auditing policies, convert recurring identity recommendations into Azure Policy, federate major pipelines.
- Quarterly review – Use KQL to audit PIM activations, check for direct user role assignments, rotate break-glass secrets, and review exemptions.
Runbook: Identity baseline for Azure landing zones
- Create admin separation
Create separate admin accounts for anyone with privileged access.
Baseline:
- One “normal” user identity for daily work
- One “admin” identity for privileged actions (no email use, no browsing)
If you use admin workstations or hardened browser profiles, apply them to admin identities first.
- Define your admin roles (small set, least privilege)
Start with the smallest useful set.
Typical roles:
- Platform Admin: owns management groups, policy, subscription vending
- Security Admin: owns Defender/Sentinel integration and security config
- Network Admin: owns connectivity patterns, DNS zones, firewall/vWAN config
If you automate PIM assignments, keep the Terraform lean:
pim-platform-admin.tfresource "azuread_group" "platform_admin" { display_name = "alz-platform-admin" security_enabled = true } data "azuread_directory_role" "platform_admin" { display_name = "Privileged Role Administrator" } resource "azuread_privileged_role_assignment_schedule" "platform_admin" { principal_id = azuread_group.platform_admin.object_id role_definition_id = data.azuread_directory_role.platform_admin.template_id permanent_assignment = false schedule { type = "Once" start_date_time = "2026-01-12T09:00:00Z" expiration { type = "AfterDuration" duration = "PT1H" } } justification = "Platform maintenance" }Avoid giving everyone Owner. Owner is convenient, but it’s also an incident generator.
- Enable PIM for privileged role elevation
Set a policy that privileged roles are:
- Eligible (not permanent)
- Time-limited activation
- Require MFA
- Require justification (and optionally approval for the highest roles)
Practical defaults:
- 1–4 hour activation for high privilege roles
- approval required for the most sensitive roles (your call)
- Create and protect break-glass accounts
Break-glass accounts exist for a scenario where Conditional Access or MFA is misconfigured and admins get locked out.
Baseline:
- At least 2 break-glass accounts
- Strong passwords stored offline in a secure process
- Exempt them from some controls carefully (or use emergency access accounts)
- Monitor and alert on any sign-in
Document who can use them and what “emergency” means.
- Conditional Access baseline for admins
Keep this practical. You can add nuance later.
Admin baseline:
- Require MFA for admin identities
- Block legacy auth (if applicable)
- Restrict admin access to trusted locations/devices (if your org can)
- Enforce sign-in risk policies if using Identity Protection
The goal is to reduce credential replay and opportunistic admin compromise.
- Standardize RBAC assignments at management group scope
Assign platform roles at the correct scope.
Recommended:
- Platform team roles assigned at Platform management group
- Workload team roles assigned at subscription scope or workload MG scope
If you assign high privilege at tenant root “just to make it work”, you will keep it forever.
- Workload identity: prefer managed identity and federated credentials
For apps and automation:
- Prefer managed identities for Azure-hosted workloads
- Prefer federated credentials (OIDC) for GitHub Actions / CI pipelines where possible
- Avoid long-lived service principal secrets
This drastically reduces the “credential in a pipeline” risk.
- Secrets and keys: make them boring
If you must store secrets:
- Put them in Key Vault (or your chosen secret store)
- Control access via RBAC
- Use private connectivity if you’re operating in a private network model
- Rotate secrets on a schedule
The best secret is one you don’t need. The second best is one you can rotate without drama.
- Logging and alerting for identity events
You should treat identity events as security signals.
Minimum:
- alert on privileged role activation
- alert on break-glass sign-in
- alert on risky sign-ins (if available)
- audit role assignment changes
If you don’t alert on privilege changes, you’re blind to the highest-risk activity.
Guardrails via Azure Policy
Enforce identity hygiene with policy initiatives. Use this Bicep skeleton to ensure critical roles are assigned to groups, not individuals, and that PIM eligibility is applied. Replace the policy definition IDs with your own.
param mgScope string
var identitySet = {
name: 'identity-guardrails'
displayName: 'Identity Guardrails'
policyDefinitions: [
{
policyDefinitionReferenceId: 'deny-user-role-assignment'
policyDefinitionId: '/providers/Microsoft.Authorization/policyDefinitions/<deny-user-role>'
}
{
policyDefinitionReferenceId: 'audit-pim'
policyDefinitionId: '/providers/Microsoft.Authorization/policyDefinitions/<audit-pim>'
}
]
}
resource initiative 'Microsoft.Authorization/policySetDefinitions@2021-06-01' = {
name: identitySet.name
properties: identitySet
}
resource assignment 'Microsoft.Authorization/policyAssignments@2021-06-01' = {
name: 'identity-guardrails'
properties: {
displayName: 'Identity Guardrails'
policyDefinitionId: initiative.id
scope: mgScope
}
}Validation checks
Use KQL to spot risky patterns. This query lists role assignments created in the last 7 days where the principal is a user account, not a group. Swap the workspace for your Log Analytics instance.
AuditLogs
| where Category == "RoleManagement"
| where OperationName == "Add member to role"
| extend principalType = tostring(TargetResources[0].type)
| where principalType == "User"
| project TimeGenerated, InitiatedBy, TargetResourcesSuccess criteria for your identity baseline
- ✓
Admins use separate admin identities for privileged work.
- ✓
Privileged roles are eligible and activated just-in-time via PIM.
- ✓
Break-glass accounts exist, are protected, and are monitored.
- ✓
Conditional Access baseline is enforced for admin identities.
- ✓
Workloads use managed identity or OIDC federation; secrets are not sitting in pipelines.
- ✓
Role assignments are group-based and documented.
- ✓
PIM activation logs are stored centrally and reviewed.
Common pitfalls
Uncontrolled sprawl. Use groups only and enforce via policy.
If nobody monitors activations, PIM is just extra clicks. Wire alerts to SOC/platform.
Permanent exclusions for “that one vendor” never get reviewed. Force expiry and document justification.
Service principals with passwords older than your tenancy keep popping up. Inventory and retire them.
Credentials stored in someone’s drawer and never tested defeat the point. Test quarterly and log usage.
Rollback / back-out plan
- PIM misconfiguration – revert the Terraform module, redeploy previous settings, and communicate downtime for activations.
- Conditional Access outage – disable the offending policy via the emergency account, fix offline, reapply via source control.
- Policy guardrail issue – disable the assignment (not the definition), patch parameters, pilot in a test MG, then reapply.
- Workload identity rollback – keep legacy service principal credentials for a limited overlap while OIDC rollouts complete. Document the sunset date.
For every rollback, log who executed it, why, and when the change will be resubmitted. Identity hygiene is only real when it survives change windows.