Senior technical leader driving cloud-native, provider-agnostic infrastructure strategy across hybrid environments (Proxmox VE on-prem, hyperscaler cloud, GCP minority cloud). Balances technical direction with people leadership, ensuring systems are portable, resilient, secure, cost-optimized, and developer-friendly.
Responsibilities:
Infrastructure Leadership
- Architect hybrid infrastructure spanning on-premises and multi-cloud.
- Define provider-agnostic standards to avoid lock-in.
- Own IaC strategy with Terraform multi-provider modules.
- Lead disaster recovery, HA, SLA/SLO governance.
On-Premises Infrastructure (Proxmox VE)
- Manage Proxmox clusters, HA groups, VM/LXC provisioning.
- Govern storage (ZFS, Ceph, NFS/iSCSI) and SDN.
- Integrate with IaC/GitOps workflows.
- Operate Proxmox Backup Server, enforce RTO/RPO.
- Use Proxmox as compute layer for Kubernetes clusters.
- Monitor via Prometheus/Grafana, enforce CIS security baselines.
Cloud-Native Platform Engineering
- Standardize Kubernetes runtime across environments.
- Drive GitOps-first delivery (ArgoCD/Flux).
- Use Helm/Kustomize for packaging.
- Adopt OpenTelemetry for observability.
- Enforce service mesh (Istio, Cilium, Linkerd).
- Apply policy-as-code (OPA, Kyverno).
- Design provider-agnostic CI/CD pipelines.
Multi-Cloud Strategy
- Select providers based on workload/cost, not inertia.
- Manage GCP for analytics/Kubernetes.
- Onboard additional providers (Alibaba, OCI, Hetzner, etc.) seamlessly.
- Enforce cross-cloud networking via WireGuard/Tailscale.
- Centralize identity federation via OIDC/SAML.
Internal Developer Platform (IDP)
- Own IDP as a product with roadmap and SLAs.
- Provide self-service provisioning across environments.
- Enable ephemeral environments on demand.
- Maintain service catalog/developer portal (Backstage, Port.io).
- Enforce RBAC/policies via OPA/Kyverno.
- Measure DevEx via DORA metrics and adoption rates.
FinOps & Cost Governance
- Treat cloud spend as engineering concern.
- Use provider-agnostic FinOps tooling (Kubecost, OpenCost).
- Apply per-provider governance (AWS/Azure RIs, GCP CUDs, Proxmox TCO).
- Integrate billing APIs into unified dashboards.
People Management & Leadership
- Lead platform engineers around cloud-native teams.
- Build skills in open-source tech (Kubernetes, Terraform, Prometheus).
- Develop FinOps and DevEx champions.
- Advocate ROI of cloud-native investment.
Strategic & Cross-Functional
- Own infrastructure roadmap.
- Make workload placement decisions based on cost, latency, compliance.
- Partner with Security/Compliance/Engineering for unified governance.
Qualifications:
Required Skills
- Cloud-Native Platform: Kubernetes, Helm, Kustomize, GitOps.
- IaC & Automation: Terraform, Ansible, Packer, Crossplane.
- Networking/Service Mesh: Istio, Cilium, Linkerd, WireGuard, OVS.
- Observability: OpenTelemetry, Prometheus, Grafana, Jaeger, ELK.
- Policy & Security: OPA, Kyverno, Vault, DevSecOps.
- On-Prem (Proxmox VE): Cluster mgmt, HA, ZFS, Ceph, PBS, SDN.
- Cloud Providers: AWS/Azure (primary), GCP (minority).
- IDP Tools: Backstage, Port.io, Humanitec.
- CI/CD: GitLab CI, GitHub Actions, ArgoCD, Tekton.
- FinOps: Kubecost, OpenCost, CloudHealth.
- Languages/Scripting: Python, Bash, Go, Terraform HCL, TypeScript.
- Leadership: Team management, roadmap planning, stakeholder communication.
Plus Skills
- Experience integrating non-hyperscaler clouds (Alibaba, OCI, Hetzner, Cloudflare, Huawei, IBM).
- Ability to rapidly onboard providers into cloud-native stack.
Nice to Have – AI-Powered Automation
- AI-assisted IaC generation (Terraform Copilot, Pulumi AI).
- AIOps for observability (Grafana ML, Elastic AIOps).
- AI-powered runbook automation (LangChain, OpenAI).
- ChatOps with AI for infra queries and cost summaries.
- Autonomous infra agents for natural language provisioning.
Key KPIs
- Cloud-native adoption rate (% workloads on Kubernetes).
- IDP adoption rate (% teams using self-service).
- Developer onboarding velocity.
- Deployment frequency & lead time (DORA metrics).
- Proxmox cluster availability ≥99.9%.
- Provider portability score (% workloads redeployable cross-cloud).
- Multi-cloud + on-prem cost savings.
- MTTR for production incidents.