You’ll work closely with AI Engineers, Backend Engineers, and Product teams to support demanding workloads (including LLM/ Agent orchestration) while building a best-in-class AI engineering platform.
Your responsibilities will include:
- Architect and evolve InteractiveAI’s core infrastructure (networking, compute, storage, Kubernetes, CI/CD) to support rapid product delivery
- Build and operate cloud-agnostic environments (Kubernetes) across VPC, on-prem, and hybrid deployments
- Implement and maintain CI/CD for services and infrastructure (GitOps preferred), enabling safe and frequent deployments
- Define “gold standards” for reliability: SLOs/SLIs, incident response, on-call practices, runbooks, and postmortems
- Establish robust observability (metrics, logs, traces) with actionable alerting and performance monitoring
- Drive infrastructure-as-code across environments (Terraform/Pulumi/CloudFormation), including strong review/testing practices
- Own security fundamentals: secrets management, IAM, network segmentation, vulnerability management, and compliance readiness
- Improve developer productivity with internal tooling, self-service environments, and paved paths
- Optimize costs and capacity planning across clusters and workloads (including bursty inference/agent workloads)
- Mentor and grow a small DevOps team, fostering ownership, quality, and pragmatic execution
