סקירה כללית

About Finubit: Finubit is a fast-moving startup creating the bank’s next-generation cloud platform — a modern, Kubernetes-native and AI-driven foundation that powers engineering for over a thousand developers. We’re rethinking how banks build, deploy, and operate systems at scale — combining GitOps, ChatOps, and AI automation to enable self-service, reliability, and observability across every environment. At Finubit, you’ll join a small, expert team building the backbone of a modern engineering organization — from platform automation to AI-based infrastructure orchestration. About the Role: As an SRE, you’ll help ensure the reliability, scalability, and performance of a multi-cluster Kubernetes ecosystem that powers the bank’s engineering platform. You’ll combine software engineering, observability, and automation to build systems that detect, prevent, and self-heal — powered by Temporal and AI ChatOps. Responsibilities: What You’ll Do: * Design reliability systems for multi-cluster Kubernetes environments. * Build self-healing, failover, and incident-response automation using Argo Workflows + Temporal. * Define and measure SLOs, SLIs, and reliability metrics. * Operate observability tools — Prometheus, Grafana, Loki, Tempo. * Implement incident playbooks and automation within ChatOps. * Collaborate with developers to build resilience and performance into applications. Requirements: What We’re Looking For: * Understanding of Kubernetes, automation, and container orchestration. * Familiar with Terraform/Terragrunt and GitOps. * Comfortable with observability stacks (Prometheus, Grafana, Loki, Tempo). * Proficient in Python or Go for tooling. * Excited to apply AI and automation to reliability engineering. Why You’ll Love Working Here: * Define what reliability means for AI-driven cloud systems. * Build automation that transforms operations into intelligent workflows. * Join a collaborative team focused on learning, scale, and impact.

דרישות המשרה

What You’ll Do: * Design reliability systems for multi-cluster Kubernetes environments. * Build self-healing, failover, and incident-response automation using Argo Workflows + Temporal. * Define and measure SLOs, SLIs, and reliability metrics. * Operate observability tools — Prometheus, Grafana, Loki, Tempo. * Implement incident playbooks and automation within ChatOps. * Collaborate with developer