סקירה כללית
^^משרה זו נלקחה מ Career^^Description About Finubit: Finubit is a fast
• moving startup creating the bank’s next
• generation cloud platform
• a modern, Kubernetes
• native and AI
• driven foundation that powers engineering for over a thousand developers. We’re rethinking how banks build, deploy, and operate systems at scale
• combining GitOps, ChatOps, and AI automation to enable self
• service, reliability, and observability across every environment. At Finubit, you’ll join a small, expert team building the backbone of a modern engineering organization
• from platform automation to AI
• based infrastructure orchestration. About the Role: As an SRE, you’ll help ensure the reliability, scalability, and performance of a multi
• cluster Kubernetes ecosystem that powers the bank’s engineering platform. You’ll combine software engineering, observability, and automation to build systems that detect, prevent, and self
• heal
• powered by Temporal and AI ChatOps. Responsibilities What You’ll Do: Design reliability systems for multi
• cluster Kubernetes environments. Build self
• healing, failover, and incident
• response automation using Argo Workflows + Temporal. Define and measure SLOs, SLIs, and reliability metrics. Operate observability tools
• Prometheus, Grafana, Loki, Tempo. Implement incident playbooks and automation within ChatOps. Collaborate with developers to build resilience and performance into applications. Requirements What We’re Looking For: Understanding of Kubernetes, automation, and container orchestration. Familiar with Terraform/Terragrunt and GitOps. Comfortable with observability stacks (Prometheus, Grafana, Loki, Tempo). Proficient in Python or Go for tooling. Excited to apply AI and automation to reliability engineering. Why You’ll Love Working Here: Define what reliability means for AI
• driven cloud systems. Build automation that transforms operations into intelligent workflows. Join a collaborative team focused on learning, scale, and impact.
דרישות המשרה
What You’ll Do: Design reliability systems for multi
• cluster Kubernetes environments. Build self
• healing, failover, and incident
• response automation using Argo Workflows + Temporal. Define and measure SLOs, SLIs, and reliability metrics. Operate observability tools
• Prometheus, Grafana, Loki, Tempo. Implement incident playbooks and automation within ChatOps. Collaborate with developers to b