JO
XF-00095 Information Technology

Site Reliability Engineer (SRE)

Posted 2 weeks ago

Job ID

XF-00095

Category

Information Technology

Required Skills

Job Description

About the Role

We are looking for a talented Site Reliability Engineer (SRE) to join our engineering team in Pune. As an SRE, you will be at the intersection of software engineering and systems administration, ensuring our production systems are reliable, scalable, and performant. You will apply software engineering principles to infrastructure and operations challenges, implement automation at scale, and work closely with development teams to build resilient systems. This role is ideal for engineers who are passionate about reliability, automation, and solving complex technical challenges.

Key Responsibilities

  • Design, build, and maintain highly available, scalable, and fault-tolerant production systems and infrastructure
  • Define and measure Service Level Objectives (SLOs), Service Level Indicators (SLIs), and error budgets to quantify system reliability
  • Implement and manage comprehensive monitoring, alerting, and observability solutions using tools like Prometheus, Grafana, and Datadog
  • Lead incident response efforts, conduct blameless post-mortems, and drive root cause analysis to prevent recurrence
  • Develop automation and tooling to reduce toil and improve operational efficiency across the organization
  • Implement and maintain CI/CD pipelines ensuring safe, reliable, and frequent deployments to production
  • Design and implement disaster recovery strategies and conduct regular chaos engineering exercises
  • Collaborate with development teams to improve application reliability, performance, and operability
  • Manage cloud infrastructure on AWS including compute, networking, storage, and managed services
  • Implement Infrastructure as Code using Terraform, CloudFormation, or Pulumi for reproducible environments
  • Participate in on-call rotations and drive improvements to reduce incident frequency and duration
  • Mentor team members on SRE best practices and foster a culture of reliability and continuous improvement

Requirements

  • Bachelor’s degree in Computer Science, Engineering, or a related technical field
  • 4+ years of experience in Site Reliability Engineering, DevOps, or Systems Engineering roles
  • Strong programming skills in Python, Go, or similar languages for automation and tooling development
  • Deep experience with AWS cloud services including EC2, ECS/EKS, Lambda, RDS, S3, and VPC
  • Proficiency with containerization and orchestration technologies (Docker, Kubernetes)
  • Experience with monitoring and observability tools (Prometheus, Grafana, ELK Stack, Datadog)
  • Strong understanding of networking, load balancing, DNS, and distributed systems concepts
  • Experience with Infrastructure as Code tools such as Terraform, Ansible, or CloudFormation
  • Familiarity with incident management processes and post-mortem culture
  • Excellent problem-solving skills and ability to troubleshoot complex distributed systems under pressure

What We Offer

  • Work on challenging reliability and scalability problems at scale
  • Opportunity to implement modern SRE practices and build a reliability culture
  • Access to cutting-edge cloud technologies and tools
  • Professional development support including certifications and conference attendance
  • Collaborative environment with talented engineers who value learning
  • Competitive compensation with comprehensive benefits
  • Flexible work arrangements and focus on work-life balance

Interested in this position?

Take the next step in your career. Submit your application now and our team will review your profile.