Job Description

About the Role

We are looking for a talented Site Reliability Engineer (SRE) to join our engineering team in Pune. As an SRE, you will be at the intersection of software engineering and systems administration, ensuring our production systems are reliable, scalable, and performant. You will apply software engineering principles to infrastructure and operations challenges, implement automation at scale, and work closely with development teams to build resilient systems. This role is ideal for engineers who are passionate about reliability, automation, and solving complex technical challenges.

Key Responsibilities

Design, build, and maintain highly available, scalable, and fault-tolerant production systems and infrastructure
Define and measure Service Level Objectives (SLOs), Service Level Indicators (SLIs), and error budgets to quantify system reliability
Implement and manage comprehensive monitoring, alerting, and observability solutions using tools like Prometheus, Grafana, and Datadog
Lead incident response efforts, conduct blameless post-mortems, and drive root cause analysis to prevent recurrence
Develop automation and tooling to reduce toil and improve operational efficiency across the organization
Implement and maintain CI/CD pipelines ensuring safe, reliable, and frequent deployments to production
Design and implement disaster recovery strategies and conduct regular chaos engineering exercises
Collaborate with development teams to improve application reliability, performance, and operability
Manage cloud infrastructure on AWS including compute, networking, storage, and managed services
Implement Infrastructure as Code using Terraform, CloudFormation, or Pulumi for reproducible environments
Participate in on-call rotations and drive improvements to reduce incident frequency and duration
Mentor team members on SRE best practices and foster a culture of reliability and continuous improvement

Requirements

Bachelor’s degree in Computer Science, Engineering, or a related technical field
4+ years of experience in Site Reliability Engineering, DevOps, or Systems Engineering roles
Strong programming skills in Python, Go, or similar languages for automation and tooling development
Deep experience with AWS cloud services including EC2, ECS/EKS, Lambda, RDS, S3, and VPC
Proficiency with containerization and orchestration technologies (Docker, Kubernetes)
Experience with monitoring and observability tools (Prometheus, Grafana, ELK Stack, Datadog)
Strong understanding of networking, load balancing, DNS, and distributed systems concepts
Experience with Infrastructure as Code tools such as Terraform, Ansible, or CloudFormation
Familiarity with incident management processes and post-mortem culture
Excellent problem-solving skills and ability to troubleshoot complex distributed systems under pressure

What We Offer

Work on challenging reliability and scalability problems at scale
Opportunity to implement modern SRE practices and build a reliability culture
Access to cutting-edge cloud technologies and tools
Professional development support including certifications and conference attendance
Collaborative environment with talented engineers who value learning
Competitive compensation with comprehensive benefits
Flexible work arrangements and focus on work-life balance

Site Reliability Engineer (SRE)

Required Skills

Job Description

About the Role

Key Responsibilities

Requirements

What We Offer

Interested in this position?

Similar Jobs

Technical Support Engineer

Angular Developer

Backend Developer