Site Reliability Engineer (SRE)

Bengaluru, Karnataka, India Full-time

Site Reliability Engineering (SRE) is an engineering discipline that combines software and systems engineering to build and run large-scale, massively distributed, fault-tolerant systems. Site Reliability Engineers are responsible for keeping internally critical and our externally-visible systems running smoothly 24/7/365. SREs are a blend of operations gearheads and software crafters that apply sound enginering principles, operational discipline and mature automation, specializing in systems, whether it be networking, the Linux kernel, or even a specific interest in scaling, algorithms, or distributed systems. Practices such as limiting time spent on operational work, blameless postmortems and proactive identification of potential outages factor into iterative improvement that is key to both product quality and interesting and dynamic day-to-day work.

Go-jek is a unique organization and it brings unique challenges: it's a super app with multiple product multiple country multi cloud setup. Millions of drivers interting with customer, merchants and banks in a highly concurrent real time environemnt has its own challenges.

### As a SRE you will:

- Engage in and improve the whole lifecycle of services—from inception and design, through deployment, operation and refinement.
- Identify and troubleshoot issues across the entire stack: hardware, software, application and network
- Manage our infrastructure with Chef, Terraform and Kubernetes.
- Maintain services and overall system health by implementing proper monitoring and symptoms based alerting policies.
- Document every action so your learnings turn into repeatable actions and then into automation.
- Improve the deployment process to make it as boring as possible.
- Design, build and maintain core infrastructure pieces that allow Go-jek's scaling to support millions of concurrent users.
- Debug production issues across services and levels of the stack.
- Plan the growth of Go-jek's infrastructure.

### You may be a fit to this role if you:
- Think about systems - edge cases, failure modes, behaviors, specific implementations.
- Know your way around Linux and the Unix Shell.
- Know what is the use of config management systems like Chef
- Have strong programming skills - Ruby and/or Go
- Have an urge to collaborate and communicate asynchronously.
- Have an urge to document all the things so you don't need to learn the same thing twice.
- Have a proactive, go-for-it attitude. When you see something broken, you can't help but fix it.
- Have an urge for delivering quickly and iterating fast.
- Have experience with Docker, Haproxy, Go, Kubernetes, envoy

### Projects you could work on:
- Better define alerts.
- Create inhouse status dashboard.
- Improving our TICK and Prometheus Monitoring or building new Metrics.
- Better manage kubernetes.
- Migrate gojek services to Kubernetes.

### Areas of expertise

- Chef
- Terraform
- load balancing the application - Envoy
- Kubernetes and containerizing our system
- Product knowledge
- Monitoring and Metrics in Prometheus and integrations with Slack/PagerDuty
- Team organization and planning
- Backend storage management and scaling
- Disaster Recovery and High Availability strategy

Apply for this opening at