Chennai

Posted 10 months ago

As a Site Reliability Engineer, you will play a crucial role in maintaining and optimizing cloud infrastructure, ensuring high availability, scalability, and performance. Leveraging your expertise in automation, Kubernetes, and CI/CD pipelines, you will drive reliability and efficiency across systems.

This role offers the chance to collaborate with cross-functional teams, improve monitoring and incident response processes, and enhance system resilience. If you’re passionate about cloud technologies, automation, and solving complex infrastructure challenges in a dynamic hybrid work environment, this is your opportunity to make a significant impact.

Top Left Decoration
Top Right Decoration

Site Reliability Engineer

– Full-time | Senior level | Chennai, Tamil Nadu, India | Hybrid Work Culture 

Our Company

Rheo is an industrial AI platform that optimizes operations using sensors and machine learning. By combining AI-driven insights with human expertise, Rheo enhances productivity, minimizes downtime, and improves efficiency. It automates monitoring, identifies risks, and provides actionable guidance, ensuring transparency and collaboration across all levels of manufacturing.

The Opportunity

As a Site Reliability Engineer, you will play a crucial role in maintaining and optimizing cloud infrastructure, ensuring high availability, scalability, and performance. Leveraging your expertise in automation, Kubernetes, and CI/CD pipelines, you will drive reliability and efficiency across systems.

This role offers the chance to collaborate with cross-functional teams, improve monitoring and incident response processes, and enhance system resilience. If you’re passionate about cloud technologies, automation, and solving complex infrastructure challenges in a dynamic hybrid work environment, this is your opportunity to make a significant impact.

REQUIREMENTS

  • Bachelor’s degree in Computer Science, Information Technology, or related field. (or equivalent work experience).
  • Proven experience as a Devops Engineer or Site Reliability Engineer or similar role, with at least  2 years.
  • Strong hands-on experience with infrastructure-as-code tools like Terraform, configuration management tools like Ansible, and version control systems like Git.
  • Proficiency in scripting languages such as Python, Bash, or Ruby for automation tasks.
  • In-depth knowledge of CI/CD concepts and experience with CI/CD tools like Jenkins, GitLab CI/CD, CircleCI or GitHub Actions.
  • Extensive experience working with cloud platforms like AWS, Azure, or GCP.
  • Solid understanding of containerization technologies such as Docker and container orchestration tools like Kubernetes.
  • Familiarity with monitoring and logging solutions like Prometheus, Grafana, ELK stack, etc.
  • Excellent problem-solving skills and the ability to troubleshoot complex issues across different technology stacks.
  • Strong communication and interpersonal skills to effectively collaborate with cross-functional teams. 

WHAT YOU WILL DO

1. AWS Cloud Maintenance:

  • Maintain and optimize AWS Cloud infrastructure to ensure scalability, reliability, and performance.
  • Monitor AWS resources and services to identify and rectify potential issues before they impact the system.

2. Kubernetes Management:

  • Manage and maintain Kubernetes clusters, ensuring high availability and performance.
  • Implement best practices for container orchestration and scaling.

3. Incident Response:

  • Participate in an on-call rotation to provide 24/7 support and respond to critical incidents promptly.
  • Collaborate with cross-functional teams to troubleshoot and resolve system issues efficiently.

4. Bug Tracking and Resolution:

  • Identify and document software and infrastructure bugs, working closely with development teams to prioritize and resolve them.
  • Continuously improve monitoring and alerting systems to proactively detect issues.

5. Performance Optimization

  • Analyze system performance and implement optimizations to enhance reliability and reduce downtime.

6. Automation:

  • Develop and maintain automation scripts and tools for provisioning, deployment, and monitoring.

7. Documentation:

  • Create and update documentation for systems, processes, and incident response procedures.

8. Security and Compliance:

  • Ensure security best practices are followed and participate in security audits and compliance initiatives.

Job Features

Job Category

Engineering

Apply For This Job

A valid email address is required.
A valid phone number is required.