Site Reliability Engineer (all genders)

City:  Ho Chi Minh
Job Function:  Tech
Job Area:  Product & IT
Seniority Level:  Mid-Senior level
Date:  Apr 23, 2025

HRS AS A COMPANY

HRS, a pioneer in business travel, aims to elevate every stay through innovative technology. With over 50 years of experience, their digital platform, driven by ProcureTech, TravelTech, and FinTech, transforms how companies and travelers Stay, Work, and Pay.

ProcureTech digitally revolutionizes lodging procurement, connecting corporations and suppliers in a cutting-edge ecosystem. This enables seamless efficiency and automation, surpassing travelers' expectations.

TravelTech redefines the online lodging experience, offering personalized content from selection to check-in, ensuring an unparalleled journey for corporate travelers.

In FinTech, HRS introduces advancements like mobile banking and digital payments, turning corporate back offices into touchless lodging enablers, eliminating legacy cost barriers. The innovative 2-click book-to-pay feature streamlines interactions for travelers and hoteliers.

Combining these technology propositions, HRS unlocks exponential catalyst effects. Their data-driven focus delivers value-added services and high-return network effects, creating substantial customer value.

HRS's exponential growth since 1972 serves over 35% of the global Fortune 500 and leading hotel chains.

Join HRS to shape the future of business travel, empowered by a culture of growth and setting new industry standards worldwide.

BUSINESS UNIT

The Site Reliability Engineering (SRE) department at HRS is fundamental to ensuring the reliability, scalability, and performance of our Lodging-as-a-Service (LaaS) platform. Our team collaborates across engineering, operations, and development teams to implement reliability standards, maintain infrastructure architecture, and achieve operational excellence while adhering to our service level objectives (SLOs) and reducing toil.

As an SRE at HRS, a key part of your role will be incident handling. You'll be at the forefront of identifying, responding to, and resolving production issues, ensuring minimal impact on our services. You'll participate in on-call rotations, requiring quick thinking and decisive action during critical incidents. Your ability to remain calm under pressure and make data-driven decisions will be crucial in maintaining our platform's reliability.

You will contribute to the reliability roadmap, support platform observability, and drive automation initiatives to enhance system resilience. Monitoring critical metrics such as error budgets, mean time to recovery (MTTR), and service level indicators (SLIs) will be part of your daily responsibilities to ensure optimal platform performance and availability. This role requires strong technical expertise in cloud infrastructure, distributed systems, and automation, combined with excellent problem-solving and incident management skills.

The department operates according to HRS' leadership principles, prioritizing system reliability and customer experience above all. We embrace a culture of blameless post-mortems, continuous improvement, and proactive problem-solving. As an SRE, you'll actively participate in incident reviews, contributing insights to prevent future occurrences and improve our overall system reliability.

SREs at HRS are innovation contributors, exploring new technologies and methodologies to improve system reliability and operational efficiency. You will work with infrastructure as code, maintain robust monitoring and alerting systems, and develop automation solutions to reduce manual intervention and improve incident response times. Our team takes full ownership of production systems, from capacity planning to disaster recovery, ensuring resilient and scalable infrastructure.

In this role, you will collaborate with team leads and other SREs to implement best practices, refine incident response procedures, and contribute to the overall reliability and performance of our LaaS platform. Your expertise in incident handling, system optimization, and proactive problem-solving will be crucial in maintaining and improving the high standards of our SRE department at HRS.

POSITION

We are seeking a competent Site Reliability Engineer with solid experience to join our team. The ideal candidate will focus on ensuring the reliability and scalability of services, working collaboratively with cross-functional teams to enhance our platform and improve processes.

CHALLENGE

  • Service Reliability: Maintain service availability, system performance, and manage capacity-related matters. Involvement in designing and implementing SLOs and SLIs
  • System Improvement: Develop and implement solutions to improve system reliability and scalability.
  • Incident Response: Participate in on-call rotations and assist in incident management and resolution. Contribution to post-incident reviews (blameless post-mortems)
  • Collaboration: Work closely with development teams to troubleshoot issues and enhance system performance.
  • Automation: Contribute to the automation of processes to improve efficiency and scalability.
  • Monitoring & Observability: Implement and maintain monitoring solutions using tools like New Relic, Kibana, Prometheus, Grafana, and ElasticSearch.

FOR THIS EXCITING MISSION YOU ARE EQUIPPED WITH...

  • Experience: 3-5 years in site reliability engineering or related areas.
  • Education: Bachelor’s degree in Computer Science, Engineering, or related field.
  • Technical Skills:
    • Proficiency in Java, Python, and familiarity with other coding languages.
    • Experience with AWS cloud services and cloud engineering practices.
    • Knowledge of monitoring tools (New Relic, Kibana, Prometheus, Grafana, ElasticSearch).
    • Strong understanding of software development methodologies.
    • Experience with infrastructure as code tools (e.g., Terraform, CloudFormation)
    • Familiarity with containerization and orchestration (e.g., Docker, Kubernetes)
    • Knowledge of networking and distributed systems
  • Problem-Solving: Strong analytical skills and the ability to perform root cause analysis.
  • Automation: Experience with scripting and automation to enhance operational efficiency.
  • Teamwork: Ability to work effectively within a team and collaborate with cross-functional teams.
  • Soft Skills:
    • Attention to Detail: High level of accuracy and thoroughness.
    • Communication Skills: Clear and concise communication abilities.
    • Learning Mindset: Eagerness to learn and apply new technologies.
    • Proactive Approach: Initiative to identify issues before they become problems.

PERSPECTIVE

Access to a global network of a globally united and mutually responsible “Tribe of Intrapreneurs” that is passionately dedicated to renew the travel industry and while doing so reinvent the ways how businesses stay, work and pay.

Our entrepreneurial driven environment of full ownership and execution focus offers you the playground to contribute to a greater mission, while growing personally and professionally throughout this unique journey. You will continuously learn from a radical culture of retrospectives and continuous improvement and actively contribute to making business life better, smarter and more sustainable.

LOCATION, MOBILITY, INCENTIVE

The attractive remuneration is in line with the market and, in addition to a fixed monthly salary, all necessary work equipment and mobility, will also include an annual or multi-year bonus.

#LI-LB1
Req ID:  18228


Job Segment: Cloud, Computer Science, Travel Industry, Java, Technology, Travel