Skip to main content

Site Reliability Engineer (POS-141)

About Us:

As a Senior Site Reliability Engineer at Kenility, you’ll join a tight-knit family of creative developers, engineers, and designers who strive to develop and deliver the highest quality products into the market.

 

Technical Requirements:

  • Bachelor’s degree in Computer Science, Software Engineering, or a related field.
  • Over five years of experience in Site Reliability Engineering, DevOps, or advanced systems engineering roles.
  • Demonstrated ability to build or evolve SRE functions, with a clear understanding of best practices and foundational strategies.
  • Proficient in defining and managing operational metrics, including reporting and analysis of system stability indicators.
  • Advanced knowledge of AWS cloud services and infrastructure.
  • Strong command of secrets management tools such as AWS Secrets Manager, HashiCorp Vault, Keeper, or Infisical.
  • Hands-on experience with infrastructure as code using Terraform, as well as configuration management tools like Chef, Puppet, or Ansible.
  • Solid scripting and automation skills in Python and/or Ruby.
  • Expertise in monitoring, observability, incident response, and ensuring high service reliability.
  • Skilled in establishing and tracking SLIs, SLOs, and other key operational metrics.
  • Strong collaboration skills and a demonstrated commitment to mentoring and team development.
  • Familiarity with the Atlassian suite (Jira, Confluence, Bitbucket) is highly desirable.
  • AWS certifications are a strong plus.
  • Background in regulated industries such as insurance or fintech is considered advantageous.
  • Experience with ITSM tools like incident.io or Jira Service Manager is preferred.
  • Understanding of modern DevOps practices and CI/CD pipeline management.
  • Minimum Upper Intermediate English (B2) or Proficient (C1).

 

Tasks and Responsibilities:

  • Establish and implement SRE processes, best practices, and tools to lay the groundwork for long-term operational excellence.
  • Lead the evolution of the NOC into a technically proficient, automation-driven reliability function.
  • Oversee improvements in observability, enhancing monitoring, alerting, and dashboards with platforms such as Grafana, CloudWatch, and Datadog.
  • Drive automation initiatives across infrastructure and operations using tools like Terraform and scripting languages.
  • Define and track service reliability through the creation of SLIs, SLOs, and error budgets.
  • Partner with DevOps, SysOps, and engineering teams to support system reliability and performance, while mentoring NOC personnel.
  • Act as a catalyst for organizational change, fostering a proactive, forward-thinking technical culture.

 

Soft Skills:

  • Responsibility
  • Proactivity
  • Flexibility
  • Great communication skills