Site Reliability Engineer

Site Reliability Engineer (POS-141)

About Us:

As a Senior Site Reliability Engineer at Kenility, you’ll join a tight-knit family of creative developers, engineers, and designers who strive to develop and deliver the highest quality products into the market.

Technical Requirements:

Bachelor’s degree in Computer Science, Software Engineering, or a related field.
Over five years of experience in Site Reliability Engineering, DevOps, or advanced systems engineering roles.
Demonstrated ability to build or evolve SRE functions, with a clear understanding of best practices and foundational strategies.
Proficient in defining and managing operational metrics, including reporting and analysis of system stability indicators.
Advanced knowledge of AWS cloud services and infrastructure.
Strong command of secrets management tools such as AWS Secrets Manager, HashiCorp Vault, Keeper, or Infisical.
Hands-on experience with infrastructure as code using Terraform, as well as configuration management tools like Chef, Puppet, or Ansible.
Solid scripting and automation skills in Python and/or Ruby.
Expertise in monitoring, observability, incident response, and ensuring high service reliability.
Skilled in establishing and tracking SLIs, SLOs, and other key operational metrics.
Strong collaboration skills and a demonstrated commitment to mentoring and team development.
Familiarity with the Atlassian suite (Jira, Confluence, Bitbucket) is highly desirable.
AWS certifications are a strong plus.
Background in regulated industries such as insurance or fintech is considered advantageous.
Experience with ITSM tools like incident.io or Jira Service Manager is preferred.
Understanding of modern DevOps practices and CI/CD pipeline management.
Minimum Upper Intermediate English (B2) or Proficient (C1).

Tasks and Responsibilities:

Establish and implement SRE processes, best practices, and tools to lay the groundwork for long-term operational excellence.
Lead the evolution of the NOC into a technically proficient, automation-driven reliability function.
Oversee improvements in observability, enhancing monitoring, alerting, and dashboards with platforms such as Grafana, CloudWatch, and Datadog.
Drive automation initiatives across infrastructure and operations using tools like Terraform and scripting languages.
Define and track service reliability through the creation of SLIs, SLOs, and error budgets.
Partner with DevOps, SysOps, and engineering teams to support system reliability and performance, while mentoring NOC personnel.
Act as a catalyst for organizational change, fostering a proactive, forward-thinking technical culture.