Site Reliability Engineer (POS-141)
About Us:
As a Senior Site Reliability Engineer at Kenility, you’ll join a tight-knit family of creative developers, engineers, and designers who strive to develop and deliver the highest quality products into the market.
Technical Requirements:
- Bachelor’s degree in Computer Science, Software Engineering, or a related field.
- Over five years of experience in Site Reliability Engineering, DevOps, or advanced systems engineering roles.
- Demonstrated ability to build or evolve SRE functions, with a clear understanding of best practices and foundational strategies.
- Proficient in defining and managing operational metrics, including reporting and analysis of system stability indicators.
- Advanced knowledge of AWS cloud services and infrastructure.
- Strong command of secrets management tools such as AWS Secrets Manager, HashiCorp Vault, Keeper, or Infisical.
- Hands-on experience with infrastructure as code using Terraform, as well as configuration management tools like Chef, Puppet, or Ansible.
- Solid scripting and automation skills in Python and/or Ruby.
- Expertise in monitoring, observability, incident response, and ensuring high service reliability.
- Skilled in establishing and tracking SLIs, SLOs, and other key operational metrics.
- Strong collaboration skills and a demonstrated commitment to mentoring and team development.
- Familiarity with the Atlassian suite (Jira, Confluence, Bitbucket) is highly desirable.
- AWS certifications are a strong plus.
- Background in regulated industries such as insurance or fintech is considered advantageous.
- Experience with ITSM tools like incident.io or Jira Service Manager is preferred.
- Understanding of modern DevOps practices and CI/CD pipeline management.
- Minimum Upper Intermediate English (B2) or Proficient (C1).
Tasks and Responsibilities:
- Establish and implement SRE processes, best practices, and tools to lay the groundwork for long-term operational excellence.
- Lead the evolution of the NOC into a technically proficient, automation-driven reliability function.
- Oversee improvements in observability, enhancing monitoring, alerting, and dashboards with platforms such as Grafana, CloudWatch, and Datadog.
- Drive automation initiatives across infrastructure and operations using tools like Terraform and scripting languages.
- Define and track service reliability through the creation of SLIs, SLOs, and error budgets.
- Partner with DevOps, SysOps, and engineering teams to support system reliability and performance, while mentoring NOC personnel.
- Act as a catalyst for organizational change, fostering a proactive, forward-thinking technical culture.
Soft Skills:
- Responsibility
- Proactivity
- Flexibility
- Great communication skills