About Us:
As a Senior SRE at Kenility, you’ll join a tight-knit family of creative developers, engineers, and designers who strive to develop and deliver the highest quality products into the market.
Technical Requirements:
- Bachelor’s degree in Computer Science, Software Engineering, or a related field.
- At least five years of experience in Site Reliability Engineering, DevOps, or advanced systems engineering positions.
- Demonstrated background in establishing or evolving SRE frameworks and reliability practices within an organization.
- Experience defining, tracking, and analyzing stability and performance metrics to ensure system reliability.
- Strong hands-on expertise working with Amazon Web Services (AWS).
- Solid experience with secrets management solutions such as AWS Secrets Manager, HashiCorp Vault, Keeper, Infisical, or similar tools.
- Familiarity with Atlassian suite products, including Jira, Confluence, and Bitbucket.
- Practical experience implementing infrastructure-as-code using Terraform and configuration management tools such as Chef, Puppet, or Ansible.
- Ability to collaborate closely with development teams to identify and address automation and monitoring requirements.
- Deep knowledge of monitoring, observability practices, incident management, and service reliability engineering.
- Proven ability to define and implement observability standards, incident response processes, and SLIs/SLOs.
- AWS certifications are highly valued.
- Experience with programming languages such as Ruby, Python, or .NET to support integrations and legacy systems is a plus.
- Background in regulated industries such as insurance or fintech is desirable.
- Familiarity with IT service management platforms like incident.io, Jira Service Manager, or similar tools.
- Experience working with CI/CD pipelines and modern DevOps methodologies.
- Minimum Upper Intermediate English (B2) or Proficient (C1).
Tasks and Responsibilities:
- Establish and shape the SRE function from the ground up, defining standards, processes, and tooling to build a robust reliability practice.
- Design and implement best practices, frameworks, and automation strategies that support long-term system stability and scalability.
- Lead the evolution of the existing NOC into a technically empowered, automation-focused reliability team.
- Oversee and enhance observability by improving monitoring, alerting, and dashboarding capabilities using tools such as Grafana, CloudWatch, and Datadog.
- Define and track SLIs, SLOs, and error budgets to ensure accountability for uptime and system performance objectives.
- Partner with DevOps, SysOps, and engineering teams to strengthen reliability standards and mentor team members, fostering skill development.
- Drive cultural and technical transformation by promoting forward-thinking reliability and automation practices across the organization.
Soft Skills:
- Responsibility
- Proactivity
- Flexibility
- Great communication skills