Senior Site Reliability Engineer - Remote (EU/UK ONLY) Remote
Background How would you feel about shaping the future of aviation safety while leading our move from legacy infrastructure to AWS? At SafetyManager365, we move quickly, build with intent, and focus on solving problems that genuinely matter to our customers and the aviation industry. Airlines generate huge volumes of safety reports, operational data and investigations. Our platform helps safety teams analyse that information, identify risks earlier and make better operational decisions. As a newly formed team within the wider Comply365 group, we bring the energy and mindset of a startup, with the backing and stability of an established enterprise. Our AI and data processing workloads already run entirely on AWS, and we are now extending that approach as we modernise the rest of the platform. We are looking for a Senior Site Reliability Engineer to play a key role in this journey - enhancing reliability, shaping our cloud architecture, reducing operational toil, and mentoring our infrastructure team as they adopt modern cloud and SRE practices. This is a high impact role that sits at the intersection of reliability engineering, cloud infrastructure, and engineering enablement, and will be ideal for someone who wants to combine technical leadership with hands-on engineering. You will play a key role in defining how we build, operate, and scale the platform. You will partner closely with software engineers to improve production readiness, deployment safety, observability, and operational maturity across the platform, reducing future incidents through better system design, simplification, and stronger operational foundations. The role is full-time (40 hours per week) with core hours from 10am to 5pm CET to ensure strong team alignment and collaboration, while still allowing flexibility outside of those hours. It can be remote and reports into the technical leadership within the SafetyManager365 group. Key Responsibilities: Take end-to-end ownership of reliability and infrastructure challenges, driving issues through to resolution and ensuring long-term fixes rather than short-term workarounds Proactively identify risks, weaknesses, and operational bottlenecks, introducing structured solutions to move the platform from reactive to reliable and strategic Design, build, and continuously improve AWS infrastructure with a focus on scalability, resilience, performance, and cost efficiency Lead the migration of services and workloads from legacy dedicated environments into modern, cloud-native AWS architectures Develop and maintain Infrastructure as Code to automate provisioning, reduce manual effort, and ensure consistency across environments Enhance CI/CD pipelines to enable safe, fast, and repeatable deployments, improving overall developer productivity and system stability Build and evolve observability capabilities across metrics, logging, tracing, and alerting to enable effective monitoring and rapid incident response Define and embed best practices for incident management, root cause analysis, and post-incident reviews to strengthen operational maturity Partner closely with engineering teams to improve production readiness, resilience, performance, and system design, while contributing to architecture decisions Mentor and support infrastructure engineers, promoting modern cloud and SRE practices, and leveraging AI tools where appropriate to improve automation, efficiency, and engineering outcomes Skills & Qualifications: A proactive, ownership-driven mindset with the ability to lead from the front - identifying problems, proposing solutions, and driving them through to completion Strong strategic thinking combined with hands-on technical execution, balancing immediate operational needs with long-term platform improvements Proven ability to operate independently in complex, ambiguous environments, using sound judgement and strong problem-solving skills Significant experience running production SaaS platforms in a senior SRE, platform, or infrastructure engineering role Deep, hands-on expertise with AWS, including designing and operating scalable, resilient, and cost-efficient cloud architectures Demonstrated experience migrating and modernising legacy infrastructure into cloud-native environments Strong foundations in Linux, networking, and systems internals, with a clear understanding of production operations at scale Experience designing for high availability, fault tolerance, and overall system resilience Proficiency with Infrastructure as Code tools such as Terraform, Pulumi, or similar, alongside experience improving CI/CD and automation practices Strong communication and collaboration skills, with a mentoring mindset coupled with experience of working closely with engineering teams to improve reliability, observability, and operational maturity. Nice to haves: Proven track record defining and operating SLIs, SLOs, and error budgets to drive reliability and performance improvements Familiarity with modern observability tooling such as Prometheus, Grafana, Datadog, and distributed tracing practices Exposure to distributed systems or microservices-based architectures, including service meshes or chaos engineering approaches Exposure to AI, ML, or data processing workloads in production, with a pragmatic approach to using AI tools to improve efficiency and outcomes Background working in regulated, compliance-sensitive, or safety-critical environments, with an interest in aviation or operational risk systems Benefits: Equipment: Laptop, monitor, and whatever else you need to get productive, although we do strongly prefer our team working off a Macbook for tool compatibility Workspace: WeWork membership if applicable in your city Annual learning budget: courses, books, conferences, coaching. Your call with minimal approval overhead Conference and speaking budget: attend industry events and speak at them if that is your thing AI tooling budget: pick the tools that make you better, we will cover them Annual team offsite and regular in-person meetups Your application implies your consent to the processing of your personal data as outlined in our Privacy Policy.