Job Description
Roles & Responsibilities
Key Responsibilities
Cloud Infrastructure
- Operate and improve the AWS infrastructure including EC2, ECS, Lambda, RDS, ElastiCache, SQS, IAM, VPC, networking, Cloudflare, and Route 53.
- Drive Infrastructure as Code maturity. Terraform is in use today; bring it to a state where every change to production goes through reviewed, version-controlled IaC.
- Right-size the environment. AWS Reserved Instances and a heavy resource footprint mean there is real money on the table - identify, propose, and execute the savings.
- Establish a defensible backup, restore, and disaster recovery posture. Set and test RTO and RPO targets. Run restore drills, not just backup jobs.
- Move monitoring from reactive CloudWatch alarms to real observability - APM, structured logging, per-tenant performance visibility, and alerts that wake people up for the right reasons.
- Manage the on-premise to cloud bridge that connects the Clubspeed client to the cloud via Amazon SQS - keep it healthy and look for ways to simplify it over time.
CI/CD and Developer Experience
- Own GitHub Actions across Clubspeed (.NET, Node.js, React, Next.js) and Resova (PHP / Laravel, Angular, MySQL). Build pipelines that are fast, predictable, and trusted by the team.
- Bake quality and security gates into the pipeline - unit and integration tests, SAST, DAST, dependency scanning, SBOM generation, license checks, container scanning, and secrets detection.
- Lead the move away from long-lived credentials in repos. Stand up real secrets management (AWS Secrets Manager, GitHub OIDC, or equivalent) and rotate everything that needs rotating.
Security and Compliance
- Lead the PCI DSS 4.0.1 remediation program, in partnership with the CTO, the engineering leads, and the Manos Group security team.
- Implement and maintain SBOMs and open-source license tracking across all repos.
- Tighten IAM, network segmentation, and audit logging across AWS. Make least privilege a default, not an aspiration.
- Coordinate third-party penetration tests and own the remediation work that comes out of them.
- Keep Vanta accurate and useful, not a check-the-box exercise.
Internal IT
- Own the day-to-day IT experience for the team - laptops, accounts, onboarding, offboarding, MFA, SSO, and the small things that compound into a good or bad working environment.
- Manage endpoint hygiene - device management, antivirus or EDR, patching, and a sane response to lost or compromised devices.
- Be the practical point person for vendor management on the IT side - licensing, renewals, and consolidation opportunities.
Working With Agentic AI
- Embrace and actively use agentic AI tools in your own work - infrastructure changes, pipeline maintenance, incident response, log analysis, and IT operations.
- Build agentic workflows that automate repetitive DevOps and IT work - alert triage, runbook execution, access provisioning, vulnerability triage, and routine support requests.
- Partner with the CTO and the broader Manos Group on cross-portfolio AI initiatives, and share what works back into the group.
Technical Range
- Fluent in Linux, networking fundamentals, and at least one scripting language used in anger (Python or shell, Node.js or PowerShell a bonus).
- Container experience - Docker, ECS, image hygiene, and an opinion about base images and patching.
- Database operations background sufficient to support engineering on MS SQL Server, MySQL, and Redis - backups, restores, performance triage, and migration support. You do not need to be a DBA, but you should not be afraid of one.
- Observability and monitoring experience - CloudWatch is the starting point; experience with an APM (Datadog, New Relic, Dynatrace, Honeycomb, or similar) is valuable.
- Cloudflare or equivalent CDN / WAF experience.
- Hands-on use of modern AI coding and ops tools - Claude Code, Cursor, GitHub Copilot, MCP based agents, or similar - in your own day-to-day work.
Operating Style
- Owner, not coordinator. You take problems end to end, including the unglamorous parts.
- Calm under pressure. Incidents, audits, and migrations are part of the job - they should not be a personality test. Honest about trade-offs. You do not oversell quick wins and you do not hide the cost of doing things properly.
- Communicates well with engineers, executives, auditors, and end users. The IT side of the job requires patience and clarity.
- Bias to automate. If you have done it three times, the fourth time should be a script, a workflow, or an agent.
Desired Candidate Profile
5+ years in a hands-on DevOps, SRE, or platform engineering role, with at least 2 years as the senior or lead DevOps person on a meaningful production environment.
- Direct, hands-on AWS experience across the services in scope - EC2, ECS, Lambda, RDS (SQL Server and MySQL), ElastiCache, SQS, IAM, VPC, CloudWatch. Multi-region experience is a plus.
- Strong Terraform background. You have written and maintained real IaC, not just inherited it.
- GitHub Actions experience at depth - reusable workflows, matrix builds, environments, OIDC to cloud, and the kind of pipeline design that scales beyond a single team.
- Practical security background. You have led or materially contributed to PCI DSS, SOC 2, or equivalent remediation work, and you can read a SAST or dependency report and prioritise it sensibly.