Mistral AI
Mistral Cloud – Site Reliability Engineer Overview
| Company Name | Mistral AI |
| Job Role | Mistral Cloud – Site Reliability Engineer |
| Qualifications | Not Specified |
| Category | IT Jobs |
| Job Type | Full Time |
| Location | London |
Mistral AI is looking for an experienced Site Reliability Engineer to help shape the reliability, scalability, and performance of its cloud platform and customer-facing applications. The work sits within the Engineering & Infra organization and is intended for someone who can partner closely with software engineering and product teams to make sure the platform consistently meets the needs of both internal users and external customers.
The company builds AI systems designed to simplify work, save time, and support learning and creativity. Its platform includes high-performance, optimized, open-source, and cutting-edge models and products that can run on-premises or in cloud environments. The team is distributed across France, the United States, the United Kingdom, Germany, and Singapore, and the culture is described as collaborative, low-ego, creative, and focused on building meaningful impact.
What you will do
- Design, implement, and maintain infrastructure that is scalable, highly available, and resilient to failures.
- Operate production systems and resolve issues in live environments, including incident handling, on-call response, user administration, data extraction, and infrastructure scaling.
- Build and improve monitoring, alerting, and incident-response processes so that performance stays strong and downtime is minimized.
- Develop and maintain the operational workflows and tooling used for customer-facing APIs and large-scale training runs, including CI/CD, containerization, orchestration, logging, and alerting systems.
- Join on-call rotations occasionally to respond to incidents and perform root-cause investigations that help prevent similar issues in the future.
- Advance infrastructure automation, deployment, and orchestration through ongoing improvement work.
- Work with software engineers to create solutions that support safe, repeatable model-training experiments.
- Contribute to the development of the cloud platform by helping define an abstraction layer between research, engineering, and infrastructure.
- Build new tools and workflows that improve reliability, availability, and performance, including scripts, refactored components, API-based features, web applications, and dashboards.
- Collaborate with the security team to ensure infrastructure meets security best practices and compliance requirements.
- Document procedures and operational knowledge so the team can work consistently and share information effectively.
- Contribute beyond the core role through open-source work, research publications, blog posts, and conference participation.
What the team is looking for
- A masterâs degree in computer science, engineering, or a related field.
- At least five years of experience in DevOps or site reliability engineering.
- Strong experience with bare-metal infrastructure and distributed systems that must remain highly available.
- Hands-on exposure to reliability challenges in critical environments, including live troubleshooting, root-cause analysis, and on-call work.
- Experience working against reliability metrics and operational targets such as observability, alerting, and service-level agreements.
- Practical experience with CI/CD, containerization, and orchestration tools such as Docker and Kubernetes.
- Knowledge of observability tooling such as Prometheus, Grafana, ELK Stack, and Datadog.
- Familiarity with infrastructure-as-code tools such as Terraform or CloudFormation.
- Strong scripting skills in languages such as Python, Go, or Bash, plus an understanding of software development best practices.
- Good knowledge of networking, security, and system administration.
- Strong problem-solving ability and clear communication skills.
- Self-motivation and the ability to thrive in a fast-paced startup setting.
- Extra relevant experience includes work in AI or machine learning environments, exposure to high-performance computing and workload managers such as Slurm, and experience with AI-focused infrastructure providers such as Fluidstack, CoreWeave, or Vast.
Hiring process
- An introductory call lasting 30 minutes.
- A 30-minute interview with the hiring manager.
- A 45-minute technical interview focused on system design.
- A 60-minute technical deep-dive interview.
- A 30-minute culture-fit discussion.
- Reference checks.
Culture
The company says it is building a strong culture around a few core principles: reasoning carefully and rigorously, being bold, helping customers succeed, shipping early and moving quickly, and leaving ego aside.
Location and working arrangement
This position is mainly based in one of the companyâs European offices, with Paris and London highlighted as the primary locations. Candidates who already live in those locations, or who are willing to relocate, will be prioritized. The company places a strong emphasis on in-person collaboration to support relationships and communication within the team.
Remote candidates may also be considered if they are based in one of the countries listed in the posting: France, the United Kingdom, Germany, Belgium, the Netherlands, Spain, or Italy. For remote hires, the company requires travel to the Paris headquarters, with accommodation and travel expenses covered, during the first week of onboarding and then for at least three days every six weeks.
What is offered
- Competitive salary together with equity.
- Health insurance.
- A sport allowance.
- Meal vouchers.
- A generous parental leave policy.
- Visa sponsorship.
By submitting an application, candidates agree to the Applicant Privacy Policy. The posting also provides an application link to apply for the role.
Degree Requirement: Not Specified
Visa Sponsorship Promising
To apply for this job please visit jobs.lever.co.