AI Engineer for LLM Ops & Evaluation (m/f/d)

Vollzeit  •  IT & Software  •  München, Germany

<div class="show-more-less-html__markup show-more-less-html__markup--clamp-after-5 relative overflow-hidden"> <p>You'll join an early-stage, AI-native startup with a product that has already proven market fit. We build cutting-edge AI solutions for Governance, Risk and Compliance (GRC) for enterprises around the world.</p><br/><p>Our customers are auditors, risk managers, and compliance teams, which means evaluation rigor, auditability, and EU AI Act readiness aren't afterthoughts for us. They're product requirements.</p><br/>Tasks<br/><p>As our AI Engineer for LLMOps &amp; Evaluation, you'll own the LLMOps pipeline end-to-end and work directly alongside our founding team.</p><br/><p>You will:</p><br/><ul><br/><li>Own the LLMOps pipeline: Evaluate infrastructure, prompt optimization loop, and the production integration that turns experiments into reliable customer-facing features</li><br/><li>Design evaluation strategy per output type: Decide when to use deterministic evals (exact match, schema validation, embeddings) vs. LLM-as-judge, and build the rubrics, test datasets, and human-review loops that make the system trustworthy</li><br/><li>Drive prompt engineering and optimization across all LLM operations in the product: Moving from hand-tuned prompts to a measurable, iterative process</li><br/><li>Pick the right tool for each problem: Some things are LLM problems, some are embedding + classical NLP problems, some are deterministic logic</li><br/><li>Run the production side of AI features: Observability (Langfuse /LangSmith / similar), cost and latency engineering, incident response when an LLM feature degrades</li><br/><li>Build human-in-the-loop workflows: Review queues, feedback ingestion, labeling; so production signal feeds back into evals and prompt iteration</li><br/><li>Mentor our AI &amp; Analytics Intern and contribute to how we build the AI team over time</li><br/></ul><br/>Requirements<br/><ul><br/><li>3+ years of hands-on experience building and shipping ML/AI systems in production (we care more about what you've shipped than years on a CV)</li><br/><li>Have shipped an LLM evaluation or prompt optimization pipeline, not just used LLMs in a project, but owned the loop</li><br/><li>Strong hands-on experience with LLM-as-judge, including its variance problems and concrete techniques for controlling them</li><br/><li>Solid foundation in classical NLP and ML ops: Embeddings, semantic similarity, entity matching, classification, fuzzy matching</li><br/><li>Informed opinions on deterministic vs. LLM-based evals, from experience</li><br/><li>Production judgment: You've owned cost and latency tradeoffs, observability, and incident response for an LLM-powered feature. You're familiar with prompt regression and have strategies for managing it</li><br/><li>Strong Python</li><br/><li>Excellent English communication, written and verbal: We discuss nuanced technical tradeoffs daily with the founding team and customers</li><br/><li>Comfort with ambiguity: You can run experiments on real data, build intuition for this domain, and know when to stop iterating</li><br/></ul><br/><p><strong>Nice to have</strong></p><br/><ul><br/><li>Hands-on experience with LLM observability and eval tooling (Langfuse, LangSmith, Phoenix/Arize, Helicone, Braintrust, W&amp;B)</li><br/><li>Experience with DSPy or similar prompt optimization frameworks, and opinions on where they do and don't work</li><br/><li>Experience with Azure OpenAI in EU regions, or with EU-sovereign providers (Mistral, Aleph Alpha)</li><br/><li>Exposure to guardrails, content safety, or AI governance</li><br/><li>Exposure to enterprise software, ideally GRC, compliance, audit, or regulated industries</li><br/><li>Familiarity with Java/Spring Boot or Kubernetes on Azure; enough to integrate cleanly</li><br/><li>German</li><br/></ul><br/>Benefits<br/><ul><br/><li>Hands-on ownership of a real AI product used by enterprise customers</li><br/><li>Work directly alongside the founding team from day one</li><br/><li>Hybrid work model: Munich North, minimum one day per week in the office, otherwise flexible (open to strong candidates elsewhere in the EU for the right fit); onboarding will take in-office</li><br/><li>A steep learning curve at the intersection of LLM engineering, enterprise GRC, and startup operations</li><br/><li>The chance to shape the AI team as we grow</li><br/></ul><br/><p>Auxilius .ai is building AI-powered GRC solutions for enterprises. We're early-stage, fast-growing, and backed by real customers. Our tech stack includes Java &amp; Spring Boot, Angular, Kubernetes on Azure, and OpenAI &amp; Anthropic LLMs.</p> </div>

Job Overview
  • Datum der Veröffentlichung

    Mai 20, 2026

  • Kategorie

    IT & Software

  • Job Type

    Vollzeit

  • Standort

    München, Germany

  • Arbeitgeber

    Auxilius.ai

  • Source

    LinkedIn