Data/Evaluations Engineer

Nous Research
Global
fulltime
Data Science / AI / ML Worldwide Fully Remote Unspecified

Posted 1 week ago

Job Description

Summary
We’re looking for a data/evaluations engineer to own our post-training evaluation pipeline. You’ll build and scale evals depth and breadth that measure model capabilities across diverse tasks, identify failure modes, and drive model improvements.

Responsibilities:
  • Identifying tasks for evaluation coverage
  • Creating, curating, or generating test cases and ways to measure these tasks
  • Implementing evaluation through objective output verification, LLM judge/reward modeling, human evaluation, or any tricks of the trade you may bring to the table
  • Adding coverage and diving deep into analyzing what’s really gone wrong in failure cases
  • Identifying ways to remedy failure cases
  • Developing ways to present and make the evals scalable and accessible internally (e.g. light GUIs, scalable Slurm scripts, etc for running the evals)

Qualifications:
  • Strong experience with evaluation frameworks
  • Experience with both automated and human evaluation methodologies
  • Ability to build evaluation infrastructure from scratch and scale existing systems
Preferred:
  • History of OSS contributions

About the Company

Nous Research

Nous Research

IT / Telecommunication Services

11-50 Founded 2023
Nous Research is an AI research lab creating world-class models out in the open. We are best known for the Hermes series of open-source models, which are general purpose, human-aligned, lightweight models downloaded more than 50 million times on HuggingFace. In the process of developing these models, Nous is building a fully open AI stack, allowing anyone to meaningfully participate in the development of frontier intelligence, beginning with our fully distributed pre-training network, Psyche.