# Freelance Agent Evaluation Engineer

> Mindrift · Mexico (Remote) · Part-time · Posted 2026-05-31

**Workplace:** remote

**Department:** English_HackerRank

## Description

_Please submit your CV in English and indicate your level of English proficiency._

Mindrift connects specialists with project-based AI opportunities for leading tech companies, focused on testing, evaluating, and improving AI systems. **Participation is project-based, not permanent employment.**

**What this opportunity involves** 

We're building a dataset to evaluate AI coding agents - how well a model handles real-world developer tasks.

You'll create challenging tasks and evaluation criteria within realistic simulated environments:

-   Build realistic developer environments - a virtual company with codebase, infrastructure, and context (tickets, docs, conversations) that forms a believable development history
-   Design tasks from intermediate states of these environments - craft the prompt, define what "solved" means, and ensure the task is solvable by an AI agent
-   Write tests that verify agent solutions - accept all valid approaches and reject incorrect ones, neither too strict nor too lenient
-   Iterate on tasks and tests based on QA feedback - review agent solutions, analyze failures, and refine until the evaluation is fair and robust

**What this is NOT**

-   Not data labeling
-   Not prompt engineering
-   Not writing code from scratch - the agent writes most of the code; you guide and evaluate

**What we look for**

-   5+ years in software development
-   Core stack: Python (FastAPI), JavaScript/TypeScript (React), Docker, Postgres, Kafka, Redis
-   Experience writing tests (functional, integration)
-   English proficiency - B2+

**Why this is hard** 

Frontier models are already good at coding. Creating a task that genuinely challenges the best models is non-trivial. You need to deeply understand where models fail and what scenarios reveal the difference between a good and a bad solution. Tasks have many valid solutions - writing tests that accept all correct solutions and reject incorrect ones is harder than it sounds.

**How it works**

Apply → Pass qualification(s) → Join a project → Complete tasks → Get paid

**Effort estimate**

Tasks for this project are estimated to take 20 hours to complete, depending on complexity. This is an estimate and not a schedule requirement; you choose when and how to work. Tasks must be submitted by the deadline and meet the listed acceptance criteria to be accepted.

**Compensation**

Up to **$21/hr equivalent**, depending on level and pace. Tasks are estimated at ~20 hours each; you set your own schedule.

## Apply

[Apply at Mindrift](https://apply.workable.com/toloka-ai/j/BE99DDF928/apply)

---
Powered by [Workable](https://www.workable.com)