# AI Evaluation Engineer (Knowledge & Research)

> Gramian Consulting Group · Colombia (Remote) · Contract · Posted 2026-05-04

**Workplace:** remote

**Department:** Partnerships

## Description

**About Us**

Gramian Consultancy is a boutique consultancy specializing in IT professional services and engineering talent solutions. With a strong background in software engineering and leadership, we help companies build high-performing teams by matching them with professionals who truly fit their needs.

**Role overview**

We are looking for an **AI Evaluation Engineer with a strong research background** to design and evaluate complex, multi-agent tasks used to benchmark next-generation AI systems.

In this role, you will work at the intersection of **research, data structuring, and AI evaluation**, building high-quality tasks that require deep document understanding, structured reasoning, and multi-step synthesis. You will create datasets and evaluation frameworks that test whether AI agents can truly **read, reason, and extract knowledge from large-scale unstructured data**.

This is a **high-precision, detail-oriented role** requiring strong analytical thinking, structured problem decomposition, and the ability to translate research content into measurable evaluation tasks.

**Commitments Required: 8 hours per day with an overlap of 4 hours with PST.**

**Employment type: Contractor assignment (no medical/paid leave)**

**Duration of contract: 5 weeks+**

**Location:** **Bangladesh, Brazil, Colombia, Egypt, Ghana, India, Indonesia, Kenya, Nigeria,Turkey, Vietnam**

**Interview: take home assessment (60min)**

### **Responsibilities**

-   Build multi-agent benchmark tasks that require reading, analyzing, and synthesizing large document collections
-   Curate real-world research corpora — academic papers, case studies, technical reports — and design questions that require comprehensive analysis
-   Write structured ground-truth oracles (JSON) with specific, verifiable answers that prove the agent actually read the source material
-   Design LLM judge prompts that evaluate agent output field-by-field against the oracle
-   Create decomposition guides that split research across multiple parallel sub-agents (one per document, one per domain, then synthesis)

## Requirements

-   5+ years of experience in **research (academic or industry)** in a scientific, technical, or analytical domain
-   Strong ability to **read, analyze, and extract structured information from unstructured documents**
-   Experience designing or working with **structured data formats (JSON, schemas, validation)**
-   Proficiency in **Python scripting** (data processing, validation, or evaluation scripts)
-   Experience with **AI evaluation, coding benchmarks, or structured reasoning tasks** (e.g., SWE-bench, Terminal-bench, or similar)
-   Experience working with **Docker** (building images, debugging containers)
-   Strong attention to detail, especially when defining **exact, verifiable outputs**
-   Ability to design **complex, multi-step problem-solving workflows**

## Apply

[Apply at Gramian Consulting Group](https://apply.workable.com/gramian/j/98D3802A83/apply)

---
Powered by [Workable](https://www.workable.com)