如何使用AgentTrove：在Python中流式处理1.7M智能体轨迹并构建干净的ShareGPT SFT数据集

2026-05-30 08:46·18天前·Sana Hassan

AI 摘要

AgentTrove是目前最大的开源智能体交互轨迹集合，包含1.7M行数据，采用ShareGPT风格布局。该Python教程展示了如何在不下载完整数据的情况下流式处理该数据集，具体步骤包括规范化智能体轮次、提取命令、分析轨迹，并将成功的轨迹导出为干净的SFT微调数据集。

原文 · 未翻译

In this tutorial, we explore AgentTrove, one of the largest open-source collections of agentic interaction traces, and learn how to work with it efficiently. Instead of downloading the full dataset, we use streaming to inspect rows, detect the conversation schema, normalize agent turns, and understand how user, assistant, system, and tool messages are structured. We also build utilities to parse command-style assistant outputs, render complete trajectories in a readable format, and study how agents interact with tools across different tasks. Also, we create a lightweight analytical workflow that samples thousands of traces, converts them into a DataFrame, summarizes turn-level statistics, visualizes important dataset patterns, and exports successful traces into a clean ShareGPT-style JSONL format for supervised fine-tuning.

Copy CodeCopiedUse a different Browser

!pip -q install "datasets>=2.19" pandas matplotlib pyarrow huggingface_hub import itertools, json, collections, textwrap, re, random, statistics import pandas as pd import matplotlib.pyplot as plt from datasets import load_dataset REPO = "open-thoughts/AgentTrove" random.seed(0) print(" Imports ready. Target dataset:", REPO) ds = load_dataset(REPO, split="train", streaming=True) print(" Streaming dataset opened.") first = next(iter(ds)) print("\n Columns present in a row:") for k in first.keys(): v = first[k] t = type(v).__name__ preview = (str(v)[:70] + "…") if v is not None and len(str(v)) > 70 else v print(f" • {k:= 1.0 except (TypeError, ValueError): return False out_path = "agenttrove_clean_sft.jsonl" kept, scanned, SCAN, KEEP = 0, 0, 1500, 200 print(f"\n Scanning up to {SCAN} rows, keeping up to {KEEP} successful traces…") with open(out_path, "w") as f: for row in itertools.islice(load_dataset(REPO, split="train", streaming=True), SCAN): scanned += 1 if not is_success(row): continue turns = normalize_turns(row[TRACE_KEY]) conv = [{"from": r, "value": c} for r, c in turns if c.strip()] if len(conv) = KEEP: break print(f" Scanned {scanned} rows → wrote {kept} clean traces to '{out_path}'") def search_traces(keyword=None, source=None, limit=3, scan=3000): """Stream the dataset and yield-print traces matching filters.""" hits = 0 for row in itertools.islice(load_dataset(REPO, split="train", streaming=True), scan): if source and row.get("original_source") != source: continue if keyword: blob = " ".join(c for _, c in normalize_turns(row[TRACE_KEY])) if keyword.lower() not in blob.lower(): continue render_trace(row, max_chars=300) hits += 1 if hits >= limit: break if hits == 0: print("No matches in the scanned window — try increasing `scan`.") print("\n Searching for 'nl2bash' source traces:") search_traces(source="nl2bash", limit=2, scan=4000) print("\n Tutorial complete! Next ideas:") print(" • Increase N / SCAN for bigger analyses.") print(" • Filter by original_source (swesmith, codeforces, r2egym…) for a domain SFT set.") print(" • Feed agenttrove_clean_sft.jsonl into Axolotl / LLaMA-Factory for fine-tuning.")

We define a success filter that retains traces marked as resolved, passed, correct, or positively rewarded. We then export successful trajectories into a clean ShareGPT-style JSONL file for downstream fine-tuning workflows. Also, we add a search utility to find traces by keyword or source, making the dataset easier to explore for specific agentic tasks.

In conclusion, we built a complete, hands-on pipeline to inspect, analyze, filter, and export data from AgentTrove in a Colab-friendly way. We started with streaming access, then progressively added schema detection, turn normalization, command extraction, trajectory rendering, statistical analysis, visualization, success-based filtering, and keyword or source-based search. This workflow helps us understand the internal structure of agentic traces and gives us a reusable foundation for preparing high-quality subsets for fine-tuning or evaluation. We also keep the process scalable by avoiding full dataset downloads and using streamed samples only when needed. Also, we demonstrated how AgentTrove can be used as more than a static dataset: we treated it as a rich source of agent behavior, tool usage, task outcomes, and training-ready conversations that can support future experiments in agent learning, workflow analysis, and domain-specific SFT dataset creation.

Check out the Full Codes with Notebook. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us

The post How to Use AgentTrove: Streaming 1.7M Agentic Traces and Building a Clean ShareGPT SFT Dataset in Python appeared first on MarkTechPost.

智能体教程/实践数据/训练

MarkTechPost（RSS）

如何使用AgentTrove：在Python中流式处理1.7M智能体轨迹并构建干净的ShareGPT SFT数据集

2026-05-30 08:46·18天前·Sana Hassan

AI 摘要

原文 · 保持原样，未翻译

Copy CodeCopiedUse a different Browser