如何使用AgentTrove:在Python中流式处理1.7M智能体轨迹并构建干净的ShareGPT SFT数据集
AgentTrove是目前最大的开源智能体交互轨迹集合,包含1.7M行数据,采用ShareGPT风格布局。该Python教程展示了如何在不下载完整数据的情况下流式处理该数据集,具体步骤包括规范化智能体轮次、提取命令、分析轨迹,并将成功的轨迹导出为干净的SFT微调数据集。
In this tutorial, we explore AgentTrove, one of the largest open-source collections of agentic interaction traces, and learn how to work with it efficiently. Instead of downloading the full dataset, we use streaming to inspect rows, detect the conversation schema, normalize agent turns, and understand how user, assistant, system, and tool messages are structured. We also build utilities to parse command-style assistant outputs, render complete trajectories in a readable format, and study how agents interact with tools across different tasks. Also, we create a lightweight analytical workflow that samples thousands of traces, converts them into a DataFrame, summarizes turn-level statistics, visualizes important dataset patterns, and exports successful traces into a clean ShareGPT-style JSONL format for supervised fine-tuning.
Copy CodeCopiedUse a different Browser
!pip -q install "datasets>=2.19" pandas matplotlib pyarrow huggingface_hub import itertools, json, collections, textwrap, re, random, statistics import pandas as pd import matplotlib.pyplot as plt from datasets import load_dataset REPO = "open-thoughts/AgentTrove" random.seed(0) print(" Imports ready. Target dataset:", REPO) ds = load_dataset(REPO, split="train", streaming=True) print(" Streaming dataset opened.") first = next(iter(ds)) print("\n Columns present in a row:") for k in first.keys(): v = first[k] t = type(v).__name__ preview = (str(v)[:70] + "…") if v is not None and len(str(v)) > 70 else v print(f" • {k:= 1.0 except (TypeError, ValueError): return False out_path = "agenttrove_clean_sft.jsonl" kept, scanned, SCAN, KEEP = 0, 0, 1500, 200 print(f"\n Scanning up to {SCAN} rows, keeping up to {KEEP} successful traces…") with open(out_path, "w") as f: for row in itertools.islice(load_dataset(REPO, split="train", streaming=True), SCAN): scanned += 1 if not is_success(row): continue turns = normalize_turns(row[TRACE_KEY]) conv = [{"from": r, "value": c} for r, c in turns if c.strip()] if len(conv) = KEEP: break print(f" Scanned {scanned} rows → wrote {kept} clean traces to '{out_path}'") def search_traces(keyword=None, source=None, limit=3, scan=3000): """Stream the dataset and yield-print traces matching filters.""" hits = 0 for row in itertools.islice(load_dataset(REPO, split="train", streaming=True), scan): if source and row.get("original_source") != source: continue if keyword: blob = " ".join(c for _, c in normalize_turns(row[TRACE_KEY])) if keyword.lower() not in blob.lower(): continue render_trace(row, max_chars=300) hits += 1 if hits >= limit: break if hits == 0: print("No matches in the scanned window — try increasing `scan`.") print("\n Searching for 'nl2bash' source traces:") search_traces(source="nl2bash", limit=2, scan=4000) print("\n Tutorial complete! Next ideas:") print(" • Increase N / SCAN for bigger analyses.") print(" • Filter by original_source (swesmith, codeforces, r2egym…) for a domain SFT set.") print(" • Feed agenttrove_clean_sft.jsonl into Axolotl / LLaMA-Factory for fine-tuning.")