About SPARTA
SPARTA introduces a groundbreaking benchmark for tree-structured multi-hop question answering (QA) across text and tables, addressing the critical shortcomings of existing datasets like HybridQA and OTT-QA, which suffer from shallow reasoning, annotation errors, and limited scale. By constructing a unified reference fact database that merges source tables with grounding tables derived from unstructured passages, our end-to-end framework automates the generation of thousands of high-fidelity QA pairs—requiring only a quarter of the annotation effort—while incorporating advanced operations like aggregation, grouping, and deep nested predicates. Innovative techniques such as provenance-based refinement and realistic-structure enforcement ensure executable, semantically sound queries that mimic real-world complexity, spanning domains like NBA, movies, and medicine. On SPARTA, state-of-the-art models plummet by over 30 F1 points, exposing gaps in cross-modal reasoning and paving the way for more robust QA systems.
News
SPARTA workload has been released on Hugging Face.
SPARTA has been accepted to ICLR 2026.
Why SPARTA?
In recent years, benchmarks like HybridQA and OTT-QA have pioneered Table-Text QA. However, their reliance on manual curation has led to fundamental limitations: shallow reasoning, high annotation noise, and toy-scale data environments.
As LLMs evolve to handle increasingly complex analytical tasks, we present SPARTA, a scalable and principled framework designed to advance cross-modal reasoning through automated, high-fidelity benchmark construction.
SPARTA redefines the standards of Table-Text QA with three core advancements:
- ✓Real-World Scale: We move beyond tiny web tables (averaging 15 rows) to realistic settings, featuring relational data with up to thousands of rows.
- ✓Tree-Structured Multi-Hop Reasoning: Unlike existing benchmarks limited to linear, shallow chains, SPARTA synthesizes complex queries requiring tree-structured reasoning, including advanced operations like aggregation and grouping.
- ✓High Fidelity & Reliability: By replacing error-prone manual labeling (which has up to a 21% error rate) with provenance-based refinement, SPARTA ensures executable, natural-sounding, and noise-free benchmarks at 4x the construction efficiency.
The challenge is significant: even state-of-the-art LLMs experience a performance drop of over 30 F1 points on SPARTA compared to previous benchmarks, highlighting the critical need for more robust and scalable evaluation.
Benchmark Comparison
SPARTA addresses critical limitations of existing Table-Text QA benchmarks
| Benchmark | Avg. Rows | Question Gen. | GROUP BY | Deep Hop | Star-shape Query | Annotation Error (over 100 sampled queries) |
|---|---|---|---|---|---|---|
| TAT-QA | 9.4 | Manual | ✗ | ✗ | ✗ | 30% |
| FinQA | 6.4 | Manual | ✗ | ✗ | ✗ | 27% |
| MultiHierTT | 10.8 | Manual | ✗ | ✗ | ✗ | 26% |
| HybridQA | 15.7 | Manual | ✗ | ✗ | ✗ | 21% |
| OTT-QA | 15.7 | Manual | ✗ | ✗ | ✗ | 21% |
| SPARTA (NBA) | 3,280.5 | Auto (LLM) | ✓ | ✓ | ✓ | 0% |
| SPARTA (Movie) | 10,054.0 | Auto (LLM) | ✓ | ✓ | ✓ | 0% |
| SPARTA (Medical) | 200.0 | Auto (LLM) | ✓ | ✓ | ✓ | 0% |
Data Examples
“Which Point Guards, drafted between 2000 and 2005, had more than 4 three-pointers, more than 8 field goals and more than 1 steal in a game?”
- →Find Point Guards drafted between 2000 and 2005 (Table: nba_player_information)
- →Filter games with >4 three-pointers, >8 field goals, >1 steal (Text: nba_player_game_stats)
- →Join results via nested subquery to find matching players
SELECT player_name
FROM nba_player_game_stats
WHERE player_name IN (
SELECT player_name
FROM nba_player_information
WHERE position = 'Point Guard'
AND draft_year BETWEEN 2000 AND 2005
)
AND number_of_three_point_field_goals_made > 4
AND number_of_field_goals_made > 8
AND number_of_steal > 1Chris Paul
Have Questions?
We're here to help! Feel free to reach out through any of the following channels:
Acknowledgement
This work was partly supported by the National Research Foundation of Korea(NRF) grant funded by the Korea government(MSIT) (RS-2025-00517736, 30%), Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government(MSIT) (No. RS-2024-00509258, Global AI Frontier Lab, 50%) (No. RS-2024-00454666, Developing a Vector DB for Long-Term Memory Storage of Hyperscale AI Models, 10%), and Basic Science Research Program through the National Research Foundation of Korea Ministry of Education (No. RS-2024-00415602, 10%).
We extend our sincere gratitude to Jaewon Park, an undergraduate research student, for his significant contributions to the implementation of the leaderboard.
The website design is inspired by the Spider 2.0 benchmark.
Citation
If you use SPARTA in your research, please cite our paper:
@inproceedings{
park2026sparta,
title={{SPARTA}: Scalable and Principled Benchmark of Tree-Structured Multi-hop {QA} over Text and Tables},
author={Sungho Park and Jueun Kim and Wook-Shin Han},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026},
url={https://openreview.net/forum?id=8KE9qvKhM4}
}Leaderboard
| Rank | Method | Score |
|---|---|---|
| Loading... | ||