Skip to main content

About SPARTA

SPARTA introduces a groundbreaking benchmark for tree-structured multi-hop question answering (QA) across text and tables, addressing the critical shortcomings of existing datasets like HybridQA and OTT-QA, which suffer from shallow reasoning, annotation errors, and limited scale. By constructing a unified reference fact database that merges source tables with grounding tables derived from unstructured passages, our end-to-end framework automates the generation of thousands of high-fidelity QA pairs—requiring only a quarter of the annotation effort—while incorporating advanced operations like aggregation, grouping, and deep nested predicates. Innovative techniques such as provenance-based refinement and realistic-structure enforcement ensure executable, semantically sound queries that mimic real-world complexity, spanning domains like NBA, movies, and medicine. On SPARTA, state-of-the-art models plummet by over 30 F1 points, exposing gaps in cross-modal reasoning and paving the way for more robust QA systems.

News

2026-02-19

SPARTA workload has been released on Hugging Face.

2026-01-26

SPARTA has been accepted to ICLR 2026.

Why SPARTA?

In recent years, benchmarks like HybridQA and OTT-QA have pioneered Table-Text QA. However, their reliance on manual curation has led to fundamental limitations: shallow reasoning, high annotation noise, and toy-scale data environments.

As LLMs evolve to handle increasingly complex analytical tasks, we present SPARTA, a scalable and principled framework designed to advance cross-modal reasoning through automated, high-fidelity benchmark construction.

SPARTA redefines the standards of Table-Text QA with three core advancements:

  • Real-World Scale: We move beyond tiny web tables (averaging 15 rows) to realistic settings, featuring relational data with up to thousands of rows.
  • Tree-Structured Multi-Hop Reasoning: Unlike existing benchmarks limited to linear, shallow chains, SPARTA synthesizes complex queries requiring tree-structured reasoning, including advanced operations like aggregation and grouping.
  • High Fidelity & Reliability: By replacing error-prone manual labeling (which has up to a 21% error rate) with provenance-based refinement, SPARTA ensures executable, natural-sounding, and noise-free benchmarks at 4x the construction efficiency.

The challenge is significant: even state-of-the-art LLMs experience a performance drop of over 30 F1 points on SPARTA compared to previous benchmarks, highlighting the critical need for more robust and scalable evaluation.

Benchmark Comparison

SPARTA addresses critical limitations of existing Table-Text QA benchmarks

BenchmarkAvg. RowsQuestion Gen.GROUP BYDeep HopStar-shape QueryAnnotation Error
(over 100 sampled queries)
TAT-QA9.4Manual30%
FinQA6.4Manual27%
MultiHierTT10.8Manual26%
HybridQA15.7Manual21%
OTT-QA15.7Manual21%
SPARTA (NBA)3,280.5Auto (LLM)0%
SPARTA (Movie)10,054.0Auto (LLM)0%
SPARTA (Medical)200.0Auto (LLM)0%

Data Examples

QuestionNBAHeight: 1Breadth: 1

Which Point Guards, drafted between 2000 and 2005, had more than 4 three-pointers, more than 8 field goals and more than 1 steal in a game?

Required Reasoning
  • Find Point Guards drafted between 2000 and 2005 (Table: nba_player_information)
  • Filter games with >4 three-pointers, >8 field goals, >1 steal (Text: nba_player_game_stats)
  • Join results via nested subquery to find matching players
SQL Query
SELECT player_name
FROM nba_player_game_stats
WHERE player_name IN (
    SELECT player_name
    FROM nba_player_information
    WHERE position = 'Point Guard'
      AND draft_year BETWEEN 2000 AND 2005
  )
  AND number_of_three_point_field_goals_made > 4
  AND number_of_field_goals_made > 8
  AND number_of_steal > 1
Answer

Chris Paul

1 / 13
Complexity

Have Questions?

We're here to help! Feel free to reach out through any of the following channels:

Acknowledgement

This work was partly supported by the National Research Foundation of Korea(NRF) grant funded by the Korea government(MSIT) (RS-2025-00517736, 30%), Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government(MSIT) (No. RS-2024-00509258, Global AI Frontier Lab, 50%) (No. RS-2024-00454666, Developing a Vector DB for Long-Term Memory Storage of Hyperscale AI Models, 10%), and Basic Science Research Program through the National Research Foundation of Korea Ministry of Education (No. RS-2024-00415602, 10%).

We extend our sincere gratitude to Jaewon Park, an undergraduate research student, for his significant contributions to the implementation of the leaderboard.

The website design is inspired by the Spider 2.0 benchmark.

Citation

If you use SPARTA in your research, please cite our paper:

@inproceedings{
park2026sparta,
title={{SPARTA}: Scalable and Principled Benchmark of Tree-Structured Multi-hop {QA} over Text and Tables},
author={Sungho Park and Jueun Kim and Wook-Shin Han},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026},
url={https://openreview.net/forum?id=8KE9qvKhM4}
}

Leaderboard

Domain
All
NBA
Movie
Medical
Data Source
Oracle
Retrival
Metric
EM
F1
P
R
RankMethodScore
Loading...