Data
Data location
Our data are hosted on Huggingface. We provide access to the following collections:
| Name | Description | Purpose | variations |
|---|---|---|---|
| data | A cleaned collection that only contains test-ready releases | Good for LLM benchmark | - data - *-objid - *-randid - *-70steps |
| data-intermediate | A full collection with all of our labeling and intermediate files | If you are interested in dig deeper into data labeling, or derive further customized version | - data-intermediate - *-objid - *-randid - *-70steps |
note: if your connection to huggingface.co is slow, you can find us on Huggingface mirror
Folder Structure
Each folder inside data contains the cleaned up files used during LLM inference and results evaluations. Here is the tree structure from game data/night .
data/night/
├── night.actions.json # list of mentioned actions
├── night.all2all.json # all simple paths between any 2 locations
├── night.all_pairs.json # all connectivity between any 2 locations
├── night.edges.json # list of all edges
├── night.locations.json # list of all locations
└── night.walkthrough # enriched walkthrough exported from Jericho simulator
Each folder inside data-intermediate contains
all intermediate files we used during data annotation and generation. Here is the tree structure from game data-intermediate/night .
data-intermediate/night/
├── night.all2all.json # all simple paths between any 2 nodes
├── night.all_pairs.json # all connectivity between any 2 nodes
├── night.anno2code.json # annotation to codename mapping
├── night.code2anno.json # codename to annotation mapping
├── night.edges.json # list of all edges
├── night.map.human # human map derived from human annotation
├── night.map.machine # machine map derived from exported action sequences
├── night.map.reversed # reverse map derived from human annotation map
├── night.moves # list of mentioned actions
├── night.nodes.json # list of all nodes
├── night.valid_moves.csv # human annotation
├── night.walkthrough # enriched walkthrough exported from Jericho simulator
└── night.walkthrough_acts # action sequences exported from Jericho simulator
Variations
70-step vs all-step version
In our paper, we benchmark using the first 70 steps of the walkthrough from each game. We also provide all-step versions of both data and data-intermediate collection.
-
70-step
data[-intermediate]-70steps.tar.zst: contains the first 70 steps of each walkthrough. If the complete walkthrough is shorter than 70 steps, then all steps are used. -
All-step
data[-intermediate].tar.zst: contains all steps of each walkthrough.
Word-only & Word+ID
-
Word-only
data[-intermediate].tar.zst: Nodes are annotated by additional descriptive text to distinguish different locations with similar names. -
Word + Object ID
data[-intermediate]-objid.tar.zst: variation of the word-only version, where nodes are labeled using minimaly fixed names with object id from Jericho simulator. -
Word + Random ID
data[-intermediate]-randid.tar.zst: variation of the Jericho ID version, where the Jericho object id replaced with randomly generated integer.
We primarily rely on the word-only version as benchmark, yet providing word+ID version for diverse benchmark settings.
How to use
We use data.tar.zst as an example here.
1. download from Huggingface
by directly download

by git
Make sure you have git-lfs installed
git lfs install
git clone https://huggingface.co/datasets/mango-ttic/data
# or, use hf-mirror if your connection to huggingface.co is slow
# git clone https://hf-mirror.com/datasets/mango-ttic/data
2. decompress
Because some json files are huge, we use tar.zst to package the data efficiently.
You may get
zstdfrom package manager likeapt install zstdordnf install zstd, or usingconda install zstdormamba install zstd, or by using pre-compiled binary distributed onzstdGitHub page.
silently decompress
tar -I 'zstd -d' -xf data.tar.zst
or, verbosely decompress
zstd -d -c data.tar.zst | tar -xvf -