The Open Artificial Knowledge (OAK) dataset is a large-scale resource of over 500 Millions tokens designed to address the challenges of acquiring high-quality, diverse, and ethically sourced training data for Large Language Models (LLMs). OAK leverages an ensemble of state-of-the-art LLMs to generate high-quality text across diverse domains, guided by Wikipedia's main categories.
from datasets import load_dataset
ds = load_dataset("tabularisai/oak", split="train")
ds[0]
@misc{borisov2024open,
title={Open Artificial Knowledge},
author={Vadim Borisov and Richard H. Schreiber},
year={2024},
eprint={2407.14371},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2407.14371},
}
Users must adhere to ethical guidelines, respect privacy considerations, and be mindful of potential biases in the synthetic data.The OAK dataset is intended for research purposes only.