Open Artificial Knowledge (OAK) Dataset

The Open Artificial Knowledge (OAK) dataset is a large-scale resource of over 500 Millions tokens designed to address the challenges of acquiring high-quality, diverse, and ethically sourced training data for Large Language Models (LLMs). OAK leverages an ensemble of state-of-the-art LLMs to generate high-quality text across diverse domains, guided by Wikipedia's main categories.

Use Now

653,552,076 tokens

of high quality synthetic Data

Broad knowledge coverage

High-level topics are extracted from Wikipedia.

Generated using

GPT4o, LLaMa3-70B, LLaMa3-8B, Mixtral-8x7B, Gemma-7B, and Gemma-2-9B

> Download

from datasets import load_dataset

ds = load_dataset("tabularisai/oak", split="train")
ds[0]

> Citation

@misc{borisov2024open,
      title={Open Artificial Knowledge}, 
      author={Vadim Borisov and Richard H. Schreiber},
      year={2024},
      eprint={2407.14371},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2407.14371}, 
}

>Disclaimer

Users must adhere to ethical guidelines, respect privacy considerations, and be mindful of potential biases in the synthetic data.The OAK dataset is intended for research purposes only.