GReaT: Generating Realistic Tabular Data

Most teams working with data run into the same problem sooner or later: you need data you do not have.

Maybe you have a real dataset, but it is too small. Maybe it contains rare edge cases you cannot reproduce. Maybe it is locked behind compliance approvals that take weeks. Or maybe you simply cannot share it outside your organization.

This is frustrating. You want to experiment, prototype, or build a demo, but the data you need is either sensitive, restricted, or not big enough.

Synthetic data offers a way out. The idea is straightforward: generate artificial data that behaves like the real thing. Not random noise, and not exact copies. Instead, you create a dataset that:

  • matches the statistical distributions you care about
  • preserves relationships between columns
  • does not leak actual records from production
  • can be generated in any quantity you need

This lets teams prototype before approvals arrive, run experiments without touching sensitive information, stress-test models on rare scenarios, and share realistic data with partners or students.

The core insight is simple: separate the usefulness of data from its sensitivity.

GReaT workflow overview

What is GReaT?

GReaT stands for Generation of Realistic Tabular data. It is an open-source Python library that uses large language models or LLMs [LINK to transformer explainer page] to generate synthetic versions of your tables.

The approach is based on research published at ICLR 2023 (one of the top machine learning conferences). Today, GReaT has over 140,000 downloads and ranks in the top 10% of all packages on PyPI. You can see the download statistics at clickpy.clickhouse.com/dashboard/be-great and browse the source code at github.com/tabularis-ai/be_great.

In practical terms, GReaT gives you three things:

  1. Structure preserved: The model learns distributions and correlations between columns automatically
  2. Raw data stays put: You move model checkpoints around, not customer rows
  3. Unlimited extra data: Generate as much as you need for experiments, demos, QA, or stress tests

Quick Start

Here is the simplest way to try GReaT:

from be_great import GReaT
from sklearn.datasets import fetch_california_housing

data = fetch_california_housing(as_frame=True).frame

model = GReaT(llm='distilgpt2', epochs=50)
model.fit(data)

synthetic_data = model.sample(n_samples=100)

This trains a model on the California Housing dataset and generates 100 new synthetic rows. You can then compare distributions, train downstream models, or use the synthetic data however you need.

If this quick test works for your use case, you have a new tool in your workflow. If it does not, you have lost about five minutes.

Distribution comparison

How It Works

Most synthetic data tools feel abstract. GReaT is fairly concrete.

The core idea comes from an observation: language models are very good at predicting “what comes next” given some context. If you frame each row of your table as a sentence, you can use a language model to learn the patterns and then generate new sentences (rows).

Here is what happens internally:

Step 1: Convert rows to text

Each row becomes a sentence like: "Age is 42, Income is 73000, City is Berlin, Has_mortgage is True"

Step 2: Fine-tune a language model

GReaT uses a pretrained model (typically distilgpt2) and fine-tunes it on your table. The model learns: given the beginning of a row, what token comes next?

Step 3: Generate new rows

At sampling time, the model writes new “sentences” token by token. These are then parsed back into table format.

Because this is standard language modeling, several things work naturally:

  • Mixed numerical and categorical data is handled automatically
  • Text columns are not a special case
  • Correlations between columns are learned as part of the sequence

This is the whole approach. No separate encoders for different data types, no complex preprocessing pipelines.

Where Teams Use GReaT

Three patterns come up repeatedly:

Sharing data without sharing data

You cannot send production data to an external vendor, a student intern, or a partner company. Instead, you train GReaT on the real table and share only the synthetic version. This is not a privacy guarantee by itself, but it is a significant improvement over emailing CSVs.

Prototyping before access is approved

Many teams use GReaT to create “shadow datasets”. They build their entire pipeline on synthetic data, get everything working, and then swap in the real table once compliance approves access. This can save weeks of waiting.

Stress-testing on rare cases

If you need to see how a model behaves on specific segments (high-income customers, rare product categories, edge-case combinations), you can generate more data for exactly those slices.

Use case overview

Conditional Generation

Sometimes you do not want random synthetic rows. You want synthetic rows that match specific criteria.

GReaT supports this through conditional sampling:

synthetic_data = model.sample(
    n_samples=100,
    conditions={"Age": ">30", "Income": ">50000"}
)

This generates 100 rows where Age is above 30 and Income is above 50,000. The model fills in the remaining columns based on what it learned about people in that segment.

This is useful for stress-testing models on specific slices, generating balanced datasets, or exploring “what if” scenarios.

Filling Missing Values

A lesser-known feature: once you have trained a GReaT model, you can use it to impute missing values in your data.

imputed_data = model.impute(incomplete_data, max_length=200)

The model uses what it learned about the joint distribution to make educated guesses about missing entries. No hand-crafted rules required.

Why Use a Language Model?

We tried the usual approaches first: GANs (like CTGAN), variational autoencoders (TVAE), and classical statistical methods like copulas.

These can work well, but they tend to need more manual tuning, struggle with mixed data types and free-text columns, and are harder to condition in flexible ways.

Language models solve a different problem: “given this context, what comes next?” Once you frame table rows as sequences, you can use that capability directly. The model handles mixed types naturally because everything becomes text.

The research paper goes into more detail: arxiv.org/pdf/2210.06280.

Real-World Applications

We see GReaT used across industries:

  • Healthcare: Generating synthetic patient records for method development and testing
  • Finance: Creating realistic transaction data for fraud detection prototypes
  • Manufacturing: Producing sensor data for anomaly detection experiments
  • Research: Augmenting small datasets when you only have a few thousand rows
  • Product teams: Building demo environments that feel real but contain no actual user data

Industry applications

Getting Started

Install the package:

pip install be-great

Then try it on your own data. The GitHub repository has documentation and examples, including a Google Colab notebook you can run immediately.

If you need help with a specific use case or want a custom solution for your organization, reach out to us at [email protected].

Community and Contributions

GReaT is an active open-source project. If you try it on a dataset where it fails, that feedback is just as valuable as a success story. You can open issues, submit pull requests, or join the community Discord (linked on the GitHub page).

Citation

If you use GReaT in academic work, please cite the original paper:

@inproceedings{borisov2023language,
  title={Language Models are Realistic Tabular Data Generators},
  author={Vadim Borisov and Kathrin Sessler and Tobias Leemann and Martin Pawelczyk and Gjergji Kasneci},
  booktitle={The Eleventh International Conference on Learning Representations},
  year={2023},
  url={https://openreview.net/forum?id=cEygmQNOeI}
}

Summary

If you work with tabular data and find yourself blocked by access restrictions, privacy concerns, or small sample sizes, GReaT offers a practical path forward.

It will not solve every problem. But for many real projects, it provides “good enough” synthetic data in a few lines of code. And because it is open source, you can inspect everything, modify it for your needs, and contribute back.

Try it on a dataset you already know well. If the synthetic version behaves like the real one, you will quickly see where it fits into your workflow.