Compliance Rating Scheme: A Data Provenance Framework for Generative AI Datasets

Maty Bohacek ¹ and Ignacio Vilanova Echavarri ²

¹ Stanford University ² Imperial College London

The rapid expansion of Generative AI has outpaced ethical standards for training datasets, often resulting in opaque data collection and significant legal risks such as copyright infringement. To address this gap, we introduce the Compliance Rating Scheme (CRS), a framework designed to evaluate dataset compliance against critical transparency, accountability, and security principles. In tandem, we open-source a Python library called DatasetSentinel, which materializes these principles and enables seamless integration into existing pipelines, allowing dataset authors to verify compliance during scraping and practitioners to evaluate existing datasets before use.

Paper Code

Generative AI has experienced exponential growth powered largely by massive datasets built through opaque data collection practices. While most focus remains on model development, the ethical and legal considerations surrounding dataset creation are frequently overlooked. Our work aims to address this by acting on the following insights:

Key Insights

Data provenance unlocks effective compliance assessment. Determining which data points are safe to scrape for a dataset and evaluating compliance of existing datasets can both be addressed using data provenance technology, which is affordable and scalable.
Existing datasets and scraping frameworks fall short. Existing datasets demonstrate poor compliance with ethical and legal standards, highlighting an urgent need for improved practices. We provide an automated way of evaluating these datasets.

Given these insights, we devise two artifacts to bridge data provenance with AI dataset scraping and evaluation, emphasizing principles of critical transparency, accountability, and security. We open-source these artifacts for the research community and industry practitioners:

Using CRS for Responsible Data Collection

The first application of CRS is during the dataset creation phase. When building a new dataset, authors can leverage the DatasetSentinel library to screen individual datapoints as they are being collected. This proactive approach ensures that only compliant data enters the dataset from the outset, preventing costly legal and ethical issues down the line.

Using CRS During Dataset Creation. The DatasetSentinel library integrates into data scraping pipelines to evaluate individual datapoints before inclusion. By checking provenance metadata against compliance criteria, dataset authors can ensure only ethically and legally compliant data is collected, preventing downstream issues before they occur.

Evaluating Existing Datasets with CRS

The second application targets AI practitioners who need to assess datasets before using them for training. Rather than relying solely on license descriptions or author claims, practitioners can use DatasetSentinel to independently verify a dataset's compliance. The tool analyzes the entire dataset and produces a comprehensive CRS score, removing the element of trust and providing concrete evidence of compliance—or lack thereof.

Using CRS to Assess Existing Datasets. AI practitioners can evaluate datasets they're considering for training by computing a CRS score. The framework analyzes the entire dataset against six compliance criteria, providing a letter grade (A-G) that indicates the dataset's adherence to ethical and legal standards, along with detailed reasoning for the assessment.

Case Studies: Evaluating Real-World Datasets

We applied the CRS framework to four publicly available datasets from different distribution platforms to demonstrate its practical utility. The results reveal significant compliance gaps across widely-used datasets in the AI community.

Dataset	Source	Modality	C1	C2	C3	C4	C5	C6	CRS Score
SOD4SB	GitHub	Images	✓	✓	✓	✓	✗	✗	C
MS COCO	Custom website	Images	✓	✗	✗	✗	✗	✗	F
RANDOM People	Hugging Face	Videos	✓	✓	✓	✓	✓	✗	B
TikTok Dataset	Kaggle	Videos	✗	✗	✗	✗	✗	✗	G

The results demonstrate that even popular, widely-used datasets often fail to meet basic compliance standards. Only one dataset (RANDOM People) achieved a "B" rating, while the widely-used MS COCO dataset received an "F" rating, highlighting significant ethical and legal concerns in current dataset practices.

Envisioning CRS Adoption

To facilitate widespread adoption, we envision CRS scores being prominently displayed on dataset distribution platforms. Here are mockups showing how the CRS framework could be integrated into popular platforms like GitHub, Hugging Face, and academic lab websites, making compliance information immediately visible to practitioners evaluating datasets.

Hugging Face Integration. CRS score displayed prominently on dataset cards, allowing practitioners to quickly assess compliance before downloading.

Citation

@inproceedings{bohacek2025compliance,
    title     = {Compliance Rating Scheme: A Data Provenance Framework for Generative AI Datasets},
    author    = {Bohacek, Matyas and Vilanova Echavarri, Ignacio},
    booktitle = {Proceedings of the 33rd ACM International Conference on Multimedia},
    pages     = {12150--12159},
    year      = {2025}
}