Compliance Rating Scheme: A Data Provenance Framework for Generative AI Datasets

Maty Bohacek 1 and Ignacio Vilanova Echavarri 2

1 Stanford University 2 Imperial College London

The rapid expansion of Generative AI has outpaced ethical standards for training datasets, often resulting in opaque data collection and significant legal risks such as copyright infringement. To address this gap, we introduce the Compliance Rating Scheme (CRS), a framework designed to evaluate dataset compliance against critical transparency, accountability, and security principles. In tandem, we open-source a Python library called DatasetSentinel, which materializes these principles and enables seamless integration into existing pipelines, allowing dataset authors to verify compliance during scraping and practitioners to evaluate existing datasets before use.

Paper    Code   

Generative AI has experienced exponential growth powered largely by massive datasets built through opaque data collection practices. While most focus remains on model development, the ethical and legal considerations surrounding dataset creation are frequently overlooked. Our work aims to address this by acting on the following insights:

Key Insights

Given these insights, we devise two artifacts to bridge data provenance with AI dataset scraping and evaluation, emphasizing principles of critical transparency, accountability, and security. We open-source these artifacts for the research community and industry practitioners:

Using CRS for Responsible Data Collection

The first application of CRS is during the dataset creation phase. When building a new dataset, authors can leverage the DatasetSentinel library to screen individual datapoints as they are being collected. This proactive approach ensures that only compliant data enters the dataset from the outset, preventing costly legal and ethical issues down the line.

Using CRS During Dataset Creation. The DatasetSentinel library integrates into data scraping pipelines to evaluate individual datapoints before inclusion. By checking provenance metadata against compliance criteria, dataset authors can ensure only ethically and legally compliant data is collected, preventing downstream issues before they occur.

Evaluating Existing Datasets with CRS

The second application targets AI practitioners who need to assess datasets before using them for training. Rather than relying solely on license descriptions or author claims, practitioners can use DatasetSentinel to independently verify a dataset's compliance. The tool analyzes the entire dataset and produces a comprehensive CRS score, removing the element of trust and providing concrete evidence of compliance—or lack thereof.

Using CRS to Assess Existing Datasets. AI practitioners can evaluate datasets they're considering for training by computing a CRS score. The framework analyzes the entire dataset against six compliance criteria, providing a letter grade (A-G) that indicates the dataset's adherence to ethical and legal standards, along with detailed reasoning for the assessment.

Case Studies: Evaluating Real-World Datasets

We applied the CRS framework to four publicly available datasets from different distribution platforms to demonstrate its practical utility. The results reveal significant compliance gaps across widely-used datasets in the AI community.

Dataset Source Modality C1 C2 C3 C4 C5 C6 CRS Score
SOD4SB GitHub Images C
MS COCO Custom website Images F
RANDOM People Hugging Face Videos B
TikTok Dataset Kaggle Videos G

The results demonstrate that even popular, widely-used datasets often fail to meet basic compliance standards. Only one dataset (RANDOM People) achieved a "B" rating, while the widely-used MS COCO dataset received an "F" rating, highlighting significant ethical and legal concerns in current dataset practices.

Envisioning CRS Adoption

To facilitate widespread adoption, we envision CRS scores being prominently displayed on dataset distribution platforms. Here are mockups showing how the CRS framework could be integrated into popular platforms like GitHub, Hugging Face, and academic lab websites, making compliance information immediately visible to practitioners evaluating datasets.

Hugging Face Integration. CRS score displayed prominently on dataset cards, allowing practitioners to quickly assess compliance before downloading.

Citation

@inproceedings{bohacek2025compliance,
    title     = {Compliance Rating Scheme: A Data Provenance Framework for Generative AI Datasets},
    author    = {Bohacek, Matyas and Vilanova Echavarri, Ignacio},
    booktitle = {Proceedings of the 33rd ACM International Conference on Multimedia},
    pages     = {12150--12159},
    year      = {2025}
}