Skip to content

Data Processing¤

Comprehensive data loading, preprocessing, and augmentation pipelines for training generative models across multiple modalities.

Overview¤

  • Multi-Modal Support


    Image, text, audio, video, protein, and multimodal datasets

  • Standard Datasets


    MNIST, CIFAR, ImageNet, FFHQ, LibriSpeech, and more

  • Augmentation


    Modality-specific data augmentation pipelines

  • Streaming


    WebDataset, TFRecord, and remote data loading

Quick Start¤

Loading Standard Datasets¤

from artifex.data import load_dataset

# Load MNIST
train_data, test_data = load_dataset(
    "mnist",
    batch_size=128,
    split=("train", "test"),
)

# Load CIFAR-10
train_data = load_dataset(
    "cifar10",
    batch_size=64,
    augment=True,
)

Custom Datasets¤

from artifex.data import ImageDataset

dataset = ImageDataset(
    root="/path/to/images",
    image_size=(256, 256),
    normalize=True,
)

Image Datasets¤

Standard Datasets¤

Dataset Description Size
CIFAR-10/100 32x32 natural images 60K/60K
ImageNet 1000-class natural images 1.2M
FFHQ High-quality face images 70K
Custom Image Load from directory Variable

Loading Images¤

from artifex.data.image import CIFARDataset, FFHQDataset

# CIFAR-10
cifar = CIFARDataset(
    root="./data",
    train=True,
    download=True,
)

# FFHQ
ffhq = FFHQDataset(
    root="/path/to/ffhq",
    resolution=256,
)

Image Augmentation¤

from artifex.data.augmentation import ImageAugmentation

augment = ImageAugmentation(
    random_flip=True,
    random_crop=True,
    color_jitter=0.1,
    random_rotation=15,
)

augmented = augment(images, key=prng_key)

Image Augmentation

Text Datasets¤

Standard Datasets¤

Dataset Description Size
Wikipedia Wikipedia articles 6M+ articles
BookCorpus Book text corpus 11K books
Custom Text Load from files Variable

Loading Text¤

from artifex.data.text import WikipediaDataset

wiki = WikipediaDataset(
    language="en",
    max_length=512,
)

Tokenizers¤

from artifex.data.tokenizers import BPETokenizer, SentencePieceTokenizer

# BPE Tokenizer
tokenizer = BPETokenizer(vocab_size=32000)
tokenizer.fit(corpus)
tokens = tokenizer.encode(text)

# SentencePiece
tokenizer = SentencePieceTokenizer(model_path="model.spm")

BPE Tokenizer | SentencePiece

Audio Datasets¤

Standard Datasets¤

Dataset Description Hours
LibriSpeech Read English speech 1000h
VCTK Multi-speaker speech 44h
Custom Audio Load from files Variable

Loading Audio¤

from artifex.data.audio import LibriSpeechDataset

librispeech = LibriSpeechDataset(
    root="./data",
    subset="train-clean-100",
    sample_rate=16000,
)

Audio Preprocessing¤

from artifex.data.preprocessing import AudioPreprocessor

preprocessor = AudioPreprocessor(
    sample_rate=16000,
    n_mels=80,
    n_fft=1024,
    hop_length=256,
)

mel_spec = preprocessor.to_mel_spectrogram(audio)

Audio Preprocessing

Video Datasets¤

Standard Datasets¤

Dataset Description Size
UCF101 Action recognition 13K clips
Custom Video Load from files Variable

Loading Video¤

from artifex.data.video import UCF101Dataset

ucf = UCF101Dataset(
    root="/path/to/ucf101",
    frames_per_clip=16,
    frame_rate=5,
)

Video Datasets

Multimodal Datasets¤

Standard Datasets¤

Dataset Description Modalities
COCO Image + captions Image, Text
Custom Multimodal Custom pairs Variable

Loading Multimodal Data¤

from artifex.data.multimodal import COCODataset

coco = COCODataset(
    root="/path/to/coco",
    split="train2017",
    include_captions=True,
)

for batch in coco:
    images = batch["image"]
    captions = batch["caption"]

COCO Dataset

Protein Datasets¤

from artifex.data.protein import ProteinDataset

protein_data = ProteinDataset(
    pdb_dir="/path/to/structures",
    max_length=256,
    include_sequence=True,
    include_structure=True,
)

Protein Dataset

Data Pipeline¤

Pipeline API¤

from artifex.data import DataPipeline

pipeline = DataPipeline()
pipeline.add_step("load", loader)
pipeline.add_step("preprocess", preprocessor)
pipeline.add_step("augment", augmentation)
pipeline.add_step("batch", batcher)

for batch in pipeline(data):
    # Process batch
    pass

Pipeline Reference

Collators¤

from artifex.data import DynamicBatchCollator

collator = DynamicBatchCollator(
    pad_token_id=0,
    max_length=512,
)

Collators

Samplers¤

from artifex.data import DistributedSampler, WeightedSampler

# Distributed training
sampler = DistributedSampler(
    dataset=dataset,
    num_replicas=4,
    rank=0,
)

# Weighted sampling
sampler = WeightedSampler(
    weights=class_weights,
    num_samples=len(dataset),
)

Samplers

Streaming Data¤

WebDataset¤

from artifex.data.streaming import WebDatasetLoader

loader = WebDatasetLoader(
    urls="s3://bucket/data-{000..099}.tar",
    batch_size=64,
    shuffle=True,
)

WebDataset

TFRecord¤

from artifex.data.streaming import TFRecordLoader

loader = TFRecordLoader(
    pattern="/path/to/data-*.tfrecord",
    features={
        "image": tf.io.FixedLenFeature([], tf.string),
        "label": tf.io.FixedLenFeature([], tf.int64),
    },
)

TFRecord

Remote Loading¤

from artifex.data.streaming import RemoteLoader

loader = RemoteLoader(
    endpoint="https://api.example.com/data",
    cache_dir="./cache",
)

Remote Loading

Module Reference¤

Category Modules
Image cifar, custom_image, ffhq, imagenet
Text bookcorpus, custom_text, wikipedia
Audio custom_audio, librispeech, vctk
Video custom_video, ucf101
Multimodal coco, custom_multimodal
Protein dataset
Tokenizers bpe, character, sentencepiece, word
Augmentation audio, image, text, video
Preprocessing audio, base, image, text, video
Loaders audio, base, collators, image, pipeline, protein_dataset, registry, samplers, structured, text, video
Streaming remote, tfrecord, webdataset