Data Processing¤
Comprehensive data loading, preprocessing, and augmentation pipelines for training generative models across multiple modalities.
Overview¤
-
Multi-Modal Support
Image, text, audio, video, protein, and multimodal datasets
-
Standard Datasets
MNIST, CIFAR, ImageNet, FFHQ, LibriSpeech, and more
-
Augmentation
Modality-specific data augmentation pipelines
-
Streaming
WebDataset, TFRecord, and remote data loading
Quick Start¤
Loading Standard Datasets¤
from artifex.data import load_dataset
# Load MNIST
train_data, test_data = load_dataset(
"mnist",
batch_size=128,
split=("train", "test"),
)
# Load CIFAR-10
train_data = load_dataset(
"cifar10",
batch_size=64,
augment=True,
)
Custom Datasets¤
from artifex.data import ImageDataset
dataset = ImageDataset(
root="/path/to/images",
image_size=(256, 256),
normalize=True,
)
Image Datasets¤
Standard Datasets¤
| Dataset | Description | Size |
|---|---|---|
| CIFAR-10/100 | 32x32 natural images | 60K/60K |
| ImageNet | 1000-class natural images | 1.2M |
| FFHQ | High-quality face images | 70K |
| Custom Image | Load from directory | Variable |
Loading Images¤
from artifex.data.image import CIFARDataset, FFHQDataset
# CIFAR-10
cifar = CIFARDataset(
root="./data",
train=True,
download=True,
)
# FFHQ
ffhq = FFHQDataset(
root="/path/to/ffhq",
resolution=256,
)
Image Augmentation¤
from artifex.data.augmentation import ImageAugmentation
augment = ImageAugmentation(
random_flip=True,
random_crop=True,
color_jitter=0.1,
random_rotation=15,
)
augmented = augment(images, key=prng_key)
Text Datasets¤
Standard Datasets¤
| Dataset | Description | Size |
|---|---|---|
| Wikipedia | Wikipedia articles | 6M+ articles |
| BookCorpus | Book text corpus | 11K books |
| Custom Text | Load from files | Variable |
Loading Text¤
from artifex.data.text import WikipediaDataset
wiki = WikipediaDataset(
language="en",
max_length=512,
)
Tokenizers¤
from artifex.data.tokenizers import BPETokenizer, SentencePieceTokenizer
# BPE Tokenizer
tokenizer = BPETokenizer(vocab_size=32000)
tokenizer.fit(corpus)
tokens = tokenizer.encode(text)
# SentencePiece
tokenizer = SentencePieceTokenizer(model_path="model.spm")
Audio Datasets¤
Standard Datasets¤
| Dataset | Description | Hours |
|---|---|---|
| LibriSpeech | Read English speech | 1000h |
| VCTK | Multi-speaker speech | 44h |
| Custom Audio | Load from files | Variable |
Loading Audio¤
from artifex.data.audio import LibriSpeechDataset
librispeech = LibriSpeechDataset(
root="./data",
subset="train-clean-100",
sample_rate=16000,
)
Audio Preprocessing¤
from artifex.data.preprocessing import AudioPreprocessor
preprocessor = AudioPreprocessor(
sample_rate=16000,
n_mels=80,
n_fft=1024,
hop_length=256,
)
mel_spec = preprocessor.to_mel_spectrogram(audio)
Video Datasets¤
Standard Datasets¤
| Dataset | Description | Size |
|---|---|---|
| UCF101 | Action recognition | 13K clips |
| Custom Video | Load from files | Variable |
Loading Video¤
from artifex.data.video import UCF101Dataset
ucf = UCF101Dataset(
root="/path/to/ucf101",
frames_per_clip=16,
frame_rate=5,
)
Multimodal Datasets¤
Standard Datasets¤
| Dataset | Description | Modalities |
|---|---|---|
| COCO | Image + captions | Image, Text |
| Custom Multimodal | Custom pairs | Variable |
Loading Multimodal Data¤
from artifex.data.multimodal import COCODataset
coco = COCODataset(
root="/path/to/coco",
split="train2017",
include_captions=True,
)
for batch in coco:
images = batch["image"]
captions = batch["caption"]
Protein Datasets¤
from artifex.data.protein import ProteinDataset
protein_data = ProteinDataset(
pdb_dir="/path/to/structures",
max_length=256,
include_sequence=True,
include_structure=True,
)
Data Pipeline¤
Pipeline API¤
from artifex.data import DataPipeline
pipeline = DataPipeline()
pipeline.add_step("load", loader)
pipeline.add_step("preprocess", preprocessor)
pipeline.add_step("augment", augmentation)
pipeline.add_step("batch", batcher)
for batch in pipeline(data):
# Process batch
pass
Collators¤
from artifex.data import DynamicBatchCollator
collator = DynamicBatchCollator(
pad_token_id=0,
max_length=512,
)
Samplers¤
from artifex.data import DistributedSampler, WeightedSampler
# Distributed training
sampler = DistributedSampler(
dataset=dataset,
num_replicas=4,
rank=0,
)
# Weighted sampling
sampler = WeightedSampler(
weights=class_weights,
num_samples=len(dataset),
)
Streaming Data¤
WebDataset¤
from artifex.data.streaming import WebDatasetLoader
loader = WebDatasetLoader(
urls="s3://bucket/data-{000..099}.tar",
batch_size=64,
shuffle=True,
)
TFRecord¤
from artifex.data.streaming import TFRecordLoader
loader = TFRecordLoader(
pattern="/path/to/data-*.tfrecord",
features={
"image": tf.io.FixedLenFeature([], tf.string),
"label": tf.io.FixedLenFeature([], tf.int64),
},
)
Remote Loading¤
from artifex.data.streaming import RemoteLoader
loader = RemoteLoader(
endpoint="https://api.example.com/data",
cache_dir="./cache",
)
Module Reference¤
| Category | Modules |
|---|---|
| Image | cifar, custom_image, ffhq, imagenet |
| Text | bookcorpus, custom_text, wikipedia |
| Audio | custom_audio, librispeech, vctk |
| Video | custom_video, ucf101 |
| Multimodal | coco, custom_multimodal |
| Protein | dataset |
| Tokenizers | bpe, character, sentencepiece, word |
| Augmentation | audio, image, text, video |
| Preprocessing | audio, base, image, text, video |
| Loaders | audio, base, collators, image, pipeline, protein_dataset, registry, samplers, structured, text, video |
| Streaming | remote, tfrecord, webdataset |
Related Documentation¤
- Data Loading Guide - Complete data loading guide
- Image Modality - Image-specific features
- Text Modality - Text-specific features
- Audio Modality - Audio-specific features