Batch Anonymization
Status: Proposed
Batch anonymization extends the single-document anonymization pipeline to support processing multiple files in one guided session. The workflow follows a state-machine model: upload → extract → review → anonymize → completed.
Overview
Users can upload up to 100 files (admin-configurable) in a single request. DocuDesk processes them sequentially, extracting text and entities from each file, then presenting a consolidated entity review before applying anonymization. A CSV audit report is available for download after completion.
Batch state is persisted in Nextcloud ICache with a 2-hour TTL. No batch data is stored permanently; only the anonymized output files are saved to the user's DocuDesk folder.
Workflow Steps
- Upload —
POST /api/anonymization/batch/upload— upload multiple files, receivebatchId - Extract —
POST /api/anonymization/batch/{batchId}/extract— process one file per call until all are extracted - Review — Entity review — consolidated entity list with toggle controls
- Anonymize —
POST /api/anonymization/batch/{batchId}/anonymize— apply anonymization with reviewed entity list - Report —
GET /api/anonymization/batch/{batchId}/report— download CSV audit report
API Endpoints
| Method | Path | Description |
|---|---|---|
POST | /api/anonymization/batch/upload | Upload multiple files; returns batchId |
POST | /api/anonymization/batch/{batchId}/extract | Extract next unprocessed file in batch |
GET | /api/anonymization/batch/{batchId}/status | Polling endpoint — returns batch status and per-file progress |
GET | /api/anonymization/batch/{batchId}/entities | Consolidated entity list for review |
POST | /api/anonymization/batch/{batchId}/anonymize | Apply anonymization with reviewed entity list |
GET | /api/anonymization/batch/{batchId}/report | Download CSV audit report (post-completion) |
Audit Report
The CSV report includes: fileName, originalFileId, anonymizedFileId, entityCount, replacementCount, status, timestamp. Entity values are excluded (GDPR data minimization, Recital 26).
Standards
- GDPR / AVG — Batch state is transient (ICache TTL 2h); entity values excluded from audit report
- WOO — Anonymization profiles aligned with WOO publication requirements
- GEMMA Media-behandelingcomponent
- TEC-DMS-7 (Workflow Management)
Limits
| Parameter | Default | Config Key |
|---|---|---|
| Max files per batch | 100 | docudesk_batch_max_files (IAppConfig) |
| Batch TTL | 2 hours | Hardcoded (ICache) |