Prague's Municipal Archive, based at Archivní 6 in the Chodovec district, confirmed this week that an internal audit of its digitised photograph collection has identified more than 14,000 duplicate image files accumulated over nearly a decade of scanning projects. The discovery is the most significant quality-control finding since the archive began its large-scale digitisation push in 2017.
The timing matters because the archive is less than six months away from launching a public-facing online portal — currently scheduled for January 2027 — that will allow Prague residents and international researchers to browse the city's photographic holdings for the first time without visiting in person. Pushing duplicate, mislabelled or low-resolution files into that portal would undermine the project's credibility from day one, and staff are now working under a tighter internal deadline to complete what archivists call a "deduplication pass" before the autumn handover to the portal's development team.
How the Duplicates Accumulated
The problem has a straightforward cause. Between 2017 and 2024, at least four separate digitisation contracts were awarded to different vendors, including one project run through the Prague Institute of Planning and Development and another funded under the EU's Integrated Regional Operational Programme. Each contractor used slightly different file-naming conventions and metadata schemas. When outputs were merged into the archive's central storage system, the same physical photograph — sometimes a street scene from Vinohrady or a construction record from the Holešovice riverfront — ended up ingested multiple times, occasionally under different catalogue numbers.
A further complication: some duplicates are not exact copies but near-identical scans at different resolutions, meaning automated deduplication software cannot simply delete the lower-quality file without human review. Archive staff are using open-source image-hashing tools alongside manual checks to flag pairs for decision. The process is slow. As of Thursday, roughly 3,200 of the 14,000-plus flagged files had been resolved.
The archive houses approximately 2.3 million digitised items in total, a figure that includes maps, official documents and photographs. The duplicate photographs represent less than one percent of the full collection, but because the January portal launch will prioritise photographic content — specifically images of Prague neighbourhoods dating from 1880 to 1989 — the proportion of affected material in that curated subset is considerably higher.
What the City Is Doing About It
Prague City Hall's Department of Culture and Monument Care, which oversees the archive's budget, approved a supplementary allocation earlier this year to hire two additional contract archivists specifically for digital quality control. Those positions were filled in May, and both contractors started work on the deduplication project in June. The department has not publicly stated the value of that supplementary funding.
The archive is also in contact with Národní knihovna — the National Library of the Czech Republic, based at Klementinum in the Old Town — to compare methodologies. The National Library completed its own large-scale deduplication exercise across its Kramerius digital library system in 2024 and is regarded domestically as having the most developed workflow for this kind of problem.
For residents or researchers who currently use the archive's existing internal catalogue system, this week's developments have a practical implication. The archive issued a notice on its website on Wednesday advising that catalogue numbers for approximately 800 photographs of the Žižkov and Žižkov-Vinohrady border areas may be reassigned before the portal goes live, as some of those records will be consolidated. Anyone who has cited specific catalogue numbers in academic work or planning applications should check for updates after 1 September, when a revised finding aid is expected to be published.
The archive asks researchers with appointments scheduled for July or August to confirm their visit dates by email, as two of the reading room's five workstations will be taken offline temporarily to support additional storage infrastructure being installed to handle the cleaned-up dataset. The reading room at Archivní 6 remains open Monday through Friday, 8 a.m. to 4 p.m.