Cookies op maakt gebruik van cookies voor het anoniem meten van het website bezoek en het vergroten van het gebruiksgemak. Door op 'ga verder' te klikken, geef je toestemming voor het gebruik van deze cookies.

The Basics of the creation of digital text

The Basics contain a set of minimum requirements and more advanced recommendations for digital text that is both readable by humans and machines . These guidelines apply to both digitised and born digital text. Depending on the end-use, the content, structure and lay-out are enclosed in the digital text and metadata. The underlying assumption of these guidelines is the creation of a digital text that is preserved in a sustainable and pliable format.

Three forms of digital texts

Digital text has three possible representations:
  1. Machine-readable text (literally text that can be searched and interpreted by a machine). In order to obtain a right interpretation of the text, structure is necessary. Examples of these kind of texts are texts produced by a word processor, e-books and texts on a website.
  2. Machine-readable text in combination with a digital image. The machine-readable text is in this case mostly exclusively used for searching purposes. Examples: digitised newspapers (see for instance Delpher, a searching machine for books and newspapers from the Dutch Royal Library), a PDF with an embedded image and OCR output, or Google books.
  3. Text exclusively as a digital image (as photographed or scanned). These kind of  texts are not machine-readable and searchable and do, strictly speaking, not meet the requirements for the Basics for digital text. When the text has sufficient metadata, as is the case with many archival material (which is not easily converted to OCR or transcribed), then the text source might still might be easily findable.  
The machine-readable representation can either be transcribed or automatically (in some cases with manual corrections afterwards) through OCR or when a text is already born digital. The Basics doesn’t make a distinction between the process: both methods result in a readable and searchable digital text.

Photography or scanning from a paper original

Since photographing or scanning a text is the practically the same as digitising a photograph or another original, we advise you to use the Basics for the creation of images.  

Minimum guidelines for the creation of text

The minimum set of guidelines contains two requirements:  

(Bron: Stadsarchief Amsterdam)

Recommendations for the creation of text

On top of the minimal guidelines, we’ve developed recommendations that provide a more advanced method of preserving your texts in a sustainable and pliable way. 
PDF / a
The use of a PDF/a file can be an alternative for XML file and is mainly used for born digital texts. PDF/a has a benefit compared to XML: it fixates and stores the original lay-out of the text. This can be essential for some digital collections, like newspapers or official records. A disadvantage of PDF/a is the conversion to another file format, for instance an e-book, is problematic. An XML-format is much easier to convert to another output format. We only recommend the use of PDF/a  when it is important to maintain the original lay-out. PDF/a can also be used to store scans of documents and OCR output together in one file. However, if you want a truly sustainable and pliable digital text, it is advisable to store the scan and OCR output separately. The digital texts and images can than always be converted independently in any given format.

Storage formats for archiving

The Basics advocates the use of an open file format with specifications that are freely available. The use of an open file format is an important requirement for interoperability en sustainability. For the Basics, we advice the following:  

Linking images to OCR output

You can use METS or MPEG-21 DIDL to link image files with OCR files and other files. We discourage the use of file names or folder names to record the structure of the files because afterwards it is difficult to make any changes.

Liability and contribution to the Basics

This text is a revised version of the Basics. The first version was written in 2008 and reviewed in 2013 during two meetings with Dutch experts working in the field of digital heritage. Professionals are also invited to comment on this text and share their experience with the Basics through or by emailing us:  

Laatst gewijzigd: 12-09-2014

2 plus 4 is:*

Reacties (0)

Er zijn nog geen reacties geplaatst.