The Basics of the creation of digital text
The Basics contain a set of minimum requirements and more advanced recommendations for digital text that is both readable by humans and machines . These guidelines apply to both digitised and born digital text. Depending on the end-use, the content, structure and lay-out are enclosed in the digital text and metadata. The underlying assumption of these guidelines is the creation of a digital text that is preserved in a sustainable and pliable format.
Three forms of digital textsDigital text has three possible representations:
- Machine-readable text (literally text that can be searched and interpreted by a machine). In order to obtain a right interpretation of the text, structure is necessary. Examples of these kind of texts are texts produced by a word processor, e-books and texts on a website.
- Machine-readable text in combination with a digital image. The machine-readable text is in this case mostly exclusively used for searching purposes. Examples: digitised newspapers (see for instance Delpher, a searching machine for books and newspapers from the Dutch Royal Library), a PDF with an embedded image and OCR output, or Google books.
- Text exclusively as a digital image (as photographed or scanned). These kind of texts are not machine-readable and searchable and do, strictly speaking, not meet the requirements for the Basics for digital text. When the text has sufficient metadata, as is the case with many archival material (which is not easily converted to OCR or transcribed), then the text source might still might be easily findable.
Since photographing or scanning a text is the practically the same as digitising a photograph or another original, we advise you to use the Basics for the creation of images.
Photography or scanning from a paper original
The minimum set of guidelines contains two requirements:
Minimum guidelines for the creation of text
- You need to use an open file format for storage
- Use a standard that is based on Unicode for the text encoding, preferably UTF-8.
(Bron: Stadsarchief Amsterdam)
On top of the minimal guidelines, we’ve developed recommendations that provide a more advanced method of preserving your texts in a sustainable and pliable way.
Recommendations for the creation of text
- Use XML to add structure to the text and publish the corresponding XML schema.
- A more advanced recommendation is the use of an XML schema that is based on TEI guidelines. The XML schema then uses TEI tags for automated, semantic encoding, which improves the machine-readability. Adding TEI tags can be complicated but the use of a special TEI editor simplifies this process.
- For OCR: use the ALTO XML Schema as a standard data structure to encode the lay-out of the text. ALTO uses coordinates to match the text with the digital image, so it is possible to locate the text in the digital image.
PDF / a
The use of a PDF/a file can be an alternative for XML file and is mainly used for born digital texts. PDF/a has a benefit compared to XML: it fixates and stores the original lay-out of the text. This can be essential for some digital collections, like newspapers or official records. A disadvantage of PDF/a is the conversion to another file format, for instance an e-book, is problematic. An XML-format is much easier to convert to another output format. We only recommend the use of PDF/a when it is important to maintain the original lay-out. PDF/a can also be used to store scans of documents and OCR output together in one file. However, if you want a truly sustainable and pliable digital text, it is advisable to store the scan and OCR output separately. The digital texts and images can than always be converted independently in any given format.
The Basics advocates the use of an open file format with specifications that are freely available. The use of an open file format is an important requirement for interoperability en sustainability. For the Basics, we advice the following:
Storage formats for archiving
- Machine-readable and structure text files should be stored in a XML file which includes an UTF-8 encoding description.
- Please consult the Basics of the creation of images when you want to store your scanned documents.
- OCR-files are best stored in plain text or XML.
When you want to use PDF/a, we recommend two modes of storage:
- Digital born texts: use PDF/a -1a or -2a that allows for a rich, XML-based logical text structure.
- Digitised texts: use PDF/a -1b or -2b to store both the image and the OCR-file (see our comments earlier on the preferred method of storing OCR files and scans separately).
Linking images to OCR outputYou can use METS or MPEG-21 DIDL to link image files with OCR files and other files. We discourage the use of file names or folder names to record the structure of the files because afterwards it is difficult to make any changes.
This text is a revised version of the Basics. The first version was written in 2008 and reviewed in 2013 during two meetings with Dutch experts working in the field of digital heritage. Professionals are also invited to comment on this text and share their experience with the Basics through www.den.nl/debasis or by emailing us: firstname.lastname@example.org.
Liability and contribution to the Basics
Laatst gewijzigd: 12-09-2014