Abderrazak Mkadmi - Archives in the Digital Age

Здесь есть возможность читать онлайн «Abderrazak Mkadmi - Archives in the Digital Age» — ознакомительный отрывок электронной книги совершенно бесплатно, а после прочтения отрывка купить полную версию. В некоторых случаях можно слушать аудио, скачать через торрент в формате fb2 и присутствует краткое содержание. Жанр: unrecognised, на английском языке. Описание произведения, (предисловие) а так же отзывы посетителей доступны на портале библиотеки ЛибКат.

Archives in the Digital Age: краткое содержание, описание и аннотация

Предлагаем к чтению аннотацию, описание, краткое содержание или предисловие (зависит от того, что написал сам автор книги «Archives in the Digital Age»). Если вы не нашли необходимую информацию о книге — напишите в комментариях, мы постараемся отыскать её.

Archiving has become an increasingly complex process. The challenge is no longer how to store the data but how to store it intelligently, in order to exploit it over time, while maintaining its integrity and authenticity. <p>Digital technologies bring about major transformations, not only in terms of the types of documents that are transferred to and stored in archives, in the behaviors and practices of the humanities and social sciences (digital humanities), but also in terms of the volume of data and the technological capacity for managing and preserving archives (Big Data). <p><i>Archives in The Digital Age</i> focuses on the impact of these various digital transformations on archives, and examines how the right to memory and the information of future generations is confronted with the right to be forgotten; a digital prerogative that guarantees individuals their private lives and freedoms.

Archives in the Digital Age — читать онлайн ознакомительный отрывок

Ниже представлен текст книги, разбитый по страницам. Система сохранения места последней прочитанной страницы, позволяет с удобством читать онлайн бесплатно книгу «Archives in the Digital Age», без необходимости каждый раз заново искать на чём Вы остановились. Поставьте закладку, и сможете в любой момент перейти на страницу, на которой закончили чтение.

Тёмная тема
Сбросить

Интервал:

Закладка:

Сделать

After passing the documents through a scanner, the result is always a file in an image format. The nature of these images depends on the scanned original documents and on the subsequent processing. These images can be, according to requirements, in black and white (or converted to black and white), in dark or light gray or in color. Color images can be 8, 16, 24, 30 or 36 bits. Each time the resolution increases, the clarity and size of the image increases.

Several types of processing can be provided to be able to exploit the digitized documents:

– Compression: It consists of reducing the size of files, thus reducing the space used on archiving media and facilitating their circulation on networks. Several compression methods exist, depending on the scanning method and the nature of the original documents:- CCITT 2G3/G4 compression, also known as “G4” or “modified reading”, is a lossless image compression method used in Group 4 facsimile machines, as defined in the ITU-T T.6 3fax standard. It is only used for bitonal (black and white) images. Group 4 compression is available in many proprietary image file formats, as well as in standard formats such as TIFF (Tagged Image File Format), CALS (Computer-aided Acquisition and Logistics Support), CIT (Combined interrogator transponder, Intergraph Raster Type 24) and PDF (Portable Document Format),- JBIG 4(Joint Bi-level Image Group) compression: this is a two-level compression of an image, in which a single bit is used to express the color value of each pixel. This standard can also be used to code grayscale images and color images with a limited number of bits per pixel. JBIG is designed for images sent using facsimile coding and offers significantly higher compression than Group 3 and 4 facsimile coding,- the JPEG 5algorithm (Joint Picture Expert Group) is used to reduce the size of color images. This format of graphic file allows very important compression rates, but with a weak resolution that influences the quality of the image: the compression entails a loss of information;

– Optical Character Recognition (OCR): The purpose of OCR is to convert text in image format into a computer-readable text format by translating the groups of dots in a scanned image into characters with the associated formatting. It is carried out by dedicated systems called “OCR”. The challenge today is to find the most efficient OCR among several tools of this type and the best suited to its application. Among the criteria for the choice of the tool, we often evoke the criterion of effectiveness, which is related to a high recognition rate. The objective to be reached is a rate of 100%. However, the recognition rate does not depend solely on the recognition engine, but also on several other measures to be taken into consideration, such as the material preparation of the paper document upstream and the performance of the OCR engine in the parameters used to adapt to the type of content, taking into account, inter alia, the language, quality and layout of the document.

OCR can be applied within an ERM system in two ways:

1 1) Application on whole pages in text in order to index them in full text using spell checkers.

2 2) Application on some areas within the pages (such as titles) in order to use them as an index. Different technologies have existed for a long time and are based on OCR techniques to extract information from these digitized documents and enrich their metadata (category, author, title, date, etc.):- Automatic Document Recognition (ADR), which consists of distinguishing one type of document from another, according to a few pre-defined parameters. This will make it possible to sort images electronically;- Automatic document reading: this technology uses artificial intelligence technologies to perform linguistic checks on recognized words and interpret them using text-mining functions, for the purpose of pre-analysis and/or thematic classification of the scanned documents.

In addition, this OCR technology is always limited and depends on the quality of the text to be scanned (if it is distorted, faded, stained, folded, contains handwritten annotations, etc.) and on the quality of the scan itself. It often generates several interpretation errors that require human intervention to be corrected, otherwise raw OCR makes it impossible for the text to be read and indexed by search engines. This is why this work is generally outsourced to service providers who use low-cost labor or Internet users (in the absence of financial means). The latter alternative, which is increasingly used by library and archive services, is called crowdsourcing. Several OCR projects have been developed through this alternative with regard to the correction of digitized newspaper texts for the National Library of Australia, the correction of OCR through gamification for the National Library of Finland and the involuntary correction of OCR via reCAPTCHA for the Google Books service, among other projects [AND 17].

1.2.2.3. Document indexing

After having acquired the document through scanning, exchange and/or production, and in order to find it and facilitate its use, it is necessary to describe its content. This second stage of electronic document management is the most important one as regards being able to keep the document and use it later. This operation can be done by type (with a formal description, author, title, date, etc.), by concepts or keywords selected in a free way, or based on a thesaurus in order to harmonize practices. In web documents in HTML format, the description is created through META tags that allow the creator of these documents to define the relevant keywords representative of the content, the subject, the author and so on. There are many metadata 6- related standards today, such as DC (Dublin Core), RDF (Resource Description Framework), EAD (Encoded Archival Description), EAC (Encoded Archival Context) and LOM (Learning Object Metadata) [MKA 08]. The objective is to make this metadata usable by a large number of search tools.

1.2.2.4. Storage of documents

1.2.2.4.1. Storage media

Storage, or what is sometimes called archiving (in the primary sense of the term), supports the conservation of documents over time. In order to implement an effective storage solution, it is first necessary to establish a needs analysis related, in particular, to the volume of data, their importance, the frequency of their consultation, the degree of confidentiality, the degree of importance of security, the length of time they are kept and the interest of putting them online, among other factors.

To facilitate the different needs of this conservation function, an ERM system uses several storage media, according to the following criteria:

– criteria relating to the document: types of documents, frequency of consultation, interest in having it online and retention periods;

– criteria relating to the medium: document access time, storage capacity, cost, rewritability or non-rewritability and secure access.

There are several storage media that can be classified into generations:

– First generation media are considered to be analog media and have not been used since the late 1990s. This refers to the perforated card and perforated tape system, which originated in the 18th century. Their storage capacity is very small and is measured in a few tens of bytes.

– Second generation media are magnetic media and have a digital recording mode, except for magnetic tape, which has both analog and digital recording modes. They include magnetic tape, cassette, hard disks, cartridges and diskettes. These media have, however, been able to withstand technological developments over a long period of time [FLE 17].

– Third generation storage media are considered to be recordable digital optical media. We are talking about CDs (Compact Disk), DVDs (Digital Video Disk) and Blu-ray disks, also known as BDs 7. In today’s market, we are talking about several new optical media such as glass discs, M-Disc (the main characteristic being that the burning layer is made of synthetic diamond) and nanoform (a disk that has a very high resistance to damage).

Читать дальше
Тёмная тема
Сбросить

Интервал:

Закладка:

Сделать

Похожие книги на «Archives in the Digital Age»

Представляем Вашему вниманию похожие книги на «Archives in the Digital Age» списком для выбора. Мы отобрали схожую по названию и смыслу литературу в надежде предоставить читателям больше вариантов отыскать новые, интересные, ещё непрочитанные произведения.


Отзывы о книге «Archives in the Digital Age»

Обсуждение, отзывы о книге «Archives in the Digital Age» и просто собственные мнения читателей. Оставьте ваши комментарии, напишите, что Вы думаете о произведении, его смысле или главных героях. Укажите что конкретно понравилось, а что нет, и почему Вы так считаете.

x