Digital libraries

Digital Preservation of Biomedical Documents - State of the Art

Abstract: 

With growing numbers of digital materials in our libraries, predominantly pdf-files and audiovisual files, there is a substantial need to preserve these documents for use in the future. The German National Library of Medicine (ZB MED) carried out a project together with the Technical Information Library in Hanover and the German National Library of Economics in Kiel/Hamburg using the digital preservation system Rosetta.

It has proven that the technical needs can be fulfilled with such a technological system including ingesting processes and migration processes. It is of utmost importance however to identify which documents are in stock in a library and in which conditions they are (file types and subtypes).

It has to be be discussed who should be responsible for digital preservation in each country as high workload and high costs are connected with the operation of a digital preservation system.

Session: 
Session E. New technologies
Ref: 
E3
Type of presentation: 
Oral presentation

A sustainable 23 Things for EAHIL

From digitization towards digital preservation - building a digital library system for medical information users

Abstract: 

Introduction

The development of specialized digital libraries and archives in the Czech Republic has been stated in the official document "Conception of permanent preservation of traditional library collections and electronic documents in Czech libraries". National Medical Library (NML) has been in the process of building a digital library since 2009 using Czech open source software tool - digital library system Kramerius. In order to maintain and evolve our digital library and its content and services it has become necessary to establish an encompassing project to build a solid digital preservation system compliant with the Open Archival Information System Reference model (OAIS).

NML has successfully applied for the Ministry of Health Internal Grant Agency (IGA) project funded between years 2011-2013. The project aims at "creation of a long term digital preservation system in the NML, which will allow permanent archiving and accessibility of full texts of scientific publications in the field of medical disciplines". The main goals of the project are: development of digital preservation system using PLATTER (Planning tool for Trusted Electronic Repositories[1]) process, building digitization workplace and department, implementation of suitable data storage in NML's datacenter, and digitization of 4.000 IGA final grant reports.

Access to the archived content is facilitated by recently developed Medvik web portal (http://www.medvik.cz/bmc) which provides access to Czech medical bibliography database (BMC) and medical union catalogues. Thus the links to full-text can be displayed directly at the articles level where available. NML has put a lot of effort into solving legal aspects by special licensing agreements to make the most of full texts freely accessible to every user. Content of the repository comes in several formats from different sources and efforts - in-house digitization and external digitization projects, collaboration with Czech publishers or contributed directly by authors. The majority of content from publishers and authors are PDF files. There are currently 100 scientific journals and +1000 other documents archived. The complexity of archived digital objects vary from single page images with OCR in separated text files over article and volume level PDF files to whole large documents as one PDF file.

Objectives

The selection of suitable data and metadata formats used by the repository is essential and affects many operations in the Acquisition, Data and Preservation planning objectives defined by PLATTER. For the long term preservation of the IGA final reports and digital born scientific publications NML has chosen PDF/A-1 format [2]. The main reasons are that it is open, standardized, self-contained and indexable format, suitable for page-oriented textual documents and it is usual in scientific communication. Newly acquired reports in digital form will be compliant with PDF/A-1b standard, the objects in PDF formats already archived in the repository will be converted to PDF/A-1b and the PDF documents from authors will be converted during the ingest workflow. The conversion process shall use thoroughly tested tool(s) to provide valid and well-formed files. This research aims at evaluation of available PDF to PDF/A converters which will be the most suitable in terms of consistency and reliability.

Methods

Technical metadata available were not sufficient for the analysis of digital objects stored in the repository. Thus the content has been analyzed using JHOVE [3] open source tool which provides functions to perform format-specific identification, validation, and characterization of digital objects. The list of files stored in the repository has been digested from the file system using simple batch command: repository>dir /b /s /aa > filelist. The list was used as input for a simple batch files implementing JHOVE and converters command line interfaces (CLI).

Several PDF/A manipulating software products mostly in trial versions have been downloaded and evaluated. The testing was performed in two phases: 1) evaluation of available converters based on conversion of PDF files from the Isartor test suite [4]; and 2) test conversion of all PDF files in the repository using the tools selected in the first phase. The converted Isartor files have been cross-validated that the best-performing PDF/A converters could be identified. The best applicable tools have made progress to the second phase. The conversion logs and reports have been processed and imported into SQL database for further analysis.

Results

The analysis revealed there are 189,917 digital objects in the repository - DjVu, TXT, PDF and Jpeg files - with total size of 43.6 gigabytes (GB), of which PDF files are 9079 with the size of 17.6 GB. The JHOVE version used in the analysis supported only PDF up to version 1.6 so the results had to be enhanced by 3-Heights PDF Validator which can output the files claimed conformance level. The results of combined analysis are displayed in Fig. 1 (1b = PDF/A-1b). There were found 15 corrupted PDF files. 

Comparison based on conversion and cross-validation of Isartor test suite involved processing of 203 files (1 removed - isartor-6-1-12-t01-fail-a - some tools became unresponsive when converting it). The test suite is specially designed to reveal flaws in a PDF/A manipulating tools. All selected tools correctly marked the 203 files as invalid. The comparison of conversion results can be seen in Tab. 1. The total score is the sum of percentages of successfully passed and validated files (self-validation and the worst result for each converter were excluded). Intarsys tool was used for validation only due to trial restricted conversion limit. Callas tool speed performance was unsatisfactory and was not used in the cross-validation. The comparison has shown quite differences between the tools.

Results of PDF Technologies and VERYDOC products were very bad and the tools cannot be recommended even for personal use. Solid Documents performed poor and inconsistent validation but delivered good conversion results - the tool might do well as desktop solution in personal or office use. Callas performed unsatisfactorily in conversion success rate though it offers many additional functions (which have not been relevant for this research).

The three of the tools passed into the second testing phase but only PDFTron and 3-Heights were finally compared because LuraTech product did not provide any CLI (only graphic interface with folder and subfolders input) and could not process the repository file list - moving the files out from the repository system has not been an option. 3-Heights product converted successfully 99.3% of the files with 48 failed and the conversion lasted 8 hours. PDFTron converted successfully 93.1% of the files with 610 failed and the conversion lasted 1.3 hours. Estimated conversion time for LuraTech based on the first phase test is 1.5 hour; for Callas 37.5 hours.

Discussion

There were previous efforts trying to determine chances of success for migration of a PDF files collections to PDF/A [5] though only one commercial tool had been used which is not available any more. Our research used the Isartor test suite for evaluation of current commercially available software tools - no mature open source software have been found. The results proved that using the test suite is good starting point in evaluation of PDF/A conversion capabilities and it can impact conversion result significantly. The final decision must be based on testing of repository's own digital objects and depends very much on specific conversion needs - in our scenario the ability to batch process a list of PDF files. There are other features and aspects of evaluated tools that might have influence on the decision - enhancing OCR and metadata capabilities, CLI/desktop/server service modes of operations, SDK availability etc. LuraTech tool has shown promising results and it might be possible to use its software development kit (SDK) and write custom code for the conversion task. The quality testing of our output test files needs to be done yet and we plan to compare 5% of converted files at least to be visually identical to the original before finally selecting the conversion tool.

Acknowledgement

This research was supported by the Ministry of Health Internal Grant Agency, Czech Republic, project no. NT12345

Legend Figure: 
Results of format identification and validation
Legend Table: 
Comparison of conversion results
References: 
  1. DigitalPreservationEurope. Repository Planning Checklist and Guidance DPED3.2. 2008 Apr [cited 2012 Apr 27]. Available from: http://www.digitalpreservationeurope.eu/publications/reports/Repository_Planning_Checklist_and_Guidance.pdf
  2. PDF/A-1, PDF for Long-term Preservation, Use of PDF 1.4 [Internet]. Library of Congress, Digital Preservation. [last updated 2009 Feb 25; cited 2012 Apr 27]. Available from: http://www.digitalpreservation.gov/formats/fdd/fdd000125.shtml
  3. JHOVE - JSTOR/Harvard Object Validation Environment [Internet]. JSTOR. [last updated 2009 Feb 25; cited 2012 Apr 27]. Available from: http://hul.harvard.edu/jhove/
  4. Isartor Test Suite [Internet]. PDF Association. 2011 [cited 2012 Apr 27]. Available from: http://www.pdfa.org/2011/08/isartor-test-suite/
  5. Walker FL, Gallagher ME, Thoma GR. PDF File Migration to PDF/A: Technical Considerations. Proc. IS&T Archiving; 2007 May; Arlington, Virginia. 2007 [cited 2012 Apr 27]. p. 6-11. Available from: http://www.lhncbc.nlm.nih.gov/lhc/docs/published/2007/pub2007020.pdf
Type of presentation: 
Poster

Libguides – the librarian created portals to high quality scientific research information

Presentation / Poster: 
Abstract: 

What is a LibGuide?

The Central Medical Library started to develop LibGuides last year, in collaboration with the University Library Groningen. A LibGuide is a basic guide with a selective overview of resources and tools in a particular field with students and beginning researchers being primary targets. The set of all LibGuides of the University of Groningen is available at: http://LibGuides.rug.nl / 

Why we chose for LibGuides 

LibGuides are the successors of the Electronic Literature Guides, which the library previously offered to students and beginning researchers.

These Literature Guides were manually edited overviews of resources per field on the library website. However, over time we became less satisfied with the Literature Guides, mainly because keeping them up-to-date was a laborious task and our web platform offers no Web 2.0 capabilities such as RSS or integration with social media.

During our search for a better solution to create overviews of library resources we found  Springshare’s LibGuides (http://www.springshare.com/LibGuides/). No other tool met our requirements as well as the LibGuides. The LibGuides are made by librarians for librarians and are already used by many libraries (so far mainly in the U.S.)

They are specially designed for our purpose: to create a basic guide with resources and tools per subject. Furthermore, they are affordable (approximately $1000,- per year), easy to make and easy to use, mobile friendly and offer many built-in tools. For example, you can integrate web chat or your virtual helpdesk inside your guides. A LibGuide is easy to integrate with social networks such as Twitter and Facebook. You can simply insert web links, RSS feeds, polls, user feedback options and search boxes.

LibGuides are much easier to maintain than our former Electronic Literature Guides. Each element (or better: widget) of a LibGuide can be shared or reused in another LibGuide within the organization. For this purpose, we have made an internal ‘Derive LibGuide’ ​​specifically for the derivation of widgets. Changes to a basic widget become automatically visible in its derived widgets.

The LibGuides can be adjusted to the visual identity of your organization by adding your logo and color.

Available medical LibGuides

To date, the Central Medical Library has compiled five LibGuides: Medicine, Dentistry, Human Movement Sciences, Neuroscience and Nursing (in Dutch), all available at: http://LibGuides.rug.nl/profile/cmb. Upon request, we can easily create more LibGuides in consultation with departments or staff members of the UMCG. The statistics show that the LibGuides meet the needs of our patrons. For example, the Medicine LibGuide had almost 18,000 views in 2011. An evaluation of the LibGuides by our patrons will follow soon.

 

Type of presentation: 
Poster