Lacebuilder¶
Lacebuilder is a friendly command-line application that generates packages for the Lace in-browser OCR to TEI web editing application. Point it to an image directory and corresponding hOCR output directory, as well as to a simple xml metadata file, and it produces the .xar packages that can be installed in Lace through eXist-db’s drag-and-drop package manager.
Free software: BSD license
Documentation: https://lacebuilder.readthedocs.io.
Features¶
Generates a base image package for all derived OCR runs, binarizing all images
Generates OCR output packages with the enhanced data used to make editing OCR easy in Lace, including word spellcheck status and dehyphenation
Automatically corrects the word bounding boxes of kraken hOCR output
Examples¶
lacebuilder offers two subcommands, packimages and packtexts. These have their own parameters. The parameters --outputdir and --metadatafile are common to both of the subcommands, so they are set before them. At present, you cannot chain the subcommands. To access the --help for the subcommands, you must properly set these output parameters, thus:
lacebuilder --outputdir /tmp/ --metadatafile /tmp/myfile_meta.xml packtexts --help
Building an image package:
lacebuilder --outputdir /home/brucerob/ --metadatafile ~/Test_Lacebuilder/552464779_meta.xml packimages --imagedir ~/Test_Tarantella/test outputdir: /home/brucerob/
generating image xar archive
Binarizing and compressing images
image archive of 111 images saved to /home/brucerob/552464779_images.xar
More information is required to build an hOCR output text package because Lace uses it to store multiple OCR ‘runs’ of a given image set and eventually to search and compile runs that have been completed using the same classifier:
lacebuilder --outputdir /home/brucerob/ --metadatafile ~/Test_Tarantella/552464779_meta.xml packtexts --hocrdir ~/Test_Tarantella/test_hocr_out/ --classifier ~/Downloads/Kraken-Greek-Classifiers-and-Samples/porson_2020-10-10-11-54-25_best.mlmodel --imagexarfile ~/552464779_images.xar
dehyphenating
spellchecking
generating hocr xar
accuracy 91%, Greek acc. 91%; completed 00%, Greek completed 00%
total: 20669 ; total correct: 11369
writing this data to /tmp/tmpo0_6nin6total.xml
text archive from date 2021-01-30-16-05-42 saved to /home/brucerob/552464779-2021-01-30-16-05-42-porson_2020-10-10-11-54-25_best-texts.xar
Example Including Archive.org Files and Tesseract Processing¶
Here is a sequence of bash commands that convert a meta.xml file and zip archive of jp2 image files into Lace packages:
mkdir /tmp/Pliny
cd /tmp/Pliny/
mv ~/Downloads/epistularumlibr00plin_* ./
unzip epistularumlibr00plin_jp2.zip
cd epistularumlibr00plin_jp2/
parallel -P 6 opj_decompress -i {} -o {.}.png ::: *jp2
mkdir epistularumlibr00plin_png
mv *png epistularumlibr00plin_png/
mkdir epistularumlibr00plin_hocr
parallel -P 6 "tesseract -l lat {} epistularumlibr00plin_hocr/{/.} hocr" ::: epistularumlibr00plin_png/*png
lacebuilder --outputdir . --metadatafile epistularumlibr00plin_meta.xml packimages --imagedir epistularumlibr00plin_jp2/epistularumlibr00plin_png/
lacebuilder --outputdir . --metadatafile epistularumlibr00plin_meta.xml packtexts --imagexarfile epistularumlibr00plin_images.xar --hocrdir epistularumlibr00plin_jp2/epistularumlibr00plin_hocr/ --ocr-engine tesseract --classifier lat --verbose
Credits¶
This package was created with Cookiecutter and the audreyr/cookiecutter-pypackage project template.