A Linux Guide to Book Scanning

Posted 2013-07-27

Book scanning is the process of taking a physical book, and converting it to a scanned digital document, such as PDF or Djvu. First you must convert the physical pages of the book to digital images on the computer. There are many different scanning techniques to do this. Then you must process these images, fixing rotation, alignment, and margin-size of each image to look good when viewed on screen. Also it is good to convert pure text pages to black and white. This will be more readable, and introduce tremendous file size savings. After this, you must bundle the processed images together in a digital document format, like Djvu or PDF. You can also perform OCR (Optical Character Recognition) to make the document searchable, and add bookmarks to allow easy navigation.

Acquiring

Acquiring images of each page of the book is the most fundamental step of book scanning. Although you can fix a lot in post processing, you can't fix everything, so it is important that your scans are good quality. However, scanning 10000dpi 24bit full color images will take a long time, so you will want to choose settings that will speed up the scan without reducing quality. I normally scan at a 300dpi resolution, and do a scale up when rendering it to pure black and white. However depending on the speed of your scanner you may want to scan at 600dpi. In any case I would not recommend scanning at a resolution of less then 300dpi. Some people say that they scan at 96dpi and the results look fine, but I contend those people must be blind. If your book contains a lot of tightly rendered line work, then you might want to scan those pages at higher resolutions. The 300dpi is for pure text, and for illustrations it might not be the best.

Although we will be reducing the text to black and white, do not use the black and white setting on your scanner, use gray-scale instead. A gray-scale scan is faster than doing a full color, and the extra information in the shades of gray is useful when reducing the page to black and white in software post processing. Many times the scanner will produce a black and white scan where the lines are too thin to read, and it is easier to change the algorithm settings for line width when post processing in software than it is to change the scanners internal black and white setting.

Full color scanning takes a lot longer than gray-scale scanning, and color images will inflate your filesize. Only scan a page in color if it is really necessary. By this I mean that many books use color in places for no real reason. For example many books put section titles in color. For these, I usually don't bother scanning the page in color. To a purist: No it's not a perfect digital representation of the book. But really is anything important being lost by having the section titles be the same color as the rest of the text? I think not, and you can vast improvements in filesize by avoiding the use of color. This same principle can be applied to many color graphs. If the graph consists entirely of shades of one color, then it can easily be mapped to shades of gray. Of course, if there are color photographs or elaborate color diagrams in your book, then you will want to scan these (and only these) pages in color. You'll take a hit in scanning time and file size, but if you need color, then you need it!

Many scanners give you a choice of image format to scan to. Don't scan directly to PDF. The results are usually not acceptable, so you will have to extract the images from the pdf to post process anyway, so you might as well just scan it to individual images in the first place. If possible try to scan directly to a lossless format like png or tiff. However, these scans will take up more disk space, so if that is an issue scanning to JPEG is acceptable. If you are scanning to JPEG make sure to scan to the highest level of quality available. You don't want to introduce compression artifacts, and at max JPEG quality you will still see a reduction in filesize without introducing noticeable compression artifacts.

Required Software

pdfimages
XSane
linux drivers for your scanner

From a PDF

Oftentimes you receive a PDF from a well meaning colleague. That person has taken the time to scan the book, but they have made some mistakes. The file size could be enormous, or they might have pages scanned crooked, or have scanned 2 pages per PDF page, making the document awkward to read. This person was not familiar with the book scanning techniques you will learn on this page. You will want to fix the document, but since it was already scanned you don't want to take the time out to rescan the whole document from scratch. As long as the PDF has pretty good quality -- no JPEG compression artifacts, and a decent resolution, there is a lot you can do to improve the document. The general idea is to deconstruct the PDF into a series of images, and then to treat those images as though they were images you had scanned yourself.

To do this for pdf, use the pdfimages command, which is usually contained within the poppler package of most linux distributions. Execute the following command: pdfimages -j file.pdf out/pg_. Of course you should replace the filename and output prefix with something appropriate for your own situation. The output prefix will be prepended to the output of each image. I usually will keep those images in a subdirectory to keep things from getting to messy. The -j option tells pdfimages to render jpeg pages as jpeg images. However if the page is of a different format, then pdfimages will save it as a pnm file, not a jpeg. ScanTailor cannot read pnm files, so we will need to convert it to another format before we can begin processing. For this task I would recommend graphicsmagick, as it has better built in parallel support than imagemagick, and so can process a lot of files faster than imagemagick. We will convert all the pnm files to png, a lossless image format, to avoid any kind of reduction in quality introduced through lossy compression like jpeg. Run the command gm mogrify -format png *.pnm and then delete all the pnm files when the command finishes.

Flatbed Scanner

Flatbed scanning is an old standby. If a home or business owns a scanner it is usually a flatbed scanner. Flatbed scanning is nondestructive, in that you can scan the book without destroying the book binding, However, it can often be slow and tedious to scan in this way, depending on the speed of the scanner. Your main enemies in flatbed scanning are spine shadows and page curl. Spine shadows are the dark area in scans that appear near the spine of the book. They occur because the book is not completely flat on the scanner, resulting in darkness in the spine which is further away from the scanner beam. Page curl is a warping distortion which occurs near the spine of the book. Because the book doesn't lay completely flat, the part of the page near the spine will curl towards the spine, resulting in nonuniform distance from the scanning beam. This results in a warped perspective in the resulting image at that part of the page. Both of these problems are counteractable by making the book lie as flat as possible on the scanning surface. You will probably be unable to get rid of all spine shadow and page curl, but the goal is to reduce it so it only appears on the margins, not the text. If it is only on the margins, that is easily correctable in post processing, but if it is on the text, that could make the text unreadable.

If the book is small enough it can be a good idea to scan two pages at once. If the book can lie on the scanning surface when it is completely open, with both pages lying flat, then you can do this. This will reduce the time to scan the book by half, and splitting the pages into individual pages is easily done during post processing. However when doing this you need to be extra aware and cautious about spine shadow and page curl, as these are more prevalent with this technique.

When scanning a page you will want to press down slightly on the books cover, to try and ensure the page is flushed as flat as possible against the scanning surface. Be careful not to fidget or move during the scan, because this can cause ugly scanning artifacts that could make the page unreadable. Don't worry too much about having the book be at a perfect 180 degree angle, as you can easily rotate during post processing. Focus more on keeping the book still and flat.

If you are scanning one page at a time, scan all the even (or odd) pages first, and when you are done with that scan all the odd (or even) pages. It is much more efficient to turn the page and hit next image on your batch scanning program then to have to reorient and reposition the book between each scan. First scan all of one side of the pages (even or odd, depending on the book). Set Xsane's starting number to 0 and the increasing increment to 2. Once you scan all of one side of the book, Go back to Xsane, and change the starting number to 1, leaving the increment at 2. In this way the odd and even pages will automatically interleave together when sorting by filename. For example, you star with the even pages being 0.tif, 2.tif, 4.tif. Then you scan the odd pages as 1.tif, 3.tif, 5.tif. Finally when you sort by filename the book will be in correct order, pages 0 - 5.

Scanning a book this way can be tedious, so I usually put a movie on that I can watch while scanning. Of course you will want to have some practice before doing two things at once like this, or else you could easily mess up the scan and not realize it.

Formfeed Scanner

A formfeed scanner is probably the fastest scanning mechanism, but it is a destructive scanning method. In order to pass the book through the formfeed scanner, you will have to debind it, and reduce it to a stack of loose papers. Then you pass that paper through the scanner.

Debinding the book is not as simple as it seems. Depending on the quality of the binding it can be quite difficult. The best way in my experience is to use a box cutter, and cut directly through the binding at regular intervals, say 15-30 page, depending on thickness of each page. Once the book is separated into chunks, you can remove the edge binding much more easily, with an industrial paper cutter, or even plain scissors if your paper is thin enough. However, you will want to keep the binding on until you are ready to scan that particular chunk. Doing this will make it easier to keep the chunk in the correct order, as they will be glued in the right order until you start scanning.

After you have decomposed the book into these chunks, you will want to keep them in order while scanning. Use some kind of physical folder system to keep each chunk in. You want to make sure that you don't lose any pages while scanning, and that all the pages stay in order. It is a lot easier to reorganize a 15 page chunk than it is a 500 page book, after all. Keep each chunk in its own folder, and keep track the order of the folders, and whether or not you have already scanned them. Organization is very important.

When feeding the form feeder, you will want to be very careful. Even though these scanners advertise that they are able to automatically scan dozens of pages at once, that figure is calculated based off the weight of normal office paper. Book paper is of a different weight, and so this figure may not be accurate. You will have to experiment and see the amount of pages that works best for your particular scanner and book. However, don't load the whole chunk at once. Paper jams can be dangerous when book scanning, because if the page gets ruined, you may end up with no recourse but to buy another copy of the book to rescan the one defective page. If you are careful, you can usually avoid paper jams. Another problem I have noticed is some books are printed on slick, thin paper. These kind of pages stick together, so you can end up with 2 or 3 pages sticking together when going through the scanner, resulting in those page images being messed up in the scan result. Keep an eye out for this, you might have to scan these troublesome pages one at a time. If the pages stick together when reading the book, they'll probably stick together when scanning the book.

If you don't have an auto 2 sided form feeder scanner, you will have to take some steps to ensure the pages are numbered correctly. First scan all of one side of the pages (even or odd, depending on the book). Set Xsane's starting number to 0 and the increasing increment to 2. Once you scan all that side of the page of the chunk, you should scan all the other side of the chunk. Go back to Xsane, and change the starting number to 1, leaving the increment at 2. In this way the odd and even pages will automatically interleave together when sorting by filename. For exampe, you star with the even pages being 0.tif, 2.tif, 4.tif. Then you scan the odd pages as 1.tif, 3.tif, 5.tif. Finally when you sort by filename the book will be in correct order, pages 0 - 5. When you move on to the next chunk, start the numbering so that the next chunk will come after the first chunk when sorting by filename, don't just start from 0 and 1 again.

Handheld (Portable) Scanner

This method involves the use of a handheld portable scanner. The way this type of scanner works is you have a small handheld bar which will scan a document as you move the scanner over the page of the book. The real key to scanning with this method is practice. You will have to experiment with your scanner first to see what will produce the optimal results. Tweak the speed of your movement, direction, angle, etc to see what will produce the best looking scan with your particular scanner. It's also very important to keep hand shaking down to a minimum, so this method will require a very steady hand. When practicing, just scan a few pages at a time, and don't attempt to scan a whole book until you have reasonable proficiency. There's no sense spending the time to scan a whole book, only to realize that your scans are so bad you can;t even save them in post-processing.

I have only used this technique once myself, I got poor quality results, and did not really have chance to practice, as I had borrowed the scanner and had to return it. Still, I have heard people claim they are able to go as fast as 20ppm with practice, so this method may be worth looking more into. If you already own a scanner of this type, I would go ahead and give it a try, but if not there are probably better options to spend your money on.

Camera Based Scanner

Just a brief caveat, I have no firsthand experience using this method, all my knowledge of it comes from second and third hand sources. However, I will include it here for completeness, and possibly update it later if I gain more experience with the method.

The general idea of this method is to use a highspeed digital camera to take pictures of each page of the book. With sufficiently high resolution a high DPI can be obtained, and capture the image faster than a traditional scanner. To get good quality photos, you will need lighting fixtures to fully illuminate each page, eliminating shadows. You will also need a piece of glass that will flatten the pages to eliminate page curl. The final piece is the scaffolding that holds each component in place, and angles them correctly to capture the pages.

If you are interested in this method, and handy with tools, you can find plans at the DIYBookScanner forums, or even buy a complete DIY kit. This method is probably the fastest and highest quality nondestructive book scanning setup, but you will need to build it yourself, instead of just buying commercial off the shelf hardware.

Processing

After you have gotten images of the pages of the book digitized, you will need to perform post processing. Post processing the images will clean up the scans, crop them, convert them to black and white, and more generally take them from their raw scanned state to looking more like a document you could actually read. In this page we are using the excellent ScanTailor software to process the scans.

Required Software

ScanTailor

ScanTailor is a very powerful software, but it is not magic. As such, you should try to get the best looking scans you can as input to ScanTailor. ScanTailor can correct many common scan issues, like splitting 2 page scans, fixing margin sizes, rotating, but there are some things it can't out. Spine shadows and page curl can be difficult for ScanTailor to process, so you should focus on reducing these.

The automatic modes of ScanTailor are very useful, but they are not always perfect. You should always verify that the automatic mode operated correctly on all pages before proceeding to the next step.

Fix Orientation

Often you will need to fix the orientation of your scans. A very common case is when you have scanned even and odd pages separately, the even pages will have a different alignment than the odd pages. Simply rotate the page to the correct orientation, and apply to all pages if all pages need to be rotated, or apply to every other page if the even and odd pages require different rotations.

Split Pages

There are 3 different page layout options in ScanTailor. It can treat the image as one single page, a page with part of another page that needs to be cropped off, or two separate pages that need to be split and processed individually. The automatic mode usually works pretty good, but you'll want to go back and check to make sure it didn't make any mistakes.

Deskew

In this step, you have more fine grained control over the rotation. You can adjust the page so that all the lines of text on the page are straight. The automatic method works pretty good here as well, and you don't really need to check this on as carefully, just look out for any egregious errors when looking through the thumbnails.

Select Content

Another cool thing about ScanTailor is that it can automatically draw out the margins for you. The first step in this is to select the content. This means you must move the box on the screen so it surrounds everything on the page that needs to be in the final document. Run automatic mode first, and then go back and check all the pages. I have found the automatic mode can sometimes misdetect pagenumbers and graphs, so be especially mindful of these types of pages.

Margins

It can be useful to measure the book with a ruler to determine the actual margins. Then you could just type those in. However, doing this is not strictly necessary, as you can usually visually guestimate pretty well. The end goal is to make it look good on display, so if it looks good, it is good. One thing to be careful of is that books by default have different sized left and right page margins. In a book the part of the page that is facing the spine has a larger margin than the end facing the edge of the book. This is a precaution on the publishers part to make sure no part of the text becomes illegible by being too close to the spine. However, this can have an annoying result if you are reading it as a digital document on your screen, as the margins will change every other page. To fix this, take a page and make the left and right margins to be somewhere in the middle of the edge facing and spine facing margins. Then apply that margin to all pages, and apply the upper margin alignment to all pages.

Some pages, especially at the beginning of the book, may require some special margin attention. Keep the same margin size as other pages, but change the alignment to fit accordingly, for example most dedications should be margin aligned center, not upper.

Output

I usually leave the output resolution on 600 even if I scan at 300. Upscaling like this helps the black and white text to look smoother. However it can be wasteful if that page is not in black and white. For color/grayscale images, you can lower the resolution to reduce filesize.

For the mode setting, you will have to look at the page in question. Black and white should be the preferred mode for all text, and line art. Color/Grayscale should only be used on full page images without any text on the page. Mixed should be used for pages that contain a mix of text and color or grayscale images. Black and white will compress the best, so it should be preferred if possible, followed by Mixed. Only use color/grayscale if you really need it.

In Mixed mode, you can use the automatic picture detection or adjust manually. Go to the Picture zones tab on the right of the screen, and draw zones around the borders of the pictures. Anything inside the zone will be processed as color/gray-scale, anything outside the zone will be processed as black and white.

After this, adjust the thickness setting until the text is readable. The default value of 0 is usually pretty good, but some books have especially thick or thin fonts that show up weird when scanning.

Dewarping allows you to fix problems that arise when the book is not flat when scanned, such as page curl. The auto mode doesn't quite work very well yet, so you should do your editing in manual mode. CLick on the Dewarping tab on the right, and adjust the blue grid until it fits the page. Then scantailor will apply a reverse transformation to dewarp the page. This can be time consuming to do, and sometimes the results look a little off, so I would reccomend scanning the book as flat as possible from the beginning, and not planning to rely too much on this feature.

Fill zones are sones that ScanTailor will fill out with a solid color. This is useful for editing out unwanted or damaged portions of the page, for example incorrect or illegible margin notes. Simply go over to the Fill Zone tab and mark the zones to fill like you marked the pictures as in the Picture Zone tab. However while this mode is handy, often times I will prefer to do work of this nature later on in GIMP instead of directly in ScanTailor.

Despeckling will automatically remove little groups of pixels, that can be introduced through scanner noise, dust on the scanning surface, page blemishes etc. The default value is usually good, but you should adjust it if your page contains heavy speckling until the result is good. However, be careful about turning it up too high, you could remove letters, or more commonly punctation marks like periods, commas, etc.

Finally you can preview your output to OK it. After you apply processing to each page, you should check them all to make sure everything is good before continuing.

External Editing

Sometimes you need to do more extensive editing of a page than ScanTailor has options for. For example, you may find that the part of a letter on a page has been cut off. For such a small problem it's not worth rescanning the page, so you can fire up GIMP, and copy the same intact letter from a different part of the page and move it to take the place of the damaged letter. Additionally, you can have more fine tuned control to remove things like underlining under the text.

Bundling

Once you have processed your scanned images, you will need to bundle them together into a document format, and include useful things like metadata, indexed bookmarks, OCR, etc. For scanned books the "standard" formats are Djvu and PDF. Linux has a much stronger Djvu toolset, thanks in no small part to the high quality djvulibre project, but making PDF files is still doable, if a bit awkward at times. I would encourage you to read the manuals for these tools if you want to get more in depth, but the instructions here should suffice for more standard situations.

The general idea for both formats is to compress the black and white images from before with a bitonal encoder, like cjb2 for djvu and jbig2enc for pdf. Images with some color/gray-scale (mixed mode from scantailor) are processed using a mixed raster content encoder. This will combine a black and white layer for the text with a color background for the colored component, resulting in a smaller size document than just encoding the whole image as color.

Finally, you can add a text layer via OCR (optical character recognition) to make the document searchable, as well as index bookmarks for easy navigation. Additional metadata (title, author, isbn, publisher, date, etc) can also be inserted at this stage, depending on the document format.

Djvu

Djvu may be less familiar with the layman than PDF, but it is well known among book scanners, and many consider it to be technically superior to PDF. Indeed, it is the de facto standard in Russia, and many of the great russian book scanning tools target Djvu as the primary target. The tooling for Djvu on linux is very good, thanks in no small part to the excellent high quality djvulibre project. This is the format I usually keep all the books I scan in. However, if you need to share the document with lots of other people who are not really familiar with book scanning, then PDF might be the better choice. For some reason it is more recognized and accepted as a standard document format by the average person.

Required Software

Overview

For Djvu, we will be using the djvubind program to bundle and perform OCR on the document. Djvubind is a python wrapper and will invoke other tools to help it in its work, such as tesseract or cuneiform for OCR and minidjvu or djvulibre for document compression.

Bundling/OCR

To bundle, simply run the command djvubind in the output directory, and then djvubind will generate a file called book.djvu for you. djvubind is also pretty nice for us, it will perform the OCR on the document automatically as well (depending on your settings in the configuration file)

Bookmarks

For bookmarks, using a graphical program like djvusmooth is probably best. Open your djvu document with djvusmooth, and add bookmarks at the correct places, making sure to spell things correctly. I like to open a copy of the document in another viewer, and leave that copy on the page containing the table of contents, to make it easy to find where all the sections are that need to be marked. Simply go to each page mentioned in the Table of Contents, add a bookmark in jpdfbookmarks and name it appropriately. Bookmarks can be nested, so you should set up your nesting to follow the format in the TOC. For example, Section 1 of Chapter 1 would be nested under Chapter 1, Section 1.1 would be nested under Section 1, and so on. When you are done, save the file, and then your finished PDF is ready.

PDF

PDF is a pretty popular document format. Most people are familiar with PDF, and should have a PDF reader installed. However, the tooling on linux to create PDFs is sadly not quite as good, or as easy to find as its proprietary counterparts. Additionally, many tools tend to recompress the pdf, and do not handle JBIG2 images, and so they will convert them internally to JPEG, resulting in a file with less visual quality and a higher filesize. In general, a PDF will have a larger filesize than a Djvu of the same document, but if you follow this guide you will still end up with a reasonably sized file.

Required Software

Overview

For PDF, we will be using the pdfbeads program to bundle the document. pdfbeads produces high quality output, but can be difficult to set up correctly, espeicially because the main manual is in Russian. pdfbeads is ruby based, so to install issue the command gem install pdfbeads. You may need root permissions for this, depending in your ruby setup. In order to compress bitonal images you will need a jbig2 encoder. I'd recommend the free jbig2enc. This software uses the familiar autotools ./configure, make, make install process. Finally to be able to incorporate color pages, and MRC pages, you will need an image encoder like imagemagick or graphicsmagick. Graphicsmagick is faster, but doesn't support JPEG2000. For color or the backgrounds of MRC pages, pdfbeads can encode as either JPEG or JPEG2000. JPEG2000 is a newer compression method than JPEG and it results in smaller filesizes and better quality than JPEG.

For bookmarks we use the jpdfbookmarks java program. It has an easy to use gui, and allows one to add bookmarks to a pdf file without recompressing. Don't use ghostscript to add bookmarks to your pdf file, because ghostscript will fully interpret and render your pdf before output and recompress it, which will result in inflated filesize, and reduced image quality.

Unfortunately, I was unable to find a native linux program to do good quality ocr on PDFs. While the actual test recognition was decent with engines like tesseract and cuneiform, the alignment of the text layer in the resulting pdf was very awkward. Words had random fontsizes and didn't match up with the position of the word in the image portion of the PDF. The result was unsatisfactory when using hocr2pdf and pdfbeads. However, there is a piece of gratis windows software which works well under wine, and can add a high quality OCR text layer without recompressing the rest of the document, called PDF Xchange Viewer.

Bundling

To bundle, simply open your terminal and go to the dir where you saved your ScanTailor output. In this directory run the command pdfbeads *.tif > out.pdf , where out.pdf is the name of the pdf file you want to create. pdfbeads is pretty fast (faster than djvubind) so you will not have to wait long.

OCR

To perform OCR, run PDF Xchange viewer using wine. Open your out.pdf file, and click on Document->OCR. Chose the highest quality option, and wait for the file to be processed. Once it is done, save the file, and get ready to add bookmarks.

Bookmarks

Run jpdfbookmarks and open up your pdf. I also find it helpful to run a pdf viewer at the same time, and leave it open to the scanned Table of Contents page in your pdf. Your goal here is to copy the structure of that Table of Contents into indexed bookmarks so that you can easily navigate to all the different parts of your document. Simply go to each page mentioned in the Table of Contents, add a bookmark in jpdfbookmarks and name it appropriately. Bookmarks can be nested, so you should set up your nesting to follow the format in the TOC. For example, Section 1 of Chapter 1 would be nested under Chapter 1, Section 1.1 would be nested under Section 1, and so on. When you are done, save the file, and then your finished PDF is ready.

Clean Up

Once you are done you will want to clean up your directories and remove space consuming temporary files. I would recommend that you hang on to the raw scan images, the ScanTailor project, and any files you edited with an external editor. The raw scan images take up a lot of space, but when you consider the cost in time of rescanning them it might make more sense to just hang on to them rather than have to scan the book over again. Of course you are free to do as you will.

Distribution

Distribution will depend largely on your circumstances and the copyright status of your book. If your book is in the public domain, you should consider submitting it to the Internet Archive, or Project Gutenberg. Of course if you are doing this make sure that they don't already have a copy of the book, or if they do that your scan is a higher quality than their current copy. For internal documents you should ask your superiors before engaging in any distribution. However you might want to hang on to those documents anyway, as they could have historical significance in the future.