Thursday, November 12, 2009

Batch scanning paper documents to PDF

As with most things in linux, there's 100 ways to achieve the same outcome. Batch scanning documents is no exception. My requirements here were for a process that would start with one-click, have fixed format, settings and post-processing steps and save the results to a PDF. For a n-page paper document, it should take n+1 clicks/keys followed by selecting a folder and typing a file name - very minimalistic.

Linux offers some very simple ways to automate everyday tasks. The bash script below has worked perfectly for a few years now.

The script repeatedly scans colour A4 pages at 300dpi. After de-speckling the images it converts them into jpeg images (quality=50%) that are stored inside single PDF documents. It then joins all the PDF pages into a single PDF file and saves it to the location the user selected.

There are other more polished tools for scanning to PDF (eg gscan2pdf), however they all require more clicks or settings to be checked. It would be easy enough to automate this with Python and add a better UI, however it works fine as it is. Let me know if this script has helped you!

Notes:
  • After scanning a page, at the dialog box hitting "Enter" starts scanning the next page, "Esc" indicates there are no more pages.
  • Imagemagick performs the image conversion in a background process, the next page can start scanning immediately. 
  • Kdialog (from KDE) is used to request user input. This could easily be swapped for the Gnome equivalent.
  • All temporary files are stored in /tmp/scanpdf and deleted 1 day later.
  • Gwenview is started automatically to show pages as they are scanned. From here it is possible to crop, rotate or reorder pages before saving. 
  • The final pages are joined in alphabetical order - pages can be reordered by renaming them.
  • Your scanner should already be configured and work with other "sane" scanning tools (eg Gimp). Replace the vendor name at the top of the script to match your scanner type.
  • Prerequisites for scanning and converting to PDF:
    • sudo aptitude install imagemagick sane-utils pdftk
  • Prerequisites for user input and displaying files:
    • sudo aptitude install kdebase-bin gwenview okular

Paste the following into a new text file, then set the files "x" permission.

#!/bin/bash
scanner="epson"
tmp="/tmp/scanpdf"
tmpdir=
"$tmp/`date '+%F_%T'`"
pagename=
"page"

#remove old files, older than 1 day
find $tmp -mmin +1440 -exec rm {} \;

#setup new temp dir
mkdir -p $tmpdir

counter=0
pid=0

#convert to jpg in background
function doConvert {
pnm=$1
jpg=${pnm%.pnm}.jpg
convert -quality 50 -despeckle -interlace Plane -monitor "$pnm" "$jpg"
rm $pnm
}

#load viewer
gwenview $tmpdir &
pid=$!

until [ $counter -eq -1 ]
do
#scan a page
page="$tmpdir/$pagename-`echo $counter | awk '{ printf \"%03d\", $1 }'`.pnm"
scanimage --format pnm --resolution 300 --mode col --quick-format A4 \
--device $scanner --progress > $page

#convert in background
doConvert $page &

counter=`expr $counter + 1`

kdialog --title "Scan To PDF" --yesno "Do you want to scan another page?"
if [ $? != 0 ]; then
break
fi
done

#get filename to save to
outfile=`kdialog --getsavefilename :scanpdf "*.pdf|PDF Documents"`
if [ $? != 0 ]; then exit; fi

#wait for background conversions to finish
while [ -e $tmpdir/*.pnm ]; do sleep 1; done

#dump each jpg into a pdf without recompressing
for file in $tmpdir/*.jpg; do convert -page A4 -compress jpeg "$file" "$file.pdf"; done

#use pdftk to concatenate pages together
pdftk $tmpdir/*.jpg.pdf cat output "$outfile"

#view finished pdf
okular "$outfile" &

#system tray popup
kdialog --title "Scan to PDF" --passivepopup "Complete" 10 


No comments:

Post a Comment