Wednesday, May 6, 2009

Converting JPEG's into PDF's with ImageMagick in Ubuntu 9.04

Before the recent update to Ubuntu Jaunty 9.04, the default install of ImageMagick (sudo apt-get install imagemagick) would convert a 300dpi scanned jpeg image into a PDF using the command line:

convert -page A4 image.jpg out.pdf
The resulting PDF would simply embed the jpeg image making it only a few Kb larger. However now with v6.5.4 of ImageMagick the default behaviour with this command is to uncompress the jpeg and store it in a lossless format. In my case, a scanned A4 page jumped from 150Kb to 7Mb.

Given the lack of documentation on ImageMagicks output format settings, it took a bit of experimenting to find the new command to embed a jpeg is:

convert -page A4 -compress jpeg image.jpg out.pdf
The additional compress option reproduces the original results.

This is part of a larger custom bash shell script that automates "one click" scanning of sequential paper pages into a PDF on ubuntu. Simple, fixed settings, no messing around. Post a comment if you're interested. Update: i have posted this script here.

14 comments:

  1. Thank you VERY MUCH! It was exactly what I needed!

    5 jpeg images with about 800 KB each were becoming a 123 MB pdf file! LOL

    ReplyDelete
  2. Now how to convert few jpeg files into single pdf?

    ReplyDelete
  3. I use pdftk to join multiple pdf's into one file. In its most basic usage:

    pdftk *.pdf cat output outfile.pdf

    ReplyDelete
  4. To convert multiple Jpeg files:

    try
    convert *.jpeg test.pdf
    or
    convert *.jpg test.pdf

    ReplyDelete
  5. @Matthew, that works okay for a couple of pages. Last time i tried that with 20 jpeg pages imagemagick jammed up trying to request a few GB of ram.

    pdftk adds some great additional functionality thats worth investigating: http://www.accesspdf.com/pdftk

    ReplyDelete
  6. Hi Rob, I would like to see your script bash.. I am interested..

    ReplyDelete
  7. @Manuel, I have posted the complete script i have been using at http://www.rrfx.net/2009/11/batch-scanning-paper-documents-to-pdf.html ...let me know if it helps you out!

    ReplyDelete
  8. Hello,
    I've observed that:

    convert -compress jpeg in.jpg out.pdf

    won't simply put the JPEG image into the output document, but it will instead *recompress* it, thereby losing data.

    Is there a way around this?

    ReplyDelete
  9. Now this is odd:

    tlon:~/pdf-jpeg-test$ convert -compress jpeg original.jpg original.pdf
    tlon:~/pdf-jpeg-test$ v
    total 304
    -rw-r--r-- 1 orbis tertius 186761 2009-11-25 17:59 original.jpg
    -rw-r--r-- 1 orbis tertius 113360 2009-11-25 18:01 original.pdf

    See the PDF file is smaller than the JPEG. Extracting the JPEG with pdfimages -j and then comparing it with the original one shows visible differences.


    On the other hand, (re)compressing the JPEG picture before "converting" it into PDF results in the PDF containing the unmodified JPEG data:
    tlon:~/pdf-jpeg-test$ convert -quality 99 original.jpg 99original.jpg
    tlon:~/pdf-jpeg-test$ convert -compress jpeg 99original.jpg 99original.pdf
    tlon:~/pdf-jpeg-test$ v 99*
    -rw-r--r-- 1 orbis tertius 201099 2009-11-25 18:01 99original.jpg
    -rw-r--r-- 1 orbis tertius 207282 2009-11-25 18:02 99original.pdf

    tlon:~/pdf-jpeg-test$ convert -quality 50 original.jpg 50original.jpg
    tlon:~/pdf-jpeg-test$ convert -compress jpeg 50original.jpg 50original.pdf
    tlon:~/pdf-jpeg-test$ v 50*
    -rw-r--r-- 1 orbis tertius 76878 2009-11-25 18:02 50original.jpg
    -rw-r--r-- 1 orbis tertius 79395 2009-11-25 18:02 50original.pdf

    ReplyDelete
  10. Hi Orbis, i was about to (re)post a long reply to that effect. Unfortunately Firefox 3.5.5 is a buggy piece of crap and it crashed while i was waiting for Kdiff3.

    A binary diff between an original test jpeg (8MB), and the one extracted from a pdf with "pdfimages -j" was identical for the first 40% of the file, and completely different for the other 60%. Odd, but it make sense that a single bit difference would then make the rest of the jpeg's different.

    I remember doing tests like this way back when i first set myself up for scanning paper documents. Enough tests to be convinced that the jpeg was as good as being stored. Progressive scan jpegs were converted to baseline first.

    It seems like imagemagick stores the quality level in the jpeg. I've noticed the same behaviour in Gimp when you hit "save as" on a jpeg, close it, reopen it and hit "save as" again. However Gimp doesnt pick up the quality level that Imagemagick seems to have written to the file.

    Given i'm using imagemagick for all my postprocessing i've not found it to be a problem. Cheers.

    ReplyDelete
  11. For anyone wanting to test this:

    #convert -quality 66 dsc07857.jpg test.jpg
    #convert -compress jpeg test.jpg test.pdf
    #pdfimages -j test.pdf out

    #ls -l (reordered source->jpg->pdf->extracted jpg)
    -rw-r--r-- 1 rob rob 52352 2008-04-05 14:02 dsc07857e800.jpg
    -rw-r--r-- 1 rob rob 29499 2009-11-26 01:46 test.jpg
    -rw-r--r-- 1 rob rob 32418 2009-11-26 01:46 test.pdf
    -rw-r--r-- 1 rob rob 29481 2009-11-26 01:47 out-000.jpg

    The extracted jpeg is almost the same file size. Binary diff:
    #kdiff3 test.jpg out-000.jpg
    In this case shows the first 20% of binary jpeg data to be the same

    To verify the jpeg data is the same, convert to a bitmap and binary diff **:
    #convert test.jpg test.bmp
    #convert out-000.jpg out-000.bmp
    #kdiff3 test.bmp out-000.bmp
    Here the bitmap header is different, however the image data is identical.

    **dont try this on large jpeg files

    ReplyDelete
  12. Thanks a lot, I use a gnome nautilus script with ubuntu :
    #!/bin/bash
    IFS='
    '
    convert -page a4 -quality 50 -compress jpeg $NAUTILUS_SCRIPT_SELECTED_FILE_PATHS photos.pdf

    ReplyDelete
  13. Hi thanks for your observation.I have a few images downloaded from internet.I want to retain their quality.I used command as follows
    convert image.jpg image.pdf

    I observed doing
    convert -page A4 image.jpg out.pdf and
    convert -page A4 -compress jpeg image.jpg out.pdf
    had no difference in the two resulting pdf's.The size of image is 209.7Kb and resulting pdf in both cases are 204.3 Kb.I see a bit of loss in quality of converted pdf.Is it possible to retain the image quality some how.

    ReplyDelete
  14. Your converted pdf is smaller than the original jpeg because it's probably used "progressive" encoding in the stored jpeg. This should be lossless, see wikipedia: http://en.wikipedia.org/wiki/JPEG

    "It has been found that Baseline Progressive JPEG encoding usually gives better compression as compared to Baseline Sequential JPEG due to the ability to use different Huffman tables"

    and

    "It is also possible to transform between baseline and progressive formats without any loss of quality, since the only difference is the order in which the coefficients are placed in the file"

    ReplyDelete