reid.ai: Converting JPEG's into PDF's with ImageMagick in Ubuntu 9.04

Wednesday, May 6, 2009

Converting JPEG's into PDF's with ImageMagick in Ubuntu 9.04

Before the recent update to Ubuntu Jaunty 9.04, the default install of ImageMagick (sudo apt-get install imagemagick) would convert a 300dpi scanned jpeg image into a PDF using the command line:

convert -page A4 image.jpg out.pdf
The resulting PDF would simply embed the jpeg image making it only a few Kb larger. However now with v6.5.4 of ImageMagick the default behaviour with this command is to uncompress the jpeg and store it in a lossless format. In my case, a scanned A4 page jumped from 150Kb to 7Mb.

Given the lack of documentation on ImageMagicks output format settings, it took a bit of experimenting to find the new command to embed a jpeg is:

convert -page A4 -compress jpeg image.jpg out.pdf
The additional compress option reproduces the original results.

This is part of a larger custom bash shell script that automates "one click" scanning of sequential paper pages into a PDF on ubuntu. Simple, fixed settings, no messing around. ~~Post a comment if you're interested.~~ Update: i have posted this script here.

14 comments:

UnknownThursday, July 23, 2009 at 2:14:00 PM PDT
Thank you VERY MUCH! It was exactly what I needed!

5 jpeg images with about 800 KB each were becoming a 123 MB pdf file! LOL
ReplyDelete
Replies
arekmTuesday, August 25, 2009 at 4:22:00 AM PDT
Now how to convert few jpeg files into single pdf?
ReplyDelete
Replies
RobTuesday, August 25, 2009 at 10:55:00 AM PDT
I use pdftk to join multiple pdf's into one file. In its most basic usage:

pdftk *.pdf cat output outfile.pdf
ReplyDelete
Replies
MattThursday, September 10, 2009 at 9:08:00 PM PDT
To convert multiple Jpeg files:

try
convert *.jpeg test.pdf
or
convert *.jpg test.pdf
ReplyDelete
Replies
RobMonday, September 14, 2009 at 5:10:00 AM PDT
@Matthew, that works okay for a couple of pages. Last time i tried that with 20 jpeg pages imagemagick jammed up trying to request a few GB of ram.

pdftk adds some great additional functionality thats worth investigating: http://www.accesspdf.com/pdftk
ReplyDelete
Replies
mixkeyThursday, November 12, 2009 at 11:43:00 AM PST
Hi Rob, I would like to see your script bash.. I am interested..
ReplyDelete
Replies
RobMonday, November 16, 2009 at 12:23:00 PM PST
@Manuel, I have posted the complete script i have been using at http://www.rrfx.net/2009/11/batch-scanning-paper-documents-to-pdf.html ...let me know if it helps you out!
ReplyDelete
Replies
UnknownWednesday, November 25, 2009 at 6:53:00 AM PST
Hello,
I've observed that:

convert -compress jpeg in.jpg out.pdf

won't simply put the JPEG image into the output document, but it will instead *recompress* it, thereby losing data.

Is there a way around this?
ReplyDelete
Replies
UnknownWednesday, November 25, 2009 at 9:09:00 AM PST
Now this is odd:

tlon:~/pdf-jpeg-test$ convert -compress jpeg original.jpg original.pdf
tlon:~/pdf-jpeg-test$ v
total 304
-rw-r--r-- 1 orbis tertius 186761 2009-11-25 17:59 original.jpg
-rw-r--r-- 1 orbis tertius 113360 2009-11-25 18:01 original.pdf

See the PDF file is smaller than the JPEG. Extracting the JPEG with pdfimages -j and then comparing it with the original one shows visible differences.

On the other hand, (re)compressing the JPEG picture before "converting" it into PDF results in the PDF containing the unmodified JPEG data:
tlon:~/pdf-jpeg-test$ convert -quality 99 original.jpg 99original.jpg
tlon:~/pdf-jpeg-test$ convert -compress jpeg 99original.jpg 99original.pdf
tlon:~/pdf-jpeg-test$ v 99*
-rw-r--r-- 1 orbis tertius 201099 2009-11-25 18:01 99original.jpg
-rw-r--r-- 1 orbis tertius 207282 2009-11-25 18:02 99original.pdf

tlon:~/pdf-jpeg-test$ convert -quality 50 original.jpg 50original.jpg
tlon:~/pdf-jpeg-test$ convert -compress jpeg 50original.jpg 50original.pdf
tlon:~/pdf-jpeg-test$ v 50*
-rw-r--r-- 1 orbis tertius 76878 2009-11-25 18:02 50original.jpg
-rw-r--r-- 1 orbis tertius 79395 2009-11-25 18:02 50original.pdf
ReplyDelete
Replies
RobWednesday, November 25, 2009 at 9:43:00 AM PST
Hi Orbis, i was about to (re)post a long reply to that effect. Unfortunately Firefox 3.5.5 is a buggy piece of crap and it crashed while i was waiting for Kdiff3.

A binary diff between an original test jpeg (8MB), and the one extracted from a pdf with "pdfimages -j" was identical for the first 40% of the file, and completely different for the other 60%. Odd, but it make sense that a single bit difference would then make the rest of the jpeg's different.

I remember doing tests like this way back when i first set myself up for scanning paper documents. Enough tests to be convinced that the jpeg was as good as being stored. Progressive scan jpegs were converted to baseline first.

It seems like imagemagick stores the quality level in the jpeg. I've noticed the same behaviour in Gimp when you hit "save as" on a jpeg, close it, reopen it and hit "save as" again. However Gimp doesnt pick up the quality level that Imagemagick seems to have written to the file.

Given i'm using imagemagick for all my postprocessing i've not found it to be a problem. Cheers.
ReplyDelete
Replies
RobWednesday, November 25, 2009 at 9:57:00 AM PST
For anyone wanting to test this:

#convert -quality 66 dsc07857.jpg test.jpg
#convert -compress jpeg test.jpg test.pdf
#pdfimages -j test.pdf out

#ls -l (reordered source->jpg->pdf->extracted jpg)
-rw-r--r-- 1 rob rob 52352 2008-04-05 14:02 dsc07857e800.jpg
-rw-r--r-- 1 rob rob 29499 2009-11-26 01:46 test.jpg
-rw-r--r-- 1 rob rob 32418 2009-11-26 01:46 test.pdf
-rw-r--r-- 1 rob rob 29481 2009-11-26 01:47 out-000.jpg

The extracted jpeg is almost the same file size. Binary diff:
#kdiff3 test.jpg out-000.jpg
In this case shows the first 20% of binary jpeg data to be the same

To verify the jpeg data is the same, convert to a bitmap and binary diff **:
#convert test.jpg test.bmp
#convert out-000.jpg out-000.bmp
#kdiff3 test.bmp out-000.bmp
Here the bitmap header is different, however the image data is identical.

**dont try this on large jpeg files
ReplyDelete
Replies
testSunday, December 27, 2009 at 11:47:00 AM PST
Thanks a lot, I use a gnome nautilus script with ubuntu :
#!/bin/bash
IFS='
'
convert -page a4 -quality 50 -compress jpeg $NAUTILUS_SCRIPT_SELECTED_FILE_PATHS photos.pdf
ReplyDelete
Replies
someoneFriday, January 13, 2012 at 5:24:00 AM PST
Hi thanks for your observation.I have a few images downloaded from internet.I want to retain their quality.I used command as follows
convert image.jpg image.pdf

I observed doing
convert -page A4 image.jpg out.pdf and
convert -page A4 -compress jpeg image.jpg out.pdf
had no difference in the two resulting pdf's.The size of image is 209.7Kb and resulting pdf in both cases are 204.3 Kb.I see a bit of loss in quality of converted pdf.Is it possible to retain the image quality some how.
ReplyDelete
Replies
RobFriday, January 13, 2012 at 10:47:00 PM PST
Your converted pdf is smaller than the original jpeg because it's probably used "progressive" encoding in the stored jpeg. This should be lossless, see wikipedia: http://en.wikipedia.org/wiki/JPEG

"It has been found that Baseline Progressive JPEG encoding usually gives better compression as compared to Baseline Sequential JPEG due to the ability to use different Huffman tables"

and

"It is also possible to transform between baseline and progressive formats without any loss of quality, since the only difference is the order in which the coefficients are placed in the file"
ReplyDelete
Replies