Recovery of scanned PDF corrupted by email transfer

Occasionally when my scanner is emailing me a PDF it gets corrupted along the way. Most of the time the individual scanned images in the PDF are either perfectly valid or at least can viewed, even though the PDF viewer refuses to open the corrupt PDF file.

The bash script below will extract the images and then combine them again into a valid new PDF file.

#!/bin/bash

# input PDF file
if [[ -n $1 ]]
then
	ifn="$1.pdf"
else
	ifn=input.pdf
fi

# output image / PDF file prefix
if [[ -n $2 ]]
then
	ofp="$2"
else
	ofp=output
fi

# get offsets of JPG images in input PDF
b=($(binwalk "$ifn" | grep JPEG | awk -F ' ' '{ print $1 }'))
# length of array is number of JPGs found
n=${#b[@]}
# add total input file length to end of offset array
# this will enable the calculation lengths for all images found
b[n]=$(stat -c%s "$ifn")

echo Found $n images, started processing.

# loop over images
for ((i=0,j=1; i<${n}; i++, j++))
do
	# retrieve offset
	offs=${b[$i]}
	# calculate length as b[i+1]-b[i]
	cnt=$(( ${b[$j]}-${b[$i]} ))
	# use dd to extract and save individual image
	dd if="$ifn" of="$ofp-$(printf "%04d" $i).jpg" bs=1 skip=$offs count=$cnt
done

# combine images into single valid PDF
convert "$ofp-*.jpg" "$ofp.pdf"

echo Finished.

If the scanned PDF contained different image types (e.g. TIFF) modify the grep, dd and convert commands accordingly

Leave a Reply

Your email address will not be published. Required fields are marked *