Docx file version control

(Linux, Open-Source, Technology, Tutorials)

Being a huge friend to version control systems I thought it would be fun to add also my writings to version control.  Despite the fact that the files themselves get entered into VCS fine, the content difference could not be read so easily.  In order to provide that functionality I had to come at this problem a little from the left field.  I had to provide the same text in a format that VCS diff could disgern and display.

The answer to my problem (well the first half at least) was Pandoc. Pandoc is a piece of software that converts files from one format to another .  Want to convert from html to docx? Pandoc does that.  Vice versa?  Pandoc can handle.  I personally decided to use markdown as the format that would be displayed in my version control diffs.

So how to make this magic happen?  The first step of course is to install the library.  Most Linux distributions probably already has it.  As for Windows… sorry, not covering that.

After pandoc has been installed the command to convert is very simple:

pandoc -f [type to convert from] [file to convert] -t [type to convert to] >[file to convert].extension

So if I am to convert docx to markdown the command would read:

pandoc -f docx test.docx -t markdown > test.md

 

My second problems was how to make the conversion process continuous so I wouldn’t have to run this command every time.  The answer to that question is another library called inotify-tools and more precisely: inotifywait.  What inotify-tools does is it provides tools necessary to watch and react to file system changes (like creating new files or editing existing).  So by combining these two tools I can create an automated system that watches my files for changes and converts them automatically.  The full script ended up being as follows:

#!/bin/sh

MONITORDIR=$1
TARGETDIR=$2
VALIDEXTENSION="docx"
PANDOC=pandoc
SOURCEFORMAT=docx
SOURCE_EXTENSION=docx
TARGETFORMAT=markdown
TARGETEXTENSION=md

E_BADDIR=1
E_CONVERSIONFAIL=2

if [ ! -d $TARGETDIR ]; then
 echo "Directory $TARGETDIR could not be found!"
 exit $E_BADDIR
fi

inotifywait -m -r -e modify -e create --format '%w%f' "${MONITORDIR}" | while read NEWFILE
do
 
 filename=$(basename $"${NEWFILE}")
 extension="${filename##*.}"

 # only convert docx files (not the temp files which end in docx#)
 if [ "$extension" = "$VALIDEXTENSION" ]; then
 echo "Converting file $filename"

 newpath=${NEWFILE/$MONITORDIR/$TARGETDIR}
 directory=$(dirname $"${newpath}")

 if [ ! -d $directory ]; then
 echo "Creating directory $directory"
 mkdir -p $directory
 fi

 convertedFileName=${filename/$SOURCE_EXTENSION/$TARGETEXTENSION}

 $PANDOC -f $SOURCEFORMAT $NEWFILE -t $TARGETFORMAT > $directory/$convertedFileName

 if [ -f $directory/$convertedFileName ]; then
 echo "Converted to $directory/$convertedFileName"
 else
 echo "File conversion failed!"
 exit $E_CONVERSIONFAIL
 fi
 fi
done

All I have to do is run this script and give the proper source and target folders and everything withing (recursively) is automatically formatted.  Afterwards I just commit both files to version control.