NOTE. Since WordPress doesn’t allow proper XML to be shown on its pages, and I can’t even link to simple shell scripts, all sample files and command lines are shown as png images. Sorry about that.
END OF NOTE.
Recently, I wanted to test how large a very large XML file (180 MB in size) would be if it was formatted as a CSV (in this case a semi-colon separated file). To make it so, I wrote a little routine in AWK to read out a specified number of tags and their values, and print them in a CSV in the order they were specified.
A (sample) XML file (with more than a few tags removed) would be;
First, I needed to get hold of the tags of the XML file. All I’m really interested in, which’ll be evident later, is the value tags listed in the “record” element. To extract all of them was pretty easy.
I piped the output of the command to a file called elements.txt. Note the order in which the elements are listed. That’s the order in which they’ll appear on each output line later.
Now, here’s the code to run to get all of the XML formatted as CSV.
The syntax to call the program is as follows, and the output should be piped to a different output file. Note the “STARTTAG” and “ENDTAG” parameters. They show between what tags data should be extracted. Note that this script can’t really handle XML that contains multiple tags with the same name, or multiple elements with the same name, for that matter. But, that shouldn’t be too hard to add either…
awk -f xmlToCSV.awk -v INPUTFILE=data/inputfile.xml -v STARTTAG=record -v ENDTAG=record -v ELEMFILE=elements.txt