Awking it – how to load a file into an array in awk

Since I’ve been using a lot of awk lately, I figured I should write an entry about one of its most useful features. To me, at least.

That’d be the ability to load data from a file, preferably a csv (comma separated values) file, with data stored in clear text, and keeping it in a key-accessed array in memory. I’ve used this technique to load a whole lot of subscription data (I’m in telecom) into memory and mapping out subscriber data for a migration I’m part of, for example. I’m not pretending to be very skilled at awk, or even to be any good at it at all, but… it’s working great for my purposes.

Now, let’s make this easy. We’ve got an inputfile, conveniently named “inputfile.csv”, with the format as shown below. We want to load this into memory, key it on “id”, and have “name” as the data.

inputfile.csv
id,name
1,Svante
2,Nils
3,Sven
4,Inga
5,Lena

Then we’ve got a second file, the file we want to process. This file contains an “id”, and a (fake) “phonenumber”. Let’s call it “data.csv”.

data.csv
3,555-912421
2,555-812471
4,555-821411
1,555-818811
5,555-999999

As output, we want to get a file that has “id”, “name” and “phonenumber”. Let’s call this file “output.csv”.

output.csv
1,Svante,555-818811
...

This’d be an easy task to do manually, but with 300,000 rows in either file, things turn more interesting at once.

Now, first, let’s set up the simple awk program. I usually have a little Bash first, to set things up and format the inputfiles and so on (since I know more Bash than awk).

#! /bin/bash
INPUTFILE="inputfile.csv"

Now, what’s going on here? Well, at first, I set up the filename we want to use, as a variable in Bash. That’s mostly so I can show how to refer to “external” variables in awk later.


awk 'BEGIN {

Then, the awk program begins. After the keyword “BEGIN”, there must always be a “{” to designate the start of the program. It ends with a closing, final, “}”.


while (getline < "'"$INPUTFILE"'")
{
split($0,ft,",");
id=ft[1];
name=ft[2];

And here, we do the actual reading of the inputfile. Note the “‘” around the Bash variable. That means that awk should just use the variable without interfering with it or trying to parse it in any way.

The split command takes the original row, “$0”, and splits it into an array “ft”, with “,” being the delimiter of fields. To make things easier for reference (I’ve got 26-column files to work with, so it’s easier this way), I take the fields of the “ft” array (numbering starts at 1 (one)) and name them.


key=id;
data=name;
nameArr[key]=data;

Now, since we’ve only got two columns, and the key is just made up of “id”, this is way overkill. But, since I’ve had to combine multiple fields to make up a key and also store multiple columns in one “data” variable (just remember to separate the values with something, and you’ll be OK), it makes sense.

The “nameArr” array will now be constructed, with “key” as key, and “data” as data. As simple as that. No need to declare anything, no initialization needed.


}
close("'"$INPUTFILE"'");
}'

And here’s the end. To be able to read the same file again later, if we’d want to, in the same program, always remember to “close()” the inputfile used.

OK, let’s append some more. Let’s read the “data.csv” file, and output the result in the “output.csv” file, shall we? This time, I’ll just comment inside the code.

#! /bin/bash
INPUTFILE="inputfile.csv"
DATAFILE="data.csv"
OUTFILE="output.csv"

awk 'BEGIN {
while (getline < "'"$INPUTFILE"'")
{
split($0,ft,",");
id=ft[1];
name=ft[2];

key=id;
data=name;
nameArr[key]=data;
}
close("'"$INPUTFILE"'");

while (getline < "'"$DATAFILE"'")
{
split($0,ft,",");
id=ft[1]; # Id is the first column
phone=ft[2]; # Phonenumber is the second
name=nameArr[id]; # Get the name from the nameArr, using "id" as key
print id","name","phone > "'"$OUTFILE"'"; # Print directly to the outputfile
}
}'

And bam, you’ve got a file called “output.csv”, containing the output. It doesn’t need to be harder than that.

output.csv
3,Sven,555-912421
2,Nils,555-812471
4,Inga,555-821411
1,Svante,555-818811
5,Lena,555-999999

Awk is a way more powerful language than what I’ve shown here. I’m using it for a lot of fun stuff during migration of subscribers and processing of data, including arithmetic (you can force a variable to be numeric by doing variablename = variablename+0; on it).

I’ll see if I can show any more of the fun stuff, later. πŸ™‚

Advertisements

About magvar

Technical guy, worked with computers since I was twelve, complete nerd, but managed to get married and am now the happy father of three kids.
This entry was posted in Uncategorized. Bookmark the permalink.

10 Responses to Awking it – how to load a file into an array in awk

  1. George says:

    Thanks, that’s exactly what I was looking for! Apparently it was more uncommon than I thought, so thanks for the article πŸ™‚

  2. repic says:

    Maybe it would be easier to use unix join command, since it is made specifically for joining two files based on a common key. If you want to just paste two files next to each other use unix paste command. Good post though, and it does not hurt to know your way around awk. It is a great general purpose text tool.

    • magvar says:

      There are (quite a few) different solutions to this problem, yes. πŸ™‚ The script I extracted this bit from contained a little more functionality than just the joining of two files, so it’s a little simplified example, and I’ve learnt a lot more awk since I wrote this. What you say about join and paste is definitely true, though.

      Thanks for pointing to these alternatives!

  3. yetish says:

    #!/usr/bin/ksh

    awk -F”,” ‘{
    if(FILENAME ~ “inputfile.csv”)
    {
    arr[$1]=$2 #Id as the index of the array and Name as the content in array.
    }
    if(FILENAME ~ “data.csv”)
    {
    if($1 in arr) #check for the ID in data file
    {
    print $1″,”arr[$1]”,”$2 > output.csv #print ID,name,phone in output file
    }
    }
    }’ inputfile.csv data.csv

    #Earlier in the beginning of the career, didn’t even know it could be so.. simple; also none of the internet sites still show it is possible this way.
    #It has processed as big as 32Million data effortlessly… in dryruns and cutover days of my migration. Cheers!!

    • magvar says:

      Thanks! I should really update this page quite a bit, since I’ve learned a little since I first created it. But your trick here was new to me as well. Really cool! Thanks!

      • Someone says:

        No need to update. The trick in your original post is quite genius and came as a lifesaver to me. I used the same approach for a slightly different problem and it worked like a breeze. Thanks!

      • magvar says:

        Thanks, glad it came in handy.

    • Danne says:

      Beautiful and fast! Thanks a million!

  4. Mark J. Reed says:

    You can use awk’s -v flag to turn external variables into awk variables and avoid messy quoting:
    awk -v filename=”$FILENAME” ‘ BEGIN { while (getline < filename) {…}}'

    • magvar says:

      Yes, learned that a while after writing the post. I’m still using awk, and learnt more along the way. Thanks for the pointer, maybe I should refresh/re-write this whole entry some day.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s