Awking it – how to load a file into an array in awk

Since I’ve been using a lot of awk lately, I figured I should write an entry about one of its most useful features. To me, at least.

That’d be the ability to load data from a file, preferably a csv (comma separated values) file, with data stored in clear text, and keeping it in a key-accessed array in memory. I’ve used this technique to load a whole lot of subscription data (I’m in telecom) into memory and mapping out subscriber data for a migration I’m part of, for example. I’m not pretending to be very skilled at awk, or even to be any good at it at all, but… it’s working great for my purposes.

Now, let’s make this easy. We’ve got an inputfile, conveniently named “inputfile.csv”, with the format as shown below. We want to load this into memory, key it on “id”, and have “name” as the data.

inputfile.csv id,name 1,Svante 2,Nils 3,Sven 4,Inga 5,Lena

Then we’ve got a second file, the file we want to process. This file contains an “id”, and a (fake) “phonenumber”. Let’s call it “data.csv”.

data.csv 3,555-912421 2,555-812471 4,555-821411 1,555-818811 5,555-999999

As output, we want to get a file that has “id”, “name” and “phonenumber”. Let’s call this file “output.csv”.

output.csv 1,Svante,555-818811 ...

This’d be an easy task to do manually, but with 300,000 rows in either file, things turn more interesting at once.

Now, first, let’s set up the simple awk program. I usually have a little Bash first, to set things up and format the inputfiles and so on (since I know more Bash than awk).

#! /bin/bash INPUTFILE="inputfile.csv"
Now, what’s going on here? Well, at first, I set up the filename we want to use, as a variable in Bash. That’s mostly so I can show how to refer to “external” variables in awk later.

awk 'BEGIN {

Then, the awk program begins. After the keyword “BEGIN”, there must always be a “{” to designate the start of the program. It ends with a closing, final, “}”.

while (getline < "'"$INPUTFILE"'") { split($0,ft,","); id=ft[1]; name=ft[2];

And here, we do the actual reading of the inputfile. Note the “‘” around the Bash variable. That means that awk should just use the variable without interfering with it or trying to parse it in any way.

The split command takes the original row, “$0”, and splits it into an array “ft”, with “,” being the delimiter of fields. To make things easier for reference (I’ve got 26-column files to work with, so it’s easier this way), I take the fields of the “ft” array (numbering starts at 1 (one)) and name them.

key=id; data=name; nameArr[key]=data;

Now, since we’ve only got two columns, and the key is just made up of “id”, this is way overkill. But, since I’ve had to combine multiple fields to make up a key and also store multiple columns in one “data” variable (just remember to separate the values with something, and you’ll be OK), it makes sense.

The “nameArr” array will now be constructed, with “key” as key, and “data” as data. As simple as that. No need to declare anything, no initialization needed.

} close("'"$INPUTFILE"'"); }'

And here’s the end. To be able to read the same file again later, if we’d want to, in the same program, always remember to “close()” the inputfile used.

OK, let’s append some more. Let’s read the “data.csv” file, and output the result in the “output.csv” file, shall we? This time, I’ll just comment inside the code.

#! /bin/bash INPUTFILE="inputfile.csv" DATAFILE="data.csv" OUTFILE="output.csv"


awk 'BEGIN {

while (getline < "'"$INPUTFILE"'")

{

  split($0,ft,",");

  id=ft[1];

  name=ft[2];
  key=id;

  data=name;

  nameArr[key]=data;

}

close("'"$INPUTFILE"'");

while (getline < "'"$DATAFILE"'") { split($0,ft,","); id=ft[1]; # Id is the first column phone=ft[2]; # Phonenumber is the second name=nameArr[id]; # Get the name from the nameArr, using "id" as key print id","name","phone > "'"$OUTFILE"'"; # Print directly to the outputfile } }'

And bam, you’ve got a file called “output.csv”, containing the output. It doesn’t need to be harder than that.

output.csv 3,Sven,555-912421 2,Nils,555-812471 4,Inga,555-821411 1,Svante,555-818811 5,Lena,555-999999

Awk is a way more powerful language than what I’ve shown here. I’m using it for a lot of fun stuff during migration of subscribers and processing of data, including arithmetic (you can force a variable to be numeric by doing variablename = variablename+0; on it).

I’ll see if I can show any more of the fun stuff, later. 🙂

About magvar

Technical guy, worked with computers since I was twelve, complete nerd, but managed to get married and am now the happy father of three kids.

View all posts by magvar →

This entry was posted in Uncategorized. Bookmark the permalink.

12 Responses to Awking it – how to load a file into an array in awk

George says:

February 11, 2012 at 20:30

Thanks, that’s exactly what I was looking for! Apparently it was more uncommon than I thought, so thanks for the article 🙂

repic says:

October 28, 2013 at 19:30

Maybe it would be easier to use unix join command, since it is made specifically for joining two files based on a common key. If you want to just paste two files next to each other use unix paste command. Good post though, and it does not hurt to know your way around awk. It is a great general purpose text tool.

- magvar says:
  
  October 28, 2013 at 21:24
  
  There are (quite a few) different solutions to this problem, yes. 🙂 The script I extracted this bit from contained a little more functionality than just the joining of two files, so it’s a little simplified example, and I’ve learnt a lot more awk since I wrote this. What you say about join and paste is definitely true, though.
  
  Thanks for pointing to these alternatives!
  
yetish says:

March 7, 2014 at 13:37

#!/usr/bin/ksh

awk -F”,” ‘{
if(FILENAME ~ “inputfile.csv”)
{
arr[$1]=$2 #Id as the index of the array and Name as the content in array.
}
if(FILENAME ~ “data.csv”)
{
if($1 in arr) #check for the ID in data file
{
print $1″,”arr[$1]”,”$2 > output.csv #print ID,name,phone in output file
}
}
}’ inputfile.csv data.csv

#Earlier in the beginning of the career, didn’t even know it could be so.. simple; also none of the internet sites still show it is possible this way.
#It has processed as big as 32Million data effortlessly… in dryruns and cutover days of my migration. Cheers!!

- magvar says:
  
  March 7, 2014 at 13:42
  
  Thanks! I should really update this page quite a bit, since I’ve learned a little since I first created it. But your trick here was new to me as well. Really cool! Thanks!
  
  - Someone says:
    
    November 20, 2014 at 17:23
    
    No need to update. The trick in your original post is quite genius and came as a lifesaver to me. I used the same approach for a slightly different problem and it worked like a breeze. Thanks!
  - magvar says:
    
    November 20, 2014 at 17:49
    
    Thanks, glad it came in handy.
- Danne says:
  
  February 1, 2017 at 22:50
  
  Beautiful and fast! Thanks a million!
  
Mark J. Reed says:

October 30, 2015 at 16:22

You can use awk’s -v flag to turn external variables into awk variables and avoid messy quoting:
awk -v filename=”$FILENAME” ‘ BEGIN { while (getline < filename) {…}}'

- magvar says:
  
  October 30, 2015 at 23:45
  
  Yes, learned that a while after writing the post. I’m still using awk, and learnt more along the way. Thanks for the pointer, maybe I should refresh/re-write this whole entry some day.
  
Nirup says:

November 18, 2017 at 03:18

This is great stuff. Thanks for sharing. Very helpful. Difficult to see these types of examples which makes life easy for so many people. I’m writing something similar (inspired by your example). If I manage to achieve it, I will come back and leave a comment again. Mark J. Reed has a good point as well.

sanketh s says:

January 22, 2018 at 08:32

Awesome.. i had been searching for a post like this from a couple of days and this is just what the doctor ordered.
Thanks mate..