Unix Power Tools (Mon Sep 30, 2024)
Unix Power Tools
Previewing data using head and tail
$ pwd
/home/anand/github/anandology/isi-fcp
$ cd data
$ ls
un-data.tsv un.tsvWe are going to work with the un.tsv file.
Let’s see the first few lines of that file.
$ head un.tsv
country region lifeMale lifeFemale infantMortality GDPperCapita
Afghanistan Asia 45.0 46.0 154 2848
Albania Europe 68.0 74.0 32 863
Algeria Africa 67.5 70.3 44 1531
Angola Africa 44.9 48.1 124 355
Argentina America 69.6 76.8 22 8055
Armenia Europe 67.2 74.0 25 354
Australia Oceania 75.4 81.2 6 20046
Austria Europe 73.7 80.1 6 29006
Azerbaijan Asia 66.5 74.5 33 321And the last few lines.
$ tail un.tsv
United.States America 73.4 80.1 7 26037
Uruguay America 69.6 76.1 17 5602
Uzbekistan Asia 64.3 70.7 43 435
Vanuatu Oceania 65.5 69.5 38 1289
Venezuela America 70.0 75.7 21 3496
Viet.Nam Asia 64.9 69.6 37 270
Yemen Asia 57.4 58.4 80 732
Yugoslavia Europe 69.8 75.3 19 1487
Zambia Africa 42.2 43.7 103 382
Zimbabwe Africa 47.6 49.4 68 786How many lines/rows of data do we have? Let’s use the wordcount command wc to find it out.
$ wc un.tsv
189 1134 6407 un.tsv
$ wc -l un.tsv
189 un.tsvIt has 189 lines. If we ignore the first line, which is the header, the file un.tsv has data for 188 countries.
Let’s skip the header so that we can work just with the data.
$ tail -n +2 un.tsv > un-data.tsvThe above command takes the contents of un.tsv from the 2nd line onwards and writes the output to un-data.tsv.
The new file will not have the header and the number of lines will be less by one.
$ head -5 un-data.tsv
Afghanistan Asia 45.0 46.0 154 2848
Albania Europe 68.0 74.0 32 863
Algeria Africa 67.5 70.3 44 1531
Angola Africa 44.9 48.1 124 355
Argentina America 69.6 76.8 22 8055See, the header is gone.
$ wc -l *.tsv
188 un-data.tsv
189 un.tsv
377 totalThe new file has one line less than the original file.
Searching for patterns using grep
We can use the grep command for searching for patterns. The grep command supports regular expressions, but for now we’ll search only for simple strings.
Let’s see some countries in Asia.
$ grep Asia un-data.txt | head -5
Afghanistan Asia 45.0 46.0 154 2848
Azerbaijan Asia 66.5 74.5 33 321
Bahrain Asia 71.1 75.3 18 9073
Bangladesh Asia 58.1 58.2 78 280
Bhutan Asia 51.6 54.9 104 166How many countries are there in Asia?
$ grep Asia un-data.tsv | wc -l
46We could also do the same using the -c option of grep command, which counts the number of matches.
$ grep -c Asia un-data.tsv
46Selecting and Sorting data
We can select a column from tabular data using the cut command.
For example, the following commmand select the 2nd column, the region.
$ cut -f2 un-data.tsv | head -5
Asia
Europe
Africa
Africa
AmericaWhat are the unique regions?
$ cut -f2 un-data.tsv | sort | uniq
Africa
America
Asia
Europe
OceaniaCan we get the counts for each region?
$ cut -f2 un-data.tsv | sort | uniq -c
53 Africa
35 America
46 Asia
40 Europe
14 OceaniaInteresting Questions
Let’s ask some intersting questions now.
Which countries have lowest infant mortality?
Infant Mortality is the 5th column.
Let’s select only the country and infant mortality columns first.
$ cut -f1,5 un-data.tsv | head -5
Afghanistan 154
Albania 32
Algeria 44
Angola 124
Argentina 22Now pipe this to sort to sort based on the last column.
$ cut -f1,5 un-data.tsv | sort -k2 | head
Croatia 10
French.Polynesia 10
Netherlands.Antilles 10
Cambodia 102
Zambia 103
Bhutan 104
Djibouti 106
Equatorial.Guinea 107
Ethiopia 107
Malaysia 11There seems to be an issue. Sort is sorting the numbers alphabetically, not numerically. Let’s fix this by asking sort to sort numerically by passing option -n.
$ cut -f1,5 un-data.tsv | sort -k2 -n | head
Tonga 3
Japan 4
Finland 5
Hong.Kong 5
Iceland 5
Norway 5
Singapore 5
Sweden 5
Switzerland 5
Australia 6Can we find mean life expectency across male and female?
Technically, it is not possible to compute the combined mean from two means unless we know the population size for male and female. For simplicity, let’s assume the the population of male and female is the same. With that assumption, we can compute the combine mean by taking the average of the two values.
We are going to use awk for solving this.
The columns of interest are column 3 (lifeMale) and column 4 (lifeFemale).
$ awk '{print $1, ($3+$4)/2;}' un-data.tsv | head
Afghanistan 45.5
Albania 71
Algeria 68.9
Angola 46.5
Argentina 73.2
Armenia 70.6
Australia 78.3
Austria 76.9
Azerbaijan 70.5
Bahamas 73.8List the rich countries
Let’s consider the countries with GDP per Capita more than $20000 as rich countries.
$ awk '$6 > 20000 {print $1, $6;}' un-data.tsv
Australia 20046
Austria 29006
Belgium 26582
Denmark 33191
Finland 24453
France 26444
Germany 29632
Hong.Kong 22898
Iceland 26217
Japan 41718
Luxembourg 35109
Netherlands 25635
Norway 33734
Singapore 25581
Sweden 26253
Switzerland 42416
United.States 26037Which countries have highest difference between the life expectency of male and female?
Let’s look at the difference between those two values.
$ awk ' {print $1, $3, $4, $3-$4;}' un-data.tsv | head
Afghanistan 45.0 46.0 -1
Albania 68.0 74.0 -6
Algeria 67.5 70.3 -2.8
Angola 44.9 48.1 -3.2
Argentina 69.6 76.8 -7.2
Armenia 67.2 74.0 -6.8
Australia 75.4 81.2 -5.8
Austria 73.7 80.1 -6.4
Azerbaijan 66.5 74.5 -8
Bahamas 70.5 77.1 -6.6We need to find the countries with highest difference.
$ awk ' {print $1, $3, $4, $3-$4;}' un-data.tsv | sort -k4 -nr | tail
Reunion 70.9 79.8 -8.9
Poland 66.7 75.7 -9
Hungary 64.5 73.8 -9.3
Kazakhstan 62.8 72.5 -9.7
Ukraine 63.6 74.0 -10.4
Belarus 64.4 74.8 -10.4
Lithuania 64.9 76.0 -11.1
Estonia 63.9 75.0 -11.1
Latvia 62.5 74.3 -11.8
Russia 58.0 71.5 -13.5What is the average life expectency across countries?
We can’t really compute this becase we don’t have the population of each country. Just for this exercise, assume that the population of each country is the same.
$ awk '{ total += $3; } END { print total/NR }' un-data.tsv
63.5261What if we want to find the average male life expectency only in Asia?
$ awk '$2 == "Asia" { total += $3; } END { print total/NR }' un-data.tsv
15.9957Something seems to be wrong. We are dividing the total by the total number of countries instead of the number of countries in Asia. For that we need to maintain another counter.
$ awk '$2 == "Asia" { total += $3; countries++; } END { print total/countries; }' un-data.tsv
65.3739Average infant mortality by continent
Again, ignoring the population of each country for simiplicity.
$ awk '{ total[$2] += $5; countries[$2]++} END {for (name in total) { print name, total[name]/countries[name]; }}' un-data.tsv
America 26.6571
Africa 86.3208
Asia 43.7826
Europe 11.575
Oceania 24.6429