Unix Power Tools (Mon Sep 30, 2024)
Unix Power Tools
Previewing data using head
and tail
$ pwd
/home/anand/github/anandology/isi-fcp
$ cd data
$ ls un-data.tsv un.tsv
We are going to work with the un.tsv
file.
Let’s see the first few lines of that file.
$ head un.tsv
country region lifeMale lifeFemale infantMortality GDPperCapita
Afghanistan Asia 45.0 46.0 154 2848
Albania Europe 68.0 74.0 32 863
Algeria Africa 67.5 70.3 44 1531
Angola Africa 44.9 48.1 124 355
Argentina America 69.6 76.8 22 8055
Armenia Europe 67.2 74.0 25 354
Australia Oceania 75.4 81.2 6 20046
Austria Europe 73.7 80.1 6 29006 Azerbaijan Asia 66.5 74.5 33 321
And the last few lines.
$ tail un.tsv
United.States America 73.4 80.1 7 26037
Uruguay America 69.6 76.1 17 5602
Uzbekistan Asia 64.3 70.7 43 435
Vanuatu Oceania 65.5 69.5 38 1289
Venezuela America 70.0 75.7 21 3496
Viet.Nam Asia 64.9 69.6 37 270
Yemen Asia 57.4 58.4 80 732
Yugoslavia Europe 69.8 75.3 19 1487
Zambia Africa 42.2 43.7 103 382 Zimbabwe Africa 47.6 49.4 68 786
How many lines/rows of data do we have? Let’s use the wordcount command wc
to find it out.
$ wc un.tsv
189 1134 6407 un.tsv
$ wc -l un.tsv 189 un.tsv
It has 189 lines. If we ignore the first line, which is the header, the file un.tsv
has data for 188 countries.
Let’s skip the header so that we can work just with the data.
$ tail -n +2 un.tsv > un-data.tsv
The above command takes the contents of un.tsv
from the 2nd line onwards and writes the output to un-data.tsv
.
The new file will not have the header and the number of lines will be less by one.
$ head -5 un-data.tsv
Afghanistan Asia 45.0 46.0 154 2848
Albania Europe 68.0 74.0 32 863
Algeria Africa 67.5 70.3 44 1531
Angola Africa 44.9 48.1 124 355 Argentina America 69.6 76.8 22 8055
See, the header is gone.
$ wc -l *.tsv
188 un-data.tsv
189 un.tsv 377 total
The new file has one line less than the original file.
Searching for patterns using grep
We can use the grep
command for searching for patterns. The grep
command supports regular expressions, but for now we’ll search only for simple strings.
Let’s see some countries in Asia.
$ grep Asia un-data.txt | head -5
Afghanistan Asia 45.0 46.0 154 2848
Azerbaijan Asia 66.5 74.5 33 321
Bahrain Asia 71.1 75.3 18 9073
Bangladesh Asia 58.1 58.2 78 280 Bhutan Asia 51.6 54.9 104 166
How many countries are there in Asia?
$ grep Asia un-data.tsv | wc -l 46
We could also do the same using the -c
option of grep command, which counts the number of matches.
$ grep -c Asia un-data.tsv 46
Selecting and Sorting data
We can select a column from tabular data using the cut
command.
For example, the following commmand select the 2nd column, the region.
$ cut -f2 un-data.tsv | head -5
Asia
Europe
Africa
Africa America
What are the unique regions?
$ cut -f2 un-data.tsv | sort | uniq
Africa
America
Asia
Europe Oceania
Can we get the counts for each region?
$ cut -f2 un-data.tsv | sort | uniq -c
53 Africa
35 America
46 Asia
40 Europe 14 Oceania
Interesting Questions
Let’s ask some intersting questions now.
Which countries have lowest infant mortality?
Infant Mortality is the 5th column.
Let’s select only the country and infant mortality columns first.
$ cut -f1,5 un-data.tsv | head -5
Afghanistan 154
Albania 32
Algeria 44
Angola 124 Argentina 22
Now pipe this to sort
to sort based on the last column.
$ cut -f1,5 un-data.tsv | sort -k2 | head
Croatia 10
French.Polynesia 10
Netherlands.Antilles 10
Cambodia 102
Zambia 103
Bhutan 104
Djibouti 106
Equatorial.Guinea 107
Ethiopia 107 Malaysia 11
There seems to be an issue. Sort is sorting the numbers alphabetically, not numerically. Let’s fix this by asking sort
to sort numerically by passing option -n
.
$ cut -f1,5 un-data.tsv | sort -k2 -n | head
Tonga 3
Japan 4
Finland 5
Hong.Kong 5
Iceland 5
Norway 5
Singapore 5
Sweden 5
Switzerland 5 Australia 6
Can we find mean life expectency across male and female?
Technically, it is not possible to compute the combined mean from two means unless we know the population size for male and female. For simplicity, let’s assume the the population of male and female is the same. With that assumption, we can compute the combine mean by taking the average of the two values.
We are going to use awk
for solving this.
The columns of interest are column 3 (lifeMale
) and column 4 (lifeFemale
).
$ awk '{print $1, ($3+$4)/2;}' un-data.tsv | head
Afghanistan 45.5
Albania 71
Algeria 68.9
Angola 46.5
Argentina 73.2
Armenia 70.6
Australia 78.3
Austria 76.9
Azerbaijan 70.5 Bahamas 73.8
List the rich countries
Let’s consider the countries with GDP per Capita more than $20000
as rich countries.
$ awk '$6 > 20000 {print $1, $6;}' un-data.tsv
Australia 20046
Austria 29006
Belgium 26582
Denmark 33191
Finland 24453
France 26444
Germany 29632
Hong.Kong 22898
Iceland 26217
Japan 41718
Luxembourg 35109
Netherlands 25635
Norway 33734
Singapore 25581
Sweden 26253
Switzerland 42416 United.States 26037
Which countries have highest difference between the life expectency of male and female?
Let’s look at the difference between those two values.
$ awk ' {print $1, $3, $4, $3-$4;}' un-data.tsv | head
Afghanistan 45.0 46.0 -1
Albania 68.0 74.0 -6
Algeria 67.5 70.3 -2.8
Angola 44.9 48.1 -3.2
Argentina 69.6 76.8 -7.2
Armenia 67.2 74.0 -6.8
Australia 75.4 81.2 -5.8
Austria 73.7 80.1 -6.4
Azerbaijan 66.5 74.5 -8 Bahamas 70.5 77.1 -6.6
We need to find the countries with highest difference.
$ awk ' {print $1, $3, $4, $3-$4;}' un-data.tsv | sort -k4 -nr | tail
Reunion 70.9 79.8 -8.9
Poland 66.7 75.7 -9
Hungary 64.5 73.8 -9.3
Kazakhstan 62.8 72.5 -9.7
Ukraine 63.6 74.0 -10.4
Belarus 64.4 74.8 -10.4
Lithuania 64.9 76.0 -11.1
Estonia 63.9 75.0 -11.1
Latvia 62.5 74.3 -11.8 Russia 58.0 71.5 -13.5
What is the average life expectency across countries?
We can’t really compute this becase we don’t have the population of each country. Just for this exercise, assume that the population of each country is the same.
$ awk '{ total += $3; } END { print total/NR }' un-data.tsv 63.5261
What if we want to find the average male life expectency only in Asia?
$ awk '$2 == "Asia" { total += $3; } END { print total/NR }' un-data.tsv 15.9957
Something seems to be wrong. We are dividing the total by the total number of countries instead of the number of countries in Asia. For that we need to maintain another counter.
$ awk '$2 == "Asia" { total += $3; countries++; } END { print total/countries; }' un-data.tsv 65.3739
Average infant mortality by continent
Again, ignoring the population of each country for simiplicity.
$ awk '{ total[$2] += $5; countries[$2]++} END {for (name in total) { print name, total[name]/countries[name]; }}' un-data.tsv
America 26.6571
Africa 86.3208
Asia 43.7826
Europe 11.575 Oceania 24.6429