#Linux : data processing with awk

As a data scientist, a lot of time is spent into parsing and processing data to transform it into a nicer format that can be easily fed to whichever algorithm is being used. This process is often known as ETL.

There are a number of tools for the job ranging from command line utils, language libraries, and even whole products that go all the way to Petabyte scale (e.g., Spark).

Here I will give one tip for a unix/linux command line util that is often forgotten but that has a lot of functionality, it is fast and memory efficient (can process tens of thousands of rows per second, scale to multiple cores by using unix pipelining), and it is available pretty much anywhere (while your preferred tool or library might not). The awesome AWK!

Now, AWK can do pretty much anything, but here are just 2 bits that are exceptionally useful as a starting point of our data processing.

Select by column (e.g, select a field of a csv file) is achieved by using the special variable $i where i is the column position we want to select, e.g., to print the second column
#awk -F "\"*,\"*" '{print $2}' filename.csv
Select by row (e.g., iterate through a csv file) is achieved using the built-in variable FNR, e.g., to print the second row
#awk 'FNR == 2 {print}' filename.csv

For a primer you can look HERE. Happy AWKing! 😎

About whitehatty

Computer Engineer and Scientist interested in Computer Security, Complex Networks, Math, Biology and Medicine. "Think Different" life style. Quake 3 Arena player. NERD by DNA.

View all posts by whitehatty »

Posted on January 22, 2019, in Linux, Tips & Tricks and tagged AWK, Data Processing, Data Science. Bookmark the permalink. 1 Comment.

Leave a comment
Trackbacks 1
Comments 0

Pingback: #Linux : extract substring in #Bash | whitehatty

	Keith on #WP7 : Enable Hidden MFG …
	Andrea on #MacOSX : Make your Mac a Wire…
	Miles on #MacOsX : Disable Auto-Save an…
	Net Eng on #MacOSX : Make your Mac a Wire…

whitehatty

#TheOnlyLimitIsTheOneYouSetYourself

#Linux : data processing with awk

About whitehatty

Leave a comment

Trackbacks 1

Comments 0

Leave a comment Cancel reply

Top Posts & Pages

Recent Comments

Categories

Random

whitehatty

#TheOnlyLimitIsTheOneYouSetYourself

#Linux : data processing with awk

Share The Knowledge:

Related

About whitehatty

Leave a comment

Trackbacks 1

Comments 0

Leave a comment Cancel reply

Top Posts & Pages

Recent Comments

Categories

Random