One of the most important tasks in dealing with data is string manipulation. We already saw how to use awk and grep to efficiently sift through text files using command line tools instead of developing ad-hoc code. To step it up a notch, we can also do some heavier preprocessing of the data, such as selecting only the subset of information that matches a particular pattern, to ensure data coming out of our pipeline is of good quality.
In this case, we use a Bash feature called parameter expansion. Let’s assume we have the text data in a variable
TEXT_LINE and an expression
pattern to match (in file-name matching format), this is a summary of the possible expansion:
- Delete shortest match of pattern from the beginning
- Delete longest match of pattern from the beginning
- Delete shortest match of pattern from the end
- Delete longest match of pattern from the end
- Get substrings based on position using numbers
- Replace particular strings or patterns
So for example, to extract only the file name without the extension:
or to extract user name from an email:
or extract the file name from an absolute path:
NOTE: You can’t combine two operations, instead you have to assign to an intermediate variable.
As a data scientist, a lot of time is spent into parsing and processing data to transform it into a nicer format that can be easily fed to whichever algorithm is being used. This process is often known as ETL.
There are a number of tools for the job ranging from command line utils, language libraries, and even whole products that go all the way to Petabyte scale (e.g., Spark).
Here I will give one tip for a unix/linux command line util that is often forgotten but that has a lot of functionality, it is fast and memory efficient (can process tens of thousands of rows per second, scale to multiple cores by using unix pipelining), and it is available pretty much anywhere (while your preferred tool or library might not). The awesome AWK!
Now, AWK can do pretty much anything, but here are just 2 bits that are exceptionally useful as a starting point of our data processing.
- Select by column (e.g, select a field of a csv file) is achieved by using the special variable $i where i is the column position we want to select, e.g., to print the second column
- Select by row (e.g., iterate through a csv file) is achieved using the built-in variable FNR, e.g., to print the second row
For a primer you can look HERE. Happy AWKing! 😎