Advertisements

Blog Archives

#InfluxDB : readable timestamps in the CLI

Time-series database InfluxDB provides a nice CLI, similar to what is available for many other databases.
One key feature of the interaction with time series databases is to work a lot with time, so it comes a little bit surprising that InfluxDB displays time as nanoseconds timestamp, like the following:

892482496000000000

To have human-readable timestamps invoke the CLI as the following:

$influx -precision rfc3339

or type the following command at the CLI propmpt:

> precision rfc3339

The timestamp will then look like:

1998-04-13T15:48:16Z

better right? 🙂

Advertisements

#InfluxDB : drop all measurements

InfluxDB is a popular time series databases. Its popularity comes from the fact that it is relatively easy to set up, it has relatively high performances, and InfluxQL, a simple SQL-like query language (which is being superseded by flux for a host of reasons).

That said, DB management functions are really important and while playing around with your algorithms (e.g., while doing time series forecasting) you might end up generating quite a few measurements (Influx jargon for table/collection) which we might want to delete at one…If it wasn’t for the fact we can’t!
At least we can drop the whole database (let’s say we have a db named forecasting):

DROP DATABASE forecasting

but if that solution does not work for you (e.g., because you do not have such right, or because you set up some specific retention policies) we are left with a couple of solutions.

Solution 1 does not work on all versions, it is slow, but can be invoked from within Influx shell:

DROP SERIES FROM /.*/

Solution 2 is a simple bash script:

for mes in `influx -username root -password root -database forecasting -execute 'show measurements' --format csv | awk -F "\"*,\"*" '{print $2}'`;
do
influx -username root -password root -database forecasting -execute 'drop measurement "${mes}"';
done

#Linux : extract substring in #Bash

One of the most important tasks in dealing with data is string manipulation. We already saw how to use awk and grep to efficiently sift through text files using command line tools instead of developing ad-hoc code. To step it up a notch, we can also do some heavier preprocessing of the data, such as selecting only the subset of information that matches a particular pattern, to ensure data coming out of our pipeline is of good quality.

In this case, we use a Bash feature called parameter expansion. Let’s assume we have the text data in a variable TEXT_LINE and an expression pattern to match (in file-name matching format), this is a summary of the possible expansion:

  • Delete shortest match of pattern from the beginning
    ${TEXT_LINE#pattern}
  • Delete longest match of pattern from the beginning
    ${TEXT_LINE##pattern}
  • Delete shortest match of pattern from the end
    ${TEXT_LINE%pattern}
  • Delete longest match of pattern from the end
    ${TEXT_LINE%%pattern}
  • Get substrings based on position using numbers
    ${TEXT_LINE:START:END}
  • Replace particular strings or patterns
    ${TEXT_LINE/pattern/replace}

So for example, to extract only the file name without the extension:

${TEXT_LINE%.*}

or to extract user name from an email:

${TEXT_LINE%%@*.*}

or extract the file name from an absolute path:

${TEXT_LINE##*/}

NOTE: You can’t combine two operations, instead you have to assign to an intermediate variable.

#Linux : data processing with awk

As a data scientist, a lot of time is spent into parsing and processing data to transform it into a nicer format that can be easily fed to whichever algorithm is being used. This process is often known as ETL.

There are a number of tools for the job ranging from command line utils, language libraries, and even whole products that go all the way to Petabyte scale (e.g., Spark).

Here I will give one tip for a unix/linux command line util that is often forgotten but that has a lot of functionality, it is fast and memory efficient (can process tens of thousands of rows per second, scale to multiple cores by using unix pipelining), and it is available pretty much anywhere (while your preferred tool or library might not). The awesome AWK!

Now, AWK can do pretty much anything, but here are just 2 bits that are exceptionally useful as a starting point of our data processing.

  1. Select by column (e.g, select a field of a csv file) is achieved by using the special variable $i where i is the column position we want to select, e.g., to print the second column
    #awk -F "\"*,\"*" '{print $2}' filename.csv
  2. Select by row (e.g., iterate through a csv file) is achieved using the built-in variable FNR, e.g., to print the second row
    #awk 'FNR == 2 {print}' filename.csv

For a primer you can look HERE. Happy AWKing! 😎

%d bloggers like this: