KISS

Keep It Simple Stupid

Extracting regular data from file in terminal

| comments

Once in a while I need to write a shell script that extracts some information from a file and does something with it. It is easy to accomplish with the standard UNIX tools, such as cat, grep, head, sed, etc., once you’ve learned their features. Today’s example is, having an index.html file, we need to extract the css files imported in it (except the commented ones and, say, mobile.css).

Here’s our sample index.html:

1
2
3
4
5
6
7
8
9
10
11
<!DOCTYPE html>
<html>
    <head>
        <title>Hello there</title>
        <link rel="stylesheet" type="text/css" href="css/main.css" />
        <link rel="stylesheet" type="text/css" href="css/mobile.css" />
        <link rel="stylesheet" type="text/css" href="css/nav.css" />
        <!--<link rel="stylesheet" type="text/css" href="css/old.css" />-->
    </head>
</html>

If you only know grep, here’s a way:

1
grep '\.css' index.html | egrep -v '(mobile\.css|<!--.*-->)' | grep -o 'href=".*"' | cut -d'"' -f2

It works, but kind of long: filter out strings not having .css lines, ignore excluded patterns, remove everything except href="…", and finally cut the contents. Turns out this can be rewritten with just one command, using sed:

1
sed -nE '/\.css/ { /(mobile\.css|<!--.*-->)/ { d; }; s/.*href="([^"]+)".*/\1/; p; }' index.html

Here -n parameter disables output of the file by default, and -E enables extended regexes. The command is: for each line containing .css, replace the whole line with the part inside href="…" and print it, unless the line contains excluded patterns. Nice, easy enough, and should be faster. Enjoy!

Comments