Quick text manipulation, a practical `sed` example

Suppose you have received a 10k line file of text in a format that is difficult for you to work with, like XML. You want to get some specific information from that file, and realize that getting that information by hand will take a long while.

In some cases an editor like IntelliJ IDEA can be really useful (see footnote), but a full-blown IDE may not what be you are looking for. You may want to use what you have, or you may need it as part of a script. Here I want to show you an example of how I have found sed the stream editor quite useful. It is available on Linux and macOS without having to install anything.

Our input file has tags (names) and values for many types of data. We only want the list of values of the identifier tags of a specific type of parent tag. Let's say we are working with some enterprise resource planning applications and are looking for a list of widget identifiers that are in the input file. For example, we want to use these to correlate data between two systems such that we can a put together a more complete dataset for a once-yearly report. Luckily for us that identifier we need is on the line after the opening tag of the widget datatype. Once we have the list of identifiers we want to use those to get more information from an SQL database.

To clarify, the input data we are interested in looks like this:


...
    <widget>
        <id>6Q</id>
...

Let us look at the different steps and how sed fits in. First we make the output file that we want to store the output in.

touch output.txt;

Then we want to search the input file for the <widget> tags, when we find one we go to the next line and just forget about the line that we found the token on.

touch output.txt; sed -n '/<widget>/ {n;p;}' < input.xml

Here we have asked sed to go to the next line and print that if we have found <widget>. By specifying -n we ask sed not to print anything other than what we specifically asked it to.

Now we will have a list of about 2k identifiers, but they still have their tags, like so: <id>6Q</id>. We don't want those tags. Neither do we want the whitespace around the tags.

So let us pipe the stream that sed outputs into the next steps using that vertical thing called the pipe, |. For example. we will tell sed to replace <id> with nothing. We do that by using the pattern s/existing text we don't want/new text we do want

touch output.txt; sed -n '/<widget>/ {n;p;}' < input.xml | sed 's/<id>//g' | sed 's/<\/id>/,/g'

Now we have removed <id> and replaced </id> with a comma.

Because we don't use any additional flag in the new sed commands we can also combine them in one call with the -e flag. I believe that will be faster because there is no more data transfer from one process to another. The version of sed on my Linux computer does not seem to support the -e flag though. If you are using sed on Linux you might need to keep piping the commands to each other.

touch output.txt; sed -n '/<widget>/ {n;p;}' <input.xml | sed -e 's/<id>//g' -e 's/<\/id>/,/g'

Next we remove the whitespace that we don't want.

touch output.txt; sed -n '/<widget>/ {n;p;}' < input.xml | sed -e 's/<id>//g' -e 's/<\/id>/,/g' -e 's/ //g' -e ':a;N;$!ba;s/\n/ /g' > output.txt

We have replaced the whitespace with nothing. What comes next is difficult to read. Why not just use sed 's/\r\n//g' to replace \n or \r\n with nothing? The newlines and carriage returns will not be seen by sed because sed will normally work on the actual contents of each line, line after line. We have to do more work or switch to using tr, which is a simpler way to swap text. A good explanation of what the pattern does can be found here. In short, we mark a position as a, and add the next line with N. $!ba means we keep doing this until the whole file is inside our computer memory. If we are not at the last line, $!, we move back to position a and keep going. This way sed can handle all input in one go, including newlines.

The result has been written to the file output.txt which now contains all widget id values separated by a comma. The last widget id also ends in a comma, that needs to be removed as well. Here sed comes to the rescue again with $ which will refer to the last character. sed 's/.$//' < output.txt will remove the last comma. Now we can use it to make a select statement on a database table: SELECT name, quantity_sold, quantity_unit FROM widgets WHERE objectid IN (...) where we fill the brackets with the list of ids we created.

echo "SELECT name, quantity_sold, quantity_unit FROM widgets WHERE objectid IN ("; sed 's/.$//' < output.txt; echo ");"

will then result in the following format

SELECT name, quantity_sold, quantity_unit FROM widgets WHERE objectid IN ( 1Z, 2G, 3A, 4T, 5H, 6Q, 7P );

There we have it, the full query.

The Grymoire probably has the best sed page I have seen, go check it out!

IntelliJ IDEA footnote: It offers Column Selection Mode and multiple carets. Note that the linked blog post is quite old, I suspect more is possible these days.