This sub-series of blog posts will cover shortcuts and tips that I’ve found useful across various projects over the years.  Hopefully they might help save someone else some time too.

HOW TO PRODUCE A DUPLICATE-FREE LIST FROM A COLUMN IN A TAB-DELIMITED TEXT FILE

 

From a general IT as well as a bioinformatics standpoint, it is sometimes handy to be able to rapidly create duplicate-free lists of items from columns in delimited text files.  When you aren’t working with large, complex datasets that tend to require a database environment, the command line is often more than enough to efficiently get the job done.

Lets say you have a tab-delimited file consisting of five columns, such as this one. In this example, the file contains a list of 30 gene symbols with some associated numerical data, but it could just as easily have been probes from an expression array or even URLs or IP addresses.  This example file also contains a row of headings and an empty last line.

 

Extracting the first column

To extract the first column we use the cut command (type man cut for more details).

cut -f 1 example_one.tsv

where -f specifies which fields/columns to extract.   This option will also accept ranges e.g. -f 2-4 for adjacent columns 2, 3 and 4 or e.g. -f 1,3 for discontinuous columns 1 and 3.

 

Changing the delimiter

The default delimiter is the tab. If your columns are separated by a different delimiter such as a space or comma, you can specify this using the -d option i.e. -d ” ” for spaces or -d “,” for commas.

 

Excluding empty lines

If your file has a line with no delimiters in it (as this one does), you can use the -s option to exclude it from the output (which is important for counting items later).

cut -f 1 -s example_one.tsv

 

Excluding specific lines (such as the column headings)

Specific lines can be excluded through the use of an inverted grep (this may be covered in more detail in a subsequent post) i.e. instead of returning all the lines containing a given keyword, we want the opposite.  So in this case, we want everything but the heading line which contains the words, “Gene_Symbol”.  You could potentially also exclude other unwanted items such as particular genes using this method.  So we pipe the output of the cut command through to grep for subsequent processing.  It is usually a good idea to check the results of a normal grep match first i.e.

cut -f 1 -s example_one.tsv | grep Gene_Symbol

should return only “Gene_Symbol”.  Addition of the -v (inverted match option to grep) should return the rest of the gene list.

cut -f 1 -s example_one.tsv | grep -v Gene_Symbol

 

Counting genes

So now we have the contents of the first column minus the empty line at the end of the file (excluded using cut -s) and the column headings (excluded using grep -v).   If we want to count how many genes we have in the list (based on the assumption that each of the remaining lines contains a valid gene symbol) we can pipe the output of the previous step above to the wc command using the -l option to only count the lines (type man wc for more details).

cut -f 1 -s example_one.tsv | grep -v Gene_Symbol | wc -l

This should give us an answer of 30.

Counting all the lines in the original file using wc -l example_one.tsv should return an answer of 32.

 

Removing redundancy

We now have a list of 30 gene symbols but some of these are repeated.  To produce a non-redundant list, we use the sort command to sort the list while specifying that we want only unique entries returned (sort -u).

cut -f 1 -s example_one.tsv | grep -v Gene_Symbol | sort -u

If we count the items in this new list we should find that there are now 25 non-redundant gene symbols.

cut -f 1 -s example_one.tsv | grep -v Gene_Symbol | sort -u | wc -l

Just as a quick aside, the sort command has numerous options for sorting – including ignoring upper and lower case (-f), ignoring leading blank spaces (a common hazard in dealing with extracted lists) (-b) and sorting by dictionary or numerical order.  For most simple cases, it won’t matter or make much difference but it can also be worth considering exactly how redundancy itself is being defined for each given context – regardless of whether this is done on the command line, using a script or in a database.

The uniq command also removes duplicates, however it does not sort items first so only those duplicates that are adjacent to each other are removed.  sort -u is equivalent to sort | uniq.

 

Storing the results

To store the results in a new file, or to overwrite a current one, redirect the output using “> i.e.

cut -f 1 -s example_one.tsv | grep -v Gene_Symbol | sort -u > newfile.txt

To append the results to an already existing file use “>>

cut -f 1 -s example_one.tsv | grep -v Gene_Symbol | sort -u >> appendedfile.txt

 

The final step is to compare the outcome with what you originally expected and resolve any discrepancies.  In real life there are nearly always a few.