Spliting a fasta file to chromosomes (BioInfo Unix Tricks)

TL;DR;

Sometimes you have a large fasta file (e.g. a whole genome in one file) and you’d like to split it into one file per chromosome.

Here’s how to do so using common unix tools (csplit and sed):

mkdir new
cd new
csplit -s -z /path/to/INPUT.FA '/>/' '{*}'
for i in xx* ; do \
  n=$(sed 's/>// ; s/ .*// ; 1q' "$i") ; \
  mv "$i" "$n.fa" ; \
 done

Content

A Complete example
Detailed Explanation
Improvements and Variations
Final thoughts

A Complete example

Download and decompress the fasta file:

sh $ wget http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz $ gunzip hg38.fa.gz $ grep -c ">" hg38.fa ### Count the number of sequences in the file 455
Split the file into one-file-per-chromosome:

```sh $ mkdir new $ cd new $ csplit -s -z ../hg38.fa ‘/>/’ ‘{*}’

### Resulting in 455 files $ ls xx00 xx01 … xx453 xx454

#### Peek into few of the files #### $ head -n3 xx00 xx01 xx454 ==> xx00 <== >chr1 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

==> xx01 <== >chr10 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

==> xx454 <== >chrY_KI270740v1_random TAATAAATTTTGAAGAAAATGAAGACTGTGTTCTCAGTTCCAGGTGCTTC ATCAGGCTCATTGTGGATCCAGACTACCAGACACAAGACATTACACATTG ```
Rename each new file based on its chromosome content:

```sh

### See below for detailed explanation of this command $ for i in xx* ; do \ n=$(sed ‘s/>// ; s/ .*// ; 1q’ “$i”) ; \ mv “$i” “$n.fa” ; \ done

### Examine resulting files $ ls chr10.fa chr10_GL383545v1_alt.fa chr10_GL383546v1_alt.fa chr10_KI270824v1_alt.fa chr10_KI270825v1_alt.fa chr11.fa chr11_GL383547v1_alt.fa chr11_JH159136v1_alt.fa …

### Ensure the file name matches the chromosome $ head -n2 chr5.fa chrY.fa chr6_GL000255v2_alt.fa ==> chr5.fa <== >chr5 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

==> chrY.fa <== >chrY NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

==> chr6_GL000255v2_alt.fa <== >chr6_GL000255v2_alt TGGCTGTAGGAAACCAGGTCTTTCCCTCCCAAGGGAGGTGAACTACAAGC ```

Detailed Explanation

Step 1: Create a new directory

mkdir new
cd new

When working interactively in unix, we often create many temporary files, and some of them a left-over from previous command attemps. Creating and working in a new directory ensures there are no such left-over files.

See -f option below for alternative usage.

Step 2: use `csplit` to split a text file based on a pattern

The csplit (context split) is part of GNU coreutils and is available on all GNU/Linux systems (e.g. Ubuntu, CentOS, Debian, Fedora).

On Mac it is easy to install it using HomeBrew: brew install coreutils (It will be available as gcsplit). On Windows use the Cygwin environment.

Typical usage is:

csplit -s -z /path/to/INPUT.FA '/>/' '{*}'

First, the options:

The -s option (--silent) prevents csplit from printing the number of byte stored in each file. For the purpose of this tutorial the information is not needed
The -z option (--elide-empty-files) prevents creation of empty files (which can happen if the pattern we split by is encountered as the first line).
See below for more useful options.
Use csplit --help to see all options.
Visit the online manual to learn more.

Second, the input file name:

In out example, the input file is /path/to/FASTA.fa.
If the input file is in the current directory, FASTA.fa is suffucient.
Use - (a single minus character) to instruct csplit to read the input from STDIN (see below for examples).

Last, the patterns to split by:

The last two arguments ('/>/' '{*}') mean the following:

Argument of the form /XXX/ means split by regular expression XXX.
The />/ argument means split by the regular expression pattern >: Any line that contains the > character will start a new file.
The argument {*} tells csplit to repeat the previous pattern as many times as possible. Without it, only the first time a > character is encountered will cause a new file to be created.
The single-quotes characters are needed to prevent shell expansion. The characters >, *, {} have special meaning on the unix shell. Surrounding them with single-quotes removes the special meaning.

Note the implicit assumption about the split pattern >: We know that in a valid FASTA file, only the sequence name lines should contain a > character. That is why we can use such simple pattern. Of course, for more complicated file formats a more elaborate pattern can be used.

By default, csplit reads the input file, and every time the specified pattern is encountered, csplit will start a new file. csplit names the files with an xx prefix, and adds a numeric suffix (e.g. the first 3 files are named xx00, xx01, xx02). See below for examples of changing these defaults.

Here’s a simple example:

### Create an input file with 100 numbers:
$ seq 100 > num.txt
$ head num.txt
1
2
3
4
5
6
7
8
9
10

### Split the input file, start a new file for every line
### that contains the digit `1`:
$ csplit -s -z num.txt '/1/' '{*}'

### These are the output files:
$ ls
num.txt  xx01  xx03  xx05  xx07  xx09  xx11  xx13  xx15  xx17  xx19
xx00     xx02  xx04  xx06  xx08  xx10  xx12  xx14  xx16  xx18

### And their content:
$ head xx*
==> xx00 <==
1
2
3
4
5
6
7
8
9

==> xx01 <==
10

==> xx02 <==
11

...

Step 3: use `sed` to extract the chromosome name from the file

The csplit commands splits the FASTA file into individual chromosome files, and names them xx00, xx01, xx02, etc.

We now want to rename them to their corresponding chromosome name.

Each file contains the chromosome name in its first line:

$ head -n3 xx00 xx01
==> xx00 <==
>chr1
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

==> xx01 <==
>chr10
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

head -n1 can be used to print the first line of every file:

$ head -n1 xx00
>chr1
$ head -n1 xx01
>chr10

The sed program is an easy way to remove the > character:

$ head -n1 xx00 | sed 's/>//'
chr1

Instead of using head, we can instruct sed to stop processing after the first line. The next command is equivalent to using head -n1:

$ sed 's/>// ; 1q' xx00
chr1

Lastly, many FASTA files contains additional information following the initial identifiers, separated by a space character:

>chr1 NC_000001.11 Homo sapiens chromosome 1, GRCh38.p7 Primary Assembly

We extend our sed command to remove all characters following the space and print only the actual identifier (in case there is any additional information):

$ sed 's/>// ; s/ .*// ; 1q' "$i" xx00
chr1


$ sed 's/>// ; s/ .*// ; 1q' "$i" xx01
chr10

Step 4: use a shell variable to rename the file

Using the above sed command we can store the chromosome name in a shell variable, and later use it in our rename command:

###   Run the sed command, and store its output in variable 'n':
$ n=$(sed 's/>// ; s/ .*// ; 1q' "$i" xx00)

###   show the content of variable 'n':
$ echo $n
chr1

### Use it to rename the file:
$ mv xx00 "$n.fa"

$ ls *.fa
chr1.fa

Step 5: use shell `for` loop to rename all files

The example above extracted the chromosome name from a single file (xx00) and renamed it.

We can now perform the same using a shell for loop, iterating over all files starting with xx prefix (which is csplit’s default prefix):

for i in xx* ; do \
  n=$(sed 's/>// ; s/ .*// ; 1q' "$i") ; \
  mv "$i" "$n.fa" ; \
 done

Improvements and Variations

Split gzip files directly

Instead of decompressing (gunziping) the input file, we can send it directly to csplit:

cat INPUT.fa.gz | gunzip | csplit -s -z - '/>/' '{*}'

The critical change is that now cpslit gets uses - as the input name - meaning STDIN.

Similar invocations:

gunzip  < INPUT.fa.gz | csplit -s -z - '/>/' '{*}'
gzip -d < INPUT.fa.gz | csplit -s -z - '/>/' '{*}'
gzip -dc  INPUT.fa.gz | csplit -s -z - '/>/' '{*}'
zcat      INPUT.fa.gz | csplit -s -z - '/>/' '{*}'

Note: zcat works well on GNU/Linux, will not work as-is on Mac - use gzip -dc on Mac.

Use specific prefix

Instead of using the default xx prefix for output files, we can specify another using -f PREFIX option:

csplit -f foo -s -z INPUT.fa '/>/' '{*}'

The resulting files are:

$ ls
INPUT.fa foo00 foo01 foo02 foo03 ...

Similarly, we can direct the output to another directory:

mkdir /data/
csplit -f /data/foo -s -z INPUT.fa '/>/' '{*}'

$ ls /data/foo*
foo00 foo01 foo02 ....

Use specific suffix / file extension

The default output suffix is two digits, and will automatically increase to more digits if there are more than 99 files.

Using the -n/--digits option we can specifiy the number of digits in the output files:

csplit -s -z -n 5 INPUT.fa '/>/' '{*}'

The resulting files are:

$ ls
INPUT.fa xx00000 xx00002 xx00003 xx00004
xx00005  xx00006 xx00007 xx00008 xx00008
...

Using -b/--suffix option we can control the exact suffix including adding a file extension. The -b option understands the printf syntax:

$ csplit -s -z -b "%05d.fa"  INPUT.fa '/>/' '{*}'

$ ls
input.fa    xx00001.fa  xx00003.fa  xx00005.fa  xx00007.fa  xx00009.fa  xx00011.fa
xx00000.fa  xx00002.fa  xx00004.fa  xx00006.fa  xx00008.fa  xx00010.fa  xx00012.fa
...

Options -f PREFIX and -b SUFFIX can be combined. Using custom prefixes or suffixes are very useful when combining it later with shell globbing (e.g. *.fa).

For example, if splitting a file is just the first step, and we then want to process all the output files, we can use a command like:

csplit -b '%05d.step1' INPUT.fa '/>/' '{*}'

And them perform an action on all *.step1 files.

Removing the `chr` prefix from chromosome names

At times we have FASTA files with chr prefix (e.g. chr12 and chrX) and we want to remove the chr prefix.

In our scenario of splitting a large FASTA files, a good place to remove those prefixes is before feeding the file to csplit. This way, each individual chromosome file will already contain the chromosome name without the chr prefix, and our renaming loop would just work:

cat hg38.fa  |   sed 's/^>chr/>/'  |   csplit -s -z - '/>/' '{*}'

The above sed command replaces all lines that start with >chr with a just the > character - effectively removing the prefix but leaving the chromosome number.

Using `awk` instead of csplit/sed

The following awk program performs the same function: it processes the input file and saves it to separate files based on a matched pattern.

awk '/^>/ { gsub(">","",$1)
            FILE=$1 ".fa"
            print ">" $1 >> FILE
            next}
          { print >> FILE }'  hg38.fa

Final thoughts

There are many other ways to split a fasta files (including dedicated tools, and writing your own short awk/python/perl/ruby scripts).

In the case of genomes downloaded from the UCSC Genome website, many of the genomes are also available as a tar.gz file containing one-file-per-chromosomes.

However this way is a good example of using and learning the common unix tools available on all GNU/Linux systems.

Hope you found this useful.

Follow me at @AGordonX for more articles (questions and topic requests are welcomed).