Spliting a fasta file to chromosomes (BioInfo Unix Tricks)
TL;DR;
Sometimes you have a large fasta file (e.g. a whole genome in one file) and you’d like to split it into one file per chromosome.
Here’s how to do so using common unix tools (csplit
and sed
):
mkdir new
cd new
csplit -s -z /path/to/INPUT.FA '/>/' '{*}'
for i in xx* ; do \
n=$(sed 's/>// ; s/ .*// ; 1q' "$i") ; \
mv "$i" "$n.fa" ; \
done
Content
A Complete example
-
Download and decompress the fasta file:
sh $ wget http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz $ gunzip hg38.fa.gz $ grep -c ">" hg38.fa ### Count the number of sequences in the file 455
-
Split the file into one-file-per-chromosome:
```sh $ mkdir new $ cd new $ csplit -s -z ../hg38.fa ‘/>/’ ‘{*}’
### Resulting in 455 files $ ls xx00 xx01 … xx453 xx454
#### Peek into few of the files #### $ head -n3 xx00 xx01 xx454 ==> xx00 <== >chr1 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
==> xx01 <== >chr10 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
==> xx454 <== >chrY_KI270740v1_random TAATAAATTTTGAAGAAAATGAAGACTGTGTTCTCAGTTCCAGGTGCTTC ATCAGGCTCATTGTGGATCCAGACTACCAGACACAAGACATTACACATTG ```
-
Rename each new file based on its chromosome content:
```sh
### See below for detailed explanation of this command $ for i in xx* ; do \ n=$(sed ‘s/>// ; s/ .*// ; 1q’ “$i”) ; \ mv “$i” “$n.fa” ; \ done
### Examine resulting files $ ls chr10.fa chr10_GL383545v1_alt.fa chr10_GL383546v1_alt.fa chr10_KI270824v1_alt.fa chr10_KI270825v1_alt.fa chr11.fa chr11_GL383547v1_alt.fa chr11_JH159136v1_alt.fa …
### Ensure the file name matches the chromosome $ head -n2 chr5.fa chrY.fa chr6_GL000255v2_alt.fa ==> chr5.fa <== >chr5 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
==> chrY.fa <== >chrY NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
==> chr6_GL000255v2_alt.fa <== >chr6_GL000255v2_alt TGGCTGTAGGAAACCAGGTCTTTCCCTCCCAAGGGAGGTGAACTACAAGC ```
Detailed Explanation
Step 1: Create a new directory
mkdir new
cd new
When working interactively in unix, we often create many temporary files, and some of them a left-over from previous command attemps. Creating and working in a new directory ensures there are no such left-over files.
See -f option below for alternative usage.
Step 2: use csplit
to split a text file based on a pattern
The csplit
(context split) is part of GNU coreutils
and is available on all GNU/Linux systems (e.g. Ubuntu, CentOS, Debian, Fedora).
On Mac it is easy to install it using HomeBrew: brew install coreutils
(It will be available as gcsplit
). On Windows use the Cygwin environment.
Typical usage is:
csplit -s -z /path/to/INPUT.FA '/>/' '{*}'
First, the options:
- The
-s
option (--silent
) prevents csplit from printing the number of byte stored in each file. For the purpose of this tutorial the information is not needed - The
-z
option (--elide-empty-files
) prevents creation of empty files (which can happen if the pattern we split by is encountered as the first line). - See below for more useful options.
- Use
csplit --help
to see all options. - Visit the online manual to learn more.
Second, the input file name:
- In out example, the input file is
/path/to/FASTA.fa
. - If the input file is in the current directory,
FASTA.fa
is suffucient. - Use
-
(a single minus character) to instructcsplit
to read the input from STDIN (see below for examples).
Last, the patterns to split by:
The last two arguments ('/>/' '{*}'
) mean the following:
- Argument of the form
/XXX/
means split by regular expressionXXX
. - The
/>/
argument means split by the regular expression pattern>
: Any line that contains the>
character will start a new file. - The argument
{*}
tellscsplit
to repeat the previous pattern as many times as possible. Without it, only the first time a>
character is encountered will cause a new file to be created. - The single-quotes characters are needed to prevent shell expansion.
The characters
>
,*
,{}
have special meaning on the unix shell. Surrounding them with single-quotes removes the special meaning.
Note the implicit assumption about the split pattern >
:
We know that in a valid FASTA file, only the sequence name lines
should contain a >
character. That is why we can use such simple pattern.
Of course, for more complicated file formats a more elaborate pattern can be used.
By default, csplit
reads the input file, and every time the specified
pattern is encountered, csplit
will start a new file.
csplit
names the files with an xx
prefix, and adds a numeric suffix
(e.g. the first 3 files are named xx00
, xx01
, xx02
).
See below for examples of changing these defaults.
Here’s a simple example:
### Create an input file with 100 numbers:
$ seq 100 > num.txt
$ head num.txt
1
2
3
4
5
6
7
8
9
10
### Split the input file, start a new file for every line
### that contains the digit `1`:
$ csplit -s -z num.txt '/1/' '{*}'
### These are the output files:
$ ls
num.txt xx01 xx03 xx05 xx07 xx09 xx11 xx13 xx15 xx17 xx19
xx00 xx02 xx04 xx06 xx08 xx10 xx12 xx14 xx16 xx18
### And their content:
$ head xx*
==> xx00 <==
1
2
3
4
5
6
7
8
9
==> xx01 <==
10
==> xx02 <==
11
...
Step 3: use sed
to extract the chromosome name from the file
The csplit
commands splits the FASTA file into individual
chromosome files, and names them xx00
, xx01
, xx02
, etc.
We now want to rename them to their corresponding chromosome name.
Each file contains the chromosome name in its first line:
$ head -n3 xx00 xx01
==> xx00 <==
>chr1
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
==> xx01 <==
>chr10
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
head -n1
can be used to print the first line of every file:
$ head -n1 xx00
>chr1
$ head -n1 xx01
>chr10
The sed
program is an easy way to remove the >
character:
$ head -n1 xx00 | sed 's/>//'
chr1
Instead of using head
, we can instruct sed
to stop processing
after the first line. The next command is equivalent to using head -n1
:
$ sed 's/>// ; 1q' xx00
chr1
Lastly, many FASTA files contains additional information following the initial identifiers, separated by a space character:
>chr1 NC_000001.11 Homo sapiens chromosome 1, GRCh38.p7 Primary Assembly
We extend our sed
command to remove all characters following the space
and print only the actual identifier (in case there is any additional information):
$ sed 's/>// ; s/ .*// ; 1q' "$i" xx00
chr1
$ sed 's/>// ; s/ .*// ; 1q' "$i" xx01
chr10
Step 4: use a shell variable to rename the file
Using the above sed
command we can store the chromosome name
in a shell variable, and later use it in our rename command:
### Run the sed command, and store its output in variable 'n':
$ n=$(sed 's/>// ; s/ .*// ; 1q' "$i" xx00)
### show the content of variable 'n':
$ echo $n
chr1
### Use it to rename the file:
$ mv xx00 "$n.fa"
$ ls *.fa
chr1.fa
Step 5: use shell for
loop to rename all files
The example above extracted the chromosome name
from a single file (xx00
) and renamed it.
We can now perform the same using a shell for
loop,
iterating over all files starting with xx
prefix
(which is csplit
’s default prefix):
for i in xx* ; do \
n=$(sed 's/>// ; s/ .*// ; 1q' "$i") ; \
mv "$i" "$n.fa" ; \
done
Improvements and Variations
Split gzip files directly
Instead of decompressing (gunzip
ing) the input file,
we can send it directly to csplit
:
cat INPUT.fa.gz | gunzip | csplit -s -z - '/>/' '{*}'
The critical change is that now cpslit
gets uses -
as
the input name - meaning STDIN.
Similar invocations:
gunzip < INPUT.fa.gz | csplit -s -z - '/>/' '{*}'
gzip -d < INPUT.fa.gz | csplit -s -z - '/>/' '{*}'
gzip -dc INPUT.fa.gz | csplit -s -z - '/>/' '{*}'
zcat INPUT.fa.gz | csplit -s -z - '/>/' '{*}'
Note: zcat
works well on GNU/Linux, will not work as-is on Mac -
use gzip -dc
on Mac.
Use specific prefix
Instead of using the default xx
prefix for output files,
we can specify another using -f PREFIX
option:
csplit -f foo -s -z INPUT.fa '/>/' '{*}'
The resulting files are:
$ ls
INPUT.fa foo00 foo01 foo02 foo03 ...
Similarly, we can direct the output to another directory:
mkdir /data/
csplit -f /data/foo -s -z INPUT.fa '/>/' '{*}'
$ ls /data/foo*
foo00 foo01 foo02 ....
Use specific suffix / file extension
The default output suffix is two digits, and will automatically increase to more digits if there are more than 99 files.
Using the -n/--digits
option we can specifiy the number of digits
in the output files:
csplit -s -z -n 5 INPUT.fa '/>/' '{*}'
The resulting files are:
$ ls
INPUT.fa xx00000 xx00002 xx00003 xx00004
xx00005 xx00006 xx00007 xx00008 xx00008
...
Using -b/--suffix
option we can control the exact suffix including
adding a file extension. The -b
option understands the printf syntax:
$ csplit -s -z -b "%05d.fa" INPUT.fa '/>/' '{*}'
$ ls
input.fa xx00001.fa xx00003.fa xx00005.fa xx00007.fa xx00009.fa xx00011.fa
xx00000.fa xx00002.fa xx00004.fa xx00006.fa xx00008.fa xx00010.fa xx00012.fa
...
Options -f PREFIX
and -b SUFFIX
can be combined.
Using custom prefixes or suffixes are very useful when combining
it later with shell globbing (e.g. *.fa
).
For example, if splitting a file is just the first step, and we then want to process all the output files, we can use a command like:
csplit -b '%05d.step1' INPUT.fa '/>/' '{*}'
And them perform an action on all *.step1
files.
Removing the chr
prefix from chromosome names
At times we have FASTA files with chr
prefix (e.g. chr12
and chrX
)
and we want to remove the chr
prefix.
In our scenario of splitting a large FASTA files, a good place
to remove those prefixes is before feeding the file to csplit
.
This way, each individual chromosome file will already contain
the chromosome name without the chr
prefix, and our renaming
loop would just work
:
cat hg38.fa | sed 's/^>chr/>/' | csplit -s -z - '/>/' '{*}'
The above sed
command replaces all lines that start with >chr
with a just the >
character - effectively removing the prefix
but leaving the chromosome number.
Using awk
instead of csplit/sed
The following awk
program performs the same function:
it processes the input file and saves it to separate files based on a matched
pattern.
awk '/^>/ { gsub(">","",$1)
FILE=$1 ".fa"
print ">" $1 >> FILE
next}
{ print >> FILE }' hg38.fa
Final thoughts
There are many other ways to split a fasta files (including dedicated tools, and writing your own short awk/python/perl/ruby scripts).
In the case of genomes downloaded from the UCSC Genome website,
many of the genomes are also available as a tar.gz
file containing
one-file-per-chromosomes.
However this way is a good example of using and learning the common unix tools available on all GNU/Linux systems.
Hope you found this useful.
Follow me at @AGordonX for more articles (questions and topic requests are welcomed).