A repeat-aware tool for upgrading de-novo assembly using long reads
If you encounter any issue, please feel free to directly contact me at kklam@eecs.berkeley.edu My name is Ka-Kit Lam.
Our paper on this tool is published in
Lam, K. K., LaButti, K., Khalak, A., & Tse, D. (2015). FinisherSC: a repeat-aware tool for upgrading de novo assembly using long reads. Bioinformatics, 31(19), 3207-3209.
https://doi.org/10.1093/bioinformatics/btv280
Moreover, FinisherSC is under MIT License.
If you have a meta-genomics dataset, you may also want to visit our MetaFinisherSC https://github.com/kakitone/MetaFinisherSC
Here is the command to run the tool:
python finisherSC.py destinedFolder mummerPath
If you are running on server computer and would like to use multiple threads, then the following commands can generate 20 threads to run FinisherSC.
python finisherSC.py -par 20 destinedFolder mummerPath
Sometimes, if the names of raw reads and contigs consists of special characters/formats, FinisherSC/MUMmer may not parse them correctly. In that case, you want to have a quick renaming of the names of contigs/reads in contigs.fasta or raw_reads.fasta using the following command.
perl -pe 's/>[^\$]*$/">Seg" . ++$n ."\n"/ge' raw_reads.fasta > newRaw_reads.fasta
cp newRaw_reads.fasta raw_reads.fasta
perl -pe 's/>[^\$]*$/">Seg" . ++$n ."\n"/ge' contigs.fasta > newContigs.fasta
cp newContigs.fasta contigs.fasta
Note that it is assumed that you have "contigs.fasta" , "raw_reads.fasta" in the destinedFolder and mummerPath is your path to Mummer. Therefore, please rename the files in the destinedFolder if needed. For "finisherSC.py", the key output file is improved3.fasta in destinedFolder.
Here is an example run with the pre-installed data files and softwares(e.g. Mummer, Gepard) [You may go to https://www.dropbox.com/sh/xjpt8xf5g1xf0ek/bGmZvt9Zfd to download the the whole package to have a quick test. But we note here that all data and softwares besides FinisherSC have their copyright belonging to their original authors. We provide the sandbox here for users' convenience only. Moreover, all the updates and versioning will be at GitHub page. Users are encouraged to git clone our project for later usage after they successfully go through the example run].
a)Fast mode [ -f True ] (e.g. for large and complex genomes , you may want to use this mode to get results faster; However, there will be a bit of computation vs quality trade-off if you want to use this mode; We suggest that you use this mode only when you got stuck in the standard mode):
python finisherSC.py -f True destinedFolder mummerPath
[It means that you want to use the fast mode ]
b)Parallel mode. [-par numberOfThreads] You can now use -par 20 to run finisherSC on 20 threads. The command is
python finisherSC.py -par 20 destinedFolder mummerPath
c)Break down large contig file [-l True]. If your contigs.fasta is too big, then you want to use this option to break it down for alignment.
python finisherSC.py -l True destinedFolder mummerPath
d)Pick from previous jobs [-p pickupFilename] (e.g. you have got improved2.fasta but the cluster node timeout and you want to start from improved2.fasta instead of from scratch)
python finisherSC.py -p improved2.fasta destinedFolder mummerPath
e)Mapping between old contigs and new contigs [-o referenceName_QueryName ](e.g. you want to know how the contigs from contigs.fasta are mapped to improved3.fasta )
python finisherSC.py -o contigs.fasta_improved3.fasta destinedFolder mummerPath
[It will then output the alignment of improved3.fasta against contigs.fasta. The output will be shown in the terminal and in mappingResults.txt at the destinedFolder]
f)Help [-h]. You can always use -h to get the usage suggestion.
python finisherSC.py -h
g)There is an experimental improvement based on repeat phasing and the further improved file is improved4.fasta. An illustration of that part is given in Fig. 7 of http://arxiv.org/abs/1402.6971 . To experiment with that, first run finisherSC.py as before. After that, you can issue the following command:
python experimental/xPhaser.py destinedFolder mummerPath
h) There is also an experimental improvement resolving tandem repeat. The further improved file is tademResolved.fasta. To experiment with that, first run finisherSC.py as before. After that, you can issue the following command:
python experimental/tSolver.py destinedFolder mummerPath
We are using the benchmark datasets available online in PacBio DevNet. And we can run FinisherSC to process those data to completion. For (1, 2, 3), we use -par 20 option and run it on a server computer and we get them finished in a couple of hours to a day. For (4), we run without any options on a laptop and get it finished in a couple of minutes.
The data are supported by the original authors. But we provide an example on how to download and transform data types so it is easier for users to validate FinisherSC on benchmark data .
Download data with links specified in file_list
for f in `cat file_list`; do wget --force-directories $f; done
Download the DEXTRACTOR
git clone https://github.com/thegenemyers/DEXTRACTOR.git
Transform .bax.h5 files to .fasta file
find . -name '*.bax.h5' | xargs DEXTRACTOR/dextract > raw_reads.fasta
Sometimes, if the names of raw reads and contigs consists of special characters/formats, FinisherSC/MUMmer may not parse them correctly. In that case, you want to have a quick renaming of the names of contigs/reads in contigs.fasta or raw_reads.fasta using the following command.
perl -pe 's/>[^\$]*$/">Seg" . ++$n ."\n"/ge' raw_reads.fasta > newRaw_reads.fasta
cp newRaw_reads.fasta raw_reads.fasta
perl -pe 's/>[^\$]*$/">Seg" . ++$n ."\n"/ge' contigs.fasta > newContigs.fasta
cp newContigs.fasta contigs.fasta
Main functional components of FinisherSC :
finisherSC.py : Starting point of FinisherSC
nonRedundantResolver.py : Filter completely embedded contigs (aka Subroutine 1 in the paper's supplementary material)
overlapResolver.py : Utilize existing overlaps of contigs to resolve repeats and merge contigs (aka Subroutine 2,3 in the paper's supplementary material)
gapFiller.py : Utilize raw reads data to fill gaps(aka Subroutine 4, 5 in the paper's supplementary material)
twoRepeatOneBridgeSolver.py : Resolve repeats with two copies where only one copy is spanned by some reads (aka Subroutine 6,7 in the paper's supplementary material)
Helper Functions :
IORobot.py : various functions that support reading and writing of data
alignerRobot.py : library containing ways to parse MUMmer results
houseKeeper.py : various functions to support house keeping of data
graphLib.py : library containing various graph string graph operations
debugging.py : a debugging point for hacking into various functions
unittester.py : integration test package
viewer.py : create dot plots
An older version of the tool (no longer supported) :
finisher.py : an independent and functioning(but no longer supported) version based on a greedy algorithm
Relevant third parties software:
fasta-splitter.pl : a tool to break FASTA file into smaller trucks in commandline
MUMmer : an alignment tool to perform the mapping of reads and contigs
gepard-1.30 : a good tool to create dotplots
Experimental folder:
There are various experimental functions there. The two important ones are tandem repeat resolver(tSolver.py) and approximate repeat phaser(xPhaser.py).
It is an indicator on the current built powered by Travis-CI. If you issue a pull request, Travis-CI will evaluate your suggestion by automatically running the code on the default test case.