Installation

PreprocessFiles is a software to obtain read counts of target regions from whole-exome sequencing data. PreprocessFiles works with Unix based systems. A C++ compiler, e.g. GCC, and cmake are required to compile the PreprocessFiles package. To build binaries in the bin directory, please follow these steps:

tar -xfzv PreprocessFiles.tar.gz

cd PreprocessFiles

cmake .

make

Usage

One needs to obtain/create the following items to successfully run PreprocessFiles:

A BAM file
A genome reference file(*.fasta)
A target file

Human genome reference (GRCh37, GRCh38) can be obtained from UCSC genome browser. Select the corresponding genome reference for which sequenced reads were mapped.

The format of the target file is as follows:

Chromosome	Start Position	End Position
1	20138	20294
1	58932	59892
1	218334	218574

A header row is required. The first field denotes chromosome, the second and third fields define the start and the end positions of a target region, respectively. The columns of the target file must be separated by tabs.
Here, we provide several general target files which can be downloaded from following links:
SureSelect Human All Exon V1
SureSelect Human All Exon V2
SureSelect Human All Exon V4
SureSelect Human All Exon V5
SureSelect Human All Exon 50Mb

By now, one should now have all the required files to run PreprocessFiles.

1. Get the read counts from BAM file.

2. Obtain the GC-content of target regions from genome reference file.

Once all BAM files are processed, users must merge all the *.count and the *.gc file into a single file.
To merge the files, use the following command:

tar -zcvf all.tar.gz patient1.count control1.count control2.count target.gc

Our website can only support the files generated following the procedure mentioned above. The packaged file “all.tar.gz” should be uploaded to our website.

For those users who do not have control samples, we provide a standard reference file created for samples where exomes were captured with Agilent SureSelect Human All Exon 50Mb kit and sequenced using Illumina Hi-seq 2000 platform and alignment used the UCSC GRCH37 genome reference. Users can use the reference file as a control provided that the samples were sequenced using same sequencing technology and reads were mapped to UCSC GRCH37 reference.

To maximize the ease to use, we have included a shell script named “run.sh” in the PreprocessFiles package. This shell script makes it convenient for users to do all steps by using one command line. Before using the shell script, users must make it executable by using the following command:

chmod a+x run.sh

Following four inputs are required to successfully run the shell script:

A file listing all the paths of BAM files
A genome reference file(*.fasta)
A target file
A directory to save results

An example of the file that lists all the paths of BAM files is given below:

/path/patient1.bam
/path/patient2.bam
/path/patient3.bam
/path/control1.bam
/path/control2.bam

Usage:
./run.sh <bam.list> <genome.fasta> <target.txt> <outputDir>

The program will produce all files in directory “outputDir”, and a packaged file “all.tar.gz” will be automatically generated, users should upload this packaged file to our web server for further analysis.