Guidelines for Bioinformatic Processing of Sequencing Data (NEB #E9500 and #E9530)

Introduction

NEBNext Direct GS libraries have a unique molecular identifier (UMI) followed by a sample barcode in the Read 1 position, with a second sample barcode in the Illumina i7 site (Index 1) and the template read as Read 2. The following sections describe how to partition the sequencing cycles into template reads, sample indexes, and UMIs for NEBNext Direct GS libraries using bcl2fastq or Picard tools.

 

5.1 Guidelines for Demultiplexing NEBNext Direct GS Libraries Using bcl2fastq

5.1.1.Prepare a sample sheet for bcl2fastq

The bcl2fastq software uses the sample barcodes to demultiplex in the order in which they are read by the sequencer. Therefore the “index” column in the sample sheet is required to contain the sample barcodes from the Read 1 site, while the “index2” column is required to contain the sample barcodes from the i7 site. This sample sheet will differ from the one that is used to run the Illumina sequencer. To download a sample sheet with the barcode sequences for use with bcl2fastq, refer to the “Usage Guidelines” located in the “Protocols, Manuals & Usage” tab on the NEBNext Direct Genotyping Solution Target Enrichment Kit (NEB #E9530), www.neb.com/E9530

5.1.2. Run bcl2fastq

Options

The bcl2fastq software should be run with the following options:

--use-bases-mask y12i8,i8,y75
--mask-short-adapter-reads 10
--barcode-mismatches 0

The first option identifies the 12 cycles of UMI and 8 cycles of the first sample barcode. The second option keeps the 12 bp UMI sequences from being identified as adaptor sequence and subsequently masked to “NNNNNNNNNNNN”. To prevent misassignment of a small number of reads (< 3%) to the wrong sample, we recommend allowing zero mismatches for barcode assignment by using the option “--barcode-mismatches 0”.

--use-bases-mask

The “--use-bases-mask” string specifies how to use each of the sequencing cycles:

  • An “n” means ignore the cycle.
  • A “Y” (or “y”) means the cycle is used as a template read.
  • An “I” (or “i”) means the cycle is used as an index read.
  • A number means that the previous character is repeated that many times.
  • An asterisk “*” means that the previous character is repeated until the end of this read or index (length according to the RunInfo.xml).

Each read mask is separated by a comma: “,”. NEBNext Direct GS libraries have three read masks. Note that as the sample index and UMI are combined in the same read mask, there is no comma separating them.

Output

Running bcl2fastq as described above will produce two fastq files: the R1 fastq file will contain the UMI sequences and the R2 fastq file will contain the Read 2 sequences.

 

5.2 Guidelines for Demultiplexing NEBNext Direct GS Libraries Using Picard

Demultiplexing NEBNext Direct GS libraries with Picard tools requires two processing steps. The first step (Step 5.2.2) is to assign each cluster on the sequencer to the appropriate barcode, and the second step (Step 5.2.3) is demultiplexing into fastq or unmapped BAM files.

Read Structure

Steps 5.2.2 and 5.2.3 require a Read Structure, which is a string that defines the mapping of sequencing cycles into sequencing reads, sample indexes, and UMIs. The linked page provides a good description of the concept as well as a number of examples. All read lengths need to be specified explicitly, as Picard does not support variable length operators.

The Read Structure for a NEBNext Direct GS library: 12M8B8B75T

5.2.1. Prepare a sample_metadata.txt file.

Steps 5.2.2 and 5.2.3 require sample metadata in a format that differs from an Illumina sample sheet. Each step has its own requirements for identifying the metadata, so there are some columns with redundant information. The sample barcodes are defined in the order in which they are read by the sequencer. For NEBNext Direct GS libraries, the inline barcode is defined as barcode1 and the i7 barcode is defined as barcode2.

To download a sample_metadata.txt file for use with the Picard tools in this section, visit the refer to the “Usage Guidelines” located in the “Protocols, Manuals & Usage” tab on the NEBNext Direct Genotyping Solution Target Enrichment Kit (NEB #E9530), www.neb.com/E9530. This file can be edited in Excel and saved as tab delimited text file (.txt).

5.2.2. Assign clusters to barcodes.

Assignment of clusters to barcodes is performed by the Picard tool ExtractIlluminaBarcodes. An example call is shown below. This will write the barcode files to the BASECALLS_DIR, compress the output barcode files and use sixteen processors. To use all available processors on a system, set NUM_PROCESSORS=0. Note that the heap size specified (the “-Xmx48g” in the command) should be 2-4G per processor to be used. To prevent misassignment of a small number of reads (< 3%) to the wrong sample, we recommend allowing zero mismatches for barcode assignment by using the option “MAX_MISMATCHES=0”.

Java -Xmx48g -jar \

picard.jar ExtractIlluminaBarcodes \
VALIDATION_STRINGENCY=SILENT \ BASECALLS_DIR=Data/Intensities/BaseCalls \ OUTPUT_DIR=Data/Intensities/BaseCalls \
LANE=1 \
READ_STRUCTURE=12M8B8B75T \
BARCODE_FILE=sample_metadata.txt \ METRICS_FILE=Data/Intensities/BaseCalls/barcode_counts.lane-1.metrics.txt \ COMPRESS_OUTPUTS=true \
NUM_PROCESSORS=16 \
MAX_MISMATCHES=0

 

Note that the “BARCODE_FILE” argument points to the sample_metadata.txt file described in step 5.2.1. The “METRICS_FILE” is an output file that has summary statistics about the assignment of clusters to barcodes.

5.2.3. Demultiplex samples.

Demultiplexing can be performed by either the Picard tool IlluminaBasecallsToFastq, if you prefer the output to be a fastq file, or IlluminaBasecallsToSam if you prefer an unmapped SAM/BAM file as output. An example call that will write to a fastq file is shown below. There are small differences between the arguments of these two tools. Please see the linked pages for a full description of the available options.

java -Xmx36g -jar \

picard.jar IlluminaBasecallsToFastq \
VALIDATION_STRINGENCY=SILENT \
BASECALLS_DIR=Data/Intensities/BaseCalls \
LANE=1 \
RUN_BARCODE=000000000-BHRLY \
NUM_PROCESSORS=12 \
READ_STRUCTURE=12M8B8B75T \
MULTIPLEX_PARAMS=sample_metadata.txt \
READ_NAME_FORMAT=ILLUMINA \
INCLUDE_NON_PF_READS=false

Note that the “MULTIPLEX_PARAMS” argument points to the sample_metadata.txt file described above.