Navigating 10x Barcoded BAM Files
Note: 10x Genomics does not provide support for community-developed tools and makes no guarantees regarding their function or performance. Please contact tool developers with any questions. If you have feedback about Analysis Guides, please email analysis-guides@10xgenomics.com.
Introduction
Cell Ranger's gene expression algorithm aligns and filters reads to determine which UMIs are included in the filtered feature-barcode matrix to be used in downstream analyses. This tutorial walks through one method for obtaining the counts from the filtered feature barcode matrix starting with the 10x BAM file (i.e. possorted_genome_bam.bam). Cell Ranger generates two matrices as output from the pipeline. This tutorial will focus on the filtered version.
Type | Description |
---|---|
Unfiltered feature-barcode matrix | Contains every barcode from the fixed list of known-good barcode sequences that has at least 1 read. |
Filtered feature-barcode matrix | Contains only detected cellular barcodes. |
The filtered feature-barcode matrix has the cell barcode as the column header, feature name as the row header and the feature count per barcode as the body of the matrix.
This tutorial will not rebuild the matrix but instead it will provide the information needed to do so if desired.
Prerequisite Skills
- Basic knowledge of Universal Molecular Identifiers (UMIs)
- Familiarity with the Linux command line
- Familiarity with the BAM/SAM standard
- Capacity to run Cell Ranger, see System Requirements
Software and Packages Used
- samtools (included in Cell Ranger install)
Learning Goals/Objectives
- Identify differences between the filtered and unfiltered barcode matrixes
- Understand the structure of Filtered Feature Barcode Matrix
- Use the
xf
flag to extract BAM records that are used for UMI counting - Determine the number of times a feature is counted for a cell associated barcode
Data
The data used in this tutorial came from the 10x Genomics public datasets repository and they represent 1000 Peripheral Blood Mononuclear Cells (PBMCs) from a Healthy Donor. Libraries were generated with 10x v3 chemistry and data were analyzed with Cell Ranger version 3.0. These data were selected for demonstration purposes. Larger datasets may take longer to analyze.
1k PBMCs from a Healthy Donor (v3 chemistry)
File Type | File Size | md5sum |
---|---|---|
Genome-aligned BAM | 4.79 GB | b47f986a532d4c4ecd351511b345dae8 |
Feature / cell matrix (filtered) | 9.61 MB | 0bb2f675b59c3d208200aea57a31044b |
Please find data download links and instructions below in the 'Full tutorial'.
Format
The commands in this tutorial are meant to be copy/pasted.
Some of the commands listed in this tutorial contain pipes "|", which sends the output of the first command to the input of the second command without writing the output to disk. This is especially useful in bioinformatics when the datafiles that we work with are large. For example, this will list the files in the current directory, then count them.
ls -l | wc -l
For some of the long conmands containing pipes in this tutorial, we dissect each piece of the command for you using a comment, noted with a "#". The lines preceded by a # are not meant to be run. But if you accidentally run them, nothing will be executed because the # at the beginning makes everything after it silent. On these comment lines, the piece of the command is listed on the left followed by the description of what it does on the right following a tab, like this:
# samtools view pbmc_1k_v3_possorted_genome_bam_filterd.bam View and convert BAM to SAM
Sometimes the line before a command will be commented to explain what the following command does. For example:
# Extract the BAM header and write to header_filted_sam samtools view -H pbmc_1k_v3_possorted_genome_bam.bam > header_filted_sam
Quickstart
Determine Which Barcodes are Cell-Associated
samtools view pbmc_1k_v3_possorted_genome_bam_filterd.bam | cut -f 12- | tr "\t" "\n" | grep "CB:Z:" | uniq > cell_barcodes.txt # samtools view pbmc_1k_v3_possorted_genome_bam_filterd.bam View and convert BAM to SAM # grep 'xf:i:25' Find counted UMIs # cut -f 12- Remove first 12 columns SAM file # tr "\t" "\n" Replace all tabs with a new line # grep 'CB:Z:' Find lines that start with CB:Z
Use Cell-Associated Barcodes to Extract Feature Counts per Barcode
samtools view pbmc_1k_v3_possorted_genome_bam.bam | grep "xf:i:25" | cut -f 12- | grep "CB:Z:CGAGAAGTCAAGTTGC" | tr "\t" "\n" | grep "GX:Z" |awk "{count[$1]++} END {for(j in count) print count[j], j}" | sort -nr > feature_counts_per_barcode.txt # samtools view pbmc_1k_v3_possorted_genome_bam_filterd.bam View and convert BAM to SAM # grep "xf:i:25" Grep counted reads # cut -f 12- Remove first 12 columns SAM file # grep "CB:Z:CGAGAAGTCAAGTTGC" Print lines with "CB:Z:CGAGAAGTCAAGTTGC". ***Replace this with your barcode of interest*** # tr "\t" "\n" Replace all tabs with a new line # grep "CB:Z:"" Grep lines that start with CB:Z # awk "{count[$1]++} END {for(j in count) print count[j], j}" Count each occurrence of a barcode # sort -nr Sort numerically and in reverse order
Full Tutorial
For this tutorial we will use one of the 10x public datasets. You can download the BAM file and count matrix directly by clicking the file name or using a command line downloading utility (e.g. wget
or curl
) as demonstrated below.
wget https://cf.10xgenomics.com/samples/cell-exp/3.0.0/pbmc_1k_v3/pbmc_1k_v3_possorted_genome_bam.bam wget https://cf.10xgenomics.com/samples/cell-exp/3.0.0/pbmc_1k_v3/pbmc_1k_v3_filtered_feature_bc_matrix.tar.gz # Unpack the pbmc_1k_v3_analysis.tar.gz files tar -xzvf pbmc_1k_v3_filtered_feature_bc_matrix.tar.gz
Let's start by parsing our BAM files to obtain our count matrix.
Going from the BAM file to the filtered feature barcode matrix requires the following steps:
- Identify cell associated barcodes
- Determine the number of times a feature is counted for each barcode
Let's get started.
1. Identify Cell-Associated Barcodes
For a UMI to be counted it must meet specific criteria:
- Have a MAPQ score of 255 (see the STAR manual, section 4.2).
- Maps to exactly one gene (as shown in the
GX
tag of the BAM file) - Overlaps an exon by at least 50% in a way consistent with annotated splice junctions and strand annotation. Records that align to exons will have an
RE:A:E
tag. - Remove any records with matching UMI and Barcode values that map to different genes.
In Cell Ranger 3, 10x Genomics introduced, xf
, a bitwise alignment flag and the bits are as follows:
Name | Value | Description |
---|---|---|
CONF_MAPPED | 1 | The read is confidently mapped to the transcriptome. |
LOW_SUPPORT_UMI | 2 | This read's (BC, UMI, feature) combination was discarded in favor of a different feature with higher read support. |
GENE_DISCORDANT | 4 | This read pair maps to a discordant pair of genes, and is not treated as a UMI count |
UMI_COUNT | 8 | This read is representative for the molecule and can be treated as a UMI count |
CONF_FEATURE | 16 | Confidently assigned feature barcode |
FILTERED_TARGET_UMI | 32 | This read was removed by targeted UMI filtering. |
BAM records that contribute to UMI counting have an xf:i:25
tag (i.e. CONF_MAPPED + UMI_COUNT + CONF_FEATURE).
In this tutorial we will use the version of samtools
that is bundled with Cell Ranger. To use this samtools you can run the following command:
source /PATH/TO/CellRanger/sourceme.bash
replacing /PATH/TO/CellRanger/
with the path to your Cell Ranger install.
Next, filter the BAM file so that it only includes records with xf:i:25
. This will reduce the size of our BAM file and the time required to sort it.
# Filter BAM file for records with 'xf:i:25' and write that to body_filtered_sam. body_filtered_sam is on the sam format and will not include the BAM header. samtools view pbmc_1k_v3_possorted_genome_bam.bam | LC_ALL=C grep "xf:i:25" > body_filtered_sam # Extract the BAM header and write to header_filted_sam samtools view -H pbmc_1k_v3_possorted_genome_bam.bam > header_filted_sam # Combine the header and extracted records cat header_filted_sam body_filtered_sam > pbmc_1k_v3_possorted_genome_bam_filterd.sam # Convert SAM to BAM samtools view -b pbmc_1k_v3_possorted_genome_bam_filterd.sam > pbmc_1k_v3_possorted_genome_bam_filterd.bam
The xf flag takes care of:
- Collapsing UMIs
- Eliminating chimeric reads
- Identifying records with a MAPQ score of 255
- Eliminating reads that map to more than one gene
- Aligning to an exon by at least 50% in a way consistent with annotated splice junctions and strand annotation.
so we don't have to worry about them. Without the xf tag, several sorting and filtering steps would need to be performed.
Now that we've sorted our BAM such that it only includes records that contribute to UMI counting, let's extract the cell associated barcodes to use in the next step.
samtools view pbmc_1k_v3_possorted_genome_bam_filterd.bam | cut -f 12- | tr "\t" "\n" | grep "CB:Z:" | uniq > cell_barcodes.txt # cut -f 12- Remove first 12 columns SAM file # tr "\t" "\n" Replace all tabs with a new line # grep 'CB:Z:' Grep lines that start with CB:Z # uniq Omit repeated lines
Now we can take a quick look at this file:
# Print first 10 lines of cell_barcodes.txt head cell_barcodes.txt
Your output should look like this:
CB:Z:GCTGCAGGTTGACGGA-1 CB:Z:CTGCCTATCTTGATTC-1 CB:Z:TCACACCCAGCAGAAC-1 CB:Z:GCGAGAAAGTTGTAGA-1 CB:Z:GGTCTGGTCAGACATC-1 CB:Z:TGAGCGCCACTATGTG-1 CB:Z:GCTGAATCACACGGAA-1 CB:Z:CAACAACTCGATACTG-1 CB:Z:GTAATCGGTTGCGAAG-1 CB:Z:TTACTGTGTTAAGGGC-1
2. Determine the Number of Times a Feature Appears for Each Barcode
Now that we have identified our cell-associated barcodes we can look up the genes in each cell and how many times they are seen in our data. Below we do that using a single barcode from our list of cell associated barcodes. You can replace "CB:Z:CGAGAAGTCAAGTTGC"
with your cell barcode of interest.
samtools view pbmc_1k_v3_possorted_genome_bam_filterd.bam | cut -f 12- | grep "CB:Z:CGAGAAGTCAAGTTGC" | tr "\t" "\n" | grep "GX:Z" |awk '{count[$1]++} END {for(j in count) print count[j], j}' | sort -nr > feature_counts_extracted_bam.txt # samtools view pbmc_1k_v3_possorted_genome_bam.bam # # grep "xf:i:25" Print lines with xf:i:25 # cut -f 12- Remove first 12 columns SAM file # grep "CB:Z:"" Grep lines that start with CB:Z at the beginning of the line # tr "\t" "\n" Replace all tabs with a new line # grep "GX:Z" Print lines with GX:Z at the beginning of the line # awk "{count[$1]++} END {for(j in count) print count[j], j}" Count occurence of each barcode # sort -nr` Sort numerically and in reverse order
Now let's take a look at the file we just created:
# Print first 10 lines of feature_counts_per_barcode.txt head feature_counts_extracted_bam.txt
The output will look like this:
10763 GX:Z:ENSG00000211592 1018 GX:Z:ENSG00000198938 964 GX:Z:ENSG00000211896 850 GX:Z:ENSG00000198712 780 GX:Z:ENSG00000198804 671 GX:Z:ENSG00000251562 651 GX:Z:ENSG00000156508 615 GX:Z:ENSG00000211897 554 GX:Z:ENSG00000229117 546 GX:Z:ENSG00000198899
Column 1 is counts and column two is the Ensembl Gene ID proceeded by the GX
BAM tag and Z
tag type.
Tag | Type | Description |
---|---|---|
GX | Z | Semicolon-separated list of gene IDs that are compatible with this alignment. Gene IDs are specified with the gene_id key in the reference GTF attribute column. |
Validation
Now let's compare our counts to those found in our CSV file. This can be a bit of a cumbersome task because the filtered feature-barcode matrix is large and in a format that's not easy to manipulate so let's convert this to a CSV that's a bit easier to work with.
# Print line number along with contents of barcodes.tsv.gz and genes.tsv.gz zcat filtered_feature_bc_matrix/barcodes.tsv.gz | awk -F "\t" 'BEGIN { OFS = "," }; {print NR,$1}' | sort -t, -k 1b,1 > numbered_barcodes.csv zcat filtered_feature_bc_matrix/features.tsv.gz | awk -F "\t" 'BEGIN { OFS = "," }; {print NR,$1,$2,$3}' | sort -t, -k 1b,1 > numbered_features.csv # Skip the header lines and sort matrix.mtx.gz zcat filtered_feature_bc_matrix/matrix.mtx.gz | tail -n +4 | awk -F " " 'BEGIN { OFS = "," }; {print $1,$2,$3}' | sort -t, -k 1b,1 > feature_sorted_matrix.csv zcat filtered_feature_bc_matrix/matrix.mtx.gz | tail -n +4 | awk -F " " 'BEGIN { OFS = "," }; {print $1,$2,$3}' | sort -t, -k 2b,2 > barcode_sorted_matrix.csv # Use join to replace line number with barcodes and genes join -t, -1 1 -2 1 numbered_features.csv feature_sorted_matrix.csv | cut -d, -f 2,3,4,5,6 | sort -t, -k 4b,4 | join -t, -1 1 -2 4 numbered_barcodes.csv - | cut -d, -f 2,3,4,5,6 > matrix.csv # Remove temp files rm -f barcode_sorted_matrix.csv feature_sorted_matrix.csv numbered_barcodes.csv numbered_features.csv
Perfect, now let's take a look at the CSV file:
# Print first 10 lines of final_matrix.csv head matrix.csv
The output will look like this:
AAACCCAAGGAGAGTA-1,ENSG00000000938,FGR,Gene Expression,5 AAACCCAAGGAGAGTA-1,ENSG00000001631,KRIT1,Gene Expression,2 AAACCCAAGGAGAGTA-1,ENSG00000002834,LASP1,Gene Expression,1 AAACCCAAGGAGAGTA-1,ENSG00000002933,TMEM176A,Gene Expression,1 AAACCCAAGGAGAGTA-1,ENSG00000003056,M6PR,Gene Expression,1 AAACCCAAGGAGAGTA-1,ENSG00000003402,CFLAR,Gene Expression,4 AAACCCAAGGAGAGTA-1,ENSG00000003756,RBM5,Gene Expression,2 AAACCCAAGGAGAGTA-1,ENSG00000004059,ARF5,Gene Expression,1 AAACCCAAGGAGAGTA-1,ENSG00000004700,RECQL,Gene Expression,1 AAACCCAAGGAGAGTA-1,ENSG00000004961,HCCS,Gene Expression,1 AAACCCAAGGAGAGTA-1,ENSG00000004975,DVL2,Gene Expression,1 AAACCCAAGGAGAGTA-1,ENSG00000005020,SKAP2,Gene Expression,1 AAACCCAAGGAGAGTA-1,ENSG00000005022,SLC25A5,Gene Expression,2 AAACCCAAGGAGAGTA-1,ENSG00000005075,POLR2J,Gene Expression,2 AAACCCAAGGAGAGTA-1,ENSG00000005243,COPZ2,Gene Expression,1
Great! Now we have a simpler format to work with. Now let's pick some genes to look at. Here are some of my favorites:
Gene name | Ensembl ID |
---|---|
NF-kappa-B essential modulator (NEMO) | ENSG00000269335 |
HLA-B | ENSG00000234745 |
VEGFA | ENSG00000112715 |
To look at these genes, we will create a CSV file with this loop:
# For a string in list perform unix command for i in ENSG00000269335 ENSG00000234745 ENSG00000112715 do # Format and print in list, merge lines one at a time, and write to gene_ids.csv printf '%s\n' $i | paste -sd " " >> gene_ids.csv done
Then we will loop through the CSV file:
# For a string in gene_ids.csv perfom unix command for i in $(cat gene_ids.csv) do # Open final_matrix.csv and search for lines that include CGAGAAGTCAAGTTGC followed by string in gene_ids.csv cat matrix.csv | grep CGAGAAGTCAAGTTGC.*${i} done
The output will look like this:
CGAGAAGTCAAGTTGC-1,ENSG00000269335,IKBKG,Gene Expression,2 CGAGAAGTCAAGTTGC-1,ENSG00000234745,HLA-B,Gene Expression,108 CGAGAAGTCAAGTTGC-1,ENSG00000112715,VEGFA,Gene Expression,2
Now let's compare this to the counts we extracted in the steps above (i.e counts in feature_counts_per_barcode.txt):
# For a string in gene_ids.csv perfom unix command for i in $(cat gene_ids.csv) do # Open feature_counts_per_barcode.txt and search for string in gene_ids.csv cat feature_counts_extracted_bam.txt | grep ${i} done
The output will look like this:
2 GX:Z:ENSG00000269335 108 GX:Z:ENSG00000234745 2 GX:Z:ENSG00000112715
This gives us the number of times these features are counted in a specific cell-associated barcode.
Conclusion
In this tutorial, we have identified differences between the filtered and unfiltered barcode matrixes, explored the structure of the Filtered Feature Barcode Matrix, and used the xf flag to extract BAM records used for UMI counting. We hope that you have found this useful in exploring the data in your own BAM files.
Related Content:
How to convert 10x BAM files to FASTQ files while preserving the barcode information?
What is the AN tag in the BAM file from cellranger count?
How do I get the read counts for each barcode?
References
PBMCs from a Health Donor (v3), Single Cell Gene Expression Dataset by Cell Ranger 3.0.0, 10x Genomics, (2019, November 19).
Zheng, Grace X.Y., Terry, Jessica M., [...] Bielas, Jason H. (2017). Massively parallel digital transcriptional profiling of single cells. Nature Communications. 8: 1-12, doi: 10.1038, ncomms14049