Large dataset enables prediction of repair after CRISPR–Cas9 editing in primary T cells


Understanding of repair outcomes after Cas9-induced DNA cleavage is still limited, especially in primary human cells. We sequence repair outcomes at 1,656 on-target genomic sites in primary human T cells and use these data to train a machine learning model, which we have called CRISPR Repair Outcome (SPROUT). SPROUT accurately predicts the length, probability and sequence of nucleotide insertions and deletions, and will facilitate design of SpCas9 guide RNAs in therapeutically important primary human cells.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.


All prices are NET prices.

Fig. 1: Overview of the method.
Fig. 2: SPROUT predicts DNA repair outcomes.

Data availability

All the raw data and analyses are openly available through SRA (BioProject PRJNA486372) and figshare (, respectively.

Code availability

The SPROUT software is publicly available at Code is also available in the Supplementary Code.


  1. 1.

    Fischbach, M. A., Bluestone, J. A. & Lim, W. A. Sci. Transl. Med. 5, 179ps7 (2013).

    Article  Google Scholar 

  2. 2.

    Simeonov, D. et al. Commun. Biol. 2, 70 (2019).

    Article  Google Scholar 

  3. 3.

    Hultquist, J. F. et al. Nat. Protoc. 14, 1–27 (2019).

    CAS  Article  Google Scholar 

  4. 4.

    Lindsay, H. et al. Nat. Biotechnol. 34, 701–702 (2016).

    CAS  Article  Google Scholar 

  5. 5.

    van Overbeek, M. et al. Mol. Cell 63, 633–646 (2016).

    Article  Google Scholar 

  6. 6.

    Brinkman, E. K. et al. Mol. Cell 70, 801–813 (2018).

    CAS  Article  Google Scholar 

  7. 7.

    Lemos, B. R. et al. Proc. Natl Acad. Sci. USA 115, E2040–E2047 (2018).

    CAS  Article  Google Scholar 

  8. 8.

    Deriano, L. & Roth, D. B. Annu. Rev. Genet. 47, 433–455 (2013).

    CAS  Article  Google Scholar 

  9. 9.

    Shen, M. W. et al. Nature 563, 646–651 (2018).

    CAS  Article  Google Scholar 

  10. 10.

    Allen, al. Nat. Biotechnol. 37, 64–72 (2019).

  11. 11.

    Shin, H. Y. et al. Nat. Commun. 8, 15464 (2017).

    CAS  Article  Google Scholar 

  12. 12.

    Kosicki, M., Tomberg, K. & Bradley, A. Nat. Biotechnol. 36, 765–771 (2018).

    CAS  Article  Google Scholar 

  13. 13.

    Roth, T. L. et al. Nature 559, 405–409 (2018).

    CAS  Article  Google Scholar 

  14. 14.

    Simeonov, D. & Marson, A. Annu. Rev. Immunol. 37, 571–597 (2019).

    CAS  Article  Google Scholar 

  15. 15.

    Untergasser, A. et al. Nucleic Acids Res. 40, e115 (2012).

    CAS  Article  Google Scholar 

  16. 16.

    Magoč, T. & Salzberg, S. L. Bioinformatics 27, 2957–2963 (2011).

    Article  Google Scholar 

  17. 17.

    Bolger, A. M., Lohse, M. & Usadel, B. Bioinformatics 30, 2114–2120 (2014).

    CAS  Article  Google Scholar 

  18. 18.

    Li, H. & Durbin, R. Bioinformatics 25, 1754–1760 (2009).

    CAS  Article  Google Scholar 

Download references


This work was supported by the Chan–Zuckerberg Biohub. J.Z. was supported by a Chan–Zuckerberg Investigator grant and by National Science Foundation grant CRII 1657155. A.M. was supported by National Iinstitutes of Health (NIH)/NIDA Avenir New Innovator Award (DP2DA042423), NIH/NIGMS funding for the HIV Accessory and Regulatory Complexes (HARC) Center (P50 GM082250; to A.M. and N.J.K.) and gifts from J. Aronov, B. Bakar, K. Jordan, F. Caufield and D. Wolkoff. A.M. holds a Career Award for Medical Scientists from the Burroughs Wellcome Fund, has received funding from the Innovative Genomics Institute (IGI) and the Parker Institute for Cancer Immunotherapy (PICI) and is an investigator at the Chan–Zuckerberg Biohub. A.A. was supported by NIH grant 7R01HG008164-04 and the Stanford data science initiative. J.H. was supported by the UCSF Medical Scientist Training Program. We would like to thank N. Neff and R. Sit for assistance collecting sequence data and A. Sellas for laboratory support. We also thank J. Palacios for statistical discussions.

Author information




R.T.L., A.A., J.H., A.M., A.P.M. and J.Z. designed research. R.T.L., A.A., J.H., D.T., T.L.R., R.A., E.S., J.F.H., N.K., Z.W., G.C., H.C. and M.D.L. conducted research. R.L., A.A., J.H., A.M., A.P.M., T.L.R and J.Z. wrote the paper.

Corresponding authors

Correspondence to Alexander Marson or Andrew P. May or James Zou.

Ethics declarations

Competing interests

A.M. is a co-founder of Spotlight Therapeutics. A.M. has served as an advisor to Juno Therapeutics, is a member of the scientific advisory board at PACT Pharma and an is advisor to Sonoma Biotherapeutics. The Marson laboratory has received sponsored research support from Juno Therapeutics, Epinomics, Sanofi and a gift from Gilead. A.M. and T.L.R. are co-founders of Arsenal Biosciences and T.L.R. is chief scientific officer of the company.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Integrated supplementary information

Supplementary Figure 1 Number of donors per site.

Histogram showing the distribution of unique blood donors per SpCas9 cut site in the T cell data.

Supplementary Figure 2 Distribution of repair outcomes.

(a) Distribution of the edit efficiency (left) and indel diversity (right) of the repair outcomes in T cells. We use the entropy of the distribution of the reads over the indel types as a metric to quantify the diversity of the repair outcomes. If there is exactly one repair outcome in all of the reads, then the entropy is 0. Higher entropy means that the repair outcomes are more diverse. (b) Distribution of the fraction of total reads with an insertion (left) and deletion (right) in T cells. (c) Distribution of the average insertion length given an insertion (left) and average deletion length given a deletion (right) in the repair outcomes of T cells.

Supplementary Figure 3 Repair outcome similarity.

Jaccard similarity between the top 20 indels of 250 (of a total of 3,989) randomly selected SpCas9-targeted sequencing experiments in T cells. Experiments performed on cells from different individuals at the same cut site are placed next to each other in the heatmap. These biological replicates show greater Jaccard similarity in repair outcomes compared to outcomes at distinct cut sites, as can be seen in the darker blocks along the diagonal.

Supplementary Figure 4 Indel predictions.

Spearman’s Rank Correlation and Pearson Correlation Coefficient were calculated to measure the performance of SPROUT in predicting insertion containing reads. The mean and standard deviation of the cross-validation results are shown in the table. In each cross-validation, 304 sites in T cell sites and 96 sites in each of the cell lines were used for testing.

Supplementary Figure 5 Example predictions of SPROUT.

Performance of SPROUT on three representative target sites with high, low and medium fraction of insertions. Predicted features are described in the SPROUT column, while actual measured features are presented in the experimental validation column.

Supplementary Figure 6 Ranking cut sites.

SPROUT’s performance in ranking guides within a gene based on predicted repair outcome. (a) Schematic of the guide ranking experiment. Assuming a gene with three potential guides (Guide 1, Guide 2, Guide 3), SPROUT ranks the guides based on likelihood to produce a single nucleotide insertion (or deletion). In this illustration the algorithm predicts that Guide 2 produces the most number of reads with 1-bp insertion/deletion. (b) Guide ranking performance on T cells. The algorithm was trained on 435 genes and tested on the remaining 108 genes. Kendal tau (between [−1,1]) measures the rank correlation (higher is better and zero indicates no correlation), “SPROUT (# genes)” indicates the number of genes for which SPROUT predicted exactly the correct ranking across all the guides, and “Rnd Shuffle (# genes)” indicates the number of genes predicted correctly by naïve guessing. (c) Guide ranking performance on HEK293. (d) Guide ranking performance on K562. (e) Guide ranking performance on HCT116. For parts (C,D,E) the model was trained on all T cell genes and tested on 28 genes from these other cell types. (f) For each gene, we order the target sites from the most likely to introduce frame-shift outcome to the least likely, using SPROUT predictions. The table reports the fraction of genes where SPROUT correctly predicts the top target site, where SPROUT correctly predicts the complete ordering of all the target sites in the gene, as well as the correlation between the SPROUT prediction and the experimental validations. The same metrics for random prediction are reported as baselines. Bootstrap mean and standard deviation are shown in the table.

Supplementary Figure 7 Nucleotide analysis.

(a) Average fraction of indel mutant reads with insertion in target sites grouped by their nucleotide type at location -1 (adjacent to the cut site from the 5’ side). Presence of C or G at location −1 is significantly correlated with higher deletion proportion (p < 0.004, two-sided t-test) and presence of A or T is significantly correlated with higher insertion proportion (p < 10−6, two-sided t test) consistently across all cell types. We show the results for T cells (left, 1,521 sites) and the aggregate results for HEK293, K562 and HCT116 (right, 96 sites). The results in this supplementary figure are normalized for the background distribution of nucleotide types—we divided by the number of occurrences of each nucleotide in computing the average fraction of indel mutant reads with insertion. As additional controls, we also performed the same analysis at the −2, +1 and +2 locations and did not find significant differences in insertion fraction by the nucleotide type (p > 0.1, two sided t-text). (b) Average fraction of indel mutant reads with insertion conditioned on the nucleotide at position +3 (the last nucleotide before e.g. 5’ of the PAM sequence). The presence of A at +3 is correlated with higher fraction of insertions. The analyses here differ from and complement Fig. 2e. The SPROUT importance scores of 2E captures the nonlinear model’s overall prediction as to the impact of each nucleotide and position. The results here ignore the effects of other positions and plots the conditional insertion fractions. Even though the methods are different, both the feature importance scores and the conditional fractions give consistent biological findings. The mean and standard error of the mean (SEM) are shown in the tables.

Supplementary Figure 8 Homopolymer analysis.

(a) Average fraction of indel mutant reads with deletion in target sites grouped by their homopolymer types. HomP(A) corresponds to target sites that have at least two consecutive A nucleotides adjacent to the cut site, and similarly for HomP(C), HomP(G), and HomP(T). No HomP indicates the rest of the target sites without homopolymers. We show the results for T cells (top, 1,521 sites) and the aggregate results for HEK293, K562 and HCT116 (bottom, 96 sites). The mean and standard error of the mean are shown. (b) SPROUT was evaluated with MMEJ and non-MMEJ deletions, and the accuracies are reported here. On T cells we observe an improved prediction accuracy for non-MMEJ model. We also see a better generalization performance of MMEJ model compared to non-MMEJ model in cancer cell lines. Bootstrap mean and standard deviation is shown in the table.

Supplementary Figure 9 Methods comparison.

(a) Comparison of the performance of SPROUT, inDelphi and FORECasT in predicting frameshift, precision (defined as one minus the entropy of the deletion frequency), and fraction of indel mutant reads with insertion on four independent validation sets. The results for frameshift and precision are reported in terms of accuracy and the results for fraction insertion is in terms of R2. (b) Scatter plot of the experimentally observed vs. predicted fraction of indel mutated reads with insertion in T cells. (c) Comparison of the performance of SPROUT with inDelphi and FORECasT in ranking guides within randomly selected genes in the T cell data by the fraction of frame-shift outcomes and 1-bp insertions. The original T cells has 304 sites, new T cells I has 32 sites, New T cells II has 182 sites and iPSC has 30 sites. Bootstrap mean and standard deviation is shown in each table.

Supplementary Figure 10 Insertion distribution.

(a) Histogram of the distances of long insertions from the target cut sites. (b) Distribution of the length of the aligned insertions in T cells. (c) Histogram of the distance of the insertion donor sites to the cut sites in intra-chromosomal long insertions. The x-axis indicates distances in log 10 bases.

Supplementary Figure 11 Long insertion analysis.

Genomic DNA sequences from sites in physical proximity can be inserted at SpCas9 cut sites. (a) All insertions longer than or equal to 25 bases were identified and plotted. (b) Number of cut sites with at least one (aligned) long insertion was plotted against the insertion length. (c) Average similarity of the insertion location within the same cut site and gene was compared across donors. This was also performed on a shuffled set of insertions as a control. Average genomic distance quantifies the distance between the donor sites of the long insertions that originate from the same chromosome. Bootstrap mean and standard deviation of 1,521 sites is shown in the table. (d) Quantification of the HiC contact data to the long insertions. Neighboring blocks as well as a randomly selected block were used as controls. The HiC block size was also varied. Bootstrap mean and standard error of the mean are shown in the table. (e) HiC chromosome contact maps were directly compared to the aligned long insertions. P value is computed using t-test.

Supplementary Figure 12 Overlap with chromatin states.

Overlap between the insertion donor sites and 15 core chromatin states. We measured the percentage of insertion donor sites that fall within each of the 15 chromatin states (“% donor sites”) across the 1,521 sites. The chromatin states were obtained for primary CD4+ T cells (E043) from the Human Epigenome Roadmap. Here we considered only the long insertions that are aligned to a different chromosome from the SpCas9 target site, to avoid potential confounding due the target sites being in exons. For background control, we randomly shuffled each aligned insertion within a +- 500kb window centered at its original location (i.e. donor site), and report the percentage of the shuffled sites that overlap each chromatin state (“% shuffled sites”). Insertion donors are significantly enriched for chromatin states associated with enhancers and transcription (states 1-7, colored red) compared to control (P < 10−5, two-sided t-test). Altogether 35.8% of inter-chromosome donor sites come from one of states 1 to 7 compared to 29.6% of the control sites.

Supplementary Figure 13 Insertion analysis in cell lines.

(a) The average similarity and distance of the chromosomal positions between long insertions. We measure the similarity and distance for long insertions across biological samples at the same cut site (“Within cut site”), across different cut sites within the same gene (“Within gene”), and across random pairs of cut sites (“Shuffled control”). We report the results for each of three previously published data1, with 96 sites each. Bootstrap mean and standard deviation is shown. (b) The HiC contact map at the insertion site locations compared to three control cases in two other cell types (HEK293 and K562) across different HiC block sizes for insertion larger than 25 nucleotides1. The first two controls average the HiC contact map in the neighboring blocks of the insertion donor and cut site. The third control averages the HiC score among random blocks in the same cut site-donor site chromosome pairs. Bootstrap mean and standard error of the mean is shown.

Supplementary Figure 14 Prediction features.

The top table list all of the features used by SPROUT to predict frameshift, insertion, deletion and repair entropy. The gRNA and PAM were one-hot encoded as features. We also explored including additional chromatin features (second table). These features did not significant change SPROUT’s performance for these tasks and are not included in the final model. We have additionally explored using longer flanking sequence (up to 50bp) and DNA melting temperature as features. They did not improve SPROUT’s performance and are not used in the model.

Supplementary Figure 15 Prediction dependence on sample size.

Performance of SPROUT as the training size increases, measured by R2. Prediction saturates at 5-fold cross validation. The leave one out validation performance is 0.60 ± 0.01. Error bars show the standard deviation across the folds and the curve indicates the mean values.

Supplementary information

Supplementary Information

Supplementary Figs. 1–15 and Supplementary Note

Reporting Summary

Supplementary Code

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Leenay, R.T., Aghazadeh, A., Hiatt, J. et al. Large dataset enables prediction of repair after CRISPR–Cas9 editing in primary T cells. Nat Biotechnol 37, 1034–1037 (2019).

Download citation

Further reading