DeepTCR.DeepTCR

DeepTCR_base

__init__(self, Name, max_length=40, device=0, tf_verbosity=3) special

Initialize Training Object.

Initializes object and sets initial parameters.

All DeepTCR algorithms begin with initializing a training object. This object will contain all methods, data, and results during the training process. One can extract learned features, per-sequence predictions, among other outputs from DeepTCR and use those in their own analyses as well.

This method is included in the three main DeepTCR objects:

  • DeepTCR_U (unsupervised)
  • DeepTCR_SS (supervised sequence classifier/regressor)
  • DeepTCR_WF (supervised repertoire classifier/regressor)
Parameters:
  • Name (str) – Name of the object. This name will be used to create folders with results as well as a folder with parameters and specifications for any models built/trained.

  • max_length (int) – maximum length of CDR3 sequence.

  • device (int) – In the case user is using tensorflow-gpu, one can specify the particular device to build the graphs on. This selects which GPU the user wants to put the graph and train on.

  • tf_verbosity (str) – determines how much tensorflow log output to display while training.

Get_Data(self, directory, Load_Prev_Data=False, classes=None, type_of_data_cut='Fraction_Response', data_cut=1.0, n_jobs=40, aa_column_alpha=None, aa_column_beta=None, count_column=None, sep='\t', aggregate_by_aa=True, v_alpha_column=None, j_alpha_column=None, v_beta_column=None, j_beta_column=None, d_beta_column=None, p=None, hla=None, use_hla_supertype=False, keep_non_supertype_alleles=False)

Get Data for DeepTCR

Parse Data into appropriate inputs for neural network from directories where data is stored.

This method can be used when your data is stored in directories and you want to load it from directoreis into DeepTCR. This method takes care of all pre-processing of the data including:

  • Combining all CDR3 sequences with the same nucleotide sequence (optional).
  • Removing any sequences with non-IUPAC characters.
  • Removing any sequences that are longer than the max_length set when initializing the training object.
  • Determining how much of the data per file to use (type_of_data_cut)
  • Whether to use HLA/HLA-supertypes during training.

This method is included in the three main DeepTCR objects:

  • DeepTCR_U (unsupervised)
  • DeepTCR_SS (supervised sequence classifier/regressor)
  • DeepTCR_WF (supervised repertoire classifier/regressor)
Parameters:
  • directory (str) – Path to directory with folders with tsv/csv files are present for analysis. Folders names become labels for files within them. If the directory contains the TCRSeq files not organized into classes/labels, DeepTCR will load all files within that directory.

  • Load_Prev_Data (bool) – Loads Previous Data. This allows the user to run the method once, and then set this parameter to True to reload the data from a local pickle file.

  • classes (list) – Optional selection of input of which sub-directories to use for analysis.

  • type_of_data_cut (str) – Method by which one wants to sample from the TCRSeq File.

    Options are:

    • Fraction_Response: A fraction (0 - 1) that samples the top fraction of the file by reads. For example, if one wants to sample the top 25% of reads, one would use this threshold with a data_cut = 0.25. The idea of this sampling is akin to sampling a fraction of cells from the file.

    • Frequency_Cut: If one wants to select clones above a given frequency threshold, one would use this threshold. For example, if one wanted to only use clones about 1%, one would enter a data_cut value of 0.01.

    • Num_Seq: If one wants to take the top N number of clones, one would use this threshold. For example, if one wanted to select the top 10 amino acid clones from each file, they would enter a data_cut value of 10.

    • Read_Cut: If one wants to take amino acid clones with at least a certain number of reads, one would use this threshold. For example, if one wanted to only use clones with at least 10 reads,they would enter a data_cut value of 10.

    • Read_Sum: IF one wants to take a given number of reads from each file, one would use this threshold. For example, if one wants to use the sequences comprising the top 100 reads of hte file, they would enter a data_cut value of 100.

  • data_cut (float or int) – Value associated with type_of_data_cut parameter.

  • n_jobs (int) – Number of processes to use for parallelized operations.

  • aa_column_alpha (int) – Column where alpha chain amino acid data is stored. (0-indexed).

  • aa_column_beta (int) – Column where beta chain amino acid data is stored.(0-indexed)

  • count_column (int) – Column where counts are stored.

  • sep (str) – Type of delimiter used in file with TCRSeq data.

  • aggregate_by_aa (bool) – Choose to aggregate sequences by unique amino-acid. Defaults to True. If set to False, will allow duplicates of the same amino acid sequence given it comes from different nucleotide clones.

  • v_alpha_column (int) – Column where v_alpha gene information is stored.

  • j_alpha_column (int) – Column where j_alpha gene information is stored.

  • v_beta_column (int) – Column where v_beta gene information is stored.

  • d_beta_column (int) – Column where d_beta gene information is stored.

  • j_beta_column (int) – Column where j_beta gene information is stored.

  • p (multiprocessing pool object) – For parellelized operations, one can pass a multiprocessing pool object to this method.

  • hla (str) – In order to use HLA information as part of the TCR-seq representation, one can provide a csv file where the first column is the file name and the remaining columns hold HLA alleles for each file. By including HLA information for each repertoire being analyzed, one is able to find a representation of TCR-Seq that is more meaningful across repertoires with different HLA backgrounds.

  • use_hla_supertype (bool) – Given the diversity of the HLA-loci, training with a full allele may cause over-fitting. And while individuals may have different HLA alleles, these different allelees may bind peptide in a functionality similar way. This idea of supertypes of HLA is a method by which assignments of HLA genes can be aggregated to 6 HLA-A and 6 HLA-B supertypes. In roder to convert input of HLA-allele genes to supertypes, a more biologically functional representation, one can se this parameter to True and if the alleles provided are of one of 945 alleles found in the reference below, it will be assigned to a known supertype.

    • For this method to work, alleles must be provided in the following format: A0101 where the first letter of the designation is the HLA loci (A or B) and then the 4 digit gene designation. HLA supertypes only exist for HLA-A and HLA-B. All other alleles will be dropped from the analysis.

    • Sidney, J., Peters, B., Frahm, N., Brander, C., & Sette, A. (2008). HLA class I supertypes: a revised and updated classification. BMC immunology, 9(1), 1.

  • keep_non_supertype_alleles (bool) – If assigning supertypes to HLA alleles, one can choose to keep HLA-alleles that do not have a known supertype (i.e. HLA-C alleles or certain HLA-A or HLA-B alleles) or discard them for the analysis. In order to keep these alleles, one should set this parameter to True. Default is False and non HLA-A or B alleles will be discarded.

Returns:
  • `variables into training object

    • self.alpha_sequences (ndarray)` – array with alpha sequences (if provided)
    • self.beta_sequences (ndarray): array with beta sequences (if provided)
    • self.class_id (ndarray): array with sequence class labels
    • self.sample_id (ndarray): array with sequence file labels
    • self.freq (ndarray): array with sequence frequencies from samples
    • self.counts (ndarray): array with sequence counts from samples
    • self.(v/d/j)_(alpha/beta) (ndarray): array with sequence (v/d/j)-(alpha/beta) usage

Load_Data(self, alpha_sequences=None, beta_sequences=None, v_beta=None, d_beta=None, j_beta=None, v_alpha=None, j_alpha=None, class_labels=None, sample_labels=None, freq=None, counts=None, Y=None, p=None, hla=None, use_hla_supertype=False, keep_non_supertype_alleles=False, w=None)

Load Data programatically into DeepTCR.

DeepTCR allows direct user input of sequence data for DeepTCR analysis. By using this method, a user can load numpy arrays with relevant TCRSeq data for analysis.

Tip: One can load data with the Get_Data command from directories and then reload it into another DeepTCR object with the Load_Data command. This can be useful, for example, if you have different labels you want to train to, and you need to change the label programatically between training each model. In this case, one can load the data first with the Get_Data method and then assign the labels pythonically before feeding them into the DeepTCR object with the Load_Data method.

Of note, this method DOES NOT combine sequences with the same amino acid sequence. Therefore, if one wants this, one should first do it programatically before feeding the data into DeepTCR with this method.

Another special use case of this method would be for any type of regression task (sequence or repertoire models). In the case that a per-sequence value is fed into DeepTCR (with Y), this value either becomes the per-sequence regression value or the average of all Y over a sample becomes the per-sample regression value. This is another case where one might want to load data with the Get_Data method and then reload it into DeepTCR with regression values.

This method is included in the three main DeepTCR objects:

  • DeepTCR_U (unsupervised)
  • DeepTCR_SS (supervised sequence classifier/regressor)
  • DeepTCR_WF (supervised repertoire classifier/regressor)
Parameters:
  • alpha_sequences (ndarray of strings) – A 1d array with the sequences for inference for the alpha chain.

  • beta_sequences (ndarray of strings) – A 1d array with the sequences for inference for the beta chain.

  • v_beta (ndarray of strings) – A 1d array with the v-beta genes for inference.

  • d_beta (ndarray of strings) – A 1d array with the d-beta genes for inference.

  • j_beta (ndarray of strings) – A 1d array with the j-beta genes for inference.

  • v_alpha (ndarray of strings) – A 1d array with the v-alpha genes for inference.

  • j_alpha (ndarray of strings) – A 1d array with the j-alpha genes for inference.

  • class_labels (ndarray of strings) – A 1d array with class labels for the sequence (i.e. antigen-specificities)

  • sample_labels (ndarray of strings) – A 1d array with sample labels for the sequence. (i.e. when loading data from different samples)

  • counts (ndarray of ints) – A 1d array with the counts for each sequence, in the case they come from samples.

  • freq (ndarray of float values) – A 1d array with the frequencies for each sequence, in the case they come from samples.

  • Y (ndarray of float values) – In the case one wants to regress TCR sequences or repertoires against a numerical label, one can provide these numerical values for this input. For the TCR sequence regressor, each sequence will be regressed to the value denoted for each sequence. For the TCR repertoire regressor, the average of all instance level values will be used to regress the sample. Therefore, if there is one sample level value for regression, one would just repeat that same value for all the instances/sequences of the sample.

  • hla (ndarray of tuples/arrays) – To input the hla context for each sequence fed into DeepTCR, this will need to formatted as an ndarray that is (N,) for each sequence where each entry is a tuple or array of strings referring to the alleles seen for that sequence. ('A01:01', 'A11:01', 'B35:01', 'B35:02', 'C*04:01')

  • use_hla_supertype (bool) – Given the diversity of the HLA-loci, training with a full allele may cause over-fitting. And while individuals may have different HLA alleles, these different allelees may bind peptide in a functionality similar way. This idea of supertypes of HLA is a method by which assignments of HLA genes can be aggregated to 6 HLA-A and 6 HLA-B supertypes. In roder to convert input of HLA-allele genes to supertypes, a more biologically functional representation, one can se this parameter to True and if the alleles provided are of one of 945 alleles found in the reference below, it will be assigned to a known supertype.

    • For this method to work, alleles must be provided in the following format: A0101 where the first letter of the designation is the HLA loci (A or B) and then the 4 digit gene designation. HLA supertypes only exist for HLA-A and HLA-B. All other alleles will be dropped from the analysis.

    • Sidney, J., Peters, B., Frahm, N., Brander, C., & Sette, A. (2008). HLA class I supertypes: a revised and updated classification. BMC immunology, 9(1), 1.

  • keep_non_supertype_alleles (bool) – If assigning supertypes to HLA alleles, one can choose to keep HLA-alleles that do not have a known supertype (i.e. HLA-C alleles or certain HLA-A or HLA-B alleles) or discard them for the analysis. In order to keep these alleles, one should set this parameter to True. Default is False and non HLA-A or B alleles will be discarded.

  • p (multiprocessing pool object) – a pre-formed pool object can be passed to method for multiprocessing tasks.

  • w (ndarray) – optional set of weights for training of autoencoder

Returns:
  • `variables into training object

    • self.alpha_sequences (ndarray)` – array with alpha sequences (if provided)
    • self.beta_sequences (ndarray): array with beta sequences (if provided)
    • self.label_id (ndarray): array with sequence class labels
    • self.file_id (ndarray): array with sequence file labels
    • self.freq (ndarray): array with sequence frequencies from samples
    • self.counts (ndarray): array with sequence counts from samples
    • self.(v/d/j)_(alpha/beta) (ndarray):array with sequence (v/d/j)-(alpha/beta) usage

Sequence_Inference(self, alpha_sequences=None, beta_sequences=None, v_beta=None, d_beta=None, j_beta=None, v_alpha=None, j_alpha=None, p=None, hla=None, batch_size=10000, models=None, return_dist=False)

Predicting outputs of sequence models on new data

This method allows a user to take a pre-trained autoencoder/sequence classifier and generate outputs from the model on new data. For the autoencoder, this returns the features from the latent space. For the sequence classifier, it is the probability of belonging to each class.

In the case that multiple models have been trained via MC or K-fold Cross-Validation strategy for the sequence classifier, this method can use some or all trained models in an ensemble fashion to provide the average prediction per sequence as well as the distribution of predictions from all trained models.

This method is included in the two sequence DeepTCR objects:

  • DeepTCR_U (unsupervised)
  • DeepTCR_SS (supervised sequence classifier/regressor)
Parameters:
  • alpha_sequences (ndarray of strings) – A 1d array with the sequences for inference for the alpha chain.

  • beta_sequences (ndarray of strings) – A 1d array with the sequences for inference for the beta chain.

  • v_beta (ndarray of strings) – A 1d array with the v-beta genes for inference.

  • d_beta (ndarray of strings) – A 1d array with the d-beta genes for inference.

  • j_beta (ndarray of strings) – A 1d array with the j-beta genes for inference.

  • v_alpha (ndarray of strings) – A 1d array with the v-alpha genes for inference.

  • j_alpha (ndarray of strings) – A 1d array with the j-alpha genes for inference.

  • hla (ndarray of tuples/arrays) – To input the hla context for each sequence fed into DeepTCR, this will need to formatted as an ndarray that is (N,) for each sequence where each entry is a tuple/array of strings referring to the alleles seen for that sequence. ('A01:01', 'A11:01', 'B35:01', 'B35:02', 'C*04:01')

    • If the model used for inference was trained to use HLA-supertypes, one should still enter the HLA in the format it was provided to the original model (i.e. A0101). This mehthod will then convert those HLA alleles into the appropriaet supertype designation. The HLA alleles DO NOT need to be provided to this method in the supertype designation.
  • p (multiprocessing pool object) – a pre-formed pool object can be passed to method for multiprocessing tasks.

  • batch_size (int) – Batch size for inference.

  • models (list) – In the case of the supervised sequence classifier, if several models were trained (via MC or Kfold crossvals), one can specify which ones to use for inference. Otherwise, thie method uses all trained models found in Name/models/ in an ensemble fashion. The method will output of the average of all models as well as the distribution of outputs for the user.

  • return_dist (bool) – If the user wants to also return teh distribution of sequence predicionts over all models use dfor inference, one should set this value to True.

Returns:
  • `features, features_dist

    • features (array), shape = [N, latent_dim]` – An array that contains n x latent_dim containing features for all sequences. For the VAE, this represents the features from the latent space. For the sequence classifier, this represents the probabilities for every class or the regressed value from the sequence regressor. In the case of multiple models being used for inference in ensemble, this becomes the average prediction over all models.

    • features_dist (array), shape = [n_models,N,latent_dim]: An array that contains the output of all models separately for each input sequence. This output is useful if using multiple models in ensemble to predict on a new sequence. This output describes the distribution of the predictions over all models.

DeepTCR_S_base

AUC_Curve(self, by=None, filename='AUC.tif', title=None, title_font=None, plot=True, diag_line=True, xtick_size=None, ytick_size=None, xlabel_size=None, ylabel_size=None, legend_font_size=None, frameon=True, legend_loc='lower right', figsize=None, set='test', color_dict=None)

AUC Curve for both Sequence and Repertoire/Sample Classifiers
Parameters:
  • by (str) – To show AUC curve for only one class, set this parameter to the name of the class label one wants to plot.

  • filename (str) – Filename to save tif file of AUC curve.

  • title (str) – Optional Title to put on ROC Curve.

  • title_font (int) – Optional font size for title

  • plot (bool) – To suppress plotting and just save the data/figure, set to False.

  • diag_line (bool) – To plot the line/diagonal of y=x defining no predictive power, set to True. To remove from plot, set to False.

  • xtick_size (float) – Size of xticks

  • ytick_size (float) – Size of yticks

  • xlabel_size (float) – Size of xlabel

  • ylabel_size (float) – Size of ylabel

  • legend_font_size (float) – Size of legend

  • frameon (bool) – Whether to show frame around legend.

  • figsize (tuple) – To change the default size of the figure, set this to size of figure (i.e. - (10,10) )

  • set (str) – Which partition of the data to look at performance of model. Options are train/valid/test.

  • color_dict (dict) – An optional dictionary that maps classes to colors in the case user wants to define colors of lines on plot.

Returns:
  • `AUC Data

    • self.AUC_DF (Pandas Dataframe)` – AUC scores are returned for each class.

    In addition to plotting the ROC Curve, the AUC's are saved to a csv file in the results directory called 'AUC.csv'

Representative_Sequences(self, top_seq=10, motif_seq=5, make_seq_logos=True, color_scheme='weblogo_protein', logo_file_format='.eps')

Identify most highly predicted sequences for each class and corresponding motifs.

This method allows the user to query which sequences were most predicted to belong to a given class along with the motifs that were learned for these representative sequences. Of note, this method only reports sequences that were in the test set so as not to return highly predicted sequences that were over-fit in the training set. To obtain the highest predicted sequences in all the data, run a K-fold cross-validation or Monte-Carlo cross-validation before running this method. In this way, the predicted probability will have been assigned to a sequence only when it was in the independent test set.

In the case of a regression task, the representative sequences for the 'high' and 'low' values for the regression model are returned in the Rep_Seq Dict.

This method will also determine motifs the network has learned that are highly associated with the label through multi-nomial linear regression and creates seq logos and fasta files in the results folder. Within a folder for a given class, the motifs are sorted by their linear coefficient. The coefficient is in the file name (i.e. 0_0.125_feature_2.eps reflects the the 0th highest feature with a coefficient of 0.125.

Parameters:
  • top_seq (int) – The number of top sequences to show for each class.

  • motif_seq (int) – The number of sequences to use to generate each motif. The more sequences, the possibly more noisy the seq_logo will be.

  • make_seq_logos (bool) – In order to make seq logos for visualization of enriched motifs, set this to True. Whether this is set to True or not, the fast files that define enriched motifs will still be saved.

  • color_scheme (str) – color scheme to use for LogoMaker.

  • options (are) – - weblogo_protein - skylign_protein - dmslogo_charge - dmslogo_funcgroup - hydrophobicity - chemistry - charge - NajafabadiEtAl2017

  • logo_file_format (str) – The type of image file one wants to save the seqlogo as. Default is vector-based format (.eps)

Returns:
  • `Outputs

    • self.Rep_Seq (dictionary of dataframes)` – This dictionary of dataframes holds for each class the top sequences and their respective probabiltiies for all classes. These dataframes can also be found in the results folder under Rep_Sequences.

    • self.Rep_Seq_Features_(alpha/beta) (dataframe): This dataframe holds information for which features were associated by a multinomial linear model to the predicted probabilities of the neural network. The values in this dataframe are the linear model coefficients. This allows one to see which features were associated with the output of the trained neural network. These are also the same values that are on the motif seqlogo files in the results folder.

Furthermore, the motifs are written in the results directory underneath the Motifs folder. To find the beta motifs for a given class, look under Motifs/beta/class_name/. These fasta/logo files are labeled by the linear coefficient of that given feature for that given class followed by the number name of the feature. These fasta files can then be visualized via weblogos at the following site: "https://weblogo.berkeley.edu/logo.cgi" or are present in the folder for direct visualization.

Create Residue Sensitivity Logos

This method allows the user to create Residue Sensitivity Logos where a set of provided sequences is perturbed to assess for position of the CDR3 sequence that if altered, would change the predicted specificity or affinity of the sequence (depending on whether training classification or regression task).

Residue Sensitivity Logos can be created from any supervised model (including sequence and repertoire classifiers). Following the training of one of these models, one can feed into this method an cdr3 sequence defined by all/any of alpha/beta cdr3 sequence, V/D/J gene usage, and HLA context within which the TCR was seen.

The output is a logo created by LogoMaker where the size of the character denotes how sensitive this position is to perturbation and color denotes the consequences of changes at this site. As default, red coloration means changes at this site would generally decrease the predicted value and blue coloration means changes at this site would increase the predicted value.

Parameters:
  • alpha_sequences (ndarray of strings) – A 1d array with the sequences for inference for the alpha chain.

  • beta_sequences (ndarray of strings) – A 1d array with the sequences for inference for the beta chain.

  • v_beta (ndarray of strings) – A 1d array with the v-beta genes for inference.

  • d_beta (ndarray of strings) – A 1d array with the d-beta genes for inference.

  • j_beta (ndarray of strings) – A 1d array with the j-beta genes for inference.

  • v_alpha (ndarray of strings) – A 1d array with the v-alpha genes for inference.

  • j_alpha (ndarray of strings) – A 1d array with the j-alpha genes for inference.

  • hla (ndarray of tuples/arrays) – To input the hla context for each sequence fed into DeepTCR, this will need to formatted as an ndarray that is (N,) for each sequence where each entry is a tuple/array of strings referring to the alleles seen for that sequence. ('A01:01', 'A11:01', 'B35:01', 'B35:02', 'C*04:01')

  • p (multiprocessing pool object) – a pre-formed pool object can be passed to method for multiprocessing tasks.

  • batch_size (int) – Batch size for inference.

  • models (list) – In the case of the supervised sequence classifier, if several models were trained (via MC or Kfold crossvals), one can specify which ones to use for inference. Otherwise, thie method uses all trained models found in Name/models/ in an ensemble fashion. The method will output of the average of all models as well as the distribution of outputs for the user.

  • figsize (tuple) – This specifies the dimensions of the logo.

  • low_color (str) – The color to use when changes at this site would largely result in decreased prediction values.

  • medium_color (str) – The color to use when changes at this site would result in either decreased or inreased prediction values.

  • high_color (str) – The color to use when changes at this site would result in increased prediction values.

  • font_name (str) – The font to use for LogoMaker.

  • class_sel (str) – In the case of a model being trained in a multi-class fashion, one must select which class to make the logo for.

  • cmap (matplotlib cmap) – One can alsp provide custom cmap for logomaker that will be used to denote changes at sites that result in increased of decreased prediction values.

  • min_size (float (0.0 - 1.0) –

  • edgecolor (str) – The color of the edge of the characters of the logo.

  • edgewidth (float) – The thickness of the edge of the characters.

  • background_color (str) – The background color of the logo.

  • norm_to_seq (bool) – When determining the color intensity of the logo, one can choose to normalize the value to just characters in that sequence (True) or one can choose to normalize to all characters in the sequences provdied (False).

  • Load_Prev_Data (bool) – Since the first part of the method runs a time-intensive step to get all the predictions for all perturbations at all residue sites, we've incorporated a paramter which can be set to True following running the method once in order to adjust the visual aspects of the plot. Therefore, one should run this method first setting this parameter to False (it's default setting) but then switch to True and run again with different visualization parameters (i.e. figsize, etc).

Returns:
  • `` – Residue Sensitivity Logo

    • (fig,ax) - the matplotlib figure and axis/axes.

SRCC(self, s=10, kde=False, title=None)

Spearman's Rank Correlation Coefficient Plot

In the case one is doing a regression-based model for the sequence classiifer, one can plot the predicted vs actual labeled value with this method. The method returns a plot for the regression and a value of the correlation coefficient.

Parameters:
  • s (int) – size of points for scatterplot

  • kde (bool) – To do a kernel density estimation per point and plot this as a color-scheme, set to True. Warning: this option will take longer to run.

  • title (str) – Title for the plot.

Returns:
  • `SRCC Output

    • corr (float)` – Spearman's Rank Correlation Coefficient

    • ax (matplotlib axis): axis on which plot is drawn

DeepTCR_SS

Get_Train_Valid_Test(self, test_size=0.25, LOO=None, split_by_sample=False, combine_train_valid=False)

Train/Valid/Test Splits.

Divide data for train, valid, test set. Training is used to train model parameters, validation is used to set early stopping, and test acts as blackbox independent test set.

Parameters:
  • test_size (float) – Fraction of sample to be used for valid and test set.

  • LOO (int) – Number of sequences to leave-out in Leave-One-Out Cross-Validation. For example, when set to 20, 20 sequences will be left out for the validation set and 20 samples will be left out for the test set.

  • split_by_sample (int) – In the case one wants to train the single sequence classifer but not to mix the train/test sets with sequences from different samples, one can set this parameter to True to do the train/test splits by sample.

  • combine_train_valid (bool) – To combine the training and validation partitions into one which will be used for training and updating the model parameters, set this to True. This will also set the validation partition to the test partition. In other words, new train set becomes (original train + original valid) and then new valid = original test partition, new test = original test partition. Therefore, if setting this parameter to True, change one of the training parameters to set the stop training criterion (i.e. train_loss_min) to stop training based on the train set. If one does not chanage the stop training criterion, the decision of when to stop training will be based on the test data (which is considered a form of over-fitting).

K_Fold_CrossVal(self, folds=None, split_by_sample=False, combine_train_valid=False, seeds=None, kernel=5, trainable_embedding=True, embedding_dim_aa=64, embedding_dim_genes=48, embedding_dim_hla=12, num_fc_layers=0, units_fc=12, weight_by_class=False, class_weights=None, use_only_seq=False, use_only_gene=False, use_only_hla=False, size_of_net='medium', graph_seed=None, drop_out_rate=0.0, multisample_dropout=False, multisample_dropout_rate=0.5, multisample_dropout_num_masks=64, batch_size=1000, epochs_min=10, stop_criterion=0.001, stop_criterion_window=10, accuracy_min=None, train_loss_min=None, hinge_loss_t=0.0, convergence='validation', learning_rate=0.001, suppress_output=False, batch_seed=None)

K_Fold Cross-Validation for Single-Sequence Classifier

If the number of sequences is small but training the single-sequence classifier, one can use K_Fold Cross Validation to train on all but one before assessing predictive performance.After this method is run, the AUC_Curve method can be run to assess the overall performance.

The method also saves the per sequence predictions at the end of training in the variable self.predicted. These per sequenes predictions are only assessed when the sequences are in the test set.

The multisample parameters are used to implement Multi-Sample Dropout at the final layer of the model as described in "Multi-Sample Dropout for Accelerated Training and Better Generalization" https://arxiv.org/abs/1905.09788. This method has been shown to improve generalization of deep neural networks as well as inmprove convergence.

Parameters:
  • folds (int) – Number of Folds

  • split_by_sample (int) – In the case one wants to train the single sequence classifer but not to mix the train/test sets with sequences from different samples, one can set this parameter to True to do the train/test splits by sample.

  • combine_train_valid (bool) – To combine the training and validation partitions into one which will be used for training and updating the model parameters, set this to True. This will also set the validation partition to the test partition. In other words, new train set becomes (original train + original valid) and then new valid = original test partition, new test = original test partition. Therefore, if setting this parameter to True, change one of the training parameters to set the stop training criterion (i.e. train_loss_min) to stop training based on the train set. If one does not chanage the stop training criterion, the decision of when to stop training will be based on the test data (which is considered a form of over-fitting).

  • seeds (nd.array) – In order to set a deterministic train/test split over the K-Fold Simulations, one can provide an array of seeds for each K-fold simulation. This will result in the same train/test split over the N Fold simulations. This parameter, if provided, should have the same size of the value of folds.

  • kernel (int) – Size of convolutional kernel for first layer of convolutions.

  • trainable_embedding (bool) – Toggle to control whether a trainable embedding layer is used or native one-hot representation for convolutional layers.

  • embedding_dim_aa (int) – Learned latent dimensionality of amino-acids.

  • embedding_dim_genes (int) – Learned latent dimensionality of VDJ genes

  • embedding_dim_hla (int) – Learned latent dimensionality of HLA

  • num_fc_layers (int) – Number of fully connected layers following convolutional layer.

  • units_fc (int) – Number of nodes per fully-connected layers following convolutional layer.

  • weight_by_class (bool) – Option to weight loss by the inverse of the class frequency. Useful for unbalanced classes.

  • class_weights (dict) – In order to specify custom weights for each class during training, one can provide a dictionary with these weights. i.e. {'A':1.0,'B':2.0'}

  • use_only_seq (bool) – To only use sequence feaures, set to True. This will turn off features learned from gene usage.

  • use_only_gene (bool) – To only use gene-usage features, set to True. This will turn off features from the sequences.

  • use_only_hla (bool) – To only use hla feaures, set to True.

  • size_of_net (list or str) – The convolutional layers of this network have 3 layers for which the use can modify the number of neurons per layer. The user can either specify the size of the network with the following options:

    • small == [12,32,64] neurons for the 3 respective layers
    • medium == [32,64,128] neurons for the 3 respective layers
    • large == [64,128,256] neurons for the 3 respective layers
    • custom, where the user supplies a list with the number of nuerons for the respective layers i.e. [3,3,3] would have 3 neurons for all 3 layers. One can also adjust the number of layers for the convolutional stack by changing the length of this list. [3,3,3] = 3 layers, [3,3,3,3] = 4 layers.
  • graph_seed (int) – For deterministic initialization of weights of the graph, set this to value of choice.

  • drop_out_rate (float) – drop out rate for fully connected layers

  • multisample_dropout (bool) – Set this parameter to True to implement this method.

  • multisample_dropout_rate (float) – The dropout rate for this multi-sample dropout layer.

  • multisample_dropout_num_masks (int) – The number of masks to sample from for the Multi-Sample Dropout layer.

  • batch_size (int) – Size of batch to be used for each training iteration of the net.

  • epochs_min (int) – Minimum number of epochs for training neural network.

  • stop_criterion (float) – Minimum percent decrease in determined interval (below) to continue training. Used as early stopping criterion.

  • stop_criterion_window (int) – The window of data to apply the stopping criterion.

  • accuracy_min (float) – Optional parameter to allow alternative training strategy until minimum training accuracy is achieved, at which point, training ceases.

  • train_loss_min (float) – Optional parameter to allow alternative training strategy until minimum training loss is achieved, at which point, training ceases.

  • hinge_loss_t (float) – The per sequence loss minimum at which the loss of that sequence is not used to penalize the model anymore. In other words, once a per sequence loss has hit this value, it gets set to 0.0.

  • convergence (str) – This parameter determines which loss to assess the convergence criteria on. Options are 'validation' or 'training'. This is useful in the case one wants to change the convergence criteria on the training data when the training and validation partitions have been combined and used to training the model.

  • learning_rate (float) – The learning rate for training the neural network. Making this value larger will increase the rate of convergence but can introduce instability into training. For most, altering this value will not be necessary.

  • suppress_output (bool) – To suppress command line output with training statisitcs, set to True.

  • batch_seed (int) – For deterministic batching during training, set this value to an integer of choice.

Monte_Carlo_CrossVal(self, folds=5, test_size=0.25, LOO=None, split_by_sample=False, combine_train_valid=False, seeds=None, kernel=5, trainable_embedding=True, embedding_dim_aa=64, embedding_dim_genes=48, embedding_dim_hla=12, num_fc_layers=0, units_fc=12, weight_by_class=False, class_weights=None, use_only_seq=False, use_only_gene=False, use_only_hla=False, size_of_net='medium', graph_seed=None, drop_out_rate=0.0, multisample_dropout=False, multisample_dropout_rate=0.5, multisample_dropout_num_masks=64, batch_size=1000, epochs_min=10, stop_criterion=0.001, stop_criterion_window=10, accuracy_min=None, train_loss_min=None, hinge_loss_t=0.0, convergence='validation', learning_rate=0.001, suppress_output=False, batch_seed=None)

Monte Carlo Cross-Validation for Single-Sequence Classifier

If the number of sequences is small but training the single-sequence classifier, one can use Monte Carlo Cross Validation to train a number of iterations before assessing predictive performance.After this method is run, the AUC_Curve method can be run to assess the overall performance.

The method also saves the per sequence predictions at the end of training in the variable self.predicted. These per sequenes predictions are only assessed when the sequences are in the test set. Ideally, after running the classifier with multiple folds, each sequencce will have multiple predicttions that were collected when they were in the test set.

The multisample parameters are used to implement Multi-Sample Dropout at the final layer of the model as described in "Multi-Sample Dropout for Accelerated Training and Better Generalization" https://arxiv.org/abs/1905.09788. This method has been shown to improve generalization of deep neural networks as well as inmprove convergence.

Parameters:
  • folds (int) – Number of iterations for Cross-Validation

  • test_size (float) – Fraction of sample to be used for valid and test set.

  • LOO (int) – Number of sequences to leave-out in Leave-One-Out Cross-Validation. For example, when set to 20, 20 sequences will be left out for the validation set and 20 samples will be left out for the test set.

  • split_by_sample (int) – In the case one wants to train the single sequence classifer but not to mix the train/test sets with sequences from different samples, one can set this parameter to True to do the train/test splits by sample.

  • combine_train_valid (bool) – To combine the training and validation partitions into one which will be used for training and updating the model parameters, set this to True. This will also set the validation partition to the test partition. In other words, new train set becomes (original train + original valid) and then new valid = original test partition, new test = original test partition. Therefore, if setting this parameter to True, change one of the training parameters to set the stop training criterion (i.e. train_loss_min) to stop training based on the train set. If one does not change the stop training criterion, the decision of when to stop training will be based on the test data (which is considered a form of over-fitting).

  • seeds (nd.array) – In order to set a deterministic train/test split over the Monte-Carlo Simulations, one can provide an array of seeds for each MC simulation. This will result in the same train/test split over the N MC simulations. This parameter, if provided, should have the same size of the value of folds.

  • kernel (int) – Size of convolutional kernel for first layer of convolutions.

  • trainable_embedding (bool) – Toggle to control whether a trainable embedding layer is used or native one-hot representation for convolutional layers.

  • embedding_dim_aa (int) – Learned latent dimensionality of amino-acids.

  • embedding_dim_genes (int) – Learned latent dimensionality of VDJ genes

  • embedding_dim_hla (int) – Learned latent dimensionality of HLA

  • num_fc_layers (int) – Number of fully connected layers following convolutional layer.

  • units_fc (int) – Number of nodes per fully-connected layers following convolutional layer.

  • weight_by_class (bool) – Option to weight loss by the inverse of the class frequency. Useful for unbalanced classes.

  • class_weights (dict) – In order to specify custom weights for each class during training, one can provide a dictionary with these weights. i.e. {'A':1.0,'B':2.0'}

  • use_only_seq (bool) – To only use sequence feaures, set to True. This will turn off features learned from gene usage.

  • use_only_gene (bool) – To only use gene-usage features, set to True. This will turn off features from the sequences.

  • use_only_hla (bool) – To only use hla feaures, set to True.

  • size_of_net (list or str) – The convolutional layers of this network have 3 layers for which the use can modify the number of neurons per layer. The user can either specify the size of the network with the following options:

    • small == [12,32,64] neurons for the 3 respective layers
    • medium == [32,64,128] neurons for the 3 respective layers
    • large == [64,128,256] neurons for the 3 respective layers
    • custom, where the user supplies a list with the number of nuerons for the respective layers i.e. [3,3,3] would have 3 neurons for all 3 layers. One can also adjust the number of layers for the convolutional stack by changing the length of this list. [3,3,3] = 3 layers, [3,3,3,3] = 4 layers.
  • graph_seed (int) – For deterministic initialization of weights of the graph, set this to value of choice.

  • drop_out_rate (float) – drop out rate for fully connected layers

  • multisample_dropout (bool) – Set this parameter to True to implement this method.

  • multisample_dropout_rate (float) – The dropout rate for this multi-sample dropout layer.

  • multisample_dropout_num_masks (int) – The number of masks to sample from for the Multi-Sample Dropout layer.

  • batch_size (int) – Size of batch to be used for each training iteration of the net.

  • epochs_min (int) – Minimum number of epochs for training neural network.

  • stop_criterion (float) – Minimum percent decrease in determined interval (below) to continue training. Used as early stopping criterion.

  • stop_criterion_window (int) – The window of data to apply the stopping criterion.

  • accuracy_min (float) – Optional parameter to allow alternative training strategy until minimum training accuracy is achieved, at which point, training ceases.

  • train_loss_min (float) – Optional parameter to allow alternative training strategy until minimum training loss is achieved, at which point, training ceases.

  • hinge_loss_t (float) – The per sequence loss minimum at which the loss of that sequence is not used to penalize the model anymore. In other words, once a per sequence loss has hit this value, it gets set to 0.0.

  • convergence (str) – This parameter determines which loss to assess the convergence criteria on. Options are 'validation' or 'training'. This is useful in the case one wants to change the convergence criteria on the training data when the training and validation partitions have been combined and used to training the model.

  • learning_rate (float) – The learning rate for training the neural network. Making this value larger will increase the rate of convergence but can introduce instability into training. For most, altering this value will not be necessary.

  • suppress_output (bool) – To suppress command line output with training statisitcs, set to True.

  • batch_seed (int) – For deterministic batching during training, set this value to an integer of choice.

Train(self, kernel=5, trainable_embedding=True, embedding_dim_aa=64, embedding_dim_genes=48, embedding_dim_hla=12, num_fc_layers=0, units_fc=12, weight_by_class=False, class_weights=None, use_only_seq=False, use_only_gene=False, use_only_hla=False, size_of_net='medium', graph_seed=None, drop_out_rate=0.0, multisample_dropout=False, multisample_dropout_rate=0.5, multisample_dropout_num_masks=64, batch_size=1000, epochs_min=10, stop_criterion=0.001, stop_criterion_window=10, accuracy_min=None, train_loss_min=None, hinge_loss_t=0.0, convergence='validation', learning_rate=0.001, suppress_output=False, batch_seed=None)

Train Single-Sequence Classifier

This method trains the network and saves features values at the end of training for downstream analysis.

The method also saves the per sequence predictions at the end of training in the variable self.predicted

The multiesample parameters are used to implement Multi-Sample Dropout at the final layer of the model as described in "Multi-Sample Dropout for Accelerated Training and Better Generalization" https://arxiv.org/abs/1905.09788. This method has been shown to improve generalization of deep neural networks as well as improve convergence.

Parameters:
  • kernel (int) – Size of convolutional kernel for first layer of convolutions.

  • trainable_embedding (bool) – Toggle to control whether a trainable embedding layer is used or native one-hot representation for convolutional layers.

  • embedding_dim_aa (int) – Learned latent dimensionality of amino-acids.

  • embedding_dim_genes (int) – Learned latent dimensionality of VDJ genes

  • embedding_dim_hla (int) – Learned latent dimensionality of HLA

  • num_fc_layers (int) – Number of fully connected layers following convolutional layer.

  • units_fc (int) – Number of nodes per fully-connected layers following convolutional layer.

  • weight_by_class (bool) – Option to weight loss by the inverse of the class frequency. Useful for unbalanced classes.

  • class_weights (dict) – In order to specify custom weights for each class during training, one can provide a dictionary with these weights. i.e. {'A':1.0,'B':2.0'}

  • use_only_seq (bool) – To only use sequence feaures, set to True. This will turn off features learned from gene usage.

  • use_only_gene (bool) – To only use gene-usage features, set to True. This will turn off features from the sequences.

  • use_only_hla (bool) – To only use hla feaures, set to True.

  • size_of_net (list or str) – The convolutional layers of this network have 3 layers for which the use can modify the number of neurons per layer. The user can either specify the size of the network with the following options:

    • small == [12,32,64] neurons for the 3 respective layers
    • medium == [32,64,128] neurons for the 3 respective layers
    • large == [64,128,256] neurons for the 3 respective layers
    • custom, where the user supplies a list with the number of nuerons for the respective layers i.e. [3,3,3] would have 3 neurons for all 3 layers. One can also adjust the number of layers for the convolutional stack by changing the length of this list. [3,3,3] = 3 layers, [3,3,3,3] = 4 layers.
  • graph_seed (int) – For deterministic initialization of weights of the graph, set this to value of choice.

  • drop_out_rate (float) – drop out rate for fully connected layers

  • multisample_dropout (bool) – Set this parameter to True to implement this method.

  • multisample_dropout_rate (float) – The dropout rate for this multi-sample dropout layer.

  • multisample_dropout_num_masks (int) – The number of masks to sample from for the Multi-Sample Dropout layer.

  • batch_size (int) – Size of batch to be used for each training iteration of the net.

  • epochs_min (int) – Minimum number of epochs for training neural network.

  • stop_criterion (float) – Minimum percent decrease in determined interval (below) to continue training. Used as early stopping criterion.

  • stop_criterion_window (int) – The window of data to apply the stopping criterion.

  • accuracy_min (loat) – Optional parameter to allow alternative training strategy until minimum training accuracy is achieved, at which point, training ceases.

  • train_loss_min (float) – Optional parameter to allow alternative training strategy until minimum training loss is achieved, at which point, training ceases.

  • hinge_loss_t (float) – The per sequence loss minimum at which the loss of that sequence is not used to penalize the model anymore. In other words, once a per sequence loss has hit this value, it gets set to 0.0.

  • convergence (str) – This parameter determines which loss to assess the convergence criteria on. Options are 'validation' or 'training'. This is useful in the case one wants to change the convergence criteria on the training data when the training and validation partitions have been combined and used to training the model.

  • learning_rate (float) – The learning rate for training the neural network. Making this value larger will increase the rate of convergence but can introduce instability into training. For most, altering this value will not be necessary.

  • suppress_output (bool) – To suppress command line output with training statisitcs, set to True.

  • batch_seed (int) – For deterministic batching during training, set this value to an integer of choice.

DeepTCR_U

KNN_Repertoire_Classifier(self, folds=5, distance_metric='KL', sample=None, n_jobs=1, plot_metrics=False, plot_type='violin', by_class=False, Load_Prev_Data=False, metrics=['Recall', 'Precision', 'F1_Score', 'AUC'])

K-Nearest Neighbor Repertoire Classifier

This method uses a K-Nearest Neighbor Classifier to assess the ability to predict a repertoire label given the structural distribution of the repertoire.The method returns AUC,Precision,Recall, and F1 Scores for all classes.

Parameters:
  • folds (int) – Number of folds to train/test K-Nearest Classifier.

  • distance_metric (str) – Provided distance metric to determine repertoire-level distance from cluster proportions. Options include = (KL,correlation,euclidean,wasserstein,JS).

  • sample (int) – For large numbers of sequences, to obtain a faster clustering solution, one can sub-sample a number of sequences and then use k-nearest neighbors to assign other sequences.

  • n_jobs (int) – Number of processes to use for parallel operations.

  • plot_metrics (bool) – Toggle to show the performance metrics

  • plot_type (str) – Type of plot as taken by seaborn.catplot for kind parameter: options include (strip,swarm,box,violin,boxen,point,bar,count)

  • by_class (bool) – Toggle to show the performance metrics by class.

  • Load_Prev_Data (bool) – If method has been run before, one can load previous data from clustering step to move to KNN step faster. Can be useful when trying different distance methods to find optimizal distance metric for a given dataset.

  • metrics (list) – List of performance measures one wants to compute. options include AUC, Precision, Recall, F1_Score

Returns:
  • `Performance Metrics

    • self.KNN_Repertoire_DF (Pandas dataframe)` – Dataframe with all metrics of performance organized by the class label, metric (i.e. AUC), k-value (from k-nearest neighbors), and the value of the performance metric.

KNN_Sequence_Classifier(self, folds=5, k_values=[1, 26, 51, 76, 101, 126, 151, 176, 201, 226, 251, 276, 301, 326, 351, 376, 401, 426, 451, 476], rep=5, plot_metrics=False, by_class=False, plot_type='violin', metrics=['Recall', 'Precision', 'F1_Score', 'AUC'], n_jobs=1, Load_Prev_Data=False)

K-Nearest Neighbor Sequence Classifier

This method uses a K-Nearest Neighbor Classifier to assess the ability to predict a sequence label given its sequence features.The method returns AUC,Precision,Recall, and F1 Scores for all classes.

Parameters:
  • folds (int) – Number of folds to train/test K-Nearest Classifier.

  • k_values (list) – List of k for KNN algorithm to assess performance metrics across.

  • rep (int) – Number of iterations to train KNN classifier for each k-value.

  • plot_metrics (bool) – Toggle to show the performance metrics

  • plot_type (str) – Type of plot as taken by seaborn.catplot for kind parameter: options include (strip,swarm,box,violin,boxen,point,bar,count)

  • by_class (bool) – Toggle to show the performance metrics by class.

  • metrics (list) – List of performance measures one wants to compute. Options include AUC, Precision, Recall, F1_Score

  • n_jobs (int) – Number of workers to set for KNeighborsClassifier.

  • Load_Prev_Data (bool) – To make new figures from old previously run analysis, set this value to True after running the method for the first time. This will load previous performance metrics from previous run.

Returns:
  • `Performance Metrics

    • self.KNN_Sequence_DF (Pandas dataframe)` – Dataframe with all metrics of performance organized by the class label, metric (i.e. AUC), k-value (from k-nearest neighbors), and the value of the performance metric.

Train_VAE(self, latent_dim=256, kernel=5, trainable_embedding=True, embedding_dim_aa=64, embedding_dim_genes=48, embedding_dim_hla=12, use_only_seq=False, use_only_gene=False, use_only_hla=False, size_of_net='medium', latent_alpha=0.001, sparsity_alpha=None, var_explained=None, graph_seed=None, batch_size=10000, epochs_min=0, stop_criterion=0.01, stop_criterion_window=30, accuracy_min=None, suppress_output=False, learning_rate=0.001, split_seed=None, Load_Prev_Data=False)

Train Variational Autoencoder (VAE)

This method trains the network and saves features values for sequences for a variety of downstream analyses that can either be done within the DeepTCR framework or by the user by simplying extracting out the learned representations.

Parameters:
  • latent_dim (int) – Number of latent dimensions for VAE.

  • kernel (int) – The motif k-mer of the first convolutional layer of the graph.

  • trainable_embedding (bool) – Toggle to control whether a trainable embedding layer is used or native one-hot representation for convolutional layers.

  • embedding_dim_aa (int) – Learned latent dimensionality of amino-acids.

  • embedding_dim_genes (int) – Learned latent dimensionality of VDJ genes

  • embedding_dim_hla (int) – Learned latent dimensionality of HLA

  • use_only_seq (bool) – To only use sequence feaures, set to True.

  • use_only_gene (bool) – To only use gene-usage features, set to True.

  • use_only_hla (bool) – To only use hla feaures, set to True.

  • size_of_net (list or str) – The convolutional layers of this network have 3 layers for which the use can modify the number of neurons per layer. The user can either specify the size of the network with the following options:

    • small == [12,32,64] neurons for the 3 respective layers
    • medium == [32,64,128] neurons for the 3 respective layers
    • large == [64,128,256] neurons for the 3 respective layers
    • custom, where the user supplies a list with the number of nuerons for the respective layers i.e. [3,3,3] would have 3 neurons for all 3 layers. One can also adjust the number of layers for the convolutional stack by changing the length of this list. [3,3,3] = 3 layers, [3,3,3,3] = 4 layers.
  • latent_alpha (float) – Penalty coefficient for latent loss. This value changes the degree of latent regularization on the VAE.

  • sparsity_alpha (float) – When training an autoencoder, the number of latent nodes required to model the underlying distribution of the data is often arrived to by trial and error and tuning this hyperparameter. In many cases, by using too many latent nodes, one my fit the distribution but downstream analysis tasks may be computationally burdensome with i.e. 256 latent features. Additionally, there can be a high level of colinearlity among these latent features. In our implemnetation of VAE, we introduce this concept of a sparsity constraint which turns off latent nodes in a soft fashion throughout straining and acts as another mode of regularization to find the minimal number of latent features to model the underlying distribution. Following training, one can set the var_explained parameter to select the number of latent nodes required to cover X percent variation explained akin to PCA analysis. This results in a lower dimensional space and more linearly indepeendent latent space. Good starting value is 1.0.

  • var_explained (float (0-1.0) – Following training, one can select the number of latent features that explain N% of the variance in the data. The output of the method will be the features in order of the explained variance.

  • graph_seed (int) – For deterministic initialization of weights of the graph, set this to value of choice.

  • batch_size (int) – Size of batch to be used for each training iteration of the net.

  • epochs_min (int) – The minimum number of epochs to train the autoencoder.

  • stop_criterion (float) – Minimum percent decrease in determined interval (below) to continue training. Used as early stopping criterion.

  • stop_criterion_window (int) – The window of data to apply the stopping criterion.

  • accuracy_min (float) – Minimum reconstruction accuracy before terminating training.

  • suppress_output (bool) – To suppress command line output with training statisitcs, set to True.

  • split_seed (int) – For deterministic batching of data during training, one can set this parameter to value of choice.

  • Load_Prev_Data (bool) – Load previous feature data from prior training.

Returns:
  • `VAE Features

    • self.features (array)` – An array that contains n x latent_dim containing features for all sequences

    • self.explained_variance_ (array): The explained variance for the N number of latent features in order of descending value.

    • self.explained_variance_ratio_ (array): The explained variance ratio for the N number of latent features in order of descending value.


DeepTCR_WF

Get_Train_Valid_Test(self, test_size=0.25, LOO=None, combine_train_valid=False, random_perm=False)

Train/Valid/Test Splits.

Divide data for train, valid, test set. In general, training is used to train model parameters, validation is used to set early stopping, and test acts as blackbox independent test set.

Parameters:
  • test_size (float) – Fraction of sample to be used for valid and test set. For example, if set to 0.25, 25% of the data will be set aside for validation and testing sets. In other words, 75% of the data is used for training.

  • LOO (int) – Number of samples to leave-out in Leave-One-Out Cross-Validation. For example, when set to 2, 2 samples will be left out for the validation set and 2 samples will be left out for the test set.

  • combine_train_valid (bool) – To combine the training and validation partitions into one which will be used for training and updating the model parameters, set this to True. This will also set the validation partition to the test partition. In other words, new train set becomes (original train + original valid) and then new valid = original test partition, new test = original test partition. Therefore, if setting this parameter to True, change one of the training parameters to set the stop training criterion (i.e. train_loss_min) to stop training based on the train set. If one does not chanage the stop training criterion, the decision of when to stop training will be based on the test data (which is considered a form of over-fitting).

  • random_perm (bool) – To do random permutation testing, one can set this parameter to True and this will shuffle the labels.

K_Fold_CrossVal(self, folds=None, combine_train_valid=False, seeds=None, kernel=5, num_concepts=12, trainable_embedding=True, embedding_dim_aa=64, embedding_dim_genes=48, embedding_dim_hla=12, num_fc_layers=0, units_fc=12, weight_by_class=False, class_weights=None, use_only_seq=False, use_only_gene=False, use_only_hla=False, size_of_net='medium', graph_seed=None, qualitative_agg=True, quantitative_agg=False, num_agg_layers=0, units_agg=12, drop_out_rate=0.0, multisample_dropout=False, multisample_dropout_rate=0.5, multisample_dropout_num_masks=64, batch_size=25, batch_size_update=None, epochs_min=25, stop_criterion=0.25, stop_criterion_window=10, accuracy_min=None, train_loss_min=None, hinge_loss_t=0.0, convergence='validation', learning_rate=0.001, suppress_output=False, loss_criteria='mean', l2_reg=0.0, batch_seed=None, subsample=None, subsample_by_freq=False, subsample_valid_test=False)

K_Fold Cross-Validation for Whole Sample Classifier

If the number of samples is small but training the whole sample classifier, one can use K_Fold Cross Validation to train on all but one before assessing predictive performance.After this method is run, the AUC_Curve method can be run to assess the overall performance.

The two MIL options alter how the predictive signatures in the neural network are aggregated to make a prediction about the repertoire. If qualitative_agg or quantitative_agg are set to True, this will include these different types of aggregation in the predcitions. One can set either to True or both to True and this will allow a user to incorporate features from multiple modes of aggregation. See below for further details on these methods of aggregation across the sequences of a repertoire.

The multiesample parameters are used to implement Multi-Sample Dropout at the final layer of the model as described in "Multi-Sample Dropout for Accelerated Training and Better Generalization" https://arxiv.org/abs/1905.09788. This method has been shown to improve generalization of deep neural networks as well as improve convergence.

Parameters:
  • folds (int) – Number of Folds

  • combine_train_valid (bool) – To combine the training and validation partitions into one which will be used for training and updating the model parameters, set this to True. This will also set the validation partition to the test partition. In other words, new train set becomes (original train + original valid) and then new valid = original test partition, new test = original test partition. Therefore, if setting this parameter to True, change one of the training parameters to set the stop training criterion (i.e. train_loss_min) to stop training based on the train set. If one does not chanage the stop training criterion, the decision of when to stop training will be based on the test data (which is considered a form of over-fitting).

  • seeds (nd.array) – In order to set a deterministic train/test split over the K-Fold Simulations, one can provide an array of seeds for each K-fold simulation. This will result in the same train/test split over the N Fold simulations. This parameter, if provided, should have the same size of the value of folds.

  • kernel (int) – Size of convolutional kernel for first layer of convolutions.

  • num_concepts (int) – Number of concepts for multi-head attention mechanism. Depending on the expected heterogeneity of the repertoires being analyed, one can adjust this hyperparameter.

  • trainable_embedding (bool) – Toggle to control whether a trainable embedding layer is used or native one-hot representation for convolutional layers.

  • embedding_dim_aa (int) – Learned latent dimensionality of amino-acids.

  • embedding_dim_genes (int) – Learned latent dimensionality of VDJ genes

  • embedding_dim_hla (int) – Learned latent dimensionality of HLA

  • num_fc_layers (int) – Number of fully connected layers following convolutional layer.

  • units_fc (int) – Number of nodes per fully-connected layers following convolutional layer.

  • weight_by_class (bool) – Option to weight loss by the inverse of the class frequency. Useful for unbalanced classes.

  • class_weights (dict) – In order to specify custom weights for each class during training, one can provide a dictionary with these weights.i.e. {'A':1.0,'B':2.0'}

  • use_only_seq (bool) – To only use sequence feaures, set to True. This will turn off features learned from gene usage.

  • use_only_gene (bool) – To only use gene-usage features, set to True. This will turn off features from the sequences.

  • use_only_hla (bool) – To only use hla feaures, set to True.

  • size_of_net (list or str) – The convolutional layers of this network have 3 layers for which the use can modify the number of neurons per layer. The user can either specify the size of the network with the following options:

    • small == [12,32,64] neurons for the 3 respective layers
    • medium == [32,64,128] neurons for the 3 respective layers
    • large == [64,128,256] neurons for the 3 respective layers
    • custom, where the user supplies a list with the number of nuerons for the respective layers i.e. [3,3,3] would have 3 neurons for all 3 layers.
  • graph_seed (int) – For deterministic initialization of weights of the graph, set this to value of choice.

  • qualitative_agg (bool) – If set to True, the model will aggregate the feature values per repertoire weighted by frequency of each TCR. This is considered a 'qualitative' aggregation as the prediction of the repertoire is based on the relative distribution of the repertoire. In other words, this type of aggregation is a count-independent measure of aggregation. This is the mode of aggregation that has been more thoroughly tested across multiple scientific examples.

  • quantitative_agg (bool) – If set to True, the model will aggregate the feature values per repertoire weighted by counts of each TCR. This is considered a 'quantitative' aggregation as the prediction of the repertoire is based on teh absolute distribution of the repertoire. In other words, this type of aggregation is a count-dependent measure of aggregation. If one believes the counts are important for the predictive value of the model, one can set this parameter to True.

  • num_agg_layers (int) – Following the aggregation layer in the network, one can choose to add more fully-connected layers before the final classification layer. This parameter will set how many layers to add after aggregation. This likely is helpful when using both types of aggregation (as detailed above) to combine those feature values.

  • units_agg (int) – For the fully-connected layers after aggregation, this parameter sets the number of units/nodes per layer.

  • drop_out_rate (float) – drop out rate for fully connected layers

  • multisample_dropout_rate (float) – The dropout rate for this multi-sample dropout layer.

  • multisample_dropout_num_masks (int) – The number of masks to sample from for the Multi-Sample Dropout layer.

  • batch_size (int) – Size of batch to be used for each training iteration of the net.

  • batch_size_update (int) – In the case that the size of the samples are very large, one may not want to update the weights of the network as often as batches are put onto the gpu. Therefore, if one wants to update the weights less often than how often the batches of data are put onto the gpu, one can set this parameter to something other than None. An example would be if batch_size is set to 5 and batch_size_update is set to 30, while only 5 samples will be put on the gpu at a time, the weights will only be updated after 30 samples have been put on the gpu. This parameter is only relevant when using gpu's for training and there are memory constraints from very large samples.

  • epochs_min (int) – Minimum number of epochs for training neural network.

  • stop_criterion (float) – Minimum percent decrease in determined interval (below) to continue training. Used as early stopping criterion.

  • stop_criterion_window (int) – The window of data to apply the stopping criterion.

  • accuracy_min (float) – Optional parameter to allow alternative training strategy until minimum training accuracy is achieved, at which point, training ceases.

  • train_loss_min (float) – Optional parameter to allow alternative training strategy until minimum training loss is achieved, at which point, training ceases.

  • hinge_loss_t (float) – The per sample loss minimum at which the loss of that sample is not used to penalize the model anymore. In other words, once a per sample loss has hit this value, it gets set to 0.0.

  • convergence (str) – This parameter determines which loss to assess the convergence criteria on. Options are 'validation' or 'training'. This is useful in the case one wants to change the convergence criteria on the training data when the training and validation partitions have been combined and used to training the model.

  • learning_rate (float) – The learning rate for training the neural network. Making this value larger will increase the rate of convergence but can introduce instability into training. For most, altering this value will not be necessary.

  • suppress_output (bool) – To suppress command line output with training statisitcs, set to True.

  • l2_reg (float) – When training the repertoire classifier, it may help to utilize L2 regularization to prevent sample-specific overfitting of the model. By setting the value of this parameter (i.e. 0.01), one will introduce L2 regularization through al but the last layer of the network.

  • batch_seed (int) – For deterministic batching during training, set this value to an integer of choice.

  • subsample (int) – Number of sequences to sub-sample from repertoire during training to improve speed of convergence as well as being a form of regularization.

  • subsample_by_freq (bool) – Whether to sub-sample randomly in the repertoire or as a function of the frequency of the TCR.

  • subsample_valid_test (bool) – Whether to sub-sample during valid/test cohorts while training. This is mostly as well to improve speed to convergence and generalizability.

Monte_Carlo_CrossVal(self, folds=5, test_size=0.25, LOO=None, combine_train_valid=False, random_perm=False, seeds=None, kernel=5, num_concepts=12, trainable_embedding=True, embedding_dim_aa=64, embedding_dim_genes=48, embedding_dim_hla=12, num_fc_layers=0, units_fc=12, weight_by_class=False, class_weights=None, use_only_seq=False, use_only_gene=False, use_only_hla=False, size_of_net='medium', graph_seed=None, qualitative_agg=True, quantitative_agg=False, num_agg_layers=0, units_agg=12, drop_out_rate=0.0, multisample_dropout=False, multisample_dropout_rate=0.5, multisample_dropout_num_masks=64, batch_size=25, batch_size_update=None, epochs_min=25, stop_criterion=0.25, stop_criterion_window=10, accuracy_min=None, train_loss_min=None, hinge_loss_t=0.0, convergence='validation', learning_rate=0.001, suppress_output=False, loss_criteria='mean', l2_reg=0.0, batch_seed=None, subsample=None, subsample_by_freq=False, subsample_valid_test=False)

Monte Carlo Cross-Validation for Whole Sample Classifier

If the number of samples is small but training the whole sample classifier, one can use Monte Carlo Cross Validation to train a number of iterations before assessing predictive performance.After this method is run, the AUC_Curve method can be run to assess the overall performance.

The two MIL options alter how the predictive signatures in the neural network are aggregated to make a prediction about the repertoire. If qualitative_agg or quantitative_agg are set to True, this will include these different types of aggregation in the predcitions. One can set either to True or both to True and this will allow a user to incorporate features from multiple modes of aggregation. See below for further details on these methods of aggregation across the sequences of a repertoire.

The multiesample parameters are used to implement Multi-Sample Dropout at the final layer of the model as described in "Multi-Sample Dropout for Accelerated Training and Better Generalization" https://arxiv.org/abs/1905.09788. This method has been shown to improve generalization of deep neural networks as well as improve convergence.

Parameters:
  • folds (int) – Number of iterations for Cross-Validation

  • test_size (float) – Fraction of sample to be used for valid and test set.

  • LOO (int) – Number of samples to leave-out in Leave-One-Out Cross-Validation

  • combine_train_valid (bool) – To combine the training and validation partitions into one which will be used for training and updating the model parameters, set this to True. This will also set the validation partition to the test partition. In other words, new train set becomes (original train + original valid) and then new valid = original test partition, new test = original test partition. Therefore, if setting this parameter to True, change one of the training parameters to set the stop training criterion (i.e. train_loss_min) to stop training based on the train set. If one does not chanage the stop training criterion, the decision of when to stop training will be based on the test data (which is considered a form of over-fitting).

  • random_perm (bool) – To do random permutation testing, one can set this parameter to True and this will shuffle the labels.

  • seeds (nd.array) – In order to set a deterministic train/test split over the Monte-Carlo Simulations, one can provide an array of seeds for each MC simulation. This will result in the same train/test split over the N MC simulations. This parameter, if provided, should have the same size of the value of folds.

  • kernel (int) – Size of convolutional kernel for first layer of convolutions.

  • num_concepts (int) – Number of concepts for multi-head attention mechanism. Depending on the expected heterogeneity of the repertoires being analyed, one can adjust this hyperparameter.

  • trainable_embedding (bool) – Toggle to control whether a trainable embedding layer is used or native one-hot representation for convolutional layers.

  • embedding_dim_aa (int) – Learned latent dimensionality of amino-acids.

  • embedding_dim_genes (int) – Learned latent dimensionality of VDJ genes

  • embedding_dim_hla (int) – Learned latent dimensionality of HLA

  • num_fc_layers (int) – Number of fully connected layers following convolutional layer.

  • units_fc (int) – Number of nodes per fully-connected layers following convolutional layer.

  • weight_by_class (bool) – Option to weight loss by the inverse of the class frequency. Useful for unbalanced classes.

  • class_weights (dict) – In order to specify custom weights for each class during training, one can provide a dictionary with these weights.i.e. {'A':1.0,'B':2.0'}

  • use_only_seq (bool) – To only use sequence feaures, set to True. This will turn off features learned from gene usage.

  • use_only_gene (bool) – To only use gene-usage features, set to True. This will turn off features from the sequences.

  • use_only_hla (bool) – To only use hla feaures, set to True.

  • size_of_net (list or str) – The convolutional layers of this network have 3 layers for which the use can modify the number of neurons per layer. The user can either specify the size of the network with the following options:

    • small == [12,32,64] neurons for the 3 respective layers
    • medium == [32,64,128] neurons for the 3 respective layers
    • large == [64,128,256] neurons for the 3 respective layers
    • custom, where the user supplies a list with the number of nuerons for the respective layers i.e. [3,3,3] would have 3 neurons for all 3 layers.
  • graph_seed (int) – For deterministic initialization of weights of the graph, set this to value of choice.

  • qualitative_agg (bool) – If set to True, the model will aggregate the feature values per repertoire weighted by frequency of each TCR. This is considered a 'qualitative' aggregation as the prediction of the repertoire is based on the relative distribution of the repertoire. In other words, this type of aggregation is a count-independent measure of aggregation. This is the mode of aggregation that has been more thoroughly tested across multiple scientific examples.

  • quantitative_agg (bool) – If set to True, the model will aggregate the feature values per repertoire weighted by counts of each TCR. This is considered a 'quantitative' aggregation as the prediction of the repertoire is based on teh absolute distribution of the repertoire. In other words, this type of aggregation is a count-dependent measure of aggregation. If one believes the counts are important for the predictive value of the model, one can set this parameter to True.

  • num_agg_layers (int) – Following the aggregation layer in the network, one can choose to add more fully-connected layers before the final classification layer. This parameter will set how many layers to add after aggregation. This likely is helpful when using both types of aggregation (as detailed above) to combine those feature values.

  • units_agg (int) – For the fully-connected layers after aggregation, this parameter sets the number of units/nodes per layer.

  • drop_out_rate (float) – drop out rate for fully connected layers

  • multisample_dropout_rate (float) – The dropout rate for this multi-sample dropout layer.

  • multisample_dropout_num_masks (int) – The number of masks to sample from for the Multi-Sample Dropout layer.

  • batch_size (int) – Size of batch to be used for each training iteration of the net.

  • batch_size_update (int) – In the case that the size of the samples are very large, one may not want to update the weights of the network as often as batches are put onto the gpu. Therefore, if one wants to update the weights less often than how often the batches of data are put onto the gpu, one can set this parameter to something other than None. An example would be if batch_size is set to 5 and batch_size_update is set to 30, while only 5 samples will be put on the gpu at a time, the weights will only be updated after 30 samples have been put on the gpu. This parameter is only relevant when using gpu's for training and there are memory constraints from very large samples.

  • epochs_min (int) – Minimum number of epochs for training neural network.

  • stop_criterion (float) – Minimum percent decrease in determined interval (below) to continue training. Used as early stopping criterion.

  • stop_criterion_window (int) – The window of data to apply the stopping criterion.

  • accuracy_min (float) – Optional parameter to allow alternative training strategy until minimum training accuracy is achieved, at which point, training ceases.

  • train_loss_min (float) – Optional parameter to allow alternative training strategy until minimum training loss is achieved, at which point, training ceases.

  • hinge_loss_t (float) – The per sample loss minimum at which the loss of that sample is not used to penalize the model anymore. In other words, once a per sample loss has hit this value, it gets set to 0.0.

  • convergence (str) – This parameter determines which loss to assess the convergence criteria on. Options are 'validation' or 'training'. This is useful in the case one wants to change the convergence criteria on the training data when the training and validation partitions have been combined and used to training the model.

  • learning_rate (float) – The learning rate for training the neural network. Making this value larger will increase the rate of convergence but can introduce instability into training. For most, altering this value will not be necessary.

  • suppress_output (bool) – To suppress command line output with training statisitcs, set to True.

  • l2_reg (float) – When training the repertoire classifier, it may help to utilize L2 regularization to prevent sample-specific overfitting of the model. By setting the value of this parameter (i.e. 0.01), one will introduce L2 regularization through all but the last layer of the network.

  • batch_seed (int) – For deterministic batching during training, set this value to an integer of choice.

  • subsample (int) – Number of sequences to sub-sample from repertoire during training to improve speed of convergence as well as being a form of regularization.

  • subsample_by_freq (bool) – Whether to sub-sample randomly in the repertoire or as a function of the frequency of the TCR.

  • subsample_valid_test (bool) – Whether to sub-sample during valid/test cohorts while training. This is mostly as well to improve speed to convergence and generalizability.

Returns:
  • `MC Predictions

    • self.DFs_pred (dict of dataframes)` – This method returns the samples in the test sets of the Monte-Carlo and their predicted probabilities for each class.

Sample_Inference(self, sample_labels=None, alpha_sequences=None, beta_sequences=None, v_beta=None, d_beta=None, j_beta=None, v_alpha=None, j_alpha=None, p=None, hla=None, freq=None, counts=None, batch_size=10, models=None, return_dist=False)

Predicting outputs of sample/repertoire model on new data

This method allows a user to take a pre-trained sample/repertoire classifier and generate outputs from the model on new data. This will return predicted probabilites for the given classes for the new data. If the model has been trained in Monte-Carlo or K-Fold Cross-Validation, there will be a model created for each iteration of the cross-validation. if the 'models' parameter is left as None, this method will conduct inference for all models trained in cross-validation and output the average predicted value per sample along with the distribution of predictions for futher downstream use. For example, by looking at the distribution of predictions for a given sample over all models trained, one can determine which samples have a high level of certainty in their predictions versus those with lower level of certainty. In essense, by training a multiple models in cross-validation schemes, this can allow the user to generate a distribution of predictions on a per-sample basis which provides a better understanding of the prediction. Alternatively, one can choose to fill in the the models parameter with a list of models the user wants to use for inference.

To load data from directories, one can use the Get_Data method from the base class to automatically format the data into the proper format to be then input into this method.

One can also use this method to get per-sequence predictions from the sample/repertoire classifier. To do this, provide all inputs except for sample_labels. The method will then return an array of dimensionality [N,n_classes] where N is the number of sequences provided. When using the method in this way, be sure to change the batch_size is adjusted to a larger value as 10 sequences per batch will be rather slow. We recommend changing into the order of thousands (i.e. 10 - 100k).

Parameters:
  • sample_labels (ndarray of strings) – A 1d array with sample labels for the sequence.

  • alpha_sequences (ndarray of strings) – A 1d array with the sequences for inference for the alpha chain.

  • beta_sequences (ndarray of strings) – A 1d array with the sequences for inference for the beta chain.

  • v_beta (ndarray of strings) – A 1d array with the v-beta genes for inference.

  • d_beta (ndarray of strings) – A 1d array with the d-beta genes for inference.

  • j_beta (ndarray of strings) – A 1d array with the j-beta genes for inference.

  • v_alpha (ndarray of strings) – A 1d array with the v-alpha genes for inference.

  • j_alpha (ndarray of strings) – A 1d array with the j-alpha genes for inference.

  • counts (ndarray of ints) – A 1d array with the counts for each sequence.

  • freq (ndarray of float values) – A 1d array with the frequencies for each sequence.

  • hla (ndarray of tuples/arrays) – To input the hla context for each sequence fed into DeepTCR, this will need to formatted as an ndarray that is (N,) for each sequence where each entry is a tuple or array of strings referring to the alleles seen for that sequence. ('A01:01', 'A11:01', 'B35:01', 'B35:02', 'C*04:01')

    • If the model used for inference was trained to use HLA-supertypes, one should still enter the HLA in the format it was provided to the original model (i.e. A0101). This mehthod will then convert those HLA alleles into the appropriaet supertype designation. The HLA alleles DO NOT need to be provided to this method in the supertype designation.
  • p (multiprocessing pool object) – a pre-formed pool object can be passed to method for multiprocessing tasks.

  • batch_size (int) – Batch size for inference.

  • models (list) – List of models in Name_Of_Object/models to use for inference. If left as None, this method will use all models in that directory.

  • return_dist (bool) – If the user wants to also return teh distribution of sample/sequence predictions over all models used for inference, one should set this value to True.

Returns:
  • `Predictions

    • self.Inference_Sample_List (ndarray)` – An array with the sample list corresponding the predicted probabilities.

    • self.Inference_Pred (ndarray): An array with the predicted probabilites for all classes. These represent the average probability of all models used for inference.

    • self.Inference_Pred_Dict (dict): A dictionary of predicted probabilities for the respective classes. These represent the average probability of all models used for inference.

    • self.Inference_Pred_Dist (ndarray): An array with the predicted probabilites for all classes on a per model basis. shape = [Number of Models, Number of Samples, Number of Classes]

If sample_labels is not provided, the method will perform per-sequence predictions and will return the an array of [N,n_classes]. If return_dist is set to True, the method will return two outputs. One containing the mean predictions and the other containing the full distribution over all models.

Train(self, kernel=5, num_concepts=12, trainable_embedding=True, embedding_dim_aa=64, embedding_dim_genes=48, embedding_dim_hla=12, num_fc_layers=0, units_fc=12, weight_by_class=False, class_weights=None, use_only_seq=False, use_only_gene=False, use_only_hla=False, size_of_net='medium', graph_seed=None, qualitative_agg=True, quantitative_agg=False, num_agg_layers=0, units_agg=12, drop_out_rate=0.0, multisample_dropout=False, multisample_dropout_rate=0.5, multisample_dropout_num_masks=64, batch_size=25, batch_size_update=None, epochs_min=25, stop_criterion=0.25, stop_criterion_window=10, accuracy_min=None, train_loss_min=None, hinge_loss_t=0.0, convergence='validation', learning_rate=0.001, suppress_output=False, loss_criteria='mean', l2_reg=0.0, batch_seed=None, subsample=None, subsample_by_freq=False, subsample_valid_test=False)

Train Whole-Sample Classifier

This method trains the network and saves features values at the end of training for downstream analysis.

The method also saves the per sequence predictions at the end of training in the variable self.predicted

The two MIL options alter how the predictive signatures in the neural network are aggregated to make a prediction about the repertoire. If qualitative_agg or quantitative_agg are set to True, this will include these different types of aggregation in the predcitions. One can set either to True or both to True and this will allow a user to incorporate features from multiple modes of aggregation. See below for further details on these methods of aggregation across the sequences of a repertoire.

The multiesample parameters are used to implement Multi-Sample Dropout at the final layer of the model as described in "Multi-Sample Dropout for Accelerated Training and Better Generalization" https://arxiv.org/abs/1905.09788. This method has been shown to improve generalization of deep neural networks as well as improve convergence.

Parameters:
  • kernel (int) – Size of convolutional kernel for first layer of convolutions.

  • num_concepts (int) – Number of concepts for multi-head attention mechanism. Depending on the expected heterogeneity of the repertoires being analyed, one can adjust this hyperparameter.

  • trainable_embedding (bool) – Toggle to control whether a trainable embedding layer is used or native one-hot representation for convolutional layers.

  • embedding_dim_aa (int) – Learned latent dimensionality of amino-acids.

  • embedding_dim_genes (int) – Learned latent dimensionality of VDJ genes

  • embedding_dim_hla (int) – Learned latent dimensionality of HLA

  • num_fc_layers (int) – Number of fully connected layers following convolutional layer.

  • units_fc (int) – Number of nodes per fully-connected layers following convolutional layer.

  • weight_by_class (bool) – Option to weight loss by the inverse of the class frequency. Useful for unbalanced classes.

  • class_weights (dict) – In order to specify custom weights for each class during training, one can provide a dictionary with these weights. i.e. {'A':1.0,'B':2.0'}

  • use_only_seq (bool) – To only use sequence feaures, set to True. This will turn off features learned from gene usage.

  • use_only_gene (bool) – To only use gene-usage features, set to True. This will turn off features from the sequences.

  • use_only_hla (bool) – To only use hla feaures, set to True.

  • size_of_net (list or str) – The convolutional layers of this network have 3 layers for which the use can modify the number of neurons per layer. The user can either specify the size of the network with the following options:

    • small == [12,32,64] neurons for the 3 respective layers
    • medium == [32,64,128] neurons for the 3 respective layers
    • large == [64,128,256] neurons for the 3 respective layers
    • custom, where the user supplies a list with the number of nuerons for the respective layers i.e. [3,3,3] would have 3 neurons for all 3 layers. One can also adjust the number of layers for the convolutional stack by changing the length of this list. [3,3,3] = 3 layers, [3,3,3,3] = 4 layers.
  • graph_seed (int) – For deterministic initialization of weights of the graph, set this to value of choice.

  • qualitative_agg (bool) – If set to True, the model will aggregate the feature values per repertoire weighted by frequency of each TCR. This is considered a 'qualitative' aggregation as the prediction of the repertoire is based on the relative distribution of the repertoire. In other words, this type of aggregation is a count-independent measure of aggregation. This is the mode of aggregation that has been more thoroughly tested across multiple scientific examples.

  • quantitative_agg (bool) – If set to True, the model will aggregate the feature values per repertoire weighted by counts of each TCR. This is considered a 'quantitative' aggregation as the prediction of the repertoire is based on teh absolute distribution of the repertoire. In other words, this type of aggregation is a count-dependent measure of aggregation. If one believes the counts are important for the predictive value of the model, one can set this parameter to True.

  • num_agg_layers (int) – Following the aggregation layer in the network, one can choose to add more fully-connected layers before the final classification layer. This parameter will set how many layers to add after aggregation. This likely is helpful when using both types of aggregation (as detailed above) to combine those feature values.

  • units_agg (int) – For the fully-connected layers after aggregation, this parameter sets the number of units/nodes per layer.

  • drop_out_rate (float) – drop out rate for fully connected layers

  • multisample_dropout_rate (float) – The dropout rate for this multi-sample dropout layer.

  • multisample_dropout_num_masks (int) – The number of masks to sample from for the Multi-Sample Dropout layer.

  • batch_size (int) – Size of batch to be used for each training iteration of the net.

  • batch_size_update (int) – In the case that the size of the samples are very large, one may not want to update the weights of the network as often as batches are put onto the gpu. Therefore, if one wants to update the weights less often than how often the batches of data are put onto the gpu, one can set this parameter to something other than None. An example would be if batch_size is set to 5 and batch_size_update is set to 30, while only 5 samples will be put on the gpu at a time, the weights will only be updated after 30 samples have been put on the gpu. This parameter is only relevant when using gpu's for training and there are memory constraints from very large samples.

  • epochs_min (int) – Minimum number of epochs for training neural network.

  • stop_criterion (float) – Minimum percent decrease in determined interval (below) to continue training. Used as early stopping criterion.

  • stop_criterion_window (int) – The window of data to apply the stopping criterion.

  • accuracy_min (float) – Optional parameter to allow alternative training strategy until minimum training accuracy is achieved, at which point, training ceases.

  • train_loss_min (float) – Optional parameter to allow alternative training strategy until minimum training loss is achieved, at which point, training ceases.

  • hinge_loss_t (float) – The per sample loss minimum at which the loss of that sample is not used to penalize the model anymore. In other words, once a per sample loss has hit this value, it gets set to 0.0.

  • convergence (str) – This parameter determines which loss to assess the convergence criteria on. Options are 'validation' or 'training'. This is useful in the case one wants to change the convergence criteria on the training data when the training and validation partitions have been combined and used to training the model.

  • learning_rate (float) – The learning rate for training the neural network. Making this value larger will increase the rate of convergence but can introduce instability into training. For most, altering this value will not be necessary.

  • suppress_output (bool) – To suppress command line output with training statisitcs, set to True.

  • l2_reg (float) – When training the repertoire classifier, it may help to utilize L2 regularization to prevent sample-specific overfitting of the model. By setting the value of this parameter (i.e. 0.01), one will introduce L2 regularization through all but the last layer of the network.

  • batch_seed (int) – For deterministic batching during training, set this value to an integer of choice.

  • subsample (int) – Number of sequences to sub-sample from repertoire during training to improve speed of convergence as well as being a form of regularization.

  • subsample_by_freq (bool) – Whether to sub-sample randomly in the repertoire or as a function of the frequency of the TCR.

  • subsample_valid_test (bool) – Whether to sub-sample during valid/test cohorts while training. This is mostly as well to improve speed to convergence and generalizability.

feature_analytics_class

Cluster(self, set='all', clustering_method='phenograph', t=None, criterion='distance', linkage_method='ward', write_to_sheets=False, sample=None, n_jobs=1, order_by_linkage=False)

Clustering Sequences by Latent Features

This method clusters all sequences by learned latent features from either the variational autoencoder or by the supervised methods. Several clustering algorithms are included including Phenograph, DBSCAN, or hierarchical clustering. DBSCAN is implemented from the sklearn package. Hierarchical clustering is implemented from the scipy package.

Parameters:
  • set (str) – To choose which set of sequences to analye, enter either 'all','train', 'valid',or 'test'. Since the sequences in the train set may be overfit, it preferable to generally examine the test set on its own.

  • clustering_method (str) – Clustering algorithm to use to cluster TCR sequences. Options include phenograph, dbscan, or hierarchical. When using dbscan or hierarchical clustering, a variety of thresholds are used to find an optimimum silhoutte score before using a final clustering threshold when t value is not provided.

  • t (float) – If t is provided, this is used as a distance threshold for hierarchical clustering or the eps value for dbscan.

  • criterion (str) – Clustering criterion as allowed by fcluster function in scipy.cluster.hierarchy module. (Used in hierarchical clustering).

  • linkage_method (str) – method parameter for linkage as allowed by scipy.cluster.hierarchy.linkage

  • write_to_sheets (bool) – To write clusters to separate csv files in folder named 'Clusters' under results folder, set to True. Additionally, if set to True, a csv file will be written in results directory that contains the frequency contribution of each cluster to each sample.

  • sample (int) – For large numbers of sequences, to obtain a faster clustering solution, one can sub-sample a number of sequences and then use k-nearest neighbors to assign other sequences.

  • n_jobs (int) – Number of processes to use for parallel operations.

  • order_by_linkage (bool) – To list sequences in the cluster dataframes by how they are related via ward's linakge, set this value to True. Otherwise, each cluster dataframe will list the sequences by the order they were loaded into DeepTCR.

Returns:
  • `Cluster Results

    • self.Cluster_DFs (list of Pandas dataframes)` – Clusters by sequences/label
    • self.var (list): Variance of lengths in each cluster
    • self.Cluster_Frequencies (Pandas dataframe): A dataframe containing the frequency contribution of each cluster to each sample.
    • self.Cluster_Assignments (ndarray): Array with cluster assignments by number.

Motif_Identification(self, group, p_val_threshold=0.05, by_samples=False, top_seq=10)

Motif Identification Supervised Classifiers

This method looks for enriched features in the predetermined group and returns fasta files in directory to be used with "https://weblogo.berkeley.edu/logo.cgi" to produce seqlogos.

Parameters:
  • group (string) – Class for analyzing enriched motifs.

  • p_val_threshold (float) – Significance threshold for enriched features/motifs for Mann-Whitney UTest.

  • by_samples (bool) – To run a motif identification that looks for enriched motifs at the sample instead of the seuence level, set this parameter to True. Otherwise, the enrichment analysis will be done at the sequence level.

  • top_seq (int) – The number of sequences from which to derive the learned motifs. The larger the number, the more noisy the motif logo may be.

Returns:
  • `Output

    • self.(alpha/beta)_group_features (Pandas Dataframe)` – Sequences used to determine motifs in fasta files are stored in this dataframe where column names represent the feature number.

Sample_Features(self, set='all', Weight_by_Freq=True)

Sample-Level Feature Values

This method returns a dataframe with the aggregate sample level features.

Parameters:
  • set (str) – To choose which set of sequences to analye, enter either 'all','train', 'valid',or 'test'. Since the sequences in the train set may be overfit, it preferable to generally examine the test set on its own.

  • Weight_by_Freq (bool) – Option to weight each sequence used in aggregate measure of feature across sample by its frequency.

Returns:
  • `Sample Level Features

    • self.sample_featres (pandas dataframe)` – This function returns the average feature vector for each sample analyzed. This can be used to make further downstream comparisons such as inter-repertoire distances.

Structural_Diversity(self, sample=None, n_jobs=1)

Structural Diversity Measurements

This method first clusters sequences via the phenograph algorithm before computing the number of clusters and entropy of the data over these clusters to obtain a measurement of the structural diversity within a repertoire.

Parameters:
  • sample (int) – For large numbers of sequences, to obtain a faster clustering solution, one can sub-sample

  • n_jobs (int) – Number of processes to use for parallel operations.

Returns:
  • `Diversity dataframe

    • self.Structural_Diversity_DF (Pandas dataframe)` – A dataframe containing the number of clusters and entropy in each sample

vis_class

HeatMap_Samples(self, set='all', filename='Heatmap_Samples.tif', Weight_by_Freq=True, color_dict=None, labels=True, font_scale=1.0, figsize=(12, 10), legend=False, legend_size=10)

HeatMap of Samples

This method creates a heatmap/clustermap for samples by latent features for the unsupervised deep learning methods.

Parameters:
  • set (str) – To choose which set of sequences to analye, enter either 'all','train', 'valid',or 'test'. Since the sequences in the train set may be overfit, it preferable to generally examine the test set on its own.

  • filename (str) – Name of file to save heatmap.

  • Weight_by_Freq (bool) – Option to weight each sequence used in aggregate measure of feature across sample by its frequency.

  • color_dict (dict) – Optional dictionary to provide specified colors for classes.

  • labels (bool) – Option to show names of samples on y-axis of heatmap.

  • font_scale (float) – This parameter controls the font size of the row labels. If there are many rows, one can make this value smaller to get better labeling of the rows.

  • figsize (tuple) – This parameter controls the size of the figure.

  • legend (bool) – Whether to show legend for class labels

  • legend_size (int) – Size of legend.

Returns:
  • `Sample Features

    • self.sample_features (pandas dataframe)` – This function returns the average feature vector for each sample analyzed. This can be used to make further downstream comparisons such as inter-repertoire distances.

HeatMap_Sequences(self, set='all', filename='Heatmap_Sequences.tif', sample_num=None, sample_num_per_class=None, color_dict=None, figsize=(12, 10), legend=False, legend_size=10)

HeatMap of Sequences

This method creates a heatmap/clustermap for sequences by latent features for the unsupervised deep lerning methods.

Parameters:
  • set (str) – To choose which set of sequences to analye, enter either 'all','train', 'valid',or 'test'. Since the sequences in the train set may be overfit, it preferable to generally examine the test set on its own.

  • filename (str) – Name of file to save heatmap.

  • sample_num (int) – Number of events to randomly sample for heatmap.

  • sample_num_per_class (int) – Number of events to randomly sample per class for heatmap.

  • color_dict (dict) – Optional dictionary to provide specified colors for classes.

  • figsize (tuple) – This parameter controls the size of the figure.

  • legend (bool) – Whether to show legend for class labels

  • legend_size (int) – Size of legend.

Repertoire_Dendrogram(self, set='all', distance_metric='KL', sample=None, n_jobs=1, color_dict=None, dendrogram_radius=0.32, repertoire_radius=0.4, linkage_method='ward', gridsize=24, Load_Prev_Data=False, filename=None, sample_labels=False, gaussian_sigma=0.5, vmax=0.01, n_pad=5, lw=None, log_scale=False)

Repertoire Dendrogram

This method creates a visualization that shows and compares the distribution of the sample repertoires via UMAP and provided distance metric. The underlying algorithm first applied phenograph clustering to determine the proportions of the sample within a given cluster. Then a distance metric is used to compare how far two samples are based on their cluster proportions. Various metrics can be provided here such as KL-divergence, Correlation, and Euclidean.

Parameters:
  • set (str) – To choose which set of sequences to analye, enter either 'all','train', 'valid',or 'test'. Since the sequences in the train set may be overfit, it preferable to generally examine the test set on its own.

  • distance_metric (str) – Provided distance metric to determine repertoire-level distance from cluster proportions. Options include = (KL,correlation,euclidean,wasserstein,JS).

  • sample (int) – For large numbers of sequences, to obtain a faster clustering solution, one can sub-sample a number of sequences and then use k-nearest neighbors to assign other sequences.

  • n_jobs (int) – Number of processes to use for parallel operations.

  • color_dict (dict) – Optional dictionary to provide specified colors for classes.

  • dendrogram_radius (float) – The radius of the dendrogram in the figure. This will usually require some adjustment given the number of samples.

  • repertoire_radius (float) – The radius of the repertoire plots in the figure. This will usually require some adjustment given the number of samples.

  • linkage_method (str) – linkage method used by scipy's linkage function

  • gridsize (int) – This parameter modifies the granularity of the hexbins for the repertoire density plots.

  • Load_Prev_Data (bool) – If method has been run before, one can load previous data used to construct the figure for faster figure creation. This is helpful when trying to format the figure correctly and will require the user to run the method multiple times.

  • filename (str) – To save dendrogram plot to results folder, enter a name for the file and the dendrogram will be saved to the results directory. i.e. dendrogram.png

  • sample_labels (bool) – To show the sample labels on the dendrogram, set to True.

  • gaussian_sigma (float) – The amount of blur to introduce in the plots.

  • vmax (float) – Highest color density value. Color scales from 0 to vmax (i.e. larger vmax == dimmer plot)

  • lw (float) – The width of the circle edge around each sample.

  • log_scale (bool) – To plot the log of the counts for the UMAP density plot, set this value to True. This can be particularly helpful for visualization if the populations are very clonal.

Returns:
  • `Output

    • self.pairwise_distances (Pandas dataframe)` – Pairwise distances of all samples

UMAP_Plot(self, set='all', by_class=False, by_cluster=False, by_sample=False, freq_weight=False, show_legend=True, scale=100, Load_Prev_Data=False, alpha=1.0, sample=None, sample_per_class=None, filename=None, prob_plot=None, plot_by_class=False)

UMAP visualization of TCR Sequences

This method displays the sequences in a 2-dimensional UMAP where the user can color code points by class label, sample label, or prior computing clustering solution. Size of points can also be made to be proportional to frequency of sequence within sample.

Parameters:
  • set (str) – To choose which set of sequences to analye, enter either 'all','train', 'valid',or 'test'. Since the sequences in the train set may be overfit, it preferable to generally examine the test set on its own.

  • by_class (bool) – To color the points by their class label, set to True.

  • by_sample (bool) – To color the points by their sample lebel, set to True.

  • by_cluster (bool) – To color the points by the prior computed clustering solution, set to True.

  • freq_weight (bool) – To scale size of points proportionally to their frequency, set to True.

  • show_legend (bool) – To display legend, set to True.

  • scale (float) – To change size of points, change scale parameter. Is particularly useful when finding good display size when points are scaled by frequency.

  • Load_Prev_Data (bool) – If method was run before, one can rerun this method with this parameter set to True to bypass recomputing the UMAP projection. Useful for generating different versions of the plot on the same UMAP representation.

  • alpha (float) – Value between 0-1 that controls transparency of points.

  • sample (int) – Number of events to sub-sample for visualization.

  • sample_per_class (int) – Number of events to randomly sample per class for UMAP.

  • filename (str) – To save umap plot to results folder, enter a name for the file and the umap will be saved to the results directory. i.e. umap.png

  • prob_plot (str) – To plot the predicted probabilities for the sequences as an additional heatmap, specify the class probability one wants to visualize (i.e. if the class of interest is class A, input 'A' as a string). Of note, only probabilities determined from the sequences in the test set are displayed as a means of not showing over-fit probabilities. Therefore, it is best to use this parameter when the set parameter is turned to 'test'.

UMAP_Plot_Samples(self, set='all', filename='UMAP_Samples.tif', Weight_by_Freq=True, scale=5, alpha=1.0)

UMAP visualization of TCR Samples

This method displays the samples in a 2-dimensional UMAP

Parameters:
  • set (str) – To choose which set of sequences to analye, enter either 'all','train', 'valid',or 'test'. Since the sequences in the train set may be overfit, it preferable to generally examine the test set on its own.

  • Weight_by_Freq (bool) – Option to weight each sequence used in aggregate measure of feature across sample by its frequency.

  • scale (float) – To change size of points, change scale parameter.

  • alpha (float) – Value between 0-1 that controls transparency of points.

  • filename (str) – To save umap plot to results folder, enter a name for the file and the umap will be saved to the results directory. i.e. umap.png