Identifying Hosts of Families of Viruses: A Machine Learning Approach

doi:10.1371/journal.pone.0027631

Figure 1.

Prediction accuracy for Picornaviridae.

A plot of (a) mean AUC vs boosting round, and (b) 95% confidence interval vs boosting round. The mean and standard deviation were computed over 10-folds of held-out data, for Picornaviridae, where . Boosting round 0 corresponds to introducing the offset term into the model. Thus, the boosting round can also be interpreted as one-half the number of decision rules (one-half because each round introduces a decision rule and its negation into the model).

More »

Expand

Figure 2.

Prediction accuracy for Rhabdoviridae.

A plot of (a) mean AUC vs boosting round, and (b) 95% confidence interval vs boosting round. The mean and standard deviation were computed over 5-folds of held-out data, for Rhabdoviridae, where . The relatively higher uncertainty for this virus family was likely due to very small sample sizes. Note that the cyan curve lies on top of the red curve.

More »

Expand

Table 1.

Validation error for virus subfamilies in Picornaviridae.

More »

Expand

Table 2.

Validation error for virus subfamilies in Rhabdoviridae.

More »

Expand

Figure 3.

Visualizing predictive subsequences.

A visualization of the mismatch neighborhood of the first 6 -mers selected in an ADT for Picornaviridae, where . The virus proteomes are grouped vertically by their label with their lengths scaled to . Regions containing elements of the mismatch neighborhood of each -mer are then indicated on the virus proteome. Note that the proteomes are not aligned along the selected -mers but merely stacked vertically with their lengths normalized.

More »

Expand

Figure 4.

Visualizing predictive subsequences on aligned sequences.

A visualization of the mismatch neighborhood of the first 6 -mers selected in an ADT for Picornaviridae, where . The virus proteomes are aligned using the multiple alignment algorithm COBALT and the alignments are grouped vertically by their label with gaps in the alignment indicated in grey. Regions containing elements of the mismatch neighborhood of each -mer are then indicated on the alignment.

More »

Expand

Figure 5.

Visualizing predictive regions of protein sequences.

A visualization of the mismatch neighborhood of the first 7 -mers, selected in all ADTs over 10-fold cross validation, for Picornaviridae, where . Regions containing elements of the mismatch neighborhood of each selected -mer are indicated on the virus proteome, with the grayscale intensity on the plot being inversely proportional to the number of cross-validation folds in which some -mer in that region was selected by Adaboost. Thus, darker spots indicate that some -mer in that part of the proteome was robustly selected by Adaboost. Furthermore, a vertical cluster of dark spots indicate that region, selected by Adaboost to be predictive, is also strongly conserved among viruses sharing a common host type.

More »

Expand

Figure 6.

Mismatch feature space representation.

The mismatch feature space representation of a segment of a protein sequence (shown on top of figure).

More »

Expand

Figure 7.

Alternating Decision Tree.

An example of an ADT where rectangles are decision nodes, circles are output nodes and, in each decision node, is the feature associated with the -mer in sequence . The output nodes connected to each decision node are associated with a pair of binary-valued functions . The binary-valued function corresponding to the highlighted path is given as and the associated .

More »

Expand