The world’s leading publication for data science, AI, and ML professionals.

Four interpretable algorithms that you should use in 2022

To make the best decision, we have to understand them

The new year has begun, and it is the time for good resolutions. One of them could be to make decision-making processes more interpretable. To help you do this, I present four interpretable rule-based algorithms. These four algorithms share the use of ensemble of decision trees as rule generator (like Random Forest, AdaBoost, Gradient Boosting, etc.). In other words, each of these interpretable algorithms starts its process by fitting a black box model and generating an interpretable rule ensemble model.

Even though they are all claimed to be interpretable, they were developed with a different idea of interpretability. As you may know, this concept is difficult to be mathematically well posed. Therefore, authors designed interpretable algorithms with their own definition of interpretability.

To avoid going into detail, I assume here that the data are composed of d quantitative features and that the rules are binary variables with the value 1 if the observation lies within a hyperrectangle of the feature space ℝᵈ. Differently said, a rule rₙ is defined as following:

where x[j] is the value of ** the observation x for the _j-t_h feature an_d Iⱼ,_ₙ is an interval of ℝ. The rule and the interval are indexed by n, which is the number of observations in the sample Dₙ.The _lengt**_h of rules is the number of interva_ls Iⱼ,_ₙ ≠ℝ in the rule’s definition, it may be different from _ d. For instance, the condition "If feature_1 in [1, 3] AND feature3 in [0, 2]" defines a rule of length 2 equal to 1 in the hyperrectangle [1,3] × ℝ × [0,2] × ℝᵈ⁻³. T_h_e rule can be written as following:


1. RuleFit (2008).

The first algorithm is the most popular and also the oldest. RuleFit was introduced in [1]. It generates a sparse linear model that contains selected interaction effects in the form of decision rules.

Motivation: The idea of this algorithm came from a double observation: the main disadvantage of rule-based models is the difficulty to capture linear dependencies; on the other hand, linear models do not capture interaction among features. Therefore, the idea of RuleFit is to combine these two types of algorithms by creating a sparse linear model and adding interaction effects in the form of decision rules.

"With ensemble learning, there is no requirement that the basis functions must be generated from the same parametric base learner."

The algorithm consists of two steps. The first is rule generation and the second is rule selection.

Rule generation: The authors have proposed to use ensemble tree algorithms such as Random Forests, AdaBoost and Gradient Boosting. Then, all nodes (interior and terminal) of each tree are extracted as a rule in the form described above.

Rule selection: These extracted rules, along with the original input characteristics, are then fed into an L1-regularised linear model, also called a Lasso, which estimates the impact of each rule and each variable on the output target while estimating many of these impacts to be zero. In order to give each linear term the same a priori influence as a typical rule, the authors suggested normalizing the original input features.

Interpretability: The Lasso penality is known to be selective. So, using this regularization, authors expected many coefficients to have zero values and thus generate an interpretable model.

"For purposes of interpretation, it is desirable that the ensemble be comprised of "simple" rules each defined by a small number of variables."

Moreover, the authors proposed many formulas to calculate the importance of any predictor in a RuleFit model and to study the interaction effects. All these metrics are here to help the statistician understand the generated model.

Remarks: In practice, RuleFit is hard to interpret. Since the prediction implied by an activated rule comes from Lasso fitting, a rule can cover positive examples but still have a negative prediction. Moreover, sometimes the generated model has too many rules with non-zero weights, hence becoming not interpretable for human.

Packages: RuleFit is implemented in Python in the following GitHub projects: imodels and rulefit. RuleFit has also several R packages.


2. Node harvest (2010).

This algorithm has been presented in [2]. The algorithm considers all nodes and leaves of a Random Forest as rules and solves a linear quadratic problem to fit a weight for each node. Hence, the estimator is a convex combination of the nodes.

Motivation: The idea behind node harvest algorithm (NH) is to select the best partition of the features space given a set of q rules Q. The minimization problem can be formally written as

where wₖ is the weight of the rule rₖ and pₖ is the prediction of the rule rₖ (usually the empirical conditional expectation of Y given X ∈ rₖ).

Unfortunately, this equation is very hard to solve, especially because the constraint wₖ ∈ {0,1} for k ∈ {1, …, q} does not correspond to a convex feasible region. In this paper [2], the author proposes a way to adapt this optimization problem, making it solvable.

"The main idea of NH is that it becomes computationally feasible to solve the optimal empirical partitioning problem if the weights are only constrained to be nonnegative."

Rule generation: In the paper, the author recommends using a Random Forest algorithm and fitting each tree with a subsample of the data of size n/10 instead of the usual bootstrap samples to speed up the computation and increase the diversity of the rule set. Then all nodes of each tree satisfying a given condition on the maximal interaction order (the rule’s length) and a given condition on the minimal size (number of observations covered by the rule) are added to the set of rules Q. Finally, if the number of rules in Q is lower than a chosen number q, the process is repeated.

Rule selection: To solve the optimization problem, the author has made two changes to the original one. The first had been to replace the condition of partitioning the feature space with a condition of empirical partitioning. It means that each observation must be covered by exactly one selected rule. The second modification had been to relax the constraint on the vector of weights w by allowing them to take values in the interval [0, 1] instead of the binary set {0, 1}. Thanks to these changes, the new optimal empirical partitioning problem can be solved with a quadratic program solver.

Interpretability: The interpretability is ensured by the choice of the number of interactions. It must be lower or equal to 3. Indeed, it is very hard to understand a rule implying more than 3 features. Moreover, even if the number of rules in Q is very large, the solution of the optimization problem should put a weight of 0 to the majority of the rules.

"the vast majority of nodes [..] will receive a zero weight, without the sparsity being enforced explicitly other than through the constraint of empirical paritioning."

In my experience, however, the number of rules selected is still too high to create an interpretable model, however, the model is still fairly accurate.

Remarks: In the paper, the author proposes a dimensionality reduction to solve the optimization problem faster whenever the size of Q is larger than the number of observations. He also presents a default values for the parameters of NH that work well in practice. Finally, he suggests regularization to improve the interpretability.

Packages: Node harvest has been implemented in R in the package nodeHarvest.


3. SIRUS (2021).

SIRUS algorithm has been introduced in [3] and extended in [4]. It uses a modified Random Forest to generate a large number of rules, which are then selected with a redundancy greater than a tuning parameter p₀. To be sure to have redundancy in the generated rules, the features are discretized.

Motivation: The main strength of this algorithm is to be stable. For two independent samples from the same distribution, the sets of selected rules are almost the same. Indeed, to the authors, stability is seen as one of the key characteristics of interpretability.

Rule generation: SIRUS uses a slightly modified Random Forest. First, the features are discretized into q bins using the empirical q quantiles of the marginal distributions. The discretization is fundamental for stability and accuracy is ensured as long as q is not too small. To enforce interpretability, the depth of the trees is limited to 2, whereas the usual Random Forest algorithm recommends deep trees. This means that each tree of the SIRUS rule generation process has at most six nodes (the root node is not counting). Finally, each tree is replaced by a subsample of aₙ observations, where aₙ is a parameter of SIRUS.

Rule selection: SIRUS generates many trees (typically 10000) and it selects rules that are shared by a large portion of them. To be able to identify these rules, SIRUS computes the empirical probability that a tree of the forest contains a particular path. It then selects all the rules with an empirical probability greater or equal to a chosen parameter p₀ ∈ (0, 1). Unfortunately, these methods generate a lot of redundancy in the list of selected rules and the selection process needs a post-treatment. If a rule r is a linear combination of rules with paths having a higher frequency in the forest, then r is removed from the set of selected rules.

Interpretability: The post-treatment in the selection process makes certain keeping only the rules with the lowest length. Indeed, a rule can only be a linear combination of rules with a lower length.

"In our case, it happens that the parameter p₀ alone is enough to control sparsity"

Remarks: The discretization process is mandatory in SIRUS. First, it increases the stability of the algorithm. Secondly, it enables the creation of decision trees with common rules. Indeed, if the feature is continuous, the probability that a decision tree algorithm will cut the same exact value for the same rule in two independent samples is zero. The authors also proved the consistency of the estimation of path probabilities and the asymptotic convergence to perfect stability (i.e., the set of selected rules is exactly the same for two independent samples drawn from the same distribution). Finally, the authors give some techniques for the parameters tuning.

Packages: SIRUS has been implemented in R in the package sirus.


4. CoveringAlgorithm (2021).

This algorithm has been presented in [5]. The algorithm extracts a sparse rule set considering all nodes and leaves of tree ensembles that are selected according to their statistical properties to form a "quasi-covering" (it means that the covering is asymptotic). The covering is then turned into a partition using the so-called partitioning trick to create a consistent estimator of the regression function. I have already spoken about the partitioning trick in a previous article, "How to deal with overlapping rules?".

Motivation : The idea is to generate an interpretable, consistent rule-based estimator of the regression function. In the literature, a consistent estimator must be generated from rules of length d. And for a large d, the rules become uninterpretable. To get around this limitation, the authors introduced the concept of significant and insignificant rules. Significant rules can be considered as rules where the target variable shows behaviour that is significantly different from its average behaviour. And insignificant rules can be considered as rules where the target variable has a very low variance. The authors also removed the condition of having a covering. This means that the set of selected rules does not have to cover the feature space; this restriction was replaced by a more flexible restriction of quasi-coverage. The CoveringAlgorithm is an illustration of these concepts.

Rule generation: As in RuleFit, ensemble tree algorithms such as Random Forests, AdaBoost and Gradient Boosting can be used as rules generator. Then, all nodes (internal and terminal) of each tree are extracted as a rules, and only rules with a length lower or equal to a specified parameter value (usually 3) are kept.

Rule selection: The selection process consists of two steps. First, the algorithm extracts significant rules, ordering them by decreasing coverage. The algorithm tries to cover the empirical feature space with these rules by excluding those that overlap too much. Second, if the coverage is not complete, the algorithm tries to expand the coverage by adding insignificant rules, ordering them by increasing variance

Interpretability: Interpretability is ensured by the small number of short-length rules in the set of selected rules. Indeed, it is easier to understand a decision if it is based on two or three rules of equal length. In addition, the notion of significant and insignificant rules helps users identify areas of interest. Insignificant rules indicate areas where not much is happening, in contrast to significant rules that highlight areas of interest.

Remarks: The main result of this paper is a proof of the consistency of an interpretable estimator of the regression function constructed from a set of significant and insignificant rules. As explained above, this algorithm is just an example to show how to use significant and insignificant rules in a predictive algorithm. As a matter of fact, it does not produce a consistent estimator because the rule generator does not fully satisfy some conditions. This algorithm is currently the subject of active research.

Packages: CoveringAlgorithm has been developed in Python in the GitHub package CoveringAlgorithm.


Conclusion

It is interesting to see that each algorithm was developed with different goals: accuracy for RuleFit and Node harvest, stability for SIRUS and simplicity for CoveringAlgorithm. Benefiting from these previous researches, I am working on a quantitative method to measure interpretability based on this triptych: predictability, stability and simplicity. A first approach dedicated to tree-based algorithms and rule-based algorithms was published in the MDPI AI Journal (and a TDS version has been posted here). I am currently working on an extension of this approach to be algorithm-independent and to be able to compare the interpretability of any predictive algorithm.


About Us

Advestis is a European Contract Research Organization (CRO) with a deep understanding and practice of statistics, and interpretable machine learning techniques. The expertise of Advestis covers the modeling of complex systems and predictive analysis for temporal phenomena.

LinkedIn: https://www.linkedin.com/company/advestis/


References

[1] Friedman, J.H. and Popescu, B.E., 2008. Predictive learning via rule ensembles. The Annals of Applied Statistics, 2(3), pp.916–954.

[2] Meinshausen, N., 2010. Node harvest. The Annals of Applied Statistics, pp.2049–2072.

[3] Bénard, C., Biau, G., Veiga, S. and Scornet, E., 2021. SIRUS: Stable and interpretable rule set for classification. Electronic Journal of Statistics, 15(1), pp.427–505.

[4] Bénard, C., Biau, G., Veiga, S. and Scornet, E., 2021, March. Interpretable random forests via rule extraction. In International Conference on Artificial Intelligence and Statistics, pp. 937–945. PMLR.

[5] Margot, V., Baudry, J.P., Guilloux, F. and Wintenberger, O., 2021. Consistent regression using data-dependent coverings. Electronic Journal of Statistics, 15(1), pp.1743–1782.


Written By

Topics:

Related Articles