In this paper we introduce a {simple} algorithm to learn optimal decision trees of bounded depth. This algorithm, \budalg, is as memory and time efficient as heuristics, and yet more efficient than most exact methods on most data sets.

In this paper we introduce a {simple} algorithm to learn optimal decision trees of bounded depth. This algorithm, \blossom, is as memory and time efficient as heuristics, and yet more efficient than most exact methods on most data sets.

Its worst case time complexity is the same as state-of-the-art dynamic programming methods. However, its anytime behavior is vastly superior.

Experiments show that whereas existing exact methods hardly scale to deep trees, our algorithm learns trees comparable to standard heuristics without significant computational overhead, and can significantly improve their accuracy when given more computation time.

% State-of-the-art exact methods often have poor anytime behavior, and hardly scale to deep trees.

% Experiments show that they are typically orders of magnitude slower than the proposed algorithm to compute optimally accurate classifiers of a given depth.

%On the other hand, \budalg\ finds, without significant computational overhead, solutions comparable to those returned by standard greedy heuristics, and can quickly improve their accuracy when given more computation time.

% the first solution found by \budalg\ is comparable to those found by standard greedy heuristics and that significantly improve upon greedy heuristics. On the

%On the other hand, \blossom\ finds, without significant computational overhead, solutions comparable to those returned by standard greedy heuristics, and can quickly improve their accuracy when given more computation time.

% the first solution found by \blossom\ is comparable to those found by standard greedy heuristics and that significantly improve upon greedy heuristics. On the

\end{abstract}

\section{Introduction}

In conclusion of their short paper showing that computing decision trees of maximum accuracy is NP-complete, Hyafil and Rivest write: ``Accordingly, it is to be expected that that good heuristics for constructing near-optimal binary decision trees will be the best solution to this problem in the near future.''~\cite{NPhardTrees}. Indeed, heuristic approaches such as \cart\cite{breiman1984classification}, \idthree~\cite{10.1023/A:1022643204877} or \cfour~\cite{c4-5} have been prevalent long afterward, and are still vastly more commonly used in practice than exact approaches. In this paper, we propose a new exact algorithm (\budalg) which, while being effective at proving optimality, does not have computational or memory overhead compared to greedy heuristics.

In conclusion of their short paper showing that computing decision trees of maximum accuracy is NP-complete, Hyafil and Rivest write: ``Accordingly, it is to be expected that that good heuristics for constructing near-optimal binary decision trees will be the best solution to this problem in the near future.''~\cite{NPhardTrees}. Indeed, heuristic approaches such as \cart\cite{breiman1984classification}, \idthree~\cite{10.1023/A:1022643204877} or \cfour~\cite{c4-5} have been prevalent long afterward, and are still vastly more commonly used in practice than exact approaches. In this paper, we propose a new exact algorithm (\blossom) which, while being effective at proving optimality, does not have computational or memory overhead compared to greedy heuristics.

%\medskip

...

...

@@ -113,19 +113,19 @@ For SAT~\cite{avellanedaefficient,narodytska2018learning} and Integer Programmin

On the other hand, dynamic programming algorithms \olddleight~\cite{dl8} and \dleight~\cite{dl85} scale very well to large data sets. Moreover, these algorithms leverage branch independence: sibling subtrees can be optimized independently, which has a significant impact on computational complexity. However, \dleight tends to be memory hungry and furthermore, is not anytime.

The constraint programming approach of Verhaeghe \textit{et al.} emulates these positive features using dedicated propagation algorithms and search strategies~\cite{verhaeghe2019learning}, while being potentially anytime, although it does not quite match \dleight's efficiency.

Finally, a recently introduced algorithm, \murtree~\cite{DBLP:journals/corr/abs-2007-12652}, improves on the dynamic programming approaches in several ways: as the algorithm introduced in this paper it explores the search space in a more flexible way. Moreover, it implements several methods dedicated to exploring the whole search space very fast: delaying feature frequency counts to a specialized algorithm for subtree of depth two, and implementing an efficient recomputation method for the classification error, for instance.

As a result, it outperforms previous exact methods: it is more memory efficient, orders of magnitude faster than \dleight, and has a better anytime behavior. However, experimental results show that for deeper trees, none of these methods can reliably outperform heuristics, whereas \budalg\ does. Moreover, it is more memory efficient than \murtree, and its pseudo-code is significantly simpler.

As a result, it outperforms previous exact methods: it is more memory efficient, orders of magnitude faster than \dleight, and has a better anytime behavior. However, experimental results show that for deeper trees, none of these methods can reliably outperform heuristics, whereas \blossom\ does. Moreover, it is more memory efficient than \murtree, and its pseudo-code is significantly simpler.

% % \medskip

%

% In this paper we introduce a relatively \emph{simple} algorithm (\budalg), that is as memory and time efficient as heuristics, and yet more efficient than most exact methods on most data sets.

% In this paper we introduce a relatively \emph{simple} algorithm (\blossom), that is as memory and time efficient as heuristics, and yet more efficient than most exact methods on most data sets.

% This algorithm can be seen as an instance of the more general framework introduced in \cite{DBLP:journals/corr/abs-2007-12652}, however tuned to have the best scalability to large trees and the best anytime behavior as possible.

% %As a result, it is comparable to \murtree on shallow trees, while clearly outperforming the state of the art on deep trees.

In a nutshell, \budalg emulates the dynamic programming algorithm \dleight~\cite{dl8}, while always expanding non-terminal branches (a.k.a ``buds'') before optimizing grown branches. As a result, this algorithm is in a sense strictly better than both the standard dynamic programming approach (because it is anytime and at least as fast) and than classic heuristics (because it emulates them during search, without significant overhead).

In a nutshell, \blossom emulates the dynamic programming algorithm \dleight~\cite{dl8}, while always expanding non-terminal branches (a.k.a ``buds'') before optimizing grown branches. As a result, this algorithm is in a sense strictly better than both the standard dynamic programming approach (because it is anytime and at least as fast) and than classic heuristics (because it emulates them during search, without significant overhead).

%but explores the search space so as to improve its anytime behaviour.

Our experimental results show that it outperforms the state of the art, to the exception of \murtree on relatively shallow trees (typically for maximum depth up to 4), for which its more sophisticated (albeit more complex) algorithmic features can pay off.

%In particular, on data sets that \dleight can tackle, \budalg can always find classifiers at least as accurate faster, and when the former can prove optimality, the latter does it orders of magnitude faster.

%In particular, on data sets that \dleight can tackle, \blossom can always find classifiers at least as accurate faster, and when the former can prove optimality, the latter does it orders of magnitude faster.

...

...

@@ -390,7 +390,7 @@ explores the same search space as Algorithm~\ref{alg:dynprog}: the same branch i

}

\end{center}

\caption{\label{fig:searchtree} The search tree for decision trees. \texttt{DynProg} explores it depth first, whereas \texttt{Blossom} explores the leaves in the order given below.}

%\caption{\label{fig:searchtree} The search tree for decision trees. \dynprog explores it depth first, whereas \budalg explores branches in the order given below the leaves.}

%\caption{\label{fig:searchtree} The search tree for decision trees. \dynprog explores it depth first, whereas \blossom explores branches in the order given below the leaves.}

\end{figure}

...

...

@@ -416,7 +416,7 @@ Let $\afeat_i <_{\abranch} \afeat_j$ if and only if feature $\afeat_i$ is select

\begin{algorithm}[t]

\begin{footnotesize}

\caption{Blossom Algorithm\label{alg:bud}}

\TitleOfAlgo{\budalg}

\TitleOfAlgo{\blossom}

\KwData{$\negex,\posex, \maxd$}

\KwResult{The minimum error on $\negex,\posex$ for decision trees of depth $\maxd$}

$\sequence\gets[]$\;

...

...

@@ -670,8 +670,8 @@ To simplify the pseudo-code, we use branches to index array-like data structures

% We show that this is true for Algorithm~\ref{alg:bud} by recursion on the maximum depth $\mdepth$.

% For $\mdepth = 0$, both algorithm return $\error[\emptyset]=\min(|\negex|, |\posex|)$.

%

% Now suppose that, $\mdepth \leq d$, \dynprog and \budalg explore exactly the same set of branches: a recursive call of \dynprog ends on the branch $\abranch$ if and only if $\best[\abranch]$ is set in Line~\ref{line:best} of \budalg.

% %every branch $\abranch$ explored by \dynprog (a recursive call ends on this branch) is also explored by \budalg ($\best[\abranch]$ is set in Line~\ref{line:best}), and

% Now suppose that, $\mdepth \leq d$, \dynprog and \blossom explore exactly the same set of branches: a recursive call of \dynprog ends on the branch $\abranch$ if and only if $\best[\abranch]$ is set in Line~\ref{line:best} of \blossom.

% %every branch $\abranch$ explored by \dynprog (a recursive call ends on this branch) is also explored by \blossom ($\best[\abranch]$ is set in Line~\ref{line:best}), and

% Now, let $\mdepth=d+1$. Without loss of generality, we can take an arbitrary branch $\abranch$ of length $d$ and show that the same extentions of $\abranch$ are explored by both algorithms.

% If $\error[\abranch]=0$, then no extension of $\abranch$ is explored by either algorithm.

% Otherwise, if $\afeat \not\in \abranch$ and $\bar{\afeat}\not\in \abranch$, then

...

...

@@ -692,8 +692,8 @@ To simplify the pseudo-code, we use branches to index array-like data structures

% We show that this is true for Algorithm~\ref{alg:bud} by recursion on the maximum depth $\mdepth$.

% For $\mdepth = 0$, both algorithm return $\error[\emptyset]=\min(|\negex|, |\posex|)$.

%

% Now suppose that, $\mdepth \leq d$, \dynprog and \budalg explore exactly the same set of branches: a recursive call of \dynprog ends on the branch $\abranch$ if and only if $\best[\abranch]$ is set in Line~\ref{line:best} of \budalg.

% %every branch $\abranch$ explored by \dynprog (a recursive call ends on this branch) is also explored by \budalg ($\best[\abranch]$ is set in Line~\ref{line:best}), and

% Now suppose that, $\mdepth \leq d$, \dynprog and \blossom explore exactly the same set of branches: a recursive call of \dynprog ends on the branch $\abranch$ if and only if $\best[\abranch]$ is set in Line~\ref{line:best} of \blossom.

% %every branch $\abranch$ explored by \dynprog (a recursive call ends on this branch) is also explored by \blossom ($\best[\abranch]$ is set in Line~\ref{line:best}), and

% Now, let $\mdepth=d+1$. Without loss of generality, we can take an arbitrary branch $\abranch$ of length $d$ and show that the same extentions of $\abranch$ are explored by both algorithms.

% If $\error[\abranch]=0$, then no extension of $\abranch$ is explored by either algorithm.

% Otherwise, if $\afeat \not\in \abranch$ and $\bar{\afeat}\not\in \abranch$, then

...

...

@@ -712,7 +712,7 @@ To simplify the pseudo-code, we use branches to index array-like data structures

The key difference between Algorithms~\ref{alg:dynprog} and \ref{alg:bud} is the order in which branches are explored (see Figure~\ref{fig:searchtree}). In particular, \dynprog must complete the first recursive call before outputing a full tree.

%Therefore, the computation time for finding a first complete tree is $\Theta((\numex+2^{\mdepth})\Perm{\numfeat-1}{\mdepth-1})$, that is $O(\numex(\numfeat-1)^{\mdepth-1})$ time.

Therefore, it finds a first complete tree in $\Theta((\numex+2^{\mdepth})\Perm{\numfeat-1}{\mdepth-1})$, that is $O(\numex(\numfeat-1)^{\mdepth-1})$ time.

On the other hand, \budalg finds a first tree in linear time: $\Theta(2^{\mdepth}+\numex\mdepth)=\Theta(\numex\mdepth)$.

On the other hand, \blossom finds a first tree in linear time: $\Theta(2^{\mdepth}+\numex\mdepth)=\Theta(\numex\mdepth)$.

Another difference with actual implementations of Algorithm~\ref{alg:dynprog} (\olddleight\ and \dleight) is that the latter methods use a cache structure in order to reduce the number of branches that need to be explored.

%Indeed, by using memory,

%there is no need to explore every permutation

...

...

@@ -819,7 +819,7 @@ To simplify the pseudo-code, we use branches to index array-like data structures

% \begin{algorithm}

% \caption{Anytime Algorithm\label{alg:bud}}

% \TitleOfAlgo{\budalg}

% \TitleOfAlgo{\blossom}

% \KwData{$\negex,\posex, \maxd$}

% \KwResult{The minimum error on $\negex,\posex$ for decision trees of depth at most $\maxd$}

% $\sequence \gets []$\;

...

...

@@ -903,7 +903,7 @@ To simplify the pseudo-code, we use branches to index array-like data structures

% }

% \end{center}

% \caption{\label{fig:searchtree} The search tree for decision trees. \texttt{DynProg} explores it depth first, whereas \texttt{Bud-first-search} explores branches in the order given below the leaves.}

% %\caption{\label{fig:searchtree} The search tree for decision trees. \dynprog explores it depth first, whereas \budalg explores branches in the order given below the leaves.}

% %\caption{\label{fig:searchtree} The search tree for decision trees. \dynprog explores it depth first, whereas \blossom explores branches in the order given below the leaves.}

% \end{figure}

...

...

@@ -1185,7 +1185,7 @@ As long as these choices of feature tests stand (i.e., as long as $\abranch$ bel

% \label{ex:lb}

%

%

% Figure~\ref{fig:lowerbound} shows a snapshot of the excution of \budalg. Every node is labelled with the feature test on that node, and with the values of $\best[\abranch]$ for the branch $\abranch$ ending on that node. When all subtrees of a branch $\abranch$ have been explored (hence $\opt[\abranch]=1$), this is marked by a ``$^*$''. We assume that the branch considered at Line~\ref{line:fail} is $\abranch = \{r, \bar{a}, \bar{c}, g\}$. For instance, we can suppose that a tree rooted at $\abranch$ with feature $e$ has been found (misclassifying 2 data points). Then, search moved to the sibling branch $\{r, \bar{a}, \bar{c}, \bar{g}\}$, which was then optimized for a total error of $4$, and now the pair $(\abranch,e)$ is popped out of \sequence. For all branches $\abranch'$ of $\abranch$, we give the values of $\lb{\abranch',\abranch}$ and $\best[\abranch']$ between brackets. Since there exists $\abranch'$ such that $\lb{\abranch',\abranch} \geq \best[\abranch']$ (e.g., $\emptyset$ and $\{r, \bar{a}\}$), we know that $\abranch$ cannot belong to an improving solution, and hence there is no need to try to extend it further.

% Figure~\ref{fig:lowerbound} shows a snapshot of the excution of \blossom. Every node is labelled with the feature test on that node, and with the values of $\best[\abranch]$ for the branch $\abranch$ ending on that node. When all subtrees of a branch $\abranch$ have been explored (hence $\opt[\abranch]=1$), this is marked by a ``$^*$''. We assume that the branch considered at Line~\ref{line:fail} is $\abranch = \{r, \bar{a}, \bar{c}, g\}$. For instance, we can suppose that a tree rooted at $\abranch$ with feature $e$ has been found (misclassifying 2 data points). Then, search moved to the sibling branch $\{r, \bar{a}, \bar{c}, \bar{g}\}$, which was then optimized for a total error of $4$, and now the pair $(\abranch,e)$ is popped out of \sequence. For all branches $\abranch'$ of $\abranch$, we give the values of $\lb{\abranch',\abranch}$ and $\best[\abranch']$ between brackets. Since there exists $\abranch'$ such that $\lb{\abranch',\abranch} \geq \best[\abranch']$ (e.g., $\emptyset$ and $\{r, \bar{a}\}$), we know that $\abranch$ cannot belong to an improving solution, and hence there is no need to try to extend it further.

%

% % the current best classifier cannot be improved as long as

%

...

...

@@ -1262,7 +1262,7 @@ As long as these choices of feature tests stand (i.e., as long as $\abranch$ bel

% % }

% \end{center}

% \caption{\label{fig:lowerbound} Example of lower bound computation w.r.t. the branch }

% %\caption{\label{fig:searchtree} The search tree for decision trees. \dynprog explores it depth first, whereas \budalg explores branches in the order given below the leaves.}

% %\caption{\label{fig:searchtree} The search tree for decision trees. \dynprog explores it depth first, whereas \blossom explores branches in the order given below the leaves.}

% \end{figure}

%

% \end{example}

...

...

@@ -1307,7 +1307,7 @@ This reasoning is more effective when good upper bounds are found early, hence t

Finally, we use two preprocessing techniques, one on the data set and one on the features. Although extremely straightforward (and probably not novel), they both have a significant impact.

\paragraph{Dataset reduction.}

It is easy to adapt \budalg\ to handle weighted data sets by redefining the error as follows, given a weight function $\weight$ on $\allex$:

It is easy to adapt \blossom\ to handle weighted data sets by redefining the error as follows, given a weight function $\weight$ on $\allex$:

@@ -1358,7 +1358,7 @@ We do not reproduce experiments to assess the accuracy of optimized and heuristi

\subsection{Computing (optimaly) accurate classifiers}

We first compare \budalg to state-of-the-art algorithms, \murtree~\cite{DBLP:journals/corr/abs-2007-12652} and \dleight~\cite{dl85}, as well as the best MIP (\binoct)~\cite{verwer2019learning} and CP (\cp)~\cite{verhaeghe2019learning} models, for computing and proving optimal trees.

We first compare \blossom to state-of-the-art algorithms, \murtree~\cite{DBLP:journals/corr/abs-2007-12652} and \dleight~\cite{dl85}, as well as the best MIP (\binoct)~\cite{verwer2019learning} and CP (\cp)~\cite{verhaeghe2019learning} models, for computing and proving optimal trees.

The data sets in Table~\ref{tab:summaryaccsmall} are organized

in two classes according to

...

...

@@ -1368,8 +1368,8 @@ Every method is run with a bound $\mdepth$ on the depth shown in the first colum

We report for both classes and for each depth: the ratio of optimality proofs (opt.); the average training accuracy (acc.); and %the average accuracy (acc.),

%as well as

the average CPU time (cpu) to prove optimality.

Since \dleight and \binoct exceed the memory limit of 50GB in some cases, we also provide, for those two methods, the ratio of runs where at least one tree is found (sol.). For the same reason, we give their accuracy, marked by a ``$^*$'', as the average increase over \budalg's on these ``successful'' data sets.

Similarly, the CPU time for all other methods is given as the average increase over \budalg's on data sets for which both methods prove optimality.

Since \dleight and \binoct exceed the memory limit of 50GB in some cases, we also provide, for those two methods, the ratio of runs where at least one tree is found (sol.). For the same reason, we give their accuracy, marked by a ``$^*$'', as the average increase over \blossom's on these ``successful'' data sets.

Similarly, the CPU time for all other methods is given as the average increase over \blossom's on data sets for which both methods prove optimality.

...

...

@@ -1384,16 +1384,16 @@ Similarly, the CPU time for all other methods is given as the average increase o

\end{table}

\budalg is comparable to \murtree for the number of optimality proofs.

% The number of optimality proofs is similar for \budalg and \murtree.

It is slightly less efficient for $\mdepth\leq7$, but slightly more for $\mdepth=10$. The gap on shallow trees can be explained by \murtree's caching, and because it puts less emphasis on finding good trees faster, but rather tries to exhaust the search space faster. The gap on deep trees can be partly explained by the removal of inconsistent datapoints: whereas \budalg can stop searching when the overall classification error reaches the number of inconsistent datapoints, \murtree must exhaust the search space.

The difference in CPU time is due to the same phenomenon (when in favor or \budalg), or due to a few data sets, e.g. \texttt{mnist\_0}, where caching is probably helpful (when in favor or \murtree). Results on individual data sets (see Appendix) show that otherwise, both algorithms are comparable for proving optimality.

\blossom is comparable to \murtree for the number of optimality proofs.

% The number of optimality proofs is similar for \blossom and \murtree.

It is slightly less efficient for $\mdepth\leq7$, but slightly more for $\mdepth=10$. The gap on shallow trees can be explained by \murtree's caching, and because it puts less emphasis on finding good trees faster, but rather tries to exhaust the search space faster. The gap on deep trees can be partly explained by the removal of inconsistent datapoints: whereas \blossom can stop searching when the overall classification error reaches the number of inconsistent datapoints, \murtree must exhaust the search space.

The difference in CPU time is due to the same phenomenon (when in favor or \blossom), or due to a few data sets, e.g. \texttt{mnist\_0}, where caching is probably helpful (when in favor or \murtree). Results on individual data sets (see Appendix) show that otherwise, both algorithms are comparable for proving optimality.

%Despite what a quick look at Table~\ref{tab:summaryaccsmall} may suggest, both methods have similar speed. The large gaps are either due to the same phenomenon described above (when in favor or \budalg), or due to a few data sets, e.g. \texttt{mnist\_0}, where caching is probably helpful (when in favor or \murtree).

%Despite what a quick look at Table~\ref{tab:summaryaccsmall} may suggest, both methods have similar speed. The large gaps are either due to the same phenomenon described above (when in favor or \blossom), or due to a few data sets, e.g. \texttt{mnist\_0}, where caching is probably helpful (when in favor or \murtree).

% As \dleight does not provide a solution for every data set (on some instance it goes over the memory limit of 50GB), we provide the number of data sets for which a solution was returned (sol.). Moreover,

% instead of absolute values, we provide the average relative difference in error and accuracy w.r.t. \budalg, however, and only for the data sets where a decision tree was found. Similarly, we report the average cpu time ratio w.r.t. \budalg, however, only for instances which were proven optimal by both algorithms\footnote{every instance proven optimal by \dleight is also proven optimal by \budalg and \murtree}.

% instead of absolute values, we provide the average relative difference in error and accuracy w.r.t. \blossom, however, and only for the data sets where a decision tree was found. Similarly, we report the average cpu time ratio w.r.t. \blossom, however, only for instances which were proven optimal by both algorithms\footnote{every instance proven optimal by \dleight is also proven optimal by \blossom and \murtree}.

% \clearpage

...

...

@@ -1420,13 +1420,13 @@ The difference in CPU time is due to the same phenomenon (when in favor or \buda

% \end{table}

When proving optimality is hard, however, \budalg is clearly the best in terms of accuracy, especially as the depth and the feature set grow. Notice that the accuracy results in Table~\ref{tab:summaryaccsmall} include data sets for which an optimal tree is found, so the gap on other data sets is much larger. Moreover, they are averaged over 58 data sets, so a gap of a fraction of a point is significant: the full results in appendix show that if the gaps are variable, they are consistently in favor of \budalg.

% both algorithms find trees of similar qualities for $\mdepth \leq 5$ and $\numfeat < 100$, however, \budalg is significantly better as these parameters grow.

When proving optimality is hard, however, \blossom is clearly the best in terms of accuracy, especially as the depth and the feature set grow. Notice that the accuracy results in Table~\ref{tab:summaryaccsmall} include data sets for which an optimal tree is found, so the gap on other data sets is much larger. Moreover, they are averaged over 58 data sets, so a gap of a fraction of a point is significant: the full results in appendix show that if the gaps are variable, they are consistently in favor of \blossom.

% both algorithms find trees of similar qualities for $\mdepth \leq 5$ and $\numfeat < 100$, however, \blossom is significantly better as these parameters grow.

Other methods are systematically outperformed. \cp has good results on very shallow trees ($\mdepth\leq4$) but is ineffective for deeper tree. Indeed, the accuracy actually \emph{decreases} when $\mdepth$ increases! \dleight can also find optimal trees in most cases

for low values of \numfeat\ and $\mdepth$.

% \numfeat) is low, and for small values of $\mdepth$.

When $\numfeat$ grows, however, it often exceeds the memory limit of 50GB (whereas \budalg does not require more memory than the size of the data set). Finally, \binoct does not produce a single proof and very often exceeds the memory limit.%\footnote{In the experiments in \cite{verwer2019learning} not all datapoints were used.}

Figure~\ref{fig:proofcactus} shows the evolution of the ratio of proofs, averaged across all 58 data sets, over time: \budalg\

When $\numfeat$ grows, however, it often exceeds the memory limit of 50GB (whereas \blossom does not require more memory than the size of the data set). Finally, \binoct does not produce a single proof and very often exceeds the memory limit.%\footnote{In the experiments in \cite{verwer2019learning} not all datapoints were used.}

Figure~\ref{fig:proofcactus} shows the evolution of the ratio of proofs, averaged across all 58 data sets, over time: \blossom\

prove optimality faster when $\mdepth$ grows, but given enough time, \murtree\ matches it for $\mdepth\leq7$.

...

...

@@ -1450,7 +1450,7 @@ prove optimality faster when $\mdepth$ grows, but given enough time, \murtree\ m

Next, we shift our focus to how fast can we obtain accurate trees and how fast can we improve the accuracy over basic solutions found by heuristics.

We use a well known heuristic as baseline: \cart (we ran its implementation in scikit-learn).

Here we report the average error after a given period of time (3 seconds, 10 seconds, 1 minute or 5 minutes), both for \murtree and \budalg in Table~\ref{tab:summaryspeed}.

Here we report the average error after a given period of time (3 seconds, 10 seconds, 1 minute or 5 minutes), both for \murtree and \blossom in Table~\ref{tab:summaryspeed}.

...

...

@@ -1467,16 +1467,16 @@ Here we report the average error after a given period of time (3 seconds, 10 sec

% \medskip

%We can see that the first solution found by \budalg has comparable accuracy to the one found by \cart.

%The implementation of \cart in scikit-learn does not seem to be very efficient computationally. However, this is not so relevant as it is clear that one greedy run of the heuristic can be implemented to be as fast as the first dive of \budalg.

The point of this experiment is threefold. Firstly, it shows that the first solution is very similar to that found by \cart. There is actually a slight advantage for \budalg, which can

be explained by the small difference in the heuristic selection of features: whereas \cart systematically selects the feature with minimum Gini impurity, \budalg does so for all \emph{but the deepest feature test}, for which it selects the feature with least classification error.

%We can see that the first solution found by \blossom has comparable accuracy to the one found by \cart.

%The implementation of \cart in scikit-learn does not seem to be very efficient computationally. However, this is not so relevant as it is clear that one greedy run of the heuristic can be implemented to be as fast as the first dive of \blossom.

The point of this experiment is threefold. Firstly, it shows that the first solution is very similar to that found by \cart. There is actually a slight advantage for \blossom, which can

be explained by the small difference in the heuristic selection of features: whereas \cart systematically selects the feature with minimum Gini impurity, \blossom does so for all \emph{but the deepest feature test}, for which it selects the feature with least classification error.

Secondly, this first tree

is found extremely quickly, and there is no scaling issue with respect to the depth of the tree or with respect to the size of the data set. Thirdly, even for large data sets and deep trees, the accuracy of the initial classifier can be significantly improved given a reasonable computation time.

%Moreover, although we would need larger data sets to be confident about that, it seems that our algorithm is faster than \cart to find this first decision tree. %One can conjecture that \cart uses more sophisticated heuristic choices to explain these two observations.

%Then, in most cases, it is possible to improve the first solution significantly within a few seconds. Notice that for larger depth, improving the initial solution is harder and the 3s time limit is comparatively tighter than for smaller trees, so the gain of \budalg over \cart is more sensible for small trees.

%Then, in most cases, it is possible to improve the first solution significantly within a few seconds. Notice that for larger depth, improving the initial solution is harder and the 3s time limit is comparatively tighter than for smaller trees, so the gain of \blossom over \cart is more sensible for small trees.

% \begin{table}[htbp]

...

...

@@ -1491,7 +1491,7 @@ is found extremely quickly, and there is no scaling issue with respect to the de

Figure~\ref{fig:acccactus} reports the evolution of the average accuracy (across all 58 data sets) over time, giving a good view of the difference between \murtree and \budalg during search. The accuracy of the tree returned by \cart is given for reference.

Figure~\ref{fig:acccactus} reports the evolution of the average accuracy (across all 58 data sets) over time, giving a good view of the difference between \murtree and \blossom during search. The accuracy of the tree returned by \cart is given for reference.

We can see in those graphs that \murtree finds an initial tree extremely quickly, although its accuracy is very low. This is because \murtree shows progress even when the tree is not complete, e.g., the first solution is always a single node with the most promising feature. We can see in Table~\ref{tab:summaryspeed} that this is indeed always the same first tree, irrespective of the depth.

...

...

@@ -1542,12 +1542,12 @@ it only slightly negatively affects the accuracy and the number of proofs.

% \subsection{Balancing size and accuracy}

%

% Most decision trees toolkits somehow try to balance size and accuracy. \budalg uses the standard approach to bound the maximum depth and searches for the tree with maximum accuracy within that limit. Other methods focus on size rather than depth.

% Most decision trees toolkits somehow try to balance size and accuracy. \blossom uses the standard approach to bound the maximum depth and searches for the tree with maximum accuracy within that limit. Other methods focus on size rather than depth.

% For instance, the algorithm \gosdt~\cite{NEURIPS2019_ac52c626} optimize a linear combination of classification error and number of leaves.

%

% In order to compare with such approaches, we designed a method to trade accuracy for size based on pruning. Given the tree of accuracy $\alpha$ found by \budalg, and given a target accuracy $\tau \leq \alpha$, we suppress the subtree of size $s_i$ and classification error $\alpha_i$ such that $\alpha_i/s_i$ is minimum, as long as the overall accuracy is not lower than $\tau$. We did not manage, unfortunately, to obtain a relevant comparison with \gosdt, because no setting of the regularization parameter enabled us to obtain trees with more than a dozen leafs. Instead we experimented with \iti~\cite{Utgoff97decisiontree}. We ran it on every data set, and grouped the resulting trees in 4 classes depending on their depths. The first column of Table~\ref{tab:iti} shows the number of data sets in each class. Then for \iti, we report the average classification error and size of the trees.

% For \budalg, we report the same data before and after pruning.

% Over more than half of the data sets, \budalg can find trees that are both smaller and more accurate than those found by \iti. On the first and last classes, however, \iti's trees are slightly smaller, albeit less accurate.

% In order to compare with such approaches, we designed a method to trade accuracy for size based on pruning. Given the tree of accuracy $\alpha$ found by \blossom, and given a target accuracy $\tau \leq \alpha$, we suppress the subtree of size $s_i$ and classification error $\alpha_i$ such that $\alpha_i/s_i$ is minimum, as long as the overall accuracy is not lower than $\tau$. We did not manage, unfortunately, to obtain a relevant comparison with \gosdt, because no setting of the regularization parameter enabled us to obtain trees with more than a dozen leafs. Instead we experimented with \iti~\cite{Utgoff97decisiontree}. We ran it on every data set, and grouped the resulting trees in 4 classes depending on their depths. The first column of Table~\ref{tab:iti} shows the number of data sets in each class. Then for \iti, we report the average classification error and size of the trees.

% For \blossom, we report the same data before and after pruning.

% Over more than half of the data sets, \blossom can find trees that are both smaller and more accurate than those found by \iti. On the first and last classes, however, \iti's trees are slightly smaller, albeit less accurate.

%

%

% \begin{table}[htbp]

...

...

@@ -1622,15 +1622,15 @@ This algorithm is considerably more efficient than state-of-the-art exact algori

\item If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

\begin{enumerate}

\item If your work uses existing assets, did you cite the creators?

\answerTODO{}

\textbf{We give the source of the data sets in the Appendix, and cite the authors of the algorithms we compared our algorithm to. We used no other asset.}

\item Did you mention the license of the assets?

\answerTODO{}

\textbf{N/a}

\item Did you include any new assets either in the supplemental material or as a URL?

\answerTODO{}

\textbf{We will made the few data sets that we binarized ourselve publicly available after the publication.}

\item Did you discuss whether and how consent was obtained from people whose data you're using/curating?

\answerTODO{}

\textbf{No}

\item Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content?

\answerTODO{}

\textbf{No}

\end{enumerate}

\item If you used crowdsourcing or conducted research with human subjects...

...

...

@@ -1655,13 +1655,46 @@ This algorithm is considerably more efficient than state-of-the-art exact algori

\section{Appendix}

Section~\ref{appendix:info} provides some information about the chosen data sets. Section~\ref{appendix:lb} provides an example for the lower bound reasoning described in Section~\ref{sec:lb}. In Section~\ref{appendix:extra} we report the results of some experimental evaluation on balancing size and accuracy with \blossom. Finally, in Section~\ref{appendix:full}, we report the raw data from our experiments (for every method and every data set).

\subsection{Information about the data sets}

\label{appendix:info}

The benchmark of classification data set we used is described in Table~\ref{tab:info}. It consists of 50 data sets

commonly used in related work articles (specifically, \cite{narodytska2018learning,dl85,verwer2019learning}), to which we added the following large data sets in order to stress how well the different approaches scale.

\begin{itemize}

\item The data set \texttt{taiwan\_binarised} comes from the \href{https://archive.ics.uci.edu/ml/index.php}{UCI repository} and was discretized using ad-hoc threshold on continuous features.

\item The data sets \texttt{adult\_discretized} and \texttt{compas\_discretized} [TODO!!]

\item The data sets \texttt{bank}, \texttt{titanic}, \texttt{surgical-deepnet} and \texttt{weather-aus} come from \href{https://www.kaggle.com/}{Kaggle} and were binarized using the one-hot encoding implemented by the authors of \cite{narodytska2018learning}.

\item The data set \texttt{mnist\_0} is the well known data set on hand written digits binarized as follows: every pixel is a binary attribute whose value is 1 if its greyscale is larger than $0.5$ and 0 otherwise. The data point is positive if it is the digit ``0'' and negative otherwise.

\end{itemize}

We report the number of data points ($|\allex|$), the number of features ($|\features|$), the same parameters after preprocessing (respectively $|\allex|^*$ and $|\features|^*$), and the ``noise'' ratio, that is: $2|\posex\cap\negex|/(|\posex|+|\negex|)$.

\begin{table}[htbp]%

\begin{center}%

\begin{scriptsize}%

\tabcolsep=10pt%

\input{src/tables/datasetinfo.tex}%

\end{scriptsize}%

\end{center}%

\caption{\label{tab:info} Benchmark and preprocessing data}%

\end{table}%

\subsection{Example of lower bound reasonning}

\label{appendix:lb}

\begin{example}[Lower bound reasoning]

\label{ex:lb}

Figure~\ref{fig:lowerbound} shows a snapshot of the excution of \budalg. Every node is labelled with the feature test on that node, and with the values of $\best[\abranch]$ for the branch $\abranch$ ending on that node. When all subtrees of a branch $\abranch$ have been explored (hence $\opt[\abranch]=1$), this is marked by a ``$^*$''. We assume that the branch considered at Line~\ref{line:fail} is $\abranch=\{r, \bar{a}, \bar{c}, g\}$. For instance, we can suppose that a tree rooted at $\abranch$ with feature $e$ has been found (misclassifying 2 data points). Then, search moved to the sibling branch $\{r, \bar{a}, \bar{c}, \bar{g}\}$, which was then optimized for a total error of $4$, and now the pair $(\abranch,e)$ is popped out of \sequence. For all branches $\abranch'$ of $\abranch$, we give the values of $\lb{\abranch',\abranch}$ and $\best[\abranch']$ between brackets. Since there exists $\abranch'$ such that $\lb{\abranch',\abranch}\geq\best[\abranch']$ (e.g., $\emptyset$ and $\{r, \bar{a}\}$), we know that $\abranch$ cannot belong to an improving solution, and hence there is no need to try to extend it further.

Figure~\ref{fig:lowerbound} shows a snapshot of the excution of \blossom. Every node is labelled with the feature test on that node, and with the values of $\best[\abranch]$ for the branch $\abranch$ ending on that node. When all subtrees of a branch $\abranch$ have been explored (hence $\opt[\abranch]=1$), this is marked by a ``$^*$''. We assume that the branch considered at Line~\ref{line:fail} is $\abranch=\{r, \bar{a}, \bar{c}, g\}$. For instance, we can suppose that a tree rooted at $\abranch$ with feature $e$ has been found (misclassifying 2 data points). Then, search moved to the sibling branch $\{r, \bar{a}, \bar{c}, \bar{g}\}$, which was then optimized for a total error of $4$, and now the pair $(\abranch,e)$ is popped out of \sequence. For all branches $\abranch'$ of $\abranch$, we give the values of $\lb{\abranch',\abranch}$ and $\best[\abranch']$ between brackets. Since there exists $\abranch'$ such that $\lb{\abranch',\abranch}\geq\best[\abranch']$ (e.g., $\emptyset$ and $\{r, \bar{a}\}$), we know that $\abranch$ cannot belong to an improving solution, and hence there is no need to try to extend it further.

% the current best classifier cannot be improved as long as

...

...

@@ -1738,20 +1771,21 @@ This algorithm is considerably more efficient than state-of-the-art exact algori

% }

\end{center}

\caption{\label{fig:lowerbound} Example of lower bound computation w.r.t. the branch }

%\caption{\label{fig:searchtree} The search tree for decision trees. \dynprog explores it depth first, whereas \budalg explores branches in the order given below the leaves.}

%\caption{\label{fig:searchtree} The search tree for decision trees. \dynprog explores it depth first, whereas \blossom explores branches in the order given below the leaves.}

\end{figure}

\end{example}

\subsection{Extra experiments on balancing size and accuracy}

\label{appendix:extra}

Most decision trees toolkits somehow try to balance size and accuracy. \budalg uses the standard approach to bound the maximum depth and searches for the tree with maximum accuracy within that limit. Other methods focus on size rather than depth.

Most decision trees toolkits somehow try to balance size and accuracy. \blossom uses the standard approach to bound the maximum depth and searches for the tree with maximum accuracy within that limit. Other methods focus on size rather than depth.

For instance, the algorithm \gosdt~\cite{NEURIPS2019_ac52c626} optimize a linear combination of classification error and number of leaves.

In order to compare with such approaches, we designed a method to trade accuracy for size based on pruning. Given the tree of accuracy $\alpha$ found by \budalg, and given a target accuracy $\tau\leq\alpha$, we suppress the subtree of size $s_i$ and classification error $\alpha_i$ such that $\alpha_i/s_i$ is minimum, as long as the overall accuracy is not lower than $\tau$. We did not manage, unfortunately, to obtain a relevant comparison with \gosdt, because no setting of the regularization parameter enabled us to obtain trees with more than a dozen leafs. Instead we experimented with \iti~\cite{Utgoff97decisiontree}. We ran it on every data set, and grouped the resulting trees in 4 classes depending on their depths. The first column of Table~\ref{tab:iti} shows the number of data sets in each class. Then for \iti, we report the average classification error and size of the trees.

For \budalg, we report the same data before and after pruning.

Over more than half of the data sets, \budalg can find trees that are both smaller and more accurate than those found by \iti. On the first and last classes, however, \iti's trees are slightly smaller, albeit less accurate.

In order to compare with such approaches, we designed a method to trade accuracy for size based on pruning. Given the tree of accuracy $\alpha$ found by \blossom, and given a target accuracy $\tau\leq\alpha$, we suppress the subtree of size $s_i$ and classification error $\alpha_i$ such that $\alpha_i/s_i$ is minimum, as long as the overall accuracy is not lower than $\tau$. We did not manage, unfortunately, to obtain a relevant comparison with \gosdt, because no setting of the regularization parameter enabled us to obtain trees with more than a dozen leafs. Instead we experimented with \iti~\cite{Utgoff97decisiontree}. We ran it on every data set, and grouped the resulting trees in 4 classes depending on their depths. The first column of Table~\ref{tab:iti} shows the number of data sets in each class. Then for \iti, we report the average classification error and size of the trees.

For \blossom, we report the same data before and after pruning.

Over more than half of the data sets, \blossom can find trees that are both smaller and more accurate than those found by \iti. On the first and last classes, however, \iti's trees are slightly smaller, albeit less accurate.

\begin{table}[htbp]

...

...

@@ -1766,14 +1800,9 @@ Over more than half of the data sets, \budalg can find trees that are both small

\subsection{Full experimental results}

\label{appendix:full}

The benchmark of classification data set we used is described in Table~\ref{tab:info}. It consists of 51 data sets

commonly used in related work articles, to which we added some large data sets from Kaggle: \texttt{bank}, \texttt{titanic}, \texttt{surgical-deepnet} and \texttt{weather-aus}, as well as \texttt{mnist\_0}, \texttt{adult\_discretized} and \texttt{compas\_discretized}. We report the number of data points ($|\allex|$), the number of features ($|\features|$), the same parameters after preprocessing (respectively $|\allex|^*$ and $|\features|^*$), and the ``noise'' ratio, that is: $2|\posex\cap\negex|/(|\posex|+|\negex|)$.

\medskip

Then we report the raw data from our experimental comparison with the state of the art for $\mdepth=3,4,5,7,10$ and for the four size catagories in the following tables:

We report here the raw data from our experimental comparison with the state of the art for $\mdepth=3,4,5,7,10$ and for the four size catagories in the following tables:

\tabcolsep=10pt

% \begin{center}

...

...

@@ -1824,15 +1853,6 @@ Every process was first run with a memory limit of 3.5GB. Many runs of \dleight,

\begin{table}[htbp]%

\begin{center}%

\begin{scriptsize}%

\tabcolsep=10pt%

\input{src/tables/datasetinfo.tex}%

\end{scriptsize}%

\end{center}%

\caption{\label{tab:info} Benchmark and preprocessing data}%

\end{table}%

\tabcolsep=5pt%

...

...

@@ -2474,7 +2494,7 @@ Every process was first run with a memory limit of 3.5GB. Many runs of \dleight,

% We report in Table~\ref{tab:summaryacc} data averaged over the 47 data sets described above, for

%

%

% the average accuracy found within the one hour time limit for \budalg and \murtree

% the average accuracy found within the one hour time limit for \blossom and \murtree

%

% on relatively shallow trees (3,4 and 5) in tables~\ref{tab:d3}, \ref{tab:d4} and \ref{tab:d5}, respectively.

% We give the minimum \emph{error}, the cpu in seconds \emph{time} and size of the search space (\emph{choices}) required to prove optimality (when a proof is given, as markes by a 1 in the column \emph{opt}) or to find the best solution (otherwise).

...

...

@@ -2561,7 +2581,7 @@ Every process was first run with a memory limit of 3.5GB. Many runs of \dleight,

\subsection{Factor analysis}

Finally, we report results of three variants of \budalg, in order to analyse the relative contributions of the factors described in Section~\ref{sec:ext}. For each variant, we report the average error (error), the ratio of optimality proofs (opt.) and the cpu time ratio with respect to the default setting on data sets for which an optimal tree has been found.

Finally, we report results of three variants of \blossom, in order to analyse the relative contributions of the factors described in Section~\ref{sec:ext}. For each variant, we report the average error (error), the ratio of optimality proofs (opt.) and the cpu time ratio with respect to the default setting on data sets for which an optimal tree has been found.

In the variant ``No heuristic'', the Gini impurity heuristic described in Section~\ref{sec:heuristic} is disabled, and replaced by simply selecting first the feature with minimum error. For shallow trees (depth 3 or 4), since in many cases the search space is completely exhausted, not computing the slightly more costly Gini impurity score may actually be a good choice and we observe run time reduction of about 15\% to 20\%. However, the accuracy of the trees decreases extremely rapidly for larger maximum depth. As a results, many less optimality proofs are obtained, and they take much longer to compute.

...

...

@@ -2932,7 +2952,7 @@ Every process was first run with a memory limit of 3.5GB. Many runs of \dleight,

\medskip

When the maximum depth and number of feature is not too large, both algorithms are comparable, although \budalg is systematically faster. However, when the depth or the number of features grows, the best solution found by \dleight is often of much lower quality. In fact, in most cases, it reaches the time or memory limit without outputing a solution (the missing entries corresponds to \dleight reaching the 50GB memory limit). Notice that \budalg uses a tiny memory space (much lower than the size of the data set).

When the maximum depth and number of feature is not too large, both algorithms are comparable, although \blossom is systematically faster. However, when the depth or the number of features grows, the best solution found by \dleight is often of much lower quality. In fact, in most cases, it reaches the time or memory limit without outputing a solution (the missing entries corresponds to \dleight reaching the 50GB memory limit). Notice that \blossom uses a tiny memory space (much lower than the size of the data set).