@@ -86,11 +86,13 @@ On the other hand, \budalg\ finds, without significant computational overhead, s

\section{Introduction}

In conclusion of their short paper showing that computing decision tree classifiers of maximum accuracy is NP-complete, Hyafil and Rivest stated: ``Accordingly, it is to be expected that that good heuristics for constructing near-optimal binary decision trees will be the best solution to this problem in the near future.''~\cite{NPhardTrees}. Indeed, heuristic approaches such as \cart\cite{breiman1984classification}, \idthree~\cite{10.1023/A:1022643204877} or \cfour~\cite{c4-5} have been prevalent long afterward, and are still vastly more commonly used in practice than exact approaches.

In conclusion of their short paper showing that computing decision trees of maximum accuracy is NP-complete, Hyafil and Rivest stated: ``Accordingly, it is to be expected that that good heuristics for constructing near-optimal binary decision trees will be the best solution to this problem in the near future.''~\cite{NPhardTrees}. Indeed, heuristic approaches such as \cart\cite{breiman1984classification}, \idthree~\cite{10.1023/A:1022643204877} or \cfour~\cite{c4-5} have been prevalent long afterward, and are still vastly more commonly used in practice than exact approaches.

%\medskip

It is well established, however, that optimal decision trees (for some combination of accuracy, depth and size) generalize better to unseen data than heuristic trees. This experiment has been confirmed in several publications\footnote{Hence we shall not reproduce once again such experiments in this paper.}, in particular for the objective criterion considered in this paper: maximizing the accuracy given an upper bound on the depth~\cite{avellanedaefficient,bertsimas2017optimal,bertsimas2007classification,DBLP:conf/ijcai/Hu0HH20,DBLP:journals/corr/abs-2007-12652,dl8}.

It is well established, however, that optimal trees (for some combination of accuracy, depth and size) generalize better to unseen data.

% than heuristic trees.

This experiment has been confirmed in several publications\footnote{Hence we shall not reproduce once again such experiments in this paper.}, in particular for the objective criterion considered in this paper: maximizing the accuracy given an upper bound on the depth~\cite{avellanedaefficient,bertsimas2017optimal,bertsimas2007classification,DBLP:conf/ijcai/Hu0HH20,DBLP:journals/corr/abs-2007-12652,dl8}.

Another valuable feature of smaller and/or shallower trees is that interpreting or explaining their prediction is comparatively easier, which is often valuable.

%\medskip

...

...

@@ -102,7 +104,7 @@ The constraint programming approach of Verhaeghe \textit{et al.} emulates these

Finally, a recently introduced algorithm, \murtree~\cite{DBLP:journals/corr/abs-2007-12652}, improves on earlier dynamic programming in several ways. As a result, it clearly dominates previous exact methods. It is more memory efficient, orders of magnitude faster than \dleight, and has a better anytime behavior. However, our experimental results show that for deeper trees, none of these methods can reliably outperform heuristics.

\medskip

% \medskip

In this paper we introduce a relatively \emph{simple} algorithm (\budalg), that is as memory and time efficient as heuristics, and yet more efficient than most exact methods on most data sets.

This algorithm can be seen as an instance of the more general framework introduced in \cite{DBLP:journals/corr/abs-2007-12652}, however tuned to have the best scalability to large trees and the best anytime behavior as possible.

...

...

@@ -194,17 +196,17 @@ To a \emph{branch} of a decision tree we associate the ordered set of labels on

%Moreover, if we also consider data points as conjunctions of features (where every feature appears either positively or negatively),

%given a branch $\abranch \subseteq \features$

Given a data set $\langle\negex,\posex\rangle$, we can associate a data set $\langle\negex(\abranch),\posex(\abranch)\rangle$ to a branch $\abranch$ where $\negex(\abranch)=\{\ex\mid\ex\in\negex, \abranch\subseteq\ex\}$ and $\posex(\abranch)=\{\ex\mid\ex\in\posex, \abranch\subseteq\ex\}$.

Given a data set $\langle\negex,\posex\rangle$, we can associate a data set $\langle\negex[\abranch],\posex[\abranch]\rangle$ to a branch $\abranch$ where $\negex[\abranch]=\{\ex\mid\ex\in\negex, \abranch\subseteq\ex\}$ and $\posex[\abranch]=\{\ex\mid\ex\in\posex, \abranch\subseteq\ex\}$.

% For instance, the branch $\abranch = \{\afeat_i, \bar{\afeat_j}, \bar{\afeat_k}, \afeat_l\}$ has length 4 and

%

We write $\grow{\abranch}{\afeat}$ as a shortcut for $\abranch\cup\{\afeat\}$.

The classification error for branch $\abranch$ is $\error[\abranch]=\min(|\negex(\abranch)|, |\posex(\abranch)|)$.

% %Let $\error[\abranch]$ be $\min(|\negex(\abranch)|, |\posex(\abranch)|)$,

The classification error for branch $\abranch$ is $\error[\abranch]=\min(|\negex[\abranch]|, |\posex[\abranch]|)$.

% %Let $\error[\abranch]$ be $\min(|\negex[\abranch]|, |\posex[\abranch]|)$,

% and we write $\error[\abranch,\afeat]$ for $\error[\grow{\abranch}{\afeat}] + \error[\grow{\abranch}{\bar{\afeat}}]$.

% %A branch \abranch\ is said \emph{pure} iff $\error[\abranch]=0$.

...

...

@@ -313,9 +315,9 @@ The solver DL8.5 is a dynamic programming algorithm for the minimum error bounde

\section{An Anytime Algorithm}

Algorithm~\ref{alg:bud} shows the pseudo-code of an iterative, anytime, version of Algorithm~\ref{alg:dynprog} (highlighted code should be ignored for now). This algorithm

Algorithm~\ref{alg:bud} shows the pseudo-code of an iterative, anytime, version of Algorithm~\ref{alg:dynprog} (highlighted code can be ignored for now). This algorithm

%In a nutshell, Algorithm~\ref{alg:bud}

explores the same search space as Algorithm~\ref{alg:dynprog}: the same branch is never explored twice. However, incomplete branches are expanded before trying alternative features for already explored branches. In other words, instead of optimizing the left subtree before exploring the right subtree as Algorithm~\ref{alg:dynprog} does, Algorithm~\ref{alg:bud} first fully expands a decision tree before exploring alternatives for any branch. An illustration of the order in which branches are explored is shown in Figure~\ref{fig:searchtree}.

explores the same search space as Algorithm~\ref{alg:dynprog}: the same branch is never explored twice. However, incomplete branches are expanded before trying alternative features for already explored branches. In other words, instead of optimizing the left subtree before exploring the right subtree as in Algorithm~\ref{alg:dynprog}, Algorithm~\ref{alg:bud} first fully expands a decision tree before exploring alternatives for any branch, see Figure~\ref{fig:searchtree} for an illustration of the branch exploration order.

%For instance, consider a data set with three binary features $\features = \{a,b,c\}$. Figure~\ref{fig:searchtree} shows the branches explored by both Algorithm~\ref{dynprog} and Algorithm~\ref{alg:bud}. Both algorithms explore first the branches $\{a,b\}$ and $\{a,\bar{b}\}$. However, whereas Algorithm~\ref{dynprog} explores next the branches $\{a,c\}$ and $\{a,\bar{c}\}$, Algorithm~\ref{alg:bud} explores next the branches $\{\bar{a},b\}$ and $\{\bar{a},\bar{b}\}$, hence immediately

...

...

@@ -389,7 +391,7 @@ Let $\afeat_i <_{\abranch} \afeat_j$ if and only if feature $\afeat_i$ is select

\item If $(\abranch,\afeat)\in\sequence$, then every subtree of $\abranch$ starting with a feature test $\aofeat <_{\abranch}\afeat$ has already been explored and $\best[\abranch]$ contains the minimum of their errors. The set $\dom[\abranch]$ contains all \emph{untried} feature tests for branch $\abranch$ ($\dom[\abranch]=\{\aofeat\mid\aofeat\in\features ~\wedge~ \afeat <_{\abranch}\aofeat\}$).

\item If $(\abranch,\afeat)\in\sequence$ but one of its children $\grow{\abranch}{\afeat}$ or $\grow{\abranch}{\afeat}$ (call it $\aobranch$) is not in the current tree, then either:

\item If $(\abranch,\afeat)\in\sequence$ but one of its children $\grow{\abranch}{\afeat}$ or $\grow{\abranch}{\afeat}$ (call it $\aobranch$) is not in the current tree, then:

\begin{itemize}

\item it is \emph{terminal} ($|\aobranch|=k$ or $\error[\aobranch]=0$), or

...

...

@@ -399,6 +401,110 @@ Let $\afeat_i <_{\abranch} \afeat_j$ if and only if feature $\afeat_i$ is select

\end{itemize}

\begin{algorithm}[t]

\begin{footnotesize}

\caption{Blossom Algorithm\label{alg:bud}}

\TitleOfAlgo{\budalg}

\KwData{$\negex,\posex, \maxd$}

\KwResult{The minimum error on $\negex,\posex$ for decision trees of depth $\maxd$}

$\sequence\gets[]$\;

% $\bud \gets \emptyset$\;

$\bud\gets\newbud(\emptyset,\emptyset)$\;

% $\bud \gets \{\emptyset\}$\;

% $\dom[\emptyset] \gets \features$\;

% $\best[\emptyset] \gets \min(\negex, \posex)$\;

% \HiLi $\opt[\emptyset] \gets \texttt{false}$\;

\While{$|\sequence| + |\bud| > 0$}{

\lnl{line:dive}\If{$\bud\neq\emptyset$}{

%$\abranch \gets \select{\bud}$\;

\lnl{line:budchoice}pick and remove $\abranch$ from $\bud$\;

% \lnl{line:leaves}\eIf{$|\abranch| = \maxd$ or $\error[\abranch] = 0$} {

\lnl{line:splitting}compute $\negex[\abranch]$ and $\posex[\abranch]$\colorbox{yellow!50}{and $p(\afeat,\negex[\abranch])$ and $p(\afeat,\posex[\abranch]), \forall\afeat\in\features$}\;

\lnl{line:domain}$\dom(\abranch)\gets\features\setminus\{\afeat\mid\afeat\in\abranch ~\vee~ \bar{\afeat}\in\abranch\}$\colorbox{yellow!50}{sorted by increasing Gini score}\;

As long as there is a bud ($\bud\neq\emptyset$), we pick any one $\abranch\in\bud$ at Line~\ref{line:budchoice} and check if it can or need to be expanded in Line~\ref{line:notterminal}. %If its length is $\mdepth$ the error at this leaf is recorded in $\best[\abranch]$.

If so, we pick a feature $\afeat$ marked as \emph{untried} for \abranch, unmark it,

...

...

@@ -411,8 +517,8 @@ and update the best error of its subtrees. If there is at least one untried feat

Otherwise, it is optimal since all features have been tried, and $\best[\abranch]$ contains the minimum error for any subtree of branch $\abranch$.

%and its error is the sum of the errors of its best subtrees.

This branch will never be expanded anymore since it is not added to $\bud$.

When the algorithm ends, $\best[\emptyset]$ contains the minimum error of any decision tree of depth $\mdepth$ on the data set.

%

When the algorithm ends, $\best[\emptyset]$ contains the minimum error of any decision tree of depth $\mdepth$. % on the data set.

% Algorithm~\ref{alg:bud} starts from a singleton set \bud\ of open branches or \emph{buds},

...

...

@@ -439,7 +545,7 @@ When the algorithm ends, $\best[\emptyset]$ contains the minimum error of any de

% \medskip

Notice that to simplify the pseudo-code, we use branches to index array-like data structures in Algorithm~\ref{alg:bud} (e.g. $\dom[\abranch]$). Actually, a set of \emph{indices} (at most $2^{\mdepth}$of them are necessary in the worst case) are used as proxy for branches in all contexts, since the current tree cannot have more than $2^{\mdepth}$ branches. At Line~\ref{line:storebest}, the indices for $\grow{\abranch}{\afeat}$, $\grow{\abranch}{\bar{\afeat}}$ are released, and a free index is marked as used when expanding a branch at Line~\ref{line:branching}. Moreover, the pseudo-code in Algorithm~\ref{alg:bud} does not show how the best subtrees of optimal branches are recorded, nor how the overall best error is updated when completing a new decision tree at Line~\ref{line:else}.

To simplify the pseudo-code, we use branches to index array-like data structures in Algorithm~\ref{alg:bud} (e.g. $\dom[\abranch]$). Actually, a set of \emph{indices} (at most $2^{\mdepth}$ in the worst case) are used as proxy for branches in all contexts, since the current tree cannot have more than $2^{\mdepth}$ branches. At Line~\ref{line:storebest}, the indices for $\grow{\abranch}{\afeat}$, $\grow{\abranch}{\bar{\afeat}}$ are released, and a free index is marked as used when expanding a branch at Line~\ref{line:branching}. Moreover, the pseudo-code in Algorithm~\ref{alg:bud} does not show how the best subtrees of optimal branches are recorded, nor how the overall best error is updated when completing a new decision tree at Line~\ref{line:else}.

%The worst case space complexity of the algorithm is therefore in $\Theta(2^{\mdepth}\numfeat)$. Under the standard assumption that $2^{\mdepth} \leq \numex$, this is less than the size of the input.

% where $\sizetree \leq 2^{\mdepth}$ is the maximum size of the explored tree, that is the maximum length of $\sequence$.

...

...

@@ -465,7 +571,7 @@ Notice that to simplify the pseudo-code, we use branches to index array-like dat

\begin{proof}

From the invariants, we can see that Algorithm~\ref{alg:bud} explores the same set of $\Perm{\numfeat}{\mdepth}2^\mdepth$ branches (i.e., the $2^{\mdepth}$ outcomes of each permutation of ${\mdepth}$ features).

%

Moreover, the ``yes'' branch of Condition~\ref{line:dive} dominates the time complexity since at most one element is added to $\sequence$, whereas Loop~\ref{line:backtrack} suppresses exactly one element of $\sequence$ at every iteration (and each of its iterations is in constant time).

The time complexity is therefore dominated by the splitting procedure whereby $\negex[\grow{\abranch}{\afeat}]$, $\negex[\grow{\abranch}{\bar{\afeat}}]$, $\posex[\grow{\abranch}{\afeat}]$ and $\posex[\grow{\abranch}{\bar{\afeat}}]$ are computed from $\negex[\abranch]$ and $\posex[\abranch]$. As discussed earlier, this takes linear time amortized over the $2^{\mdepth}$ branches sharing the same set of $\mdepth$ features. Therefore, the overall time complexity for the splitting operations is in $\Theta(\Perm{\numfeat}{\mdepth}\numex)$.

...

...

@@ -473,7 +579,6 @@ Notice that to simplify the pseudo-code, we use branches to index array-like dat

Since branches can be stored in constant space (an index, the parent branch and the two children),

the worst case space complexity $\Theta(2^{\mdepth}\numfeat)$ to record which feature have been tried (the sets $\dom$).

\hfill$\square$

\end{proof}

...

...

@@ -683,109 +788,7 @@ Notice that to simplify the pseudo-code, we use branches to index array-like dat

\begin{algorithm}

\begin{footnotesize}

\caption{Blossom Algorithm\label{alg:bud}}

\TitleOfAlgo{\budalg}

\KwData{$\negex,\posex, \maxd$}

\KwResult{The minimum error on $\negex,\posex$ for decision trees of depth $\maxd$}

$\sequence\gets[]$\;

% $\bud \gets \emptyset$\;

$\bud\gets\newbud(\emptyset,\emptyset)$\;

% $\bud \gets \{\emptyset\}$\;

% $\dom[\emptyset] \gets \features$\;

% $\best[\emptyset] \gets \min(\negex, \posex)$\;

% \HiLi $\opt[\emptyset] \gets \texttt{false}$\;

\While{$|\sequence| + |\bud| > 0$}{

\lnl{line:dive}\If{$\bud\neq\emptyset$}{

%$\abranch \gets \select{\bud}$\;

\lnl{line:budchoice}pick and remove $\abranch$ from $\bud$\;

% \lnl{line:leaves}\eIf{$|\abranch| = \maxd$ or $\error[\abranch] = 0$} {

\lnl{line:splitting}compute $\negex(\abranch)$ and $\posex(\abranch)$\colorbox{yellow!50}{and $p(\afeat,\negex(\abranch))$ and $p(\afeat,\posex(\abranch)), \forall\afeat\in\features$}\;

\lnl{line:domain}$\dom(\abranch)\gets\features\setminus\{\afeat\mid\afeat\in\abranch ~\vee~ \bar{\afeat}\in\abranch\}$\colorbox{yellow!50}{sorted by increasing Gini score}\;

@@ -1061,11 +1064,11 @@ We discuss here only improvements that have an impact on the efficiency of the a

\subsection{Heuristic Ordering}

\label{sec:heuristic}

In order to quickly find highly accurate trees, it is important to select first the most promising features, according to a heuristic . We experimented with three heuristics based on scores to minimize: \emph{error}, \emph{entropy}~\cite{10.1023/A:1022643204877}, and \emph{Gini impurity}~\cite{breiman1984classification}.

In order to quickly find accurate trees, it is important to select first the most promising features. We tried three heuristics based on scores to minimize: The \emph{classification error}, the \emph{entropy}~\cite{10.1023/A:1022643204877}, and the \emph{Gini impurity}~\cite{breiman1984classification}.

Each of these heuristics associates a score to a feature $\afeat$ at a branch $\abranch$:

@@ -1086,8 +1089,8 @@ This means that we actually do not have to try other features for that node. Thi

%We order the possible features for branch $\abranch$ in non-decreasing order with respect to a score above and

%explore the features in that order in Line~\ref{line:assignment}.

Computing the frequencies $p(\afeat,{\negex(\abranch)})$ and $p(\afeat,{\posex(\abranch)})$ of every feature $\afeat$ can be done in $\Theta(\numfeat\numex)$ time where

$\numex= |\negex(\abranch)|+|\posex(\abranch)|$.\footnote{$p(\bar{\afeat},{\negex(\abranch)})= |\negex(\abranch)| - p({\afeat},{\negex(\abranch)})$ and $p(\bar{\afeat},{\posex(\abranch)})= |\posex(\abranch)| - p({\afeat},{\posex(\abranch)})$ can then be queried in constant time} In other words this is more expensive than the splitting procedure by a factor $\numfeat$, but can be similarly amortized. However, since the depth of the branches is effectively reduced by one, the number of terminal branches is reduced by the same factor $\numfeat$, hence this incurs no asymptotic increase in complexity.

Computing the frequencies $p(\afeat,{\negex[\abranch]})$ and $p(\afeat,{\posex[\abranch]})$ of every feature $\afeat$ can be done in $\Theta(\numfeat\numex)$ time where

$\numex= |\negex[\abranch]|+|\posex[\abranch]|$.\footnote{$p(\bar{\afeat},{\negex[\abranch]})= |\negex[\abranch]| - p({\afeat},{\negex[\abranch]})$ and $p(\bar{\afeat},{\posex[\abranch]})= |\posex[\abranch]| - p({\afeat},{\posex[\abranch]})$ can then be queried in constant time} In other words this is more expensive than the splitting procedure by a factor $\numfeat$, but can be similarly amortized. However, since the depth of the branches is effectively reduced by one, the number of terminal branches is reduced by the same factor $\numfeat$, hence this incurs no asymptotic increase in complexity.

Furthermore, ordering the features (at Line~\ref{line:domain})

%Computing this order

costs $\Theta(\numfeat\log\numfeat)$ for each of the $2^{\mdepth-1}\numfeat^{\mdepth-1}$ branches added to $\bud$ at Line~\ref{line:branching}. Again, since the depth of the branches is effectively reduced by one, the resulting complexity

...

...

@@ -1102,9 +1105,12 @@ The feature ordering has a very significant impact on how quickly the algorithm

\subsection{Lower Bound}

\label{sec:lb}

It is possible to fail early using a lower bound on the error given prior decisions in the same way as \dleight~\cite{dl8}.

It is possible to fail early using a lower bound on the error given prior decisions, similarly as \dleight\ does~\cite{dl8}.

%, following the idea introduced in \cite{dl8}.

The idea is that once some subtrees along a branch $\abranch$ are optimal and the sum of their errors is larger than the current upper bound (the best solution found so far) then there is no need to continue exploring branch $\abranch$.

% The idea is that once

When some subtrees along a branch $\abranch$ are optimal and the sum of their errors is larger than the current upper bound,

%(the best solution found so far)

then there is no need to continue exploring branch $\abranch$.

%Line~\ref{line:leaves} can be changed to ``\textbf{If} $\bud \neq \emptyset ~\& \not\exists \abranch \in \bud, \dominated{\abranch}$ \textbf{then}''. %Notice that when a branch is ``pruned'' in this way, its

...

...

@@ -1113,7 +1119,7 @@ The idea is that once some subtrees along a branch $\abranch$ are optimal and th

%In this case, we can fail by forcing

First, observe that $\best[\abranch]$ is an upper bound on the classification error for any subtree rooted at $\abranch$, since this value comes from an actual tree (of depth $\mdepth- |\abranch|$ for the data set $\langle\negex(\abranch),\posex(\abranch)\rangle$). It is possible to propagate this upper bound to parent nodes efficiently (in $O(|\abranch|)$ time). Here we assume that this is done every time the value $\best[\abranch]$ is actually updated, by recursively applying the same update procedure to the parent.

First, observe that $\best[\abranch]$ is an upper bound on the classification error for any subtree rooted at $\abranch$, since this value comes from an actual tree (of depth $\mdepth- |\abranch|$ for the data set $\langle\negex[\abranch],\posex[\abranch]\rangle$). It is possible to propagate this upper bound to parent nodes efficiently (in $O(|\abranch|)$ time). Here we assume that this is done recursively for the parent branch, every time the value $\best[\abranch]$ is updated. %, by recursively applying the same update procedure to the parent.

Now, when the condition in Line~\ref{line:optimal} fails for a branch $\abranch$, it means that $\best[\abranch]$ is \emph{optimal}, there is no subtree rooted at $\abranch$ of maximum depth $\mdepth- |\abranch|$ whose classification error is lower than $\best[\abranch]$. This is true either because every subtree has been explored, or, with the changes described in Section~\ref{sec:heuristic}, because $\mdepth- |\abranch| =1$ and the feature $\afeat$ with least

...

...

@@ -1137,8 +1143,7 @@ For any ancestor $\abranch'$ of $\abranch$, we define a lower bound $\lb{\abranc

In plain words, $\lb{\abranch',\abranch}$ is the sum the errors of optimal ``sibling'' branches between $\abranch'$ and $\abranch$. We illustrate this bound in Example~\ref{ex:lb}.

In plain words, $\lb{\abranch',\abranch}$ is the sum the errors of optimal ``sibling'' branches between $\abranch'$ and $\abranch$. %We illustrate this bound in Example~\ref{ex:lb}.

As long as these choices of feature tests stand (i.e., as long as $\abranch$ belongs to the current tree), these subtrees cannot be improved, hence this lower bound is correct.

...

...

@@ -1147,95 +1152,96 @@ As long as these choices of feature tests stand (i.e., as long as $\abranch$ bel

% The procedure $\dominated{\abranch}$ can therefore simply check, for all parent $\abranch'$ of $\abranch$ up until the root ($\emptyset$), whether $\lb{\abranch',\abranch} \geq \best[\abranch']$. As a result, a branch $\abranch$ which is guaranteed, by this reasoning, to never belong to a non-dominated tree will not be explored further.

\begin{example}[Lower bound reasoning]

\label{ex:lb}

Figure~\ref{fig:lowerbound} shows a snapshot of the excution of \budalg. Every node is labelled with the feature test on that node, and with the values of $\best[\abranch]$ for the branch $\abranch$ ending on that node. When all subtrees of a branch $\abranch$ have been explored (hence $\opt[\abranch]=1$), this is marked by a ``$^*$''. We assume that the branch considered at Line~\ref{line:fail} is $\abranch=\{r, \bar{a}, \bar{c}, g\}$. For instance, we can suppose that a tree rooted at $\abranch$ with feature $e$ has been found (misclassifying 2 data points). Then, search moved to the sibling branch $\{r, \bar{a}, \bar{c}, \bar{g}\}$, which was then optimized for a total error of $4$, and now the pair $(\abranch,e)$ is popped out of \sequence. For all branches $\abranch'$ of $\abranch$, we give the values of $\lb{\abranch',\abranch}$ and $\best[\abranch']$ between brackets. Since there exists $\abranch'$ such that $\lb{\abranch',\abranch}\geq\best[\abranch']$ (e.g., $\emptyset$ and $\{r, \bar{a}\}$), we know that $\abranch$ cannot belong to an improving solution, and hence there is no need to try to extend it further.

% the current best classifier cannot be improved as long as

\caption{\label{fig:lowerbound} Example of lower bound computation w.r.t. the branch }

%\caption{\label{fig:searchtree} The search tree for decision trees. \dynprog explores it depth first, whereas \budalg explores branches in the order given below the leaves.}

\end{figure}

\end{example}

% \begin{example}[Lower bound reasoning]

% \label{ex:lb}

%

%

% Figure~\ref{fig:lowerbound} shows a snapshot of the excution of \budalg. Every node is labelled with the feature test on that node, and with the values of $\best[\abranch]$ for the branch $\abranch$ ending on that node. When all subtrees of a branch $\abranch$ have been explored (hence $\opt[\abranch]=1$), this is marked by a ``$^*$''. We assume that the branch considered at Line~\ref{line:fail} is $\abranch = \{r, \bar{a}, \bar{c}, g\}$. For instance, we can suppose that a tree rooted at $\abranch$ with feature $e$ has been found (misclassifying 2 data points). Then, search moved to the sibling branch $\{r, \bar{a}, \bar{c}, \bar{g}\}$, which was then optimized for a total error of $4$, and now the pair $(\abranch,e)$ is popped out of \sequence. For all branches $\abranch'$ of $\abranch$, we give the values of $\lb{\abranch',\abranch}$ and $\best[\abranch']$ between brackets. Since there exists $\abranch'$ such that $\lb{\abranch',\abranch} \geq \best[\abranch']$ (e.g., $\emptyset$ and $\{r, \bar{a}\}$), we know that $\abranch$ cannot belong to an improving solution, and hence there is no need to try to extend it further.

%

% % the current best classifier cannot be improved as long as

% \caption{\label{fig:lowerbound} Example of lower bound computation w.r.t. the branch }

% %\caption{\label{fig:searchtree} The search tree for decision trees. \dynprog explores it depth first, whereas \budalg explores branches in the order given below the leaves.}

% \end{figure}

%

% \end{example}

% \medskip

This reasoning will be more effective when good upper bounds are found early, hence the feature ordering heuristic discussed in the previous section has an impact. Moreover, the order in which buds are expanded (the choice of branch in Line~\ref{line:budchoice}) has an impact as well. We found that the simplest branch selection strategy was also the one giving the best results: we expand first the branch that was inserted into \bud\ first (i.e., \bud\ is \emph{FIFO}). One possible explanation is that by avoiding to unnecessarily ``jump'' to different parts of the decision tree, this strategy promotes optimizing sibling subtrees first, and therefore, deeper tree earlier.

This reasoning is more effective when good upper bounds are found early, hence the feature ordering discussed in the previous section has an impact. Moreover, the choice of branch in Line~\ref{line:budchoice}) has an impact as well. We found that the simplest branch selection strategy was also the one giving the best results: we expand first the branch that was inserted into \bud\ first (i.e., \bud\ is \emph{FIFO}). One possible explanation is that by avoiding to unnecessarily ``jump'' to different parts of the decision tree, this strategy promotes optimizing sibling subtrees first, and therefore, deeper tree earlier.

% intuitivelly, one want to optimize the branches of the decision trees with the largest error first, in order to benefit from larger lower bounds earlier. To this end, it

...

...

@@ -1256,7 +1262,7 @@ This reasoning will be more effective when good upper bounds are found early, he

@@ -1274,7 +1280,7 @@ Finally, we use two preprocessing techniques, one on the data set and one on the

\paragraph{Dataset reduction.}

It is easy to adapt \budalg (or most decision tree classifiers, actually) to handle weighted data sets by redefining the error as follows, given a weight function $\weight$ on $\allex$:

We can use the weighted version to handle noisy data, by merging duplicated datapoints and suppressing inconsistent datapoints.

...

...

@@ -1288,11 +1294,17 @@ This preprocessing can be done in $O(\numfeat \numex \log \numex)$ by ordering t

\paragraph{Feature reduction.}

A feature $\afeat$ is redundant if there exists a feature $\afeat'$ such that either: $\forall x \in\allex, \afeat\in x \iff\afeat' \in x$, or $\forall x \in\allex, \afeat\in x \iff\afeat' \not\in x$. We simply remove such redundant features. This can be done by comparing pairs of rows of the data set via bitset operations, and therefore in time $O(\numex\numfeat^2)$.

A feature $\afeat$ is redundant if there exists another feature $\afeat'$ such that either: $\forall x \in\allex, \afeat\in x \iff\afeat' \in x$, or $\forall x \in\allex, \afeat\in x \iff\afeat' \not\in x$.

%We simply remove such redundant features.

They can be found by comparing pairs of rows of the data set via bitset operations, and therefore in time $O(\numex\numfeat^2)$.

This may appear very naive, however, it turns out that the binarization techniques (one-hot encoding) used to turn general data sets into binary data set are often not optimized and many redundant features do exist. The number of features ($\numfeat$) has a huge impact on the complexity of the algorithm since the branching factor in the tree representing the search space is $2\numfeat$ (see Figure~\ref{fig:searchtree}).

Removing redundant features may appear very naive, however, it turns out that the binarization techniques (one-hot encoding) used to turn general data sets into binary data sets are often not optimized and many redundant features do exist. The number of features ($\numfeat$) has a huge impact on the complexity:

%of the algorithm since

the branching factor of the algorithm is indeed

%in the tree representing the search space is

$2\numfeat$ (see Figure~\ref{fig:searchtree}).

Moreover, at every branch, ``informationless'' features (i.e., features $\afeat$ such that $(\forall x \in\posex(\abranch)\afeat\in x)\iff(\forall x \in\negex(\abranch)\afeat\in