@@ -349,7 +349,7 @@ Let $\afeat_i <_{\abranch} \afeat_j$ if and only if feature $\afeat_i$ is select
\begin{itemize}
\item\sequence\ represents the current decision tree: if $(\abranch,\afeat)\in\sequence$, then the current tree tests feature $\afeat$ at the extremity of branch $\abranch$. We say that the branch $\abranch$ is in the current tree, and that feature $\afeat$ is tested on branch $\abranch$.
\item If $(\abranch,\afeat)\in\sequence$, then every sub-tree of $\abranch$ starting with a feature test $\aofeat <_{\abranch}\afeat$ has already been explored and $\best[\abranch]$ contains the minimum of their errors. The set $\dom[\abranch]$ contains all \emph{untried} feature tests for branch $\abranch$ ($\dom[\abranch]=\{\aofeat\mid\aofeat\in\features ~\wedge~ \afeat <_{\abranch}\aofeat$).
\item If $(\abranch,\afeat)\in\sequence$, then every subtree of $\abranch$ starting with a feature test $\aofeat <_{\abranch}\afeat$ has already been explored and $\best[\abranch]$ contains the minimum of their errors. The set $\dom[\abranch]$ contains all \emph{untried} feature tests for branch $\abranch$ ($\dom[\abranch]=\{\aofeat\mid\aofeat\in\features ~\wedge~ \afeat <_{\abranch}\aofeat$).
\item If $(\abranch,\afeat)\in\sequence$ but one of its children $\grow{\abranch}{\afeat}$ or $\grow{\abranch}{\afeat}$ (call it $\aobranch$) is not in the current tree, then either:
...
...
@@ -370,7 +370,7 @@ expand the tree with the test $\afeat$ at branch $\abranch$. The two children $\
If there is no bud ($\bud=\emptyset$), then the current tree is complete: every branch $\abranch$ is either terminal or optimal. In that case we pop the last assignment $(\abranch,\afeat)$ from \sequence\
%, mark the feature $\afeat$ as tried for branch $\abranch$
and update the best error of its subtrees. If there is at least one untried feature for branch $\abranch$, we add $\abranch$ to $\bud$.
Otherwise, it is optimal since all features have been tried, and $\best[\abranch]$ contains the minimum error for any sub-tree of branch $\abranch$.
Otherwise, it is optimal since all features have been tried, and $\best[\abranch]$ contains the minimum error for any subtree of branch $\abranch$.
%and its error is the sum of the errors of its best subtrees.
This branch will never be expanded anymore since it is not added to $\bud$.
...
...
@@ -652,17 +652,13 @@ Notice that to simplify the pseudo-code, we use branches to index array-like dat
\KwData{$\negex,\posex, \maxd$}
\KwResult{The minimum error on $\negex,\posex$ for decision trees of depth $\maxd$}
\lnl{line:splitting}compute $\negex(\abranch)$ and $\posex(\abranch)$\colorbox{yellow!50}{and $p(\afeat,\negex(\abranch))$ and $p(\afeat,\posex(\abranch)), \forall\afeat\in\features$}\;
\lnl{line:domain}$\dom(\abranch)\gets\features\setminus\{\afeat\mid\afeat\in\abranch ~\vee~ \bar{\afeat}\in\abranch\}$\colorbox{yellow!50}{sorted by increasing Gini score}\;
@@ -1031,7 +1041,7 @@ The feature tests at Line~\ref{line:assignment} of Algorithm~\ref{alg:bud} are e
In the data sets we used, the Gini impurity was significantly better, and hence all reported experiment results are using Gini impurity unless stated otherwise. For branches of length $\mdepth-1$, however, we use the error instead. Indeed, the optimal feature $\afeat$ for a branch $\abranch$ that cannot be extended further is the one minimizing
This means that we actually do not have to try other features for that node. This is implemented by the highlited code at Line~\ref{line:optimal}: since one cannot improve on the first feature for test at depth $\mdepth$, branches of length $\mdepth-1$ do not have to be put back into \bud, and can be backtracked upon.
This means that we actually do not have to try other features for that node. This is implemented by the highlighted code at Line~\ref{line:optimal}: since one cannot improve on the first feature for test at depth $\mdepth$, branches of length $\mdepth-1$ do not have to be put back into \bud, and can be backtracked upon.
% which means that we effectively restrict search to branches of length $\mdepth-1$.
...
...
@@ -1039,12 +1049,14 @@ This means that we actually do not have to try other features for that node. Thi
%We order the possible features for branch $\abranch$ in non-decreasing order with respect to a score above and
%explore the features in that order in Line~\ref{line:assignment}.
Computing the frequencies $p(\afeat,{\negex(\abranch)})$ and $p(\afeat,{\posex(\abranch)})$ of every feature $\afeat$ can be done in $\Theta(\numfeat\numex)$ time where
$\numex= |\negex(\abranch)|+|\posex(\abranch)|$.\footnote{$p(\bar{\afeat},{\negex(\abranch)})$ and $p(\bar{\afeat},{\posex(\abranch)})$ can then be deduced in $\Theta(\numfeat)$ time} In other words this is more expensive than the splitting procedure by a factor $\numfeat$, but can be similarly amortized. However, since the depth of the branches is effectively reduced by one, the number of terminal branches is reduced by the same factor $\numfeat$, hence this incurs no asymptotic increase in complexity.
Furthermore, ordering the features (in $\dom[\abranch]$)
$\numex= |\negex(\abranch)|+|\posex(\abranch)|$.\footnote{$p(\bar{\afeat},{\negex(\abranch)})= |\negex(\abranch)| - p({\afeat},{\negex(\abranch)})$ and $p(\bar{\afeat},{\posex(\abranch)})= |\posex(\abranch)| - p({\afeat},{\posex(\abranch)})$ can then be queried in constant time} In other words this is more expensive than the splitting procedure by a factor $\numfeat$, but can be similarly amortized. However, since the depth of the branches is effectively reduced by one, the number of terminal branches is reduced by the same factor $\numfeat$, hence this incurs no asymptotic increase in complexity.
Furthermore, ordering the features (at Line~\ref{line:domain})
%Computing this order
costs $\Theta(\numfeat\log\numfeat)$ for each of the $2^{\mdepth-1}\numfeat^{\mdepth-1}$ branches added to $\bud$ at Line~\ref{line:branching}. Again, since the depth of the branches is effectively reduced by one, the resulting complexity
%(excluding the time for splitting the data set)
is $O((\numex+2^{\mdepth}\log\numfeat)\numfeat^{\mdepth})$. This very slight increase is often inconsequencial, as long as we have $\numex\geq2^{\mdepth}\log\numfeat$.
is $O((\numex+2^{\mdepth}\log\numfeat)\numfeat^{\mdepth})$. This very slight increase is often inconsequencial, as
$\numex$ is still often the dominating term.
% long as we have $\numex \geq 2^{\mdepth} \log \numfeat$.
The feature ordering has a very significant impact on how quickly the algorithm can improve the accuracy of the classifier. Moreover, it also has an impact (though indirect and much less significant) on the computational time necessary to explore the whole search space and prove optimality, because of the lower bound technique detailed in the next section.
...
...
@@ -1052,7 +1064,9 @@ The feature ordering has a very significant impact on how quickly the algorithm
\subsection{Lower Bound}
\label{sec:lb}
It is possible to fail early using a lower bound on the error given prior decisions in the same way as \murtree, following the idea introduced in \cite{dl8}. The idea is that once some subtrees along a branch $\abranch$ are optimal and the sum of their errors is larger than the current upper bound (the best solution found so far) then there is no need to continue exploring branch $\abranch$.
It is possible to fail early using a lower bound on the error given prior decisions in the same way as \dleight~\cite{dl8}.
%, following the idea introduced in \cite{dl8}.
The idea is that once some subtrees along a branch $\abranch$ are optimal and the sum of their errors is larger than the current upper bound (the best solution found so far) then there is no need to continue exploring branch $\abranch$.
%Line~\ref{line:leaves} can be changed to ``\textbf{If} $\bud \neq \emptyset ~\& \not\exists \abranch \in \bud, \dominated{\abranch}$ \textbf{then}''. %Notice that when a branch is ``pruned'' in this way, its
...
...
@@ -1064,23 +1078,30 @@ It is possible to fail early using a lower bound on the error given prior decisi
First, observe that $\best[\abranch]$ is an upper bound on the classification error for any subtree rooted at $\abranch$, since this value comes from an actual tree (of depth $\mdepth- |\abranch|$ for the data set $\langle\negex(\abranch),\posex(\abranch)\rangle$). It is possible to propagate this upper bound to parent nodes efficiently (in $O(|\abranch|)$ time). Here we assume that this is done every time the value $\best[\abranch]$ is actually updated, by recursively applying the same update procedure to the parent.
Now, when the condition in Line~\ref{line:optimal} fails