Here's a rewritten version of your text, edited for clarity, flow, and a more direct tone.
DISCOVER: A GPU-Accelerated Symbolic Regression Algorithm¶
Physics-Informed Scientific Discovery with Flexible Feature Constraints¶
The Core Problem: \(L_0\)-Regularized Regression¶
All search methods in DISCOVER aim to solve the same fundamental problem. Given a large set of \(M\) candidate symbolic features \(\Phi = \{\Phi_1, \Phi_2, \dots, \Phi_M\}\) and a target property \(\mathbf{y}\), our goal is to find a simple linear model—using at most \(D\) features—that best predicts the target. This is known as the \(L_0\)-norm regularized least-squares problem:
Here, \(\|\beta\|_0\) is the "\(L_0\)-norm," which just counts the number of non-zero coefficients in the vector \(\beta\). Because this problem is NP-hard, we need a variety of heuristic, approximate, or specialized algorithms to find a good solution.
Search Algorithms: The Sparsifying Operator¶
The following methods are implemented in search.py
to select the best possible subset of \(D\) features from all the candidates.
Brute-Force Search (_find_best_models_brute_force
)¶
Methodology¶
This is the most direct and exhaustive approach. It systematically checks every single combination of \(D\) features from the total pool of \(M\) candidates.
- For a given model size \(D\), generate all \(\binom{M}{D}\) unique feature combinations.
- For each combination, create a feature matrix \(X_D\).
- Fit an Ordinary Least Squares (OLS) model to find the coefficients \(\beta\). $$ \beta = X_D^\dagger \mathbf{y} \quad \text{(pseudo-inverse in case \(X_D\) is rank-deficient)} $$
- Calculate the model's error, typically the Root Mean Squared Error (RMSE). $$ \text{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2} = \sqrt{\frac{1}{n} |\mathbf{y} - X_D\beta|_2^2} $$
- The combination with the lowest RMSE is chosen as the best model for that size \(D\).
Application in Symbolic Regression¶
This method guarantees finding the globally optimal set of \(D\) symbolic features. However, its computational cost, \(O(\binom{M}{D})\), makes it practical only for very small numbers of features (\(M\)) and model sizes (\(D\)).
Greedy Search (_find_best_models_greedy
)¶
Methodology¶
This is a forward selection algorithm that builds the model one feature at a time. It's much faster than brute force but might not find the absolute best solution.
- Dimension 1: It finds the single best feature (\(\Phi_1^*\)) by fitting \(M\) different one-feature models and picking the one with the lowest error. $$ \Phi_1^* = \underset{\Phi_j \in \Phi}{\text{argmin}} \left( \min_{\beta_0, \beta_j} |\mathbf{y} - (\beta_0 + \beta_j \Phi_j)|_2^2 \right) $$
- Dimension 2: Keeping \(\Phi_1^*\) in the model, it searches through the remaining \(M-1\) features to find the second feature (\(\Phi_2^*\)) that provides the best two-feature model. $$ \Phi_2^ = \underset{\Phi_j \in \Phi \setminus {\Phi_1^}}{\text{argmin}} \left( \min_{\beta_0, \beta_1, \beta_j} |\mathbf{y} - (\beta_0 + \beta_1 \Phi_1^* + \beta_j \Phi_j)|_2^2 \right) $$
- Dimension D: The process continues, adding the feature that offers the biggest improvement at each step, until the model has \(D\) features.
Application in Symbolic Regression¶
The greedy search quickly finds a good—though not necessarily optimal—combination of symbolic features. At each step, it picks the symbolic expression that best explains the variance that the existing features haven't already accounted for.
Orthogonal Matching Pursuit (OMP) (_find_best_models_omp
)¶
Methodology¶
OMP is a smarter, more robust version of the greedy algorithm. Instead of just adding a new feature, it recalculates the coefficients for all selected features at each step.
- Initialize the residual \(\mathbf{r}_0 = \mathbf{y}\) and an empty feature set \(\mathcal{S}_0 = \emptyset\).
- For each step \(k\) from 1 to \(D\):
- Find the feature \(\Phi_k^*\) that's most correlated with the current residual \(\mathbf{r}_{k-1}\). $$ \Phi_k^* = \underset{\Phi_j \notin \mathcal{S}{k-1}}{\text{argmax}} \left| \langle \mathbf{r}{k-1}, \Phi_j \rangle \right| $$
- Add this feature to the active set: \(\mathcal{S}_k = \mathcal{S}_{k-1} \cup \{\Phi_k^*\}\).
- Solve a new least-squares problem using all features in the current set \(\mathcal{S}_k\) to get updated coefficients \(\beta_k\). $$ \beta_k = \Phi_{\mathcal{S}_k}^\dagger \mathbf{y} \quad \text{(least-squares solution / pseudo-inverse)} $$
- Update the residual for the next iteration: \(\mathbf{r}_k = \mathbf{y} - \Phi_{\mathcal{S}_k} \beta_k\).
Application in Symbolic Regression¶
OMP provides a more stable solution than a simple greedy search. By re-fitting the entire model at each step, it better handles correlations between symbolic features, often leading to a more physically meaningful final equation.
SISSO++ (Breadth-First QR Search) (_find_best_models_sisso_pp
)¶
Methodology¶
This highly efficient breadth-first search avoids the slow, combinatorial nature of brute-force search by using clever linear algebra updates. It keeps a "beam" (a list) of the best models at each dimension.
- Dimension 1: It evaluates all 1D models and keeps the top \(N_{\text{beam}}\) models with the lowest Residual Sum of Squares (RSS).
- Dimension 2: For each of the top models from step 1, it tries adding every other available feature.
- The QR Update: Its speed comes from using QR decomposition. For a feature set \(X_D = Q_D R_D\), the RSS is easy to calculate: $$ \text{RSS}D = |\mathbf{y}|_2^2 - |Q_D^T \mathbf{y}|_2^2 $$ To check a new model with an added feature \(\mathbf{x}_{\text{new}}\), it doesn't refit. Instead, it computes the component of \(\mathbf{x}_{\text{new}}\) that is orthogonal to the space spanned by \(Q_D\): $$ \mathbf{w}{\text{new}} = \mathbf{x}{\text{new}} - Q_D Q_D^T \mathbf{x}{\text{new}} $$ The reduction in RSS from adding the new feature is then calculated directly: $$ \Delta \text{RSS} = \frac{(\mathbf{r}D^T \mathbf{w}{\text{new}})^2}{|\mathbf{w}_{\text{new}}|_2^2} $$ where \(\mathbf{r}_D\) is the residual.
- A new beam of top \(N_{\text{beam}}\) 2D models is created, and the process repeats up to the maximum desired dimension, \(D_{\text{max}}\).
Application in Symbolic Regression¶
SISSO++ explores a much wider range of feature combinations than a greedy search without the massive cost of brute-force. It's excellent at finding high-quality, low-dimensional symbolic models that a simple greedy approach might miss.
Random Mutation Hill Climbing (RMHC) & Simulated Annealing (SA)¶
Methodology¶
These are stochastic search algorithms that explore the solution space by making random changes to a potential model.
- RMHC (
_find_best_models_rmhc
):- Start with a decent initial model (e.g., one found by a greedy search).
- In each iteration, randomly swap one feature in the model with one feature out of the model.
- If the new model is better (lower error), keep the change.
- If it's worse, discard the change.
- Repeat many times, restarting periodically to avoid getting stuck in local optima.
- SA (
_find_best_models_sa
): Simulated Annealing is similar to RMHC but with a key difference: it can accept worse solutions to escape local optima. The probability of accepting a worse solution (with error increase \(\Delta E\)) is: $$ P(\text{accept}) = e^{-\frac{\Delta E}{T}} $$ Here, \(T\) is a "temperature" that starts high (allowing lots of bad moves) and is slowly lowered (the "annealing schedule"), forcing the search to converge on a good solution.
Application in Symbolic Regression¶
These methods are great for refining a model found by a faster algorithm like a greedy search. They can escape local optima that trap simpler methods, potentially finding a feature that, while not great on its own, unlocks a much better overall model when combined with others.
Mixed-Integer Quadratic Programming (MIQP) (_find_best_models_miqp
)¶
Methodology¶
This method provides a mathematically rigorous, exact solution to the \(L_0\) problem by reformulating it for specialized solvers like Gurobi.
- Variables: It defines two types of variables:
- Continuous variables (\(\beta_j\)) for the coefficients.
- Binary variables (\(z_j \in \{0, 1\}\)), where \(z_j = 1\) means feature \(\Phi_j\) is in the model.
- Objective Function: Minimize the RSS, which is a quadratic function of \(\beta\). $$ \min_{\beta, \mathbf{z}} \quad (\mathbf{y} - \Phi\beta)^T (\mathbf{y} - \Phi\beta) \quad \equiv \quad \min_{\beta, \mathbf{z}} \quad \beta^T (\Phi^T\Phi) \beta - 2(\mathbf{y}^T\Phi)\beta + \mathbf{y}^T \mathbf{y} $$
- Constraints:
- Cardinality Constraint: This directly enforces the \(L_0\) norm, ensuring exactly \(D\) features are chosen: $$ \sum_{j=1}^{M} z_j = D $$
- Big-M Constraints: These link the binary and continuous variables. They ensure that if a feature isn't selected (\(z_j=0\)), its coefficient must also be zero (\(\beta_j=0\)). $$ -M \cdot z_j \le \beta_j \le M \cdot z_j \quad \text{for } j=1, \dots, M $$ Here, \(M\) is just a very large number that's bigger than any possible coefficient value.
Application in Symbolic Regression¶
When it's computationally feasible, MIQP is the gold standard. It guarantees that the selected combination of \(D\) symbolic features is provably the best one possible from the candidate pool.
Geometric Greedy Search (_find_best_models_ch_greedy
)¶
Methodology¶
This is a specialized greedy search used for classification tasks. Instead of minimizing RMSE, its goal is to maximize the geometric separation between classes.
- At each step, it adds the new feature that minimizes the overlap between the convex hulls of the different classes in the feature space.
- The overlap is estimated using Monte Carlo integration:
- A large number of random points are sampled within the data's bounding box.
- It checks how many class hulls each random point falls inside.
- The overlap score is the fraction of points that are inside more than one hull. $$ \text{Overlap Fraction} = \frac{\text{Points in } \ge 2 \text{ hulls}}{\text{Points in } \ge 1 \text{ hull}} $$
- The search greedily selects the feature that produces the lowest overlap fraction at each step.
Application in Symbolic Regression¶
This method finds a set of symbolic features that create a new space where different classes are as geometrically separable as possible. The result isn't a single equation but a set of transformations (the features) that define this optimal classification space.
Post-Search: Non-Linear Refinement¶
After a linear model is found, an optional step can refine it by introducing non-linear parameters.
- Methodology: It takes a discovered model, like \(y \approx \beta_0 + \beta_1 \Phi_1 + \beta_2 \Phi_2\), and checks if replacing a feature \(\Phi_i\) with a parameterized version, like \(e^{-p \Phi_i}\) or \(\Phi_i^p\), improves the fit. It uses a numerical optimization algorithm (L-BFGS-B) to find the best values for both the linear coefficients (\(\beta_i\)) and the new non-linear parameters (\(p\)).
- Application in Symbolic Regression: This allows DISCOVER to find even more complex and accurate formulas that aren't just simple linear combinations of the base features, such as discovering optimal exponents or decay constants in the final equation.