discover

Getting Started: A Step-by-Step Tutorial

This tutorial will guide you through your first analysis using DISCOVER. We will go through installation, data preparation, configuration, execution, and interpretation of the output. We will take config.json as an example.

Step 1: Installation

Ensure you have Python 3.8+ installed. It is strongly advised to use a virtual environment.

  1. Clone the Repository:
    git clone <your-repository-url>
    cd discover-project
    
  2. Activate a Virtual Environment:
    python -m venv venv
    source venv/bin/activate  # On Windows, use `venv\\\\Scripts\\\\activate`
    
  3. Install Core Dependencies:
    pip install pandas numpy sympy scikit-learn matplotlib seaborn joblib
    
  4. Install Optional Dependencies (Optional):
    • For GPU Acceleration (NVIDIA):
      # Put XX in place of your CUDA version (e.g., 118 for CUDA 11.8)
      pip install cupy-cudaXX
      
      *   **For GPU Acceleration (Apple Silicon):** ```bash pip install torch torchvision torchaudio ```
      
    • For Unit-Awareness:
      pip install pint
      

Step 2: Prepare Your Data

DISCOVER expects your data in a single CSV file.

For our case, sample_dataset.csv contains columns for main features like R_A (ionic radius), ELN_A (electronegativity), and target property E_a (migration barrier).

Step 3: Prepare the Analysis

The workflow is managed by config.json. Let’s break down the critical parts for an initial run.

{
    // 1. DATA & I/O
    "data_file": "manuel_data.csv",
    "property_key": "E_a",
    "non_feature_cols": ["Name", "Delta_E", "."],
},

"workdir": "manuel_data",

    // 2. FEATURE CONSTRUCTION
    "depth": 3,
    "op_rules": [
      {"op": "add"}, {"op": "sub"}, {"op": "mul"}, {"op": "div"},
      {"op": "sqrt"}, {"op": "sq"}
    ],

    // 3. SEARCH & SELECTION
    "max_D": 2,
    "search_strategy": "sa",
"selection_method": "bic",

    // 4. MODEL & VALIDATION
    "task_type": "regression",
    "fix_intercept": false,

    // 5. COMPUTATIONAL
    "n_jobs": -1,
    "device": "cpu",
    "random_state": 42
}

Step 4: Run the Analysis

With your configuration set, execute the run_discover.py script in your command line:

python run_discover.py config.json

DISCOVER will start, giving live output on how it’s proceeding:

  1. Loading data and config.
  2. Creating features iteratively, reporting at each depth.
  3. Printing the top-screened features at each iteration.
  4. Executing the selected search strategy for each dimension D.
  5. Checking the models and selecting the best dimension.
  6. Printing the final summary report to the console.

Step 5: Understanding the Output

All output is saved to the directory specified by workdir (e.g., manuel_data/).

Key Console Output: The final part of the console output is the summary report, the most important takeaway:

========================= FINAL MODEL REPORT =========================
DISCOVER Summary
=================
TASK_TYPE: regression
MODEL_SELECTION_METHOD: BIC
SELECTED_DIMENSION: 2
SEARCH_STRATEGY: SA
.

******* BEST MODEL (D=2) ********
Score BIC Score: -123.456
R2/Accuracy on full data: 0.9512

============================================================
Final Symbolic Model: regression
============================================================
P = -0.1234 + 0.5678*sqrt(R_A/R_B) + 0.9012*Abs(ELN_A - ELN_B)

Important Files in the workdir:

What’s Next?

Congratulations! You’ve successfully executed your first DISCOVER analysis. You now possess an interpretable, symbolic model that describes your data.

For exploring more sophisticated features, check out our How-To Guides, where you can observe how to: