Auto Evaluation (Beta)

Auto Evaluation is currently in beta. Features and scoring methodologies may evolve based on feedback.

Auto Evaluation provides an LLM-driven search quality evaluation pipeline that automatically tests your search configuration across hundreds of queries and multiple customer profiles. This helps you identify search quality issues, track improvements over time, and make data-driven decisions about search tuning.

What you can do

Run automated evaluations using your top searches from analytics or seeded test data.
Test search quality across multiple customer profiles simultaneously.
View detailed scoring metrics including relevance, intent, attribute compliance, brand handling, and result diversity.
Compare evaluation runs to track improvements or regressions over time.
Export results to CSV or JSON for further analysis.
Re-run evaluations with the same queries or fresh data.

Start a new evaluation

Go to Evaluate and select Auto Evaluation.
Configure your evaluation parameters:
- Query Source: Choose between “Top Searches (Analytics)” to use real search queries from your store, or “Seeded Data” to use pre-configured test queries.
- Date Range: Select how many days of search data to analyze (1-90 days).
- Query Limit: Set the maximum number of queries to evaluate (10-1000).
- Profiles per Query: Choose how many customer profiles to test each query against (1-10).
Optionally expand “Search Tuning Parameters” to customize search behavior for this evaluation.
Click Start Evaluation.

The evaluation runs in the background and results appear as they are processed. You can navigate away and return later to view completed results.

Search tuning parameters

When starting an evaluation, you can customize search parameters to test different configurations:

Textual Weight: Controls the influence of text-based matching (0-1). Higher values prioritize keyword and semantic text matches.
Visual Weight: Controls the influence of image similarity (0-1). Higher values prioritize visual similarity in results.
High Signal Boost: Maximum multiplicative boost from high-signal attributes (0-1). Strong title/vendor matches improve ranking by up to this percentage. Only near-exact matches produce meaningful boosts due to a squared similarity exponent. Default: 0.3.
Minimum Match Score: Sets the minimum relevance threshold for results to be included (0-1).
Top K Factor: Limits the number of vector matches considered during semantic aggregation.
Multiple Factor: Adjusts how the search limit is multiplied for candidate retrieval.
Product Filter Prompt: Natural language instructions for filtering products when generating evaluation profiles. Use this to exclude certain product types like memberships, gift cards, or warranty products from the evaluation.

Understanding scores

Each query evaluation produces multiple scores that measure different aspects of search quality:

Overall Score

A weighted combination of all individual scores, providing a single metric for search quality. This is the primary metric for tracking improvements over time.

Relevance Score

Measures how well the returned products match the search query. High relevance scores indicate that products in the results are semantically related to what the customer searched for.

Intent Score

Evaluates how well the results satisfy the likely shopping intent behind the query. This considers whether the results help the customer accomplish their goal, not just match keywords.

Attribute Score

Assesses whether product attributes in the results comply with any explicit or implicit attribute requirements in the query. For example, a search for “red dress” should return products where color attributes match.

Brand Score

Measures how well brand-related queries are handled. This evaluates whether brand searches return the correct brand products and whether brand diversity is appropriate for non-brand queries.

Diversity Score

Evaluates the variety of results returned. Good diversity means showing different product types, styles, or options rather than repetitive similar items.

View evaluation results

After an evaluation completes, the results page shows:

Summary metrics: Average scores across all evaluated queries.
Score distribution: A histogram showing how scores are distributed across all results.
Metrics by bucket: Performance breakdown by query classification (e.g., brand queries, category queries, attribute queries).
Metrics by profile type: Performance breakdown by customer profile type.
Lowest scoring queries: Queries that may need attention, sorted by overall score.
Highest scoring queries: Your best performing queries.
Detailed results table: Browse individual query results with filtering and sorting options.

Click on any query result to view detailed information including the judge’s rationale for the scores and the actual products returned.

Compare evaluation runs

Go to Evaluate and select Auto Evaluation.
Click Compare Runs.
Select two completed evaluation runs to compare.
Review the comparison showing:
- Overall score changes between runs.
- Number of improved, regressed, and unchanged queries.
- Top improvements and top regressions with score deltas.

Comparing runs helps you understand the impact of search configuration changes and identify which queries improved or regressed.

Re-run an evaluation

You can re-run a completed evaluation in two modes:

Fresh Data: Runs a new evaluation with the same parameters but fetches fresh query data and generates new profiles.
Reuse Existing Data: Runs the evaluation using the exact same queries and profiles from the original run, allowing you to test configuration changes with controlled variables.

To re-run an evaluation, open a completed evaluation and select Re-run Evaluation from the actions menu.

Export results

Export evaluation results for further analysis or reporting:

CSV Export: Downloads a spreadsheet with all query results and scores.
JSON Export: Downloads structured data including full result details and metadata.

To export, open a completed evaluation and select the export format from the actions menu.

Notes

Evaluations use LLM-based scoring, which provides nuanced quality assessment but may have some variability between runs.
The “Top Searches” source requires analytics data collection to be enabled for your store.
Large evaluations (many queries with many profiles) may take several minutes to complete.
Evaluation results are stored and can be accessed at any time from the Auto Evaluation page.

Getting Started

Catalog

Sort Orders

Product Sequences

Configuration

Evaluate

Merchandising

Misc

Auto Evaluation (Beta)

What you can do

Start a new evaluation

Search tuning parameters

Understanding scores

Overall Score

Relevance Score

Intent Score

Attribute Score

Brand Score

Diversity Score

View evaluation results

Compare evaluation runs

Re-run an evaluation

Export results

Notes

Getting Started

Catalog

Sort Orders

Product Sequences

Configuration

Evaluate

Merchandising

Misc

​What you can do

​Start a new evaluation

​Search tuning parameters

​Understanding scores

​Overall Score

​Relevance Score

​Intent Score

​Attribute Score

​Brand Score

​Diversity Score

​View evaluation results

​Compare evaluation runs

​Re-run an evaluation

​Export results

​Notes

What you can do

Start a new evaluation

Search tuning parameters

Understanding scores

Overall Score

Relevance Score

Intent Score

Attribute Score

Brand Score

Diversity Score

View evaluation results

Compare evaluation runs

Re-run an evaluation

Export results

Notes