Auto Evaluation is currently in beta. Features and scoring methodologies may evolve based on feedback.
What you can do
- Run automated evaluations using your top searches from analytics or seeded test data.
- Test search quality across multiple customer profiles simultaneously.
- View detailed scoring metrics including relevance, intent, attribute compliance, brand handling, and result diversity.
- Compare evaluation runs to track improvements or regressions over time.
- Export results to CSV or JSON for further analysis.
- Re-run evaluations with the same queries or fresh data.
Start a new evaluation
- Go to Evaluate and select Auto Evaluation.
- Configure your evaluation parameters:
- Query Source: Choose between “Top Searches (Analytics)” to use real search queries from your store, or “Seeded Data” to use pre-configured test queries.
- Date Range: Select how many days of search data to analyze (1-90 days).
- Query Limit: Set the maximum number of queries to evaluate (10-1000).
- Profiles per Query: Choose how many customer profiles to test each query against (1-10).
- Optionally expand “Search Tuning Parameters” to customize search behavior for this evaluation.
- Click Start Evaluation.
Search tuning parameters
When starting an evaluation, you can customize search parameters to test different configurations:- Textual Weight: Controls the influence of text-based matching (0-1). Higher values prioritize keyword and semantic text matches.
- Visual Weight: Controls the influence of image similarity (0-1). Higher values prioritize visual similarity in results.
- Minimum Match Score: Sets the minimum relevance threshold for results to be included (0-1).
- Top K Factor: Limits the number of vector matches considered during semantic aggregation.
- Multiple Factor: Adjusts how the search limit is multiplied for candidate retrieval.
- Product Filter Prompt: Natural language instructions for filtering products when generating evaluation profiles. Use this to exclude certain product types like memberships, gift cards, or warranty products from the evaluation.
Understanding scores
Each query evaluation produces multiple scores that measure different aspects of search quality:Overall Score
A weighted combination of all individual scores, providing a single metric for search quality. This is the primary metric for tracking improvements over time.Relevance Score
Measures how well the returned products match the search query. High relevance scores indicate that products in the results are semantically related to what the customer searched for.Intent Score
Evaluates how well the results satisfy the likely shopping intent behind the query. This considers whether the results help the customer accomplish their goal, not just match keywords.Attribute Score
Assesses whether product attributes in the results comply with any explicit or implicit attribute requirements in the query. For example, a search for “red dress” should return products where color attributes match.Brand Score
Measures how well brand-related queries are handled. This evaluates whether brand searches return the correct brand products and whether brand diversity is appropriate for non-brand queries.Diversity Score
Evaluates the variety of results returned. Good diversity means showing different product types, styles, or options rather than repetitive similar items.View evaluation results
After an evaluation completes, the results page shows:- Summary metrics: Average scores across all evaluated queries.
- Score distribution: A histogram showing how scores are distributed across all results.
- Metrics by bucket: Performance breakdown by query classification (e.g., brand queries, category queries, attribute queries).
- Metrics by profile type: Performance breakdown by customer profile type.
- Lowest scoring queries: Queries that may need attention, sorted by overall score.
- Highest scoring queries: Your best performing queries.
- Detailed results table: Browse individual query results with filtering and sorting options.
Compare evaluation runs
- Go to Evaluate and select Auto Evaluation.
- Click Compare Runs.
- Select two completed evaluation runs to compare.
- Review the comparison showing:
- Overall score changes between runs.
- Number of improved, regressed, and unchanged queries.
- Top improvements and top regressions with score deltas.
Re-run an evaluation
You can re-run a completed evaluation in two modes:- Fresh Data: Runs a new evaluation with the same parameters but fetches fresh query data and generates new profiles.
- Reuse Existing Data: Runs the evaluation using the exact same queries and profiles from the original run, allowing you to test configuration changes with controlled variables.
Export results
Export evaluation results for further analysis or reporting:- CSV Export: Downloads a spreadsheet with all query results and scores.
- JSON Export: Downloads structured data including full result details and metadata.
Notes
- Evaluations use LLM-based scoring, which provides nuanced quality assessment but may have some variability between runs.
- The “Top Searches” source requires analytics data collection to be enabled for your store.
- Large evaluations (many queries with many profiles) may take several minutes to complete.
- Evaluation results are stored and can be accessed at any time from the Auto Evaluation page.