How to build a customer feedback tagging system

TL;DR

You can build a working feedback tagging system in a single afternoon using a spreadsheet, a defined taxonomy, and optionally an LLM API call to automate the repetitive parts. The full setup takes two to four hours for your first 200 responses. You need raw feedback data, Google Sheets or Airtable, and optionally an OpenAI API key for the automation layer.

What You Need Before You Start

  • Raw feedback already collected: NPS verbatims, support tickets, app store reviews, user interview notes, or survey open-ends. At least 50 responses to make tagging worthwhile.
  • A spreadsheet tool: Google Sheets (free) or Airtable (free tier supports up to 1,000 records per base).
  • Optional: Python 3.10+ with pandas and openai installed if you want to automate tagging at scale.
  • Optional: An OpenAI API key with access to gpt-4o-mini (roughly $0.15 per 1M input tokens as of 2026, well within free credit for a first run).
  • At least one other stakeholder who touches the feedback: a CS lead, designer, or fellow PM. You need a second pair of eyes on your taxonomy before you tag anything.
  • A rough sense of your product’s core feature areas. You do not need a finalized list yet.

Step 1: Export and Centralize All Your Raw Feedback

Before you can tag anything, everything needs to live in one place. Pull your feedback from every source: Intercom export as CSV, Typeform results download, App Store review scraper, Zendesk ticket export.

Open a new Google Sheet and create a single tab called raw_feedback. Add these columns in order: response_id, source, date, verbatim, respondent_segment (if you have it), and rating (for NPS or CSAT scores).

Paste all your rows in. Use a simple formula to generate sequential IDs if your export does not include them:

=ROW()-1

Drop that in A2 and fill down. It gives you 1, 2, 3... without gaps.

Keep the verbatim text exactly as written. Do not clean it up or correct typos. You want the signal preserved. If a response is in another language, add a language column and flag it now.

You should now see one flat table with every piece of feedback in the verbatim column, a unique ID on each row, and consistent column headers across all sources.

Step 2: Define Your Tagging Taxonomy

This step is where most teams go wrong by moving too fast. Your taxonomy is the backbone of the entire system.

Open a second tab called taxonomy. Create three columns: tag_name, definition, and example.

Start by reading through 30 to 50 responses without tagging anything. Just read. Write down the recurring themes you notice: pricing complaints, onboarding confusion, missing features, performance issues, competitor mentions. These become your top-level tags.

Aim for 8 to 15 top-level tags. Fewer than 8 usually means you are merging things that are genuinely different. More than 15 creates inter-rater disagreement. A realistic starter taxonomy for a SaaS product looks like this:

tag_name definition example
onboarding confusion during initial setup or first use “I couldn’t figure out how to connect my first account”
pricing cost, billing, plan complaints or compliments “The pro plan is way too expensive for a 2-person team”
performance speed, reliability, downtime “It crashed twice this week”
missing_feature request for something that does not exist “I wish I could export to PDF”

Write out every definition. Vague definitions cause tagging drift the moment a second person touches the data.

You should now see a complete taxonomy tab with at minimum 8 tags, each with a written definition and a real verbatim example pulled from your data.

Step 3: Set Up Your Tagging Sheet

Back on the raw_feedback tab, add a primary_tag column and a secondary_tag column after verbatim. Add a notes column at the end for anything ambiguous.

Create a dropdown validation on primary_tag so taggers cannot type freeform text. In Google Sheets: select the primary_tag column, go to Data > Data validation > Criteria: Dropdown from a range, and point it at your taxonomy!A:A column.

Do the same for secondary_tag. Some feedback genuinely covers two themes. A comment like “the pricing page is confusing” touches both pricing and onboarding. Use secondary_tag for the less dominant theme.

Add a is_actionable column with a Yes/No dropdown. This single boolean will save you hours when you pull reports later, because not all tagged feedback requires a product response.

You should now see your tagging sheet with constrained dropdowns, no freeform entry possible, and a clear column structure ready for human tagging.

Step 4: Tag Your First 100 Responses Manually

Do this yourself. Do not delegate it to a junior team member for the first pass. The goal is not just to label data. It is to stress-test your taxonomy against real responses.

Work in batches of 25. After each batch, check whether any responses felt hard to categorize. If you are stuck on more than 3 responses per batch, your taxonomy needs a new tag or a sharper definition.

Track your time. 100 responses typically take 45 to 90 minutes depending on response length.

When you hit an ambiguous response, use the notes column and tag it anyway with your best guess. You will reconcile these in the next step.

You should now see 100 tagged rows, with a handful of notes flagging edge cases, and an emerging sense of which tags are overrepresented versus sparse.

Step 5: Run an Inter-Rater Reliability Check

This is the quality gate most product teams skip. Have one other person (CS lead, designer, or fellow PM) independently tag a random 20-row sample from your 100 already-tagged responses. They should not see your tags.

Once they are done, compare using this formula in a helper column:

=IF(C2=H2,"match","mismatch")

Where column C is your tag and column H is theirs. Count the matches:

=COUNTIF(I2:I21,"match")/20

You want 70% agreement or higher before you proceed to automation. Below 70% means your taxonomy definitions are not tight enough. Go back to Step 2 and rewrite the definitions for the tags where you disagreed most.

You should now see an agreement score. If it is above 70%, your taxonomy is solid. If not, you have a specific list of tags to tighten up.

Step 6: Build a Tag Frequency Report

Add a third tab called reporting. Use COUNTIF to count how many responses carry each primary tag:

=COUNTIF(raw_feedback!E:E, taxonomy!A2)

Fill that down for every tag in your taxonomy. Add a percentage column:

=B2/SUM($B$2:$B$15)

Build a bar chart from this. Sort descending. You now have a frequency distribution that tells you where the largest volume of feedback is concentrated.

Combine this with the rating column from your source data using AVERAGEIF to get average satisfaction score per tag:

=AVERAGEIF(raw_feedback!E:E, taxonomy!A2, raw_feedback!G:G)

Low rating + high volume = your top priority. High rating + high volume = a strength to protect.

You should now see a simple dashboard tab with a sorted frequency chart and average scores per tag. This is deliverable-ready for a sprint review or product review meeting.

Step 7: Automate Tagging With an LLM

Once your taxonomy is validated, you can automate new incoming feedback. This Python script sends each verbatim to gpt-4o-mini and returns a tag from your defined list:

import openai
import pandas as pd

client = openai.OpenAI(api_key="YOUR_KEY_HERE")

TAGS = [
    "onboarding", "pricing", "performance",
    "missing_feature", "billing", "integrations",
    "support", "data_export", "ui_ux", "other"
]

def tag_feedback(verbatim: str) -> str:
    prompt = f"""You are a product analyst. Tag the following customer feedback with exactly one tag from this list: {TAGS}.

Feedback: {verbatim}

Reply with only the tag name, nothing else."""

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        max_tokens=20,
        temperature=0
    )
    return response.choices[0].message.content.strip()

df = pd.read_csv("raw_feedback.csv")
df["auto_tag"] = df["verbatim"].apply(tag_feedback)
df.to_csv("tagged_feedback.csv", index=False)

Run this on your untagged backlog. Add an auto_tag column alongside your manual primary_tag column. Spot-check 20 rows to verify accuracy.

You should now see a tagged CSV with an auto_tag column populated for every row, ready to be imported back into your sheet or Airtable base.

Step 8: Set Up a Recurring Intake Pipeline

A one-time tagging exercise decays fast. You need a repeatable process for new feedback.

The simplest version: a weekly 30-minute calendar block where you export the past week’s feedback, run the Python script, append the rows to your master sheet, and check the updated frequency chart.

If you are using Notion as your product workspace, you can paste your reporting tab output as a linked database view in your product review doc template. That keeps the data visible without extra tool-switching.

For teams handling more than 200 new responses per week, connect your feedback source directly to Airtable via Zapier or Make. Set up an automation that fires the tag script via a webhook whenever a new row appears. See our guide on automating qualitative data workflows for a step-by-step setup.

You should now see a calendar event, a documented process, and a running master sheet that grows week over week without requiring manual bulk exports.

Step 9: Share and Socialize the Output

Tagged data that lives only in your sheet helps no one. Export a weekly snapshot as a two-slide format: one slide showing the top 5 tags by volume, one slide showing the three verbatims that best illustrate your biggest theme.

Drop this into your team Slack channel or your sprint retrospective doc every week. Consistency matters more than depth here. A short, reliable signal beats a quarterly deep-dive that nobody acts on.

Tag-based insights pair well with quantitative metrics. If your missing_feature tag volume spikes by 40% in a given week, cross-reference that against your activation rate. You now have a qualitative signal attached to a quantitative one.

You should now see a two-slide or two-paragraph summary you can share in under five minutes, giving the rest of the team direct access to the tagged output.

Common Mistakes To Avoid

  • Skipping the inter-rater check. If only one person defines and applies the taxonomy, you will tag based on intuition rather than a consistent rule. The moment you hand it to someone else, it falls apart.
  • Creating tags that are too broad. “Product issues” is not a tag. “Performance” is. Broad tags absorb too many responses and give you no actionable signal.
  • Using more than two levels of tags from day one. Sub-tags like onboarding > account_setup > oauth_error sound useful but create maintenance nightmares. Start flat and add hierarchy only when a single tag is generating more than 25% of all responses.
  • Tagging manually forever. Human tagging does not scale past 500 responses per month. Set up the LLM automation at Step 7 before you hit that ceiling, not after.
  • Not versioning your taxonomy. When you add or rename a tag, old rows become inconsistent. Keep a taxonomy_changelog tab with date, change made, and the affected tag. Retroactively re-tag affected rows or document why you did not.
  • Ignoring the other category. If other grows past 15% of all responses, your taxonomy has a gap. Audit other monthly and promote recurring themes to first-class tags.

When To Level Up

A spreadsheet-based system handles roughly 300 to 500 responses per month before the cracks show. Past that threshold, manual spot-checking becomes unreliable, automation errors accumulate undetected, and the taxonomy needs governance that a sheet cannot provide.

At that point you are looking at dedicated qualitative analysis tools: Label Studio for annotation workflows with multiple reviewers, Dovetail or Grain for interview-heavy research programs, or a purpose-built feedback aggregation layer like Productboard or Canny.

The cost jump is real: Dovetail starts at $29 per user per month, Productboard at $20 per maker per month. But if your team is spending three or more hours per week managing the spreadsheet, the time savings justify it within the first month.

For a full comparison of tools at this tier, browse the research methodology tools category for current reviews. Also worth reading: our guide to analyzing NPS verbatim data and best tools for qualitative data analysis in 2026.

Frequently Asked Questions

How many tags should my taxonomy have?
Eight to fifteen top-level tags is the practical range for most SaaS products. Fewer than eight means you are conflating distinct problem areas. More than fifteen makes inter-rater agreement difficult and reduces the reporting clarity you are building the system for.

Can I use ChatGPT directly instead of the API?
You can paste batches manually into ChatGPT and get usable output, but it does not scale past 50 or 60 responses per session and introduces inconsistency between sessions. The API approach in Step 7 gives you reproducible, logged, batch-scalable results for a few cents per 1,000 responses.

What if my feedback is in multiple languages?
Flag each row with a language column at import time (Step 1). The gpt-4o-mini model handles multilingual input well and will apply your English tag list to non-English verbatims accurately. Verify with a native speaker on a 20-row spot-check before you trust the automation output at scale.

How often should I review and update the taxonomy?
Monthly. Set a recurring calendar event. Check whether any tag is carrying more than 25% of all responses (too broad), whether other has grown past 15% (missing tag), and whether any tag has fewer than 5 responses over the past three months (possibly redundant).

Should tags reflect problems or features?
Both, but keep them consistent. Either tag by feature area (onboarding, billing, integrations) or by problem type (confusion, frustration, delight). Mixing the two in the same taxonomy creates overlap that makes reporting ambiguous. Most product teams find feature-area tags more actionable because they map directly to team ownership.

Bottom Line

Building a feedback tagging system is not a one-day project you finish and forget. it is a discipline: a defined taxonomy, a consistent process, a recurring review cycle, and automation to keep it from becoming a bottleneck. Start with 50 to 100 responses and a flat taxonomy of 8 to 12 tags. Validate it with a second reviewer. Then automate the repetitive tagging work with a simple API call so your time goes toward reading the signal, not generating it. the output, a ranked frequency table with average satisfaction scores per theme, is immediately useful in sprint reviews, product strategy docs, and roadmap discussions. for more tools and frameworks that complement this workflow, browse the research methodology category on this site.