What Is shippability in localization? A new quality framework from Smartling and OpenAI

A quality score tells you whether the translation passed a defined check. It counts errors, checks thresholds and looks backward. The business, however, is asking something different: are we ready to ship this to that market?

That distinction is the fault line running through most enterprise localization programs right now, and it was the central argument at one of LocWorld55 Dublin's most attended sessions. The room was packed because artificial intelligence (AI) has made shipping faster, but the challenge now is making sure reliability, evaluation, and deployment practices keep pace. After all, that’s where gaps can cost teams real money and market credibility.

At LocWorld55, Kathy Mok, Head of Localization at OpenAI, and Olga Beregovaya, Smartling's VP of AI, co-presented "Would You Ship This? Reframing Translation Quality for the AI Era."

Smartling and OpenAI session at LocWorld55 Dublin

Olga Beregovaya (Smartling) and Kathy Mok (OpenAI) on stage at LocWorld55 Dublin.

For those who weren't in the room, here are the ideas that are worth taking back to your program:

When the Dashboard Lies (Sort Of)

Kathy opened with a scenario most localization managers will recognize immediately. You've launched 100 languages quickly. The Multidimensional Quality Metrics (MQM) scorecard is green, Service Level Agreements (SLAs) are met, and all three languages have passed their thresholds. Then the feedback starts coming in: Japan marketing says the creative asset isn't good enough, a Spanish-speaking stakeholder flags the call-to-action (CTA) as feeling low quality, and a growth product manager quietly starts looking for their own French agency.

The uncomfortable part is that the dashboard still says green. The MQM score isn't the problem here, rather, it's answering the question it was designed to answer: whether the translation passed a defined linguistic check. The business is asking whether this experience is ready for a real market, with real users who make real decisions based on what they read, and those two questions are not the same thing. Treating them as equivalent is exactly how technically correct translations produce commercially broken experiences.

Quality Models Built for a Slower World

This doesn’t mean that traditional quality measurement is wrong. It means it was designed for a pace that no longer exists. Brands now ship content globally on a daily cadence, AI-first translation has become the operational norm, and vendor partners are retraining workflows in real time to keep up. In that environment, post-delivery error counting becomes a lagging indicator at best. By the time a Linguistic Quality Assurance (LQA) review confirms that something was wrong, the content is often already in market.

The deeper issue is structural. Traditional quality models ask reviewers to find defects, but they weren't designed to ask whether a given defect matters, to whom it matters, on which surface, in which market, and at what level of risk. That granular error-tagging work has its place, but it doesn't reliably predict whether a campaign will convert, whether a safety message will be trusted, or whether a checkout flow will cause someone to abandon the transaction entirely.

Introducing Shippability

This shift is referred to as shippability: the practice of treating quality review not as a backward-looking defect audit, but as a forward-looking launch-readiness decision. The core question changes from "how many errors did we find?" to "would a local user trust this enough to continue?" It sounds like a small shift in wording, but the operational implications are significant.

Framed this way, the reviewer's job changes entirely. Rather than policing language against a taxonomy, reviewers take local ownership of a shipping decision by evaluating four things:

Meaning (is the original intent intact?)
Market fit (is this appropriate for this specific audience and context?)
Risk (does it mislead, block an action, or erode trust?)
Action (what happens next: ship it, improve it post-launch, or hold it for a fix before release)

That last dimension matters, because without a clear, actionable output, shippability becomes another abstract quality framework that changes nothing in practice. The three shipping calls are designed to prevent exactly that: fix before ship, ship then improve, or ready to ship. Each one tells a team what to do, not just how the translation scored.

The Threshold Moves with the Market

One of the more practically useful arguments from the session is that shippability isn't a universal standard. It's a risk-calibrated one, and the right risk level depends entirely on what's being translated and for whom it's intended. A low-visibility help center article, a paid acquisition headline, a safety instruction, and a pricing screen represent four very different risk profiles. Applying the same review depth to all of them means either over-investing in the places that don't warrant it or under-investing in the ones that do.

Market personas also shift the threshold in meaningful ways. For example, AI-cautious audiences require higher confidence cues and more deliberate tone, while utility-first markets prioritize task clarity over stylistic polish, and quality-sensitive locales have higher expectations around nuance and register. The localization decisions that work well for one audience profile can actively underperform for another, which is why local ownership of the shippability call matters as much as having the framework in the first place.

How Smartling and OpenAI Built It in Practice

Olga brought the second half of the session into the operational reality of what it actually takes to run a program this way. The Smartling and OpenAI partnership started at 20 locales, expanded to 60+, and now operates at full coverage across ChatGPT and OpenAI's full product suite. That scale, sustained at that speed, is the real stress test for any quality framework.

The translator's role had to be rethought almost completely. Inside the shippability model, a linguist isn't processing strings in a queue. Instead, they're functioning more like an in-country product manager, reading the full context without pre-judgment, assessing it against the market persona and risk framework, then making and recording a clear decision. Those decisions feed back into the system as signals that inform what's automatable, where human review still moves outcomes, and what needs to change in the underlying workflow or model behavior over time.

Smartling built a purpose-designed work surface to support this model, one that is minimal, plain-language, and structured around the three shipping calls rather than traditional error categorization. The design reflects the philosophy directly: no complex scoring grids, no elaborate defect tagging. The interface asks reviewers to read in full context, assess holistically, and decide. That simplicity is intentional, because cognitive overhead in the review step is one of the things that slows programs down and dilutes the quality of the signal coming back.

Starting Without a Full Program Rebuild

The session’s question and answer (Q&A) surfaced a predictable concern: this sounds right, but where do you even begin? Kathy recommended starting with one lane, one market, one content type, one changed question. Instead of asking reviewers how many errors they found, ask whether they'd ship this to their market. Track what comes back in four simple categories: 1) ship; 2) not yet; 3) why; and 4) what action was triggered. That's the signal, and it's more useful than a granular error count because it maps directly to a business decision.

The ownership split matters too. The client organization sets the business context and defines the risk appetite for each content type and market. The vendor partner is responsible for enabling the judgment, getting the right reviewers in place, building workflows that can operate at the required pace, and making sure the tooling supports clear decisions rather than burdening reviewers with process overhead. Both sides have to do their part, because shippability decisions require someone who understands what's at stake commercially and someone who can structure the program to make those decisions consistently at scale.

The data from even a modest pilot starts surfacing the real shape of risk in a program: where teams are over-reviewing content that doesn't warrant it, where they're under-reviewing content that does, and what a quality strategy built for their actual shipping pace would look like in practice.

The Lasting Question

The session closed with an impactful restaurant analogy: consider that the menu can be translated correctly, but the question isn't whether the words are right. The question is whether guests will order with confidence, trust what they're reading, and feel comfortable enough to come back.

A slightly awkward phrase in the dessert description is a very different problem from a misunderstanding about allergens. Both are technically errors, however only one of them represents a risk serious enough to stop a launch. Knowing the difference, and structuring a quality program around that distinction, is what shippability is designed to do.

Localization quality is not a language perfection topic. It is a launch-confidence topic. The Smartling and OpenAI session at LocWorld55 made that case in concrete terms, grounded in a real program running at real scale. If your current quality process can't reliably answer whether a translation is ready for its market, that's the most useful place to start.