A field guide for engineers who use LLMs as architectural peers, not autocomplete. We trace the problem from the math of the decoding layer all the way down to a
{% for %}loop that takes down a storefront, and we show how to override the failure mode with constraint-driven prompting.
1. Introduction: The Dichotomy of LLM Intelligence
Ask a frontier model (Claude, GPT, Gemini) to design a feature and the output is frequently excellent. Propose an "advanced multi-faceted collection filter for a Shopify storefront" and you will get a coherent concept: faceted navigation with AND/OR semantics across product type, vendor, price band, and metafield-driven attributes; instant client-side feedback; URL-encoded filter state for shareable links; accessible keyboard navigation; and a UI that respects the Polaris design language down to the spacing tokens. The concept is standard-compliant, well-decomposed, and often genuinely creative in how it composes known patterns into something new.
Then ask the same model, in the same breath, to implement it. The output collapses. You get a single Liquid template that iterates collection.products inside a nested {% for %} loop, or a React component that fetches the entire catalog on mount and filters it in a useMemo. It is syntactically perfect. It passes a linter. It works flawlessly in a demo store with 30 products. And it falls over the moment it meets a real catalog of 2,000+ products with 3 to 5 variants each.
This is the dichotomy. The same system that reasons about the problem space with apparent depth defaults, at generation time, to the most locally probable implementation, which is almost always the textbook one. The textbook implementation is locally optimal (it is the cleanest expression of the algorithm) and globally catastrophic (it ignores the runtime environment the algorithm will actually execute in).
The paradox is sharper when you consider the machinery. The model that produced both answers is one of the most aggressively non-linear function approximators ever built. Its internal representation of "collection filter" is a high-dimensional vector that has been bent, gated, and recombined through dozens of non-linear transformations. That non-linearity is precisely what lets it synthesize a novel feature concept. And yet, without explicit constraints, the generation phase flattens into something that behaves as if it were linear and memoryless about scale. The intelligence is non-linear; the default code is linear, in the worst sense.
The thesis of this article: this is not a flaw you fix with a better model. It is a structural property of how these systems decode, and you manage it the way you manage any structural property of a tool, by understanding the mechanism and engineering around it. We will (1) deconstruct why the decision engine is non-linear and stochastic, (2) explain why code generation nonetheless feels linear and rigid and is afflicted by what we will call scale blindness, (3) walk a concrete Shopify case study from naive failure to architected solution, and (4) give you the exact prompt blueprint that forces the non-linear engine to optimize for performance rather than aesthetics.
2. Deconstructing the LLM Decision Engine: Non-Linearity and Stochastic Processes
To understand why the design is good and the default code is bad, you have to look at two different things the model is doing with the same weights: building a rich conditional representation of your request, and sampling a sequence of tokens from that representation. The first is where the creativity lives. The second is where scale goes to die.
2.1 The Non-Linear Foundation
A Transformer is a stack of L identical blocks, each composed of multi-head self-attention followed by a position-wise feed-forward network (FFN), with residual connections and layer normalization wrapped around both sublayers. The single most important architectural fact for our purposes is this: if you removed the non-linearities, the entire stack would algebraically collapse into one linear projection.
Consider why. A composition of linear maps is itself a linear map: if f(x) = W_1 x and g(y) = W_2 y, then g(f(x)) = W_2 W_1 x = W x for some single matrix W. Stacking 96 linear layers buys you nothing over a single layer. The depth is only meaningful because each block inserts a non-linear function that breaks this collapse. Two places do the work:
The FFN activation. Each feed-forward sublayer computes something of the form
FFN(x) = W_2 · phi(W_1 x + b_1) + b_2, wherephiis a non-linear activation. Modern models use smooth, gated activations rather than the older ReLU:- GeLU (Gaussian Error Linear Unit):
GeLU(x) = x · Phi(x), wherePhiis the standard Gaussian CDF. It is a smooth gate that lets a neuron's own magnitude probabilistically decide how much of its signal passes. - Swish / SiLU:
Swish(x) = x · sigmoid(beta·x), and its gated variant SwiGLU, now common in large models, which splits the projection into a value path and a sigmoid-gated path and multiplies them. The multiplication itself is a non-linearity (a product of two learned linear projections is quadratic in the input).
These activations are what give a single FFN the capacity to approximate a curved decision boundary instead of a flat one. Stack
Lof them and the set of functions the network can represent becomes astronomically rich.- GeLU (Gaussian Error Linear Unit):
The attention softmax. Self-attention computes
softmax(Q Kᵀ / sqrt(d_k)) V. The softmax is non-linear and, crucially, input-dependent: the weights with which the model mixes other tokens are a non-linear function of the tokens themselves. This is dynamic, content-addressed routing. The same FFN weights get fed radically different inputs depending on context, because attention re-weights what flows into them per token, per position.
The practical consequence: the model's internal representation of "collection filter on a Shopify storefront" is not a lookup. It is a point in a high-dimensional embedding space that has been folded through dozens of non-linear transformations until concepts that are semantically related (faceted search, URL state, debounce, ARIA roles, Polaris tokens) sit near each other in directions the network can act on. Feature synthesis (proposing a design that is more than the sum of memorized snippets) is this folding at work. Non-linearity is not incidental to the model's creativity; it is the substrate of it.
2.2 Stochastic Process, Not Pure Randomness
Engineers loosely call LLM output "random." It is not random in the mathematical sense, and the distinction matters for how you steer it.
Pure randomness is a draw from a distribution that is independent of history. A fair die is memoryless: P(X_n = 6) is 1/6 regardless of every prior roll. Formally, draws are i.i.d. (independent and identically distributed). There is no conditioning on the past.
A stochastic process is a sequence of random variables that is conditioned on history. Token generation is exactly this. The model defines a probability distribution over the next token given everything seen so far:
P(x_{n+1} | x_1, x_2, ..., x_n)
This is autoregressive. The next token is a random variable, but its distribution is strictly shaped by the prior context window. It is best understood as a high-order Markov process: the classical Markov property says P(X_{n+1} | X_n) depends only on the immediately prior state, and a Transformer generalizes this so the "state" is the entire visible context up to the model's window length. The dependence on the past is the whole point. It is why the model stays on topic, closes the brackets it opened, and finishes the function signature it started.
This reframes the design-versus-code gap precisely. When you ask for a concept, the conditional distribution P(next token | "design an advanced collection filter...") has wide, relatively flat probability mass over many plausible and creative continuations, and sampling explores that space. When you ask for an implementation, the conditional distribution P(next token | "here is the Liquid template:") becomes extremely peaked. Given {% for product in, the next tokens are almost deterministically collection.products %} because that is the overwhelmingly most probable continuation in the training distribution. The process is the same; the entropy of the conditional distribution is what changed. Code generation is a low-entropy regime, and low entropy means the model funnels toward the single most-trodden path, which is the textbook pattern.
2.3 The Decoding Layer: Logits, Softmax, and Controlled Stochasticity
The final hidden state for a position is projected to a vector of logits, one real number per vocabulary token. Logits are unnormalized scores. The softmax converts them into a probability distribution:
P(token_i) = exp(z_i / T) / Σ_j exp(z_j / T)
Here z_i is the logit for token i and T is the temperature. This single equation is the control surface for everything practical:
- Temperature
T. AtT → 0the distribution collapses onto its argmax (greedy decoding): deterministic, repetitive, "safe." AtT = 1you sample the model's native distribution. AtT > 1you flatten the distribution, raising the odds of low-probability tokens: more diverse, more novel, more likely to wander off the most-trodden path, and also more likely to be wrong. Temperature is literally a knob on the variance of the stochastic process. - Top-k sampling. Restrict sampling to the
khighest-probability tokens, renormalize, then sample. Caps the tail. - Top-p (nucleus) sampling. Restrict to the smallest set of tokens whose cumulative probability exceeds
p, then sample. The candidate set size adapts to how confident the model is: small when the distribution is peaked, large when it is flat.
The design takeaway is counterintuitive but important. The same sampling settings that make the design phase creative make the implementation phase dangerous. Higher temperature and a wide nucleus help the model propose a genuinely novel filter UX, because in that regime the conditional distribution is broad and exploration pays off. But in the code phase the broad-exploration setting does not magically discover a scalable architecture; the scalable architecture is not the high-probability continuation it is failing to reach, it is a continuation that does not exist anywhere in the local distribution unless scale constraints are in the context. Sampling differently cannot conjure information the prompt never supplied. This is the hinge on which the rest of the article turns: you do not fix scale-blind code by turning a knob on the decoder. You fix it by changing the conditioning, that is, the prompt.
3. The Implementation Bottleneck: Sequential Code Generation and Scale Blindness
This is the core mechanical section. We are going to take apart exactly why a non-linear engine emits linear-feeling, scale-oblivious code, and we will go deeper than "it predicts the next token." There are four distinct mechanisms stacked on top of each other, and they compound.
3.1 Code Generation Is a Low-Entropy, Left-to-Right Constraint Satisfaction Problem
Natural language is forgiving. There are thousands of acceptable ways to phrase a sentence, and the conditional distribution over the next word is broad. Code is not forgiving. It is a formal language with a grammar, a type system, scope rules, and a compiler or interpreter that rejects the output categorically if a single token is wrong. The model has internalized this. Empirically and structurally, the conditional distributions during code generation are far more peaked than during prose, because the space of syntactically valid and idiomatic continuations at any point is narrow.
Now layer on the autoregressive constraint. The model emits tokens strictly left to right and cannot revise a token once emitted. There is no backtracking, no second pass, no "actually, let me restructure the data flow now that I see where this is going." Every token is committed. This is the deep reason code generation feels linear and rigid even though the engine producing it is not: the model is performing online constraint satisfaction under an irreversibility constraint. To stay valid, it must, at each step, choose the continuation that is most likely to keep the program well-formed given what it has already committed to.
The consequence is a powerful bias toward canonical structure established early. Suppose the first architectural token committed is a server-rendered Liquid loop. Every subsequent token is now conditioned on "we are inside a Liquid loop." The model will faithfully, fluently, and correctly complete a Liquid loop, because that is now the high-probability path consistent with its own prefix. The decision that mattered (loop versus paginated async fetch versus prebuilt index) was made in the first few tokens, under a distribution that favored the most common pattern, and was then locked in by autoregression. The model does not "decide on an architecture" and then implement it. It stumbles into an architecture token-by-token and then is trapped by it. This is why the first sentence of your prompt's constraints matters more than the last: it shifts the distribution before the irreversible commitments happen.
3.2 Greedy Local Coherence Versus Global Optimality
Decoding optimizes a local objective: maximize (roughly) the probability of the next token given the prefix. Even with beam search or sampling, the horizon is short and the objective is sequence likelihood, not runtime performance. There is no term anywhere in the decoding objective for "p95 latency," "main-thread blocking time," or "API cost points consumed." The loss the model was trained on was next-token cross-entropy over a corpus of human-written code. That corpus is dominated by examples optimized for readability and correctness in small contexts: tutorials, Stack Overflow answers, library quickstarts, demo apps. Scale-hardened code (the kind with cursor pagination, request coalescing, backpressure, and cache invalidation) is a small minority of the training signal, and it is rarely the simplest expression of a given feature.
So you have a generator whose objective is local-likelihood and whose training distribution over-represents simple patterns. The product of those two is a strong pull toward the implementation that a competent developer would write for a demo. That implementation is locally coherent (it reads beautifully), passes type checks, and is globally wrong for production. The model is not making a mistake by its own objective. It is doing exactly what it was optimized to do. The mismatch is between its objective (likely, valid, idiomatic tokens) and yours (code that holds up at 2,000+ products under a performance budget).
3.3 Scale Blindness: The Absence of Implicit Simulation
Here is the mechanism that engineers most often miss, and it is the heart of section 3.
When a senior engineer reads {% for product in collection.products %} followed by a nested loop over variants, something fires in their head automatically. They simulate. They think: "collection.products caps at 50 per page in Liquid, so either this silently truncates or it is wrapped in pagination I am not seeing; if it is the full catalog, that is 2,000 products times 4 variants, that is 8,000 iterations of server-side string rendering inside a single request, and Liquid rendering has a wall-clock budget before the storefront times out." That simulation is a learned, embodied model of a runtime. It runs in the background whether the engineer wants it to or not, because they have felt the pain of a page that times out.
The LLM has no such simulator. This is not a metaphor; it is a literal architectural fact. During generation:
- It does not execute the code it is writing. There is no interpreter in the loop, no runtime, no event loop, no rendering pipeline.
- It does not run load tests, profile a flame graph, or measure Interaction to Next Paint. It has never experienced a blocked main thread.
- It does not maintain a cost counter against a leaky bucket, or a wall-clock budget against a Liquid render timeout, or a memory ceiling against a mobile device's heap.
What it has instead is a statistical association between certain code shapes and certain words ("this can be slow," "consider pagination for large datasets") that appears in its training data near those shapes. That association only surfaces in the output if the context makes it the probable thing to say. Absent an explicit prompt about scale, the context "implement a collection filter" does not make "first, reason about the cost of this at 8,000 iterations" a high-probability continuation, because most training examples of "implement a collection filter" do not contain that reasoning. So the model writes the loop, with no internal alarm, because there is no internal alarm to ring.
This is scale blindness: the model optimizes for local logic correctness and design-spec compliance (does it match the Polaris guidelines, does it implement faceted AND/OR semantics correctly, does it produce valid JSX) while being structurally incapable of implicitly simulating execution behavior at volume unless that behavior is made explicit in its context. Two corollaries follow, and both are important:
- Compliance is not performance. A model is very good at matching a design system, because the spec ("use Polaris spacing tokens, use the
Filterscomponent, follow theIndexFiltersinteraction pattern") is declarative, present in the prompt, and verifiable from the text alone. Performance is none of those things by default. It is emergent, environmental, and invisible in the source. The model will reliably satisfy the part of the request that is checkable from the text and silently miss the part that is only observable at runtime. - Correctness at N=30 is the trap. The naive code is genuinely correct. It returns the right filtered set. It looks right in review. It demos perfectly. The defect is not in the logic; it is in the logic's interaction with volume, latency, and platform quotas, none of which are represented in the artifact the reviewer is reading. Scale-blind bugs are invisible precisely where humans look for bugs.
3.4 Why Attention and the Context Window Make This Worse at Architectural Scope
There is a fourth, subtler mechanism. Architectural quality is a long-range, global property of a codebase: the decision to paginate interacts with the cache layer, which interacts with the URL state encoding, which interacts with how the frontend hydrates. But attention, while it can in principle attend across the whole context window, is in practice strongest over local and recently-emitted tokens, and the training signal that rewards global structural coherence across a large file is weaker and rarer than the signal that rewards local line-by-line correctness. The model is far better at "this line is correct given the previous line" than at "this module's data-flow architecture is correct given the system's load profile," because the former is densely supervised by the corpus and the latter is sparsely supervised.
Put concretely: even when an LLM does emit a paginated fetch in one place, it will often, three hundred tokens later, write a .filter() over the assembled full array on the client, quietly reintroducing the very O(N) main-thread pass that pagination was supposed to avoid. The two facts ("we paginate to avoid loading everything" and "we then filter everything in memory") are not contradictory at the local token level; each is individually idiomatic. The contradiction is global, and global is exactly where the generator is weakest. This is why scale-blind failures are frequently partial and inconsistent: the model applies a scalable pattern in the spot where it is most cued and abandons it where it is not.
3.5 The Summary of the Bottleneck
Stack the four mechanisms:
- Low-entropy, irreversible decoding funnels the model onto the canonical pattern and locks it in early.
- A local-likelihood objective trained on demo-grade code makes the canonical pattern the demo pattern.
- The absence of an implicit runtime simulator means no internal alarm fires when that pattern is non-performant at volume.
- Weak long-range structural supervision means even partially-good architectures are reintroduced as O(N) somewhere the model was not cued.
None of these is fixed by a bigger model or a different temperature. All four are addressed by the same intervention: inject the runtime, the scale, and the platform quotas into the context, so that the high-probability continuation becomes the scalable one and so that the model has something to "simulate" against. That is section 5. First, let us watch the failure and the fix concretely.
4. Case Study: Designing a Shopify Feature at Scale (2,000+ Products)
The feature: an advanced multi-faceted collection filter for Discount Prime, our Shopify discount and pricing app. On a collection page, the merchant's customers should be able to filter 2,000+ live products (3 to 5 variants each) across several facets at once: product type, vendor, price range, availability, and a discount-eligibility metafield that Discount Prime writes per product. The UI must follow Polaris conventions, feel instant, and keep filter state in the URL so a filtered view is shareable. The performance budget is non-negotiable: sub-second load, no main-thread jank, and strict respect for Shopify's platform quotas.
Watch what an unconstrained model produces, why it fails, and what the same model produces once scale is in the context.
4.1 The Naive LLM Approach (The Linear Failure)
Asked plainly to "build the filter," the model commits, in its first tokens, to server-side rendering over the full collection. The output looks like this.
{# collection-filter.liquid : naive, scale-blind version #} <div class="dp-filter-results"> {% assign type = current_tags %} {% for product in collection.products %} {% assign show = true %}{# Facet 1: product type #} {% if filter_type != blank and product.type != filter_type %} {% assign show = false %} {% endif %} {# Facet 2: vendor #} {% if filter_vendor != blank and product.vendor != filter_vendor %} {% assign show = false %} {% endif %} {# Facet 3: price band : recompute the min/max across every variant #} {% assign vmin = 999999 %} {% for variant in product.variants %} {% if variant.price < vmin %}{% assign vmin = variant.price %}{% endif %} {% endfor %} {% if filter_price_max != blank and vmin > filter_price_max %} {% assign show = false %} {% endif %} {# Facet 4: discount-eligibility metafield written by Discount Prime #} {% if product.metafields.discount_prime.eligible != true %} {% assign show = false %} {% endif %} {% if show %} {% render 'product-card', product: product %} {% endif %}
{% endfor %} </div>
And on the client, the React variant the model reaches for when asked to make it "dynamic":
// FilterPanel.jsx : naive, scale-blind version import { useEffect, useMemo, useState } from "react";export default function FilterPanel() { const [allProducts, setAllProducts] = useState([]); const [filters, setFilters] = useState({ type: null, vendor: null, maxPrice: null });
// Pull the ENTIRE catalog on mount. useEffect(() => { fetch("/products.json?limit=2000") // one giant payload .then((r) => r.json()) .then((d) => setAllProducts(d.products)); }, []);
// Re-filter the entire array on every keystroke, on the main thread. const visible = useMemo(() => { return allProducts.filter((p) => { if (filters.type && p.product_type !== filters.type) return false; if (filters.vendor && p.vendor !== filters.vendor) return false; if (filters.maxPrice && Math.min(...p.variants.map(v => +v.price)) > filters.maxPrice) return false; return true; }); }, [allProducts, filters]);
return <ResultsGrid products={visible} />; }
Both are correct. Both demo perfectly with 30 products. Here is the mathematical analysis of why they break at 2,000+, which is exactly the analysis the model did not perform because nothing in its context asked it to.
Liquid path failure.
collection.productsis capped at 50 items per page in Liquid, and{% for %}loops are themselves limited to 50 iterations per page. So the naive loop is not just slow, it is silently wrong: it renders at most the first 50 products and quietly drops the other 1,950+. The merchant sees a "working" filter that omits 97% of the catalog, and nobody notices in a small demo store. To process the whole catalog you must use the{% paginate %}tag (max 250 per page, paginating no further than the 25,000th item), which means the single-template-loop architecture is structurally incapable of filtering the full set in one render.- Even bounded at 250 per page, the inner per-variant min-price loop multiplies iterations: 250 products times 4 variants is 1,000 inner iterations of string-rendering work, per page, inside one synchronous request. Liquid renders server-side within a wall-clock budget; heavy nested loops push render time up and risk the storefront render timeout, returning a degraded or error page.
- Faceting server-side this way also defeats caching and CDN edge delivery, because the rendered HTML now varies by every combination of filter parameters.
Client path failure.
fetch("/products.json?limit=2000")pulls a multi-megabyte JSON payload over the wire on first paint. That is a direct hit to LCP (the largest contentful element waits on the network) and a large transfer cost on mobile.JSON.parseof a multi-megabyte body and the subsequentsetStatehappen on the main thread, blocking it. ThenArray.prototype.filterover 2,000 objects, each spawningp.variants.map(...)andMath.min(...), runs again on the main thread on every keystroke. This is the classic INP killer: each interaction schedules a long task, the next paint is delayed, and the UI feels frozen. A long task over several thousand objects routinely blows past the 200 ms "good INP" threshold.- If you reach instead for the Storefront GraphQL API to fetch everything, you collide with the leaky bucket: the Storefront API bucket holds 2,000 cost points and refills at 1,000 per second, and a single query cannot exceed 1,000 points. A query that requests 2,000 products with variants, prices, images, and metafields has a per-field cost that, multiplied across the connection, blows the single-query ceiling outright, and a naive "load everything on mount" pattern across many users drains the bucket and gets throttled with
429/throttled responses.
The defect in every case is identical: the artifact is logically correct and environmentally catastrophic, and the environment is invisible in the source.
4.2 The Architected LLM Approach (The Non-Linear Solution)
Now give the model the scale, the budgets, and the platform boundaries (the section 5 blueprint), and force it to write the bottleneck analysis before any code. The architecture it produces is qualitatively different. The shape of the solution is: do not render or transport the whole catalog; build a compact index, paginate the network, push filtering off the main thread, and hydrate progressively.
1. Server-side: emit a compact, cacheable facet index, not the product HTML. Instead of rendering 2,000 product cards, the Liquid layer emits a small JSON index of just the fields the filter needs (id, type, vendor, min price as an integer, eligibility flag, handle). This is built once, paginated, and cacheable because it does not vary by filter state.
{# facet-index.liquid : emit a compact index, paginate to cover the full catalog #}
{% paginate collection.products by 250 %}
<script type="application/json" data-dp-facet-page="{{ paginate.current_page }}">
[
{%- for product in collection.products -%}
{%- assign vmin = product.price_min -%}
{
"id": {{ product.id }},
"h": {{ product.handle | json }},
"t": {{ product.type | json }},
"v": {{ product.vendor | json }},
"p": {{ vmin }},
"e": {{ product.metafields.discount_prime.eligible | default: false }}
}{%- unless forloop.last -%},{%- endunless -%}
{%- endfor -%}
]
</script>
{% endpaginate %}
Note the use of product.price_min, a precomputed property, instead of an inner per-variant loop. The N×M iteration is gone. The payload per product drops from a full card to roughly five fields.
2. Network: cursor-based GraphQL pagination, never a single mega-query. When the index must come from the Storefront API rather than Liquid, the model now generates a cursor-paginated fetch that respects the leaky bucket by requesting bounded pages and only the fields it needs, keeping each query well under the 1,000-point ceiling.
// fetchFacetIndex.js : cursor-based pagination, minimal field cost const PAGE = ` query FacetPage($cursor: String) { collection(handle: "all") { products(first: 250, after: $cursor) { pageInfo { hasNextPage endCursor } nodes { id handle productType vendor priceRange { minVariantPrice { amount } } eligible: metafield(namespace: "discount_prime", key: "eligible") { value } } } } }`;
export async function fetchFacetIndex(client) { let cursor = null, hasNext = true; const index = []; while (hasNext) { const data = await client.request(PAGE, { cursor }); // bounded cost per call const conn = data.collection.products; for (const n of conn.nodes) { index.push({ id: n.id, h: n.handle, t: n.productType, v: n.vendor, p: Number(n.priceRange.minVariantPrice.amount), e: n.eligible?.value === "true", }); } hasNext = conn.pageInfo.hasNextPage; cursor = conn.pageInfo.endCursor; // Yield between pages so we never monopolize the bucket or the main thread. await new Promise((r) => setTimeout(r, 0)); } return index; }
We request only the fields the filter consumes (low per-query cost), page in bounded chunks of 250 (each query stays well under 1,000 points), and use endCursor/hasNextPage for stable cursor-based traversal instead of offset paging, which Shopify connections are built for.
3. Compute: move filtering off the main thread into a Web Worker. The index is small, but filtering it on every keystroke still belongs off the main thread to protect INP. The worker holds the index and answers filter queries; the main thread only ever touches the result slice it needs to paint.
// filter.worker.js
let index = [];
self.onmessage = (e) => {
const { type, payload } = e.data;
if (type === "init") { index = payload; return; }
if (type === "query") {
const f = payload;
const out = [];
for (let i = 0; i < index.length; i++) {
const p = index[i];
if (f.type && p.t !== f.type) continue;
if (f.vendor && p.v !== f.vendor) continue;
if (f.maxPrice && p.p > f.maxPrice) continue;
if (f.eligibleOnly && !p.e) continue;
out.push(p.id);
}
self.postMessage({ ids: out }); // hand back ids only; hydrate lazily
}
};
// FilterPanel.jsx : architected version import { useEffect, useRef, useState } from "react";export default function FilterPanel({ facetIndex }) { const workerRef = useRef(null); const [visibleIds, setVisibleIds] = useState([]);
useEffect(() => { const w = new Worker(new URL("./filter.worker.js", import.meta.url)); w.onmessage = (e) => setVisibleIds(e.data.ids); w.postMessage({ type: "init", payload: facetIndex }); workerRef.current = w; return () => w.terminate(); }, [facetIndex]);
const onChange = (filters) => { // Debounced, off-main-thread; the keystroke never blocks paint. workerRef.current?.postMessage({ type: "query", payload: filters }); syncFiltersToURL(filters); // shareable, debounced history update };
// Only hydrate cards for the current viewport page of ids (lazy/windowed). return <LazyResultsGrid ids={visibleIds} pageSize={24} onChange={onChange} />; }
4. Delivery and state: lazy hydration, windowing, and edge-cacheable index. Product cards are hydrated only for the visible window of result ids (e.g., 24 at a time) rather than all matches, keeping the DOM and the hydration cost bounded regardless of how many products match. The compact facet index is cacheable at the edge because it is filter-independent, so most visitors get it from the CDN rather than recomputing it. Filter state is encoded in the URL and updated with a debounced history.replaceState, so views are shareable without thrashing navigation.
The transformation is total, and it came from the same model. What changed was not the weights and not the temperature. What changed is that the context now contained the data scale, the performance budget, and the platform's quota mechanics, so the high-probability continuation shifted from "render the loop" to "build the index, paginate the cursor, offload the compute." We made the runtime visible, and the generator could finally "simulate" against it.
| Dimension | Naive (scale-blind) | Architected (constraint-driven) |
|---|---|---|
| Catalog handling | Single Liquid loop, silently capped at 50, or full-catalog client fetch | Compact paginated facet index, cursor-based GraphQL |
| Main thread | Parse + filter thousands of objects per keystroke | Filtering in a Web Worker, ids only to main thread |
| Network | One multi-MB payload on mount | Bounded 250-item pages, minimal fields, edge-cached index |
| Shopify quotas | Single mega-query exceeds 1,000-point ceiling, drains bucket | Each query well under ceiling, paced between pages |
| Core Web Vitals | High LCP, INP > 200 ms | Sub-second LCP, INP within budget, no long tasks |
| Correctness at 2,000+ | Logically right, environmentally broken | Holds under load |
5. Practical Solution: Overriding Scale Blindness via Constraint-Driven Prompt Engineering
The fix follows directly from section 3. Because the model has no implicit runtime simulator and decodes toward the most probable (demo-grade) pattern, you must supply the runtime, the scale, and the quotas as explicit context so that the scalable pattern becomes the probable one and the model has something concrete to reason against. You are not coaxing a better mood out of the model. You are changing the conditional distribution by changing what it is conditioned on.
Four ingredients turn a scale-blind prompt into a constraint-driven one:
- Role prompting that names the constraint domain. "Principal Software Architect and Performance Optimization Expert specializing in Shopify" shifts the distribution toward the corner of training data where scale reasoning actually lives. Role is not flavor; it is a prior over which subset of patterns the model samples from.
- Exact data scale, stated as numbers. "2,000+ products, 3 to 5 variants each" gives the model the operands for the arithmetic it must be forced to do. Vague scale ("a lot of products") does not move the distribution; concrete N does.
- Hard platform boundaries. Name the leaky bucket, the 1,000-point single-query ceiling, the Liquid 50/250 limits, and the Core Web Vitals budgets. These are the "physics" of the target runtime; stating them gives the model the walls to design within.
- Worst-case-first execution order. Force a written bottleneck analysis before any code. This is the single highest-leverage instruction, because it makes the model commit its scale reasoning to tokens before the irreversible architectural commitment in section 3.1 happens. Once "this naive loop is 8,000 iterations and exceeds the render budget" is in the context, the next architectural token is no longer the loop.
5.1 The Gold-Standard Prompt Blueprint
Use this as a template. The bracketed slot is the only part you change per feature.
Role: You are a Principal Software Architect and Performance Optimization Expert specializing in the Shopify Platform (Polaris Design System, Liquid Engine, and Storefront/GraphQL APIs).Problem Statement: I need to design a highly dynamic [INSERT FEATURE NAME, e.g., Advanced Multi-Faceted Collection Filter]. The UI must strictly adhere to Shopify Polaris design guidelines and offer a highly innovative UX, but it must be architected for extreme scale.
Hard Engineering Constraints:
- Data Scale: The storefront contains 2,000+ live products, with an average of 3 to 5 variants per product.
- Performance Budgets: The solution must maintain sub-second Page Load Times, zero main-thread blocking (Optimized Interaction to Next Paint - INP), and minimal Largest Contentful Paint (LCP) impact.
- Platform Boundaries: The architecture must absolutely respect Shopify API Rate Limits (Leaky Bucket Algorithm) and avoid heavy, nested Liquid loops or un-paginated server-side rendering cascades.
Execution Steps: Before generating any codebase, schemas, or UI mockups, write an isolated section titled "Performance & Scale Bottleneck Analysis". In this section, mathematically analyze why naive, textbook implementation patterns fail under a 2,000-product load on Shopify. Following this analysis, provide a production-ready architectural blueprint leveraging Lazy Loading, GraphQL Cursor-Based Pagination, and Client-Side Hydration.
Why each clause earns its place maps one-to-one onto the failure mechanisms:
- The role line counters the demo-grade training bias (3.2) by shifting the sampling prior toward production-hardened patterns.
- The numeric data scale gives the model operands so the bottleneck math is possible at all (3.3).
- The performance budget and platform boundaries supply the runtime that the model otherwise cannot simulate (3.3); they are the missing physics.
- The "analysis before code" ordering is the antidote to irreversible early commitment (3.1): it forces scale reasoning into the context before the architecture is chosen, and as a bonus it gives you an auditable artifact you can check.
5.2 The Mindset Shift: LLMs as Peer Architects Under Explicit SLAs
The deepest change is not in the prompt template; it is in how you frame the collaboration. Treating an LLM as autocomplete (handing it a function signature and accepting the most probable body) gets you exactly the scale-blind default, because autocomplete is, definitionally, the high-probability local continuation. Treating it as a peer architect means you owe it what you would owe a senior engineer joining the project: the data volumes, the latency budgets, the platform quotas, the failure modes you have already hit. In other words, you give it the engineering SLA up front.
A useful discipline: write the non-functional requirements into the prompt the way you would write them into an Architecture Decision Record. Data scale, throughput, p95 latency target, error budget, platform rate limits, and the explicit instruction to analyze worst-case behavior before proposing a design. The model's non-linear engine is fully capable of synthesizing an enterprise-grade architecture; the case study proves it produced one from the same weights. It simply will not do so unprompted, because nothing in an unconstrained request makes the scalable pattern more probable than the textbook one, and nothing gives it a runtime to simulate against.
The takeaway for senior engineers is therefore precise and actionable. The model's brilliance at design and its naivety at implementation are two faces of one mechanism: a non-linear representation engine sampled by a local-likelihood, scale-blind decoder. You cannot change the mechanism. You can change its conditioning. Put the runtime in the prompt, force the bottleneck analysis before the code, and the same system that wrote the loop that takes down your storefront will write the index, the cursor, and the worker that keeps it up.

