LLM streaming with partial JSON reconstruction keeps your AI interfaces fast. Learn to parse incomplete tokens and update UI components in real time.
Last month, I spent about three days debugging a "stuttering" chat interface that felt sluggish despite using high-speed streaming. The issue wasn't the API latency; it was that we were waiting for the entire JSON payload to finish before rendering anything. If you're building production AI tools, LLM streaming isn't just about showing text character-by-character; it’s about providing immediate, structured feedback to the user.
When you need to stream structured data, the standard JSON.parse() approach fails immediately because the stream is, by definition, invalid JSON until the very last byte arrives. To bridge this gap, you need to implement a parser that can handle partial chunks.
We first tried simply concatenating chunks and attempting a try-catch block around JSON.parse(). It worked for small objects, but as soon as the LLM generated a nested list or a long string, the parser threw an error, and the UI stayed blank.
If you want to master the basics of this approach, I highly recommend checking out my guide on LLM Streaming Structured Data: Real-Time Parsing Guide. It covers the fundamental state machine approach required to handle these edge cases.
To make this work, you need a library that performs incremental parsing. I’ve had success with jsonrepair or similar libraries that attempt to close open brackets and quotes automatically.
Here is a simplified pattern for how we handle this in our React components:
JAVASCRIPTimport { parsePartialJson } from CE9178">'./utils/parser'; const useStreamingJson = (stream) => { const [data, setData] = useState({}); useEffect(() => { let accumulated = ""; const reader = stream.getReader(); const read = async () => { const { done, value } = await reader.read(); if (done) return; accumulated += new TextDecoder().decode(value); try { const partial = parsePartialJson(accumulated); setData(partial); } catch (e) { // Silently ignore during streaming } read(); }; read(); }, [stream]); return data; };
This approach allows the UI to update as the model generates tokens. The key is ensuring your frontend performance doesn't tank because you're triggering a re-render on every single token. We usually add a debounce or a requestAnimationFrame throttle to limit updates to ~60fps.
Even with a good parser, your structured output might drift. If your schema expects a number but the LLM starts spitting out a string, your UI will crash. I’ve written extensively about Getting reliable structured output from an LLM in production to help mitigate these common failure modes.
When you're dealing with token generation speeds that can hit 50-100 tokens per second, validation becomes expensive. I prefer to validate the final object fully only once the stream completes. During the stream, I treat the data as "optimistic" and keep the UI in a "loading/streaming" state.
There's a hidden cost to partial JSON parsing. By attempting to fix broken JSON on the fly, you might accidentally interpret a hallucination as a valid field.
If you find that your schemas are becoming too complex, it’s often better to switch to a tool that handles the heavy lifting, such as Zod. You can see how we handle that in Structured output: Implementing Deterministic JSON Schema Validation.
I’m still not 100% happy with how we handle "interrupted" streams—when the network cuts out, we’re left with a half-baked object that isn't quite valid. Next time, I think I’ll implement a more robust buffer that persists the last valid state to a local store, rather than relying solely on memory.
Streaming is a game of managing expectations. If you show the user the data as it arrives, they’re much more patient with the model’s generation time. Just don't let the complexity of the parser become the bottleneck that slows down your application.
Getting reliable structured output from an LLM is the difference between a prototype and a product. Learn how to enforce JSON schemas effectively.