StoryDraw
Interactive drawing and storytelling web app built with Next.js: Groq LLM chapters, ONNX ResNet-18 sketch classification, and text-to-speech narration.
- Contribution
- Author / maintainer

StoryDraw ๐จ๐
StoryDraw is an interactive storytelling web application that merges real-time computer vision with generative AI. The user draws objects on a canvas, a ResNet-18 model classifies each sketch locally in the browser, and a language model weaves those objects into a dynamic, narrated story โ chapter by chapter.
How It Works
The experience unfolds across 5 chapters. Each chapter ends with a drawing prompt, and what the user draws โ and how confidently the model recognizes it โ directly shapes the narrative.
Story style selected by user
โ
LLM generates Chapter 1 + suggests 3 drawable objects
โ
User draws one of the suggested objects on the canvas
โ
ResNet-18 (ONNX, runs locally in the browser) classifies the sketch
โ
Top prediction + confidence score fed back to LLM as narrative context
โ
LLM generates next chapter, shaped by what was drawn (and how clearly)
โ
Repeat for 5 chapters โ Story complete
The model's confidence score is not just a metric โ it actively influences the story. A sketch recognized with high confidence produces a clear, prominent character or event. An ambiguous sketch produces a mysterious, uncertain element in the narrative. This turns a technical limitation into a creative feature.
Core Technologies
- Frontend โ Next.js + React ยท Canvas UI, chapter display, state management
- Computer Vision โ ResNet-18 โ ONNX (runs in browser) ยท Sketch classification, confidence scoring
- LLM โ Groq API ยท Chapter generation, narrative continuity
- Text-to-Speech โ Browser Speech / ElevenLabs / Google TTS ยท Chapter narration
- Deploy โ Vercel ยท Frontend + API routes
The Model
Architecture
ResNet-18 trained from scratch on a curated subset of the Quick, Draw! dataset by Google.
Transfer learning from ImageNet was deliberately avoided. Quick Draw sketches are grayscale binary strokes โ fundamentally different from the photographic textures and colors that ImageNet-pretrained weights encode. Training from scratch on the correct domain produces better results and removes the need to upsample 28ร28 drawings to 224ร224.
Training Details
- Dataset: Quick, Draw! numpy bitmap format (28ร28 grayscale, pre-rendered)
- Categories: 20 curated classes (see Category Design below)
- Samples: ~5,000 per class โ 100,000 total training examples
- Input: 1-channel 28ร28 images (first Conv layer adapted from 3โ1 channels)
- Output: 20-class softmax
- Training environment: Google Colab
Export Pipeline
PyTorch (.pth) โ ONNX โ onnxruntime-web (browser inference)
Running inference in the browser eliminates the need for a Python backend, removes cold start latency, and means the model works offline after the initial page load.
Preprocessing โ A Critical Detail
The HTML5 canvas draws dark strokes on a white background. Quick Draw data is the opposite โ white strokes on black. The canvas output is inverted before inference to match the training distribution. Skipping this step causes silent, consistent misclassification.
Canvas output (black on white) โ Grayscale โ Resize 28ร28 โ Invert โ Normalize โ Model
Category Design
20 categories were selected from Quick Draw not just for model performance, but for narrative coherence. Every category has a role in the story engine.
- Nature & Places โ mountain, ocean, tree, sun, moon ยท Scene setting, environment
- Characters & Creatures โ cat, dog, bird, fish, face ยท Protagonists, companions
- Tools & Objects โ clock, umbrella, car, airplane, boat ยท Plot devices, transitions
- Conflict & Mystery โ eye, hand, flower, star, house ยท Tension, resolution, symbolism
Categories that are visually similar (e.g. face and eye) were kept in different narrative groups to prevent confusion from affecting story coherence. If the model confuses them, the story still makes sense because they serve different roles.
Server-Side Category Management
The three drawing options presented after each chapter are not random. They are managed server-side to ensure story quality across all 5 chapters.
Logic in lib/story/categoryGroups.ts:
- Categories are organized into thematic groups
- Chapter position influences which groups are prioritized (Chapter 1 favors Places or Nature; later chapters introduce Characters and Conflict)
- A
usedCategoriesset prevents repeats across the full story arc - The backend pre-selects
nextOptionsand passes them to the LLM as constraints โ the LLM focuses entirely on prose, not on deciding what to suggest next
This separation of concerns (server handles structure, LLM handles creativity) keeps the API responses consistent and prevents the model from hallucinating category names that don't exist in the classifier.
Design Decisions Worth Noting
Why train from scratch instead of fine-tuning? ImageNet weights encode photographic features โ texture, color, depth. Quick Draw sketches have none of these. With 100k training examples available, training from scratch on the correct domain outperforms fine-tuning a mismatched prior.
Why ONNX in the browser instead of a Python backend? Eliminates cold starts, removes infrastructure complexity, and makes the app fully deployable on Vercel without a separate model server. Inference on a 28ร28 input is fast enough that browser execution adds no perceptible latency.
Why Groq instead of OpenAI? Groq's inference speed is significantly faster, which matters for a real-time storytelling experience where the user is waiting for the next chapter before drawing again.
Why separate category selection from LLM generation? Giving the LLM full control over what to suggest next produced inconsistent category names and occasional suggestions outside the trained classes. Moving selection server-side and passing options as constraints keeps the LLM focused on prose quality.
What This Project Demonstrates
- Integrating a custom-trained vision model into a production web app without a Python backend
- Using model uncertainty (confidence distribution) as a creative input rather than discarding it
- Designing a category taxonomy with both ML performance and product experience in mind
- Combining computer vision and language models in a coherent user-facing product
- Understanding when to use transfer learning and when training from scratch is the better choice