StoryDraw

Interactive drawing and storytelling web app built with Next.js: Groq LLM chapters, ONNX ResNet-18 sketch classification, and text-to-speech narration.

Proof of concept to consolidate skills acquired in the Deep Learning course.

Year: 2026
Contribution: Author / maintainer

StoryDraw 🎨📖

StoryDraw is an interactive storytelling web application that merges real-time computer vision with generative AI. The user draws objects on a canvas, a ResNet-18 model classifies each sketch locally in the browser, and a language model weaves those objects into a dynamic, narrated story — chapter by chapter.

How It Works

The experience unfolds across 5 chapters. Each chapter ends with a drawing prompt, and what the user draws — and how confidently the model recognizes it — directly shapes the narrative.

Story style selected by user
          ↓
LLM generates Chapter 1 + suggests 3 drawable objects
          ↓
User draws one of the suggested objects on the canvas
          ↓
ResNet-18 (ONNX, runs locally in the browser) classifies the sketch
          ↓
Top prediction + confidence score fed back to LLM as narrative context
          ↓
LLM generates next chapter, shaped by what was drawn (and how clearly)
          ↓
Repeat for 5 chapters → Story complete

The model's confidence score is not just a metric — it actively influences the story. A sketch recognized with high confidence produces a clear, prominent character or event. An ambiguous sketch produces a mysterious, uncertain element in the narrative. This turns a technical limitation into a creative feature.

Core Technologies

Frontend — Next.js + React · Canvas UI, chapter display, state management
Computer Vision — ResNet-18 → ONNX (runs in browser) · Sketch classification, confidence scoring
LLM — Groq API · Chapter generation, narrative continuity
Text-to-Speech — Browser Speech / ElevenLabs / Google TTS · Chapter narration
Deploy — Vercel · Frontend + API routes

The Model

Architecture

ResNet-18 trained from scratch on a curated subset of the Quick, Draw! dataset by Google.

Transfer learning from ImageNet was deliberately avoided. Quick Draw sketches are grayscale binary strokes — fundamentally different from the photographic textures and colors that ImageNet-pretrained weights encode. Training from scratch on the correct domain produces better results and removes the need to upsample 28×28 drawings to 224×224.

Training Details

Dataset: Quick, Draw! numpy bitmap format (28×28 grayscale, pre-rendered)
Categories: 20 curated classes (see Category Design below)
Samples: ~5,000 per class → 100,000 total training examples
Input: 1-channel 28×28 images (first Conv layer adapted from 3→1 channels)
Output: 20-class softmax
Training environment: Google Colab

Export Pipeline

PyTorch (.pth) → ONNX → onnxruntime-web (browser inference)

Running inference in the browser eliminates the need for a Python backend, removes cold start latency, and means the model works offline after the initial page load.

Preprocessing — A Critical Detail

The HTML5 canvas draws dark strokes on a white background. Quick Draw data is the opposite — white strokes on black. The canvas output is inverted before inference to match the training distribution. Skipping this step causes silent, consistent misclassification.

Canvas output (black on white) → Grayscale → Resize 28×28 → Invert → Normalize → Model

Category Design

20 categories were selected from Quick Draw not just for model performance, but for narrative coherence. Every category has a role in the story engine.

Nature & Places — mountain, ocean, tree, sun, moon · Scene setting, environment
Characters & Creatures — cat, dog, bird, fish, face · Protagonists, companions
Tools & Objects — clock, umbrella, car, airplane, boat · Plot devices, transitions
Conflict & Mystery — eye, hand, flower, star, house · Tension, resolution, symbolism

Categories that are visually similar (e.g. face and eye) were kept in different narrative groups to prevent confusion from affecting story coherence. If the model confuses them, the story still makes sense because they serve different roles.

Server-Side Category Management

The three drawing options presented after each chapter are not random. They are managed server-side to ensure story quality across all 5 chapters.

Logic in lib/story/categoryGroups.ts:

Categories are organized into thematic groups
Chapter position influences which groups are prioritized (Chapter 1 favors Places or Nature; later chapters introduce Characters and Conflict)
A usedCategories set prevents repeats across the full story arc
The backend pre-selects nextOptions and passes them to the LLM as constraints — the LLM focuses entirely on prose, not on deciding what to suggest next

This separation of concerns (server handles structure, LLM handles creativity) keeps the API responses consistent and prevents the model from hallucinating category names that don't exist in the classifier.

Design Decisions Worth Noting

Why train from scratch instead of fine-tuning? ImageNet weights encode photographic features — texture, color, depth. Quick Draw sketches have none of these. With 100k training examples available, training from scratch on the correct domain outperforms fine-tuning a mismatched prior.

Why ONNX in the browser instead of a Python backend? Eliminates cold starts, removes infrastructure complexity, and makes the app fully deployable on Vercel without a separate model server. Inference on a 28×28 input is fast enough that browser execution adds no perceptible latency.

Why Groq instead of OpenAI? Groq's inference speed is significantly faster, which matters for a real-time storytelling experience where the user is waiting for the next chapter before drawing again.

Why separate category selection from LLM generation? Giving the LLM full control over what to suggest next produced inconsistent category names and occasional suggestions outside the trained classes. Moving selection server-side and passing options as constraints keeps the LLM focused on prose quality.

What This Project Demonstrates

Integrating a custom-trained vision model into a production web app without a Python backend
Using model uncertainty (confidence distribution) as a creative input rather than discarding it
Designing a category taxonomy with both ML performance and product experience in mind
Combining computer vision and language models in a coherent user-facing product
Understanding when to use transfer learning and when training from scratch is the better choice