pwa react llm webgpu transformers

Experiment - Case Study

LiteSpark

LiteSpark is a privacy-first AI workspace that runs entirely in your browser. By leveraging the power of WebGPU and WASM, It can run cool LLM models completely offline with no installations and setup. Just load page, load model and go.

View Project
LiteSpark

LiteSpark - An Experiment towards a complete AI toolkit in browser, Zero Installations.

This was a project that i started with the hope that i can build a complete local AI toolkit in the browser that can run offline, with zero installations. You can just cache some models once and run it completely locally. Little did i know just how much pain this would be.

The current Tech Stack

Layer Technology Purpose
Runtime/Build Bun High-speed JavaScript toolkit for dependency management and bundling.
Local Inference Transformers.js Enables running AI models to run via WebGPU & WASM.
Database PGlite A full Postgres database running entirely in the browser for local persistence. Also planned to integrate vectordb with pgvector
Frontend React 19 Utilizing the latest React features with TanStack Router for SPA navigation.

Initial Approach

Wllama

The initial way was to to try to make llama.cpp binding work in the browser. This sounded like great way to go because, Llama.cpp is the most popular inference engine out there and gguf models are super portable and easy to quantize - run with lower VRAM (super important for web-browsers). I found perfect project for this - Wllama

However at the time Wllama did not support webgpu, which means you models will be running completely on CPU and RAM. CPU is not designed to run highly parallelized compulations(ML Model matrix operations) like the way GPUs are. So this was a deal breaker.

Tersorflow.js / LiteRT

Google recently rebranded tensorflow as LiteRT with teh lauch of Gemma 4 model family and with that a huge push towards Edge LM Inference. While Tensorflow was never a first choice for LMs when compared to Pytorch, it was somewhat okay.

Here the problems was that TensorFlow.js was being discontinued and JS bindings fro LiteRT was not yet out. So that was out of the table.

ONNX - Transformers.js

The Open Neural Network Exchange(ONNX) is an open-source artificial intelligence ecosystem of technology companies and research organizations that establish open standards for representing machine learning algorithms and software tools to enable a standard format for representing machine learning models.

Transformer.js as builtin support for ONNX model inferencing. This seemed like huge advantage as ONNX not only work for LLM, But also for CNN models, other niche vision, audio , classifications and prediction models. This mean i could integrate a whole ecosystem of Models to create a comprehensive Tool that can do more than just chatting to llm models

Things like Voice support, Live vision and even agents.

Architecture and development

  • The application is design to be a progressive web app that way it once it loads in the browser, the code for the front end and the service workers are cached and can be run offline
  • Tried Integrating Qwen3.5 family, Gemma4 family and LFM model

I quickly realised is that transformer.js with ONNX models is not super straight forward. Each model has slightly different implementations. So decided we can have a Adaptor factory model.. where each model family gets a custom adaptor adapting over the differences to create a unified common model interface. But this way more challenging that i realised.

The challenge

The biggest problem here is each models having way different implementations of core models functionalities.

File naming conventions

  • Though there some common ground in file naming i still noticed some differences like
vision_encoder_q4.onnx vs vision_q4.onnx
pre_processor vs processor

Quantization Mismatchs

  • Most common quantization are q4, q6, fp16, fp32 and q4f16
  • The problem here is most models dont have all quantizations
  • Even worse problem was sometime they have vison_encoder in q4 but have text_decoder in fp16
  • There is no building qant chooser/ fallback mechanism
  • Some browser dont support fp16

Core Implementation Differences

This is by far the worst problem

  • some models implement core functionality interfaces and methods very differently
Gemma4ForConditionalGeneration //Gemma4
MultiModalityCausalLM //Janus
AutoModelForCausalLM //Llama - LFM
Qwen3_5ForConditionalGeneration //Qwen3.5
AutoModelForImageTextToText // Gemma 3

Note that these are model loader.... vastly different.

Each of the model have way diffrent implementations of Images and MultiModality as well

And some of these are in different versions of Transformer js

Conclusions

  • While its doable to make a web llm experience supporting all major model... Its not feasible
  • ONNX runtime with transformers as of now is highly disorganized.. infeasible to make a unified model inference implementation
  • Have to explore more runtimes like new versions of Wllama and the upcoming implementation of LiteRT for web