edgemicroappsiot

Edge Microapps: Host a Recommendation Engine on Raspberry Pi for Local Networks

UUnknown

2026-01-24

11 min read

Build a privacy-first dining recommender on a Raspberry Pi 5 + AI HAT+ for your group — local inference, low-latency, and step-by-step deploy guide.

Stop shipping sensitive recommendations to the cloud — run a tiny, private recommender on a Pi 5

Decision fatigue and privacy concerns are real: you want a fast, helpful dining recommender for your office or friends without handing personal tastes and chat data to third-party APIs. In 2026 the combination of the Raspberry Pi 5 and the new AI HAT+ family (NPU-backed accelerators released in late 2025) makes it practical to run a full micro recommendation service on a single board inside a local network.

This guide walks you through building and deploying a privacy-first micro recommendation engine (think: Where2Eat, but local), including hardware, software, model choices, code snippets (backend + frontend), deployment, performance tuning and operational tips — all targeted to ship fast and run reliably on a Pi 5 for a small group (10–50 users).

Why run your recommender at the edge in 2026?

Privacy-first: No third-party inference, no external user telemetry.
Low latency: sub-100ms recommendations on local network with NPU acceleration.
Cost control: one-time hardware spend vs. ongoing cloud inference costs.
Resilience: Works even if your internet drops — ideal for offices or remote cabins.
Personalization: Fine-grained control over data models and ownership.

What we'll build — high level

A microservice that serves dining recommendations over the LAN. Components:

Data store: SQLite for items and user feedback.
Embedding model: small, quantized embedding model running locally (TFLite/ONNX backed by the AI HAT+).
Recommendation logic: content-based + lightweight collaborative updates (cosine similarity of embeddings; thumbs-up / thumbs-down feedback updates user profile vector).
API: FastAPI (Python) providing /recommend, /feedback, and admin endpoints.
Frontend: small single-file web app served from the Pi or static host; client interacts over HTTPS on the LAN.

Hardware & software checklist

Raspberry Pi 5 (4–8 GB recommended; 8 GB for more headroom)
AI HAT+ (2025/26 edition) for NPU acceleration
64GB or larger NVMe/SD (Pi 5 supports NVMe via adapter — recommended for durability)
Power supply, Ethernet cable (for stable LAN), optional USB keyboard for first-boot
OS: 64-bit Raspberry Pi OS or Ubuntu Server 24.04+ (ARM64) — pick the distro you maintain
Docker (ARM64), python3.11+, pip, git

Note on the AI HAT+

The AI HAT+ provides an NPU accelerator and vendor SDK that exposes TFLite or ONNX execution providers. In late 2025 vendors shipped drivers and optimized runtimes that integrate with ONNX Runtime and TFLite delegates for ARM NPUs — leverage those to reduce inference time and power. See practical tips in the edge LLM/acceleration playbook for quantization and runtime choices.

Step 1 — Prepare the Pi

Flash your OS (Raspberry Pi Imager or your preferred tool). Use the 64-bit build.
- Optional: configure SSH and set a static IP for stable access.

Install system updates and essentials:

sudo apt update && sudo apt upgrade -y
sudo apt install -y git curl build-essential docker.io docker-compose python3-pip
sudo usermod -aG docker $USER

Install the AI HAT+ SDK and drivers per vendor docs. The SDK typically exposes a TFLite delegate or ONNX execution provider — test with the vendor sample models first.
Install ONNX Runtime or tflite-runtime with NPU provider, e.g.:

pip install onnxruntime-aarch64  # or the vendor package that includes the NPU provider

If your vendor provides an ONNX provider (e.g., onnxruntime with a npu provider), configure it when creating sessions so calls are routed to the NPU.

Step 2 — Choose a local embedding and inference approach

For a small dataset (dozens to a few hundred restaurants) you don't need a giant embedding model. Use a compact sentence embedding model from the sentence-transformers family converted to ONNX or TFLite. The benefits:

Fast, deterministic embeddings for both items and user inputs.
Works well with content-based and hybrid recommendation strategies.
Easy to quantize for NPU acceleration.

Conversion workflow (high-level): get a lightweight model like all-MiniLM variants, convert to ONNX, then to TFLite if your NPU works with TFLite delegates. Quantize to int8 or fp16 — test accuracy vs. speed. See MLOps guidance for model conversion and feature handling in MLOps playbooks.

Embedding example (Python with ONNXRuntime)

import onnxruntime as ort
import numpy as np

# Create session with NPU provider when available
providers = ['NPUExecutionProvider', 'CPUExecutionProvider']  # vendor-specific name
sess = ort.InferenceSession('mini_emb.onnx', providers=providers)

def embed_text(text: str):
    # tokenization will depend on your converted model — this is a placeholder
    tokens = tokenize(text)
    inputs = {sess.get_inputs()[0].name: np.array([tokens], dtype=np.int64)}
    out = sess.run(None, inputs)
    vec = out[0][0]
    return vec / np.linalg.norm(vec)

Step 3 — Data model & recommendation algorithm

Keep it simple and explainable. For a dining recommender use these tables in SQLite:

items (id, name, attrs JSON, embedding BLOB)
users (id, profile_embedding BLOB, metadata JSON)
feedback (user_id, item_id, score, ts)

Recommendation flow:

If user has a profile, compute cosine similarity between profile vector and item vectors and return top-K.
On first use, compute a profile from declared preferences (cuisines, price range) and recent thumbs-up.
When user gives feedback, update their profile vector with simple exponential moving average (EMA):
```
new_profile = normalize(alpha * item_vec + (1 - alpha) * old_profile)
```

Cosine similarity brute force — good enough for micro apps

With <500 items, simple brute-force dot products are efficient and easy to maintain. Example (numpy):

import numpy as np

def recommend(profile_vec, item_vecs, items, k=10):
    # item_vecs: ndarray (n, d)
    scores = item_vecs @ profile_vec
    idx = np.argsort(-scores)[:k]
    return [items[i] for i in idx]

Step 4 — Backend: FastAPI microservice

FastAPI keeps the stack lightweight and production-ready. Provide endpoints:

GET /recommend?user_id= — returns top-K items
POST /feedback — {user_id, item_id, score}
POST /admin/items — upload new items (admin protected)

Minimal FastAPI example (core logic)

from fastapi import FastAPI, HTTPException
import sqlite3
import numpy as np

app = FastAPI()
# DB helpers omitted for brevity — use connection pooling in production

@app.get('/recommend')
def recommend_endpoint(user_id: int, k: int = 10):
    profile = load_user_profile(user_id)  # numpy array
    items, item_vecs = load_all_items_vecs()  # list and ndarray
    recs = recommend(profile, item_vecs, items, k)
    return {'results': recs}

@app.post('/feedback')
def feedback(payload: dict):
    # payload: user_id, item_id, score
    update_feedback(payload)
    update_user_profile(payload['user_id'], payload['item_id'], payload['score'])
    return {'status': 'ok'}

Step 5 — Frontend: tiny privacy-first web app

A single-page app that fetches /recommend and lets users give thumbs up/down. Serve it from the Pi (FastAPI static files or nginx). Keep UI minimal — focus on fast responses and explicit consent for storing preferences.

<!-- index.html snippet -->
<div id="list"></div>
<script>
async function load(){
  const r = await fetch('/recommend?user_id=1');
  const j = await r.json();
  const list = document.getElementById('list');
  j.results.forEach(i => {
    const el = document.createElement('div');
    el.innerHTML = `${i.name}  `;
    list.appendChild(el);
  });
}
async function fb(item_id, score){
  await fetch('/feedback', {method:'POST', headers:{'Content-Type':'application/json'}, body: JSON.stringify({user_id:1, item_id, score})});
}
load();
</script>

Step 6 — Containerize and deploy

Use multi-arch Docker builds and compose for simplicity. Example Dockerfile tips:

Base on python:3.11-slim for arm64
Install the vendor runtime in the image (if required) and include the ONNX/TFLite binary that supports the NPU
Expose ports and mount persistent volumes for SQLite and models

docker-compose.yml (concept)

version: '3.8'
services:
  recommender:
    image: yourorg/recommender:arm64-latest
    restart: unless-stopped
    volumes:
      - ./data:/app/data
      - ./models:/app/models
    ports:
      - 8000:8000
    devices:
      - /dev/ai_hat:/dev/ai_hat  # if needed by vendor

Deploy: git pull on the Pi, docker-compose pull, docker-compose up -d. For continuous delivery, you can run a small GitHub Actions workflow that builds multi-arch images and pushes to a private registry, then trigger a secure pull on the Pi. For runtime and container trends see Kubernetes runtime trends coverage.

Operational tips & performance tuning

Quantize embeddings: int8 or fp16 gives big speedups on NPUs. Validate quality loss — see edge LLM quantization notes in the edge LLM playbook.
Cache top item vectors in memory to avoid DB roundtrips.
Batch embedding requests if multiple texts arrive (e.g., import of 100 items).
Rate-limit endpoints to protect the Pi from accidental overload.
Backups: snapshot the SQLite file nightly to a trusted backup server or encrypted USB; storage workflows are covered in creator storage workflows.
Monitoring: simple health endpoint and a lightweight log-forwarding to a local Prometheus/metrics aggregator if you need it — see practical observability tips in observability for offline features.

Security and privacy hardening

Serve API over HTTPS on the LAN using a local CA (mkcert) or self-signed cert + pinned trust for clients.
Put the Pi on a segmented VLAN for IoT devices if you worry about lateral movement.
Disable outbound telemetry in all libraries and the OS where possible.
Rotate admin keys and store them on a hardware-backed secret manager if you require stronger guarantees (YubiKey or local HashiCorp Vault instance) — related identity operational practices are discussed in passwordless & key management playbooks.
Document data retention: how long feedback and profile embeddings live, and provide a deletion endpoint.

Scaling patterns — when to add more Pis or move to cloud

This microapp is designed for a small population. If your usage grows beyond ~50 concurrent users or you need high availability, consider:

Horizontal scale: run multiple Pi nodes with a lightweight service discovery (mDNS or Consul) and a reverse proxy on a stronger local host.
Split read/write responsibilities: one Pi handles inference; another handles ingestion and backups.
Hybrid mode: keep sensitive profile vectors on-device, but move heavy offline training to a private cloud batch job and push updated item embeddings back to the Pi.

Case study — a friend group dining app (real-world insight)

Inspired by the micro-app trend (non-developers building single-purpose apps), imagine a 12-person friend group that wants a quick, fair way to pick dinner. They used the Pi 5 + AI HAT+ under the couch Wi-Fi: the host bootstrapped a 150-restaurant seed, everyone rated a few places, and the recommender generated personalized options during group chats. The result: fewer fight-for-decisions, better matches to dietary needs, and no external data leakage.

"We got useful picks within minutes — and nobody's data ever left the house." — small-group pilot, Fall 2025

2026 trends & why this matters now

By 2026 three forces make local micro-recommenders practical: (1) compact, high-quality embedding models and quantization techniques matured in 2024–25; (2) hardware accelerators like AI HAT+ shipped stable NPUs and vendor runtimes in late 2025; (3) regulatory and user-demand pressure for privacy-first services pushed more microapps to the edge. Together these make it realistic to run personalized inference on cheap, energy-efficient hardware without cloud dependence.

Testing checklist before you open it to users

Load test the API with expected concurrency; tune worker count and NPU provider settings.
Verify embedding quality: compare top-K before and after quantization for representative queries.
Test offline recovery: simulate power loss and confirm DB integrity and backup restore.
Validate privacy: confirm no outbound network calls and remove unused telemetry libraries.

Advanced extensions (when you’re ready)

Contextual prompts: add ephemeral context (today’s weather, lunch budgets) to bias recommendations — compute a context embedding and combine it with the user profile.
Small LLM for explanations: run a quantized LLM on the Pi to generate human-friendly explanations ("Because you liked spicy ramen and outdoor seating"). Use only locally hosted models for privacy — see the edge LLM playbook at trainmyai.uk.
Voice UI: the AI HAT+ often supports audio I/O; add a voice interface for kiosk-style environments — field recorder ops and edge audio tips are covered in Field Recorder Ops 2026.
Federated updates: if multiple Pis exist, use secure aggregation to share anonymized model improvements and push global item embedding updates.

Actionable takeaways

Start small: build with SQLite, a compact ONNX/TFLite embedding model, and brute-force cosine similarity.
Use the AI HAT+ SDK to offload embeddings to the NPU and quantize models for better throughput.
Keep the app private by default — HTTPS on the LAN, explicit opt-in for any data sharing, and nightly encrypted backups.
Plan for graceful scale: caching, batching, and optional horizontal scaling across multiple Pi nodes. Edge caching patterns are discussed in edge caching & cost control.

Final notes — tradeoffs you should be aware of

Running locally reduces cloud costs and improves privacy, but it increases maintenance responsibility: OS updates, driver compatibility for new AI HAT revisions, and hardware failures. For many small groups and offices, the tradeoff is worth it — especially given 2026 hardware and model improvements.

Start your micro recommendation project

If you want a starter repo, a tested Dockerfile for ARM64, and a turn-key FastAPI + vanilla JS template tuned for Pi 5 + AI HAT+, check thecode.website’s starter kit or clone a starter repo to your Pi and run the included setup script. Expect to have a functioning recommender in under a day once your hardware is plugged in. For hands-on guides and model conversion tips, see the edge fine-tuning playbook and MLOps guidance at databricks.cloud.

Build locally, keep your users' data private, and iterate quickly — microapps in 2026 are about fast utility, real privacy guarantees, and leveraging capable edge hardware. Ready to put your recommender on the LAN?

Call to action

Clone the starter repo, try the pre-quantized embedding model, and share your results. Have questions about NPU tuning, model conversion, or scaling across multiple Pis? Reach out via thecode.website discussion board — we publish example configs and the exact vendor commands for popular AI HAT+ models in the repo.

Unknown

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.