Episode 10·April 22, 2026

Build an Offline AI Stack That Works When Your WiFi Doesn't

Spotify Apple Podcasts RSS Feed Open Companion Resource

Intro

This episode is for nomad founders who travel weekly and can't afford to pause client delivery during connectivity outages. You'll get a complete offline-first AI architecture that keeps your critical workflows running from airports, ferries, and rural Airbnbs - with everything you need to ship it in a weekend.

In This Episode

Santi and Kira tackle the hidden tax of nomad life: dead hours when your AI tools can't reach the cloud. They break down a complete offline-first stack using local LLM runners (LM Studio, GPT4All, Ollama), whisper.cpp for transcription, and SQLite as a work queue that syncs via Litestream when you're back online. You'll learn the hardware requirements for different model sizes, how to set up Battery Saver versus Throughput run profiles, which tasks to handle locally versus defer to the cloud, and the sync conflict policies that prevent data loss. They walk through Santi's real-world test - processing three documents offline for two hours with only 6% battery drain - and share the exact scripts and SOPs to implement this system in a weekend.

Key Takeaways

Set up two run profiles: Battery Saver mode (7-8B model, CPU only) for boarding situations, and Throughput mode (13B model with GPU offload) for long train rides with power outlets
Partition your AI tasks - run summaries, tags, and short drafts locally, but defer heavy codegen, image analysis, and anything requiring 8K+ context windows to the cloud queue
Use SQLite in WAL mode with Litestream for bulletproof offline queuing and automatic sync when connectivity returns, with file watchers triggering background jobs

Timestamps

Companion Resource

sop

Offline‑First AI SOP: Travel‑Day Stack for Summaries, Tags, and Drafts

A copy‑paste SOP for running summaries, tags, and short drafts fully offline on travel days. Includes model/run profiles, SQLite WAL schema, watchers, a Litestream sync worker, conflict policy, and ‘Travel Day Mode’ scripts for macOS and Windows.

LM Studio documentation
lmstudio.ai
- - LM Studio is a desktop application that discovers, downloads, and runs local LLMs with a GUI and exposes a local OpenAI‑compatible API; it also offers a headless mode (llmster).
GPT4All docs
docs.gpt4all.io
- - GPT4All provides a cross‑platform desktop app to run LLMs locally and privately, with a built‑in local API server (no OpenAI key required).
WindowsCentral
windowscentral.com
- - Ollama introduced an official Windows GUI app that removes much of the need to use the terminal to run local models.
ggml‑org/whisper.cpp (GitHub)
github.com
- - Whisper.cpp enables fully offline, on‑device speech‑to‑text; official repo is actively maintained under ggml‑org.
PyPI: faster‑whisper‑dictation
pypi.org
- - An actively maintained offline dictation app based on faster‑whisper (faster‑whisper‑dictation) shipped version 0.2.0 on March 24, 2026.
Gartner press release
gartner.com
- - Gartner projects AI PCs will reach 55% of total PC shipments in 2026 (≈143M units).
IDC blog (MWC 2026)
idc.com
- - IDC highlights a shift to on‑device intelligence at MWC 2026, citing advances in AI‑capable chipsets that make on‑device AI scalable and efficient.
Computerworld
computerworld.com
- - Computerworld reports AI PCs are projected to surpass 50% of sales by 2026, reflecting enterprise interest in running AI on PCs due to cost/security of cloud AI.
Ollama model page
ollama.com
- - Ollama’s documentation for model pages states typical RAM needs: ~8GB for 7B, ~16GB for 13B, ~64GB for 70B models (quantization dependent).
llama.cpp site
llama-cpp.com
- - llama.cpp guidance indicates a 7B model quantized to 4‑bit requires roughly 5 GB of RAM.
LM Studio docs – System Requirements
lmstudio.ai
- - LM Studio system requirements recommend 16GB+ RAM for comfortable local inference on desktop/laptop.
Hugging Face docs + NVIDIA technical blog
huggingface.co
- - Quantization reduces compute and memory costs; moving from FP32 to INT8 reduces data movement and power, with additional savings at 4‑bit (trade‑offs apply).
SQLite High Reliability page
sqlite.org
- - SQLite is a high‑reliability embedded database used in billions of devices, suitable for offline‑first apps.
Fly.io LiteFS docs; Litestream docs
fly.io
- - LiteFS transparently replicates SQLite databases across nodes; Litestream performs continuous, incremental replication to S3‑compatible storage — both suitable for ‘sync when online’.
Watchman GitHub; fswatch GitHub
github.com
- - Watchman (Meta) and fswatch provide cross‑platform file‑watching to trigger background jobs (e.g., enqueue to SQLite, start sync) when files change.
SQLite SEE; SQLCipher (GitHub)
sqlite.org
- - For database encryption at rest with SQLite, use SEE (commercial) or SQLCipher (open‑source) to encrypt the full database file.
WindowsCentral (news)
windowscentral.com
- - Ollama Windows app (official GUI)
- - Confirms that even traditionally CLI‑first runners now ship desktop GUIs, lowering the barrier for offline use on Windows laptops.
PyPI
pypi.org
- - faster‑whisper‑dictation
- - Demonstrates current (2026) packaging and distribution of an offline STT app using faster‑whisper for real‑time dictation.
SQLite docs (SEE) + Litestream/LiteFS docs
fly.io
- - SQLite + LiteFS/Litestream for offline queue and deferred sync
- - Shows that SQLite is a robust, embedded store for offline queuing and that open‑source tools exist to replicate/sync when back online.
IDC blog + Gartner press release + Computerworld
gartner.com
- - On‑device AI/AI PCs growth into 2026
- - Signals rising offline/on‑device AI capability — a macro trend supporting an offline‑first design for travel days.

Kira: Someone in the Slack asked me something last week that I haven't stopped thinking about. She said — "I was on a ferry from Split to Hvar, three hours, no signal, and I had a client summary due by the time I docked. I just... sat there. Staring at my laptop. All my tools need the internet."

Santi: Three hours of dead laptop.

Kira: Three hours. And she's not even talking about some obscure workflow. Summarize a call transcript. Tag it. Draft a follow-up. That's it. Stuff a seven-billion-parameter model can do in its sleep.

Santi: On her laptop. Right there on the ferry.

Kira: On her laptop! But everything she's built runs through cloud APIs. OpenAI, Anthropic, whatever. No signal, no work.

Santi: And this isn't a ferry problem. This is an airport problem, a rural Airbnb problem, a train-through-the-Alps problem. I was in the Algarve last month — gorgeous place, terrible connectivity — and I watched my content repurposing tool just spin for forty minutes because the café wifi couldn't hold a connection to Claude.

Kira: The Algarve. In Portugal. Where you live.

Santi: Where I live! This isn't some remote island edge case. This is Tuesday.

Kira: So the question is — what if your AI stack just... worked? No signal required. You drop a file, it transcribes, summarizes, tags, queues everything up, and the moment you get wifi again, it syncs.

Santi: That's what we built. And it runs on hardware most of us already own.

Kira: Every travel day you spend without an offline AI stack is a day you're paying for cloud APIs you can't reach and missing deadlines you could have hit from your laptop alone. That's the tax nobody talks about — not the flights, not the visa runs. The dead hours.

Santi: Today we're building the fix. A portable offline AI stack — local models, local transcription, a SQLite queue that holds everything until you're back online. The whole thing ships in a weekend.

Santi: So here's why this conversation is happening now and not two years ago. Running a large language model on your laptop used to mean fighting with Python environments, compiling from source, praying your CUDA drivers cooperated. It was a weekend project for engineers, not a tool for operators.

Kira: And now?

Santi: Now you download a desktop app. LM Studio ships a full GUI — Mac, Windows, Linux — with a model browser built in. You click a model, it downloads, you run it. It exposes a local API that's compatible with the OpenAI format, so anything you've already built against GPT-4 or Claude can point at localhost instead.

Kira: Wait — same API format? So my Make scenarios that call OpenAI could just... swap the endpoint?

Santi: Same format. You change the base URL to localhost, pick your model, and it works offline. GPT4All does the same thing — desktop app, local API server, no cloud keys. And Ollama, which used to be terminal-only, shipped a Windows GUI last year. So all three major runners now have point-and-click interfaces.

Kira: Okay, but what size models are we actually running on a travel laptop? Because I'm not carrying a gaming rig through airport security.

Santi: Right, and this is where people over-promise. You're not running seventy-billion-parameter models on a MacBook Air. That needs sixty-four gigs of RAM minimum. What you are running — comfortably — is a seven or eight billion parameter model quantized to four-bit. That fits in about five to eight gigs of RAM. Ollama's docs say eight gigs minimum for a seven-B model. LM Studio recommends sixteen gigs for comfortable inference. So if your laptop has sixteen gigs of RAM, you're in good shape for the tasks we're talking about — summaries, tags, short drafts.

Kira: And thirteen-billion-parameter models?

Santi: Sixteen gigs of RAM, ideally with some GPU offload if you have a discrete card. But here's the thing — for travel day work, you probably don't need thirteen-B. A seven-B instruct model at four-bit quantization handles summarization and tagging just fine. Save the bigger models for when you're plugged in at a coworking space.

Kira: And this is the important part — quantization isn't just about fitting the model in memory. It's about battery. Hugging Face and NVIDIA's engineering docs both confirm that moving from full precision to four-bit reduces memory bandwidth and power draw. You're doing less data movement per token, which means less heat, less fan noise, longer battery life.

Santi: You've been reading the quantization docs.

Kira: I have a flight to Oaxaca next week. I'm motivated.

Santi: Okay, so the architecture. Four pieces. A local LLM runner — pick one, LM Studio, GPT4All, Ollama. A local speech-to-text engine — whisper.cpp or faster-whisper. A SQLite database running in WAL mode as your work queue. And a sync layer — Litestream pushing to S3-compatible storage — that wakes up when you get connectivity.

Kira: Walk me through the flow. I'm on a plane. I have a voice memo from a client call.

Santi: You drop the audio file into an inbox folder on your laptop. A file watcher — Watchman from Meta, or fswatch if you're on Mac or Linux — detects the new file and enqueues a transcription job into your SQLite queue. The worker picks it up, runs it through whisper.cpp — fully offline, no API call — and writes the transcript back to the database. Then it automatically chains two more jobs: summarize and tag. Those hit your local LLM through the localhost API. Summary goes into the docs table, tags go into a JSON array on the same row. All of this happens without a single byte leaving your laptop.

Kira: And the queue is the key piece here, right? Because without it, you're just running models manually.

Santi: The queue is everything. SQLite in WAL mode — write-ahead logging — means you can read and write simultaneously without locking. It's the same database engine running on billions of devices. Your phone uses it. Your browser uses it. It's not exotic infrastructure — it's the most battle-tested embedded database on the planet.

Kira: And it fits in your backpack.

Santi: It fits in a single file. Your entire work queue, your document store, your sync state — one file. Encrypted at rest with SQLCipher if you want, which you should, because a stolen laptop with client transcripts on it is a nightmare you don't want.

Kira: So we set up two run profiles for this, and I want to explain why. Imagine you're boarding in twelve minutes. Battery at forty percent. You're not going to fire up a thirteen-B model and melt your laptop before takeoff.

Santi: Right. Battery Saver mode — seven to eight-B model, four-bit quantization, CPU only, context window capped at two to four thousand tokens. You're doing summaries, tags, short drafts. If your tokens per second drops below fifteen, you're pushing the hardware too hard — scale down or close other apps.

Kira: And if you're on a six-hour train ride with a power outlet?

Santi: Throughput mode. Thirteen-B model, partial GPU offload if you have it, context window up to eight thousand. You can do longer drafts, more complex summarization. But even here — and this is where I'll push back on myself — you're not doing heavy code generation, you're not doing vision tasks, you're not doing retrieval-augmented generation over huge document sets. Those stay in the cloud queue.

Kira: That's the split that makes this whole thing work. You're not trying to replace your cloud stack. You're partitioning. Summaries, tags, short drafts — local. Heavy codegen, image analysis, anything that needs a context window bigger than eight thousand tokens — that waits in the queue and gets processed when you land and have wifi.

Santi: And the queue handles that gracefully. You can set priority levels — local tasks at priority five, cloud-deferred tasks at priority seven. When your sync worker comes online, it pushes the deferred jobs to your cloud endpoint automatically.

Kira: No manual cleanup. No "oh wait, I forgot to send that summary."

Santi: Nothing manual. The worker drains the queue in priority order. Done items get marked done. Failed items retry up to three times with backoff. After three failures, they park for you to review — but that almost never happens with the local tasks because there's no network to fail.

Kira: Okay, but this is where I was skeptical. Because sync is where offline-first systems get messy. You've been working offline for three hours. Your cloud tools have been getting updates from your team. You come back online and now you have two versions of reality.

Santi: Yeah, and I'm not going to pretend this is trivial. But the conflict policy is simpler than people think for this use case. Your local work is new content — summaries, tags, drafts that didn't exist before. You're not editing shared documents offline. You're creating artifacts and queuing them. So the conflict surface is small.

Kira: What about the edge case where someone on your team also tagged the same document while you were offline?

Santi: Tags merge as a set union. Duplicates get removed. Summaries use last-writer-wins based on timestamp, with an origin flag — local versus cloud — so you can always see which version came from where. And every queue item has an idempotency key — a hash of the file plus the task type — so you never process the same job twice even if the sync pushes a duplicate.

Kira: Okay. That's cleaner than I expected.

Santi: The sync itself is Litestream pushing WAL changes to S3-compatible storage. You start a Docker container, point it at your database, and it streams incremental changes whenever it has connectivity. When you're offline, it just idles. No errors, no crashes. It picks up where it left off.

Kira: And the encryption piece — because I know someone's going to ask — SQLCipher encrypts the entire database file. Keys come from your OS keychain, not from an environment file sitting in your project folder. So even if someone grabs your laptop at the airport, they get an encrypted blob.

Santi: So here's what I actually did. Two weeks ago, I disconnected everything — wifi off, Bluetooth off, phone in airplane mode — and ran this stack for two hours straight.

Kira: What'd you throw at it?

Santi: One ten-minute audio recording — a mock client call. Two markdown notes, about eight hundred words each. Dropped them all into the inbox folders.

Kira: And?

Santi: The file watchers picked them up within seconds. Whisper.cpp transcribed the audio in about four minutes on my M2 MacBook Pro — that's a medium English model, fully offline. The worker then summarized and tagged all three documents. Total queue drain time from first drop to last completed job — eleven minutes.

Kira: Eleven minutes for three documents, including a ten-minute audio transcription.

Santi: Eleven minutes. Battery went from eighty-two percent to seventy-six percent. Six percent drain for the whole session. And when I turned wifi back on, Litestream synced the database to S3 in under thirty seconds. Everything was there — summaries, tags, the transcript. No conflicts, no manual intervention.

Kira: And this is why we say do the drill before you need it. Because the first time you try this should not be on the ferry to Hvar with a deadline in ninety minutes.

Santi: Run it on a Saturday. Disconnect for two hours. Drop some files. Watch the queue. If something breaks, you fix it at your desk, not in a panic.

Kira: I do want to name the real limitations though, because I think it's easy to oversell this.

Santi: Go.

Kira: Local models are slower. A seven-B model on CPU is not going to match GPT-4 turbo on summarization quality. The summaries are good enough for travel-day triage — you'll clean them up later — but they're not final-draft quality.

Santi: That's fair. And battery is a real constraint. We don't have hard independent benchmarks for twenty-twenty-six laptops running local inference versus cloud API calls — that data doesn't exist yet in a rigorous form. What we have is the physics — quantization reduces power draw per token — and my anecdotal six-percent-over-two-hours number. Your mileage will vary based on your hardware, your model choice, and how aggressively you're running inference.

Kira: And the sync complexity is real. It's manageable for this use case — new artifacts, small conflict surface — but if you tried to do collaborative editing offline with multiple team members, you'd need a much more sophisticated conflict resolution strategy than last-writer-wins.

Santi: Agreed. This is a single-operator pattern. One laptop, one queue, one sync target. It's not a distributed team workflow. It's your personal travel-day safety net.

Kira: Which is exactly what that woman on the ferry needed.

Santi: Exactly what she needed. And the macro trend supports investing the weekend to build it. Gartner projects fifty-five percent of PC shipments in twenty-twenty-six will be AI PCs — that's about a hundred and forty-three million units with dedicated neural processing hardware. IDC flagged the same shift at Mobile World Congress this year. The laptops are getting better at this, not worse.

Kira: So the stack you build this weekend gets faster every time you upgrade your hardware. That's a compounding investment.

Santi: So that ferry from Split to Hvar — three hours, no signal. With this stack running, she drops her client call recording into the inbox folder before she even boards. By the time she's watching the Dalmatian coast go by, the transcript's done, the summary's written, the tags are in the database. She docks, her phone finds a cell tower, Litestream pushes everything to S3, and her client gets the deliverable on time. No panic. No dead hours.

Kira: And the whole system — the runner, the queue, the watchers, the sync, the encryption — it's all in the Offline-First AI SOP we put in the show notes. Every script, every schema, the conflict policy, the Travel Day Mode toggles for Mac and Windows. It's the exact stack we just walked through, ready to copy and ship in a weekend.

Santi: One thing to do this week. Pick a Saturday. Install one runner — LM Studio, GPT4All, Ollama, whichever one you want. Download a seven-B model. Disconnect your wifi. Drop a file. Watch it process. That's it. That one drill will tell you whether your hardware can handle it, and once you see it work, you'll never fly without it again.

Kira: Stop losing travel days. Build the stack.

Santi: See you Wednesday.

offline AIlocal LLMdigital nomadtravel productivityLM StudioOllamawhisper.cppSQLiteLitestreambattery managementsync strategyquantizationAI workflowlocation independence