Published

Running LLMs Locally with Ollama

Testing which LLMs my NVIDIA GeForce RTX 4060 Ti can run locally.

AILLMOllamaLinux

This blog post belongs to a two-part series: Ollama Local LLMs

  1. Running LLMs Locally with Ollama (this page)
  2. Building Something with Local LLMs

Is there any value in running small LLMs locally when all the big tech companies offer free credits for truly large LLMs? Will my graphics card explode under my desk?

Running Our First Ollama Inference

I have the following system specs, displayed courtesy of screenfetch.

System Specs

The key takeaway is that I have 16 GB of VRAM available in my GPU, so I’m limited to models that around that size. (It turns out, the sweet spot for model size ends up being about half my available GPU VRAM. This is because inference on large context eats up the remaining VRAM.)

To begin, I simply Google’d “run llms locally” and chose to follow this useful guide from Jeremy Morgan. Here are the two commands I needed to get up and running. Beware of piping commands into sh.

alexshank@alex-desktop:~/Desktop/main/repos/blog-astro$ curl -fsSL https://ollama.com/install.sh | sh
>>> Installing ollama to /usr/local
[sudo] password for alexshank:
>>> Downloading Linux amd64 bundle
######################################################################## 100.0%
>>> Creating ollama user...
>>> Adding ollama user to render group...
>>> Adding ollama user to video group...
>>> Adding current user to ollama group...
>>> Creating ollama systemd service...
>>> Enabling and starting ollama service...
Created symlink /etc/systemd/system/default.target.wants/ollama.service /etc/systemd/system/ollama.service.
>>> NVIDIA GPU installed.
alexshank@alex-desktop:~/Desktop/main/repos/blog-astro$ ollama run mistral-small3.1
pulling manifest
pulling 1fa8532d986d... 100% ▕████████████████████████████████▏  15 GB
pulling 6db27cd4e277... 100% ▕████████████████████████████████▏  695 B
pulling 70a4dab5e1d1... 100% ▕████████████████████████████████▏ 1.5 KB
pulling a00920c28dfd... 100% ▕████████████████████████████████▏   17 B
pulling 9b6ac0d4e97e... 100% ▕████████████████████████████████▏  494 B
verifying sha256 digest
writing manifest
success
>>> hello
Hello! How can I assist you today?

> > > What's the capital of italy?
> > > The capital of Italy is Rome.

Browsing Available Models

Before I ran the above prompt, I looked at the most popular models on the Ollama site and chose mistral-small3.1. There’s a good amount of buzz around Mistral, and it even has some vision capabilities.

Mistral Small 3.1 Model

I’m not overly concerned with the quality of individual models we’re demoing here. I’m primarily interested in exploring the capabilties and features of the models, along with Ollama’s integrations. Let’s explore the following capabilities using selected models I did very brief research on:

  • Audio -> Text Generation (Audio Transcription)
  • Multi-Modal Embeddings (for RAG and Vector DBs)
  • Integrations (e.g., Ollama server API, Model Context Protocol, Claude Code drop-in, etc.)

Let’s see if we can do anything interesting in these areas. Quickly browsing Ollama, I’ve landed on these models.

Model NameParametersCapabilities
mistral-small3.124 billionText + Image -> Text Generation
deepseek-ocr:3b3 billionText + Image -> Text Generation (OCR)
x/flux2-klein:4b4 billionText -> Image Generation (No Linux support yet from Ollama 🙁)
gemma3n:e4b4 billion (effective)Text + Image + Audio -> Text Generation (No audio input support yet from Ollama 🙁)
nomic-embed-text-v2-moe:latest305 millionText Embeddings
MedAIBase/Qwen3-VL-Embedding:2b2 billionMulti-Modal Embeddings (Throwing an inscrutable error for me 🙁)

Image generation and audio input aren’t fully supported in Ollama yet, even though the underlying models can be downloaded. We’ll explore Whisper and Stable Diffusion Web UI to fill in the gaps.

Text Generation

TODO.

Ensuring We’re On the GPU

TODO.

Tokens Per Second

TODO.

Whisper

TODO.

Stable Diffusion Web UI

TODO.

Vector DBs and RAG

TODO.