Deploying Phi-3 on Ubuntu 24.04 GPUs
Ollama and WebUI
How to Build Your Own Private AI: Deploying Phi-3 with Ollama and a WebUI on Ubuntu 24.04
Large Language Models (LLMs) have fundamentally changed how we interact with artificial intelligence, powering everything from advanced coding assistants to everyday conversational bots. However, relying on cloud-based giants like OpenAI or Google means sending your private data over the internet and paying for ongoing API costs. The solution is running AI locally. Highly efficient models like Microsoft's Phi-3 are leading this revolution, proving that you no longer need massive data centers to achieve incredible results. Phi-3 delivers state-of-the-art reasoning and performance while being compact enough to run smoothly on your own private GPU server.
To harness the power of models like Phi-3 on an Ubuntu 24.04 GPU server, we will use Ollama. Ollama is a powerful, developer-friendly tool that completely simplifies the process of downloading, managing, and running open-source LLMs. Instead of wrestling with complex Python environments, complicated dependencies, and manual weights loading, Ollama acts as a streamlined local API server. It handles the heavy lifting of model inference in the background, automatically utilizing your server's GPU to ensure lightning-fast generation speeds.
Finally, an AI model isn't very accessible if you can only interact with it through a command-line terminal. That's where the WebUI comes in. In this tutorial, we will build a lightweight, Flask-based Web User Interface that connects directly to your local Ollama instance. This setup will give you a sleek, browser-based chat experience just like ChatGPT but hosted entirely on your own hardware. By the end of this guide, you will have a fully functional, highly performant AI ecosystem that offers complete control over your privacy, performance, and costs.
Prerequisites
- An Ubuntu 24.04 server with an NVIDIA GPU.
- A non-root user or a user with sudo privileges.
- NVIDIA drivers are installed on your server.
Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
>>> Installing ollama to /usr/local >>> Downloading Linux amd64 bundle ######################################################################## 100.0% >>> Creating ollama user... >>> Adding ollama user to render group... >>> Adding ollama user to video group... >>> Adding current user to ollama group... >>> Creating ollama systemd service... >>> Enabling and starting ollama service... Created symlink /etc/systemd/system/default.target.wants/ollama.service → /etc/systemd/system/ollama.service. >>> NVIDIA GPU installed.
source ~/.bashrc
systemctl enable ollama systemctl start ollama
curl http://localhost:11434
Ollama is running
Run a Phi-3 Model with Ollama
ollama run phi3
If Phi-3 isn’t already on your system, this command will:
- Download the model locally
- Launch an interactive terminal session
- Let you type messages for Phi-3 to respond to in real time
>>> What is Ollama? Ollama seems to be a term that doesn't correspond with well-known technology or concepts as of my last update in April 2023. It might either refer to an emerging tool, application, system, project, acronym, organization, or even fictional element not recognized within the industry up until then.
2. Press Ctrl+D to exit the session.
Use Ollama Programmatically with cURL
Beyond the terminal, Ollama excels via its local REST API — perfect for:
- Building custom tools
- Integrating with your apps
- Testing prompts with
curl
curl http://localhost:11434/api/generate -d '{
"model": "phi3",
"prompt": "Explain what is GPU? Answer in short paragraph",
"stream": false
}'
{"model":"phi3","created_at":"2025-07-30T14:44:34.276761484Z","response":"A Graphics Processing Unit, or GPU, is a specialized electronic circuit designed to accelerate the creation of computer graphics and facilitate complex calculations for scientific research. Traditionally used solethy primarily by professional graphic designers and video game developers as it excels at processing large blocks of data simultaneously which allows rendering high-quality images with great speed; modern GPUs have evolved into multipurpose processors that can run a wide array of applications including machine learning, cryptocurrency mining, media encoding/decoding, scientific simulations. They are typically found on personal computers as graphics cards or in mobile devices such as smartphones and tablets where computational efficiency is paramount for performance but their use extends far beyond just rendering visuals - they can perform general-purpose calculations quickly due to a parallel processing architecture that's significantly more efficient at handling the matrix and vector operations commonly found in today’s computing tasks.","done":true,"done_reason":"stop","context":[32010,29871,13,9544,7420,825,338,22796,29973,673,297,3273,14880,32007,29871,13,32001,29871,13,29909,29247,10554,292,13223,29892,470,22796,29892,338,263,4266,1891,27758,11369,8688,304,15592,403,278,11265,310,6601,18533,322,16089,10388,4280,17203,363,16021,5925,29889,18375,17658,1304,14419,21155,19434,491,10257,3983,293,2874,414,322,4863,3748,18777,408,372,5566,1379,472,9068,2919,10930,310,848,21699,607,6511,15061,1880,29899,29567,4558,411,2107,6210,29936,5400,22796,29879,505,15220,1490,964,6674,332,4220,1889,943,393,508,1065,263,9377,1409,310,8324,3704,4933,6509,29892,24941,542,10880,1375,292,29892,5745,8025,29914,7099,3689,29892,16021,23876,29889,2688,526,12234,1476,373,7333,23226,408,18533,15889,470,297,10426,9224,1316,408,15040,561,2873,322,1591,1372,988,26845,19201,338,1828,792,363,4180,541,1009,671,4988,2215,8724,925,15061,7604,29879,448,896,508,2189,2498,29899,15503,4220,17203,9098,2861,304,263,8943,9068,11258,393,29915,29879,16951,901,8543,472,11415,278,4636,322,4608,6931,15574,1476,297,9826,30010,29879,20602,9595,29889],"total_duration":5109776741,"load_duration":6017051,"prompt_eval_count":19,"prompt_eval_duration":20751507,"eval_count":189,"eval_duration":5082086586}
Create a Flask WebUI for Chat Interface
Make interacting with Phi-3 more intuitive by creating a simple Flask WebUI.This interface lets you:
- Send prompts to the model
- View responses instantly
- Experience Phi-3 like a local ChatGPT!
python3 -m venv ollama-env source ollama-env/bin/activate
pip install flask flask-cors requests
nano app.py
from flask import Flask, request, jsonify, render_template
import requests
app = Flask(__name__)
OLLAMA_API = "http://localhost:11434/api/generate"
@app.route("/")
def home():
return render_template("index.html")
@app.route("/chat", methods=["POST"])
def chat():
user_input = request.json.get("prompt")
response = requests.post(OLLAMA_API, json={
"model": "phi3",
"prompt": user_input,
"stream": False
})
return jsonify(response.json())
if __name__ == "__main__":
app.run(host="0.0.0.0", port=5000)
- Make a folder named
templates - Inside it, create a file called
index.html
This will hold the web interface for your Flask WebUI.
mkdir templates nano templates/index.html
<!DOCTYPE html>
<html>
<head>
<title>Phi-3 Chatbot</title>
</head>
<body>
<h1>Chat with Phi-3</h1>
<textarea id="input" rows="4" cols="50"></textarea><br>
<button onclick="send()">Send</button>
<pre id="output"></pre>
<script>
async function send() {
const prompt = document.getElementById("input").value;
const response = await fetch("/chat", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({ prompt })
});
const data = await response.json();
document.getElementById("output").textContent = data.response;
}
</script>
</body>
</html>
python3 app.py
http://your-server-ip:5000
“Give me 3 startup ideas related to climate change.”
Phi-3 will generate responses instantly!
Send, and Phi-3 will respond right below the input box in your WebUI!
CTCservers Recommended Tutorials
Web, Network
Step-by-Step Guide: Install AMD ROCm on Ubuntu with RX 6600 GPU
Learn how to quickly and easily set up AMD ROCm on Ubuntu for your RX 6600 GPU, enabling powerful machine learning, AI workloads, and GPU-accelerated computing right on your system.
Web, Network, Linux, Mysql, Ubuntu
LAMP Setup Guide 2026: Ubuntu & Debian | CTCservers
Install a secure LAMP stack on Debian or Ubuntu. Follow our step-by-step guide to configure Linux, Apache, MySQL, and PHP for your web server.
Web, Network, Ubuntu
Deploy Phi-3 with Ollama on Ubuntu GPU | CTCservers
Learn how to easily deploy the Phi-3 LLM on an Ubuntu 24.04 GPU server using Ollama and WebUI. Follow our step-by-step tutorial for seamless AI hosting.
Discover CTCservers Dedicated Server Locations
CTCservers servers are available around the world, providing diverse options for hosting websites. Each region offers unique advantages, making it easier to choose a location that best suits your specific hosting needs.