LangChain agent server trên VPS: build internal AI tool tự host

Chia sẻ bài viết

TL;DR

LangChain (Python) là framework xây agent AI nhanh: kết nối LLM, tool, memory, retrieval thành 1 server REST API duy nhất.
FastAPI + LangChain + LangGraph cho stateful agent có khả năng plan, execute tool, replan dựa kết quả.
Chạy trên Cloud VPS 40 (399k/tháng) là đủ cho team 30 user concurrent, không cần GPU nếu dùng Claude/OpenAI API.
Use case thực: trợ lý nội bộ trả lời policy HR, sales bot lookup CRM, dev assistant lookup docs, BI bot query SQL.
Bonus: tích hợp Slack/MS Teams qua webhook để user chat thẳng với agent.

Mỗi công ty đều có "knowledge silo": tài liệu nội bộ rải rác Google Drive, Notion, Confluence; data trong CRM/HR/Database; quy trình SOP trong PDF lưu local. Người mới onboard mất 2-3 tháng mới tìm được thông tin. LangChain agent giải bài này: build 1 chatbot internal hiểu context, tự gọi tool fetch data từ nhiều nguồn, trả lời ngữ cảnh ngắn gọn.

Mình đã build 4 LangChain agent cho công ty 50-200 nhân viên trong 2025: HR bot, sales lookup bot, BI SQL bot, dev knowledge bot. Cost vận hành dưới 200k/tháng VPS + 50-200k LLM API tuỳ usage. Tiết kiệm 30-50% thời gian search context. Bài này chia sẻ setup full stack.

1. Kiến trúc agent server

User (Slack/Web)
  -> REST API FastAPI
    -> LangGraph orchestrator
      -> LLM (Claude/GPT-4o/Local)
        -> Tool 1: search_docs (RAG ChromaDB)
        -> Tool 2: query_database (Postgres MCP)
        -> Tool 3: fetch_jira_issue
        -> Tool 4: lookup_employee_info
      -> Memory (Redis conversation history)
      -> Output: structured JSON response
  <- Stream tokens via SSE

2. Cài đặt và stack cơ bản

# VPS Cloud 40 (2GB RAM, 399k)
dnf install python3.12 python3-pip -y

mkdir -p /opt/ai-agent && cd /opt/ai-agent
python3 -m venv venv
source venv/bin/activate

pip install fastapi uvicorn[standard] langchain langchain-anthropic 
    langchain-openai langchain-community langgraph chromadb 
    sentence-transformers psycopg2-binary redis sse-starlette 
    python-dotenv slack-sdk

3. FastAPI server cơ bản

# /opt/ai-agent/main.py
from fastapi import FastAPI
from pydantic import BaseModel
from langchain_anthropic import ChatAnthropic
from langgraph.prebuilt import create_react_agent
from tools import tool_set
import os

app = FastAPI(title="Internal AI Agent")

llm = ChatAnthropic(
    model="claude-sonnet-4-7-20260520",
    api_key=os.environ['ANTHROPIC_API_KEY'],
    temperature=0
)

agent = create_react_agent(llm, tools=tool_set)

class Query(BaseModel):
    user_id: str
    message: str
    thread_id: str = "default"

@app.post("/chat")
async def chat(q: Query):
    result = await agent.ainvoke(
        {"messages": [("user", q.message)]},
        config={"configurable": {"thread_id": q.thread_id}}
    )
    return {"response": result['messages'][-1].content}

@app.get("/health")
def health():
    return {"status": "ok"}

4. Định nghĩa tool cho agent

# /opt/ai-agent/tools.py
from langchain_core.tools import tool
import psycopg2, requests, os

@tool
def search_company_docs(query: str) -> str:
    """Search internal company documents and policies. Use for HR policy, IT setup, SOP."""
    from chromadb import HttpClient
    client = HttpClient(host="localhost", port=8000)
    col = client.get_collection("company_docs")
    results = col.query(query_texts=[query], n_results=5)
    return "nn".join([d for d in results['documents'][0]])

@tool
def query_employee_info(employee_email: str) -> dict:
    """Look up employee info by email: department, manager, start date, position."""
    conn = psycopg2.connect(os.environ['HR_DB_URL'])
    cur = conn.cursor()
    cur.execute("SELECT name, dept, manager, start_date, position FROM employees WHERE email=%s", (employee_email,))
    row = cur.fetchone()
    if not row:
        return {"error": "Employee not found"}
    return {"name": row[0], "department": row[1], "manager": row[2], "start_date": str(row[3]), "position": row[4]}

@tool
def fetch_jira_issue(issue_key: str) -> dict:
    """Fetch Jira issue details. Use for project status, bug info."""
    r = requests.get(
        f"https://your-domain.atlassian.net/rest/api/3/issue/{issue_key}",
        auth=(os.environ['JIRA_USER'], os.environ['JIRA_TOKEN'])
    )
    return r.json()

@tool
def run_sql_readonly(sql: str) -> list:
    """Run read-only SQL query on analytics database. Only SELECT allowed."""
    if not sql.strip().upper().startswith('SELECT'):
        return {"error": "Only SELECT queries allowed"}
    conn = psycopg2.connect(os.environ['ANALYTICS_DB_URL'])
    cur = conn.cursor()
    cur.execute(sql)
    cols = [d[0] for d in cur.description]
    return [dict(zip(cols, row)) for row in cur.fetchmany(100)]

tool_set = [search_company_docs, query_employee_info, fetch_jira_issue, run_sql_readonly]

5. RAG: index documents vào ChromaDB

# /opt/ai-agent/index_docs.py
from langchain_community.document_loaders import (
    DirectoryLoader, PyPDFLoader, UnstructuredMarkdownLoader
)
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings import SentenceTransformerEmbeddings
from langchain_chroma import Chroma

def index():
    loader = DirectoryLoader('/data/company-docs', glob="**/*.md",
                             loader_cls=UnstructuredMarkdownLoader)
    docs = loader.load()
    splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
    chunks = splitter.split_documents(docs)

    embed = SentenceTransformerEmbeddings(model_name="BAAI/bge-m3")
    db = Chroma(
        collection_name="company_docs",
        embedding_function=embed,
        persist_directory="/opt/ai-agent/chroma-data"
    )
    db.add_documents(chunks)
    print(f"Indexed {len(chunks)} chunks")

if __name__ == "__main__":
    index()

BGE-M3 là embedding model multilingual, hiểu tốt tiếng Việt. Cron job chạy mỗi đêm reindex để pull doc mới.

6. Memory: conversation history với Redis

from langgraph.checkpoint.redis import RedisSaver

memory = RedisSaver.from_conn_string(
    redis_url="redis://localhost:6379/0"
)

agent = create_react_agent(
    llm,
    tools=tool_set,
    checkpointer=memory  # auto save state per thread_id
)

Mỗi user/session có thread_id riêng. Agent nhớ context cuộc trò chuyện trước đó, follow-up question hiểu reference "anh ấy", "cái đó".

7. System prompt định hướng agent

SYSTEM_PROMPT = """Bạn là trợ lý AI nội bộ của công ty TND, tên là Tina.

Năng lực:
- Trả lời câu hỏi về policy HR, IT, quy trình công ty.
- Lookup thông tin nhân viên bằng email.
- Fetch Jira issue cho update project.
- Query SQL analytics database (chỉ read).

Nguyên tắc:
- Luôn trả lời tiếng Việt rõ ràng, ngắn gọn dưới 5 câu trừ khi cần liệt kê.
- Nếu không biết, nói "Mình không tìm thấy thông tin, vui lòng hỏi anh/chị quản lý."
- Không bịa data. Luôn cite source khi trả lời từ document.
- Bảo mật: không lộ password, key, token trong response.
- Tone: thân thiện, professional, không quá formal."""

agent = create_react_agent(
    llm,
    tools=tool_set,
    checkpointer=memory,
    state_modifier=SYSTEM_PROMPT
)

8. Deploy với systemd và Caddy

# /etc/systemd/system/ai-agent.service
[Unit]
Description=AI Agent FastAPI
After=network.target

[Service]
User=aiagent
WorkingDirectory=/opt/ai-agent
EnvironmentFile=/opt/ai-agent/.env
ExecStart=/opt/ai-agent/venv/bin/uvicorn main:app --host 127.0.0.1 --port 8001 --workers 2
Restart=on-failure

[Install]
WantedBy=multi-user.target

# Caddyfile
ai.your-domain.com {
    reverse_proxy 127.0.0.1:8001 {
        flush_interval -1
        transport http {
            response_header_timeout 120s
        }
    }
    basicauth /chat {
        team $2a$14$your_caddy_hash
    }
}

9. Tích hợp Slack slash command

@app.post("/slack/command")
async def slack_command(req: Request):
    form = await req.form()
    user_id = form['user_id']
    text = form['text']
    channel = form['channel_id']

    # Slack expects response in 3s, so async ack first
    asyncio.create_task(process_and_respond(user_id, text, form['response_url']))
    return {"response_type": "ephemeral", "text": "Tina đang nghĩ..."}

async def process_and_respond(user_id, message, response_url):
    result = await agent.ainvoke(
        {"messages": [("user", message)]},
        config={"configurable": {"thread_id": f"slack-{user_id}"}}
    )
    answer = result['messages'][-1].content
    requests.post(response_url, json={"response_type": "in_channel", "text": answer})

Trong Slack workspace, install Slack app, set slash command /tina pointing to https://ai.your-domain.com/slack/command. User gõ /tina chế độ làm việc remote là gì? sẽ thấy response từ Tina trong 3-10 giây.

10. Streaming response qua SSE

UX tốt hơn với streaming, user thấy token xuất hiện dần thay vì đợi 10s mới có response:

from sse_starlette.sse import EventSourceResponse

@app.post("/chat/stream")
async def chat_stream(q: Query):
    async def event_generator():
        async for chunk in agent.astream(
            {"messages": [("user", q.message)]},
            config={"configurable": {"thread_id": q.thread_id}},
            stream_mode="updates"
        ):
            for k, v in chunk.items():
                if k == 'agent':
                    content = v['messages'][-1].content
                    if content:
                        yield {"event": "token", "data": content}
        yield {"event": "done", "data": ""}
    return EventSourceResponse(event_generator())

11. Observability: trace với LangSmith hoặc Langfuse

Self-host Langfuse để track mọi request: prompt input, tool calls, LLM cost, latency. Setup Langfuse Docker:

# /opt/langfuse/docker-compose.yml
services:
  langfuse-server:
    image: langfuse/langfuse:3
    environment:
      DATABASE_URL: postgresql://lf:lfpass@db:5432/langfuse
      NEXTAUTH_SECRET: $(openssl rand -base64 32)
      NEXTAUTH_URL: https://lf.your-domain.com
    ports: ["3001:3000"]
  db:
    image: postgres:16
    environment: { POSTGRES_USER: lf, POSTGRES_PASSWORD: lfpass, POSTGRES_DB: langfuse }
    volumes: [lf-data:/var/lib/postgresql/data]
volumes:
  lf-data:

Trong agent code, init Langfuse callback:

from langfuse.callback import CallbackHandler
lf = CallbackHandler(public_key="pk_lf_...", secret_key="sk_lf_...", host="https://lf.your-domain.com")
result = await agent.ainvoke(input, config={"callbacks": [lf]})

12. Cost monitor và rate limit

from slowapi import Limiter
from slowapi.util import get_remote_address

limiter = Limiter(key_func=get_remote_address)
app.state.limiter = limiter

@app.post("/chat")
@limiter.limit("20/minute")
async def chat(request: Request, q: Query):
    # check user budget remaining
    daily_cost = redis_client.get(f"cost:{q.user_id}:{today}")
    if daily_cost and float(daily_cost) > 2.0:
        return {"error": "Daily budget exceeded (2 USD)"}
    ...

Mỗi user max 2 USD/ngày API cost, 20 request/phút. Tránh user duy nhất nuốt hết quota team.

13. Multi-agent: specialised agent cho mỗi domain

1 agent monolith handle mọi thứ cồng kềnh. Tách thành nhiều agent specialised, mỗi agent có toolset riêng, router agent điều phối:

HR Agent: tools = [search_hr_docs, query_employee, leave_balance]
Sales Agent: tools = [crm_lookup, deal_status, customer_history]
Dev Agent: tools = [search_code, jira_issue, ci_status, deploy_logs]
Router Agent: nhận query, classify intent, route sang agent phù hợp

LangGraph workflow design pattern: subgraph cho mỗi specialised agent, supervisor agent ra quyết định.

14. Security: prevent prompt injection

Input validation: reject prompt quá dài (> 5000 char), chứa ký tự nghi ngờ.
Output filter: regex check không lộ password, API key, token.
Tool sandboxing: SQL tool chỉ SELECT, không EXEC/DROP. File tool chỉ /data, không /etc.
Role-based tool access: HR Agent không gọi được financial tool.
Log mọi request, alert pattern bất thường (vd 100 query/phút từ 1 user).

15. Test agent với eval framework

# /opt/ai-agent/eval.py
import json
from agent import agent

test_cases = [
    {"q": "Chế độ nghỉ phép năm là gì?", "expected_tool": "search_company_docs", "expected_keyword": "12 ngày"},
    {"q": "Email của manager phòng IT là gì?", "expected_tool": "query_employee_info"},
    {"q": "Top 5 sản phẩm bán chạy tháng này?", "expected_tool": "run_sql_readonly"},
]

passed = 0
for tc in test_cases:
    result = agent.invoke({"messages": [("user", tc['q'])]})
    if tc['expected_keyword'] in result['messages'][-1].content:
        passed += 1
        print(f"PASS: {tc['q']}")
    else:
        print(f"FAIL: {tc['q']}")

print(f"n{passed}/{len(test_cases)} passed")

16. Cost realistic cho team 100 user

Resource	Cost/tháng
Cloud VPS 40 (2GB RAM)	399.000đ
Claude API (500 query/ngày, ~3000 token avg)	~2.500.000đ
Embedding (BGE-M3 local, free)	0đ
Storage docs + Chroma	included VPS
Tổng	~2.900.000đ

So với thuê 1 nhân viên support nội bộ part-time 5-8 triệu/tháng, agent tiết kiệm rõ rệt. ROI rõ ràng cho công ty 50+ nhân viên.

17. Caching layer giảm cost LLM

Nhiều câu hỏi nội bộ lặp lại (vd "ngày nghỉ năm là gì", "wifi văn phòng password"). Cache response giảm cost LLM 30-50%:

from langchain.cache import RedisSemanticCache
from langchain.globals import set_llm_cache

set_llm_cache(RedisSemanticCache(
    redis_url="redis://localhost:6379/1",
    embedding=embed,
    score_threshold=0.95
))

Semantic cache match câu hỏi tương đương nghĩa (vd "nghỉ phép năm" và "ngày off mỗi năm") cùng trả 1 response cache. TTL 24h cho document liên quan thay đổi ít.

18. Bài học sau 1 năm chạy agent production

User chỉ dùng agent khi UX nhanh dưới 5s response. Stream là bắt buộc.
Trust building cần thời gian. Tháng đầu user test trick, tháng 2-3 dùng dần thật, từ tháng 4 trở đi thành thói quen.
Tool design quan trọng hơn LLM choice. Tool rõ ràng, return data ngắn gọn, agent tự dùng đúng.
Update RAG hằng tuần, không thì document outdated 6 tháng làm agent trả lời sai.
Always log user feedback (thumb up/down). Train RLHF nhỏ cải thiện response quality.

Agent server không phải silver bullet. Nó là multiplier cho team support nội bộ, giảm load 30-50% câu hỏi lặp, nhưng vẫn cần human cho case phức tạp. Đầu tư đúng cách thì ROI rất cao cho công ty 50+ nhân viên.

Cloud VPS cho vibe coder

VPS chạy LangChain agent server cho doanh nghiệp

Cloud VPS TND sẵn AlmaLinux 9, Ubuntu 22/24, Debian 12/13. SSD CEPH, snapshot 1-click, backup hằng ngày, network 200Mbps trong nước. Cloud VPS 40 (2GB RAM, 399k) đủ chạy agent server cho team 50-100 user, scale lên Cloud VPS 80 khi vector DB lớn.

Xem 8 cấu hình Cloud VPS →

FAQ

LangChain hay LlamaIndex tốt hơn?

LangChain tổng quát hơn (agent, tool, chain). LlamaIndex chuyên sâu RAG (retrieval, indexing). Cho agent có tool use phức tạp, LangChain tốt hơn. Cho hệ thống chuyên search document, LlamaIndex mạnh hơn. Có thể kết hợp cả 2: LlamaIndex cho RAG, LangChain cho orchestration.

VPS có cần GPU không?

Không, nếu dùng Claude/OpenAI/Gemini API. LLM chạy trên cloud của provider. VPS chỉ cần CPU đủ cho FastAPI và embedding model nhỏ (BGE-M3 chạy CPU OK). Nếu muốn LLM tự host (Llama, Qwen) thì cần GPU hoặc CPU mạnh.

Có alternative cho LangChain không?

Có. LangGraph (cùng team LangChain, low-level hơn), Haystack (Pythonic, lâu năm), Semantic Kernel (Microsoft), Flowise (no-code UI), LlamaIndex agents. Tuỳ độ phức tạp dự án và team preference. LangChain phổ biến nhất, có nhiều tutorial.

Làm sao prevent agent hallucinate?

1) Temperature=0 cho deterministic. 2) System prompt yêu cầu "nếu không biết, nói không biết". 3) RAG cung cấp source citation. 4) Tool return structured data thay vì free text. 5) Test eval suite định kỳ. Agent vẫn có thể sai, không 100%, cần human review critical decision.

Có integrate được MS Teams thay Slack không?

Có, dùng Microsoft Bot Framework hoặc Power Automate webhook. Workflow tương tự: user gõ message, Teams gửi POST request tới endpoint API agent, agent process trả lời. Setup Teams app phức tạp hơn Slack, tốn 1-2 ngày.