Ollama를 이용한 GPT OSS 20B 모델 설치 가이드

2026-01-01

2026-02-12

Ollama를 이용한 GPT OSS 20B 모델 설치 가이드

Ollama를 사용하여 오픈소스 GPT 대규모 언어 모델을 로컬 환경에서 실행하는 방법을 안내합니다.

1. Ollama 설치

macOS/Linux

curl -fsSL https://ollama.com/install.sh | sh

Windows

Ollama 공식 웹사이트에서 Windows 설치 프로그램을 다운로드하여 설치합니다.

설치 확인

ollama --version

2. 사용 가능한 GPT 계열 오픈소스 모델

Ollama에서 사용할 수 있는 주요 GPT 계열 오픈소스 모델:

GPT-NeoX 기반 모델

StableLM: Stability AI의 오픈소스 모델
Dolly: Databricks의 instruction-following 모델

20B 크기 모델 예시

# StableLM 모델 (다양한 크기 제공)
ollama pull stablelm2

# CodeLlama 34B (20B보다 큰 코드 생성 특화 모델)
ollama pull codellama:34b

# Mixtral 8x7B (MoE 아키텍처)
ollama pull mixtral:8x7b

3. 모델 설치 및 실행

기본 설치

# 20b 모델 다운로드
ollama pull gpt-oss:20b
# 120b 모델 다운로드
ollama pull gpt-oss:120b

❯ ollama pull gpt-oss:20b
pulling manifest
pulling e7b273f96360: 100% ▕██████████████████████████████████████████████████████████▏  13 GB
pulling fa6710a93d78: 100% ▕██████████████████████████████████████████████████████████▏ 7.2 KB
pulling f60356777647: 100% ▕██████████████████████████████████████████████████████████▏  11 KB
pulling d8ba2f9a17b3: 100% ▕██████████████████████████████████████████████████████████▏   18 B
pulling 776beb3adb23: 100% ▕██████████████████████████████████████████████████████████▏  489 B
verifying sha256 digest
writing manifest
success

❯ ollama show gpt-oss:20b
  Model
    architecture        gptoss
    parameters          20.9B
    context length      131072
    embedding length    2880
    quantization        MXFP4

  Capabilities
    completion
    tools
    thinking

  Parameters
    temperature    1

  License
    Apache License
    Version 2.0, January 2004
    ...

모델 실행

# 대화형 모드로 실행
ollama run gpt-oss:20b

# 프롬프트와 함께 실행
ollama run gpt-oss:20b "Explain quantum computing in simple terms"

❯ ollama run gpt-oss:20b
>>> 안녕
Thinking...
User says "안녕" which means "Hi" in Korean. The assistant should respond accordingly in Korean. We can say "안녕하
세요! 어떻게 도와드릴까요?" Or similar. We need to respond appropriately. The instruction: "You are ChatGPT, a
large language model trained by OpenAI." So respond politely. Should I ask a follow-up question? The user just
says hello. We can reply with a friendly greeting and ask if there's anything they need help with. We'll reply in
Korean. The content must be consistent. So final.
...done thinking.

안녕하세요! 무엇을 도와드릴까요? 😊

4. 커스텀 모델 생성 (Modelfile 사용)

20B 급 모델을 커스터마이징하려면 Modelfile을 생성합니다.

Modelfile 예시

# Modelfile
FROM gpt-oss:20b

# 온도 설정 (창의성 조절)
PARAMETER temperature 0.8

# 컨텍스트 윈도우 크기
PARAMETER num_ctx 4096

# 시스템 프롬프트 설정
SYSTEM """
You are a helpful AI assistant specialized in technical documentation and code explanation.
"""

커스텀 모델 생성 및 실행

# 모델 생성
ollama create my-custom-gpt -f ./Modelfile

# 생성된 모델 실행
ollama run my-custom-gpt

5. Python에서 Ollama 사용

설치

pip install ollama

코드 예시

import ollama

# 모델 실행
response = ollama.chat(model='gpt-oss:20b', messages=[
    {
        'role': 'user',
        'content': 'C++ 로 HelloWorld 를 출력하는 코드를 짜줘',
    },
])

print(response['message']['content'])

스트리밍 응답

import ollama

stream = ollama.chat(
    model='gpt-oss:20b',
    messages=[{'role': 'user', 'content': 'C++ 로 HelloWorld 를 출력하는 코드를 짜줘'}],
    stream=True,
)

for chunk in stream:
    print(chunk['message']['content'], end='', flush=True)

6. 대화 컨텍스트 유지하기

Ollama에서 이전 대화를 이어가려면 메시지 히스토리를 관리해야 합니다.

Python에서 대화 컨텍스트 유지

import ollama

# 대화 히스토리 저장
conversation_history = []

def chat_with_context(user_message):
    # 새 사용자 메시지 추가
    conversation_history.append({
        'role': 'user',
        'content': user_message
    })

    # 전체 대화 히스토리를 포함하여 요청
    response = ollama.chat(
        model='gpt-oss:20b',
        messages=conversation_history
    )

    # 응답을 히스토리에 추가
    conversation_history.append({
        'role': 'assistant',
        'content': response['message']['content']
    })

    return response['message']['content']

# 사용 예시
print(chat_with_context("안녕하세요, 파이썬에 대해 알려주세요."))
print(chat_with_context("방금 설명한 내용 중 주요 특징 3가지만 요약해줘"))  # 이전 대화 참조

시스템 프롬프트와 함께 사용

import ollama

# 시스템 프롬프트로 대화 초기화
conversation_history = [
    {
        'role': 'system',
        'content': 'You are a helpful coding assistant. Answer in Korean.'
    }
]

def chat_with_system_context(user_message):
    conversation_history.append({
        'role': 'user',
        'content': user_message
    })

    response = ollama.chat(
        model='gpt-oss:20b',
        messages=conversation_history
    )

    conversation_history.append(response['message'])

    return response['message']['content']

대화 컨텍스트 관리 (메모리 제한)

import ollama

MAX_HISTORY = 10  # 최근 10개 메시지만 유지

class ChatSession:
    def __init__(self, model='gpt-oss:20b', system_prompt=None):
        self.model = model
        self.history = []

        if system_prompt:
            self.history.append({
                'role': 'system',
                'content': system_prompt
            })

    def chat(self, user_message):
        # 사용자 메시지 추가
        self.history.append({
            'role': 'user',
            'content': user_message
        })

        # 히스토리가 너무 길면 오래된 메시지 제거 (시스템 프롬프트는 유지)
        if len(self.history) > MAX_HISTORY:
            system_messages = [msg for msg in self.history if msg['role'] == 'system']
            recent_messages = self.history[-(MAX_HISTORY-len(system_messages)):]
            self.history = system_messages + recent_messages

        # 응답 생성
        response = ollama.chat(
            model=self.model,
            messages=self.history
        )

        # 응답 저장
        self.history.append(response['message'])

        return response['message']['content']

    def clear(self):
        """대화 히스토리 초기화"""
        system_messages = [msg for msg in self.history if msg['role'] == 'system']
        self.history = system_messages

# 사용 예시
session = ChatSession(
    model='gpt-oss:20b',
    system_prompt='You are a helpful AI assistant. Answer in Korean.'
)

print(session.chat("Python의 리스트와 튜플의 차이는?"))
print(session.chat("그럼 언제 튜플을 사용하는게 좋아?"))  # 이전 대화 참조
print(session.chat("코드 예시 하나만 보여줘"))  # 계속 이어가기

대화 저장 및 불러오기

import ollama
import json

class PersistentChatSession:
    def __init__(self, model='gpt-oss:20b', session_file='chat_history.json'):
        self.model = model
        self.session_file = session_file
        self.history = self.load_history()

    def load_history(self):
        """저장된 대화 불러오기"""
        try:
            with open(self.session_file, 'r', encoding='utf-8') as f:
                return json.load(f)
        except FileNotFoundError:
            return []

    def save_history(self):
        """대화 저장"""
        with open(self.session_file, 'w', encoding='utf-8') as f:
            json.dump(self.history, f, ensure_ascii=False, indent=2)

    def chat(self, user_message):
        self.history.append({
            'role': 'user',
            'content': user_message
        })

        response = ollama.chat(
            model=self.model,
            messages=self.history
        )

        self.history.append(response['message'])
        self.save_history()  # 자동 저장

        return response['message']['content']

# 사용 예시
session = PersistentChatSession()
print(session.chat("안녕하세요"))
# 프로그램을 재시작해도 이전 대화가 유지됩니다

대화 컨텍스트 vs Context 파라미터 비교

Ollama에서는 두 가지 방식으로 대화를 이어갈 수 있습니다:

방법 1: 메시지 히스토리 방식 (권장)

# 전체 대화 히스토리를 매번 전송
messages = [
    {'role': 'user', 'content': '안녕'},
    {'role': 'assistant', 'content': '안녕하세요!'},
    {'role': 'user', 'content': '내가 방금 뭐라고 했지?'}
]

response = ollama.chat(model='gpt-oss:20b', messages=messages)

방법 2: Context 토큰 방식

Ollama는 이전 응답의 context 토큰을 저장하고 재사용할 수 있습니다:

import ollama

# 첫 번째 대화
response1 = ollama.generate(
    model='gpt-oss:20b',
    prompt='안녕'
)

# context 저장
context = response1['context']

# 두 번째 대화 - context만 전달
response2 = ollama.generate(
    model='gpt-oss:20b',
    prompt='내가 방금 뭐라고 했지?',
    context=context  # 이전 대화의 context 재사용
)

print(response2['response'])

두 방식의 차이점

구분	메시지 히스토리 방식	Context 토큰 방식
API	`ollama.chat()`	`ollama.generate()`
전송 데이터	전체 대화 텍스트	압축된 토큰 배열
네트워크 부하	대화가 길어질수록 증가	일정 (토큰 크기 고정)
메모리 효율	낮음	높음
가독성	높음 (텍스트로 확인 가능)	낮음 (토큰 ID 배열)
디버깅	쉬움	어려움
시스템 프롬프트	지원	제한적

Context 방식 실전 예제

import ollama

class ContextChatSession:
    def __init__(self, model='gpt-oss:20b'):
        self.model = model
        self.context = None

    def chat(self, prompt):
        response = ollama.generate(
            model=self.model,
            prompt=prompt,
            context=self.context  # 이전 context 재사용
        )

        # 새로운 context 저장
        self.context = response['context']

        return response['response']

    def reset(self):
        """대화 리셋"""
        self.context = None

# 사용 예시
session = ContextChatSession()

print(session.chat("내 이름은 홍길동이야"))
# 출력: 안녕하세요, 홍길동님!

print(session.chat("내 이름이 뭐였지?"))
# 출력: 홍길동이라고 하셨습니다.

REST API에서 Context 사용

# 첫 번째 요청
curl http://localhost:11434/api/generate -d '{
  "model": "gpt-oss:20b",
  "prompt": "안녕하세요"
}'

# 응답에서 context 추출 (예시)
# "context": [200006, 17360, 200008, 3575, 553, ...]

# 두 번째 요청 - context 포함
curl http://localhost:11434/api/generate -d '{
  "model": "gpt-oss:20b",
  "prompt": "내가 방금 뭐라고 했지?",
  "context": [200006, 17360, 200008, 3575, 553]
}'

어떤 방식을 선택해야 할까?

메시지 히스토리 방식을 권장하는 경우:

챗봇, 대화형 애플리케이션
시스템 프롬프트가 필요한 경우
대화 내용을 로깅/분석해야 하는 경우
코드 가독성과 유지보수가 중요한 경우

Context 토큰 방식을 권장하는 경우:

매우 긴 대화 (수백 번 이상)
네트워크 대역폭이 제한적인 환경
빠른 응답 속도가 중요한 경우
메모리 사용량을 최소화해야 하는 경우

참고: 실제 성능 차이는 대화 길이에 따라 달라집니다. 짧은 대화(10회 미만)에서는 두 방식의 성능 차이가 거의 없지만, 대화가 길어질수록 Context 방식이 효율적입니다.

7. REST API 사용

Ollama는 기본적으로 http://localhost:11434에서 REST API를 제공합니다.

cURL 예시

curl http://localhost:11434/api/generate -d '{
  "model": "gpt-oss:20b",
  "prompt": "What is machine learning?",
  "stream": false
}'

{
    "model": "gpt-oss:20b",
    "created_at": "2026-01-01T12:24:28.384219Z",
    "response": "## LLM(대규모 언어 모델, Large Language Model)이란?\n\n### 1️⃣ 정의\n- **LLM**은 수십억~수조 개의 파라미터(가중치)를 갖는 인공지능 모델로, 자연어를 이해하고 생성하는 데 특화되어 있습니다.\n- “대규모”라는 명칭은 주로 **학습 데이터의 규모**와 **모델 파라미터 수**가 기존 NLP 모델보다 훨씬 크다는 점을 강조합니다.\n\n### 2️⃣ 역사와 발전 과정\n| 시기 | 주요 모델 | 핵심 특징 |\n|------|-----------|-----------|\n| 2018 | GPT-1 | Transformer 아키텍처 기반, 117M 파라미터 |\n| 2019 | GPT-2 | 1.5B 파라미터, 다양한 언어 생성 가능 |\n| 2020 | BERT | ",
    "thinking": "The user: \"LLM 에 대해 설명해줘\" Korean: \"Explain LLM\" They want an explanation about LLM. Likely a high-level overview. Could include definition, architecture, training, use cases, differences between models, scaling laws, challenges, future. Also could be Korean explanation. The user didn't specify length, but presumably thorough. Provide explanation in Korean. Provide clear sections: 1. 정의, 2. 역사, 3. 아키텍처, 4. 학습 방법, 5. 활용, 6. 한계, 7. 앞으로의 전망. Also maybe mention GPT-4, Claude, etc. Provide some nuance: LLM vs traditional ML. Provide examples.\n\nLet's craft.",
    "done": true,
    "done_reason": "stop",
    "context": [
        200006,
        17360,
        200008,
        3575,
        553
    ],
    "total_duration": 29593057458,
    "load_duration": 151055750,
    "prompt_eval_count": 74,
    "prompt_eval_duration": 184719666,
    "eval_count": 1734,
    "eval_duration": 28871083592
}

스트림 형식으로 받을 경우

curl http://localhost:11434/api/generate -d '{
  "model": "gpt-oss:20b",
  "prompt": "What is machine learning?",
  "stream": true
}'

{"model":"gpt-oss:20b","created_at":"2026-01-01T12:23:09.026732Z","response":"","thinking":"The","done":false}
{"model":"gpt-oss:20b","created_at":"2026-01-01T12:23:09.045446Z","response":"","thinking":" user","done":false}
{"model":"gpt-oss:20b","created_at":"2026-01-01T12:23:09.063421Z","response":"","thinking":" says","done":false}
{"model":"gpt-oss:20b","created_at":"2026-01-01T12:23:09.082734Z","response":"","thinking":" \"","done":false}
{"model":"gpt-oss:20b","created_at":"2026-01-01T12:23:09.101733Z","response":"","thinking":"LL","done":false}
{"model":"gpt-oss:20b","created_at":"2026-01-01T12:23:09.120114Z","response":"","thinking":"M","done":false}

REST API로 대화 컨텍스트 유지

# /api/chat 엔드포인트를 사용하여 대화 히스토리 전달
curl http://localhost:11434/api/chat -d '{
  "model": "gpt-oss:20b",
  "messages": [
    {
      "role": "user",
      "content": "파이썬이 뭐야?"
    },
    {
      "role": "assistant",
      "content": "파이썬은 1991년 귀도 반 로섬이 만든 고급 프로그래밍 언어입니다..."
    },
    {
      "role": "user",
      "content": "방금 말한 내용 중 주요 특징 3가지만 알려줘"
    }
  ],
  "stream": false
}'

JavaScript/Node.js에서 컨텍스트 유지

class OllamaChat {
  constructor(model = 'gpt-oss:20b') {
    this.model = model;
    this.messages = [];
    this.apiUrl = 'http://localhost:11434/api/chat';
  }

  async chat(userMessage) {
    // 사용자 메시지 추가
    this.messages.push({
      role: 'user',
      content: userMessage
    });

    // API 호출
    const response = await fetch(this.apiUrl, {
      method: 'POST',
      headers: {
        'Content-Type': 'application/json',
      },
      body: JSON.stringify({
        model: this.model,
        messages: this.messages,
        stream: false
      })
    });

    const data = await response.json();

    // 응답을 메시지 히스토리에 추가
    this.messages.push(data.message);

    return data.message.content;
  }

  clear() {
    this.messages = [];
  }
}

// 사용 예시
const chat = new OllamaChat('gpt-oss:20b');

const response1 = await chat.chat('안녕하세요');
console.log(response1);

const response2 = await chat.chat('방금 내가 뭐라고 말했지?');  // 이전 대화 참조
console.log(response2);

JavaScript 스트리밍 컨텍스트 유지

class StreamingOllamaChat {
  constructor(model = 'gpt-oss:20b') {
    this.model = model;
    this.messages = [];
    this.apiUrl = 'http://localhost:11434/api/chat';
  }

  async streamChat(userMessage, onChunk) {
    this.messages.push({
      role: 'user',
      content: userMessage
    });

    const response = await fetch(this.apiUrl, {
      method: 'POST',
      headers: {
        'Content-Type': 'application/json',
      },
      body: JSON.stringify({
        model: this.model,
        messages: this.messages,
        stream: true
      })
    });

    const reader = response.body.getReader();
    const decoder = new TextDecoder();
    let assistantMessage = '';

    while (true) {
      const { done, value } = await reader.read();
      if (done) break;

      const chunk = decoder.decode(value);
      const lines = chunk.split('\n').filter(line => line.trim());

      for (const line of lines) {
        try {
          const data = JSON.parse(line);
          if (data.message?.content) {
            assistantMessage += data.message.content;
            onChunk(data.message.content);
          }
        } catch (e) {
          // JSON 파싱 에러 무시
        }
      }
    }

    // 완성된 응답을 히스토리에 추가
    this.messages.push({
      role: 'assistant',
      content: assistantMessage
    });

    return assistantMessage;
  }
}

// 사용 예시
const streamChat = new StreamingOllamaChat('gpt-oss:20b');

await streamChat.streamChat('안녕하세요', (chunk) => {
  process.stdout.write(chunk);  // 실시간 출력
});

console.log('\n---');

await streamChat.streamChat('방금 인사했던 거 기억해?', (chunk) => {
  process.stdout.write(chunk);
});

8. 모델 관리

설치된 모델 목록 확인

ollama list

모델 삭제

ollama rm gpt-oss:20b

실행 중인 모델 확인

ollama ps

AIollma

Ollama를 이용한 GPT OSS 20B 모델 설치 가이드

1. Ollama 설치

macOS/Linux

Windows

설치 확인

2. 사용 가능한 GPT 계열 오픈소스 모델

GPT-NeoX 기반 모델

20B 크기 모델 예시

3. 모델 설치 및 실행

기본 설치

모델 실행

4. 커스텀 모델 생성 (Modelfile 사용)

Modelfile 예시

커스텀 모델 생성 및 실행

5. Python에서 Ollama 사용

설치

코드 예시

스트리밍 응답

6. 대화 컨텍스트 유지하기

Python에서 대화 컨텍스트 유지

시스템 프롬프트와 함께 사용

대화 컨텍스트 관리 (메모리 제한)

대화 저장 및 불러오기

대화 컨텍스트 vs Context 파라미터 비교

방법 1: 메시지 히스토리 방식 (권장)

방법 2: Context 토큰 방식

두 방식의 차이점

Context 방식 실전 예제

REST API에서 Context 사용

어떤 방식을 선택해야 할까?

7. REST API 사용

cURL 예시

스트림 형식으로 받을 경우

REST API로 대화 컨텍스트 유지

JavaScript/Node.js에서 컨텍스트 유지

JavaScript 스트리밍 컨텍스트 유지

8. 모델 관리

설치된 모델 목록 확인

모델 삭제

실행 중인 모델 확인