AI Glossary for Java Developers


Take your skills to the next level!

The Persistence Hub is the place to be for every Java developer. It gives you access to all my premium video courses, monthly Java Persistence News, monthly coding problems, and regular expert sessions.


AI introduces many new terms, acronyms, and techniques you must understand to build a good AI-based system. That makes it hard for many Java developers to learn how to integrate AI into their applications using SpringAI, Langchain4J, or some other library.

I ran into the same issue when I started learning about AI.

In this article, I did my best to explain the most important terms and acronyms to Java developers who are not already AI experts.

Foundational AI Concepts

Generative AI

Generative AI is a model trained on huge data sets to find patterns that it uses to generate content such as text, images, audio, and video.

LLM – Large Language Model

LLMs are a specific kind of generative AI specialized in text-based data.

NLP (Natural Language Processing)

NLP aims to enable programs to understand and generate human language. It’s a subfield of computer science that uses linguistics, statistical modeling, and machine learning.

LLMs represent a modern form of NLP, replacing traditional rule-based systems with statistical models trained on massive text corpora.

AI Prompt

An AI prompt is the input you send to an AI system. It includes instructions, questions, context, or retrieved information. You typically combine system prompts, user prompts, and additional data from a RAG pipeline to guide the model toward a specific output.

Training

Training is the process of teaching a model to recognize patterns by exposing it to large datasets. During training, the model adjusts its internal parameters (weights) to reduce prediction errors. Modern LLMs are typically trained in two stages: a general pre-training phase on large text corpora and, optionally, a domain-specific fine-tuning phase. The quality and size of the training data directly impact the model’s accuracy, reasoning abilities, and robustness.

Fine-tuning

Fine-tuning is a technique for adapting an already trained LLM to a specific task. It uses an already-trained model and further trains it on a smaller, task-specific dataset. This improves model accuracy on specialized tasks.

Weight

A weight is a numerical parameter inside a neural network that determines how strongly input values influence the model’s predictions. During training, the model adjusts its weights to reduce errors and learn patterns from the training data. LLMs contain billions of weights, and their arrangement defines the model’s behavior and capabilities.

Open-Weight Model

An open-weight model provides access to the model’s weights, but not necessarily the full training data or training code. You can download and run the model locally or on your own infrastructure. This gives you more control over performance, privacy, and deployment without requiring you to reproduce the original training process. Many modern LLMs such as Llama or Mistral fall into this category.

Open-Source Model

An open-source model is released under a license that grants access not only to its weights, but also to the training code and, often, the entire training pipeline. This allows full reproducibility, community-driven improvements, and transparent evaluation. Only a few LLMs are truly open source because publishing the complete training data is rare.

LLM Fundamentals

Token

Tokens are the basic units that an LLM processes when reading or generating text. They can represent words, sub-words, punctuation marks, or individual characters. The number of tokens in a prompt determines how much content fits into the model’s context window and directly affects performance and cost.

Context Window

The context window defines how many tokens a model can process in a single request. It includes the system prompt, user prompt, chat history, and the generated response. Large context windows are important when you provide extensive RAG context or long documents.

System Prompt

The system prompt defines the model’s role, behavior, and response style. It sets the overall direction for the conversation and influences how the model interprets all following messages. You typically use it to define rules, structure responses, or provide stable context.

User Prompt

The user prompt contains the actual question or instruction sent by the user. It is combined with the system prompt and other messages to build the complete request.

Cutoff Date

The cutoff date defines the point in time when the model’s training data ends. Information published after this date is unknown to the model unless you add it through RAG or custom fine-tuning.

Multi-Modality

Multi-modal models process different types of input such as text, images, and audio. They support use cases like document extraction, image analysis, and visual Q&A. Spring AI supports multimodal prompts if the model provider exposes them.

Sampling & Decoding Strategies

Sampling

Sampling describes how the model selects the next token. Techniques such as temperature, Top-K, and Top-P influence randomness, creativity, and determinism.

Top-K Sampling

Top-K sampling limits token selection to the K most probable next tokens. It helps control randomness and avoid extremely unlikely outputs by preventing the model from choosing tokens with a low probability.

Top-P (Nucleus Sampling)

Top-P sampling restricts choices to the smallest set of tokens whose cumulative probability exceeds a given threshold. It adapts dynamically and often produces more stable results.

Prompt Engineering

Zero-shot prompting / direct prompting

Using a prompt without examples to let an AI system perform a task its model hasn’t been trained for is called zero-shot or direct prompting. The model then uses the patterns it has learned to create the result.

This is the simplest form of a prompt. It works well for simple tasks or when examples would increase token usage too much.

One-shot prompting / few-shot prompting / multi-shot prompting

A prompt with one or more examples that lets an AI system perform a task its model hasn’t been trained for is called a one-shot, few-shot, or multi-shot prompt. The model then uses the provided examples to extrapolate the structure of the result it is supposed to create and the patterns it has learned to create it.

Prompt Stuffing

Prompt stuffing means adding context or instructions into a single prompt. You use it to provide the model with everything it needs: chat memory, metadata, retrieved segments from RAG, or multiple tasks. Excessive stuffing can lead to confused or overly long responses.

Chain-of-Thought (CoT)

Chain-of-Thought encourages the model to break down a task or problem into intermediate reasoning steps. This makes the solving process transparent and often improves the quality of reasoning.

Self-Consistency

Self-consistency generates multiple Chain-of-Thought responses and selects the most consistent result. It improves reasoning at the cost of higher token usage.

Structured Output Prompting

Structured output prompting instructs the model to return output in a predefined structure, such as JSON or XML. Spring AI supports this using JSON Schema and automatic object binding.

Safety

Guardrails

Guardrails enforce rules on prompts and responses. They help you block unsafe content, enforce formatting, and protect internal system prompts.

Prompt Injection

Prompt injection embeds malicious instructions in the user prompt or external content. It tries to manipulate the model’s behavior or bypass rules. You need validation and guardrails to mitigate this risk.

Jailbreak

Jailbreaking attempts to bypass a model’s safety rules. These prompts try to force the model to produce restricted or unsafe output.

Retrieval-Augmented Generation (RAG) & Search

Retrieval-Augmented Generation (RAG)

Enriching the prompt with facts from a trusted source that was not part of the training data is called Retrieval-Augmented Generation (RAG). This is often used to improve the quality of the generated result by ensuring that the LLM uses the most current and reliable information.

You can use RAG to improve accuracy and reduce the risk of hallucinations.

Embeddings

Embeddings are vector representations of data, like words, images, or documents, enabling a model to compare and find similar data.

Chunking

Chunking splits long documents into smaller segments before embedding. It ensures that each retrieval unit fits within the model’s context window.

Chunk Overlap

Chunk overlap adds shared content between neighboring chunks to avoid splitting important information.

Vector Store

A vector store manages vectors generated by embedding models and provides APIs for storing, indexing, and querying them efficiently. It enables semantic search, similarity comparisons, and retrieval for RAG pipelines. Most vector stores support metadata filtering, approximate nearest neighbor search, and different index structures such as HNSW or IVF. You typically use a vector store to store document embeddings and retrieve the most relevant chunks for a user query.

Approximate Nearest Neighbor (ANN)

ANN algorithms find vectors close to a query without scanning the entire dataset. They provide fast similarity search with minimal accuracy loss.

Metadata Filtering

Metadata filtering restricts retrieval results based on attributes such as tenant ID, document type, or language. It is essential in multi-tenant or domain-specific environments.

Hallucination

A hallucination is false information presented by an AI system as facts. Providing accurate context via RAG or using validation logic helps reduce the risk.

Evaluation & Quality Assurance

Response Evaluation

Response evaluation checks the quality of the model’s output. It focuses on correctness, groundedness, relevance, and adherence to the requested structure.

Model Evaluation

Model evaluation measures how well an LLM performs for a specific use case. It uses metrics, datasets, and evaluation prompts to compare models or detect regressions.

Model Architectures & Types

Decoder-Only Model

Decoder-only models such as GPT or Llama generate text one token at a time. They are optimized for chat and generative tasks.

Encoder-Only Model

Encoder-only models like BERT generate embeddings but cannot produce text. They are commonly used for classification and retrieval.

Encoder-Decoder Model

Encoder-decoder models such as T5 handle text transformation tasks like summarization and translation.

Mixture of Experts (MoE)

Mixture of Experts models route tokens to specialized internal “experts.” This increases capacity without increasing inference cost for every token.

Deployment, Hosting & Performance Optimization

Inference Server

An inference server hosts a model and exposes APIs for generation and embedding. Popular options include vLLM, TGI, and Ollama.

Ollama

Ollama is a local model runner that downloads, manages, and executes LLMs on your machine. It exposes an OpenAI-compatible API, and Spring AI provides an integration. It supports GGUF models and provides optimizations for consumer hardware.

LM Studio

LM Studio is a desktop application for running local LLMs with a graphical interface. It includes model management, quantization options, and a built-in OpenAI-compatible API server.

Docker Model Runner

A Docker model runner is a containerized environment for hosting and serving LLMs. It bundles the model, dependencies, and inference engine into a reproducible container image. Many inference servers, including vLLM, TGI, and Ollama, provide official Docker images.

GGUF

GGUF is a model format optimized for local inference engines such as llama.cpp. It stores quantized weights and supports high-performance execution on consumer hardware.

Quantization

Quantization reduces the precision of model weights, such as converting FP16 to INT8. It improves performance and reduces memory usage with minimal accuracy loss.

Agents, Tools & Spring AI Integration

Chat Memory

Chat memory stores previous messages so the model can continue a conversation with context. It allows the LLM to understand follow-up questions, maintain state across multiple requests, and build coherent multi-step interactions. Chat memory can be short-lived, stored in the prompt, or managed externally using databases or vector stores.

In Spring AI, you can integrate chat memory to provide context between requests without manually stuffing previous messages into each prompt.

Tool Call (Function Call)

Tool calling allows the model to call your APIs. Spring AI maps these calls to your code so the model can perform real actions.

MCP (Model Context Protocol)

MCP standardizes how tools expose their capabilities to LLMs. It allows models to discover, validate, and call external tools based on a defined schema.

Action / Observation Loop

Agents follow an action-observation loop. They choose an action, execute it through a tool, observe the result, and decide the next step.

Advisors

Advisors modify or enrich prompts before sending them to the model and process the model’s response. They help you enforce rules, add context, or automatically insert system messages.

RAG Pipeline

A RAG pipeline groups the steps of retrieval-augmented generation. It performs chunking, embedding, searching, and prompt assembly.

Summary

There are many new terms, concepts, and abbreviations you have to be familiar with when trying to learn how to integrate AI into your application.

I tried explaining the most important ones in this article. If you want me to add something, please post a comment below.