AI Glossary
Training Data
The text and examples that shape an AI model
Definition
Training data is the large collection of text, images, or other data used to train an AI model. For LLMs, this includes books, websites, code, scientific papers, and more — often trillions of tokens. The quality, diversity, and recency of training data heavily influence model capabilities and biases. Training data has a knowledge cutoff — the model knows nothing that happened after it.