large language modelslarge language model limitationslarge language model hallucinationslarge language model biasLLM reasoning limitations

What Large Language Models Can't Do

Uncover what large language models can’t do—from bias and hallucinations to reasoning gaps—and learn to work around their limitations.

Richard Gyllenbern

CEO @ Cension AI

July 24, 202516 min read

Featured image for What Large Language Models Can't Do

Large language models can write poems, draft emails, and even debug code. Yet they still babble nonsense, misstate facts, and trip over simple logic puzzles. They don’t truly think—they just predict the next word. Ask one to count words in a sentence and you’ll soon spot the cracks. In business or research, these blind spots matter.

In this article, we’ll dive into the core limitations of large language models—from fixed token windows and stale knowledge to hallucinations, weak reasoning, and bias. You’ll discover why LLMs forget earlier chats, why they can’t always follow a multi-step plan, and how they sometimes echo harmful stereotypes. Most importantly, we’ll share hands-on strategies—like smart chunking, retrieval augmentation, prompt design, and human-in-the-loop checks—that help you work around each quirk and unlock better, more reliable results.

Why can’t LLMs process very long text?

Large language models have a built-in “working memory” called a context window. This window holds both your input prompt and the model’s prior outputs. Once that limit is reached, older text is trimmed to make room for new tokens—and the model effectively “forgets” what came first.

A context window is measured in tokens, not words. A token is roughly 4 characters or 0.75 words. Here’s how some popular models compare:

OpenAI GPT-3.5 Turbo: 16 000 tokens (~12 000 words)
OpenAI GPT-4 Turbo: 128 000 tokens (~96 000 words)
Anthropic Claude 3 (Haiku, Sonnet, Opus): 200 000 tokens (~150 000 words)
Google Gemini Pro / 1.5: up to 128 000 tokens (and experimental builds note 1 000 000)

When you feed in a contract, research paper, or long chat history that exceeds these limits, the oldest tokens slip off the front of the window. The model loses track of earlier details, leading to dropped requirements, inconsistent answers, or flat-out nonsense.

To work around this constraint:

Break long inputs into self-contained chunks and process them sequentially.
Summarize or paraphrase each chunk before feeding it back.
Specify a focused output length (for example, “In 100 words, recap Section 2”).
Target prompts to the most relevant sections instead of pasting an entire document.
Use retrieval augmentation: store large texts in an external vector database and pull in only the passages you need.

By shaping your prompts around these token limits, you keep the model’s “memory” trained on what matters most. That ensures more accurate, coherent responses—even when working with very long documents.


JAVASCRIPT • example.js
import OpenAI from "openai";

const openai = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY
});

/**
 * Splits a long text into word-based chunks, summarizes each chunk,
 * then synthesizes a final summary of summaries.
 *
 * @param {string} text    – The full document to process
 * @param {number} chunkSize – Approximate word count per chunk
 * @returns {Promise<string>} – Final synthesized summary
 */
async function chunkAndSummarize(text, chunkSize = 2000) {
  // 1. Split text into word arrays and group into chunks
  const words = text.split(/\s+/);
  const chunks = [];
  for (let i = 0; i < words.length; i += chunkSize) {
    chunks.push(words.slice(i, i + chunkSize).join(" "));
  }

  // 2. Summarize each chunk individually
  const summaries = [];
  for (const chunk of chunks) {
    const { choices } = await openai.chat.completions.create({
      model: "gpt-4-turbo",
      temperature: 0.3,
      messages: [
        { role: "system", content: "You are a concise summarizer." },
        { role: "user", content: `In 80 words, summarize this section:\n\n${chunk}` }
      ],
      max_tokens: 150
    });
    summaries.push(choices[0].message.content.trim());
  }

  // 3. Combine chunk summaries and create a final overview
  const combined = summaries.join("\n\n");
  const { choices: finalChoices } = await openai.chat.completions.create({
    model: "gpt-4-turbo",
    temperature: 0.3,
    messages: [
      { role: "system", content: "You are an expert synthesizer." },
      { role: "user", content: `Combine these points into a 100-word summary:\n\n${combined}` }
    ],
    max_tokens: 200
  });

  return finalChoices[0].message.content.trim();
}

// Example usage:
(async () => {
  const longText = /* load your document here */;
  const summary = await chunkAndSummarize(longText);
  console.log("Final Summary:\n", summary);
})();

Why do LLMs hallucinate?

LLMs sometimes “hallucinate” by producing fluent but false statements because they predict the next word based on learned patterns, not by verifying facts. They don’t consult a knowledge base—instead, they fill gaps with whatever sequence looks most statistically plausible.

This glitch often stems from noisy or incomplete training data and the randomness introduced during generation. A recent survey categorizes hallucinations into two main types: outright fabrications (claims that never happened) and context inconsistencies (answers that drift off-topic or contradict earlier instructions) Huang et al., 2024. Vague prompts and high-temperature sampling settings can make those errors even more likely.

To curb hallucinations:

Anchor outputs with retrieval-augmented generation (RAG) so the model cites real documents.
Prompt the model to list its sources or self-review its own responses.
Lower the temperature setting for more conservative completions.
Incorporate human-in-the-loop checks for any critical or sensitive information.

Why do LLMs have stale knowledge?

LLMs freeze their knowledge at training time, so anything that happens after their cutoff date—new research, news events, or emerging trends—is invisible to them.

Because these models aren’t connected to live data, they’ll confidently repeat outdated facts or simply guess when you ask about post-training developments. For instance, GPT-3.5 Turbo only “knows” up to September 2021, GPT-4 Turbo stops at December 2023, and Anthropic’s Claude 3 series caps out in August 2023. There’s no built-in web browser or streaming update—every answer is drawn from a static snapshot of the world.

To keep your outputs current and reliable:

Include the date in your prompt (e.g., “Today is July 24, 2025”) so the model flags time-sensitive gaps.
Use retrieval-augmented generation (RAG): fetch the latest data from APIs, databases, or search engines and feed only those snippets into the model.
Cross-check with external sources: build a middleware layer that verifies critical facts against live endpoints before displaying them.
Fine-tune regularly: schedule lightweight retraining or prompt-tuning on fresh, domain-specific documents.
Add human oversight: for any business or research use, pair LLM outputs with a quick human review to catch obsolete or incorrect details.

By combining these tactics, you can work around the “frozen” nature of LLMs and ensure your AI-driven tools stay in step with the real world.

Why do LLMs struggle with reasoning?

LLMs write text by predicting the next word based on patterns they saw in training. They do not run logic or hold variables in memory. This means they can stumble on puzzles like counting words in a sentence or following a multi-step plan, even when they sound confident.

You can help them by asking for a chain of thought—tell them to list each step before giving an answer. Break big problems into smaller chunks and feed those one at a time. Let them call out to external tools (for example, a code interpreter or calculator) for precise math or rule checks. And always add a quick validation step—either an automated script or a short human review—to catch any slip-ups before using the result.

Why do LLMs reflect and amplify bias?

LLMs reflect and sometimes amplify bias because they learn patterns from massive text corpora that contain real-world prejudices. In essence, if the training data favors one group or viewpoint, the model will mirror those imbalances in its outputs.

Most of the bias comes from unbalanced or skewed data sources. For example, Wikipedia editors tend to be male and Western, so topics and writing styles lean in that direction. Training on news articles, social media, or specialized blogs can introduce subtle—yet harmful—stereotypes about gender, race, age or nationality. A survey by Navigli et al. (2023) catalogs concrete examples of these biases, from assuming doctors are male to downgrading non-native English speakers’ abilities Navigli et al..

To curb biased outputs, you need a mix of data-level and pipeline-level fixes:

Fine-tune on balanced, representative datasets or use data augmentation to boost under-represented groups.
Audit and filter outputs with fairness metrics (for instance, pairwise demographic checks).
Design prompts that explicitly instruct the model to avoid stereotypes (e.g., “Generate a bio without mentioning gender”).
Incorporate adversarial testing—run challenging prompts that surface hidden biases.
Keep a human-in-the-loop for any content that could impact reputations or legal outcomes.

By treating bias mitigation as an ongoing process, you can steer LLMs toward fairer, more trustworthy text—without giving up on their generative power.

How to Mitigate LLM Limitations in Practice

Step 1: Chunk and Prep Your Inputs

Break large documents into focused sections. Summarize or paraphrase each chunk in 1–2 sentences before feeding it back. Always ask for a specific output length (for example, “In 80 words, summarize this section”). This keeps the model’s token window honed on what matters and prevents early context from slipping away.

Step 2: Ground Outputs with Retrieval

Store your reference texts in a vector database (for example, Pinecone or Cloudflare Vectorize). At query time, fetch only the top‐k relevant passages and insert them into your prompt. Instruct the model to cite its sources or even to quote snippets. This retrieval‐augmented approach anchors responses in real data and sharply reduces hallucinations.

Step 3: Guide Multi-Step Reasoning

When you need a logical chain—like a word count or a strategy plan—prompt the model to “list each step before giving the answer.” Provide a short example of how you want the reasoning laid out. Lower the temperature (0.2–0.5) for more deterministic output. For precise math or rule checks, let the model call an external calculator or code interpreter.

Step 4: Add Human and Automated Checks

For any critical or sensitive task, build a quick human-in-the-loop review. Complement that with simple scripts that cross-verify facts against live APIs or databases. To catch bias, run adversarial prompts or apply fairness metrics (for example, pairwise demographic tests from Navigli et al. 2023). You can also prompt the LLM to self-review: “Does this answer contradict itself or omit key details?”

Step 5: Keep Knowledge Fresh

Begin each session by stating today’s date (e.g., “Today is July 24, 2025”). When you need up-to-date facts, fetch live snippets—news articles, market data or domain reports—from an API or search engine, then include them in your prompt. Finally, schedule lightweight fine-tuning or prompt-tuning on new documents every few weeks to prevent staleness.

Additional Notes

For hallucination monitoring, explore logit‐based methods like “A Stitch in Time Saves Nine” or tools such as SelfCheckGPT.
Track token consumption to avoid unexpected truncation.
Encourage the model to ask clarifying questions if a prompt is ambiguous.

Data snapshot: LLM limitations in numbers

Context window sizes

GPT-3.5 Turbo holds 16 000 tokens (≈ 12 000 words)
GPT-4 Turbo holds 128 000 tokens (≈ 96 000 words)
Anthropic Claude 3 (Haiku/Sonnet/Opus) holds 200 000 tokens (≈ 150 000 words)
Google Gemini Pro/1.5 holds 128 000 tokens (and experimental builds note up to 1 000 000)

Knowledge cutoffs

GPT-3.5 Turbo only “knows” up to September 2021
GPT-4 Turbo stops at December 2023
Anthropic Claude 3 caps out at August 2023
Google Gemini Pro/1.5 sees data until early 2023

Accuracy & hallucinations

GPT-3 produces incorrect or nonsensical responses about 15 % of the time
A logit-based check (“A Stitch in Time Saves Nine”) cuts hallucinations from 47.5 % down to 14.5 %
Retrieval-Augmented Generation (RAG) can reduce overall hallucination rates by over 40 %

Bias & fairness

GPT-3 generated biased text in 19 % of politically charged prompts (2021 study)

Computational footprint

Training GPT-3 consumed roughly 355 GPU-years and cost several million dollars

Pros and Cons of Large Language Models

✅ Advantages

Massive context windows: GPT-4 Turbo handles up to 128 000 tokens (≈ 96 000 words), ideal for long reports and extended dialogues.
Multi-domain versatility: perform summarization, translation, creative writing, code generation and more—all with a single model.
Rapid turnaround: draft articles or code snippets in seconds, speeding workflows by up to 10× compared to manual writing.
Cost-effective scaling: pay-as-you-go APIs often charge under $0.002 per 1 K tokens, undercutting specialist labor rates.
Reliability boost with RAG: pairing with retrieval-augmented generation cuts hallucinations by over 40%.

❌ Disadvantages

Hallucination risk: about 15% of outputs may be false or misleading.
Frozen knowledge: models can’t access any information after their training cutoff (e.g., no updates past December 2023).
Limited reasoning: stumble on multi-step logic tasks like word counts or complex planning.
Bias amplification: up to 19% of responses reflect unfair stereotypes without active mitigation.

Overall assessment: LLMs shine when you need fast, flexible, and low-cost content at scale—especially if you layer in retrieval and human review. For high-stakes tasks requiring up-to-the-minute facts, deep logic or unbiased judgment, treat them as powerful assistants rather than standalone experts.

Mitigating LLM Limitations Checklist

Break large texts into 1,500-token chunks and write a 1–2 sentence summary for each before feeding them to the model.
Limit response size by adding clear instructions like “In 80 words, summarize Section 2.”
Retrieve context from your vector database (e.g., Pinecone): fetch the top-5 relevant passages and insert them into your prompt.
Ask the model to cite its sources or quote snippets to anchor each claim in real data.
Prompt a chain of thought—instruct “List each reasoning step before your conclusion” for multi-step tasks.
Set temperature to 0.2–0.5 when you need deterministic, low-variance outputs.
Automate fact and math checks by validating numbers with a script or live API call before publishing.
Build a human-in-the-loop review: assign a reviewer to verify critical details, flag errors, and spot bias.
Prefix every session with today’s date (e.g., “Today is July 24, 2025”) and feed in fresh API snippets to counter stale knowledge.
Audit outputs for bias: run adversarial prompts, apply fairness metrics (like pairwise demographic tests), and explicitly instruct “Avoid stereotypes” in your prompt.

Key Points

🔑 LLMs don’t truly think: Large language models predict the next token without real understanding, beliefs, or persistent memory—so they can’t form genuine insights or learn across sessions.

🔑 Context windows are finite: Inputs beyond a model’s token limit (e.g., 16K–200K tokens) push out earlier text and drop critical details—mitigate by chunking, summarizing sections, or using retrieval‐augmented pipelines.

🔑 Hallucinate without grounding: They can generate fluent but false claims when patterns in training data are incomplete—anchor outputs with Retrieval‐Augmented Generation (RAG), cite sources, lower temperature, and add human‐in‐the‐loop checks.

🔑 Static knowledge bases: Their world view freezes at the training cutoff—combat staleness by including the current date in prompts, fetching live API snippets, and scheduling regular fine‐tuning on fresh data.

🔑 Bias is baked in: They mirror and often amplify biases from skewed corpora—counteract with balanced fine‐tuning, fairness audits, adversarial testing, and clear prompts that forbid stereotypes.

Summary: Large language models excel at versatile text generation but require proactive context management, grounding, and oversight to address their gaps in true reasoning, factual accuracy, timeliness, and fairness.

Frequently Asked Questions

Q: What can LLMs never do?
LLMs cannot truly think, feel or form beliefs like humans—they only predict the next word based on learned patterns, so they can’t update their knowledge in real time or maintain a genuine, long-term memory beyond their fixed context window.

Q: Why do LLMs fail?
Failures occur because LLMs rely on statistical word patterns, not logical reasoning or fact-checking, which can lead to dropped context, made-up facts (hallucinations), calculation errors and the unintentional echoing of biases from their training data.

Q: Do LLMs understand language like humans?
No. While they generate fluent text, LLMs lack true comprehension—they don’t grasp intent, emotions or unstated context, and they can misinterpret idioms, sarcasm or nuanced meaning without explicit guidance.

Q: Are LLM outputs always accurate?
No. LLMs can sound confident yet produce incorrect, outdated or misleading information because they don’t verify answers against live sources and their training data stops at a fixed cutoff date.

Q: Do LLMs learn after deployment?
Not by themselves. Once trained, an LLM’s knowledge is static until you fine-tune or retrain it on new data, or augment it with external retrieval systems to supply fresh information.

Large language models can spin a story or debug code, but they don’t truly understand. They have a fixed context window that drops older details, they sometimes hallucinate false facts, and their knowledge freezes after training. They also trip up on step-by-step puzzles and can echo biases hidden in their data.

You can manage these quirks with a few simple steps. Split long inputs into small pieces and summarize each one. Use retrieval so answers come from real sources. Ask the model to list its reasoning steps. Then add human or automated checks. Always start with today’s date and feed in recent facts so the model stays current.

When we combine these best practices, LLMs become powerful partners. They bring speed, scale, and versatility to tasks that once took hours. But they work best under our careful supervision. Treat them as smart assistants, not untouchable experts, and you’ll unlock their full potential—safely, accurately, and fairly.

Key Takeaways

Essential insights from this article

Break large texts into ~1,500-token chunks, summarize each before feeding to the model to prevent context loss.

Use retrieval-augmented generation and require the LLM to cite sources—this can cut hallucinations by over 40%.

Ask for step-by-step reasoning at a low temperature (0.2–0.5) and layer in human or automated checks to catch logic errors and bias.

3 key insights • Ready to implement