Here is the uncomfortable thing about writing for AI answer engines. The model that decides whether to cite you almost never sees your page. It sees a fragment of it, a few hundred words pulled out of the surrounding article, stripped of the headline that framed them and the paragraph that set them up. The model judges that fragment alone. If the fragment answers the question on its own, you get cited. If it only makes sense as part of the whole piece you so carefully wrote, you do not.
That is the entire game, and most content advice misses it because it is still optimizing for a reader who scrolls. A human reader arrives at the top, takes in your design, follows your argument, and forgives a slow opening because the payoff is three paragraphs down. A retrieval system does none of that. It chops your page into pieces before it ever reasons about them. Writing for agents means writing for the piece, not the page.
This post builds the advice from the bottom up, from how large language models actually ingest a web page, because the tactics only make sense once you see the machine.
Where this problem came from
For thirty years, the unit of web content was the page. You published a page, search engines indexed the page, and a ranking pointed a human at the whole thing. The page was atomic. Everything about content production assumed it: the hero image, the introduction that built tension, the narrative that rewarded a full read.
Retrieval-augmented generation broke the page into pieces. When you ask ChatGPT or Perplexity or Google's AI Mode a question that needs current information, the system does not read whole web pages and think about them. It cannot. A model has a fixed context window, and stuffing ten full articles into it is slow, expensive, and counterproductive. So the pipeline does something else. It fetches candidate pages, splits each one into chunks, converts every chunk into a vector that captures its meaning, and retrieves only the handful of chunks whose vectors sit closest to the meaning of the question. Those few chunks, not the pages they came from, are what the model reads when it composes an answer.
The consequence is structural. Your page is no longer the unit of content. The chunk is. And you do not control where the chunk boundaries fall. A chunker might split your page every 500 tokens, or at every heading, or wherever it detects a topic shift. Whatever it does, it will produce passages that get judged in isolation. The skill of content for agents is making sure that wherever the cut lands, the piece on either side still stands up.
How a machine actually reads your page
Three facts about that pipeline change how you should write. None of them is intuitive if you are used to writing for people.
The first fact: most AI crawlers do not run JavaScript. This is the one that quietly sinks a lot of modern sites. When Vercel and MERJ analyzed AI crawler behavior in late 2024, across hundreds of millions of fetches, they found that the major bots fetch JavaScript files but do not execute them. GPTBot pulled JS in about 11 percent of its requests and ClaudeBot in about 24 percent, but neither one ran the code. Only Google's crawler renders JavaScript with a real browser. So if your content is assembled in the browser by a client-side framework, an AI crawler sees the empty shell that arrives before the script runs. The product description, the article body, the answer a model might have cited: to that crawler, none of it exists. You cannot be extracted from a page that, as far as the machine is concerned, is blank. Server-side rendering or static generation is not a performance nicety here. It is the difference between being readable and being invisible.
The second fact: the machine wants your text, not your design. Cloudflare, when it shipped a feature that serves pages to agents as Markdown, published the numbers behind the decision. One of its own blog posts took 16,180 tokens as raw HTML and 3,150 tokens as clean Markdown, an 80 percent reduction. Every AI pipeline does some version of that conversion, stripping navigation, styling, scripts, and layout down to the words and their structure. What survives the strip is your headings, your paragraphs, your lists, your tables, and the semantic bones of the document. What gets discarded is everything you spent design effort on. Write for what survives.
The third fact: position inside a chunk matters, and the middle is where content goes to be ignored. In 2023, researchers from Stanford, Berkeley, and Samaya AI published a study with a title that became shorthand for the whole effect: "Lost in the Middle." They found that when the information a model needs sits at the start or end of its input, the model uses it well. When the same information sits in the middle, accuracy drops, sometimes sharply. Models attend hardest to the edges. This is true inside a chunk just as it is across a long context. A fact buried in the fourth sentence of a dense paragraph is structurally disadvantaged against the same fact stated first.
Put those three facts together and the brief writes itself. Ship text that exists without JavaScript. Make the structure carry the meaning, because the structure is what reaches the model. And lead with the answer, because the front of a passage is the part that gets read.
What this looks like on the page
The principles are concrete. Here is how each one translates into something you can actually do.
Answer first, then explain. The single highest-leverage habit is to state the answer in the first sentence or two under any heading, then expand. A reader will tolerate a wind-up. A retrieval system treats the opening of a section as the most likely thing to extract and the most heavily weighted thing to judge. If a section is headed "How much does X cost," the first line should give a number or a range, not a paragraph about how pricing is nuanced. The nuance can follow. It just cannot go first. This is close to the inverted pyramid that newswriting has used for a century, and it works for the same reason: it puts the payload where attention is highest.
Write self-contained sections. Because chunk boundaries often track headings, treat every section as if it might be read entirely alone, with no title above it and nothing below. That means resolving pronouns and references inside the section. A passage that opens "This approach has three weaknesses" is broken once it is extracted, because "this approach" is now undefined. Name the thing. Repeat the subject. The light redundancy that an editor would trim from a print article is, for machine extraction, load-bearing. Each section should make one point completely rather than half a point that depends on the section before it.
Use headings that state the content, not tease it. A heading is a strong signal both for where a chunk gets cut and for what the chunk is about. "The pricing problem" tells a machine very little. "Enterprise plans start at 2,000 dollars a month" tells it exactly what lives below. Headings phrased as the real questions people ask, or as plain declarative claims, give the retrieval system a clean handle. Clever headings cost you that handle.
Lead with semantic HTML. Use a single h1, then h2 and h3 in a logical nesting. Mark lists as lists and tables as tables. Put paragraphs in paragraph tags. This sounds obvious, and it is exactly the part that breaks when content is assembled by a page builder that renders everything as nested divs styled to look like headings. To a human those divs look fine. To the parser that converts your page to Markdown, a div is not a heading and your document loses its outline. The hierarchy you can see has to also be the hierarchy in the markup.
Put structured facts in tables and lists. When information is genuinely a set (comparisons, specifications, steps, options), formatting it as a table or a list does real work. It survives the conversion to Markdown cleanly, it isolates each fact into its own row or item where nothing buries it, and it is unambiguous in a way a flowing sentence is not. A specification sentence with five clauses forces a model to parse relationships. A five-row table hands them over. The same instinct that makes a list scannable for a hurried human makes it extractable for a machine.
Raise the factual density. AI engines are drawn to passages that carry specific, checkable information: numbers, dates, named entities, direct figures. The Princeton-led GEO study, presented at the KDD conference in 2024, ran controlled experiments on which content changes lift visibility inside generative engines, and adding relevant statistics and well-placed quotations were among the methods that helped, while keyword stuffing actually hurt. A useful test: read a paragraph and ask what a model could quote from it as a standalone fact. If the answer is nothing, the paragraph is decoration. Decoration does not get cited.
What to stop doing
Some long-standing content habits actively work against extraction. They are worth naming because they rarely feel like mistakes.
Stop burying the answer. The slow build, the context before the conclusion, the "to understand X we first need to understand Y," all of it pushes the payload into the low-attention middle. If a section answers a question, answer it at the top.
Stop writing sections that lean on each other. Content built as one continuous argument, where paragraph six only means something if you read paragraphs one through five, does not survive chunking. Each piece, extracted alone, should still inform.
Stop hiding facts in prose. A statistic wrapped in three subordinate clauses is hard for a model to lift cleanly. Give important facts their own sentence, their own list item, their own table cell.
Stop relying on client-side rendering for content that matters. If the words only appear after JavaScript runs, assume the major non-Google AI crawlers cannot see them.
Stop chasing llms.txt as a shortcut. The proposed llms.txt file, a curated map of your site for language models, has had real attention but little payoff. SE Ranking analyzed roughly 300,000 domains and found no relationship between having the file and how often a domain gets cited in AI answers. Google's Search team has said plainly that it does not use it, with John Mueller comparing it to the long-dead keywords meta tag. It is cheap to add and unlikely to hurt, but it is not where the work is.
Two more cautions, because content for agents has its own failure modes. Schema markup, the structured JSON-LD that labels your content as a product, an FAQ, an article, a recipe, is genuinely useful for giving machines an unambiguous read of entities and facts. It is not a magic citation lever, and the experienced view is that it supports AI understanding rather than guaranteeing inclusion. Use it to remove ambiguity, not as a trick. And do not let extraction-friendly structure collapse into the flat, sectioned sameness that AI systems produce by default and that readers and search engines alike have started to discount. The goal is a page that is easy for a machine to parse and still worth a human's time. Answer-first structure and original, specific substance are not in tension. A thin page organized perfectly is still thin.
Where this is heading
The direction of travel is toward more extraction, not less. A growing share of the audience for any page is now a machine deciding what to surface to a person who may never click through. Semrush, studying 230,000 prompts and more than 100 million AI citations across mid to late 2025, found citation patterns shifting week to week as engines retuned which sources they favored. The surface is unstable. What is stable is the mechanism underneath it: chunk, retrieve, judge the chunk.
Two forces will sharpen this. One is agentic commerce and agentic research, where an autonomous agent gathers and compares information on a person's behalf. That agent has even less patience for layout and narrative than a retrieval pipeline. It wants facts it can lift and act on. The other is provenance and labeling, with the EU AI Act's transparency rules under Article 50 taking effect on 2 August 2026, pushing machine-readable marking of synthetic content. The web is being rebuilt, slowly, to be read by software, and structured, honest, clearly marked content is the version of a page that software can use.
For teams producing content at scale, the practical move is to treat machine-readability as a property you can check, not hope for. Fetch your own page the way a crawler does, with JavaScript turned off, and see what text remains. Convert it to Markdown and read the result: if the outline is wrong or the facts are gone, a model sees the same broken version. This is exactly the kind of repetitive structural audit an agent handles well, running across a content library and flagging the pages that fall apart on extraction. At Perform Digital, that auditing layer is part of how we think about content systems for clients: not just whether a page reads well, but whether it survives being taken apart.
The mental shift is the whole thing. Stop picturing a reader who arrives at the top of your page and scrolls. Picture a machine that grabbed one section out of the middle, with no title and no run-up, and has to decide right now whether that section answers a question. Write so that, wherever the cut lands, the answer is yes.
Council summary
This post argues that AI answer engines never judge your page, only a chunk of it, so content should be written for the fragment: answer-first, self-contained sections, semantic HTML, facts in tables, and text that exists without JavaScript. The council verified every figure against primary sources. The Vercel and MERJ crawler study (GPTBot fetching JavaScript in roughly 11 percent of requests, ClaudeBot in roughly 24 percent, only Google rendering it), Cloudflare's 80 percent token reduction, the "Lost in the Middle" paper, the Princeton-led GEO study at KDD 2024, the SE Ranking llms.txt analysis of about 300,000 domains, the Semrush citation study, and the EU AI Act Article 50 date of 2 August 2026 all check out as stated. Wording was tightened in one place for clarity, and no claims were cut. The takeaway for a busy reader: fetch your own page with JavaScript off, convert it to Markdown, and fix whatever falls apart, because that broken version is what the model sees.
Comments