AI & Automation
When I first started experimenting with AI for my e-commerce client's SEO strategy, I had the same question every marketer was asking: where does GPT-4 actually get its ranking data from? The conventional wisdom suggested AI models just "know" current rankings somehow. But after tracking down a couple dozen LLM mentions for a client who wasn't even in a tech-heavy niche, I discovered the reality is much more complex—and much more opportunity-rich than most people realize.
Here's the uncomfortable truth: most businesses are optimizing for the wrong thing entirely. While everyone's debating whether AI will kill SEO, the smart money is figuring out how to get mentioned by AI systems before competitors even know this game exists.
Through months of testing across multiple client projects and diving deep into how these systems actually work, I've learned that understanding GPT-4's data sources isn't just academic—it's the key to an entirely new distribution channel.
Here's what you'll learn from my hands-on experiments:
The real sources behind GPT-4's "knowledge" about websites and rankings
Why traditional SEO metrics don't predict AI mentions
A practical framework for getting your content into AI training data
How I tracked and increased LLM mentions for clients across different industries
Why this opportunity window won't stay open forever
If you're still thinking about AI as a threat to your traffic, you're missing the biggest distribution opportunity since search engines themselves. Let me show you what I learned the hard way—and how you can use it.
Most SEO professionals are approaching AI mentions with the same mindset they use for traditional search rankings. The industry talks about "optimizing for AI" like it's just another SERP feature you can game with the right keywords and schema markup.
Here's what every AI and SEO guide tells you:
Use structured data to help AI understand your content
Create FAQ sections that match common AI queries
Focus on authoritative backlinks to signal quality to AI systems
Optimize for featured snippets since AI pulls from similar sources
Build topical authority through comprehensive content clusters
This conventional wisdom exists because it's easier to package familiar SEO tactics with an "AI" label than to admit nobody really knows how these systems work. The problem? It assumes AI models work like search engines—crawling, indexing, and ranking content in real-time.
But here's where this approach falls short: GPT-4 doesn't crawl the web like Google does. Its knowledge comes from training data that was collected at a specific point in time, combined with some real-time retrieval capabilities that work completely differently than traditional search.
The result? Businesses are optimizing their websites for AI systems that might never see their content, while missing the actual pathways that lead to AI mentions. Meanwhile, sites with terrible SEO metrics are getting referenced by AI systems simply because they happened to be in the right datasets.
Understanding where GPT-4 actually gets its data isn't just technical curiosity—it's the foundation for an entirely different content and distribution strategy.
Who am I
7 years of freelance experience working with SaaS
and Ecommerce brands.
My wake-up call came when I was working on a complete SEO overhaul for a Shopify e-commerce client. This wasn't a tech company or SaaS—just a traditional retail business selling physical products. But while tracking their SEO progress, I noticed something unexpected: their brand was getting mentioned in AI-generated responses, despite being in a niche where you wouldn't expect much LLM usage.
This discovery sent me down a research rabbit hole. I started tracking mentions across different AI systems for this client and several others. What I found challenged everything I thought I knew about how AI systems source their information.
The conventional approach would have been to focus on traditional SEO—building backlinks, optimizing product pages, creating content clusters. But I realized we were dealing with something fundamentally different. These AI mentions weren't correlating with search rankings or domain authority in any predictable way.
I reached out to teams at AI-first startups like Profound and Athena to understand what they were seeing. The consensus was clear: everyone is still figuring this out. There's no definitive playbook yet, which meant we were in uncharted territory.
But that uncertainty also represented opportunity. While most businesses were waiting for "best practices" to emerge, we could experiment and potentially establish positioning before the space got crowded.
The challenge was figuring out the actual mechanics. Where does GPT-4 get its ranking data? How do AI systems decide which sources to reference? And most importantly for my clients: how do you influence those decisions?
Through conversations with AI researchers and hands-on testing, I learned that the answer is much more complex—and much more actionable—than the simple explanations most marketers were getting.
My experiments
What I ended up doing and the results.
After months of experimentation across multiple client projects, I've developed a framework for understanding and influencing how AI systems source information. Here's what I discovered about where GPT-4 actually gets its ranking data and how you can use this knowledge:
The Three-Layer Reality of AI Data Sources
First, you need to understand that GPT-4 operates on three distinct data layers, each with different implications for your content strategy:
Layer 1: Training Data Foundation
This is the massive dataset used to train the model initially. It includes web pages, books, articles, and other text sources collected up to a specific cutoff date. For GPT-4, this cutoff is around early 2024. Here's the crucial insight: being in this training data is like being baked into the model's "memory." Sites that were authoritative and well-linked during the training period have a fundamental advantage.
Layer 2: Real-Time Retrieval
When GPT-4 needs current information, it can search the web in real-time using integrated search capabilities. But here's what most people miss—this isn't like Google search. The system uses specific search queries and retrieval patterns that don't necessarily match traditional SEO ranking factors.
Layer 3: Reinforcement Learning from Human Feedback (RLHF)
The model's responses are constantly refined based on human feedback. This means the sources it chooses to cite are influenced by which references users find most helpful, not just which pages rank highest in search.
My Testing Framework
I developed a systematic approach to track and influence AI mentions across these three layers:
Step 1: Audit Current AI Mentions
I created queries specifically designed to test how AI systems reference information in my clients' industries. Instead of generic searches, I tested edge cases and specific scenarios where AI might need to pull current data or make recommendations.
Step 2: Content Architecture for AI Consumption
Based on my observations, AI systems don't consume content the same way humans do. They break information into passages and synthesize answers from multiple sources. This meant restructuring content so each section could stand alone as a valuable snippet while still being part of a coherent whole.
Step 3: Strategic Content Distribution
Rather than just publishing on company blogs, I focused on getting content into sources that were more likely to be included in training datasets or real-time retrieval. This included guest posts on high-authority publications, contributions to industry resources, and strategic presence in commonly-referenced databases.
The Five Key Optimizations
Through testing, I identified five specific optimizations that consistently improved AI mention rates:
Chunk-level retrieval optimization: Making each section self-contained with clear context, so AI systems could extract valuable information even without the full article.
Answer synthesis readiness: Structuring content with logical hierarchies that AI systems could easily parse and recombine with other sources.
Citation-worthiness: Ensuring factual accuracy and clear attribution, since AI systems seem to favor sources that other systems and humans trust.
Topical breadth and depth: Covering topics comprehensively rather than just targeting specific keywords, since AI systems pull from sources that demonstrate subject matter expertise.
Multi-modal content integration: Including data visualizations, tables, and structured information that AI systems could extract and reference more easily than pure text.
The breakthrough insight was realizing that AI mentions weren't just about content quality—they were about content architecture and distribution strategy aligned with how these systems actually work.
The results from this approach were surprisingly measurable, even though we were essentially pioneering new territory. Across the clients where I implemented this framework, I tracked a consistent pattern:
Mention frequency increased significantly. For my e-commerce client, we went from essentially zero AI mentions to being referenced in responses about industry best practices and product recommendations. The mentions weren't massive in volume, but they were consistent and relevant.
Quality of mentions improved. Instead of generic brand mentions, AI systems started referencing specific methodologies and frameworks from our content. This suggested the content was being valued for its substance, not just its SEO optimization.
Cross-platform consistency emerged. When content started getting mentioned by one AI system, it typically began appearing in others as well. This indicated we were successfully getting into shared data sources or reference patterns.
Timeline considerations. The most important finding was about timing. Content published during active training periods had much higher mention rates than content published afterward. This suggested a first-mover advantage that won't last indefinitely.
But here's the reality check: these results came from focusing on solid content fundamentals first, then adapting for AI consumption. The couple dozen LLM mentions we achieved weren't from aggressive "AI optimization" tactics—they came from comprehensive, authoritative content that naturally aligned with how these systems process information.
Learnings
Sharing so you don't make them.
After months of experimentation across different industries and client types, here are the key lessons that will save you time and help you avoid common pitfalls:
Lesson 1: Traditional SEO metrics don't predict AI mentions. High domain authority and search rankings don't guarantee AI references. I've seen sites with mediocre SEO get consistent AI mentions while perfectly optimized sites get ignored.
Lesson 2: Content structure matters more than content volume. A single well-structured, comprehensive piece often generates more AI mentions than dozens of blog posts optimized for traditional SEO.
Lesson 3: Real-time information has different rules. For current events or rapidly changing information, AI systems rely heavily on real-time retrieval, which follows different patterns than training data inclusion.
Lesson 4: The opportunity window is closing. As more businesses understand this space, the advantage of being early will diminish. The time to experiment is now, while most competitors are still focused on traditional SEO.
Lesson 5: Distribution partnerships matter more than individual content pieces. Getting your content referenced by authoritative sources that AI systems trust is more valuable than optimizing individual pages.
Lesson 6: Don't abandon traditional SEO. AI optimization should layer on top of solid SEO fundamentals, not replace them. The businesses seeing the best results combine both approaches strategically.
Lesson 7: Test early and test often. The landscape changes rapidly as AI systems evolve. What works today might not work in six months, so continuous testing and adaptation are essential.
My playbook, condensed for your use case.
For SaaS startups looking to leverage AI mentions:
Focus on comprehensive use case documentation that AI systems can reference for recommendations
Create detailed integration guides that become go-to resources for AI-generated advice
Establish thought leadership in industry publications likely to be included in training datasets
Document unique methodologies and frameworks that differentiate your approach
For e-commerce stores wanting to increase AI mentions:
Create comprehensive product education content that AI can reference for recommendations
Develop authoritative guides for your product categories or industry
Focus on review platforms and directories commonly referenced by AI systems
Build expertise content around product selection and usage guidance
What I've learned