Website Intel

Company Research

Scrape any website and extract structured data using a custom schema.

crawlerscrapingwebsitestructured-data

Overview

Website Intel is an MCP server that scrapes any public website and returns structured JSON data based on a schema you define. It handles JavaScript-heavy SPAs, dynamic content, and can crawl multiple pages following links. Under the hood, it uses a headless browser with crawl4ai for intelligent data extraction — you describe what you want, and it pulls exactly that from any page. Whether you need pricing tables, team directories, product feature lists, or blog metadata, Website Intel transforms unstructured web pages into clean, typed data ready for your sales workflows.

Currently macOS only. Windows and Linux support coming soon.

Use Cases

Extract Pricing Page Data for Competitive Analysis

Your sales team needs to understand how a competitor structures their pricing tiers. Instead of manually copying pricing details from their website, you point Website Intel at their pricing page with a schema that defines fields like tier name, price, features included, and limits. The MCP renders the JavaScript-heavy pricing page, extracts every tier into structured JSON, and returns it in seconds.

Expected outcome: A clean JSON object with each pricing tier, its monthly and annual cost, feature list, and usage limits — ready to paste into a competitive battle card or feed into a comparison spreadsheet.

Build a Prospect List from a Conference Speaker Page

A major industry conference publishes its speaker lineup on a dynamic webpage. You want to build a prospect list from the speakers — names, titles, companies. You define a schema with those fields and set Website Intel to crawl the speaker directory, following pagination links across multiple pages.

Expected outcome: A structured list of every speaker with their name, job title, and company — typically 50 to 200 contacts from a single conference page, ready for outreach sequencing.

Scrape Product Feature Lists for Qualification

Before reaching out to a prospect, you need to understand what their product actually does. You point Website Intel at their product or features page and ask it to extract the feature categories, individual features, and any integration mentions. This tells you whether they are a fit for your solution.

Expected outcome: A categorized list of the prospect's product features and integrations, enabling your rep to write a personalized first email that references specific capabilities.

Capabilities

  • Scrapes any public URL including JavaScript-rendered single-page applications
  • Crawls multiple pages by following links with configurable page limits (1–10 pages)
  • Accepts user-defined JSON schemas to extract exactly the data you need
  • Renders dynamic content using a headless browser (Playwright under the hood)
  • Uses LLM-powered extraction to intelligently map page content to your schema
  • Handles pagination, tabs, accordions, and other interactive UI elements
  • Returns clean, typed JSON output ready for downstream processing

Data Sources

Any WebsiteScrapes and extracts structured data from any public URL

Tools

Scrapes a webpage or crawls multiple pages and extracts structured data as JSON using a custom schema. Supports single-page scraping with full JS rendering and multi-page crawling that follows links.

Parameters

NameTypeRequiredDescription
urlstringYesFull URL to process (must include http/https protocol)
schemaobjectYesJSON Schema defining the desired output data structure
promptstringYesNatural language extraction instructions describing what data to pull
modestringNo"scrape" for single page with JS rendering, or "crawl" for multi-page link following. Default: "scrape"
limitintegerNoMaximum pages to crawl in crawl mode. Range: 1–10. Default: 5

Response Fields

  • JSON structured according to the user-provided schema

Dependencies

  • macOS (Windows and Linux support coming soon)
  • Python 3.10+
  • Node.js 20+
  • LLM API key (OpenAI, Anthropic, or compatible provider)
  • crawl4ai (installed automatically during setup)

Works With

Run Website Intel first to extract company details from their website — pricing pages, team pages, product features — then chain Techstack Intel to detect their tools, and Social Intel to research key contacts on LinkedIn.

Used in Skills

Quick Setup

git clone https://github.com/ekas-io/open-sales-stack.git
cd open-sales-stack
./scripts/setup.sh
./scripts/add-to-claude.sh --website-intel
See full setup instructions on GitHub →

Frequently Asked Questions

Does Website Intel work on websites that require JavaScript to render?
Yes. Website Intel uses a headless browser (Playwright via crawl4ai) to fully render pages before extraction. This means single-page applications built with React, Vue, Angular, or any other JavaScript framework are fully supported. The page is rendered just like a real browser would see it.
What happens if the website blocks scraping or requires login?
Website Intel works with publicly accessible pages. If a website requires authentication or blocks automated access, the extraction will fail gracefully. It does not support logging into websites or bypassing access controls. For gated content, you would need to provide an alternative data source.
How accurate is the data extraction compared to manual copy-paste?
The LLM-powered extraction is highly accurate for structured content like pricing tables, team directories, feature lists, and product catalogs. For less structured content like blog posts or marketing copy, accuracy depends on how specific your schema and extraction prompt are. More specific prompts yield better results.
Can I crawl an entire website?
You can crawl up to 10 pages per request in crawl mode. The crawler follows links from the starting page and extracts data from each page it visits. For larger sites, you can run multiple crawl requests targeting different sections (e.g., /pricing, /team, /blog).
Do I need to pay for an API key to use Website Intel?
Yes. Website Intel requires an LLM API key (OpenAI, Anthropic, or compatible provider) for the intelligent extraction step. The scraping and crawling itself is free — you only pay for the LLM calls used to interpret page content against your schema.
What is the difference between scrape mode and crawl mode?
Scrape mode processes a single URL — it loads the page, renders JavaScript, and extracts data. Crawl mode starts at a URL and follows links to discover additional pages, extracting data from each one. Use scrape for a single page like a pricing table, and crawl for multi-page content like a blog archive or team directory.

Related

Need help with this MCP?

This MCP is open source. Need help integrating it into your sales stack, or want us to build something custom?

Book a Call →