Url crawler
Author: g | 2025-04-25
A web crawler gathers URL(s). The crawler retrieves and analyzes the collected URLs. Crawler checks all the pages matching URLs, hyperlinks, all URLs, and meta tags.
URL Crawler on the App Store
🕸 Crawl the web using PHP 🕷This package provides a class to crawl links on a website. Under the hood Guzzle promises are used to crawl multiple urls concurrently.Because the crawler can execute JavaScript, it can crawl JavaScript rendered sites. Under the hood Chrome and Puppeteer are used to power this feature.Support usWe invest a lot of resources into creating best in class open source packages. You can support us by buying one of our paid products.We highly appreciate you sending us a postcard from your hometown, mentioning which of our package(s) you are using. You'll find our address on our contact page. We publish all received postcards on our virtual postcard wall.InstallationThis package can be installed via Composer:composer require spatie/crawlerUsageThe crawler can be instantiated like thissetCrawlObserver() ->startCrawling($url);">use Spatie\Crawler\Crawler;Crawler::create() ->setCrawlObserver(class that extends \Spatie\Crawler\CrawlObservers\CrawlObserver>) ->startCrawling($url);The argument passed to setCrawlObserver must be an object that extends the \Spatie\Crawler\CrawlObservers\CrawlObserver abstract class:namespace Spatie\Crawler\CrawlObservers;use GuzzleHttp\Exception\RequestException;use Psr\Http\Message\ResponseInterface;use Psr\Http\Message\UriInterface;abstract class CrawlObserver{ /* * Called when the crawler will crawl the url. */ public function willCrawl(UriInterface $url, ?string $linkText): void { } /* * Called when the crawler has crawled the given url successfully. */ abstract public function crawled( UriInterface $url, ResponseInterface $response, ?UriInterface $foundOnUrl = null, ?string $linkText, ): void; /* * Called when the crawler had a problem crawling the given url. */ abstract public function crawlFailed( UriInterface $url, RequestException $requestException, ?UriInterface $foundOnUrl = null, ?string $linkText = null, ): void; /** * Called when the crawl has ended. */ public function finishedCrawling(): void { }}Using multiple observersYou can set multiple observers with setCrawlObservers:setCrawlObservers([ , , ... ]) ->startCrawling($url);">Crawler::create() ->setCrawlObservers([ class that extends \Spatie\Crawler\CrawlObservers\CrawlObserver>, class that extends \Spatie\Crawler\CrawlObservers\CrawlObserver>, ... ]) ->startCrawling($url);Alternatively you can set multiple observers one by one with addCrawlObserver:addCrawlObserver() ->addCrawlObserver() ->addCrawlObserver() ->startCrawling($url);">Crawler::create() ->addCrawlObserver(class that extends \Spatie\Crawler\CrawlObservers\CrawlObserver>) ->addCrawlObserver(class that extends \Spatie\Crawler\CrawlObservers\CrawlObserver>) ->addCrawlObserver(class that extends \Spatie\Crawler\CrawlObservers\CrawlObserver>) ->startCrawling($url);Executing JavaScriptBy default, the crawler will not execute JavaScript. This is how you can enable the execution of JavaScript:executeJavaScript() ...">Crawler::create() ->executeJavaScript() ...In order to make it possible to get the body html after the javascript has been executed, this package depends onour Browsershot package.This package uses Puppeteer under the hood. Here are some pointers on how to install it on your system.Browsershot will make an educated guess as to where its dependencies are installed on your system.By default, the Crawler will instantiate a new Browsershot instance. You may find the need to set a custom created instance using the setBrowsershot(Browsershot $browsershot) method.setBrowsershot($browsershot) ->executeJavaScript() ...">Crawler::create() ->setBrowsershot($browsershot) ->executeJavaScript() ...Note that the crawler will still work even if you don't have the system dependencies required by Browsershot.These system dependencies are only required if you're calling executeJavaScript().Filtering certain urlsYou can tell the crawler not to visit certain urls by using the setCrawlProfile-function. That function expectsan object that extends Spatie\Crawler\CrawlProfiles\CrawlProfile:/* * Determine if the given url should be crawled. */public function shouldCrawl(UriInterface $url): bool;This package comes with three CrawlProfiles out of the box:CrawlAllUrls: this profile will crawl all urls on all pages including urls to an external site.CrawlInternalUrls: this profile will only crawl the internal
URL Crawler on the App Store
Limit of pages to crawl.;// Crawls 5 URLs and ends.Crawler::create() ->setCrawlQueue($queue) ->setCurrentCrawlLimit(5) ->startCrawling($url);// Crawls the next 5 URLs and ends.Crawler::create() ->setCrawlQueue($queue) ->setCurrentCrawlLimit(5) ->startCrawling($url);">$queue = ;// Crawls 5 URLs and ends.Crawler::create() ->setCrawlQueue($queue) ->setCurrentCrawlLimit(5) ->startCrawling($url);// Crawls the next 5 URLs and ends.Crawler::create() ->setCrawlQueue($queue) ->setCurrentCrawlLimit(5) ->startCrawling($url);Example 3: Combining the total and crawl limitBoth limits can be combined to control the crawler:;// Crawls 5 URLs and ends.Crawler::create() ->setCrawlQueue($queue) ->setTotalCrawlLimit(10) ->setCurrentCrawlLimit(5) ->startCrawling($url);// Crawls the next 5 URLs and ends.Crawler::create() ->setCrawlQueue($queue) ->setTotalCrawlLimit(10) ->setCurrentCrawlLimit(5) ->startCrawling($url);// Doesn't crawl further as the total limit is reached.Crawler::create() ->setCrawlQueue($queue) ->setTotalCrawlLimit(10) ->setCurrentCrawlLimit(5) ->startCrawling($url);">$queue = ;// Crawls 5 URLs and ends.Crawler::create() ->setCrawlQueue($queue) ->setTotalCrawlLimit(10) ->setCurrentCrawlLimit(5) ->startCrawling($url);// Crawls the next 5 URLs and ends.Crawler::create() ->setCrawlQueue($queue) ->setTotalCrawlLimit(10) ->setCurrentCrawlLimit(5) ->startCrawling($url);// Doesn't crawl further as the total limit is reached.Crawler::create() ->setCrawlQueue($queue) ->setTotalCrawlLimit(10) ->setCurrentCrawlLimit(5) ->startCrawling($url);Example 4: Crawling across requestsYou can use the setCurrentCrawlLimit to break up long running crawls. The following example demonstrates a (simplified) approach. It's made up of an initial request and any number of follow-up requests continuing the crawl.Initial RequestTo start crawling across different requests, you will need to create a new queue of your selected queue-driver. Start by passing the queue-instance to the crawler. The crawler will start filling the queue as pages are processed and new URLs are discovered. Serialize and store the queue reference after the crawler has finished (using the current crawl limit).;// Crawl the first set of URLsCrawler::create() ->setCrawlQueue($queue) ->setCurrentCrawlLimit(10) ->startCrawling($url);// Serialize and store your queue$serializedQueue = serialize($queue);">// Create a queue using your queue-driver.$queue = ;// Crawl the first set of URLsCrawler::create() ->setCrawlQueue($queue) ->setCurrentCrawlLimit(10) ->startCrawling($url);// Serialize and store your queue$serializedQueue = serialize($queue);Subsequent RequestsFor any following requests you will need to unserialize your original queue and pass it to the crawler:setCrawlQueue($queue) ->setCurrentCrawlLimit(10) ->startCrawling($url);// Serialize and store your queue$serialized_queue = serialize($queue);">// Unserialize queue$queue = unserialize($serializedQueue);// Crawls the next set of URLsCrawler::create() ->setCrawlQueue($queue) ->setCurrentCrawlLimit(10) ->startCrawling($url);// Serialize and store your queue$serialized_queue = serialize($queue);The behavior is based on the information in the queue. Only if the same queue-instance is passed in the behavior works as described. When a completely new queue is passed in, the limits of previous crawls -even for the same website- won't apply.An example with more details can be found here.Setting the maximum crawl depthBy default, the crawler continues until it has crawled every page of the supplied URL. If you want to limit the depth of the crawler you can use the setMaximumDepth method.setMaximumDepth(2)">Crawler::create() ->setMaximumDepth(2)Setting the maximum response sizeMost html pages are quite small. But the crawler could accidentally pick up on large files such as PDFs and MP3s. To keep memory usage low in such cases the crawler will only use the responses that are smaller than 2 MB. If, when streaming a response, it becomes larger than 2 MB, the crawler will stop streaming the response. An empty response body will be assumed.You can change the maximum response size.setMaximumResponseSize(1024 * 1024 * 3)">// let's use a 3 MB maximum.Crawler::create() ->setMaximumResponseSize(1024 * 1024 * 3)Add a delay between requestsIn some cases you might get rate-limited when crawling too aggressively. To circumventurl-crawler GitHub Topics GitHub
Crawl web content Use Norconex open-source enterprise web crawler to collect web sites content for your search engine or any other data repository. Run it on its own, or embed it in your own application. Works on any operating system, is fully documented and is packaged with sample crawl configurations running out-of-the-box to get you started quickly. Features There are multiple reasons for using Norconex Web Crawler. The following is a partial list of features: Multi-threaded. Supports full and incremental crawls. Supports different hit interval according to different schedules. Can crawls millions on a single server of average capacity. Extract text out of many file formats (HTML, PDF, Word, etc.) Extract metadata associated with documents. Supports pages rendered with JavaScript. Supports deduplication of crawled documents. Language detection. Many content and metadata manipulation options. OCR support on images and PDFs. Page screenshots. Extract page "featured" image. Translation support. Dynamic title generation. Configurable crawling speed. URL normalization. Detects modified and deleted documents. Supports different frequencies for re-crawling certain pages. Supports various web site authentication schemes. Supports sitemap.xml (including "lastmod" and "changefreq"). Supports robot rules. Supports canonical URLs. Can filter documents based on URL, HTTP headers, content, or metadata. Can treat embedded documents as distinct documents. Can split a document into multiple documents. Can store crawled URLs in different database engines. Can re-process or delete URLs no longer linked by other crawled pages. Supports different URL extraction strategies for different content types. Fires many crawler event types for custom event listeners. Date parsers/formatters. A web crawler gathers URL(s). The crawler retrieves and analyzes the collected URLs. Crawler checks all the pages matching URLs, hyperlinks, all URLs, and meta tags.URL Crawler for Windows - CNET Download
Urls on the pages of a host.CrawlSubdomains: this profile will only crawl the internal urls and its subdomains on the pages of a host.Custom link extractionYou can customize how links are extracted from a page by passing a custom UrlParser to the crawler.setUrlParserClass(::class) ...">Crawler::create() ->setUrlParserClass(class that implements \Spatie\Crawler\UrlParsers\UrlParser>::class) ...By default, the LinkUrlParser is used. This parser will extract all links from the href attribute of a tags.There is also a built-in SitemapUrlParser that will extract & crawl all links from a sitemap. It does support sitemap index files.setUrlParserClass(SitemapUrlParser::class) ...">Crawler::create() ->setUrlParserClass(SitemapUrlParser::class) ...Ignoring robots.txt and robots metaBy default, the crawler will respect robots data. It is possible to disable these checks like so:ignoreRobots() ...">Crawler::create() ->ignoreRobots() ...Robots data can come from either a robots.txt file, meta tags or response headers.More information on the spec can be found here: robots data is done by our package spatie/robots-txt.Accept links with rel="nofollow" attributeBy default, the crawler will reject all links containing attribute rel="nofollow". It is possible to disable these checks like so:acceptNofollowLinks() ...">Crawler::create() ->acceptNofollowLinks() ...Using a custom User AgentIn order to respect robots.txt rules for a custom User Agent you can specify your own custom User Agent.setUserAgent('my-agent')">Crawler::create() ->setUserAgent('my-agent')You can add your specific crawl rule group for 'my-agent' in robots.txt. This example disallows crawling the entire site for crawlers identified by 'my-agent'.// Disallow crawling for my-agentUser-agent: my-agentDisallow: /Setting the number of concurrent requestsTo improve the speed of the crawl the package concurrently crawls 10 urls by default. If you want to change that number you can use the setConcurrency method.setConcurrency(1) // now all urls will be crawled one by one">Crawler::create() ->setConcurrency(1) // now all urls will be crawled one by oneDefining Crawl and Time LimitsBy default, the crawler continues until it has crawled every page it can find. This behavior might cause issues if you are working in an environment with limitations such as a serverless environment.The crawl behavior can be controlled with the following two options:Total Crawl Limit (setTotalCrawlLimit): This limit defines the maximal count of URLs to crawl.Current Crawl Limit (setCurrentCrawlLimit): This defines how many URLs are processed during the current crawl.Total Execution Time Limit (setTotalExecutionTimeLimit): This limit defines the maximal execution time of the crawl.Current Execution Time Limit (setCurrentExecutionTimeLimit): This limits the execution time of the current crawl.Let's take a look at some examples to clarify the difference between setTotalCrawlLimit and setCurrentCrawlLimit.The difference between setTotalExecutionTimeLimit and setCurrentExecutionTimeLimit will be the same.Example 1: Using the total crawl limitThe setTotalCrawlLimit method allows you to limit the total number of URLs to crawl, no matter how often you call the crawler.;// Crawls 5 URLs and ends.Crawler::create() ->setCrawlQueue($queue) ->setTotalCrawlLimit(5) ->startCrawling($url);// Doesn't crawl further as the total limit is reached.Crawler::create() ->setCrawlQueue($queue) ->setTotalCrawlLimit(5) ->startCrawling($url);">$queue = ;// Crawls 5 URLs and ends.Crawler::create() ->setCrawlQueue($queue) ->setTotalCrawlLimit(5) ->startCrawling($url);// Doesn't crawl further as the total limit is reached.Crawler::create() ->setCrawlQueue($queue) ->setTotalCrawlLimit(5) ->startCrawling($url);Example 2: Using the current crawl limitThe setCurrentCrawlLimit will set a limit on how many URls will be crawled per execution. This piece of code will process 5 pages with each execution, without a totalURL Crawler by Marco Tini - appadvice.com
Oleh alat ini. Jenis yang tidak didukung mungkin ada dan valid di halaman, dan dapat muncul di hasil Penelusuran, tetapi tidak akan muncul di alat ini. Data respons tambahan Untuk melihat data respons tambahan seperti HTML mentah yang ditampilkan, header HTTP, output konsol JavaScript, dan semua resource halaman yang dimuat, klik Lihat halaman yang di-crawl. Informasi respons tambahan hanya tersedia untuk URL dengan status URL ada di Google atau URL ada di Google, tetapi mengalami masalah. Crawler yang digunakan untuk menghasilkan data bergantung pada posisi Anda saat membuka panel samping: Saat dibuka dari tingkat atas laporan, sub-laporan HTTPS, dan sub-laporan data terstruktur apa pun di bagian Peningkatan & Pengalaman, jenis crawler ditampilkan di bagian Ketersediaan halaman > Di-crawl > Di-crawl sebagai Saat dibuka dari sub-laporan AMP, jenis crawler-nya adalah smartphone Googlebot. Screenshot halaman yang dirender hanya tersedia di pengujian langsung. Pengujian URL aktifJalankan pengujian langsung untuk URL di properti Anda guna memeriksa masalah pengindeksan, data terstruktur, dan lainnya. Pengujian langsung berguna saat Anda memperbaiki halaman, untuk menguji apakah masalah telah diperbaiki.Untuk menjalankan pengujian langsung guna mengetahui potensi error pengindeksan: Periksa URL. Catatan: tidak masalah jika halaman belum diindeks, atau gagal diindeks, tetapi halaman harus dapat diakses dari internet tanpa persyaratan login. Klik Uji URL aktif. Baca memahami hasil pengujian langsung untuk memahami laporan tersebut. Anda dapat beralih antara hasil pengujian langsung dan hasil yang diindeks dengan mengklik Indeks Google atau Pengujian Langsung di halaman. Untuk menjalankan kembali pengujian langsung, klik tombol jalankan kembali pengujian di halaman pengujian. Untuk melihat detailSubmit URLs to Majestic’s Crawler for Free
GivenA page linking to a tel: URI: Norconex test Phone Number ">>html lang="en"> head> title>Norconex testtitle> head> body> a href="tel:123">Phone Numbera> body>html>And the following config: ">xml version="1.0" encoding="UTF-8"?>httpcollector id="test-collector"> crawlers> crawler id="test-crawler"> startURLs> url> startURLs> crawler> crawlers>httpcollector>ExpectedThe collector should not follow this link – or that of any other schema it can't actually process.ActualThe collectors tries to follow the tel: link.INFO [SitemapStore] test-crawler: Initializing sitemap store...INFO [SitemapStore] test-crawler: Done initializing sitemap store.INFO [HttpCrawler] 1 start URLs identified.INFO [CrawlerEventManager] CRAWLER_STARTEDINFO [AbstractCrawler] test-crawler: Crawling references...INFO [CrawlerEventManager] DOCUMENT_FETCHED: [CrawlerEventManager] CREATED_ROBOTS_META: [CrawlerEventManager] URLS_EXTRACTED: [CrawlerEventManager] DOCUMENT_IMPORTED: [CrawlerEventManager] DOCUMENT_COMMITTED_ADD: [CrawlerEventManager] REJECTED_NOTFOUND: [AbstractCrawler] test-crawler: Re-processing orphan references (if any)...INFO [AbstractCrawler] test-crawler: Reprocessed 0 orphan references...INFO [AbstractCrawler] test-crawler: 2 reference(s) processed.INFO [CrawlerEventManager] CRAWLER_FINISHEDINFO [AbstractCrawler] test-crawler: Crawler completed.INFO [AbstractCrawler] test-crawler: Crawler executed in 6 seconds.INFO [MapDBCrawlDataStore] Closing reference store: ./work/crawlstore/mapdb/test-crawler/INFO [JobSuite] Running test-crawler: END (Fri Jan 08 16:21:17 CET 2016)">INFO [AbstractCollectorConfig] Configuration loaded: id=test-collector; logsDir=./logs; progressDir=./progressINFO [JobSuite] JEF work directory is: ./progressINFO [JobSuite] JEF log manager is : FileLogManagerINFO [JobSuite] JEF job status store is : FileJobStatusStoreINFO [AbstractCollector] Suite of 1 crawler jobs created.INFO [JobSuite] Initialization...INFO [JobSuite] No previous execution detected.INFO [JobSuite] Starting execution.INFO [AbstractCollector] Version: Norconex HTTP Collector 2.4.0-SNAPSHOT (Norconex Inc.)INFO [AbstractCollector] Version: Norconex Collector Core 1.4.0-SNAPSHOT (Norconex Inc.)INFO [AbstractCollector] Version: Norconex Importer 2.5.0-SNAPSHOT (Norconex Inc.)INFO [AbstractCollector] Version: Norconex JEF 4.0.7 (Norconex Inc.)INFO [AbstractCollector] Version: Norconex Committer Core 2.0.3 (Norconex Inc.)INFO [JobSuite] Running test-crawler: BEGIN (Fri Jan 08 16:21:17 CET 2016)INFO [MapDBCrawlDataStore] Initializing reference store ./work/crawlstore/mapdb/test-crawler/INFO [MapDBCrawlDataStore] ./work/crawlstore/mapdb/test-crawler/: Done initializing databases.INFO [HttpCrawler] test-crawler: RobotsTxt support: trueINFO [HttpCrawler] test-crawler: RobotsMeta support: trueINFO [HttpCrawler] test-crawler: Sitemap support: trueINFO [HttpCrawler] test-crawler: Canonical links support: trueINFO [HttpCrawler] test-crawler: User-Agent: INFO [SitemapStore] test-crawler: Initializing sitemap store...INFO [SitemapStore] test-crawler: Done initializing sitemap store.INFO [HttpCrawler] 1 start URLs identified.INFO [CrawlerEventManager] CRAWLER_STARTEDINFO [AbstractCrawler] test-crawler: Crawling references...INFO [CrawlerEventManager] DOCUMENT_FETCHED: [CrawlerEventManager] CREATED_ROBOTS_META: [CrawlerEventManager] URLS_EXTRACTED: [CrawlerEventManager] DOCUMENT_IMPORTED: [CrawlerEventManager] DOCUMENT_COMMITTED_ADD:Crawlers crawl weird long urls
Spider is the fastest and most affordable crawler and scraper that returns LLM-ready data.[Document(page_content='Spider - Fastest Web Crawler built for AI Agents and Large Language Models[Spider v1 Logo Spider ](/)The World's Fastest and Cheapest Crawler API==========View Demo* Basic* StreamingExample requestPythonCopy\`\`\`import requests, osheaders = { 'Authorization': os.environ["SPIDER_API_KEY"], 'Content-Type': 'application/json',}json_data = {"limit":50,"url":" = requests.post(' headers=headers, json=json_data)print(response.json())\`\`\`Example ResponseScrape with no headaches----------* Proxy rotations* Agent headers* Avoid anti-bot detections* Headless chrome* Markdown LLM ResponsesThe Fastest Web Crawler----------* Powered by [spider-rs]( Do 20,000 pages in seconds* Full concurrency* Powerful and simple API* Cost effectiveScrape Anything with AI----------* Custom scripting browser* Custom data extraction* Data pipelines* Detailed insights* Advanced labeling[API](/docs/api) [Price](/credits/new) [Guides](/guides) [About](/about) [Docs]( [Privacy](/privacy) [Terms](/eula)© 2024 Spider from A11yWatchTheme Light Dark Toggle Theme [GitHubGithub]( metadata={'description': 'Collect data rapidly from any website. Seamlessly scrape websites and get data tailored for LLM workloads.', 'domain': 'spider.cloud', 'extracted_data': None, 'file_size': 33743, 'keywords': None, 'pathname': '/', 'resource_type': 'html', 'title': 'Spider - Fastest Web Crawler built for AI Agents and Large Language Models', 'url': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48/spider.cloud/index.html', 'user_id': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48'})]The params parameter is a dictionary that can be passed to the loader. See the Spider documentation to see all available parameters. A web crawler gathers URL(s). The crawler retrieves and analyzes the collected URLs. Crawler checks all the pages matching URLs, hyperlinks, all URLs, and meta tags. How do Web Crawlers Work? Web crawlers typically start with a set of known URLs, known as seed URLs. The crawler first visits the web pages at these URLs. During the visit, the crawler
URL Crawler para Windows - CNET Download
This, you can use the setDelayBetweenRequests() method to add a pause between every request. This value is expressed in milliseconds.setDelayBetweenRequests(150) // After every page crawled, the crawler will wait for 150ms">Crawler::create() ->setDelayBetweenRequests(150) // After every page crawled, the crawler will wait for 150msLimiting which content-types to parseBy default, every found page will be downloaded (up to setMaximumResponseSize() in size) and parsed for additional links. You can limit which content-types should be downloaded and parsed by setting the setParseableMimeTypes() with an array of allowed types.setParseableMimeTypes(['text/html', 'text/plain'])">Crawler::create() ->setParseableMimeTypes(['text/html', 'text/plain'])This will prevent downloading the body of pages that have different mime types, like binary files, audio/video, ... that are unlikely to have links embedded in them. This feature mostly saves bandwidth.Using a custom crawl queueWhen crawling a site the crawler will put urls to be crawled in a queue. By default, this queue is stored in memory using the built-in ArrayCrawlQueue.When a site is very large you may want to store that queue elsewhere, maybe a database. In such cases, you can write your own crawl queue.A valid crawl queue is any class that implements the Spatie\Crawler\CrawlQueues\CrawlQueue-interface. You can pass your custom crawl queue via the setCrawlQueue method on the crawler.setCrawlQueue()">Crawler::create() ->setCrawlQueue(implementation of \Spatie\Crawler\CrawlQueues\CrawlQueue>)HereArrayCrawlQueueRedisCrawlQueue (third-party package)CacheCrawlQueue for Laravel (third-party package)Laravel Model as Queue (third-party example app)Change the default base url schemeBy default, the crawler will set the base url scheme to http if none. You have the ability to change that with setDefaultScheme.setDefaultScheme('https')">Crawler::create() ->setDefaultScheme('https')ChangelogPlease see CHANGELOG for more information what has changed recently.ContributingPlease see CONTRIBUTING for details.TestingFirst, install the Puppeteer dependency, or your tests will fail.To run the tests you'll have to start the included node based server first in a separate terminal window.cd tests/servernpm installnode server.jsWith the server running, you can start testing.SecurityIf you've found a bug regarding security please mail security@spatie.be instead of using the issue tracker.PostcardwareYou're free to use this package, but if it makes it to your production environment we highly appreciate you sending us a postcard from your hometown, mentioning which of our package(s) you are using.Our address is: Spatie, Kruikstraat 22, 2018 Antwerp, Belgium.We publish all received postcards on our company website.CreditsFreek Van der HertenAll ContributorsLicenseThe MIT License (MIT). Please see License File for more information.URL Metadata Crawler OpenAPI definition Apify
Frequently Asked Questions How does Moz Link data compare to other indexes like ahrefs and Majestic? Every index uses its own crawler to gather data and will build up a slightly different picture of the web based on the links indexed. Many SEOs use a combination of different indexes. You can read more about comparing the big link indexes and tool features on this backlinko blog. How often does Moz’s Link Index update? The index that powers Link Explorer is constantly updating to provide fresh link data. This does not mean that DA and PA will change with every data update; it will only change if we find new links to a respective site. Read more about how we index the web. What's Covered? Moz's Link Index Crawler Our Link index data is gathered by crawling and indexing links, just like Googlebot does to populate Google’s search results. This data allows us to understand how Google rankings work and calculate metrics like Page Authority and Domain Authority.Our web crawler, Dotbot, is built on a machine learning-based model that is optimized to select pages like those that appear in our collection of Google SERPs. We feed the machine learning model with features of the URL, like the backlink counts for the URL and the PLD (pay-level domains), features about the URL, like its length and how many subdirectories it has, and features on the quality of the domains linking to the URL and PLD. So the results are not based on any one particular metric, but we're training the crawler to start with high-value links.How Often Does the Moz Link Index Update?The index that powers Link Explorer is constantly updating to provide fresh link data. This includes updating the data which powers each section of Link Explorer, including Linking Domains, Discovered and Lost, and Inbound Links. When discovered or lost links are found, we'll update our database to reflect those changes in your scores and link counts. We prioritize the links we crawl based on a machine learning algorithm to mimic Google's index. This does not mean that DA and PA will change with every data update; it will only change if we find new links to a respective site.How Old is Moz Link Index Data?Links which are newly discovered by our crawlers should be populated in Link Explorer and the Links section of your Campaign within about 3 days of. A web crawler gathers URL(s). The crawler retrieves and analyzes the collected URLs. Crawler checks all the pages matching URLs, hyperlinks, all URLs, and meta tags.Website Crawler: Online Spyder to Test URLs for
Web crawling is growing increasingly common due to its use in competitor price analysis, search engine optimization (SEO), competitive intelligence, and data mining.Table of Contents1. How Is a Crawler Detected?2. Why Was Your Crawler Detected?3. How To Avoid Web Crawler DetectionWhile web crawling has significant benefits for users, it can also significantly increase loading on websites, leading to bandwidth or server overloads. Because of this, many websites can now identify crawlers — and block them.Techniques used in traditional computer security aren’t used much for web scraping detection because the problem is not related to malicious code execution like viruses or worms. It’s all about the sheer number of requests a crawling bot sends. Therefore, websites have other mechanisms in place to detect crawler bots.This guide discusses why your crawler may have been detected and how to avoid detection during web scraping.Web crawlers typically use the User-Agent header in an HTTP request to identify themselves to a web server. This header is what identifies the browser used to access a site. It can be any text but commonly includes the browser type and version number. It can also be more generic, such as “bot” or “page-downloader.”Website administrators examine the webserver log and check the User-Agent field to find out which crawlers have previously visited the website and how often. In some instances, the User-Agent field also has a URL. Using this information, the website administrator can find out more about the crawling bot.Because checking the web server log for each request is a tedious task, many site administrators use certain tools to track, verify, and identify web crawlers. Crawler traps are one such tool. These traps are web pages that trick a web crawler into crawling an infinite number of irrelevant URLs. If your web crawler stumbles upon such a page, it will either crash or need to be manually terminated.When your scraper gets stuck in one of these traps, the site administrator can then identify your trapped crawler through the User-Agent identifier.Such tools are used by website administrators for several reasons. For one, if a crawler bot is sending too many requests to a website, it may overload the server. In this case, knowing the crawler’s identity can allow the website administrator to contact the owner and troubleshoot with them.Website administrators can also perform crawler detection by embedding JavaScript or PHP code in HTML pages to “tag” web crawlers. The code is executed in the browser when it renders the web pages. The main purpose of doing this is to identify the User-Agent of the web crawler to prevent it from accessing future pages on the website, or at least to limit its access as much as possible.Using such codeComments
🕸 Crawl the web using PHP 🕷This package provides a class to crawl links on a website. Under the hood Guzzle promises are used to crawl multiple urls concurrently.Because the crawler can execute JavaScript, it can crawl JavaScript rendered sites. Under the hood Chrome and Puppeteer are used to power this feature.Support usWe invest a lot of resources into creating best in class open source packages. You can support us by buying one of our paid products.We highly appreciate you sending us a postcard from your hometown, mentioning which of our package(s) you are using. You'll find our address on our contact page. We publish all received postcards on our virtual postcard wall.InstallationThis package can be installed via Composer:composer require spatie/crawlerUsageThe crawler can be instantiated like thissetCrawlObserver() ->startCrawling($url);">use Spatie\Crawler\Crawler;Crawler::create() ->setCrawlObserver(class that extends \Spatie\Crawler\CrawlObservers\CrawlObserver>) ->startCrawling($url);The argument passed to setCrawlObserver must be an object that extends the \Spatie\Crawler\CrawlObservers\CrawlObserver abstract class:namespace Spatie\Crawler\CrawlObservers;use GuzzleHttp\Exception\RequestException;use Psr\Http\Message\ResponseInterface;use Psr\Http\Message\UriInterface;abstract class CrawlObserver{ /* * Called when the crawler will crawl the url. */ public function willCrawl(UriInterface $url, ?string $linkText): void { } /* * Called when the crawler has crawled the given url successfully. */ abstract public function crawled( UriInterface $url, ResponseInterface $response, ?UriInterface $foundOnUrl = null, ?string $linkText, ): void; /* * Called when the crawler had a problem crawling the given url. */ abstract public function crawlFailed( UriInterface $url, RequestException $requestException, ?UriInterface $foundOnUrl = null, ?string $linkText = null, ): void; /** * Called when the crawl has ended. */ public function finishedCrawling(): void { }}Using multiple observersYou can set multiple observers with setCrawlObservers:setCrawlObservers([ , , ... ]) ->startCrawling($url);">Crawler::create() ->setCrawlObservers([ class that extends \Spatie\Crawler\CrawlObservers\CrawlObserver>, class that extends \Spatie\Crawler\CrawlObservers\CrawlObserver>, ... ]) ->startCrawling($url);Alternatively you can set multiple observers one by one with addCrawlObserver:addCrawlObserver() ->addCrawlObserver() ->addCrawlObserver() ->startCrawling($url);">Crawler::create() ->addCrawlObserver(class that extends \Spatie\Crawler\CrawlObservers\CrawlObserver>) ->addCrawlObserver(class that extends \Spatie\Crawler\CrawlObservers\CrawlObserver>) ->addCrawlObserver(class that extends \Spatie\Crawler\CrawlObservers\CrawlObserver>) ->startCrawling($url);Executing JavaScriptBy default, the crawler will not execute JavaScript. This is how you can enable the execution of JavaScript:executeJavaScript() ...">Crawler::create() ->executeJavaScript() ...In order to make it possible to get the body html after the javascript has been executed, this package depends onour Browsershot package.This package uses Puppeteer under the hood. Here are some pointers on how to install it on your system.Browsershot will make an educated guess as to where its dependencies are installed on your system.By default, the Crawler will instantiate a new Browsershot instance. You may find the need to set a custom created instance using the setBrowsershot(Browsershot $browsershot) method.setBrowsershot($browsershot) ->executeJavaScript() ...">Crawler::create() ->setBrowsershot($browsershot) ->executeJavaScript() ...Note that the crawler will still work even if you don't have the system dependencies required by Browsershot.These system dependencies are only required if you're calling executeJavaScript().Filtering certain urlsYou can tell the crawler not to visit certain urls by using the setCrawlProfile-function. That function expectsan object that extends Spatie\Crawler\CrawlProfiles\CrawlProfile:/* * Determine if the given url should be crawled. */public function shouldCrawl(UriInterface $url): bool;This package comes with three CrawlProfiles out of the box:CrawlAllUrls: this profile will crawl all urls on all pages including urls to an external site.CrawlInternalUrls: this profile will only crawl the internal
2025-04-08Limit of pages to crawl.;// Crawls 5 URLs and ends.Crawler::create() ->setCrawlQueue($queue) ->setCurrentCrawlLimit(5) ->startCrawling($url);// Crawls the next 5 URLs and ends.Crawler::create() ->setCrawlQueue($queue) ->setCurrentCrawlLimit(5) ->startCrawling($url);">$queue = ;// Crawls 5 URLs and ends.Crawler::create() ->setCrawlQueue($queue) ->setCurrentCrawlLimit(5) ->startCrawling($url);// Crawls the next 5 URLs and ends.Crawler::create() ->setCrawlQueue($queue) ->setCurrentCrawlLimit(5) ->startCrawling($url);Example 3: Combining the total and crawl limitBoth limits can be combined to control the crawler:;// Crawls 5 URLs and ends.Crawler::create() ->setCrawlQueue($queue) ->setTotalCrawlLimit(10) ->setCurrentCrawlLimit(5) ->startCrawling($url);// Crawls the next 5 URLs and ends.Crawler::create() ->setCrawlQueue($queue) ->setTotalCrawlLimit(10) ->setCurrentCrawlLimit(5) ->startCrawling($url);// Doesn't crawl further as the total limit is reached.Crawler::create() ->setCrawlQueue($queue) ->setTotalCrawlLimit(10) ->setCurrentCrawlLimit(5) ->startCrawling($url);">$queue = ;// Crawls 5 URLs and ends.Crawler::create() ->setCrawlQueue($queue) ->setTotalCrawlLimit(10) ->setCurrentCrawlLimit(5) ->startCrawling($url);// Crawls the next 5 URLs and ends.Crawler::create() ->setCrawlQueue($queue) ->setTotalCrawlLimit(10) ->setCurrentCrawlLimit(5) ->startCrawling($url);// Doesn't crawl further as the total limit is reached.Crawler::create() ->setCrawlQueue($queue) ->setTotalCrawlLimit(10) ->setCurrentCrawlLimit(5) ->startCrawling($url);Example 4: Crawling across requestsYou can use the setCurrentCrawlLimit to break up long running crawls. The following example demonstrates a (simplified) approach. It's made up of an initial request and any number of follow-up requests continuing the crawl.Initial RequestTo start crawling across different requests, you will need to create a new queue of your selected queue-driver. Start by passing the queue-instance to the crawler. The crawler will start filling the queue as pages are processed and new URLs are discovered. Serialize and store the queue reference after the crawler has finished (using the current crawl limit).;// Crawl the first set of URLsCrawler::create() ->setCrawlQueue($queue) ->setCurrentCrawlLimit(10) ->startCrawling($url);// Serialize and store your queue$serializedQueue = serialize($queue);">// Create a queue using your queue-driver.$queue = ;// Crawl the first set of URLsCrawler::create() ->setCrawlQueue($queue) ->setCurrentCrawlLimit(10) ->startCrawling($url);// Serialize and store your queue$serializedQueue = serialize($queue);Subsequent RequestsFor any following requests you will need to unserialize your original queue and pass it to the crawler:setCrawlQueue($queue) ->setCurrentCrawlLimit(10) ->startCrawling($url);// Serialize and store your queue$serialized_queue = serialize($queue);">// Unserialize queue$queue = unserialize($serializedQueue);// Crawls the next set of URLsCrawler::create() ->setCrawlQueue($queue) ->setCurrentCrawlLimit(10) ->startCrawling($url);// Serialize and store your queue$serialized_queue = serialize($queue);The behavior is based on the information in the queue. Only if the same queue-instance is passed in the behavior works as described. When a completely new queue is passed in, the limits of previous crawls -even for the same website- won't apply.An example with more details can be found here.Setting the maximum crawl depthBy default, the crawler continues until it has crawled every page of the supplied URL. If you want to limit the depth of the crawler you can use the setMaximumDepth method.setMaximumDepth(2)">Crawler::create() ->setMaximumDepth(2)Setting the maximum response sizeMost html pages are quite small. But the crawler could accidentally pick up on large files such as PDFs and MP3s. To keep memory usage low in such cases the crawler will only use the responses that are smaller than 2 MB. If, when streaming a response, it becomes larger than 2 MB, the crawler will stop streaming the response. An empty response body will be assumed.You can change the maximum response size.setMaximumResponseSize(1024 * 1024 * 3)">// let's use a 3 MB maximum.Crawler::create() ->setMaximumResponseSize(1024 * 1024 * 3)Add a delay between requestsIn some cases you might get rate-limited when crawling too aggressively. To circumvent
2025-04-12Urls on the pages of a host.CrawlSubdomains: this profile will only crawl the internal urls and its subdomains on the pages of a host.Custom link extractionYou can customize how links are extracted from a page by passing a custom UrlParser to the crawler.setUrlParserClass(::class) ...">Crawler::create() ->setUrlParserClass(class that implements \Spatie\Crawler\UrlParsers\UrlParser>::class) ...By default, the LinkUrlParser is used. This parser will extract all links from the href attribute of a tags.There is also a built-in SitemapUrlParser that will extract & crawl all links from a sitemap. It does support sitemap index files.setUrlParserClass(SitemapUrlParser::class) ...">Crawler::create() ->setUrlParserClass(SitemapUrlParser::class) ...Ignoring robots.txt and robots metaBy default, the crawler will respect robots data. It is possible to disable these checks like so:ignoreRobots() ...">Crawler::create() ->ignoreRobots() ...Robots data can come from either a robots.txt file, meta tags or response headers.More information on the spec can be found here: robots data is done by our package spatie/robots-txt.Accept links with rel="nofollow" attributeBy default, the crawler will reject all links containing attribute rel="nofollow". It is possible to disable these checks like so:acceptNofollowLinks() ...">Crawler::create() ->acceptNofollowLinks() ...Using a custom User AgentIn order to respect robots.txt rules for a custom User Agent you can specify your own custom User Agent.setUserAgent('my-agent')">Crawler::create() ->setUserAgent('my-agent')You can add your specific crawl rule group for 'my-agent' in robots.txt. This example disallows crawling the entire site for crawlers identified by 'my-agent'.// Disallow crawling for my-agentUser-agent: my-agentDisallow: /Setting the number of concurrent requestsTo improve the speed of the crawl the package concurrently crawls 10 urls by default. If you want to change that number you can use the setConcurrency method.setConcurrency(1) // now all urls will be crawled one by one">Crawler::create() ->setConcurrency(1) // now all urls will be crawled one by oneDefining Crawl and Time LimitsBy default, the crawler continues until it has crawled every page it can find. This behavior might cause issues if you are working in an environment with limitations such as a serverless environment.The crawl behavior can be controlled with the following two options:Total Crawl Limit (setTotalCrawlLimit): This limit defines the maximal count of URLs to crawl.Current Crawl Limit (setCurrentCrawlLimit): This defines how many URLs are processed during the current crawl.Total Execution Time Limit (setTotalExecutionTimeLimit): This limit defines the maximal execution time of the crawl.Current Execution Time Limit (setCurrentExecutionTimeLimit): This limits the execution time of the current crawl.Let's take a look at some examples to clarify the difference between setTotalCrawlLimit and setCurrentCrawlLimit.The difference between setTotalExecutionTimeLimit and setCurrentExecutionTimeLimit will be the same.Example 1: Using the total crawl limitThe setTotalCrawlLimit method allows you to limit the total number of URLs to crawl, no matter how often you call the crawler.;// Crawls 5 URLs and ends.Crawler::create() ->setCrawlQueue($queue) ->setTotalCrawlLimit(5) ->startCrawling($url);// Doesn't crawl further as the total limit is reached.Crawler::create() ->setCrawlQueue($queue) ->setTotalCrawlLimit(5) ->startCrawling($url);">$queue = ;// Crawls 5 URLs and ends.Crawler::create() ->setCrawlQueue($queue) ->setTotalCrawlLimit(5) ->startCrawling($url);// Doesn't crawl further as the total limit is reached.Crawler::create() ->setCrawlQueue($queue) ->setTotalCrawlLimit(5) ->startCrawling($url);Example 2: Using the current crawl limitThe setCurrentCrawlLimit will set a limit on how many URls will be crawled per execution. This piece of code will process 5 pages with each execution, without a total
2025-03-26Oleh alat ini. Jenis yang tidak didukung mungkin ada dan valid di halaman, dan dapat muncul di hasil Penelusuran, tetapi tidak akan muncul di alat ini. Data respons tambahan Untuk melihat data respons tambahan seperti HTML mentah yang ditampilkan, header HTTP, output konsol JavaScript, dan semua resource halaman yang dimuat, klik Lihat halaman yang di-crawl. Informasi respons tambahan hanya tersedia untuk URL dengan status URL ada di Google atau URL ada di Google, tetapi mengalami masalah. Crawler yang digunakan untuk menghasilkan data bergantung pada posisi Anda saat membuka panel samping: Saat dibuka dari tingkat atas laporan, sub-laporan HTTPS, dan sub-laporan data terstruktur apa pun di bagian Peningkatan & Pengalaman, jenis crawler ditampilkan di bagian Ketersediaan halaman > Di-crawl > Di-crawl sebagai Saat dibuka dari sub-laporan AMP, jenis crawler-nya adalah smartphone Googlebot. Screenshot halaman yang dirender hanya tersedia di pengujian langsung. Pengujian URL aktifJalankan pengujian langsung untuk URL di properti Anda guna memeriksa masalah pengindeksan, data terstruktur, dan lainnya. Pengujian langsung berguna saat Anda memperbaiki halaman, untuk menguji apakah masalah telah diperbaiki.Untuk menjalankan pengujian langsung guna mengetahui potensi error pengindeksan: Periksa URL. Catatan: tidak masalah jika halaman belum diindeks, atau gagal diindeks, tetapi halaman harus dapat diakses dari internet tanpa persyaratan login. Klik Uji URL aktif. Baca memahami hasil pengujian langsung untuk memahami laporan tersebut. Anda dapat beralih antara hasil pengujian langsung dan hasil yang diindeks dengan mengklik Indeks Google atau Pengujian Langsung di halaman. Untuk menjalankan kembali pengujian langsung, klik tombol jalankan kembali pengujian di halaman pengujian. Untuk melihat detail
2025-04-03Spider is the fastest and most affordable crawler and scraper that returns LLM-ready data.[Document(page_content='Spider - Fastest Web Crawler built for AI Agents and Large Language Models[Spider v1 Logo Spider ](/)The World's Fastest and Cheapest Crawler API==========View Demo* Basic* StreamingExample requestPythonCopy\`\`\`import requests, osheaders = { 'Authorization': os.environ["SPIDER_API_KEY"], 'Content-Type': 'application/json',}json_data = {"limit":50,"url":" = requests.post(' headers=headers, json=json_data)print(response.json())\`\`\`Example ResponseScrape with no headaches----------* Proxy rotations* Agent headers* Avoid anti-bot detections* Headless chrome* Markdown LLM ResponsesThe Fastest Web Crawler----------* Powered by [spider-rs]( Do 20,000 pages in seconds* Full concurrency* Powerful and simple API* Cost effectiveScrape Anything with AI----------* Custom scripting browser* Custom data extraction* Data pipelines* Detailed insights* Advanced labeling[API](/docs/api) [Price](/credits/new) [Guides](/guides) [About](/about) [Docs]( [Privacy](/privacy) [Terms](/eula)© 2024 Spider from A11yWatchTheme Light Dark Toggle Theme [GitHubGithub]( metadata={'description': 'Collect data rapidly from any website. Seamlessly scrape websites and get data tailored for LLM workloads.', 'domain': 'spider.cloud', 'extracted_data': None, 'file_size': 33743, 'keywords': None, 'pathname': '/', 'resource_type': 'html', 'title': 'Spider - Fastest Web Crawler built for AI Agents and Large Language Models', 'url': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48/spider.cloud/index.html', 'user_id': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48'})]The params parameter is a dictionary that can be passed to the loader. See the Spider documentation to see all available parameters
2025-04-03