ID W3AF:0E451999712B6E0A680E017A9C9A2ABB
Type w3af
Reporter andresriancho
Modified 2019-09-06T21:15:55


This plugin does a search in and parses the results. It then uses the results to find new URLs in the target site. This plugin is a time machine !

Plugin type



Name | Type | Default Value | Description | Help
max_depth | integer | 3 | Maximum recursion depth for spidering process | The plugin will spider the site related to the target site with the maximum depth specified in this parameter.


For more information about this plugin and the associated tests, there's always the source code to understand exactly what's under the hood:
Plugin source code
Unittest source code


This plugin has no dependencies.


Copyright 2006 Andres Riancho

This file is part of w3af, .

w3af is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation version 2 of the License.

w3af is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with w3af; if not, write to the Free Software
Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA

import re
from itertools import izip, repeat

import w3af.core.controllers.output_manager as om
from w3af.core.controllers.plugins.crawl_plugin import CrawlPlugin
from w3af.core.controllers.misc.is_private_site import is_private_site
from w3af.core.controllers.exceptions import RunOnce
from import opt_factory
from import OptionList
from import URL
from import ScalableBloomFilter
from w3af.core.controllers.core_helpers.fingerprint_404 import is_404
from import FuzzableRequest

class archive_dot_org(CrawlPlugin):
    Search to find new pages in the target site.

    :author: Andres Riancho (
    :author: Darren Bilby, thanks for the good idea!

    ARCHIVE_START_URL = '*/%s'
    INTERESTING_URLS_RE = '<a href="(http://web\.archive\.org/web/\d*?/https?://%s/.*?)"'
    NOT_IN_ARCHIVE = '<p>Wayback Machine doesn't have that page archived.</p>'

    def __init__(self):

        # Internal variables
        self._already_crawled = ScalableBloomFilter()
        self._already_verified = ScalableBloomFilter()

        # User configured parameters
        self._max_depth = 3

    def crawl(self, fuzzable_request, debugging_id):
        Does a search in and searches for links on the html. Then
        searches those URLs in the target site. This is a time machine !

        :param debugging_id: A unique identifier for this call to discover()
        :param fuzzable_request: A fuzzable_request instance that contains
                                    (among other things) the URL to test.
        domain = fuzzable_request.get_url().get_domain()

        if is_private_site(domain):
            msg = 'There is no point in searching for "%s"'\
                  ' because it is a private site that will never be indexed.'
            om.out.information(msg % domain)
            raise RunOnce(msg)

        # Initial check to verify if domain in archive
        start_url = self.ARCHIVE_START_URL % fuzzable_request.get_url()
        start_url = URL(start_url)
        http_response = self._uri_opener.GET(start_url, cache=True)

        if self.NOT_IN_ARCHIVE in http_response.body:
            msg = 'There is no point in searching for "%s"'
            msg += ' because they are not indexing this site.'
            om.out.information(msg % domain)
            raise RunOnce(msg)

        references = self._spider_archive(
            [start_url, ], self._max_depth, domain)

    def _analyze_urls(self, references):
        Analyze which references are cached by

        :return: A list of query string objects for the URLs that are in
                 the cache AND are in the target web site.
        real_urls = []

        # Translate URL's to normal URL's
        for url in references:
            url = url.url_string[url.url_string.index('http', 1):]

        real_urls = list(set(real_urls))

        if len(real_urls):
            om.out.debug(' cached the following pages:')
            for u in real_urls:
                om.out.debug('- %s' % u)
            om.out.debug(' did not find any pages.')

        # Verify if they exist in the target site and add them to
        # the result if they do. Send the requests using threads:, real_urls)

    def _spider_archive(self, url_list, max_depth, domain):
        Perform a classic web spidering process.

        :param url_list: The list of URL strings
        :param max_depth: The max link depth that we have to follow.
        :param domain: The domain name we are checking
        # Start the recursive spidering
        res = []

        def spider_worker(url, max_depth, domain):
            if url in self._already_crawled:
                return []


                http_response = self._uri_opener.GET(url, cache=True)
                return []

            # Filter the ones we need
            url_regex_str = self.INTERESTING_URLS_RE % domain
            matched_urls = re.findall(url_regex_str, http_response.body)
            new_urls = [URL(u) for u in matched_urls]
            new_urls = [u.remove_fragment() for u in new_urls]
            new_urls = set(new_urls)

            # Go recursive
            if max_depth - 1 > 0:
                if new_urls:
                                                    max_depth - 1,
                msg = 'Some sections of the site were not analyzed'
                msg += ' because of the configured max_depth.'
                return new_urls

        args = izip(url_list, repeat(max_depth), repeat(domain))
        self.worker_pool.map_multi_args(spider_worker, args)

        return list(set(res))

    def _exists_in_target(self, url):
        Check if a resource still exists in the target web site.

        :param url: The resource to verify.
        :return: None, the result is stored in self.output_queue
        if url in self._already_verified:


        response = self._uri_opener.GET(url, cache=True)

        if not is_404(response):
            msg = 'The URL: "%s" was found at and is'\
                  ' STILL AVAILABLE in the target site.'
            om.out.debug(msg % url)

            fr = FuzzableRequest(response.get_uri())
            msg = 'The URL: "%s" was found at and was'\
                  ' DELETED from the target site.'
            om.out.debug(msg % url)

    def get_options(self):
        :return: A list of option objects for this plugin.
        ol = OptionList()

        d = 'Maximum recursion depth for spidering process'
        h = 'The plugin will spider the site related to the target'\
            ' site with the maximum depth specified in this parameter.'
        o = opt_factory('max_depth', self._max_depth, d, 'integer', help=h)

        return ol

    def set_options(self, options_list):
        This method sets all the options that are configured using the user
        interface generated by the framework using the result of get_options().

        :param options_list: A dictionary with the options for the plugin.
        :return: No value is returned.
        self._max_depth = options_list['max_depth'].get_value()

    def get_long_desc(self):
        :return: A DETAILED description of the plugin functions and features.
        return """
        This plugin does a search in and parses the results. It
        then uses the results to find new URLs in the target site. This plugin
        is a time machine!