google_spider

2013-06-10T23:02:10
ID W3AF:5EF74878480454FAE4ECC4A3A04D7E86
Type w3af
Reporter andresriancho
Modified 2014-09-03T21:59:27

Description

This plugin finds new URL's using google. It will search for "site:domain.com" and do GET requests all the URL's found in the result. One configurable parameter exists:

  • result_limit

Plugin type

Crawl

Options

Name | Type | Default Value | Description | Help
---|---|---|---|---
result_limit | integer | 300 | Fetch the first "result_limit" results from the Google search | No detailed help available

Source

For more information about this plugin and the associated tests, there's always the source code to understand exactly what's under the hood:
Plugin source code
Unittest source code

Dependencies

This plugin has no dependencies.

                                        
                                            """
google_spider.py

Copyright 2006 Andres Riancho

This file is part of w3af, http://w3af.org/ .

w3af is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation version 2 of the License.

w3af is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with w3af; if not, write to the Free Software
Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA

"""
from w3af.core.data.options.opt_factory import opt_factory
from w3af.core.data.options.option_list import OptionList
from w3af.core.data.search_engines.google import google as google

from w3af.core.controllers.plugins.crawl_plugin import CrawlPlugin
from w3af.core.controllers.exceptions import BaseFrameworkException, RunOnce
from w3af.core.controllers.misc.is_private_site import is_private_site
from w3af.core.controllers.misc.decorators import runonce


class google_spider(CrawlPlugin):
    """
    Search google using google API to get new URLs
    :author: Andres Riancho (andres.riancho@gmail.com)
    """

    def __init__(self):
        CrawlPlugin.__init__(self)

        # User variables
        self._result_limit = 300

    @runonce(exc_class=RunOnce)
    def crawl(self, fuzzable_request):
        """
        :param fuzzable_request: A fuzzable_request instance that contains
                                    (among other things) the URL to test.
        """
        google_se = google(self._uri_opener)

        domain = fuzzable_request.get_url().get_domain()
        if is_private_site(domain):
            msg = 'There is no point in searching google for "site:%s".'\
                  ' Google doesn\'t index private pages.'
            raise BaseFrameworkException(msg % domain)

        try:
            g_results = google_se.get_n_results('site:' + domain,
                                                self._result_limit)
        except:
            pass
        else:
            self.worker_pool.map(self.http_get_and_parse,
                                    [r.URL for r in g_results])

    def get_options(self):
        """
        :return: A list of option objects for this plugin.
        """
        ol = OptionList()

        d = 'Fetch the first "result_limit" results from the Google search'
        o = opt_factory('result_limit', self._result_limit, d, 'integer')
        ol.add(o)

        return ol

    def set_options(self, options_list):
        """
        This method sets all the options that are configured using the user
        interface generated by the framework using the result of get_options().

        :param options_list: A dictionary with the options for the plugin.
        :return: No value is returned.
        """
        self._result_limit = options_list['result_limit'].get_value()

    def get_long_desc(self):
        """
        :return: A DETAILED description of the plugin functions and features.
        """
        return """
        This plugin finds new URL's using google. It will search for
        "site:domain.com" and do GET requests all the URL's found in the result.

        One configurable parameter exists:
            - result_limit
        """