Peruttu

Web scraper in Python with Scrapy ([url removed, login to view]) for Google

I need to scrape Google search results, using Python with Scrapy ([url removed, login to view]).

My problem is that Google blocks automated scraping.

I need help to find how to configure the scraper (increase scraping delay?) and/or an anonymous proxy (like Tor+Privoxy) to be able to scrape Google search results.

What I have so far:

1) Simple Google parser:

def parse(self, response):

hxs = HtmlXPathSelector(response)

if [url removed, login to view]('[url removed, login to view]'):

for url in [url removed, login to view]('//div[@id="ires"]/ol/li//h3[@class="r"]/a/@href').extract():

... # Here parse google links

for url in [url removed, login to view]('//a[@id="pnnext"]/@href').extract():

url = "https://" + [url removed, login to view]('/')[2] + url

yield Request(url)

This simple parser, without any proxy, gets recognized as an automated scraper and blocked.

2) I installed Tor+Privoxy, with this middleware class:

class ProxyMiddleware(object):

def process_request(self, request, spider):

[url removed, login to view]['proxy'] = "http://localhost:8118"

configured in the settings:

DOWNLOADER_MIDDLEWARES = {

'[url removed, login to view]': 110,

'[url removed, login to view]': 100,

}

But scrapy seems not to work with Tor+Privoxy on https pages (with http scrapy+tor+privoxy works, but Google now only works with https).

So what I actually need is a sample project with detailed proxy configuration (Tor/Privoxy or else) on how to avoid being blocked by Google because of automated scraping.

Taidot: PHP, tietojärjestelmäarkkitehtuuri

Näytä lisää: privoxy tor scraping google, python parse, scrapy privoxy, scrape google scrapy, scrapy google search, scrapy google search results, python scrapy google search results, python scraping google, using scrapy tor, scrapy tor, scraping google tor, python scrape google, scrape search results scrapy, web scraping python 3, web scraping https, simple scraping software, scrapy org, r architecture, python find, how to work for google, find python, scraper google python, web spider software, web scraping with r, vbulletin.org

About the Employer:
( 19 reviews ) Biella, Italy

Projektin tunnus: #4253255

5 freelanceria on tarjonnut keskimäärin 190 $ tähän työhön

SigmaVisual

I can help in your project, please check PMB and our ratings/reviews to get idea of our experience. Please let me know if you have any queries.

199 $ USD 5 päivässä
(218 arvostelua)
7.7
bob1982

is php ok to you? thanks

250 $ USD 5 päivässä
(337 arvostelua)
6.8
mantislin

Hi sir, please check PM, thx Kimi.

250 $ USD 5 päivässä
(118 arvostelua)
6.4
exprtsolution

i can make this project., please check pm.. thanks

150 $ USD 7 päivässä
(5 arvostelua)
2.3
AstreyLabs

Hi I have solution for your task. Go ahead.

100 $ USD 1 päivässä
(0 arvostelua)
0.0