Käynnissä

linux craigslist crawler / scraper / harvester

I want this script done in linux, to be ran at the command prompt. No GUI needed; I won't be running it from a web-browser. Just through shell access. You can make recommendations as to what programming language you feel would be best.

I have a .csv file of URLs on Craigslist that I need to be scraped and parsed. The script will parse the email address, city, subject line of the ad, and the date that the ad was posted. I need the ability to specify a specific date range for the script to scrape data from, as well as just the option for the script to scrape everything. If you go to any of the links in the text file, there is usually a link at the bottom that says "next 100 postings" ([url removed, login to view] is an example - just scroll down to the bottom); when the script encounters this, it will automatically parse that link, and continue onto the next page, until no more of these are found. This function would only be used if I have selected to scrape everything. If I am only scraping a specific date range, then the script will still have to use the 'next 100 postings' link at times, but won't need to continue until there are no more of the 'next 100 postings' links.

The script must be multi-threaded (must be able to handle up to 500 simultaneous threads), and must support the usage of http/https/socks4/socks5 proxies. I will have a text file of proxies, and the script will randomly grab a proxy for each URL that it scrapes.

The .csv file will have 3 columns in it:

1. The URL to begin scraping

2. The Country that is being scraped

3. The City that is being scraped

The script will use the country value to place the data scraped from that country into its' own folder, and it will use the city value in the .csv files that it outputs after it parses each page. As an example:

[url removed, login to view],USA,Austin

[url removed, login to view],Canada,Vancouver

[url removed, login to view],Australia,Canberra

[url removed, login to view],UK,Cambridge

In this sample, the script will go to [url removed, login to view], and it will see numerous posts. If I have it set to only scrape a specific date range, it will only parse the URLs that are in that date range. If not, it will parse all of those URLs, as well as go to the 'next 100 postings' link and do the same, etc.

As of the the time I wrote this, the very first link link to be parsed is the "Expanding Firm Hiring - Marketing & Management" link - [url removed, login to view] The script will parse this link, and will save this data to a .csv file called [url removed, login to view], in a folder called USA. This is what the output of the [url removed, login to view] file will look like, just from scraping that link:

email_address_here,Austin,Expanding Firm Hiring - Marketing & Management (AUSTIN),9/23/2009

I know that the date is shown as 2009-09-23, but I would need whatever format the date is in to be formatted in the above example (month/date/year).

I also need the option to select either scrape all countries, or just certain countries. For instance, if I just wanted to scrape the USA, or I wanted to scrape the USA, Canada, and Australia, etc.

The script will do the exact same thing for the other 3 examples, in Canada, Australia, and the UK.

I will own the exclusive rights to this script; you will not be able to re-sell it, and I will obtain full rights to this script.

If you have any questions, please don't hesitate to ask.

Taidot: Linux, PHP, Tietojen kaavinta verkosta

Näytä lisää: script scrape web page linux, gui web crawler scraper, linux gui scraper, what was the first programming language, what support will i need from management, what programming language is this, what programming language, what is the best web programming language, what is the best programming language, web programming uk, value city, usa programming, time in canberra, time canberra, the best web programming language, the best programming language, shell programming language, script php proxy web, programming on linux, programming language usage, programming in linux, programming for linux, php programming hiring, multi threaded programming, marketing firm hiring

Tietoa työnantajasta:
( 32 arvostelua ) Doral, Costa Rica

Projektin tunnus: #520096

Myönnetty käyttäjälle:

zeke

Dear Customer! This is my favourite kind of project and I have a lot of experience wrigint crawlers/scrappers/web bots/etc. Please see PMB for examples of my previous works in this field. Ready to start right now and f Lisää

300 $ USD 3 päivässä
(152 arvostelua)
6.8

13 freelancers are bidding on average $355 for this job

srinichal

I can do this in bash using wget

220 $ USD 3 päivässä
(100 arvostelua)
7.1
pgcoding

please check pmb.

400 $ USD 15 päivässä
(25 arvostelua)
6.2
LanceGuru

Hi, Please see the private message. Thank You

400 $ USD 3 päivässä
(23 arvostelua)
6.2
rapuk

Hej, Steve. I'm very much interested in this project. I'll get this job done and meet all your requirements. If you want I can make a demo. I prefer to use java for this scraper.

250 $ USD 5 päivässä
(30 arvostelua)
5.1
Scorpio1987

Please check PM..

350 $ USD 4 päivässä
(11 arvostelua)
4.5
ukshumi

Please check PM, Already have some thing

250 $ USD 4 päivässä
(10 arvostelua)
3.7
AlexeyKaplin

I can do it on Perl.

0 $ USD 0 päivässä
(2 arvostelua)
0.8
Sapron

Please, check you pmb.

200 $ USD 5 päivässä
(0 arvostelua)
0.0
tech2trade

Hi, We have already done similar crawler for cityserch's web site using Microsoft technologies. Have all data form the same. Please feel free to call me on 001 408 218 8015 or mail me your contact information to Lisää

900 $ USD 20 päivässä
(0 arvostelua)
0.0
bubble1000

Hi, I read your requirement carefully, I have such experience, I can take this job. thanks.

600 $ USD 14 päivässä
(0 arvostelua)
0.0
ppan279

Hi, I have no reviews to show for as I have registered recently but I have rich experience of scraping of about 4 years in which I have scraped not less than 500 sites of all hue and [url removed, login to view] me with this work a Lisää

350 $ USD 7 päivässä
(0 arvostelua)
0.0
badousoft

pls see PM

400 $ USD 7 päivässä
(0 arvostelua)
0.0