I want this script done in linux, to be ran at the command prompt. No GUI needed; I won't be running it from a web-browser. Just through shell access. You can make recommendations as to what programming language you feel would be best.
I have a .csv file of URLs on Craigslist that I need to be scraped and parsed. The script will parse the email address, city, subject line of the ad, and the date that the ad was posted. I need the ability to specify a specific date range for the script to scrape data from, as well as just the option for the script to scrape everything. If you go to any of the links in the text file, there is usually a link at the bottom that says "next 100 postings" ([url removed, login to view] is an example - just scroll down to the bottom); when the script encounters this, it will automatically parse that link, and continue onto the next page, until no more of these are found. This function would only be used if I have selected to scrape everything. If I am only scraping a specific date range, then the script will still have to use the 'next 100 postings' link at times, but won't need to continue until there are no more of the 'next 100 postings' links.
The script must be multi-threaded (must be able to handle up to 500 simultaneous threads), and must support the usage of http/https/socks4/socks5 proxies. I will have a text file of proxies, and the script will randomly grab a proxy for each URL that it scrapes.
The .csv file will have 3 columns in it:
1. The URL to begin scraping
2. The Country that is being scraped
3. The City that is being scraped
The script will use the country value to place the data scraped from that country into its' own folder, and it will use the city value in the .csv files that it outputs after it parses each page. As an example:
[url removed, login to view],USA,Austin
[url removed, login to view],Canada,Vancouver
[url removed, login to view],Australia,Canberra
[url removed, login to view],UK,Cambridge
In this sample, the script will go to [url removed, login to view], and it will see numerous posts. If I have it set to only scrape a specific date range, it will only parse the URLs that are in that date range. If not, it will parse all of those URLs, as well as go to the 'next 100 postings' link and do the same, etc.
As of the the time I wrote this, the very first link link to be parsed is the "Expanding Firm Hiring - Marketing & Management" link - [url removed, login to view] The script will parse this link, and will save this data to a .csv file called [url removed, login to view], in a folder called USA. This is what the output of the [url removed, login to view] file will look like, just from scraping that link:
email_address_here,Austin,Expanding Firm Hiring - Marketing & Management (AUSTIN),9/23/2009
I know that the date is shown as 2009-09-23, but I would need whatever format the date is in to be formatted in the above example (month/date/year).
I also need the option to select either scrape all countries, or just certain countries. For instance, if I just wanted to scrape the USA, or I wanted to scrape the USA, Canada, and Australia, etc.
The script will do the exact same thing for the other 3 examples, in Canada, Australia, and the UK.
I will own the exclusive rights to this script; you will not be able to re-sell it, and I will obtain full rights to this script.
If you have any questions, please don't hesitate to ask.
13 freelancers are bidding on average $355 for this job
Hej, Steve. I'm very much interested in this project. I'll get this job done and meet all your requirements. If you want I can make a demo. I prefer to use java for this scraper.