I need a customized web crawling program to scrape data off an extensive business contact database that contains millions of members. This program must be able to circumvent server detection, either by bandwith throttling or another device. The database will require multiple templates for extraction, however the end user will have the capability to determine the specific crawling rules, keywords, and depth of crawl. The end product must be able to be convereted into an Excel file, MSFT Access, or MySQL database.
Features that are desired:
1. Multiple Data Types in Single Extraction Template
(i.e., Free Text, Tabled Information, Multiple Tables)
2. Multiple Types of File Lists or Data Inputs
(i.e., Excel, Access, MySQL, SQL, etc.)
3. Multiple Extraction Datastores
(i.e., Excel, Access, MySQL, SQL, etc.)
4. Automatic Table Creation during Extraction
(Supported in Excel, MySQL, SQL)
5. SQL 2005 Express Instance
(Stores Meta-Data, Program Variables, and can store Extracted Data)
6. Comprehensive Meta-Data Logging
(For Auditing, Data Cleansing, and Data Joining)
7. Wizard driven DataSet Initialization String Creation
(The HTML for Extraction Area Start and Row Start)
8. Manual Editing of DataSet initialization String
(User Defined DataSet HTML)
9. Automatic Table Row Count Calculation
(Automatically Calculates Number of Tables Rows on each HTML Page)
10. Wizard driven Field Creation
(The HTML for Data Extraction Start and Column Start)
11. Manual Editing of Field Start and Stop HTML
(User Defined Start and Stop Tags)
12. Supports Optional Fields
(Accurate Extraction of data that appears in some rows, but not in others)
13. Built In Data Cleansing
(Remove HTML, Preserve Text Whitespace, Full URL from Relative, and more)
14. Test Extraction w/Step by Step Replay for Troubleshooting
15. One-Click Save to Datastore Option
(Extract while browsing in the DataPage Editor)
16. Basic Automation Wizard
(Simple Extraction Automation via File List from Excel, Access, MySQL and SQL)
1. WinHTTP Stack
(Server quality HTTP platform that allows up to 10 page per second downloading)
2. Multi-Step Task Execution
(Simulate user tasks like Log-in, get Cookie or SessionID, Submit Searches)
3. Bandwidth Throttling
(Scale between 10 request/second to 1 request/hour to simulate real user)
4. Download Images and Files
(Edit File Path and File Naming Conventions)
5. Customize User Agent, Referrer URL, Relative URL, Cookies, and more.
6. Powerful SQL based File List Manipulation and Concatenation
7. Package Run Scheduling
(Run Normally or Silently from Windows Scheduler or other program interface)
9. Create URL File Lists
(Manually or using Excel, Access, MySQL, and SQL) X
10. Advanced Web Crawler
(Control Depth, Number of pages, and parameters of Link to be crawled or ignored) X
This program must be able to avoid automated detection or blocking from the host. Remember, it must be able to extract entries in the millions at very high speeds. The database which I need to scrape is [url removed, login to view]
Payment will be transfered via an escrow once three successful tests of the program have been completed to my full specifications.
I will utilize the web crawler to search for companies via U.S. NAICS industry descritpions ([url removed, login to view]). Once the listed companies under that industry appear in [url removed, login to view], I will require their business information be scraped and put into my database.
The information required includes:
Alt. Company Nam (DBA)
Est. Annual Sales
Est. # of Employees
Est. # Employees at Loc.
State of Incorp.
For example, if I want to collect the business information of every company that is in the NAICS Industry "Spices and Seasonings", I will enter a search in [url removed, login to view] for "Spices and Seasonings. Every related company will be listed, and that information will need to be scraped and compiled into my own database.
A company listed under "Spices and Seasonings" ([url removed, login to view])
For the test of the finished product I will require sample excell, MSFT Access, and MySQL databases for the companies in the following industries: "Spices and Seasonings", "Soft Drink Manufacturing", and "Snack Food Manufacturing".
Some NAICS industry descriptions may return more results than can be listed by Manta, which requires a specialized solution in further crawling via either category or description. For example, Snack Food Manufacturing has 4862 companies but not all can be listed at one time. [url removed, login to view]
7 freelancers are bidding on average $693 for this job
Live worked done on many ETL Process for FMCG Company using SSIS with any DB like flat file, Excel, Access, Oracle, SQL. I can do my best. More.. look my profle