Website crawlers (wrappers) for 12 sites

$100-500 USD

Peruutettu

Julkaistu

lähes 19 vuotta sitten

$100-500 USD

Maksettu toimituksen yhteydessä

Use HTMLUnit to write programs to crawl each of the sites below. Generate a pipe-delimited file from each site with one row per record. The format of the file is given below. You don't need to crawl the entire site, but it's your responsibility to make sure that your program works over the entire site. Build in an N-second delay between page fetches, where N is a command-line parameter. Retry each page request 4 times before skipping (with an N-second delay between each retry). If you skip a page, write out the fact that you skipped it to a logfile. Note that I am *not* asking you to collect email addresses from any of these sites. ## Deliverables 1) Complete and fully-functional working program(s) in executable form as well as complete source code of all work done. 2) Deliverables must be in ready-to-run condition as follows: source code must be written in Java and compilable under JDK 1.5 with HTMLUnit and Log4J jar files. Using HTMLUnit and its associated XPath libraries will make this project much easier to write and maintain over time, so it is required that you use it. 3) All deliverables will be considered "work made for hire" under U.S. Copyright law. Buyer will receive exclusive and complete copyrights to all work purchased. 4) The work must be completed by June 1, 2005. 5) The specific program requirements follow. Note that in total, 12 programs are to be delivered. Specs for the last 3 programs are in the attached file. For each URL specified below, write a program that starts from that URL and generates a file according to the crawl instructions specified below where each row has the format "Type|LinkText|URL" where Type="Localities", "Surnames", or "Topics". 1. [login to view URL] * follow the Localities, Surnames, and Topics links * under Localities, follow the links for each location recursively. Location links are preceeded by a yellow folder icon. Write out any message board links you find. Message board links are found after a grey horizontal line on the page. * for example, on the page [login to view URL], you would recursively follow the 14 location links, and write out the two message board links: * Localities|General|[login to view URL] * Localities|CanadaGenWeb|[login to view URL] * under Surnames, follow the links for each 1-, 2-, and 3-character name prefix links recursively, and write out any message baord links you find. * for example, on the page [login to view URL], you would recursively follow the 26 3-character name prefix links Sta..Stz, and write out the approximately 60 message board links, such as * Surnames|St. Ama|[login to view URL] * handle the Topics message boards similarly 2. [login to view URL] * follow the links under Surnames, Regional, and General Topics * on each of the 26 surname pages, write out the surname links in the list * for example, on the page [login to view URL], you would write out approximately 200 rows, the first of which is: * Surnames|Qafzezi|[login to view URL] * There are only two links to follow under regional: U.S. States, and Countries. On each of those two pages, write out the links for each location (US State or Country). * for example, on the page [login to view URL], you would write out roughly 100-150 rows, the first of which is * Localities|Albania|[login to view URL] * There's really just one page with Topics links: [login to view URL] On this page, capture all links under the various headings and subheadings in the list. * for example: * Topics|General Genealogy|[login to view URL] For each URL specified below, write a program that starts from that URL and generates a file according to the crawl instructions specified below where each row has the format "LinkText|URL". 1. [login to view URL] * this one is easy; just capture the link text and URL for each of the Links in the list * for example: * Louisiana, 1718-1925 Marriage Index|[login to view URL] 2. [login to view URL] * follow the links recursively for each of the 1-, 2-, and sometimes 3-character prefixes. Write out the databases listed on each page. Instead of writing out just the link text, write out the entire line as the link text * for example, on page [login to view URL],a,ab&firstTitle=0, you would write out 5 rows, the first of which is: * Abandoned iron mines of Andover and Byram Townships, Sussex County, New Jersey|[login to view URL] For each URL specified below, write a program that starts from that URL and generates a file according to the crawl instructions specified below where each row has the format "Location|LinkText|URL" where Location=the state, province, or country. 1. [login to view URL] * follow all links from "Western United States and Canada" to "Additional Lists"; on the linked-to pages, capture information for the archives & libraries * example: * Alaska|Alaska State Archives|[login to view URL] * Alaska|Alaska State Library. Alaska Historical Collections|[login to view URL] 2. [login to view URL] * Get data from this one page only; capture information from the list of the state archives and records programs. Skip the links to the State Coordinator and SHRAB. * example: * Alabama|Alabama Department of Archives and History|[login to view URL] * Alaska|Alaska Division of Libraries and Archives|[login to view URL] * Alaska|Archives and Records Management|[login to view URL] * Arizona|Arizona History and Archives Division|[login to view URL] 3. [login to view URL] * Follow links from "Communal" to "State and Regional"; on the linked-to pages, capture information in the Links section, and continue following the links in the Categories section * get the Location field from the Category; it's ok if the Category is not a State, Country or Province - just capture whatever it is * example: here are the first couple of links from the Communal : Northern America : USA : Alabama category * Alabama|Amador County - Archives|[login to view URL] * Alabama|Birmingham Public Library - Archival Resources|[login to view URL] 4. [login to view URL] * Follow the links from "Africa" to "Northern America", following the same crawl strategy as for site (3) * example: here are the link from the Communities : Associations : Africa : Senegal category * Senegal|Association des Amis des Archives du Senegal (AMIAS)|[login to view URL] 5. [login to view URL] * Get data from this one page only; capture information from state libraries and organizations * example: * Alabama|Alabama Department of Archives & History|[login to view URL] * Alabama|Alabama Public Library Service|[login to view URL] ## Platform Must run on Java 1.5, using HTMLUnit and Dom4J.

Website crawlers (wrappers) for 12 sites

$100-500 USD

$100-500 USD

Tietoa projektista

Haluatko ansaita rahaa?

Freelancerin tarjouskilpailun edut

Tietoja asiakkaasta

Asiakkaan vahvistus

Muita töitä tältä asiakkaalta

Samanlaisia töitä