Peruttu

Website crawler/scraper for Open Source Plucker Project

The existing community of Plucker users will be benefiting from a Perl-based HTML spider/crawler that can take a parent URL, and follow the links to a specified depth, and convert those pages to the standard Plucker document format:

Plucker Document Format:

[url removed, login to view]

Plucker Workshop:

[url removed, login to view]

Plucker Homepage:

[url removed, login to view]

The spider will take several arguments, matching the existing Python spider that is used currently for this task.

I started to rewrite the Python spider in Perl several years ago, and had to give up on the effort due to time constraints.

These include (but are not limited to):

--url: Home/parent url for root document (starting page)

--file: Final output filename for the completed document

--maxdepth: Maximum depth to spider the content

--bpp: Bits per pixel for images; 1, 2, 4, 8, 16

--no-urlinfo: Do not include info about the URLs

--compression: none, doc, zlib

--stayonhost: Do not follow external URLs

--stayondomain: Do not follow URLs off of this domain

--staybelow: Stay below this URL prefix

--launchable, --not-launchable: Set/unset the Launchable attribute

--backup, --no-backup: Set/unset the Backup attribute

--beamable, --not-beamable: Set/unset the Beam attribute

The document format is open and available, and I can provide as many examples and resources as possible to make this as easy as possible.

At a bare minimum, the spider must be able to handle HTML, RSS and text formats.

For the right person chosen for this work, this project must leverage as many upstream CPAN modules as possible to keep the spider itself small and compact (LWP, GD, XML::RSS, Parallel::UserAgent, etc.)

This should be very easy to complete in a very short period of time.

When bidding, please provide examples of previous and relevant work you've done in this area.

The entire Plucker Open Source community of several thousand users would be using this tool, if it was available, so please make the code as clean, extensible and well-commented as possible; it will be seen and reviewed by thousands of eyes across the planet.

Thanks for your consideration, and good luck bidding!

Taidot: tietojenkäsittely, Linux, Mobile App Development, Perl, Python

Näytä lisää: perl crawler source, python crawler entire domain, open source website scraper, work from home resources, work from home org, when to stay home from work, when should you stay home from work, website formats, website format html, stay home work, standard website format, source format, small project in python, should i stay home from work, set bits in c, set bits, root info, rewrite content tool, python home work, planet source code, open text, open source c++ code, is in a prefix, good cvs, examples of good cvs

Tietoa työnantajasta:
( 0 arvostelua ) New London, United States

Projektin tunnus: #198874

4 freelanceria on tarjonnut keskimäärin 84 $ tähän työhön

pgcoding

please check pmb.

100 $ USD 7 päivässä
(8 arvostelua)
4.9
winterspade

Please see PMB for more information. Thanks

100 $ USD 7 päivässä
(5 arvostelua)
4.4
dhoss

Hi there, I'd love to help! -Devin Austin Founder/Head Developer of [url removed, login to view]

75 $ USD 14 päivässä
(1 arvostelu)
2.9
bhoga

We are a group of young and dynamic software developers with rich experience in developing web crawlers . We have pre-designed crawlers that can be modified/ redesigned according to your needs. Hence we assure you that Lisää

60 $ USD 4 päivässä
(0 arvostelua)
0.0