Scrape data from the SEC EDGAR database [login to view URL] form DEF 14A - repost

There has been a change of plan. I no longer want to scrape the form 4 filings. I now want to scrape forms with the code DEF14A. These come out on an annual basis. I would like to scrape data for as many companies as possible. This would include the constituents of the S&P 500 and maybe some of the larger indices. I also need the historical data. You will have get data from the DEF 14A file for every available year. I want to extract just two fields: the name of the company and the percentage of the shares owned by company insiders.

Extracting this information will be fairly challenging because the DEF 14A filings do not follow a set pattern. You will need to create a program capable of scanning through the document and locating the relevant text. You will have to employ some kind of proximity system and make use of "wildcard" searches.

Here is an example, its the most recent DEF 14A filing submitted by Microsoft: [url removed, login to view] The relevant table appears on page 11. The table lists the names of the managers and the percentage of the company that they personally own. Collectively the managers own 9.46 percent of the company. In this example I would be looking to extract the word "Microsoft" and the number 9.46.

I would like to do this for every form DEF 14A filing in Microsoft's history. The big problem is that not all the companies use the same language: Here is a list of different phrases that various companies use:

All Directors and Executive Officers as a group (including Named Executives) (32 persons) beneficially owned 1.64% of Ford common stock or securities convertible into Ford common stock as of February 1, 2012 6 1,226,353 9,476,285 13.3%

All directors and current executive officers as a group (12 persons) 3,991,056 6,348,957 10,340,013 2.6%

Directors, nominees and Named Executive Officers as a group (11 persons) 6,185,034 (12) 25.18 Executive officers and directors as a group (13 persons)(19) 1,490,847 6.7%

All executive officers and directors as a group (14 persons) 4,782,931 6.4 All directors and executive officers as a group (18 persons) 661,671 1,440,299 269,802,371,776 4.3%

All directors and executive officers as a group (12 persons)(6) 870,542 1.07 All Company directors and executive officers as a group (19 persons) 433,960 596,312 1,030,272 1.5%

All nominees, continuing directors and executive officers as a group (20 persons) 5,944,103 16,824,264 139,82 8,03,926,234 (4) 23%

All directors, director nominees and executive officers as a group (12 persons) 13,412,40 17.0% All current executive officers and directors as a group (10 persons) (7)...................... 19,059,809 1,275,405 52.1%

All directors and executive officers as of November 13, 2012 as a group (13 persons) 17,011,477 624,969 17,636,446 54.8%

There does appear to be certain words that recur most of the time. If you created a set of rules that said something like:

LOCATE TEXT WITH THE WORDS: "as a group (?? persons)"


If you managed to devise rules like that then you would extract the correct percentage most of the time. The failure rate would be fairly high. I do not mind if a fairly high percentage of the data is either missing or incorrect.

Taidot: tiedonlouhinta

Näytä lisää: scrape edgar, edgar scrape, scrape table edgar sec, scrape sec, def 14a data, the p.i.c. group, p&a group, ford company, company history example, big data company, change of rate, big 4 companies, data mining companies, two managers, stock shares, securities, SEC, sec filings, page scrape, list of companies database, get data, forms filing, edgar, document scanning, data scrape database

Tietoa työnantajasta:
( 38 arvostelua ) EDINBURGH, United Kingdom

Projektin tunnus: #4246334

5 freelanceria on tarjonnut keskimäärin %project_bid_stats_avg_sub_26% %project_currencyDetails_sign_sub_27% tähän työhön


Hello, sounds like interesting and challenging project. If you are interested in my bid, please contact me so i can continue with the research. Samples of previous work will be provided upon request. Best Regards.

£1000 GBP 10 päivässä
(19 arvostelua)

I can help in your project, please check PMB and our ratings/reviews to get idea of our experience. Please let me know if you have any queries.

£350 GBP 7 päivässä
(6 arvostelua)

Thank you for inviting us in placing a bid in your project.

£600 GBP 10 päivässä
(2 arvostelua)

Please see PM.

£700 GBP 30 päivässä
(0 arvostelua)

Got a bevy of bots scraping full text breaking news 24/7 using rules-based text processing systems for similarly difficult-to-scrape data (for similar reasons, too).

£260 GBP 7 päivässä
(0 arvostelua)