I need a python script that can extract only the questions from a PDF of a deposition transcript and then save them into a word document format which is numbered. The PDF can range from 5 pages to over 1000 pages in length.
The text will look [almost - **] exactly like the below (except it will be in a PDF). I have attached a sample PDF to this proposal:
1. A. Yes.
2. Q. And you're familiar with Navistar; right?
3. A. Yes, ma'am.
4. Q. And you've been familiar with Navistar for decades?
5. A. Yeah. All these old things sound bad but, yes,
7. Q. And this engine that was in the Nolans' Excursion that
8. was manufactured by Navistar; right?
9. A. Correct.
The above would be scraped, numbering removed from the PDF and then saved in a Word document as:
Q. And you're familiar with Navistar; right?
Q. And you've been familiar with Navistar for decades?
Q. And this engine that was in the Nolans' Excursion that
was manufactured by Navistar; right?
ISSUES - **
1. Each page of the transcript has each line numbered from 1 to 25 on the left hand side. Any attempt to scrape the line from "Q:" until reaching "?" will grab the line numbers as well. These will need to be removed from the output.
2. Some questions which are near the end of one page will continue onto the next page. At the bottom of each page of the PDF is a footer and a page number. At the top of the next page is a header. NONE of that should be saved. I will leave it up to the coder to decide the best method to deal with this.
I will note that it may be easier to convert the PDF to some other type first (.txt) which will strip the header/footer, but leave the page and line numbers which have to be dealt with. It will be up to you how to deal with this issue.
3. The question text will need to be reformatted to remove all hard coded returns at the end of lines EXCEPT those that follow the ending "?" Below is an example of before and after:
10. Q: Okay. Yesterday, as you recall, you were in
11. here for a different case, Jones versus Ford, and you
12. have a binder that looks almost identical today?
The lines above, if scraped or converted to .txt format, will retain the shortened text structure which needs to be revised. The above should be reformatted and saved as:
Q: Okay. Yesterday, as you recall, you were in here for a different case, Jones versus Ford, and you have a binder that looks almost identical today?
and NOT saved as:
Q: Okay. Yesterday, as you recall, you were in
here for a different case, Jones versus Ford, and you
have a binder that looks almost identical today?
4. Processing of the PDF/Document should stop when the words "ERRATA SHEET" are encountered. That is the end of the file. At times there may be more pages beyond this - they should be ignored and all work saved once the above phrase is found. This exact phrase (including being in caps) only appears once per document. There may be times where this phrase is not found and the document simply ends. Either scenario should be accounted for.
1. Simple method to tell the program where the PDF is located. I would prefer NOT to have to type in the exact directory location, instead be able to navigate to it and then the program know the location from that. Drag and drop is fine, navigation by user is fine - whatever is easier.
2. Save the output in the same directory as the input file. File should be saved in Word format. (.doc or .docx fine) Use same filename as input file with the addition of _questions to the filename. If original file is "[login to view URL]" the saved file should be "[login to view URL]" or "deposition_questions.docx." Obviously, if drag and drop is implemented, output directory can simply be the "documents" folder located at: "C:\Users\Username\Documents" with username required by the user to be entered.
Program probably grows to more than one phase, this is just a starting point.
21 freelanceria on tarjonnut keskimäärin %project_bid_stats_avg_sub_26% %project_currencyDetails_sign_sub_27% tähän työhön
~~~ Very interested to me! ~~~ Hi, dear! Feel contact me for these kinds of projects. I will provide you the best result on time. I am waiting for your message to have a detail discussion. Thanks.
HELLO I CAN START RIGHT NOW - I AM EXPERT IN Python Data Processing and I BET YOU CANNOT FIND BETTER FREELANCER THAN ME ... pLEASE MESSEGE ME AND LETS DISCUSS THE THINGS THANKSPlease Reply