AllenLowe Posted October 7, 2009 Report Posted October 7, 2009 I have my jazz history book in published form, and due to a lot of stupid things the original disc that it was on may not be accessible - can the book pages be scanned in such a way as the pages can be put into a Word file (or some such thing) and edited? or do I have to re-type the damn thing? Quote
Robert J Posted October 7, 2009 Report Posted October 7, 2009 OCR software is pretty good these days. I have used a trial version of ABBYY FineReader that works well - it has lots of output functions. http://www.scanstore.com/Scanning_Software...p?ITEM_ID=18288 As with any software there are price points. Here's a list of the popular ones. There's probably other free stuff out there, but maybe not as accurate http://www.simpleocr.com/OCR_Software_Guide.asp Quote
jostber Posted October 7, 2009 Report Posted October 7, 2009 Some tips here: http://en.wikipedia.org/wiki/Book_scanning http://www.bookscanbureau.co.uk/ http://www.npr.org/blogs/library/2009/04/t...sc=fb&cc=fp http://www.techcrunch.com/2009/06/07/scan-...them-on-google/ Quote
Dan Gould Posted October 7, 2009 Report Posted October 7, 2009 I can't recall the name but I bought cheap software at Office Depot that converts PDF files into Word docs, so if you scan the pages as PDF files that would do the trick to make them editable. Quote
AllenLowe Posted October 7, 2009 Author Report Posted October 7, 2009 groovy; thanks guys. As I hit my old age it occurs to me that I need to take a little more control of my work; I have three books written, a fourth I am trying to finish. I've never found a decent publisher who will handle my stuff, and I've made more money selling it myself anyway (basically I occupy a middle position, in publishing limbo; I sell enough to make it worthwhile to ME, but not enough for a major trade press). thanks again - Quote
ejp626 Posted October 7, 2009 Report Posted October 7, 2009 OCR software is pretty good these days. I have used a trial version of ABBYY FineReader that works well - it has lots of output functions. http://www.scanstore.com/Scanning_Software...p?ITEM_ID=18288 As with any software there are price points. Here's a list of the popular ones. There's probably other free stuff out there, but maybe not as accurate http://www.simpleocr.com/OCR_Software_Guide.asp I used to really like OmniPage (I think I had version 8 or 9). I've heard relatively good things of the software through OmniPage 12, but then the company was bought out and the customer support/service went completely to hell and none of it worked well with Vista (no big surprise there). OmniPage 16 supposedly bites. I'm totally bummed because my home computer finally died, and I can't find the installation CDs, so I have to get new software. I'm leaning towards the ABBYY FineReader. I'm pretty sure that if you buy it (not just use the trial version) you can save out multiple pages. Any insight into this? Second, does it offer a "straighten page" option? For some of my scanned material, I just don't have an option but to try to utilize this feature. Thanks for any thoughts. Quote
rockefeller center Posted October 7, 2009 Report Posted October 7, 2009 I can't recall the name but I bought cheap software at Office Depot that converts PDF files into Word docs, so if you scan the pages as PDF files that would do the trick to make them editable. Impossible. PDF encapsulates a description of a document that includes text, fonts, images, and 2D vector graphics which compose the document. Scanning a page results in a bitmap (image). You're on Windows, right? Try http://jocr.sourceforge.net/ which is released under the GNU Public License: no trial or shareware teaser - simply download an use the software. Quote
Dan Gould Posted October 7, 2009 Report Posted October 7, 2009 I can't recall the name but I bought cheap software at Office Depot that converts PDF files into Word docs, so if you scan the pages as PDF files that would do the trick to make them editable. Impossible. PDF encapsulates a description of a document that includes text, fonts, images, and 2D vector graphics which compose the document. Scanning a page results in a bitmap (image). Scansoft PDF Converter 4 takes PDF files and converts to Word Doc or Publisher and other formats. For my scanner, when scanning printed pages, the default output options is PDF. Quote
spinlps Posted October 8, 2009 Report Posted October 8, 2009 Impossible. PDF encapsulates a description of a document that includes text, fonts, images, and 2D vector graphics which compose the document. Scanning a page results in a bitmap (image). Not impossible. In fact, its fairly common for scanners, especially document imaging units (what we used to call copiers) to out to PDF format in addition to JPG, GIF, TIFF, BMP, etc... Both my office Ricoh copier and Canon desktop scanner not only provide output to PDF but also the option to perform in-line OCR. Current OCR and crawling technology is truly amazing. Our enterprise search crawls, in addition to full text search on Office documents and PDF's, also performs full text OCR on image files. You might expect a CAD drawing to be indexed but it also will OCR a JPG or GIF, "sense" text, and index the content accordingly. Can you tell I'm an IT guy? Jeez... Quote
Serioza Posted October 8, 2009 Report Posted October 8, 2009 I have my jazz history book in published form, and due to a lot of stupid things the original disc that it was on may not be accessible - can the book pages be scanned in such a way as the pages can be put into a Word file (or some such thing) and edited? or do I have to re-type the damn thing? do re-type the damn thing Quote
rockefeller center Posted October 8, 2009 Report Posted October 8, 2009 (edited) Alright, I didn't think of in-line OCR. Not so sure that PDF makes sense as output format when I want to have the text in an editable format. So if the scan unit provides in-line OCR it should be possible to save that OCR string as a text file. Edited October 8, 2009 by rockefeller center Quote
rockefeller center Posted October 8, 2009 Report Posted October 8, 2009 I can't recall the name but I bought cheap software at Office Depot that converts PDF files into Word docs, so if you scan the pages as PDF files that would do the trick to make them editable. Impossible. PDF encapsulates a description of a document that includes text, fonts, images, and 2D vector graphics which compose the document. Scanning a page results in a bitmap (image). Scansoft PDF Converter 4 takes PDF files and converts to Word Doc or Publisher and other formats. For my scanner, when scanning printed pages, the default output options is PDF. What's that scanner model you use? Quote
ejp626 Posted October 8, 2009 Report Posted October 8, 2009 Alright, I didn't think of in-line OCR. Not so sure that PDF makes sense as output format when I want to have the text in an editable format. So if the scan unit provides in-line OCR it should be possible to save that OCR string as a text file. I generally set my scanner to output TIFs, then after the editing and processing I usually just save to Word documents, but some people prefer saving to PDFs. I think the PDF-direct output is for people who just need a permanent record but don't intend to edit (receipts, accounting stuff and contracts tend to fall in this category). Quote
rockefeller center Posted October 8, 2009 Report Posted October 8, 2009 Alright, I didn't think of in-line OCR. Not so sure that PDF makes sense as output format when I want to have the text in an editable format. So if the scan unit provides in-line OCR it should be possible to save that OCR string as a text file. I generally set my scanner to output TIFs, then after the editing and processing I usually just save to Word documents, but some people prefer saving to PDFs. I think the PDF-direct output is for people who just need a permanent record but don't intend to edit (receipts, accounting stuff and contracts tend to fall in this category). Yeah, I'd still like to know what the "PDF-direct" looks like in Dan's case: embedded image or text. Quote
AllenLowe Posted October 8, 2009 Author Report Posted October 8, 2009 (edited) allright now you all lost me - I'm on windows so I was thinking of using Rockefeller Center's link, the JOCR thing - am I missing anything here? (and thanks, Serioza, but retyping 100,000 + words is not my favorite option) - hey, here's an idea - anyone here want to hire out (for a reasonable fee, I hope) to do this? Edited October 8, 2009 by AllenLowe Quote
Dan Gould Posted October 8, 2009 Report Posted October 8, 2009 Define "reasonable" and I'm your man. Quote
Dan Gould Posted October 8, 2009 Report Posted October 8, 2009 Alright, I didn't think of in-line OCR. Not so sure that PDF makes sense as output format when I want to have the text in an editable format. So if the scan unit provides in-line OCR it should be possible to save that OCR string as a text file. I generally set my scanner to output TIFs, then after the editing and processing I usually just save to Word documents, but some people prefer saving to PDFs. I think the PDF-direct output is for people who just need a permanent record but don't intend to edit (receipts, accounting stuff and contracts tend to fall in this category). Yeah, I'd still like to know what the "PDF-direct" looks like in Dan's case: embedded image or text. Microtek ScanWizard 5. We got it from my wife's Aunt who got it as part of a free "bundle" when she bought her PC and had no use for it. Like I said, if I scan text like a birth certificate, the default setting for the output is PDF. I've never tried to output a PDF when I scan a photograph or photo + text. And if there was any question, the Scansoft software did a perfect job converting PDF to an editable file, keeping images intact and text boxes editable. Quote
AllenLowe Posted October 8, 2009 Author Report Posted October 8, 2009 well, lets figure based on time - how long would it take to scan a 350 page book? Quote
spinlps Posted October 8, 2009 Report Posted October 8, 2009 It shouldn't take long IF you have an extra copy of the book that you are willing to unbind. That way the pages could be placed in the automatic document feeder in reasonable increments (25, 50, 75p, etc...). Scan duplex and you're on your way! Quote
ejp626 Posted October 8, 2009 Report Posted October 8, 2009 well, lets figure based on time - how long would it take to scan a 350 page book? You've got a few choices. Do you still have the page proofs? If so, I would use those. If not, if your scanner supports multiple pages in a scan, then I would probably scan 10 pages at a time -- to avoid losing data due to crashes, etc. I'm in an unfortunate situation, since the scan settings reset for every single job, so in my case, I actually make a copy of the thing I am scanning, then automatically feed them through (this counts as a job). Make sure the settings are at least 300 dpi - 400 dpi is better if the output files aren't too large for your available storage. If you have page proofs, or have already copied the book, I think converting to a TIF file will take 30 minutes, running batches through the scanner (this assumes you have a copier/scanner. Again, do this in installments (no more than 35-40 pages at a time). If you are scanning the book page by page, it might take 2-3 hours (or more if you have the software attempt OCR on the spot -- better to just save the files out and process later). Then you will run the OCR software. If you mostly have text and few or no footnotes and pictures, then maybe 1 day of going through and cleaning up. It could be more. That's a lot of pages. If the scan isn't clean or you have tables, footnotes, etc., then I'd say 2-3 days of intense work. That's my general experience. It shouldn't take long IF you have an extra copy of the book that you are willing to unbind. That way the pages could be placed in the automatic document feeder in reasonable increments (25, 50, 75p, etc...). Scan duplex and you're on your way! Yeah, pretty much my advice. This does assume Allen has a multiple page scanner and not a flatbed! Quote
spinlps Posted October 8, 2009 Report Posted October 8, 2009 It shouldn't take long IF you have an extra copy of the book that you are willing to unbind. That way the pages could be placed in the automatic document feeder in reasonable increments (25, 50, 75p, etc...). Scan duplex and you're on your way! Yeah, pretty much my advice. This does assume Allen has a multiple page scanner and not a flatbed! Well - He could always go to Kinko's or a local mom & pop copy shop to see if they can do it for him. Kinko's may ask him for proof that he's the copyright holder though. Quote
Dan Gould Posted October 8, 2009 Report Posted October 8, 2009 Since my scanner is flatbed and pretty damn slow, I'd probably be faster if I re-typed it. And I'm a fast typer. Quote
AllenLowe Posted October 8, 2009 Author Report Posted October 8, 2009 thanks for the continued advice - I am thinking I should approach Kinkos first and see what they say - the copyright is clearly marked as mine on the title page - and I have enough books to take one apart - Quote
rockefeller center Posted October 8, 2009 Report Posted October 8, 2009 With GOCR I'm not sure if there's some sort of a wrapper existing that lets you "batch OCR" multiple images (example all tiffs in folder my_book: /images/my_book/*.TIF) and stream them into ONE text file. On my platform I've been using tesseract in combination with ocube that let's you do exactly that. There should be a tesseract binary for windows but I don't know if there's a wrapper equivalent to ocube. Maybe try a web search for windows batch ocr or something like that. Quote
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.