Why no OCR?
A typewritten/printed label like this should be easily amendable to optical character recognition which you could then have volunteers correct, greatly saving on the amount of labor involved (thus giving you results more quickly).
While I wondered this same thing, I remembered that OCR software is far from perfect, especially when the image is less than perfect, also. Also, the same info is often in different locations on the labels (I've only done the plant specimens but looking at some of the insect images the data/info seems to be all over the place!). Secondly, I have come across some hand-written info, OCR usually ignores that at best or produces 'garbage' at worst.
Note also that sometimes the largest amount of text is actually split into the "Location" and the "Habitat & Description" fields. But sometimes some of the info is applicable to both fields, so writing additional rules/scripts to decide what goes where could be very time consuming. Finally, converting the label info into text is only part of the problem. Getting it into the correct fields of the database is maybe even more important. While humans will certainly make mistakes, they are still the easiest and certainly the most abundant resource for that procedure where the info is highly irregular in presentation. But like most of the projects at Zoonverse, there will be multiple people entering info on the same image. Comparing the data from multiple records IS a simpler task for a computer. Comparing nearly identical items/records can be extremely fast with 'stupid' computers who never tire of looking at what might be identical objects. And they are also well designed to spot the one or two objects that are different from all the others. So, a human mistake on data entry can be easily spotted and discarded, along with all but one of the 'correct' one.
This is one of the benefits of 'shared computing' or, more accurately in this instance, 'shared input.' Besides, it's fun and even educational for our human 'computers!' 😉
having worked with OCR software in a news environment for a number of years, I can attest to the fact that OCR is not perfect - it must always be followed up by a human
that being said, i strongly urge every block chosen by the reviewers be OCR'd and that data available for copy/paste by the reviewer, encouraging the user to ensure the accuracy of the OCR.
in the ones i've worked on so far, i can see that a number of them are typewritten and, if the OCR of that text was available in a open text field, searching for a species' or collector's name would be immediately available. some of us might like to say "i'll work with all of those collected by X please", or "i'd like to do virginia"
in the system we work with, that i helped build, we also used OCR'd data for suggestions for fields - country, state, and date are easy enough, some other fields are doable, if tricky
this would obviously not be true for all of them, but working out the infrastructure to make it happen could be useful for any scanned data of this type.
actually, in my experience, the toughest part is getting a good scan
let me know if you'd be interested in any help in this endeavor, and thank you so much for all you've put into this system so far
by bumishness admin
Like xairbusdriver and chmbrin have stated, it isn't so much about retrieving the raw text, but about providing contextual information, i.e. separating the blocks of text into the different fields. Though the actual transcribing is important too! Especially on the handwritten records, where OCR fails pretty miserably.
This reminds me that Distributed Proofreaders scans books in the public domain with OCR, then has proofing done by several levels of readers. Eventually the books are posted online at Project Gutenberg. It's a huge job to proofread, but the goal there is to get a book format. Our bugs and plants are aiming at a different final format.
Anyway, there's lots of experienced people around.
What? No OCR? Why, I beg to differ. 😃 There are plenty of OCR apps that will read right off your screen. This is the one I use. Of course, you have to check over the text carefully afterwards to fix "scanno" errors but I'd say it halves the time and effort to do the transcriptions.
It only works on typewritten or printed output. Anything handwritten, even neatly hand printed, comes out as gibberish. OK... other than Robert Kral's pre-mid-1970's typewriter... ooh I hate that rickety broken thing.
by poboyski scientist
Some very keen observations. OCR has long been on our list of tools to develop for transcribing label data. As several of you pointed out, there are several challenges: 1) to find the text in the image, 2) to read the text (turn squiggles on a page into text), and 3) interpret what the text means (ie. populate the proper fields in a database).
The CalBug team is collaborating with a computer science team that studies "Text in the Wild" (ie. finding street signs, restaurant names, etc. from a photograph and extracting the text). We now have a pretty good handle on this, but hand written text remains a challenge. The bigger challenge is parsing the data to the proper fields in the database. Because our insect labels are so small the information is highly abbreviated, but not in consistent ways. The order, or even the presence, of information on the label is also idiosyncratic (100 years of collectors, each with their own way of doing things). In the plant world a lot of progress has been made to semi-automate reading and transcribing text to a database (there are at least three software packages that are pretty good at this). But they still require someone at the console guiding the transcription.
We have been talking about adding an OCR function to Notes from Nature. We are not sure what form it would take or how it would be used, but it seems like a smart way to go. A screenshot reader would be awesome! In the meantime, we appreciate your ideas - keep 'em coming!