Cross-breeding/hybrid info?

by xairbusdriver

If I'm reading these labels correctly, the newer one is simply proposing a name ("pro sp.") for the cross bred E. capillifolium and the E. perfoliatum mentioned in the original specimen label. I'm assuming that the later label is claiming that the specimen is actually the hybrid and NOT an either of the two "parents." Is that correct? Or does it really matter?! 😉

Frankly, the "X" in the "Eupatorium X pinnatifidum" 'name' is confusing to me (although that is a 'state' I often visit!).

Secondarily, is the "Ell." in the newer label the abbreviated name of the "Scientific Author?" Should that be included?

Posted April 27, 2013 10:44 PM
by okopho

My interpretation (after a bit of research on teh interwebs):

The top label indicates that these plants were once considered to be a species (Eupatorium pinnatifidum), but are now considered to be a hybrid of E. capillifolium and E. perfoliatum. The hydrid retains it's original species name, but with an X infix. The (pro sp.) is an abbreviation of the term pro specie, which means the name has been transferred from the species to the hybrid (the opposite term is pro hybrida). The author who originally gave the name Eupatorium pinnatifidum to these plants was Stephen Elliott (1771-1830).

The guidance says that the scientific name should "include at least genus and species as written". So I would conclude that Eupatorium X pinnatifidum would be the most minimal acceptable entry. The equivalent for the scientific author would be Ell.

Probably other volunteers would make different choices about what information to enter and where. But that's the nature of crowd-sourcing. There are no 100% "right" answers. Everybody just gives it their best shot, and it's then up to the software to separate the wheat from the chaff.

Posted April 28, 2013 12:34 AM
by ghewson

There was a suggestion (which I can't find now) from a scientist a couple of days ago saying to put just the genus, i.e., Eupatorium in this case. This will indicate at a later date that the record needs fixing up. I'm also tagging these with #cross.

I again call on the scientists to write an FAQ...

Posted April 28, 2013 8:25 AM
by okopho in response to ghewson's comment.

If you're referrring to this thread, then that suggestion would only apply if this is a mixed sheet. However, it's not apparent from the labels shown above that this is the case - the top label is clearly referring to a single (albeit hybrid) specimen. Of course, without seeing the original image, it's hard to be sure about this.

But, yes, if it is a mixed sheet, another option would be to enter only the genus - but then what should be entered for the scientific author (given that there will probably be a different one for each specimen)?

As for a FAQ: I don't think it will really solve anything, because many (most?) of the volunteers who contribute will never read it (or even visit the forum). Instead, I would hope that the focus of the effort is concentrated on improving the design of the interface so that volunteers never need to read a FAQ. The technical details can be interesting (if you like that sort of thing), but a good crowd-sourcing project should never require that level of interest from its volunteers. Ideally, everyone should be able to make a valuable contribution without any special knowledge at all.

Posted April 28, 2013 4:10 PM
by xairbusdriver

Ah yes, "ideally" and "should" (often followed, un-necessarily, by "be"). 8) I suspect that the planners of this project did not imagine the seemingly infinite variations that are in these labels and 'sub' labels (and 'sub-sub' labels!). Frankly I'm amazed at how well most collectors seemed to put in 99% of the info! As I proved in another thread, by my own stupidity, you just can't make something idiot pruff! LOL!

Unfortunately, I agree that a FAQ is seldom read by even the majority, much less by those who could benefit from it. And in this project, it still raises the question of how many possible problems might need addressing.

As for details, I'd like to know if clicking the "Skip this Field" actually places some non-printing character ( a "null" for example) rather than leaving it blank. Any character would be better than nothing to show that there may be a problem with that specimen. It shows that a conscious decision was made not to enter anything rather than simply being an error. Who knows (followed closely by 'who cares!')? 😃

Posted April 28, 2013 10:25 PM
by okopho in response to xairbusdriver's comment.

Well I'm going to be contrary and suggest that the planners know only too well how much variation there is in the data. The very fact that they are using a crowd-sourcing solution is testament to that. If the data was nice and regular, they could (and would) have simply automated the whole thing.

The power of crowd-sourcing is that, counter-intuitively, both and "wrong" and "right" transcriptions can be valuable. Each record is processed by several volunteers, with all the transcriptions going into a pool to be analysed by the software. Effectively, the volunteers are "voting" on what they think each field should be. The software has no idea what the "correct" inputs are - it just counts "votes". If there's a high level of agreement, the input will be accepted; if there's a low level of agreement, the input can be flagged as requiring further attention.

I think people worry far too much about making what they see as mistakes. Really, the people who run projects like this are grateful for every transcription, "right" or "wrong", that is contributed. The system is desgned to handle input with a very high degree of redundancy, and so, above all, it wants lots and lots of contributions from a large and diverse pool of volunteers. Every "vote" counts!

I hope this makes it clear that the skipping of fields is a total non-issue. There are certain "key" fields that the software expects to be present on every record, and so it will ask for confirmation if they are skipped. Now, if the majority of volunteers who process the record skip the same "key" fields, then that should be sufficient to highlight any problems. No need for any special flagging.

TL;DR

Stop wasting time worrying and do more transcriptions! 😉

Posted April 29, 2013 2:18 AM
by ghewson in response to okopho's comment.

I've been wondering how the software deals with the freeform descriptions, which are almost bound to be different from one volunteer's transcription to the next.

Posted April 29, 2013 8:10 AM
by robgur scientist, admin

We are wondering too! This will be an area of active research in the next weeks and months. Unlike more bounded transcription in tools such as Old Weather, there are some long passages and there is also some flexibility in what is considered e.g habitat and locality. We'll be posting a lot more about how we manage to collect and harmonize the data we are getting etc. and we'll probably be looking for great ideas here too!

Posted April 29, 2013 2:28 PM
by okopho in response to ghewson's comment.

Yes, that's a good point. I suppose the software could use fuzzy-matching to eliminate minor differences. Even more simply, it could just require a relatively high number of volunteers to transcribe each record: bigger samples tend to provide more accurate results (up to a point).

This issue is the main reason why sticking to verbatim transcriptions is probably the best overall approach. It's the easiest way to get high levels of agreement, because there are no special rules to learn: you just type what is written. With this approach, volunteers can accurately transcribe printed texts in a language that is completely unknown to them! In fact, it's probably easier to transcribe text that you don't understand (which is the reason why many citizen science projects tend to have quite minimal tutorials).

The more difficult problem is deciding which parts of the transcription go in which fields: because that necessarily requires some degree of interpretation. Inevitably, every volunteer, however knowledgeable, will make mistakes. But that's perfectly okay. If the average error rate is 10%, by the end of the project, 90% of the work will be done!

Posted April 29, 2013 3:02 PM
by jaymoore

I've started a collection called 'crosses' to gather these in when I come across them, and someone else suggested adding the hashtag #cross to them, so I've started doing that as well.

Posted April 29, 2013 4:41 PM