Tuesday, April 6, 2010

Today's Wolfram Alpha query

Today's query is:


So Alpha is coming along (Some folks have taken to calling him HAL) though for now the Alpha team has to actively curate the information, which means they are scrubbing all the data sets he has to work with. This is obviously not the ideal - he should be acquiring and scrubbing his own data sets. Turns out though that this is mind bogglingly difficult. For several reasons.
  1. Veracity of a data source is hard to quantify: Wikipedia? *snort* (only good 85% of the time (ish))
  2. Scoping a given data set is also hard - where are the edges? Fuzzy sets require gobs of resources.
  3. Standardizing the interfaces of data sets - particularly widely disparate ones - is simply daunting.
  4. The decision matrix for deciding how to keep a data set current is highly subjective and expensive in terms of resource allocation.
  5. And what to do with punch outs? Sometimes the best answer is to point at something else. So far though the Alpha team is doing no punch outs. This has to be because the rules are just too complex. (Might be an issue of not wanting to feel like ask.com which offers punch outs instead of curated data sets.)

  6. Seems to me that a very important next step for these guys is to implement a feedback system so that a user can tell Alpha directly how valuable his answer was. Once they do this he can begin learning by SARSA methods. (SARSA is an acronym for an algorithmic paradigm used for machine learning. It stands for State Action Reward State Action)

    Using SARSA, Alpha can begin trying new things with users and when they rate the results highly (give him a big Reward) he will have a better idea of how to answer (what Action to take) when he receives that same question (is in the same State again.) later.

    While this will mitigate some of the difficulties above (the curators don't need to decide boundaries and interfaces etc, Alpha can figure them out for himself) It creates an entirely new - and very large - problem set. It probably has some fancy name in the machine learning literature which I've not learned yet, but what it is, is parenting.

    I'll put together some thoughts about that for my next note.

    -j

    No comments:

    Post a Comment