Uncategorized

You are currently browsing the archive for the Uncategorized category.

I’ve recently switched mobile broadband provider from O2 and I was keen to ensure that I could save copies of all the SMS messages I had received whilst using the service. With older mobile handsets this also is a major hassle as I found out when decommissioning an old Sony Ericsson phone.

On Mac OS X liberating a copy of your text messages isn’t an issue since the Mobile Connect software they provide is able to export your messages (either the ones on the SIM or the ones saved locally) to a CSV file directly from the application. For Windows machines that run the O2 Connection Manager application there is no direct way to get either the messages on the SIM or saved to the local machine. To get the messages off the SIM I ended up using a Mac with Mobile Connect but this was not going to help with the locally saved messages since whilst O2 Connection Manager could copy messages from the SIM into the saved folder it couldn’t do it the other way. So I started to look for where this data might be stored on my machine.

This process was very frustrating as none of the standard locations I’d expect to find the database file contained anything that might look like a text message database. Although along the way I found some interesting files, such as one containing a list of various fast food restaurants and coffee shops that provide WiFi access points, in none of the standard locations I looked for configuration files could I find the SMS text message directory. I even tried to see if I could check the last modified times of files in those locations in case it was somewhere non-obvious but this also drew a blank. In the end I decided to be a bit more sneaky; in Windows it’s possible to find out what file handles a given process has open. In the case of O2 Connection Manager is called tscui.exe.

There are many programs that will inspect the file handles open from a given process however I found I got good results with the freeware ProcessActivityView. It has a session recording feature that is very useful if the process rapidly opens, writes to and closes a file since normally it’ll be tricky to detect with programs that simply provide a current snapshot.

So what did I find? Well, the files that were accessed when I opened the SMS component of Connection Manager were in %APPDATA%\Tatara Systems where %APPDATA% is an environment variable that expands to whatever Application Data directory you have on your local login (for me it was C:\Documents and Settings\axiomsofchoice\Application Data). I have no idea what connection Tatara Systems have to do with the Connection Manager software but I’d found what I was looking for and fortunately it was an easily parsed XML file and not, e.g. a binary database or encrypted somehow.

I hope this post will be of help to anyone else who finds themselves in a similar position. Remember, data portability is a primary use case of any application you use where personal data is stored.

I wonder if you’ve seen any of the coverage of the Author’s Guild vs the HathiTrust case (in case not there is a round up of it here)? The AG, by identifying candidates that should not be considered as potential orphan works, have been attempting to show that the HathiTrust’s process of identifying orphan works is not fulfilling it’s due diligence as effectively as it should. As such this represents a serious threat to any efforts intended to address the problem of orphaned works. The British Library estimates that about 40% of in-copyright works are orphaned and so the issue for scholarship is acute.

As pointed out here the efforts the AG are going to are actually showing that the system works (when you have enough eyes, of course – as they now do) but it could do with some improvement. The titles they’ve identified so far have been removed from the candidates list which is given here and reproduced on their blog that I linked to above.

I’m just wondering if there’s any mileage in learning from the successes of this crowdsourcing effort and using it to compile a more effective set of tools for working out if a work is truly an orphan work, certainly one better than the HathiTrust have outlined. In particular the sources that those helping with the current crowdsourcing effort are using seem to speed up the process considerably. Might it be useful to set up an EtherPad to help compile this information? Would you know of anyone who might like to help crowdsourcing rights holders, crowdsourcing the process or who might benefit from having information about these resources?

So one of my metrics rants (the other main one being the lack of objectivity in the term “impact,” but that’s for another blog post :) ) is that you cannot expect any system of metrics to predict in any reliable way “the next Einstein” because there aren’t enough instances of out-of-nowhere breakthroughs in science (cf. The mismeasurement of science). It’s just not sensible to try to make predictions about outliers. However there is enough available data to make sensible predictions about what Thomas Kuhn calls Normal Science.

So it’s with some interest that I stumbled upon the Thomson Reuters Citation Laureates which is an on-going attempt to predict the next set of Nobel prize winners on the basis of citations amassed by top-flite academics. My understand is that Nobel prizes in science are awarded for specific advances of research rather than life-time achievement. This is the reason, for instance, why Stephen Hawking, despite having glittering academic career and many awards to his name does not have a Nobel prize since the predictions his theoretical work makes have yet to be fully played out by experimental results. The interesting thing about the Citation Laureates project is that they’ve actually been moderately successful.

Of course this isn’t totally surprising, as to make any significant progress in any field you need to have put in a huge amount of time and effort mastering it and so it’s to be expected that break-through correlate with mastery of a subject, so just tracking the master increases your probability of success. But not all masters of a subject make breakthroughs in it and it’s precisely the breakthroughs the Nobel prizes award. Conversely however, I would say that such predictions might actually be more appropriate for life-time achievement awards such as the Fields Medal or the Turing Award since key career milestones are more easily tracked and predictions made of them.

We will doubtless some day see the next Einstein, an individual who at a stroke changes the way we understand the physcial world around us, but by the time we’ve realised that they’ve arrived, no one will care who called it first.

The single key idea I wish to promote in this blog post is that the scholarly literature, despite attempts to recast it in other molds, is a single continuously evolving network-structured object. As such I would like to suggest that the only technology that at present matches this model most naturally is the World Wide Web as originally conceived.

It has often been remarked how curious it is that the cradle of the Internet and the World Wide Web was academia and yet academia has been one of the few spaces in which these two technologies have yet to truly disrupt the pre-existing technologies. An important driving force behind this has been the way these prior technologies have so thoroughly engulfed our thinking about how knowledge is formed, transmitted, debated, acquired and transformed that we require of any new technology that it be cast in superficially the same mold. This print-publication model of literature, though disruptive in it’s day, has obvious inherent shortcomings which have, as print-publication became commonplace and consequently these shortcoming became less obvious, insidiously manifested themselves as net inhibitors of scholarly discourse.

In contrasted to the written word, the printed word gave almost total confidence in the consistency of texts between disparate copies which in turn gave those words far greater authority than the spoken word could ever achieve. However, this also gives rise to the implicit assumption that the ideas embodied in those texts have a certain degree of fixity irrespective of whether or not this was warranted. Deference to a canonical source invites us to absolve ourselves of any critical thinking. It turns us into passive consumers of content.

The task of identifying such canonical sources within the literature becomes a progressively foolhardy enterprise the further we attempt it. Those who seek such sources have an insurmountable problem, in general, to reconcile the many views on any given idea for the simple reason that there are at least as many conceptions of ideas as there are individuals and there is an inexhaustible supply of things we can disagree on. Put another way, each of us has our own “name space” and no one of these has a claim to a privileged position amongst all the others.

If none of these view points is privileged then we must consider ways of interacting with the literature that take into account as many as possible, and further we should understand how they relate to each other, i.e. the network of interrelated ideas. It is in this way that I would like us to consider the literature as a single network-structured object. I’ve already remarked that ideas and knowledge are never fixed, so the other fundamental aspect of the global view of literature as a single object is that it is continuously evolving.

The print publication may have a physical permanence derived, again from an inherent shortcoming in it’s form of technology, but in practical terms it has a finite shelf-life afforded it by it’s links with the core subset of the corpus that present day scholars find useful or can agree fits in with currents norms. As those links are broken we reach a point where the physical item falls into disuse even though the ideas embodied within it live on in the literature as a whole through the acts of transmission, debate and transformation we have already identified.

Consider now the World Wide Web, which I would consider to be antithetical to any models of permanence. It defies all means of preservation by practical means by it’s very nature. If I link to an image hosted elsewhere I have no guarantees that image will be the same every time someone visits my page. Indeed I may wish this to be the case. This dynamic behaviour models very closely the ways ideas are disseminated through the informal discourse the happens all around the formal print-publication record.

The Web has given us the beginnings of a toolset that will increasingly allow us to see the literature in it’s entirety, as a single object. The individual components such as web servers, web pages, agents, tweets, &c., &c., may appear to have more or less importance than other individual components on the web but it’s their connections and relations within the whole network that gives them a context for the importance we assign to them. An importance that is continuously evolving.

A particular Wikipedia page may, for instance, give the most clear description of a particular subject of interest of any page on the Web at a particular instance in time. However, a Wikipedia page on a controversial subject is next to useless unless you’re able to assess it’s contents relative to other web pages on the Web that cover the same subject from alternative stand points. Again, it’s the collection of pages and ideas which is important here, not any particular one that might assume a privileged position. When we start to think in these terms we start to see moves to stake any claim a piece of this “knowledge real estate” as essentially a waste of time, not least because the only way to define appropriate boundaries (by fixating on a particular composition of text) ignores the inherent shortcomings of doing so. As has often been remarked, information wants to be free so the network considers censorship as a form damage and reroutes around it which it easily do through, e.g. find a new form of words to express the same idea.

What I mean by these latter two statement is that if we wish to see an end to copyright we need to begin to think of the literature as a single continuously evolving network-structured object, that exists natively on the Web and into which content and ideas should be freely formed, transmitted, debated, acquired and transformed.

Tags:

I’ve just come back from two days of hacking at HackCamp 2010 hosted at Google in London. It was so great to see the diversity of projects people were working on. The project I decided to tackle turned out to be far more ambitious than was possible in under 24 hours of coding but my collaborator @leipie and I made great progress with identifying the necessary components in the stack for putting together a future implementation. Since next weekend is Science Hack Day 2010 I believe this project would be suitable to take on over the weekend. I intend to resolve most of the additional problems raised by this weekend’s work in future blog posts during the week with the next on in particular discussing the chosen components and thinking behind each. In the meantime below is a dump of the idea and it’s motivation:

  • There are plenty of great examples of long-standing open problems in theoretical computer science and math; for many of these there is a strong believe, based on past experience, that the solution (should one actually exist) requires thinking somehow “outside of the box”
  • I make this “thinking outside of the box” concept more concrete in the following way: almost all examples of purported solutions to these open problems follow standard patterns although their details differ. Hence if, given a steady stream of these potential solutions, you can find a way to annotate each new one and compare the pattern of proof with those found in ones you’ve already received, as soon as you encounter a solution that doesn’t follow this pattern it will stick out like a sore thumb.
  • These anomalous solutions may not actually solve the problem but they may signal potential new avenues of attack, hopefully meaning the solution is reached far quicker. In fact certain lemmas within the proof may be entirely correct but the rest of the proof totally bogus – there should be a way of reusing just those parts that were correct, assuming they actually help with finding a solution.
  • Anyone will be allowed to submit as many solutions (read published papers) as they wish. Annotations will be done by the community and anyone in the community to contribute annotations. As a result these solutions are judged on their merit against each other and winners of this “competition” are those solutions which contribute novel ideas in the sense that they rise to the top in any listing of solutions.
  • The annotations are at a fairly coarse granularity (compared to formal proofs supplied to verification procedures), roughly at the level of proof technique (e.g. “this section is proof by induction” or “here they used diagonalisation”). Another way to think of this is that it’s kind of at the level of “hand-wavy” styles of proof :)
  • (To allow reuse there could be a concept of ‘forking’ someone else’s “paper” submitted to the system)
  • This system is supposed to contrast directly with the arXiv, where such “out there” solutions are less likely to appear due to the filtering process, although Perelman’s recent solution to PoincarĂ© conjecture is one very notable exception to this.
  • The situation is even worse with traditional publishing because nobody gets to see the rejects and so these cannot be used to train a filter. The reason for this siomple: a small group of reveiwers simply does scale – I actively want this system to attract many of those who might be considered “cranks” or “nut jobs” i.e. the sorts of people who think they have solved the the hardest problem in certain field in a single page of text with no equations. But even if they haven’t they may actually have interesting ideas that are worth filtering on. I want this system to scalable enough to cope and indeed thrive on the data they provide.
  • The key open problem I have chosen here is “N vs. NP” because there are lots of papers out there from a diversity of sources that I can use to test the system from the outset.
  • One implementation issue I’d like to address is the centralized gatekeeper-like nature of arXiv, although in the first instance the system will have to be centralized so that code development can be bootstrapped. The eventual hope is that anyone that wishes to submit solutions can do so on a server of their choice and this system will aggregate feeds from each such server.

Tags: ,

Last Saturday I was saddened to hear of the passing of Martin Gardner. My own relationship with Martin’s work was through his republication and of commentary of Silvanus P Thompson’s Calculus Made Easy. In this little book, over the course of a summer holiday (can’t remember now if it was 2002 or 2003), I learned about calculus for the first time. Martin’s commentary helped to put the subject into perspective in light of the radical changes in style that had taken place since Thompson’s first publication of the work, emphasizing that the core ideas were essentially no different. I can’t emphasis enough just how pivotal this event was in my journey through mathematics.

When I at school doing A-Levels first time around I had little interest in studying the subject. In fact I was far more interested in doing art using my computer. However as I got further into the techniques of computer graphics and tried to code my own computer graphics tools I very quickly realized I didn’t know enough math to progress any further. There was no getting around it, before I could do any of the cool things I wanted to do with my computer I’d have to learn calculus and the book that was recommended to me (I think it was in some SIGGRAPH course notes) was Calculus Made Easy – what one fool can do, another can.

There are several obituaries and tributes you can read to learn more about his life and work (for instance here, here, here and here). However the best way to appreciate his work is just to pickup one of his books and start reading.

News of Gardner’s death broke late on Saturday evening. The following day I headed over to the Jam Factory in Oxford for a day of coding at Oxford Geek Jam 6. Leading up to the day we had discussed various formats we could follow but hadn’t really agreed on any one of these so I suggested that we code up one of Martin Gardner’s puzzles as a tribute to him.

Once a critical mass of coders had arrived at the event and after a little searching we decided that the Game of Hip would work really well. I think we were all initially more interested in coding up solutions to puzzles since at the previous geek jam, which took the form of a coding dojo, we’d worked on a problem that we could tackle computationally. However, we could only find one example where someone had coded the game before and that was in Pascal. We therefore also agreed we would implement a JavaScript UI for the game which would be svg rendered on an HTML5 canvas. This would make it possible for us to get a version of the game running on the iPhone (although it turned out that there were some HTML5 and/or svg issues which meant it didn’t really work on Android).

Oxford Geek Jam 6 in full flow

This was the hottest weekend of the year so far and the Sunday felt even hotter that the Saturady, but the pitchers of Pimms got us through. (Incidentally that reminds me that the Cambridge puntcon is coming up fairly soon. Unfortunately I missed it last year, though it might be on a weekend this year I can’t make again. Perhaps we need an Oxford puntcon?) Even after getting kicked out of the Jam Factory earlier than normal for staff training we managed to find a pub that has good wifi continued coding. You can see the end result of our day here.

The format on Sunday differed somewhat from previous Oxford Geek Jams (I’d previously only been to Oxford Geek Jam 5) where hackers mostly worked individually on projects, although there was some pair programming at the last one. As we launched into developing the Game of Hip we decided that we would tackle the problem on three fronts; core game engine, front-end UI and game playing AI (sadly we had insufficient time to fully develop the latter). We then sub-divided the five man team into smaller groups around these themes and programmed the various parts in pairs. As a result I feel we really gelled as a team and in part this was helped by our use of the Oxford Geek Jam svn code repository. We committed changes fairly frequently, although everything got committed into the project trunk so we often found we needed to resolve conflicts. In one respect I think this shows just how closely integrated we were working as a team though it’s clearly not ideal to have to fix conflicts manually. However, I believe it provides a good reason for trying out git next time round to compare the effect on our workflow.

I’m really proud of what we were able to achieve and the day has reaffirmed my belief in the principle that if you get enough bright and motivated people in a room collaborating on a great idea you can do amazing things.

Tags:

The following post is a response to Peter Murray-Rust’s post “Time flies like an arrow; fruit flies like a banana. Or do they?

Peter, I fully agree on the fundamental importance of NLP to AI (for me it’s the most important of the so-called AI-complete problems). Indeed, it’s interesting to note that Chomsky’s work on natural language linguistics gave rise to the subject of formal languages which includes all computer languages in which AI solutions must somehow be written. Clearly for efficient human-computer interaction NLP would be extremely beneficial.

However, I strongly believe we should be continually striving for increasing formalism in the end product of our labours independently of the means of how we got to them (I’m mainly thinking of scientific end products here). My definition of formalism in this context includes some form of reduction in ambiguity by some degree of agreement on the meaning and prescribed usage of terms.

I base this last assertion on what I think is the key scientific example of the importance of clarity in meaning and in the logical consequences of that meaning. I speak of course of the revolution in thought about space and time brought about by Einstein’s Theory of Relativity. Post-Einstein wherever you needed to talk about “time flying” (at least in scientific discourse, and more specifically physics) what was meant by that was necessarily fundamentally different to what it had been before. The previous sloppy usage was now simply unacceptable.

All of which affords me the opportunity to quote in-extenso the following passage from Eddington’s Mathematical Theory of Relativity (p. 8):

Those who still insist on the existence of a unique “true time” generally rely on the possibility that the resources of experiment are not yet exhausted and that some day a discriminating test may be found. But the off-chance that a future generation my discover some significance in our utterances is scarcely an excuse for making meaningless noises.

Conversely, where I think NLP techniques have even greater potential, beyond simply working out what someone said, are in the following two ways;

  1. In a grammatically accurate phrase such as “All men are mortal. Socrates is a man. Socrates is not mortal.” it should be possible for a machine to identify the obvious logical error. Much scientific discourse essentially comes down to formulae expressible in simple logics, in which it is possible for a machine to tease out seemingly subtle flaws. (Any formal structure capture in this process must form the basis of the end product if it is to be worthwhile)
  2. Although ambiguity is problematic if we are trying to understand what a person has said in an amenable automated way, there are classes of case in which ambiguity arises where it has beneficial consequences. I can make the analogy here with an abstract algebra such as Group Theory where the ambiguity in exactly what sorts of thing the elements of a group are enables one to prove general theorems about groups that apply to any arbitrary types of elements of groups. Alternatively, we can take the example of Dirac’s bra-ket notation where in a complete bra-ket the individual components have different interpretations which means we can view the complete bra-ket ambiguously from different perspectives although it turns out the they are in any case equivalent.

    So my hope would be that the machine that encounters such ambiguity is able to either abstract away from it a more general concept or allow the ambiguity to pass whilst acknowledging that the interpretations of which it admits are equally valid and possibly intended. Without this latter observations much of poetry would be impossible.

Ping

So once again I return this blog and I’m going to start another post with an empty promise to myself to update this blog more often. At some time since the last post I silently changed the name and subtitle of this blog – feel free to lampoon as you see fit. Here are five things I’ve been up to since last time:

  1. Handwriting tweets using the Wacom Bamboo Pen and Touch tablet.
  2. Creating a revised version of the chem visualiser gadget – it’s not ready for the samples gallery yet but I’m working on it.
  3. I started a FriendFeed conversation about minimal scientific artefacts which I then synthesised into a Wave – I think the minimality criterion is actually quite powerful for reasons I go into in the Wave
  4. We had a Wave hack day at RAL which was very successful
  5. I bought a larger antenna for my wifi card – I know this sounds really minor but it’s made a massive difference to my workstation connectivity which in turn has made me a happy bunny

So prepare for the deluge – I’m back in the blogosphere and this time I have something to say. Hopefully…

Please note: this post is likely to change a few times before it stablizes as I try out different ideas.

Last night I fixed some configuration problems with the \LaTeX settings for this blog and so I thought it might be a good opportunity to get to know how to use xy-pic better.

This example is the associativity axiom of composition for categories and by definition it commutes:

\xymatrix{A \ar[r]^f \ar[dr]_{f \circ g}& B \ar[d]^g \ar[dr]^{g \circ h} \\& C \ar[r]_h & D }

It clearly shows how associativity is really just a higher-order form of commutativity. Commutativity normally denotes the exchangability of operands; associativity denotes the commutativity of operator application.

I was interested to see if I could draw 2-morphisms from n-category theory, mainly because I wanted to write some notes about this as I watched a video lecture (about which more later). With a bit of help from here I got it working although it did require a couple of tweeks to the site’s config, but here it is:

 \xymatrix{ A\rtwocell^f_g & B }

For now all of this works well enough but I have been struck by how xy-pic tends to focus heavily on the presentational aspects of drawing diagrams. I would prefer a package that allows me to think and express in terms on the diagram structure, i.e. what connects to what and how, and then renders that definition based on certain general rules. It could be argued that this lack of semantic emphasis stems from the fact that the package aims to be very general-purpose in the types of diagrams it allows (I have seen cobordisms and knots typeset using it) but my gut feeling here would be that there are enough things in common between these structures to build a common semantic diagram description language. It almost certainly already exists.

Tags: ,

I’m on my way to Oxford Social Media Covnetion 2009 which is being held at the Said Business School in the University. I have to say big thanks to the organizers of the event for squeezing me in at the last minute, but it’s good to hear that they have been over-subscribed.

The itinerary for the day is given here. I’m particular interested to see the Parallel Session II: Making science public: data-sharing, dissemination and public engagement with science. Hopefully they will allow live-blogging so I can update thoughs during the sessions but anyway it should be a fun day.

Tags: , , ,

« Older entries

Creative Commons Attribution-NonCommercial-ShareAlike 2.0 UK: England & Wales
This work by Daniel Hagon is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 2.0 UK: England & Wales.