I’ve recently switched mobile broadband provider from O2 and I was keen to ensure that I could save copies of all the SMS messages I had received whilst using the service. With older mobile handsets this also is a major hassle as I found out when decommissioning an old Sony Ericsson phone.

On Mac OS X liberating a copy of your text messages isn’t an issue since the Mobile Connect software they provide is able to export your messages (either the ones on the SIM or the ones saved locally) to a CSV file directly from the application. For Windows machines that run the O2 Connection Manager application there is no direct way to get either the messages on the SIM or saved to the local machine. To get the messages off the SIM I ended up using a Mac with Mobile Connect but this was not going to help with the locally saved messages since whilst O2 Connection Manager could copy messages from the SIM into the saved folder it couldn’t do it the other way. So I started to look for where this data might be stored on my machine.

This process was very frustrating as none of the standard locations I’d expect to find the database file contained anything that might look like a text message database. Although along the way I found some interesting files, such as one containing a list of various fast food restaurants and coffee shops that provide WiFi access points, in none of the standard locations I looked for configuration files could I find the SMS text message directory. I even tried to see if I could check the last modified times of files in those locations in case it was somewhere non-obvious but this also drew a blank. In the end I decided to be a bit more sneaky; in Windows it’s possible to find out what file handles a given process has open. In the case of O2 Connection Manager is called tscui.exe.

There are many programs that will inspect the file handles open from a given process however I found I got good results with the freeware ProcessActivityView. It has a session recording feature that is very useful if the process rapidly opens, writes to and closes a file since normally it’ll be tricky to detect with programs that simply provide a current snapshot.

So what did I find? Well, the files that were accessed when I opened the SMS component of Connection Manager were in %APPDATA%\Tatara Systems where %APPDATA% is an environment variable that expands to whatever Application Data directory you have on your local login (for me it was C:\Documents and Settings\axiomsofchoice\Application Data). I have no idea what connection Tatara Systems have to do with the Connection Manager software but I’d found what I was looking for and fortunately it was an easily parsed XML file and not, e.g. a binary database or encrypted somehow.

I hope this post will be of help to anyone else who finds themselves in a similar position. Remember, data portability is a primary use case of any application you use where personal data is stored.

I had thought the post I would write for Ada Lovelace Day this year would be about the amazing achievements of women in science and technology over the years (similar to the brilliant blog posts you can find over at Finding Ada – http://findingada.com/) and whilst I may yet write such a post, instead I would like to explore a little what problems still exist in our current technological culture that might cause women to be held back from even greater achievements.

First let’s talk numbers. Recent figures from the BCS put the percentage of women in IT in the UK as around 14% (http://www.womenintechnology.co.uk/news/greater-numbers-of-women-are-working-in-it-news-800574277). Now normally I’m the first person to point out that gender balance isn’t really a numbers game but about striving to provide equal opportunities for everyone; all things being equal you wouldn’t expect exact parity at all times as the numbers should fluctuate, perhaps by several percent, since you can’t force someone to want to do Computer Science just to make up the numbers. However, this statistic is too extreme to be dismissed on the basis of the numbers ‘finding their own levels.’ The conclusion then is that we are failing to provide equal opportunities. Actually we aren’t merely failing we suck really badly at it.

It turns out that this percentage doesn’t seem to have changed much since Ellen Spurtus wrote a highly cited MIT Technical Report titled “Why Are There So Few Female Computer Scientists?” around twenty years ago (http://people.mills.edu/spertus/Gender/why.html). I note that there are many parallels between the points she raises in her report and a recent analysis of why so few wikipedia edits come from women (http://blog.wikimedia.org/2011/07/15/shedding-light-on-women-who-edit-wikipedia/), in particular the comments from Sue Gardner of the Wikimedia Foundation (http://suegardner.org/2011/02/19/nine-reasons-why-women-dont-edit-wikipedia-in-their-own-words/). (Note, in contrast, I’d say that in the case of Wikipdeia it very much is a numbers game as the goal is a balanced NPOV.). The aspect I’d like to focus on in this blog post is what Spurtus terms ‘Sex-Correlated Differences’ (http://people.mills.edu/spertus/Gender/pap/node12.html#SECTION00430000000000000000) and Gardner characterizes as ‘Wikipedia’s sometimes-fighty culture.’ Though it goes by different terms the effect is much the same. Here I’ll use the term ‘technological intimidation’ and I’m ashamed to say I’m not above blame for this myself.

Even if we just confine our discussion of technology to computing there is vastly more to learn about how computers work and what they are capable of than one individual could possibly hope to learn about in their lifetime. I mean, even just the actual machine that’s probably sat on your desktop and not just some theoretical construct, you’d think that given enough time you could work out how all it’s components worked, but you’d be wrong. Computers are now so complex that the best we can hope to learn about is the small set of functionality that most interests us individually. I’d like to think I know a lot about computers but in reality I know relatively little and so does everyone else.

How stupid then, it seems to me, that there is this perceived need to defend, often in very ardent and entrenched terms, some small part of the technological landscape, perhaps simply because it’s the flavour of the day. This feels nothing short of dogma (http://twitter.com/#!/axiomsofchoice/status/122034400237076481) and the adversarial way ‘religious wars’ are fought over things like programming languages, operating systems, hardware and data protocols is clearly exactly the sort of cultural put-off Spurtus and Gardner were describing in their pieces. The tone these technological discussions take must be intimidating for many people and I would suggest, although I cannot experience this directly myself, for women in particular. All technology is a set of carefully balanced trade-offs and there is almost never a right answer even for an expert.

There will always be some other technology for each of us for which we’re a n00b at, so to talk down to or ostracize someone just because they do not have the same level of know-how about a technology you’re fluent in is ultimately counter-productive. We should be as willing and keen to learn what someone else knows as we are to teach them what we know. Anything else runs entirely against to the true principles of the Hacker Ethic (http://en.wikipedia.org/wiki/Hacker_ethic#Sharing).

We must tear down any barriers to technology that purposefully makes anyone (women and men included) feel to intimidated by their apparent lack of knowledge or skill to explore, play or hack. As Emma Mulqueeny blogged recently (http://mulqueeny.wordpress.com/2011/08/10/year-8-is-too-late/): ‘Don’t Be Scared.’ One brilliant project I heard about recently is Girl Devlop It (http://girldevelopit.com/) who’s aim is precisely to address this perception. We need more projects like this. We need to make the opportunities promised by technology available to everyone.

I wonder if you’ve seen any of the coverage of the Author’s Guild vs the HathiTrust case (in case not there is a round up of it here)? The AG, by identifying candidates that should not be considered as potential orphan works, have been attempting to show that the HathiTrust’s process of identifying orphan works is not fulfilling it’s due diligence as effectively as it should. As such this represents a serious threat to any efforts intended to address the problem of orphaned works. The British Library estimates that about 40% of in-copyright works are orphaned and so the issue for scholarship is acute.

As pointed out here the efforts the AG are going to are actually showing that the system works (when you have enough eyes, of course – as they now do) but it could do with some improvement. The titles they’ve identified so far have been removed from the candidates list which is given here and reproduced on their blog that I linked to above.

I’m just wondering if there’s any mileage in learning from the successes of this crowdsourcing effort and using it to compile a more effective set of tools for working out if a work is truly an orphan work, certainly one better than the HathiTrust have outlined. In particular the sources that those helping with the current crowdsourcing effort are using seem to speed up the process considerably. Might it be useful to set up an EtherPad to help compile this information? Would you know of anyone who might like to help crowdsourcing rights holders, crowdsourcing the process or who might benefit from having information about these resources?

So one of my metrics rants (the other main one being the lack of objectivity in the term “impact,” but that’s for another blog post :) ) is that you cannot expect any system of metrics to predict in any reliable way “the next Einstein” because there aren’t enough instances of out-of-nowhere breakthroughs in science (cf. The mismeasurement of science). It’s just not sensible to try to make predictions about outliers. However there is enough available data to make sensible predictions about what Thomas Kuhn calls Normal Science.

So it’s with some interest that I stumbled upon the Thomson Reuters Citation Laureates which is an on-going attempt to predict the next set of Nobel prize winners on the basis of citations amassed by top-flite academics. My understand is that Nobel prizes in science are awarded for specific advances of research rather than life-time achievement. This is the reason, for instance, why Stephen Hawking, despite having glittering academic career and many awards to his name does not have a Nobel prize since the predictions his theoretical work makes have yet to be fully played out by experimental results. The interesting thing about the Citation Laureates project is that they’ve actually been moderately successful.

Of course this isn’t totally surprising, as to make any significant progress in any field you need to have put in a huge amount of time and effort mastering it and so it’s to be expected that break-through correlate with mastery of a subject, so just tracking the master increases your probability of success. But not all masters of a subject make breakthroughs in it and it’s precisely the breakthroughs the Nobel prizes award. Conversely however, I would say that such predictions might actually be more appropriate for life-time achievement awards such as the Fields Medal or the Turing Award since key career milestones are more easily tracked and predictions made of them.

We will doubtless some day see the next Einstein, an individual who at a stroke changes the way we understand the physcial world around us, but by the time we’ve realised that they’ve arrived, no one will care who called it first.

The single key idea I wish to promote in this blog post is that the scholarly literature, despite attempts to recast it in other molds, is a single continuously evolving network-structured object. As such I would like to suggest that the only technology that at present matches this model most naturally is the World Wide Web as originally conceived.

It has often been remarked how curious it is that the cradle of the Internet and the World Wide Web was academia and yet academia has been one of the few spaces in which these two technologies have yet to truly disrupt the pre-existing technologies. An important driving force behind this has been the way these prior technologies have so thoroughly engulfed our thinking about how knowledge is formed, transmitted, debated, acquired and transformed that we require of any new technology that it be cast in superficially the same mold. This print-publication model of literature, though disruptive in it’s day, has obvious inherent shortcomings which have, as print-publication became commonplace and consequently these shortcoming became less obvious, insidiously manifested themselves as net inhibitors of scholarly discourse.

In contrasted to the written word, the printed word gave almost total confidence in the consistency of texts between disparate copies which in turn gave those words far greater authority than the spoken word could ever achieve. However, this also gives rise to the implicit assumption that the ideas embodied in those texts have a certain degree of fixity irrespective of whether or not this was warranted. Deference to a canonical source invites us to absolve ourselves of any critical thinking. It turns us into passive consumers of content.

The task of identifying such canonical sources within the literature becomes a progressively foolhardy enterprise the further we attempt it. Those who seek such sources have an insurmountable problem, in general, to reconcile the many views on any given idea for the simple reason that there are at least as many conceptions of ideas as there are individuals and there is an inexhaustible supply of things we can disagree on. Put another way, each of us has our own “name space” and no one of these has a claim to a privileged position amongst all the others.

If none of these view points is privileged then we must consider ways of interacting with the literature that take into account as many as possible, and further we should understand how they relate to each other, i.e. the network of interrelated ideas. It is in this way that I would like us to consider the literature as a single network-structured object. I’ve already remarked that ideas and knowledge are never fixed, so the other fundamental aspect of the global view of literature as a single object is that it is continuously evolving.

The print publication may have a physical permanence derived, again from an inherent shortcoming in it’s form of technology, but in practical terms it has a finite shelf-life afforded it by it’s links with the core subset of the corpus that present day scholars find useful or can agree fits in with currents norms. As those links are broken we reach a point where the physical item falls into disuse even though the ideas embodied within it live on in the literature as a whole through the acts of transmission, debate and transformation we have already identified.

Consider now the World Wide Web, which I would consider to be antithetical to any models of permanence. It defies all means of preservation by practical means by it’s very nature. If I link to an image hosted elsewhere I have no guarantees that image will be the same every time someone visits my page. Indeed I may wish this to be the case. This dynamic behaviour models very closely the ways ideas are disseminated through the informal discourse the happens all around the formal print-publication record.

The Web has given us the beginnings of a toolset that will increasingly allow us to see the literature in it’s entirety, as a single object. The individual components such as web servers, web pages, agents, tweets, &c., &c., may appear to have more or less importance than other individual components on the web but it’s their connections and relations within the whole network that gives them a context for the importance we assign to them. An importance that is continuously evolving.

A particular Wikipedia page may, for instance, give the most clear description of a particular subject of interest of any page on the Web at a particular instance in time. However, a Wikipedia page on a controversial subject is next to useless unless you’re able to assess it’s contents relative to other web pages on the Web that cover the same subject from alternative stand points. Again, it’s the collection of pages and ideas which is important here, not any particular one that might assume a privileged position. When we start to think in these terms we start to see moves to stake any claim a piece of this “knowledge real estate” as essentially a waste of time, not least because the only way to define appropriate boundaries (by fixating on a particular composition of text) ignores the inherent shortcomings of doing so. As has often been remarked, information wants to be free so the network considers censorship as a form damage and reroutes around it which it easily do through, e.g. find a new form of words to express the same idea.

What I mean by these latter two statement is that if we wish to see an end to copyright we need to begin to think of the literature as a single continuously evolving network-structured object, that exists natively on the Web and into which content and ideas should be freely formed, transmitted, debated, acquired and transformed.


After a tweet from @manuscript there was a brief but lively debate about the correct usage of the word “data” in the plural versus in the singular-collective. The following is a rather lengthy reply to this FriendFeed thread.

Etymologically speaking “data” comes from the Latin verb “to give” so “data” and “datum” mean something along the lines of “that which is given” and “something given” respectively – literally the thing(s) we know about.

That’s what the common usage is but if the term is to be used scientifically it requires a corresponding scientific definition, probably along the following lines; we can take the set theoretic view that what we’re interested in is some (maybe real-world) set over a certain domain.

To take a concrete example think of the occupiers of a park bench throughout the course a year. This set consists of a tuple. The first component of this tuple is an element from the uncountable number of instances of time (elements from the continuum of timings) during each of which subsets from the set “placable on a park bench” are make the predicate “placed on the park bench” hold. Hence the second component is one of these subsets.

What we would like is a complete description of this set of tuples but since there are an uncountable number of them (given what we’ve already said about time being uncountable) the best we can do is to sample the set at certain points. Obviously the more points we have the more we know and whilst we can get a sufficiently accurate sampling for a given purpose we cannot in general gain a complete description of the set (see also). The elements we sample from the set are also know as “points”.

The important thing to realise about these samplings is that they involve the actual objects these sets are concerned about. For instance, the two birds occupying the park bench at noon on 1st January and that timing itself are, in a very literal sense, the sample. If that sounds crazy it’s probably because what you were thinking about was something you could in store memory on a computer – that is, a representation of the elements of the set. A representation of something is very different from this thing itself though. For any given set there are an indeterminate number of representations of it’s elements, each more or less suitable for a given purpose. Almost all practical representations are digital numbers.

In order to make sense of these numbers we need to know something of the format they take and this information is encoded in the metadata or “data about data”. (Interestingly the etymology of “meta” is Greek whereas “data” is as we have already seen Latin.) This definition of metadata is consistent with the usage of metadata that applies to the dataset as a whole, e.g. who conducted the sampling, what instrument they used to do it, &c., &c., …, if we think of these metadata as giving sufficient information to allow us to correctly interpret the collection of samplings.

So what of the words Data and Datum? Well, the Data are the complete set of samplings – so the term “Dataset”, over who’s definition so many in the data science community continue to argue the semantics of, is in fact totally superfluous. Thus a single sampling is a datum which when represented in an suitable form is normally a collection of numbers whose form is given by metadata.

Obviously by this definition a single datum constitutes a complete, though probably uninteresting, dataset. A single number in general cannot be thought of as data unless the set we’re sampling is, say, the integers or real numbers. Even in this latter case we cannot in practice get away with a single number since computers cannot represent all elements from these countably (and uncountably resp.) infinite sets and so we require at least some metadata to allow us to make sense of which representation we’ve employed.

Tags: , ,

Continuing my Science Hack Day 2010 theme for this week’s posts I’m cross posting the following idea I came up with this evening:

Science-friendly url-shortener. The idea would be that scientists could use per-article urls for the references they cite – which is the same as when advertisers give specific codes to different types of ad during each ad campaign to enable them to track people’s interest in a product. Whenever somebody reads their article and follow a url they are first taken to the url shortener site which then forwards them to the actual url cited. The site should only allow people to shorten urls if they have registered, which they must do via some type of single sign-on system such as OpenId or ORCID. This registration means that whenever someone follows a link the system knows exactly who it was that followed it and allows the scientist who wrote the article to gather detailed information about the interests of those who read their articles. Of course anybody would be allowed to unshorten, although this still allows stats to be gathered. The actual format of the shortened url needs careful thinking in order to ensure it’s acceptable to journals.

Tags: ,

I’ve just come back from two days of hacking at HackCamp 2010 hosted at Google in London. It was so great to see the diversity of projects people were working on. The project I decided to tackle turned out to be far more ambitious than was possible in under 24 hours of coding but my collaborator @leipie and I made great progress with identifying the necessary components in the stack for putting together a future implementation. Since next weekend is Science Hack Day 2010 I believe this project would be suitable to take on over the weekend. I intend to resolve most of the additional problems raised by this weekend’s work in future blog posts during the week with the next on in particular discussing the chosen components and thinking behind each. In the meantime below is a dump of the idea and it’s motivation:

  • There are plenty of great examples of long-standing open problems in theoretical computer science and math; for many of these there is a strong believe, based on past experience, that the solution (should one actually exist) requires thinking somehow “outside of the box”
  • I make this “thinking outside of the box” concept more concrete in the following way: almost all examples of purported solutions to these open problems follow standard patterns although their details differ. Hence if, given a steady stream of these potential solutions, you can find a way to annotate each new one and compare the pattern of proof with those found in ones you’ve already received, as soon as you encounter a solution that doesn’t follow this pattern it will stick out like a sore thumb.
  • These anomalous solutions may not actually solve the problem but they may signal potential new avenues of attack, hopefully meaning the solution is reached far quicker. In fact certain lemmas within the proof may be entirely correct but the rest of the proof totally bogus – there should be a way of reusing just those parts that were correct, assuming they actually help with finding a solution.
  • Anyone will be allowed to submit as many solutions (read published papers) as they wish. Annotations will be done by the community and anyone in the community to contribute annotations. As a result these solutions are judged on their merit against each other and winners of this “competition” are those solutions which contribute novel ideas in the sense that they rise to the top in any listing of solutions.
  • The annotations are at a fairly coarse granularity (compared to formal proofs supplied to verification procedures), roughly at the level of proof technique (e.g. “this section is proof by induction” or “here they used diagonalisation”). Another way to think of this is that it’s kind of at the level of “hand-wavy” styles of proof :)
  • (To allow reuse there could be a concept of ‘forking’ someone else’s “paper” submitted to the system)
  • This system is supposed to contrast directly with the arXiv, where such “out there” solutions are less likely to appear due to the filtering process, although Perelman’s recent solution to Poincaré conjecture is one very notable exception to this.
  • The situation is even worse with traditional publishing because nobody gets to see the rejects and so these cannot be used to train a filter. The reason for this siomple: a small group of reveiwers simply does scale – I actively want this system to attract many of those who might be considered “cranks” or “nut jobs” i.e. the sorts of people who think they have solved the the hardest problem in certain field in a single page of text with no equations. But even if they haven’t they may actually have interesting ideas that are worth filtering on. I want this system to scalable enough to cope and indeed thrive on the data they provide.
  • The key open problem I have chosen here is “N vs. NP” because there are lots of papers out there from a diversity of sources that I can use to test the system from the outset.
  • One implementation issue I’d like to address is the centralized gatekeeper-like nature of arXiv, although in the first instance the system will have to be centralized so that code development can be bootstrapped. The eventual hope is that anyone that wishes to submit solutions can do so on a server of their choice and this system will aggregate feeds from each such server.

Tags: ,

In May Craig Venter announced to an assembled press pack that he and his team had successfully created a synthetic genome and implanted it into a cell to create the world’s first synthetic bacteria M. mycoides JCVI-syn1.0. The work was published in the journal Science in article the “Creation of a Bacterial Cell Controlled by a Chemically Synthesized Genome.

In this press conference he gave a few details of certain “watermarks” placed in the base pair sequence for the purpose of clearly distinguishing the synthetic organism from any potential contamination in their experimental results but did not reveal their entire contents nor give details of how they were encoded. Instead he threw open a challenege for anyone to decode the watermarks and uncover an email address, to which they were invited to send an email to prove that they had indeed correctly decoded it. The watermarks are given in the supporting online meterials of the article cited above.

The challenge is also described in the following video:

(Jump to the specific section.)

On Sunday I decided to see if I could crack the code and after about six hours of coding I managed to do it! It turns out I was the 44th such person to do so. Rather than explain the technique I used I’ll simply present the following as conclusive proof that I did indeed crack the code, though of course you’re going to have crack it yourself to verify this.


If you’re lucky enough to be going to one of the barcamps I’m also going to this summer I’ll most likely present the method in one of my talks and eventually on this blog.

Tags: , ,

« Older entries

Creative Commons Attribution-NonCommercial-ShareAlike 2.0 UK: England & Wales
This work by Daniel Hagon is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 2.0 UK: England & Wales.