Facing Tagging Data Scattering
Oscar Daz, Jon Iturrioz, and Cristbal Arellano
ONEKIN Research Group, University of the Basque Country, San Sebastin, Spain {oscar.diaz,jon.iturrioz,cristobal-arellano}@ehu.es http://www.onekin.org/
Abstract. Web2.0 has brought tagging at the forefront of user practises for organizing and locating resources. Unfortunately, these tagging efforts suffer from a main drawback: lack of interoperability. Such situation hinders tag sharing (e.g. tags introduced at del.icio.us to be available at Flickr) and, in practice, leads to tagging data to be locked to tagging sites. This work argues that for tagging to reach its full potential, tag management systems should be provided that accounts for a common way to handle tags no matter the tagging site (e.g. del.icio.us, Flickr) that frontended the tagging. This paper introduces TAGMAS (TAG MAnagement System) that offers a global view of your tagging data no matter where it is located. By capitalizing on TAGMAS, tagging applications can be built in a quicker and robust way. Using measurements and one use case, we demonstrate the practicality and performance of TAGMAS.
Key words: Tagging sites, Web2.0, Datasets
Introduction
Tagging, i.e. the activity of associating keywords with resources, has proved to be an effective mechanism for locating and organizing user resources [11]. Places where tagging is conducted, i.e. tagging sites, can be numerous. Indeed, it is very common for users to keep an account in distinct tagging sites depending on a broad range of issues: the resource type (e.g. if bookmarks then, del.icio.us; if video then, Youtube), A the utilities offered by the site (e.g. if L TEX references need to be obtained, CiteULike might be a better option that del.icio.us), the supporting community (e.g. if musicrelated resources such as mp3, videos, lyrics are the resources to tag, lastfm.com could be an appropriate site), condentiality (e.g. if restrict sharing is an issue, you might favour to use www.bookmarks2.com rather than del.icio.us where private bookmarks are cumbersome to handle), etc. Therefore, tageable resources will most likely be scattered throughout the Web. Unfortunately, these tagging efforts suffer from a main drawback: lack of interoperability (i.e. the ability of two or more systems or components to exchange information and to use the information that has been exchanged [4]). Both the tagging process and tagging description differ across tagging sites. Communication protocols, tagging schemas, APIs or graphical-user interfaces, all exhibit considerable variations. This causes tagging sites to become silos with at best a proprietary API. Such situation
jeopardizes both sharing (e.g. tags introduced at del.icio.us to be available at Flickr) and holistic viewing (e.g. posing global queries such as resources being tagged as Poznan no matter which tagging site keeps them). Indeed, a tag set stands for a users conceptual model about how to describe the content (e.g. tag Ajax), purpose (e.g. tag forProject1) or quality (tag interesting) of resources [8,9]. Such conceptual model is site independent. Unfortunately, there is not currently support for such a holistic view. This paper describes a tag management component, TAGMAS, that allows for a common way to handle tags no matter the tagging site (e.g. delicious, Flickr) that frontended the tags. TAGMAS is proposed as an application for WindowsOS that offers a global view of your tagging data. W3C-backed SPARQL [5] and SPARUL [6] serve to query and update tagging data, respectively, and in doing so, blackboxes the heterogeneity of the underlying tagging sites. No more need to learn either proprietary APIs or protocol messaging. The aim is to support tagging interoperability whereby tagging data produced in one site can be seamlessly used in another. Even more important, global queries can now be posed that expand all along the tagsphere. Last but not least, TAGMAS API streamlines application development by abstracting the application from the location of tagging data. As a proof of concept, one application has been implemented, tagfolio, that provides a frontend to SPARQL query specication on top of TAGMAS. The rest of the paper is organized as follows. Section 2 motivates this work through an example. Section 3 describes query specication in TAGMAS by offering a global view over tagging sites. Sections 4 and 5 go down to design and implementation details by describing how TAGMAS query expressions are mapped to the proprietary APIs. We evaluate the performance of TAGMAS in Section 6. Related work and some conclusions end the paper.
Motivation
Consider you have to collect information about Poznan, no matter the format this information is: a podcast, a picture or a website. Tagging sites help you by putting the wisdom of the crowds into your hands: type the Poznan tag into your favourite tagging sites, and you will recover a handful collection of resources. Although this implies moving along distinct sites (e.g. del.icio.us, Youtube and Flickr), the effort could be worth enough. However, the difculty frequently rests on nding the right tags to ask for. It is not always easy to nd a sensible collection of tags, more to the point if you are a novice. But, after all, the wisdom of the crowds is there for novices, not for experts which already have the background to nd the right resources by themselves. Rather than explicitly providing the tags themselves, novices can tap on someone elses tags. But tags do not exist independently but attached to a resource. Hence, we tap on a given resource (e.g. the Wikipedia entry for Poznan), recover how it has been tagged in del.icio.us (e.g. OstrowTumski, OldBrewery) and use these tags to query Youtube or Flickr. This approach certainly facilitates tag location for novices but it severely complicates the procedure for resource retrieval: for each tag which characterises the
Fig. 1. The Poznan tagfolio.
resource http://en.wikipedia.org/wiki/Poznan at del.icio.us, do recover those pictures at Flickr that have such tag. The tricky thing about the previous example is that not only does it access two different sites (i.e. del.icio.us and Flickr) but these accesses are intertwined. In terms of the relational algebra, the query is not just a union but a join. If manually conducted this query is very tiresome to support. Going back and forth between del.icio.us and Flickr is really not an option. What is needed is a declarative query language that hides much of the distribution and diversity of the tagsphere. This is the purpose of TAGMAS. On top of such query language, applications such as tagfolio can be constructed. Broadly, a tagfolio is a desktop folder that is dened through a query over the tagsphere. When a tagfolio is opened, the query is executed, and the outcome populates the folder. Figure 1 illustrates a tagfolio that keeps photos related with Poznan at Flickr. The lower panel shows the content of the folder (initially empty). The upper panel serves to specify a query la Query-By-Example (QBE) [13], i.e. each row denotes a selection on a single site, and join variables are denoted by using the same variable name in two different rows (identied through a question mark). Back to Figure 1, the rst row states that photos from Flickr should be retrieved (notice the tick in the rst column). The condition to be fullled is provided by the second row: the photo should at least share a tag (through the ?tag variable) with the resource http://en.wikipedia.org/wiki/Poznan kept at del.icio.us. The outcome is then a set of photo references together with the tags fullling the condition. This use case illustrates tag interoperability at work. Instead of each application having to face tag interoperability, this work advocates for a tag management
Fig. 2. TagOnt ontology and a sample individual.
component, TAGMAS, that makes transparent the location of tagging data. This paper focus on the query capability: query specication, query transformation and query execution of disperse and heterogeneous tagging sites.
TAGMAS Query Specication
TAGMAS offers a global view over heterogeneous tagging sites. As known from the database community, a key point to integrate different data sources is a formal description of each data source that permits its automatic integration by machines, and offers a common model for the user to express queries. This work uses RDF as the data model, and introduces TagOnt as an ontology to integrate the distinct tagging conventions [10]. TagOnt rests on the observation that in current tagging sites there is not semantic formal agreement on the representation of the notion of tagging, this means that every system uses a different format to publish its tagging data, which prevents interoperability and does not allow for machine-processability [10]. Figure 2 depicts the main constructs of TagOnt together with an individual. The central Element of the ontology is a Tagging, i.e. a tuple (resource, tag, time, tagger, domain). The following properties are introduced: hasTaggedResource (which holds the resource URL); hasTagLabel (i.e. the resource tags); isTaggedOn (i.e. date and time of the tagging); hasTagger (i.e. person who did the tagging); and hasServiceDomain that species the tagging site. The latter allows converting tagging data from existing applications without losing its original context. In doing so, TagOnt allows for crossapplication tagging which is a must in our scenario. Once the ontology is dened, operations should be available to query and populate the knowledge base. To this end, SPARQL and SPARQL/Update Language (SPARUL) are used. Figure 3 shows distinct examples based on TagOnt, namely:
Fig. 3. SPARQL/SPARUL Query examples.
(A) obtain pictures at Flickr with carellano001 as the tagger, where at least one of their tags has been used to tag also the bookmark http://en.wikipedia.org/wiki/Poznan at del.icio.us, (B) attach tag Poznan to picture http://farm4.static.ickr.com/3299/3663279424_d73d853ceb.jpg located at Flickr with carellano001 as the tagger, (C) delete tags associated with picture http://farm4.static.ickr.com/3299/3663279424_d73d853ceb.jpg which is kept at Flickr with carellano001 as the tagger. Variables are denoted with an starting question mark, e.g. ?tag, (D) rename tag Poznan to WISEVenue with carellano001 as the tagger, so that all resources are re-tagged (no matter the tagging site).
Fig. 4. Example case of transformation: (1) SPARQL query; (2) Datasets; (3) Calls; (4) Execution plan.
TAGMAS Query Transformation
Once the query has been created in SPARQL/SPARUL, using the TagOnt model, it is then TAGMAS responsibility to transform SPARQL/SPARUL requests down to Calls to proprietary APIs. Next paragraphs outline the distinct steps (some of them are executed in parallel) that TAGMAS follows to realize this process. The example query is used to illustrate the details. Group triples in Datasets. First, the SPARQL query is re-arranged into Datasets (see gure 4(1) and (2)). Datasets are a proposal for a query to expand along distinct RDF graphs, where each graph consists of triples with subject, predicate and object. When querying a collection of graphs, the GRAPH keyword is used to match patterns against named graphs (i.e. a Dataset). Conceptually, the vision is like if each tagging site was a RDF -graph provider. Unfortunately, this is not yet the case. However, we
Table 1. From SPARQL triples to proprietary API operations (highlights are for our running example).
would like to provide such an illusion since it accounts for aggregating SPARQL triples based on its tagging site. Therefore, objects on the hasServiceDomain property become GRAPH clauses. This GRAPH -based query is now the input to the next step. Transform Datasets in Calls. In this step SPARQL triples belonging to the same GRAPH (Dataset) are transformed to a single Call. Each tagging site offers a different way to access the site resources. To isolate from this heterogeneity, a homogeneous syntactic API has been dened. Figure 5 shows the methods of this interface, where it can be observed that the interface is tightly coupled with TagOnt ontology, and basically one method has been dened for each property of the TagOnt ontology. This interface provides a site-independent way to handle resources (e.g. getTags species the recovering of the resources tags regardless of how this operation is nally realised by the site at hand). The transformation of these methods to the concrete API site method has been delayed until execution step. Table 1 summarises all possible combinations of triples that can appear in a Dataset based on TagOnt properties. Column (a) holds predicate combinations of the TagOnt ontology. Column (b) shows its Calls counterpart, where ? elements stand for variables, and upper case elements represent constants (e.g. URL). Therefore, this table describes how to obtain Calls out of Datasets. An example is given in gure 4(2) and (3): the del.icio.us Dataset becomes the Call ?tag=getTags(http://en.wikipedia.org/wiki/Poznan, null) in accordance with
Fig. 5. Site independant Call interface.
row6 rule (highlighted in table 1). Analogously, Flickr Dataset produces the Call ?photos=getResources(?tag, null) akin to the row11 rule, where the variable ?tag is instantiated with tags obtained from the previous del.icio.us operation. Create the Execution plan. Now, it is the turn to consider dependencies among Calls, i.e. whether output parameters of a Call become input parameters of another. Two situations can arise: if no dependency exists then, a UNION expression is constructed. This means the tagging sites are invoked concurrently and results are merged independently, if the output variable of a Call C1 is used as an input variable of the other Call C2 then, a JOIN expression is created with C1 and C2 as part of the outer and inner loops, respectively. The latter case arises in our running example: tags recovered after executing ?tag=getTags(http://en.wikipedia.org/wiki/Poznan, null) at del.icio.us, are used as the input parameter for invoking ?photos=getResources(?tag, null) at Flickr (see Figure 4(4)). The JOIN expression implies that for each tag recovered from del.icio.us, a request is issued to Flickr. The JOIN outputs the union of the set of Flickr pictures obtained in each iteration. The Execution plan ends with the projection of the ?photos variable.
TAGMAS Query Execution
At this point, an Execution plan is available, but it can not yet be enacted since it is described in terms of Calls which is not understood by tagging servers. "Tagging drivers" are needed to map Calls into the specicities of each server. Each "tagging driver" encapsulates the peculiarities of the tagging site at hand (e.g. protocol, data format, etc). Since tagging sites do not provide such drivers, TAGMAS provides native support for del.icio.us and Flickr. The management of the "tagging driver" is realized by the TaggingDriverManager component. This component hosts the drivers, and supports the interaction with the distinct tagging sites using JTBC (Java TaggingSite Base Connectivity). JTBC mimics JDBC specication where "tagging drivers" are used to encapsulate the specicities of each tagging site.
Fig. 6. The structure of the Java TaggingSite Base Connectivity (JTBC).
This approach allows for new sites (e.g. CiteULike) to be introduced through interface realization. If queries should now be expanded to CiteULike, interfaces TaggingDriver, TaggingConnection and TaggingStatement should be implemented that encapsulate the specicities of CiteULike (see Figure 6). The latter is just an implementation that realises each operation for the tagging site (e.g. DeliciousGetTagsTaggingStatement, DeliciousGetDatesTaggingStatement). The DeliciousGetTagsTaggingStatement class encapsulates the protocol, envelop strategy and parameter details specic to del.icio.us getTags invocation. The gure 7 describes how TAGMAS process previous del.icio.us Call ?tag=getTags(http://en.wikipedia.org/wiki/Poznan, null).
Fig. 7. Example snippet of a del.icio.us Call at execution stage.
First, the driver is loaded (3) and the sites name is used to obtain the specic connection object that links to it (4, 5). This connection acts as a factory (software pattern) and permits to obtain the concrete object statement related to an abstract getTags operation (6). Finally the statement is executed (10) with the parameters specied in sentences (7) and (8, 9).
Tagging site Delicious Flickr Delicious + Flickr
Total Tagmas Tagmas (ms) (ms) % getResources(tag) 2331 405 17.4 getResources(tag) 1032 427 41.4 getTags(resource) 3818 474 12.4 JOIN getResources(tag) Table 2. TAGMAS overhead.
Query
Evaluation
This section evaluates TAGMAS, measuring what additional latency TAGMAS adds compared to direct access without TAGMAS. For our measurements TAGMAS has been deployed in an AMD Turion 64 X2 2 GHz CPU with 4GB of memory. The experiments have been realized with a domestic 6Mbps WIFI LAN bandwidth. We have not spent much effort to optimize TAGMAS; we have reused some general purpose modules (i.e. Jena parser) that can be improved in next prototypes. Our main goal was to demonstrate the viability of our experimental system. Nonetheless, our results demonstrate that performance of our current prototype is competitive with other remote access Web technologies and is fast enough to be usable in practice. We measured the latency for two query types: simple selections and joins. Table 2 shows the outcome. The rst two rows correspond to a selection query (i.e. getResources(tag)) for two remote sites, del.icio.us and Flickr. The last row amounts for a join involving Flickr and del.icio.us (it is actually our running query). For each query, we collected the total elapsed time in milliseconds (ms) (third column) and TAGMAS latency removing net-time (fourth column). The last column holds the percentage involved by TAGMAS. The rst insight is that TAGMAS latency keeps almost constant around 420ms, no matter the query. Although the join expression takes around 50ms more to work out, we do not think this is especially signicant. Even in the presence of very efcient servers such as Flickr, TAGMAS accounts for 41% of the total time. Not surprisingly, network latency dominates the query time. This is specially so for joins where sites are invoked several times. For our running example, network latency accounts for as much as 87.6%. Even so, getting the results back for our example query is below 4 seconds, a reasonable time assumed by any user in a prototyped tool.
Related Work
Starting with desktops striving to integrate resources disseminated across tagging sites, Menagerie introduces a framework that supports uniform naming, protection, and access for personal objects stored by Web services (e.g. tagging sites) [7]. It perceives a tagging site as a le system: resources (e.g. Flickr photos) are les, and folders are obtained after the local structures of the server (e.g. albums for Flickr, bundles for del.icio.us). The rationales are similar to TAGMAS. The main differences stem from
how tagging sites are perceived. Somehow Menagerie replicates the tagging server structure in the desktop lesystem. By contrast, TAGMAS perceives tagging sites as sources of resources, and folders as views, i.e. queries over these tagging sites. This has two important implications. First, the very same resource, e.g. a picture, can be virtually located at different tagfolios. You are not longer conned to locate a picture in an album. And second, tagfolios are based on tags: if photos are recovered through tags when directly accessing Flickr, the user will likely also use tags when accessing Flickr from the desktop rather than forcing him to remember the album where photos are kept. These two important advantages rest on the existence of a global schema for tagging data. Moving to the Web, ActiveRDF [12] is an object-oriented API for managing RDF data that offers full manipulation and querying of RDF data. The aim is to embed Semantic Web data into object-oriented languages. Here, resources and their description are conceived as an RDF graph that, with the help of ActiveRDF, can be integrated into OO languages. The integration is programmatic (i.e. through an API to manipulate RDF structures). Unfortunately, tagging sites do not offer their data as RDF graphs but through their own proprietary APIs. This is precisely one of the endeavours of TAGMAS, i.e. to abstract away from this heterogeneity, and to provide an RDF integrated view of the tagging data, no matter where it is located. ActiveRDF applications can then capitalize on TAGMAS as a supplier of RDF graphs for tagging data. Therefore, the role of ActiveRDF in our architecture would be at the application layer. Keotag [1] is a tag-based metasearch web site that permits to search resources annotated with tags in fourteen different tagging sites (del.icio.us, Technorati, Youtube, Digg etc.). The user introduces a tag, clicks on the corresponding tagging-site icon, and related resources are displayed. Queries are then single-site and multi-tag. By contrast Xoocle [3] permits to search in some predened tagging sites (del.icio.us, Flickr, Technorati) based on the tags you kept at you Stumbleupon account. The user enters his Stumbleupon username, and Xoocle displays a list of all his Stumbleupon tags [2]. Next the user selects one tag, and Xoocle obtains all del.icio.us, Flickrs and Technoratis resources annotated with this tag. Xoocle is limited to single-tag queries that always expand along the same tagging sites. TAGMAS expands Xoocles efforts by providing an integrated and extensible architecture that allows for multi-tag, multi-site queries... and updates.
Conclusions
This paper describes TAGMAS, an application for WindowsOS that encapsulates heterogeneity of tagging site through SPARQL and SPARUL. The nal aim is to streamline the development of tagging-aware desktop applications that now do not have to face such diversity. This will hopefully promote a new crop of tagging tooling that capitalize on tags as the main conduit for localizing and organizing user resources in a holistic way. As a proof of concept, an application, tagfolio, has been developed on top of TAGMAS (i.e. using TAGMAS API).
Design decisions were taken to facilitate incorporation of additional tagging sites into TAGMAS. Finally, our measurements demonstrate the practicality of our approach for medium-scale environments. Future work includes building drivers for other popular tagging sites (e.g. Youtube, CiteULike), and providing cache strategies to speed up TAGMAS query processing. For instance, the set of user tags tends to consolidate as time goes by. This makes this set a good candidate for caching. Depending on the query patterns, some caching strategies can be envisaged for materializing fractions of the tagging data in a similar way to those available for datawarehousing. Acknowledgements This work is co-supported by the Spanish Ministry of Science and Innovation, and the European Social Fund under contract TIN2008-06507-C02-01/TIN (MODELINE), and the Avanza I+D initiative of the Ministry of Industry, Tourism and Commerce under contract TSI-020100-2008-415. Arellano has a doctoral grant from the Spanish Ministry of Science & Innovation.
References
1. 2. 3. 4. 5. 6. 7. Keotag. http://www.keotag.com/. Stumbleupon. http://www.stumbleupon.com/. Xoocle. http://www.xoocle.com/. IEEE Standard Glossary of Computer Networking Terminology, May 1995. SPARQL Query Language for RDF, 2007. http://www.w3.org/TR/rdf-sparql-query/. SPARQL Update, 2008. http://www.w3.org/Submission/2008/SUBM-SPARQL-Update20080715/. R. Geambasu, C. Cheung, A. Moshchuk, S. D. Gribble, and H. M. Levy. Organizing and Sharing Distributed Personal Web-Service Data. In World Wide Web Conference (WWW08), 2008. S. A. Golder and B. A. Hubermann. The Structure of Collaborative Tagging System. Technical report, HP Labs, 2006. http://www.hpl.hp.com/research/idl/papers/tags/tags.pdf. M. E.I. Kipp. @toread and Cool: Subjective, Affective and Associative Factors in Tagging. In Canadian Association for Information Science (CAIS08), 2008. T. Knerr. Tagging Ontology - Towards a Common Ontology for Folksonomies, 2007. http://tagont.googlecode.com/les/TagOntPaper.pdf. B. Lund, T. Hammond, M. Flack, and T. Hannay. Social Bookmarking Tools (II). D-Lib Magazine, 2005. http://www.dlib.org/dlib/april05/lund/04lund.html. E. Oren, A. Haller, M. Hauswirth, B. Heitmann, S. Decker, and C. Mesnage. A Flexible Integration Framework for Semantic Web 2.0 Applications. IEEE Software, pages 6471, 2007. R. Ramakrishnan and J. Gehrke. Database Management Systems, chapter 6, pages 177192. McGraw-Hill, 2003. http://pages.cs.wisc.edu/ dbbook/openAccess /thirdEdition/qbe.pdf.
8. 9. 10. 11. 12.
13.