I had to take a few days off of work last week because of my aching back, it was really a fog-of-pain for a few days but this week I'm on the mend and in beautiful Banff for the WWW 2007 conference. Actually, I'm mostly here for the AIRweb workshop but staying a few extra days to hear what folks are thinking about regarding the future of the web, online information retrieval, humanity, and so on.
The AIRweb submissions included a lot of web graph related research. Some of it makes quite intuitive sense: web spammers will link to their spam sites as well as legitimate sites (camouflage) but legitimate sites don't link to web spam sites. So some of the talks discussed the underlying linear algebra of these phenomenon (Anti-TrustRank and BadRank) or their inapplicability to identifying spam (TrustRank). The presentations about temporal patterns, spam term density, the effects of on-the-fly re-ranking and javascript redirection were quite interesting.
A lot of these rank-demotion and web graph heuristics aren't really central to the efforts we have at Technorati for thwarting splogs. We instrument the data streams for baseline behaviors of various features. It's more like an intrusion detection system because fundamentally, web spammers can't behave like "normal" publishers and still succeed; they have to compensate for their absense of popularity with all kinds of abnormal behaviors and those behaviors are quite intrusive if you're listening for them. And so we are. This is by no means perfect but we're doing way better than 80-20. It's my belief that as the web becomes more participatory and there are incentives and opportunities to inject junk into it, intrusion detection will as much a vital capability as search relevance rank demotion to maintain a high quality experience. At the close of the workshop, I proposed that the web spam research community tell us what they want; what can we do to help? I can only imagine that Technorati's data streams could prove useful for the growing challenges of the participant-driven and temporally sensitive web.
So that was yesterday.
This morning, Tim Berners-Lee kicked off with a keynote that touched on the successive innovations of email, the web, wikis and blogs. On the iterative nature of technological and social change, he drew a cycling diagram of the needs that emerge when changes occur and enjoy widespread adoption and the collaborative/creative forces that drive innovation. He laid out how the Semantic Web was the next iteration and complex meaning will be readily accessible on the web. OK, that's all well and good. However, I just don't buy this idea that the Semantic Web is ... the Web at all. We have a web for people (he ackowledged as much at the beginning of the talk) but the idea of having tons of detailed data representations for generalized browsers of really complex data... I just don't get why folks won't end up building domain specific apps anyway. Building UI's for "general data representation" means that you'll never really be able represent the domain specific qualities within some part of The Ontology. At least, I've never seen those things work. Useful apps need domain experts (champions of the end-user e.g. product managers) and engineers to build something that works for that domain. Generic UI's breakdown when dealing with the nuances of specific domains. I want a data-rich web for humans that is machine consumable (microformats), not a parallel-universe web of machine-oriented RDF. Anyway, thanks for inventing the web TBL and good luck all you Semantic Webbers. I think you'll need it.
I almost fell out of my chair though when TBL said that blog spam isn't really a problem. I'll surmise that he has a set feed reader repertoire (or, old school bookmarks) and doesn't use blog search much. While I think we've done a pretty good job spam scrubbing Technorati, the fact remains that there is a veritable ocean of pinging rubbish mongers engaging in underhanded payola schemes, kleptotorial and other nefarious endeavors out there. What spam you do see on Technorati is the tip of the ice berg. Tim, use our site, despite the ice berg tip :)
Side notes: when in Canada going to "google.com" gets redirected to "google.ca" which includes a toggle to search "The Web"/"Pages from Canada" ... amusing, ergo the graphic in this post. Also, I can't believe how long the days are here; about 3 hours more daylight than the San Francisco bay area!
So thanks to Brian Davison, Carlos Castillo and Kumar Chellapilla for putting together a great AIRweb program, good work guys! I'm heading home tomorrow.
www2007 w3c airweb webspam search spam splogs splog ping technorati webgraphs linear algebra microformats semantic web tim berners-lee intrusion detection google banff canada
( May 09 2007, 09:44:35 PM PDT ) Permalink