What's That Noise?! [Ian Kallen's Weblog]

« Previous page | Main | Next page »

20041218 Saturday December 18, 2004

PHP's mbstring is a 2 bit thief Normally, PHP's strlen function reports back the number of bytes in a string. When dealing with western characters, that will equal to the number characters as well. However, strange things can happen when PHP's mbstring extension for multi-byte character support is enabled to alias that function.

If you have mbstring.func_overload configured to alias mb_strlen for strlen (i.e. when the 2 bit is flipped), then strlen starts counting characters, not bytes. If you need to count the number of bytes, it's not obvious how you're supposed to do it.

This is how I did it:
In places where I really needed to know the number of bytes, I used a homebrewed function byte_count instead strlen. Here's the function definition for byte_count.

     function byte_count($val) {  
         $len = (function_exists('mb_strlen')) ? 
             mb_strlen($val, 'latin1') :
             strlen($val);
         return $len;
     }

Perl is hokey about it too. The length is supposed to count the number of characters but if you want to force it to count bytes, you need to use the bytes pragma. From the manpage:

           $x = chr(400);
           print "Length is ", length $x, "\n";     # "Length is 1"
           printf "Contents are %vd\n", $x;         # "Contents are 400"
           {
               use bytes;
               print "Length is ", length $x, "\n"; # "Length is 2"
               printf "Contents are %vd\n", $x;     # "Contents are 198.144"
           }

Java is not without it's pickiness but it as least it has byte and char as distinct primitives.

( Dec 18 2004, 12:50:23 AM PST ) Permalink


20041215 Wednesday December 15, 2004

Tool to defeat a DOS against an Apache server farm Throttling the number of requests that come in to a single web server instance isn't rocket science but it requires keeping track of a state shared amongst processes/threads. But how would you efficiently throttle requests that come to a set of servers?

Apparently, there's some Apache goodness available for this now. At least I think it sounds good! Ian Holsman has written mod_ip_count for Apache 2.0. It uses the APR portability layer and memcached for shared state (actually apr_memcache from Paul Querna). This would enable a whole server farm to keep track of request rates from and throttle specific IP addresses.

( Dec 15 2004, 04:24:13 PM PST ) Permalink


20041214 Tuesday December 14, 2004

Greener Pastures

This weekend Technorati's network and server infrastructure is going to move. In one big fell swoop. Well, hopefully nothing will fall.

The home page sez: "Movin' on up" cause Technorati is substituting the Jefferson's theme song for the old ops/facilities anthem, the Talking Heads' "Burning Down The House"

( Dec 14 2004, 12:47:43 AM PST ) Permalink


20041212 Sunday December 12, 2004

l10n development practices I've used Java's handy-dandy ResourceBundles to do some proof-of-concept localizations. But my past proofs were limited; they only used other European languages that utilized ISO-8859-1 characters. I don't recall having to do anything special with the property files in those cases.

Working on a recent Japanese localization project was an eye opening experience. It turns out the java.util.Properties expects ISO-8859-1 characters. I guess that's the downside of having a super-simple file format. I got the localized display boostrapped by using native2ascii to get the UTF-8 localization text rendered as escaped unicode. On a one-off basis, that's easy enough. But collaborative development always begs the tools question, how do folks typically manage this?

What about input encoding? If there's an HTML form on a page and the input has multibyte characters in the query string (or POST data), are characters escaped to ISO-8859-1? My recollection was that HTTP headers must be ISO-8859-1.... but looking at the docs for PHP's mbstring and the encoding_translation parameter, it looks like server-side handling of the request needs to account for other character set encodings. Do browsers honor charset specification as a form attribute, like

<form action=... method=... accept-charset="UTF-8">
(looks like Struts supports this) or is it presumed that the browser always escapes unicode? Or perhaps they simply URL encode the characters so it's a non-issue? On the server side the must the request handling do this
request.setCharacterEncoding("UTF-8");
String raw = request.getParameter("foo");
String clean = new String(raw.getBytes("ISO-8859-1"), "UTF-8");
or is it all supposed to transparently just work (obviating String cleansing) if request.setCharacterEncoding("UTF-8") is used? ...for all of the hand-waving in the docs for ResourceBundle, etc establishing a clear practice for input String handling in a webapp remains murky.

As far as sending responses, is it safe to always just send UTF-8 and include "charset=UTF-8" in the Content-type header? Is it standard practice to presume that the client will send a request header Accept-Charset (which indicates what an acceptable response is)? If they send it and UTF-8 isn't on the list, must the server go through a big String re-writing exercise to encode response to the browser's preference or is UTF-8 presumed to be implicitly acceptable at all times?

So many questions... I'm still digging for anwers.

( Dec 12 2004, 11:51:01 PM PST ) Permalink


20041207 Tuesday December 07, 2004

Runtime inseration of struts tiles in Velocity I've been impressed with TilesTool, it comes in the Velocity Tools package. It runs Velocity views through the struts MVC machine for processing reusable "subviews". However, there's no support for runtime insertion of components!

You can do this in tiles-defs.xml

   <definition name=".dog" extends=".animal.layout">
     <put name="body"    value=".dog.display" /> 
     <put name="head"    value=".dog.head" /> 
   </definition>
   <definition name=".cosmos.head" extends=".head">
     <put name="titleKey" value="dog.title" />
   </definition>
   <definition name=".dog.display" 
     controllerUrl="/dog.do"
     path="/tile/dog.vm"
     />
and so forth. Declaritive tile composition works just fine. But what about programmatic composition at runtime?

With JSTL and struts, I can do this:

<c:forEach var="bit" items="${kibble}">
  <tiles:insert page="/tile/bark.jsp">
    <tiles:put name="bit" beanName="bit" />
  </tiles:insert>
</c:forEach>
I would imagine that the Velocity equivalent would look like this:
<ol>
#foreach ($bit in $kibble)
 $tiles.put("/tile/bark.vm", { "bit" : $bit })
#end
</ol>
but alas, it's not implemented by TilesTool. I can work around this by moving "bark.vm" to its own velocimacro but that it fugly as hell. I would prefer parameterized components.

( Dec 07 2004, 06:53:07 AM PST ) Permalink


20041206 Monday December 06, 2004

Servlet container forward from inside Velocity I ported some JSP UI code to Velocity, it's been fun learning the Velocity paradigm (being able to cleanly process template components outside the container rocks). One of the things in the JSP UI handled forwarding requests that bypassed the struts controller back through struts.

In JSP with struts tags, it looks like this (assume web.xml has "struts-logic" mapped):

<%@ taglib uri="struts-logic" prefix="logic" %>
<logic:redirect forward="home"/>
But what about Velocity? Well, it turns out that the VelocityViewServlet stuffs the basic servlet container things into the Velocity context, much like JSTL does in JSPville. Ergo, the $request object itself can be invoked like this:
$request.getRequestDispatcher("/home.do").forward($request,$response)
Seems kinda grotty to not be able to use struts symbolic name, but so far that's where my read of the Velocity docs has taken me. As I unpeel the onion, I may be inspired to subclass the VelocityViewServlet as a StrutsViewServlet... it seems like however you're invoking the rendering, you should be able to access, if present, other runtime services such as struts, spring, etc. ( Dec 06 2004, 10:05:35 AM PST ) Permalink


20041205 Sunday December 05, 2004

memcached in a service oriented functionality ecosystem Among my efforts over recent months have been those focused on decoupling. Technorati has a very high update rate as it taps the ping streams, fetches update contents, analyzes links and keyword indexes the substance of posts in the blogsphere. Such a system doesn't work well when components are closely coupled; the availability of the whole system is subject to the whim of the system's weakest links. Often, weaknesses are combinatorial; the weakness of the whole is greater than the weakness of the parts. That's what I'm focused on undoing. Fixing weaknesses in the components is important but decoupling them first is more so.

When folks say "service oriented architecture" it still cannotes monolithicism to me. An architecture implies a level of structure definition that sounds rigid; can you re-pour that foundation to adapt redrawn plans? Software development agility and loose coupling should reinforce each other. I prefer to think of architectures and ecosystems. A service oriented functionality ecosystem supplies application functionality as a suite of services. Supporting requirements (as opposed to the core business requirements) such security, logging, persistence, redundancy and caching are each handled independently; they in turn may be provisioned as services that higher level services rely on. This is part of the evolution under way at Technorati; some of the changes are evident in Dave's recent posts but some are just revisions that we're quietly rolling out.

Queues and distributed memory caches are natural elements of a such an environment. In the December issue of Linux Journal, Technorati's use of open source building blocks such as memcached is discussed by Doc Searls.

This is the game:
A memcached server (or a set of servers) can be accessed over the network to store things in a table kept in RAM. When storing things, you can specify a maximum age for the cache entry -- if you go back to fetch it and the elapsed time since it was stored exceeds that age, it gets treated as a cache miss.

Storing things in memcached with the timeout parameter and invalidating cache entries works as long as you have consistent mechanism for calculating the key. If internally you're managing "stories" and each one has an "id" attribute that is unique (a primary key), that's a good candidate to store them with. So for instance putting memcache inside a content management system (CMS) "content service" seems natural. In babytalk code:

  public Story fetchStory(int storyId) {
      Story story = memc.get(storyId);
      if (story == null) // perhaps more rigorous validation of the fetched object
          return story;
      story = StoryDB.findById(storyId);
      memc.put(storyId, story, AGE);
      return(story);
  }

If it's difficult to determine whether something is new or an update because it doesn't have an id and uniqueness is determined by some combination of attributes, then the lookup cycle can be helped by caching with composite keys. It gets a little more complicated:
  public Story fetchStory(Map atts) {
      // encapulate whatever attributes uniquely identify a thing
      CacheKey key = new CacheKey(attrs); 
      Story story = memc.get(key);
      if (story == null) 
          return story;
      story = StoryDB.findByAttrs(attrs);
      memc.put(key, story, AGE);
      return(story);
  }

We're in the process of evolving Technorati's infrastructure to one that is loosely coupled, redundant and robust. Our use of memcached is one of the enabling technologies of that evolution.

( Dec 05 2004, 09:22:23 AM PST ) Permalink


20041123 Tuesday November 23, 2004

Adding a file system hierarchy into a CVS module I had a whole bunch of code that needed to added to a CVS repository. Normally with just a new directory or two and a few files, doing "cvs add" for each one is no big deal. But when there's a whole file system hierarchy that can be a real PITA. This is where "cvs import" comes in handy.

I usually only use "cvs import" to create a new CVS module but it can also be used to do a "bulk add." Maybe it's common knowledge for CVS jockies but it's easy to forget about unless oft-used. Here's the scenario:

There, that's a lot easier than individually adding directories and files.

( Nov 23 2004, 11:09:35 AM PST ) Permalink


20041119 Friday November 19, 2004

Editing Perl with Eclipse and EPIC I've been bouncing between development in Perl and Java a lot this, two languages that I both love and loath at various times. One of the things that I love about developing in Java is using Eclipse. This week I decided to give the Eclipse support for Perl a spin. Well, I'll be saving my love for the Java support.

The only gripe I've heard about Eclipse that I haven't had a good answer for is the absense of Emacs key bindings. Otherwise, what's there not to dig about Eclipse?

Dollar costs
zilch
Support for refactoring
renaming and changing method signatures, moving them around... all with dependent references kept in tact
Extracting interfaces
OK, that falls under refactoring but worthy of its own mention
Source and javadoc stub automation
Spontaneous method stub creation, constructor and accessor creation, javadoc stubs
Test case generation and running
JUnit and ant awareness, yum
Syntactical and semantic error highlighting
Fixing errors early and often is easy cause they're usually obvious

I was hopefull that the EPIC plugin would provide at least some of those things for Perl development. This is what I found:

OK, that's a pretty good start. But some things I wanted like syntax completion (i.e. when editing Java, you can type "for", hit shift-space to pull up options to loop over an array or a Collection and voila: a for loop is materialized. ...so basically all of the goodness you get for Java development is lacking for Perl. Nonetheless, I think it's a promising start. I'll be trying the EPIC updates from time to time as new occasions to develop in Perl present themselves.

In the meantime, you can enjoy the fruits of this week's labor by pulling it off of CPAN; that's where you can get WebService::Technorati. It's also part of the latest release of the Technorati web services SDK. Thanks to David Wheeler for turning me on to Pod::Simple::HTML ...I'm still trying to figure how he gets it to output nice docs from pod, mine didn't come out nearly that purty. Ah well, I guess that'll be part of next week's Perl fun.

( Nov 19 2004, 10:59:25 PM PST ) Permalink


20041117 Wednesday November 17, 2004

XML::Parser on Mac OS X I needed to fiddle with XML::XPath on my powerbook today, it depends on XML::Parser. Complacent with how most unixy things I want to do JFW on Mac OS X, I dropped down to my CPAN shell and typed "install XML::Parser" -- bzzzt!

It turns out that expat is not installed, grrr. So I fired up Fink Commander and had it gimme some expat lovin'. Tried it again -- bzzzt! This is what I did in the CPAN shell

cpan> o conf makepl_arg "EXPATLIBPATH=/sw/lib EXPATINCPATH=/sw/include"
cpan> install XML::Parser
-- ding-ding-ding! We have a winner! XML::Parser installed! Thereafter, XML::XPath JFW'd and I'm on my way.

( Nov 17 2004, 04:53:44 PM PST ) Permalink


20041107 Sunday November 07, 2004

Tomcat's "Content-type" header parsing busted? One bit fun this week was trying to figure out why some XML output I was working was mangling characters. I thought I was doing all of the right things as far as handling the data goes. Well, I think I was but Tomcat 5.0.28 wasn't.

I poked around the Jakarta bug database and the only mention I could find that close was PR 31442, which described having this

<%@ page language="java" contentType="text/html; charset=UTF-8" %>
<%@page pageEncoding="UTF-8"%>
and saying that the text was coming back ISO8859-1 when the page is requested as a GET but not as a POST. Well, someone from the Jakarta project marked the bug INVALID glibly saying to ask on the user's mailing list and look at the Connector configuration because it's not a bug. WTF? Are you kidding?

Now I looked around in the Connector stanza's that come in the server.xml and see no mention of encoding configuration attributes. I've got a real simple test case.

<% response.setContentType("text/xml"); %>
triggers no funny encoding behavior, I get the data out as good old utf8 just as I wanted but if I did this
<% response.setContentType("text/xml; charset=UTF-8"); %>
....kablooey! Mangled encoding! That's just wrong. And if it's not wrong, I think it warrants a better answer than RTFM on the Connectors.

And the problem may not just be isolated to JSP handling. Judging from other reports that are turning up in Google's index pertaining to SetCharacterEncodingFilter, it's affecting the filter implemetation as well.

( Nov 07 2004, 02:46:58 AM PST ) Permalink


20041106 Saturday November 06, 2004

What are all of these stupid people doing in my country? There were some severe system problems last week that pretty much knocked this site out of commission, I'm hoping it's all in the past now.

In the meantime, the Big Lie that waging war on Iraq has some relationship to 9/11 and terrorism apparently has been successfully Jedi mind-tricked into the American psyche and we're destined to have four more years of high crimes and misdemeanors. It just makes me wonder what is up with the rest of the country. Plenty of folks abroad are, evidently, equally perplexed by this election, as we see in a recent Daily Mail cover.

If you're single, there are some Canadians offering asylum. I'm thinking of packing up the family and moving to New Zealand or something.
Just to keep track of where I don't want to be, I'm reckoning with the map:

electoral college strong kerry Strong Kerry (146)
electoral college weak kerry Weak Kerry (37)
electoral college barely kerry Barely Kerry (69)
electoral college tied Exactly tied (0)
electoral college barely bush Barely Bush (30)
electoral college weak bush Weak Bush (66)
electoral college strong bush Strong Bush (183)
Needed to win: 270
 
Do you live in a state of stupity?
Apparently 59,054,087 of you do.

  State Avg. IQ 2004
1 Connecticut 113 Kerry
2 Massachusetts 111 Kerry
3 New Jersey 111 Kerry
4 New York 109 Kerry
5 Rhode Island 107 Kerry
6 Hawaii 106 Kerry
7 Maryland 105 Kerry
8 New Hampshire 105 Kerry
9 Illinois 104 Kerry
10 Delaware 103 Kerry
11 Minnesota 102 Kerry
12 Vermont 102 Kerry
13 Washington 102 Kerry
14 California 101 Kerry
15 Pennsylvania 101 Kerry
16 Maine 100 Kerry
17 Virginia 100 Bush
18 Wisconsin 100 Kerry
19 Colorado 99 Bush
20 Iowa 99 Bush
21 Michigan 99 Kerry
22 Nevada 99 Bush
23 Ohio 99 Bush
24 Oregon 99 Kerry
25 Alaska 98 Bush
26 Florida 98 Bush
27 Missouri 98 Bush
28 Kansas 96 Bush
29 Nebraska 95 Bush
30 Arizona 94 Bush
30 Arizona 94 Bush
31 Indiana 94 Bush
32 Tennessee 94 Bush
33 North Carolina 93 Bush
34 West Virginia 93 Bush
35 Arkansas 92 Bush
36 Georgia 92 Bush
37 Kentucky 92 Bush
38 New Mexico 92 Bush
39 North Dakota 92 Bush
40 Texas 92 Bush
41 Alabama 90 Bush
42 Louisiana 90 Bush
43 Montana 90 Bush
44 Oklahoma 90 Bush
45 South Dakota 90 Bush
46 South Carolina 89 Bush
47 Wyoming 89 Bush
48 Idaho 87 Bush
49 Utah 87 Bush
50 Mississippi 85 Bush
There you have it: the closer you are to the coasts or Lake Michigan, the more likely you're not a dumbass.
I was never particularly enamored with John Kerry, in fact I would've been happy with a Wesley Clark-Howard Dean ticket. Nonetheless, I don't think Kerry would have been so driven to a war that he would have disregarded counter-indicative intelligence and the advice of allies to wage one.
Another idea that no longer seems entirely ridiculous is to secede from union. Seriously, who wants to be part of this country when California's vote is under represented in the electoral college and yet our youth are being sent to Fallujah to wage war against a culture and people most people here know little of. We're the country's vegetable stand and it's cannon fodder. I don't think so. Suddenly, Ecotopia sounds like a reasonable proposition.
Independence for California!

( Nov 06 2004, 01:12:57 PM PST ) Permalink


20041023 Saturday October 23, 2004

Rock stars and actresses What is it with rock stars and actresses? Hopefully there's no home made videos of them but apparently Lars Ulrich and Connie Nielsen are a thing.

Remember Lucilla in Gladiator? Yea, that's Connie Nielsen.

Here's a guy with two sons and a wife of 7 or 8 years going to fashion shows, art auctions and movie premiers with his Danish girlfriend. Oh, Lars: you're so damned hollywood! Apparently the paparazzi in Denmark have kept tabs on them as well.

Back in the old days o' Metallica we had loads of fun but didn't go to fashion shows, art auctions and movie premiers. We didn't sip fine wines either. Oh well, I hope the dude is happy.

( Oct 23 2004, 07:29:35 PM PDT ) Permalink


20041022 Friday October 22, 2004

Technorati's referer-driven cosmos There's a it's-so-simple-it's-cool feature on Technorati. You can construct a link to get the cosmos for a page simply by linking from it a specific URL.

After raising the notion with Tantek, he plugged the trivial bit to enable this on the Technorati site..
Check it out http://www.technorati.com/cosmos/referer.html (ok, so I'm not very popular in this big 'ol cosmos but anyway...). This is what you do:


Try it!

( Oct 22 2004, 07:54:11 PM PDT ) Permalink


20041021 Thursday October 21, 2004

Technorati Party All work and no play makes Ian a dull boy, so to keep me jazzed Technorati is having a party!

OK, I lied. It ain't about me, it's about our new office and the major milestones that Technorati is achieving, the agony of startup setbacks and the ecstacy of... having fun! The details:

WHEN
Thursday, October 28, 7 p.m.
WHERE
Technorati, 665 Third Street #207, San Francisco
WHAT
A party to catch up, and celebrate the move to our new offices!
RSVP
rsvp@technorati.com. As space is always limited, please be sure to RSVP.
Bring Your Own Lampshade

Here's Dave's original post.

( Oct 21 2004, 04:34:11 PM PDT ) Permalink