Creating a Rudimentary Tokenizer for the Delicious Data Set

In order to tokenize all the files in the Delicious data set to facilitate searching, I came up with the following basic scheme:

  1. Decode all XML entities leaving only the actual character they represent (see html_entity_decode).
  2. Strip out all HTML tags (see strip_tags).
  3. Convert everything to lower case (see strtolower).
  4. Strip out all numbers using the \b\d+\b regular expression (see preg_replace).
  5. Remove all non-alphabetic characters left.
  6. Collapse multiple spaces into one space.

This strategy translates very cleanly into PHP as shown below:

class WordTokenizer
{
	public static function tokenizeFile($file)
	{
		$contents = file_get_contents($file);
		
		$contents = html_entity_decode($contents);
		
		$contents = strtolower(strip_tags($contents));
		
		// strip numbers
		$contents = preg_replace('/\b\d+\b/', ' ', $contents);
		
		// strip punctuation symbols & non-alphabetic symbols
		$contents = preg_replace('/[^a-z]/', ' ', $contents);
		
		// replace multiple spaces with just one
		$contents = preg_replace('/\s+/', ' ', $contents);
		
		return $contents;
	}
}

The full source code is available in the information retrieval repository.

html_entity_decode

Setting Up a Tag Database for the Delicious Data Set

One of my assignments this past semester involved performing search operations on the tags supplied with the delicious data set. I created a database named “contentindex” and then a table for the tag information from the XML file using the following SQL:

CREATE TABLE `tag` (
	`docname` VARCHAR(255) NOT NULL,
	`tagname` VARCHAR(255) NOT NULL,
	`weight` INT NOT NULL,
	UNIQUE index_doc_tag_pair_weight (`docname`, `tagname`)
)
ENGINE = InnoDB;

I set up an Eclipse Java project for the assignment. The MySQL Connector/J JDBC driver is required for interaction with the database. You will need to add it to the unzipped JAR file to the Delicious Eclipse project. To do so, right click on the project -> Build Path -> Add External Archives…

A list of XML parsers that could be used on the tag info XML file is available on the SAX website. I settled on the Xerces parser. Setup was a straightforward extraction into a folder, creation of a corresponding Java project in Eclipse, and finally adding the Xerces project to the Delicious Eclipse project’s build path – Right click on the project -> Build Path -> Configure Build Path…. -> “Projects” Tab -> Add…

The source code for the XML tag info parser is available on my github repo.