Went through the chapter on spidering and clustering. Using the downloaded code, generated a second dataset using the same list of feeds. Below are two dendrograms, the first generated using the original dataset (that appears in the book) and the second generated using my dataset.
The structures of the dendrograms are noticably different, but there are similar clusters. The first dendrogram contains a cluster of five blogs (The Unofficial Apple Weblog, Download Squad, Autoblog, Joystiq, and Engadget) that all remain close together in the second, though a few other blogs are grouped together with them. This group is indicated by the blue box in both diagrams. We can also see a larger cluster of many web-related blogs, most of which (including Blogger's Blog, Publishing 2.0, and several Google and search-related blogs) remain close in the second dendrogram. This group, indicated in green, seems to vary more between the two diagrams.
2. Clustering: Capitol Words data
I looked through the APIs available at programmableweb.com and decided to build a dataset using Capitol Words, which tracks word usage by members of the U.S. Congress. I obtained a list of legislators from the Sunlight Labs API, and used the Python interface to the Capitol Words API to build a list of 112 current and recent senators' most frequently used words. For each senator, I downloaded their 250 most used words along with the number of times they've used those words. The resulting dataset contains 5969 words, and the number of times each senator has used that word, according to CW, if it is in their top 250 words (0 if otherwise).
We can use this dataset and the PCI code to run several clustering algorithms. For example, k-means clustering with k = 5 returns this result:
['"Sen. Charles E. Grassley(R--IA)"', '"Sen. Daniel Kahikina Akaka(D--HI)"', '"Sen. Susan M. Collins(R--ME)"', '"Sen. Christopher J. Dodd(D--CT)"', '"Sen. Hillary Rodham Clinton(D--NY)"', '"Sen. Tim P. Johnson(D--SD)"']
['"Sen. Michael Bennet(D--CO)"', '"Sen. Arlen Specter(R--PA)"', '"Sen. James H. Webb(D--VA)"', '"Sen. Evan Bayh(D--IN)"', '"Sen. Russell D. Feingold(D--WI)"', '"Sen. Kay R. Hagan(D--NC)"']
['"Sen. Jeanne Shaheen(D--NH)"', '"Sen. Pete V. Domenici(R--NM)"', '"Sen. Harry M. Reid(D--NV)"', '"Sen. Robert C. Byrd(D--WV)"', '"Sen. Thomas Harkin(D--IA)"', '"Sen. Mark E. Udall(D--CO)"', '"Sen. Bernard Sanders(I--VT)"', '"Sen. Daniel K. Inouye(D--HI)"', '"Sen. Barbara A. Mikulski(D--MD)"', '"Sen. Charles T. Hagel(R--NE)"', '"Sen. Edward M. Kennedy(D--MA)"', '"Sen. Sheldon Whitehouse(D--RI)"', '"Sen. John D. Rockefeller(D--WV)"', '"Sen. Thomas Allen Coburn(R--OK)"', '"Sen. Thomas Richard Carper(D--DE)"', '"Sen. Dianne Feinstein(D--CA)"', '"Sen. Sherrod C. Brown(D--OH)"', '"Sen. Mark R. Warner(D--VA)"', '"Sen. Joseph R. Biden(D--DE)"', '"Sen. John E. Sununu(R--NH)"', '"Sen. Carl Levin(D--MI)"', '"Sen. Jeff Bingaman(D--NM)"', '"Sen. John H. Isakson(R--GA)"', '"Sen. Amy Klobuchar(D--MN)"', '"Sen. Max S. Baucus(D--MT)"', '"Sen. Tom S. Udall(D--NM)"', '"Sen. Frank R. Lautenberg(D--NJ)"', '"Sen. Mary L. Landrieu(D--LA)"', '"Sen. Richard G. Lugar(R--IN)"', '"Sen. Benjamin L. Cardin(D--MD)"', '"Sen. Robert Menendez(D--NJ)"', '"Sen. Elizabeth H. Dole(R--NC)"', '"Sen. Jon Tester(D--MT)"', '"Sen. E. Benjamin Nelson(D--NE)"', '"Sen. Ted Stevens(R--AK)"', '"Sen. Barack H. Obama(D--IL)"', '"Sen. Bill Nelson(D--FL)"', '"Sen. Maria Cantwell(D--WA)"', '"Sen. Larry E. Craig(R--ID)"', '"Sen. Patty Murray(D--WA)"', '"Sen. Edward E. Kaufman(D--DE)"', '"Sen. Ken Salazar(D--CO)"', '"Sen. Mark Begich(D--AK)"', '"Sen. John William Warner(R--VA)"', '"Sen. Roland W. Burris(D--IL)"']
['"Sen. Pat Roberts(R--KS)"', '"Sen. Jim W. DeMint(R--SC)"', '"Sen. Samuel D. Brownback(R--KS)"', '"Sen. Michael B. Enzi(R--WY)"', '"Sen. George V. Voinovich(R--OH)"', '"Sen. Richard C. Shelby(R--AL)"', '"Sen. Kent Conrad(D--ND)"', '"Sen. Robert F. Bennett(R--UT)"', '"Sen. Melquiades Rafael Martinez(R--FL)"', '"Sen. Claire McCaskill(D--MO)"', '"Sen. Bob Corker(R--TN)"', '"Sen. Lisa A. Murkowski(R--AK)"', '"Sen. Wayne A. Allard(R--CO)"', '"Sen. Lamar Alexander(R--TN)"', '"Sen. John Francis Reed(D--RI)"', '"Sen. John Eric Ensign(R--NV)"', '"Sen. Olympia Jean Snowe(R--ME)"', '"Sen. David B. Vitter(R--LA)"', '"Sen. C. Saxby Chambliss(R--GA)"', '"Sen. Lindsey O. Graham(R--SC)"', '"Sen. Charles E. Schumer(D--NY)"', '"Sen. John R. Thune(R--SD)"', '"Sen. Mitch McConnell(R--KY)"', '"Sen. Kay Bailey Hutchison(R--TX)"', '"Sen. James E. Risch(R--ID)"', '"Sen. Byron L. Dorgan(D--ND)"', '"Sen. John Sidney McCain(R--AZ)"', '"Sen. Jon Kyl(R--AZ)"', '"Sen. John Barrasso(R--WY)"', '"Sen. Jim Bunning(R--KY)"', '"Sen. John Cornyn(R--TX)"', '"Sen. Roger F. Wicker(R--MS)"', '"Sen. Christopher S. Bond(R--MO)"', '"Sen. Mike O. Johanns(R--NE)"', '"Sen. Joseph I. Lieberman(I--CT)"', '"Sen. Thad Cochran(R--MS)"']
['"Sen. Michael D. Crapo(R--ID)"', '"Sen. Gordon Harold Smith(R--OR)"', '"Sen. Barbara Boxer(D--CA)"', '"Sen. Patrick J. Leahy(D--VT)"', '"Sen. Mark Lunsford Pryor(D--AR)"', '"Sen. Debbie Ann Stabenow(D--MI)"', '"Sen. Herbert H. Kohl(D--WI)"', '"Sen. Norm Coleman(R--MN)"', '"Sen. Judd A. Gregg(R--NH)"', '"Sen. Richard M. Burr(R--NC)"']
We can also use heirarchical clustering, which generates the following dendrogram:
We can also use the book's method for visualizing data in two dimensions:
All of these methods show that the senators, as represented by this dataset, are roughly evenly spaced. However, there are some small clusters in which one party predominates, or a few locations in which senators from the same areas are close together. For example, Arizona Republicans John McCain and Jon Kyl are grouped together by both algorithms. This is particularly evident in the dendrogram, and I've marked a few clusters which show this.Part of the reason that the senators are so evenly spaced may be that, because of the way the dataset was constructed, there is not enough overlapping data between them. The only data used to construct the dataset was each senator's top 250 words, so each senator has a maximum of 250 nonzero entries. (The reason I constructed it in this way is that the CW API doesn't provide a direct way to get the number of times a senator has used a particular word. The only option is to get the top X words a senator has used, and check to see if the particular word is in the list.)
A better dataset might be obtained by increasing the number of words to search through for each senator. However, including more words in the wordlist would quickly make the dataset larger and more difficult to process. A good strategy might be selecting the wordlist that is the union of each senator's top 250 (or possibly even less) words, and checking that wordlist against each senator's top 500 (or 1000, or more) words to obtain frequencies.

