I was reading slashdot yesterday and some (inane) comments about people’s user ID numbers made me curious about the overall distribution of IDs in slashdot discussions. It didn’t take long to take what I’d learned about the Beautiful Soup screen-scraping library from writing my Twitter status backup script to get some simple information.

Here’s a quick and crude little Python script that does the job as of yesterday’s slashdot HTML/CSS scheme. It spits out the ID number and username, which you then might send through the Unix sort command with various options, e.g.:

slashdot_info.py '[slashdot story url]' | sort -un

I’m placing this code into the public domain after disclaiming any responsibility for your use of it. Suggested enhancement: Add option to get number of comments for each user and allow sorting by that variable also.

(Update: This only grabs the first 50 comments. I’m not invested enough in the problem at the moment to want to figure out how to grab the entire discussion. That’s another exercise left to you.)

#!/usr/bin/python3

# list slashdot usernames and ids for story
# can slice/dice with "sort"

import sys
import re
import datetime
from urllib.request import urlopen
from BeautifulSoup import BeautifulSoup

if len(sys.argv) > 1:
    url = sys.argv[1]
else:
    print('url is required', file=sys.stderr)
    sys.exit(1)

pattern = r'''(?x)        # verbose mode
    >                     # end of "a href"
    ([^>]+)               # username to capturing group 1
    (?:                   # non-capturing group for user id + etc
      \s                  #
      \(                  # literal ( starting user id
      ([0-9]+)            # user id to capturing group 2
      \)                  # literal ) ending user id
    )
    </a>                  '''
    # matches <span class="by">by <a href="//slashdot.org/%7Eusername">
    #         username (12345)</a></span>

print('url: %s\nat:  %s' % (url, datetime.datetime.today()), file=sys.stderr)

num_comments = 0
num_ac = 0
num_unmatched = 0

f = urlopen(url)
# f = open('slashdot_user_stats_test.htm', 'rb')
soup = BeautifulSoup(f.read())
f.close()
commenters = soup.findAll('span', {'class': 'by'})
if len(commenters) > 0:
    r = re.compile(pattern)
    for by in commenters:
        by = str(by.renderContents().strip(), 'utf8')
        num_comments += 1
        if by == 'by Anonymous Coward':
            num_ac += 1
        else:
            m = r.search(by)
            if m:
                print('%s %s' % (m.groups()[1], m.groups()[0]))
            else:
                print('oops, not AC and does not match expected pattern: %s' %
                      by, file=sys.stderr)
                num_unmatched += 1

print('%d comments\n%d anonymous cowards\n%d not matched' %
      (num_comments, num_ac, num_unmatched), file=sys.stderr)