Python script: Extract slashdot user names and ID numbers from a story discussion

I was reading slashdot yesterday and some (inane) comments about people's user ID numbers made me curious about the overall distribution of IDs in slashdot discussions. It didn't take long to take what I'd learned about the Beautiful Soup screen-scraping library from writing my Twitter status backup script to get some simple information.

Here's a quick and crude little Python script that does the job as of yesterday's slashdot HTML/CSS scheme. It spits out the ID number and username, which you then might send through the Unix sort command with various options, e.g.:

slashdot_info.py '[slashdot story url]' | sort -un

I'm placing this code into the public domain after disclaiming any responsibility for your use of it. Suggested enhancement: Add option to get number of comments for each user and allow sorting by that variable also.

(Update: This only grabs the first 50 comments. I'm not invested enough in the problem at the moment to want to figure out how to grab the entire discussion. That's another exercise left to you.)

#!/usr/bin/python3

# list slashdot usernames and ids for story
# can slice/dice with "sort"

import sys
import re
import datetime
from urllib.request import urlopen
from BeautifulSoup import BeautifulSoup

if len(sys.argv) > 1:
    url = sys.argv[1]
else:
    print('url is required', file=sys.stderr)
    sys.exit(1)

pattern = r'''(?x)        # verbose mode
    >                     # end of "a href"
    ([^>]+)               # username to capturing group 1
    (?:                   # non-capturing group for user id + etc
      \s                  #
      \(                  # literal ( starting user id
      ([0-9]+)            # user id to capturing group 2
      \)                  # literal ) ending user id
    )
    </a>                  '''
    # matches <span class="by">by <a href="//slashdot.org/%7Eusername">
    #         username (12345)</a></span>

print('url: %s\nat:  %s' % (url, datetime.datetime.today()), file=sys.stderr)

num_comments = 0
num_ac = 0
num_unmatched = 0

f = urlopen(url)
# f = open('slashdot_user_stats_test.htm', 'rb')
soup = BeautifulSoup(f.read())
f.close()
commenters = soup.findAll('span', {'class': 'by'})
if len(commenters) > 0:
    r = re.compile(pattern)
    for by in commenters:
        by = str(by.renderContents().strip(), 'utf8')
        num_comments += 1
        if by == 'by Anonymous Coward':
            num_ac += 1
        else:
            m = r.search(by)
            if m:
                print('%s %s' % (m.groups()[1], m.groups()[0]))
            else:
                print('oops, not AC and does not match expected pattern: %s' %
                      by, file=sys.stderr)
                num_unmatched += 1

print('%d comments\n%d anonymous cowards\n%d not matched' %
      (num_comments, num_ac, num_unmatched), file=sys.stderr)

If you enjoyed this article, please subscribe for free!
Via the atom or rss feed, or enter your email address to get updates when new entries are posted:
(Your email will not be shared nor used for anything other than sending new posts. See the policies page for more about subscriptions and privacy.)

You can skip to the end and leave a response. Pinging is currently not allowed.

No comments yet.

You can follow any responses to this entry through the
comments feed.

Say Your Say

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

By submitting your comment here, you agree to license it under the same Creative Commons Attribution-ShareAlike 3.0 License as the movingtofreedom.org web site. Please see policies for more information about comments and privacy.