Python Programming/Internet

From Wikibooks, the open-content textbooks collection

Jump to: navigation, search
Previous: Databases Index Next: Networks


The urllib module which is bundled with python can be used for web interaction. This module provides a file-like interface for web urls.

An example of reading the contents of a webpage

import urllib
pageText = urllib.urlopen("http://www.spam.org/eggs.html").read()
print pageText

Get and post methods can be used, too.

import urllib
params = urllib.urlencode({"plato":1, "socrates":10, "sophokles":4, "arkhimedes":11})
 
# Using GET method
pageText = urllib.urlopen("http://international-philosophy.com/greece?%s" % params).read()
print pageText
 
# Using POST method
pageText = urllib.urlopen("http://international-philosophy.com/greece", params).read()
print pageText

However, the urlopen method doesn't work for all urls. If you try urllib.urlopen("http://www.google.com") and print that you should get a 403 error. The reason is because of the header that python sends to the site. A work around is to use the following:

import urllib, urllib2
 
def getsource(pagereq):
    user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
    header = {'User-Agent': user_agent}
    req = urllib2.Request(pagereq, None, header)
    response = urllib2.urlopen(req)
    page = response.read()
    return page
 
def main():
   html = getsource("http://www.google.com/")
   print html
 
if __name__ == '__main__':
   main()

This "encodes" python as FireFox, which google will accept.

Personal tools