Python Programming/Internet
From Wikibooks, the open-content textbooks collection
| Previous: Databases | Index | Next: Networks |
The urllib module which is bundled with python can be used for web interaction. This module provides a file-like interface for web urls.
An example of reading the contents of a webpage
import urllib pageText = urllib.urlopen("http://www.spam.org/eggs.html").read() print pageText
Get and post methods can be used, too.
import urllib params = urllib.urlencode({"plato":1, "socrates":10, "sophokles":4, "arkhimedes":11}) # Using GET method pageText = urllib.urlopen("http://international-philosophy.com/greece?%s" % params).read() print pageText # Using POST method pageText = urllib.urlopen("http://international-philosophy.com/greece", params).read() print pageText
However, the urlopen method doesn't work for all urls. If you try urllib.urlopen("http://www.google.com") and print that you should get a 403 error. The reason is because of the header that python sends to the site. A work around is to use the following:
import urllib, urllib2 def getsource(pagereq): user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)' header = {'User-Agent': user_agent} req = urllib2.Request(pagereq, None, header) response = urllib2.urlopen(req) page = response.read() return page def main(): html = getsource("http://www.google.com/") print html if __name__ == '__main__': main()
This "encodes" python as FireFox, which google will accept.

