Python Programming/Web

From Wikibooks, open books for an open world
Jump to navigation Jump to search

Python web requests/parsing is very simple, and there are several must-have modules to help with this.

Urllib[edit]

Urllib is the built in python module for html requests, main article is Python Programming/Internet.

try:
    import urllib2
except (ModuleNotFoundError, ImportError): #ModuleNotFoundError is 3.6+
    import urllib.parse as urllib2
    
url = 'https://www.google.com'
u = urllib2.urlopen(url)
content = u.read() #content now has all of the html in google.com

Requests[edit]

requests
Python HTTP for Humans
PyPi Link https://pypi.python.org/pypi/requests
Pip command pip install requests

The python requests library simplifies http requests. It has functions for each of the http requests

  • GET (requests.get)
  • POST (requests.post)
  • HEAD (requests.head)
  • PUT (requests.put)
  • DELETE (requests.delete)
  • OPTIONS (requests.options)

Basic request[edit]

import requests

url = 'https://www.google.com'
r = requests.get(url)

The response object[edit]

The response from the last function has many variables/data retrieval.

>>> import requests
>>> r = requests.get('https://www.google.com')
>>> print(r)
<Response [200]>
>>> dir(r) # dir shows all variables, functions, basically anything you can do with var.n where n is something to do
['__attrs__', '__bool__', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__enter__', '__eq__', '__exit__', '__format__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__nonzero__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_content', '_content_consumed', '_next', 'apparent_encoding', 'close', 'connection', 'content', 'cookies', 'elapsed', 'encoding', 'headers', 'history', 'is_permanent_redirect', 'is_redirect', 'iter_content', 'iter_lines', 'json', 'links', 'next', 'ok', 'raise_for_status', 'raw', 'reason', 'request', 'status_code', 'text', 'url']
  • r.content and r.text provide similar html content, but r.text is preferred.
  • r.encoding will display the encoding of the website.
  • r.headers shows the headers returned by the website.
  • r.is_redirect and r.is_permanent_redirect shows whether or not the original link was a redirect.
  • r.iter_content will iterate each character in the html as a byte. To convert bytes to string, it must be decoded with the encoding in r.encoding.
  • r.iter_lines is like r.iter_content, but will iterate each line of the html. It is also in bytes
  • r.json will convert json to a python dict if the return output is json.
  • r.rawwill return the base urllib3.response.HTTPResponse object.
  • r.status_code will return the html code sent by the server. Code 200 is success, while any other code is an error. r.raise_for_status will return an exception if the status code is not 200.
  • r.url will return the url sent.

Authentication[edit]

Requests has built-in authentication. Here is an example with basic authentication.

import requests

r = requests.get('http://example.com', auth = requests.auth.HTTPBasicAuth('username', 'password'))

If it is Basic Authentication, you can just pass a tuple.

import requests

r = requests.get('http://example.com', auth = ('username', 'password'))

All of the other types of authentication are at the requests documentation.

Queries[edit]

Queries in html pass values. For example, when you make a google search, the search url is a form of https://www.google.com/search?q=My+Search+Here&.... Anything after the ? is the query. Queries are url?name1=value1&name2=value2.... Requests has a system for automatically making these queries.

>>> import requests
>>> query = {'q':'test'}
>>> r = requests.get('https://www.google.com/search', params = query)
>>> print(r.url) #prints the final url
https://www.google.com/search?q=test

The true power is noticed in multiple entries.

>>> import requests
>>> query = {'name':'test', 'fakeparam': 'yes', 'anotherfakeparam': 'yes again'}
>>> r = requests.get('http://example.com', params = query)
>>> print(r.url) #prints the final url
http://example.com/?name=test&fakeparam=yes&anotherfakeparam=yes+again

Not only does it pass these values but also changes special characters & whitespace to html-compatible versions.

BeautifulSoup4[edit]

beautifulsoup4
Screen-scraping library
PyPi Link https://pypi.python.org/pypi/beautifulsoup4
Pip command pip install beautifulsoup4
Import command import bs4

BeautifulSoup4 is a powerful html parsing command. Let's try with some example html.

>>> import bs4
>>> example_html = """<!DOCTYPE html>
... <html>
... <head>
... <title>Testing website</title>
... <style>.b{color: blue;}</style>
... </head>
... <body>
... <h1 class='b', id = 'hhh'>A Blue Header</h1>
... <p> I like blue text, I like blue text... </p>
... <p class = 'b'> This text is blue, yay yay yay!</p>
... <p class = 'b'>Check out the <a href = '#hhh'>Blue Header</a></p>
... </body>
... </html>
... """
>>> bs = bs4.BeautifulSoup(example_html)
>>> print(bs)
<!DOCTYPE html>
<html><head><title>Testing website</title><style>.b{color: blue;}</style></head><body><h1 class="b" id="hhh">A Blue Header</h1><p> I like blue text, I like blue text... </p><p class="b"> This text is blue, yay yay yay!</p><p class="b">Check out the <a href="#hhh">Blue Header</a></p></body></html>
>>> print(bs.prettify()) #adds in newlines
<!DOCTYPE html>
<html>
 <head>
  <title>
   Testing website
  </title>
  <style>
   .b{color: blue;}
  </style>
 </head>
 <body>
  <h1 class="b" id="hhh">
   A Blue Header
  </h1>
  <p>
   I like blue text, I like blue text...
  </p>
  <p class="b">
   This text is blue, yay yay yay!
  </p>
  <p class="b">
   Check out the
   <a href="#hhh">
    Blue Header
   </a>
  </p>
 </body>
</html>

Getting elements[edit]

There are two ways to access elements. The first way is to manually type in the tags, going down in order, until you get to the tag you want.

>>> print(bs.html)
<html><head><title>Testing website</title><style>.b{color: blue;}</style></head><body><h1 class="b" id="hhh">A Blue Header</h1><p> I like blue text, I like blue text... </p><p class="b"> This text is blue, yay yay yay!</p><p class="b">Check out the <a href="#hhh">Blue Header</a></p></body></html>
>>> print(bs.html.body)
<body><h1 class="b" id="hhh">A Blue Header</h1><p> I like blue text, I like blue text... </p><p class="b"> This text is blue, yay yay yay!</p><p class="b">Check out the <a href="#hhh">Blue Header</a></p></body>
>>> print(bs.html.body.h1)

However, this is inconvenient with large html. There is a function, find_all, to find all instances of a certain element. It takes in a html tag, such as h1 or p, and returns all instances of it.

>>> p = bs.find_all('p')
>>> p
[<p> I like blue text, I like blue text... </p>, <p class="b"> This text is blue, yay yay yay!</p>, <p class="b">Check out the <a href="#hhh">Blue Header</a></p>]

This is still inconvenient in a large website because there will be thousands of entries. You can simplify it by finding classes or ids.

>>> blue = bs.find_all('p', _class = 'b')
>>> blue
[]

However, it does not bring up any results. Therefore, we might want to use our own finding system.

>>> p = bs.find_all('p')
>>> p
[<p> I like blue text, I like blue text... </p>, <p class="b"> This text is blue, yay yay yay!</p>, <p class="b">Check out the <a href="#hhh">Blue Header</a></p>]
>>> blue = [p for p in p if 'class' in p.__dict__['attrs'] and 'b' in p.__dict__['attrs']['class']]
>>> blue
[<p class="b"> This text is blue, yay yay yay!</p>, <p class="b">Check out the <a href="#hhh">Blue Header</a></p>]

This checks to see if there are any classes in each of the elements and then checks to see if the class b is in the classes if there are classes. From the list, we can do something to each element, such as retrieve the text inside.

>>> b = blue[0].text
>>> print(bb)
 This text is blue, yay yay yay!