Python Web Scraping - Processing CAPTCHA



In this chapter, let us understand how to perform web scraping and processing CAPTCHA that is used for testing a user for human or robot.

What is CAPTCHA?

The full form of CAPTCHA is Completely Automated Public Turing test to tell Computers and Humans Apart, which clearly suggests that it is a test to determine whether the user is human or not.

A CAPTCHA is a distorted image which is usually not easy to detect by computer program but a human can somehow manage to understand it. Most of the websites use CAPTCHA to prevent bots from interacting.

Loading CAPTCHA with Python

Suppose we want to do registration on a website and there is form with CAPTCHA, then before loading the CAPTCHA image we need to know about the specific information required by the form. With the help of next Python script we can understand the form requirements of registration form on website named http://example.webscrapping.com.

import lxml.html
import urllib.request as urllib2
import pprint
import http.cookiejar as cookielib
def form_parsing(html):
   tree = lxml.html.fromstring(html)
   data = {}
   for e in tree.cssselect('form input'):
      if e.get('name'):
         data[e.get('name')] = e.get('value')
   return data
REGISTER_URL = '<a target="_blank" rel="nofollow" 
   href="http://example.webscraping.com/user/register">http://example.webscraping.com/user/register'</a>
ckj = cookielib.CookieJar()
browser = urllib2.build_opener(urllib2.HTTPCookieProcessor(ckj))
html = browser.open(
   '<a target="_blank" rel="nofollow" 
      href="http://example.webscraping.com/places/default/user/register?_next">
      http://example.webscraping.com/places/default/user/register?_next</a> = /places/default/index'
).read()
form = form_parsing(html)
pprint.pprint(form)

In the above Python script, first we defined a function that will parse the form by using lxml python module and then it will print the form requirements as follows −

{
   '_formkey': '5e306d73-5774-4146-a94e-3541f22c95ab',
   '_formname': 'register',
   '_next': '/places/default/index',
   'email': '',
   'first_name': '',
   'last_name': '',
   'password': '',
   'password_two': '',
   'recaptcha_response_field': None
}

You can check from the above output that all the information except recpatcha_response_field are understandable and straightforward. Now the question arises that how we can handle this complex information and download CAPTCHA. It can be done with the help of pillow Python library as follows;

Pillow Python Package

Pillow is a fork of the Python Image library having useful functions for manipulating images. It can be installed with the help of following command −

pip install pillow

In the next example we will use it for loading the CAPTCHA −

from io import BytesIO
import lxml.html
from PIL import Image
def load_captcha(html):
   tree = lxml.html.fromstring(html)
   img_data = tree.cssselect('div#recaptcha img')[0].get('src')
   img_data = img_data.partition(',')[-1]
   binary_img_data = img_data.decode('base64')
   file_like = BytesIO(binary_img_data)
   img = Image.open(file_like)
   return img

The above python script is using pillow python package and defining a function for loading CAPTCHA image. It must be used with the function named form_parser() that is defined in the previous script for getting information about the registration form. This script will save the CAPTCHA image in a useful format which further can be extracted as string.

OCR: Extracting Text from Image using Python

After loading the CAPTCHA in a useful format, we can extract it with the help of Optical Character Recognition (OCR), a process of extracting text from the images. For this purpose, we are going to use open source Tesseract OCR engine. It can be installed with the help of following command −

pip install pytesseract

Example

Here we will extend the above Python script, which loaded the CAPTCHA by using Pillow Python Package, as follows −

import pytesseract
img = get_captcha(html)
img.save('captcha_original.png')
gray = img.convert('L')
gray.save('captcha_gray.png')
bw = gray.point(lambda x: 0 if x < 1 else 255, '1')
bw.save('captcha_thresholded.png')

The above Python script will read the CAPTCHA in black and white mode which would be clear and easy to pass to tesseract as follows −

pytesseract.image_to_string(bw)

After running the above script we will get the CAPTCHA of registration form as the output.

Advertisements