2009
10.07

Scraping google images

Here is some rudimentary code for scraping google images. Later on, I’ll add features for search options & perhaps limiting the number of images this thing downloads. Leave a comment below if your interested in these features or if this is useful to you.

Put this code into a file like gimages.py

import urllib
import urllib2
import os
import sys
#import pdb

import demjson # a powerful python json decoder/encoder. Necessary for decoding garbled google JSON output within a reasonable development time frame

############################
# The meat of the project
############################
def search(term):
    """returns a results object for getting images."""
    return results(term)

class results:
    """iterable list of image results.
    here are the properties of each image
    [0] # google images page for image
    [1] # unknown
    [2] # unknown
    [3] # image url
    [4] # google images thumbnail width
    [5] # google images thumbnail height
    [6] # title text describing relevance of image to query somehow. Not sure of ruleset for this
    [7] # unknown
    [8] # unknown
    [9] # dimensions & size
    [10] # filetype
    [11] # original domain
    [12] # unknown
    [13] # unknown
    [14] # server that url contained in [0] resides on
    [15] # unknown
    [16] # unknown
    [17] # unknown

    still unknown as how to get the alt text stored in the image without having to visit actual page.

    results.stats_text will give you some html containing the time it took to load the page, the total images retrieved, etc.
    searching by size, type & color will come later"""

    def __init__(self, term):
        self.term = urllib.quote(term) # the original search term
        self.index = 1 # which image we are returning
        self.images = [] # stash/return images (from) here
        #self.curPageObj = {} # stores json object containing list of images
        self.cur_page_num = 1 # page we are on in google images
        self.max_images = 1000 # only retrieve first 1000 images due to restrictions placed by google. Google says it can find hundreds of millions of images, but, it will only return the first 1000 results. Such a crude example of unnoticed false advertising, IMHO.
        # retrieve initial images
        url = 'http://images.google.com/images?hl=en&q=%s&ijn=page&start=%d' % (self.term, self.cur_page_num) # Google uses an internal json api to retrieve images :)  Yup.
        page = get(url)
        page = page.replace('/*', '')
        page = page.replace('*/', '')
        page = demjson.decode(page)
        self.images.extend(page['images'])
        self.stats_text = page['sd']

    def __iter__(self):
        return self

    def next(self): # return next image object here or get a new page object if page_num
        if self.index == self.max_images:
            raise StopIteration

        if self.index % 18 == 0: # if we need to go to the next google images page
            # get the next page
            self.cur_page_num = self.index
            url = 'http://images.google.com/images?hl=en&q=%s&ijn=page&start=%d' % (self.term, self.cur_page_num)
            page = get(url)
            page = page.replace('/*', '')
            page = page.replace('*/', '')
            page = demjson.decode(page)

            self.images.extend(page['images']) # add to existing list of images

            # do something here like get a new page & attach more stuff to self.images
        self.index = self.index + 1
        return self.images[self.index] # return next image!

# @TODO Make gimages.get(url) keep trying if the server says it's down.
# @TODO Add support for searching by size, type & color.
# @TODO In results object, extract total number of images retrieved & other stats out of HTML, rather than make user do that. Same with each image: get the width, height & size out of image for the user.
# @TODO Make cookie file optional incase script is being run from a read only directory.

# This is some boilerplate code for using urllib with cookies
# at the end, we get a nice get(url) function that has its own cookie file

COOKIEFILE = 'cookies.lwp'
# the path and filename to save your cookies in

cj = None
ClientCookie = None
cookielib = None

# Let's see if cookielib is available
try:
    import cookielib
except ImportError:
    # If importing cookielib fails
    # let's try ClientCookie
    try:
        import ClientCookie
    except ImportError:
        # ClientCookie isn't available either
        urlopen = urllib2.urlopen
        Request = urllib2.Request
    else:
        # imported ClientCookie
        urlopen = ClientCookie.urlopen
        Request = ClientCookie.Request
        cj = ClientCookie.LWPCookieJar()

else:
    # importing cookielib worked
    urlopen = urllib2.urlopen
    Request = urllib2.Request
    cj = cookielib.LWPCookieJar()
    # This is a subclass of FileCookieJar
    # that has useful load and save methods

if cj is not None:
# we successfully imported
# one of the two cookie handling modules

    if os.path.isfile(COOKIEFILE):
        # if we have a cookie file already saved
        # then load the cookies into the Cookie Jar
        cj.load(COOKIEFILE)

    # Now we need to get our Cookie Jar
    # installed in the opener;
    # for fetching URLs
    if cookielib is not None:
        # if we use cookielib
        # then we get the HTTPCookieProcessor
        # and install the opener in urllib2
        opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
        urllib2.install_opener(opener)

    else:
        # if we use ClientCookie
        # then we get the HTTPCookieProcessor
        # and install the opener in ClientCookie
        opener = ClientCookie.build_opener(ClientCookie.HTTPCookieProcessor(cj))
        ClientCookie.install_opener(opener)

def get(url, txdata=None):
    try:
        # fake a user agent, some websites (like google) don't like automated exploration
        txheaders =  {'User-agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'}

        #txdata = None
        # if we were making a POST type request,
        # we could encode a dictionary of values here,
        # using urllib.urlencode(somedict)

        req = Request(url, txdata, txheaders)
        # create a request object

        handle = urlopen(req)
        # and open it to return a handle on the url

    except IOError, e:
        print 'We failed to open "%s".' % url
        if hasattr(e, 'code'):
            print 'We failed with error code - %s.' % e.code
        elif hasattr(e, 'reason'):
            print "The error object has the following 'reason' attribute :"
            print e.reason
            print "This usually means the server doesn't exist,"
            print "is down, or we don't have an internet connection."
        sys.exit()

    else:
        #print 'Here are the headers of the page :'
        #print handle.info()
        return handle.read()
        # handle.read() returns the page
        # handle.geturl() returns the true url of the page fetched
        # (in case urlopen has followed any redirects, which it sometimes does)

    #print
    if cj is None:
        print "We don't have a cookie library available - sorry."
        print "I can't show you any cookies."
    else:
        #print 'These are the cookies we have received so far :'
        for index, cookie in enumerate(cj):
            print index, '  :  ', cookie
        cj.save(COOKIEFILE)                     # save the cookies again

Then go like this:

import gimages
results = gimages.search('asdf')
for image in results:
    print image[3]

Each image row has these family values:

    [0] # google images page for image
    [1] # unknown
    [2] # unknown
    [3] # image url
    [4] # google images thumbnail width
    [5] # google images thumbnail height
    [6] # title text describing relevance of image to query somehow. Not sure of ruleset for this
    [7] # unknown
    [8] # unknown
    [9] # dimensions & size
    [10] # filetype
    [11] # original domain
    [12] # unknown
    [13] # unknown
    [14] # server that url contained in [0] resides on
    [15] # unknown
    [16] # unknown
    [17] # unknown

You’ll also have to download Demjson. Demjson is a really good json library & it’s necessary to parse google’s weird internal json api.

Scraping Google Search

With this code:

import urllib
import urllib2
import os
import sys
#from xml.etree.ElementTree import ElementTree
#from xml.etree.ElementTree import XMLTreeBuilder
import lxml.html
import pdb

############################
# The meat of the project
############################
def search(term):
    """returns a results object for getting searches."""
    return results(term)

class results:
    def parse_page(self, page):
        tree = lxml.html.document_fromstring(page)
        searches = []
        for i in tree.find_class('g'):
            temp = {}
            temp['url'] = i.find_class('l')[0].get('href')
            temp['title'] = i.find_class('l')[0].text_content()
            temp['desc'] = i.find_class('s')[0].text_content()
            searches.append(temp)
        return searches

    def get_stats_text(self, page):
        tree = lxml.html.document_fromstring(page)
        stats_text = tree.get_element_by_id('ssb').text_content()
        return stats_text

    def __init__(self, term):
        self.term = urllib.quote(term) # the original search term
        self.index = 1 # which image we are returning
        self.searches = [] # stash/return images (from) here
        #self.curPageObj = {} # stores json object containing list of images
        self.cur_page_num = 1 # page we are on in google images
        self.max_searches = 1000 # only retrieve first 1000 images due to restrictions placed by google. Google says it can find hundreds of millions of images, but, it will only return the first 1000 results. Such a crude example of unnoticed false advertising, IMHO.
        # retrieve initial images
        url = 'http://www.google.com/search?q=%s&start=%d' % (self.term, self.cur_page_num) # Google uses an internal json api to retrieve images :)  Yup.
        page = get(url)
        "/html/body/div[2]/div/p" # search info
        "/html/body/div[2]/div[3]/div/ol" # array of searches
        #tree = ElementTree()
        #root = tree.fromstring(page)
        #pdb.set_trace()
        self.searches.extend(self.parse_page(page))
        self.stats_text = self.get_stats_text(page)

        # build logic here
        #page = page.replace('/*', '')
        #page = page.replace('*/', '')
        #page = demjson.decode(page)
        #self.images.extend(page['images'])
        #self.stats_text = page['sd']

    def __iter__(self):
        return self

    def next(self): # return next image object here or get a new page object if page_num
        if self.index == self.max_searches:
            raise StopIteration

        self.index = self.index + 1
        if self.index % 10 == 0: # if we need to go to the next google images page
            # get the next page
            self.cur_page_num = self.index
            url = 'http://www.google.com/search?q=%s&start=%d' % (self.term, self.cur_page_num)
            page = get(url)
            self.searches.extend(self.parse_page(page))
            #self.stats_text = self.get_stats_text(page)t(url)

            #self.images.extend(searches) # add to existing list of images

            # do something here like get a new page & attach more stuff to self.images
        return self.searches[self.index] # return next image!

COOKIEFILE = 'cookies.lwp'
# the path and filename to save your cookies in

cj = None
ClientCookie = None
cookielib = None

# Let's see if cookielib is available
try:
    import cookielib
except ImportError:
    # If importing cookielib fails
    # let's try ClientCookie
    try:
        import ClientCookie
    except ImportError:
        # ClientCookie isn't available either
        urlopen = urllib2.urlopen
        Request = urllib2.Request
    else:
        # imported ClientCookie
        urlopen = ClientCookie.urlopen
        Request = ClientCookie.Request
        cj = ClientCookie.LWPCookieJar()

else:
    # importing cookielib worked
    urlopen = urllib2.urlopen
    Request = urllib2.Request
    cj = cookielib.LWPCookieJar()
    # This is a subclass of FileCookieJar
    # that has useful load and save methods

if cj is not None:
# we successfully imported
# one of the two cookie handling modules

    if os.path.isfile(COOKIEFILE):
        # if we have a cookie file already saved
        # then load the cookies into the Cookie Jar
        cj.load(COOKIEFILE)

    # Now we need to get our Cookie Jar
    # installed in the opener;
    # for fetching URLs
    if cookielib is not None:
        # if we use cookielib
        # then we get the HTTPCookieProcessor
        # and install the opener in urllib2
        opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
        urllib2.install_opener(opener)

    else:
        # if we use ClientCookie
        # then we get the HTTPCookieProcessor
        # and install the opener in ClientCookie
        opener = ClientCookie.build_opener(ClientCookie.HTTPCookieProcessor(cj))
        ClientCookie.install_opener(opener)

def get(url, txdata=None):
    try:
        # fake a user agent, some websites (like google) don't like automated exploration
        txheaders =  {'User-agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'}

        #txdata = None
        # if we were making a POST type request,
        # we could encode a dictionary of values here,
        # using urllib.urlencode(somedict)

        req = Request(url, txdata, txheaders)
        # create a request object

        handle = urlopen(req)
        # and open it to return a handle on the url

    except IOError, e:
        print 'We failed to open "%s".' % url
        if hasattr(e, 'code'):
            print 'We failed with error code - %s.' % e.code
        elif hasattr(e, 'reason'):
            print "The error object has the following 'reason' attribute :"
            print e.reason
            print "This usually means the server doesn't exist,"
            print "is down, or we don't have an internet connection."
        sys.exit()

    else:
        #print 'Here are the headers of the page :'
        #print handle.info()
        return handle.read()
        # handle.read() returns the page
        # handle.geturl() returns the true url of the page fetched
        # (in case urlopen has followed any redirects, which it sometimes does)

    #print
    if cj is None:
        print "We don't have a cookie library available - sorry."
        print "I can't show you any cookies."
    else:
        #print 'These are the cookies we have received so far :'
        for index, cookie in enumerate(cj):
            print index, '  :  ', cookie
        cj.save(COOKIEFILE)                     # save the cookies again

You’ll be able to scrape google search results. Same procedure as above except you’ll need lxml. You can get this by typing easy_install lxml or whatever. It’s a very good library for scraping. It allows me to develop facebook scraping code quickly and efficiently ;) Place the code into something like gsearches.py & do the following:

import gsearches
for result in gsearches.search('asdf'):
    print result

Results have the following format:
{‘url’: ‘http://scipp.ucsc.edu/groups/babar/charm2007.ppt’, ‘desc’: ‘File Format: Microsoft Powerpoint – View as HTML1. D0-D0 Mixing at BaBar. Charm 2007 August, 2007. Abe Seiden. University of California at Santa Cruz. for. The BaBar Collaboration …scipp.ucsc.edu/groups/babar/charm2007.ppt – Similar’, ‘title’: ‘aSDf’}

Like I said, later on I’ll add some more features like using a proxy server or limiting the number of results. Perhaps I’ll release my facebook scraping code too at some point.

No Comment.

Add Your Comment

Sean Neilan is Digg proof thanks to caching by WP Super Cache