Wednesday, 27 August 2014

How do you scrape AJAX pages?

Overview:

All screen scraping first requires manual review of the page you want to extract resources from. When dealing with AJAX you usually just need to analyze a bit more than just simply the HTML.

When dealing with AJAX this just means that the value you want is not in the initial HTML document that you requested, but that javascript will be exectued which asks the server for the extra information you want.

You can therefore usually simply analyze the javascript and see which request the javascript makes and just call this URL instead from the start.

Example:

Take this as an example, assume the page you want to scrape from has the following script:

<script type="text/javascript">
function ajaxFunction()
{
var xmlHttp;
try
  {
  // Firefox, Opera 8.0+, Safari
  xmlHttp=new XMLHttpRequest();
  }
catch (e)
  {
  // Internet Explorer
  try
    {
    xmlHttp=new ActiveXObject("Msxml2.XMLHTTP");
    }
  catch (e)
    {
    try
      {
      xmlHttp=new ActiveXObject("Microsoft.XMLHTTP");
      }
    catch (e)
      {
      alert("Your browser does not support AJAX!");
      return false;
      }
    }
  }
  xmlHttp.onreadystatechange=function()
    {
    if(xmlHttp.readyState==4)
      {
      document.myForm.time.value=xmlHttp.responseText;
      }
    }
  xmlHttp.open("GET","time.asp",true);
  xmlHttp.send(null);
  }
</script>

Then all you need to do is instead do an HTTP request to time.asp of the same server instead. Example from w3schools.


Sporce: http://stackoverflow.com/questions/260540/how-do-you-scrape-ajax-pages

Using Perl to scrape a website


I am interested in writing a perl script that goes to the following link and extracts

the number 1975: https://familysearch.org/search/collection/results#count=20&query=

%2Bevent_place_level_1%3ACalifornia%20%2Bevent_place_level_2%3A%22San%20Diego

%22%20%2Bbirth_year%3A1923-1923~%20%2Bgender%3AM%20%2Brace

%3AWhite&collection_id=2000219

That website is the amount of white men born in the year 1923 who live in San Diego

County, California in 1940. I am trying to do this in a loop structure to generalize

over multiple counties and birth years.

In the file, locations.txt, I put the list of counties, such as San Diego County.

The current code runs, but instead of the # 1975, it displays unknown. The number 1975

should be in $val\n.

I would very much appreciate any help!

#!/usr/bin/perl

use strict;

use LWP::Simple;

open(L, "locations26.txt");

my $url = 'https://familysearch.org/search/collection/results#count=20&query=

%2Bevent_place_level_1%3A%22California%22%20%2Bevent_place_level_2%3A%22%LOCATION%

%22%20%2Bbirth_year%3A%YEAR%-%YEAR%~%20%2Bgender%3AM%20%2Brace

%3AWhite&collection_id=2000219';

open(O, ">out26.txt");
 my $oldh = select(O);
 $| = 1;
 select($oldh);
 while (my $location = <L>) {
     chomp($location);
     $location =~ s/ /+/g;
      foreach my $year (1923..1923) {
                 my $u = $url;
                 $u =~ s/%LOCATION%/$location/;
                 $u =~ s/%YEAR%/$year/;
                 #print "$u\n";
                 my $content = get($u);
                 my $val = 'unknown';
                 if ($content =~ / of .strong.([0-9,]+)..strong. /) {
                         $val = $1;
                 }
                 $val =~ s/,//g;
                 $location =~ s/\+/ /g;
                 print "'$location',$year,$val\n";
                 print O "'$location',$year,$val\n";
         }
     }

Update: API is not a viable solution. I have been in contact with the site developer.

The API does not apply to that part of the webpage. Hence, any solution pertaining to

JSON will not be applicbale.



Source: http://stackoverflow.com/questions/14654288/using-perl-to-scrape-a-website

Tuesday, 26 August 2014

Data Scraping using php


Here is my code

    $ip=$_SERVER['REMOTE_ADDR'];

    $url=file_get_contents("http://whatismyipaddress.com/ip/$ip");

    preg_match_all('/<th>(.*?)<\/th><td>(.*?)<\/td>/s',$url,$output,PREG_SET_ORDER);

    $isp=$output[1][2];

    $city=$output[9][2];

    $state=$output[8][2];

    $zipcode=$output[12][2];

    $country=$output[7][2];

    ?>
    <body>
    <table align="center">
    <tr><td>ISP :</td><td><?php echo $isp;?></td></tr>
    <tr><td>City :</td><td><?php echo $city;?></td></tr>
    <tr><td>State :</td><td><?php echo $state;?></td></tr>
    <tr><td>Zipcode :</td><td><?php echo $zipcode;?></td></tr>
    <tr><td>Country :</td><td><?php echo $country;?></td></tr>
    </table>
    </body>

How do I find out the ISP provider of a person viewing a PHP page?

Is it possible to use PHP to track or reveal it?

Error: http://i.imgur.com/LGWI8.png

Curl Scrapping

<?php
$curl_handle=curl_init();
curl_setopt( $curl_handle, CURLOPT_FOLLOWLOCATION, true );
$url='http://www.whatismyipaddress.com/ip/132.123.23.23';
curl_setopt($curl_handle, CURLOPT_URL,$url);
curl_setopt($curl_handle, CURLOPT_HTTPHEADER, Array("User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.15) Gecko/20080623 Firefox/2.0.0.15") );
curl_setopt($curl_handle, CURLOPT_CONNECTTIMEOUT, 2);
curl_setopt($curl_handle, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl_handle, CURLOPT_USERAGENT, 'Your application name');
$query = curl_exec($curl_handle);

curl_close($curl_handle);
preg_match_all('/<th>(.*?)<\/th><td>(.*?)<\/td>/s',$url,$output,PREG_SET_ORDER);
echo $query;
$isp=$output[1][2];

$city=$output[9][2];

$state=$output[8][2];

$zipcode=$output[12][2];

$country=$output[7][2];
?>
<body>
<table align="center">
<tr><td>ISP :</td><td><?php echo $isp;?></td></tr>
<tr><td>City :</td><td><?php echo $city;?></td></tr>
<tr><td>State :</td><td><?php echo $state;?></td></tr>
<tr><td>Zipcode :</td><td><?php echo $zipcode;?></td></tr>
<tr><td>Country :</td><td><?php echo $country;?></td></tr>
</table>
</body>

Error: http://i.imgur.com/FJIq6.png

What's is wrong with my code here? Any alternative code , that i can use here.

I am not able to scrape that data as described here. http://i.imgur.com/FJIq6.png

P.S. Please post full code. It would be easier for me to understand.



Source: http://stackoverflow.com/questions/10461088/data-scraping-using-php

PDF scraping using R


I have been using the XML package successfully for extracting HTML tables but want to extend to PDF's. From previous questions it does not appear that there is a simple R solution but wondered if there had been any recent developments

Failing that, is there some way in Python (in which I am a complete Novice) to obtain and manipulate pdfs so that I could finish the job off with the R XML package

Extracting text from PDFs is hard, and nearly always requires lots of care.

I'd start with the command line tools such as pdftotext and see what they spit out. The problem is that PDFs can store the text in any order, can use awkward font encodings, and can do things like use ligature characters (the joined up 'ff' and 'ij' that you see in proper typesetting) to throw you.

pdftotext is installable on any Linux system



Source: http://stackoverflow.com/questions/7918718/pdf-scraping-using-r

Monday, 25 August 2014

Php Scraping data from a website

I am very new to programming and need a little help with getting data from a website and passing it into my PHP script.

The website is http://www.birthdatabase.com/.

I would like to plug in a name (First and Last) and retrieve the result. I know you can query the site by passing the name in the URL, but I am having problems scraping the results.

http://www.birthdatabase.com/cgi-bin/query.pl?textfield=FIRST&textfield2=LAST&age=&affid=

I am using the file_get_contents($URL) function to get the page but need help after that. Specifically, I would like to scrape only the results from a certain state if there are multiple results for that name.



You need the awesome simple_html_dom class.

With this class you can query the webpage's DOM in a similar way to jQuery.

First include the class in your page, then get the page content with this snippet:

$html = file_get_html('http://www.birthdatabase.com/cgi-bin/query.pl?textfield=' . $first . '&textfield2=' . $last . '&age=&affid=');

Then you can use CSS selections to scrape your data (something like this):

$n = 0;
foreach($html->find('table tbody tr td div font b table tbody') as $element) {
    @$row[$n]['tr']  = $element->find('tr')->text;
    $n++;
}

// output your data
print_r($row);



Source: http://stackoverflow.com/questions/15601584/php-scraping-data-from-a-website

Obtaining reddit data


I am interested in obtaining data from different reddit subreddits. Does anyone know if there is a reddit/other api similar like twitter does to crawl all the pages?


Yes, reddit has an API that can be used for a variety of purposes such as data collection, automatic commenting bots, or even to assist in subreddit moderation.

There are a few places to discover information on reddit's API:

    github reddit wiki -- provides the overview and rules for using reddit's API (follow the rules)
    automatically generated API docs -- provides information on the requests needed to access most of the API endpoints
    /r/redditdev -- the reddit community dedicated to answering questions both about reddit's source code and about reddit's API

If there is a particular programming language you are already familiar with, you should check out the existing set of API wrappers for various languages. Despite my bias (I am the package maintainer) I am quite certain PRAW, for python, has support for the largest number of reddit API features.



Source: http://stackoverflow.com/questions/14322834/obtaining-reddit-data

Tuesday, 19 August 2014

What is the right way of storing screen-scraping data?


i'm working on a web site. it is scraping product details(names, features, prices etc.) from various web sites, processing and displaying them. i'am considering to run update script on each day and keep data fresh.

    scrape data
    process them
    store on database
    read(from db) and display them

i'am already storing all the data in a sql schema but i'm not sure. After each update, all the old records are vanishing. if the scraped new data comes corrupted somehow, there is nothing to show.

so, is there any common way to archive the old data? which one is more convenient: seperate sql schemas or xml files? or something else?

Source: http://stackoverflow.com/questions/13686474/what-is-the-right-way-of-storing-screen-scraping-data

Scraping dynamic data


I am scraping profiles on ask.fm for a research question. The problem is that only the top

most recent questions are viewable and I have to click "view more" to see the next 15.

The source code for clicking view more looks like this:

<input class="submit-button-more submit-button-more-active" name="commit" onclick="return

Forms.More.allowSubmit(this)" type="submit" value="View more" />

What is an easy way of calling this 4 times before scraping it. I want the most recent 60

posts on the site. Python is preferable.

You could probably use selenium to browse to the website and click on the button/link a

few times. You can get that here:

    https://pypi.python.org/pypi/selenium

Or you might be able to do it with mechanize:

    http://wwwsearch.sourceforge.net/mechanize/

I have also heard good things about twill, but never used it myself:

    http://twill.idyll.org/



Source: http://stackoverflow.com/questions/19437782/scraping-dynamic-data

Web Scraping data from different sites


I am looking for a few ideas on how can I solve a design problem I'm going to be faced with building a web scraper to scrape multiple sites. Writing the scraper(s) is not the problem, matching the data from different sites (which may have small differences) is.

For the sake of being generic assume that I am scraping something like this from two or more different sites:

    public class Data {
        public int id;
        public String firstname;
        public String surname;
        ....
    }

If i scrape this from two different sites, I will encounter the situation where I could have the following:

Site A: id=100, firstname=William, surname=Doe

Site B: id=1974, firstname=Bill, surname=Doe

Essentially, I would like to consider these two sets of data the same (they are the same person but with their name slightly different on each site). I am looking for possible design solutions that can handle this.

The only idea I've come up with is scraping the data from a third location and using it as a reference list. Then when I scrape site A or B I can, over time, build up a list of failures and store them in a list for each scraper so that it can know (if i find id=100 then i know that the firstname will be William etc). I can't help but feel this is a rubbish idea!

If you need any more info, or if you think my description is a bit naff, let me know!

Thanks,

DMcB


Source: http://stackoverflow.com/questions/23970057/web-scraping-data-from-different-sites

Scrape Data Point Using Python


I am looking to scrape a data point using Python off of the url http://www.cavirtex.com/orderbook .

The data point I am looking to scrape is the lowest bid offer, which at the current moment looks like this:

<tr>
 <td><b>Jan. 19, 2014, 2:37 a.m.</b></td>
 <td><b>0.0775/0.1146</b></td>
 <td><b>860.00000</b></td>
 <td><b>66.65 CAD</b></td>
</tr>

The relevant point being the 860.00 . I am looking to build this into a script which can send me an email to alert me of certain price differentials compared to other exchanges.

I'm quite noobie so if in your explanations you could offer your thought process on why you've done certain things it would be very much appreciated.

Thank you in advance!

Edit: This is what I have so far which will return me the name of the title correctly, I'm having trouble grabbing the table data though.

import urllib2, sys
from bs4 import BeautifulSoup

site= "http://cavirtex.com/orderbook"
hdr = {'User-Agent': 'Mozilla/5.0'}
req = urllib2.Request(site,headers=hdr)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)
print soup.title



Here is the code for scraping the lowest bid from the 'Buying BTC' table:

from selenium import webdriver

fp = webdriver.FirefoxProfile()
browser = webdriver.Firefox(firefox_profile=fp)
browser.get('http://www.cavirtex.com/orderbook')

lowest_bid = float('inf')
elements = browser.find_elements_by_xpath('//div[@id="orderbook_buy"]/table/tbody/tr/td')

for element in elements:
    text = element.get_attribute('innerHTML').strip('<b>|</b>')
    try:
        bid = float(text)
        if lowest_bid > bid:
            lowest_bid = bid
    except:
        pass

browser.quit()
print lowest_bid

In order to install Selenium for Python on your Windows-PC, run from a command line:

pip install selenium (or pip install selenium --upgrade if you already have it).

If you want the 'Selling BTC' table instead, then change "orderbook_buy" to "orderbook_sell".

If you want the 'Last Trades' table instead, then change "orderbook_buy" to "orderbook_trades".

Note:

If you consider performance critical, then you can implement the data-scraping via URL-Connection instead of Selenium, and have your program running much faster. However, your code will probably end up being a lot "messier", due to the tedious XML parsing that you'll be obliged to apply...

Here is the code for sending the previous output in an email from yourself to yourself:

import smtplib,ssl

def SendMail(username,password,contents):
    server = Connect(username)
    try:
        server.login(username,password)
        server.sendmail(username,username,contents)
    except smtplib.SMTPException,error:
        Print(error)
    Disconnect(server)

def Connect(username):
    serverName = username[username.index("@")+1:username.index(".")]
    while True:
        try:
            server = smtplib.SMTP(serverDict[serverName])
        except smtplib.SMTPException,error:
            Print(error)
            continue
        try:
            server.ehlo()
            if server.has_extn("starttls"):
                server.starttls()
                server.ehlo()
        except (smtplib.SMTPException,ssl.SSLError),error:
            Print(error)
            Disconnect(server)
            continue
        break
    return server

def Disconnect(server):
    try:
        server.quit()
    except smtplib.SMTPException,error:
        Print(error)

serverDict = {
    "gmail"  :"smtp.gmail.com",
    "hotmail":"smtp.live.com",
    "yahoo"  :"smtp.mail.yahoo.com"
}

SendMail("your_username@your_provider.com","your_password",str(lowest_bid))

The above code should work if your email provider is either gmail or hotmail or yahoo.

Please note that depending on your firewall configuration, it may ask your permission upon the first time you try it...



Source: http://stackoverflow.com/questions/21217034/scrape-data-point-using-python

What is Web Scraping and is Python the best language to use for this? [closed]


What is Web Scraping and is Python the best language to use for this? If so why is python

the best?

What is Web Scraping

The process of making HTTP requests to websites and then extracting data from HTML

documents (as opposed to using an official API)

    and is Python the best language to use for this?

Subjective. Argumentative. Insert favourite language (Perl Perl Perl) here.

Web scraping is a computer software technique of extracting information from websites.

Usually, such software programs simulate human exploration of the Web by either

implementing low-level Hypertext Transfer Protocol (HTTP), or embedding certain full-

fledged Web browsers. Web scraping is closely related to Web indexing, which indexes Web

content using a bot and is a universal technique adopted by most search engines. In

contrast, Web scraping focuses more on the transformation of unstructured Web content,

typically in HTML format, into structured data that can be stored and analyzed in a

central local database or spreadsheet. Web scraping is also related to Web automation,

which simulates human Web browsing using computer software. Uses of Web scraping include

online price comparison, weather data monitoring, website change detection, Web research,

Web content mashup and Web data integration. (Wikipedia)

Like any language, Python has certain advantages and disadvantages with it. For example,

Perl is an easier language to do regular expression with. Then again, BeautifulSoup, a

module for Python, makes web scraping really, really easy.

Web scraping is the process of automatically collecting Web information.

Python and Perl have vast libraries for web scraping. Even though I have a personal

preference towards Perl when it comes to data extraction, because the ease at which you

can use regex, technically Python is no less. However based on the active community,

Python would be a better choice.

Python:

BeautifulSoap module which can extract HTML content.

Scrapy Open source framework for web scraping in Python. It is a very elegant framework.

Html5lib HTML Parser based on HTML5 specification

Perl

WWW::Mechanize -- Very powerful module for extracting webpages, and parsing, and easy to

use.

Win32::IE::Mechanize -- If you are using windows and wanted to scrape Javascript based

pages.

Mozilla::Mechanize -- Scrape JavaScript based pages from Linux, but you need to install

GTK

If you want to use Ruby for this task, Nokogiri and Mechanize are probably right tools for

the job.

If you are looking into web scraping with python, you may also want to look at lxml

package. It has same ElementTree API as stdlib parser, but is much more powerfull and

fast.

Plus lxml package provides brilliant lxml.cssselect module.

Depending on your needs, you may want to start with either bare lxml or some web-scraping

library like Scrapy, mechanize or SmartHTTP.


Source: http://stackoverflow.com/questions/5926863/what-is-web-scraping-and-is-python-

the-best-language-to-use-for-this

Scraping Data from Table Python


I'm trying to scrape data from a website's table using Python.

from bs4 import BeautifulSoup
from mechanize import Browser

BASE_URL = "http://www.ggp.com/properties/mall-directory"

def main():
    mech = Browser()
    url = "http://www.ggp.com/properties/mall-directory"
    page1 = mech.open(url)
    html1 = page1.read()
    soup1 = BeautifulSoup(html1)
    extract(soup1, 2007)


def extract(soup,year):
    table = soup.find("table")
    for row in table.findAll('option'):
        print row


main()

Row prints out:

<option value="184">Yakima, WA</option>
<option value="896">Yankton, SD</option>
<option value="851">Yazoo City, MS</option>
<option value="113">York-Hanover, PA</option>
<option value="87">Youngstown-Warren, OH-PA</option>
<option value="235">Yuba City, CA</option>
<option value="205">Yuma, AZ</option>
<option value="424">Zanesville, OH</option>

But what I need is

Yakima, WA
Yankton, SD
Yazoo City, MS
York-Hanover, PA
etc...

I've tried row.findAll('option value') but this doesn't work...



Source: http://stackoverflow.com/questions/24124291/scraping-data-from-table-python

What is Web Scraping and is Python the best language to use for this? [closed]


What is Web Scraping and is Python the best language to use for this? If so why is python the best?



    What is Web Scraping

The process of making HTTP requests to websites and then extracting data from HTML documents (as opposed to using an official API)

    and is Python the best language to use for this?

Subjective. Argumentative. Insert favourite language (Perl Perl Perl) here.



Web scraping is a computer software technique of extracting information from websites. Usually, such software programs simulate human exploration of the Web by either implementing low-level Hypertext Transfer Protocol (HTTP), or embedding certain full-fledged Web browsers. Web scraping is closely related to Web indexing, which indexes Web content using a bot and is a universal technique adopted by most search engines. In contrast, Web scraping focuses more on the transformation of unstructured Web content, typically in HTML format, into structured data that can be stored and analyzed in a central local database or spreadsheet. Web scraping is also related to Web automation, which simulates human Web browsing using computer software. Uses of Web scraping include online price comparison, weather data monitoring, website change detection, Web research, Web content mashup and Web data integration. (Wikipedia)

Like any language, Python has certain advantages and disadvantages with it. For example, Perl is an easier language to do regular expression with. Then again, BeautifulSoup, a module for Python, makes web scraping really, really easy.




Web scraping is the process of automatically collecting Web information.

Python and Perl have vast libraries for web scraping. Even though I have a personal preference towards Perl when it comes to data extraction, because the ease at which you can use regex, technically Python is no less. However based on the active community, Python would be a better choice.

Python:

BeautifulSoap module which can extract HTML content.

Scrapy Open source framework for web scraping in Python. It is a very elegant framework.

Html5lib HTML Parser based on HTML5 specification

Perl

WWW::Mechanize -- Very powerful module for extracting webpages, and parsing, and easy to use.

Win32::IE::Mechanize -- If you are using windows and wanted to scrape Javascript based pages.

Mozilla::Mechanize -- Scrape JavaScript based pages from Linux, but you need to install GTK

If you want to use Ruby for this task, Nokogiri and Mechanize are probably right tools for the job.



If you are looking into web scraping with python, you may also want to look at lxml package. It has same ElementTree API as stdlib parser, but is much more powerfull and fast.

Plus lxml package provides brilliant lxml.cssselect module.

Depending on your needs, you may want to start with either bare lxml or some web-scraping library like Scrapy, mechanize or SmartHTTP.


Source: http://stackoverflow.com/questions/5926863/what-is-web-scraping-and-is-python-the-best-language-to-use-for-this

Scraping Data from Table Python


I'm trying to scrape data from a website's table using Python.

from bs4 import BeautifulSoup
from mechanize import Browser

BASE_URL = "http://www.ggp.com/properties/mall-directory"

def main():
    mech = Browser()
    url = "http://www.ggp.com/properties/mall-directory"
    page1 = mech.open(url)
    html1 = page1.read()
    soup1 = BeautifulSoup(html1)
    extract(soup1, 2007)


def extract(soup,year):
    table = soup.find("table")
    for row in table.findAll('option'):
        print row


main()

Row prints out:

<option value="184">Yakima, WA</option>
<option value="896">Yankton, SD</option>
<option value="851">Yazoo City, MS</option>
<option value="113">York-Hanover, PA</option>
<option value="87">Youngstown-Warren, OH-PA</option>
<option value="235">Yuba City, CA</option>
<option value="205">Yuma, AZ</option>
<option value="424">Zanesville, OH</option>

But what I need is

Yakima, WA
Yankton, SD
Yazoo City, MS
York-Hanover, PA
etc...

I've tried row.findAll('option value') but this doesn't work...



Source: http://stackoverflow.com/questions/24124291/scraping-data-from-table-python

web scraping groupon


i want scrap groupon.com now my problem is such sites when you load for the first time asks you to join their email service but when you reload the page they directly show you the content of the page. how do i do it? i am using php for my scripting.

also if anyone could suggest a framework or library in php which makes scraping easy it would be great.


I would investigate the cURL library for grabbing website content. I'm not sure on the exact information you want to scrape, or if the refresh will cause an issue, but hopefully this launches your attempt.

We use iMacros. PRO: Works in browser, works with any website. CON: Not as fast as CURL. - of course, nothing stops you from using both.



Must you stick with PHP for the scraping? TestPlan makes this type of testing easy. You can either access the page again, or simply use TestPlan to sign up for their email list to gain extended access to their site.

Here's a rough example that takes you to the main page and closes the little popup:

GotoURL http://www.groupon.com/
Click id:step_one

SubmitForm with
    %Params:subscription[email_address]% somewhere@test.domain.xx
end


They have an API http://www.groupon.com/pages/api if that helps.


Source: http://stackoverflow.com/questions/3843733/web-scraping-groupon

Getting Different Results For Web Scraping


I was trying to do web scraping and was using the following code :

import mechanize
from bs4 import BeautifulSoup

url = "http://www.thehindu.com/archive/web/2010/06/19/"

br =  mechanize.Browser()
htmltext = br.open(url).read()

link_dictionary = {}
soup = BeautifulSoup(htmltext)

for tag_li in soup.findAll('li', attrs={"data-section":"Chennai"}):
    for link in tag_li.findAll('a'):
        link_dictionary[link.string] = link.get('href')
        print link_dictionary[link.string]
        urlnew = link_dictionary[link.string]

        brnew =  mechanize.Browser()
        htmltextnew = brnew.open(urlnew).read()

        articletext = ""
        soupnew = BeautifulSoup(htmltextnew)
        for tag in soupnew.findAll('p'):
            articletext += tag.text

        print articletext

I was unable to get any printed values by using this. But on using attrs={"data-section":"Business"} instead of attrs={"data-section":"Chennai"} I was able to get the desired output. Can someone help me?


READ THE TERMS OF SERVICES OF THE WEBSITE BEFORE SCRAPING

If you are using firebug or inspect element in Chrome, you might see some contents that will not be seen if you are using Mechanize or Urllib2.

For example, when you view the source code of the page sent out by you. (Right click view source in Chrome). and search for data-section tag, you won't see any tags which chennai, I am not 100% sure but I will say those contents need to be populated by Javascript ..etc. which requires the functionality of a browser.

If I were you, I will use selenium to open up the page and then get the source page from there, then the HTML collected in that way will be more like what you see in a browser.

Cited here

from selenium import webdriver
from bs4 import BeautifulSoup
import time   

driver = webdriver.Firefox()
driver.get("URL GOES HERE")
# I noticed there is an ad here, sleep til page fully loaded.
time.sleep(10)

soup = BeautifulSoup(driver.page_source)
print len(soup.findAll(...}))
# or you can work directly in selenium     
...

driver.close()

And the output for me is 8

Source: http://stackoverflow.com/questions/19918153/getting-different-results-for-web-scraping

Monday, 11 August 2014

Scrape Online Selling - Estate Sales

You can find a number of real bargains and few great items to vend at estate sales. The true estate sales are exactly as the name denotes; when someone tries to sell an entire estate or house full of items. The items are generally sold individually or in lots. An estate sale transpires when somebody may have passed away and no one knows what to do with all their junk. Perhaps the relatives can't take all the stuff with them. They hired the company to perform an estate sale or are performing the estate sale themselves.

At a typical sale you will see the house and everything attached to it will be for sale. These sales are publicized in your local newspaper's classified ads section. Estate sales are more modest than garage sales as there are lots of deals to be made.

Estate Sales Tips

· Get there early! Just as with garage sales, the most valuable items are going to be gone in the first few hours.

· Haggle. It is still a good idea to try to lower the sales price of items you are thinking of buying.

· Return to the sale the next day. Most os the time, these types of sales are two day events. I have found that on the second day the operators usually lower their prices significantly. The incentive is also there for the operators to unload everything that is left over.

· Introduce yourself. Usually these type of sales are performed by companies on behalf of the family of the estate. You can leave your business card with the operators and ask them to give you a call about future estate sales. In fact, most of the companies that perform the these sales let you sign up to be notified by e-mail of up and coming sales.

Used Bookstores

If you have any used bookstores around your area you can use the store to find used books to sell books online. Most used bookstores get infested with books but have a limited space in which to store all those books. Because of this, they may be willing to sell you some of their overstocked inventory at a deep discount.

It never hurts to speak to the owner or manager and try to get them to sell you some of their inventory. You can leave them with your business card and let them know that you are always in the market for volume sales. Most owners will always listen to a buyer who is willing to buy in volume since shelf space is a large problem for traditional brick and mortar used bookstores.

Source:http://ezinearticles.com/?Scrape-Online-Selling---Estate-Sales&id=7413912

Three Common Methods For Web Data Extraction

Probably the most common technique used traditionally to extract data from web pages this is to cook up some regular expressions that match the pieces you want (e.g., URL's and link titles). Our screen-scraper software actually started out as an application written in Perl for this very reason. In addition to regular expressions, you might also use some code written in something like Java or Active Server Pages to parse out larger chunks of text. Using raw regular expressions to pull out the data can be a little intimidating to the uninitiated, and can get a bit messy when a script contains a lot of them. At the same time, if you're already familiar with regular expressions, and your scraping project is relatively small, they can be a great solution.

Other techniques for getting the data out can get very sophisticated as algorithms that make use of artificial intelligence and such are applied to the page. Some programs will actually analyze the semantic content of an HTML page, then intelligently pull out the pieces that are of interest. Still other approaches deal with developing "ontologies", or hierarchical vocabularies intended to represent the content domain.

There are a number of companies (including our own) that offer commercial applications specifically intended to do screen-scraping. The applications vary quite a bit, but for medium to large-sized projects they're often a good solution. Each one will have its own learning curve, so you should plan on taking time to learn the ins and outs of a new application. Especially if you plan on doing a fair amount of screen-scraping it's probably a good idea to at least shop around for a screen-scraping application, as it will likely save you time and money in the long run.

So what's the best approach to data extraction? It really depends on what your needs are, and what resources you have at your disposal. Here are some of the pros and cons of the various approaches, as well as suggestions on when you might use each one:

Raw regular expressions and code

Advantages:


- If you're already familiar with regular expressions and at least one programming language, this can be a quick solution.

- Regular expressions allow for a fair amount of "fuzziness" in the matching such that minor changes to the content won't break them.

- You likely don't need to learn any new languages or tools (again, assuming you're already familiar with regular expressions and a programming language).

- Regular expressions are supported in almost all modern programming languages. Heck, even VBScript has a regular expression engine. It's also nice because the various regular expression implementations don't vary too significantly in their syntax.

Disadvantages:

- They can be complex for those that don't have a lot of experience with them. Learning regular expressions isn't like going from Perl to Java. It's more like going from Perl to XSLT, where you have to wrap your mind around a completely different way of viewing the problem.

- They're often confusing to analyze. Take a look through some of the regular expressions people have created to match something as simple as an email address and you'll see what I mean.

- If the content you're trying to match changes (e.g., they change the web page by adding a new "font" tag) you'll likely need to update your regular expressions to account for the change.

- The data discovery portion of the process (traversing various web pages to get to the page containing the data you want) will still need to be handled, and can get fairly complex if you need to deal with cookies and such.

When to use this approach: You'll most likely use straight regular expressions in screen-scraping when you have a small job you want to get done quickly. Especially if you already know regular expressions, there's no sense in getting into other tools if all you need to do is pull some news headlines off of a site.

Ontologies and artificial intelligence

Advantages:

- You create it once and it can more or less extract the data from any page within the content domain you're targeting.

- The data model is generally built in. For example, if you're extracting data about cars from web sites the extraction engine already knows what the make, model, and price are, so it can easily map them to existing data structures (e.g., insert the data into the correct locations in your database).

- There is relatively little long-term maintenance required. As web sites change you likely will need to do very little to your extraction engine in order to account for the changes.

Disadvantages:

- It's relatively complex to create and work with such an engine. The level of expertise required to even understand an extraction engine that uses artificial intelligence and ontologies is much higher than what is required to deal with regular expressions.

- These types of engines are expensive to build. There are commercial offerings that will give you the basis for doing this type of data extraction, but you still need to configure them to work with the specific content domain you're targeting.

- You still have to deal with the data discovery portion of the process, which may not fit as well with this approach (meaning you may have to create an entirely separate engine to handle data discovery). Data discovery is the process of crawling web sites such that you arrive at the pages where you want to extract data.

When to use this approach: Typically you'll only get into ontologies and artificial intelligence when you're planning on extracting information from a very large number of sources. It also makes sense to do this when the data you're trying to extract is in a very unstructured format (e.g., newspaper classified ads). In cases where the data is very structured (meaning there are clear labels identifying the various data fields), it may make more sense to go with regular expressions or a screen-scraping application.

Screen-scraping software


Advantages:

- Abstracts most of the complicated stuff away. You can do some pretty sophisticated things in most screen-scraping applications without knowing anything about regular expressions, HTTP, or cookies.

- Dramatically reduces the amount of time required to set up a site to be scraped. Once you learn a particular screen-scraping application the amount of time it requires to scrape sites vs. other methods is significantly lowered.

- Support from a commercial company. If you run into trouble while using a commercial screen-scraping application, chances are there are support forums and help lines where you can get assistance.

Disadvantages:

- The learning curve. Each screen-scraping application has its own way of going about things. This may imply learning a new scripting language in addition to familiarizing yourself with how the core application works.

- A potential cost. Most ready-to-go screen-scraping applications are commercial, so you'll likely be paying in dollars as well as time for this solution.

- A proprietary approach. Any time you use a proprietary application to solve a computing problem (and proprietary is obviously a matter of degree) you're locking yourself into using that approach. This may or may not be a big deal, but you should at least consider how well the application you're using will integrate with other software applications you currently have. For example, once the screen-scraping application has extracted the data how easy is it for you to get to that data from your own code?

When to use this approach: Screen-scraping applications vary widely in their ease-of-use, price, and suitability to tackle a broad range of scenarios. Chances are, though, that if you don't mind paying a bit, you can save yourself a significant amount of time by using one. If you're doing a quick scrape of a single page you can use just about any language with regular expressions. If you want to extract data from hundreds of web sites that are all formatted differently you're probably better off investing in a complex system that uses ontologies and/or artificial intelligence. For just about everything else, though, you may want to consider investing in an application specifically designed for screen-scraping.

As an aside, I thought I should also mention a recent project we've been involved with that has actually required a hybrid approach of two of the aforementioned methods. We're currently working on a project that deals with extracting newspaper classified ads. The data in classifieds is about as unstructured as you can get. For example, in a real estate ad the term "number of bedrooms" can be written about 25 different ways. The data extraction portion of the process is one that lends itself well to an ontologies-based approach, which is what we've done. However, we still had to handle the data discovery portion. We decided to use screen-scraper for that, and it's handling it just great. The basic process is that screen-scraper traverses the various pages of the site, pulling out raw chunks of data that constitute the classified ads. These ads then get passed to code we've written that uses ontologies in order to extract out the individual pieces we're after. Once the data has been extracted we then insert it into a database.

Source:http://ezinearticles.com/?Three-Common-Methods-For-Web-Data-Extraction&id=165416

Avail Professional and Affordable Web Research and Data Mining Services From Data Experts

The web can be considered as an unstructured database of a very large size containing a huge amount of information. Tons of data is available freely and easily accessible on the internet, there is just a need to thoroughly analyze it and categorize it to help meet the needs related to the content. Thus, information retrieval and analysis has become crucial, to use and apply that information further.

Today, web research and data mining services have become increasingly important to almost all individuals, sectors and businesses. These services include applying methodologies like business intelligence, web scrapping services and data extraction to get the desired results. Web researchers use Web search engines (keyword queries) or specific means to surf the web to get specific results. However, all the results obtained are not relevant as keyword search gives a lot of irrelevant material. When searching the internet or web for information on any given topic, it usually ends with a million pages to see through, which can be a frustrating experience and takes a lot of time.

Today, we are living in a data driven world where data has become the driving force behind all business enterprises in spite of their size. The complete credit goes to the arrival of cutting edge technological inventions. Data mining is the process of analyzing huge amount of data available on the web to create up-to-date and useable information. It is mainly used for risk management and business intelligence.

Data mining and web research have become essential for almost all industries such as education, telecommunications, retail, insurance, healthcare, banking, real estate, travel, and E-Commerce. The initial process related to data mining is Knowledge Discovery in Databases (KDD) it involves extracting data from the web and other sources to transform it into valuable information which is worth billions.

Web Research involves searching, collecting, understanding, evaluating and exploring information. Web Research is the detailed study of a subject in order to discover information or achieve a new understanding of it. Web research services are mainly availed by students, researchers, and others who are seeking some particular information. Web-based market research solutions are used by media, publishing, public relations, finance, transportation, automation, automobile, government, FMCG, healthcare and various other industries.

There are a variety of web research services offered by eminent companies:
  •     Data Mining
  •     Information Retrieval
  •     Online Data Research
  •     Web Content Filtering
  •     Web Data Extraction
  •     Web Research
Web market research is especially effective where an organization needs to improve an existing product/service or wants to create completely new business product/service. Web-based research helps in establishing a positive and interactive relationship with employees, customers and business associates.

Special features of outsourcing data mining and web research to experts:
  •     Competitive pricing and flexible hiring options
  •     Quick turn-around time
  •     High quality and effective results
  •     Data mining support and multidimensional analysis
  •     Highly skilled, experienced and knowledgeable professionals
The several advantages of web research make it an ideal technique for gathering precious market and industry information, which helps companies and businesses in decision-making.

Source:http://ezinearticles.com/?Compensation-to-Outsource-Data-Entry-Work&id=3486446