bumpyjump.com bumpyjump.com bumpyjump.com
Search:    Home Page :> About Us :> Security & Privacy :> ToS :> Add Url :> Add Your Article   

 

Policies & Law

 

Family & Home

 

Creative Arts

 

Health & Therapy

 

Adventure & Sports

 

Companies & Business

 

Tour & Travel

 

Education & Learning

 

Automotive

 

Self Healing

 

Teens & Kids

 

Finance & Investment

 

Recreation & Entertainment

 

Shopping & Auction

 

People & Society

 

Computers & Software

 

News & Events

 

Fashion & Relationships

 

Property & Agents

 

Healthcare & Treatment

 

Jobs & Employment

 

Science & Research

 

Drink & Food

 

Online & Board Games

 

Home Page › Computers & Software › Paid Software
 

Data Discovery vs. Data Extraction

 
Author: Todd Wilson

Looking at screen-scraping at a simplified level, there are two primary stages involved: data discovery and data extraction. Data discovery deals with navigating a web site to arrive at the pages containing the data you want, and data extraction deals with actually pulling that data off of those pages. Generally when people think of screen-scraping they focus on the data extraction portion of the process, but my experience has been that data discovery is often the more difficult of the two.

The data discovery step in screen-scraping might be as simple as requesting a single URL. For example, you might just need to go to the home page of a site and extract out the latest news headlines. On the other side of the spectrum, data discovery may involve logging in to a web site, traversing a series of pages in order to get needed cookies, submitting a POST request on a search form, traversing through search results pages, and finally following all of the details links within the search results pages to get to the data youre actually after. In cases of the former a simple Perl script would often work just fine. For anything much more complex than that, though, a commercial screen-scraping tool can be an incredible time-saver. Especially for sites that require logging in, writing code to handle screen-scraping can be a nightmare when it comes to dealing with cookies and such.

In the data extraction phase youve already arrived at the page containing the data youre interested in, and you now need to pull it out of the HTML. Traditionally this has typically involved creating a series of regular expressions that match the pieces of the page you want (e.g., URLs and link titles). Regular expressions can be a bit complex to deal with, so most screen-scraping applications will hide these details from you, even though they may use regular expressions behind the scenes.

As an addendum, I should probably mention a third phase that is often ignored, and that is, what do you do with the data once youve extracted it? Common examples include writing the data to a CSV or XML file, or saving it to a database. In the case of a live web site you might even scrape the information and display it in the users web browser in real-time. When shopping around for a screen-scraping tool you should make sure that it gives you the flexibility you need to work with the data once its been extracted.

Author Bio:
Todd Wilson is a well-known scripter. Todd likes to create articles about this industry.
You can search for this article using: free software, free software downloads, cheap computer software, discount software
 
 
 

Related Articles

 
6 Surefire Ways To Squeeze Maximum Sales From Your Website
 
7 Tips For the New Network Marketer
 
Getting Organized with Palms & OtherTechnology
 
Web Hosting Features
 
Website Optimization, Navigation and Visitor Tracking Explained
 
Microsoft Navision Implementation, Customization, Integration - Rio de Janeiro ERP Market Notes
 
How Is Bulk Domain Name Registration Done? It Just Depends!
 
Reseller Web Hosting Explained
 
Why Start An Internet Business?
 
Optimizing The Java Programming Language
 
 
 
Home Page :> Security & Privacy :> ToS  
Copyright © 2006, www.bumpyjump.com