Python: module scrapeHotelReviewData

scrapeHotelReviewData

Python Module for scraping hotel reviews. This module scrapes review data from hotels (date, rating and review text) from Tridadvisor/Orbitz for all hotels in (and close to) the given list of cities in an US state. Note that no Personal Information of the reviewers such as their account name or address is scraped. Does this in 5 steps: Step 1: Get the list of cities and the URLs listing hotels for that. (Only needed for TA - Tripadvisor) Step 2: Get the list of URLs for each hotel in a city. Step 3: Get the list of review pages for each hotel. Step 4: From each hotel get the review information needed. Step 5: Get the stats for each hotel It also has the ability to be interrupted and (effectively) resume from where it stopped without having to redownload all the previous files. It does this by performs optimizations such as storing the webpages it has downloaded and compacting the downloaded pages into a format that is easy to process. NOTE: This becomes essential since Tridadvisor starts dropping if you download too many pages too fast and hence it may need to be restarted after pausing for a few seconds. Usage: scrapeHotelReviewData.py [-h] -state STATE -cities CITIES [-delay DELAY] -site SITE -o OUTPUT -path PATH Inputs: -h, --help show this help message and exit -state STATE State for which the city data is required. -cities CITIES Filename containing list of cities for which data is required -delay DELAY Amount of time to pause after downloading a website -site SITE Either tripadvisor or orbitz -o OUTPUT Path to output file for reviews -path PATH Directory where the webpages should be downloaded Key Outputs: - TSV File containing the review information - Condensed set of information downloaded - List of cities not found on website.

Modules

argparse
bisect
os
sys
time
urllib2

Functions


analyzeReviewPage(contents, hName, option, outF)
Analyzes the review page and and gets details about them which it then writes to the output file Inputs: - contents : Content string - hName : Name of the hotel - option : Tripad/Orbitz - outF : File to write to

checkIfExists(hUrl)
Checks to see if the current url has already been scraped from before

downloadToFile(url, fileName, force=False)
Downloads url to file Inputs: - url : Url to be downloaded - fileName : Name of the file to write to - force : Optional boolean argument which if true will overwrite the file even it it exists Returns: - Pair indicating if the file was downloaded and a list of the contents of the file split up line-by-line

getAddress(s)
Gets the city Inputs: - url : Content string Returns: - City

getAllOrbitzReviews(cityList, outF)
Gets all the reviews from orbitz

getAllTAReviews(cityList, outF, path)
Gets all the reviews from tripadvisor

getCities(cityF)
Reads the file and returns the list of cities Inputs: - cityF : File containing the list of cities (one per line) Returns: - List of cities in list

getCityHotelListPage(content)
Gets the links to the pages listing the hotels/B&bs/Rentals from the pruned search page contents.          Inputs: - content : Content list of pruned search page

getFileContentFromWeb(url)
Downloads data from a website Inputs: - url : Url to be downloaded Returns: - Content of url

getFullAddress(s)
Gets the complete address Inputs: - url : Content string

getHotelListInsertIndex(s)
Tripadvisor only: Gets where the hotel index should be inserted

getNumberOfHotels(content)
Tripadvisor only: Gets the number of hotels in this city Inputs: - content : Website content Returns: - Number of hotels

getNumberOfReviews(content)
Tripadvisor only: Gets the number of reviews for this hotel Inputs: - content : Website content Returns: - Number of reviews

getOrbitzHotels(content)
For orbitz gets all hotels in a city.          Inputs: - content : Pruned content of the hotel list page

getOrbitzReviewsForHotel(revUrl, hName, hInd, city, outF)
Function to get all reviews for a particular hotel from Orbitz

getTAHotels(content)
For trip advisor gets all hotels in a city.          Inputs: - content : Pruned content of the hotel list page

getTAReviewsForHotel(revUrl, hName, city, outF)
Function to get all reviews for a particular hotel from tripadvisor

isCharInt(c)
Checks to see if the character is an integer or not.

pruneCitySearchPage(content)
Prunes the page listing the search for the city to find the lines containing links to the page listing the hotels/B&bs/Rentals.          Inputs: - content : Content list

pruneHotelListFile(hlContents, fileName, option, city)
Prunes the hotel list file contents from the entire website to only what is required

pruneOrbitzHotelListPage(content)
Prunes the page returned by orbitz listing all hotels in a city.          Inputs: - content : Content list of the original hotel list page

pruneOrbitzReviewPage(contents)
For orbitz prune the page containing all the review to just the vital lines containing required information          Inputs: - content : Pruned content of the review page

pruneReviewFile(revContents, fileName, option, city)
Prunes the review file

pruneSearchFile(searchContents, fileName, city)
Prunes the search page for the city down to the lines containing links listing the hotels/B&bs/Rentals and writes them to a file.          Inputs: - city : City name - searchContents : Content list - fileName : Output file Returns: - Pruned contents containing the desired links

pruneTAHotelListPage(content)
Prunes the page returned by trip advisor listing all hotels in a city.          Inputs: - content : Content list of the original hotel list page

pruneTAReviewPage(contents)
For trip advisor pruned the page containing all the review to just the vital lines containing required information          Inputs: - content : Pruned content of the review page