scrapeHotelReviewData | Project Homepage Developer's Homepage |
Python Module for scraping hotel reviews.
This module scrapes review data from hotels (date, rating and review text) from Tridadvisor/Orbitz for all hotels in (and close to) the given list of cities in an US state.
Note that no Personal Information of the reviewers such as their account name or address is scraped.
Does this in 5 steps:
Step 1: Get the list of cities and the URLs listing hotels for that. (Only needed for TA - Tripadvisor)
Step 2: Get the list of URLs for each hotel in a city.
Step 3: Get the list of review pages for each hotel.
Step 4: From each hotel get the review information needed.
Step 5: Get the stats for each hotel
It also has the ability to be interrupted and (effectively) resume from where it stopped without having to redownload all the previous files. It does this by performs optimizations such as storing the webpages it has downloaded and compacting the downloaded pages into a format that is easy to process.
NOTE: This becomes essential since Tridadvisor starts dropping if you download too many pages too fast and hence it may need to be restarted after pausing for a few seconds.
Usage:
scrapeHotelReviewData.py [-h] -state STATE -cities CITIES [-delay DELAY] -site SITE -o OUTPUT -path PATH
Inputs:
-h, --help show this help message and exit
-state STATE State for which the city data is required.
-cities CITIES Filename containing list of cities for which data is
required
-delay DELAY Amount of time to pause after downloading a website
-site SITE Either tripadvisor or orbitz
-o OUTPUT Path to output file for reviews
-path PATH Directory where the webpages should be downloaded
Key Outputs:
- TSV File containing the review information
- Condensed set of information downloaded
- List of cities not found on website.
Modules | ||||||
|
Functions | ||
|