Web Scraping using Python with memcache deployed in Oracle Cloud
In this blog will demonstrate web scraping using Python. Also used memcache to cache specific data for a duration of 1 day.
I used Oracle Cloud to deploy the application and the instance is accessible from internet. This steps can be done in the local machine as well.
The Oracle Cloud setup is not detailed here, for setting up the Oracle instance refer Oracle Documentation.
- Briefly once the instance is created from the Oracle Web UI interface
- To access the key using Putty we need to setup the SSH key.
- Enable the firewall to access the instance from the Internet.
Install memcache service in Oracle Linux
- To install the
memcached
server in the Linux machine use below command
sudo dnf install memcached
- Now we need to start the memcache service, initially the service will be disabled, this can be checked with below command
sudo systemctl status memcached
- Start the service using below command and check the status once again using above command. Refer the snapshot
sudo systemctl start memcached
Install memcache client in Python
- To install the python package
pymemcache
using below command. This will package enables us to use the client to access the memcache
sudo pip3 install pymemecache
NOTE:-
Refer Oracle documentation for installing memchaced on Oracle Linux
In case if we need to flush all the values that where cached from the memcache server, we can use below command, 11211 is the port used by memcache server
$ echo "flush_all" > nc localhost 11211
Python code to scrap a website
- The code will use
request-html
module to scrap thehttps://travel.state.gov
to extract the Immigartion data for EB2 and EB3 for India category. - The code will dynamically create the url template
https://travel.state.gov/content/travel/en/legal/visa-law0/visa-bulletin/{YEAR}/visa-bulletin-for-{NAME-OF-MONTH}-{YEAR}.html
INFO
To install the pyton request-html package use below command. For more details refer PYPI documenation
pip install request-html
Info about the code
- The Python code uses HTTP Server to handle GET requests. This code is to demonstrate the use of cache and scarping website.
- The
memcached
server should be started and running during the execution of the below Python app. - When GET request is received, the code checks the memcache for the key
YYYY-MM
, if value exits it is served as response. - The code scraps the website for specific data set, and stores it in cache with TTL (expiration seconds) for 1 day.
- Create a file app.y with the below content.
import os
from requests_html import HTML, HTMLSession
from datetime import date
from datetime import datetime
import calendar
import json
from http.server import BaseHTTPRequestHandler, HTTPServer
from pymemcache.client.base import Client
import ast
PORT_NUMBER = int(os.environ.get('PORT', 8084))
##########
#### POC to scrap the Website and get info (visabulletin)
#### This is not a perfect code, requires improvement, possibly tunning.
#### challenges, calling a funtion within a function where i had to use @staticmethod
#### The web link http://localhost:8084 - renders the raw html non formatted string
#### this can be tuned to converted to json if needed - next step
#########
# HTTPRequestHandler class
class testHTTPServer_RequestHandler(BaseHTTPRequestHandler):
# better to apply DRY prinicple
@staticmethod
def toFetchTitleA(response, debug = False):
xpathStrA='/html/body/div[3]/div[7]/div[2]/div[1]/div[2]/div[3]'
tbl = response.html.xpath(xpathStrA)
tmptxt=''
if debug :
print (tbl)
for tb in tbl:
if debug:
print (tb)
ptags = tb.find("p")
cntr = 0;
for ptag in ptags:
cntr = cntr +1
if cntr ==14 :
tmptxt = str(ptag.text).replace('\n',' ')
#print(tmptxt)
return tmptxt
@staticmethod
def toFetchTitleB (response, debug = False) :
xpathStrB='/html/body/div[3]/div[7]/div[2]/div[1]/div[2]/div[5]'
tbl = response.html.xpath(xpathStrB)
tmptxt=''
if debug :
print (tbl)
for tb in tbl:
if debug :
print (tb)
ptags = tb.find("p")
cntr = 0;
for ptag in ptags:
cntr = cntr +1
if cntr ==3 :
tmptxt = str(ptag.text).replace('\n',' ')
#print(tmptxt)
return tmptxt
@staticmethod
def parseToJson(resultList, debug = False) :
key = ''
value=''
iter=0
temp = {}
result = {}
for item in resultList:
temp = {}
codedItem = item.split("##")
if debug:
print (codedItem)
for item in codedItem :
if len(item) > 0 :
codedKey = item.split(":-")[0]
value = item.split(":-")[1]
if debug:
print(f"value = {value}")
if '@' in codedKey:
iter= codedKey.split('@')[0]
key= codedKey.split('@')[1].strip()
if debug:
print(f"key = {key}")
temp[key]=value.strip()
result[iter] = temp
return result
@staticmethod
def getWebContent():
session = HTMLSession()
toDate = date.today()
currentMonth = toDate.month
currentYear = toDate.year
monthName = calendar.month_name
currentMonthName = monthName[currentMonth].lower()
#print(f"{monthName[currentMonth].lower()} and {currentYear}")
url = f"https://travel.state.gov/content/travel/en/legal/visa-law0/visa-bulletin/{currentYear}/visa-bulletin-for-{currentMonthName}-{currentYear}.html"
print(url)
# form output to list
outputList = []
response = session.get(url)
tables= response.html.find("table")
colCnt = 0
IndCheckCol = 5
CategoryCheckCol=1
rowCnt = 0
checkRow = 4
requiredInfo= ''
printTableInfo = False
employTitleACnt = 1
employTitleBCnt = 2
tblCnt =0
toDebug = False
for table in tables:
trs = table.find("tr") #,first=True)
#tds = trs.find("td")
firstTd = table.find("td", first=True)
#print(td.text)
output = ''
colCnt =0
rowCnt =0
requiredInfo= ''
if "Employment" in firstTd.text :
tblCnt +=1
if tblCnt == employTitleACnt:
ret = testHTTPServer_RequestHandler.toFetchTitleA(response)
requiredInfo = str(tblCnt)+'@title :- ' + requiredInfo + ret + ' ## '
if toDebug :
print (requiredInfo)
if tblCnt == employTitleBCnt:
ret = testHTTPServer_RequestHandler.toFetchTitleB(response)
requiredInfo = str(tblCnt)+'@title :- ' + requiredInfo + ret + ' ## '
if toDebug :
print (requiredInfo)
tmp = str(firstTd.text).replace('\n',' ')
for tr in trs:
# we use only till rowcount 3
rowCnt = rowCnt+1
headtd = tr.find("td")
colCnt = 0;
for head in headtd :
colCnt = colCnt+1;
tmp = str(head.text).replace('\n',' ')
output += tmp+ " | "
if colCnt == CategoryCheckCol and rowCnt <= checkRow:
requiredInfo += str(tblCnt)+'@'+tmp + ' :- '
if colCnt == IndCheckCol and rowCnt <= checkRow :
requiredInfo += tmp + ' ## '
## prints the info in tabular format
if printTableInfo == True :
print (output)
output= ''
outputList.append(requiredInfo)
return outputList
@staticmethod
def parseToJson(resultList, debug = False) :
toDate = date.today()
currentMonth = toDate.month
currentYear = toDate.year
monthName = calendar.month_name
cacheClient = Client('localhost')
isCacheResult = cacheClient.get(f'{currentYear}-{currentMonth}')
if isCacheResult :
# we need to decode since the returned value is byte type
return isCacheResult
key = ''
value=''
iter=0
temp = {}
result = {}
if debug:
print(f"resultList input := {resultList}")
for item in resultList:
if debug :
print(f"processing ... {item}")
temp = {}
codedItem = item.split("##")
if debug:
print (codedItem)
for item in codedItem :
if len(item) > 0 and len(item.split(":-")) > 1 :
if debug:
print (f"item = {item} && length = {len(item)}")
codedKey = item.split(":-")[0]
value = item.split(":-")[1]
if debug:
print(f"value = {value} && codedKey = {codedKey}")
if '@' in codedKey:
iter= codedKey.split('@')[0]
key= codedKey.split('@')[1].strip()
if debug:
print(f"key = {key}")
temp[key]=value.strip()
result[iter] = temp
result['timestamp']= str(datetime.now())
# cache only for 1 day
cacheClient.set(f'{currentYear}-{currentMonth}',result,86400)
return result
# GET
def do_HEAD(self):
# Send response status code
self.send_response(200)
self.send_header('Content-type', 'text/html')
self.end_headers()
return
# GET
def do_GET(self):
# Send response status code
self.send_response(200)
# Send headers
self.send_header('Content-type','text/json')
self.end_headers()
# Send message back to client
outMessage = testHTTPServer_RequestHandler.getWebContent()
#print(type(outMessage))
#print (outMessage)
message = ""
json_data = testHTTPServer_RequestHandler.parseToJson(outMessage, False)
#print(json_data)
if json_data and not isinstance(json_data,bytes):
#Python pretty print JSON
message = json.dumps(json_data, indent=4)
# below condition to convert the bytes type since cache value is of this type
if json_data and isinstance(json_data,bytes):
# convert the string type of dictionary value to dictonary type
dict_str = json_data.decode("utf-8").replace("'",'"')
# String converted to dict type, using ast
dict_type_val = ast.literal_eval(dict_str)
# pretty the json value back
message = json.dumps(dict_type_val, indent=4)
message = str(message)
#print(f"message :- {message}")
#message = "Hello world!"
# Write content as utf-8 data
self.wfile.write(bytes(message, "utf8"))
return
def run():
print('starting server...')
# Server settings
# Choose port 8080, for port 80, which is normally used for a http server, you need root access
server_address = ('0.0.0.0', PORT_NUMBER)
httpd = HTTPServer(server_address, testHTTPServer_RequestHandler)
print('running server...')
httpd.serve_forever()
run()
Note:
To setup the memcache as Windows refer to the memcache documentation.
Reference to execute specific port 1
In Windows 10, edited the registry for memcache using
regedit
and included the additional argument-m
for memory and-p
for port.
Run the Python code
- The Oracle Linux instance installed with python, we can use below command to run the process in background
$ nohup python app.py &
Output
- Since the Oracle instance is accessible via internet, with the IP address I can access the instance to serve the response
- The URL in this case is
http://192.9.244.94:8084/
, the response is JSON pay load looks like below
{
" 1": {
"title": "A. FINAL ACTION DATES FOR\u00a0EMPLOYMENT-BASED\u00a0PREFERENCE CASES",
"Employment- based": "INDIA",
"1st": "C",
"2nd": "01DEC14",
"3rd": "15FEB12"
},
" 2": {
"title": "B. DATES FOR FILING OF EMPLOYMENT-BASED\u00a0VISA\u00a0APPLICATIONS",
"Employment- based": "INDIA",
"1st": "C",
"2nd": "01JAN15",
"3rd": "22FEB12"
},
"timestamp": "2022-08-09 19:17:47.274246"
}
Snapshot
Points to note
- The web scrap in the python code is mostly hard coded, the code is not more generic.
- The Oracle instance is only accessible via IP address, not using DNS service to resolve IP address.