SAP HANA One - Continuing my Python Based Loads with PyHDB

1. Background and Creating Tables in HANA

I'm continuing the series that I started on my blogger site (http://www.sapbasisconsultant.net/).

Originally I loaded MariaDB with data extracted from financial websites. I was hoping that I could import this data from MariaDB into a Hana One instance in AWS. After spending some fruitless hours trying to get ODBC working correctly, I decided to push the data into HANA using SAP's HANA Python module named 'PyHDB' (available here --> https://github.com/SAP/PyHDB).

The first thing I did was to create some tables. Since I'm trying to complete almost everything using Python, I created this short script to create what I needed:

# Name....: mk_tables.py
# Author..: J. Haynes
# Purpose.: Create needed tables in HDB

# Pull in the needed libraries and setup the connection to the AWS instance
import pyhdb
conn = pyhdb.connect(
    host="54.226.200.xxx",
    port=30015,
    user="hanauser",
    password="xxxxx"
)

# Setup a cursor object
cur = conn.cursor()

# Create tables in HANA
cur.execute('CREATE TABLE F_STATS("SymId" INTEGER, "SymName" \
VARCHAR (10) null,"DateText" VARCHAR (20) null, "HeadLine" \
VARCHAR (30) null, "DataLine" VARCHAR (30) null)')

cur.execute('CREATE TABLE Symbols("SymId" INTEGER, "Symbol" VARCHAR (10) null, \
"Location" VARCHAR (20) null, "Name" VARCHAR (30) null)')

cur.execute('CREATE TABLE SymData("SymId" INTEGER, "SymName" VARCHAR (10) null, \
"DateText" VARCHAR (12) null, "Open" DECIMAL (15,5), "High" DECIMAL (15,5), \
Low DECIMAL (15,5), Close DECIMAL (15,5), Volume DECIMAL (15,5), AdjClose DECIMAL(15,5))')



This script opens a connection to the HANA instance, sets up a cursor object, and then creates each of the three
tables we will work with below.

2. Loading the Market Symbols

I took script that extacts market symbols and updated the connection settings to point to the HANA One instance.

# Name.....: updt_sym.py
# Author...: J. Haynes
# Purpose..: Import the market symbols into HANA

# Open needed Python libs
import pyhdb
import string,io
import urllib2
from bs4 import BeautifulSoup

# Establish the connection and setup the cursor object
conn = pyhdb.connect(
    host="54.234.128.xxx",
    port=30015,
    user="hanauser",
    password="xxxxx"
)
cur = conn.cursor()

# Extract info from the main Wikipedia page
url_part1 = 'http://en.wikipedia.org/wiki/Companies_listed_on_the_New_York_Stock_Exchange_'
url = url_part1 + '(0-9)'

aurl = urllib2.urlopen(url)
soup = BeautifulSoup(aurl.read(),"html.parser")

# Clean out previous entries in the Symbols table
cur.execute("DELETE FROM Symbols")
conn.commit()

# step through each row in the table and build each record for the insert
for row in soup('table')[1]('tr'):
   tds = row('td')
   if (len(tds) > 0) :
       tmpname = (tds[0].string or u"No Name")
       stmt = 'INSERT INTO Symbols("SYMID", "Symbol", "Location", "Name") '
       stmt = stmt +  "VALUES (id_seq.NEXTVAL,'" + tds[1].string + "','" + tds[2].string + "','" + tmpname + "')"
       print stmt
       cur.execute(stmt)
       conn.commit()

for letter in string.uppercase[:26]:
    url_part2 = letter
    url = url_part1 + '(' + letter + ')'

    aurl = urllib2.urlopen(url)
    soup = BeautifulSoup(aurl.read(),"html.parser")

    # step through each row in the table and build each record for the insert
    for row in soup('table')[1]('tr'):
        tds = row('td')
        if (len(tds) > 0) :
           tmpname = (tds[0].string or u"No Name")
           tmpname = tmpname.replace("'", "")[:65]
           stmt = 'INSERT INTO Symbols("SYMID", "Symbol", "Location", "Name") '
           stmt = stmt +  "VALUES (id_seq.NEXTVAL,'" + tds[1].string + "','" + tds[2].string + "','" + tmpname + "')"
           #print stmt
           cur.execute(stmt)
           conn.commit()

This script scrapes information form the Wikipedia page, creates an insert statement, and then executes that statement
within the cursor object named 'cur'

The data can then be validated in the HANA Studio using the Data Preview function:

Here is a view of the updated table:

3. Loading the Market Data

In my previous blog posts, I used a script to download market data (into CSV files) for the stocks listed a WikiPedia page (that original blog post is here --> http://www.sapbasisconsultant.net/2015/08/sap-hana-preparing-my-stock-ma...). Once that data was collected, I imported all of the CSV files into MariaDB and as a second test into Hadoop using Hive. Now I just need to push that same data into Hana so I updated the original Hive script as shown here:

# Name.....: updt_dat.py
# Author...: J. Haynes
# Purpose..: Copy stock data files into Hana

# Open needed libs
import sys
import pyhdb
import string,io
import urllib2
from bs4 import BeautifulSoup

# Example of the data to load
#Date,Open,High,Low,Close,Volume,Adj Close
#2010-01-28,6.38,6.38,6.12,6.30,61000,5.68551

# Set the location of the import files
fdir = "/home/hadoopuser/stockdata/history/new/"

# Establish the connection and setup the cursor object
conn = pyhdb.connect(
    host="54.234.128.xxx",
    port=30015,
    user="hanauser",
    password="xxxxxx"
)

cur1 = conn.cursor()

# Clean out the current SymData table
cur1.execute("DELETE from SymData")

# Select Symbol data from the database and step through to load each file
stmt = "SELECT * from Symbols"
cur1.execute(stmt)
rows = cur1.fetchall()

for row in rows:
   fname = fdir + row[1] + ".csv_new"
   print "Importing File --> " + fname

   # read the contents of the file and step through each record
   f = open(fname, 'r')
   file_content = f.readlines()

   for frow in file_content:
     Symbol, zDate, Open, High, Low, Close, Volume, AdjClose = frow.strip().split(",")
     stmt = 'INSERT INTO Symdata("SYMID", "SymName", "DateText", "Open", "High", "LOW", "CLOSE", "VOLUME", "ADJCLOSE") '
     stmt = stmt +  "VALUES (Symid_seq.NEXTVAL,'" + Symbol + "','" + zDate + "'," + Open + "," + High + "," + Low  + "," + Close  + "," + Volume  + "," + AdjClose + ")"
     cur1.execute(stmt)
     conn.commit()

This script pulls the list of market symbols from HANA, opens a specific file for a given stock, opens and steps through the file, and then creates/executes an insert into the HANA table for each record in that file.

As before, the data upload can be validated in the HANA Studio:

4. Loading the Fundamental Statistics

This data is loaded using the same process as the two steps above. The only difference is this script scrapes statistics off of pages on Yahoo:

# Name.....: updt_by.py
# Author...: J. Haynes
# Purpose..: Generate a list of book values and insert into a Hana table

# Load any needed libraries
import pyhdb
import csv,os.path,time
import string
import urllib2
from bs4 import BeautifulSoup

# For now this is a test URL
#url = "http://finance.yahoo.com/q/ks?s=GOOG+Key+Statistics"

# Open a connection to the db and pull the list of stocks
# Change this to pull from hive
# Establish the connection and setup two cursor objects
conn = pyhdb.connect(
    host="54.146.253.xxx",
    port=30015,
    user="hanauser",
    password="xxxxxx"
)
cur1 = conn.cursor()
cur2 = conn.cursor()

# Pull the list of market symbols from HANA
stmt = "SELECT * from Symbols"
cur1.execute(stmt)
rows = cur1.fetchall()

# step through each symbol and generate the URL for the parsing process
for row in rows:
    url = "http://finance.yahoo.com/q/ks?s=" + row[1] + "+Key+Statistics"
    sname=row[1]

    # Open the URL and parse
    aurl = urllib2.urlopen(url)
    soup = BeautifulSoup(aurl.read(),"html.parser")

    # Create lists of data parsed by using BeautifulSoup
    head1 = soup.find_all("td", class_="yfnc_tablehead1")
    data1 = soup.find_all("td", class_="yfnc_tabledata1")

    # Create a list containing td header content
    hlist = [""]
    for row in head1:
        headline = row.contents[0]
        hlist.append(headline)

    # Create a list containing td data content
    dlist = [""]
    for row in data1:
        dataline = row.string
        dlist.append(dataline)

    # Remove any data already loaded for today
    stmt = "DELETE FROM F_STATS where 'DateText'='" +  time.strftime("%Y%m%d") + "'"
    cur2.execute(stmt)
    conn.commit()
    j=1

while j 

As I described before, this script pulls the data from two columns on the Yahoo page and creates parallel lists of information. Then it steps through those lists, and creates/executes an 'insert' statement. Here is a quick look at the data:

5. Conclusion

Now I can finally start having fun with the data in HANA. Here is a quick look at how to filter and sort the data to show stocks with low P/B ratios (exciting!):

For me, the next step is to create a view and add the full company name. From there we can create reports, start graphing the historical data, and play with some of the projection functions provided by HANA.