The A - Z of Pyathena Library

Database plays an important role when we talk about any application. While solving real-time problems, we can’t think of only front-end, back-end, or database. We need all of them while developing any application. So there must be the perfect co-relation between them for seamless working. Today, we will discuss one such python library that provides us an easy interface while interacting with Amazon Athena. It is none other than python pyathena.

Pyathena is the python library that provides the DB API for Amazon Athena. We can query data on amazon Athena using SQLAlchemy or Pandas in python. So, to use them, we need all of them to be installed in our system. Please run the following code to install it on your PC.

Contents

Installation

pip install pyathena
pip install PyAthena[Pandas]
pip install PyAthena[SQLAlchemy]

Usage

Now, once the installation is complete, let’s see the critical use of it.

Querying Database in Pyathena

from pyathena import connect

cursor = connect(s3_staging_dir="s3://YOUR_S3_BUCKET/path/to/",
                 region_name="us-west-2").cursor()
# fetching a row 
cursor.execute("SELECT * FROM one_row")
print(cursor.description)
print(cursor.fetchall())
#fetching all rows
cursor.execute("SELECT * FROM many_rows LIMIT 10")
for row in cursor:
    print(row)

So, in the above example, we first imported the connect function to connect it to the AWS Athena. Once we provide the credential, the connection is built. After that, we executed a query to fetch a row from the database. After that, we printed its description and the row. Similarly, in the further part of the code, we ran another query for fetching all the rows, but we are limiting its number to 10. This gives us a clear understanding that whatever SQL query we want to use, we have to specify it in the execute() method and the data is stored in cursor variable.

Querying Database with Parameters in Pyathena

Now, once we understand how to build the connection and run the simple query, let’s see how to query the database with parameters.

from pyathena import connect

cursor = connect(s3_staging_dir="s3://YOUR_S3_BUCKET/path/to/",
                 region_name="us-west-2").cursor()
cursor.execute("""
               SELECT col_string FROM one_row_complex
               WHERE col_string = %(param)s
               """, {"param": "a string"})
print(cursor.fetchall())

So in the above example, you can see that we ran a query to fetch the data after creating the cursor. However, we haven’t simply run a query but passed a parameter. We used the “%” operator and then used the dictionary object to specify its value.

If things are evident here, Let’s see how to query the database using different frameworks.

SQLAlchemy for Querying Athena

So, as we already know, SQLAlchemy is the Python SQL toolkit and Object Relational Mapper that gives application developers the full power and flexibility of SQL. Pyathena[SQL Alchemy] is one of its extensions that provides us the interface to use in on AWS Athena. Let’s see how we can do it.

Example: Reading Data from the Database

<pre class="wp-block-syntaxhighlighter-code">from <a href="http://www.pythonpool.com/urllib-error-httperror-http-error-403-forbidden/" target="_blank" rel="noopener">urllib</a>.parse import quote_plus  # PY2: from urllib import quote_plus
from sqlalchemy.engine import create_engine
from sqlalchemy.sql.expression import select
from sqlalchemy.sql.functions import func
from sqlalchemy.sql.schema import Table, MetaData

# Specifying the connection in conn_str variable
conn_str = "awsathena+rest://{aws_access_key_id}:{aws_secret_access_key}@athena.{region_name}.amazonaws.com:443/"\
           "{schema_name}?s3_staging_dir={s3_staging_dir}"

# Creating engine on which query runs
engine = create_engine(conn_str.format(
    aws_access_key_id=quote_plus("YOUR_ACCESS_KEY_ID"),
    aws_secret_access_key=quote_plus("YOUR_SECRET_ACCESS_KEY"),
    region_name="us-west-2",
    schema_name="default",
    s3_staging_dir=quote_plus("s3://YOUR_S3_BUCKET/path/to/")))

#Running sqlalchemy queries
many_rows = Table("many_rows", MetaData(bind=engine), autoload=True)
print(select([func.count("*")], from_obj=many_rows).scalar())</pre>

So, we first imported all the necessary libraries in the above example. After that, we specified the connection with all the required credentials. After that, we created an engine variable that stores the connection built between SQL alchemy and AWS Athena. Once we build the connection, we run our query as per SQLAlchemy syntax.

Pandas for Querying Athena for Querying Athena

However, it may be possible that one has no knowledge of SQLAlchemy in that case, Pyathena provides us the other interfaces to do that. One of them is using pandas. It is one of the easiest ways to query your data. The data you receive is stored in the form of a pandas dataframe. You can easily use it for your purpose then. Let’s see how we can do it.

Example: Reading Data from Database

from pyathena import connect
import pandas as pd

conn = connect(s3_staging_dir="s3://YOUR_S3_BUCKET/path/to/",
               region_name="us-west-2")
df = pd.read_sql_query("SELECT * FROM many_rows", conn)
print(df.head())

So, in the above example, you can see that we have used the read_sql_query() method to execute our query. We passed our query as the parameter inside it. The result we receive is stored as the dataframe object in the df variable.

Example: Writing Data to Database – to_sql() Method

So, till now, we have seen how we can read data from the database. We understood how to run a query. Now, it’s time to see how to write data to the database. For that, we will use the to_sql() method. Let’s see that.

import pandas as pd
from pyathena import connect
from pyathena.pandas.util import to_sql

conn = connect(s3_staging_dir="s3://YOUR_S3_BUCKET/path/to/",
               region_name="us-west-2")
df = pd.DataFrame({"a": [1, 2, 3, 4, 5]})
to_sql(df, "YOUR_TABLE", conn, "s3://YOUR_S3_BUCKET/path/to/",
       schema="YOUR_SCHEMA", index=False, if_exists="replace")

So, in the above example, we have used the to_sql() method to send the data that we stored in the form of a dataframe. For that, we have specified the connection parameter and destination parameter also. However, one important thing to note here is that the to_sql() method uses SQLAlchemy to execute. So we must install it in our system.

Creating and Dropping Table in Pyathena

So, as you have understood, how can we use pyathena to run a SQL query for reading and writing data? It’s high time to see how we can use DDL commands like creating a table or dropping a table from the database. To do that, we will use the execute() method from the cursor library. This method takes the SQL query as the parameter and executes it over our database.

from pyathena import connect
import pandas as pd
conn = connect(s3_staging_dir='s3://[bucket_name]', region_name='us-east-1').cursor()

# Creating the table
conn.execute("create table db.SampleTable Students(roll_no int, name1 char(30) , age int, salary int )")

# Drop a table 
conn.execute("drop table if exists db.table_name")

Pyathena Performance

Pyathena provides us with a good interface for connecting AWS Athena. However, its performance highly depends on the condition of its use. Moreover, its performance also depends on the command or methods we use for querying our data. So, it’s essential to choose those functions whose complexity is low. This will give a boost to your performance in terms of time. One more factor that is highly important is understanding the Athena architect and then working accordingly. We need to understand that how to work according to its architect. You will find the link for that in the subsequent section.

Pyathena vs Boto3

So, meanwhile, both are used for the same purpose, i.e., providing API for was Athena. It is always a matter of discussion which one is faster. So, the answer to his varies as per your needs as the python pyathena depends on boto3. The best didn’t come Unless PyAthena will add a lot of overhead to its library, which is unlikely to happen. However, you can improve your performance by following some measures i.e.

Conclusion

So, today in this article, we learned about the python pyathena library. We have seen how we can use it to share data between our local computer to AWS Athena. After that, we have seen different examples to understand its usage. We have also seen its scope of using various libraries to send and receive data. I hope this article has helped you.