Accessing data using Python
Here are some examples of how to access data using Python. You can find a similar
example for R in the Accessing data using R section.
Importing packages
First we need to import the packages we will be using. We will be using requests
to download the data from the internet and some utility functions to deal with
urls. Eventually, we will transform data into a pandas DataFrame.
import json
import requests
import pandas as pd
from urllib.parse import urljoin
Deal with data and pagination in python
We can define some utility functions to deal with data and pagination. Those functions are general and can be used to access any endpoint of the SMARTER API:
base_url = "https://webserver.ibba.cnr.it"
session = requests.Session()
def read_url(session, url, params={}):
response = session.get(url, params=params)
# check errors: SMARTER-backend is supposed to return JSON objects
if response.headers['Content-Type'] != 'application/json':
raise Exception("API did not return json")
# parse json data
parsed = response.json()
# check for errors
if response.status_code != 200:
raise Exception(
f"SMARTER API returned an error [{response.status_code}]: "
f"'{parsed['message']}'")
return parsed
def get_smarter_data(url, params={}, session=session):
# do the request and parse data with our function
parsed = read_url(session, url, params)
# track results
results = parsed["items"]
# check for pagination
while parsed["next"]:
# append next value to base url
url = urljoin(base_url, parsed["next"])
# query arguments are already in url: get next page
parsed = read_url(session, url)
# append new results to results list
results += parsed["items"]
return results
base_url is the base url of the SMARTER API. We define a session to keep track of
the cookies and headers of the requests. We define a function read_url that
parses the response of the API and checks for errors. We define a function
get_smarter_data that gets the data from the API and checks for pagination.
Read data with Python
Now we can read data from the API using the functions we defined before. We can
get the data from the API and transform it into a pandas.DataFrame by creating
a custom function in order to call the API with different parameters:
def get_smarter_datasets(params={}):
url = urljoin(base_url, "smarter-api/datasets")
results = get_smarter_data(url, params)
if not results:
print("No results found")
return None
return pd.json_normalize(results)
By calling the function get_smarter_datasets we can get the data from the API
and transform it into a pandas.DataFrame to collect all the datasets object from
the Dataset endpoint. Similarly, we can define a function to get the data from the
Breed endpoint:
def get_smarter_breeds(params={}):
url = urljoin(base_url, "smarter-api/breeds")
results = get_smarter_data(url, params)
if not results:
print("No results found")
return None
return pd.json_normalize(results)
get_smarter_breeds and get_smarter_datasets functions can be used to return
all the SMARTER datasets and breeds. However you can pass additional parameters to
the endpoints using the params parameter (which can be a dictionary or a list
of tuples, when specifying the same parameter multiple times). For
example, you could retrieve all goats breeds using the species option:
goat_breeds = get_smarter_breeds(params={'species': 'Goat'})
Here’s another example on how to get the foreground genotypes from the Dataset
endpoint using the functions we defined before: here we are passing a list of tuples
to the function since the parameter type is required for both terms and you
cannot define a python dict with the same key multiple times:
foreground_genotypes = get_smarter_datasets(
params=[('type', 'genotypes'), ('type', 'foreground')])
To have a full list of the available parameters for all the available endpoints
you can check the API documentation at https://webserver.ibba.cnr.it/smarter-api/docs.
Let’s define another function that could be used for sheep and goat samples
endpoints relying on parameters, and then do a simple query relying on species
and breed_code parameters:
def get_smarter_samples(species, params={}):
# mind that species is lowercase in endpoint url
species = species.lower()
url = urljoin(base_url, f"smarter-api/samples/{species}")
results = get_smarter_data(url, params)
if not results:
print("No results found")
return None
return pd.json_normalize(results)
goat_landrace = get_smarter_samples(
"Goat", params={'breed_code': "LNR"})
We can refine the query by adding more parameters to the query. For example, we
can get the goat samples which have a locations (GPS coordinates) and phenotypes
defined (mind to the double _ in locations__exists and
phenotype__exists):
goat_landrace = get_smarter_samples(
"Goat",
params={
'breed_code': "LNR",
'locations__exists': True,
'phenotype__exists': True}
)
from the results dataframe, we can extract the smarter_id and breed_code columns,
to have a list of our samples in order to subset the full genotype file using plink:
samples = goat_landrace[['smarter_id', 'breed_code']]
samples.to_csv("samples.csv", index=False)
Here’s another example that could be applied in order to get information on variants. In this case we will select the goat variants on chromosome 1 within 1-1000000 positions in ARS1 assembly:
def get_smarter_variations(species, assembly, params = {}):
# mind that species is lowercase in endpoint url, while assembly is uppercase
species = species.lower()
assembly = assembly.upper()
url = urljoin(base_url, f"smarter-api/variants/{species}/{assembly}")
results = get_smarter_data(url, params)
if not results:
print("No results found")
return None
return pd.json_normalize(results)
selected_goat_variations = get_smarter_variations(
species = "Goat",
assembly = "ARS1",
params = {
"size": 100,
"region": "1:1-1000000"
}
)
Hint
We are planning to simplify the variants response by returning a SNP list of the selected SNPs only, in order to be used when subsetting a genotype file using plink
Warning
Be careful when using the variants endpoints: getting all the variants will takes a lot of time and could fill all your available memory. Avoid to request all variants in your R session, unless you know what you are doing