Accessing data using Python

Here are some examples of how to access data using Python. You can find a similar example for R in the Accessing data using R section.

Importing packages

First we need to import the packages we will be using. We will be using requests to download the data from the internet and some utility functions to deal with urls. Eventually, we will transform data into a pandas DataFrame.

import json
import requests

import pandas as pd
from urllib.parse import urljoin

Deal with data and pagination in python

We can define some utility functions to deal with data and pagination. Those functions are general and can be used to access any endpoint of the SMARTER API:

base_url = "https://webserver.ibba.cnr.it"
session = requests.Session()


def read_url(session, url, params={}):
   response = session.get(url, params=params)

   # check errors: SMARTER-backend is supposed to return JSON objects
   if response.headers['Content-Type'] != 'application/json':
      raise Exception("API did not return json")

   # parse json data
   parsed = response.json()

   # check for errors
   if response.status_code != 200:
      raise Exception(
         f"SMARTER API returned an error [{response.status_code}]: "
         f"'{parsed['message']}'")

   return parsed


def get_smarter_data(url, params={}, session=session):
   # do the request and parse data with our function
   parsed = read_url(session, url, params)

   # track results
   results = parsed["items"]

   # check for pagination
   while parsed["next"]:
      # append next value to base url
      url = urljoin(base_url, parsed["next"])

      # query arguments are already in url: get next page
      parsed = read_url(session, url)

      # append new results to results list
      results += parsed["items"]

   return results

base_url is the base url of the SMARTER API. We define a session to keep track of the cookies and headers of the requests. We define a function read_url that parses the response of the API and checks for errors. We define a function get_smarter_data that gets the data from the API and checks for pagination.

Read data with Python

Now we can read data from the API using the functions we defined before. We can get the data from the API and transform it into a pandas.DataFrame by creating a custom function in order to call the API with different parameters:

def get_smarter_datasets(params={}):
   url = urljoin(base_url, "smarter-api/datasets")
   results = get_smarter_data(url, params)
   if not results:
      print("No results found")
      return None
   return pd.json_normalize(results)

By calling the function get_smarter_datasets we can get the data from the API and transform it into a pandas.DataFrame to collect all the datasets object from the Dataset endpoint. Similarly, we can define a function to get the data from the Breed endpoint:

def get_smarter_breeds(params={}):
   url = urljoin(base_url, "smarter-api/breeds")
   results = get_smarter_data(url, params)
   if not results:
      print("No results found")
      return None
   return pd.json_normalize(results)

get_smarter_breeds and get_smarter_datasets functions can be used to return all the SMARTER datasets and breeds. However you can pass additional parameters to the endpoints using the params parameter (which can be a dictionary or a list of tuples, when specifying the same parameter multiple times). For example, you could retrieve all goats breeds using the species option:

goat_breeds = get_smarter_breeds(params={'species': 'Goat'})

Here’s another example on how to get the foreground genotypes from the Dataset endpoint using the functions we defined before: here we are passing a list of tuples to the function since the parameter type is required for both terms and you cannot define a python dict with the same key multiple times:

foreground_genotypes = get_smarter_datasets(
   params=[('type', 'genotypes'), ('type', 'foreground')])

To have a full list of the available parameters for all the available endpoints you can check the API documentation at https://webserver.ibba.cnr.it/smarter-api/docs. Let’s define another function that could be used for sheep and goat samples endpoints relying on parameters, and then do a simple query relying on species and breed_code parameters:

def get_smarter_samples(species, params={}):
   # mind that species is lowercase in endpoint url
   species = species.lower()
   url = urljoin(base_url, f"smarter-api/samples/{species}")
   results = get_smarter_data(url, params)
   if not results:
      print("No results found")
      return None
   return pd.json_normalize(results)

goat_landrace = get_smarter_samples(
   "Goat", params={'breed_code': "LNR"})

We can refine the query by adding more parameters to the query. For example, we can get the goat samples which have a locations (GPS coordinates) and phenotypes defined (mind to the double _ in locations__exists and phenotype__exists):

goat_landrace = get_smarter_samples(
   "Goat",
   params={
      'breed_code': "LNR",
      'locations__exists': True,
      'phenotype__exists': True}
)

from the results dataframe, we can extract the smarter_id and breed_code columns, to have a list of our samples in order to subset the full genotype file using plink:

samples = goat_landrace[['smarter_id', 'breed_code']]
samples.to_csv("samples.csv", index=False)

Here’s another example that could be applied in order to get information on variants. In this case we will select the goat variants on chromosome 1 within 1-1000000 positions in ARS1 assembly:

def get_smarter_variations(species, assembly, params = {}):
   # mind that species is lowercase in endpoint url, while assembly is uppercase
   species = species.lower()
   assembly = assembly.upper()

   url = urljoin(base_url, f"smarter-api/variants/{species}/{assembly}")
   results = get_smarter_data(url, params)
   if not results:
      print("No results found")
      return None
   return pd.json_normalize(results)

selected_goat_variations = get_smarter_variations(
   species = "Goat",
   assembly = "ARS1",
   params = {
      "size": 100,
      "region": "1:1-1000000"
   }
)

Hint

We are planning to simplify the variants response by returning a SNP list of the selected SNPs only, in order to be used when subsetting a genotype file using plink

Warning

Be careful when using the variants endpoints: getting all the variants will takes a lot of time and could fill all your available memory. Avoid to request all variants in your R session, unless you know what you are doing