Accessing data using Python =========================== .. toctree:: :maxdepth: 4 Here are some examples of how to access data using Python. You can find a similar example for ``R`` in the :ref:`Accessing data using R` section. Importing packages ------------------ First we need to import the packages we will be using. We will be using ``requests`` to download the data from the internet and some utility functions to deal with urls. Eventually, we will transform data into a ``pandas`` DataFrame. .. code-block:: python import json import requests import pandas as pd from urllib.parse import urljoin Deal with data and pagination in python --------------------------------------- We can define some utility functions to deal with data and pagination. Those functions are general and can be used to access any endpoint of the SMARTER API: .. code-block:: python base_url = "https://webserver.ibba.cnr.it" session = requests.Session() def read_url(session, url, params={}): response = session.get(url, params=params) # check errors: SMARTER-backend is supposed to return JSON objects if response.headers['Content-Type'] != 'application/json': raise Exception("API did not return json") # parse json data parsed = response.json() # check for errors if response.status_code != 200: raise Exception( f"SMARTER API returned an error [{response.status_code}]: " f"'{parsed['message']}'") return parsed def get_smarter_data(url, params={}, session=session): # do the request and parse data with our function parsed = read_url(session, url, params) # track results results = parsed["items"] # check for pagination while parsed["next"]: # append next value to base url url = urljoin(base_url, parsed["next"]) # query arguments are already in url: get next page parsed = read_url(session, url) # append new results to results list results += parsed["items"] return results ``base_url`` is the base url of the SMARTER API. We define a ``session`` to keep track of the cookies and headers of the requests. We define a function ``read_url`` that parses the response of the API and checks for errors. We define a function ``get_smarter_data`` that gets the data from the API and checks for pagination. Read data with Python --------------------- Now we can read data from the API using the functions we defined before. We can get the data from the API and transform it into a ``pandas.DataFrame`` by creating a custom function in order to call the API with different parameters: .. code-block:: python def get_smarter_datasets(params={}): url = urljoin(base_url, "smarter-api/datasets") results = get_smarter_data(url, params) if not results: print("No results found") return None return pd.json_normalize(results) By calling the function ``get_smarter_datasets`` we can get the data from the API and transform it into a ``pandas.DataFrame`` to collect all the *datasets* object from the *Dataset* endpoint. Similarly, we can define a function to get the data from the *Breed* endpoint: .. code-block:: python def get_smarter_breeds(params={}): url = urljoin(base_url, "smarter-api/breeds") results = get_smarter_data(url, params) if not results: print("No results found") return None return pd.json_normalize(results) ``get_smarter_breeds`` and ``get_smarter_datasets`` functions can be used to return all the SMARTER *datasets* and *breeds*. However you can pass additional parameters to the endpoints using the ``params`` parameter (which can be a dictionary or a list of tuples, when specifying the same parameter multiple times). For example, you could retrieve all goats breeds using the ``species`` option: .. code-block:: python goat_breeds = get_smarter_breeds(params={'species': 'Goat'}) Here's another example on how to get the *foreground genotypes* from the *Dataset* endpoint using the functions we defined before: here we are passing a list of tuples to the function since the parameter ``type`` is required for both terms and you cannot define a python *dict* with the same key multiple times: .. code-block:: python foreground_genotypes = get_smarter_datasets( params=[('type', 'genotypes'), ('type', 'foreground')]) To have a full list of the available parameters for all the available endpoints you can check the API documentation at ``_. Let's define another function that could be used for sheep and goat samples endpoints relying on parameters, and then do a simple query relying on ``species`` and ``breed_code`` parameters: .. code-block:: python def get_smarter_samples(species, params={}): # mind that species is lowercase in endpoint url species = species.lower() url = urljoin(base_url, f"smarter-api/samples/{species}") results = get_smarter_data(url, params) if not results: print("No results found") return None return pd.json_normalize(results) goat_landrace = get_smarter_samples( "Goat", params={'breed_code': "LNR"}) We can refine the query by adding more parameters to the query. For example, we can get the goat samples which have a locations (GPS coordinates) and phenotypes defined (mind to the double ``_`` in ``locations__exists`` and ``phenotype__exists``): .. code-block:: python goat_landrace = get_smarter_samples( "Goat", params={ 'breed_code': "LNR", 'locations__exists': True, 'phenotype__exists': True} ) from the results dataframe, we can extract the ``smarter_id`` and ``breed_code`` columns, to have a list of our samples in order to subset the full genotype file using ``plink``: .. code-block:: python samples = goat_landrace[['smarter_id', 'breed_code']] samples.to_csv("samples.csv", index=False) Here's another example that could be applied in order to get information on variants. In this case we will select the goat variants on chromosome *1* within *1-1000000* positions in *ARS1* assembly: .. code-block:: python def get_smarter_variations(species, assembly, params = {}): # mind that species is lowercase in endpoint url, while assembly is uppercase species = species.lower() assembly = assembly.upper() url = urljoin(base_url, f"smarter-api/variants/{species}/{assembly}") results = get_smarter_data(url, params) if not results: print("No results found") return None return pd.json_normalize(results) selected_goat_variations = get_smarter_variations( species = "Goat", assembly = "ARS1", params = { "size": 100, "region": "1:1-1000000" } ) .. hint:: We are planning to simplify the variants response by returning a SNP list of the selected SNPs only, in order to be used when subsetting a genotype file using plink .. warning:: Be careful when using the variants endpoints: getting all the variants will takes a lot of time and could fill all your available memory. Avoid to request all variants in your R session, unless you know what you are doing