Accessing data using R ====================== .. toctree:: :maxdepth: 4 .. hint:: We have a *R* package that can be used to interact with SMARTER-backend API, see `SMARTER R package `_ for more information. Here are some examples on how to interact with SMARTER-backend API using ``R``. You can find a similar example for ``Python`` in the :ref:`Accessing data using Python` section. Importing R libraries --------------------- First of all, let's start with importing some ``R`` libraries (maybe you will need to install some of them first): .. code-block:: r library(httr) library(jsonlite) library(dplyr) ``httr`` is required to send requests and get response from SMARTER-backend API; ``jsonlite`` is required to parse ``JSON`` output, which is the default format of the API response. ``dplyr`` is useful to manage dataframes, for examples when they have different columns (like response from SMARTER-backend) Deal with data and pagination in R ---------------------------------- Next, before starting query SMARTER-backend, we can define more utility functions (as suggested by `Best practices for API packages `_) in order to deal with pagination and API errors. We will read our data with ``jsonlite`` package in order to **flatten** our results (read nested object and add them as columns in the resulting dataframe): .. code-block:: r base_url <- "https://webserver.ibba.cnr.it" read_url <- function(url, query = list()) { # make a GET request to the API by combining parameters (if any) resp <- httr::GET(url, query = query) # check errors: SMARTER-backend is supposed to return JSON objects if (http_type(resp) != "application/json") { stop("API did not return json", call. = FALSE) } # parse a JSON response. fromJSON to flatten results parsed <- jsonlite::fromJSON( content(resp, "text", encoding = "utf-8"), flatten = TRUE ) # deal with API errors: not "200 Ok" status if (httr::http_error(resp)) { stop( sprintf( "SMARTER API returned an error [%s]: '%s'", status_code(resp), parsed$message ), call. = FALSE ) } return(parsed) } get_smarter_data <- function(url, query = list()) { # do the request and parse data with our function parsed <- read_url(url, query) # track results in df results <- parsed$items # check for pagination while (!is.null(parsed$`next`)) { # append next value to base url next_url <- httr::modify_url(base_url, path = parsed$`next`) # query arguments are already in url: get next page parsed <- read_url(next_url) # append new results to df. Deal with different columns results <- dplyr::bind_rows(results, parsed$items) } # return an S3 obj with the data we got structure(list( content = parsed, url = url, results = results ), class = "smarter_api") } ``base_url`` is defined for simplicity in order to make all our request to the same server. Our functions will take an ``url`` parameter, which will be our API endpoint, and a ``query`` parameter, which will be a list of parameters that will enhance our queries as described in :ref:`Query parameters` Read data with R ---------------- Next we can try to read data from our API by defining custom functions around the desired endpoint. This function will call the functions previously defined and will return all the results in a *dataframe*. Here's a sample function to deal with datasets objects by querying the *datasets* endpoint: .. code-block:: r get_smarter_datasets <- function(query=list()) { url <- httr::modify_url(base_url, path = "/smarter-api/datasets") data <- get_smarter_data(url, query) # returning only the results dataframe data$results } all_datasets <- get_smarter_datasets() By calling the defined ``get_smarter_datasets`` function you will retrieve all datasets and you will store them in the ``all_datasets`` dataframe. Similarly, to deal with the *Breed* endpoint you could define the ``get_smarter_breeds`` function: .. code-block:: r get_smarter_breeds <- function(query = list()) { # setting the URL endpoint url <- httr::modify_url(base_url, path = "/smarter-api/breeds") # reading our data data <- get_smarter_data(url, query) # returning only the results dataframe data$results } goat_breeds <- get_smarter_breeds(query = list(species = "Goat")) ``get_smarter_breeds`` and ``get_smarter_datasets`` functions can be used to return all the SMARTER *datasets* and *breeds*. However you can pass additional parameters to the endpoint using the ``query`` parameter (which need to be a ``list``). For example, you could retrieve all the *genotypes* datasets using the ``type`` parameter: .. code-block:: r genotypes_datasets <- get_smarter_datasets(query = list(type="genotypes")) Since query accepts ``list``, you can specify the same parameter multiple times (if the endpoints supports this type of query, see `api docs `_ to get more information). For example, if you need only the *foreground genotypes*, you can select dataset like this: .. code-block:: r foreground_genotypes_datasets <- get_smarter_datasets( query = list(type="genotypes", type="foreground")) You can add other parameters to refine your query, for example if you want to select only the *Goat* breeds, you can specify ``species = "Goat"`` in the ``query`` parameter. If you need also to search for the *land* term in the *breed* name, you will call the same function by adding a new parameter: .. code-block:: r search_goat_breeds <- get_smarter_breeds(query = list( species = "Goat", search = "land") ) ``search_goat_breeds`` will be a dataframe with the same results of the query URL:: https://webserver.ibba.cnr.it/smarter-api/breeds?species=Goat&search=land We can select only the column we need by subsetting dataframe columns, or using ``dplyr`` `select `_: .. code-block:: r search_goat_breeds <- search_goat_breeds %>% select(name, code) Breed code and names can be used to get from samples from the proper endpoint. Let's define another function that could be used for sheep and goat samples endpoints relying on parameters: .. code-block:: r get_smarter_samples <- function(species, query = list()) { # mind that species is lowercase in endpoint url species <- tolower(species) url <- modify_url(base_url, path = sprintf("/smarter-api/samples/%s", species)) data <- get_smarter_data(url, query) # returning only the results dataframe data$results } landrace_samples <- get_smarter_samples( species = "Goat", query = list(breed_code = "LNR") ) We can refine our query, for example by selecting Landrace goat samples which have a locations (GPS coordinates) and phenotypes defined (mind to the double ``_`` in ``locations__exists`` and ``phenotype__exists``): .. code-block:: r selected_landrace_samples <- get_smarter_samples( species = "Goat", query = list( breed_code = "LNR", locations__exists = TRUE, phenotype__exists = TRUE) ) As before we can select the ``smarter_id`` and ``breed_code`` columns, to have a list of our samples in order to subset the full genotype file using ``plink``: .. code-block:: r selected_landrace_samples %>% select(smarter_id, breed_code) Here's another example that could be applied in order to get information on variants. In this case we will select the goat variants on chromosome *1* within *1-1000000* positions in *ARS1* assembly: .. code-block:: r get_smarter_variations <- function(species, assembly, query = list()) { # mind that species is lowercase in endpoint url, while assembly is uppercase species <- tolower(species) assembly <- toupper(assembly) url <- modify_url(base_url, path = sprintf("/smarter-api/variants/%s/%s", species, assembly)) data <- get_smarter_data(url, query) # returning only the results dataframe data$results } selected_goat_variations <- get_smarter_variations( species = "Goat", assembly = "ARS1", query = list( size = 100, region = "1:1-1000000" ) ) .. hint:: We are planning to simplify the variants response by returning a SNP list of the selected SNPs only, in order to be used when subsetting a genotype file using plink .. warning:: Be careful when using the variants endpoints: getting all the variants will takes a lot of time and could fill all your available memory. Avoid to request all variants in your R session, unless you know what you are doing