Accessing data using R
Hint
We have a R package that can be used to interact with SMARTER-backend API, see SMARTER R package for more information.
Here are some examples on how to interact with SMARTER-backend API using R.
You can find a similar example for Python in the Accessing data using Python section.
Importing R libraries
First of all, let’s start with importing some R libraries (maybe you will need
to install some of them first):
library(httr)
library(jsonlite)
library(dplyr)
httr is required to send requests and get response from SMARTER-backend API;
jsonlite is required to parse JSON output, which is the default format
of the API response. dplyr is useful to manage
dataframes, for examples when they have different columns (like response from
SMARTER-backend)
Deal with data and pagination in R
Next, before starting query SMARTER-backend, we can define more utility functions
(as suggested by Best practices for API packages)
in order to deal with pagination and API errors. We will read our data with
jsonlite package in order to flatten our results (read nested object and
add them as columns in the resulting dataframe):
base_url <- "https://webserver.ibba.cnr.it"
read_url <- function(url, query = list()) {
# make a GET request to the API by combining parameters (if any)
resp <-
httr::GET(url, query = query)
# check errors: SMARTER-backend is supposed to return JSON objects
if (http_type(resp) != "application/json") {
stop("API did not return json", call. = FALSE)
}
# parse a JSON response. fromJSON to flatten results
parsed <-
jsonlite::fromJSON(
content(resp, "text", encoding = "utf-8"),
flatten = TRUE
)
# deal with API errors: not "200 Ok" status
if (httr::http_error(resp)) {
stop(
sprintf(
"SMARTER API returned an error [%s]: '%s'",
status_code(resp),
parsed$message
),
call. = FALSE
)
}
return(parsed)
}
get_smarter_data <- function(url, query = list()) {
# do the request and parse data with our function
parsed <- read_url(url, query)
# track results in df
results <- parsed$items
# check for pagination
while (!is.null(parsed$`next`)) {
# append next value to base url
next_url <- httr::modify_url(base_url, path = parsed$`next`)
# query arguments are already in url: get next page
parsed <- read_url(next_url)
# append new results to df. Deal with different columns
results <- dplyr::bind_rows(results, parsed$items)
}
# return an S3 obj with the data we got
structure(list(
content = parsed,
url = url,
results = results
),
class = "smarter_api")
}
base_url is defined for simplicity in order to make all our request to the
same server.
Our functions will take an url parameter, which will be our API endpoint,
and a query parameter, which will be a list of parameters that will enhance our queries
as described in Query parameters
Read data with R
Next we can try to read data from our API by defining custom functions around the desired endpoint. This function will call the functions previously defined and will return all the results in a dataframe. Here’s a sample function to deal with datasets objects by querying the datasets endpoint:
get_smarter_datasets <- function(query=list()) {
url <-
httr::modify_url(base_url, path = "/smarter-api/datasets")
data <- get_smarter_data(url, query)
# returning only the results dataframe
data$results
}
all_datasets <- get_smarter_datasets()
By calling the defined get_smarter_datasets function you will retrieve all
datasets and you will store them in the all_datasets dataframe. Similarly,
to deal with the Breed endpoint you could define the get_smarter_breeds function:
get_smarter_breeds <- function(query = list()) {
# setting the URL endpoint
url <- httr::modify_url(base_url, path = "/smarter-api/breeds")
# reading our data
data <- get_smarter_data(url, query)
# returning only the results dataframe
data$results
}
goat_breeds <-
get_smarter_breeds(query = list(species = "Goat"))
get_smarter_breeds and get_smarter_datasets functions can be used to return
all the SMARTER datasets and breeds. However you can pass additional parameters to
the endpoint using the query parameter (which need to be a list). For
example, you could retrieve all the genotypes datasets using the type parameter:
genotypes_datasets <- get_smarter_datasets(query = list(type="genotypes"))
Since query accepts list, you can specify the same parameter multiple times
(if the endpoints supports this type of query, see api docs
to get more information). For example, if you need only the foreground genotypes,
you can select dataset like this:
foreground_genotypes_datasets <- get_smarter_datasets(
query = list(type="genotypes", type="foreground"))
You can add other parameters to refine your query, for example
if you want to select only the Goat breeds, you can specify
species = "Goat" in the query parameter. If you need also to search
for the land term in the breed name, you will call the same function
by adding a new parameter:
search_goat_breeds <-
get_smarter_breeds(query = list(
species = "Goat", search = "land")
)
search_goat_breeds will be a dataframe with the same results of the query URL:
https://webserver.ibba.cnr.it/smarter-api/breeds?species=Goat&search=land
We can select only the column we need by subsetting dataframe columns, or using
dplyr select:
search_goat_breeds <- search_goat_breeds %>% select(name, code)
Breed code and names can be used to get from samples from the proper endpoint. Let’s define another function that could be used for sheep and goat samples endpoints relying on parameters:
get_smarter_samples <- function(species, query = list()) {
# mind that species is lowercase in endpoint url
species <- tolower(species)
url <-
modify_url(base_url, path = sprintf("/smarter-api/samples/%s", species))
data <- get_smarter_data(url, query)
# returning only the results dataframe
data$results
}
landrace_samples <- get_smarter_samples(
species = "Goat",
query = list(breed_code = "LNR")
)
We can refine our query, for example by selecting
Landrace goat samples which have a locations (GPS coordinates) and phenotypes
defined (mind to the double _ in locations__exists and
phenotype__exists):
selected_landrace_samples <- get_smarter_samples(
species = "Goat",
query = list(
breed_code = "LNR",
locations__exists = TRUE,
phenotype__exists = TRUE)
)
As before we can select the smarter_id and breed_code columns,
to have a list of our samples in order to subset the full genotype file using plink:
selected_landrace_samples %>% select(smarter_id, breed_code)
Here’s another example that could be applied in order to get information on variants. In this case we will select the goat variants on chromosome 1 within 1-1000000 positions in ARS1 assembly:
get_smarter_variations <- function(species, assembly, query = list()) {
# mind that species is lowercase in endpoint url, while assembly is uppercase
species <- tolower(species)
assembly <- toupper(assembly)
url <-
modify_url(base_url, path = sprintf("/smarter-api/variants/%s/%s", species, assembly))
data <- get_smarter_data(url, query)
# returning only the results dataframe
data$results
}
selected_goat_variations <- get_smarter_variations(
species = "Goat",
assembly = "ARS1",
query = list(
size = 100,
region = "1:1-1000000"
)
)
Hint
We are planning to simplify the variants response by returning a SNP list of the selected SNPs only, in order to be used when subsetting a genotype file using plink
Warning
Be careful when using the variants endpoints: getting all the variants will takes a lot of time and could fill all your available memory. Avoid to request all variants in your R session, unless you know what you are doing