Fetching Data from APIs¶
This guide covers how to pull data from REST APIs into your Odibi pipelines. Whether you're pulling from public APIs like openFDA or internal company APIs, this guide will get you started.
What is an API?¶
An API (Application Programming Interface) is a way for programs to talk to each other. When you hear "REST API" in data engineering, it usually means:
- A URL you can call to get data (like
https://api.fda.gov/food/enforcement.json) - The data comes back as JSON (a text format that looks like nested key-value pairs)
- You might need an API key or token to authenticate
Example: What an API Response Looks Like¶
When you call https://api.fda.gov/food/enforcement.json?limit=2, you get:
{
"meta": {
"results": {
"skip": 0,
"limit": 2,
"total": 25847
}
},
"results": [
{
"recall_number": "F-0276-2017",
"status": "Terminated",
"city": "Davie",
"state": "FL",
"classification": "Class II"
},
{
"recall_number": "F-0865-2017",
"status": "Terminated",
"city": "Millbrae",
"state": "CA",
"classification": "Class II"
}
]
}
Notice:
- The actual data is nested inside "results"
- There's metadata telling you there are 25,847 total records
- You only got 2 because of limit=2
How Odibi Fetches API Data¶
Odibi uses format: api to fetch data from REST APIs. It handles:
- Pagination - Automatically fetches all pages of data
- Response parsing - Extracts the data from nested JSON
- Retries - Handles temporary failures gracefully
- Rate limiting - Respects API limits
Quick Start Example¶
Here's a complete working example that fetches FDA food recall data:
project: my_api_project
connections:
# HTTP connection - defines the base URL and auth
my_api:
type: http
base_url: "https://api.fda.gov"
headers:
User-Agent: "my-app/1.0"
local_storage:
type: local
base_path: ./data
story:
connection: local_storage
path: stories
system:
connection: local_storage
path: _system
pipelines:
- pipeline: fetch_fda_data
nodes:
- name: fda_recalls
read:
connection: my_api
format: api # <-- This tells Odibi to use the API fetcher
path: /food/enforcement.json # <-- The API endpoint
options:
params: # Query parameters to send
limit: 1000
pagination: # How to get all pages
type: offset_limit
offset_param: skip
limit_param: limit
limit: 1000
max_pages: 10
response: # Where to find the data in JSON
items_path: results
write:
connection: local_storage
format: parquet
path: bronze/fda_recalls.parquet
mode: overwrite
Run it:
Understanding API Configuration¶
Connection Setup¶
First, define an HTTP connection with the API's base URL:
connections:
my_api:
type: http
base_url: "https://api.example.com"
headers:
User-Agent: "odibi-pipeline/1.0"
Accept: "application/json"
Authentication¶
Different APIs use different authentication methods:
No Authentication (Public APIs)¶
API Key in Header¶
connections:
api_with_key:
type: http
base_url: "https://api.example.com"
auth:
mode: api_key
api_key: "${MY_API_KEY}" # Use environment variable
header_name: "X-API-Key" # Default is X-API-Key
Bearer Token¶
connections:
api_with_token:
type: http
base_url: "https://api.example.com"
auth:
mode: bearer
token: "${API_TOKEN}"
Basic Auth (Username/Password)¶
connections:
api_with_basic:
type: http
base_url: "https://api.example.com"
auth:
mode: basic
username: "${API_USER}"
password: "${API_PASSWORD}"
Pagination Strategies¶
Most APIs don't return all data at once—they paginate (return data in chunks). Odibi supports 4 pagination types:
1. Offset/Limit (Most Common)¶
Used by: openFDA, many REST APIs
The API uses skip (or offset) and limit parameters:
- Page 1: ?skip=0&limit=100
- Page 2: ?skip=100&limit=100
- Page 3: ?skip=200&limit=100
options:
pagination:
type: offset_limit
offset_param: skip # Parameter name for offset (default: "offset")
limit_param: limit # Parameter name for limit (default: "limit")
limit: 1000 # Records per page
max_pages: 100 # Safety limit
stop_on_empty: true # Stop when no more results (default: true)
start_offset: 0 # Starting offset value (default: 0, use 1 for 1-indexed APIs)
2. Page Number¶
Used by: Some older APIs
The API uses page numbers:
- Page 1: ?page=1&per_page=100
- Page 2: ?page=2&per_page=100
options:
pagination:
type: page_number
page_param: page # Parameter name for page number
page_size_param: per_page # Parameter name for page size
page_size: 100 # Records per page
start_page: 1 # First page number (usually 1)
max_pages: 50
3. Cursor-Based¶
Used by: Twitter, Slack, modern APIs
The API returns a "cursor" or "next_token" in each response:
options:
pagination:
type: cursor
cursor_param: next_token # Parameter name to send cursor
cursor_path: meta.next_cursor # Where to find cursor in response
max_pages: 100
4. Link Header (GitHub Style)¶
Used by: GitHub, GitLab
The API returns a Link header with the next URL:
No Pagination¶
For APIs that return all data in one response:
Response Parsing¶
APIs return data in different structures. Tell Odibi where to find the actual records:
Data at Root Level¶
Data in a Key¶
Nested Data¶
OData APIs¶
Array Indexing¶
Extract specific elements from arrays using bracket notation:
Or with nested paths:
Wrapping Single Objects¶
Some APIs return a single object instead of an array:
Use wrap_single: true to convert it to a 1-row table:
options:
response:
items_path: ""
wrap_single: true # Wraps single dict in array [{"temperature": 72, ...}]
Array-of-Arrays to Dict¶
APIs returning arrays of arrays (like OHLCV candles):
[
[1640000000, 50000, 51000, 49000, 50500, 123.45],
[1640003600, 50500, 52000, 50000, 51500, 234.56]
]
Map array elements to named fields:
options:
response:
items_path: ""
array_row_fields: ["timestamp", "open", "high", "low", "close", "volume"]
Result:
[
{"timestamp": 1640000000, "open": 50000, "high": 51000, ...},
{"timestamp": 1640003600, "open": 50500, "high": 52000, ...}
]
Converting Dict to Rows¶
Extract dict values as rows with dict_to_list:
options:
response:
items_path: "rates"
dict_to_list: true # Extracts values with keys preserved as _key
Result:
[
{"_key": "USD", "_value": 1.0},
{"_key": "EUR", "_value": 0.85},
{"_key": "GBP", "_value": 0.73}
]
Adding Metadata Fields¶
You can add fields to every record using static values or date variables:
options:
response:
items_path: results
add_fields:
_fetched_at: "$now" # Adds current UTC timestamp
_load_date: "$today" # Adds today's date
_source: "fda_api" # Adds a constant value
Date Variables¶
Use date variables in add_fields OR params for dynamic dates at runtime. There are three syntaxes available:
1. Global Syntax: ${date:expression} (Recommended)¶
Works anywhere in YAML, not just API configs:
params:
start_date: ${date:-7d} # 7 days ago
end_date: ${date:today} # Today
compact: ${date:today:%Y%m%d} # Custom format: 20240115
response:
add_fields:
_fetched_at: ${date:now}
| Expression | Description | Example Output |
|---|---|---|
${date:now} |
Current datetime | 2024-01-15 14:30:45 |
${date:today} |
Today's date | 2024-01-15 |
${date:yesterday} |
Yesterday | 2024-01-14 |
${date:-7d} |
7 days ago | 2024-01-08 |
${date:-30d} |
30 days ago | 2023-12-16 |
${date:-1m} |
~1 month ago | 2023-12-15 |
${date:start_of_month} |
First of month | 2024-01-01 |
${date:today:%Y%m%d} |
Custom format | 20240115 |
2. Shortcut Syntax: $variable (API params/add_fields only)¶
Quick access to common date values:
| Variable | Format | Example |
|---|---|---|
$now |
ISO datetime | 2024-01-15T10:30:00+00:00 |
$today |
YYYY-MM-DD | 2024-01-15 |
$yesterday |
YYYY-MM-DD | 2024-01-14 |
$7_days_ago |
YYYY-MM-DD | 2024-01-08 |
$30_days_ago |
YYYY-MM-DD | 2023-12-16 |
$today_compact |
YYYYMMDD | 20240115 |
$7_days_ago_compact |
YYYYMMDD | 20240108 |
3. Expression Syntax: {expression} (API params/add_fields only)¶
Flexible relative dates with optional format:
Example: Tracking Data Freshness¶
options:
response:
items_path: results
add_fields:
_fetched_at: ${date:now} # When this record was fetched
_load_date: ${date:today} # Date-based partitioning key
_week_start: ${date:start_of_month} # For aggregations
Example: Dynamic Date Filtering (openFDA)¶
read:
format: api
path: /food/enforcement.json
options:
params:
# Fetch recalls from last 30 days
search: "report_date:[${date:-30d:%Y%m%d}+TO+${date:today:%Y%m%d}]"
response:
items_path: results
add_fields:
_fetched_at: ${date:now}
Injecting Custom Dates¶
You can override dates via environment variables:
Why use date variables?
- Track when data was ingested (not just when it was created)
- Debug stale data issues
- Implement incremental processing based on fetch time
- Create audit trails for compliance
- Build self-adjusting pipelines that always fetch "last 30 days"
📖 Full documentation: See Variable Substitution Guide for all variable types including environment variables and custom vars.
HTTP Settings¶
Configure timeouts, retries, and rate limiting:
options:
http:
timeout_s: 60 # Request timeout in seconds (default: 30)
retries:
max_attempts: 5 # Total attempts including first (default: 3)
backoff:
base_s: 1.0 # Initial wait between retries (default: 1.0)
max_s: 60.0 # Maximum wait time (default: 60.0)
exponential_base: 2.0 # Backoff multiplier (default: 2.0) - delay = base_s * (exponential_base ^ attempt)
retry_on_status: # HTTP codes to retry (defaults shown below)
- 429 # Too Many Requests
- 500 # Server Error
- 502 # Bad Gateway
- 503 # Service Unavailable
- 504 # Gateway Timeout
rate_limit:
type: auto # Respects Retry-After headers (default)
# OR fixed rate:
# type: fixed
# requests_per_second: 2
Query Parameters¶
Send parameters with your request:
options:
params:
limit: 1000
status: active
sort: created_at
# For search/filter syntax (varies by API):
search: "report_date:[20240101+TO+20241231]"
Complete Examples¶
Example 1: openFDA Food Recalls¶
project: fda_recalls
connections:
openfda:
type: http
base_url: "https://api.fda.gov"
headers:
User-Agent: "odibi-pipeline/1.0"
bronze:
type: local
base_path: ./data/bronze
story:
connection: bronze
path: stories
system:
connection: bronze
path: _system
pipelines:
- pipeline: food_recalls
nodes:
- name: fda_food_recalls
read:
connection: openfda
format: api
path: /food/enforcement.json
options:
params:
limit: 1000
pagination:
type: offset_limit
offset_param: skip
limit_param: limit
limit: 1000
max_pages: 100
response:
items_path: results
add_fields:
_fetched_at: "$now"
write:
connection: bronze
format: parquet
path: fda_food_recalls.parquet
mode: overwrite
Example 2: GitHub API (Link Header Pagination)¶
project: github_data
connections:
github:
type: http
base_url: "https://api.github.com"
headers:
Accept: "application/vnd.github+json"
auth:
mode: bearer
token: "${GITHUB_TOKEN}"
bronze:
type: local
base_path: ./data/bronze
story:
connection: bronze
path: stories
system:
connection: bronze
path: _system
pipelines:
- pipeline: repo_issues
nodes:
- name: issues
read:
connection: github
format: api
path: /repos/henryodibi11/Odibi/issues
options:
params:
state: all
per_page: 100
pagination:
type: link_header
link_rel: next
max_pages: 10
response:
items_path: "" # GitHub returns array at root
write:
connection: bronze
format: parquet
path: github_issues.parquet
mode: overwrite
Example 3: OData API¶
project: odata_example
connections:
odata_api:
type: http
base_url: "https://services.odata.org/V4/Northwind/Northwind.svc"
bronze:
type: local
base_path: ./data/bronze
story:
connection: bronze
path: stories
system:
connection: bronze
path: _system
pipelines:
- pipeline: customers
nodes:
- name: customers
read:
connection: odata_api
format: api
path: /Customers
options:
params:
$format: json
$top: 100
pagination:
type: offset_limit
offset_param: $skip
limit_param: $top
limit: 100
max_pages: 10
response:
items_path: value
write:
connection: bronze
format: parquet
path: customers.parquet
mode: overwrite
Example 4: Cursor-Based API¶
project: cursor_api
connections:
my_api:
type: http
base_url: "https://api.example.com"
auth:
mode: bearer
token: "${API_TOKEN}"
bronze:
type: local
base_path: ./data/bronze
story:
connection: bronze
path: stories
system:
connection: bronze
path: _system
pipelines:
- pipeline: fetch_data
nodes:
- name: records
read:
connection: my_api
format: api
path: /v1/records
options:
params:
page_size: 500
pagination:
type: cursor
cursor_param: cursor
cursor_path: pagination.next_cursor
max_pages: 50
response:
items_path: data
write:
connection: bronze
format: parquet
path: records.parquet
mode: overwrite
Step-by-Step: Figuring Out a New API¶
When you get a request to pull data from an API you've never used, here's the workflow:
Step 1: Test the API in Your Browser or Terminal¶
Before writing any YAML, test the API works:
# Simple test - just hit the endpoint
curl "https://api.fda.gov/food/enforcement.json?limit=1"
# With an API key (if required)
curl -H "X-API-Key: your-key-here" "https://api.example.com/data"
# With a bearer token
curl -H "Authorization: Bearer your-token" "https://api.example.com/data"
Or just paste the URL in your browser for public APIs.
Step 2: Look at the Response¶
Save the response and examine it:
Open response.json and find:
- Where is the data array? Look for
[]brackets - At root level:
[{...}, {...}]→items_path: "" - In a key:
{"results": [{...}]}→items_path: results -
Nested:
{"data": {"items": [{...}]}}→items_path: data.items -
Is there pagination info? Look for:
"total": 25000→ There's more data than you got"next_cursor": "abc123"→ Cursor pagination"skip": 0, "limit": 100→ Offset pagination
Step 3: Find Pagination from the Docs¶
Read the API docs and look for:
| Docs say... | Pagination type | Example params |
|---|---|---|
| "skip/limit", "offset/limit" | offset_limit |
?skip=0&limit=100 |
| "page/per_page", "page/size" | page_number |
?page=1&per_page=100 |
| "cursor", "next_token", "continuation" | cursor |
?cursor=abc123 |
| "Link header", "RFC 5988" | link_header |
Check response headers |
Step 4: Test Pagination Manually¶
Verify pagination works before automating:
# Page 1
curl "https://api.example.com/data?skip=0&limit=10"
# Page 2 - should return different data
curl "https://api.example.com/data?skip=10&limit=10"
Step 5: Write Your Odibi YAML¶
Now you know:
- ✅ Base URL
- ✅ Endpoint path
- ✅ Auth method (if any)
- ✅ Pagination type and params
- ✅ Where data lives in response (items_path)
Write the YAML and test with a small max_pages: 2 first.
Step 6: Run and Verify¶
# Test run with limited pages
python -m odibi run my_pipeline.yaml
# Check the output
python -c "import pandas as pd; df = pd.read_parquet('output.parquet'); print(len(df), df.columns.tolist()[:5])"
How to Find API Documentation¶
When working with a new API, you need to find:
- Base URL - Where the API lives (e.g.,
https://api.fda.gov) - Endpoints - The paths for different data (e.g.,
/food/enforcement.json) - Authentication - How to prove who you are
- Pagination - How to get all the data
- Response format - Where the data is in the JSON
Tips for Reading API Docs¶
| Look for... | Maps to... |
|---|---|
| "Base URL" or "API Host" | base_url in connection |
| "Endpoints" or "Resources" | path in read config |
| "Authentication", "API Keys" | auth in connection |
| "Pagination", "skip/limit", "page/per_page" | pagination in options |
| "Response", "Example response" | items_path in response |
Common API Documentation Sites¶
- openFDA: https://open.fda.gov/apis/
- GitHub: https://docs.github.com/en/rest
- OData: https://www.odata.org/documentation/
Troubleshooting¶
"HTTP Error 401: Unauthorized"¶
- Check your API key or token
- Verify the auth mode matches what the API expects
- Make sure environment variables are set
"HTTP Error 404: Not Found"¶
- Check the endpoint path
- Some APIs need a trailing slash, some don't
- Verify the base URL is correct
"HTTP Error 429: Too Many Requests"¶
- The API is rate limiting you
- Add rate limiting config:
No data returned¶
- Check
items_pathmatches where data is in the response - Test the API URL in a browser or with
curl - Look at the raw response:
"Connection timed out"¶
- Increase timeout:
Quick Reference¶
read:
connection: my_api # HTTP connection name
format: api # Required for API fetching
path: /endpoint # API endpoint path
options:
method: GET # HTTP method: GET (default), POST, PUT, PATCH, DELETE
params: # Query parameters (GET) or merged into body (POST)
key: value
request_body: # JSON body for POST/PUT/PATCH requests
filters:
status: ["active"]
pagination:
type: offset_limit # offset_limit | page_number | cursor | link_header | none
# ... pagination-specific options
max_pages: 100 # Safety limit
start_offset: 0 # Starting offset (use 1 for 1-indexed APIs)
response:
items_path: results # Dotted path to data array
add_fields: # Optional fields to add
_fetched_at: "${date:now}"
http:
timeout_s: 60 # Request timeout (default: 30)
retries:
max_attempts: 5 # Total attempts (default: 3)
backoff:
base_s: 1.0 # Initial delay (default: 1.0)
max_s: 60.0 # Max delay (default: 60.0)
exponential_base: 2.0 # Backoff multiplier (default: 2.0)
retry_on_status: [429, 500, 502, 503, 504] # HTTP codes to retry
rate_limit:
type: auto # auto (default) | fixed
# requests_per_second: 2 # For type: fixed
POST APIs (Advanced)¶
Some APIs use POST requests with JSON body for complex queries. Odibi supports this:
read:
connection: my_api
format: api
path: /v1/search
options:
method: POST # Use POST instead of GET
request_body: # JSON body to send
filters:
Classification: ["Class I"]
PostedDateFrom: ["${date:-30d}"]
columns:
- RecallID
- FirmName
- ProductDescription
pagination:
type: offset_limit
offset_param: start # These go into the JSON body for POST
limit_param: rows
limit: 1000
start_offset: 1 # Some APIs are 1-indexed
response:
items_path: result
For POST requests, pagination parameters are automatically added to the JSON body instead of URL query string.
Next Steps¶
- Try the openFDA example to see it in action
- Learn about incremental loading for weekly API pulls
- Set up alerting to know when API pulls fail