Skip to content

Fetching Data from APIs

This guide covers how to pull data from REST APIs into your Odibi pipelines. Whether you're pulling from public APIs like openFDA or internal company APIs, this guide will get you started.

What is an API?

An API (Application Programming Interface) is a way for programs to talk to each other. When you hear "REST API" in data engineering, it usually means:

  • A URL you can call to get data (like https://api.fda.gov/food/enforcement.json)
  • The data comes back as JSON (a text format that looks like nested key-value pairs)
  • You might need an API key or token to authenticate

Example: What an API Response Looks Like

When you call https://api.fda.gov/food/enforcement.json?limit=2, you get:

{
  "meta": {
    "results": {
      "skip": 0,
      "limit": 2,
      "total": 25847
    }
  },
  "results": [
    {
      "recall_number": "F-0276-2017",
      "status": "Terminated",
      "city": "Davie",
      "state": "FL",
      "classification": "Class II"
    },
    {
      "recall_number": "F-0865-2017",
      "status": "Terminated",
      "city": "Millbrae",
      "state": "CA",
      "classification": "Class II"
    }
  ]
}

Notice: - The actual data is nested inside "results" - There's metadata telling you there are 25,847 total records - You only got 2 because of limit=2

How Odibi Fetches API Data

Odibi uses format: api to fetch data from REST APIs. It handles:

  1. Pagination - Automatically fetches all pages of data
  2. Response parsing - Extracts the data from nested JSON
  3. Retries - Handles temporary failures gracefully
  4. Rate limiting - Respects API limits

Quick Start Example

Here's a complete working example that fetches FDA food recall data:

project: my_api_project

connections:
  # HTTP connection - defines the base URL and auth
  my_api:
    type: http
    base_url: "https://api.fda.gov"
    headers:
      User-Agent: "my-app/1.0"

  local_storage:
    type: local
    base_path: ./data

story:
  connection: local_storage
  path: stories

system:
  connection: local_storage
  path: _system

pipelines:
  - pipeline: fetch_fda_data
    nodes:
      - name: fda_recalls
        read:
          connection: my_api
          format: api                    # <-- This tells Odibi to use the API fetcher
          path: /food/enforcement.json   # <-- The API endpoint
          options:
            params:                      # Query parameters to send
              limit: 1000

            pagination:                  # How to get all pages
              type: offset_limit
              offset_param: skip
              limit_param: limit
              limit: 1000
              max_pages: 10

            response:                    # Where to find the data in JSON
              items_path: results

        write:
          connection: local_storage
          format: parquet
          path: bronze/fda_recalls.parquet
          mode: overwrite

Run it:

python -m odibi run my_pipeline.yaml

Understanding API Configuration

Connection Setup

First, define an HTTP connection with the API's base URL:

connections:
  my_api:
    type: http
    base_url: "https://api.example.com"
    headers:
      User-Agent: "odibi-pipeline/1.0"
      Accept: "application/json"

Authentication

Different APIs use different authentication methods:

No Authentication (Public APIs)

connections:
  public_api:
    type: http
    base_url: "https://api.fda.gov"

API Key in Header

connections:
  api_with_key:
    type: http
    base_url: "https://api.example.com"
    auth:
      mode: api_key
      api_key: "${MY_API_KEY}"      # Use environment variable
      header_name: "X-API-Key"       # Default is X-API-Key

Bearer Token

connections:
  api_with_token:
    type: http
    base_url: "https://api.example.com"
    auth:
      mode: bearer
      token: "${API_TOKEN}"

Basic Auth (Username/Password)

connections:
  api_with_basic:
    type: http
    base_url: "https://api.example.com"
    auth:
      mode: basic
      username: "${API_USER}"
      password: "${API_PASSWORD}"

Pagination Strategies

Most APIs don't return all data at once—they paginate (return data in chunks). Odibi supports 4 pagination types:

1. Offset/Limit (Most Common)

Used by: openFDA, many REST APIs

The API uses skip (or offset) and limit parameters: - Page 1: ?skip=0&limit=100 - Page 2: ?skip=100&limit=100 - Page 3: ?skip=200&limit=100

options:
  pagination:
    type: offset_limit
    offset_param: skip        # Parameter name for offset (default: "offset")
    limit_param: limit        # Parameter name for limit (default: "limit")
    limit: 1000               # Records per page
    max_pages: 100            # Safety limit
    stop_on_empty: true       # Stop when no more results (default: true)
    start_offset: 0           # Starting offset value (default: 0, use 1 for 1-indexed APIs)

2. Page Number

Used by: Some older APIs

The API uses page numbers: - Page 1: ?page=1&per_page=100 - Page 2: ?page=2&per_page=100

options:
  pagination:
    type: page_number
    page_param: page          # Parameter name for page number
    page_size_param: per_page # Parameter name for page size
    page_size: 100            # Records per page
    start_page: 1             # First page number (usually 1)
    max_pages: 50

3. Cursor-Based

Used by: Twitter, Slack, modern APIs

The API returns a "cursor" or "next_token" in each response:

options:
  pagination:
    type: cursor
    cursor_param: next_token       # Parameter name to send cursor
    cursor_path: meta.next_cursor  # Where to find cursor in response
    max_pages: 100

Used by: GitHub, GitLab

The API returns a Link header with the next URL:

options:
  pagination:
    type: link_header
    link_rel: next            # Usually "next"
    max_pages: 50

No Pagination

For APIs that return all data in one response:

options:
  pagination:
    type: none

Response Parsing

APIs return data in different structures. Tell Odibi where to find the actual records:

Data at Root Level

[{"id": 1}, {"id": 2}]
options:
  response:
    items_path: ""   # Empty = root is the array

Data in a Key

{"results": [{"id": 1}, {"id": 2}]}
options:
  response:
    items_path: results

Nested Data

{"data": {"items": [{"id": 1}, {"id": 2}]}}
options:
  response:
    items_path: data.items

OData APIs

{"value": [{"id": 1}, {"id": 2}], "@odata.nextLink": "..."}
options:
  response:
    items_path: value

Array Indexing

Extract specific elements from arrays using bracket notation:

[
  {"meta": "ignored"},
  [{"id": 1}, {"id": 2}]
]
options:
  response:
    items_path: "[1]"   # Get second element (0-indexed)

Or with nested paths:

{"series": [{"data": [{"id": 1}, {"id": 2}]}]}
options:
  response:
    items_path: "series[0].data"   # Navigate series[0] then .data

Wrapping Single Objects

Some APIs return a single object instead of an array:

{"temperature": 72, "humidity": 45}

Use wrap_single: true to convert it to a 1-row table:

options:
  response:
    items_path: ""
    wrap_single: true   # Wraps single dict in array [{"temperature": 72, ...}]

Array-of-Arrays to Dict

APIs returning arrays of arrays (like OHLCV candles):

[
  [1640000000, 50000, 51000, 49000, 50500, 123.45],
  [1640003600, 50500, 52000, 50000, 51500, 234.56]
]

Map array elements to named fields:

options:
  response:
    items_path: ""
    array_row_fields: ["timestamp", "open", "high", "low", "close", "volume"]

Result:

[
  {"timestamp": 1640000000, "open": 50000, "high": 51000, ...},
  {"timestamp": 1640003600, "open": 50500, "high": 52000, ...}
]

Converting Dict to Rows

Extract dict values as rows with dict_to_list:

{"rates": {"USD": 1.0, "EUR": 0.85, "GBP": 0.73}}
options:
  response:
    items_path: "rates"
    dict_to_list: true   # Extracts values with keys preserved as _key

Result:

[
  {"_key": "USD", "_value": 1.0},
  {"_key": "EUR", "_value": 0.85},
  {"_key": "GBP", "_value": 0.73}
]

Adding Metadata Fields

You can add fields to every record using static values or date variables:

options:
  response:
    items_path: results
    add_fields:
      _fetched_at: "$now"         # Adds current UTC timestamp
      _load_date: "$today"        # Adds today's date
      _source: "fda_api"          # Adds a constant value

Date Variables

Use date variables in add_fields OR params for dynamic dates at runtime. There are three syntaxes available:

Works anywhere in YAML, not just API configs:

params:
  start_date: ${date:-7d}              # 7 days ago
  end_date: ${date:today}              # Today
  compact: ${date:today:%Y%m%d}        # Custom format: 20240115
response:
  add_fields:
    _fetched_at: ${date:now}
Expression Description Example Output
${date:now} Current datetime 2024-01-15 14:30:45
${date:today} Today's date 2024-01-15
${date:yesterday} Yesterday 2024-01-14
${date:-7d} 7 days ago 2024-01-08
${date:-30d} 30 days ago 2023-12-16
${date:-1m} ~1 month ago 2023-12-15
${date:start_of_month} First of month 2024-01-01
${date:today:%Y%m%d} Custom format 20240115

2. Shortcut Syntax: $variable (API params/add_fields only)

Quick access to common date values:

Variable Format Example
$now ISO datetime 2024-01-15T10:30:00+00:00
$today YYYY-MM-DD 2024-01-15
$yesterday YYYY-MM-DD 2024-01-14
$7_days_ago YYYY-MM-DD 2024-01-08
$30_days_ago YYYY-MM-DD 2023-12-16
$today_compact YYYYMMDD 20240115
$7_days_ago_compact YYYYMMDD 20240108

3. Expression Syntax: {expression} (API params/add_fields only)

Flexible relative dates with optional format:

params:
  start: "{-30d}"              # 30 days ago
  end: "{today:%Y%m%d}"        # Today in YYYYMMDD

Example: Tracking Data Freshness

options:
  response:
    items_path: results
    add_fields:
      _fetched_at: ${date:now}              # When this record was fetched
      _load_date: ${date:today}             # Date-based partitioning key
      _week_start: ${date:start_of_month}   # For aggregations

Example: Dynamic Date Filtering (openFDA)

read:
  format: api
  path: /food/enforcement.json
  options:
    params:
      # Fetch recalls from last 30 days
      search: "report_date:[${date:-30d:%Y%m%d}+TO+${date:today:%Y%m%d}]"
    response:
      items_path: results
      add_fields:
        _fetched_at: ${date:now}

Injecting Custom Dates

You can override dates via environment variables:

export REPORT_DATE=2024-01-01
params:
  date: ${REPORT_DATE}    # Uses your injected value

Why use date variables?

  • Track when data was ingested (not just when it was created)
  • Debug stale data issues
  • Implement incremental processing based on fetch time
  • Create audit trails for compliance
  • Build self-adjusting pipelines that always fetch "last 30 days"

📖 Full documentation: See Variable Substitution Guide for all variable types including environment variables and custom vars.

HTTP Settings

Configure timeouts, retries, and rate limiting:

options:
  http:
    timeout_s: 60                 # Request timeout in seconds (default: 30)

    retries:
      max_attempts: 5             # Total attempts including first (default: 3)
      backoff:
        base_s: 1.0               # Initial wait between retries (default: 1.0)
        max_s: 60.0               # Maximum wait time (default: 60.0)
        exponential_base: 2.0     # Backoff multiplier (default: 2.0) - delay = base_s * (exponential_base ^ attempt)
      retry_on_status:            # HTTP codes to retry (defaults shown below)
        - 429                     # Too Many Requests
        - 500                     # Server Error
        - 502                     # Bad Gateway
        - 503                     # Service Unavailable
        - 504                     # Gateway Timeout

    rate_limit:
      type: auto                  # Respects Retry-After headers (default)
      # OR fixed rate:
      # type: fixed
      # requests_per_second: 2

Query Parameters

Send parameters with your request:

options:
  params:
    limit: 1000
    status: active
    sort: created_at
    # For search/filter syntax (varies by API):
    search: "report_date:[20240101+TO+20241231]"

Complete Examples

Example 1: openFDA Food Recalls

project: fda_recalls

connections:
  openfda:
    type: http
    base_url: "https://api.fda.gov"
    headers:
      User-Agent: "odibi-pipeline/1.0"

  bronze:
    type: local
    base_path: ./data/bronze

story:
  connection: bronze
  path: stories

system:
  connection: bronze
  path: _system

pipelines:
  - pipeline: food_recalls
    nodes:
      - name: fda_food_recalls
        read:
          connection: openfda
          format: api
          path: /food/enforcement.json
          options:
            params:
              limit: 1000
            pagination:
              type: offset_limit
              offset_param: skip
              limit_param: limit
              limit: 1000
              max_pages: 100
            response:
              items_path: results
              add_fields:
                _fetched_at: "$now"
        write:
          connection: bronze
          format: parquet
          path: fda_food_recalls.parquet
          mode: overwrite
project: github_data

connections:
  github:
    type: http
    base_url: "https://api.github.com"
    headers:
      Accept: "application/vnd.github+json"
    auth:
      mode: bearer
      token: "${GITHUB_TOKEN}"

  bronze:
    type: local
    base_path: ./data/bronze

story:
  connection: bronze
  path: stories

system:
  connection: bronze
  path: _system

pipelines:
  - pipeline: repo_issues
    nodes:
      - name: issues
        read:
          connection: github
          format: api
          path: /repos/henryodibi11/Odibi/issues
          options:
            params:
              state: all
              per_page: 100
            pagination:
              type: link_header
              link_rel: next
              max_pages: 10
            response:
              items_path: ""    # GitHub returns array at root
        write:
          connection: bronze
          format: parquet
          path: github_issues.parquet
          mode: overwrite

Example 3: OData API

project: odata_example

connections:
  odata_api:
    type: http
    base_url: "https://services.odata.org/V4/Northwind/Northwind.svc"

  bronze:
    type: local
    base_path: ./data/bronze

story:
  connection: bronze
  path: stories

system:
  connection: bronze
  path: _system

pipelines:
  - pipeline: customers
    nodes:
      - name: customers
        read:
          connection: odata_api
          format: api
          path: /Customers
          options:
            params:
              $format: json
              $top: 100
            pagination:
              type: offset_limit
              offset_param: $skip
              limit_param: $top
              limit: 100
              max_pages: 10
            response:
              items_path: value
        write:
          connection: bronze
          format: parquet
          path: customers.parquet
          mode: overwrite

Example 4: Cursor-Based API

project: cursor_api

connections:
  my_api:
    type: http
    base_url: "https://api.example.com"
    auth:
      mode: bearer
      token: "${API_TOKEN}"

  bronze:
    type: local
    base_path: ./data/bronze

story:
  connection: bronze
  path: stories

system:
  connection: bronze
  path: _system

pipelines:
  - pipeline: fetch_data
    nodes:
      - name: records
        read:
          connection: my_api
          format: api
          path: /v1/records
          options:
            params:
              page_size: 500
            pagination:
              type: cursor
              cursor_param: cursor
              cursor_path: pagination.next_cursor
              max_pages: 50
            response:
              items_path: data
        write:
          connection: bronze
          format: parquet
          path: records.parquet
          mode: overwrite

Step-by-Step: Figuring Out a New API

When you get a request to pull data from an API you've never used, here's the workflow:

Step 1: Test the API in Your Browser or Terminal

Before writing any YAML, test the API works:

# Simple test - just hit the endpoint
curl "https://api.fda.gov/food/enforcement.json?limit=1"

# With an API key (if required)
curl -H "X-API-Key: your-key-here" "https://api.example.com/data"

# With a bearer token
curl -H "Authorization: Bearer your-token" "https://api.example.com/data"

Or just paste the URL in your browser for public APIs.

Step 2: Look at the Response

Save the response and examine it:

curl "https://api.fda.gov/food/enforcement.json?limit=2" > response.json

Open response.json and find:

  1. Where is the data array? Look for [] brackets
  2. At root level: [{...}, {...}]items_path: ""
  3. In a key: {"results": [{...}]}items_path: results
  4. Nested: {"data": {"items": [{...}]}}items_path: data.items

  5. Is there pagination info? Look for:

  6. "total": 25000 → There's more data than you got
  7. "next_cursor": "abc123" → Cursor pagination
  8. "skip": 0, "limit": 100 → Offset pagination

Step 3: Find Pagination from the Docs

Read the API docs and look for:

Docs say... Pagination type Example params
"skip/limit", "offset/limit" offset_limit ?skip=0&limit=100
"page/per_page", "page/size" page_number ?page=1&per_page=100
"cursor", "next_token", "continuation" cursor ?cursor=abc123
"Link header", "RFC 5988" link_header Check response headers

Step 4: Test Pagination Manually

Verify pagination works before automating:

# Page 1
curl "https://api.example.com/data?skip=0&limit=10"

# Page 2 - should return different data
curl "https://api.example.com/data?skip=10&limit=10"

Step 5: Write Your Odibi YAML

Now you know: - ✅ Base URL - ✅ Endpoint path - ✅ Auth method (if any) - ✅ Pagination type and params - ✅ Where data lives in response (items_path)

Write the YAML and test with a small max_pages: 2 first.

Step 6: Run and Verify

# Test run with limited pages
python -m odibi run my_pipeline.yaml

# Check the output
python -c "import pandas as pd; df = pd.read_parquet('output.parquet'); print(len(df), df.columns.tolist()[:5])"

How to Find API Documentation

When working with a new API, you need to find:

  1. Base URL - Where the API lives (e.g., https://api.fda.gov)
  2. Endpoints - The paths for different data (e.g., /food/enforcement.json)
  3. Authentication - How to prove who you are
  4. Pagination - How to get all the data
  5. Response format - Where the data is in the JSON

Tips for Reading API Docs

Look for... Maps to...
"Base URL" or "API Host" base_url in connection
"Endpoints" or "Resources" path in read config
"Authentication", "API Keys" auth in connection
"Pagination", "skip/limit", "page/per_page" pagination in options
"Response", "Example response" items_path in response

Common API Documentation Sites

  • openFDA: https://open.fda.gov/apis/
  • GitHub: https://docs.github.com/en/rest
  • OData: https://www.odata.org/documentation/

Troubleshooting

"HTTP Error 401: Unauthorized"

  • Check your API key or token
  • Verify the auth mode matches what the API expects
  • Make sure environment variables are set

"HTTP Error 404: Not Found"

  • Check the endpoint path
  • Some APIs need a trailing slash, some don't
  • Verify the base URL is correct

"HTTP Error 429: Too Many Requests"

  • The API is rate limiting you
  • Add rate limiting config:
    http:
      rate_limit:
        type: fixed
        requests_per_second: 1
    

No data returned

  • Check items_path matches where data is in the response
  • Test the API URL in a browser or with curl
  • Look at the raw response:
    curl "https://api.fda.gov/food/enforcement.json?limit=1"
    

"Connection timed out"

  • Increase timeout:
    http:
      timeout_s: 120
    

Quick Reference

read:
  connection: my_api           # HTTP connection name
  format: api                  # Required for API fetching
  path: /endpoint              # API endpoint path
  options:
    method: GET                # HTTP method: GET (default), POST, PUT, PATCH, DELETE
    params:                    # Query parameters (GET) or merged into body (POST)
      key: value
    request_body:              # JSON body for POST/PUT/PATCH requests
      filters:
        status: ["active"]

    pagination:
      type: offset_limit       # offset_limit | page_number | cursor | link_header | none
      # ... pagination-specific options
      max_pages: 100           # Safety limit
      start_offset: 0          # Starting offset (use 1 for 1-indexed APIs)

    response:
      items_path: results      # Dotted path to data array
      add_fields:              # Optional fields to add
        _fetched_at: "${date:now}"

    http:
      timeout_s: 60            # Request timeout (default: 30)
      retries:
        max_attempts: 5        # Total attempts (default: 3)
        backoff:
          base_s: 1.0          # Initial delay (default: 1.0)
          max_s: 60.0          # Max delay (default: 60.0)
          exponential_base: 2.0  # Backoff multiplier (default: 2.0)
        retry_on_status: [429, 500, 502, 503, 504]  # HTTP codes to retry
      rate_limit:
        type: auto             # auto (default) | fixed
        # requests_per_second: 2  # For type: fixed

POST APIs (Advanced)

Some APIs use POST requests with JSON body for complex queries. Odibi supports this:

read:
  connection: my_api
  format: api
  path: /v1/search
  options:
    method: POST              # Use POST instead of GET
    request_body:             # JSON body to send
      filters:
        Classification: ["Class I"]
        PostedDateFrom: ["${date:-30d}"]
      columns:
        - RecallID
        - FirmName
        - ProductDescription
    pagination:
      type: offset_limit
      offset_param: start     # These go into the JSON body for POST
      limit_param: rows
      limit: 1000
      start_offset: 1         # Some APIs are 1-indexed
    response:
      items_path: result

For POST requests, pagination parameters are automatically added to the JSON body instead of URL query string.

Next Steps