# Bulk Enrichment

This guide covers how to efficiently enrich large datasets — from thousands to hundreds of thousands of companies — using the HG API.

## How It Works

The `/companies/enrich` endpoint accepts up to 25 companies per request, and the API allows 25 requests per second. That means you can enrich up to **625 companies per second**, or roughly **100,000 companies in under 3 minutes**.

| Dataset Size | Estimated Time |
|  --- | --- |
| 1,000 companies | ~2 seconds |
| 10,000 companies | ~16 seconds |
| 100,000 companies | ~3 minutes |
| 500,000 companies | ~14 minutes |


## Before You Start

### 1. Check your credit balance

Large enrichments consume significant credits. Use the Credits API to verify you have enough before starting:

```bash
curl https://api.hginsights.com/data-api/v2/credits \
  -H "Authorization: Bearer $HG_API_KEY"
```

### 2. Only request what you need

Each field group adds to your credit consumption. If you only need firmographics, don't request technographics and spend:

```json
{
  "companies": {"domains": ["walmart.com"]},
  "fields": ["firmographics"]
}
```

### 3. Use domains or HG IDs

If you have company names but no domains or IDs, use the **Company Match** endpoint first to resolve them, then enrich using the returned IDs.

## Implementation

### Step 1: Split into batches of 25

```python
import requests

API_KEY = "your_api_key"
BASE_URL = "https://api.hginsights.com/data-api/v2/companies/enrich"
HEADERS = {
    "Authorization": f"Bearer {API_KEY}",
    "Content-Type": "application/json"
}

domains = ["walmart.com", "google.com", ...]  # your full list

# Split into chunks of 25
batches = [domains[i:i+25] for i in range(0, len(domains), 25)]
```

### Step 2: Send requests with rate limiting

Stay within the 25 requests/second limit. A simple approach is to add a small delay between requests:

```python
import time

results = []

for batch in batches:
    payload = {
        "companies": {"domains": batch},
        "fields": ["firmographics", "technographics", "spend"]
    }

    response = requests.post(BASE_URL, json=payload, headers=HEADERS)

    if response.status_code == 200:
        results.extend(response.json()["companies"])
    elif response.status_code == 429:
        # Rate limited — wait and retry
        time.sleep(2)
        response = requests.post(BASE_URL, json=payload, headers=HEADERS)
        results.extend(response.json()["companies"])

    time.sleep(0.05)  # ~20 requests/second, safely under the limit
```

### Step 3: Handle rate limits gracefully

If you hit a 429 (Too Many Requests) response, back off and retry:

```python
def enrich_with_retry(payload, max_retries=3):
    for attempt in range(max_retries):
        response = requests.post(BASE_URL, json=payload, headers=HEADERS)

        if response.status_code == 200:
            return response.json()["companies"]
        elif response.status_code == 429:
            wait_time = 2 ** attempt  # 1s, 2s, 4s
            time.sleep(wait_time)
        else:
            response.raise_for_status()

    raise Exception("Max retries exceeded")
```

## Parallel Requests

For maximum throughput, send requests in parallel using a thread pool. This example sends up to 20 concurrent requests:

```python
from concurrent.futures import ThreadPoolExecutor, as_completed
import time

def enrich_batch(batch):
    payload = {
        "companies": {"domains": batch},
        "fields": ["firmographics", "technographics"]
    }
    response = requests.post(BASE_URL, json=payload, headers=HEADERS)
    if response.status_code == 429:
        time.sleep(2)
        response = requests.post(BASE_URL, json=payload, headers=HEADERS)
    response.raise_for_status()
    return response.json()["companies"]

results = []
with ThreadPoolExecutor(max_workers=20) as executor:
    futures = {executor.submit(enrich_batch, batch): batch for batch in batches}
    for future in as_completed(futures):
        results.extend(future.result())
```

## Handling Large Technographics Responses

Some companies have thousands of technology installs. The API paginates technographics results — use the `pagination` parameter to retrieve all installs:

```json
{
  "companies": {"domains": ["walmart.com"]},
  "fields": ["technographics"],
  "pagination": {
    "technographics": {"limit": 100, "offset": 0}
  }
}
```

Increment `offset` by `limit` until you've retrieved all installs (check `installs_count` in the response).

## Tips

- **Start small** — Test with 100 companies first to validate your pipeline and estimate credit usage
- **Log failures** — Track which batches fail so you can retry them without re-processing the entire dataset
- **Store results** — Write results to a database or file as you go, rather than holding everything in memory
- **Filter technographics** — If you only need specific vendors or products, use filters to reduce response size and credit consumption


## Need Recurring Large-Scale Data?

If you're enriching large datasets on a regular basis, the API may not be the most cost-effective option. HG Insights also offers bulk data deliveries tailored to the companies you care about. Contact your account manager to learn about dataset licensing options.