Home API and data snapshot access

API and data snapshot access

Euan
By Euan
4 articles

Document vs PDF metadata in the API

How “PDF” and “Policy document” objects in the snapshots and APIs are related and what metadata is shared between them In the Overton web interface every Policy document object is made up of one or more PDF objects. How Overton parses publication landing pages Imagine a publication landing page representing a report on a government website. Even though it’s all related to the one report that landing page may link out to a number of different documents: - An executive summary - The actual report - An appendix containing tables and figures Upon scraping this website Overton would create a Policy document – the entire report – and then three PDF objects, one for the three different child documents. Alternatively, one landing page may link out to the “same” document in different languages: - English - French Here Overton would still create a single Policy document and two PDF objects, one for each language. Despite their name PDF objects can represent content in other formats too – e.g. Word or HTML. Metadata shared between vs unique to individual PDFs Usually the basic metadata for the Policy document – its title, publication date, snippet and so on – comes from the landing page itself. You can learn more about the ways we look for this data in the guidelines for publishers. In some cases we derive this from the first PDF (for example when there’s no clear publication date or title in the landing page metadata). Those basic metadata fields include: - title - translated_title - policy_document_url - policy_document_series - authors - snippet - published_on - policy_source information Each PDF “inherits” the metadata from the parent Policy document. Each PDF has some unique metadata, and it moves through Overton’s data processing pipeline more is added: - pdf_url - pdf_title (where available) - pdf_thumbnail (where available) - Language - Topics, entities and subject areas - Outgoing references to scholarly or other policy documents - SDG classifications - Lists of people / institution pairs mentioned in the text In turn, in the app each Policy document is considered to contain the union of all of its child PDF‘s subject areas, clasiffications, outgoing references etc. Overton’s processing pipeline works on PDFs, not Policy documents, and the snapshots reflect this: each item in the snapshot is a PDF. However, in the overton.io web application & API we group PDFs by their inherited policy_document_id to make things friendlier for end users, and so each item is a Policy document: if the same query appeared in both the executive summary and the appendix of the example report at the very top of this help page the application & API would return only one result (the common parent Policy document). Relationship and identifiers Every Policy document object has at least one PDF child object (if there are no documents linked from the landing page Overton will either not create a Policy document or scrape the HTML of the landing page itself, as appropriate) Policy documents are identified by their policy_document_id field in the API and snapshot data. PDFs are identified by their pdf_document_id field. In general the pdf_document_id contains the parent policy_document_id e.g. the PDFs: - committee_house-2048b061d68144d56dd7bbb006b50f8b-1acf210163a36c9d273b9c703003f8fa - committee_house-2048b061d68144d56dd7bbb006b50f8b-b0f68ff2a09f76d54daf94cce0dff991 are both child PDFs of Policy document: - committee_house-2048b061d68144d56dd7bbb006b50f8b

Last updated on Jun 25, 2026

Overton data snapshots

A description of Overton’s bulk data snapshot format Data snapshots provide access to the Overton database in a machine-readable format, allowing you to import it into your own database or BI system. Data snapshots are not publicly available. We generate them for a defined list of customers. If you are unsure whether you should have access, please contact your organization’s account holder. If your organization does not yet have a subscription to Overton Index, please reach out via our subscriptions page. Each snapshot is a tar.gz file that contains a set of JSONL files with document metadata and a single text file with the citation graph. We provide the text file for convenience, but you can also rebuild the graph directly from the JSON metadata of each document. Document metadata The JSONL files are named overton_docs_xxxxxxxxx.json (where xxxxxxxxx is a 0 padded sequence number starting at 1), are UTF-8 encoded and contain a JSON format record on each line. Other than the final JSONL file each one should contain 999 records. Each JSON record contains the metadata for a different policy PDF indexed in Overton. IMPORTANT: a single policy document may have multiple PDFs associated with it, e.g. an executive summary, an appendix or different language versions. Overton indexes each one separately but aggregates them in the web interface. You can do this too, by grouping on the policy_document_id key. Metadata schema Be careful as it’s not guaranteed that every record contains every field, and empty fields may contain null or empty strings / arrays. Fields may appear in any order inside a record. pdf_document_id This is the primary key of the record. policy_document_id Every PDF belongs to exactly one policy document in Overton. Each policy document has a unique ID shown in this field. title snippet authors published_on policy_document_url These are the title, abstract (where available), publication date and web address of the relevant parent policy document. The publication date uses YYYY-MM-DD format. translated_title language Overton tries to detect languages automatically, but falls back to English. The language codes are in ISO 639-2 format (three letter codes, “eng” is English etc.). Where the language isn’t English we provide a machine translated version of the title in the translated_title field. policy_document_series overton_policy_document_series classifications topics entities Often a policy source will group documents into a series (“Commodity Market Reports” “Working Papers” etc.) – this is stored where available in the policy_document_series field. Because languages, names, and spellings of these series types vary across and even within sources, Overton maps common series types (working papers, blogs, transcripts, and clinical guidelines) to a low-cardinality overton_policy_document_series field. Classifications (subject areas), topics and entities are JSON arrays and are described in more detail here. pdf_url pdf_title pdf_thumbnail This metadata is specific to PDFs. It includes the URL where Overton found the PDF, its title (if available—this field is usually empty), and a thumbnail image. We can provide thumbnails in a separate snapshot file if required, but please do not hotlink them from overton.io. policy_source_id policy_source_title policy_source_type policy_source_region policy_source_country These fields contain more information about the source of the policy document – a unique ID (policy_source_id), its title, type and the country and region it is from. policy_document_ids_cited mentions_people dois_cited This is the citation graph. policy_document_ids_cited is a JSON array containing the set of policy_document_id keys representing policy documented cited by this PDF (note: these are not pdf_document_id keys, they are the keys of the parent policy document) dois_cited is a JSON array containing the DOIs that are cited by this PDF. mentions_people is a JSON array containing any academic name mentions that we’ve found in the document. Note that this isn’t the same as entity extraction – you can read more about our name finding process on the relevant help page. overton_document_url This is the web address of the policy document on overton.io. Citation graph You can build the citation graph directly from the JSONL files but we also include a text file for convenience. The text file is tab delimited and contains one citing document -> cited document pair per line. The file has three columns: Citing document ID Citation type Cited document ID The citation type is either ‘doc’ (a policy document citing another policy document) or ‘doi’ (a policy document citing a scholarly document). When the type is ‘doi,’ the cited document ID is a DOI. Note that these are not all Crossref DOIs—many come from DataCite or the EU Publications Office. See the data gotcha below for details on entries in the mapping that do not appear in the snapshot files. Data gotchas Odd character encoding & the UTF-8 replacement character in titles Overton makes a best-effort guess at policy document titles from the available metadata, but sometimes falls back on parsing text from the body of a PDF or using an OCR. This method works poorly on non-English documents and on files already OCRed at the source by an older system, where the text is often sufficient for searching but not suitable for humans. As a result you may encounter UTF-8 encoding errors in strings. We currently strip out the UTF-8 replacement character in the data dump to make parsing easier. Missing document IDs in policy_document_ids_cited Occasionally the policy_document_ids_cited field may contain policy document IDs that aren’t in the dump or the Overton web app. This occurs when a solid citation—usually a link—points to a document in our database whose metadata we do not trust. This often happens because its title or date fails our data sanity checks, or because we could not fully fetch it from the source website. We do not include these ‘ghost’ documents in database dumps, and they do not appear in the web interface. We keep them because the citation is valid, but we cannot display the policy document being cited.

Last updated on Jun 25, 2026

Overview of using Overton’s API

The data in the Overton Index is available in a machine-readable format through our REST API, which sits between our database and the web application. The API returns JSON and requires an API key. This article will cover how to access and the API and interpret results. For our technical documentation on the API and how to find your API key, please see our Swagger document. See: [Technical Documentation for Overton's REST API (Swagger)](https://app.overton.io/swagger.php?) Our general guide on the using our API will be helpful for users wanting to understanding how the API works and what it can be used for. See: [Guide - How to use Overton's API](https://www.overton.io/guide-how-to-use-overtons-api) Access API access does not come enabled by default on accounts, and your subscription type determines access. To check if your account has API access, go to the grey action bar above the search results and hover over ‘Export’. If you see the option ‘Generate API call,’ then your account has API access. If you do not see this option and want to know if API access is available, please contact [email protected]. To generate the API call for the page you are currently viewing in the app, click ‘Generate API call’. Best practice and return codes Call the API no more than once per second. We enforce rate limiting with some leeway—occasional faster calls are fine, but if you exceed the limit too often, the system may automatically block your API key. When rate limiting occurs, the API returns a 429 HTTP status code and empty results. Interpreting results Search API results are generally broken up into three sections: The query Image of the results section of the JSON The query object shows the number of pages that the search can return (note that your account may include a page limit). To select a page use the &page=x parameter, where x is a valid page number. Facets The Facets key contains roll-up information for various fields. This is what is displayed in the left hand sidebar on Overton’s search pages. Please note that facets aren’t available as default, if you would like to add these to your query by default then let us know by contacting [email protected]. Results Image of the Facets section of the JSON The Results key contains the actual results of your query. The **pdf_document_id **is the unique key for the document: a single policy document may contain multiple PDFs when e.g. there’s an executive summary, or different language versions. **document_url **is the landing page for the policy document (the web page that typically shows authors, an abstract etc.), while **pdf_url **is the link to the actual PDF. For licensing reasons the API does not include full text content of PDFs. To obtain this you must use the **pdf_url **field and fetch and process the relevant PDF yourself. For some documents **pdf_url **is not present, or is the same as the document_url: these are policy documents that are only available in HTML and so data must be scraped from there. Please do not hotlink to PDF thumbnails – the paths may change without notice. Topics, entities, classifications and COFOG (also referred to as subject areas) are covered in more detail in a separate help article. API vs. data snapshots For large projects or where you need to get at a lot of data quickly the data snapshot may be a better option than using the API. See: [Overton's data snapshots ](https://help.overton.io/en/articles/4235577-overton-data-snapshots) If you think a data snapshot will be more useful for your project, please contact [email protected] to discuss your needs. API Resources - Technical documentation for Overton’s REST API - Guide – How to get started using Overton’s API

Last updated on Jun 25, 2026