Scrape the Drug Definitions and Links from the NCI Drug Dictionary

Run the full sequence that scrapes, parses, and stores the NCI Drug Dictionary found at CancerGov.org and any correlates to the NCI Thesaurus in a Postgres Database.

get_dictionary_and_links(
  conn,
  max_page = 50,
  sleep_time = 3,
  verbose = TRUE,
  render_sql = TRUE,
  crawl_delay = 5,
  size = 10000
)

Arguments

conn	Postgres connection object.
max_page	maximum page number to iterate the scrape over in the "https://www.cancer.gov/publications/dictionaries/cancer-drug?expand=ALL&page=" path, Default: 50
sleep_time	Time in seconds for the system to sleep before each scrape with `read_html`.
verbose	When reading from a slow connection, this prints some output on every iteration so you know its working.

Value

Any differences found between the scraped data and the existing data in the Drug Dictionary and Drug Link Tables are appended to their respective tables with the local timestamp.

Details

Scrapes the Definitions and the links to each Drug Page at the main Drug Dictionary pages in the https://www.cancer.gov/publications/dictionaries/cancer-drugi and stores the parsed response to the Drug Dictionary and Drug Link Tables, respectively.

Web Source Types

The NCI Drug Dictionary has 2 data sources that run in parallel. The first source is the Drug Dictionary itself at https://www.cancer.gov/publications/dictionaries/cancer-drug. The other source are the individual drug pages, called Drug Detail Links in skyscraper, that contain tables of synonyms, including investigational names.

Drug Dictionary