Run the full sequence that scrapes, parses, and stores the NCI Drug Dictionary found at CancerGov.org and any correlates to the NCI Thesaurus in a Postgres Database.

get_dictionary_and_links(
  conn,
  max_page = 50,
  sleep_time = 3,
  verbose = TRUE,
  render_sql = TRUE,
  crawl_delay = 5,
  size = 10000
)

Arguments

conn

Postgres connection object.

max_page

maximum page number to iterate the scrape over in the "https://www.cancer.gov/publications/dictionaries/cancer-drug?expand=ALL&page=" path, Default: 50

sleep_time

Time in seconds for the system to sleep before each scrape with read_html.

verbose

When reading from a slow connection, this prints some output on every iteration so you know its working.

Value

Any differences found between the scraped data and the existing data in the Drug Dictionary and Drug Link Tables are appended to their respective tables with the local timestamp.

Details

Scrapes the Definitions and the links to each Drug Page at the main Drug Dictionary pages in the https://www.cancer.gov/publications/dictionaries/cancer-drugi and stores the parsed response to the Drug Dictionary and Drug Link Tables, respectively.

Web Source Types

The NCI Drug Dictionary has 2 data sources that run in parallel. The first source is the Drug Dictionary itself at https://www.cancer.gov/publications/dictionaries/cancer-drug. The other source are the individual drug pages, called Drug Detail Links in skyscraper, that contain tables of synonyms, including investigational names.

Drug Dictionary

The listed drug names and their definitions are scraped from the Drug Dictionary HTML and updated to a Drug Dictionary Table in a cancergov schema.

See also