R/cg-internals.R
get_dictionary_and_links.Rd
Run the full sequence that scrapes, parses, and stores the NCI Drug Dictionary found at CancerGov.org and any correlates to the NCI Thesaurus in a Postgres Database.
get_dictionary_and_links( conn, max_page = 50, sleep_time = 3, verbose = TRUE, render_sql = TRUE, crawl_delay = 5, size = 10000 )
conn | Postgres connection object. |
---|---|
max_page | maximum page number to iterate the scrape over in the "https://www.cancer.gov/publications/dictionaries/cancer-drug?expand=ALL&page=" path, Default: 50 |
sleep_time | Time in seconds for the system to sleep before each scrape with |
verbose | When reading from a slow connection, this prints some output on every iteration so you know its working. |
Any differences found between the scraped data and the existing data in the Drug Dictionary and Drug Link Tables are appended to their respective tables with the local timestamp.
Scrapes the Definitions and the links to each Drug Page at the main Drug Dictionary pages in the https://www.cancer.gov/publications/dictionaries/cancer-drugi and stores the parsed response to the Drug Dictionary and Drug Link Tables, respectively.
The NCI Drug Dictionary has 2 data sources that run in parallel. The first source is the Drug Dictionary itself at https://www.cancer.gov/publications/dictionaries/cancer-drug. The other source are the individual drug pages, called Drug Detail Links in skyscraper, that contain tables of synonyms, including investigational names.
The listed drug names and their definitions are scraped from the Drug Dictionary HTML and updated to a Drug Dictionary Table in a cancergov
schema.