National Register and the National Archives: Difference between revisions

no edit summary
No edit summary
No edit summary
 
(5 intermediate revisions by the same user not shown)
Line 1: Line 1:
[[File:Excel thumbnail.jpg|none|thumb]]
=== Overview ===
File coming soon
This page describes a data analysis project to understand the scope of the National Register nomination form files made available on the [https://www.archives.gov/research/electronic-records/historic-preservation National Archives website].
 
The National Park Service provides an Excel file titled 'Everything' on their [https://www.nps.gov/subjects/nationalregister/data-downloads.htm Data Downloads] page. This file includes all National Register of Historic Places Properties ("Listed/Returned/Removed/eligible/ineligible/Approved/Accepted/Rejected"). The columns in this spreadsheet include the following headings: Property Name, State, County, City, Street & Number, Status, Listing Type, Status Date, Restricted Address, Area of Significance, Category of Property, External Link, Level of Significance, Listed Date, Name of Multiple Property Listing, NHL Designated Date, Other Names, Park Name, and Property ID. Other metadata, such as architect name, or [[wikipedia:National_Register_of_Historic_Places#Criteria|National Register Criteria for Evaluation A, B, C or D]], are ''not'' stored in this file.
 
The 'External Link' column has an entry for some (but not all) properties that links to the corresponding file of the National Register nomination form stored on the National Archives website. For example, the Wainwright Building in St. Louis, Missouri, listed in 1968, points to https://catalog.archives.gov/id/63818176. On the NARA page, the PDF is automatically loaded within a frame on the page. A user is also given the option to download the file directly to their desktop.
 
It is worth noting that some of the PDFs on the NARA website contain not just the National Register nomination form, but other related documents such as correspondence to/from SHPOs, public petitions for/against listings, and other archival research and records relevant to the building being nominated. PDF files of nomination forms downloaded through other NPS websites (such as the [https://npgallery.nps.gov/NRHP NPGallery] ) or corresponding SHPO online repositories likely don't include these supplementary documents. The National Register nomination form for the Wainwright Building is just 12 pages long, but the PDF stored on the NARA website is 401 pages, and includes a publication titled ''[https://archive.org/details/randall-1967-the-wainwright-building-a-public-appeal-for-preservation The Wainwright Building, A Public Appeal For Preservation].'' A search for the term "A Public Appeal For Preservation" on a search engine like Google shows references to this document and library repositories that have the title, but no apparent links to download a digitized copy of it.
 
The goal of this analysis project was to:
 
-Gage the feasibility of downloading all publicly available PDFs to a local drive
 
-Analyze basic characteristics of these PDFs in aggregate
 
-Propose further in-depth content analysis of the files
 
=== '''Automating the download of all PDFs''' ===
A command line script was written which automated the following steps. A copy of the script will be provided upon request to hello@openpreservation.xyz.
 
For each listing in the national-register-everything-20240710.xlsx file available at https://www.nps.gov/subjects/nationalregister/data-downloads.htm:
 
-look up the corresponding NARA address and load the page's HTML source
 
-Search within the HTML and record the direct link to the PDF download,
 
For example, listing #9001229 points to NARA https://catalog.archives.gov/id/75320568, which has a direct PDF download link of https://s3.amazonaws.com/NARAprodstorage/lz/electronic-records/rg-079/NPS_NY/09001229.pdf .
 
-Once a list of all NARA pages with PDF links was recorded, a script was written to download each file individually to a local drive.
 
The Excel file, which includes the links to the NARA website and also the direct PDF link, can be downloaded here: [[:File:NRHP NARA PDF download 2025.04.18.xlsx|https://openpreservation.xyz/wiki/File:NRHP_NARA_PDF_download_2025.04.18.xlsx]]
 
=== Basic characteristics of all NRHP PDFs ===
Once stored on a local disk, I used a script to extract information about each PDF, including:
 
-file size
 
-page count
 
-other PDF metadata (title, subject, keywords, file creation and modification date, etc).
 
The 'Calculations' tab in the NRHP_NARA_PDF_download_2025.04.18.xlsx file linked above includes a chart of Page Count distribution, as well as some basic summary statistics:
 
-Total page count: 3,124,141
 
-Total documents: 76,092
 
-Average pages per document: 41
 
-Total file size: 3,759 GiBs
 
-Average file size: 51 MiBs.
 
=== Future work ===
The above data analysis was carried out in April 2025, with the data made available by NPS through July 2024. More recent data through June 2025 still needs to be analyzed using the above steps.
 
A more in-depth analysis of the content of the PDFs is needed. The ~3 million pages of NRHP likely contain some insightful documents, such as the ''[https://archive.org/details/randall-1967-the-wainwright-building-a-public-appeal-for-preservation The Wainwright Building, A Public Appeal For Preservation]'' example mentioned above. Because these documents are not cataloged or indexed, the exact nature of the content of these 3 million pages is largely unknown. By filtering the NRHP_NARA_PDF_download_2025.04.18.xlsx file by page count, one can locate the PDF files that possibly contain these hidden gems.
 
Future data analysis of these 76,092 PDFs might include an automated keyword extraction and tabulation, and other textual analysis, possibly using Large Language Models.
 
The digitized versions of the scanned nomination forms include Optical Character Recognition text. It is likely that improved OCR programs exist since first being digitized. A percentage of the 3 million pages could be sampled and run through new OCR technology to see if better results are possible.
 
The size of files stored on the NARA website are often much larger than the casual researcher need and can cause sluggishness when opening up more than one PDF on older devices. The roughly 4TB of data could likely be reduced in size without affecting the legibility of the documents. A project is in progress to compress all files down so they will fit on a 1TB thumb drive or SSD.
 
Reach out to hello@openpreservation.xyz if you have data analysis ideas about the 76,092 PDFs of National Register nomination forms available from the NARA website.
74

edits