HACKER Q&A
📣 rllearner

How to Structure Gnarly PDFs


I'm trying to compile a time series of publicly listed stocks stretching back to 2005. I'm doing this by parsing the semi-annual reports (NCSR filings) from a mutual fund complex that includes a large index fund (VTI). The reports are html with very different formats over the years. They each render to 500 pdf pages.

I initially tried passing the full pdf to the famous parsing platforms, without much luck. I then manually located the holdings tables I'm interested in (50 of the 500 pages in each of the pdfs) and tried using the famous parsing platforms without much luck.

Any advice from the community?


  👤 AlbertoGP Accepted Answer ✓
I might be missing something, but parsing the HTML, even with the different formats, should be much simpler than the PDF form.

In 20 years I would guess they used no more than 20 formats, which is doable even if writing XPath (perhaps CSS selectors would suffice) by hand.

Do you mean that the mutual fund complex includes many funds and you get as many different formats for a same time period?