I initially tried passing the full pdf to the famous parsing platforms, without much luck. I then manually located the holdings tables I'm interested in (50 of the 500 pages in each of the pdfs) and tried using the famous parsing platforms without much luck.
Any advice from the community?
In 20 years I would guess they used no more than 20 formats, which is doable even if writing XPath (perhaps CSS selectors would suffice) by hand.
Do you mean that the mutual fund complex includes many funds and you get as many different formats for a same time period?