segunda-feira, 18 de setembro de 2017

How Apache Tika helped me to extract open data from random files

I love open data. Recently we had the idea to see if alderman were attending the legislative sessions, when getting into the town website we had the bad news that each session information was provided in a PDF or XLS file that should be downloaded. That was really bad and we thought that we should do something.
Sessão means session, behind each session we had a link to the file that contains the session information...

Our first goal was to download all the files so we could parse it later, a simple bash script solved that, but then we had the challenge to read all files and transform it to javascript so we could show it in a web page.
Well, we thought: for PDF we use iText, for XLS we use JExcel. Issue solved? No.

Read the file to text would require more work, specially with the JExcel API, that allow us to go thought the spreadsheet cells, read the text of each cell etc. But that was fine, that is the exciting parts about dealing with open data: hacking. We even used Strategy design pattern, how cool is that?

Strategy, but in our case the Strategy was the parse method and the concrete classes were XLSParser and PDFParser. Image source.

However... Yes, you guessed it, the spreadsheets were not regular, so we could no easily guess which cells had the information we were looking for. I would stop there because TBH this kind of voluntary of work is not something that will bring you money neither recognition, we do it because we live in a country that needs a lot of improvements, but when this starts to disturb our work or family compromises we have to stop.
Then I reminded that while working on KieML I found a reference to this project Apache Tika when checking OpenNLP sources and I was also looking for other project to integrate with KieML.

Visit Apache Tika website.

I was afraid about using Apache Tika because it has a lot of dependencies.
org.apache.tika:tika-parsers dependencies 

BUT this ETL part of the project was not for runtime and Apache Tika has a good reason to have so many dependencies: The Apache Tika™ toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF). 
It turns out that then I decided to use and you can see the code I used here:

All the code I needed to use to parse any file to text. You can see it in my github.

Yes, this is all. These lines of code are parsing the XLS or PDF to a String that I can easily transform to JSON. The JSON is then used on a webpage.

The information spread across files is now better organized in a web page

My conclusion here is simple: Apache Tika is a great project and if you are dealing with content from different sources you should at least check it!

Nenhum comentário:

Postar um comentário