Pular para o conteúdo principal

How Apache Tika helped me to extract open data from random files

I love open data. Recently we had the idea to see if alderman were attending the legislative sessions, when getting into the town website we had the bad news that each session information was provided in a PDF or XLS file that should be downloaded. That was really bad and we thought that we should do something.
Sessão means session, behind each session we had a link to the file that contains the session information...

Our first goal was to download all the files so we could parse it later, a simple bash script solved that, but then we had the challenge to read all files and transform it to javascript so we could show it in a web page.
Well, we thought: for PDF we use iText, for XLS we use JExcel. Issue solved? No.

Read the file to text would require more work, specially with the JExcel API, that allow us to go thought the spreadsheet cells, read the text of each cell etc. But that was fine, that is the exciting parts about dealing with open data: hacking. We even used Strategy design pattern, how cool is that?

Strategy, but in our case the Strategy was the parse method and the concrete classes were XLSParser and PDFParser. Image source.

However... Yes, you guessed it, the spreadsheets were not regular, so we could no easily guess which cells had the information we were looking for. I would stop there because TBH this kind of voluntary of work is not something that will bring you money neither recognition, we do it because we live in a country that needs a lot of improvements, but when this starts to disturb our work or family compromises we have to stop.
Then I reminded that while working on KieML I found a reference to this project Apache Tika when checking OpenNLP sources and I was also looking for other project to integrate with KieML.

Visit Apache Tika website.

I was afraid about using Apache Tika because it has a lot of dependencies.
org.apache.tika:tika-parsers dependencies 

BUT this ETL part of the project was not for runtime and Apache Tika has a good reason to have so many dependencies: The Apache Tika™ toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF). 
It turns out that then I decided to use and you can see the code I used here:

All the code I needed to use to parse any file to text. You can see it in my github.

Yes, this is all. These lines of code are parsing the XLS or PDF to a String that I can easily transform to JSON. The JSON is then used on a webpage.

The information spread across files is now better organized in a web page

My conclusion here is simple: Apache Tika is a great project and if you are dealing with content from different sources you should at least check it!

Comentários

Postagens mais visitadas deste blog

Dancing lights with Arduino - The idea

I have been having fun with Arduino these days! In this article I am going to show how did I use an electret mic with Arduino to create a Dancing Lights circuit. Dancing Lights   I used to be an eletronician before starting the IT college. I had my own electronics maintenance office to fix television, radios, etc. In my free time I used to create electronic projects to sell and I made a few "reais" selling a version of Dancing lights, but it was too limited: it simply animated lamps using a relay in the output of a 4017 CMOS IC. The circuit was a decimal counter  controlled by a 555. 4017 decimal counter. Source in the image When I met Arduino a few years ago, I was skeptical because I said: I can do this with IC, why should I use a microcontroller. I thought that Arduino was for kids. But now my pride is gone and I am having a lot of fun with Arduino :-) The implementation of Dancing Lights with Arduino uses an electret mic to capture the sound and light leds...

Simplest JavaFX ComboBox autocomplete

Based on this Brazilian community post , I've created a sample Combobox auto complete. What it basically does is: When user type with the combobox selected, it will work on a temporary string to store the typed text; Each key typed leads to the combobox to be showed and updated If backspace is type, we update the filter Each key typed shows the combo box items, when the combobox is hidden, the filter is cleaned and the tooltip is hidden:   The class code and a sample application is below. I also added the source to my personal github , sent me PR to improve it and there are a lot of things to improve, like space and accents support.

Genetic algorithms with Java

One of the most fascinating topics in computer science world is Artificial Intelligence . A subset of Artificial intelligence are the algorithms that were created inspired in the nature. In this group, we have Genetic Algorithms  (GA). Genetic Algorithms  To find out more about this topic I recommend the following MIT lecture and the Nature of Code book and videos created by Daniel Shiffman. Genetic Algorithms using Java After I remembered the basics about it, I wanted to practice, so I tried my own implementation, but I would have to write a lot of code to do what certainly others already did. So I started looking for Genetic Algorithm libraries and found Jenetics , which is a modern library that uses Java 8 concepts and APIs, and there's also JGAP . I decided to use Jenetics because the User Guide was so clear and it has no other dependency, but Java 8. The only thing I missed for Jenetics are more small examples like the ones I will show i...