segunda-feira, 18 de setembro de 2017

How Apache Tika helped me to extract open data from random files

I love open data. Recently we had the idea to see if alderman were attending the legislative sessions, when getting into the town website we had the bad news that each session information was provided in a PDF or XLS file that should be downloaded. That was really bad and we thought that we should do something.
Sessão means session, behind each session we had a link to the file that contains the session information...

Our first goal was to download all the files so we could parse it later, a simple bash script solved that, but then we had the challenge to read all files and transform it to javascript so we could show it in a web page.
Well, we thought: for PDF we use iText, for XLS we use JExcel. Issue solved? No.

Read the file to text would require more work, specially with the JExcel API, that allow us to go thought the spreadsheet cells, read the text of each cell etc. But that was fine, that is the exciting parts about dealing with open data: hacking. We even used Strategy design pattern, how cool is that?

Strategy, but in our case the Strategy was the parse method and the concrete classes were XLSParser and PDFParser. Image source.

However... Yes, you guessed it, the spreadsheets were not regular, so we could no easily guess which cells had the information we were looking for. I would stop there because TBH this kind of voluntary of work is not something that will bring you money neither recognition, we do it because we live in a country that needs a lot of improvements, but when this starts to disturb our work or family compromises we have to stop.
Then I reminded that while working on KieML I found a reference to this project Apache Tika when checking OpenNLP sources and I was also looking for other project to integrate with KieML.

Visit Apache Tika website.

I was afraid about using Apache Tika because it has a lot of dependencies.
org.apache.tika:tika-parsers dependencies 

BUT this ETL part of the project was not for runtime and Apache Tika has a good reason to have so many dependencies: The Apache Tika™ toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF). 
It turns out that then I decided to use and you can see the code I used here:

All the code I needed to use to parse any file to text. You can see it in my github.

Yes, this is all. These lines of code are parsing the XLS or PDF to a String that I can easily transform to JSON. The JSON is then used on a webpage.

The information spread across files is now better organized in a web page

My conclusion here is simple: Apache Tika is a great project and if you are dealing with content from different sources you should at least check it!

domingo, 13 de agosto de 2017

Java MQTT Client with Eclipse Paho

This is a quick post to share a sample code using Eclipse Paho for MQTT.

Recommended: First steps with MQTT using Mosquitto

Some may prefer to use REST  or Websockets, but MQTT is the standard protocol to connect things to the Internet. Even a small NodeMCU can read and publish on MQTT topics. You can configure a RaspberryPi instance to be a mqtt server or even use a cloud MQTT server, such as Paho, Mosquitto and other servers on the cloud. There are also paid services such as Amazon AWS.
Here' s a very simple code to say hello from a Java application.

terça-feira, 4 de julho de 2017

KieML: Running your Machine Learning models in the cloud

Nowadays Artificial Intelligence is every day on the news. Actually not only in the news, almost every month we have some impressive scientific paper related to deep learning or machine learning as you can follow in Two Minutes Papers youtube channel. Consequently we many libraries and products based on AI, and as I Java programmer I highlight DeepLearning4J, Weka, Tensorflow, OpenNLP, Apache Mahout.

We can say that AI is already the new electricity for big tech companies, but what about your company or about your customers? How can Machine Learning or AI improve their business and how will you achieve that?

If you check the list of solutions I provided above you will notice that we may chose a library for a solution, but it does not fit well to solve other problems. For example, perhaps training a neural network to classify some customer feedback may not be the best solution, so you decided to use OpenNLP instead DeepLearning4J.

Another problem is dynamically update your machine learning model or pre-trained neural network in production. Let's say currently your model deployed in production has a precision of 87%  and your AI specialists make it better with a precision of 88%. How you can make your model available for all productions systems that use it now? What about if you decide to change the provider from, let's say, from opennlp to deeplearning4j?

We've working on a project called KieML to provide a solution for these problems.


KieML is a project built on top of Kie API and is used on JBoss projects such as Drools, Optaplanner,  and jBPM. Using KieML you can package your models in a JAR (also called kJAR), along with a kmodule.xml  descriptor and models-descriptor.xml.


Notice the models-descriptor.xml should point to the model binary and possible labels. You can also provide specific parameters for your provider.
The kmodule.xml can also describe possible drools resources, but for KieML you can keep it with the following content:


Once the JAR is saved on the maven repository then you can use the API to load it:

Input input = new Input();
// load a file or use a text input
Result result = KieMLContainer.newContainer(GAV)
                           .predict("yourModelId", input);

KieML server

Since we are using Kie API we can extend Kie Server so you can easily manage your models using JMS or REST API.

If you want to know more about extending Kie Server you can follow these great Maciej's serie of articles. KieML has also a client extension so that you can remotely call Kie Server using Java:

KieServicesConfiguration configuration = KieServicesFactory
.newRestConfiguration(" http://localhost:8080/rest/server", "kieserver", "kieserver1!");
KieServicesClient client = KieServicesFactory.newKieServicesClient(configuration);
Input input = new Input("some input");
KieServerMLClient mlClient = client.getServicesClient(KieServerMLClient.class);
System.out.println(mlClient.getModel(CONTAINER_ID, "my model").getResult());

New models JARs can be places in a maven repository that you can manually copy to the production server maven repository (no maven installation is required, just the repository) or use a centralized nexus repository to push the new JARs. Using the KieScanner feature we can keep a published container updated with the latest version of a kjar.

Finally to make KieML available in the cloud you can build a Wildfly Swarm JAR and then deploy it on Openshift, Amazon EC2 or any other service that simply allow Java execution. You can read more about running Kie Server on Widlfly Swarm in this Maciej's post.

Kie Server is also easily managed from Drools/jBPM (or BPM Suite and BRMS) web console when it is used in managed mode and you can manage as many server as you want and put it behind a load balancer because KieML should work in a stateless way. Finally you may check the Kie Server documentation to learn more about its great REST/JMS and its Java client API.

If you want to try it now you just need maven and Java 8 because the source and instructions to build and run it locally are in my github. The project is in constant development and still on its early stages, every contribution, suggestion and comment is welcome!

I hope Kie team won't be mad because I took the "KIE" name for this project 0_0

sexta-feira, 16 de junho de 2017

K-means and decision tree using Weka and JavaFX

Weka is one of the most known tools for Machine Learning in Java, which also has a great Java API including API for k-means clustering. Using JavaFX it is possible to visualize unclassified data, classify the data using Weka APIs and then visualize the result in a JavaFX chart, like the Scatter chart.

In this post we will show a simple application that allows you to load data, show it without series distinction using a JavaFX scatter chart,, then we use Weka to classify the data in a defined number of clusters and finally separated the clustered data by chart series. We will be using the Iris.2D.arff file that comes with Weka download.

K-means clustering using Weka is really simple and requires only a few lines of code as you can see in this post. In our application we will build 3 charts for the Iris dataset:

  1. Data without class distinction (no classes)
  2. The data with the ground truth classification
  3. Data clustered using weka

As you can see the clustered data is really close to the real one (the data with correct labels). The code to build the clustered data:

After creating these 3 charts I also modified the whole code to add a decision tree classifier using weka J48 algorithm implementation. Right after the chart you can see the tree that I built our of the Iris 2d data:

When you click in any chart you will see a new item is added and it will be classified on center chart using the decision tree and on clustered chart using the k-means classification.

We use our generated decision tree to classify data and also the cluster. In the image above as you can see the cluster classify some data differently from what is classified with the decision tree.

I think it is particularly interesting how it is easy to visualize data with JavaFX. The full code for this project can be found on my github, but here is the main class code: