Source Code & Data

In this page, you can find references to the code and the data I have used in most of my publications.

JedAI Toolkit

This is a new toolkit for Entity Resolution (ER) that applies uniformly to both structured (i.e., relational database, CSV files) and semi-structured (RDF & XML files) data. It can be used in three different ways:

  1. As an open-source library (in Java) for expert users that combines state-of-the-art methods for blocking and matching into an end-to-end Entity Resolution workflow (Github repository).
  2. As a user-friendly desktop application with a wizard-like interface that allows even lay users to build complex ER workflows with out-of-the-box solutions, i.e., without the need to fine-tune any configuration parameters (Github repository).
  3. As a workbench for comparing the performance of various ER workflows over a series of established benchmark datasets.

JedAI Toolkit is developed in collaboration with the University of Athens, SciFY, Paris Descartes University, and NCSR Demokritos.

Blocking Framework

This framework comprises the code in Java that I have used in my research on Information Integration. I have implemented algorithms for (blocking-based) Entity Resolution both from my own papers and from the literature. Most methods apply uniformly to both structured data (e.g., databases) and semi-structured data, i.e., entity profiles extracted from the Web Data. This is ensured by modeling every entity as a set of uniquely identified name-value pairs. Apart from the code, the framework also contains established datasets for testing all methods.

Surfing Prediction Framework

This framework comprises practically all methods that I have used in my research on Web Usage Mining.The code in Java actually implements the three layers of revisitation prediction that are described in the survey paper “Methods for web revisitation prediction: survey and experimentation“. All methods cover both client-side and server-side revisitation prediction. The framework also contains guidelines for downloading 4 real-world datasets, 2 for each type of revisitation prediction.

Text Models

This framework comprises the code in Java that I have used in my research on Text Mining. In essence, it implements the bag and the graph representation models that can be used for topic classification. They are analytically described in the paper “Graph vs. bag representation models for the topic classification of web documents“. In the near future, I will enrich the framework with some of the datasets I have used in my experiments.

One thought on “Source Code & Data

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s