In this page, you can find references to the code and the data I have used in most of my publications.
This is a new toolkit for Entity Resolution (ER) that applies uniformly to both structured (i.e., relational database, CSV files) and semi-structured (RDF & XML files) data. It can be used in three different ways:
- As an open-source library (in Java) for expert users that combines state-of-the-art methods for blocking and matching into an end-to-end Entity Resolution workflow (Github repository).
- As a user-friendly desktop application with a wizard-like interface that allows even lay users to build complex ER workflows with out-of-the-box solutions, i.e., without the need to fine-tune any configuration parameters (Github repository).
- As a workbench for comparing the performance of various ER workflows over a series of established benchmark datasets.
This framework comprises the code in Java that I have used in my research on Information Integration. I have implemented algorithms for (blocking-based) Entity Resolution both from my own papers and from the literature. Most methods apply uniformly to both structured data (e.g., databases) and semi-structured data, i.e., entity profiles extracted from the Web Data. This is ensured by modeling every entity as a set of uniquely identified name-value pairs. Apart from the code, the framework also contains established datasets for testing all methods.
This framework comprises practically all methods that I have used in my research on Web Usage Mining.The code in Java actually implements the three layers of revisitation prediction that are described in the survey paper “Methods for web revisitation prediction: survey and experimentation“. All methods cover both client-side and server-side revisitation prediction. The framework also contains guidelines for downloading 4 real-world datasets, 2 for each type of revisitation prediction.
This framework comprises the code in Java that I have used in my research on Text Mining. In essence, it implements the bag and the graph representation models that can be used for topic classification. They are analytically described in the paper “Graph vs. bag representation models for the topic classification of web documents“. In the near future, I will enrich the framework with some of the datasets I have used in my experiments.