Temporal entity extraction from historical texts
Donnerstag, 9. Juni
13:45 bis 14:45 Uhr
Raum 2120
Freitag, 10. Juni
13:45 bis 14:45 Uhr
Raum 2120
In my thesis project I would like to consider temporal aspects in historical texts, i.e. automated extraction and normalisation of the temporal information. Time is a crucial dimension in humanities, e.g. a description of a person, place or linguistic entity should contain temporal terms. Historical texts may be rich in temporal expressions. Manual extraction of this information is time-‐consuming, and for this reason some facts might still be undiscovered, and thus unknown by the scientists. My research will contribute to the development of an effective tool for temporal entity extraction from historical texts, assisting historical text-‐mining.
This project is funded by the Swiss Law Sources Foundation. As material for my research I will use historical Swiss law texts kindly provided by the Foundation.
This organization has been publishing critical editions of Swiss historical legal texts in all languages of Switzerland for over a hundred years, for a total of 118 volumes. The texts' creation time ranges from the 10th to the 18th century, meaning that they vary greatly in the overall state of the written language.
There are no temporal entity extraction systems for historical texts. Existing temporal information extraction systems use either rule-‐based techniques, or a combination of machine learning and rule-‐based methods. Normally, a temporal information extraction system performs the following tasks: detection fo a temporal expression, its classification (whether it refers to a date/time/etc.) and normalisation in an ISO format.
At the first stage of my research, I explored rule-‐based methods. I first conducted experiments with the rule-‐based tagger Heideltime. It was soon revealed that, due to a high data sparsity, rule-‐based methods require a lot of manual work, collecting new expressions and writing new rules. To overcome these problems, at the current stage of my project I am developing a system trained with machine learning techniques, capable to learn annotation patterns from the manually annotated Gold Standard Corpus.
At the end of my project I will assess cross-‐domain interoperability of the system, evaluating the best scoring system on out-‐of-‐domain historical corpora. When the extraction of the temporal expressions is completed, I will conclude my project by mapping this temporal information to the existing entries in the database of historical places and persons of Switzerland which I am developing based on the back-‐of-‐the-‐book indexes of the published volumes of law texts.