Logo of the University of Passau

Text and data mining

What is text and data mining?

Various research methods are summarised under the terms text and data mining (TDM).  In data mining, the focus is on data that is usually already available in a structured form. In text mining, the focus is on textual data, e.g. full texts from scientific journals or the entire novel production of a century.

These data sets and text collections are first prepared systematically and in a machine-readable format so that computer-aided analyses can then be used to automatically identify patterns or correlations or, for example, to summarise large quantities of documents with their central statements.

Quick check

Text and data mining has been legally permitted for researchers since the amendment to the Copyright Act (UrhG) in 2018 with § 60d UrhG. However, legal and licensing requirements must still be observed.

The right to TDM also includes the storage and processing of data and texts for analysis as well as the necessary digitisation, normalisation, structuring, categorisation, annotation, combination, etc. After completion of the research, the underlying corpus may be handed over for permanent storage for preservation and quality control (see also research data management).

Although TDM is generally permitted, there are certain limits:

  • The research purpose may only serve non-commercial purposes.
  • Legal access to the data must be given (licence agreement or open access publications).
  • No existing copy protection may be circumvented.
  • Many licensers also prohibit the automated, mass downloading of PDF files via crawlers, scripts, bots, etc.

Since such a mass download can lead to the blocking of the publisher's content for the entire university, please inform yourself in advance about alternative interfaces and contact the publisher or us at ub-publizieren@uni-passau.de.

The DOI registry Crossref and some publishers offer special interfaces where you can obtain full texts for your TDM projects:

In addition to content that requires a licence, there are also freely accessible databases that allow the use of TDM, including

I agree that a connection to the Vimeo server will be established when the video is played and that personal data (e.g. your IP address) will be transmitted.
I agree that a connection to the YouTube server will be established when the video is played and that personal data (e.g. your IP address) will be transmitted.
Show video