LibGuides: Manage Research Data: Text and data mining

What is text mining

Text mining 101
What is text mining, how does it work and why is it useful? This article will help you understand the basics in just a few minutes.
OpenMinTed
OpenMinTeD is an open and sustainable Text and Data Mining (TDM) platform and infrastructure where researchers can discover, collaboratively create, share and re-use knowledge from a wide range of text based scientific and scholarly related sources.

Text and data mining methods

There are many techniques for mining and analysing text data, The selection of which will depend on the goal of the project.

Analysis:

Word frequency: a list of all words contained in the text and their frequency of use.

Collocation: the frequency of which words occur together

Concordance: locates a given word and the context it is used in

N-grams: common sets of 2,3,4 etc word clusters

Part of speech tagging: Tag words as being nouns, adjectives, verbs, ect. based on its location and definition

Named entity recognition: Identifies names, locations and dates etc.

Often these methods are used in combination for a given text data set to avoid the downsides of each. For example a word frequency count may be carried out followed by a collocation and concordance search of important terms. This provides important information on how the word is being used, and if the word has multiple meanings within the text as well as if it is often linked with other terms.

Text classification

Tags text according to various topics enabling a structure and classification system to evolve in the text. This makes use of some of the techniques of Natural Language Processing (NLP)

Topic analysis: Tags the main themes or topics of a text

Sentiment analysis: Tags the underlying emotion of a piece of text

Language detection: identify pieces of text that may be in another language

Intent detection: identifies the intent behind the text. For example if you are analysing a data set consisting of emails, is a given email providing feedback or requesting information?

Sources of text data

The true power of text mining comes not from performing analysis on single texts but from the ability to use analysis to generate new information and insights from larger sets of text data. These insights would be difficult to find through slow reading. There are many possible sources of text data, below is a sample of some possible sources.

Text created or gathered as part of your research- Such as surveys, interviews, transcriptions, primary resources, and articles from your literature review
Web scraping- this may be from social media, news sites, and websites

Use JSTOR to build a custom dataset- Define and request datasets for content on JSTOR, download a sample dataset for teaching text-mining techniques, or request a large dataset for intensive research.
The digital humanities toychest- A listing of document collections suitable for potential text data mining.

Text mining using Scopus and Elsevier sources

All Elsevier journals and books are able to be text data mined. The Elsevier API enables researchers to bulk download the content they would like to analyse allowing greater efficiency and consistency in the process. Find out more information about the API and links to the developer portal where you may download the API for non commercial research on the Elsevier text and data mining page. The API may also be used to mine metadata such as titles and abstracts indexed in the Scopus database.

Text mining with PubMed

Text mining within PubMed has been made possible with the availability of many free tools that make seeing the interactions and links between terms within research easy. Enabling the researcher to gain insights and see patterns in their area of research.

PubVenn

Generates a Venn diagram and list of articles based on search terms and articles listed in PubMed.

PubReMiner

Ranks the frequency of words and terms found in abstracts and titles for articles indexed in PubMed in a table display. Other frequency displays include the journals and authors most associated with the terms. You are also able to lookup human gene names.

Coremine Medical

Requires a free account to use. Results are displayed on a dashboard showing relationships in a graphic network.

MeSH on demand

Enter free text to highlight and identify MeSH terms, PubMed articles that are identified as being similar to the text will also be displayed.

VOSviewer

Visualise patterns and relationships within a bibliometric network. Featuring the ability to form networks based on authors, journals and other bibliometric details as well as text mining for the relationships between terms.

Web based text mining tools

Voyant tools

Input includes HTML, .txt, .pdf, and Word documents, no programing experience is required. Voyant analyses your text and provides an analysis in a dashboard format with interactivity between the various tools. This free web based tool is a text reading and analysis environment, enabling visualisation and interpretation of text for scholarly purposes.

Additional resources

Constellate
Constellate, the new text and data analytics service from JSTOR and Portico is a platform for learning and performing text analysis, building datasets, and sharing analytics course materials.
Goldstone-Underwood Stoplist
A list of 6032 stop words which may be adapted for your project. Stop words are words that contain little useful information in an analysis that you wish your analysis to ignore.

Manage Research Data: Text and data mining

What is text mining

Analysis:

Text classification

PubVenn

PubReMiner

Coremine Medical

MeSH on demand

VOSviewer

Voyant tools

The getting started in Voyant guide

Text Analyser

Lexos

Additional resources

Library Contact

Library Links

Quick Links

More about ECU