BACKGROUND
Given the rapid rate at which text data are being digitally gathered in many domains of science, there is growing need for automated tools that can analyses, classify, and interpret this kind of data.
Text mining techniques can be applied to create a structured representation of text, making its content more accessible for researchers. Applications of text mining are everywhere: social media, web search, advertising, emails, customer service, healthcare, marketing, etc. This course offers an extensive exploration into text mining with Python.
This four-day summer school has a strongly practical hands-on focus, and participants will gain experience in using text mining on real data from for example social sciences and healthcare and interpreting the results. Through lectures and practical sessions, the participants will learn the necessary skills to design, implement, and understand their own text mining pipeline.
WORKSHOP GOAL
During the summer school, we address the following topics:
- Review the fundamental approaches to text mining.
- Understand and apply current methods for analyzing texts.
- Define a text mining pipeline given a practical data science problem.
- Implement all steps in a text mining pipeline: feature extraction, feature selection, model learning, model evaluation.
- Understand and apply state-of-the-art methods in text mining.
- Explore deep learning techniques for text analysis and how they can be applied to solve advanced text-based problems.
The course starts with reviewing basic concepts of text mining and implementing advanced concepts in natural language processing.
WORKSHOP CONTENT
PART 1
- Basics of Python (Basic data types, Containers, Functions, Numpy, ...)
-
Practical exercise
PART 2
- What is Text Mining?
- Text Preprocessing
- Vector Space Model
- Practical exercise
PART 3
- Classification basics
-
Text Classification Algorithms
-
Evaluating classifiers
-
Practical exercise
PART 4
- How to do feature selection (FS) for text data?
- Text Preprocessing
- Is Principal Component Analysis (PCA) a FS method for text?
- Other methods?
- Practical exercise
PART 5
- What is text clustering?
-
What are the applications?
-
How to cluster text data?
-
Practical exercise
PART 6
- Word representing
- Vector representation
- Word as vectors
- Practical exercise
PART 7
- Language modeling
-
Feed-forward neural networks
-
Recurrent neural networks
-
Practical
PART 8
- Convolutional Neural Networks
- Transformers and BERT
- Practical exercise
TARGET AUDIENCE & PRIOR KNOWLEDGE
This interdisciplinary summer school is ideal for learners who are comfortable with Python programming, wish to acquire skills in text mining approaches, and have a foundational understanding of machine learning. Participants from various fields such as sociology, psychology, education, business, biology, geosciences, political science, and communication sciences will find this course beneficial.
TECHNICAL REQUIREMENTS
- Participants are requested to bring their own laptop for the practical exercises. Additionally, participants should have an internet connection available to fully engage in the activities and access any necessary resources.
- Connection to the Wifi is required to be able to use Python in Jupiter hub (e.g. via eduroam: https://www.uni-bremen.de/en/zfn/wifi/overview-wifi).
- Participants should have a basic knowledge of data science and programming and a motivation of scripting and programming in Python.
ABOUT THE TRAINER
Maryam Movahedifar is a data scientist for training and consulting at the DSC. She holds a PhD in Statistics and has extensive experience in Interpretable Machine Learning. With a strong foundation in statistical methods and practical experience in applying these techniques to real-world problems, she is well-equipped to teach complex machine learning concepts. Her expertise includes making advanced models understandable and accessible.