Internet-based Information Extraction Technologies

 

Teacher:  Fang Li

Office:   SEIEE Building, No.3 Room: 533 Tel: 34205423

Office Time: Thursday afternoon (14.00~16.00)

 

Teacher Assistant: Zhe Ye

 

Lecture Time & Venue: 

Every Thursday (10.00~11.40AM) from Sept, 2017 to Dec, 2017.

  Place:

Dong Xia Yuan 213 (东下院213)  1th and from 8~15th week

  Dong ShangYuan 115 (东上院115)  during week 2th to 7th

 

Textbook: 

Information Extraction: Algorithms and Prospects in a Retrieval Context by Marie-Francine Moens  Published by Spinger, (P.O. Box 17, 3300 AA Dordrecht, The Netherlands) . ISBN-13 978-1-4020-4993-4 (e-book)

 

References:

1)    Sunita Sarawagi, Information Extraction from Foundations and Trends in Database vol.1,No.3(2007) 261-377

2)    Jerry R.Hobbs, Ellen Riloff, Information Extraction chapter 21 of Handbook of Natural Language Processing (2010).

3)    Ralph Grishman. Information Extraction: Capabilities and Challenges (2012)

 

Introduction:

Internet-based Information extraction (IE) is the method of deriving structured information from unstructured text and semi-structured web pages. More succinctly, information extraction is finding names of the entities, relations and events from the Internet.

The lecture introduces an overview of the history and technologies of information extraction. It presents the state-of-the art research methods and focuses on real world applications.

Readings will be based on the text book and references. Grades will be based on class participation and a project. There is no final examination for this course. Students are encouraged to form a group in order to finish a project and write a report. There are three tasks. Each group can choose one of the tasks and present their project in the class workshop held at the end of the semester.

 

Course Topics and Readings

Weeks

Topics

Slides

Readings

1th

Motivation &

Course Introduction

Lecture 1

NELL system

2th

Basic Knowledge for IE

Lecture 2

 

POSforChinese

WordVector1

WordVector2

3th

IE Concepts

Lecture 3

WordVectorTutorial

PPT(TA)

Chapter 1 of textbook

Chapter 2 of textbook

Chapter 8 of textbook

4th

Named Entity Extraction (rule-based)

Lecture 4

Chapter 4 of textbook

Reference

5 th

Named Entity Extraction (machine learning)

Lecture 5

Chapter 5 of textbook

CRFmodelforORG

6th

Relation Extraction (pattern-based, supervised)

Lecture 6

Reference, SVMguide

7th

Relation Extraction (semi-supervised & distant-supervised,deepLearning)

Lecture 7

Video

DistantSupervisionMethod

TransE(deepLearning)

8th

Event extraction

Lecture 8

Chapter9 of textbook

Template-based Event extraction without template

9th

Opinion Mining

Lecture 9

Sentimental Analysis

SA2016competitionRef

10th

Opinion words mining

Lecture 10

Turney Algorithm

Inducing Domain-specific Sentiment Lexicons from Unlabeled Corpora

11th

Webpage IE

Lecture 11

surveyofwebIE

12th

IE system (1)

lecture 12

LixTo, Roadrunner

13th

IE System (2)

Lecture 13 

Know-it-all

14th

IE System (3)

lecture 14

Text runner

15 th

Student Workshop

(Employment Relation Extraction)

1)      顾仁杰组

2)      陈鼎组

3)      顾乡组

4)      吕明辉组

5)      王贺组

6)      王政晖(task 3

Each group presents their work which includes: the task, its problems and analysis (2 minutes) Describe your general approach (3 minutes) Your results (3 minutes) Open questions and challenges (2 minutes) , Q&A (5 minutes).

16th

Student Workshop (Sentiment Analysis)

1)   仇伟

2)   王睿杰组

3)   朱肇国组

4)   陈翔宇组

5)   谢昕宇组

6)   顾嘉凡组

Same as the above.

Noted:

The content of each lecture may change. The above slides only give you the general information about each lecture. The classroom exercises and discussions are not included in these slides.

References from Industry:  

1)     Knowledge-based Information extraction taught by an expert from Alibaba in the year of 2013.

2)  Information Extraction in E-commerce taught by an expert from Alibaba in the year of 2013.

Prerequisites

Data Structure, Programming Language, Natural Language Processing

Grading:

1.      Attendence & Classroom Discussions (40%)  (from 1th to 14th week)

2.      Workshop Presentation (20%)  (in the 15th,16th week)

3.      Evaluations of Algorithm or System (40%)  (in the 17th week)   

Project tasks:

1)    Specific Relation Extraction. Please see the training data(for employment, chief of, location extraction) , another training data (for four kinds of employment relationship extraction) and student work1 and student work2 presented in the last few years for your references. Or

2)    Positive and negative Sentimental Analysis. Please see the training data for example. Or

3)    Exploring a new extraction for some particular applications such as news extraction.

About the evaluation: (the time and place will be announced later)     

1)    Task1 (employment relation extraction): input file format and output file format.

2)    Task2 (positive and negative sentimental analysis) input file format and output file format

3)    Specification for evaluation