Prof. Meng - Emrecan Tarakci: Difference between revisions

From CS486wiki
Jump to navigationJump to search
Content deleted Content added
No edit summary   (change visibility)
No edit summary   (change visibility)
 
(19 intermediate revisions by the same user not shown)
Line 1: Line 1:
'''1.Introduction'''
== Introduction ==


After the discussion and the agreement of the project with Professor Weiyi Meng,
After the discussion and the agreement of the project with Professor Weiyi Meng, I - Emrecan Tarakci- have started working on the project known as Publication Analysis on Google Scholar. This project was for meeting the demand of Senior Project I & II courses. Program for Publication Analysis on Google Scholar mainly focuses on extracting the records from Google Scholar such as name of author/s, title of paper, year of publication, publication venue and citation count. After extraction and storing, the program analyzes and computes the count of self-citations, non-self-citations, i10-index, H-index and the number of academician's publications per year. Since the program will work for the Watson faculty members at first, the program computes the total citation count, the total non-self-citation count, the average citation count, the average non-self-citation count, the total i10-index, the total i10-index based on non-self-citation, the average i10-index, the average i10-index based on non-self-citation, the average H-index, the average H-index based on non-self-citation, the ratio of total non-self-citation over the total citation.
I - Emrecan Tarakci- have started working on the project known as Publication
Analysis on Google Scholar. This project was for meeting the demand of Senior
Project I & II courses. Program for Publication Analysis on Google Scholar
mainly focuses on extracting the records from Google Scholar such as name of
author/s, title of paper, year of publication, publication venue and citation
count. After extraction and storing, the program analyzes and computes the
count of self-citations, non-self-citations, i10-index, H-index and the number
of academician's publications per year. Since the program will work for the
Watson faculty members at first, the program computes the total citation count,
the total non-self-citation count, the average citation count, the average
non-self-citation count, the total i10-index, the total i10-index based on
non-self-citation, the average i10-index, the average i10-index based on
non-self-citation, the average H-index, the average H-index based on
non-self-citation, the ratio of total non-self-citation over the total citation.

== Technical Details ==

The development environment of the program is Visual Studio 2013 for Desktop. As a programming language I used C#. Also, Microsoft SQL Server is used for Database Management.

== Project Requirements ==

This project is to extract and analyze the publications and citations of university faculty based on the Google Scholar pages.

Stage 1: Extract basic publication and citation information for a given faculty
Input: The URL of the Google Scholar profile page of a faculty
Outputs: Extract every individual publication of the faculty. For each publication, extract its title, authors, publication venue (conference name for conference publication publications, journal name and volume/issue numbers for journal publications), page numbers, publication year, citation count, and citation link. Individual author names should be separated. The citation link is the URL that links to the (first) page that contains the publications that cite the publication under consideration. The extracted records are exported to an XML file and Excel file.
Requirement: Minimize the number of query submissions/downloads from Google Scholar site.

Stage 2: Extract information of all publications that cite a given publication P and determine whether a citation is a self-citation.
Input: A given publication P and the URL L of the (first) page that contains the publications that cite P.
Output: Compute the number of self-citations and non-self-citations for P among the publications that cite P. A publication p1 that cites P is a self-citation if p1 and P share at least one author.
Requirement: Minimize the number of query submissions/downloads from Google Scholar site.

Stage 3: Combine Stage 1 and Stage 2 programs to find the non-self-citation count for every publication of a given faculty from the Google Scholar site.
Input: The URL of the Google Scholar profile page of a faculty
Output: The same as for Stage 1 except that the non-self-citation count for each publication is added to the result.

Stage 4: Compute the i10-index (the number of publications that have at least 10 citations) and H-index (the largest number h such that there are h papers with each having at least h citations) based on both the total citation and non-self-citation. Also compute the ratio of non-self-citation over the total citation.
Input: The output of Stage 3.
Output: The i10-index based on the total citation, the i10-index based on non-self-citation, the H-index based on the total citation, the H-index based on non-self-citation, the ratio of non-self-citation over the total citation.

Stage 5: Divide the publication records of a given faculty by year.
Input: The output of Stage 1.
Output: Divide the input by year with the publications for more recent years listed first.

Stage 6: For the list of Google Scholar faculty profiles, compute the total citation count, the total non-self-citation count, the average citation count, the average non-self-citation count, the total i10-index, the total i10-index based on non-self-citation, the average i10-index, the average i10-index based on non-self-citation, the average H-index, the average H-index based on non-self-citation, the ratio of total non-self-citation over the total citation.

== Weekly Progress ==

Since, I have done first two phases during first semester, at the beginning of the spring semester I started with third phase.

=== Week 1 & 2 & 3 & 4 ===

Working on Phase 3

Major difficulties were, sending multiple requests to Google's server and as a result being banned by Google (it is basically local IP ban)

=== Week 5 & 6 & 7 ===

Working on Phase 4

=== Week 8 & 9 ===

Working on Phase 5

=== Week 10 & 11 & 12 ===

Working on Phase 6

== Charts ==

=== First Semester ===
[[File:Screen_Shot_2014-12-16_at_11.35.59_PM1.png]]

=== Second Semester ===
[[File:Screen_Shot_2014-12-17_at_12.37.20_AM.png]]

Latest revision as of 21:27, 2 May 2015

Introduction

After the discussion and the agreement of the project with Professor Weiyi Meng, I - Emrecan Tarakci- have started working on the project known as Publication Analysis on Google Scholar. This project was for meeting the demand of Senior Project I & II courses. Program for Publication Analysis on Google Scholar mainly focuses on extracting the records from Google Scholar such as name of author/s, title of paper, year of publication, publication venue and citation count. After extraction and storing, the program analyzes and computes the count of self-citations, non-self-citations, i10-index, H-index and the number of academician's publications per year. Since the program will work for the Watson faculty members at first, the program computes the total citation count, the total non-self-citation count, the average citation count, the average non-self-citation count, the total i10-index, the total i10-index based on non-self-citation, the average i10-index, the average i10-index based on non-self-citation, the average H-index, the average H-index based on non-self-citation, the ratio of total non-self-citation over the total citation.

Technical Details

The development environment of the program is Visual Studio 2013 for Desktop. As a programming language I used C#. Also, Microsoft SQL Server is used for Database Management.

Project Requirements

This project is to extract and analyze the publications and citations of university faculty based on the Google Scholar pages.

Stage 1: Extract basic publication and citation information for a given faculty Input: The URL of the Google Scholar profile page of a faculty Outputs: Extract every individual publication of the faculty. For each publication, extract its title, authors, publication venue (conference name for conference publication publications, journal name and volume/issue numbers for journal publications), page numbers, publication year, citation count, and citation link. Individual author names should be separated. The citation link is the URL that links to the (first) page that contains the publications that cite the publication under consideration. The extracted records are exported to an XML file and Excel file. Requirement: Minimize the number of query submissions/downloads from Google Scholar site.

Stage 2: Extract information of all publications that cite a given publication P and determine whether a citation is a self-citation. Input: A given publication P and the URL L of the (first) page that contains the publications that cite P. Output: Compute the number of self-citations and non-self-citations for P among the publications that cite P. A publication p1 that cites P is a self-citation if p1 and P share at least one author. Requirement: Minimize the number of query submissions/downloads from Google Scholar site.

Stage 3: Combine Stage 1 and Stage 2 programs to find the non-self-citation count for every publication of a given faculty from the Google Scholar site. Input: The URL of the Google Scholar profile page of a faculty Output: The same as for Stage 1 except that the non-self-citation count for each publication is added to the result.

Stage 4: Compute the i10-index (the number of publications that have at least 10 citations) and H-index (the largest number h such that there are h papers with each having at least h citations) based on both the total citation and non-self-citation. Also compute the ratio of non-self-citation over the total citation. Input: The output of Stage 3. Output: The i10-index based on the total citation, the i10-index based on non-self-citation, the H-index based on the total citation, the H-index based on non-self-citation, the ratio of non-self-citation over the total citation.

Stage 5: Divide the publication records of a given faculty by year. Input: The output of Stage 1. Output: Divide the input by year with the publications for more recent years listed first.

Stage 6: For the list of Google Scholar faculty profiles, compute the total citation count, the total non-self-citation count, the average citation count, the average non-self-citation count, the total i10-index, the total i10-index based on non-self-citation, the average i10-index, the average i10-index based on non-self-citation, the average H-index, the average H-index based on non-self-citation, the ratio of total non-self-citation over the total citation.

Weekly Progress

Since, I have done first two phases during first semester, at the beginning of the spring semester I started with third phase.

Week 1 & 2 & 3 & 4

Working on Phase 3

Major difficulties were, sending multiple requests to Google's server and as a result being banned by Google (it is basically local IP ban)

Week 5 & 6 & 7

Working on Phase 4

Week 8 & 9

Working on Phase 5

Week 10 & 11 & 12

Working on Phase 6

Charts

First Semester

Second Semester